History log of /linux-master/drivers/cpufreq/intel_pstate.c
Revision Date Author Comments
# 1f4b7fdd 19-Feb-2024 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: Update default EPPs for Meteor Lake

Update default balanced_performance EPP to 115 and performance EPP to 16.

Changing the balanced_performance EPP has better performance/watt
compared to default powerup EPP value of 128.

Changing the performance EPP to 0x10 shows reduced power for similar
performance as EPP 0. On small form factor devices this is beneficial
as lower power results in lower CPU and skin temperature. This
results in reduced thermal throttling and higher performance.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 240a8da6 19-Feb-2024 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: Allow model specific EPPs

The current implementation allows model specific EPP override for
balanced_performance. Add feature to allow model specific EPP for all
predefined EPP strings. For example for some CPU models, even changing
performance EPP has benefits

Use a mask of EPPs as driver_data instead of just balanced_performance.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 4615ac90 12-Feb-2024 Jiri Slaby (SUSE) <jirislaby@kernel.org>

cpufreq: intel_pstate: remove cpudata::prev_cummulative_iowait

Commit 09c448d3c61f ("cpufreq: intel_pstate: Use IOWAIT flag in Atom
algorithm") removed the last user of cpudata::prev_cummulative_iowait.
Remove the member too.

Found by https://github.com/jirislaby/clang-struct.

Signed-off-by: Jiri Slaby (SUSE) <jirislaby@kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# f0a0fc10 17-Feb-2024 Doug Smythies <dsmythies@telus.net>

cpufreq: intel_pstate: fix pstate limits enforcement for adjust_perf call back

There is a loophole in pstate limit clamping for the intel_cpufreq CPU
frequency scaling driver (intel_pstate in passive mode), schedutil CPU
frequency scaling governor, HWP (HardWare Pstate) control enabled, when
the adjust_perf call back path is used.

Fix it.

Fixes: a365ab6b9dfb cpufreq: intel_pstate: Implement the ->adjust_perf() callback
Signed-off-by: Doug Smythies <dsmythies@telus.net>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 192cdb1c 22-Jan-2024 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Refine computation of P-state for given frequency

On systems using HWP, if a given frequency is equal to the maximum turbo
frequency or the maximum non-turbo frequency, the HWP performance level
corresponding to it is already known and can be used directly without
any computation.

Accordingly, adjust the code to use the known HWP performance levels in
the cases mentioned above.

This also helps to avoid limiting CPU capacity artificially in some
cases when the BIOS produces the HWP_CAP numbers using a different
E-core-to-P-core performance scaling factor than expected by the kernel.

Fixes: f5c8cf2a4992 ("cpufreq: intel_pstate: hybrid: Use known scaling factor for P-cores")
Cc: 6.1+ <stable@vger.kernel.org> # 6.1+
Tested-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# bde4f5ff 09-Jan-2024 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: Update hybrid scaling factor for Meteor Lake

On some Meteor Lake platforms, maximum one core turbo frequency is not
observed. During hybrid performance to frequency conversion, the maximum
frequency is 100 MHz less. This results in requesting maximum frequency
100 MHz less.

For example when the max one core turbo is 4.9 GHz:
MSR HWP_CAPABILITIES shows highest performance ratio for P-core is 0x3E.
With the current scaling factor of 78741 (1.27x for converting frequency
to performance) results in max frequency of 4.8 GHz. This results in
capping the max scaling frequency as 4.8 GHz, which is 100 MHz less than
the desired.

Add capability to define per CPU model specific scaling factor and define
scaling factor of 80000 (1.25x for converting frequency to performance for
P-cores) for Meteor Lake.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
[ rjw: Debug message adjustment, subject edit ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# e9501315 13-Dec-2023 Zhenguo Yao <yaozhenguo1@gmail.com>

cpufreq: intel_pstate: Add Emerald Rapids support in no-HWP mode

Users may disable HWP in firmware, in which case intel_pstate will give up
unless the CPU model is explicitly supported.

See also the following past commits:

- commit df51f287b5de ("cpufreq: intel_pstate: Add Sapphire Rapids support
in no-HWP mode")
- commit d8de7a44e11f ("cpufreq: intel_pstate: Add Skylake servers support")
- commit 706c5328851d ("cpufreq: intel_pstate: Add Cometlake support in
no-HWP mode")
- commit fbdc21e9b038 ("cpufreq: intel_pstate: Add Icelake servers support in
no-HWP mode")
- commit 71bb5c82aaae ("cpufreq: intel_pstate: Add Tigerlake support in
no-HWP mode")

Signed-off-by: Zhenguo Yao <yaozhenguo1@gmail.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
[ rjw: Changelog edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 2719675f 20-Nov-2023 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: Prioritize firmware-provided balance performance EPP

The platform firmware can provide a balance performance EPP value by
enabling HWP and programming the EPP to the desired value.

However, currently this only takes effect for processors listed in
intel_epp_balance_perf[], so in order to enable a new processor model
to utilize this mechanism, that table needs to be updated. It arguably
should not be necessary to modify the kernel to work properly with
every new generation of processors, though, and distributions that don't
always ship the most recent kernels should be able to run reasonably well
on new hardware without code changes.

For this reason, move the check to avoid updating the EPP when the balance
performance EPP is unmodified from the power-up default of 0x80 after the
check that allows the firmware-provided balance performance EPP value to
be retrieved. This will cause the code to always look for the firmware-
provided value before consulting intel_epp_balance_perf[] and the handling
of new hardware will not depend on whether or not that thable has been
updated yet.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
[ rjw: Subject and changelog edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 37b6ddba 07-Sep-2023 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: Revise global turbo disable check

Setting global turbo flag based on CPU 0 P-state limits is problematic
as it limits max P-state request on every CPU on the system just based
on its P-state limits.

There are two cases in which global.turbo_disabled flag is set:
- When the MSR_IA32_MISC_ENABLE_TURBO_DISABLE bit is set to 1
in the MSR MSR_IA32_MISC_ENABLE. This bit can be only changed by
the system BIOS before power up.
- When the max non turbo P-state is same as max turbo P-state for CPU 0.

The second check is not a valid to decide global turbo state based on
the CPU 0. CPU 0 max turbo P-state can be same as max non turbo P-state,
but for other CPUs this may not be true.

There is no guarantee that max P-state limits are same for every CPU. This
is possible that during fusing max P-state for a CPU is constrained. Also
with the Intel Speed Select performance profile, CPU 0 may not be present
in all profiles. In this case the max non turbo and turbo P-state can be
set to the lowest possible P-state by the hardware when switched to
such profile. Since max non turbo and turbo P-state is same,
global.turbo_disabled flag will be set.

Once global.turbo_disabled is set, any scaling max and min frequency
update for any CPU will result in its max P-state constrained to the max
non turbo P-state.

Hence remove the check of max non turbo P-state equal to max turbo P-state
of CPU 0 to set global turbo disabled flag.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
[ rjw: Subject edit ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# d51847ac 20-Aug-2023 Doug Smythies <dsmythies@telus.net>

cpufreq: intel_pstate: set stale CPU frequency to minimum

The intel_pstate CPU frequency scaling driver does not
use policy->cur and it is 0.
When the CPU frequency is outdated arch_freq_get_on_cpu()
will default to the nominal clock frequency when its call to
cpufreq_quick_getpolicy_cur returns the never updated 0.
Thus, the listed frequency might be outside of currently
set limits. Some users are complaining about the high
reported frequency, albeit stale, when their system is
idle and/or it is above the reduced maximum they have set.

This patch will maintain policy_cur for the intel_pstate
driver at the current minimum CPU frequency.

Reported-by: Yang Jie <yang.jie@linux.intel.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217597
Signed-off-by: Doug Smythies <dsmythies@telus.net>
[ rjw: White space damage fixes and comment adjustment ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 0fcfc9e5 29-Jun-2023 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: Fix scaling for hybrid-capable systems with disabled E-cores

Some system BIOS configuration may provide option to disable E-cores.
As part of this change, CPUID feature for hybrid (Leaf 7 sub leaf 0,
EDX[15] = 0) may not be set. But HWP performance limits will still be
using a scaling factor like any other hybrid enabled system.

The current check for applying scaling factor will fail when hybrid
CPUID feature is not set and the only way to make sure that scaling
should be applied by checking CPPC nominal frequency and nominal
performance.

First, or systems predating Alder Lake, the CPPC nominal frequency and
nominal performance are 0, which can be used to distinguish those
systems from hybrid systems with disabled E-cores.

Second, if the CPPC nominal frequency and nominal performance are
defined, which indicates the need to use a special scaling factor, and
the nominal performance value multiplied by 100 is not equal to the
nominal frequency one, use hybrid scaling factor.

This can be done for all HWP systems without additional CPU model check.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
[ rjw: Subject and changelog edits, removal of unneeded parens, comment
edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 03f44ffb 21-Jun-2023 Tero Kristo <tero.kristo@linux.intel.com>

cpufreq: intel_pstate: Fix energy_performance_preference for passive

If the intel_pstate driver is set to passive mode, then writing the
same value to the energy_performance_preference sysfs twice will fail.
This is caused by the wrong return value used (index of the matched
energy_perf_string), instead of the length of the passed in parameter.
Fix by forcing the internal return value to zero when the same
preference is passed in by user. This same issue is not present when
active mode is used for the driver.

Fixes: f6ebbcf08f37 ("cpufreq: intel_pstate: Implement passive mode with HWP enabled")
Reported-by: Niklas Neronin <niklas.neronin@intel.com>
Signed-off-by: Tero Kristo <tero.kristo@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 2744a63c 13-Mar-2023 Greg Kroah-Hartman <gregkh@linuxfoundation.org>

cpufreq: move to use bus_get_dev_root()

Direct access to the struct bus_type dev_root pointer is going away soon
so replace that with a call to bus_get_dev_root() instead, which is what
it is there for.

Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Len Brown <lenb@kernel.org>
Cc: linux-pm@vger.kernel.org
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lore.kernel.org/r/20230313182918.1312597-3-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 1f5e62f5 02-Mar-2023 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: Enable HWP IO boost for all servers

The HWP IO boost results in slight improvements for IO performance on
both Ice Lake and Sapphire Rapid servers.

Currently there is a CPU model check for Skylake desktop and server along
with the ACPI PM profile for performance and enterprise servers to enable
IO boost.

Remove the CPU model check, so that all current server models enable HWP
IO boost by default.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 5bd289f6 24-Feb-2023 Nick Alcock <nick.alcock@oracle.com>

cpufreq: intel_pstate: remove MODULE_LICENSE in non-modules

Since commit 8b41fc4454e ("kbuild: create modules.builtin without
Makefile.modbuiltin or tristate.conf"), MODULE_LICENSE declarations
are used to identify modules. As a consequence, uses of the macro
in non-modules will cause modprobe to misidentify their containing
object file as a module when it is not (false positives), and modprobe
might succeed rather than failing with a suitable error message.

So remove it in the files in this commit, none of which can be built as
modules.

Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Suggested-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 60675225 21-Feb-2023 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: Adjust balance_performance EPP for Sapphire Rapids

While the majority of server OS distributions are deployed with the
"performance" governor as the default, some distributions like Ubuntu use
the "powersave" governor by default.

While using the "powersave" governor in its default configuration on
Sapphire Rapids systems leads to much lower power, the performance is
lower by more than 25% for several workloads relative to the
"performance" governor.

A 37% difference has been reported by www.Phoronix.com [1].

This is a consequence of using a relatively high EPP value in the
default configuration of the "powersave" governor and the performance
can be made much closer to the "performance" governor's level by
adjusting the default EPP value. Based on experiments, with EPP of 0x00,
0x10, 0x20, the performance delta between the "powersave" governor and
the "performance" one is around 12%. However, the EPP of 0x20 reduces
average power by 18% with respect to the lower EPP values.

[Note that raising min_perf_pct in sysfs as high as 50% in addition to
adjusting EPP does not improve the performance any further.]

For this reason, change the EPP value corresponding to the the default
balance_performance setting for Sapphire Rapids to 0x20, which is
straightforward, because analogous default EPP adjustment has been
applied to Alder Lake and there is a way to set the balance_performance
EPP value in intel_pstate based on the processor model already.

The goal here is to limit the mean performance delta between the
"powersave" governor in the default configuration and the "performance"
governor for a wide variety of server workloadsto to around 10-12%. For
some bursty workloads, this delta can be still large, as the frequency
ramp-up will still lag when the "powersave" governor is in use
irrespective of the EPP setting, because the performance governor always
requests the maximum possible frequency.

Link: https://www.phoronix.com/review/centos-clear-spr/6 # [1]
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
[ rjw: Subject and changelog edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# e8a0e30b 28-Dec-2022 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Drop ACPI _PSS states table patching

After making acpi_processor_get_platform_limit() use the "no limit"
value for its frequency QoS request when _PPC returns 0, it is not
necessary to replace the frequency corresponding to the first _PSS
return package entry with the maximum turbo frequency of the given
CPU in intel_pstate_init_acpi_perf_limits() any more, so drop the
code doing that along with the comment explaining it.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# df51f287 21-Nov-2022 Giovanni Gherdovich <ggherdovich@suse.cz>

cpufreq: intel_pstate: Add Sapphire Rapids support in no-HWP mode

Users may disable HWP in firmware, in which case intel_pstate wouldn't load
unless the CPU model is explicitly supported.

See also the following past commits:

commit d8de7a44e11f ("cpufreq: intel_pstate: Add Skylake servers support")
commit 706c5328851d ("cpufreq: intel_pstate: Add Cometlake support in
no-HWP mode")
commit fbdc21e9b038 ("cpufreq: intel_pstate: Add Icelake servers support in
no-HWP mode")
commit 71bb5c82aaae ("cpufreq: intel_pstate: Add Tigerlake support in
no-HWP mode")

Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 21cdb6c1 27-Oct-2022 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: Allow EPP 0x80 setting by the firmware

With the
"commit 3d13058ed2a6 ("cpufreq: intel_pstate: Use firmware default EPP")"
the firmware can set an EPP, which driver will not overwrite. But the
driver has a valid range check for:
0x40 > firmware epp < 0x80.
Hence firmware can't specify EPP of 0x80.

If the firmware didn't specify in the valid range, the driver has a
hard coded EPP of 102. But some Chrome hardware vendors don't want
this overwrite and wants to boot with chipset default EPP of 0x80 as
this improves battery life.

In this case they want to have capability to specify EPP of 0x80 via
the firmware. This require the valid range to include 0x80 also.
But here the valid range can't be simply extended to include 0x80 as
this is the chipset default EPP. Even without any firmware specifying
EPP, the chipset will always boot with EPP of 0x80.

To make sure that firmware specified EPP of 0x80 and not by the
chipset default, it will require additional check to make sure HWP
was enabled by the firmware before boot. Only way the firmware can
update EPP, is to enable HWP and update EPP via MSR_HWP_REQUEST.

This driver already checks, if the HWP is enabled by the firmware.
Use the same flag and extend valid range to include 0x80.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# f5c8cf2a 24-Oct-2022 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: hybrid: Use known scaling factor for P-cores

Commit 46573fd6369f ("cpufreq: intel_pstate: hybrid: Rework HWP
calibration") attempted to use the information from CPPC (the nominal
performance in particular) to obtain the scaling factor allowing the
frequency to be computed if the HWP performance level of the given CPU
is known or vice versa.

However, it turns out that on some platforms this doesn't work, because
the CPPC information on them does not align with the contents of the
MSR_HWP_CAPABILITIES registers.

This basically means that the only way to make intel_pstate work on all
of the hybrid platforms to date is to use the observation that on all
of them the scaling factor between the HWP performance levels and
frequency for P-cores is 78741 (approximately 100000/1.27). For
E-cores it is 100000, which is the same as for all of the non-hybrid
"core" platforms and does not require any changes.

Accordingly, make intel_pstate use 78741 as the scaling factor between
HWP performance levels and frequency for P-cores on all hybrid platforms
and drop the dependency of the HWP calibration code on CPPC.

Fixes: 46573fd6369f ("cpufreq: intel_pstate: hybrid: Rework HWP calibration")
Reported-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Tested-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: 5.15+ <stable@vger.kernel.org> # 5.15+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 8dbab94d 24-Oct-2022 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Read all MSRs on the target CPU

Some of the MSR accesses in intel_pstate are carried out on the CPU
that is running the code, but the values coming from them are used
for the performance scaling of the other CPUs.

This is problematic, for example, on hybrid platforms where
MSR_TURBO_RATIO_LIMIT for P-cores and E-cores is different, so the
values read from it on a P-core are generally not applicable to E-cores
and the other way around.

For this reason, make the driver access all MSRs on the target CPU on
platforms using the "core" pstate_funcs callbacks which is the case for
all of the hybrid platforms released to date. For this purpose, pass
a CPU argument to the ->get_max(), ->get_max_physical(), ->get_min()
and ->get_turbo() pstate_funcs callbacks and from there pass it to
rdmsrl_on_cpu() or rdmsrl_safe_on_cpu() to access the MSR on the target
CPU.

Fixes: 46573fd6369f ("cpufreq: intel_pstate: hybrid: Rework HWP calibration")
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Tested-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: 5.15+ <stable@vger.kernel.org> # 5.15+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 71bb5c82 06-Sep-2022 Doug Smythies <dsmythies@telus.net>

cpufreq: intel_pstate: Add Tigerlake support in no-HWP mode

Users may disable HWP in firmware, in which case intel_pstate wouldn't load
unless the CPU model is explicitly supported.

Add TIGERLAKE to the list of CPUs that can register intel_pstate while not
advertising the HWP capability. Without this change, an TIGERLAKE in no-HWP
mode could only use the acpi_cpufreq frequency scaling driver.

See also commits:
d8de7a44e11f: cpufreq: intel_pstate: Add Skylake servers support
fbdc21e9b038: cpufreq: intel_pstate: Add Icelake servers support in no-HWP mode
706c5328851d: cpufreq: intel_pstate: Add Cometlake support in no-HWP mode

Reported by: M. Cargi Ari <cagriari@pm.me>
Signed-off-by: Doug Smythies <dsmythies@telus.net>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# bbd67f1b 02-May-2022 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: Support Sapphire Rapids OOB mode

Prevent intel_pstate to load when OOB (Out Of Band) P-states mode is
enabled in Sapphire Rapids. The OOB identifying bits are same as the
prior generation CPUs like Ice Lake servers. So, also add Sapphire
Rapids to intel_pstate_cpu_oob_ids list.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# addca285 07-Apr-2022 Chen Yu <yu.c.chen@intel.com>

cpufreq: intel_pstate: Handle no_turbo in frequency invariance

Problem statement:

Once the user has disabled turbo frequency by

# echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo

the cfs_rq's util_avg becomes quite small when compared with
CPU capacity.

Step to reproduce:

# echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo

# ./x86_cpuload --count 1 --start 3 --timeout 100 --busy 99

would launch 1 thread and bind it to CPU3, lasting for 100 seconds,
with a CPU utilization of 99%. [1]

top result:
%Cpu3 : 98.4 us, 0.0 sy, 0.0 ni, 1.6 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st

check util_avg:
cat /sys/kernel/debug/sched/debug | grep "cfs_rq\[3\]" -A 20 | grep util_avg
.util_avg : 611

So the util_avg/cpu capacity is 611/1024, which is much smaller than
98.4% shown in the top result.

This might impact some logic in the scheduler. For example,
group_is_overloaded() would compare the group_capacity and group_util
in the sched group, to check if this sched group is overloaded or not.
With this gap, even when there is a nearly 100% workload, the sched
group will not be regarded as overloaded. Besides group_is_overloaded(),
there are also other victims. There is a ongoing work that aims to
optimize the task wakeup in a LLC domain. The main idea is to stop
searching idle CPUs if the sched domain is overloaded[2]. This proposal
also relies on the util_avg/CPU capacity to decide whether the LLC
domain is overloaded.

Analysis:

CPU frequency invariance has caused this difference. In summary,
the util_sum of cfs rq would decay quite fast when the CPU is in
idle, when the CPU frequency invariance is enabled.

The detail is as followed:

As depicted in update_rq_clock_pelt(), when the frequency invariance
is enabled, there would be two clock variables on each rq, clock_task
and clock_pelt:

The clock_pelt scales the time to reflect the effective amount of
computation done during the running delta time but then syncs back to
clock_task when rq is idle.

absolute time | 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15|16
@ max frequency ------******---------------******---------------
@ half frequency ------************---------************---------
clock pelt | 1| 2| 3| 4| 7| 8| 9| 10| 11|14|15|16

The fast decay of util_sum during idle is due to:

1. rq->clock_pelt is always behind rq->clock_task
2. rq->last_update is updated to rq->clock_pelt' after invoking
___update_load_sum()
3. Then the CPU becomes idle, the rq->clock_pelt' would be suddenly
increased a lot to rq->clock_task
4. Enters ___update_load_sum() again, the idle period is calculated by
rq->clock_task - rq->last_update, AKA, rq->clock_task - rq->clock_pelt'.
The lower the CPU frequency is, the larger the delta =
rq->clock_task - rq->clock_pelt' will be. Since the idle period will be
used to decay the util_sum only, the util_sum drops significantly during
idle period.

Proposal:

This symptom is not only caused by disabling turbo frequency, but it
would also appear if the user limits the max frequency at runtime.

Because, if the frequency is always lower than the max frequency,
CPU frequency invariance would decay the util_sum quite fast during
idle.

As some end users would disable turbo after boot up, this patch aims to
present this symptom and deals with turbo scenarios for now.

It might be ideal if CPU frequency invariance is aware of the max CPU
frequency (user specified) at runtime in the future.

Link: https://github.com/yu-chen-surf/x86_cpuload.git #1
Link: https://lore.kernel.org/lkml/20220310005228.11737-1-yu.c.chen@intel.com/ #2
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 3d13058e 10-Mar-2022 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: Use firmware default EPP

For some specific platforms (E.g. AlderLake) the balance performance
EPP is updated from the hard coded value in the driver. This acts as
the default and balance_performance EPP. The purpose of this EPP
update is to reach maximum 1 core turbo frequency (when possible) out
of the box.

Although we can achieve the objective by using hard coded value in the
driver, there can be other EPP which can be better in terms of power.
But that will be very subjective based on platform and use cases.
This is not practical to have a per platform specific default hard coded
in the driver.

If a platform wants to specify default EPP, it can be set in the firmware.
If this EPP is not the chipset default of 0x80 (balance_perf_epp unless
driver changed it) and more performance oriented but not 0, the driver
can use this as the default and balanced_perf EPP. In this case no driver
update is required every time there is some new platform and default EPP.

If the firmware didn't update the EPP from the chipset default then
the hard coded value is used as per existing implementation.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# dfeeedc1 17-Dec-2021 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Update cpuinfo.max_freq on HWP_CAP changes

With HWP enabled, when the turbo range of performance levels is
disabled by the platform firmware, the CPU capacity is given by
the "guaranteed performance" field in MSR_HWP_CAPABILITIES which
is generally dynamic. When it changes, the kernel receives an HWP
notification interrupt handled by notify_hwp_interrupt().

When the "guaranteed performance" value changes in the above
configuration, the CPU performance scaling needs to be adjusted so
as to use the new CPU capacity in computations, which means that
the cpuinfo.max_freq value needs to be updated for that CPU.

Accordingly, modify intel_pstate_notify_work() to read
MSR_HWP_CAPABILITIES and update cpuinfo.max_freq to reflect the
new configuration (this update can be carried out even if the
configuration doesn't actually change, because it simply doesn't
matter then and it takes less time to update it than to do extra
checks to decide whether or not a change has really occurred).

Reported-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Tested-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# b6e6f8be 16-Dec-2021 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: Update EPP for AlderLake mobile

There is an expectation from users that they can get frequency specified
by cpufreq/cpuinfo_max_freq when conditions permit. But with AlderLake
mobile it may not be possible. This is possible that frequency is clipped
based on the system power-up EPP value. In this case users can update
cpufreq/energy_performance_preference to some performance oriented EPP to
limit clipping of frequencies.

To get out of box behavior as the prior generations of CPUs, update EPP
for AlderLake mobile CPUs on boot. On prior generations of CPUs EPP = 128
was enough to get maximum frequency, but with AlderLake mobile the
equivalent EPP is 102. Since EPP is model specific, this is possible that
they have different meaning on each generation of CPU.

The current EPP string "balance_performance" corresponds to EPP = 128.
Change the EPP corresponding to "balance_performance" to 102 for only
AlderLake mobile CPUs and update this on each CPU during boot.

To implement reuse epp_values[] array and update the modified EPP at the
index for BALANCE_PERFORMANCE. Add a dummy EPP_INDEX_DEFAULT to
epp_values[] to match indexes in the energy_perf_strings[].

After HWP PM is enabled also update EPP when "balance_performance" is
redefined for the very first time after the boot on each CPU. On
subsequent suspend/resume or offline/online the old EPP is restored,
so no specific action is needed.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 458b03f8 10-Dec-2021 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Drop redundant intel_pstate_get_hwp_cap() call

It is not necessary to call intel_pstate_get_hwp_cap() from
intel_pstate_update_perf_limits(), because it gets called from
intel_pstate_verify_cpu_policy() which is either invoked directly
right before intel_pstate_update_perf_limits(), in
intel_cpufreq_verify_policy() in the passive mode, or called
from driver callbacks in a sequence that causes it to be followed
by an immediate intel_pstate_update_perf_limits().

Namely, in the active mode intel_cpufreq_verify_policy() is called
by intel_pstate_verify_policy() which is the ->verify() callback
routine of intel_pstate and gets called by the cpufreq core right
before intel_pstate_set_policy(), which is the driver's ->setoplicy()
callback routine, where intel_pstate_update_perf_limits() is called.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 03c83982 18-Nov-2021 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: ITMT support for overclocked system

On systems with overclocking enabled, CPPC Highest Performance can be
hard coded to 0xff. In this case even if we have cores with different
highest performance, ITMT can't be enabled as the current implementation
depends on CPPC Highest Performance.

On such systems we can use MSR_HWP_CAPABILITIES maximum performance field
when CPPC.Highest Performance is 0xff.

Due to legacy reasons, we can't solely depend on MSR_HWP_CAPABILITIES as
in some older systems CPPC Highest Performance is the only way to identify
different performing cores.

Reported-by: Michael Larabel <Michael@MichaelLarabel.com>
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Tested-by: Michael Larabel <Michael@MichaelLarabel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# ed38eb49 17-Nov-2021 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Fix active mode offline/online EPP handling

After commit 4adcf2e5829f ("cpufreq: intel_pstate: Add ->offline and
->online callbacks") the EPP value set by the "performance" scaling
algorithm in the active mode is not restored after an offline/online
cycle which replaces it with the saved EPP value coming from user
space.

Address this issue by forcing intel_pstate_hwp_set() to set a new
EPP value when it runs first time after online.

Fixes: 4adcf2e5829f ("cpufreq: intel_pstate: Add ->offline and ->online callbacks")
Link: https://lore.kernel.org/linux-pm/adc7132c8655bd4d1c8b6129578e931a14fe1db2.camel@linux.intel.com/
Reported-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: 5.9+ <stable@vger.kernel.org> # 5.9+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# cd23f02f 12-Nov-2021 Adamos Ttofari <attofari@amazon.de>

cpufreq: intel_pstate: Add Ice Lake server to out-of-band IDs

Commit fbdc21e9b038 ("cpufreq: intel_pstate: Add Icelake servers
support in no-HWP mode") enabled the use of Intel P-State driver
for Ice Lake servers.

But it doesn't cover the case when OS can't control P-States.

Therefore, for Ice Lake server, if MSR_MISC_PWR_MGMT bits 8 or 18
are enabled, then the Intel P-State driver should exit as OS can't
control P-States.

Fixes: fbdc21e9b038 ("cpufreq: intel_pstate: Add Icelake servers support in no-HWP mode")
Signed-off-by: Adamos Ttofari <attofari@amazon.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 074d0cdf 04-Nov-2021 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: Clear HWP Status during HWP Interrupt enable

It is possible that some performance excursions happened before OS boot
or enable HWP interrupts. So clear MSR_HWP_STATUS bits when we enable
HWP interrupt. In this way a next excursion will results in a HWP
interrupt.

The status bits of MSR_HWP_STATUS must be cleared (0) by software so
that a new status condition change will cause the hardware to set the
bit again and issue the notification.

Fixes: 57577c996d73 ("cpufreq: intel_pstate: Process HWP Guaranteed change notification")
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 55210556 03-Nov-2021 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: Fix unchecked MSR 0x773 access

It is possible that on some platforms HWP interrupts are disabled. In
that case accessing MSR 0x773 will result in warning.

So check X86_FEATURE_HWP_NOTIFY feature to access MSR 0x773. The other
places in code where this MSR is accessed, already checks this feature
except during disable path called during cpufreq offline and suspend
callbacks.

Fixes: 57577c996d73 ("cpufreq: intel_pstate: Process HWP Guaranteed change notification")
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Tested-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# dbea75fe 03-Nov-2021 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Clear HWP desired on suspend/shutdown and offline

Commit a365ab6b9dfb ("cpufreq: intel_pstate: Implement the
->adjust_perf() callback") caused intel_pstate to use nonzero HWP
desired values in certain usage scenarios, but it did not prevent
them from being leaked into the confugirations in which HWP desired
is expected to be 0.

The failing scenarios are switching the driver from the passive
mode to the active mode and starting a new kernel via kexec() while
intel_pstate is running in the passive mode.

To address this issue, ensure that HWP desired will be cleared on
offline and suspend/shutdown.

Fixes: a365ab6b9dfb ("cpufreq: intel_pstate: Implement the ->adjust_perf() callback")
Reported-by: Julia Lawall <julia.lawall@inria.fr>
Tested-by: Julia Lawall <julia.lawall@inria.fr>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# c72bcf0a 26-Oct-2021 Zhang Rui <rui.zhang@intel.com>

cpufreq: intel_pstate: Fix cpu->pstate.turbo_freq initialization

Fix a problem in active mode that cpu->pstate.turbo_freq is initialized
only if HWP-to-frequency scaling factor is refined.

In passive mode, this problem is not exposed, because
cpu->pstate.turbo_freq is set again, later in
intel_cpufreq_cpu_init()->intel_pstate_get_hwp_cap().

Fixes: eb3693f0521e ("cpufreq: intel_pstate: hybrid: CPU-specific scaling factor")
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 57577c99 28-Sep-2021 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: Process HWP Guaranteed change notification

It is possible that HWP guaranteed ratio is changed in response to
change in power and thermal limits. For example when Intel Speed Select
performance profile is changed or there is change in TDP, hardware can
send notifications. It is possible that the guaranteed ratio is
increased. This creates an issue when turbo is disabled, as the old
limits set in MSR_HWP_REQUEST are still lower and hardware will clip
to older limits.

This change enables HWP interrupt and process HWP interrupts. When
guaranteed is changed, calls cpufreq_update_policy() so that driver
callbacks are called to update to new HWP limits. This callback
is called from a delayed workqueue of 10ms to avoid frequent updates.

Although the scope of IA32_HWP_INTERRUPT is per logical cpu, on some
plaforms interrupt is generated on all CPUs. This is particularly a
problem during initialization, when the driver didn't allocated
data for other CPUs. So this change uses a cpumask of enabled CPUs and
process interrupts on those CPUs only.

When the cpufreq offline() or suspend() callback is called, HWP interrupt
is disabled on those CPUs and also cancels any pending work item.

Spin lock is used to protect data and processing shared with interrupt
handler. Here READ_ONCE(), WRITE_ONCE() macros are used to designate
shared data, even though spin lock act as an optimization barrier here.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Tested-by: pablomh@gmail.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# d9a7e9df 12-Sep-2021 Doug Smythies <doug.smythies@gmail.com>

cpufreq: intel_pstate: Override parameters if HWP forced by BIOS

If HWP has been already been enabled by BIOS, it may be
necessary to override some kernel command line parameters.
Once it has been enabled it requires a reset to be disabled.

Suggested-by: Rafael J. Wysocki <rafael@kernel.org>
Signed-off-by: Doug Smythies <dsmythies@telus.net>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 46573fd6 04-Sep-2021 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: hybrid: Rework HWP calibration

The current HWP calibration for hybrid processors in intel_pstate is
fragile, because it depends too much on the information provided by
the platform firmware via CPPC which may not be reliable enough. It
also need not be so complicated.

In order to improve that mechanism and make it more resistant to
platform firmware issues, make it only use the CPPC nominal_perf
values to compute the HWP-to-frequency scaling factors for all
CPUs and possibly use the HWP_CAP highest_perf values to recompute
them if the ones derived from the CPPC nominal_perf values alone
appear to be too high.

Namely, fetch CPC.nominal_perf for all CPUs present in the system,
find the minimum one and use it as a reference for computing all of
the CPUs' scaling factors (using the observation that for the CPUs
having the minimum CPC.nominal_perf the HWP range of available
performance levels should be the same as the range of available
"legacy" P-states and so the HWP-to-frequency scaling factor for
them should be the same as the corresponding scaling factor used
for representing the P-state values in kHz).

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Zhang Rui <rui.zhang@intel.com>


# dd7c46d6 07-Sep-2021 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Revert "cpufreq: intel_pstate: Process HWP Guaranteed change notification"

Revert commit d0e936adbd22 ("cpufreq: intel_pstate: Process HWP
Guaranteed change notification"), because it causes a NULL pointer
dereference to occur on Lenovo X1 gen9 laptops due to an HWP
guaranteed performance change interrupt arriving prematurely.

This feature will be revisited in the next cycle.

Reported-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# d0e936ad 19-Aug-2021 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: Process HWP Guaranteed change notification

It is possible that HWP guaranteed ratio is changed in response to
change in power and thermal limits. For example when Intel Speed Select
performance profile is changed or there is change in TDP, hardware can
send notifications. It is possible that the guaranteed ratio is
increased. This creates an issue when turbo is disabled, as the old
limits set in MSR_HWP_REQUEST are still lower and hardware will clip
to older limits.

This change enables HWP interrupt and process HWP interrupts. When
guaranteed is changed, calls cpufreq_update_policy() so that driver
callbacks are called to update to new HWP limits. This callback
is called from a delayed workqueue of 10ms to avoid frequent updates.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 09681a07 03-Aug-2021 Sebastian Andrzej Siewior <bigeasy@linutronix.de>

cpufreq: Replace deprecated CPU-hotplug functions

The functions get_online_cpus() and put_online_cpus() have been
deprecated during the CPU hotplug rework. They map directly to
cpus_read_lock() and cpus_read_unlock().

Replace deprecated CPU-hotplug functions with the official version.
The behavior remains unchanged.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 49d6feef 30-Jun-2021 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Combine ->stop_cpu() and ->offline()

Combine the ->stop_cpu() and ->offline() callback routines for
intel_pstate in the active mode so as to avoid setting the
->stop_cpu callback pointer which is going to be dropped from
the framework.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>


# 8df71a7d 26-May-2021 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: hybrid: Fix build with CONFIG_ACPI unset

One of the previous commits introducing hybrid processor support to
intel_pstate broke build with CONFIG_ACPI unset.

Fix that and while at it make empty stubs of two functions related
to ACPI CPPC static inline and fix a spelling mistake in the name of
one of them.

Fixes: eb3693f0521e ("cpufreq: intel_pstate: hybrid: CPU-specific scaling factor")
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Randy Dunlap <rdunlap@infradead.org> # build-tested


# 706c5328 18-May-2021 Giovanni Gherdovich <ggherdovich@suse.cz>

cpufreq: intel_pstate: Add Cometlake support in no-HWP mode

Users may disable HWP in firmware, in which case intel_pstate wouldn't load
unless the CPU model is explicitly supported.

See also commit d8de7a44e11f ("cpufreq: intel_pstate: Add Skylake servers
support").

Suggested-by: Doug Smythies <dsmythies@telus.net>
Tested-by: Doug Smythies <dsmythies@telus.net>
Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# fbdc21e9 18-May-2021 Giovanni Gherdovich <ggherdovich@suse.cz>

cpufreq: intel_pstate: Add Icelake servers support in no-HWP mode

Users may disable HWP in firmware, in which case intel_pstate wouldn't load
unless the CPU model is explicitly supported.

Add ICELAKE_X to the list of CPUs that can register intel_pstate while not
advertising the HWP capability. Without this change, an ICELAKE_X in no-HWP
mode could only use the acpi_cpufreq frequency scaling driver.

See also commit d8de7a44e11f ("cpufreq: intel_pstate: Add Skylake servers
support").

Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# eb3693f0 12-May-2021 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: hybrid: CPU-specific scaling factor

The scaling factor between HWP performance levels and CPU frequency
may be different for different types of CPUs in a hybrid processor
and in general the HWP performance levels need not correspond to
"P-states" representing values that would be written to
MSR_IA32_PERF_CTL if HWP was disabled.

However, the policy limits control in cpufreq is defined in terms
of CPU frequency, so it is necessary to map the frequency limits set
through that interface to HWP performance levels with reasonable
accuracy and the behavior of that interface on hybrid processors
has to be compatible with its behavior on non-hybrid ones.

To address this problem, use the observations that (1) on hybrid
processors the sysfs interface can operate by mapping frequency
to "P-states" and translating those "P-states" to specific HWP
performance levels of the given CPU and (2) the scaling factor
between the MSR_IA32_PERF_CTL "P-states" and CPU frequency can be
regarded as a known value. Moreover, the mapping between the
HWP performance levels and CPU frequency can be assumed to be
linear and such that HWP performance level 0 correspond to the
frequency value of 0, so it is only necessary to know the
frequency corresponding to one specific HWP performance level
to compute the scaling factor applicable to all of them.

One possibility is to take the nominal performance value from CPPC,
if available, and use cpu_khz as the corresponding frequency. If
the CPPC capabilities interface is not there or the nominal
performance value provided by it is out of range, though, something
else needs to be done.

Namely, the guaranteed performance level either from CPPC or from
MSR_HWP_CAPABILITIES can be used instead, but the corresponding
frequency needs to be determined. That can be done by computing the
product of the (known) scaling factor between the MSR_IA32_PERF_CTL
P-states and CPU frequency (the PERF_CTL scaling factor) and the
P-state value referred to as the "TDP ratio".

If the HWP-to-frequency scaling factor value obtained in one of the
ways above turns out to be euqal to the PERF_CTL scaling factor, it
can be assumed that the number of HWP performance levels is equal to
the number of P-states and the given CPU can be handled as though
this was not a hybrid processor.

Otherwise, one more adjustment may still need to be made, because the
HWP-to-frequency scaling factor computed so far may not be accurate
enough (e.g. because the CPPC information does not match the exact
behavior of the processor). Specifically, in that case the frequency
corresponding to the highest HWP performance value from
MSR_HWP_CAPABILITIES (computed as the product of that value and the
HWP-to-frequency scaling factor) cannot exceed the frequency that
corresponds to the maximum 1-core turbo P-state value from
MSR_TURBO_RATIO_LIMIT (computed as the procuct of that value and the
PERF_CTL scaling factor) and the HWP-to-frequency scaling factor may
need to be adjusted accordingly.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# c3d175e4 12-May-2021 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: hybrid: Avoid exposing two global attributes

The turbo_pct and num_pstates sysfs attributes represent CPU
properties that may be different for differenty types of CPUs in
a hybrid processor, so avoid exposing them in that case.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# e5af36b2 21-Apr-2021 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Use HWP if enabled by platform firmware

It turns out that there are systems where HWP is enabled during
initialization by the platform firmware (BIOS), but HWP EPP support
is not advertised.

After commit 7aa1031223bc ("cpufreq: intel_pstate: Avoid enabling HWP
if EPP is not supported") intel_pstate refuses to use HWP on those
systems, but the fallback PERF_CTL interface does not work on them
either because of enabled HWP, and once enabled, HWP cannot be
disabled. Consequently, the users of those systems cannot control
CPU performance scaling.

Address this issue by making intel_pstate use HWP unconditionally if
it is enabled already when the driver starts.

Fixes: 7aa1031223bc ("cpufreq: intel_pstate: Avoid enabling HWP if EPP is not supported")
Reported-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Tested-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: 5.9+ <stable@vger.kernel.org> # 5.9+


# b989bc0f 07-Apr-2021 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Simplify intel_pstate_update_perf_limits()

Because pstate.max_freq is always equal to the product of
pstate.max_pstate and pstate.scaling and, analogously,
pstate.turbo_freq is always equal to the product of
pstate.turbo_pstate and pstate.scaling, the result of the
max_policy_perf computation in intel_pstate_update_perf_limits() is
always equal to the quotient of policy_max and pstate.scaling,
regardless of whether or not turbo is disabled. Analogously, the
result of min_policy_perf in intel_pstate_update_perf_limits() is
always equal to the quotient of policy_min and pstate.scaling.

Accordingly, intel_pstate_update_perf_limits() need not check
whether or not turbo is enabled at all and in order to compute
max_policy_perf and min_policy_perf it can always divide policy_max
and policy_min, respectively, by pstate.scaling. Make it do so.

While at it, move the definition and initialization of the
turbo_max local variable to the code branch using it.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Chen Yu <yu.c.chen@intel.com>


# de5bcf40 16-Mar-2021 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Clean up frequency computations

Notice that some computations related to frequency in intel_pstate
can be simplified if (a) intel_pstate_get_hwp_max() updates the
relevant members of struct cpudata by itself and (b) the "turbo
disabled" check is moved from it to its callers, so modify the code
accordingly and while at it rename intel_pstate_get_hwp_max() to
intel_pstate_get_hwp_cap() which better reflects its purpose and
provide a simplified variat of it, __intel_pstate_get_hwp_cap(),
suitable for the initialization path.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Chen Yu <yu.c.chen@intel.com>


# 75a8d877 16-Jan-2021 Nigel Christian <nigel.l.christian@gmail.com>

cpufreq: intel_pstate: Remove repeated word

In the comment for trace in passive mode there is an
unnecessary "the". Eradicate it.

Signed-off-by: Nigel Christian <nigel.l.christian@gmail.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 6f67e060 11-Jan-2021 Chen Yu <yu.c.chen@intel.com>

cpufreq: intel_pstate: Get per-CPU max freq via MSR_HWP_CAPABILITIES if available

Currently, when turbo is disabled (either by BIOS or by the user),
the intel_pstate driver reads the max non-turbo frequency from the
package-wide MSR_PLATFORM_INFO(0xce) register.

However, on asymmetric platforms it is possible in theory that small
and big core with HWP enabled might have different max non-turbo CPU
frequency, because MSR_HWP_CAPABILITIES is per-CPU scope according
to Intel Software Developer Manual.

The turbo max freq is already per-CPU in current code, so make
similar change to the max non-turbo frequency as well.

Reported-by: Wendy Wang <wendy.wang@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
[ rjw: Subject and changelog edits ]
Cc: 4.18+ <stable@vger.kernel.org> # 4.18+: a45ee4d4e13b: cpufreq: intel_pstate: Change intel_pstate_get_hwp_max() argument
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 597ffbc8 07-Jan-2021 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Rename two functions

Rename intel_cpufreq_adjust_hwp() and intel_cpufreq_adjust_perf_ctl()
to intel_cpufreq_hwp_update() and intel_cpufreq_perf_ctl_update(),
respectively, to avoid possible confusion with the ->adjist_perf()
callback function, intel_cpufreq_adjust_perf().

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Chen Yu <yu.c.chen@intel.com>


# a45ee4d4 07-Jan-2021 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Change intel_pstate_get_hwp_max() argument

All of the callers of intel_pstate_get_hwp_max() access the struct
cpudata object that corresponds to the given CPU already and the
function itself needs to access that object (in order to update
hwp_cap_cached), so modify the code to pass a struct cpudata pointer
to it instead of the CPU number.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Chen Yu <yu.c.chen@intel.com>


# 9dd04ec6 07-Jan-2021 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Always read hwp_cap_cached with READ_ONCE()

Because intel_pstate_get_hwp_max() which updates hwp_cap_cached
may run in parallel with the readers of it, annotate all of the
read accesses to it with READ_ONCE().

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Chen Yu <yu.c.chen@intel.com>


# c4151604 20-Dec-2020 Lukas Bulwahn <lukas.bulwahn@gmail.com>

cpufreq: intel_pstate: remove obsolete functions

percent_fp() was used in intel_pstate_pid_reset(), which was removed in
commit 9d0ef7af1f2d ("cpufreq: intel_pstate: Do not use PID-based P-state
selection") and hence, percent_fp() is unused since then.

percent_ext_fp() was last used in intel_pstate_update_perf_limits(), which
was refactored in commit 1a4fe38add8b ("cpufreq: intel_pstate: Remove
max/min fractions to limit performance"), and hence, percent_ext_fp() is
unused since then.

make CC=clang W=1 points us those unused functions:

drivers/cpufreq/intel_pstate.c:79:23: warning: unused function 'percent_fp' [-Wunused-function]
static inline int32_t percent_fp(int percent)
^

drivers/cpufreq/intel_pstate.c:94:23: warning: unused function 'percent_ext_fp' [-Wunused-function]
static inline int32_t percent_ext_fp(int percent)
^

Remove those obsolete functions.

Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 17ffd358 05-Jan-2021 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Use HWP capabilities in intel_cpufreq_adjust_perf()

If turbo P-states cannot be used, either due to the configuration of
the processor, or because intel_pstate is not allowed to used them,
the maximum available P-state with HWP enabled corresponds to the
HWP_CAP.GUARANTEED value which is not static. It can be adjusted by
an out-of-band agent or during an Intel Speed Select performance
level change, so long as it remains less than or equal to
HWP_CAP.MAX.

However, if turbo P-states cannot be used, intel_cpufreq_adjust_perf()
always uses pstate.max_pstate (set during the initialization of the
driver only) as the maximum available P-state, so it may miss a change
of the HWP_CAP.GUARANTEED value.

Prevent that from happening by modifyig intel_cpufreq_adjust_perf()
to always read the "guaranteed" and "maximum turbo" performance
levels from the cached HWP_CAP value.

Fixes: a365ab6b9dfb ("cpufreq: intel_pstate: Implement the ->adjust_perf() callback")
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>


# be128345 29-Dec-2020 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Fix fast-switch fallback path

When sugov_update_single_perf() falls back to the "frequency"
path due to the missing scale-invariance, it will call
cpufreq_driver_fast_switch() via sugov_fast_switch()
and the driver's ->fast_switch() callback will be invoked,
so it must not be NULL.

However, after commit a365ab6b9dfb ("cpufreq: intel_pstate: Implement
the ->adjust_perf() callback") intel_pstate sets ->fast_switch() to
NULL when it is going to use intel_cpufreq_adjust_perf(), which is a
mistake, because on x86 the scale-invariance may be turned off
dynamically, so modify it to retain the original ->adjust_perf()
callback pointer.

Fixes: a365ab6b9dfb ("cpufreq: intel_pstate: Implement the ->adjust_perf() callback")
Reported-by: Kenneth R. Crudup <kenny@panix.com>
Tested-by: Kenneth R. Crudup <kenny@panix.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# e40ad84c 17-Dec-2020 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Use most recent guaranteed performance values

When turbo has been disabled by the BIOS, but HWP_CAP.GUARANTEED is
changed later, user space may want to take advantage of this increased
guaranteed performance.

HWP_CAP.GUARANTEED is not a static value. It can be adjusted by an
out-of-band agent or during an Intel Speed Select performance level
change. The HWP_CAP.MAX is still the maximum achievable performance
with turbo disabled by the BIOS, so HWP_CAP.GUARANTEED can still
change as long as it remains less than or equal to HWP_CAP.MAX.

When HWP_CAP.GUARANTEED is changed, the sysfs base_frequency
attribute shows the most recent guaranteed frequency value. This
attribute can be used by user space software to update the scaling
min/max limits of the CPU.

Currently, the ->setpolicy() callback already uses the latest
HWP_CAP values when setting HWP_REQ, but the ->verify() callback will
restrict the user settings to the to old guaranteed performance value
which prevents user space from making use of the extra CPU capacity
theoretically available to it after increasing HWP_CAP.GUARANTEED.

To address this, read HWP_CAP in intel_pstate_verify_cpu_policy()
to obtain the maximum P-state that can be used and use that to
confine the policy max limit instead of using the cached and
possibly stale pstate.max_freq value for this purpose.

For consistency, update intel_pstate_update_perf_limits() to use the
maximum available P-state returned by intel_pstate_get_hwp_max() to
compute the maximum frequency instead of using the return value of
intel_pstate_get_max_freq() which, again, may be stale.

This issue is a side-effect of fixing the scaling frequency limits in
commit eacc9c5a927e ("cpufreq: intel_pstate: Fix intel_pstate_get_hwp_max()
for turbo disabled") which corrected the setting of the reduced scaling
frequency values, but caused stale HWP_CAP.GUARANTEED to be used in
the case at hand.

Fixes: eacc9c5a927e ("cpufreq: intel_pstate: Fix intel_pstate_get_hwp_max() for turbo disabled")
Reported-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Tested-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: 5.8+ <stable@vger.kernel.org> # 5.8+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# a365ab6b 14-Dec-2020 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Implement the ->adjust_perf() callback

Make intel_pstate expose the ->adjust_perf() callback when it
operates in the passive mode with HWP enabled which causes the
schedutil governor to use that callback instead of ->fast_switch().

The minimum and target performance-level values passed by the
governor to ->adjust_perf() are converted to HWP.REQ.MIN and
HWP.REQ.DESIRED, respectively, which allows the processor to
adjust its configuration to maximize energy-efficiency while
providing sufficient capacity.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>


# 2554c32f 12-Nov-2020 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Simplify intel_cpufreq_update_pstate()

Avoid doing the same assignment in both branches of a conditional,
do it after the whole conditional instead.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# fcb3a1ab 10-Nov-2020 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Take CPUFREQ_GOV_STRICT_TARGET into account

Make intel_pstate take the new CPUFREQ_GOV_STRICT_TARGET governor
flag into account when it operates in the passive mode with HWP
enabled, so as to fix the "powersave" governor behavior in that
case (currently, HWP is allowed to scale the performance all the
way up to the policy max limit when the "powersave" governor is
used, but it should be constrained to the policy min limit then).

Fixes: f6ebbcf08f37 ("cpufreq: intel_pstate: Implement passive mode with HWP enabled")
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: 5.9+ <stable@vger.kernel.org> # 5.9+: 9a2a9ebc0a75 cpufreq: Introduce governor flags
Cc: 5.9+ <stable@vger.kernel.org> # 5.9+: 218f66870181 cpufreq: Introduce CPUFREQ_GOV_STRICT_TARGET
Cc: 5.9+ <stable@vger.kernel.org> # 5.9+: ea9364bbadf1 cpufreq: Add strict_target to struct cpufreq_policy


# e0be38ed 23-Oct-2020 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Avoid missing HWP max updates in passive mode

If the cpufreq policy max limit is changed when intel_pstate operates
in the passive mode with HWP enabled and the "powersave" governor is
used on top of it, the HWP max limit is not updated as appropriate.

Namely, in the "powersave" governor case, the target P-state
is always equal to the policy min limit, so if the latter does
not change, intel_cpufreq_adjust_hwp() is not invoked to update
the HWP Request MSR due to the "target_pstate != old_pstate" check
in intel_cpufreq_update_pstate(), so the HWP max limit is not
updated as a result.

Also, if the CPUFREQ_NEED_UPDATE_LIMITS flag is not set for the
driver and the target frequency does not change along with the
policy max limit, the "target_freq == policy->cur" check in
__cpufreq_driver_target() prevents the driver's ->target() callback
from being invoked at all, so the HWP max limit is not updated.

To prevent that occurring, set the CPUFREQ_NEED_UPDATE_LIMITS flag
in the intel_cpufreq driver structure if HWP is enabled and modify
intel_cpufreq_update_pstate() to do the "target_pstate != old_pstate"
check only in the non-HWP case and let intel_cpufreq_adjust_hwp()
always run in the HWP case (it will update HWP Request only if the
cached value of the register is different from the new one including
the limits, so if neither the target P-state value nor the max limit
changes, the register write will still be avoided).

Fixes: f6ebbcf08f37 ("cpufreq: intel_pstate: Implement passive mode with HWP enabled")
Reported-by: Zhang Rui <rui.zhang@intel.com>
Cc: 5.9+ <stable@vger.kernel.org> # 5.9+: 1c534352f47f cpufreq: Introduce CPUFREQ_NEED_UPDATE_LIMITS ...
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>


# cdc1719c 08-Oct-2020 Chen Yu <yu.c.chen@intel.com>

cpufreq: intel_pstate: Delete intel_pstate sysfs if failed to register the driver

There is a corner case that if the intel_pstate driver fails to be
registered (might be due to invalid MSR access) and acpi_cpufreq
takse over, the intel_pstate sysfs interface is still populated
which may confuse user space (turbostat for example):

grep . /sys/devices/system/cpu/cpu0/cpufreq/scaling_driver
acpi-cpufreq

grep . /sys/devices/system/cpu/intel_pstate/*
/sys/devices/system/cpu/intel_pstate/max_perf_pct:0
/sys/devices/system/cpu/intel_pstate/min_perf_pct:0
grep: /sys/devices/system/cpu/intel_pstate/no_turbo: Resource temporarily unavailable
grep: /sys/devices/system/cpu/intel_pstate/num_pstates: Resource temporarily unavailable
/sys/devices/system/cpu/intel_pstate/status:off
grep: /sys/devices/system/cpu/intel_pstate/turbo_pct: Resource temporarily unavailable

The mere presence of the intel_pstate sysfs interface does not mean
that intel_pstate is in use (for example, echo "off" to "status"),
but it should not be created in the failing case.

Fix this issue by deleting the intel_pstate sysfs if the driver
registration fails.

Reported-by: Wendy Wang <wendy.wang@intel.com>
Suggested-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com
[ rjw: Refactor code to avoid jumps, change function name, changelog edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# fc7d1755 27-Sep-2020 Zhang Rui <rui.zhang@intel.com>

cpufreq: intel_pstate: Fix missing return statement

Fix missing return statement when writing "off" to intel_pstate status
sysfs I/F.

Fixes: 55671ea3257a ("cpufreq: intel_pstate: Free memory only when turning off")
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# eacc9c5a 31-Aug-2020 Francisco Jerez <currojerez@riseup.net>

cpufreq: intel_pstate: Fix intel_pstate_get_hwp_max() for turbo disabled

This fixes the behavior of the scaling_max_freq and scaling_min_freq
sysfs files in systems which had turbo disabled by the BIOS.

Caleb noticed that the HWP is programmed to operate in the wrong
P-state range on his system when the CPUFREQ policy min/max frequency
is set via sysfs. This seems to be because in his system
intel_pstate_get_hwp_max() is returning the maximum turbo P-state even
though turbo was disabled by the BIOS, which causes intel_pstate to
scale kHz frequencies incorrectly e.g. setting the maximum turbo
frequency whenever the maximum guaranteed frequency is requested via
sysfs.

Tested-by: Caleb Callaway <caleb.callaway@intel.com>
Signed-off-by: Francisco Jerez <currojerez@riseup.net>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
[ rjw: Minor subject edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 55671ea3 01-Sep-2020 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Free memory only when turning off

When intel_pstate switches the operation mode from "active" to
"passive" or the other way around, freeing its data structures
representing CPUs and allocating them again from scratch is not
necessary and wasteful. Moreover, if these data structures are
preserved, the cached HWP Request MSR value from there may be
written to the MSR to start with to reinitialize it and help to
restore the EPP value set previously (it is set to 0xFF when CPUs
go offline to allow their SMT siblings to use the full range of
EPP values and that also happens when the driver gets unregistered).

Accordingly, modify the driver to only do a full cleanup on driver
object registration errors and when its status is changed to "off"
via sysfs and to write the cached HWP Request MSR value back to
the MSR on CPU init if the data structure representing the given
CPU is still there.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>


# 4adcf2e5 01-Sep-2020 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Add ->offline and ->online callbacks

Add ->offline and ->online driver callbacks to prepare for taking a
CPU offline and to restore its working configuration when it goes
back online, respectively, to avoid invoking the ->init callback on
every CPU online which is quite a bit of unnecessary overhead.

Define ->offline and ->online so that they can be used in the
passive mode as well as in the active mode and because ->offline
will do the majority of ->stop_cpu work, the passive mode does
not need that callback any more, so drop it from there.

Also modify the active mode ->suspend and ->resume callbacks to
prevent them from interfering with the new ->offline and ->online
ones in case the latter are invoked withing the system-wide suspend
and resume code flow and make the passive mode use them too.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>


# b388eb58 27-Aug-2020 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Tweak the EPP sysfs interface

Modify the EPP sysfs interface to reject attempts to change the EPP
to values different from 0 ("performance") in the active mode with
the "performance" policy (ie. scaling_governor set to "performance"),
to avoid situations in which the kernel appears to discard data
passed to it via the EPP sysfs attribute.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>


# c27a0ccc 27-Aug-2020 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Update cached EPP in the active mode

Make intel_pstate update the cached EPP value when setting the EPP
via sysfs in the active mode just like it is the case in the passive
mode, for consistency, but also for the benefit of subsequent
changes.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>


# 43298db3 20-Aug-2020 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Refuse to turn off with HWP enabled

After commit f6ebbcf08f37 ("cpufreq: intel_pstate: Implement passive
mode with HWP enabled") it is possible to change the driver status
to "off" via sysfs with HWP enabled, which effectively causes the
driver to unregister itself, but HWP remains active and it forces the
minimum performance, so even if another cpufreq driver is loaded,
it will not be able to control the CPU frequency.

For this reason, make the driver refuse to change the status to
"off" with HWP enabled.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>


# f6ebbcf0 06-Aug-2020 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Implement passive mode with HWP enabled

Allow intel_pstate to work in the passive mode with HWP enabled and
make it set the HWP minimum performance limit (HWP floor) to the
P-state value given by the target frequency supplied by the cpufreq
governor, so as to prevent the HWP algorithm and the CPU scheduler
from working against each other, at least when the schedutil governor
is in use, and update the intel_pstate documentation accordingly.

Among other things, this allows utilization clamps to be taken
into account, at least to a certain extent, when intel_pstate is
in use and makes it more likely that sufficient capacity for
deadline tasks will be provided.

After this change, the resulting behavior of an HWP system with
intel_pstate in the passive mode should be close to the behavior
of the analogous non-HWP system with intel_pstate in the passive
mode, except that the HWP algorithm is generally allowed to make the
CPU run at a frequency above the floor P-state set by intel_pstate in
the entire available range of P-states, while without HWP a CPU can
run in a P-state above the requested one if the latter falls into the
range of turbo P-states (referred to as the turbo range) or if the
P-states of all CPUs in one package are coordinated with each other
at the hardware level.

[Note that in principle the HWP floor may not be taken into account
by the processor if it falls into the turbo range, in which case the
processor has a license to choose any P-state, either below or above
the HWP floor, just like a non-HWP processor in the case when the
target P-state falls into the turbo range.]

With this change applied, intel_pstate in the passive mode assumes
complete control over the HWP request MSR and concurrent changes of
that MSR (eg. via the direct MSR access interface) are overridden by
it.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Reviewed-by: Francisco Jerez <currojerez@riseup.net>


# 4daca379 03-Aug-2020 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: Fix cpuinfo_max_freq when MSR_TURBO_RATIO_LIMIT is 0

The MSR_TURBO_RATIO_LIMIT can be 0. This is not an error. User can update
this MSR via BIOS settings on some systems or can use msr tools to update.
Also some systems boot with value = 0.

This results in display of cpufreq/cpuinfo_max_freq wrong. This value
will be equal to cpufreq/base_frequency, even though turbo is enabled.

But platform will still function normally in HWP mode as we get max
1-core frequency from the MSR_HWP_CAPABILITIES. This MSR is already used
to calculate cpu->pstate.turbo_freq, which is used for to set
policy->cpuinfo.max_freq. But some other places cpu->pstate.turbo_pstate
is used. For example to set policy->max.

To fix this, also update cpu->pstate.turbo_pstate when updating
cpu->pstate.turbo_freq.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# de002c55 28-Jul-2020 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Fix EPP setting via sysfs in active mode

Because intel_pstate_set_energy_pref_index() reads and writes the
MSR_HWP_REQUEST register without using the cached value of it used by
intel_pstate_hwp_boost_up() and intel_pstate_hwp_boost_down(), those
functions may overwrite the value written by it and so the EPP value
set via sysfs may be lost.

To avoid that, make intel_pstate_set_energy_pref_index() take the
cached value of MSR_HWP_REQUEST just like the other two routines
mentioned above and update it with the new EPP value coming from
user space in addition to updating the MSR.

Note that the MSR itself still needs to be updated too in case
hwp_boost is unset or the boosting mechanism is not active at the
EPP change time.

Fixes: e0efd5be63e8 ("cpufreq: intel_pstate: Add HWP boost utility and sched util hooks")
Reported-by: Francisco Jerez <currojerez@riseup.net>
Cc: 4.18+ <stable@vger.kernel.org> # 4.18+: 3da97d4db8ee cpufreq: intel_pstate: Rearrange ...
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Francisco Jerez <currojerez@riseup.net>


# 3a957176 27-Jul-2020 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Rearrange the storing of new EPP values

Move the locking away from intel_pstate_set_energy_pref_index()
into its only caller and drop the (now redundant) return_pref label
from it.

Also move the "raw" EPP value check into the caller of that function,
so as to do it before acquiring the mutex, and reduce code duplication
related to the "raw" EPP values processing somewhat.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Francisco Jerez <currojerez@riseup.net>


# 7aa10312 14-Jul-2020 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Avoid enabling HWP if EPP is not supported

Although there are processors supporting hardware-managed P-states
(HWP) without the energy-performance preference (EPP) feature, they
are not expected to be run with HWP enabled (the BIOS should disable
HWP on those systems). Missing EPP support generally indicates an
incomplete HWP implementation and so it is better to avoid using
HWP on those systems in production.

However, intel_pstate currently enables HWP on such systems, which
is questionable, so prevent it from doing that by making it check
EPP support before enabling HWP and avoid enabling it if EPP is not
supported by the processor at hand.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 23a522e3 15-Jul-2020 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Clean up aperf_mperf_shift description

The kerneldoc description of the aperf_mperf_shift field in
struct global_params is unclear and there is a typo in it, so
simplify it and clean it up.

Reported-by: Lee Jones <lee.jones@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Lee Jones <lee.jones@linaro.org>


# 8f23d1f1 15-Jul-2020 Lee Jones <lee.jones@linaro.org>

cpufreq: intel_pstate: Supply struct attribute description for get_aperf_mperf_shift()

Fixes the following W=1 kernel build warning(s):

drivers/cpufreq/intel_pstate.c:293: warning: Function parameter or member 'get_aperf_mperf_shift' not described in 'pstate_funcs'

Suggested-by: "Rafael J. Wysocki" <rafael@kernel.org>
Signed-off-by: Lee Jones <lee.jones@linaro.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
[ rjw: Remove line break ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 39a188b8 13-Jul-2020 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Fix active mode setting from command line

If intel_pstate starts in the passive mode by default (that happens
when the processor in the system doesn't support HWP), passing
intel_pstate=active in the kernel command line doesn't work, so
fix that.

Fixes: 33aa46f252c7 ("cpufreq: intel_pstate: Use passive mode by default without HWP")
Reported-by: Doug Smythies <dsmythies@telus.net>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Doug Smythies <dsmythies@telus.net>


# 3ff79754 09-Jul-2020 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: Fix static checker warning for epp variable

Fix warning for:
drivers/cpufreq/intel_pstate.c:731 store_energy_performance_preference()
error: uninitialized symbol 'epp'.

This warning is for a case, when energy_performance_preference attribute
matches pre defined strings. In this case the value of raw epp will not
be used to set EPP bits in MSR_HWP_REQUEST. So initializing with any
value is fine.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# f473bf39 26-Jun-2020 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: Allow raw energy performance preference value

Currently using attribute "energy_performance_preference", user space can
write one of the four per-defined preference string. These preference
strings gets mapped to a hard-coded Energy-Performance Preference (EPP) or
Energy-Performance Bias (EPB) knob.

These four values are supposed to cover broad spectrum of use cases, but
are not uniformly distributed in the range. There are number of cases,
where this is not enough. For example:

Suppose user wants more performance when connected to AC. Instead of using
default "balance performance", the "performance" setting can be used. This
changes EPP value from 0x80 to 0x00. But setting EPP to 0, results in
electrical and thermal issues on some platforms. This results in
aggressive throttling, which causes a drop in performance. But some value
between 0x80 and 0x00 results in better performance. But that value can't
be fixed as the power curve is not linear. In some cases just changing EPP
from 0x80 to 0x75 is enough to get significant performance gain.

Similarly on battery the default "balance_performance" mode can be
aggressive in power consumption. But picking up the next choice
"balance power" results in too much loss of performance, which results in
bad user experience in use cases like "Google Hangout". It was observed
that some value between these two EPP is optimal.

This change allows fine grain EPP tuning for platform like Chromebook or
for users who wants to fine tune power and performance.
Here based on the product and use cases, different EPP values can be set.
This change is similar to the change done for:
/sys/devices/system/cpu/cpu*/power/energy_perf_bias
where user has choice to write a predefined string or raw value.

The change itself is trivial. When user preference doesn't match
predefined string preferences and value is an unsigned integer and in
range, use that value for EPP. When the EPP feature is not present
writing raw value is not supported.

Suggested-by: Len Brown <lenb@kernel.org>
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# ed7bde7a 26-Jun-2020 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: Allow enable/disable energy efficiency

By default intel_pstate the driver disables energy efficiency by setting
MSR_IA32_POWER_CTL bit 19 for Kaby Lake desktop CPU model in HWP mode.
This CPU model is also shared by Coffee Lake desktop CPUs. This allows
these systems to reach maximum possible frequency. But this adds power
penalty, which some customers don't want. They want some way to enable/
disable dynamically.

So, add an additional attribute "energy_efficiency" under
/sys/devices/system/cpu/intel_pstate/ for these CPU models. This allows
to read and write bit 19 ("Disable Energy Efficiency Optimization") in
the MSR IA32_POWER_CTL.

This attribute is present in both HWP and non-HWP mode as this has an
effect in both modes. Refer to Intel Software Developer's manual for
details.

The scope of this bit is package wide. Also these systems are single
package systems. So read/write MSR on the current CPU is enough.

The energy efficiency (EE) bit setting needs to be preserved during
suspend/resume and CPU offline/online operation. To do this:
- Restoring the EE setting from the cpufreq resume() callback, if there
is change from the system default.
- By default, don't disable EE from cpufreq init() callback for matching
CPU models. Since the scope is package wide and is a single package
system, move the disable EE calls from init() callback to
intel_pstate_init() function, which is called only once.

Suggested-by: Len Brown <lenb@kernel.org>
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 589bab6b 12-Jun-2020 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: Add one more OOB control bit

Add one more bit for OOB (Out Of Band) enabling of P-states.

If OOB handling of P-states is enabled, intel_pstate shouldn't load.
Currently, only "BIT(8) == 1" of the MSR MSR_MISC_PWR_MGMT is
considered as OOB, but "BIT(18) == 1" needs to be taken into
consideration as OOB condition too.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
[ rjw: Add an empty code line, edit subject and changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 8c539776 10-Apr-2020 Chris Wilson <chris@chris-wilson.co.uk>

cpufreq: intel_pstate: Only mention the BIOS disabling turbo mode once

Make a note of the first time we discover the turbo mode has been
disabled by the BIOS, as otherwise we complain every time we try to
update the mode.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 33aa46f2 25-Mar-2020 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Use passive mode by default without HWP

After recent changes allowing scale-invariant utilization to be
used on x86, the schedutil governor on top of intel_pstate in the
passive mode should be on par with (or better than) the active mode
"powersave" algorithm of intel_pstate on systems in which
hardware-managed P-states (HWP) are not used, so it should not be
necessary to use the internal scaling algorithm in those cases.

Accordingly, modify intel_pstate to start in the passive mode by
default if the processor at hand does not support HWP of if the driver
is requested to avoid using HWP through the kernel command line.

Among other things, that will allow utilization clamps and the
support for RT/DL tasks in the schedutil governor to be utilized on
systems in which intel_pstate is used.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 5ac54113 25-Mar-2020 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Simplify intel_pstate_cpu_init()

The initial policy value set by intel_pstate_cpu_init() depends on
whether or not CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE is set, but
that is not necessary, because the core will set the policy to
"performance" in cpufreq_init_policy() if the default governor is
"performance" anyway.

Accordingly, change intel_pstate_cpu_init() to always set policy
to CPUFREQ_POLICY_POWERSAVE initially to provide a valid fallback
value to cpufreq_init_policy() in case the default cpufreq governor
is neither "powersave" nor "performance".

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# d9782807 25-Mar-2020 Thomas Gleixner <tglx@linutronix.de>

cpufreq/intel_pstate: Fix wrong macro conversion

The feature flag hwp_support_ids are supposed to match on is
X86_FEATURE_HWP, not X86_FEATURE_APERFMPERF. Fix it.

[ bp: Write commit message. ]

Fixes: b11d77fa300d ("cpufreq: Convert to new X86 CPU match macros")
Reported-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200324060124.GC11705@shao2-debian


# b11d77fa 24-Mar-2020 Thomas Gleixner <tglx@linutronix.de>

cpufreq: Convert to new X86 CPU match macros

The new macro set has a consistent namespace and uses C99 initializers
instead of the grufty C89 ones.

Get rid the of most local macro wrappers for consistency. The ones which
make sense for readability are renamed to X86_MATCH*.

In the centrino driver this also removes the two extra duplicates of family
6 model 13 which have no value at all.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lkml.kernel.org/r/87eetheu88.fsf@nanos.tec.linutronix.de


# d5a2a6bb 05-Mar-2020 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Consolidate policy verification

There is still some code duplication between intel_pstate_verify_policy()
and intel_cpufreq_verify_policy(), so avoid it by moving the common
code into a separate function and calling it from both these places.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 918229cd 22-Jan-2020 Giovanni Gherdovich <ggherdovich@suse.cz>

x86/intel_pstate: Handle runtime turbo disablement/enablement in frequency invariance

On some platforms such as the Dell XPS 13 laptop the firmware disables turbo
when the machine is disconnected from AC, and viceversa it enables it again
when it's reconnected. In these cases a _PPC ACPI notification is issued.

The scheduler needs to know freq_max for frequency-invariant calculations.
To account for turbo availability to come and go, record freq_max at boot as
if turbo was available and store it in a helper variable. Use a setter
function to swap between freq_base and freq_max every time turbo goes off or on.

Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lkml.kernel.org/r/20200122151617.531-7-ggherdovich@suse.cz


# 1e4f63ae 26-Jan-2020 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: Avoid creating excessively large stack frames

In the process of modifying a cpufreq policy, the cpufreq core makes
a copy of it including all of the internals which is stored on the
CPU stack. Because struct cpufreq_policy is relatively large, this
may cause the size of the stack frame to exceed the 2 KB limit and
so the GCC complains when -Wframe-larger-than= is used.

In fact, it is not necessary to copy the entire policy structure
in order to modify it, however.

First, because cpufreq_set_policy() obtains the min and max policy
limits from frequency QoS now, it is not necessary to pass the limits
to it from the callers. The only things that need to be passed to it
from there are the new governor pointer or (if there is a built-in
governor in the driver) the "policy" value representing the governor
choice. They both can be passed as individual arguments, though, so
make cpufreq_set_policy() take them this way and rework its callers
accordingly. This avoids making copies of cpufreq policies in the
callers of cpufreq_set_policy().

Second, cpufreq_set_policy() still needs to pass the new policy
data to the ->verify() callback of the cpufreq driver whose task
is to sanitize the min and max policy limits. It still does not
need to make a full copy of struct cpufreq_policy for this purpose,
but it needs to pass a few items from it to the driver in case they
are needed (different drivers have different needs in that respect
and all of them have to be covered). For this reason, introduce
struct cpufreq_policy_data to hold copies of the members of
struct cpufreq_policy used by the existing ->verify() driver
callbacks and pass a pointer to a temporary structure of that
type to ->verify() (instead of passing a pointer to full struct
cpufreq_policy to it).

While at it, notice that intel_pstate and longrun don't really need
to verify the "policy" value in struct cpufreq_policy, so drop those
check from them to avoid copying "policy" into struct
cpufreq_policy_data (which allows it to be slightly smaller).

Also while at it fix up white space in a couple of places and make
cpufreq_set_policy() static (as it can be so).

Fixes: 3000ce3c52f8 ("cpufreq: Use per-policy frequency QoS")
Link: https://lore.kernel.org/linux-pm/CAMuHMdX6-jb1W8uC2_237m8ctCpsnGp=JCxqt8pCWVqNXHmkVg@mail.gmail.com
Reported-by: kbuild test robot <lkp@intel.com>
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: 5.4+ <stable@vger.kernel.org> # 5.4+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>


# 731e6b97 13-Jan-2020 Harry Pan <harry.pan@intel.com>

cpufreq: intel_pstate: fix spelling mistake: "Whethet" -> "Whether"

Fix a spelling typo in the comment, no function change.

Signed-off-by: Harry Pan <harry.pan@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# c31432fa 31-Oct-2019 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: Fix invalid EPB setting

The max value of EPB can only be 0x0F. Attempting to set more than that
triggers an "unchecked MSR access error" warning which happens in
intel_pstate_hwp_force_min_perf() called via cpufreq stop_cpu().

However, it is not even necessary to touch the EPB from intel_pstate,
because it is restored on every CPU online by the intel_epb.c code,
so let that code do the right thing and drop the redundant (and
incorrect) EPB update from intel_pstate.

Fixes: af3b7379e2d70 ("cpufreq: intel_pstate: Force HWP min perf before offline")
Reported-by: Qian Cai <cai@lca.pw>
Cc: 5.2+ <stable@vger.kernel.org> # 5.2+
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
[ rjw: Changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 8d2eecea 04-Nov-2019 Jamal Shareef <jamal.k.shareef@gmail.com>

cpufreq: intel_pstate: Fix plain int as pointer warning from sparse

Fix sparse warning: Using plain integer as NULL pointer.

Replace assignment of 0 to pointers with NULL assignment.

Signed-off-by: Jamal Shareef <jamal.k.shareef@gmail.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 3000ce3c 15-Oct-2019 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: Use per-policy frequency QoS

Replace the CPU device PM QoS used for the management of min and max
frequency constraints in cpufreq (and its users) with per-policy
frequency QoS to avoid problems with cpufreq policies covering
more then one CPU.

Namely, a cpufreq driver is registered with the subsys interface
which calls cpufreq_add_dev() for each CPU, starting from CPU0, so
currently the PM QoS notifiers are added to the first CPU in the
policy (i.e. CPU0 in the majority of cases).

In turn, when the cpufreq driver is unregistered, the subsys interface
doing that calls cpufreq_remove_dev() for each CPU, starting from CPU0,
and the PM QoS notifiers are only removed when cpufreq_remove_dev() is
called for the last CPU in the policy, say CPUx, which as a rule is
not CPU0 if the policy covers more than one CPU. Then, the PM QoS
notifiers cannot be removed, because CPUx does not have them, and
they are still there in the device PM QoS notifiers list of CPU0,
which prevents new PM QoS notifiers from being registered for CPU0
on the next attempt to register the cpufreq driver.

The same issue occurs when the first CPU in the policy goes offline
before unregistering the driver.

After this change it does not matter which CPU is the policy CPU at
the driver registration time and whether or not it is online all the
time, because the frequency QoS is per policy and not per CPU.

Fixes: 67d874c3b2c6 ("cpufreq: Register notifiers with the PM QoS framework")
Reported-by: Dmitry Osipenko <digetx@gmail.com>
Tested-by: Dmitry Osipenko <digetx@gmail.com>
Reported-by: Sudeep Holla <sudeep.holla@arm.com>
Tested-by: Sudeep Holla <sudeep.holla@arm.com>
Diagnosed-by: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lore.kernel.org/linux-pm/5ad2624194baa2f53acc1f1e627eb7684c577a19.1562210705.git.viresh.kumar@linaro.org/T/#md2d89e95906b8c91c15f582146173dce2e86e99f
Link: https://lore.kernel.org/linux-pm/20191017094612.6tbkwoq4harsjcqv@vireshk-i7/T/#m30d48cc23b9a80467fbaa16e30f90b3828a5a29b
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>


# 5ebb34ed 27-Aug-2019 Peter Zijlstra <peterz@infradead.org>

x86/intel: Aggregate microserver naming

Currently big microservers have _XEON_D while small microservers have
_X, Make it uniformly: _D.

for i in `git grep -l "\(INTEL_FAM6_\|VULNWL_INTEL\|INTEL_CPU_FAM6\).*_\(X\|XEON_D\)"`
do
sed -i -e 's/\(\(INTEL_FAM6_\|VULNWL_INTEL\|INTEL_CPU_FAM6\).*ATOM.*\)_X/\1_D/g' \
-e 's/\(\(INTEL_FAM6_\|VULNWL_INTEL\|INTEL_CPU_FAM6\).*\)_XEON_D/\1_D/g' ${i}
done

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Cc: x86@kernel.org
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Link: https://lkml.kernel.org/r/20190827195122.677152989@infradead.org


# 5e741407 27-Aug-2019 Peter Zijlstra <peterz@infradead.org>

x86/intel: Aggregate big core graphics naming

Currently big core clients with extra graphics on have:

- _G
- _GT3E

Make it uniformly: _G

for i in `git grep -l "\(INTEL_FAM6_\|VULNWL_INTEL\|INTEL_CPU_FAM6\).*_GT3E"`
do
sed -i -e 's/\(\(INTEL_FAM6_\|VULNWL_INTEL\|INTEL_CPU_FAM6\).*\)_GT3E/\1_G/g' ${i}
done

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Cc: x86@kernel.org
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Link: https://lkml.kernel.org/r/20190827195122.622802314@infradead.org


# af239c44 27-Aug-2019 Peter Zijlstra <peterz@infradead.org>

x86/intel: Aggregate big core mobile naming

Currently big core mobile chips have either:

- _L
- _ULT
- _MOBILE

Make it uniformly: _L.

for i in `git grep -l "\(INTEL_FAM6_\|VULNWL_INTEL\|INTEL_CPU_FAM6\).*_\(MOBILE\|ULT\)"`
do
sed -i -e 's/\(\(INTEL_FAM6_\|VULNWL_INTEL\|INTEL_CPU_FAM6\).*\)_\(MOBILE\|ULT\)/\1_L/g' ${i}
done

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Cc: x86@kernel.org
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190827195122.568978530@infradead.org


# c66f78a6 27-Aug-2019 Peter Zijlstra <peterz@infradead.org>

x86/intel: Aggregate big core client naming

Currently the big core client models either have:

- no OPTDIFF
- _CORE
- _DESKTOP

Make it uniformly: 'no OPTDIFF'.

for i in `git grep -l "\(INTEL_FAM6_\|VULNWL_INTEL\|INTEL_CPU_FAM6\).*_\(CORE\|DESKTOP\)"`
do
sed -i -e 's/\(\(INTEL_FAM6_\|VULNWL_INTEL\|INTEL_CPU_FAM6\).*\)_\(CORE\|DESKTOP\)/\1/g' ${i}
done

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Cc: x86@kernel.org
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190827195122.513945586@infradead.org


# da5c504c 08-Aug-2019 Viresh Kumar <viresh.kumar@linaro.org>

cpufreq: intel_pstate: Implement QoS supported freq constraints

Intel pstate driver exposes min_perf_pct and max_perf_pct sysfs files,
which can be used to force a limit on the min/max P state of the driver.
Though these files eventually control the min/max frequencies that the
CPUs will run at, they don't make a change to policy->min/max values.

When the values of these files are changed (in passive mode of the
driver), it leads to calling ->limits() callback of the cpufreq
governors, like schedutil. On a call to it the governors shall
forcefully update the frequency to come within the limits. Since the
limits, i.e. policy->min/max, aren't updated by the driver, the
governors fails to get the target freq within limit and sometimes aborts
the update believing that the frequency is already set to the target
value.

This patch implements the QoS supported frequency constraints to update
policy->min/max values whenever min_perf_pct or max_perf_pct files are
updated. This is only done for the passive mode as of now, as the driver
is already working fine in active mode.

Fixes: ecd288429126 ("cpufreq: schedutil: Don't set next_freq to UINT_MAX")
Reported-by: Doug Smythies <dsmythies@telus.net>
Tested-by: Doug Smythies <dsmythies@telus.net>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# c57b25bd 04-Jul-2019 Viresh Kumar <viresh.kumar@linaro.org>

cpufreq: intel_pstate: Reuse refresh_frequency_limits()

The implementation of intel_pstate_update_max_freq() is quite similar to
refresh_frequency_limits(), lets reuse it.

Finding minimum of policy->user_policy.max and policy->cpuinfo.max_freq
in intel_pstate_update_max_freq() is redundant as cpufreq_set_policy()
will call the ->verify() callback of intel-pstate driver, which will do
this comparison anyway and so dropping it from
intel_pstate_update_max_freq() doesn't harm.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# b886d83c 01-Jun-2019 Thomas Gleixner <tglx@linutronix.de>

treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 441

Based on 1 normalized pattern(s):

this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license as published by
the free software foundation version 2 of the license

extracted by the scancode license scanner the SPDX license identifier

GPL-2.0-only

has been chosen to replace the boilerplate/reference in 315 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Armijn Hemel <armijn@tjaldur.nl>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190531190115.503150771@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 108ec36b 29-Mar-2019 Borislav Petkov <bp@suse.de>

drivers/cpufreq: Convert some slow-path static_cpu_has() callers to boot_cpu_has()

Using static_cpu_has() is pointless on those paths, convert them to the
boot_cpu_has() variant.

No functional changes.

Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 9083e498 25-Mar-2019 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Update max frequency on global turbo changes

While the cpuinfo.max_freq value doesn't really matter for
intel_pstate in the active mode, in the passive mode it is used by
governors as the maximum physical frequency of the CPU and the
results of governor computations generally depend on it. Also it
is made available to user space via sysfs and it should match the
current HW configuration.

For this reason, make intel_pstate update cpuinfo.max_freq for all
CPUs if it detects a global change of turbo frequency settings from
"disable" to "enable" or the other way associated with a _PPC change
notification from the platform firmware.

Note that policy_is_inactive(), cpufreq_cpu_acquire(),
cpufreq_cpu_release(), and cpufreq_set_policy() need to be made
available to it for this purpose.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=200759
Reported-by: Gabriele Mazzotta <gabriele.mzt@gmail.com>
Tested-by: Gabriele Mazzotta <gabriele.mzt@gmail.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>


# 5a25e3f7 25-Mar-2019 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Driver-specific handling of _PPC updates

In some cases, the platform firmware disables or enables turbo
frequencies for all CPUs globally before triggering a _PPC change
notification for one of them. Obviously, that global change affects
all CPUs, not just the notified one, and it needs to be acted upon by
cpufreq.

The intel_pstate driver is able to detect such global changes of
the settings, but it also needs to update policy limits for all
CPUs if that happens, in particular if turbo frequencies are
enabled globally - to allow them to be used.

For this reason, introduce a new cpufreq driver callback to be
invoked on _PPC notifications, if present, instead of simply
calling cpufreq_update_policy() for the notified CPU and make
intel_pstate use it to trigger policy updates for all CPUs
in the system if global settings change.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=200759
Reported-by: Gabriele Mazzotta <gabriele.mzt@gmail.com>
Tested-by: Gabriele Mazzotta <gabriele.mzt@gmail.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>


# 4ab52646 01-Apr-2019 Borislav Petkov <bp@suse.de>

cpufreq/intel_pstate: Load only on Intel hardware

This driver is Intel-only so loading on anything which is not Intel is
pointless. Prevent it from doing so.

While at it, correct the "not supported" print statement to say CPU
"model" which is what that test does.

Fixes: 076b862c7e44 (cpufreq: intel_pstate: Add reasons for failure and debug messages)
Suggested-by: Erwan Velu <e.velu@criteo.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Renninger <trenn@suse.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 92a3e426 25-Mar-2019 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: Also use CPPC nominal_perf for base_frequency

The ACPI specification states that if the "Guaranteed Performance
Register" is not implemented, the OSPM assumes guaranteed performance
to always be equal to nominal performance.

So for invalid or unimplemented guaranteed performance register, use
nominal performance as guaranteed performance.

This change will fall back to nominal_perf when guranteed_perf is
invalid. If nominal_perf is also invalid or not present, fall back
to the existing implementation, which is to read from HWP Capabilities
MSR.

Fixes: 86d333a8cc7f ("cpufreq: intel_pstate: Add base_frequency attribute")
Suggested-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: 4.20+ <stable@vger.kernel.org> # 4.20+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 8e3b4039 10-Mar-2019 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Fix up iowait_boost computation

After commit b8bd1581aa61 ("cpufreq: intel_pstate: Rework iowait
boosting to be less aggressive") the handling of the case when
the SCHED_CPUFREQ_IOWAIT flag is set again after a few iterations of
intel_pstate_update_util() is a bit inconsistent, because the
new value of cpu->iowait_boost may be lower than ONE_EIGHTH_FP
if it was set before, but has not dropped down to zero just yet.

Fix that up by ensuring that the new value of cpu->iowait_boost
will always be at least ONE_EIGHTH_FP then.

Fixes: b8bd1581aa61 ("cpufreq: intel_pstate: Rework iowait boosting to be less aggressive")
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# b8bd1581 06-Feb-2019 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Rework iowait boosting to be less aggressive

The current iowait boosting mechanism in intel_pstate_update_util()
is quite aggressive, as it goes to the maximum P-state right away,
and may cause excessive amounts of energy to be used, which is not
desirable and arguably isn't necessary too.

Follow commit a5a0809bc58e ("cpufreq: schedutil: Make iowait boost
more energy efficient") that reworked the analogous iowait boost
mechanism in the schedutil governor and make the iowait boosting
in intel_pstate_update_util() work along the same lines.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# a8e1942d 15-Feb-2019 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Eliminate intel_pstate_get_base_pstate()

There is only one caller of intel_pstate_get_base_pstate() and it is
more straightforward to carry out the computation directly in the
caller, so do that and drop intel_pstate_get_base_pstate().

No intentional changes of behavior.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# fa93b51c 15-Feb-2019 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Avoid redundant initialization of local vars

After commit 1a4fe38add8b ("cpufreq: intel_pstate: Remove max/min
fractions to limit performance") the initial value of the pstate local
variable in intel_pstate_max_within_limits() and the initial value of
the max_pstate local variable in intel_pstate_prepare_request() are
both immediately discarded, so initialize both these variables to
their target values upfront.

No intentional changes of behavior.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 076b862c 13-Feb-2019 Erwan Velu <e.velu@criteo.com>

cpufreq: intel_pstate: Add reasons for failure and debug messages

The init code path has several exceptions where the driver can
decide not to load.

As CONFIG_X86_INTEL_PSTATE is generally set to Y, the return code is
not reachable. The initialization code is neither verbose of the
reason why it did choose to prematurely exit, so it is difficult for
a user to determine, on a given platform, why the driver didn't load
properly.

This patch is about reporting to the user the reason/context of why
the driver failed to load. That is a precious hint when debugging
a platform.

Signed-off-by: Erwan Velu <e.velu@criteo.com>
[ rjw: Subject & changelog, minor fixups ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 625c85a6 24-Jan-2019 Viresh Kumar <viresh.kumar@linaro.org>

cpufreq: Use struct kobj_attribute instead of struct global_attr

The cpufreq_global_kobject is created using kobject_create_and_add()
helper, which assigns the kobj_type as dynamic_kobj_ktype and show/store
routines are set to kobj_attr_show() and kobj_attr_store().

These routines pass struct kobj_attribute as an argument to the
show/store callbacks. But all the cpufreq files created using the
cpufreq_global_kobject expect the argument to be of type struct
attribute. Things work fine currently as no one accesses the "attr"
argument. We may not see issues even if the argument is used, as struct
kobj_attribute has struct attribute as its first element and so they
will both get same address.

But this is logically incorrect and we should rather use struct
kobj_attribute instead of struct global_attr in the cpufreq core and
drivers and the show/store callbacks should take struct kobj_attribute
as argument instead.

This bug is caught using CFI CLANG builds in android kernel which
catches mismatch in function prototypes for such callbacks.

Reported-by: Donghee Han <dh.han@samsung.com>
Reported-by: Sangkyu Kim <skwith.kim@samsung.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# af3b7379 16-Nov-2018 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: Force HWP min perf before offline

Force HWP Request MAX = HWP Request MIN = HWP Capability MIN and EPP to
0xFF. In this way the performance limits on the offlined CPU will not
influence performance limits on its sibling CPU, which is still online.

If the sibling CPU is calling for higher performance, it will impact the
max core performance. Here core performance will follow higher of the
performance requests from each sibling.

Reported-and-tested-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 09659af3 05-Nov-2018 Paul E. McKenney <paulmck@kernel.org>

cpufreq/intel_pstate: Replace synchronize_sched() with synchronize_rcu()

Now that synchronize_rcu() waits for preempt-disable regions of code
as well as RCU read-side critical sections, synchronize_sched() can be
replaced by synchronize_rcu(). This commit therefore makes this change.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Len Brown <lenb@kernel.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: <linux-pm@vger.kernel.org>


# 5906056e 23-Oct-2018 Dominik Brodowski <linux@dominikbrodowski.net>

cpufreq: intel_pstate: Fix compilation for !CONFIG_ACPI

While at it, add a few comments which config options #ifdef
and #else statements refer to.

Fixes: 86d333a8cc7f (cpufreq: intel_pstate: Add base_frequency attribute)
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 86d333a8 15-Oct-2018 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: Add base_frequency attribute

Expose base_frequency to user space via cpufreq sysfs when HWP is in
use.

This HWP base frequency is read from the ACPI _CPC object if present,
or from the HWP Capabilities MSR otherwise.

On the majority of the HWP platforms the _CPC object will point to
the HWP Capabilities MSR using the "Functional Fixed Hardware"
address space type. The address space type also can simply be
ACPI_TYPE_INTEGER, however, in which case the platform firmware
can set its value at the initialization time based on the system
constraints.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
[ rjw: Changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# f2c4db1b 07-Aug-2018 Peter Zijlstra <peterz@infradead.org>

x86/cpu: Sanitize FAM6_ATOM naming

Going primarily by:

https://en.wikipedia.org/wiki/List_of_Intel_Atom_microprocessors

with additional information gleaned from other related pages; notably:

- Bonnell shrink was called Saltwell
- Moorefield is the Merriefield refresh which makes it Airmont

The general naming scheme is: FAM6_ATOM_UARCH_SOCTYPE

for i in `git grep -l FAM6_ATOM` ; do
sed -i -e 's/ATOM_PINEVIEW/ATOM_BONNELL/g' \
-e 's/ATOM_LINCROFT/ATOM_BONNELL_MID/' \
-e 's/ATOM_PENWELL/ATOM_SALTWELL_MID/g' \
-e 's/ATOM_CLOVERVIEW/ATOM_SALTWELL_TABLET/g' \
-e 's/ATOM_CEDARVIEW/ATOM_SALTWELL/g' \
-e 's/ATOM_SILVERMONT1/ATOM_SILVERMONT/g' \
-e 's/ATOM_SILVERMONT2/ATOM_SILVERMONT_X/g' \
-e 's/ATOM_MERRIFIELD/ATOM_SILVERMONT_MID/g' \
-e 's/ATOM_MOOREFIELD/ATOM_AIRMONT_MID/g' \
-e 's/ATOM_DENVERTON/ATOM_GOLDMONT_X/g' \
-e 's/ATOM_GEMINI_LAKE/ATOM_GOLDMONT_PLUS/g' ${i}
done

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: dave.hansen@linux.intel.com
Cc: len.brown@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# d3264f75 01-Aug-2018 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: Ignore turbo active ratio in HWP

When HWP is active turbo active ratio is not used, so we should allow
policy max frequency above turbo activation ratio to be set. When HWP is
not active, then any policy max frequency above turbo activation ratio
can result upto max one-core turbo frequency.

This fix helps better thermal control in turbo region when other methods
like "Running Average Power Limit" is not available to use.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 01e61a42 30-Jul-2018 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: Limit the scope of HWP dynamic boost platforms

Dynamic boosting of HWP performance on IO wake showed significant
improvement to IO workloads. This series was intended for Skylake Xeon
platforms only and feature was enabled by default based on CPU model
number.

But some Xeon platforms reused the Skylake desktop CPU model number. This
caused some undesirable side effects to some graphics workloads. Since
they are heavily IO bound, the increase in CPU performance decreased the
power available for GPU to do its computing and hence decrease in graphics
benchmark performance.

For example on a Skylake desktop, GpuTest benchmark showed average FPS
reduction from 529 to 506.

This change makes sure that HWP boost feature is only enabled for Skylake
server platforms by using ACPI FADT preferred PM Profile. If some desktop
users wants to get benefit of boost, they can still enable boost from
intel_pstate sysfs attribute "hwp_dynamic_boost".

Fixes: 41ab43c9c89e (cpufreq: intel_pstate: enable boost for Skylake Xeon)
Link: https://bugs.freedesktop.org/show_bug.cgi?id=107410
Reported-by: Eero Tamminen <eero.t.tamminen@intel.com>
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# eea033d0 18-Jul-2018 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: Show different max frequency with turbo 3 and HWP

On HWP platforms with Turbo 3.0, the HWP capability max ratio shows the
maximum ratio of that core, which can be different than other cores. If
we show the correct maximum frequency in cpufreq sysfs via
cpuinfo_max_freq and scaling_max_freq then, user can know which cores
can run faster for pinning some high priority tasks.

Currently the max turbo frequency is shown as max frequency, which is
the max of all cores, even if some cores can't reach that frequency
even for single threaded workload.

But it is possible that max ratio in HWP capabilities is set as 0xFF or
some high invalid value (E.g. One KBL NUC). Since the actual performance
can never exceed 1 core turbo frequency from MSR TURBO_RATIO_LIMIT, we
use this as a bound check.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 95d6c085 18-Jul-2018 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Register when ACPI PCCH is present

Currently, intel_pstate doesn't register if _PSS is not present on
HP Proliant systems, because it expects the firmware to take over
CPU performance scaling in that case. However, if ACPI PCCH is
present, the firmware expects the kernel to use it for CPU
performance scaling and the pcc-cpufreq driver is loaded for that.

Unfortunately, the firmware interface used by that driver is not
scalable for fundamental reasons, so pcc-cpufreq is way suboptimal
on systems with more than just a few CPUs. In fact, it is better to
avoid using it at all.

For this reason, modify intel_pstate to look for ACPI PCCH if _PSS
is not present and register if it is there. Also prevent the
pcc-cpufreq driver from trying to initialize itself if intel_pstate
has been registered already.

Fixes: fbbcdc0744da (intel_pstate: skip the driver if ACPI has power mgmt option)
Reported-by: Andreas Herrmann <aherrmann@suse.com>
Reviewed-by: Andreas Herrmann <aherrmann@suse.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Tested-by: Andreas Herrmann <aherrmann@suse.com>
Cc: 4.16+ <stable@vger.kernel.org> # 4.16+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 1111b783 31-May-2018 Xie Yisheng <xieyisheng1@huawei.com>

cpufreq: intel_pstate: use match_string() helper

match_string() returns the index of an array for a matching string,
which can be used instead of open coded variant.

Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Signed-off-by: Yisheng Xie <xieyisheng1@huawei.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# ff7c9917 18-Jun-2018 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: Fix scaling max/min limits with Turbo 3.0

When scaling max/min settings are changed, internally they are converted
to a ratio using the max turbo 1 core turbo frequency. This works fine
when 1 core max is same irrespective of the core. But under Turbo 3.0,
this will not be the case. For example:
Core 0: max turbo pstate: 43 (4.3GHz)
Core 1: max turbo pstate: 45 (4.5GHz)
In this case 1 core turbo ratio will be maximum of all, so it will be
45 (4.5GHz). Suppose scaling max is set to 4GHz (ratio 40) for all cores
,then on core one it will be
= max_state * policy->max / max_freq;
= 43 * (4000000/4500000) = 38 (3.8GHz)
= 38
which is 200MHz less than the desired.
On core2, it will be correctly set to ratio 40 (4GHz). Same holds true
for scaling min frequency limit. So this requires usage of correct turbo
max frequency for core one, which in this case is 4.3GHz. So we need to
adjust per CPU cpu->pstate.turbo_freq using the maximum HWP ratio of that
core.

This change uses the HWP capability of a core to adjust max turbo
frequency. But since Broadwell HWP doesn't use ratios in the HWP
capabilities, we have to use legacy max 1 core turbo ratio. This is not
a problem as the HWP capabilities don't differ among cores in Broadwell.
We need to check for non Broadwell CPU model for applying this change,
though.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: 4.6+ <stable@vger.kernel.org> # 4.6+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# fad953ce 12-Jun-2018 Kees Cook <keescook@chromium.org>

treewide: Use array_size() in vzalloc()

The vzalloc() function has no 2-factor argument form, so multiplication
factors need to be wrapped in array_size(). This patch replaces cases of:

vzalloc(a * b)

with:
vzalloc(array_size(a, b))

as well as handling cases of:

vzalloc(a * b * c)

with:

vzalloc(array3_size(a, b, c))

This does, however, attempt to ignore constant size factors like:

vzalloc(4 * 1024)

though any constants defined via macros get caught up in the conversion.

Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.

The Coccinelle script used for this was:

// Fix redundant parens around sizeof().
@@
type TYPE;
expression THING, E;
@@

(
vzalloc(
- (sizeof(TYPE)) * E
+ sizeof(TYPE) * E
, ...)
|
vzalloc(
- (sizeof(THING)) * E
+ sizeof(THING) * E
, ...)
)

// Drop single-byte sizes and redundant parens.
@@
expression COUNT;
typedef u8;
typedef __u8;
@@

(
vzalloc(
- sizeof(u8) * (COUNT)
+ COUNT
, ...)
|
vzalloc(
- sizeof(__u8) * (COUNT)
+ COUNT
, ...)
|
vzalloc(
- sizeof(char) * (COUNT)
+ COUNT
, ...)
|
vzalloc(
- sizeof(unsigned char) * (COUNT)
+ COUNT
, ...)
|
vzalloc(
- sizeof(u8) * COUNT
+ COUNT
, ...)
|
vzalloc(
- sizeof(__u8) * COUNT
+ COUNT
, ...)
|
vzalloc(
- sizeof(char) * COUNT
+ COUNT
, ...)
|
vzalloc(
- sizeof(unsigned char) * COUNT
+ COUNT
, ...)
)

// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@

(
vzalloc(
- sizeof(TYPE) * (COUNT_ID)
+ array_size(COUNT_ID, sizeof(TYPE))
, ...)
|
vzalloc(
- sizeof(TYPE) * COUNT_ID
+ array_size(COUNT_ID, sizeof(TYPE))
, ...)
|
vzalloc(
- sizeof(TYPE) * (COUNT_CONST)
+ array_size(COUNT_CONST, sizeof(TYPE))
, ...)
|
vzalloc(
- sizeof(TYPE) * COUNT_CONST
+ array_size(COUNT_CONST, sizeof(TYPE))
, ...)
|
vzalloc(
- sizeof(THING) * (COUNT_ID)
+ array_size(COUNT_ID, sizeof(THING))
, ...)
|
vzalloc(
- sizeof(THING) * COUNT_ID
+ array_size(COUNT_ID, sizeof(THING))
, ...)
|
vzalloc(
- sizeof(THING) * (COUNT_CONST)
+ array_size(COUNT_CONST, sizeof(THING))
, ...)
|
vzalloc(
- sizeof(THING) * COUNT_CONST
+ array_size(COUNT_CONST, sizeof(THING))
, ...)
)

// 2-factor product, only identifiers.
@@
identifier SIZE, COUNT;
@@

vzalloc(
- SIZE * COUNT
+ array_size(COUNT, SIZE)
, ...)

// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@

(
vzalloc(
- sizeof(TYPE) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
vzalloc(
- sizeof(TYPE) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
vzalloc(
- sizeof(TYPE) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
vzalloc(
- sizeof(TYPE) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
vzalloc(
- sizeof(THING) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
vzalloc(
- sizeof(THING) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
vzalloc(
- sizeof(THING) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
vzalloc(
- sizeof(THING) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
)

// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@

(
vzalloc(
- sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
vzalloc(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
vzalloc(
- sizeof(THING1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
vzalloc(
- sizeof(THING1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
vzalloc(
- sizeof(TYPE1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
|
vzalloc(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
)

// 3-factor product, only identifiers, with redundant parens removed.
@@
identifier STRIDE, SIZE, COUNT;
@@

(
vzalloc(
- (COUNT) * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
vzalloc(
- COUNT * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
vzalloc(
- COUNT * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
vzalloc(
- (COUNT) * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
vzalloc(
- COUNT * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
vzalloc(
- (COUNT) * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
vzalloc(
- (COUNT) * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
vzalloc(
- COUNT * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
)

// Any remaining multi-factor products, first at least 3-factor products
// when they're not all constants...
@@
expression E1, E2, E3;
constant C1, C2, C3;
@@

(
vzalloc(C1 * C2 * C3, ...)
|
vzalloc(
- E1 * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
)

// And then all remaining 2 factors products when they're not all constants.
@@
expression E1, E2;
constant C1, C2;
@@

(
vzalloc(C1 * C2, ...)
|
vzalloc(
- E1 * E2
+ array_size(E1, E2)
, ...)
)

Signed-off-by: Kees Cook <keescook@chromium.org>


# 41ab43c9 05-Jun-2018 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: enable boost for Skylake Xeon

Enable HWP boost on Skylake server and workstations.

Reported-by: Mel Gorman <mgorman@techsingularity.net>
Tested-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# aaaece3d 05-Jun-2018 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: New sysfs entry to control HWP boost

A new attribute is added to intel_pstate sysfs to enable/disable
HWP dynamic performance boost.

Reported-by: Mel Gorman <mgorman@techsingularity.net>
Tested-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 52ccc431 05-Jun-2018 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: HWP boost performance on IO wakeup

This change uses SCHED_CPUFREQ_IOWAIT flag to boost HWP performance.
Since SCHED_CPUFREQ_IOWAIT flag is set frequently, we don't start
boosting steps unless we see two consecutive flags in two ticks. This
avoids boosting due to IO because of regular system activities.

To avoid synchronization issues, the actual processing of the flag is
done on the local CPU callback.

Reported-by: Mel Gorman <mgorman@techsingularity.net>
Tested-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# e0efd5be 05-Jun-2018 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: Add HWP boost utility and sched util hooks

Added two utility functions to HWP boost up gradually and boost down to
the default cached HWP request values.

Boost up:
Boost up updates HWP request minimum value in steps. This minimum value
can reach upto at HWP request maximum values depends on how frequently,
this boost up function is called. At max, boost up will take three steps
to reach the maximum, depending on the current HWP request levels and HWP
capabilities. For example, if the current settings are:
If P0 (Turbo max) = P1 (Guaranteed max) = min
No boost at all.
If P0 (Turbo max) > P1 (Guaranteed max) = min
Should result in one level boost only for P0.
If P0 (Turbo max) = P1 (Guaranteed max) > min
Should result in two level boost:
(min + p1)/2 and P1.
If P0 (Turbo max) > P1 (Guaranteed max) > min
Should result in three level boost:
(min + p1)/2, P1 and P0.
We don't set any level between P0 and P1 as there is no guarantee that
they will be honored.

Boost down:
After the system is idle for hold time of 3ms, the HWP request is reset
to the default value from HWP init or user modified one via sysfs.

Caching of HWP Request and Capabilities
Store the HWP request value last set using MSR_HWP_REQUEST and read
MSR_HWP_CAPABILITIES. This avoid reading of MSRs in the boost utility
functions.

These boost utility functions calculated limits are based on the latest
HWP request value, which can be modified by setpolicy() callback. So if
user space modifies the minimum perf value, that will be accounted for
every time the boost up is called. There will be case when there can be
contention with the user modified minimum perf, in that case user value
will gain precedence. For example just before HWP_REQUEST MSR is updated
from setpolicy() callback, the boost up function is called via scheduler
tick callback. Here the cached MSR value is already the latest and limits
are updated based on the latest user limits, but on return the MSR write
callback called from setpolicy() callback will update the HWP_REQUEST
value. This will be used till next time the boost up function is called.

In addition add a variable to control HWP dynamic boosting. When HWP
dynamic boost is active then set the HWP specific update util hook. The
contents in the utility hooks will be filled in the subsequent patches.

Reported-by: Mel Gorman <mgorman@techsingularity.net>
Tested-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 50e9ffab 14-May-2018 Doug Smythies <doug.smythies@gmail.com>

cpufreq: intel_pstate: allow trace in passive mode

Allow use of the trace_pstate_sample trace function
when the intel_pstate driver is in passive mode.
Since the core_busy and scaled_busy fields are not
used, and it might be desirable to know which path
through the driver was used, either intel_cpufreq_target
or intel_cpufreq_fast_switch, re-task the core_busy
field as a flag indicator.

The user can then use the intel_pstate_tracer.py utility
to summarize and plot the trace.

Note: The core_busy feild still goes by that name
in include/trace/events/power.h and within the
intel_pstate_tracer.py script and csv file headers,
but it is graphed as "performance", and called
core_avg_perf now in the intel_pstate driver.

Sometimes, in passive mode, the driver is not called for
many tens or even hundreds of seconds. The user
needs to understand, and not be confused by, this limitation.

Signed-off-by: Doug Smythies <dsmythies@telus.net>
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# b258dfea 30-Mar-2018 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Do not include debugfs.h

The intel_pstate driver doesn't use debugfs any more, so drop
linux/debugfs.h from the list of included headers in it.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>


# 70f6bf2a 28-Jan-2018 Chen Yu <yu.c.chen@intel.com>

cpufreq: intel_pstate: Enable HWP during system resume on CPU0

When maxcpus=1 is in the kernel command line, the BP is responsible
for re-enabling the HWP - because currently only the APs invoke
intel_pstate_hwp_enable() during their online process - which might
put the system into unstable state after resume.

Fix this by enabling the HWP explicitly on BP during resume.

Reported-by: Doug Smythies <dsmythies@telus.net>
Suggested-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Yu Chen <yu.c.chen@intel.com>
[ rjw: Subject/changelog, minor modifications ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# d8de7a44 10-Jan-2018 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: Add Skylake servers support

Currently intel_pstate can function only in HWP mode on Skylake servers.
When HWP feature is not enabled on the processor then acpi-cpufreq is
driver is used.

Based on the power and performance tests using intel_pstate scaling
algorithm the results are comparable. But intel_pstate brings in
additional features:
- Display of turbo frequency range, which many users like to see
- Place limits in the turbo frequency range when platform allows

Since these tests are done only using non PID algorithm introduced in
kernel version 4.14, this patch is not a backport candidate. So each user
has to carefully weigh the benefits before he backports.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# dbd49b85 10-Jan-2018 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: Replace bxt_funcs with core_funcs

Since core_funcs and bxt_funcs have same set of callbacks, replace
bxt_funcs with core_funcs.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 5e932321 23-Aug-2017 Toshi Kani <toshi.kani@hpe.com>

intel_pstate: convert to use acpi_match_platform_list()

Convert to use acpi_match_platform_list() for the platform check.
There is no change in functionality.

Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# b20a3f3d 10-Aug-2017 Sudeep Holla <Sudeep.Holla@arm.com>

cpufreq: remove setting of policy->cpu in policy->cpus during init

policy->cpu is copied into policy->cpus in cpufreq_online() before
calling into cpufreq_driver->init(). So there's no need to set the
same in the individual driver init() functions again.

This patch removes the redundant setting of policy->cpu in policy->cpus
in intel_pstate and cppc drivers.

Reported-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# c587c79f 08-Aug-2017 Doug Smythies <doug.smythies@gmail.com>

cpufreq: intel_pstate: report correct CPU frequencies during trace

The intel_pstate CPU frequency scaling driver has always
calculated CPU frequency incorrectly. Recent changes have
eliminted most of the issues, however the frequency reported
in the trace buffer, if used, is incorrect.

It remains desireable that cpu->pstate.scaling still be a nice
round number for things such as when setting max and min frequencies.
So the proposal is to just fix the reported frequency in the trace data.

Fixes what remains of [1].

Link: https://bugzilla.kernel.org/show_bug.cgi?id=96521 # [1]
Signed-off-by: Doug Smythies <dsmythies@telus.net>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# d77d4888 09-Aug-2017 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Shorten a couple of long names

The names of the INTEL_PSTATE_DEFAULT_SAMPLING_INTERVAL symbol and
the get_target_pstate_use_cpu_load() function don't need to be so
long any more, so make them shorter.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# a891283e 09-Aug-2017 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Simplify intel_pstate_adjust_pstate()

Since there is only one P-state selection routine in intel_pstate
now, make intel_pstate_adjust_pstate() call it directly and drop
the target_pstate argument from that function.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 7bde2d50 03-Aug-2017 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: Improve IO performance with per-core P-states

In the current implementation, the response latency between seeing
SCHED_CPUFREQ_IOWAIT set and the actual P-state adjustment can be up
to 10ms. It can be reduced by bumping up the P-state to the max at
the time SCHED_CPUFREQ_IOWAIT is passed to intel_pstate_update_util().
With this change, the IO performance improves significantly.

For a simple "grep -r . linux" (Here linux is the kernel source
folder) with caches dropped every time on a Broadwell Xeon workstation
with per-core P-states, the user and system time is shorter by as much
as 30% - 40%.

The same performance difference was not observed on clients that don't
support per-core P-state.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
[ rjw: Changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 674e7541 27-Jul-2017 Viresh Kumar <viresh.kumar@linaro.org>

sched: cpufreq: Allow remote cpufreq callbacks

With Android UI and benchmarks the latency of cpufreq response to
certain scheduling events can become very critical. Currently, callbacks
into cpufreq governors are only made from the scheduler if the target
CPU of the event is the same as the current CPU. This means there are
certain situations where a target CPU may not run the cpufreq governor
for some time.

One testcase to show this behavior is where a task starts running on
CPU0, then a new task is also spawned on CPU0 by a task on CPU1. If the
system is configured such that the new tasks should receive maximum
demand initially, this should result in CPU0 increasing frequency
immediately. But because of the above mentioned limitation though, this
does not occur.

This patch updates the scheduler core to call the cpufreq callbacks for
remote CPUs as well.

The schedutil, ondemand and conservative governors are updated to
process cpufreq utilization update hooks called for remote CPUs where
the remote CPU is managed by the cpufreq policy of the local CPU.

The intel_pstate driver is updated to always reject remote callbacks.

This is tested with couple of usecases (Android: hackbench, recentfling,
galleryfling, vellamo, Ubuntu: hackbench) on ARM hikey board (64 bit
octa-core, single policy). Only galleryfling showed minor improvements,
while others didn't had much deviation.

The reason being that this patch only targets a corner case, where
following are required to be true to improve performance and that
doesn't happen too often with these tests:

- Task is migrated to another CPU.
- The task has high demand, and should take the target CPU to higher
OPPs.
- And the target CPU doesn't call into the cpufreq governor until the
next tick.

Based on initial work from Steve Muckle.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Saravana Kannan <skannan@codeaurora.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# f5c13f44 31-Jul-2017 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Drop INTEL_PSTATE_HWP_SAMPLING_INTERVAL

After commit 62611cb912f7 (intel_pstate: delete scheduler hook in HWP
mode) the INTEL_PSTATE_HWP_SAMPLING_INTERVAL is not used anywhere in
the code, so drop it.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 22baebd4 25-Jul-2017 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Drop ->get from intel_pstate structure

The ->get callback in the intel_pstate structure was mostly there
for the scaling_cur_freq sysfs attribute to work, but after commit
f8475cef9008 (x86: use common aperfmperf_khz_on_cpu() to calculate
KHz using APERF/MPERF) that attribute uses arch_freq_get_on_cpu()
provided by the x86 arch code on all processors supported by
intel_pstate, so it doesn't need the ->get callback from the
driver any more.

Moreover, the very presence of the ->get callback in the intel_pstate
structure causes the cpuinfo_cur_freq attribute to be present when
intel_pstate operates in the active mode, which is bogus, because
the role of that attribute is to return the current CPU frequency
as seen by the hardware. For intel_pstate, though, this is just an
average frequency and not really current, but computed for the
previous sampling interval (the actual current frequency may be
way different at the point this value is obtained by reading from
cpuinfo_cur_freq), and after commit 82b4e03e01bc (intel_pstate: skip
scheduler hook when in "performance" mode) the value in
cpuinfo_cur_freq may be stale or just 0, depending on the driver's
operation mode. In fact, however, on the hardware supported by
intel_pstate there is no way to read the current CPU frequency
from it, so the cpuinfo_cur_freq attribute should not be present
at all when this driver is in use.

For this reason, drop intel_pstate_get() and clear the ->get
callback pointer pointing to it, so that the cpuinfo_cur_freq is
not present for intel_pstate in the active mode any more.

Fixes: 82b4e03e01bc (intel_pstate: skip scheduler hook when in "performance" mode)
Reported-by: Huaisheng Ye <yehs1@lenovo.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>


# c4f3f70c 24-Jul-2017 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Drop ->update_util from pstate_funcs

All systems use the same P-state selection "powersave" algorithm
in the active mode if HWP is not used, so there's no need to provide
a pointer for it in struct pstate_funcs any more.

Drop ->update_util from struct pstate_funcs and make
intel_pstate_set_update_util_hook() use intel_pstate_update_util()
directly.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 9d0ef7af 24-Jul-2017 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Do not use PID-based P-state selection

All systems with a defined ACPI preferred profile that are not
"servers" have been using the load-based P-state selection algorithm
in intel_pstate since 4.12-rc1 (mobile systems and laptops have been
using it since 4.10-rc1) and no problems with it have been reported
to date. In particular, no regressions with respect to the PID-based
P-state selection have been reported. Also testing indicates that
the P-state selection algorithm based on CPU load is generally on par
with the PID-based algorithm performance-wise, and for some workloads
it turns out to be better than the other one, while being more
straightforward and easier to understand at the same time.

Moreover, the PID-based P-state selection algorithm in intel_pstate
is known to be unstable in some situation and generally problematic,
the issues with it are hard to address and it has become a
significant maintenance burden.

For these reasons, make intel_pstate use the "powersave" P-state
selection algorithm based on CPU load in the active mode on all
systems and drop the PID-based P-state selection code along with
all things related to it from the driver. Also update the
documentation accordingly.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# b8b78825 19-Jul-2017 Viresh Kumar <viresh.kumar@linaro.org>

cpufreq: Don't set transition_latency for setpolicy drivers

The transition_latency field isn't used for drivers with ->setpolicy()
callback present and there is no point setting it from the drivers.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 6e34e1f2 13-Jul-2017 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: Correct the busy calculation for KNL

The busy percent calculated for the Knights Landing (KNL) platform
is 1024 times smaller than the correct busy value. This causes
performance to get stuck at the lowest ratio.

The scaling algorithm used for KNL is performance-based, but it still
looks at the CPU load to set the scaled busy factor to 0 when the
load is less than 1 percent. In this case, since the computed load
is 1024x smaller than it should be, the scaled busy factor will
always be 0, irrespective of CPU business.

This needs a fix similar to the turbostat one in commit b2b34dfe4d9a
(tools/power turbostat: KNL workaround for %Busy and Avg_MHz).

For this reason, add one more callback to processor-specific
callbacks to specify an MPERF multiplier represented by a number of
bit positions to shift the value of that register to the left to
copmensate for its rate difference with respect to the TSC. This
shift value is used during CPU busy calculations.

Fixes: ffb810563c (intel_pstate: Avoid getting stuck in high P-states when idle)
Reported-and-tested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: 4.6+ <stable@vger.kernel.org> # 4.6+
[ rjw: Changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# d4436c0d 10-Jul-2017 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: Fix ratio setting for min_perf_pct

When the minimum performance limit percentage is set to the power-up
default, it is possible that minimum performance ratio is off by one.

In the set_policy() callback the minimum ratio is calculated by
applying global.min_perf_pct to turbo_ratio and rounding up, but the
power-up default global.min_perf_pct is already rounded up to the
next percent in min_perf_pct_min(). That results in two round up
operations, so for the default min_perf_pct one of them is not
required.

It is better to remove rounding up in min_perf_pct_min() as this
matches the displayed min_perf_pct prior to commit c5a2ee7dde89
(cpufreq: intel_pstate: Active mode P-state limits rework) in 4.12.

For example on a platform with max turbo ratio of 37 and minimum
ratio of 10, the min_perf_pct resulted in 28 with the above commit.
Before this commit it was 27 and it will be the same after this
change.

Fixes: 1a4fe38add8b (cpufreq: intel_pstate: Remove max/min fractions to limit performance)
Reported-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 106c9c77 03-Jul-2017 Arvind Yadav <arvind.yadav.cs@gmail.com>

cpufreq: intel_pstate: constify attribute_group structures

attribute_groups are not supposed to change at runtime. All functions
working with attribute_groups provided by <linux/sysfs.h> work with const
attribute_group. So mark the non-const structs as const.

File size before:
text data bss dec hex filename
15197 2552 40 17789 457d drivers/cpufreq/intel_pstate.o

File size After adding 'const':
text data bss dec hex filename
15261 2488 40 17789 457d drivers/cpufreq/intel_pstate.o

Signed-off-by: Arvind Yadav <arvind.yadav.cs@gmail.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# fab24dcc 28-Jun-2017 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Clean up after performance governor changes

After commit 82b4e03e01bc (intel_pstate: skip scheduler hook when in
"performance" mode) get_target_pstate_use_performance() and
get_target_pstate_use_cpu_load() are never called if scaling_governor
is "performance", so drop the CPUFREQ_POLICY_PERFORMANCE checks from
them as they will never trigger anyway.

Moreover, the documentation needs to be updated to reflect the change
made by the above commit, so do that too.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>


# 82b4e03e 23-Jun-2017 Len Brown <len.brown@intel.com>

intel_pstate: skip scheduler hook when in "performance" mode

When the governor is set to "performance", intel_pstate does not
need the scheduler hook for doing any calculations. Under these
conditions, its only purpose is to continue to maintain
cpufreq/scaling_cur_freq.

The cpufreq/scaling_cur_freq sysfs attribute is now provided by
shared x86 cpufreq code on modern x86 systems, including
all systems supported by the intel_pstate driver.

So in "performance" governor mode, the scheduler hook can be skipped.
This applies to both in Software and Hardware P-state control modes.

Suggested-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 62611cb9 23-Jun-2017 Len Brown <len.brown@intel.com>

intel_pstate: delete scheduler hook in HWP mode

The cpufreq/scaling_cur_freq sysfs attribute is now provided by
shared x86 cpufreq code on modern x86 systems, including
all systems supported by the intel_pstate driver.

In HWP mode, maintaining that value was the sole purpose of
the scheduler hook, intel_pstate_update_util_hwp(),
so it can now be removed.

Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 1a4fe38a 12-Jun-2017 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: Remove max/min fractions to limit performance

In the current model the max/min perf limits are a fraction of current
user space limits to the allowed max_freq or 100% for global limits.
This results in wrong ratio limits calculation because of rounding
issues for some user space limits.

Initially we tried to solve this issue by issue by having more shift
bits to increase precision. Still there are isolated cases where we still
have error.

This can be avoided by using ratios all together. Since the way we get
cpuinfo.max_freq is by multiplying scaling factor to max ratio, we can
easily keep the max/min ratios in terms of ratios and not fractions.

For example:
if the max ratio = 36
cpuinfo.max_freq = 36 * 100000 = 3600000

Suppose user space sets a limit of 1200000, then we can calculate
max ratio limit as
= 36 * 1200000 / 3600000
= 12
This will be correct for any user limits.

The other advantage is that, we don't need to do any calculation in the
fast path as ratio limit is already calculated via set_policy() callback.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 57caf4ec 05-Jun-2017 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Avoid division by 0 in min_perf_pct_min()

Commit c5a2ee7dde89 (cpufreq: intel_pstate: Active mode P-state
limits rework) incorrectly assumed that pstate.turbo_pstate would
always be nonzero for CPU0 in min_perf_pct_min() if
cpufreq_register_driver() had succeeded which may not be the case
in virtualized environments.

If that assumption doesn't hold, it leads to an early crash on boot
in intel_pstate_register_driver(), so add a sanity check to
min_perf_pct_min() to prevent the crash from happening.

Fixes: c5a2ee7dde89 (cpufreq: intel_pstate: Active mode P-state limits rework)
Reported-and-tested-by: Jongman Heo <jongman.heo@samsung.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 3cedbc5a 01-May-2017 Len Brown <len.brown@intel.com>

intel_pstate: use updated msr-index.h HWP.EPP values

intel_pstate exports sysfs attributes for setting and observing HWP.EPP.
These attributes use strings to describe 4 operating states, and
inside the driver, these strings are mapped to numerical register
values.

The authorative mapping between the strings and numerical HWP.EPP values
are now globally defined in msr-index.h, replacing the out-dated
mapping that were open-coded into intel_pstate.c

new old string
--- --- ------
0 0 performance
128 64 balance_performance
192 128 balance_power
255 192 power

Note that the HW and BIOS default value on most system is 128,
which intel_pstate will now call "balance_performance"
while it used to call it "balance_power".

Signed-off-by: Len Brown <len.brown@intel.com>


# 1b72e7fd 10-Apr-2017 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: schedutil: Use policy-dependent transition delays

Make the schedutil governor take the initial (default) value of the
rate_limit_us sysfs attribute from the (new) transition_delay_us
policy parameter (to be set by the scaling driver).

That will allow scaling drivers to make schedutil use smaller default
values of rate_limit_us and reduce the default average time interval
between consecutive frequency changes.

Make intel_pstate set transition_delay_us to 500.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>


# 630e5757 29-Mar-2017 Box, David E <david.e.box@intel.com>

cpufreq: intel_pstate: Add support for Gemini Lake

Use same parameters as INTEL_FAM6_ATOM_GOLDMONT to enable
Gemini Lake.

Signed-off-by: Box, David E <david.e.box@intel.com>
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# b02aabe8 27-Mar-2017 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Eliminate intel_pstate_get_min_max()

Some computations in intel_pstate_get_min_max() are not necessary
and one of its two callers doesn't even use the full result.

First off, the fixed-point value of cpu->max_perf represents a
non-negative number between 0 and 1 inclusive and cpu->min_perf
cannot be greater than cpu->max_perf. It is not necessary to check
those conditions every time the numbers in question are used.

Moreover, since intel_pstate_max_within_limits() only needs the
upper boundary, it doesn't make sense to compute the lower one in
there and returning min and max from intel_pstate_get_min_max()
via pointers doesn't look particularly nice.

For the above reasons, drop intel_pstate_get_min_max(), add a helper
to get the base P-state for min/max computations and carry out them
directly in the previous callers of intel_pstate_get_min_max().

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 2bfc4cbb 27-Mar-2017 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Do not walk policy->cpus

intel_pstate_hwp_set() is the only function walking policy->cpus
in intel_pstate. The rest of the code simply assumes one CPU per
policy, including the initialization code.

Therefore it doesn't make sense for intel_pstate_hwp_set() to
walk policy->cpus as it is guaranteed to have only one bit set
for policy->cpu.

For this reason, rearrange intel_pstate_hwp_set() to take the CPU
number as the argument and drop the loop over policy->cpus from it.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 8ca6ce37 27-Mar-2017 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Introduce pid_in_use()

Add a new function pid_in_use() to return the information on whether
or not the PID-based P-state selection algorithm is in use.

That allows a couple of complicated conditions in the code to be
reduced to simple checks against the new function's return value.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 2f49afc2 27-Mar-2017 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Drop struct cpu_defaults

The cpu_defaults structure is redundant, because it only contains
one member of type struct pstate_funcs which can be used directly
instead of struct cpu_defaults.

For this reason, drop struct cpu_defaults, use struct pstate_funcs
directly instead of it where applicable and rename all of the
variables of that type accordingly.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# de4a76cb 27-Mar-2017 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Move cpu_defaults definitions

Move the definitions of the cpu_defaults structures after the
definitions of utilization update callback routines to avoid
extra declarations of the latter.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 67dd9bf4 27-Mar-2017 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Add update_util callback to pstate_funcs

Avoid using extra function pointers during P-state selection by
dropping the get_target_pstate member from struct pstate_funcs,
adding a new update_util callback to it (to be registered with
the CPU scheduler as the utilization update callback in the active
mode) and reworking the utilization update callback routines to
invoke specific P-state selection functions directly.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# eabd22c6 27-Mar-2017 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Use different utilization update callbacks

Notice that some overhead in the utilization update callbacks
registered by intel_pstate in the active mode can be avoided if
those callbacks are tailored to specific configurations of the
driver. For example, the utilization update callback for the HWP
enabled case only needs to update the average CPU performance
periodically whereas the utilization update callback for the
PID-based algorithm does not need to take IO-wait boosting into
account and so on.

With that in mind, define three utilization update callbacks for
three different use cases: HWP enabled, the CPU load "powersave"
P-state selection algorithm and the PID-based "powersave" P-state
selection algorithm and modify the driver initialization to
choose the callback matching its current configuration.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 0042b2c0 27-Mar-2017 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Modify check in intel_pstate_update_status()

One of the checks in intel_pstate_update_status() implicitly relies
on the information that there are only two struct cpufreq_driver
objects available, but it is better to do it directly against the
value it really is about (to make the code easier to follow if
nothing else).

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# ee8df89a 27-Mar-2017 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Drop driver_registered variable

The driver_registered variable in intel_pstate is used for checking
whether or not the driver has been registered, but intel_pstate_driver
can be used for that too (with the rule that the driver is not
registered as long as it is NULL).

That is a bit more straightforward and the code may be simplified
a bit this way, so modify the driver accordingly.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 694cb173 27-Mar-2017 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Skip unnecessary PID resets on init

PID controller parameters only need to be initialized if the
get_target_pstate_use_performance() P-state selection routine
is going to be used. It is not necessary to initialize them
otherwise, so don't do that.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 7aec5b50 27-Mar-2017 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Set HWP sampling interval once

In the HWP enabled case pid_params.sample_rate_ns only needs to be
updated once, because it is global, so do that when setting hwp_active
instead of doing it during the initialization of every CPU.

Moreover, pid_params.sample_rate_ms is never used if HWP is enabled,
so do not update it at all then.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# ff35f02e 27-Mar-2017 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Clean up intel_pstate_busy_pid_reset()

intel_pstate_busy_pid_reset() is the only caller of pid_reset(),
pid_p_gain_set(), pid_i_gain_set(), and pid_d_gain_set(). Moreover,
it passes constants as two parameters of pid_reset() and all of
the other routines above essentially contain the same code, so
fold all of them into the caller and drop unnecessary computations.

Introduce percent_fp() for converting integer values in percent
to fixed-point fractions and use it in the above code cleanup.

Finally, rename intel_pstate_busy_pid_reset() to
intel_pstate_pid_reset() as it also is used for the
initialization of PID parameters for every CPU and the
meaning of the "busy" part of the name is not particularly
clear.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 4ddd0146 27-Mar-2017 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Fold intel_pstate_reset_all_pid() into the caller

There is only one caller of intel_pstate_reset_all_pid(), which is
pid_param_set() used in the debugfs interface only, and having that
code split does not make it particularly convenient to follow.

For this reason, move the body of intel_pstate_reset_all_pid() into
its caller and drop that function.

Also change the loop from for_each_online_cpu() (which is obviously
racy with respect to CPU offline/online) to for_each_possible_cpu(),
so that all PID parameters are reset for all CPUs regardless of their
online/offline status (to prevent, for example, a previously offline
CPU from going online with a stale set of PID parameters).

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 5c439053 27-Mar-2017 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Initialize pid_params statically

Notice that both the existing struct cpu_defaults instances in which
PID parameters are actually initialized use the same values of those
parameters, so it is not really necessary to copy them over to
pid_params dynamically.

Instead, initialize pid_params statically with those values and
drop the unused pid_policy member from struct cpu_defaults along
with copy_pid_params() used for initializing it.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 64043678 27-Mar-2017 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Drop pointless initialization of PID parameters

The P-state selection algorithm used by intel_pstate for Atom
processors is not based on the PID controller and the initialization
of PID parametrs for those processors is pointless and confusing, so
drop it.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# e14cf885 27-Mar-2017 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Eliminate struct perf_limits

After recent changes the purpose of struct perf_limits is not
particularly clear any more and the code may be made somewhat
easier to follow by eliminating it, so go for that.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 80b120ca 22-Mar-2017 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Avoid transient updates of cpuinfo.max_freq

Both intel_pstate_verify_policy() and intel_cpufreq_verify_policy()
set policy->cpuinfo.max_freq depending on the turbo status, but the
updates made by them are discarded by the core, because the policy
object passed to them by the core is temporary and cpuinfo.max_freq
from that object is not copied to the final policy object in
cpufreq_set_policy().

However, cpufreq_set_policy() passes the temporary policy object
to the ->setpolicy callback of the driver, so intel_pstate_set_policy()
actually sees the policy->cpuinfo.max_freq value updated by
intel_pstate_verify_policy() and not the final one. It also
updates policy->max sometimes which basically has no effect after
it returns, because the core discards that update.

To avoid confusion, eliminate policy->cpuinfo.max_freq updates from
intel_pstate_verify_policy() and intel_cpufreq_verify_policy()
entirely and check the maximum frequency explicitly in
intel_pstate_update_perf_limits() instead of relying on the
transiently updated policy->cpuinfo.max_freq value.

Moreover, move the max->policy adjustment carried out in
intel_pstate_set_policy() to a separate function and call that
function from the ->verify driver callbacks to ensure that it will
actually be effective.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# c5a2ee7d 22-Mar-2017 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Active mode P-state limits rework

The coordination of P-state limits used by intel_pstate in the active
mode (ie. by default) is problematic, because it synchronizes all of
the limits (ie. the global ones and the per-policy ones) so as to use
one common pair of P-state limits (min and max) across all CPUs in
the system. The drawbacks of that are as follows:

- If P-states are coordinated in hardware, it is not necessary
to coordinate them in software on top of that, so in that case
all of the above activity is in vain.

- If P-states are not coordinated in hardware, then the processor
is actually capable of setting different P-states for different
CPUs and coordinating them at the software level simply doesn't
allow that capability to be utilized.

- The coordination works in such a way that setting a per-policy
limit (eg. scaling_max_freq) for one CPU causes the common
effective limit to change (and it will affect all of the other
CPUs too), but subsequent reads from the corresponding sysfs
attributes for the other CPUs will return stale values (which
is confusing).

- Reads from the global P-state limit attributes, min_perf_pct and
max_perf_pct, return the effective common values and not the last
values set through these attributes. However, the last values
set through these attributes become hard limits that cannot be
exceeded by writes to scaling_min_freq and scaling_max_freq,
respectively, and they are not exposed, so essentially users
have to remember what they are.

All of that is painful enough to warrant a change of the management
of P-state limits in the active mode.

To that end, redesign the active mode P-state limits management in
intel_pstate in accordance with the following rules:

(1) All CPUs are affected by the global limits (that is, none of
them can be requested to run faster than the global max and
none of them can be requested to run slower than the global
min).

(2) Each individual CPU is affected by its own per-policy limits
(that is, it cannot be requested to run faster than its own
per-policy max and it cannot be requested to run slower than
its own per-policy min).

(3) The global and per-policy limits can be set independently.

Also, the global maximum and minimum P-state limits will be always
expressed as percentages of the maximum supported turbo P-state.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 55395345 22-Mar-2017 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Use load-based P-state selection more widely

Extend the set of systems for which intel_pstate will use the
"powersave" P-state selection algorithm based on CPU load in the
active mode by systems with ACPI preferred profile set to "tablet",
"appliance PC", "desktop", or "workstation" (ie. everything with a
specified preferred profile that is not a "server").

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# eb5139d1 22-Mar-2017 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Support HWP processors in all operation modes

Currently, some processors supporting HWP are only supported by
intel_pstate if HWP is actually going to be used and not supported
otherwise which is confusing.

Specifically, they are not supported if "intel_pstate=no_hwp" is
passed to the kernel in the command line or if the driver is started
in the passive mode ("intel_pstate=passive").

There is no real reason for that, because everything about those
processor is known anyway and the driver can work with them in all
modes, so make that happen, but use the load-based P-state selection
algorithm for the active mode "powersave" policy with them.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 64897b20 21-Mar-2017 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Fix policy data management in passive mode

The policy->cpuinfo.max_freq and policy->max updates in
intel_cpufreq_turbo_update() are excessive as they are done for no
good reason and may lead to problems in principle, so they should be
dropped. However, after dropping them intel_cpufreq_turbo_update()
becomes almost entirely pointless, because the check made by it is
made again down the road in intel_pstate_prepare_request(). The
only thing in it that still needs to be done is the call to
update_turbo_state(), so drop intel_cpufreq_turbo_update() altogether
and make its callers invoke update_turbo_state() directly instead of
it.

In addition to that, fix intel_cpufreq_verify_policy() so that it
checks global.no_turbo in addition to global.turbo_disabled when
updating policy->cpuinfo.max_freq to make it consistent with
intel_pstate_verify_policy().

Fixes: 001c76f05b01 (cpufreq: intel_pstate: Generic governors support)
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 7de32556 17-Mar-2017 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: One set of global limits in active mode

In the active mode intel_pstate currently uses two sets of global
limits, each associated with one of the possible scaling_governor
settings in that mode: "powersave" or "performance".

The driver switches over from one of those sets to the other
depending on the scaling_governor setting for the last CPU whose
per-policy cpufreq interface in sysfs was last used to change
parameters exposed in there. That obviously leads to no end of
issues when the scaling_governor settings differ between CPUs.

The most recent issue was introduced by commit a240c4aa5d0f (cpufreq:
intel_pstate: Do not reinit performance limits in ->setpolicy)
that eliminated the reinitialization of "performance" limits in
intel_pstate_set_policy() preventing the max limit from being set
to anything below 100, among other things.

Namely, an undesirable side effect of commit a240c4aa5d0f is that
now, after setting scaling_governor to "performance" in the active
mode, the per-policy limits for the CPU in question go to the highest
level and stay there even when it is switched back to "powersave"
later.

As it turns out, some distributions set scaling_governor to
"performance" temporarily for all CPUs to speed-up system
initialization, so that change causes them to misbehave later.

To fix that, get rid of the performance/powersave global limits
split and use just one set of global limits for everything.

From the user's persepctive, after this modification, when
scaling_governor is switched from "performance" to "powersave"
or the other way around on one CPU, the limits settings (ie. the
global max/min_perf_pct and per-policy scaling_max/min_freq for
any CPUs) will not change. Still, switching from "performance"
to "powersave" or the other way around changes the way in which
P-states are selected and in particular "performance" causes the
driver to always request the highest P-state it is allowed to ask
for for the given CPU.

Fixes: a240c4aa5d0f (cpufreq: intel_pstate: Do not reinit performance limits in ->setpolicy)
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# e4c204ce 14-Mar-2017 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Avoid percentages in limits-related computations

Currently, intel_pstate_update_perf_limits() first converts the
policy minimum and maximum limits into percentages of the maximum
turbo frequency (rounding up to an integer) and then converts these
percentages to fractions (by using fixed-point arithmetic to divide
them by 100).

That introduces a rounding error unnecessarily, because the fractions
can be obtained by carrying out fixed-point divisions directly on the
input numbers.

Rework the computations in intel_pstate_hwp_set() to use fractions
instead of percentages (and drop redundant local variables from
there) and modify intel_pstate_update_perf_limits() to compute the
fractions directly and percentages out of them.

While at it, introduce percent_ext_fp() for converting percentages
to fractions (with extended number of fraction bits) and use it in
the computations.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 3f8ed54a 13-Mar-2017 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: Correct frequency setting in the HWP mode

In the functions intel_pstate_hwp_set(), min/max range from HWP capability
MSR along with max_perf_pct and min_perf_pct, is used to set the HWP
request MSR. In some cases this doesn't result in the correct HWP max/min
in HWP request.

For example: In the following case:

HWP capabilities from MSR 0x771
0x70a1220

Here cpufreq min/max frequencies from above MSR dump are 700MHz and 3.2GHz
respectively.

This will result in
hwp_min = 0x07
hwp_max = 0x20

To limit max frequency to 2GHz:

perf_limits->max_perf_pct = 63 (2GHz as a percent of 3.2GHz rounded up)

With the current calculation:
adj_range = max_perf_pct * range / 100;
adj_range = 63 * (32 - 7) / 100
adj_range = 15

max = hw_min + adj_range;
max = 7 + 15 = 22

This will result in HWP request of 0x160f, which will result in a
frequency cap of 2.2GHz not 2GHz.

The problem with the above calculation is that hwp_min of 7 is treated
as 0% in the range. But max_perf_pct is calculated with respect to minimum
as 0 and max as 3.2GHz or hwp_max, so adding hwp_min to it will result in
more than the desired.

Since the min_perf_pct and max_perf_pct is already a percent of max
frequency or hwp_max, this min/max HWP request value can be calculated
directly applying these percentage to hwp_max.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 6e7408ac 12-Mar-2017 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Update pid_params.sample_rate_ns in pid_param_set()

Fix the debugfs interface for PID tuning to actually update
pid_params.sample_rate_ns on PID parameters updates, as changing
pid_params.sample_rate_ms via debugfs has no effect now.

Fixes: a4675fbc4a7a (cpufreq: intel_pstate: Replace timers with utilization update callbacks)
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>


# 5f98ced1 09-Mar-2017 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Drop redundant wrapper function

intel_pstate_hwp_set_policy() is a wrapper around
intel_pstate_hwp_set(), but the only value it adds is to check
hwp_active before calling the latter and one of its two callers
has already checked hwp_active before that happens, so in that
code path the additional check is redundant and using the wrapper
is rather pointless.

For this reason, drop intel_pstate_hwp_set_policy() and make its
callers invoke intel_pstate_hwp_set() directly (after checking
hwp_active).

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>


# a240c4aa 02-Mar-2017 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Do not reinit performance limits in ->setpolicy

If the current P-state selection algorithm is set to "performance"
in intel_pstate_set_policy(), the limits may be initialized from
scratch, but only if no_turbo is not set and the maximum frequency
allowed for the given CPU (i.e. the policy object representing it)
is at least equal to the max frequency supported by the CPU. In all
of the other cases, the limits will not be updated.

For example, the following can happen:

# cat intel_pstate/status
active
# echo performance > cpufreq/policy0/scaling_governor
# cat intel_pstate/min_perf_pct
100
# echo 94 > intel_pstate/min_perf_pct
# cat intel_pstate/min_perf_pct
100
# cat cpufreq/policy0/scaling_max_freq
3100000
echo 3000000 > cpufreq/policy0/scaling_max_freq
# cat intel_pstate/min_perf_pct
94
# echo 95 > intel_pstate/min_perf_pct
# cat intel_pstate/min_perf_pct
95

That is confusing for two reasons. First, the initial attempt to
change min_perf_pct to 94 seems to have no effect, even though
setting the global limits should always work. Second, after
changing scaling_max_freq for policy0 the global min_perf_pct
attribute shows 94, even though it should have not been affected
by that operation in principle.

Moreover, the final attempt to change min_perf_pct to 95 worked
as expected, because scaling_max_freq for the only policy with
scaling_governor equal to "performance" was different from the
maximum at that time.

To make all that confusion go away, modify intel_pstate_set_policy()
so that it doesn't reinitialize the limits at all.

At the same time, change intel_pstate_set_performance_limits() to
set min_sysfs_pct to 100 in the "performance" limits set so that
switching the P-state selection algorithm to "performance" causes
intel_pstate/min_perf_pct in sysfs to go to 100 (or whatever value
min_sysfs_pct in the "performance" limits is set to later).

That requires per-CPU limits to be initialized explicitly rather
than by copying the global limits to avoid setting min_sysfs_pct
in the per-CPU limits to 100.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# d74b1992 28-Feb-2017 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Fix intel_pstate_verify_policy()

The code added to intel_pstate_verify_policy() by commit 1443ebbacfd7
(cpufreq: intel_pstate: Fix sysfs limits enforcement for performance
policy) should use perf_limits instead of limits, because otherwise
setting global limits via sysfs may affect policies inconsistently.

For example, in the sequence of shell commands below, the
scaling_min_freq attribute for policy1 and policy2 should be
affected in the same way, because scaling_governor is set in
the same way for both of them:

# cat cpufreq/policy1/scaling_governor
powersave
# cat cpufreq/policy2/scaling_governor
powersave
# echo performance > cpufreq/policy0/scaling_governor
# echo 94 > intel_pstate/min_perf_pct
# cat cpufreq/policy0/scaling_min_freq
2914000
# cat cpufreq/policy1/scaling_min_freq
2914000
# cat cpufreq/policy2/scaling_min_freq
800000

The are affected differently, because intel_pstate_verify_policy()
is invoked with limits set to &performance_limits (left behind by
policy0) for policy1 and with limits set to &powersave_limits (left
behind by policy1) for policy2. Since perf_limits is set to the
set of limits matching the policy being updated, using it instead
of limits fixes the inconsistency.

Fixes: 1443ebbacfd7 (cpufreq: intel_pstate: Fix sysfs limits enforcement for performance policy)
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# cd59b4be 28-Feb-2017 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Fix global settings in active mode

Commit 111b8b3fe4fa (cpufreq: intel_pstate: Always keep all
limits settings in sync) changed intel_pstate to invoke
cpufreq_update_policy() for every registered CPU on global sysfs
attributes updates, but that led to undesirable effects in the
active mode if the "performance" P-state selection algorithm is
configufred for one CPU and the "powersave" one is chosen for
all of the other CPUs.

Namely, in that case, the following is possible:

# cd /sys/devices/system/cpu/
# cat intel_pstate/max_perf_pct
100
# cat intel_pstate/min_perf_pct
26
# echo performance > cpufreq/policy0/scaling_governor
# cat intel_pstate/max_perf_pct
100
# cat intel_pstate/min_perf_pct
100
# echo 94 > intel_pstate/min_perf_pct
# cat intel_pstate/min_perf_pct
26

The reason why this happens is because intel_pstate attempts to
maintain two sets of global limits in the active mode, one for
the "performance" P-state selection algorithm and one for the
"powersave" P-state selection algorithm, but the P-state selection
algorithms are set per policy, so the global limits cannot reflect
all of them at the same time if they are different for different
policies.

In the particular situation above, the attempt to change
min_perf_pct to 94 caused cpufreq_update_policy() to be run
for a CPU with the "powersave" P-state selection algorithm
and intel_pstate_set_policy() called by it silently switched the
global limits to the "powersave" set which finally was reflected
by the sysfs interface.

To prevent that from happening, modify intel_pstate_update_policies()
to always switch back to the set of limits that was used right before
it has been invoked.

Fixes: 111b8b3fe4fa (cpufreq: intel_pstate: Always keep all limits settings in sync)
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 64078299 03-Mar-2017 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Avoid triggering cpu_frequency tracepoint unnecessarily

In the passive mode the cpu_frequency trace event is already
triggered by the cpufreq core or by scaling governors, so
intel_pstate should not trigger it once again for the same
P-state updates.

In addition to that, the frequency returned by
intel_cpufreq_fast_switch() and passed via freqs.new from
intel_cpufreq_target() to cpufreq_freq_transition_end() should
reflect the P-state actually set, so make that happen.

Fixes: 001c76f05b01 (cpufreq: intel_pstate: Generic governors support)
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 7f17326f 02-Mar-2017 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Fix intel_cpufreq_verify_policy()

The intel_pstate_update_perf_limits() called from
intel_cpufreq_verify_policy() may cause global P-state limits
to change which is generally confusing and unnecessary.

In the passive mode the global limits are only applied to the
frequency selected by the scaling governor (they are not taken
into account by governors when making decisions anyway), so making
them follow the per-policy limits serves no purpose and may go
against user expectations (as it generally causes the global
attributes in sysfs to change even though they have not been
written to in some cases).

Fix that by dropping the intel_pstate_update_perf_limits()
invocation from intel_cpufreq_verify_policy() (which also
reduces the code size by a few lines).

This change does not affect the per-CPU limits case, because those
limits allow any P-state to be set by default in the passive mode
and it removes the only piece of code updating them in that mode,
so the per-policy settings will be the only ones taken into account
in that case as expected.

Fixes: 001c76f05b01 (cpufreq: intel_pstate: Generic governors support)
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 2bc756e7 02-Mar-2017 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Do not use performance_limits in passive mode

Using performance_limits in the passive mode doesn't make
sense, because in that mode the global limits are applied to the
frequency selected by the scaling governor.

The maximum and minimum P-state limits in performance_limits are both
set to 100 percent which will put all CPUs into the turbo range
regardless of what governor is used and what frequencies are
selected by it (that is particularly undesirable on CPUs with the
generic powersave governor attached).

For this reason, make intel_pstate_register_driver() always point
limits to powersave_limits in the passive mode.

Fixes: 001c76f05b01 (cpufreq: intel_pstate: Generic governors support)
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 55687da1 08-Feb-2017 Ingo Molnar <mingo@kernel.org>

sched/headers: Prepare for new header dependencies before moving code to <linux/sched/cpufreq.h>

We are going to split <linux/sched/cpufreq.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/cpufreq.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 92134bdb 25-Feb-2017 Len Brown <len.brown@intel.com>

intel_pstate: use MSR_ATOM_RATIOS definitions from msr-index.h

Originally, these MSRs were locally defined in this driver.
Now the definitions are in msr-index.h -- use them.

Signed-off-by: Len Brown <len.brown@intel.com>


# c3a49c89 27-Feb-2017 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Fix limits issue with operation mode switching

There is a problem with intel_pstate operation mode switching
introduced by commit fb1fe1041c04 (cpufreq: intel_pstate: Operation
mode control from sysfs), because the global sysfs limits are
preserved across operation modes while per-policy limits are
reinitialized from scratch on a mode switch and both sets of limits
may get out of sync this way.

Fix that by always reinitializing the global limits upon the
registration of the driver.

Fixes: fb1fe1041c04 (cpufreq: intel_pstate: Operation mode control from sysfs)
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>


# 6e978b22 03-Feb-2017 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: Disable energy efficiency optimization

Some Kabylake desktop processors may not reach max turbo when running in
HWP mode, even if running under sustained 100% utilization.

This occurs when the HWP.EPP (Energy Performance Preference) is set to
"balance_power" (0x80) -- the default on most systems.

It occurs because the platform BIOS may erroneously enable an
energy-efficiency setting -- MSR_IA32_POWER_CTL BIT-EE, which is not
recommended to be enabled on this SKU.

On the failing systems, this BIOS issue was not discovered when the
desktop motherboard was tested with Windows, because the BIOS also
neglects to provide the ACPI/CPPC table, that Windows requires to enable
HWP, and so Windows runs in legacy P-state mode, where this setting has
no effect.

Linux' intel_pstate driver does not require ACPI/CPPC to enable HWP, and
so it runs in HWP mode, exposing this incorrect BIOS configuration.

There are several ways to address this problem.

First, Linux can also run in legacy P-state mode on this system.
As intel_pstate is how Linux enables HWP, booting with
"intel_pstate=disable"
will run in acpi-cpufreq/ondemand legacy p-state mode.

Or second, the "performance" governor can be used with intel_pstate,
which will modify HWP.EPP to 0.

Or third, starting in 4.10, the
/sys/devices/system/cpu/cpufreq/policy*/energy_performance_preference
attribute in can be updated from "balance_power" to "performance".

Or fourth, apply this patch, which fixes the erroneous setting of
MSR_IA32_POWER_CTL BIT_EE on this model, allowing the default
configuration to function as designed.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Cc: 4.6+ <stable@vger.kernel.org> # 4.6+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 8fc7554a 19-Jan-2017 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: Calculate guaranteed performance for HWP

When HWP is active, turbo activation ratio is not used to calculate max
non turbo ratio. But on these systems the max non turbo ratio is decided
by config TDP settings.

This change removes usage of MSR_TURBO_ACTIVATION_RATIO for HWP systems,
instead directly use TDP ratios, when more than one TDPs are available.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 4e5d3f71 18-Jan-2017 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: Make HWP limits compatible with legacy

Under HWP the performance limits are calculated using max_perf_pct
and min_perf_pct using possible performance, not available performance.
The available performance can be reduced by no_turbo setting. To make
compatible with legacy mode, use max/min performance percentage with
respect to available performance.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 7d9a8a9f 18-Jan-2017 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: Lower frequency than expected under no_turbo

When turbo is not disabled by BIOS, but user disabled from intel P-State
sysfs and changes max/min using cpufreq sysfs, the resultant frequency
is lower than what user requested.

The reason for this, when the perf limits are calculated in set_policy()
callback, they are with reference to max cpu frequency (turbo frequency
), but when enforced in the intel_pstate_get_min_max() they are with
reference to max available performance as documented in the intel_pstate
documentation (in this case max non turbo P-State).

This needs similar change as done in intel_cpufreq_verify_policy() for
passive mode. Set policy->cpuinfo.max_freq based on the turbo status.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# fb1fe104 04-Jan-2017 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Operation mode control from sysfs

Make it possible to change the operation mode of intel_pstate with
the help of a new sysfs attribute called "status".

There are three possible configurations that can be selected using
this attribute:

"off" - The driver is not in use at this time.
"active" - The driver works as a P-state governor (default).
"passive" - The driver works as a regular cpufreq one and collaborates
with the generic cpufreq governors (it sets P-states as
requested by those governors). [This is the same mode
the driver can be started in by passing intel_pstate=passive
in the kernel command line.]

The current setting is returned by reads from this attribute. Writing
one of the above strings to it changes the operation mode as indicated
by that string, if possible.

If HW-managed P-states (HWP) feature is enabled, it is not possible
to change the driver's operation mode and attempts to write to this
attribute will fail.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 0c30b65b 10-Jan-2017 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Expose global sysfs attributes upfront

Expose the intel_pstate's global sysfs attributes before registering
the driver to prepare for the addition of an attribute that also will
have to work if the driver is not registered.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 1443ebba 18-Jan-2017 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: Fix sysfs limits enforcement for performance policy

A side effect of keeping intel_pstate sysfs limits in sync with cpufreq
is that the now sysfs limits can't enforced under performance policy.

For example, if the max_perf_pct is changed from 100 to 80, this will call
intel_pstate_set_policy(), which will change the max_perf to 100 again for
performance policy. Same issue happens, when no_turbo is set.

This change calculates max and min frequency using sysfs performance
limits in intel_pstate_verify_policy() and adjusts policy limits by
calling cpufreq_verify_within_limits().

Also, it causes the setting of performance limits to be skipped if
no_turbo is set.

Fixes: 111b8b3fe4fa (cpufreq: intel_pstate: Always keep all limits settings in sync)
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 111b8b3f 30-Dec-2016 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Always keep all limits settings in sync

Make intel_pstate update per-logical-CPU limits when the global
settings are changed to ensure that they are always in sync and
users will not see confusing values in per-logical-CPU sysfs
attributes.

This also fixes the problem that setting the "no_turbo" global
attribute to 1 in the "passive" mode (ie. when intel_pstate acts
as a regular cpufreq driver) when scaling_governor is set to
"performance" has no effect.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>


# cad30467 30-Dec-2016 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Use locking in intel_cpufreq_verify_policy()

Race conditions are possible if intel_cpufreq_verify_policy()
is executed in parallel with global limits updates from sysfs,
so the invocation of intel_pstate_update_perf_limits() in it
should be carried out under intel_pstate_limits_lock.

Make that happen.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>


# aa439248 30-Dec-2016 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Use locking in intel_pstate_resume()

Theoretically, intel_pstate_resume() may be executed in parallel
with intel_pstate_set_policy(), if the latter is invoked via
cpufreq_update_policy() as a result of a notification, so use
intel_pstate_limits_lock in there too to avoid race conditions.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>


# 366430b5 23-Dec-2016 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Do not expose PID parameters in passive mode

If intel_pstate works in the passive mode in which it acts as
a regular cpufreq driver and collaborates with generic cpufreq
governors, the PID parameters are not used, so do not expose
them via debugfs in that case.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 984edbdc 06-Dec-2016 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: Support for energy performance hints with HWP

It is possible to provide hints to the HWP algorithms in the processor
to be more performance centric to more energy centric. These hints are
provided by using HWP energy performance preference (EPP) or energy
performance bias (EPB) settings.

The scope of these settings is per logical processor, which means that
each of the logical processors in the package can be programmed with a
different value.

This change provides cpufreq sysfs interface to provide hint. For each
policy, two additional attributes will be available to check and provide
hint. These attributes will only be present when the intel_pstate driver
is using HWP mode.

These attributes are:
- energy_performance_available_preferences
- energy_performance_preference

To get list of supported hints:
$ cat energy_performance_available_preferences
default performance balance_performance balance_power power

The current preference can be read or changed via cpufreq sysfs
attribute "energy_performance_preference". Reading from this attribute
will display current effective setting changed via any method. User can
write any of the valid preference string to this attribute. User can
always restore to power-on default by writing "default".

Implementation
Since these hints can be provided by direct MSR write or using some tools
like x86_energy_perf_policy, the driver internally doesn't maintain any
state. The user operation will result in direct read/write of MSR: 0x774
(HWP_REQUEST_MSR). Also driver use read modify write to update other
fields in this MSR.

Summary of changes:
- struct cpudata field epp_saved is renamed to epp_powersave, as this
stores the value to restore once policy is switched from performance
to powersave to restore original powersave EPP value.
- A new struct cpudata field epp_saved is used to store the raw MSR
EPP/EPB value when a CPU goes offline or on suspend and restore on
online/resume. This ensures that EPP value is restored to correct
value irrespective of the means used to set.
- EPP/EPB value ranges are fixed for each preference, which can be
set for the cpufreq sysfs, so user request is mapped to/from this
range.
- New attributes are only added when HWP is present.
- Since EPP value of 0 is valid the fields are initialized to
-EINVAL when not valid. The field epp_default is read only once
after powerup to avoid reading on subsequent CPU online operation
- New suspend callback to store epp on suspend operation
- Don't invalidate old epp_saved field on resume and online as now
we can restore last epp value on suspend and this field can still
have old EPP value sampled during switch to performance from
powersave.
- While here optimized setting of cpu_data->epp_powersave = epp in
intel_pstate_hwp_set() as this was done in both true and false
paths.
- epp/epb set function returns error to caller on failure to pass
on to user space for display.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# b59fe540 06-Dec-2016 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: Add locking around HWP requests

To avoid race conditions from multiple threads, increase the scope
of intel_pstate_limits_lock to include HWP requests also.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
[ rjw: Subject ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 58bf4542 13-Oct-2016 Piotr Luc <piotr.luc@intel.com>

cpufreq: intel_pstate: Add Knights Mill CPUID

Add Knights Mill (KNM) to the list of CPUIDs supported by intel_pstate.

Signed-off-by: Piotr Luc <piotr.luc@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 7a3ba767 25-Nov-2016 Arnd Bergmann <arnd@arndb.de>

cpufreq: intel_pstate: fix intel_pstate_exit_perf_limits() prototype

The addition of the generic governor support marked the
intel_pstate_exit_perf_limits as inline(), which fixed a warning,
but it introduced another warning:

drivers/cpufreq/intel_pstate.c: In function ‘intel_pstate_exit_perf_limits’:
drivers/cpufreq/intel_pstate.c:483:1: error: no return statement in function returning non-void [-Werror=return-type]

This changes it back to a 'void' return type, and changes the
corresponding intel_pstate_init_acpi_perf_limits() function to
be inline as well for consistency.

Fixes: 001c76f05b01 (cpufreq: intel_pstate: Generic governors support)
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 8442885f 24-Nov-2016 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: Set EPP/EPB to 0 in performance mode

When user has selected performance policy, then set the EPP (Energy
Performance Preference) or EPB (Energy Performance Bias) to maximum
performance mode.

Also when user switch back to powersave, then restore EPP/EPB to last
EPP/EPB value before entering performance mode. If user has not changed
EPP/EPB manually then it will be power on default value.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 17669006 22-Nov-2016 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq/intel_pstate: Use CPPC to get max performance

Use the acpi cppc_lib interface to get CPPC performance limits and update
the per cpu priority for the ITMT scheduler. If the highest performance of
CPUs differs the ITMT feature is enabled.

Co-developed-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Cc: linux-pm@vger.kernel.org
Cc: peterz@infradead.org
Cc: jolsa@redhat.com
Cc: rjw@rjwysocki.net
Cc: linux-acpi@vger.kernel.org
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: bp@suse.de
Link: http://lkml.kernel.org/r/0998b98943bcdec7d1ddd4ff27358da555ea8e92.1479844244.git.tim.c.chen@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


# d5dd33d9 21-Nov-2016 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: increase precision of performance limits

Even with round up of limits->min_perf and limits->max_perf, in some
cases resultant performance is 100 MHz less than the desired.

For example when the maximum frequency is 3.50 GHz, setting
scaling_min_frequency to 2.3 GHz always results in 2.2 GHz minimum.

Currently the fixed floating point operation uses 8 bit precision for
calculating limits->min_perf and limits->max_perf. For some operations
in this driver the 14 bit precision is used. Using the 14 bit precision
also for calculating limits->min_perf and limits->max_perf, addresses
this issue.

Introduced fp_ext_toint() equivalent to fp_toint() and int_ext_tofp()
equivalent to int_tofp() with 14 bit precision.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 46992d6b 21-Nov-2016 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: round up min_perf limits

In some use cases, user wants to enforce a minimum performance limit on
CPUs. But because of simple division the resultant performance is 100 MHz
less than the desired in some cases.

For example when the maximum frequency is 3.50 GHz, setting
scaling_min_frequency to 1.6 GHz always results in 1.5 GHz minimum. With
simple round up, the frequency can be set to 1.6 GHz to minimum in this
case. This round up is already done to max_policy_pct and max_perf, so do
the same for min_policy_pct and min_perf.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 001c76f0 17-Nov-2016 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Generic governors support

There may be reasons to use generic cpufreq governors (eg. schedutil)
on Intel platforms instead of the intel_pstate driver's internal
governor. However, that currently can only be done by disabling
intel_pstate altogether and using the acpi-cpufreq driver instead
of it, which is subject to limitations.

First of all, acpi-cpufreq only works on systems where the _PSS
object is present in the ACPI tables for all logical CPUs. Second,
on those systems acpi-cpufreq will only use frequencies listed by
_PSS which may be suboptimal. In particular, by convention, the
whole turbo range is represented in _PSS as a single P-state and
the frequency assigned to it is greater by 1 MHz than the greatest
non-turbo frequency listed by _PSS. That may confuse governors to
use turbo frequencies less frequently which may lead to suboptimal
performance.

For this reason, make it possible to use the intel_pstate driver
with generic cpufreq governors as a "normal" cpufreq driver. That
mode is enforced by adding intel_pstate=passive to the kernel
command line and cannot be disabled at run time. In that mode,
intel_pstate provides a cpufreq driver interface including
the ->target() and ->fast_switch() callbacks and is listed in
scaling_driver as "intel_cpufreq".

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Doug Smythies <dsmythies@telus.net>


# d0ea59e1 17-Nov-2016 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Request P-states control from SMM if needed

Currently, intel_pstate is unable to control P-states on my
IvyBridge-based Acer Aspire S5, because they are controlled by SMM
on that machine by default and it is necessary to request OS control
of P-states from it via the SMI Command register exposed in the ACPI
FADT. intel_pstate doesn't do that now, but acpi-cpufreq and other
cpufreq drivers for x86 platforms do.

Address this problem by making intel_pstate use the ACPI-defined
mechanism as well. However, intel_pstate is not modular and it
doesn't need the module refcount tricks played by
acpi_processor_notify_smm(), so export the core of this function
to it as acpi_processor_pstate_control() and make it call that.
[The changes in processor_perflib.c related to this should not
make any functional difference for the acpi_processor_notify_smm()
users].

To be safe, only call acpi_processor_notify_smm() from intel_pstate
if ACPI _PPC support is enabled in it.

Suggested-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>


# 7f7a516e 14-Nov-2016 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: Use CPU load based algorithm for PM_MOBILE

Use get_target_pstate_use_cpu_load() to calculate target P-State for
devices, with the preferred power management profile in ACPI FADT
set to PM_MOBILE.

This may help in resolving some thermal issues caused by low sustained
cpu bound workloads. The current algorithm tend to over provision in this
case as it doesn't look at the CPU busyness.

Also included the fix from Arnd Bergmann <arnd@arndb.de> to solve compile
issue, when CONFIG_ACPI is not defined.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# a410c03d 28-Oct-2016 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: protect limits variable

The limits variable gets modified from intel_pstate sysfs and also gets
modified from cpufreq sysfs. So protect with a mutex to keep data
integrity, when they are getting modified from multiple threads.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 5879f877 25-Oct-2016 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: Reduce impact due to rounding error

When policy->max and policy->min are same, in some cases they don't
result in the same frequency cap. The max_policy_pct is rounded up but
not min_perf_pct. So even when they are same, results in different
percentage or maximum and minimum.
Since minimum is a conservative value for power, a lower value without
rounding is better in most of the cases, unless user wants
policy->max = policy->min.
This change uses use the same policy percentage when policy->max and
policy->min are same.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# eae48f04 25-Oct-2016 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: Per CPU P-State limits

Intel P-State offers two interface to set performance limits:
- Intel P-State sysfs
/sys/devices/system/cpu/intel_pstate/max_perf_pct
/sys/devices/system/cpu/intel_pstate/min_perf_pct
- cpufreq
/sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq
/sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq

In the current implementation both of the above methods, change limits
to every CPU in the system. Moreover the limits placed using cpufreq
policy interface also presented in the Intel P-State sysfs via modified
max_perf_pct and min_per_pct during sysfs reads. This allows to check
percent of reduced/increased performance, irrespective of method used to
limit.

There are some new generations of processors, where it is possible to
have limits placed on individual CPU cores. Using cpufreq interface it
is possible to set limits on each CPU. But the current processing will
use last limits placed on all CPUs. So the per core limit feature of
CPUs can't be used.

This change brings in capability to set P-States limits for each CPU,
with some limitations. In this case what should be the read of
max_perf_pct and min_perf_pct? It can be most restrictive limits placed
on any CPU or max possible performance on any given CPU on which no
limits are placed. In either case someone will have issue.

So the consensus is, we can't have both sysfs controls present when user
wants to use limit per core limits.
- By default per-core-control feature is not enabled. So no one will
notice any difference.
- The way to enable is by kernel command line
intel_pstate=per_cpu_perf_limits
- When the per-core-controls are enabled there is no display of for both
read and write on
/sys/devices/system/cpu/intel_pstate/max_perf_pct
/sys/devices/system/cpu/intel_pstate/min_perf_pct
- User can change limits using
/sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq
/sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq
/sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
- User can still observe turbo percent and number of P-States from
/sys/devices/system/cpu/intel_pstate/turbo_pct
/sys/devices/system/cpu/intel_pstate/num_pstates
- User can read write system wide turbo status
/sys/devices/system/cpu/no_turbo

While changing this BUG_ON is changed to WARN_ON, as they are not fatal
errors for the system.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 2f1d407a 24-Oct-2016 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Always set max P-state in performance mode

The only times at which intel_pstate checks the policy set for
a given CPU is the initialization of that CPU and updates of its
policy settings from cpufreq when intel_pstate_set_policy() is
invoked.

That is insufficient, however, because intel_pstate uses the same
P-state selection function for all CPUs regardless of the policy
setting for each of them and the P-state limits are shared between
them. Thus if the policy is set to "performance" for a particular
CPU, it may not behave as expected if the cpufreq settings are
changed subsequently for another CPU.

That can be easily demonstrated by writing "performance" to
scaling_governor for all CPUs and then switching it to "powersave"
for one of them in which case all of the CPUs will behave as though
their scaling_governor were all "powersave" (even though the policy
still appears to be "performance" for the remaining CPUs).

Fix this problem by modifying intel_pstate_adjust_busy_pstate() to
always set the P-state to the maximum allowed by the current limits
for all CPUs whose policy is set to "performance".

Note that it still is recommended to always change the policy setting
in the same way for all CPUs even with this fix applied to avoid
confusion.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# a6c6ead1 18-Oct-2016 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Set P-state upfront in performance mode

After commit a4675fbc4a7a (cpufreq: intel_pstate: Replace timers with
utilization update callbacks) the cpufreq governor callbacks may not
be invoked on NOHZ_FULL CPUs and, in particular, switching to the
"performance" policy via sysfs may not have any effect on them. That
is a problem, because it usually is desirable to squeeze the last
bit of performance out of those CPUs, so work around it by setting
the maximum P-state (within the limits) in intel_pstate_set_policy()
upfront when the policy is CPUFREQ_POLICY_PERFORMANCE.

Fixes: a4675fbc4a7a (cpufreq: intel_pstate: Replace timers with utilization update callbacks)
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>


# 185d8245 21-Oct-2016 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: Remove PID debugfs when not used

When target state is calculated using get_target_pstate_use_cpu_load(),
PID controller is not used, hence it has no effect on performance.
So don't present debugfs entries to tune PID controller.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 1d29815e 18-Oct-2016 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Drop boost_iowait flag

The "IOwait boosting" mechanism is only used by the
get_target_pstate_use_cpu_load() governor function and the
boost_iowait flag in pid_params is always set when that function
is in use (and it is never set otherwise). This means that the
boost_iowait flag is in fact redundant and may be dropped.

For this reason, replace the boost_iowait flag check in
intel_pstate_update_util() with an equivalent check against
pstate_funcs.get_target_pstate and drop that flag.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>


# 3954517e 11-Oct-2016 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Fix struct pstate_adjust_policy kerneldoc

It looks like the name of struct pstate_adjust_policy was updated
without updating its kerneldoc comment accordingly, so fix that
mistake.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 0843e83c 06-Oct-2016 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Proportional algorithm for Atom

The PID algorithm used by the intel_pstate driver tends to drive
performance to the minimum for workloads with utilization below the
setpoint, which is undesirable, so replace it with a modified
"proportional" algorithm on Atom.

The new algorithm will set the new P-state to be 1.25 times the
available maximum times the (frequency-invariant) utilization during
the previous sampling period except when the target P-state computed
this way is lower than the average P-state during the previous
sampling period. In the latter case, it will increase the target by
50% of the difference between it and the average P-state to prevent
performance from dropping down too fast in some cases.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>


# f00593a4 29-Sep-2016 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Clarify comment in get_target_pstate_use_performance()

Make the comment explaining the meaning of the perf_scaled variable
in get_target_pstate_use_performance() more straightforward.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# f9f4872d 08-Oct-2016 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: Fix unsafe HWP MSR access

This is a requirement that MSR MSR_PM_ENABLE must be set to 0x01 before
reading MSR_HWP_CAPABILITIES on a given CPU. If cpufreq init() is
scheduled on a CPU which is not same as policy->cpu or migrates to a
different CPU before calling msr read for MSR_HWP_CAPABILITIES, it
is possible that MSR_PM_ENABLE was not to set to 0x01 on that CPU.
This will cause GP fault. So like other places in this path
rdmsrl_on_cpu should be used instead of rdmsrl.

Moreover the scope of MSR_HWP_CAPABILITIES is on per thread basis, so it
should be read from the same CPU, for which MSR MSR_HWP_REQUEST is
getting set.

dmesg dump or warning:

[ 22.014488] WARNING: CPU: 139 PID: 1 at arch/x86/mm/extable.c:50 ex_handler_rdmsr_unsafe+0x68/0x70
[ 22.014492] unchecked MSR access error: RDMSR from 0x771
[ 22.014493] Modules linked in:
[ 22.014507] CPU: 139 PID: 1 Comm: swapper/0 Not tainted 4.7.5+ #1
...
...
[ 22.014516] Call Trace:
[ 22.014542] [<ffffffff813d7dd1>] dump_stack+0x63/0x82
[ 22.014558] [<ffffffff8107bc8b>] __warn+0xcb/0xf0
[ 22.014561] [<ffffffff8107bcff>] warn_slowpath_fmt+0x4f/0x60
[ 22.014563] [<ffffffff810676f8>] ex_handler_rdmsr_unsafe+0x68/0x70
[ 22.014564] [<ffffffff810677d9>] fixup_exception+0x39/0x50
[ 22.014604] [<ffffffff8102e400>] do_general_protection+0x80/0x150
[ 22.014610] [<ffffffff817f9ec8>] general_protection+0x28/0x30
[ 22.014635] [<ffffffff81687940>] ? get_target_pstate_use_performance+0xb0/0xb0
[ 22.014642] [<ffffffff810600c7>] ? native_read_msr+0x7/0x40
[ 22.014657] [<ffffffff81688123>] intel_pstate_hwp_set+0x23/0x130
[ 22.014660] [<ffffffff81688406>] intel_pstate_set_policy+0x1b6/0x340
[ 22.014662] [<ffffffff816829bb>] cpufreq_set_policy+0xeb/0x2c0
[ 22.014664] [<ffffffff81682f39>] cpufreq_init_policy+0x79/0xe0
[ 22.014666] [<ffffffff81682cb0>] ? cpufreq_update_policy+0x120/0x120
[ 22.014669] [<ffffffff816833a6>] cpufreq_online+0x406/0x820
[ 22.014671] [<ffffffff8168381f>] cpufreq_add_dev+0x5f/0x90
[ 22.014717] [<ffffffff81530ac8>] subsys_interface_register+0xb8/0x100
[ 22.014719] [<ffffffff816821bc>] cpufreq_register_driver+0x14c/0x210
[ 22.014749] [<ffffffff81fe1d90>] intel_pstate_init+0x39d/0x4d5
[ 22.014751] [<ffffffff81fe13f2>] ? cpufreq_gov_dbs_init+0x12/0x12

Cc: 4.3+ <stable@vger.kernel.org> # 4.3+
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 3ba7bcaa 13-Sep-2016 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: Add io_boost trace

Add io_boost percent to current pstate_sample tracepoint.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 09c448d3 13-Sep-2016 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Use IOWAIT flag in Atom algorithm

Modify the P-state selection algorithm for Atom processors to use
the new SCHED_CPUFREQ_IOWAIT flag instead of the questionable
get_cpu_iowait_time_us() function.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 42ce8921 11-Sep-2016 Julia Lawall <Julia.Lawall@lip6.fr>

intel_pstate: constify local structures

For structure types defined in the same file or local header files, find
top-level static structure declarations that have the following
properties:
1. Never reassigned.
2. Address never taken
3. Not passed to a top-level macro call
4. No pointer or array-typed field passed to a function or stored in a
variable.
Declare structures having all of these properties as const.

Done using Coccinelle.
Based on a suggestion by Joe Perches <joe@perches.com>.

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 58919e83 16-Aug-2016 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq / sched: Pass flags to cpufreq_update_util()

It is useful to know the reason why cpufreq_update_util() has just
been called and that can be passed as flags to cpufreq_update_util()
and to the ->func() callback in struct update_util_data. However,
doing that in addition to passing the util and max arguments they
already take would be clumsy, so avoid it.

Instead, use the observation that the schedutil governor is part
of the scheduler proper, so it can access scheduler data directly.
This allows the util and max arguments of cpufreq_update_util()
and the ->func() callback in struct update_util_data to be replaced
with a flags one, but schedutil has to be modified to follow.

Thus make the schedutil governor obtain the CFS utilization
information from the scheduler and use the "RT" and "DL" flags
instead of the special utilization value of ULONG_MAX to track
updates from the RT and DL sched classes. Make it non-modular
too to avoid having to export scheduler variables to modules at
large.

Next, update all of the other users of cpufreq_update_util()
and the ->func() callback in struct update_util_data accordingly.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>


# 65c1262f 23-Jul-2016 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: Add more out-of-band IDs

Add Skylake-X and Broadwell-X IDs for out-of-band (OBB) control of
P-States.

For these processors, if MSR_MISC_PWR_MGMT BIT(8) == 1, then the
Intel P-State driver should exit as OS can't control P-States.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
[ rjw : Subject/changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# da7de91c 19-Jul-2016 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: Check cpuid for MSR_HWP_INTERRUPT

The MSR MSR_HWP_INTERRUPT is valid only when CPUID.06H:EAX[8] = 1, so
check for feature before accessing this MSR.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# bc95a454 19-Jul-2016 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

intel_pstate: Update cpu_frequency tracepoint every time

Currently, intel_pstate only updates the cpu_frequency tracepoint
if the new P-state to set is different from the current one, but
that causes powertop to report 100% idle on an 100% loaded system
sometimes.

Prevent that from happening by updating the cpu_frequency tracepoint
every time intel_pstate_update_pstate() is called.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>-


# 2630abc2 18-Jul-2016 Carsten Emde <C.Emde@osadl.org>

cpufreq: intel_pstate: clean remnant struct element

When I was working with the Intel P state driver I came across a
remnant struct element that is no longer needed after the function
intel_pstate_calc_freq() was retired.

Signed-off-by: Carsten Emde <C.Emde@osadl.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 5fc8f707 08-Jul-2016 Jan Kiszka <jan.kiszka@siemens.com>

intel_pstate: Fix MSR_CONFIG_TDP_x addressing in core_get_max_pstate()

If MSR_CONFIG_TDP_CONTROL is locked, we currently try to address some
MSR 0x80000648 or so. Mask out the relevant level bits 0 and 1.

Found while running over the Jailhouse hypervisor which became upset
about this strange MSR index.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: 4.4+ <stable@vger.kernel.org> # 4.4+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 100cf6f2 06-Jul-2016 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: Replace MSR_NHM_TURBO_RATIO_LIMIT

Replace MSR_NHM_TURBO_RATIO_LIMIT with MSR_TURBO_RATIO_LIMIT.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 4a7cb7a9 27-Jun-2016 Jisheng Zhang <jszhang@marvell.com>

intel_pstate: Declare pid_params/pstate_funcs/hwp_active __read_mostly

pid_params is written once by copy_pid_params() during initialization,
and thereafter is mostly read by hot path intel_pstate_update_util().
The read of pid_params gets more after commit a4675fbc4a7a ("cpufreq:
intel_pstate: Replace timers with utilization update callbacks")

pstate_funcs is written once by copy_cpu_funcs() during initialization,
and thereafter is mostly read by hot path intel_pstate_update_util()

hwp_active is written to once during initialization and thereafter is
mostly read by hot path intel_pstate_update_util().

The fact that they are mostly read and not written to makes them
candidates for __read_mostly declarations.

Signed-off-by: Jisheng Zhang <jszhang@marvell.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 29327c84 27-Jun-2016 Jisheng Zhang <jszhang@marvell.com>

intel_pstate: add __init/__initdata marker to some functions/variables

These functions/variables are not needed after booting, so mark them
as __init or __initdata.

Signed-off-by: Jisheng Zhang <jszhang@marvell.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# eed43609 27-Jun-2016 Jisheng Zhang <jszhang@marvell.com>

intel_pstate: Fix incorrect placement of __initdata

__initdata should be placed between the variable name and equal sign
(if there is) for the variable to be placed in the intended section.

Signed-off-by: Jisheng Zhang <jszhang@marvell.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 5ab666e0 27-Jun-2016 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

intel_pstate: Do not clear utilization update hooks on policy changes

intel_pstate_set_policy() is invoked by the cpufreq core during
driver initialization, on changes of policy attributes (minimim and
maximum frequency, for example) via sysfs and via CPU notifications
from the platform firmware. On some platforms the latter may occur
relatively often.

Commit bb6ab52f2bef (intel_pstate: Do not set utilization update hook
too early) made intel_pstate_set_policy() clear the CPU's utilization
update hook before updating the policy attributes for it (and set the
hook again after doind that), but that involves invoking
synchronize_sched() and adds overhead to the CPU notifications
mentioned above and to the sched-RCU handling in general.

That extra overhead is arguably not necessary, because updating
policy attributes when the CPU's utilization update hook is active
should not lead to any adverse effects, so drop the clearing of
the hook from intel_pstate_set_policy() and make it check if
the hook has been set already when attempting to set it.

Fixes: bb6ab52f2bef (intel_pstate: Do not set utilization update hook too early)
Reported-by: Jisheng Zhang <jszhang@marvell.com>
Tested-by: Jisheng Zhang <jszhang@marvell.com>
Tested-by: Doug Smythies <dsmythies@telus.net>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# b00345d1 15-Jun-2016 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: Adjust _PSS[0] freqeuency if needed

The maximum turbo P-State used by the intel_pstate driver may be
limited by ACPI _PSS table entry 0. After commit 9522a2ff9cde
(cpufreq: intel_pstate: Enforce _PPC limits), the maximum performance
on servers will be capped by the _PSS table entry 0 by default.

Even though that is formally correct, it may lead to preformance
regressions in some cases. Namely, if the _PSS table entry 0 is
not the maximum turbo P-State, performance measured after commit
9522a2ff9cde will not match the performance measured before that
commit on the same system.

For this reason, modify the code to always use the maximum turbo
frequency as the one that corresponds to _PSS table entry 0 if turbo
is enabled in the BIOS. This way, the performance levels from
before commit 9522a2ff9cde will be restored on the affected systems.

Fixes: 9522a2ff9cde (cpufreq: intel_pstate: Enforce _PPC limits)
Suggested-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
[ rjw : Changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 41bad47f 09-Jun-2016 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: Broxton support

Add Broxton CPU model number.

Broxton requires core_params to get performance limits via MSRs, but
it is an Atom platform, which requires more power optimized algorithm.

So the P state selection will use similar algorithm as other Atom
platforms.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 5b20c944 02-Jun-2016 Dave Hansen <dave.hansen@linux.intel.com>

x86/cpufreq: Use Intel family name macros for the intel_pstate cpufreq driver

Another straightforward replacement of magic numbers.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: jacob.jun.pan@intel.com
Cc: linux-pm@vger.kernel.org
Link: http://lkml.kernel.org/r/20160603001945.0F5D02AA@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 983e600e 07-Jun-2016 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: Fix ->set_policy() interface for no_turbo

When turbo is disabled, the ->set_policy() interface is broken.

For example, when turbo is disabled and cpuinfo.max = 2900000 (full
max turbo frequency), setting the limits results in frequency less
than the requested one:
Set 1000000 KHz results in 0700000 KHz
Set 1500000 KHz results in 1100000 KHz
Set 2000000 KHz results in 1500000 KHz

This is because the limits->max_perf fraction is calculated using
the max turbo frequency as the reference, but when the max P-State is
capped in intel_pstate_get_min_max(), the reference is not the max
turbo P-State. This results in reducing max P-State.

One option is to always use max turbo as reference for calculating
limits. But this will not be correct. By definition the intel_pstate
sysfs limits, shows percentage of available performance. So when
BIOS has disabled turbo, the available performance is max non turbo.
So the max_perf_pct should still show 100%.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
[ rjw : Subject & changelog, rewrite in fewer lines of code ]
Cc: All applicable <stable@vger.kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 2c2c1af4 07-Jun-2016 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: Fix code ordering in intel_pstate_set_policy()

The limits->max_perf is rounded_up but immediately overwritten by
another assignment to limits->max_perf.

Move that operation to the correct location.

While here also added a pr_debug() call in ->set_policy to aid in
debugging.

Fixes: 785ee2788141 (cpufreq: intel_pstate: Fix limits->max_perf rounding error)
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
[ rjw : Subject & changelog ]
Cc: 4.4+ <stable@vger.kernel.org> # 4.4+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 6cacd115 30-May-2016 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: Downgrade print level for _PPC

Downgrade pr_info to pr_debug for the "_PPC limits will be enforced"
message.

In server systems with many cores this message is annoying.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
[ rjw: Changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# c749c64f 11-May-2016 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

intel_pstate: Simplify conditional in intel_pstate_set_policy()

One of the if () statements in intel_pstate_set_policy() causes
another if () to be evaluated if the condition is true and it
doesn't do anything else, so merge the two if () statements into
one.

No functional changes.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>


# 1aa7a6e2 11-May-2016 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

intel_pstate: Clean up get_target_pstate_use_performance()

The comments and the core_busy variable name in
get_target_pstate_use_performance() are totally confusing,
so modify them to reflect what's going on.

The results of the computations should be the same as before.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 8edb0a6e 11-May-2016 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

intel_pstate: Use sample.core_avg_perf in get_avg_pstate()

Notice that get_avg_pstate() can use sample.core_avg_perf instead of
carrying the same division again, so make it do that.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# a1c9787d 11-May-2016 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

intel_pstate: Clarify average performance computation

The core_pct_busy field of struct sample actually contains the
average performace during the last sampling period (in percent)
and not the utilization of the core as suggested by its name
which is confusing.

For this reason, change the name of that field to core_avg_perf
and rename the function that computes its value accordingly.

Also notice that storing this value as percentage requires a costly
integer multiplication to be carried out in a hot path, so instead
store it as an "extended fixed point" value with more fraction bits
and update the code using it accordingly (it is better to change the
name of the field along with its meaning in one go than to make those
two changes separately, as that would likely lead to more
confusion).

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 4578ee7e 11-May-2016 Chen Yu <yu.c.chen@intel.com>

intel_pstate: Avoid unnecessary synchronize_sched() during initialization

Currently, in intel_pstate_clear_update_util_hook(), after
clearing the utilization update hook, we leverage
synchronize_sched() to deal with synchronization, which
is a little bit time-costly because synchronize_sched()
has to wait for all the CPUs to go through a grace period.

Actually, the synchronize_sched() is not necessary if the utilization
update hook has not been set for the given CPU yet, so make the driver
check if that's the case and avoid the synchronize_sched() call then.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=116371
Tested-by: Tian Ye <yex.tian@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
[ rjw : Rebase ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# f96fd0c8 06-May-2016 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

intel_pstate: Clean up intel_pstate_get()

intel_pstate_get() contains a local variable that's initialized but
never used and it can be written in fewer lines of code, so clean
it up.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>


# e59a8f7f 04-May-2016 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: Ignore _PPC processing under HWP

When HWP (hardware P states) feature is active, the ACPI _PSS and _PPC
is not used. So ignore processing for _PPC limits.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 6d45b719 04-May-2016 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

intel_pstate: Fix intel_pstate_get()

After commit 8fa520af5081 "intel_pstate: Remove freq calculation from
intel_pstate_calc_busy()" intel_pstate_get() calls get_avg_frequency()
to compute the average frequency, which is problematic for two reasons.

First, intel_pstate_get() may be invoked before the driver reads the
CPU feedback registers for the first time and if that happens,
get_avg_frequency() will attempt to divide by zero.

Second, the get_avg_frequency() call in intel_pstate_get() is racy
with respect to intel_pstate_sample() and it may end up returning
completely meaningless values for this reason.

Moreover, after commit 7349ec0470b6 "intel_pstate: Move
intel_pstate_calc_busy() into get_target_pstate_use_performance()"
sample.core_pct_busy is never computed on Atom, but it is used in
intel_pstate_adjust_busy_pstate() in that case too.

To address those problems notice that if sample.core_pct_busy
was used in the average frequency computation carried out by
get_avg_frequency(), both the divide by zero problem and the
race with respect to intel_pstate_sample() would be avoided.

Accordingly, move the invocation of intel_pstate_calc_busy() from
get_target_pstate_use_performance() to intel_pstate_update_util(),
which also will take care of the uninitialized sample.core_pct_busy
on Atom, and modify get_avg_frequency() to use sample.core_pct_busy
as per the above.

Reported-by: kernel test robot <ying.huang@linux.intel.com>
Link: http://marc.info/?l=linux-kernel&m=146226437623173&w=4
Fixes: 8fa520af5081 "intel_pstate: Remove freq calculation from intel_pstate_calc_busy()"
Fixes: 7349ec0470b6 "intel_pstate: Move intel_pstate_calc_busy() into get_target_pstate_use_performance()"
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# ba41e1bc 01-May-2016 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Fix HWP on boot CPU after system resume

Commit 41cfd64cf49fc "Update frequencies of policy->cpus only from
->set_policy()" changed the way the intel_pstate driver's ->set_policy
callback updates the HWP (hardware-managed P-states) settings.
A side effect of it is that if those settings are modified on the
boot CPU during system suspend and wakeup, they will never be
restored during subsequent system resume.

To address this problem, allow cpufreq drivers that don't provide
->target or ->target_index callbacks to use ->suspend and ->resume
callbacks and add a ->resume callback to intel_pstate to restore
the HWP settings on the CPUs that belong to the given policy.

Fixes: 41cfd64cf49fc "Update frequencies of policy->cpus only from ->set_policy()"
Tested-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>


# 2b3ec765 27-Apr-2016 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: Enable PPC enforcement for servers

For platforms which are controlled via remove node manager, enable _PPC by
default. These platforms are mostly categorized as enterprise server or
performance servers. These platforms needs to go through some
certifications tests, which tests control via _PPC.
The relative risk of enabling by default is low as this is is less likely
that these systems have broken _PSS table.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 3be9200d 27-Apr-2016 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: Adjust policy->max

When policy->max is changed via _PPC or sysfs and is more than the max non
turbo frequency, it does not really change resulting performance in some
processors. When policy->max results in a P-State ratio more than the
turbo activation ratio, then processor can choose any P-State up to max
turbo. So the user or _PPC setting has no value, but this can cause
undesirable side effects like:
- Showing reduced max percentage in Intel P-State sysfs
- It can cause reduced max performance under certain boundary conditions:
The requested max scaling frequency either via _PPC or via cpufreq-sysfs,
will be converted into a fixed floating point max percent scale. In
majority of the cases this will result in correct max. But not 100% of the
time. If the _PPC is requested at a point where the calculation lead to a
lower max, this can result in a lower P-State then expected and it will
impact performance.
Example of this condition using a Broadwell laptop with config TDP.

ACPI _PSS table from a Broadwell laptop
2301000 2300000 2200000 2000000 1900000 1800000 1700000 1500000 1400000
1300000 1100000 1000000 900000 800000 600000 500000

The actual results by disabling config TDP so that we can get what is
requested on or below 2300000Khz.

scaling_max_freq Max Requested P-State Resultant scaling
max
---------------------------------------- ----------------------
2400000 18 2900000 (max
turbo)
2300000 17 2300000 (max
physical non turbo)
2200000 15 2100000
2100000 15 2100000
2000000 13 1900000
1900000 13 1900000
1800000 12 1800000
1700000 11 1700000
1600000 10 1600000
1500000 f 1500000
1400000 e 1400000
1300000 d 1300000
1200000 c 1200000
1100000 a 1000000
1000000 a 1000000
900000 9 900000
800000 8 800000
700000 7 700000
600000 6 600000
500000 5 500000
------------------------------------------------------------------

Now set the config TDP level 1 ratio as 0x0b (equivalent to 1100000KHz)
in BIOS (not every system will let you adjust this).
The turbo activation ratio will be set to one less than that, which will
be 0x0a (So any request above 1000000KHz should result in turbo region
assuming no thermal limits).
Here _PPC will request max to 1100000KHz (which basically should still
result in turbo as this is more than the turbo activation ratio up to
max allowable turbo frequency), but actual calculation resulted in a max
ceiling P-State which is 0x0a. So under any load condition, this driver
will not request turbo P-States. This will be a huge performance hit.

When config TDP feature is ON, if the _PPC points to a frequency above
turbo activation ratio, the performance can still reach max turbo. In this
case we don't need to treat this as the reduced frequency in set_policy
callback.

In this change when config TDP is active (by checking if the physical max
non turbo ratio is more than the current max non turbo ratio), any request
above current max non turbo is treated as full performance.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
[ rjw : Minor cleanups ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 9522a2ff 27-Apr-2016 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: Enforce _PPC limits

Use ACPI _PPC notification to limit max P state driver will request.
ACPI _PPC change notification is sent by BIOS to limit max P state
in several cases:
- Reduce impact of platform thermal condition
- When Config TDP feature is used, a changed _PPC is sent to
follow TDP change
- Remote node managers in server want to control platform power
via baseboard management controller (BMC)

This change registers with ACPI processor performance lib so that
_PPC changes are notified to cpufreq core, which in turns will
result in call to .setpolicy() callback. Also the way _PSS
table identifies a turbo frequency is not compatible to max turbo
frequency in intel_pstate, so the very first entry in _PSS needs
to be adjusted.

This feature can be turned on by using kernel parameters:
intel_pstate=support_acpi_ppc

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
[ rjw: Minor cleanups ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 1becf035 22-Apr-2016 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: Fix processing for turbo activation ratio

When the config TDP level is not nominal (level = 0), the MSR values for
reading level 1 and level 2 ratios contain power in low 14 bits and actual
ratio bits are at bits [23:16]. The current processing for level 1 and
level 2 is wrong as there is no shift done to get actual ratio.

Fixes: 6a35fc2d6c22 (cpufreq: intel_pstate: get P1 from TAR when available)
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: 4.4+ <stable@vger.kernel.org> # 4.4+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# bdcaa23f 22-Apr-2016 Philippe Longepe <philippe.longepe@linux.intel.com>

cpufreq: intel_pstate: Use average P-State instead of current P-State

The result returned by pid_calc() is subtracted from current_pstate
(which is the P-State requested during the last period) in order to
obtain the target P-State for the current iteration.

However, current_pstate may not reflect the real current P-State of
the CPU. In particular, that P-State may be higher because of the
frequency sharing per module.

The theory is:
- The load is the percentage of time spent in C0 and is related to
the average P-State during the same period.
- The last requested P-State can be completely different than the
average P-State (because of frequency sharing or throttling).
- The P-State shift computed by the pid_calc is based on the load
computed at average P-State, so the shift must be relative to
this average P-State.

Using the average P-State instead of current P-State improves power
without significant performance penalty in cases when a task migrates
from one core to other core sharing frequency and voltage.

Performance and power comparison with this patch on Cherry Trail
platform using Android:

Benchmark ?Perf ?Power
FishTank 10.45% 3.1%
SmartBench-Gaming -0.1% -10.4%
SmartBench-Productivity -0.8% -10.4%
CandyCrush n/a -17.4%
AngryBirds n/a -5.9%
videoPlayback n/a -13.9%
audioPlayback n/a -4.9%
IcyRocks-20-50 0.0% -38.4%
iozone RR -0.16% -1.3%
iozone RW 0.74% -1.3%

Signed-off-by: Philippe Longepe <philippe.longepe@linux.intel.com>
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# ffb81056 09-Apr-2016 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

intel_pstate: Avoid getting stuck in high P-states when idle

Jörg Otte reports that commit a4675fbc4a7a (cpufreq: intel_pstate:
Replace timers with utilization update callbacks) caused the CPUs in
his Haswell-based system to stay in the very high frequency region
even if the system is completely idle.

That turns out to be an existing problem in the intel_pstate driver's
P-state selection algorithm for Core processors. Namely, all
decisions made by that algorithm are based on the average frequency
of the CPU between sampling events and on the P-state requested on
the last invocation, so it may get stuck at a very hight frequency
even if the utilization of the CPU is very low (in fact, it may get
stuck in a inadequate P-state regardless of the CPU utilization).
The only way to kick it out of that limbo is a sufficiently long idle
period (3 times longer than the prescribed sampling interval), but if
that doesn't happen often enough (eg. due to a timing change like
after the above commit), the P-state of the CPU may be inadequate
pretty much all the time.

To address the most egregious manifestations of that issue, reset the
core_busy value used to determine the next P-state to request if the
utilization of the CPU, determined with the help of the MPERF
feedback register and the TSC, is below 1%.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=115771
Reported-and-tested-by: Jörg Otte <jrg.otte@gmail.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 4836df17 05-Apr-2016 Joe Perches <joe@perches.com>

intel_pstate: Use pr_fmt

Prefix the output using the more common kernel style.

Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
[ rjw: Rebase ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 22590efb 08-Apr-2016 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

intel_pstate: Avoid pointless FRAC_BITS shifts under div_fp()

There are multiple places in intel_pstate where int_tofp() is applied
to both arguments of div_fp(), but this is pointless, because int_tofp()
simply shifts its argument to the left by FRAC_BITS which mathematically
is equivalent to multuplication by 2^FRAC_BITS, so if this is done
to both arguments of a division, the extra factors will cancel each
other during that operation anyway.

Drop the pointless int_tofp() applied to div_fp() arguments throughout
the driver.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 13ad7701 03-Apr-2016 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: Documenation for structures

No code change. Only added kernel doc style comments for structures.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 30a39153 03-Apr-2016 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: fix inconsistency in setting policy limits

When user sets performance policy using cpufreq interface, it is possible
that because of policy->max limits, the actual performance is still
limited. But the current implementation will silently switch the
policy to powersave and start using powersave limits. If user modifies
any limits using intel_pstate sysfs, this is actually changing powersave
limits.

The current implementation tracks limits under powersave and performance
policy using two different variables. When policy->max is less than
policy->cpuinfo.max_freq, only powersave limit variable is used.

This fix causes the performance limits variable to be used always when
the policy is performance.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 0bed612b 01-Apr-2016 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: sched: Helpers to add and remove update_util hooks

Replace the single helper for adding and removing cpufreq utilization
update hooks, cpufreq_set_update_util_data(), with a pair of helpers,
cpufreq_add_update_util_hook() and cpufreq_remove_update_util_hook(),
and modify the users of cpufreq_set_update_util_data() accordingly.

With the new helpers, the code using them doesn't need to worry
about the internals of struct update_util_data and in particular
it doesn't need to worry about populating the func field in it
properly upfront.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>


# febce40f 01-Apr-2016 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

intel_pstate: Avoid extra invocation of intel_pstate_sample()

The initialization of intel_pstate for a given CPU involves populating
the fields of its struct cpudata that represent the previous sample,
but currently that is done in a problematic way.

Namely, intel_pstate_init_cpu() makes an extra call to
intel_pstate_sample() so it reads the current register values that
will be used to populate the "previous sample" record during the
next invocation of intel_pstate_sample(). However, after commit
a4675fbc4a7a (cpufreq: intel_pstate: Replace timers with utilization
update callbacks) that doesn't work for last_sample_time, because
the time value is passed to intel_pstate_sample() as an argument now.
Passing 0 to it from intel_pstate_init_cpu() is problematic, because
that causes cpu->last_sample_time == 0 to be visible in
get_target_pstate_use_performance() (and hence the extra
cpu->last_sample_time > 0 check in there) and effectively allows
the first invocation of intel_pstate_sample() from
intel_pstate_update_util() to happen immediately after the
initialization which may lead to a significant "turn on"
effect in the governor algorithm.

To mitigate that issue, rework the initialization to avoid the
extra intel_pstate_sample() call from intel_pstate_init_cpu().
Instead, make intel_pstate_sample() return false if it has been
called with cpu->sample.time equal to zero, which will make
intel_pstate_update_util() skip the sample in that case, and
reset cpu->sample.time from intel_pstate_set_update_util_hook()
to make the algorithm start properly every time the hook is set.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# bb6ab52f 31-Mar-2016 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

intel_pstate: Do not set utilization update hook too early

The utilization update hook in the intel_pstate driver is set too
early, as it only should be set after the policy has been fully
initialized by the core. That may cause intel_pstate_update_util()
to use incorrect data and put the CPUs into incorrect P-states as
a result.

To prevent that from happening, make intel_pstate_set_policy() set
the utilization update hook instead of intel_pstate_init_cpu() so
intel_pstate_update_util() only runs when all things have been
initialized as appropriate.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# fdfdb2b1 18-Mar-2016 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

intel_pstate: Do not call wrmsrl_on_cpu() with disabled interrupts

After commit a4675fbc4a7a (cpufreq: intel_pstate: Replace timers with
utilization update callbacks) wrmsrl_on_cpu() cannot be called in the
intel_pstate_adjust_busy_pstate() path as that is executed with
disabled interrupts. However, atom_set_pstate() called from there
via intel_pstate_set_pstate() uses wrmsrl_on_cpu() to update the
IA32_PERF_CTL MSR which triggers the WARN_ON_ONCE() in
smp_call_function_single().

The reason why wrmsrl_on_cpu() is used by atom_set_pstate() is
because intel_pstate_set_pstate() calling it is also invoked during
the initialization and cleanup of the driver and in those cases it is
not guaranteed to be run on the CPU that is being updated. However,
in the case when intel_pstate_set_pstate() is called by
intel_pstate_adjust_busy_pstate(), wrmsrl() can be used to update
the register safely. Moreover, intel_pstate_set_pstate() already
contains code that only is executed if the function is called by
intel_pstate_adjust_busy_pstate() and there is a special argument
passed to it because of that.

To fix the problem at hand, rearrange the code taking the above
observations into account.

First, replace the ->set() callback in struct pstate_funcs with a
->get_val() one that will return the value to be written to the
IA32_PERF_CTL MSR without updating the register.

Second, split intel_pstate_set_pstate() into two functions,
intel_pstate_update_pstate() to be called by
intel_pstate_adjust_busy_pstate() that will contain all of the
intel_pstate_set_pstate() code which only needs to be executed in
that case and will use wrmsrl() to update the MSR (after obtaining
the value to write to it from the ->get_val() callback), and
intel_pstate_set_min_pstate() to be invoked during the
initialization and cleanup that will set the P-state to the
minimum one and will update the MSR using wrmsrl_on_cpu().

Finally, move the code shared between intel_pstate_update_pstate()
and intel_pstate_set_min_pstate() to a new static inline function
intel_pstate_record_pstate() and make them both call it.

Of course, that unifies the handling of the IA32_PERF_CTL MSR writes
between Atom and Core.

Fixes: a4675fbc4a7a (cpufreq: intel_pstate: Replace timers with utilization update callbacks)
Reported-and-tested-by: Josh Boyer <jwboyer@fedoraproject.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 4fec7ad5 10-Mar-2016 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

intel_pstate: Do not skip samples partially

If the current value of MPERF or the current value of TSC is the
same as the previous one, respectively, intel_pstate_sample() bails
out early and skips the sample.

However, intel_pstate_adjust_busy_pstate() is still called in that
case which is not correct, so modify intel_pstate_sample() to
return a bool value indicating whether or not the sample has been
taken and use it to decide whether or not to call
intel_pstate_adjust_busy_pstate().

While at it, remove redundant parentheses from the MPERF/TSC
check in intel_pstate_sample().

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>


# 8fa520af 06-Mar-2016 Philippe Longepe <philippe.longepe@linux.intel.com>

intel_pstate: Remove freq calculation from intel_pstate_calc_busy()

Use a helper function to compute the average pstate and call it only
where it is needed (only when tracing or in intel_pstate_get).

Signed-off-by: Philippe Longepe <philippe.longepe@linux.intel.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 7349ec04 06-Mar-2016 Philippe Longepe <philippe.longepe@linux.intel.com>

intel_pstate: Move intel_pstate_calc_busy() into get_target_pstate_use_performance()

The cpu_load algorithm doesn't need to invoke intel_pstate_calc_busy(),
so move that call from intel_pstate_sample() to
get_target_pstate_use_performance().

Signed-off-by: Philippe Longepe <philippe.longepe@linux.intel.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# a158bed5 06-Mar-2016 Philippe Longepe <philippe.longepe@linux.intel.com>

intel_pstate: Optimize calculation for max/min_perf_adj

mul_fp(int_tofp(A), B) expands to:
((A << FRAC_BITS) * B) >> FRAC_BITS, so the same result can be obtained
via simple multiplication A * B. Apply this observation to
max_perf * limits->max_perf and max_perf * limits->min_perf in
intel_pstate_get_min_max()."

Signed-off-by: Philippe Longepe <philippe.longepe@linux.intel.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# b54a0dfd 08-Mar-2016 Philippe Longepe <philippe.longepe@linux.intel.com>

intel_pstate: Remove extra conversions in pid calculation

pid->setpoint and pid->deadband can be initialized in fixed point, so we
can avoid the int_tofp in pid_calc.

Signed-off-by: Philippe Longepe <philippe.longepe@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 08f511fd 03-Mar-2016 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: Reduce cpufreq_update_util() overhead a bit

Use the observation that cpufreq_update_util() is only called
by the scheduler with rq->lock held, so the callers of
cpufreq_set_update_util_data() can use synchronize_sched()
instead of synchronize_rcu() to wait for cpufreq_update_util()
to complete. Moreover, if they are updated to do that,
rcu_read_(un)lock() calls in cpufreq_update_util() might be
replaced with rcu_read_(un)lock_sched(), respectively, but
those aren't really necessary, because the scheduler calls
that function from RCU-sched read-side critical sections
already.

In addition to that, if cpufreq_set_update_util_data() checks
the func field in the struct update_util_data before setting
the per-CPU pointer to it, the data->func check may be dropped
from cpufreq_update_util() as well.

Make the above changes to reduce the overhead from
cpufreq_update_util() in the scheduler paths invoking it
and to make the cleanup after removing its callbacks less
heavy-weight somewhat.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>


# a4675fbc 04-Feb-2016 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Replace timers with utilization update callbacks

Instead of using a per-CPU deferrable timer for utilization sampling
and P-states adjustments, register a utilization update callback that
will be invoked from the scheduler on utilization changes.

The sampling rate is still the same as what was used for the deferrable
timers, so the functional impact of this patch should not be significant.

Based on an earlier patch from Srinivas Pandruvada.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>


# f05c9665 25-Feb-2016 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: disable HWP notifications

Disable HWP Interrupt notification before enabling HWP. Since we don't
have HWP interrupt handling for possible performance interrupts, there
is not much use of enabling HWP interrupts.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 7791e4aa 25-Feb-2016 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: Enable HWP by default

If the processor supports HWP, enable it by default without checking
for the cpu model. This will allow to enable HWP in all supported
processors without driver change.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 41cfd64c 21-Feb-2016 Viresh Kumar <viresh.kumar@linaro.org>

intel_pstate: Update frequencies of policy->cpus only from ->set_policy()

The intel-pstate driver is using intel_pstate_hwp_set() from two
separate paths, i.e. ->set_policy() callback and sysfs update path for
the files present in /sys/devices/system/cpu/intel_pstate/ directory.

While an update to the sysfs path applies to all the CPUs being managed
by the driver (which essentially means all the online CPUs), the update
via the ->set_policy() callback applies to a smaller group of CPUs
managed by the policy for which ->set_policy() is called.

And so, intel_pstate_hwp_set() should update frequencies of only the
CPUs that are part of policy->cpus mask, while it is called from
->set_policy() callback.

In order to do that, add a parameter (cpumask) to intel_pstate_hwp_set()
and apply the frequency changes only to the concerned CPUs.

For ->set_policy() path, we are only concerned about policy->cpus, and
so policy->rwsem lock taken by the core prior to calling ->set_policy()
is enough to take care of any races. The larger lock acquired by
get_online_cpus() is required only for the updates to sysfs files.

Add another routine, intel_pstate_hwp_set_online_cpus(), and call it
from the sysfs update paths.

This also fixes a lockdep reported recently, where policy->rwsem and
get_online_cpus() could have been acquired in any order causing an ABBA
deadlock. The sequence of events leading to that was:

intel_pstate_init(...)
...cpufreq_online(...)
down_write(&policy->rwsem); // Locks policy->rwsem
...
cpufreq_init_policy(policy);
...intel_pstate_hwp_set();
get_online_cpus(); // Temporarily locks cpu_hotplug.lock
...
up_write(&policy->rwsem);

pm_suspend(...)
...disable_nonboot_cpus()
_cpu_down()
cpu_hotplug_begin(); // Locks cpu_hotplug.lock
__cpu_notify(CPU_DOWN_PREPARE, ...);
...cpufreq_offline_prepare();
down_write(&policy->rwsem); // Locks policy->rwsem

Reported-and-tested-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# bc696ca0 26-Jan-2016 Borislav Petkov <bp@suse.de>

x86/cpufeature: Replace the old static_cpu_has() with safe variant

So the old one didn't work properly before alternatives had run.
And it was supposed to provide an optimized JMP because the
assumption was that the offset it is jumping to is within a
signed byte and thus a two-byte JMP.

So I did an x86_64 allyesconfig build and dumped all possible
sites where static_cpu_has() was used. The optimization amounted
to all in all 12(!) places where static_cpu_has() had generated
a 2-byte JMP. Which has saved us a whopping 36 bytes!

This clearly is not worth the trouble so we can remove it. The
only place where the optimization might count - in __switch_to()
- we will handle differently. But that's not subject of this
patch.

Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1453842730-28463-6-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 88b7b7c0 08-Dec-2015 Prarit Bhargava <prarit@redhat.com>

cpufreq: intel_pstate: Minor cleanup for FRAC_BITS

785ee27 ("cpufreq: intel_pstate: Fix limits->max_perf rounding error")
hardcodes the value of FRAC_BITS. This patch fixes that minor issue.

Fixes: 785ee2788141 (cpufreq: intel_pstate: Fix limits->max_perf rounding error)
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 63d1d656 04-Dec-2015 Philippe Longepe <philippe.longepe@intel.com>

cpufreq: intel_pstate: Account for IO wait time

In cases where we have many IOs, the global load becomes low and the
load algorithm will decrease the requested P-State. Because of that,
the IOs overheads will increase and impact the IO performances.

To improve IO bound work, we can count the io-wait time as busy time
in calculating CPU busy.

This change uses get_cpu_iowait_time_us() to obtain the IO wait time value
and converts time into number of cycles spent waiting on IO at the TSC
rate. At the moment, this trick is only used for Atom.

Signed-off-by: Philippe Longepe <philippe.longepe@intel.com>
Signed-off-by: Stephane Gasparini <stephane.gasparini@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# e70eed2b 04-Dec-2015 Philippe Longepe <philippe.longepe@intel.com>

cpufreq: intel_pstate: Account for non C0 time

The current function to calculate cpu utilization uses the average P-state
ratio (APerf/Mperf) scaled by the ratio of the current P-state to the
max available non-turbo one. This leads to an overestimation of
utilization which causes higher-performance P-states to be selected more
often and that leads to increased energy consumption.

This is a problem for low-power systems, so it is better to use a
different utilization calculation algorithm for them.

Namely, the Percent Busy value (or load) can be estimated as the ratio of the
MPERF counter that runs at a constant rate only during active periods (C0) to
the time stamp counter (TSC) that also runs (at the same rate) during idle.
That is:

Percent Busy = 100 * (delta_mperf / delta_tsc)

Use this algorithm for platforms with SoCs based on the Airmont and Silvermont
Atom cores.

Signed-off-by: Philippe Longepe <philippe.longepe@intel.com>
Signed-off-by: Stephane Gasparini <stephane.gasparini@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 157386b6 04-Dec-2015 Philippe Longepe <philippe.longepe@intel.com>

cpufreq: intel_pstate: Configurable algorithm to get target pstate

Target systems using different cpus have different power and performance
requirements. They may use different algorithms to get the next P-state
based on their power or performance preference.

For example, power-constrained systems may not want to use
high-performance P-states as aggressively as a full-size desktop or a
server platform. A server platform may want to run close to the max to
achieve better performance, while laptop-like systems may prefer
sacrificing performance for longer battery lifes.

For the above reasons, modify intel_pstate to allow the target P-state
selection algorithm to be depend on the CPU ID.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Philippe Longepe <philippe.longepe@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 584ee3dc 18-Nov-2015 Alexandra Yates <alexandra.yates@linux.intel.com>

intel_pstate: Fix "performance" mode behavior with HWP enabled

If hardware-driven P-state selection (HWP) is enabled, the
"performance" mode of intel_pstate should only allow the processor
to use the highest-performance P-state available. That is not
the case currently, so make it actually happen.

Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Alexandra Yates <alexandra.yates@linux.intel.com>
[ rjw: Subject and changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 785ee278 20-Nov-2015 Prarit Bhargava <prarit@redhat.com>

cpufreq: intel_pstate: Fix limits->max_perf rounding error

A rounding error was found in the calculation of limits->max_perf
in intel_pstate_set_policy(), which is used to calculate the max and min
pstate values in intel_pstate_get_min_max(). In that code,
limits->max_perf is truncated to 2 hex digits such that, for example,
0x169 was incorrectly calculated to 0x16 instead of 0x17. This resulted in
the pstate being set one level too low. This patch rounds the value of
limits->max_perf up instead of down so that the correct max pstate can
be reached.

Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 8478f539 20-Nov-2015 Prarit Bhargava <prarit@redhat.com>

cpufreq: intel_pstate: Fix limits->max_policy_pct rounding error

I have a Intel (6,63) processor with a "marketing" frequency (from
/proc/cpuinfo) of 2100MHz, and a max turbo frequency of 2600MHz. I
can execute

cpupower frequency-set -g powersave --min 1200MHz --max 2100MHz

and the max_freq_pct is set to 80. When adding load to the system I noticed
that the cpu frequency only reached 2000MHZ and not 2100MHz as expected.

This is because limits->max_policy_pct is calculated as 2100 * 100 /2600 = 80.7
and is rounded down to 80 when it should be rounded up to 81. This patch
adds a DIV_ROUND_UP() which will return the correct value.

Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 1421df63 09-Nov-2015 Philippe Longepe <philippe.longepe@linux.intel.com>

cpufreq: intel_pstate: Add separate support for Airmont cores

There are two flavors of Atom cores to be supported by intel_pstate,
Silvermont and Airmont, so make the driver distinguish between them by
adding separate frequency tables.

Separate the CPU defaults params for each of them and match the CPU IDs
against them as appropriate.

Signed-off-by: Philippe Longepe <philippe.longepe@linux.intel.com>
Signed-off-by: Stephane Gasparini <stephane.gasparini@linux.intel.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
[ rjw: Subject and changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 938d21a2 09-Nov-2015 Philippe Longepe <philippe.longepe@linux.intel.com>

cpufreq: intel_pstate: Replace BYT with ATOM

Rename symbol and function names starting with "BYT" or "byt" to
start with "ATOM" or "atom", respectively, so as to make it clear
that they may apply to Atom in general and not just to Baytrail
(the goal is to support several Atoms architectures eventually).

This should not lead to any functional changes.

Signed-off-by: Philippe Longepe <philippe.longepe@linux.intel.com>
Signed-off-by: Stephane Gasparini <stephane.gasparini@linux.intel.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
[ rjw : Changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 6ee11e41 18-Nov-2015 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Revert "cpufreq: intel_pstate: Use ACPI perf configuration"

Revert commit 37afb0003242 (cpufreq: intel_pstate: Use ACPI perf
configuration) that is reported to cause a regression to happen
on a system where invalid data are returned by the ACPI _PSS object.

Since that commit makes assumptions regarding the _PSS output
correctness that may turn out to be overly optimistic in general,
there is a concern that it may introduce regression on more
systems, so it's better to revert it now and we'll revisit the
underlying issue in the next cycle with a more robust solution.

Conflicts:
drivers/cpufreq/intel_pstate.c

Fixes: 37afb0003242 (cpufreq: intel_pstate: Use ACPI perf configuration)
Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 799281a3 18-Nov-2015 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Revert "cpufreq: intel_pstate: Avoid calculation for max/min"

Revert commit 4ef451487019 (cpufreq: intel_pstate: Avoid calculation for
max/min) as it depends on commit 37afb0003242 (cpufreq: intel_pstate: Use
ACPI perf configuration) that causes problems to happen and needs to be
reverted.

Conflicts:
drivers/cpufreq/intel_pstate.c

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 539342f6 22-Oct-2015 Prarit Bhargava <prarit@redhat.com>

intel_pstate: decrease number of "HWP enabled" messages

When booting an HWP enabled system the kernel displays one "HWP enabled"
message for each cpu. The messages are superfluous since HWP is globally
enabled across all CPUs. This patch also adds an informational message
when HWP is disabled via intel_pstate=no_hwp.

Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Reviewed-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 51443fbf 15-Oct-2015 Prarit Bhargava <prarit@redhat.com>

cpufreq: intel_pstate: Fix intel_pstate powersave min_perf_pct value

On systems that initialize the intel_pstate driver with the performance
governor, and then switch to the powersave governor will not transition to
lower cpu frequencies until /sys/devices/system/cpu/intel_pstate/min_perf_pct
is set to a low value.

The behavior of governor switching changed after commit a04759924e25
("[cpufreq] intel_pstate: honor user space min_perf_pct override on
resume"). The commit introduced tracking of performance percentage
changes via sysfs in order to restore userspace changes during
suspend/resume. The problem occurs because the global values of the newly
introduced max_sysfs_pct and min_sysfs_pct are not lowered on the governor
change and this causes the powersave governor to inherit the performance
governor's settings.

A simple change would have been to reset max_sysfs_pct to 100 and
min_sysfs_pct to 0 on a governor change, which fixes the problem with
governor switching. However, since we cannot break userspace[1] the fix
is now to give each governor its own limits storage area so that governor
specific changes are tracked.

I successfully tested this by booting with both the performance governor
and the powersave governor by default, and switching between the two
governors (while monitoring /sys/devices/system/cpu/intel_pstate/ values,
and looking at the output of cpupower frequency-info). Suspend/Resume
testing was performed by Doug Smythies.

[1] Systems which suspend/resume using the unmaintained pm-utils package
will always transition to the performance governor before the suspend and
after the resume. This means a system using the powersave governor will
go from powersave to performance, then suspend/resume, performance to
powersave. The simple change during governor changes would have been
overwritten when the governor changed before and after the suspend/resume.
I have submitted https://bugzilla.redhat.com/show_bug.cgi?id=1271225
against Fedora to remove the 94cpufreq file that causes the problem. It
should be noted that pm-utils is obsoleted with newer versions of systemd.

Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Acked-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 8e601a9f 15-Oct-2015 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: Fix divide by zero on Knights Landing (KNL)

This is a workaround for KNL platform, where in some cases MPERF counter
will not have updated value before next read of MSR_IA32_MPERF. In this
case divide by zero will occur. This change ignores current sample for
busy calculation in this case.

Fixes: b34ef932d79a (intel_pstate: Knights Landing support)
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Acked-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Cc: 4.1+ <stable@vger.kernel.org> # 4.1+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 4ef45148 14-Oct-2015 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: Avoid calculation for max/min

When requested from cpufreq to set policy, look into _pss and get
control values, instead of using max/min perf calculations. These
calculation misses next control state in boundary conditions.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Acked-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 37afb000 14-Oct-2015 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: Use ACPI perf configuration

Use ACPI _PSS to limit the Intel P State turbo, max and min ratios.
This driver uses acpi processor perf lib calls to register performance.
The following logic is used to adjust Intel P state driver limits:
- If there is no turbo entry in _PSS, then disable Intel P state turbo
and limit to non turbo max
- If the non turbo max ratio is more than _PSS max non turbo value, then
set the max non turbo ratio to _PSS non turbo max
- If the min ratio is less than _PSS min then change the min ratio
matching _PSS min
- Scale the _PSS turbo frequency to max turbo frequency based on control
value.
This feature can be disabled by using kernel parameters:
intel_pstate=no_acpi

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Acked-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 3bcc6fa9 14-Oct-2015 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel-pstate: Use separate max pstate for scaling

Systems with configurable TDP have multiple max non turbo p state. Intel
P state uses max non turbo P state for scaling. But using the real max
non turbo p state causes underestimation of next P state. So using
the physical max non turbo P state as before for scaling.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Acked-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 6a35fc2d 14-Oct-2015 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: get P1 from TAR when available

After Ivybridge, the max non turbo ratio obtained from platform info msr
is not always guaranteed P1 on client platforms. The max non turbo
activation ratio (TAR), determines the max for the current level of TDP.
The ratio in platform info is physical max. The TAR MSR can be locked,
so updating this value is not possible on all platforms.
This change gets this ratio from MSR TURBO_ACTIVATION_RATIO if
available,
but also do some sanity checking to make sure that this value is
correct.
The sanity check involves reading the TDP ratio for the current tdp
control value when platform has configurable TDP present and matching
TAC
with this.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Acked-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 74da56ce 09-Sep-2015 Kristen Carlson Accardi <kristen@linux.intel.com>

intel_pstate: fix PCT_TO_HWP macro

PCT_TO_HWP does not take the actual range of pstates exported
by HWP_CAPABILITIES in account, and is broken on most platforms.
Remove the macro and set the min and max pstate for hwp by
determining the range and adjusting by the min and max percent
limits values.

Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 43717aad 09-Sep-2015 Chen Yu <yu.c.chen@intel.com>

intel_pstate: Fix user input of min/max to legal policy region

In current code, max_perf_pct might be smaller than min_perf_pct
by improper user input:

$ grep . /sys/devices/system/cpu/intel_pstate/m*_perf_pct
/sys/devices/system/cpu/intel_pstate/max_perf_pct:100
/sys/devices/system/cpu/intel_pstate/min_perf_pct:100

$ echo 80 > /sys/devices/system/cpu/intel_pstate/max_perf_pct

$ grep . /sys/devices/system/cpu/intel_pstate/m*_perf_pct
/sys/devices/system/cpu/intel_pstate/max_perf_pct:80
/sys/devices/system/cpu/intel_pstate/min_perf_pct:100

Fix this problem by 2 steps:
1. Normalize the user input to [min_policy, max_policy].
2. Make sure max_perf_pct>=min_perf_pct, suggested by Seiichi Ikarashi.

Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Acked-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 5aecc3c8 04-Aug-2015 Ethan Zhao <ethan.zhao@oracle.com>

intel_pstate: append more Oracle OEM table id to vendor bypass list

Append more Oracle X86 servers that have their own power management,

SUN FIRE X4275 M3
SUN FIRE X4170 M3
and
SUN FIRE X6-2

Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 1c939123 05-Aug-2015 Kristen Carlson Accardi <kristen@linux.intel.com>

intel_pstate: Add SKY-S support

Whitelist the SKL-S processor

Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 144c8e17 29-Jul-2015 Chen Yu <yu.c.chen@intel.com>

intel_pstate: Fix possible overflow complained by Coverity

Coverity scanning performed on intel_pstate.c shows possible
overflow when doing left shifting:
val = pstate << 8;
since pstate is of type integer, while val is of u64, left shifting
pstate might lead to potential loss of upper bits. Say, if pstate equals
0x4000 0000, after pstate << 8 we will get zero assigned to val.
Although pstate will not likely be that big, this patch cast the left
operand to u64 before performing the left shift, to avoid complaining
from Coverity.

Reported-by: Coquard, Christophe <christophe.coquard@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 69cefc27 21-Jul-2015 Lukasz Anaczkowski <lukasz.anaczkowski@intel.com>

intel_pstate: Add get_scaling cpu_defaults param to Knights Landing

Scaling for Knights Landing is same as the default scaling (100000).
When Knigts Landing support was added to the pstate driver, this
parameter was omitted resulting in a kernel panic during boot.

Fixes: b34ef932d79a (intel_pstate: Knights Landing support)
Reported-by: Yasuaki Ishimatsu <yishimat@redhat.com>
Signed-off-by: Dasaratharaman Chandramouli <dasaratharaman.chandramouli@intel.com>
Signed-off-by: Lukasz Anaczkowski <lukasz.anaczkowski@intel.com>
Acked-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Cc: 4.1+ <stable@vger.kernel.org> # 4.1+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# ba88d433 14-Jul-2015 Kristen Carlson Accardi <kristen@linux.intel.com>

intel_pstate: enable HWP per CPU

HWP previously was only enabled at driver load time, on the boot
CPU, however, HWP must be enabled per package. Move the code to
enable HWP to the cpufreq driver init path so that it will be
called per CPU.

Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Tested-by: David Zhuang <david.zhuang@oracle.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 4ea1636b 25-Jun-2015 Andy Lutomirski <luto@kernel.org>

x86/asm/tsc: Rename native_read_tsc() to rdtsc()

Now that there is no paravirt TSC, the "native" is
inappropriate. The function does RDTSC, so give it the obvious
name: rdtsc().

Suggested-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang Rui <ray.huang@amd.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kvm ML <kvm@vger.kernel.org>
Link: http://lkml.kernel.org/r/fd43e16281991f096c1e4d21574d9e1402c62d39.1434501121.git.luto@kernel.org
[ Ported it to v4.2-rc1. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 7180dddf 15-Jun-2015 Prarit Bhargava <prarit@redhat.com>

intel_pstate: Fix overflow in busy_scaled due to long delay

The kernel may delay interrupts for a long time which can result in timers
being delayed. If this occurs the intel_pstate driver will crash with a
divide by zero error:

divide error: 0000 [#1] SMP
Modules linked in: btrfs zlib_deflate raid6_pq xor msdos ext4 mbcache jbd2 binfmt_misc arc4 md4 nls_utf8 cifs dns_resolver tcp_lp bnep bluetooth rfkill fuse dm_service_time iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi nf_conntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_ftp ip6t_rpfilter ip6t_REJECT ipt_REJECT xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw iptable_filter ip_tables intel_powerclamp coretemp vfat fat kvm_intel iTCO_wdt iTCO_vendor_support ipmi_devintf sr_mod kvm crct10dif_pclmul
crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel cdc_ether lrw usbnet cdrom mii gf128mul glue_helper ablk_helper cryptd lpc_ich mfd_core pcspkr sb_edac edac_core ipmi_si ipmi_msghandler ioatdma wmi shpchp acpi_pad nfsd auth_rpcgss nfs_acl lockd uinput dm_multipath sunrpc xfs libcrc32c usb_storage sd_mod crc_t10dif crct10dif_common ixgbe mgag200 syscopyarea sysfillrect sysimgblt mdio drm_kms_helper ttm igb drm ptp pps_core dca i2c_algo_bit megaraid_sas i2c_core dm_mirror dm_region_hash dm_log dm_mod
CPU: 113 PID: 0 Comm: swapper/113 Tainted: G W -------------- 3.10.0-229.1.2.el7.x86_64 #1
Hardware name: IBM x3950 X6 -[3837AC2]-/00FN827, BIOS -[A8E112BUS-1.00]- 08/27/2014
task: ffff880fe8abe660 ti: ffff880fe8ae4000 task.ti: ffff880fe8ae4000
RIP: 0010:[<ffffffff814a9279>] [<ffffffff814a9279>] intel_pstate_timer_func+0x179/0x3d0
RSP: 0018:ffff883fff4e3db8 EFLAGS: 00010206
RAX: 0000000027100000 RBX: ffff883fe6965100 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000010 RDI: 000000002e53632d
RBP: ffff883fff4e3e20 R08: 000e6f69a5a125c0 R09: ffff883fe84ec001
R10: 0000000000000002 R11: 0000000000000005 R12: 00000000000049f5
R13: 0000000000271000 R14: 00000000000049f5 R15: 0000000000000246
FS: 0000000000000000(0000) GS:ffff883fff4e0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f7668601000 CR3: 000000000190a000 CR4: 00000000001407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Stack:
ffff883fff4e3e58 ffffffff81099dc1 0000000000000086 0000000000000071
ffff883fff4f3680 0000000000000071 fbdc8a965e33afee ffffffff810b69dd
ffff883fe84ec000 ffff883fe6965108 0000000000000100 ffffffff814a9100
Call Trace:
<IRQ>

[<ffffffff81099dc1>] ? run_posix_cpu_timers+0x51/0x840
[<ffffffff810b69dd>] ? trigger_load_balance+0x5d/0x200
[<ffffffff814a9100>] ? pid_param_set+0x130/0x130
[<ffffffff8107df56>] call_timer_fn+0x36/0x110
[<ffffffff814a9100>] ? pid_param_set+0x130/0x130
[<ffffffff8107fdcf>] run_timer_softirq+0x21f/0x320
[<ffffffff81077b2f>] __do_softirq+0xef/0x280
[<ffffffff816156dc>] call_softirq+0x1c/0x30
[<ffffffff81015d95>] do_softirq+0x65/0xa0
[<ffffffff81077ec5>] irq_exit+0x115/0x120
[<ffffffff81616355>] smp_apic_timer_interrupt+0x45/0x60
[<ffffffff81614a1d>] apic_timer_interrupt+0x6d/0x80
<EOI>

[<ffffffff814a9c32>] ? cpuidle_enter_state+0x52/0xc0
[<ffffffff814a9c28>] ? cpuidle_enter_state+0x48/0xc0
[<ffffffff814a9d65>] cpuidle_idle_call+0xc5/0x200
[<ffffffff8101d14e>] arch_cpu_idle+0xe/0x30
[<ffffffff810c67c1>] cpu_startup_entry+0xf1/0x290
[<ffffffff8104228a>] start_secondary+0x1ba/0x230
Code: 42 0f 00 45 89 e6 48 01 c2 43 8d 44 6d 00 39 d0 73 26 49 c1 e5 08 89 d2 4d 63 f4 49 63 c5 48 c1 e2 08 48 c1 e0 08 48 63 ca 48 99 <48> f7 f9 48 98 4c 0f af f0 49 c1 ee 08 8b 43 78 c1 e0 08 44 29
RIP [<ffffffff814a9279>] intel_pstate_timer_func+0x179/0x3d0
RSP <ffff883fff4e3db8>

The kernel values for cpudata for CPU 113 were:

struct cpudata {
cpu = 113,
timer = {
entry = {
next = 0x0,
prev = 0xdead000000200200
},
expires = 8357799745,
base = 0xffff883fe84ec001,
function = 0xffffffff814a9100 <intel_pstate_timer_func>,
data = 18446612406765768960,
<snip>
i_gain = 0,
d_gain = 0,
deadband = 0,
last_err = 22489
},
last_sample_time = {
tv64 = 4063132438017305
},
prev_aperf = 287326796397463,
prev_mperf = 251427432090198,
sample = {
core_pct_busy = 23081,
aperf = 2937407,
mperf = 3257884,
freq = 2524484,
time = {
tv64 = 4063149215234118
}
}
}

which results in the time between samples = last_sample_time - sample.time
= 4063149215234118 - 4063132438017305 = 16777216813 which is 16.777 seconds.

The duration between reads of the APERF and MPERF registers overflowed a s32
sized integer in intel_pstate_get_scaled_busy()'s call to div_fp(). The result
is that int_tofp(duration_us) == 0, and the kernel attempts to divide by 0.

While the kernel shouldn't be delaying for a long time, it can and does
happen and the intel_pstate driver should not panic in this situation. This
patch changes the div_fp() function to use div64_s64() to allow for "long"
division. This will avoid the overflow condition on long delays.

[v2]: use div64_s64() in div_fp()

Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 6c1e4591 01-Jun-2015 Doug Smythies <doug.smythies@gmail.com>

intel_pstate: Force setting target pstate when required

During initialization and exit it is possible that the target pstate
might not actually be set. Furthermore, the result can be that the
driver and the processor are out of synch and, under some conditions,
the driver might never send the processor the proper target pstate.

This patch adds a bypass or do_checks flag to the call to
intel_pstate_set_pstate. If bypass, then specifically bypass clamp
checks and the do not send if it is the same as last time check. If
do_checks, then, and as before, do the current policy clamp checks,
and do not do actual send if the new target is the same as the old.

Signed-off-by: Doug Smythies <dsmythies@telus.net>
Reported-by: Marien Zwart <marien.zwart@gmail.com>
Reported-by: Alex Lochmann <alexander.lochmann@tu-dortmund.de>
Reported-by: Piotr Ko?aczkowski <pkolaczk@gmail.com>
Reported-by: Clemens Eisserer <linuxhippy@gmail.com>
Tested-by: Marien Zwart <marien.zwart@gmail.com>
Tested-by: Doug Smythies <dsmythies@telus.net>
[ rjw: Dropped pointless symbol definitions, rebased ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# f16255eb 31-May-2015 Doug Smythies <doug.smythies@gmail.com>

intel_pstate: change some inconsistent debug information

Commit ce717613f3fb (intel_pstate: Turn per cpu printk into pr_debug)
turned per cpu printk into pr_debug. However, only half of the change
was done, introducing an inconsistency between entry and exit from
driver pstate control. This patch changes the exit message to pr_debug
also.

The various messages are inconsistent with respect to any identifier
text that can be used to help isolate the desired information from a
huge log. This patch makes a consistent identifier portion of the
string.

Amends: ce717613f3fb (intel_pstate: Turn per cpu printk into pr_debug)
Signed-off-by: Doug Smythies <dsmythies@telus.net>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# d6472302 02-Jun-2015 Stephen Rothwell <sfr@canb.auug.org.au>

x86/mm: Decouple <linux/vmalloc.h> from <asm/io.h>

Nothing in <asm/io.h> uses anything from <linux/vmalloc.h>, so
remove it from there and fix up the resulting build problems
triggered on x86 {64|32}-bit {def|allmod|allno}configs.

The breakages were triggering in places where x86 builds relied
on vmalloc() facilities but did not include <linux/vmalloc.h>
explicitly and relied on the implicit inclusion via <asm/io.h>.

Also add:

- <linux/init.h> to <linux/io.h>
- <asm/pgtable_types> to <asm/io.h>

... which were two other implicit header file dependencies.

Suggested-by: David Miller <davem@davemloft.net>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
[ Tidied up the changelog. ]
Acked-by: David Miller <davem@davemloft.net>
Acked-by: Takashi Iwai <tiwai@suse.de>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Vinod Koul <vinod.koul@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Anton Vorontsov <anton@enomsg.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Colin Cross <ccross@android.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: James E.J. Bottomley <JBottomley@odin.com>
Cc: Jaroslav Kysela <perex@perex.cz>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Kristen Carlson Accardi <kristen@linux.intel.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Suma Ramars <sramars@cisco.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 0dd23f94 12-May-2015 Joe Konno <joe.konno@intel.com>

intel_pstate: set BYT MSR with wrmsrl_on_cpu()

Commit 007bea098b86 (intel_pstate: Add setting voltage value for
baytrail P states.) introduced byt_set_pstate() with the assumption that
it would always be run by the CPU whose MSR is to be written by it. It
turns out, however, that is not always the case in practice, so modify
byt_set_pstate() to enforce the MSR write done by it to always happen on
the right CPU.

Fixes: 007bea098b86 (intel_pstate: Add setting voltage value for baytrail P states.)
Signed-off-by: Joe Konno <joe.konno@intel.com>
Acked-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Cc: 3.14+ <stable@vger.kernel.org> # 3.14+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 4055fad3 11-Apr-2015 Doug Smythies <doug.smythies@gmail.com>

intel_pstate: Add tsc collection and keep previous target pstate

The intel_pstate driver is difficult to debug and investigate without tsc.

Also, it is likely use of tsc, and some version of C0 percentage,
will be re-introdcued in futute.

There have also been occasions where it is desirebale to know, and
confirm, the previous target pstate.

This patch brings back tsc, adds previous target pstate,
and adds both to the trace data collection.

Signed-off-by: Doug Smythies <dsmythies@telus.net>
Acked-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 64df1fdf 03-Apr-2015 Borislav Petkov <bp@suse.de>

cpufreq: intel_pstate: Fix an annoying !CONFIG_SMP warning

I keep seeing

drivers/cpufreq/intel_pstate.c: In function ‘intel_pstate_init’:
drivers/cpufreq/intel_pstate.c:1187:26: warning: initialization from incompatible pointer type
struct cpuinfo_x86 *c = &boot_cpu_data;

when doing randconfig builds.

This is caused by the fact that when !CONFIG_SMP, asm/processor.h
defines cpu_info to boot_cpu_data and the local variable

struct cpu_defaults *cpu_info

overshadows it leading to this unfortunate assignment in the
preprocessed source:

struct cpu_defaults *boot_cpu_data;
struct cpuinfo_x86 *c = &boot_cpu_data;

Rename the local variable and use static_cpu_has_safe() which alleviates
the need for defining a local cpuinfo_x86 pointer.

Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 6a82ba6d 10-Apr-2015 Kristen Carlson Accardi <kristen@linux.intel.com>

intel_pstate: Change the setpoint for Atom params

Change the setpoint for the Baytrail and Cherrytrail CPUs. This
will cause more aggressive pstate selection and improves
performance on a variety of workloads with little power penalty.

Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# b34ef932 10-Apr-2015 Dasaratharaman Chandramouli <dasaratharaman.chandramouli@intel.com>

intel_pstate: Knights Landing support

1. Add Knights Landing (KNL) CPUID to the list of CPUIDs supported by
the intel_pstate driver.

2. Add a new cpu_default structure for KNL since KNL has a slightly
different mechanism to get turbo pstates from MSRs.

Signed-off-by: Dasaratharaman Chandramouli <dasaratharaman.chandramouli@intel.com>
Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
[ rjw: Subject and changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 5f97899d 10-Apr-2015 Kristen Carlson Accardi <kristen@linux.intel.com>

intel_pstate: remove MSR test

x86_match_cpu will not match our cpuid unless APERF/MPERF flag is
set, so there is no need to do the manual check for this MSR.

Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# d64c3b0b 06-Feb-2015 Kristen Carlson Accardi <kristen@linux.intel.com>

intel_pstate: provide option to only use intel_pstate with HWP

Allow users the option to disable the driver for any hardware
which does not support HWP.

Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# a0475992 29-Jan-2015 Kristen Carlson Accardi <kristen@linux.intel.com>

intel_pstate: honor user space min_perf_pct override on resume

If the user has requested an override of the min_perf_pct via
sysfs, then it should be restored whenever policy is updated,
such as on resume. Take the max of whatever the user requested
and whatever the policy is.

Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 630ec286 29-Jan-2015 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

intel_pstate: respect cpufreq policy request

When thermal or other subsystem requests to change the policy,
use that irrepective of whether cpufreq policy is PERFORMANCE or
not. Without this change, when thermal subsystem passive policy wants
to reduce performance, it still runs at 100%.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 0522424e 28-Jan-2015 Kristen Carlson Accardi <kristen@linux.intel.com>

intel_pstate: Add num_pstates to sysfs

Add a sysfs interface to display the total number of supported
pstates. This value is independent of whether turbo has been
enabled or disabled.

Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# d01b1f48 28-Jan-2015 Kristen Carlson Accardi <kristen@linux.intel.com>

intel_pstate: expose turbo range to sysfs

This patch adds "turbo_pct" to the intel_pstate sysfs interface.
turbo_pct will display the percentage of the total supported
pstates that are in the turbo range. This value is independent
of whether turbo has been disabled or not.

Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 7ab0256e 28-Jan-2015 Kristen Carlson Accardi <kristen@linux.intel.com>

intel_pstate: Add support for SkyLake

Add SKL cpuid.

Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# e0d4c8f8 10-Dec-2014 Kristen Carlson Accardi <kristen@linux.intel.com>

intel_pstate: Add a few comments

Add a few comments in the code which calculates busyness to
clarify parts of the algorithm.

Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# aa4ea34d 08-Dec-2014 Ethan Zhao <ethan.zhao@oracle.com>

intel_pstate: add kernel parameter to force loading

To force loading on Oracle Sun X86 servers, provide one kernel command line
parameter

intel_pstate = force

For those who are aware of the risk of no power capping capabily working
and try to get better performance with this driver.

Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
Tested-by: Alexey Kodanev <alexey.kodanev@oracle.com>
Reviewed-by: Linda Knippers <linda.knippers@hp.com>
Acked-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 966916ea 30-Nov-2014 ethan zhao <ethan.zhao@oracle.com>

intel_pstate: skip this driver if Sun server has _PPC method

Oracle Sun X86 servers have dynamic power capping capability that works via
ACPI _PPC method etc, so skip loading this driver if Sun server has ACPI _PPC
enabled.

Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
Tested-by: Linda Knippers <linda.knippers@hp.com>
Acked-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 43f8a966 06-Nov-2014 Dirk Brandewie <dirk.j.brandewie@intel.com>

intel_pstate: Add CPUID for BDW-H CPU

Add BDW-H to the list of supported processors.

Signed-off-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 2f86dc4c 06-Nov-2014 Dirk Brandewie <dirk.j.brandewie@intel.com>

intel_pstate: Add support for HWP

Add support of Hardware Managed Performance States (HWP) described in Volume 3
section 14.4 of the SDM.

With HWP enbaled intel_pstate will no longer be responsible for selecting P
states for the processor. intel_pstate will continue to register to
the cpufreq core as the scaling driver for CPUs implementing
HWP. In HWP mode intel_pstate provides three functions reporting
frequency to the cpufreq core, support for the set_policy() interface
from the core and maintaining the intel_pstate sysfs interface in
/sys/devices/system/cpu/intel_pstate. User preferences expressed via
the set_policy() interface or the sysfs interface are forwared to the
CPU via the HWP MSR interface.

Signed-off-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# d022a65e 13-Oct-2014 Dirk Brandewie <dirk.j.brandewie@intel.com>

intel_pstate: Correct BYT VID values.

Using a VID value that is not high enough for the requested P state can
cause machine checks. Add a ceiling function to ensure calulated VIDs
with fractional values are set to the next highest integer VID value.

The algorythm for calculating the non-trubo VID from the BIOS writers
guide is:
vid_ratio = (vid_max - vid_min) / (max_pstate - min_pstate)
vid = ceiling(vid_min + (req_pstate - min_pstate) * vid_ratio)

Cc: All applicable <stable@vger.kernel.org>
Signed-off-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# b27580b0 13-Oct-2014 Dirk Brandewie <dirk.j.brandewie@intel.com>

intel_pstate: Fix BYT frequency reporting

BYT has a different conversion from P state to frequency than the core
processors. This causes the min/max and current frequency to be
misreported on some BYT SKUs. Tested on BYT N2820, Ivybridge and
Haswell processors.

Link: https://bugzilla.yoctoproject.org/show_bug.cgi?id=6663
Cc: All applicable <stable@vger.kernel.org>
Signed-off-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# c0348717 13-Oct-2014 Dirk Brandewie <dirk.j.brandewie@intel.com>

intel_pstate: Don't lose sysfs settings during cpu offline

The user may have custom settings don't destroy them during suspend.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=80651
Reported-by: Tobias Jakobi <liquid.acid@gmx.net>
Cc: All applicable <stable@vger.kernel.org>
Signed-off-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 4521e1a0 13-Oct-2014 Gabriele Mazzotta <gabriele.mzt@gmail.com>

cpufreq: intel_pstate: Reflect current no_turbo state correctly

Some BIOSes modify the state of MSR_IA32_MISC_ENABLE_TURBO_DISABLE
based on the current power source for the system battery AC vs
battery. Reflect the correct current state and ability to modify the
no_turbo sysfs file based on current state of
MSR_IA32_MISC_ENABLE_TURBO_DISABLE.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=83151
Cc: All applicable <stable@vger.kernel.org>
Signed-off-by: Gabriele Mazzotta <gabriele.mzt@gmail.com>
Signed-off-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 36b4bed5 15-Oct-2014 Pali Rohár <pali@kernel.org>

cpufreq: intel_pstate: Fix setting max_perf_pct in performance policy

Code which changes policy to powersave changes also max_policy_pct based on
max_freq. Code which change max_perf_pct has upper limit base on value
max_policy_pct. When policy is changing from powersave back to performance
then max_policy_pct is not changed. Which means that changing max_perf_pct is
not possible to high values if max_freq was too low in powersave policy.

Test case:

$ cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_min_freq
800000
$ cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
3300000
$ cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
performance
$ cat /sys/devices/system/cpu/intel_pstate/max_perf_pct
100

$ echo powersave > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
$ echo 800000 > /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
$ echo 20 > /sys/devices/system/cpu/intel_pstate/max_perf_pct

$ cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
powersave
$ cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
800000
$ cat /sys/devices/system/cpu/intel_pstate/max_perf_pct
20

$ echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
$ echo 3300000 > /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
$ echo 100 > /sys/devices/system/cpu/intel_pstate/max_perf_pct

$ cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
performance
$ cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
3300000
$ cat /sys/devices/system/cpu/intel_pstate/max_perf_pct
24

And now intel_pstate driver allows to set maximal value for max_perf_pct based
on max_policy_pct which is 24 for previous powersave max_freq 800000.

This patch will set default value for max_policy_pct when setting policy to
performance so it will allow to set also max value for max_perf_pct.

Signed-off-by: Pali Rohár <pali.rohar@gmail.com>
Cc: All applicable <stable@vger.kernel.org>
Acked-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 73f1ae8a 01-Sep-2014 Gabriele Mazzotta <gabriele.mzt@gmail.com>

cpufreq: intel_pstate: Remove unneeded variable

It should have been removed with commit d1b6848590af
("cpufreq / intel_pstate: Optimize intel_pstate_set_policy")

Signed-off-by: Gabriele Mazzotta <gabriele.mzt@gmail.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 16405f98 22-Aug-2014 Mika Westerberg <mika.westerberg@linux.intel.com>

cpufreq: intel_pstate: Add CPU ID for Braswell processor

This is pretty much the same as Intel Baytrail, only the CPU ID is
different. Add the new ID to the supported CPU list.

Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# ce717613 27-Aug-2014 Andi Kleen <ak@linux.intel.com>

intel_pstate: Turn per cpu printk into pr_debug

On larger systems intel_pstate currently spams the boot up
log with its "Intel pstate controlling ..." message for each CPU.
It's the only subsystem that prints a message for each
CPU.

Turn the message into a pr_debug.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 78e27086 18-Jul-2014 Stratos Karafotis <stratosk@semaphore.gr>

cpufreq: intel_pstate: Remove core_pct rounding

The specific rounding adds conditionally only 1/256 to fractional
part of core_pct.

We can safely remove it without any noticeable impact in
calculations.

Use div64_u64 instead of div_u64 to avoid possible overflow of
sample->mperf as divisor

Signed-off-by: Stratos Karafotis <stratosk@semaphore.gr>
Signed-off-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 4b707c89 18-Jul-2014 Stratos Karafotis <stratosk@semaphore.gr>

cpufreq: intel_pstate: Simplify P state adjustment logic.

Simplify the code by removing the inline functions pstate_increase and
pstate_decrease and use directly the intel_pstate_set_pstate.

Signed-off-by: Stratos Karafotis <stratosk@semaphore.gr>
Acked-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# ac658131 18-Jul-2014 Stratos Karafotis <stratosk@semaphore.gr>

cpufreq: intel_pstate: Keep values in aperf/mperf in full precision

Currently we shift right aperf and mperf variables by FRAC_BITS
to prevent overflow when we convert them to fix point numbers
(shift left by FRAC_BITS).

But this is not necessary, because we actually use delta aperf and mperf
which are much less than APERF and MPERF values.

So, use the unmodified APERF and MPERF values in calculation.
This also adds 8 bits in precision, although the gain is insignificant.

Signed-off-by: Stratos Karafotis <stratosk@semaphore.gr>
Signed-off-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 4ab60c3f 18-Jul-2014 Stratos Karafotis <stratosk@semaphore.gr>

cpufreq: intel_pstate: Disable interrupts during MSRs reading

According to Intel 64 and IA-32 Architectures SDM, Volume 3,
Chapter 14.2, "Software needs to exercise care to avoid delays
between the two RDMSRs (for example interrupts)".

So, disable interrupts during reading MSRs IA32_APERF and IA32_MPERF.
This should increase the accuracy of the calculations.

Signed-off-by: Stratos Karafotis <stratosk@semaphore.gr>
Signed-off-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# c410833a 18-Jul-2014 Stratos Karafotis <stratosk@semaphore.gr>

cpufreq: intel_pstate: Align multiple lines to open parenthesis

Suppress checkpatch.pl --strict warnings:
CHECK: Alignment should match open parenthesis

Signed-off-by: Stratos Karafotis <stratosk@semaphore.gr>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# abf013bf 18-Jul-2014 Stratos Karafotis <stratosk@semaphore.gr>

cpufreq: intel_pstate: Remove unnecessary intermediate variable sample_time

Remove the unnecessary intermediate assignment and use directly the
pid_params.sample_rate_ms variable.

Signed-off-by: Stratos Karafotis <stratosk@semaphore.gr>
Signed-off-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 285cb990 18-Jul-2014 Stratos Karafotis <stratosk@semaphore.gr>

cpufreq: intel_pstate: Cleanup parentheses

Remove unnecessary parentheses.
Also, add parentheses in one case for better readability.

Signed-off-by: Stratos Karafotis <stratosk@semaphore.gr>
Signed-off-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 2d8d1f18 18-Jul-2014 Stratos Karafotis <stratosk@semaphore.gr>

cpufreq: intel_pstate: Fit code in a single line where possible

We can fit these lines in a single one, under the 80 characters
limit.

Signed-off-by: Stratos Karafotis <stratosk@semaphore.gr>
Signed-off-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 845c1cbe 18-Jul-2014 Stratos Karafotis <stratosk@semaphore.gr>

cpufreq: intel_pstate: Add missing blank lines after declarations

Also, remove unnecessary blank lines.

Signed-off-by: Stratos Karafotis <stratosk@semaphore.gr>
Signed-off-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# fa30dff9 18-Jul-2014 Stratos Karafotis <stratosk@semaphore.gr>

cpufreq: intel_pstate: Remove unnecessary type casting in div_s64() call

div_s64() accepts the divisor parameter as s32. Helper div_fp()
also accepts divisor as int32_t.

So, remove the unnecessary int64_t type casting.

Signed-off-by: Stratos Karafotis <stratosk@semaphore.gr>
Signed-off-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 317dd50e 18-Jul-2014 Stratos Karafotis <stratosk@semaphore.gr>

cpufreq: intel_pstate: Make intel_pstate_kobject and debugfs_parent locals

Since we never remove sysfs entry and debugfs files, we can make
the intel_pstate_kobject and debugfs_parent locals.

Also, annotate with __init intel_pstate_sysfs_expose_params()
and intel_pstate_debug_expose_params() in order to be freed
after bootstrap.

Signed-off-by: Stratos Karafotis <stratosk@semaphore.gr>
Signed-off-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 179e8471 04-Jul-2014 Vincent Minet <vincent@vincent-minet.net>

intel_pstate: Set CPU number before accessing MSRs

Ensure that cpu->cpu is set before writing MSR_IA32_PERF_CTL during CPU
initialization. Otherwise only cpu0 has its P-state set and all other
cores are left with their values unchanged.

In most cases, this is not too serious because the P-states will be set
correctly when the timer function is run. But when the default governor
is set to performance, the per-CPU current_pstate stays the same forever
and no attempts are made to write the MSRs again.

Signed-off-by: Vincent Minet <vincent@vincent-minet.net>
Cc: All applicable <stable@vger.kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# dd5fbf70 20-Jun-2014 Dirk Brandewie <dirk.j.brandewie@intel.com>

intel_pstate: don't touch turbo bit if turbo disabled or unavailable.

If turbo is disabled in the BIOS bit 38 should be set in
MSR_IA32_MISC_ENABLE register per section 14.3.2.1 of the SDM Vol 3
document 325384-050US Feb 2014. If this bit is set do *not* attempt
to disable trubo via the MSR_IA32_PERF_CTL register. On some systems
trying to disable turbo via MSR_IA32_PERF_CTL will cause subsequent
writes to MSR_IA32_PERF_CTL not take affect, in fact reading
MSR_IA32_PERF_CTL will not show the IDA/Turbo DISENGAGE bit(32) as
set. A write of bit 32 to zero returns to normal operation.

Also deal with the case where the processor does not support
turbo and the BIOS does not report the fact in MSR_IA32_MISC_ENABLE
but does report the max and turbo P states as the same value.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=64251
Cc: 3.13+ <stable@vger.kernel.org> # 3.13+
Signed-off-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# c16ed060 20-Jun-2014 Dirk Brandewie <dirk.j.brandewie@intel.com>

intel_pstate: Fix setting VID

Commit 21855ff5 (intel_pstate: Set turbo VID for BayTrail) introduced
setting the turbo VID which is required to prevent a machine check on
some Baytrail SKUs under heavy graphics based workloads. The
docmumentation update that brought the requirement to light also
changed the bit mask used for enumerating P state and VID values from
0x7f to 0x3f.

This change returns the mask value to 0x7f.

Tested with the Intel NUC DN2820FYK,
BIOS version FYBYT10H.86A.0034.2014.0513.1413 with v3.16-rc1 and
v3.14.8 kernel versions.

Fixes: 21855ff5 (intel_pstate: Set turbo VID for BayTrail)
Link: https://bugzilla.kernel.org/show_bug.cgi?id=77951
Reported-and-tested-by: Rune Reterson <rune@megahurts.dk>
Reported-and-tested-by: Eric Eickmeyer <erich@ericheickmeyer.com>
Cc: 3.13+ <stable@vger.kernel.org> # 3.13+
Signed-off-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 51d211e9 17-Jun-2014 Doug Smythies <doug.smythies@gmail.com>

intel_pstate: Correct rounding in busy calculation

There was a mistake in the actual rounding portion this previous patch:
f0fe3cd7e12d (intel_pstate: Correct rounding in busy calculation) such that
the rounding was asymetric and incorrect.

Severity: Not very serious, but can increase target pstate by one extra value.
For real world work flows the issue should self correct (but I have no proof).
It is the equivalent of different PID gains for positive and negative numbers.

Examples:
-3.000000 used to round to -4, rounds to -3 with this patch.
-3.503906 used to round to -5, rounds to -4 with this patch.

Fixes: f0fe3cd7e12d (intel_pstate: Correct rounding in busy calculation)
Signed-off-by: Doug Smythies <dsmythies@telus.net>
Cc: 3.14+ <stable@vger.kernel.org> # 3.14+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 830bcac4 09-Jun-2014 Stratos Karafotis <stratosk@semaphore.gr>

cpufreq: intel_pstate: Remove duplicate CPU ID check

We check the CPU ID during driver init. There is no need
to do it again per logical CPU initialization.

So, remove the duplicate check.

Signed-off-by: Stratos Karafotis <stratosk@semaphore.gr>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# bf810222 30-May-2014 Doug Smythies <dsmythies@telus.net>

intel_pstate: Improve initial busy calculation

This change makes the busy calculation using 64 bit math which prevents
overflow for large values of aperf/mperf.

Cc: 3.14+ <stable@vger.kernel.org> # 3.14+
Signed-off-by: Doug Smythies <dsmythies@telus.net>
Signed-off-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# c4ee841f 29-May-2014 Dirk Brandewie <dirk.j.brandewie@intel.com>

intel_pstate: add sample time scaling

The PID assumes that samples are of equal time, which for a deferable
timers this is not true when the system goes idle. This causes the
PID to take a long time to converge to the min P state and depending
on the pattern of the idle load can make the P state appear stuck.

The hold-off value of three sample times before using the scaling is
to give a grace period for applications that have high performance
requirements and spend a lot of time idle, The poster child for this
behavior is the ffmpeg benchmark in the Phoronix test suite.

Cc: 3.14+ <stable@vger.kernel.org> # 3.14+
Signed-off-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# f0fe3cd7 29-May-2014 Dirk Brandewie <dirk.j.brandewie@intel.com>

intel_pstate: Correct rounding in busy calculation

Changing to fixed point math throughout the busy calculation in
commit e66c1768 (Change busy calculation to use fixed point
math.) Introduced some inaccuracies by rounding the busy value at two
points in the calculation. This change removes roundings and moves
the rounding to the output of the PID where the calculations are
complete and the value returned as an integer.

Fixes: e66c17683746 (intel_pstate: Change busy calculation to use fixed point math.)
Reported-by: Doug Smythies <dsmythies@telus.net>
Cc: 3.14+ <stable@vger.kernel.org> # 3.14+
Signed-off-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# adacdf3f 29-May-2014 Dirk Brandewie <dirk.j.brandewie@intel.com>

intel_pstate: Remove C0 tracking

Commit fcb6a15c (intel_pstate: Take core C0 time into account for core
busy calculation) introduced a regression referenced below. The issue
with "lockup" after suspend that this commit was addressing is now dealt
with in the suspend path.

Fixes: fcb6a15c2e7e (intel_pstate: Take core C0 time into account for core busy calculation)
Link: https://bugzilla.kernel.org/show_bug.cgi?id=66581
Link: https://bugzilla.kernel.org/show_bug.cgi?id=75121
Reported-by: Doug Smythies <dsmythies@telus.net>
Cc: 3.14+ <stable@vger.kernel.org> # 3.14+
Signed-off-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 94f89e07 20-May-2014 Stratos Karafotis <stratosk@semaphore.gr>

cpufreq: intel_pstate: Remove unused member name of cpudata

Although, a value is assigned to member name of struct cpudata,
it is never used.

We can safely remove it.

Signed-off-by: Stratos Karafotis <stratosk@semaphore.gr>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# d40a63c4 08-May-2014 Dirk Brandewie <dirk.j.brandewie@intel.com>

intel_pstate: remove setting P state to MAX on init

Setting the P state of the core to max at init time is a hold over
from early implementation of intel_pstate where intel_pstate disabled
cpufreq and loaded VERY early in the boot sequence. This was to
ensure that intel_pstate did not affect boot time. This in not needed
now that intel_pstate is a cpufreq driver.

Removing this covers the case where a CPU has gone through a manual
CPU offline/online cycle and the P state is set to MAX on init and the
CPU immediately goes idle. Due to HW coordination the P state request
on the idle CPU will drag all cores to MAX P state until the load is
reevaluated when to core goes non-idle.

Reported-by: Patrick Marlier <patrick.marlier@gmail.com>
Signed-off-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Cc: 3.14+ <stable@vger.kernel.org> # 3.14+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# c7e241df 08-May-2014 Dirk Brandewie <dirk.j.brandewie@intel.com>

intel_pstate: Add CPU IDs for Broadwell processors

Add support for Broadwell processors.

Signed-off-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 21855ff5 08-May-2014 Dirk Brandewie <dirk.j.brandewie@intel.com>

intel_pstate: Set turbo VID for BayTrail

A documentation update exposed that there is a separate set of VID
values that must be used in the turbo/boost P state range. Add
enumerating and setting the correct VID for P states in the turbo
range.

Cc: v3.13+ <stable@vger.kernel.org> # v3.13+
Signed-off-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 6b17ddb2 29-Apr-2014 Stratos Karafotis <stratosk@semaphore.gr>

intel_pstate: Remove sample parameter in intel_pstate_calc_busy

Since commit d37e2b7644 ("intel_pstate: remove unneeded sample buffers")
we use only one sample. So, there is no need to pass the sample
pointer to intel_pstate_calc_busy. Instead, get the pointer from
cpudata. Also, remove the unused SAMPLE_COUNT macro.

While at it, reformat the first line in this function.

Signed-off-by: Stratos Karafotis <stratosk@semaphore.gr>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# c2294a2f 24-Mar-2014 Dirk Brandewie <dirk.j.brandewie@intel.com>

intel_pstate: Use del_timer_sync in intel_pstate_cpu_stop

Ensure that no timer callback is running since we are about to free
the timer structure. We cannot guarantee that the call back is called
on the CPU where the timer is running.

Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# bb18008f 19-Mar-2014 Dirk Brandewie <dirk.j.brandewie@intel.com>

intel_pstate: Set core to min P state during core offline

Change to use the new ->stop_cpu() callback to do clean up during CPU
hotplug. The requested P state for an offline core will be used by the
hardware coordination function to select the package P state. If the
core is under load when it is offlined it will fix the package P state
floor to the requested P state of offline core.

Reported-by: Patrick Marlier <patrick.marlier@gmail.com>
Signed-off-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# d98d099b 12-Feb-2014 Dirk Brandewie <dirk.j.brandewie@intel.com>

intel_pstate: fix pid_reset to use fixed point values

commit d253d2a526 (Improve accuracy by not truncating until final
result), changed internal variables of the PID to be fixed point
numbers. Update the pid_reset() to reflect this change.

Signed-off-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# d37e2b76 12-Feb-2014 Dirk Brandewie <dirk.j.brandewie@intel.com>

intel_pstate: remove unneeded sample buffers

Remove unneeded sample buffers, intel_pstate operates on the most
recent sample only. This save some memory and make the code more
readable.

Signed-off-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# e66c1768 25-Feb-2014 Dirk Brandewie <dirk.j.brandewie@intel.com>

intel_pstate: Change busy calculation to use fixed point math.

Commit fcb6a15c2e (intel_pstate: Take core C0 time into account for
core busy calculation) introduced a regression on some processor SKUs
supported by intel_pstate. This was due to the truncation caused by
using integer math to calculate core busy and C0 percentages.

On a i7-4770K processor operating at 800Mhz going to 100% utilization
the percent busy of the CPU using integer math is 22%, but it actually
is 22.85%. This value scaled to the current frequency returned 97
which the PID interpreted as no error and did not adjust the P state.

Tested on i7-4770K, i7-2600, i5-3230M.

Fixes: fcb6a15c2e7e (intel_pstate: Take core C0 time into account for core busy calculation)
References: https://lkml.org/lkml/2014/2/19/626
References: https://bugzilla.kernel.org/show_bug.cgi?id=70941
Signed-off-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 61d8d2ab 12-Feb-2014 Dirk Brandewie <dirk.j.brandewie@intel.com>

intel_pstate: Add support for Baytrail turbo P states

A documentation update exposed the existance of the turbo ratio
register. Update baytrail support to use the turbo range.

Signed-off-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Cc: 3.13+ <stable@vger.kernel.org> # 3.13+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 4042e757 12-Feb-2014 Dirk Brandewie <dirk.j.brandewie@intel.com>

intel_pstate: Use LFM bus ratio as min ratio/P state

LFM (max efficiency ratio) is the max frequency at minimum voltage
supported by the processor. Using LFM as the minimum P state
increases performmance without affecting power. By not using P states
below LFM we avoid using P states that are less power efficient.

Signed-off-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Cc: 3.13+ <stable@vger.kernel.org> # 3.13+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 709c078e 12-Feb-2014 Dirk Brandewie <dirk.j.brandewie@intel.com>

intel_pstate: Remove energy reporting from pstate_sample tracepoint

Remove the reporting of energy since it does not provide any useful
information about the state of the driver and will be a maintainance
headache going forward since the RAPL energy units register is not
architectural and subject to change between micro-architectures

References: https://bugzilla.kernel.org/show_bug.cgi?id=69831
Fixes: b69880f9ccf7 (intel_pstate: Add trace point to report internal state.)
Signed-off-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# fcb6a15c 03-Feb-2014 Dirk Brandewie <dirk.j.brandewie@intel.com>

intel_pstate: Take core C0 time into account for core busy calculation

Take non-idle time into account when calculating core busy time.
This ensures that intel_pstate will notice a decrease in load.

References: https://bugzilla.kernel.org/show_bug.cgi?id=66581
Cc: 3.10+ <stable@vger.kernel.org> # 3.10+
Signed-off-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# b69880f9 16-Jan-2014 Dirk Brandewie <dirk.j.brandewie@intel.com>

intel_pstate: Add trace point to report internal state.

Add perf trace event "power:pstate_sample" to report driver state to
aid in diagnosing issues reported against intel_pstate.

Signed-off-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 6cbd7ee1 06-Jan-2014 Dirk Brandewie <dirk.j.brandewie@intel.com>

intel_pstate: Add X86_FEATURE_APERFMPERF to cpu match parameters.

KVM environments do not support APERF/MPERF MSRs. intel_pstate cannot
operate without these registers.

The previous validity checks in intel_pstate_msrs_not_valid() are
insufficent in nested KVMs.

References: https://bugzilla.redhat.com/show_bug.cgi?id=1046317
Signed-off-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Cc: All applicable <stable@vger.kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 98a947ab 31-Dec-2013 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

intel_pstate: Fail initialization if P-state information is missing

If pstate.current_pstate is 0 after the initial
intel_pstate_get_cpu_pstates(), this means that we were unable to
obtain any useful P-state information and there is no reason to
continue, so free memory and return an error in that case.

This fixes the following divide error occuring in a nested KVM
guest:

Intel P-state driver initializing.
Intel pstate controlling: cpu 0
cpufreq: __cpufreq_add_dev: ->get() failed
divide error: 0000 [#1] SMP
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.13.0-0.rc4.git5.1.fc21.x86_64 #1
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
task: ffff88001ea20000 ti: ffff88001e9bc000 task.ti: ffff88001e9bc000
RIP: 0010:[<ffffffff815c551d>] [<ffffffff815c551d>] intel_pstate_timer_func+0x11d/0x2b0
RSP: 0000:ffff88001ee03e18 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff88001a454348 RCX: 0000000000006100
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffff88001ee03e38 R08: 0000000000000000 R09: 0000000000000000
R10: ffff88001ea20000 R11: 0000000000000000 R12: 00000c0a1ea20000
R13: 1ea200001ea20000 R14: ffffffff815c5400 R15: ffff88001a454348
FS: 0000000000000000(0000) GS:ffff88001ee00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 0000000001c0c000 CR4: 00000000000006f0
Stack:
fffffffb1a454390 ffffffff821a4500 ffff88001a454390 0000000000000100
ffff88001ee03ea8 ffffffff81083e9a ffffffff81083e15 ffffffff82d5ed40
ffffffff8258cc60 0000000000000000 ffffffff81ac39de 0000000000000000
Call Trace:
<IRQ>
[<ffffffff81083e9a>] call_timer_fn+0x8a/0x310
[<ffffffff81083e15>] ? call_timer_fn+0x5/0x310
[<ffffffff815c5400>] ? pid_param_set+0x130/0x130
[<ffffffff81084354>] run_timer_softirq+0x234/0x380
[<ffffffff8107aee4>] __do_softirq+0x104/0x430
[<ffffffff8107b5fd>] irq_exit+0xcd/0xe0
[<ffffffff81770645>] smp_apic_timer_interrupt+0x45/0x60
[<ffffffff8176efb2>] apic_timer_interrupt+0x72/0x80
<EOI>
[<ffffffff810e15cd>] ? vprintk_emit+0x1dd/0x5e0
[<ffffffff81757719>] printk+0x67/0x69
[<ffffffff815c1493>] __cpufreq_add_dev.isra.13+0x883/0x8d0
[<ffffffff815c14f0>] cpufreq_add_dev+0x10/0x20
[<ffffffff814a14d1>] subsys_interface_register+0xb1/0xf0
[<ffffffff815bf5cf>] cpufreq_register_driver+0x9f/0x210
[<ffffffff81fb19af>] intel_pstate_init+0x27d/0x3be
[<ffffffff81761e3e>] ? mutex_unlock+0xe/0x10
[<ffffffff81fb1732>] ? cpufreq_gov_dbs_init+0x12/0x12
[<ffffffff8100214a>] do_one_initcall+0xfa/0x1b0
[<ffffffff8109dbf5>] ? parse_args+0x225/0x3f0
[<ffffffff81f64193>] kernel_init_freeable+0x1fc/0x287
[<ffffffff81f638d0>] ? do_early_param+0x88/0x88
[<ffffffff8174b530>] ? rest_init+0x150/0x150
[<ffffffff8174b53e>] kernel_init+0xe/0x130
[<ffffffff8176e27c>] ret_from_fork+0x7c/0xb0
[<ffffffff8174b530>] ? rest_init+0x150/0x150
Code: c1 e0 05 48 63 bc 03 10 01 00 00 48 63 83 d0 00 00 00 48 63 d6 48 c1 e2 08 c1 e1 08 4c 63 c2 48 c1 e0 08 48 98 48 c1 e0 08 48 99 <49> f7 f8 48 98 48 0f af f8 48 c1 ff 08 29 f9 89 ca c1 fa 1f 89
RIP [<ffffffff815c551d>] intel_pstate_timer_func+0x11d/0x2b0
RSP <ffff88001ee03e18>
---[ end trace f166110ed22cc37a ]---
Kernel panic - not syncing: Fatal exception in interrupt

Reported-and-tested-by: Kashyap Chamarthy <kchamart@redhat.com>
Cc: Josh Boyer <jwboyer@fedoraproject.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: All applicable <stable@vger.kernel.org>


# 91a4cd4f 17-Dec-2013 Dirk Brandewie <dirk.j.brandewie@intel.com>

intel_pstate: Remove periodic P state boost

Remove the periodic P state boost. This code required for some corner
case benchmark tests. The calculation of the required P state was
incorrect/inaccurate and would not allow P state increase.

This was fixed by a combination of commits:
2134ed4 cpufreq / intel_pstate: Change to scale off of max P-state
d253d2a intel_pstate: Improve accuracy by not truncating until final result

References: https://bugzilla.kernel.org/show_bug.cgi?id=64271
Reported-by: Doug Smythies <dsmythies@telus.net>
Signed-off-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 007bea09 18-Dec-2013 Dirk Brandewie <dirk.j.brandewie@intel.com>

intel_pstate: Add setting voltage value for baytrail P states.

Baytrail requires setting P state and voltage pairs when adjusting the
requested P state. Add function for retrieving the valid voltage
values and modify *_set_pstate() functions to caluclate the
appropriate voltage for the requested P state.

Signed-off-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# fbbcdc07 31-Oct-2013 Adrian Huang <adrianhuang0701@gmail.com>

intel_pstate: skip the driver if ACPI has power mgmt option

Do not load the Intel pstate driver if the platform firmware
(ACPI BIOS) supports the power management alternatives.
The ACPI BIOS indicates that the OS control mode can be used
if the _PSS (Performance Supported States) is defined in ACPI
table. For the OS control mode, the Intel pstate driver will be
loaded.

HP BIOS has several power management modes (firmware, OS-control and
so on). For the OS control mode in HP BIOS, the Intel p-state driver
will be loaded. When the customer chooses the firmware power
management in HP BIOS, the Intel p-state driver will be ignored.

I put hw_vendor_info vendor_info in case other vendors (Dell, Lenovo...)
have their firmware power management. Vendors should make sure their
firmware power management works properly, and they can go for adding
their vendor info to the variable.

I have verified the patch on HP ProLiant servers. The patch worked
correctly.

Signed-off-by: Adrian Huang <adrianhuang0701@gmail.com>
[rjw: Fixed up !CONFIG_ACPI build]
[Linda Knippers: As Adrian has recently left HP, I retested the
updated patch on an HP ProLiant server and verified that it is
behaving correctly. When the BIOS is configured for OS control for
power management, the intel_pstate driver loads as expected. When
the BIOS is configured to provide the power management, the
intel_pstate driver does not load and we get the pcc_cpufreq driver
instead.]
Signed-off-by: Linda Knippers <linda.knippers@hp.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# e0a261a2 30-Oct-2013 Dirk Brandewie <dirk.j.brandewie@intel.com>

cpufreq/intel_pstate: Add static declarations to internal functions

Fixes warnings reported by kbuild test robot

sparse warnings: (new ones prefixed by >>)

drivers/cpufreq/intel_pstate.c:729:6: sparse: symbol 'copy_pid_params' was not declared. Should it be static?
drivers/cpufreq/intel_pstate.c:739:6: sparse: symbol 'copy_cpu_funcs' was not declared. Should it be static?

Signed-off-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 19e77c28 21-Oct-2013 Dirk Brandewie <dirk.j.brandewie@intel.com>

intel_pstate: Add Baytrail support

Add support for the Baytrail processor.

Signed-off-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 016c8150 21-Oct-2013 Dirk Brandewie <dirk.j.brandewie@intel.com>

intel_pstate: Refactor driver to support CPUs with different MSR layouts

Non-core processors have a different MSR layout to commumicate P state
information. Refactor the driver to use CPU dependent accessors to
P state information.

Signed-off-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 7244cb62 21-Oct-2013 Dirk Brandewie <dirk.j.brandewie@intel.com>

intel_pstate: Correct calculation of min pstate value

The minimum pstate is supposed to be a percentage of the maximum P
state available. Calculate min using max pstate and not the
current max which may have been limited by the user

Signed-off-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# d253d2a5 21-Oct-2013 Brennan Shacklett <brennan@genyes.org>

intel_pstate: Improve accuracy by not truncating until final result

This patch addresses Bug 60727
(https://bugzilla.kernel.org/show_bug.cgi?id=60727)
which was due to the truncation of intermediate values in the
calculations, which causes the code to consistently underestimate the
current cpu frequency, specifically 100% cpu utilization was truncated
down to the setpoint of 97%. This patch fixes the problem by keeping
the results of all intermediate calculations as fixed point numbers
rather scaling them back and forth between integers and fixed point.

References: https://bugzilla.kernel.org/show_bug.cgi?id=60727
Signed-off-by: Brennan Shacklett <bpshacklett@gmail.com>
Acked-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 09c87e2f 16-Oct-2013 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

intel_pstate: Fix type mismatch warning

The expression in line 398 of intel_pstate.c causes the following
warning to be emitted:

drivers/cpufreq/intel_pstate.c:398:3: warning: left shift count >= width of type

which happens because unsigned long is 32-bit on some architectures.

Fix that by using a helper u64 variable and simplify the code
slightly.

Tested-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 52e0a509 15-Oct-2013 Dirk Brandewie <dirk.j.brandewie@intel.com>

cpufreq / intel_pstate: Fix max_perf_pct on resume

If the system is suspended while max_perf_pct is less than 100 percent
or no_turbo set policy->{min,max} will be set incorrectly with scaled
values which turn the scaled values into hard limits.

References: https://bugzilla.kernel.org/show_bug.cgi?id=61241
Reported-by: Patrick Bartels <petzicus@googlemail.com>
Signed-off-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Cc: 3.9+ <stable@vger.kernel.org> # 3.9+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# be49e346 02-Oct-2013 Viresh Kumar <viresh.kumar@linaro.org>

cpufreq: add new routine cpufreq_verify_within_cpu_limits()

Most of the users of cpufreq_verify_within_limits() calls it for
limiting with min/max from policy->cpuinfo. We can make that code
simple by introducing another routine which will do this for them
automatically.

This patch adds another routine cpufreq_verify_within_cpu_limits()
and updates others to use it.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 1ccf7a1c 01-Oct-2013 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

intel_pstate: fix no_turbo

When sysfs for no_turbo is set, then also some p states in turbo regions
are observed. This patch will set IDA Engage bit when no_turbo is set to
explicitly disengage turbo.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Acked-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 6cdcdb79 30-Jun-2013 Nell Hardcastle <nell@spicious.com>

intel_pstate: Add Haswell CPU models

Enable the intel_pstate driver for Haswell CPUs. One missing Ivy Bridge
model (0x3E) is also included. Models referenced from
tools/power/x86/turbostat/turbostat.c:has_nehalem_turbo_ratio_limit

Signed-off-by: Nell Hardcastle <nell@spicious.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# adc97d6a 06-Aug-2013 Viresh Kumar <viresh.kumar@linaro.org>

cpufreq: Drop the owner field from struct cpufreq_driver

We don't need to set .owner = THIS_MODULE any more in cpufreq drivers
as this field isn't used any more by the cpufreq core.

This patch removes it and updates all dependent drivers accordingly.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 2134ed4d 18-Jul-2013 Dirk Brandewie <dirk.j.brandewie@intel.com>

cpufreq / intel_pstate: Change to scale off of max P-state

Change to using max P-state instead of max turbo P-state. This
change resolves two issues.

On a quiet system intel_pstate can fail to respond to a load change.

On CPU SKUs that have a limited number of P-states and no turbo range
intel_pstate fails to select the highest available P-state.

This change is suitable for stable v3.9+

References: https://bugzilla.kernel.org/show_bug.cgi?id=59481
Reported-and-tested-by: Arjan van de Ven <arjan@linux.intel.com>
Reported-and-tested-by: dsmythies@telus.net
Signed-off-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: 3.9+ <stable@vger.kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 2760984f 19-Jun-2013 Paul Gortmaker <paul.gortmaker@windriver.com>

cpufreq: delete __cpuinit usage from all cpufreq files

The __cpuinit type of throwaway sections might have made sense
some time ago when RAM was more constrained, but now the savings
do not offset the cost and complications. For example, the fix in
commit 5e427ec2d0 ("x86: Fix bit corruption at CPU resume time")
is a good example of the nasty type of bugs that can be created
with improper use of the various __init prefixes.

After a discussion on LKML[1] it was decided that cpuinit should go
the way of devinit and be phased out. Once all the users are gone,
we can then finally remove the macros themselves from linux/init.h.

This removes all the drivers/cpufreq uses of the __cpuinit macros
from all C files.

[1] https://lkml.org/lkml/2013/5/20/589

[v2: leave 2nd lines of args misaligned as requested by Viresh]
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: cpufreq@vger.kernel.org
Cc: linux-pm@vger.kernel.org
Acked-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>


# c96d53d6 17-May-2013 Dirk Brandewie <dirk.j.brandewie@intel.com>

cpufreq / intel_pstate: Add additional supported CPU ID

Add CPU ID for Ivybrigde processor.

Signed-off-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# b57ffac5 13-May-2013 Wei Yongjun <yongjun_wei@trendmicro.com.cn>

cpufreq / intel_pstate: use vzalloc() instead of vmalloc()/memset(0)

Use vzalloc() instead of vmalloc() and memset(0).

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 35363e94 07-May-2013 Dirk Brandewie <dirk.j.brandewie@intel.com>

cpufreq / intel_pstate: remove #ifdef MODULE compile fence

The driver can no longer be built as a module remove the compile fence
around cpufreq tracing call.

Signed-off-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# a73108d5 07-May-2013 Dirk Brandewie <dirk.j.brandewie@intel.com>

cpufreq / intel_pstate: Remove idle mode PID

Remove dead code from the driver.

Signed-off-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# ca182aee 07-May-2013 Dirk Brandewie <dirk.j.brandewie@intel.com>

cpufreq / intel_pstate: fix ffmpeg regression

The ffmpeg benchmark in the phoronix test suite has threads on
multiple cores that rely on the progress on of threads on other cores
and ping pong back and forth fast enough to make the core appear less
busy than it "should" be. If the core has been at minimum p-state for
a while bump the pstate up to kick the core to see if it is in this
ping pong state. If the core is truly idle the p-state will be
reduced at the next sample time. If the core makes more progress it
will send more work to the thread bringing both threads out of the
ping pong scenario and the p-state will be selected normally.

This fixes a performance regression of approximately 30%

Cc: 3.9+ <stable@vger.kernel.org>
Signed-off-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# d8f469e9 07-May-2013 Dirk Brandewie <dirk.j.brandewie@intel.com>

cpufreq / intel_pstate: use lowest requested max performance

There are two ways that the maximum p-state can be clamped, via a
policy change and via the sysfs file.

The acpi-thermal driver adjusts the p-state policy in response to
thermal events. These changes override the users settings at the
moment.

Use the lowest of the two requested values this ensures that we will
not exceed the requested pstate from either mechanism.

Reported-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: 3.9+ <stable@vger.kernel.org>
Signed-off-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 1abc4b20 07-May-2013 Dirk Brandewie <dirk.j.brandewie@intel.com>

cpufreq / intel_pstate: remove idle time and duration from sample and calculations

Idle time is taken into account in the APERF/MPERF ratio calculation
there is no reason for the driver to track it seperately. This
reduces the work in the driver and makes the code more readable.

Removal of the tracking of sample duration removes the possibility of
the divide by zero exception when the duration is sub 1us

References: https://bugzilla.kernel.org/show_bug.cgi?id=56691
Reported-by: Mike Lothian <mike@fireburn.co.uk>
Cc: 3.9+ <stable@vger.kernel.org>
Signed-off-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# d1b68485 09-Apr-2013 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq / intel_pstate: Optimize intel_pstate_set_policy

This function is called quite often from other subsystems.
Removed unused call to intel_pstate_get_min_max().
Also when "policy->policy == CPUFREQ_POLICY_PERFORMANCE", then
no need to do calculations as the limits will be forced anyway.
Also corrected filename in the header.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# ec376a2a 04-Apr-2013 Dirk Brandewie <dirk.brandewie@gmail.com>

cpufreq / intel_pstate: Set timer timeout correctly

The current calculation of the delay time is wrong and a cut and
paste error from a previous experimental driver. This can result in
the timeout being set to jiffies + 1 which setup the driver to race
with itself if the APIC timer interrupt happens at just the right
time.

References: https://bugzilla.redhat.com/show_bug.cgi?id=920289
Reported-by: Adam Williamson <awilliam@redhat.com>
Reported-and-tested-by: Parag Warudkar <parag.lkml@gmail.com>
Signed-off-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 05e99c8cf 20-Mar-2013 Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

intel-pstate: Use #defines instead of hard-coded values.

They are defined in coreboot (MSR_PLATFORM) and the other
one is already defined in msr-index.h.

Let's use those.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# e6f3eb29 23-Mar-2013 Dirk Brandewie <dirk.brandewie@gmail.com>

cpufreq / intel_pstate: Fix calculation of current frequency

Use the correct pstate value to calculate the effective frequency.

References: https://bugzilla.redhat.com/show_bug.cgi?id=923942
Reported-by: Satish Balay <balay@fastmail.fm>
Signed-off-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# b563b4e3 21-Mar-2013 Dirk Brandewie <dirk.brandewie@gmail.com>

cpufreq / intel_pstate: Add function to check that all MSRs are valid

Some VMs seem to try to implement some MSRs but not all the registers
the driver needs. Check to make sure all the MSR that we need are
available. If any of the required MSRs are not available refuse to
load.

References: https://bugzilla.redhat.com/show_bug.cgi?id=922923
Reported-by: Josh Stone <jistone@redhat.com>
Signed-off-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# d3929b83 05-Mar-2013 Dirk Brandewie <dirk.brandewie@gmail.com>

cpufreq / intel_pstate: Do not load on VM that does not report max P state.

It seems some VMs support the P state MSRs but return zeros. Fail
gracefully if we are running in this environment.

References: https://bugzilla.redhat.com/show_bug.cgi?id=916833
Reported-by: Josh Boyer <jwboyer@redhat.com>
Signed-off-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 907cc908 05-Mar-2013 Dirk Brandewie <dirk.brandewie@gmail.com>

cpufreq / intel_pstate: Fix intel_pstate_init() error path

If cpufreq_register_driver() fails just free memory that has been
allocated and return. intel_pstate_exit() function is removed since we
are built-in only now there is no reason for a module exit procedure.

Reported-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 6be26498 15-Feb-2013 Dirk Brandewie <dirk.brandewie@gmail.com>

cpufreq / intel_pstate: Add kernel command line option disable intel_pstate.

When intel_pstate is configured into the kernel it will become the
preferred scaling driver for processors that it supports. Allow the
user to override this by adding:
intel_pstate=disable
on the kernel command line.

Signed-off-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 191e5edf 11-Feb-2013 Dirk Brandewie <dirk.brandewie@gmail.com>

cpufreq / intel_pstate: Fix 32 bit build

Fixes 32 bit build.

on i386:
drivers/built-in.o: In function `intel_pstate_timer_func':
intel_pstate.c:(.text+0x4ce97e): undefined reference to `__udivdi3'
drivers/built-in.o: In function `intel_pstate_cpu_init':
intel_pstate.c:(.cpuinit.text+0x974): undefined reference to `__udivdi3'

Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>


# 93f0822d 06-Feb-2013 Dirk Brandewie <dirk.brandewie@gmail.com>

cpufreq/x86: Add P-state driver for sandy bridge.

Add a P-state driver for the Intel Sandy bridge processor. In cpufreq
terminology this driver implements a scaling driver with an internal
governor.

When built into the the kernel this driver will be the preferred
scaling driver for Sandy bridge processors.

In addition to the interfaces provided by the cpufreq subsystem for
controlling scaling drivers. The user may control the behavior of the
driver via three sysfs files located in
"/sys/devices/system/cpu/intel_pstate".

max_perf_pct: limits the maximum P state that will be requested by
the driver stated as a percentage of the avail performance.

min_perf_pct: limits the minimum P state that will be requested by
the driver stated as a percentage of the avail performance.

no_turbo: limits the driver to selecting P states below the turbo
frequency range.

Signed-off-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>