History log of /linux-master/arch/powerpc/kernel/time.c
Revision Date Author Comments
# 0aafbdf3 18-Feb-2023 Christophe Leroy <christophe.leroy@csgroup.eu>

powerpc: Make generic_calibrate_decr() the default

ppc_md.calibrate_decr() is a mandatory item. Its nullity is never
checked so it must be non null on all platforms.

Most platforms define generic_calibrate_decr() as their
ppc_md.calibrate_decr(). Have time_init() call
generic_calibrate_decr() when ppc_md.calibrate_decr() is NULL,
and remove default assignment from all machines.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/6cb9865d916231c38401ba34ad1a98c249fae135.1676711562.git.christophe.leroy@csgroup.eu


# 2a7ce82d 05-Feb-2023 Rohan McLure <rmclure@linux.ibm.com>

powerpc/kcsan: Exclude udelay to prevent recursive instrumentation

In order for KCSAN to increase its likelihood of observing a data race,
it sets a watchpoint on memory accesses and stalls, allowing for
detection of conflicting accesses by other kernel threads or interrupts.

Stalls are implemented by injecting a call to udelay in instrumented code.
To prevent recursive instrumentation, exclude udelay from being instrumented.

Signed-off-by: Rohan McLure <rmclure@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20230206021801.105268-3-rmclure@linux.ibm.com


# c2854801 21-Jan-2023 Nicholas Piggin <npiggin@gmail.com>

powerpc/64: Fix perf profiling asynchronous interrupt handlers

Interrupt entry sets the soft mask to IRQS_ALL_DISABLED to match the
hard irq disabled state. So when should_hard_irq_enable() returns true
because we want PMI interrupts in irq handlers, MSR[EE] is enabled but
PMIs just get soft-masked. Fix this by clearing IRQS_PMI_DISABLED before
enabling MSR[EE].

This also tidies some of the warnings, no need to duplicate them in
both should_hard_irq_enable() and do_hard_irq_enable().

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20230121100156.2824054-1-npiggin@gmail.com


# f985adaf 06-Oct-2022 Nicholas Piggin <npiggin@gmail.com>

powerpc: remove the last remnants of cputime_t

cputime_t was a core kernel type, removed by commits
ed5c8c854f2b..b672592f0221. As explained in commit b672592f0221
("sched/cputime: Remove generic asm headers"), the final cleanup
is for the arch to provide cputime_to_nsec[s](). Commit ade7667a981b
("powerpc: Add cputime_to_nsecs()") did that, but justdidn't remove
the then-unused cputime_to_usecs(), cputime_t type, and associated
remnants.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20221006105653.115829-1-npiggin@gmail.com


# c8455020 09-Sep-2022 Nicholas Piggin <npiggin@gmail.com>

powerpc/time: avoid programming DEC at the start of the timer interrupt

Setting DEC to maximum at the start of the timer interrupt is not
necessary and can be avoided for performance when MSR[EE] is not
enabled during the handler as explained in commit 0faf20a1ad16
("powerpc/64s/interrupt: Don't enable MSR[EE] in irq handlers unless
perf is in use"), where this change was first attempted.

The idea is that the timer interrupt runs with MSR[EE]=0, and at the end
of the interrupt DEC is programmed to the next timer interval, so there
is no need to clear the decrementer exception before then.

When the above commit was merged, that was not quite true. The low res
timer subsystem had some cases in the oneshot timer code where if the
tick was to be stopped and no timers active, the clock device would not
get the ->set_state_oneshot_stopped() call, so DEC would not get
reprogrammed, and this would hang taking continual timer interrupts.

So this was reverted in commit d2b9be1f4af5 ("powerpc/time: Always set
decrementer in timer_interrupt()"), which was a partial revert of the
above commit.

Commit 62c1256d5447 ("timers/nohz: Switch to ONESHOT_STOPPED in the
low-res handler when the tick is stopped") was later merged to fix this
missing case in the timer subsystem, so now the behaviour can be
restored.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220909142457.278032-1-npiggin@gmail.com


# 6ba5aa54 02-Sep-2022 Nicholas Piggin <npiggin@gmail.com>

powerpc/pseries: Move dtl scanning and steal time accounting to pseries platform

dtl is the PAPR Dispatch Trace Log, which is entirely a pseries feature.
The pseries platform alrady has a file dealing with the dtl, so move
scanning for stolen time accounting there from kernel/time.c.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220902085316.2071519-5-npiggin@gmail.com


# e6f6390a 08-Mar-2022 Christophe Leroy <christophe.leroy@csgroup.eu>

powerpc: Add missing headers

Don't inherit headers "by chances" from asm/prom.h, asm/mpc52xx.h,
asm/pci.h etc...

Include the needed headers, and remove asm/prom.h when it was
needed exclusively for pulling necessary headers.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/be8bdc934d152a7d8ee8d1a840d5596e2f7d85e0.1646767214.git.christophe.leroy@csgroup.eu


# 1fd02f66 30-Apr-2022 Julia Lawall <Julia.Lawall@inria.fr>

powerpc: fix typos in comments

Various spelling mistakes in comments.
Detected with the help of Coccinelle.

Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
Reviewed-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220430185654.5855-1-Julia.Lawall@inria.fr


# ce0091a0 24-Mar-2021 He Ying <heying24@huawei.com>

powerpc/time: Fix sparse warnings

We found these warnings in arch/powerpc/kernel/time.c as follows:
warning: symbol 'decrementer_max' was not declared. Should it be static?
warning: symbol 'rtc_lock' was not declared. Should it be static?
warning: symbol 'dtl_consumer' was not declared. Should it be static?

Declare 'decrementer_max' in powerpc asm/time.h.

Include linux/mc146818rtc.h in powerpc kernel/time.c where 'rtc_lock' is
declared. And remove duplicated declaration of 'rtc_lock' in powerpc
platforms/chrp/time.c because it has included linux/mc146818rtc.h.

Move 'dtl_consumer' definition after "include <asm/dtl.h>" because it is
declared there.

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: He Ying <heying24@huawei.com>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210324090939.143477-1-heying24@huawei.com


# d2b9be1f 20-Apr-2022 Michael Ellerman <mpe@ellerman.id.au>

powerpc/time: Always set decrementer in timer_interrupt()

This is a partial revert of commit 0faf20a1ad16 ("powerpc/64s/interrupt:
Don't enable MSR[EE] in irq handlers unless perf is in use").

Prior to that commit, we always set the decrementer in
timer_interrupt(), to clear the timer interrupt. Otherwise we could end
up continuously taking timer interrupts.

When high res timers are enabled there is no problem seen with leaving
the decrementer untouched in timer_interrupt(), because it will be
programmed via hrtimer_interrupt() -> tick_program_event() ->
clockevents_program_event() -> decrementer_set_next_event().

However with CONFIG_HIGH_RES_TIMERS=n or booting with highres=off, we
see a stall/lockup, because tick_nohz_handler() does not cause a
reprogram of the decrementer, leading to endless timer interrupts.
Example trace:

[ 1.898617][ T7] Freeing initrd memory: 2624K^M
[ 22.680919][ C1] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:^M
[ 22.682281][ C1] rcu: 0-....: (25 ticks this GP) idle=073/0/0x1 softirq=10/16 fqs=1050 ^M
[ 22.682851][ C1] (detected by 1, t=2102 jiffies, g=-1179, q=476)^M
[ 22.683649][ C1] Sending NMI from CPU 1 to CPUs 0:^M
[ 22.685252][ C0] NMI backtrace for cpu 0^M
[ 22.685649][ C0] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.16.0-rc2-00185-g0faf20a1ad16 #145^M
[ 22.686393][ C0] NIP: c000000000016d64 LR: c000000000f6cca4 CTR: c00000000019c6e0^M
[ 22.686774][ C0] REGS: c000000002833590 TRAP: 0500 Not tainted (5.16.0-rc2-00185-g0faf20a1ad16)^M
[ 22.687222][ C0] MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 24000222 XER: 00000000^M
[ 22.688297][ C0] CFAR: c00000000000c854 IRQMASK: 0 ^M
...
[ 22.692637][ C0] NIP [c000000000016d64] arch_local_irq_restore+0x174/0x250^M
[ 22.694443][ C0] LR [c000000000f6cca4] __do_softirq+0xe4/0x3dc^M
[ 22.695762][ C0] Call Trace:^M
[ 22.696050][ C0] [c000000002833830] [c000000000f6cc80] __do_softirq+0xc0/0x3dc (unreliable)^M
[ 22.697377][ C0] [c000000002833920] [c000000000151508] __irq_exit_rcu+0xd8/0x130^M
[ 22.698739][ C0] [c000000002833950] [c000000000151730] irq_exit+0x20/0x40^M
[ 22.699938][ C0] [c000000002833970] [c000000000027f40] timer_interrupt+0x270/0x460^M
[ 22.701119][ C0] [c0000000028339d0] [c0000000000099a8] decrementer_common_virt+0x208/0x210^M

Possibly this should be fixed in the lowres timing code, but that would
be a generic change and could take some time and may not backport
easily, so for now make the programming of the decrementer unconditional
again in timer_interrupt() to avoid the stall/lockup.

Fixes: 0faf20a1ad16 ("powerpc/64s/interrupt: Don't enable MSR[EE] in irq handlers unless perf is in use")
Reported-by: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Link: https://lore.kernel.org/r/20220420141657.771442-1-mpe@ellerman.id.au


# 35de589c 24-Jan-2022 Nicholas Piggin <npiggin@gmail.com>

powerpc/time: improve decrementer clockevent processing

The stop/shutdown op should not use decrementer_set_next_event because
that sets decrementers_next_tb to now + decrementer_max, which means a
decrementer interrupt that occurs after that time will call the
clockevent event handler unexpectedly. Set next_tb to ~0 here to prevent
any clock event call. Init all clockevents to stopped.

Then the decrementer clockevent device always has event_handler set and
applicable because we know the clock event device was not stopped. So
make this call unconditional to show that it is always called. next_tb
need not be set to ~0 before the event handler is called because it will
stop the clockevent device if there is no other timer.

Finally, the timer broadcast interrupt should not modify next_tb because
it is not involved with the local decrementer clockevent on this CPU.

This doesn't fix a known bug, just tidies the code.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220124143930.3923442-3-npiggin@gmail.com


# cf74ff52 24-Jan-2022 Nicholas Piggin <npiggin@gmail.com>

powerpc/time: Fix KVM host re-arming a timer beyond decrementer range

If the next host timer is beyond decrementer range, timer_rearm_host_dec
will leave decrementer not programmed. This will not cause a problem for
the host it will just set the decrementer correctly when the decrementer
interrupt hits, it seems safer not to leave the next host decrementer
interrupt timing able to be influenced by a guest.

This code is only used in the P9 KVM paths so it's unlikely to be hit
practically unless large decrementer is force disabled in the host.

Fixes: 25aa145856cd ("powerpc/time: add API for KVM to re-arm the host timer/decrementer")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220124143930.3923442-2-npiggin@gmail.com


# 76222808 04-Mar-2022 Christophe Leroy <christophe.leroy@csgroup.eu>

powerpc: Move C prototypes out of asm-prototypes.h

We originally added asm-prototypes.h in commit 42f5b4cacd78 ("powerpc:
Introduce asm-prototypes.h"). It's purpose was for prototypes of C
functions that are only called from asm, in order to fix sparse
warnings about missing prototypes.

A few months later Nick added a different use case in
commit 4efca4ed05cb ("kbuild: modversions for EXPORT_SYMBOL() for asm")
for C prototypes for exported asm functions. This is basically the
inverse of our original usage.

Since then we've added various prototypes to asm-prototypes.h for both
reasons, meaning we now need to unstitch it all.

Dispatch prototypes of C functions into relevant headers and keep
only the prototypes for functions defined in assembly.

For the time being, leave prom_init() there because moving it
into asm/prom.h or asm/setup.h conflicts with
drivers/gpu/drm/nouveau/nvkm/subdev/bios/shadowrom.o
This will be fixed later by untaggling asm/pci.h and asm/prom.h
or by renaming the function in shadowrom.c

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/62d46904eca74042097acf4cb12c175e3067f3d1.1646413435.git.christophe.leroy@csgroup.eu


# cc15ff32 20-Jan-2022 Ganesh Goudar <ganeshgr@linux.ibm.com>

powerpc/mce: Avoid using irq_work_queue() in realmode

In realmode mce handler we use irq_work_queue() to defer
the processing of mce events, irq_work_queue() can only
be called when translation is enabled because it touches
memory outside RMA, hence we enable translation before
calling irq_work_queue and disable on return, though it
is not safe to do in realmode.

To avoid this, program the decrementer and call the event
processing functions from timer handler.

Signed-off-by: Ganesh Goudar <ganeshgr@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220120121931.517974-1-ganeshgr@linux.ibm.com


# 8defc2a5 24-Jan-2022 Nicholas Piggin <npiggin@gmail.com>

powerpc/64s/interrupt: Fix decrementer storm

The decrementer exception can fail to be cleared when the interrupt
returns in the case where the decrementer wraps with the next timer
still beyond decrementer_max. This results in a decrementer interrupt
storm. This is triggerable with small decrementer system with hard
and soft watchdogs disabled.

Fix this by always programming the decrementer if there was no timer.

Fixes: 0faf20a1ad16 ("powerpc/64s/interrupt: Don't enable MSR[EE] in irq handlers unless perf is in use")
Reported-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220124143930.3923442-1-npiggin@gmail.com


# 0faf20a1 22-Sep-2021 Nicholas Piggin <npiggin@gmail.com>

powerpc/64s/interrupt: Don't enable MSR[EE] in irq handlers unless perf is in use

Enabling MSR[EE] in interrupt handlers while interrupts are still soft
masked allows PMIs to profile interrupt handlers to some degree, beyond
what SIAR latching allows.

When perf is not being used, this is almost useless work. It requires an
extra mtmsrd in the irq handler, and it also opens the door to masked
interrupts hitting and requiring replay, which is more expensive than
just taking them directly. This effect can be noticable in high IRQ
workloads.

Avoid enabling MSR[EE] unless perf is currently in use. This saves about
60 cycles (or 8%) on a simple decrementer interrupt microbenchmark.
Replayed interrupts drop from 1.4% of all interrupts taken, to 0.003%.

This does prevent the soft-nmi interrupt being taken in these handlers,
but that's not too reliable anyway. The SMP watchdog will continue to be
the reliable way to catch lockups.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210922145452.352571-5-npiggin@gmail.com


# 047a6fd4 19-Oct-2021 Christophe Leroy <christophe.leroy@csgroup.eu>

powerpc/config: Add CONFIG_BOOKE_OR_40x

We have many functionnalities common to 40x and BOOKE, it leads to
many places with #if defined(CONFIG_BOOKE) || defined(CONFIG_40x).

We are going to add a few more with KUAP for booke/40x, so create
a new symbol which is defined when either BOOKE or 40x is defined.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/9a3dbd60924cb25c9f944d3d8205ac5a0d15e229.1634627931.git.christophe.leroy@csgroup.eu


# 25aa1458 23-Nov-2021 Nicholas Piggin <npiggin@gmail.com>

powerpc/time: add API for KVM to re-arm the host timer/decrementer

Rather than have KVM look up the host timer and fiddle with the
irq-work internal details, have the powerpc/time.c code provide a
function for KVM to re-arm the Linux timer code when exiting a
guest.

This is implementation has an improvement over existing code of
marking a decrementer interrupt as soft-pending if a timer has
expired, rather than setting DEC to a -ve value, which tended to
cause host timers to take two interrupts (first hdec to exit the
guest, then the immediate dec).

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211123095231.1036501-8-npiggin@gmail.com


# 9581991a 23-Nov-2021 Nicholas Piggin <npiggin@gmail.com>

KVM: PPC: Book3S HV P9: Use large decrementer for HDEC

On processors that don't suppress the HDEC exceptions when LPCR[HDICE]=0,
this could help reduce needless guest exits due to leftover exceptions on
entering the guest.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211123095231.1036501-6-npiggin@gmail.com


# 4ebbd075 23-Nov-2021 Nicholas Piggin <npiggin@gmail.com>

KVM: PPC: Book3S HV P9: Use host timer accounting to avoid decrementer read

There is no need to save away the host DEC value, as it is derived
from the host timer subsystem which maintains the next timer time,
so it can be restored from there.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211123095231.1036501-5-npiggin@gmail.com


# e606a2f4 31-Aug-2021 Christophe Leroy <christophe.leroy@csgroup.eu>

powerpc/time: Remove generic_suspend_{dis/en}able_irqs()

Commit d75d68cfef49 ("powerpc: Clean up obsolete code relating to
decrementer and timebase") made generic_suspend_enable_irqs() and
generic_suspend_disable_irqs() static.

Fold them into their only caller.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/c3f9ec9950394ef939014f7934268e6ee30ca04f.1630398566.git.christophe.leroy@csgroup.eu


# e225c4d6 23-Mar-2021 Wan Jiabing <wanjiabing@vivo.com>

powerpc: Remove duplicate includes

interrupt.c: asm/interrupt.h has been included at line 12, so remove the
duplicate one at line 10.

time.c: linux/sched/clock.h has been included at line 33,so remove the
duplicate one at line 56 and move sched/cputime.h under sched including
segament.

Signed-off-by: Wan Jiabing <wanjiabing@vivo.com>
Reviewed-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210323062916.295346-1-wanjiabing@vivo.com


# 98694166 10-Aug-2021 Christophe Leroy <christophe.leroy@csgroup.eu>

powerpc/interrupt: Fix OOPS by not calling do_IRQ() from timer_interrupt()

An interrupt handler shall not be called from another interrupt
handler otherwise this leads to problems like the following:

Kernel attempted to write user page (afd4fa84) - exploit attempt? (uid: 1000)
------------[ cut here ]------------
Bug: Write fault blocked by KUAP!
WARNING: CPU: 0 PID: 1617 at arch/powerpc/mm/fault.c:230 do_page_fault+0x484/0x720
Modules linked in:
CPU: 0 PID: 1617 Comm: sshd Tainted: G W 5.13.0-pmac-00010-g8393422eb77 #7
NIP: c001b77c LR: c001b77c CTR: 00000000
REGS: cb9e5bc0 TRAP: 0700 Tainted: G W (5.13.0-pmac-00010-g8393422eb77)
MSR: 00021032 <ME,IR,DR,RI> CR: 24942424 XER: 00000000

GPR00: c001b77c cb9e5c80 c1582c00 00000021 3ffffbff 085b0000 00000027 c8eb644c
GPR08: 00000023 00000000 00000000 00000000 24942424 0063f8c8 00000000 000186a0
GPR16: afd52dd4 afd52dd0 afd52dcc afd52dc8 0065a990 c07640c4 cb9e5e98 cb9e5e90
GPR24: 00000040 afd4fa96 00000040 02000000 c1fda6c0 afd4fa84 00000300 cb9e5cc0
NIP [c001b77c] do_page_fault+0x484/0x720
LR [c001b77c] do_page_fault+0x484/0x720
Call Trace:
[cb9e5c80] [c001b77c] do_page_fault+0x484/0x720 (unreliable)
[cb9e5cb0] [c000424c] DataAccess_virt+0xd4/0xe4
--- interrupt: 300 at __copy_tofrom_user+0x110/0x20c
NIP: c001f9b4 LR: c03250a0 CTR: 00000004
REGS: cb9e5cc0 TRAP: 0300 Tainted: G W (5.13.0-pmac-00010-g8393422eb77)
MSR: 00009032 <EE,ME,IR,DR,RI> CR: 48028468 XER: 20000000
DAR: afd4fa84 DSISR: 0a000000
GPR00: 20726f6f cb9e5d80 c1582c00 00000004 cb9e5e3a 00000016 afd4fa80 00000000
GPR08: 3835202d 72777872 2d78722d 00000004 28028464 0063f8c8 00000000 000186a0
GPR16: afd52dd4 afd52dd0 afd52dcc afd52dc8 0065a990 c07640c4 cb9e5e98 cb9e5e90
GPR24: 00000040 afd4fa96 00000040 cb9e5e0c 00000daa a0000000 cb9e5e98 afd4fa56
NIP [c001f9b4] __copy_tofrom_user+0x110/0x20c
LR [c03250a0] _copy_to_iter+0x144/0x990
--- interrupt: 300
[cb9e5d80] [c03e89c0] n_tty_read+0xa4/0x598 (unreliable)
[cb9e5df0] [c03e2a0c] tty_read+0xdc/0x2b4
[cb9e5e80] [c0156bf8] vfs_read+0x274/0x340
[cb9e5f00] [c01571ac] ksys_read+0x70/0x118
[cb9e5f30] [c0016048] ret_from_syscall+0x0/0x28
--- interrupt: c00 at 0xa7855c88
NIP: a7855c88 LR: a7855c5c CTR: 00000000
REGS: cb9e5f40 TRAP: 0c00 Tainted: G W (5.13.0-pmac-00010-g8393422eb77)
MSR: 0000d032 <EE,PR,ME,IR,DR,RI> CR: 2402446c XER: 00000000

GPR00: 00000003 afd4ec70 a72137d0 0000000b afd4ecac 00004000 0065a990 00000800
GPR08: 00000000 a7947930 00000000 00000004 c15831b0 0063f8c8 00000000 000186a0
GPR16: afd52dd4 afd52dd0 afd52dcc afd52dc8 0065a990 0065a9e0 00000001 0065fac0
GPR24: 00000000 00000089 00664050 00000000 00668e30 a720c8dc a7943ff4 0065f9b0
NIP [a7855c88] 0xa7855c88
LR [a7855c5c] 0xa7855c5c
--- interrupt: c00
Instruction dump:
3884aa88 38630178 48076861 807f0080 48042e45 2f830000 419e0148 3c80c079
3c60c076 38841be4 386301c0 4801f705 <0fe00000> 3860000b 4bfffe30 3c80c06b
---[ end trace fd69b91a8046c2e5 ]---

Here the problem is that by re-enterring an exception handler,
kuap_save_and_lock() is called a second time with this time KUAP
access locked, leading to regs->kuap being overwritten hence
KUAP not being unlocked at exception exit as expected.

Do not call do_IRQ() from timer_interrupt() directly. Instead,
redefine do_IRQ() as a standard function named __do_IRQ(), and
call it from both do_IRQ() and time_interrupt() handlers.

Fixes: 3a96570ffceb ("powerpc: convert interrupt handlers to use wrappers")
Cc: stable@vger.kernel.org # v5.12+
Reported-by: Stan Johnson <userm57@yahoo.com>
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/c17d234f4927d39a1d7100864a8e1145323d33a0.1628611927.git.christophe.leroy@csgroup.eu


# 0cdff98b 22-Jun-2021 Nicholas Piggin <npiggin@gmail.com>

powerpc/64s: Remove irq mask workaround in accumulate_stolen_time()

The caller has been moved to C after irq soft-mask state has been
reconciled, and Linux IRQs have been marked as disabled, so this no
longer needs to play games with IRQ internals.

Fixes: 68b34588e202 ("powerpc/64/sycall: Implement syscall entry/exit logic in C")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210623022924.704645-1-npiggin@gmail.com


# 6ffe2c6e 28-May-2021 Nicholas Piggin <npiggin@gmail.com>

KVM: PPC: Book3S HV P9: Reduce irq_work vs guest decrementer races

irq_work's use of the DEC SPR is racy with guest<->host switch and guest
entry which flips the DEC interrupt to guest, which could lose a host
work interrupt.

This patch closes one race, and attempts to comment another class of
races.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210528090752.3542186-11-npiggin@gmail.com


# 1b1b6a6f 30-Jan-2021 Nicholas Piggin <npiggin@gmail.com>

powerpc: handle irq_enter/irq_exit in interrupt handler wrappers

Move irq_enter/irq_exit into asynchronous interrupt handler wrappers.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210130130852.2952424-35-npiggin@gmail.com


# 3a96570f 30-Jan-2021 Nicholas Piggin <npiggin@gmail.com>

powerpc: convert interrupt handlers to use wrappers

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210130130852.2952424-29-npiggin@gmail.com


# b709e32e 22-Oct-2020 Pingfan Liu <kernelfans@gmail.com>

powerpc/time: Enable sched clock for irqtime

When CONFIG_IRQ_TIME_ACCOUNTING and CONFIG_VIRT_CPU_ACCOUNTING_GEN, powerpc
does not enable "sched_clock_irqtime" and can not utilize irq time
accounting.

Like x86, powerpc does not use the sched_clock_register() interface. So it
needs an dedicated call to enable_sched_clock_irqtime() to enable irq time
accounting.

Fixes: 518470fe962e ("powerpc: Add HAVE_IRQ_TIME_ACCOUNTING")
Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
[mpe: Add fixes tag]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1603349479-26185-1-git-send-email-kernelfans@gmail.com


# 59d512e4 06-Nov-2020 Nicholas Piggin <npiggin@gmail.com>

powerpc/64: irq replay remove decrementer overflow check

This is way to catch some cases of decrementer overflow, when the
decrementer has underflowed an odd number of times, while MSR[EE] was
disabled.

With a typical small decrementer, a timer that fires when MSR[EE] is
disabled will be "lost" if MSR[EE] remains disabled for between 4.3 and
8.6 seconds after the timer expires. In any case, the decrementer
interrupt would be taken at 8.6 seconds and the timer would be found at
that point.

So this check is for catching extreme latency events, and it prevents
those latencies from being a further few seconds long. It's not obvious
this is a good tradeoff. This is already a watchdog magnitude event and
that situation is not improved a significantly with this check. For
large decrementers, it's useless.

Therefore remove this check, which avoids a mftb when enabling hard
disabled interrupts (e.g., when enabling after coming from hardware
interrupt handlers). Perhaps more importantly, it also removes the
clunky MSR[EE] vs PACA_IRQ_HARD_DIS incoherency in soft-interrupt replay
which simplifies the code.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20201107014336.2337337-1-npiggin@gmail.com


# ab037dd8 26-Nov-2020 Christophe Leroy <christophe.leroy@csgroup.eu>

powerpc/vdso: Switch VDSO to generic C implementation.

With the C VDSO, the performance is slightly lower, but it is worth
it as it will ease maintenance and evolution, and also brings clocks
that are not supported with the ASM VDSO.

On an 8xx at 132 MHz, vdsotest with the ASM VDSO:
gettimeofday: vdso: 828 nsec/call
clock-getres-realtime-coarse: vdso: 391 nsec/call
clock-gettime-realtime-coarse: vdso: 614 nsec/call
clock-getres-realtime: vdso: 460 nsec/call
clock-gettime-realtime: vdso: 876 nsec/call
clock-getres-monotonic-coarse: vdso: 399 nsec/call
clock-gettime-monotonic-coarse: vdso: 691 nsec/call
clock-getres-monotonic: vdso: 460 nsec/call
clock-gettime-monotonic: vdso: 1026 nsec/call

On an 8xx at 132 MHz, vdsotest with the C VDSO:
gettimeofday: vdso: 955 nsec/call
clock-getres-realtime-coarse: vdso: 545 nsec/call
clock-gettime-realtime-coarse: vdso: 592 nsec/call
clock-getres-realtime: vdso: 545 nsec/call
clock-gettime-realtime: vdso: 941 nsec/call
clock-getres-monotonic-coarse: vdso: 545 nsec/call
clock-gettime-monotonic-coarse: vdso: 591 nsec/call
clock-getres-monotonic: vdso: 545 nsec/call
clock-gettime-monotonic: vdso: 940 nsec/call

It is even better for gettime with monotonic clocks.

Unsupported clocks with ASM VDSO:
clock-gettime-boottime: vdso: 3851 nsec/call
clock-gettime-tai: vdso: 3852 nsec/call
clock-gettime-monotonic-raw: vdso: 3396 nsec/call

Same clocks with C VDSO:
clock-gettime-tai: vdso: 941 nsec/call
clock-gettime-monotonic-raw: vdso: 1001 nsec/call
clock-gettime-monotonic-coarse: vdso: 591 nsec/call

On an 8321E at 333 MHz, vdsotest with the ASM VDSO:
gettimeofday: vdso: 220 nsec/call
clock-getres-realtime-coarse: vdso: 102 nsec/call
clock-gettime-realtime-coarse: vdso: 178 nsec/call
clock-getres-realtime: vdso: 129 nsec/call
clock-gettime-realtime: vdso: 235 nsec/call
clock-getres-monotonic-coarse: vdso: 105 nsec/call
clock-gettime-monotonic-coarse: vdso: 208 nsec/call
clock-getres-monotonic: vdso: 129 nsec/call
clock-gettime-monotonic: vdso: 274 nsec/call

On an 8321E at 333 MHz, vdsotest with the C VDSO:
gettimeofday: vdso: 272 nsec/call
clock-getres-realtime-coarse: vdso: 160 nsec/call
clock-gettime-realtime-coarse: vdso: 184 nsec/call
clock-getres-realtime: vdso: 166 nsec/call
clock-gettime-realtime: vdso: 281 nsec/call
clock-getres-monotonic-coarse: vdso: 160 nsec/call
clock-gettime-monotonic-coarse: vdso: 184 nsec/call
clock-getres-monotonic: vdso: 169 nsec/call
clock-gettime-monotonic: vdso: 275 nsec/call

On a Power9 Nimbus DD2.2 at 3.8GHz, with the ASM VDSO:
clock-gettime-monotonic: vdso: 35 nsec/call
clock-getres-monotonic: vdso: 16 nsec/call
clock-gettime-monotonic-coarse: vdso: 18 nsec/call
clock-getres-monotonic-coarse: vdso: 522 nsec/call
clock-gettime-monotonic-raw: vdso: 598 nsec/call
clock-getres-monotonic-raw: vdso: 520 nsec/call
clock-gettime-realtime: vdso: 34 nsec/call
clock-getres-realtime: vdso: 16 nsec/call
clock-gettime-realtime-coarse: vdso: 18 nsec/call
clock-getres-realtime-coarse: vdso: 517 nsec/call
getcpu: vdso: 8 nsec/call
gettimeofday: vdso: 25 nsec/call

And with the C VDSO:
clock-gettime-monotonic: vdso: 37 nsec/call
clock-getres-monotonic: vdso: 20 nsec/call
clock-gettime-monotonic-coarse: vdso: 21 nsec/call
clock-getres-monotonic-coarse: vdso: 19 nsec/call
clock-gettime-monotonic-raw: vdso: 38 nsec/call
clock-getres-monotonic-raw: vdso: 20 nsec/call
clock-gettime-realtime: vdso: 37 nsec/call
clock-getres-realtime: vdso: 20 nsec/call
clock-gettime-realtime-coarse: vdso: 20 nsec/call
clock-getres-realtime-coarse: vdso: 19 nsec/call
getcpu: vdso: 8 nsec/call
gettimeofday: vdso: 28 nsec/call

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20201126131006.2431205-8-mpe@ellerman.id.au


# 8a6a5920 01-Dec-2020 Frederic Weisbecker <frederic@kernel.org>

sched/vtime: Consolidate IRQ time accounting

The 3 architectures implementing CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
all have their own version of irq time accounting that dispatch the
cputime to the appropriate index: hardirq, softirq, system, idle,
guest... from an all-in-one function.

Instead of having these ad-hoc versions, move the cputime destination
dispatch decision to the core code and leave only the actual per-index
cputime accounting to the architecture.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201202115732.27827-4-frederic@kernel.org


# 942e8911 30-Sep-2020 Christophe Leroy <christophe.leroy@csgroup.eu>

powerpc/time: Avoid using get_tbl() and get_tbu() internally

get_tbl() is confusing as it returns the content of TBL register
on PPC32 but the concatenation of TBL and TBU on PPC64.

Use mftb() instead.

Do the same with get_tbu() for consistency allthough it's name
is less confusing.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/41573406a4eab98838decaa91649086fef1e6119.1601556145.git.christophe.leroy@csgroup.eu


# 6601ec1c 29-Sep-2020 Christophe Leroy <christophe.leroy@csgroup.eu>

powerpc: Remove get_tb_or_rtc()

601 is gone, get_tb_or_rtc() is equivalent to get_tb().

Replace the former by the later.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/3e8a13ee83418630c753c30cb722ae682d5b2d39.1601362098.git.christophe.leroy@csgroup.eu


# a4c5a355 29-Sep-2020 Christophe Leroy <christophe.leroy@csgroup.eu>

powerpc: Remove __USE_RTC()

Now that PowerPC 601 is gone, __USE_RTC() is never true.

Remove it.

That also leads to removing get_rtc() and get_rtcl()

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/4757e1ed21fe1968c761ae081d1f3d790a9673f8.1601362098.git.christophe.leroy@csgroup.eu


# d6bdceb6 29-May-2020 Peter Zijlstra <peterz@infradead.org>

powerpc64: Break asm/percpu.h vs spinlock_types.h dependency

In order to use <asm/percpu.h> in lockdep.h, we need to make sure
asm/percpu.h does not itself depend on lockdep.

The below seems to make that so and builds powerpc64-defconfig +
PROVE_LOCKING.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
https://lkml.kernel.org/r/20200623083721.336906073@infradead.org


# abc3fce7 02-Apr-2020 Nicholas Piggin <npiggin@gmail.com>

Revert "powerpc/64: irq_work avoid interrupt when called with hardware irqs enabled"

This reverts commit ebb37cf3ffd39fdb6ec5b07111f8bb2f11d92c5f.

That commit does not play well with soft-masked irq state
manipulations in idle, interrupt replay, and possibly others due to
tracing code sometimes using irq_work_queue (e.g., in
trace_hardirqs_on()). That can cause PACA_IRQ_DEC to become set when
it is not expected, and be ignored or cleared or cause warnings.

The net result seems to be missing an irq_work until the next timer
interrupt in the worst case which is usually not going to be noticed,
however it could be a long time if the tick is disabled, which is
against the spirit of irq_work and might cause real problems.

The idea is still solid, but it would need more work. It's not really
clear if it would be worth added complexity, so revert this for
now (not a straight revert, but replace with a comment explaining why
we might see interrupts happening, and gives git blame something to
find).

Fixes: ebb37cf3ffd3 ("powerpc/64: irq_work avoid interrupt when called with hardware irqs enabled")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200402120401.1115883-1-npiggin@gmail.com


# 60083063 13-Feb-2020 Geert Uytterhoeven <geert+renesas@glider.be>

powerpc/time: Replace <linux/clk-provider.h> by <linux/of_clk.h>

The PowerPC time code is not a clock provider, and just needs to call
of_clk_init().

Hence it can include <linux/of_clk.h> instead of <linux/clk-provider.h>.

Remove the #ifdef protecting the of_clk_init() call, as a stub is
available for the !CONFIG_COMMON_CLK case.

Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Stephen Boyd <sboyd@kernel.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200213083804.24315-1-geert+renesas@glider.be


# 2babd6ea 25-Feb-2020 Nicholas Piggin <npiggin@gmail.com>

powerpc/64s/exception: Avoid touching the stack in hdecrementer

The hdec interrupt handler is reported to sometimes fire in Linux if
KVM leaves it pending after a guest exists. This is harmless, so there
is a no-op handler for it.

The interrupt handler currently uses the regular kernel stack. Change
this to avoid touching the stack entirely.

This should be the last place where the regular Linux stack can be
accessed with asynchronous interrupts (including PMI) soft-masked.
It might be possible to take advantage of this invariant, e.g., to
context switch the kernel stack SLB entry without clearing MSR[EE].

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200225173541.1549955-17-npiggin@gmail.com


# 55226345 02-Dec-2019 Vincenzo Frascino <vincenzo.frascino@arm.com>

powerpc: Fix vDSO clock_getres()

clock_getres in the vDSO library has to preserve the same behaviour
of posix_get_hrtimer_res().

In particular, posix_get_hrtimer_res() does:
sec = 0;
ns = hrtimer_resolution;
and hrtimer_resolution depends on the enablement of the high
resolution timers that can happen either at compile or at run time.

Fix the powerpc vdso implementation of clock_getres keeping a copy of
hrtimer_resolution in vdso data and using that directly.

Fixes: a7f290dad32e ("[PATCH] powerpc: Merge vdso's and add vdso support to 32 bits kernel")
Cc: stable@vger.kernel.org
Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr>
Acked-by: Shuah Khan <skhan@linuxfoundation.org>
[chleroy: changed CLOCK_REALTIME_RES to CLOCK_HRTIMER_RES]
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/a55eca3a5e85233838c2349783bcb5164dae1d09.1575273217.git.christophe.leroy@c-s.fr


# 176ed98c 27-Oct-2019 Arnd Bergmann <arnd@arndb.de>

y2038: vdso: powerpc: avoid timespec references

As a preparation to stop using 'struct timespec' in the kernel,
change the powerpc vdso implementation:

- split up the vdso data definition to have equivalent members
for seconds and nanoseconds instead of an xtime structure

- use timespec64 as an intermediate for the xtime update

- change the asm-offsets definition to be based the appropriate
fixed-length types

This is only a temporary fix for changing the types, in order
to actually support a 64-bit safe vdso32 version of clock_gettime(),
the entire powerpc vdso should be replaced with the generic
lib/vdso/ implementation. If that happens first, this patch
becomes obsolete.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>


# eb8e20f8 13-Oct-2019 Michael Ellerman <mpe@ellerman.id.au>

powerpc/pseries: Mark accumulate_stolen_time() as notrace

accumulate_stolen_time() is called prior to interrupt state being
reconciled, which can trip the warning in arch_local_irq_restore():

WARNING: CPU: 5 PID: 1017 at arch/powerpc/kernel/irq.c:258 .arch_local_irq_restore+0x9c/0x130
...
NIP .arch_local_irq_restore+0x9c/0x130
LR .rb_start_commit+0x38/0x80
Call Trace:
.ring_buffer_lock_reserve+0xe4/0x620
.trace_function+0x44/0x210
.function_trace_call+0x148/0x170
.ftrace_ops_no_ops+0x180/0x1d0
ftrace_call+0x4/0x8
.accumulate_stolen_time+0x1c/0xb0
decrementer_common+0x124/0x160

For now just mark it as notrace. We may change the ordering to call it
after interrupt state has been reconciled, but that is a larger
change.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20191024055932.27940-1-mpe@ellerman.id.au


# f83eeb1a 03-Oct-2019 Frederic Weisbecker <frederic@kernel.org>

sched/cputime: Rename vtime_account_system() to vtime_account_kernel()

vtime_account_system() decides if we need to account the time to the
system (__vtime_account_system()) or to the guest (vtime_account_guest()).

So this function is a misnomer as we are on a higher level than
"system". All we know when we call that function is that we are
accounting kernel cputime. Whether it belongs to guest or system time
is a lower level detail.

Rename this function to vtime_account_kernel(). This will clarify things
and avoid too many underscored vtime_account_system() versions.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Link: https://lkml.kernel.org/r/20191003161745.28464-2-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 2874c5fd 27-May-2019 Thomas Gleixner <tglx@linutronix.de>

treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152

Based on 1 normalized pattern(s):

this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license as published by
the free software foundation either version 2 of the license or at
your option any later version

extracted by the scancode license scanner the SPDX license identifier

GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 3029 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# de269129 04-Mar-2019 Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>

powerpc/hmi: Fix kernel hang when TB is in error state.

On TOD/TB errors timebase register stops/freezes until HMI error recovery
gets TOD/TB back into running state. On successful recovery, TB starts
running again and udelay() that relies on TB value continues to function
properly. But in case when HMI fails to recover from TOD/TB errors, the
TB register stay freezed. With TB not running the __delay() function
keeps looping and never return. If __delay() is called while in panic
path then system hangs and never reboots after panic.

Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 6917735e 23-Mar-2019 Jagadeesh Pagadala <jagdsh.linux@gmail.com>

powerpc: Remove duplicate headers

Remove duplicate headers inclusions.

Signed-off-by: Jagadeesh Pagadala <jagdsh.linux@gmail.com>
Reviewed-by: Mukesh Ojha <mojha@codeaurora.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 75f8a375 28-Jan-2019 Brajeswar Ghosh <brajeswar.linux@gmail.com>

powerpc/kernel/time: Remove duplicate header

Remove linux/rtc.h which is included more than once

Signed-off-by: Brajeswar Ghosh <brajeswar.linux@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 51eeef9e 02-Aug-2018 Christophe Leroy <christophe.leroy@c-s.fr>

powerpc/time: no steal_time when CONFIG_PPC_SPLPAR is not selected

If CONFIG_PPC_SPLPAR is not selected, steal_time will always
be NUL, so accounting it is pointless

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# abcff86d 02-Aug-2018 Christophe Leroy <christophe.leroy@c-s.fr>

powerpc/time: Only set CONFIG_ARCH_HAS_SCALED_CPUTIME on PPC64

scaled cputime is only meaningfull when the processor has
SPURR and/or PURR, which means only on PPC64.

Removing it on PPC32 significantly reduces the size of
vtime_account_system() and vtime_account_idle() on an 8xx:

Before:
00000000 l F .text 000000a8 vtime_delta
00000280 g F .text 0000010c vtime_account_system
0000038c g F .text 00000048 vtime_account_idle

After:
(vtime_delta gets inlined inside the two functions)
000001d8 g F .text 000000a0 vtime_account_system
00000278 g F .text 00000038 vtime_account_idle

In terms of performance, we also get approximatly 7% improvement on
task switch. The following small benchmark app is run with perf stat:

void *thread(void *arg)
{
int i;

for (i = 0; i < atoi((char*)arg); i++)
pthread_yield();
}

int main(int argc, char **argv)
{
pthread_t th1, th2;

pthread_create(&th1, NULL, thread, argv[1]);
pthread_create(&th2, NULL, thread, argv[1]);
pthread_join(th1, NULL);
pthread_join(th2, NULL);

return 0;
}

Before the patch:

Performance counter stats for 'chrt -f 98 ./sched 100000' (50 runs):

8228.476465 task-clock (msec) # 0.954 CPUs utilized ( +- 0.23% )
200004 context-switches # 0.024 M/sec ( +- 0.00% )

After the patch:

Performance counter stats for 'chrt -f 98 ./sched 100000' (50 runs):

7649.070444 task-clock (msec) # 0.955 CPUs utilized ( +- 0.27% )
200004 context-switches # 0.026 M/sec ( +- 0.00% )

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# b38a181c 02-Aug-2018 Christophe Leroy <christophe.leroy@c-s.fr>

powerpc/time: isolate scaled cputime accounting in dedicated functions.

scaled cputime is only meaningfull when the processor has
SPURR and/or PURR, which means only on PPC64.

In preparation of the following patch that will remove
CONFIG_ARCH_HAS_SCALED_CPUTIME on PPC32, this patch moves
all scaled cputing accounting logic into dedicated functions.

This patch doesn't change any functionality. It's only code
reorganisation.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# b4d16ab5 17-Oct-2018 Michael Ellerman <mpe@ellerman.id.au>

powerpc/time: Fix clockevent_decrementer initalisation for PR KVM

In the recent commit 8b78fdb045de ("powerpc/time: Use
clockevents_register_device(), fixing an issue with large
decrementer") we changed the way we initialise the decrementer
clockevent(s).

We no longer initialise the mult & shift values of
decrementer_clockevent itself.

This has the effect of breaking PR KVM, because it uses those values
in kvmppc_emulate_dec(). The symptom is guest kernels spin forever
mid-way through boot.

For now fix it by assigning back to decrementer_clockevent the mult
and shift values.

Fixes: 8b78fdb045de ("powerpc/time: Use clockevents_register_device(), fixing an issue with large decrementer")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 81759360 01-Oct-2018 Anton Blanchard <anton@ozlabs.org>

powerpc/time: Add set_state_oneshot_stopped decrementer callback

If CONFIG_PPC_WATCHDOG is enabled we always cap the decrementer to
0x7fffffff:

if (IS_ENABLED(CONFIG_PPC_WATCHDOG))
set_dec(0x7fffffff);
else
set_dec(decrementer_max);

If there are no future events, we don't reprogram the decrementer
after this and we end up with 0x7fffffff even on a large decrementer
capable system.

As suggested by Nick, add a set_state_oneshot_stopped callback
so we program the decrementer with decrementer_max if there are
no future events.

Signed-off-by: Anton Blanchard <anton@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 8b78fdb0 01-Oct-2018 Anton Blanchard <anton@ozlabs.org>

powerpc/time: Use clockevents_register_device(), fixing an issue with large decrementer

We currently cap the decrementer clockevent at 4 seconds, even on systems
with large decrementer support. Fix this by converting the code to use
clockevents_register_device() which calculates the upper bound based on
the max_delta passed in.

Signed-off-by: Anton Blanchard <anton@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 60f1d289 29-May-2018 Christophe Leroy <christophe.leroy@c-s.fr>

powerpc/time: inline arch_vtime_task_switch()

arch_vtime_task_switch() is a small function which is called
only from vtime_common_task_switch(), so it is worth inlining

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 34efabe4 23-Apr-2018 Arnd Bergmann <arnd@arndb.de>

powerpc: remove unused to_tm() helper

to_tm() is now completely unused, the only reference being in the
_dump_time() helper that is also unused. This removes both, leaving
the rest of the powerpc RTC code y2038 safe to as far as the hardware
supports.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 5235afa8 23-Apr-2018 Arnd Bergmann <arnd@arndb.de>

powerpc: use time64_t in update_persistent_clock

update_persistent_clock() is deprecated because it suffers from overflow
in 2038 on 32-bit architectures. This changes powerpc to use the
update_persistent_clock64() replacement, and to pass down 64-bit
timestamps consistently.

This is now simpler, as we no longer have to worry about the offset
numbers in tm_year and tm_mon that are different between the Linux
conventions and RTAS.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 5bfd6435 23-Apr-2018 Arnd Bergmann <arnd@arndb.de>

powerpc: use time64_t in read_persistent_clock

Looking through the remaining users of the deprecated mktime()
function, I found the powerpc rtc handlers, which use it in
place of rtc_tm_to_time64().

To clean this up, I'm changing over the read_persistent_clock()
function to the read_persistent_clock64() variant, and change
all the platform specific handlers along with it.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# e360cd37 04-May-2018 Nicholas Piggin <npiggin@gmail.com>

powerpc/time: account broadcast timer event interrupts separately

These are not local timer interrupts but IPIs. It's good to be able
to see how timer offloading is behaving, so split these out into
their own category.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# bc907113 04-May-2018 Nicholas Piggin <npiggin@gmail.com>

powerpc: move timer broadcast code under GENERIC_CLOCKEVENTS_BROADCAST ifdef

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# a7cba02d 04-May-2018 Nicholas Piggin <npiggin@gmail.com>

powerpc: allow soft-NMI watchdog to cover timer interrupts with large decrementers

Large decrementers (e.g., POWER9) can take a very long time to wrap,
so when the timer iterrupt handler sets the decrementer to max so as
to avoid taking another decrementer interrupt when hard enabling
interrupts before running timers, it effectively disables the soft
NMI coverage for timer interrupts.

Fix this by using the traditional 31-bit value instead, which wraps
after a few seconds. masked interrupt code does the same thing, and
in normal operation neither of these paths would ever wrap even the
31 bit value.

Note: the SMP watchdog should catch timer interrupt lockups, but it
is preferable for the local soft-NMI to catch them, mainly to avoid
the IPI.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 3f984620 04-May-2018 Nicholas Piggin <npiggin@gmail.com>

powerpc: generic clockevents broadcast receiver call tick_receive_broadcast

The broadcast tick recipient can call tick_receive_broadcast rather
than re-running the full timer interrupt.

It does not have to check for the next event time, because the sender
already determined the timer has expired. It does not have to test
irq_work_pending, because that's a direct decrementer interrupt and
does not go through the clock events subsystem. And it does not have
to read PURR because that was removed with the previous patch.

This results in no code size change, but both the decrementer and
broadcast path lengths are reduced.

Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 3d3a6021 04-May-2018 Nicholas Piggin <npiggin@gmail.com>

powerpc/pseries: lparcfg calculate PURR on demand

For SPLPAR, lparcfg provides a sum of PURR registers for all CPUs.
Currently this is done by reading PURR in context switch and timer
interrupt, and storing that into a per-CPU variable. These are summed
to provide the value.

This does not work with all timer schemes (e.g., NO_HZ_FULL), and it
is sub-optimal for performance because it reads the PURR register on
every context switch, although that's been difficult to distinguish
from noise in the contxt_switch microbenchmark.

This patch implements the sum by calling a function on each CPU, to
read and add PURR values of each CPU.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# ebb37cf3 04-May-2018 Nicholas Piggin <npiggin@gmail.com>

powerpc/64: irq_work avoid interrupt when called with hardware irqs enabled

irq_work_raise should not cause a decrementer exception unless it is
called from NMI context. Doing so often just results in an immediate
masked decrementer interrupt:

<...>-550 90d... 4us : update_curr_rt <-dequeue_task_rt
<...>-550 90d... 5us : dbs_update_util_handler <-update_curr_rt
<...>-550 90d... 6us : arch_irq_work_raise <-irq_work_queue
<...>-550 90d... 7us : soft_nmi_interrupt <-soft_nmi_common
<...>-550 90d... 7us : printk_nmi_enter <-soft_nmi_interrupt
<...>-550 90d.Z. 8us : rcu_nmi_enter <-soft_nmi_interrupt
<...>-550 90d.Z. 9us : rcu_nmi_exit <-soft_nmi_interrupt
<...>-550 90d... 9us : printk_nmi_exit <-soft_nmi_interrupt
<...>-550 90d... 10us : cpuacct_charge <-update_curr_rt

The soft_nmi_interrupt here is the call into the watchdog, due to the
decrementer interrupt firing with irqs soft-disabled. This is
harmless, but sub-optimal.

When it's not called from NMI context or with interrupts enabled, mark
the decrementer pending in the irq_happened mask directly, rather than
having the masked decrementer interupt handler do it. This will be
replayed at the next local_irq_enable. See the comment for details.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# a6201da3 02-Apr-2018 Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>

powerpc: Fix oops due to bad access of lppaca on bare metal

Commit 8e0b634b1327 ("powerpc/64s: Do not allocate lppaca if we are
not virtualized") removed allocation of lppaca on bare metal
platforms. But with CONFIG_PPC_SPLPAR enabled, we still access the
lppaca on bare metal in some code paths.

Fix this but adding runtime checks for SPLPAR (shared processor LPAR).

Fixes: 8e0b634b1327 ("powerpc/64s: Do not allocate lppaca if we are not virtualized")
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 890ae797 21-Feb-2018 Alexandre Belloni <alexandre.belloni@bootlin.com>

powerpc/time: stop validating rtc_time in .read_time

The RTC core is always calling rtc_valid_tm after the read_time callback.
It is not necessary to call it just before returning from the callback.

Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 4e26bc4a 19-Dec-2017 Madhavan Srinivasan <maddy@linux.vnet.ibm.com>

powerpc/64: Rename soft_enabled to irq_soft_mask

Rename the paca->soft_enabled to paca->irq_soft_mask as it is no
longer used as a flag for interrupt state, but a mask.

Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# e0b5687b 19-Dec-2017 Madhavan Srinivasan <maddy@linux.vnet.ibm.com>

powerpc/64: Implement and use soft_enabled_return API

Add a new wrapper function, soft_enabled_return(), added to return
paca->soft_enabled value.

Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 0b63acf4 19-Dec-2017 Madhavan Srinivasan <maddy@linux.vnet.ibm.com>

powerpc/64: Move set_soft_enabled() and rename

Move set_soft_enabled() from powerpc/kernel/irq.c to asm/hw_irq.c, to
encourage updates to paca->soft_enabled done via these access
function. Add "memory" clobber to hint compiler since
paca->soft_enabled memory is the target here.

Renaming it as soft_enabled_set() will make namespaces works better as
prefix than a postfix when new soft_enabled manipulation functions are
introduced.

Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# c2e480ba 19-Dec-2017 Madhavan Srinivasan <maddy@linux.vnet.ibm.com>

powerpc/64: Add #defines for paca->soft_enabled flags

Two #defines IRQS_ENABLED and IRQS_DISABLED are added to be used when
updating paca->soft_enabled. Replace the hardcoded values used when
updating paca->soft_enabled with IRQ_(EN|DIS)ABLED #define. No logic
change.

Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 4e287e65 06-Jun-2017 Nicholas Piggin <npiggin@gmail.com>

powerpc: use spin loop primitives in some functions

Use the different spin loop primitives in some simple powerpc
spin loops, including those which will spin as a common case.

This will help to test the spin loop primitives before more
conversions are done.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Add some includes of <linux/processor.h>]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# d4cfb113 27-May-2017 Paul Mackerras <paulus@ozlabs.org>

powerpc: Convert VDSO update function to use new update_vsyscall interface

This converts the powerpc VDSO time update function to use the new
interface introduced in commit 576094b7f0aa ("time: Introduce new
GENERIC_TIME_VSYSCALL", 2012-09-11). Where the old interface gave
us the time as of the last update in seconds and whole nanoseconds,
with the new interface we get the nanoseconds part effectively in
a binary fixed-point format with tk->tkr_mono.shift bits to the
right of the binary point.

With the old interface, the fractional nanoseconds got truncated,
meaning that the value returned by the VDSO clock_gettime function
would have about 1ns of jitter in it compared to the value computed
by the generic timekeeping code in the kernel.

The powerpc VDSO time functions (clock_gettime and gettimeofday)
already work in units of 2^-32 seconds, or 0.23283 ns, because that
makes it simple to split the result into seconds and fractional
seconds, and represent the fractional seconds in either microseconds
or nanoseconds. This is good enough accuracy for now, so this patch
avoids changing how the VDSO works or the interface in the VDSO data
page.

This patch converts the powerpc update_vsyscall_old to be called
update_vsyscall and use the new interface. We convert the fractional
second to units of 2^-32 seconds without truncating to whole nanoseconds.
(There is still a conversion to whole nanoseconds for any legacy users
of the vdso_data/systemcfg stamp_xtime field.)

In addition, this improves the accuracy of the computation of tb_to_xs
for those systems with high-frequency timebase clocks (>= 268.5 MHz)
by doing the right shift in two parts, one before the multiplication and
one after, rather than doing the right shift before the multiplication.
(We can't do all of the right shift after the multiplication unless we
use 128-bit arithmetic.)

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Acked-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 6b847d79 20-Jun-2017 Santosh Sivaraj <santosh@fossix.org>

powerpc/time: Fix tracing in time.c

Since trace_clock is in a different file and already marked with notrace,
enable tracing in time.c by removing it from the disabled list in Makefile.
Also annotate clocksource read functions and sched_clock with notrace.

Testing: Timer and ftrace selftests run with different trace clocks.

Acked-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Santosh Sivaraj <santosh@fossix.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 6e2f03e2 19-May-2017 Ivan Mikhaylov <ivan@de.ibm.com>

powerpc/[booke|4xx]: Don't clobber TCR[WP] when setting TCR[DIE]

Prevent a kernel panic caused by unintentionally clearing TCR watchdog
bits. At this point in the kernel boot, the watchdog may have already
been enabled by u-boot. The original code's attempt to write to the TCR
register results in an inadvertent clearing of the watchdog
configuration bits, causing the 476 to reset.

Signed-off-by: Ivan Mikhaylov <ivan@de.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 115631c3 30-Mar-2017 Nicolai Stange <nicstange@gmail.com>

powerpc/time: Set ->min_delta_ticks and ->max_delta_ticks

In preparation for making the clockevents core NTP correction aware,
all clockevent device drivers must set ->min_delta_ticks and
->max_delta_ticks rather than ->min_delta_ns and ->max_delta_ns: a
clockevent device's rate is going to change dynamically and thus, the
ratio of ns to ticks ceases to stay invariant.

Make the powerpc arch's clockevent driver initialize these fields properly.

This patch alone doesn't introduce any change in functionality as the
clockevents core still looks exclusively at the (untouched) ->min_delta_ns
and ->max_delta_ns. As soon as this has changed, a followup patch will
purge the initialization of ->min_delta_ns and ->max_delta_ns from this
driver.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Stephen Boyd <sboyd@codeaurora.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Oliver O'Halloran <oohall@gmail.com>
Cc: linuxppc-dev@lists.ozlabs.org
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Signed-off-by: Nicolai Stange <nicstange@gmail.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>


# 32ef5517 05-Feb-2017 Ingo Molnar <mingo@kernel.org>

sched/headers: Prepare to move cputime functionality from <linux/sched.h> into <linux/sched/cputime.h>

Introduce a trivial, mostly empty <linux/sched/cputime.h> header
to prepare for the moving of cputime functionality out of sched.h.

Update all code that relies on these facilities.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# e6017571 01-Feb-2017 Ingo Molnar <mingo@kernel.org>

sched/headers: Prepare for new header dependencies before moving code to <linux/sched/clock.h>

We are going to split <linux/sched/clock.h> out of <linux/sched.h>, which
will have to be picked up from other headers and .c files.

Create a trivial placeholder <linux/sched/clock.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 9f3768e0 21-Feb-2017 Frédéric Weisbecker <fweisbec@gmail.com>

powerpc: Remove leftover cputime_to_nsecs call causing build error

This type conversion is a leftover that got ignored during the kcpustat
conversion to nanosecs, resulting in build breakage with config having
CONFIG_NO_HZ_FULL=y.

arch/powerpc/kernel/time.c: In function 'running_clock':
arch/powerpc/kernel/time.c:712:2: error: implicit declaration of function 'cputime_to_nsecs' [-Werror=implicit-function-declaration]
return local_clock() - cputime_to_nsecs(kcpustat_this_cpu->cpustat[CPUTIME_STEAL]);

All we need is to remove it.

Fixes: e7f340ca9c07 ("powerpc, sched/cputime: Remove unused cputime definitions")
Reported-by: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# e7f340ca 30-Jan-2017 Frederic Weisbecker <fweisbec@gmail.com>

powerpc, sched/cputime: Remove unused cputime definitions

Since the core doesn't deal with cputime_t anymore, most of these APIs
have been left unused. Lets remove these.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-33-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# fb8b049c 30-Jan-2017 Frederic Weisbecker <fweisbec@gmail.com>

sched/cputime: Push time to account_system_time() in nsecs

This is one more step toward converting cputime accounting to pure nsecs.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-25-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 18b43a9b 30-Jan-2017 Frederic Weisbecker <fweisbec@gmail.com>

sched/cputime: Push time to account_idle_time() in nsecs

This is one more step toward converting cputime accounting to pure nsecs.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-24-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# be9095ed 30-Jan-2017 Frederic Weisbecker <fweisbec@gmail.com>

sched/cputime: Push time to account_steal_time() in nsecs

This is one more step toward converting cputime accounting to pure nsecs.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-23-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 23244a5c 30-Jan-2017 Frederic Weisbecker <fweisbec@gmail.com>

sched/cputime: Push time to account_user_time() in nsecs

This is one more step toward converting cputime accounting to pure nsecs.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-22-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 5613fda9 30-Jan-2017 Frederic Weisbecker <fweisbec@gmail.com>

sched/cputime: Convert task/group cputime to nsecs

Now that most cputime readers use the transition API which return the
task cputime in old style cputime_t, we can safely store the cputime in
nsecs. This will eventually make cputime statistics less opaque and more
granular. Back and forth convertions between cputime_t and nsecs in order
to deal with cputime_t random granularity won't be needed anymore.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-8-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# c8d7dabf 05-Jan-2017 Frederic Weisbecker <fweisbec@gmail.com>

sched/cputime: Rename vtime_account_user() to vtime_flush()

CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y used to accumulate user time and
account it on ticks and context switches only through the
vtime_account_user() function.

Now this model has been generalized on the 3 archs for all kind of
cputime (system, irq, ...) and all the cputime flushing happens under
vtime_account_user().

So let's rename this function to better reflect its new role.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1483636310-6557-11-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# a19ff1a2 05-Jan-2017 Frederic Weisbecker <fweisbec@gmail.com>

sched/cputime, powerpc/vtime: Accumulate cputime and account only on tick/task switch

Currently CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y accounts the cputime on
any context boundary: irq entry/exit, guest entry/exit, context switch,
etc...

Calling functions such as account_system_time(), account_user_time()
and such can be costly, especially if they are called on many fastpath
such as twice per IRQ. Those functions do more than just accounting to
kcpustat and task cputime. Depending on the config, some subsystems can
perform unpleasant multiplications and divisions, among other things.

So lets accumulate the cputime instead and delay the accounting on ticks
and context switches only.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1483636310-6557-8-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# f828c3d0 05-Jan-2017 Frederic Weisbecker <fweisbec@gmail.com>

sched/cputime, powerpc: Migrate stolen_time field to the accounting structure

That in order to gather all cputime accumulation to the same place.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1483636310-6557-7-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 8c8b73c4 05-Jan-2017 Frederic Weisbecker <fweisbec@gmail.com>

sched/cputime, powerpc: Prepare accounting structure for cputime flush on tick

In order to prepare for CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y to delay
cputime accounting to the tick, provide finegrained accumulators to
powerpc in order to store the cputime until flushing.

While at it, normalize the name of several fields according to common
cputime naming.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1483636310-6557-6-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 90d08ba2 05-Jan-2017 Frederic Weisbecker <fweisbec@gmail.com>

sched/cputime, powerpc32: Fix stale scaled stime on context switch

On context switch with powerpc32, the cputime is accumulated in the
thread_info struct. So the switching-in task must move forward its
start time snapshot to the current time in order to later compute the
delta spent in system mode.

This is what we do for the normal cputime by initializing the starttime
field to the value of the previous task's starttime which got freshly
updated.

But we are missing the update of the scaled cputime start time. As a
result we may be accounting too much scaled cputime later.

Fix this by initializing the scaled cputime the same way we do for
normal cputime.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1483636310-6557-2-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# a5a1d1c2 21-Dec-2016 Thomas Gleixner <tglx@linutronix.de>

clocksource: Use a plain u64 instead of cycle_t

There is no point in having an extra type for extra confusion. u64 is
unambiguous.

Conversion was done with the following coccinelle script:

@rem@
@@
-typedef u64 cycle_t;

@fix@
typedef cycle_t;
@@
-cycle_t
+u64

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: John Stultz <john.stultz@linaro.org>


# 7c0f6ba6 24-Dec-2016 Linus Torvalds <torvalds@linux-foundation.org>

Replace <asm/uaccess.h> with <linux/uaccess.h> globally

This was entirely automated, using the script by Al:

PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>'
sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \
$(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h)

to do the replacement at the end of the merge window.

Requested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 40565b5a 14-Nov-2016 Stanislaw Gruszka <sgruszka@redhat.com>

sched/cputime, powerpc, s390: Make scaled cputime arch specific

Only s390 and powerpc have hardware facilities allowing to measure
cputimes scaled by frequency. On all other architectures
utimescaled/stimescaled are equal to utime/stime (however they are
accounted separately).

Remove {u,s}timescaled accounting on all architectures except
powerpc and s390, where those values are explicitly accounted
in the proper places.

Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20161031162143.GB12646@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 7008eb99 14-Nov-2016 Stanislaw Gruszka <sgruszka@redhat.com>

sched/cputime, powerpc: Remove cputime_last_delta global variable

Since commit:

cf9efce0ce313 ("powerpc: Account time using timebase rather than PURR")

cputime_last_delta is not initialized to other value than 0, hence it's
not used except zero check and cputime_to_scaled() just returns
the argument.

Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Paul Mackerras <paulus@ozlabs.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1479175612-14718-2-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 0545d543 05-Sep-2016 Daniel Axtens <dja@axtens.net>

powerpc/sparse: Add more assembler prototypes

Another set of things that are only called from assembler and so need
prototypes to keep sparse happy.

Signed-off-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 9445aa1a 13-Jan-2016 Al Viro <viro@zeniv.linux.org.uk>

ppc: move exports to definitions

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# c223c903 17-May-2016 Christophe Leroy <christophe.leroy@c-s.fr>

powerpc32: provide VIRT_CPU_ACCOUNTING

This patch provides VIRT_CPU_ACCOUTING to PPC32 architecture.
PPC32 doesn't have the PACA structure, so we use the task_info
structure to store the accounting data.

In order to reuse on PPC32 the PPC64 functions, all u64 data has
been replaced by 'unsigned long' so that it is u32 on PPC32 and
u64 on PPC64

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <oss@buserror.net>


# 79901024 01-Jul-2016 Oliver O'Halloran <oohall@gmail.com>

powerpc/timer: Large Decrementer support

Power ISAv3 adds a large decrementer (LD) mode which increases the size
of the decrementer register. The size of the enlarged decrementer
register is between 32 and 64 bits with the exact size being dependent
on the implementation. When in LD mode, reads are sign extended to 64
bits and a decrementer exception is raised when the high bit is set (i.e
the value goes below zero). Writes however are truncated to the physical
register width so some care needs to be taken to ensure that the high
bit is not set when reloading the decrementer. This patch adds support
for using the LD inside the host kernel on processors that support it.

When LD mode is supported firmware will supply the ibm,dec-bits property
for CPU nodes to allow the kernel to determine the maximum decrementer
value. Enabling LD mode is a hypervisor privileged operation so the kernel
can only enable it manually when running in hypervisor mode. Guests that
support LD mode can request it using the "ibm,client-architecture-support"
firmware call (not implemented in this patch) or some other platform
specific method. If this property is not supplied then the traditional
decrementer width of 32 bit is assumed and LD mode will not be enabled.

This patch was based on initial work by Jack Miller.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Acked-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 169047f4 30-May-2016 Arnd Bergmann <arnd@arndb.de>

rtc: powerpc: provide rtc_class_ops directly

The rtc-generic driver provides an architecture specific
wrapper on top of the generic rtc_class_ops abstraction,
and powerpc has another abstraction on top, which is a bit
silly.

This changes the powerpc rtc-generic device to provide its
rtc_class_ops directly, to reduce the number of layers
by one.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Alexandre Belloni <alexandre.belloni@free-electrons.com>


# 7f92bc56 05-Jan-2016 Daniel Axtens <dja@axtens.net>

powerpc: sparse: Include headers for __weak symbols

Sometimes when sparse warns about undefined symbols, it isn't
because they should have 'static' added, it's because they're
overriding __weak symbols defined elsewhere, and the header has
been missed.

Fix a few of them by adding appropriate headers.

Signed-off-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 00b912b0 15-Dec-2015 Daniel Axtens <dja@axtens.net>

powerpc: Remove broken GregorianDay()

GregorianDay() is supposed to calculate the day of the week
(tm->tm_wday) for a given day/month/year. In that calcuation it
indexed into an array called MonthOffset using tm->tm_mon-1. However
tm_mon is zero-based, not one-based, so this is off-by-one. It also
means that every January, GregoiranDay() will access element -1 of
the MonthOffset array.

It also doesn't appear to be a correct algorithm either: see in
contrast kernel/time/timeconv.c's time_to_tm function.

It's been broken forever, which suggests no-one in userland uses
this. It looks like no-one in the kernel uses tm->tm_wday either
(see e.g. drivers/rtc/rtc-ds1305.c:319).

tm->tm_wday is conventionally set to -1 when not available in
hardware so we can simply set it to -1 and drop the function.
(There are over a dozen other drivers in drivers/rtc that do
this.)

Found using UBSAN.

Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Andrew Morton <akpm@linux-foundation.org> # as an example of what UBSan finds.
Cc: Alessandro Zummo <a.zummo@towertech.it>
Cc: Alexandre Belloni <alexandre.belloni@free-electrons.com>
Cc: rtc-linux@googlegroups.com
Signed-off-by: Daniel Axtens <dja@axtens.net>
Acked-by: Alexandre Belloni <alexandre.belloni@free-electrons.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 37a13e78 16-Jul-2015 Viresh Kumar <viresh.kumar@linaro.org>

powerpc/time: Migrate to new 'set-state' interface

Migrate powerpc driver to the new 'set-state' interface provided by
clockevents core, the earlier 'set-mode' interface is marked obsolete
now.

This also enables us to implement callbacks for new states of clockevent
devices, for example: ONESHOT_STOPPED.

We weren't doing anything in ->set_mode(ONSHOT) and so
set_state_oneshot() isn't implemented.

Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>


# 8f6b9512 01-May-2015 Paul Gortmaker <paul.gortmaker@windriver.com>

powerpc: use device_initcall for registering rtc devices

Currently these two RTC devices are in core platform code
where it is not possible for them to be modular. It will
never be modular, so using module_init as an alias for
__initcall can be somewhat misleading.

Fix this up now, so that we can relocate module_init from
init.h into module.h in the future. If we don't do this, we'd
have to add module.h to obviously non-modular code, and that
would be a worse thing.

Note that direct use of __initcall is discouraged, vs. one
of the priority categorized subgroups. As __initcall gets
mapped onto device_initcall, our use of device_initcall
directly in this change means that the runtime impact is
zero -- they will remain at level 6 in initcall ordering.

Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Geoff Levand <geoff@infradead.org>
Acked-by: Geoff Levand <geoff@infradead.org>
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>


# b6c295df 27-Mar-2015 Paul Mackerras <paulus@samba.org>

KVM: PPC: Book3S HV: Accumulate timing information for real-mode code

This reads the timebase at various points in the real-mode guest
entry/exit code and uses that to accumulate total, minimum and
maximum time spent in those parts of the code. Currently these
times are accumulated per vcpu in 5 parts of the code:

* rm_entry - time taken from the start of kvmppc_hv_entry() until
just before entering the guest.
* rm_intr - time from when we take a hypervisor interrupt in the
guest until we either re-enter the guest or decide to exit to the
host. This includes time spent handling hcalls in real mode.
* rm_exit - time from when we decide to exit the guest until the
return from kvmppc_hv_entry().
* guest - time spend in the guest
* cede - time spent napping in real mode due to an H_CEDE hcall
while other threads in the same vcore are active.

These times are exposed in debugfs in a directory per vcpu that
contains a file called "timings". This file contains one line for
each of the 5 timings above, with the name followed by a colon and
4 numbers, which are the count (number of times the code has been
executed), the total time, the minimum time, and the maximum time,
all in nanoseconds.

The overhead of the extra code amounts to about 30ns for an hcall that
is handled in real mode (e.g. H_SET_DABR), which is about 25%. Since
production environments may not wish to incur this overhead, the new
code is conditional on a new config symbol,
CONFIG_KVM_BOOK3S_HV_EXIT_TIMING.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>


# 4be1b297 12-Feb-2015 Cyril Bur <cyrilbur@gmail.com>

powerpc: add running_clock for powerpc to prevent spurious softlockup warnings

On POWER8 virtualised kernels the VTB register can be read to have a view
of time that only increases while the guest is running. This will prevent
guests from seeing time jump if a guest is paused for significant amounts
of time.

On POWER7 and below virtualised kernels stolen time is subtracted from
local_clock as a best effort approximation. This will not eliminate
spurious warnings in the case of a suspended guest but may reduce the
occurance in the case of softlockups due to host over commit.

Bare metal kernels should avoid reading the VTB as KVM does not restore
sane values when not executing, the approxmation is fine as host kernels
won't observe any stolen time.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Cyril Bur <cyrilbur@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Andrew Jones <drjones@redhat.com>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: chai wen <chaiw.fnst@cn.fujitsu.com>
Cc: Fabian Frederick <fabf@skynet.be>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: Ben Zhang <benzh@chromium.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f0d37300 03-Dec-2014 Kevin Hao <haokexin@gmail.com>

powerpc: call of_clk_init() from time_init()

So the boards which has COMMON_CLK enabled don't have to
invoke this in its board specific file.

Signed-off-by: Kevin Hao <haokexin@gmail.com>
Acked-by: Scott Wood <scottwood@freescale.com>
Acked-by: Michael Turquette <mturquette@linaro.org>
Signed-off-by: Michael Turquette <mturquette@linaro.org>


# 16b1d26e 14-Oct-2014 Neelesh Gupta <neelegup@linux.vnet.ibm.com>

rtc/tpo: Driver to support rtc and wakeup on PowerNV platform

The patch implements the OPAL rtc driver that binds with the rtc
driver subsystem. The driver uses the platform device infrastructure
to probe the rtc device and register it to rtc class framework. The
'wakeup' is supported depending upon the property 'has-tpo' present
in the OF node. It provides a way to load the generic rtc driver in
in the absence of an OPAL driver.

The patch also moves the existing OPAL rtc get/set time interfaces to the
new driver and exposes the necessary OPAL calls using EXPORT_SYMBOL_GPL.

Test results:
-------------
Host:
[root@tul169p1 ~]# ls -l /sys/class/rtc/
total 0
lrwxrwxrwx 1 root root 0 Oct 14 03:07 rtc0 -> ../../devices/opal-rtc/rtc/rtc0
[root@tul169p1 ~]# cat /sys/devices/opal-rtc/rtc/rtc0/time
08:10:07
[root@tul169p1 ~]# echo `date '+%s' -d '+ 2 minutes'` > /sys/class/rtc/rtc0/wakealarm
[root@tul169p1 ~]# cat /sys/class/rtc/rtc0/wakealarm
1413274345
[root@tul169p1 ~]#

FSP:
$ smgr mfgState
standby
$ rtim timeofday

System time is valid: 2014/10/14 08:12:04.225115

$ smgr mfgState
ipling
$

CC: devicetree@vger.kernel.org
CC: tglx@linutronix.de
CC: rtc-linux@googlegroups.com
CC: a.zummo@towertech.it
Signed-off-by: Neelesh Gupta <neelegup@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 69111bac 21-Oct-2014 Christoph Lameter <cl@linux.com>

powerpc: Replace __get_cpu_var uses

This still has not been merged and now powerpc is the only arch that does
not have this change. Sorry about missing linuxppc-dev before.

V2->V2
- Fix up to work against 3.18-rc1

__get_cpu_var() is used for multiple purposes in the kernel source. One of
them is address calculation via the form &__get_cpu_var(x). This calculates
the address for the instance of the percpu variable of the current processor
based on an offset.

Other use cases are for storing and retrieving data from the current
processors percpu area. __get_cpu_var() can be used as an lvalue when
writing data or on the right side of an assignment.

__get_cpu_var() is defined as :

__get_cpu_var() always only does an address determination. However, store
and retrieve operations could use a segment prefix (or global register on
other platforms) to avoid the address calculation.

this_cpu_write() and this_cpu_read() can directly take an offset into a
percpu area and use optimized assembly code to read and write per cpu
variables.

This patch converts __get_cpu_var into either an explicit address
calculation using this_cpu_ptr() or into a use of this_cpu operations that
use the offset. Thereby address calculations are avoided and less registers
are used when code is generated.

At the end of the patch set all uses of __get_cpu_var have been removed so
the macro is removed too.

The patch set includes passes over all arches as well. Once these operations
are used throughout then specialized macros can be defined in non -x86
arches as well in order to optimize per cpu access by f.e. using a global
register that may be set to the per cpu base.

Transformations done to __get_cpu_var()

1. Determine the address of the percpu instance of the current processor.

DEFINE_PER_CPU(int, y);
int *x = &__get_cpu_var(y);

Converts to

int *x = this_cpu_ptr(&y);

2. Same as #1 but this time an array structure is involved.

DEFINE_PER_CPU(int, y[20]);
int *x = __get_cpu_var(y);

Converts to

int *x = this_cpu_ptr(y);

3. Retrieve the content of the current processors instance of a per cpu
variable.

DEFINE_PER_CPU(int, y);
int x = __get_cpu_var(y)

Converts to

int x = __this_cpu_read(y);

4. Retrieve the content of a percpu struct

DEFINE_PER_CPU(struct mystruct, y);
struct mystruct x = __get_cpu_var(y);

Converts to

memcpy(&x, this_cpu_ptr(&y), sizeof(x));

5. Assignment to a per cpu variable

DEFINE_PER_CPU(int, y)
__get_cpu_var(y) = x;

Converts to

__this_cpu_write(y, x);

6. Increment/Decrement etc of a per cpu variable

DEFINE_PER_CPU(int, y);
__get_cpu_var(y)++

Converts to

__this_cpu_inc(y)

Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CC: Paul Mackerras <paulus@samba.org>
Signed-off-by: Christoph Lameter <cl@linux.com>
[mpe: Fix build errors caused by set/or_softirq_pending(), and rework
assignment in __set_breakpoint() to use memcpy().]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# e51df2c1 19-Aug-2014 Anton Blanchard <anton@samba.org>

powerpc: Make a bunch of things static

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# e1802b06 19-Aug-2014 Anton Blanchard <anton@samba.org>

powerpc: Move more symbol exports next to function definitions

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 23f66e2d 27-Aug-2014 Tejun Heo <tj@kernel.org>

Revert "powerpc: Replace __get_cpu_var uses"

This reverts commit 5828f666c069af74e00db21559f1535103c9f79a due to
build failure after merging with pending powerpc changes.

Link: http://lkml.kernel.org/g/20140827142243.6277eaff@canb.auug.org.au

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 5828f666 16-Aug-2014 Christoph Lameter <cl@linux.com>

powerpc: Replace __get_cpu_var uses

__get_cpu_var() is used for multiple purposes in the kernel source. One of
them is address calculation via the form &__get_cpu_var(x). This calculates
the address for the instance of the percpu variable of the current processor
based on an offset.

Other use cases are for storing and retrieving data from the current
processors percpu area. __get_cpu_var() can be used as an lvalue when
writing data or on the right side of an assignment.

__get_cpu_var() is defined as :

#define __get_cpu_var(var) (*this_cpu_ptr(&(var)))

__get_cpu_var() always only does an address determination. However, store
and retrieve operations could use a segment prefix (or global register on
other platforms) to avoid the address calculation.

this_cpu_write() and this_cpu_read() can directly take an offset into a
percpu area and use optimized assembly code to read and write per cpu
variables.

This patch converts __get_cpu_var into either an explicit address
calculation using this_cpu_ptr() or into a use of this_cpu operations that
use the offset. Thereby address calculations are avoided and less registers
are used when code is generated.

At the end of the patch set all uses of __get_cpu_var have been removed so
the macro is removed too.

The patch set includes passes over all arches as well. Once these operations
are used throughout then specialized macros can be defined in non -x86
arches as well in order to optimize per cpu access by f.e. using a global
register that may be set to the per cpu base.

Transformations done to __get_cpu_var()

1. Determine the address of the percpu instance of the current processor.

DEFINE_PER_CPU(int, y);
int *x = &__get_cpu_var(y);

Converts to

int *x = this_cpu_ptr(&y);

2. Same as #1 but this time an array structure is involved.

DEFINE_PER_CPU(int, y[20]);
int *x = __get_cpu_var(y);

Converts to

int *x = this_cpu_ptr(y);

3. Retrieve the content of the current processors instance of a per cpu
variable.

DEFINE_PER_CPU(int, y);
int x = __get_cpu_var(y)

Converts to

int x = __this_cpu_read(y);

4. Retrieve the content of a percpu struct

DEFINE_PER_CPU(struct mystruct, y);
struct mystruct x = __get_cpu_var(y);

Converts to

memcpy(&x, this_cpu_ptr(&y), sizeof(x));

5. Assignment to a per cpu variable

DEFINE_PER_CPU(int, y)
__get_cpu_var(y) = x;

Converts to

__this_cpu_write(y, x);

6. Increment/Decrement etc of a per cpu variable

DEFINE_PER_CPU(int, y);
__get_cpu_var(y)++

Converts to

__this_cpu_inc(y)

tj: Folded a fix patch.
http://lkml.kernel.org/g/alpine.DEB.2.11.1408172143020.9652@gentwo.org

Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CC: Paul Mackerras <paulus@samba.org>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Tejun Heo <tj@kernel.org>


# 4a0e6377 16-Jul-2014 Thomas Gleixner <tglx@linutronix.de>

clocksource: Get rid of cycle_last

cycle_last was added to the clocksource to support the TSC
validation. We moved that to the core code, so we can get rid of the
extra copy.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>


# 6e0fdf9a 20-May-2014 Paul Bolle <pebolle@tiscali.nl>

powerpc: fix typo 'CONFIG_PMAC'

Commit b0d278b7d3ae ("powerpc/perf_event: Reduce latency of calling
perf_event_do_pending") added a check for CONFIG_PMAC were a check for
CONFIG_PPC_PMAC was clearly intended.

Fixes: b0d278b7d3ae ("powerpc/perf_event: Reduce latency of calling perf_event_do_pending")
Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 8050936c 09-May-2014 Anton Blanchard <anton@samba.org>

powerpc: irq work racing with timer interrupt can result in timer interrupt hang

I am seeing an issue where a CPU running perf eventually hangs.
Traces show timer interrupts happening every 4 seconds even
when a userspace task is running on the CPU. /proc/timer_list
also shows pending hrtimers have not run in over an hour,
including the scheduler.

Looking closer, decrementers_next_tb is getting set to
0xffffffffffffffff, and at that point we will never take
a timer interrupt again.

In __timer_interrupt() we set decrementers_next_tb to
0xffffffffffffffff and rely on ->event_handler to update it:

*next_tb = ~(u64)0;
if (evt->event_handler)
evt->event_handler(evt);

In this case ->event_handler is hrtimer_interrupt. This will eventually
call back through the clockevents code with the next event to be
programmed:

static int decrementer_set_next_event(unsigned long evt,
struct clock_event_device *dev)
{
/* Don't adjust the decrementer if some irq work is pending */
if (test_irq_work_pending())
return 0;
__get_cpu_var(decrementers_next_tb) = get_tb_or_rtc() + evt;

If irq work came in between these two points, we will return
before updating decrementers_next_tb and we never process a timer
interrupt again.

This looks to have been introduced by 0215f7d8c53f (powerpc: Fix races
with irq_work). Fix it by removing the early exit and relying on
code later on in the function to force an early decrementer:

/* We may have raced with new irq work */
if (test_irq_work_pending())
set_dec(1);

Signed-off-by: Anton Blanchard <anton@samba.org>
Cc: stable@vger.kernel.org # 3.14+
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 0d948730 25-Feb-2014 Preeti U Murthy <preeti@linux.vnet.ibm.com>

cpuidle/powernv: Add "Fast-Sleep" CPU idle state

Fast sleep is one of the deep idle states on Power8 in which local timers of
CPUs stop. On PowerPC we do not have an external clock device which can
handle wakeup of such CPUs. Now that we have the support in the tick broadcast
framework for archs that do not sport such a device and the low level support
for fast sleep, enable it in the cpuidle framework on PowerNV.

Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 1b783955 25-Feb-2014 Preeti U Murthy <preeti@linux.vnet.ibm.com>

powerpc: Split timer_interrupt() into timer handling and interrupt handling routines

Split timer_interrupt(), which is the local timer interrupt handler on ppc
into routines called during regular interrupt handling and __timer_interrupt(),
which takes care of running local timers and collecting time related stats.

This will enable callers interested only in running expired local timers to
directly call into __timer_interupt(). One of the use cases of this is the
tick broadcast IPI handling in which the sleeping CPUs need to handle the local
timers that have expired.

Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 1b67bee1 25-Feb-2014 Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>

powerpc: Implement tick broadcast IPI as a fixed IPI message

For scalability and performance reasons, we want the tick broadcast IPIs
to be handled as efficiently as possible. Fixed IPI messages
are one of the most efficient mechanisms available - they are faster than
the smp_call_function mechanism because the IPI handlers are fixed and hence
they don't involve costly operations such as adding IPI handlers to the target
CPU's function queue, acquiring locks for synchronization etc.

Luckily we have an unused IPI message slot, so use that to implement
tick broadcast IPIs efficiently.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
[Functions renamed to tick_broadcast* and Changelog modified by
Preeti U. Murthy<preeti@linux.vnet.ibm.com>]
Signed-off-by: Preeti U. Murthy <preeti@linux.vnet.ibm.com>
Acked-by: Geoff Levand <geoff@infradead.org> [For the PS3 part]
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 0215f7d8 13-Jan-2014 Benjamin Herrenschmidt <benh@kernel.crashing.org>

powerpc: Fix races with irq_work

If we set irq_work on a processor and immediately afterward, before the
irq work has a chance to be processed, we change the decrementer value,
we can seriously delay the handling of that irq_work.

Fix it by checking in a few places for pending irq work, first before
changing the decrementer in decrementer_set_next_event() and after
changing it in the same function and in timer_interrupt().

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# c041cfa2 23-Jan-2013 fan.du <fan.du@windriver.com>

powerpc: Make irq_stat.timers_irqs counting more specific

Current irq_stat.timers_irqs counting doesn't discriminate timer event handler
and other timer interrupt(like arch_irq_work_raise). Sometimes we need to know
exactly how much interrupts timer event handler fired, so let's be more specific
on this.

Signed-off-by: Fan Du <fan.du@windriver.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 84b07386 16-Nov-2013 Anton Blanchard <anton@samba.org>

powerpc/pseries: Duplicate dtl entries sometimes sent to userspace

When reading from the dispatch trace log (dtl) userspace interface, I
sometimes see duplicate entries. One example:

# hexdump -C dtl.out

00000000 07 04 00 0c 00 00 48 44 00 00 00 00 00 00 00 00
00000010 00 0c a0 b4 16 83 6d 68 00 00 00 00 00 00 00 00
00000020 00 00 00 00 10 00 13 50 80 00 00 00 00 00 d0 32

00000030 07 04 00 0c 00 00 48 44 00 00 00 00 00 00 00 00
00000040 00 0c a0 b4 16 83 6d 68 00 00 00 00 00 00 00 00
00000050 00 00 00 00 10 00 13 50 80 00 00 00 00 00 d0 32

The problem is in scan_dispatch_log() where we call dtl_consumer()
but bail out before incrementing the index.

To fix this I moved dtl_consumer() after the timebase comparison.

Signed-off-by: Anton Blanchard <anton@samba.org>
Cc: stable@vger.kernel.org
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 7ffcf8ec 06-Aug-2013 Anton Blanchard <anton@samba.org>

powerpc: Fix little endian lppaca, slb_shadow and dtl_entry

The lppaca, slb_shadow and dtl_entry hypervisor structures are
big endian, so we have to byte swap them in little endian builds.

LE KVM hosts will also need to be fixed but for now add an #error
to remind us.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 6f7aba7b 06-Aug-2013 Anton Blanchard <anton@samba.org>

powerpc: Add some endian annotations to time and xics code

Fix a couple of sparse warnings.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 8c6ffba0 14-Jul-2013 Rusty Russell <rusty@rustcorp.com.au>

PTR_RET is now PTR_ERR_OR_ZERO(): Replace most.

Sweep of the simple cases.

Cc: netdev@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 061d19f2 24-Jun-2013 Paul Gortmaker <paul.gortmaker@windriver.com>

powerpc: Delete __cpuinit usage from all users

The __cpuinit type of throwaway sections might have made sense
some time ago when RAM was more constrained, but now the savings
do not offset the cost and complications. For example, the fix in
commit 5e427ec2d0 ("x86: Fix bit corruption at CPU resume time")
is a good example of the nasty type of bugs that can be created
with improper use of the various __init prefixes.

After a discussion on LKML[1] it was decided that cpuinit should go
the way of devinit and be phased out. Once all the users are gone,
we can then finally remove the macros themselves from linux/init.h.

This removes all the powerpc uses of the __cpuinit macros. There
are no __CPUINIT users in assembly files in powerpc.

[1] https://lkml.org/lkml/2013/5/20/589

Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Josh Boyer <jwboyer@gmail.com>
Cc: Matt Porter <mporter@kernel.crashing.org>
Cc: Kumar Gala <galak@kernel.crashing.org>
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 09652b00 09-Mar-2013 Adrian-Leonard Radu <ady8radu@gmail.com>

powerpc: Use PTR_RET instead of IS_ERR/PTR_ERR

Signed-off-by: Adrian-Leonard Radu <ady8radu@gmail.com>
Acked-by: Geoff Levand <geoff@infradead.org>
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>


# 689dfa89 15-Jan-2013 Tiejun Chen <tiejun.chen@windriver.com>

powerpc: Max next_tb to prevent from replaying timer interrupt

With lazy interrupt, we always call __check_irq_replaysome with
decrementers_next_tb to check if we need to replay timer interrupt.
So in hotplug case we also need to set decrementers_next_tb as MAX
to make sure __check_irq_replay don't replay timer interrupt
when return as we expect, otherwise we'll trap here infinitely.

Signed-off-by: Tiejun Chen <tiejun.chen@windriver.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# c11f11fc 20-Jan-2013 Frederic Weisbecker <fweisbec@gmail.com>

kvm: Prepare to add generic guest entry/exit callbacks

Do some ground preparatory work before adding guest_enter()
and guest_exit() context tracking callbacks. Those will
be later used to read the guest cputime safely when we
run in full dynticks mode.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>


# abf917cd 24-Jul-2012 Frederic Weisbecker <fweisbec@gmail.com>

cputime: Generic on-demand virtual cputime accounting

If we want to stop the tick further idle, we need to be
able to account the cputime without using the tick.

Virtual based cputime accounting solves that problem by
hooking into kernel/user boundaries.

However implementing CONFIG_VIRT_CPU_ACCOUNTING require
low level hooks and involves more overhead. But we already
have a generic context tracking subsystem that is required
for RCU needs by archs which plan to shut down the tick
outside idle.

This patch implements a generic virtual based cputime
accounting that relies on these generic kernel/user hooks.

There are some upsides of doing this:

- This requires no arch code to implement CONFIG_VIRT_CPU_ACCOUNTING
if context tracking is already built (already necessary for RCU in full
tickless mode).

- We can rely on the generic context tracking subsystem to dynamically
(de)activate the hooks, so that we can switch anytime between virtual
and tick based accounting. This way we don't have the overhead
of the virtual accounting when the tick is running periodically.

And one downside:

- There is probably more overhead than a native virtual based cputime
accounting. But this relies on hooks that are already set anyway.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>


# 023f333a 17-Dec-2012 Jason Gunthorpe <jgg@ziepe.ca>

NTP: Add a CONFIG_RTC_SYSTOHC configuration

The purpose of this option is to allow ARM/etc systems that rely on the
class RTC subsystem to have the same kind of automatic NTP based
synchronization that we have on PC platforms. Today ARM does not
implement update_persistent_clock and makes extensive use of the class
RTC system.

When enabled CONFIG_RTC_SYSTOHC will provide a generic
rtc_update_persistent_clock that stores the current time in the RTC and
is intended complement the existing CONFIG_RTC_HCTOSYS option that loads
the RTC at boot.

Like with RTC_HCTOSYS the platform's update_persistent_clock is used
first, if it works. Platforms with mixed class RTC and non-RTC drivers
need to return ENODEV when class RTC should be used. Such an update for
PPC is included in this patch.

Long term, implementations of update_persistent_clock should migrate to
proper class RTC drivers and use CONFIG_RTC_SYSTOHC instead.

Tested on ARM kirkwood and PPC405

Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>


# ce73ec6d 08-Nov-2012 Shan Hai <shan.hai@windriver.com>

powerpc/vdso: Remove redundant locking in update_vsyscall_tz()

The locking in update_vsyscall_tz() is not only unnecessary because the vdso
code copies the data unproteced in __kernel_gettimeofday() but also
introduces a hard to reproduce race condition between update_vsyscall()
and update_vsyscall_tz(), which causes user space process to loop
forever in vdso code.

The following patch removes the locking from update_vsyscall_tz().

Locking is not only unnecessary because the vdso code copies the data
unprotected in __kernel_gettimeofday() but also erroneous because updating
the tb_update_count is not atomic and introduces a hard to reproduce race
condition between update_vsyscall() and update_vsyscall_tz(), which further
causes user space process to loop forever in vdso code.

The below scenario describes the race condition,
x==0 Boot CPU other CPU
proc_P: x==0
timer interrupt
update_vsyscall
x==1 x++;sync settimeofday
update_vsyscall_tz
x==2 x++;sync
x==3 sync;x++
sync;x++
proc_P: x==3 (loops until x becomes even)

Because the ++ operator would be implemented as three instructions and not
atomic on powerpc.

A similar change was made for x86 in commit 6c260d58634
("x86: vdso: Remove bogus locking in update_vsyscall_tz")

Signed-off-by: Shan Hai <shan.hai@windriver.com>
CC: <stable@vger.kernel.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 1b2852b1 19-Nov-2012 Frederic Weisbecker <fweisbec@gmail.com>

vtime: Warn if irqs aren't disabled on system time accounting APIs

System time accounting APIs such as vtime_account_system() and
vtime_account_idle() need to be irqsafe. Current callers include
irq entry, exit and kvm, all of which have been checked against that
requirement. Now it's better to grow that with an automatic check
in case we have further callers or we missed something.

Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>


# e3942ba0 13-Nov-2012 Frederic Weisbecker <fweisbec@gmail.com>

vtime: Consolidate a bit the ctx switch code

On ia64 and powerpc, vtime context switch only consists
in flushing system and user pending time, plus a few
arch housekeeping.

Consolidate that into a generic implementation. s390 is
a special case because pending user and system time accounting
there is hard to dissociate. So it's keeping its own implementation.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>


# bcebdf84 13-Nov-2012 Frederic Weisbecker <fweisbec@gmail.com>

vtime: Explicitly account pending user time on process tick

All vtime implementations just flush the user time on process
tick. Consolidate that in generic code by calling a user time
accounting helper. This avoids an indirect call in ia64 and
prepare to also consolidate vtime context switch code.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>


# fd25b4c2 13-Nov-2012 Frederic Weisbecker <fweisbec@gmail.com>

vtime: Remove the underscore prefix invasion

Prepending irq-unsafe vtime APIs with underscores was actually
a bad idea as the result is a big mess in the API namespace that
is even waiting to be further extended. Also these helpers
are always called from irq safe callers except kvm. Just
provide a vtime_account_system_irqsafe() for this specific
case so that we can remove the underscore prefix on other
vtime functions.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>


# 11113334 24-Oct-2012 Frederic Weisbecker <fweisbec@gmail.com>

vtime: Make vtime_account_system() irqsafe

vtime_account_system() currently has only one caller with
vtime_account() which is irq safe.

Now we are going to call it from other places like kvm where
irqs are not always disabled by the time we account the cputime.

So let's make it irqsafe. The arch implementation part is now
prefixed with "__".

vtime_account_idle() arch implementation is prefixed accordingly
to stay consistent.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>


# a7e1a9e3 08-Sep-2012 Frederic Weisbecker <fweisbec@gmail.com>

vtime: Consolidate system/idle context detection

Move the code that finds out to which context we account the
cputime into generic layer.

Archs that consider the whole time spent in the idle task as idle
time (ia64, powerpc) can rely on the generic vtime_account()
and implement vtime_account_system() and vtime_account_idle(),
letting the generic code to decide when to call which API.

Archs that have their own meaning of idle time, such as s390
that only considers the time spent in CPU low power mode as idle
time, can just override vtime_account().

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>


# bf9fae9f 08-Sep-2012 Frederic Weisbecker <fweisbec@gmail.com>

cputime: Use a proper subsystem naming for vtime related APIs

Use a naming based on vtime as a prefix for virtual based
cputime accounting APIs:

- account_system_vtime() -> vtime_account()
- account_switch_vtime() -> vtime_task_switch()

It makes it easier to allow for further declension such
as vtime_account_system(), vtime_account_idle(), ... if we
want to find out the context we account to from generic code.

This also make it better to know on which subsystem these APIs
refer to.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>


# 70639421 04-Sep-2012 John Stultz <john.stultz@linaro.org>

time: Convert CONFIG_GENERIC_TIME_VSYSCALL to CONFIG_GENERIC_TIME_VSYSCALL_OLD

To help migrate archtectures over to the new update_vsyscall method,
redfine CONFIG_GENERIC_TIME_VSYSCALL as CONFIG_GENERIC_TIME_VSYSCALL_OLD

Cc: Tony Luck <tony.luck@intel.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>


# 189374ae 04-Sep-2012 John Stultz <john.stultz@linaro.org>

time: Move update_vsyscall definitions to timekeeper_internal.h

Since users will need to include timekeeper_internal.h, move
update_vsyscall definitions to timekeeper_internal.h.

Cc: Tony Luck <tony.luck@intel.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>


# e72bbbab 10-Sep-2012 Li Zhong <zhong@linux.vnet.ibm.com>

powerpc/trace: Fix interrupt tracepoints vs. RCU

There are a few tracepoints in the interrupt code path, which is before
irq_enter(), or after irq_exit(), like
trace_irq_entry()/trace_irq_exit() in do_IRQ(),
trace_timer_interrupt_entry()/trace_timer_interrupt_exit() in
timer_interrupt().

If the interrupt is from idle(), and because tracepoint contains RCU
read-side critical section, we could see following suspicious RCU usage
reported:

[ 145.127743] ===============================
[ 145.127747] [ INFO: suspicious RCU usage. ]
[ 145.127752] 3.6.0-rc3+ #1 Not tainted
[ 145.127755] -------------------------------
[ 145.127759] /root/.workdir/linux/arch/powerpc/include/asm/trace.h:33
suspicious rcu_dereference_check() usage!
[ 145.127765]
[ 145.127765] other info that might help us debug this:
[ 145.127765]
[ 145.127771]
[ 145.127771] RCU used illegally from idle CPU!
[ 145.127771] rcu_scheduler_active = 1, debug_locks = 0
[ 145.127777] RCU used illegally from extended quiescent state!
[ 145.127781] no locks held by swapper/0/0.
[ 145.127785]
[ 145.127785] stack backtrace:
[ 145.127789] Call Trace:
[ 145.127796] [c00000000108b530] [c000000000013c40] .show_stack
+0x70/0x1c0 (unreliable)
[ 145.127806] [c00000000108b5e0]
[c0000000000f59d8] .lockdep_rcu_suspicious+0x118/0x150
[ 145.127813] [c00000000108b680] [c00000000000fc58] .do_IRQ+0x498/0x500
[ 145.127820] [c00000000108b750] [c000000000003950]
hardware_interrupt_common+0x150/0x180
[ 145.127828] --- Exception: 501 at .plpar_hcall_norets+0x84/0xd4
[ 145.127828] LR = .check_and_cede_processor+0x38/0x70
[ 145.127836] [c00000000108bab0] [c0000000000665dc] .shared_cede_loop
+0x5c/0x100
[ 145.127844] [c00000000108bb70] [c000000000588ab0] .cpuidle_enter
+0x30/0x50
[ 145.127850] [c00000000108bbe0]
[c000000000588b0c] .cpuidle_enter_state+0x3c/0xb0
[ 145.127857] [c00000000108bc60] [c000000000589730] .cpuidle_idle_call
+0x150/0x6c0
[ 145.127863] [c00000000108bd30] [c000000000058440] .pSeries_idle
+0x10/0x40
[ 145.127870] [c00000000108bda0] [c00000000001683c] .cpu_idle
+0x18c/0x2d0
[ 145.127876] [c00000000108be60] [c00000000000b434] .rest_init
+0x124/0x1b0
[ 145.127884] [c00000000108bef0] [c0000000009d0d28] .start_kernel
+0x568/0x588
[ 145.127890] [c00000000108bf90] [c000000000009660] .start_here_common
+0x20/0x40

This is because the RCU usage in interrupt context should be used in
area marked by rcu_irq_enter()/rcu_irq_exit(), called in
irq_enter()/irq_exit() respectively.

Move them into the irq_enter()/irq_exit() area to avoid the reporting.

Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# dabe859e 26-Jul-2012 Paul Mackerras <paulus@samba.org>

powerpc: Give hypervisor decrementer interrupts their own handler

At the moment the handler for hypervisor decrementer interrupts is
the same as for decrementer interrupts, i.e. timer_interrupt().
This is bogus; if we ever do get a hypervisor decrementer interrupt
it won't have anything to do with the next timer event. In fact
the only time we get hypervisor decrementer interrupts is when one
is left pending on exit from a KVM guest.

When we get a hypervisor decrementer interrupt we don't need to do
anything special to clear it, since they are edge-triggered on the
transition of HDEC from 0 to -1. Thus this adds an empty handler
function for them. We don't need to have them masked when interrupts
are soft-disabled, so we use STD_EXCEPTION_HV instead of
MASKABLE_EXCEPTION_HV.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# baa36046 18-Jun-2012 Frederic Weisbecker <fweisbec@gmail.com>

cputime: Consolidate vtime handling on context switch

The archs that implement virtual cputime accounting all
flush the cputime of a task when it gets descheduled
and sometimes set up some ground initialization for the
next task to account its cputime.

These archs all put their own hooks in their context
switch callbacks and handle the off-case themselves.

Consolidate this by creating a new account_switch_vtime()
callback called in generic code right after a context switch
and that these archs must implement to flush the prev task
cputime and initialize the next task cputime related state.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>


# 860aed25 01-Jun-2012 Paul Mackerras <paulus@samba.org>

powerpc/time: Sanity check of decrementer expiration is necessary

This reverts 68568add2c ("powerpc/time: Remove unnecessary sanity check
of decrementer expiration"). We do need to check whether we have reached
the expiration time of the next event, because we sometimes get an early
decrementer interrupt, most notably when we set the decrementer to 1 in
arch_irq_work_raise(). The effect of not having the sanity check is that
if timer_interrupt() gets called early, we leave the decrementer set to
its maximum value, which means we then don't get any more decrementer
interrupts for about 4 seconds (or longer, depending on timebase
frequency). I saw these pauses as a consequence of getting a stray
hypervisor decrementer interrupt left over from exiting a KVM guest.

This isn't quite a straight revert because of changes to the surrounding
code, but it restores the same algorithm as was previously used.

Cc: stable@vger.kernel.org
Acked-by: Anton Blanchard <anton@samba.org>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# 6e35994d 18-Apr-2012 Bharat Bhushan <r65777@freescale.com>

KVM: PPC: Use clockevent multiplier and shifter for decrementer

Time for which the hrtimer is started for decrementer emulation is calculated
using tb_ticks_per_usec. While hrtimer uses the clockevent for DEC
reprogramming (if needed) and which calculate timebase ticks using the
multiplier and shifter mechanism implemented within clockevent layer.

It was observed that this conversion (timebase->time->timebase) are not
correct because the mechanism are not consistent.
In our setup it adds 2% jitter.

With this patch clockevent multiplier and shifter mechanism are used when
starting hrtimer for decrementer emulation. Now the jitter is < 0.5%.

Signed-off-by: Bharat Bhushan <bharat.bhushan@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>


# f5339277 15-Mar-2012 Stephen Rothwell <sfr@canb.auug.org.au>

powerpc: Remove FW_FEATURE ISERIES from arch code

This is no longer selectable, so just remove all the dependent code.

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 7230c564 06-Mar-2012 Benjamin Herrenschmidt <benh@kernel.crashing.org>

powerpc: Rework lazy-interrupt handling

The current implementation of lazy interrupts handling has some
issues that this tries to address.

We don't do the various workarounds we need to do when re-enabling
interrupts in some cases such as when returning from an interrupt
and thus we may still lose or get delayed decrementer or doorbell
interrupts.

The current scheme also makes it much harder to handle the external
"edge" interrupts provided by some BookE processors when using the
EPR facility (External Proxy) and the Freescale Hypervisor.

Additionally, we tend to keep interrupts hard disabled in a number
of cases, such as decrementer interrupts, external interrupts, or
when a masked decrementer interrupt is pending. This is sub-optimal.

This is an attempt at fixing it all in one go by reworking the way
we do the lazy interrupt disabling from the ground up.

The base idea is to replace the "hard_enabled" field with a
"irq_happened" field in which we store a bit mask of what interrupt
occurred while soft-disabled.

When re-enabling, either via arch_local_irq_restore() or when returning
from an interrupt, we can now decide what to do by testing bits in that
field.

We then implement replaying of the missed interrupts either by
re-using the existing exception frame (in exception exit case) or via
the creation of a new one from an assembly trampoline (in the
arch_local_irq_enable case).

This removes the need to play with the decrementer to try to create
fake interrupts, among others.

In addition, this adds a few refinements:

- We no longer hard disable decrementer interrupts that occur
while soft-disabled. We now simply bump the decrementer back to max
(on BookS) or leave it stopped (on BookE) and continue with hard interrupts
enabled, which means that we'll potentially get better sample quality from
performance monitor interrupts.

- Timer, decrementer and doorbell interrupts now hard-enable
shortly after removing the source of the interrupt, which means
they no longer run entirely hard disabled. Again, this will improve
perf sample quality.

- On Book3E 64-bit, we now make the performance monitor interrupt
act as an NMI like Book3S (the necessary C code for that to work
appear to already be present in the FSL perf code, notably calling
nmi_enter instead of irq_enter). (This also fixes a bug where BookE
perfmon interrupts could clobber r14 ... oops)

- We could make "masked" decrementer interrupts act as NMIs when doing
timer-based perf sampling to improve the sample quality.

Signed-off-by-yet: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---

v2:

- Add hard-enable to decrementer, timer and doorbells
- Fix CR clobber in masked irq handling on BookE
- Make embedded perf interrupt act as an NMI
- Add a PACA_HAPPENED_EE_EDGE for use by FSL if they want
to retrigger an interrupt without preventing hard-enable

v3:

- Fix or vs. ori bug on Book3E
- Fix enabling of interrupts for some exceptions on Book3E

v4:

- Fix resend of doorbells on return from interrupt on Book3E

v5:

- Rebased on top of my latest series, which involves some significant
rework of some aspects of the patch.

v6:
- 32-bit compile fix
- more compile fixes with various .config combos
- factor out the asm code to soft-disable interrupts
- remove the C wrapper around preempt_schedule_irq

v7:
- Fix a bug with hard irq state tracking on native power7


# 9f5072d4 09-Dec-2011 Andreas Schwab <schwab@linux-m68k.org>

powerpc: Fix wrong divisor in usecs_to_cputime

Commit d57af9b (taskstats: use real microsecond granularity for CPU times)
renamed msecs_to_cputime to usecs_to_cputime, but failed to update all
numbers on the way. This causes nonsensical cpu idle/iowait values to be
displayed in /proc/stat (the only user of usecs_to_cputime so far).

This also renames __cputime_msec_factor to __cputime_usec_factor, adapting
its value and using it directly in cputime_to_usecs instead of doing two
multiplications.

Signed-off-by: Andreas Schwab <schwab@linux-m68k.org>
Acked-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 7df10275 23-Nov-2011 Anton Blanchard <anton@samba.org>

powerpc/time: Optimise decrementer_check_overflow

decrementer_check_overflow is called from arch_local_irq_restore so
we want to make it as light weight as possible. As such, turn
decrementer_check_overflow into an inline function.

To avoid a circular mess of includes, separate out the two components
of struct decrementer_clock and keep the struct clock_event_device
part local to time.c.

The fast path improves from:

arch_local_irq_restore
0: mflr r0
4: std r0,16(r1)
8: stdu r1,-112(r1)
c: stb r3,578(r13)
10: cmpdi cr7,r3,0
14: beq- cr7,24 <.arch_local_irq_restore+0x24>
...
24: addi r1,r1,112
28: ld r0,16(r1)
2c: mtlr r0
30: blr

to:

arch_local_irq_restore
0: std r30,-16(r1)
4: ld r30,0(r2)
8: stb r3,578(r13)
c: cmpdi cr7,r3,0
10: beq- cr7,6c <.arch_local_irq_restore+0x6c>
...
6c: ld r30,-16(r1)
70: blr

Unfortunately we still setup a local TOC (due to -mminimal-toc). Yet
another sign we should be moving to -mcmodel=medium.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 621692cb 23-Nov-2011 Anton Blanchard <anton@samba.org>

powerpc/time: Fix some style issues

Fix some formatting issues and use the DECREMENTER_MAX
define instead of 0x7fffffff.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 68568add 23-Nov-2011 Anton Blanchard <anton@samba.org>

powerpc/time: Remove unnecessary sanity check of decrementer expiration

The clockevents code uses max_delta_ns to avoid calling a
clockevent with too large a value.

Remove the redundant version of this in the timer_interrupt
code.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 11b8633a 23-Nov-2011 Anton Blanchard <anton@samba.org>

powerpc/time: Use clocksource_register_hz

Use clocksource_register_hz which calculates the shift/mult
factors for us. Also remove the shift = 22 assumption in
vsyscall_update - thanks to Paul Mackerras and John Stultz for
catching that.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# d8afc6fd 23-Nov-2011 Anton Blanchard <anton@samba.org>

powerpc/time: Use clockevents_calc_mult_shift

We can use clockevents_calc_mult_shift instead of doing all
the work ourselves.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 37fb9a02 23-Nov-2011 Anton Blanchard <anton@samba.org>

powerpc/time: Handle wrapping of decrementer

When re-enabling interrupts we have code to handle edge sensitive
decrementers by resetting the decrementer to 1 whenever it is negative.
If interrupts were disabled long enough that the decrementer wrapped to
positive we do nothing. This means interrupts can be delayed for a long
time until it finally goes negative again.

While we hope interrupts are never be disabled long enough for the
decrementer to go positive, we have a very good test team that can
drive any kernel into the ground. The softlockup data we get back
from these fails could be seconds in the future, completely missing
the cause of the lockup.

We already keep track of the timebase of the next event so use that
to work out if we should trigger a decrementer exception.

Signed-off-by: Anton Blanchard <anton@samba.org>
Cc: stable@kernel.org
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 4b16f8e2 22-Jul-2011 Paul Gortmaker <paul.gortmaker@windriver.com>

powerpc: various straight conversions from module.h --> export.h

All these files were including module.h just for the basic
EXPORT_SYMBOL infrastructure. We can shift them off to the
export.h header which is a way smaller footprint and thus
realize some compile time gains.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>


# 4f8b50bb 27-Jun-2011 Peter Zijlstra <a.p.zijlstra@chello.nl>

irq_work, ppc: Fix up arch hooks

Commit e360adbe29 ("irq_work: Add generic hardirq context
callbacks") fouled up the ppc bit, not properly naming the
arch specific function that raises the 'self-IPI'.

Cc: Huang Ying <ying.huang@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Anton Blanchard <anton@samba.org>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: stable@kernel.org # 37+
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-eg0aqien8p1aqvzu9dft6dtv@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 84ffae55 07-Apr-2011 Anton Blanchard <anton@samba.org>

powerpc: Fix oops if scan_dispatch_log is called too early

We currently enable interrupts before the dispatch log for the boot
cpu is setup. If a timer interrupt comes in early enough we oops in
scan_dispatch_log:

Unable to handle kernel paging request for data at address 0x00000010

...

.scan_dispatch_log+0xb0/0x170
.account_system_vtime+0xa0/0x220
.irq_enter+0x88/0xc0
.do_IRQ+0x48/0x230

The patch below adds a check to scan_dispatch_log to ensure the
dispatch log has been allocated.

Signed-off-by: Anton Blanchard <anton@samba.org>
Cc: <stable@kernel.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 963e5d3b 28-Mar-2011 Benjamin Herrenschmidt <benh@kernel.crashing.org>

powerpc: Make decrementer interrupt robust against offlined CPUs

With some implementations, it is possible that a timer interrupt
occurs every few seconds on an offline CPU. In this case, just
re-arm the decrementer and return immediately

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# ad5d1c88 20-Mar-2011 Anton Blanchard <anton@samba.org>

powerpc: Fix accounting of softirq time when idle

commit cf9efce0ce31 (powerpc: Account time using timebase rather
than PURR) used in_irq() to detect if the time was spent in
interrupt processing. This only catches hardirq context so if we
are in softirq context and in the idle loop we end up accounting it
as idle time. If we instead use in_interrupt() we catch both softirq
and hardirq time.

The issue was found when running a network intensive workload. top
showed the following:

0.0%us, 1.1%sy, 0.0%ni, 85.7%id, 0.0%wa, 9.9%hi, 3.3%si, 0.0%st

85.7% idle. But this was wildly different to the perf events data.
To confirm the suspicion I ran something to keep the core busy:

# yes > /dev/null &

8.2%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 10.3%hi, 81.4%si, 0.0%st

We only got 8.2% of the CPU for the userspace task and softirq has
shot up to 81.4%.

With the patch below top shows the correct stats:

0.0%us, 0.0%sy, 0.0%ni, 5.3%id, 0.0%wa, 13.3%hi, 81.3%si, 0.0%st

Signed-off-by: Anton Blanchard <anton@samba.org>
Cc: stable@kernel.org
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# b18ae08d 02-Jan-2011 Tejun Heo <tj@kernel.org>

powerpc/cell: Use system_wq in cpufreq_spudemand

With cmwq, there's no reason to use a separate workqueue in
cpufreq_spudemand. Use system_wq instead. The work items are already
sync canceled on stop, so it's already guaranteed that no work is
running when spu_gov_exit() is entered.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Dave Jones <davej@redhat.com>
Cc: cpufreq@vger.kernel.org
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 364a1246 22-Nov-2010 Heiko Schocher <hs@denx.de>

powerpc/time: printk time stamp init not correct

problem:

I see sometimes on my mpc5200 based board such printk timing
information:

[ 0.000000] NR_IRQS:512 nr_irqs:512 16
[ 0.000000] MPC52xx PIC is up and running!
[ 0.000000] clocksource: timebase mult[79364d9] shift[22] registered
[ 0.000000] console [ttyPSC0] enabled
[ 130.300633] pid_max: default: 32768 minimum: 301
[ 130.305647] Mount-cache hash table entries: 512
[ 130.315818] NET: Registered protocol family 16

reason:
if the tbu not starts from 0 when linux boots, boot_tb
maybe could not store the real 64 bit tbu value, because
boot_tp is only a 32 bit unsigned long.

solution:
change boot_tb to u64

[BenH: Made it u64 instead of unsigned long long]

Signed-off-by: Heiko Schocher <hs@denx.de>
cc: Wolfgang Denk <wd@denx.de>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# e360adbe 14-Oct-2010 Peter Zijlstra <a.p.zijlstra@chello.nl>

irq_work: Add generic hardirq context callbacks

Provide a mechanism that allows running code in IRQ context. It is
most useful for NMI code that needs to interact with the rest of the
system -- like wakeup a task to drain buffers.

Perf currently has such a mechanism, so extract that and provide it as
a generic feature, independent of perf so that others may also
benefit.

The IRQ context callback is generated through self-IPIs where
possible, or on architectures like powerpc the decrementer (the
built-in timer facility) is set to generate an interrupt immediately.

Architectures that don't have anything like this get to do with a
callback from the timer tick. These architectures can call
irq_work_run() at the tail of any IRQ handlers that might enqueue such
work (like the perf IRQ handler) to avoid undue latencies in
processing the work.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Kyle McMartin <kyle@mcmartin.ca>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
[ various fixes ]
Signed-off-by: Huang Ying <ying.huang@intel.com>
LKML-Reference: <1287036094.7768.291.camel@yhuang-dev>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 55ec2fca 20-Sep-2010 Timur Tabi <timur@freescale.com>

powerpc: export ppc_proc_freq and ppc_tb_freq as GPL symbols

Export the global variable 'ppc_tb_freq', so that modules (like the Book-E
watchdog driver) can use it. To maintain consistency, ppc_proc_freq is
changed to a GPL-only export. This is okay, because any module that needs
this symbol should be an actual Linux driver, which must be GPL-licensed.

Signed-off-by: Timur Tabi <timur@freescale.com>
Acked-by: Josh Boyer <jwboyer@linux.vnet.ibm.com>
Signed-off-by: Kumar Gala <galak@kernel.crashing.org>


# 872e439a 30-Aug-2010 Paul Mackerras <paulus@samba.org>

powerpc/pseries: Re-enable dispatch trace log userspace interface

Since the cpu accounting code uses the hypervisor dispatch trace log
now when CONFIG_VIRT_CPU_ACCOUNTING = y, the previous commit disabled
access to it via files in the /sys/kernel/debug/powerpc/dtl/ directory
in that case. This restores those files.

To do this, we now have a hook that the cpu accounting code will call
as it processes each entry from the hypervisor dispatch trace log.
The code in dtl.c now uses that to fill up its ring buffer, rather
than having the hypervisor fill the ring buffer directly.

This also fixes dtl_file_read() to handle overflow conditions a bit
better and adds a spinlock to ensure that race conditions (multiple
processes opening or reading the file concurrently) are handled
correctly.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# cf9efce0 26-Aug-2010 Paul Mackerras <paulus@samba.org>

powerpc: Account time using timebase rather than PURR

Currently, when CONFIG_VIRT_CPU_ACCOUNTING is enabled, we use the
PURR register for measuring the user and system time used by
processes, as well as other related times such as hardirq and
softirq times. This turns out to be quite confusing for users
because it means that a program will often be measured as taking
less time when run on a multi-threaded processor (SMT2 or SMT4 mode)
than it does when run on a single-threaded processor (ST mode), even
though the program takes longer to finish. The discrepancy is
accounted for as stolen time, which is also confusing, particularly
when there are no other partitions running.

This changes the accounting to use the timebase instead, meaning that
the reported user and system times are the actual number of real-time
seconds that the program was executing on the processor thread,
regardless of which SMT mode the processor is in. Thus a program will
generally show greater user and system times when run on a
multi-threaded processor than on a single-threaded processor.

On pSeries systems on POWER5 or later processors, we measure the
stolen time (time when this partition wasn't running) using the
hypervisor dispatch trace log. We check for new entries in the
log on every entry from user mode and on every transition from
kernel process context to soft or hard IRQ context (i.e. when
account_system_vtime() gets called). So that we can correctly
distinguish time stolen from user time and time stolen from system
time, without having to check the log on every exit to user mode,
we store separate timestamps for exit to user mode and entry from
user mode.

On systems that have a SPURR (POWER6 and POWER7), we read the SPURR
in account_system_vtime() (as before), and then apportion the SPURR
ticks since the last time we read it between scaled user time and
scaled system time according to the relative proportions of user
time and system time over the same interval. This avoids having to
read the SPURR on every kernel entry and exit. On systems that have
PURR but not SPURR (i.e., POWER5), we do the same using the PURR
rather than the SPURR.

This disables the DTL user interface in /sys/debug/kernel/powerpc/dtl
for now since it conflicts with the use of the dispatch trace log
by the time accounting code.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# b0d278b7 10-Aug-2010 Paul Mackerras <paulus@samba.org>

powerpc/perf_event: Reduce latency of calling perf_event_do_pending

Commit 0fe1ac48 ("powerpc/perf_event: Fix oops due to
perf_event_do_pending call") moved the call to perf_event_do_pending
in timer_interrupt() down so that it was after the irq_enter() call.
Unfortunately this moved it after the code that checks whether it
is time for the next decrementer clock event. The result is that
the call to perf_event_do_pending() won't happen until the next
decrementer clock event is due. This was pointed out by Milton
Miller.

This fixes it by moving the check for whether it's time for the
next decrementer clock event down to the point where we're about
to call the event handler, after we've called perf_event_do_pending.

This has the side effect that on old pre-Core99 Powermacs where we
use the ppc_n_lost_interrupts mechanism to replay interrupts, a
replayed interrupt will incur a little more latency since it will
now do the code from the irq_enter down to the irq_exit, that it
used to skip. However, these machines are now old and rare enough
that this doesn't matter. To make it clear that ppc_n_lost_interrupts
is only used on Powermacs, and to speed up the code slightly on
non-Powermac ppc32 machines, the code that tests ppc_n_lost_interrupts
is now conditional on CONFIG_PMAC as well as CONFIG_PPC32.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Cc: stable@kernel.org
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# d75d68cf 20-Jun-2010 Paul Mackerras <paulus@samba.org>

powerpc: Clean up obsolete code relating to decrementer and timebase

Since the decrementer and timekeeping code was moved over to using
the generic clockevents and timekeeping infrastructure, several
variables and functions have been obsolete and effectively unused.
This deletes them.

In particular, wakeup_decrementer() is no longer needed since the
generic code reprograms the decrementer as part of the process of
resuming the timekeeping code, which happens during sysdev resume.
Thus the wakeup_decrementer calls in the suspend_enter methods for
52xx platforms have been removed. The call in the powermac cpu
frequency change code has been replaced by set_dec(1), which will
cause a timer interrupt as soon as interrupts are enabled, and the
generic code will then reprogram the decrementer with the correct
value.

This also simplifies the generic_suspend_en/disable_irqs functions
and makes them static since they are not referenced outside time.c.
The preempt_enable/disable calls are removed because the generic
code has disabled all but the boot cpu at the point where these
functions are called, so we can't be moved to another cpu.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 0e469db8 20-Jun-2010 Paul Mackerras <paulus@samba.org>

powerpc: Rework VDSO gettimeofday to prevent time going backwards

Currently it is possible for userspace to see the result of
gettimeofday() going backwards by 1 microsecond, assuming that
userspace is using the gettimeofday() in the VDSO. The VDSO
gettimeofday() algorithm computes the time in "xsecs", which are
units of 2^-20 seconds, or approximately 0.954 microseconds,
using the algorithm

now = (timebase - tb_orig_stamp) * tb_to_xs + stamp_xsec

and then converts the time in xsecs to seconds and microseconds.

The kernel updates the tb_orig_stamp and stamp_xsec values every
tick in update_vsyscall(). If the length of the tick is not an
integer number of xsecs, then some precision is lost in converting
the current time to xsecs. For example, with CONFIG_HZ=1000, the
tick is 1ms long, which is 1048.576 xsecs. That means that
stamp_xsec will advance by either 1048 or 1049 on each tick.
With the right conditions, it is possible for userspace to get
(timebase - tb_orig_stamp) * tb_to_xs being 1049 if the kernel is
slightly late in updating the vdso_datapage, and then for stamp_xsec
to advance by 1048 when the kernel does update it, and for userspace
to then see (timebase - tb_orig_stamp) * tb_to_xs being zero due to
integer truncation. The result is that time appears to go backwards
by 1 microsecond.

To fix this we change the VDSO gettimeofday to use a new field in the
VDSO datapage which stores the nanoseconds part of the time as a
fractional number of seconds in a 0.32 binary fraction format.
(Or put another way, as a 32-bit number in units of 0.23283 ns.)
This is convenient because we can use the mulhwu instruction to
convert it to either microseconds or nanoseconds.

Since it turns out that computing the time of day using this new field
is simpler than either using stamp_xsec (as gettimeofday does) or
stamp_xtime.tv_nsec (as clock_gettime does), this converts both
gettimeofday and clock_gettime to use the new field. The existing
__do_get_tspec function is converted to use the new field and take
a parameter in r7 that indicates the desired resolution, 1,000,000
for microseconds or 1,000,000,000 for nanoseconds. The __do_get_xsec
function is then unused and is deleted.

The new algorithm is

now = ((timebase - tb_orig_stamp) << 12) * tb_to_xs
+ (stamp_xtime_seconds << 32) + stamp_sec_fraction

with 'now' in units of 2^-32 seconds. That is then converted to
seconds and either microseconds or nanoseconds with

seconds = now >> 32
partseconds = ((now & 0xffffffff) * resolution) >> 32

The 32-bit VDSO code also makes a further simplification: it ignores
the bottom 32 bits of the tb_to_xs value, which is a 0.64 format binary
fraction. Doing so gets rid of 4 multiply instructions. Assuming
a timebase frequency of 1GHz or less and an update interval of no
more than 10ms, the upper 32 bits of tb_to_xs will be at least
4503599, so the error from ignoring the low 32 bits will be at most
2.2ns, which is more than an order of magnitude less than the time
taken to do gettimeofday or clock_gettime on our fastest processors,
so there is no possibility of seeing inconsistent values due to this.

This also moves update_gtod() down next to its only caller, and makes
update_vsyscall use the time passed in via the wall_time argument rather
than accessing xtime directly. At present, wall_time always points to
xtime, but that could change in future.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 7615856e 13-Jul-2010 John Stultz <johnstul@us.ibm.com>

timkeeping: Fix update_vsyscall to provide wall_to_monotonic offset

update_vsyscall() did not provide the wall_to_monotoinc offset,
so arch specific implementations tend to reference wall_to_monotonic
directly. This limits future cleanups in the timekeeping core, so
this patch fixes the update_vsyscall interface to provide
wall_to_monotonic, allowing wall_to_monotonic to be made static
as planned in Documentation/feature-removal-schedule.txt

Signed-off-by: John Stultz <johnstul@us.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Anton Blanchard <anton@samba.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Tony Luck <tony.luck@intel.com>
LKML-Reference: <1279068988-21864-7-git-send-email-johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


# 06d518e3 13-Jul-2010 John Stultz <johnstul@us.ibm.com>

powerpc: Cleanup xtime usage

This removes powerpc's direct xtime usage, allowing for further
generic timeekeping cleanups

Signed-off-by: John Stultz <johnstul@us.ibm.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Anton Blanchard <anton@samba.org>
LKML-Reference: <1279068988-21864-6-git-send-email-johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


# b0797b60 13-Jul-2010 John Stultz <johnstul@us.ibm.com>

powerpc: Simplify update_vsyscall

Currently powerpc's update_vsyscall calls an inline update_gtod.
However, both are straightforward, and there are no other users,
so this patch merges update_gtod into update_vsyscall.

Signed-off-by: John Stultz <johnstul@us.ibm.com>
Cc: Anton Blanchard <anton@samba.org>
Cc: Paul Mackerras <paulus@samba.org>
LKML-Reference: <1279068988-21864-5-git-send-email-johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


# c1aa687d 20-Jun-2010 Paul Mackerras <paulus@samba.org>

powerpc: Clean up obsolete code relating to decrementer and timebase

Since the decrementer and timekeeping code was moved over to using
the generic clockevents and timekeeping infrastructure, several
variables and functions have been obsolete and effectively unused.
This deletes them.

In particular, wakeup_decrementer() is no longer needed since the
generic code reprograms the decrementer as part of the process of
resuming the timekeeping code, which happens during sysdev resume.
Thus the wakeup_decrementer calls in the suspend_enter methods for
52xx platforms have been removed. The call in the powermac cpu
frequency change code has been replaced by set_dec(1), which will
cause a timer interrupt as soon as interrupts are enabled, and the
generic code will then reprogram the decrementer with the correct
value.

This also simplifies the generic_suspend_en/disable_irqs functions
and makes them static since they are not referenced outside time.c.
The preempt_enable/disable calls are removed because the generic
code has disabled all but the boot cpu at the point where these
functions are called, so we can't be moved to another cpu.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 8fd63a9e 20-Jun-2010 Paul Mackerras <paulus@samba.org>

powerpc: Rework VDSO gettimeofday to prevent time going backwards

Currently it is possible for userspace to see the result of
gettimeofday() going backwards by 1 microsecond, assuming that
userspace is using the gettimeofday() in the VDSO. The VDSO
gettimeofday() algorithm computes the time in "xsecs", which are
units of 2^-20 seconds, or approximately 0.954 microseconds,
using the algorithm

now = (timebase - tb_orig_stamp) * tb_to_xs + stamp_xsec

and then converts the time in xsecs to seconds and microseconds.

The kernel updates the tb_orig_stamp and stamp_xsec values every
tick in update_vsyscall(). If the length of the tick is not an
integer number of xsecs, then some precision is lost in converting
the current time to xsecs. For example, with CONFIG_HZ=1000, the
tick is 1ms long, which is 1048.576 xsecs. That means that
stamp_xsec will advance by either 1048 or 1049 on each tick.
With the right conditions, it is possible for userspace to get
(timebase - tb_orig_stamp) * tb_to_xs being 1049 if the kernel is
slightly late in updating the vdso_datapage, and then for stamp_xsec
to advance by 1048 when the kernel does update it, and for userspace
to then see (timebase - tb_orig_stamp) * tb_to_xs being zero due to
integer truncation. The result is that time appears to go backwards
by 1 microsecond.

To fix this we change the VDSO gettimeofday to use a new field in the
VDSO datapage which stores the nanoseconds part of the time as a
fractional number of seconds in a 0.32 binary fraction format.
(Or put another way, as a 32-bit number in units of 0.23283 ns.)
This is convenient because we can use the mulhwu instruction to
convert it to either microseconds or nanoseconds.

Since it turns out that computing the time of day using this new field
is simpler than either using stamp_xsec (as gettimeofday does) or
stamp_xtime.tv_nsec (as clock_gettime does), this converts both
gettimeofday and clock_gettime to use the new field. The existing
__do_get_tspec function is converted to use the new field and take
a parameter in r7 that indicates the desired resolution, 1,000,000
for microseconds or 1,000,000,000 for nanoseconds. The __do_get_xsec
function is then unused and is deleted.

The new algorithm is

now = ((timebase - tb_orig_stamp) << 12) * tb_to_xs
+ (stamp_xtime_seconds << 32) + stamp_sec_fraction

with 'now' in units of 2^-32 seconds. That is then converted to
seconds and either microseconds or nanoseconds with

seconds = now >> 32
partseconds = ((now & 0xffffffff) * resolution) >> 32

The 32-bit VDSO code also makes a further simplification: it ignores
the bottom 32 bits of the tb_to_xs value, which is a 0.64 format binary
fraction. Doing so gets rid of 4 multiply instructions. Assuming
a timebase frequency of 1GHz or less and an update interval of no
more than 10ms, the upper 32 bits of tb_to_xs will be at least
4503599, so the error from ignoring the low 32 bits will be at most
2.2ns, which is more than an order of magnitude less than the time
taken to do gettimeofday or clock_gettime on our fastest processors,
so there is no possibility of seeing inconsistent values due to this.

This also moves update_gtod() down next to its only caller, and makes
update_vsyscall use the time passed in via the wall_time argument rather
than accessing xtime directly. At present, wall_time always points to
xtime, but that could change in future.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 0fe1ac48 13-Apr-2010 Paul Mackerras <paulus@samba.org>

powerpc/perf_event: Fix oops due to perf_event_do_pending call

Anton Blanchard found that large POWER systems would occasionally
crash in the exception exit path when profiling with perf_events.
The symptom was that an interrupt would occur late in the exit path
when the MSR[RI] (recoverable interrupt) bit was clear. Interrupts
should be hard-disabled at this point but they were enabled. Because
the interrupt was not recoverable the system panicked.

The reason is that the exception exit path was calling
perf_event_do_pending after hard-disabling interrupts, and
perf_event_do_pending will re-enable interrupts.

The simplest and cleanest fix for this is to use the same mechanism
that 32-bit powerpc does, namely to cause a self-IPI by setting the
decrementer to 1. This means we can remove the tests in the exception
exit path and raw_local_irq_restore.

This also makes sure that the call to perf_event_do_pending from
timer_interrupt() happens within irq_enter/irq_exit. (Note that
calling perf_event_do_pending from timer_interrupt does not mean that
there is a possible 1/HZ latency; setting the decrementer to 1 ensures
that the timer interrupt will happen immediately, i.e. within one
timebase tick, which is a few nanoseconds or 10s of nanoseconds.)

Signed-off-by: Paul Mackerras <paulus@samba.org>
Cc: stable@kernel.org
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 89713ed1 31-Jan-2010 Anton Blanchard <anton@samba.org>

powerpc: Add timer, performance monitor and machine check counts to /proc/interrupts

With NO_HZ it is useful to know how often the decrementer is going off. The
patch below adds an entry for it and also adds it into the /proc/stat
summaries.

While here, I added performance monitoring and machine check exceptions.
I found it useful to keep an eye on the PMU exception rate
when using the perf tool. Since it's possible to take a completely
handled machine check on a System p box it also sounds like a good idea to
keep a machine check summary.

The event naming matches x86 to keep gratuitous differences to a minimum.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# b919ee82 07-Feb-2010 Anton Blanchard <anton@samba.org>

powerpc: Only print clockevent settings once

The clockevent multiplier and shift is useful information, but we
only need to print it once.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 61c03ddb 12-Jan-2010 Anton Blanchard <anton@samba.org>

powerpc: Replace per_cpu(, smp_processor_id()) with __get_cpu_var()

The cputime code has a few places that do per_cpu(, smp_processor_id()).
Replace them with __get_cpu_var().

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 3e7b4843 11-Jan-2010 Stefan Roese <sr@denx.de>

powerpc: Fix decrementer setup on 1GHz boards

We noticed that recent kernels didn't boot on our 1GHz Canyonlands 460EX
boards anymore. As it seems, patch 8d165db1 [powerpc: Improve
decrementer accuracy] introduced this problem. The routine div_sc()
overflows with shift = 32 resulting in this incorrect setup:

time_init: decrementer frequency = 1000.000012 MHz
time_init: processor frequency = 1000.000012 MHz
clocksource: timebase mult[400000] shift[22] registered
clockevent: decrementer mult[33] shift[32] cpu[0]

This patch now introduces a local div_dc64() version of this function
so that this overflow doesn't happen anymore.

Signed-off-by: Stefan Roese <sr@denx.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Detlev Zundel <dzu@denx.de>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 0696b711 16-Nov-2009 Lin Ming <ming.m.lin@intel.com>

timekeeping: Fix clock_gettime vsyscall time warp

Since commit 0a544198 "timekeeping: Move NTP adjusted clock multiplier
to struct timekeeper" the clock multiplier of vsyscall is updated with
the unmodified clock multiplier of the clock source and not with the
NTP adjusted multiplier of the timekeeper.

This causes user space observerable time warps:
new CLOCK-warp maximum: 120 nsecs, 00000025c337c537 -> 00000025c337c4bf

Add a new argument "mult" to update_vsyscall() and hand in the
timekeeping internal NTP adjusted multiplier.

Signed-off-by: Lin Ming <ming.m.lin@intel.com>
Cc: "Zhang Yanmin" <yanmin_zhang@linux.intel.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Tony Luck <tony.luck@intel.com>
LKML-Reference: <1258436990.17765.83.camel@minggr.sh.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


# a362c638 13-Nov-2009 Thomas Gleixner <tglx@linutronix.de>

clocksource/events: Fix fallout of generic code changes

powerpc grew a new warning due to the type change of clockevent->mult.

The architectures which use parts of the generic time keeping
infrastructure tripped over my wrong assumption that
clocksource_register is only used when GENERIC_TIME=y.

I should have looked and also I should have known better. These
renitent Gaul villages are racking my nerves. Some serious deprecating
is due.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


# 978d7eb3 01-Nov-2009 Benjamin Herrenschmidt <benh@kernel.crashing.org>

powerpc: Avoid giving out RTC dates below EPOCH

Doing so causes xtime to be negative which crashes the timekeeping
code in funny ways when doing suspend/resume

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 4ab79aa8 29-Oct-2009 Alexander Graf <agraf@suse.de>

Export symbols for KVM module

We want to be able to build KVM as a module. To enable us doing so, we
need some more exports from core Linux parts.

This patch exports all functions and variables that are required for KVM.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 6795b85c 26-Oct-2009 Anton Blanchard <anton@samba.org>

powerpc: tracing: Add powerpc tracepoints for timer entry and exit

We can monitor the effectiveness of our power management of both the
kernel and hypervisor by probing the timer interrupt. For example, on
this box we see 10.37s timer interrupts on an idle core:

<idle>-0 [010] 3900.671297: timer_interrupt_entry: pt_regs=c0000000ce1e7b10
<idle>-0 [010] 3900.671302: timer_interrupt_exit: pt_regs=c0000000ce1e7b10

<idle>-0 [010] 3911.042963: timer_interrupt_entry: pt_regs=c0000000ce1e7b10
<idle>-0 [010] 3911.042968: timer_interrupt_exit: pt_regs=c0000000ce1e7b10

<idle>-0 [010] 3921.414630: timer_interrupt_entry: pt_regs=c0000000ce1e7b10
<idle>-0 [010] 3921.414635: timer_interrupt_exit: pt_regs=c0000000ce1e7b10

Since we have a 207MHz decrementer it will go negative and fire every 10.37s
even if Linux is completely idle.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# cdd6c482 20-Sep-2009 Ingo Molnar <mingo@elte.hu>

perf: Do the big rename: Performance Counters -> Performance Events

Bye-bye Performance Counters, welcome Performance Events!

In the past few months the perfcounters subsystem has grown out its
initial role of counting hardware events, and has become (and is
becoming) a much broader generic event enumeration, reporting, logging,
monitoring, analysis facility.

Naming its core object 'perf_counter' and naming the subsystem
'perfcounters' has become more and more of a misnomer. With pending
code like hw-breakpoints support the 'counter' name is less and
less appropriate.

All in one, we've decided to rename the subsystem to 'performance
events' and to propagate this rename through all fields, variables
and API names. (in an ABI compatible fashion)

The word 'event' is also a bit shorter than 'counter' - which makes
it slightly more convenient to write/handle as well.

Thanks goes to Stephane Eranian who first observed this misnomer and
suggested a rename.

User-space tooling and ABI compatibility is not affected - this patch
should be function-invariant. (Also, defconfigs were not touched to
keep the size down.)

This patch has been generated via the following script:

FILES=$(find * -type f | grep -vE 'oprofile|[^K]config')

sed -i \
-e 's/PERF_EVENT_/PERF_RECORD_/g' \
-e 's/PERF_COUNTER/PERF_EVENT/g' \
-e 's/perf_counter/perf_event/g' \
-e 's/nb_counters/nb_events/g' \
-e 's/swcounter/swevent/g' \
-e 's/tpcounter_event/tp_event/g' \
$FILES

for N in $(find . -name perf_counter.[ch]); do
M=$(echo $N | sed 's/perf_counter/perf_event/g')
mv $N $M
done

FILES=$(find . -name perf_event.*)

sed -i \
-e 's/COUNTER_MASK/REG_MASK/g' \
-e 's/COUNTER/EVENT/g' \
-e 's/\<event\>/event_id/g' \
-e 's/counter/event/g' \
-e 's/Counter/Event/g' \
$FILES

... to keep it as correct as possible. This script can also be
used by anyone who has pending perfcounters patches - it converts
a Linux kernel tree over to the new naming. We tried to time this
change to the point in time where the amount of pending patches
is the smallest: the end of the merge window.

Namespace clashes were fixed up in a preparatory patch - and some
stylistic fallout will be fixed up in a subsequent patch.

( NOTE: 'counters' are still the proper terminology when we deal
with hardware registers - and these sed scripts are a bit
over-eager in renaming them. I've undone some of that, but
in case there's something left where 'counter' would be
better than 'event' we can undo that on an individual basis
instead of touching an otherwise nicely automated patch. )

Suggested-by: Stephane Eranian <eranian@google.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Paul Mackerras <paulus@samba.org>
Reviewed-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <linux-arch@vger.kernel.org>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 77c0a700 27-Aug-2009 Benjamin Herrenschmidt <benh@kernel.crashing.org>

powerpc: Properly start decrementer on BookE secondary CPUs

This moves the code to start the decrementer on 40x and BookE into
a separate function which is now called from time_init() and
secondary_time_init(), before the respective clock sources are
registered. We also remove the 85xx specific code for doing it
from the platform code.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# d90246cd 22-Aug-2009 Martin Schwidefsky <schwidefsky@de.ibm.com>

timekeeping: Increase granularity of read_persistent_clock(), build fix

Fix the following build problem on powerpc:

arch/powerpc/kernel/time.c: In function 'read_persistent_clock':
arch/powerpc/kernel/time.c:788: error: 'return' with a value, in function returning void
arch/powerpc/kernel/time.c:791: error: 'return' with a value, in function returning void

Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: dwalker@fifo99.com
Cc: johnstul@us.ibm.com
LKML-Reference: <20090822222313.74b9619c@skybase>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 14ea58ad 01-Aug-2009 Julia Lawall <julia@diku.dk>

powerpc: Use DIV_ROUND_CLOSEST in time init code

The kernel.h macro DIV_ROUND_CLOSEST performs the computation (x + d/2)/d
but is perhaps more readable.

The semantic patch that makes this change is as follows:
(http://www.emn.fr/x-info/coccinelle/)

// <smpl>
@haskernel@
@@

#include <linux/kernel.h>

@depends on haskernel@
expression x,__divisor;
@@

- (((x) + ((__divisor) / 2)) / (__divisor))
+ DIV_ROUND_CLOSEST(x,__divisor)
// </smpl>

Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# d4f587c6 14-Aug-2009 Martin Schwidefsky <schwidefsky@de.ibm.com>

timekeeping: Increase granularity of read_persistent_clock()

The persistent clock of some architectures (e.g. s390) have a
better granularity than seconds. To reduce the delta between the
host clock and the guest clock in a virtualized system change the
read_persistent_clock function to return a struct timespec.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Acked-by: John Stultz <johnstul@us.ibm.com>
Cc: Daniel Walker <dwalker@fifo99.com>
LKML-Reference: <20090814134811.013873340@de.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


# a42548a1 28-Jul-2009 Stanislaw Gruszka <sgruszka@redhat.com>

cputime: Optimize jiffies_to_cputime(1)

For powerpc with CONFIG_VIRT_CPU_ACCOUNTING
jiffies_to_cputime(1) is not compile time constant and run time
calculations are quite expensive. To optimize we use
precomputed value. For all other architectures is is
preprocessor definition.

Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
LKML-Reference: <1248862529-6063-5-git-send-email-sgruszka@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 105988c0 17-Jun-2009 Paul Mackerras <paulus@samba.org>

perf_counter: powerpc: Enable use of software counters on 32-bit powerpc

This enables the perf_counter subsystem on 32-bit powerpc. Since we
don't have any support for hardware counters on 32-bit powerpc yet,
only software counters can be used.

Besides selecting HAVE_PERF_COUNTERS for 32-bit powerpc as well as
64-bit, the main thing this does is add an implementation of
set_perf_counter_pending(). This needs to arrange for
perf_counter_do_pending() to be called when interrupts are enabled.
Rather than add code to local_irq_restore as 64-bit does, the 32-bit
set_perf_counter_pending() generates an interrupt by setting the
decrementer to 1 so that a decrementer interrupt will become pending
in 1 or 2 timebase ticks (if a decrementer interrupt isn't already
pending). When interrupts are enabled, timer_interrupt() will be
called, and some new code in there calls perf_counter_do_pending().
We use a per-cpu array of flags to indicate whether we need to call
perf_counter_do_pending() or not.

This introduces a couple of new Kconfig symbols: PPC_HAVE_PMU_SUPPORT,
which is selected by processor families for which we have hardware PMU
support (currently only PPC64), and PPC_PERF_CTRS, which enables the
powerpc-specific perf_counter back-end.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: linuxppc-dev@ozlabs.org
Cc: benh@kernel.crashing.org
LKML-Reference: <19000.55404.103840.393470@cargo.ozlabs.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 177996e6 09-Jun-2009 Benjamin Herrenschmidt <benh@kernel.crashing.org>

powerpc: Don't do generic calibrate_delay()

Currently we are wasting time calling the generic calibrate_delay()
function. We don't need it since our implementation of __delay() is
based on the CPU timebase. So instead, we use our own small
implementation that initializes loops_per_jiffy to something sensible
to make the few users like spinlock debug be happy

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 8d165db1 10-May-2009 Anton Blanchard <anton@samba.org>

powerpc: Improve decrementer accuracy

I have been looking at sources of OS jitter and notice that after a long
NO_HZ idle period we wakeup too early:

relative time (us) event
timer irq exit
999946.405 timer irq entry
4.835 timer irq exit
21.685 timer irq entry
3.540 timer (tick_sched_timer) entry

Here we slept for just under a second then took a timer interrupt that did
nothing. 21.685 us later we wake up again and do the work.

We set a rather low shift value of 16 for the decrementer clockevent, which I
think is causing this issue. On this box we have a 207MHz decrementer and see:

clockevent: decrementer mult[3501] shift[16] cpu[0]

For calculations of large intervals this mult/shift combination could be
off by a significant amount. I notice the sparc code has a loop that iterates
to find a mult/shift combination that maximises the shift value while
keeping mult under 32bit. With the patch below we get:

clockevent: decrementer mult[35015c20] shift[32] cpu[15]

And we no longer see the spurious wakeups.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 8e19608e 21-Apr-2009 Magnus Damm <damm@igel.co.jp>

clocksource: pass clocksource to read() callback

Pass clocksource pointer to the read() callback for clocksources. This
allows us to share the callback between multiple instances.

[hugh@veritas.com: fix powerpc build of clocksource pass clocksource mods]
[akpm@linux-foundation.org: cleanup]
Signed-off-by: Magnus Damm <damm@igel.co.jp>
Acked-by: John Stultz <johnstul@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# bcd68a70 19-Feb-2009 Geert Uytterhoeven <Geert.Uytterhoeven@sonycom.com>

powerpc: Hook up rtc-generic, and kill rtc-ppc

PowerPC has been a long time user of the generic RTC abstraction, so hook up
rtc-generic:
- Create the "rtc-generic" platform device if ppc_md.get_rtc_time is set,
- Kill rtc-ppc, as rtc-generic offers the same functionality in a more
generic way, and supports autoloading through udev.

Signed-off-by: Geert Uytterhoeven <Geert.Uytterhoeven@sonycom.com>
Acked-by: David Woodhouse <David.Woodhouse@intel.com>
Acked-by: Alessandro Zummo <a.zummo@towertech.it>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Kyle McMartin <kyle@mcmartin.ca>


# 79741dd3 31-Dec-2008 Martin Schwidefsky <schwidefsky@de.ibm.com>

[PATCH] idle cputime accounting

The cpu time spent by the idle process actually doing something is
currently accounted as idle time. This is plain wrong, the architectures
that support VIRT_CPU_ACCOUNTING=y can do better: distinguish between the
time spent doing nothing and the time spent by idle doing work. The first
is accounted with account_idle_time and the second with account_system_time.
The architectures that use the account_xxx_time interface directly and not
the account_xxx_ticks interface now need to do the check for the idle
process in their arch code. In particular to improve the system vs true
idle time accounting the arch code needs to measure the true idle time
instead of just testing for the idle process.
To improve the tick based accounting as well we would need an architecture
primitive that can tell us if the pt_regs of the interrupted context
points to the magic instruction that halts the cpu.

In addition idle time is no more added to the stime of the idle process.
This field now contains the system time of the idle process as it should
be. On systems without VIRT_CPU_ACCOUNTING this will always be zero as
every tick that occurs while idle is running will be accounted as idle
time.

This patch contains the necessary common code changes to be able to
distinguish idle system time and true idle time. The architectures with
support for VIRT_CPU_ACCOUNTING need some changes to exploit this.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# 457533a7 31-Dec-2008 Martin Schwidefsky <schwidefsky@de.ibm.com>

[PATCH] fix scaled & unscaled cputime accounting

The utimescaled / stimescaled fields in the task structure and the
global cpustat should be set on all architectures. On s390 the calls
to account_user_time_scaled and account_system_time_scaled never have
been added. In addition system time that is accounted as guest time
to the user time of a process is accounted to the scaled system time
instead of the scaled user time.
To fix the bugs and to prevent future forgetfulness this patch merges
account_system_time_scaled into account_system_time and
account_user_time_scaled into account_user_time.

Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: Michael Neuling <mikey@neuling.org>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# 320ab2b0 13-Dec-2008 Rusty Russell <rusty@rustcorp.com.au>

cpumask: convert struct clock_event_device to cpumask pointers.

Impact: change calling convention of existing clock_event APIs

struct clock_event_timer's cpumask field gets changed to take pointer,
as does the ->broadcast function.

Another single-patch change. For safety, we BUG_ON() in
clockevents_register_device() if it's not set.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Ingo Molnar <mingo@elte.hu>


# 3cc69878 27-Oct-2008 Paul Mackerras <paulus@samba.org>

powerpc: Eliminate unused do_gtod variable

Since we started using the generic timekeeping code, we haven't had a
powerpc-specific version of do_gettimeofday, and hence there is now
nothing that reads the do_gtod variable in arch/powerpc/kernel/time.c.
This therefore removes it and the code that sets it.

Signed-off-by: Paul Mackerras <paulus@samba.org>


# 597bc5c0 27-Oct-2008 Paul Mackerras <paulus@samba.org>

powerpc: Improve resolution of VDSO clock_gettime

Currently the clock_gettime implementation in the VDSO produces a
result with microsecond resolution for the cases that are handled
without a system call, i.e. CLOCK_REALTIME and CLOCK_MONOTONIC. The
nanoseconds field of the result is obtained by computing a
microseconds value and multiplying by 1000.

This changes the code in the VDSO to do the computation for
clock_gettime with nanosecond resolution. That means that the
resolution of the result will ultimately depend on the timebase
frequency.

Because the timestamp in the VDSO datapage (stamp_xsec, the real time
corresponding to the timebase count in tb_orig_stamp) is in units of
2^-20 seconds, it doesn't have sufficient resolution for computing a
result with nanosecond resolution. Therefore this adds a copy of
xtime to the VDSO datapage and updates it in update_gtod() along with
the other time-related fields.

Signed-off-by: Paul Mackerras <paulus@samba.org>


# ddb107e9 27-Jun-2008 Kumar Gala <galak@kernel.crashing.org>

powerpc/booke: don't reinitialize time base

For some reason long ago I decided that we should zero out the time base
when we calibrate the decrementer. The problem is that this can be
harmful in SMP systems where the firmware has already synchronized the
time bases on the various cores.

Signed-off-by: Kumar Gala <galak@kernel.crashing.org>


# 15c8b6c1 09-May-2008 Jens Axboe <jens.axboe@oracle.com>

on_each_cpu(): kill unused 'retry' parameter

It's not even passed on to smp_call_function() anymore, since that
was removed. So kill it.

Acked-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 1c21a293 07-May-2008 Michael Ellerman <michael@ellerman.id.au>

[POWERPC] Fix sparse warnings in arch/powerpc/kernel

Make a few things static in lparcfg.c
Make init and exit routines static in rtas_flash.c
Make things static in rtas_pci.c
Make some functions static in rtas.c
Make fops static in rtas-proc.c
Remove unneeded extern for do_gtod in smp.c
Make clocksource_init() static in time.c
Make last_tick_len and ticklen_to_xs static in time.c
Move the declaration of the pvr per-cpu into smp.h
Make kexec_smp_down() and kexec_stack static in machine_kexec_64.c
Don't return void in arch_teardown_msi_irqs() in msi.c
Move declaration of GregorianDay()into asm/time.h

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# 7fc5c784 01-May-2008 Roman Zippel <zippel@linux-m68k.org>

ntp: rename TICK_LENGTH_SHIFT to NTP_SCALE_SHIFT

As TICK_LENGTH_SHIFT is used for more than just the tick length, the name
isn't quite approriate anymore, so this renames it to NTP_SCALE_SHIFT.

Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 074b3b87 01-May-2008 Roman Zippel <zippel@linux-m68k.org>

ntp: increase time_freq resolution

This changes time_freq to a 64bit value and makes it static (the only outside
user had no real need to modify it). Intermediate values were already 64bit,
so the change isn't that big, but it saves a little in shifts by replacing
SHIFT_NSEC with TICK_LENGTH_SHIFT. PPM_SCALE is then used to convert between
user space and kernel space representation.

Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 06b8e878 06-Feb-2008 Michael Neuling <mikey@neuling.org>

taskstats scaled time cleanup

This moves the ability to scale cputime into generic code. This allows us
to fix the issue in kernel/timer.c (noticed by Balbir) where we could only
add an unscaled value to the scaled utime/stime.

This adds a cputime_to_scaled function. As before, the POWERPC version
does the scaling based on the last SPURR/PURR ratio calculated. The
generic and s390 (only other arch to implement asm/cputime.h) versions are
both NOPs.

Also moves the SPURR and PURR snapshots closer.

Signed-off-by: Michael Neuling <mikey@neuling.org>
Cc: Jay Lan <jlan@engr.sgi.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7ac5dde9 12-Dec-2007 Scott Wood <scottwood@freescale.com>

[POWERPC] Implement arch disable/enable irq hooks.

These hooks ensure that a decrementer interrupt is not pending when
suspending; otherwise, problems may occur on 6xx/7xx/7xxx-based
systems (except for powermacs, which use a separate suspend path).
For example, with deep sleep on the 831x, a pending decrementer will
cause a system freeze because the SoC thinks the decrementer interrupt
would have woken the system, but the core must have interrupts
disabled due to the setup required for deep sleep.

Changed via-pmu.c to use the new ppc_md hooks, and made the arch_*
functions call the generic_* functions unconditionally. -- paulus

Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# 53024fe2 13-Dec-2007 Milton Miller <miltonm@bga.com>

[POWERPC] Optimize account_system_vtime

We have multiple calls to has_feature being inlined, but gcc can't
be sure that the store via get_paca() doesn't alias the path to
cur_cpu_spec->feature.

Reorder to put the calls to read_purr and read_spurr adjacent to each
other. To add a sense of consistency, reorder the remaining lines to
perform parallel steps on purr and scaled purr of each line instead of
calculating and then using one value before going on to the next.

In addition, we can tell gcc that no SPURR means no PURR. The test is
completely hidden in the PURR case, and in the !PURR case the second test
is eliminated resulting in the simple register copy in the out-of-line
branch.

Further, gcc sees get_paca()->system_time referenced several times and
allocates a register to address it (shadowing r13) instead of caching its
value. Reading into a local varable saves the shadow of r13 and removes
a potentially duplicate load (between the nested if and its parent).

Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# db3801a8 13-Dec-2007 Milton Miller <miltonm@bga.com>

[POWERPC] Depend on ->initialized in calc_steal_time

If CPU_FTR_PURR is not set, we will never set cpu_purr_data->initialized.
Checking via __get_cpu_var on 64 bit avoids one dependent load compared
to cpu_has_feature in the not-present case, and is always required when
it is present. The code is under CONFIG_VIRT_CPU_ACCOUNTING so 32 bit
will not be affected.

Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# 6e6b44e8 13-Dec-2007 Milton Miller <miltonm@bga.com>

[POWERPC] Timer interrupt: use a struct for two per_cpu varables

timer_interrupt() was calculating per_cpu_offset several times, having to
start from the toc because of potential aliasing issues.

Placing both decrementer per_cpu varables in a struct and calculating
the address once with __get_cpu_var results in better code on both 32
and 64 bit.

Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# 8b5621f1 13-Dec-2007 Milton Miller <miltonm@bga.com>

[POWERPC] Use __get_cpu_var in time.c

Use __get_cpu_var(x) instead of per_cpu(x, smp_processor_id()), as it
is optimized on ppc64 to access the current cpu's per-cpu offset directly;
it's local_paca.offset instead of TOC->paca[local_paca->processor_id].offset.

This is the trivial portion, two functions with one use each.

Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# c481887f 13-Dec-2007 Milton Miller <miltonm@bga.com>

[POWERPC] init_decrementer_clockevent can be static __init

as its only called from time_init, which is __init.

Also remove unneeded forward declaration.

Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# 2b46b567 19-Nov-2007 Michael Neuling <mikey@neuling.org>

[POWERPC] Fix possible division by zero in scaled time accounting

If we get no user time and no system time allocated since the last
account_system_vtime, the system to user time ratio estimate can end
up dividing by zero.

This was causing a problem noticed by Balbir Singh.

Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# 0302f12e 11-Nov-2007 Tony Breeds <tony@bakeyournoodle.com>

[POWERPC] Demote clockevent printk to KERN_DEBUG

These don't need to be seen by everyone on every boot.

Signed-off-by: Tony Breeds <tony@bakeyournoodle.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# fa13a5a1 09-Nov-2007 Paul Mackerras <paulus@samba.org>

sched: restore deterministic CPU accounting on powerpc

Since powerpc started using CONFIG_GENERIC_CLOCKEVENTS, the
deterministic CPU accounting (CONFIG_VIRT_CPU_ACCOUNTING) has been
broken on powerpc, because we end up counting user time twice: once in
timer_interrupt() and once in update_process_times().

This fixes the problem by pulling the code in update_process_times
that updates utime and stime into a separate function called
account_process_tick. If CONFIG_VIRT_CPU_ACCOUNTING is not defined,
there is a version of account_process_tick in kernel/timer.c that
simply accounts a whole tick to either utime or stime as before. If
CONFIG_VIRT_CPU_ACCOUNTING is defined, then arch code gets to
implement account_process_tick.

This also lets us simplify the s390 code a bit; it means that the s390
timer interrupt can now call update_process_times even when
CONFIG_VIRT_CPU_ACCOUNTING is turned on, and can just implement a
suitable account_process_tick().

account_process_tick() now takes the task_struct * as an argument.
Tested both with and without CONFIG_VIRT_CPU_ACCOUNTING.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 43875cc0 31-Oct-2007 Paul Mackerras <paulus@samba.org>

[POWERPC] Fix off-by-one error in setting decrementer on Book E/4xx (v2)

The decrementer in Book E and 4xx processors interrupts on the
transition from 1 to 0, rather than on the 0 to -1 transition as on
64-bit server and 32-bit "classic" (6xx/7xx/7xxx) processors. At the
moment we subtract 1 from the count of how many decrementer ticks are
required before the next interrupt before putting it into the
decrementer, which is correct for server/classic processors, but could
possibly cause the interrupt to happen too early on Book E and 4xx if
the timebase/decrementer frequency is low.

This fixes the problem by making set_dec subtract 1 from the count for
server and classic processors, instead of having the callers subtract
1. Since set_dec already had a bunch of ifdefs to handle different
processor types, there is no net increase in ugliness. :)

Note that calling set_dec(0) may not generate an interrupt on some
processors. To make sure that decrementer_set_next_event always calls
set_dec with an interval of at least 1 tick, we set min_delta_ns of
the decrementer_clockevent to correspond to 2 ticks (2 rather than 1
to compensate for truncations in the conversions between ticks and
ns).

This also removes a redundant call to set the decrementer to
0x7fffffff - it was already set to that earlier in timer_interrupt.

Signed-off-by: Paul Mackerras <paulus@samba.org>


# 4603ac18 18-Oct-2007 Michael Neuling <mikey@neuling.org>

powerpc: add scaled time accounting

This adds POWERPC specific hooks for scaled time accounting.

POWER6 includes a SPURR register. The SPURR is based off the PURR register
but is scaled based on CPU frequency and issue rates. This gives a more
accurate account of the instructions used per task. The PURR and timebase
will be constant relative to the wall clock, irrespective of the CPU
frequency.

This implementation reads the SPURR register in account_system_vtime which
is only call called on context witch and hard and soft irq entry and exit.
The percentage of user and system time is then estimated using the ratio of
these accounted by the PURR. If the SPURR is not present, the PURR read.

An earlier implementation of this patch read the SPURR whenever the PURR
was read, which included the system call entry and exit path.
Unfortunately this showed a performance regression on lmbench runs, so was
re-implemented.

I've included the lmbench results here when run bare metal on POWER6. 1st
column is the unpatch results. 2nd column is the results using the below
patch and the 3rd is the % diff of these results from the base. 4th and
5th columns are the results and % differnce from the base using the older
patch (SPURR read in syscall entry/exit path).

Base Scaled-Acct SPURR-in-syscall
Result Result % diff Result % diff
Simple syscall: 0.3086 0.3086 0.0000 0.3452 11.8600
Simple read: 0.4591 0.4671 1.7425 0.5044 9.86713
Simple write: 0.4364 0.4366 0.0458 0.4731 8.40971
Simple stat: 2.0055 2.0295 1.1967 2.0669 3.06158
Simple fstat: 0.5962 0.5876 -1.442 0.6368 6.80979
Simple open/close: 3.1283 3.1009 -0.875 3.2088 2.57328
Select on 10 fd's: 0.8554 0.8457 -1.133 0.8667 1.32101
Select on 100 fd's: 3.5292 3.6329 2.9383 3.6664 3.88756
Select on 250 fd's: 7.9097 8.1881 3.5197 8.2242 3.97613
Select on 500 fd's: 15.2659 15.836 3.7357 15.873 3.97814
Select on 10 tcp fd's: 0.9576 0.9416 -1.670 0.9752 1.83792
Select on 100 tcp fd's: 7.248 7.2254 -0.311 7.2685 0.28283
Select on 250 tcp fd's: 17.7742 17.707 -0.375 17.749 -0.1406
Select on 500 tcp fd's: 35.4258 35.25 -0.496 35.286 -0.3929
Signal handler installation: 0.6131 0.6075 -0.913 0.647 5.52927
Signal handler overhead: 2.0919 2.1078 0.7600 2.1831 4.35967
Protection fault: 0.7345 0.7478 1.8107 0.8031 9.33968
Pipe latency: 33.006 16.398 -50.31 33.475 1.42368
AF_UNIX sock stream latency: 14.5093 30.910 113.03 30.715 111.692
Process fork+exit: 219.8 222.8 1.3648 229.37 4.35623
Process fork+execve: 876.14 873.28 -0.32 868.66 -0.8533
Process fork+/bin/sh -c: 2830 2876.5 1.6431 2958 4.52296
File /var/tmp/XXX write bw: 1193497 1195536 0.1708 118657 -0.5799
Pagefaults on /var/tmp/XXX: 3.1272 3.2117 2.7020 3.2521 3.99398

Also, kernel compile times show no difference with this patch applied.

[pbadari@us.ibm.com: Avoid unnecessary PURR reading]
Signed-off-by: Michael Neuling <mikey@neuling.org>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Jay Lan <jlan@engr.sgi.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1281c8be 14-Oct-2007 Anton Blanchard <anton@samba.org>

[POWERPC] Quieten clockevent printk

The clockevent bootup message only needs to be KERN_INFO.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# cdec12ae 11-Oct-2007 Paul Mackerras <paulus@samba.org>

[POWERPC] Make clockevents work on PPC601 processors

In testing the new clocksource and clockevent code on a PPC601
processor, I discovered that the clockevent multiplier value for the
decrementer clockevent was overflowing. Because the RTCL register in
the 601 effectively counts at 1GHz (it doesn't actually, but it
increases by 128 every 128ns), and the shift value was 32, that meant
the multiplier value had to be 2^32, which won't fit in an unsigned
long on 32-bit. The same problem would arise on any platform where
the timebase frequency was 1GHz or more (not that we actually have any
such machines today).

This fixes it by reducing the shift value to 16. Doing the
calculations with a resolution of 2^-16 nanoseconds (15 femtoseconds)
should be quite adequate. :)

Signed-off-by: Paul Mackerras <paulus@samba.org>


# d968014b 08-Oct-2007 Paul Mackerras <paulus@samba.org>

[POWERPC] Prevent decrementer clockevents from firing early

On old powermacs, we sometimes set the decrementer to 1 in order to
trigger a decrementer interrupt, which we use to handle an interrupt
that was pending at the time when it was re-enabled. This was causing
the decrementer clock event device to call the event function for the
next event early, which was causing problems when high-res timers were
not enabled.

This fixes the problem by recording the timebase value at which the
next event should occur, and checking the current timebase against the
recorded value in timer_interrupt. If it isn't time for the next
event, it just reprograms the decrementer and returns.

This also subtracts 1 from the value stored into the decrementer,
which is appropriate because the decrementer interrupts on the
transition from 0 to -1, not when the decrementer reaches 0.

Signed-off-by: Paul Mackerras <paulus@samba.org>


# d831d0b8 20-Sep-2007 Tony Breeds <tony@bakeyournoodle.com>

[POWERPC] Implement clockevents driver for powerpc

This registers a clock event structure for the decrementer and turns
on CONFIG_GENERIC_CLOCKEVENTS, which means that we now don't need
most of timer_interrupt(), since the work is done in generic code.
For secondary CPUs, their decrementer clockevent is registered when
the CPU comes up (the generic code automatically removes the
clockevent when the CPU goes down).

Signed-off-by: Tony Breeds <tony@bakeyournoodle.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# 4a4cfe38 21-Sep-2007 Tony Breeds <tony@bakeyournoodle.com>

[POWERPC] Implement generic time of day clocksource for powerpc

Signed-off-by: Tony Breeds <tony@bakeyournoodle.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# aa3be5f3 20-Sep-2007 Tony Breeds <tony@bakeyournoodle.com>

[POWERPC] Implement {read,update}_persistent_clock

With these functions implemented we cooperate better with the generic
timekeeping code. This obsoletes the need for the timer sysdev as a bonus.

Signed-off-by: Tony Breeds <tony@bakeyournoodle.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# c27da339 18-Sep-2007 Benjamin Herrenschmidt <benh@kernel.crashing.org>

[POWERPC] Fix timekeeping on PowerPC 601

Recent changes to the timekeeping code broke support for the PowerPC 601
processor which doesn't have the usual timebase facility but a slightly
different thing called (yuck) the RTC.

This fixes it, boot tested on an old 601 based PowerMac 7200.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# aab69292 20-Aug-2007 Josh Boyer <jwboyer@linux.vnet.ibm.com>

[POWERPC] 40x decrementer fixes

Allow generic_calibrate_decr to work for 40x platforms. Given that the hardware
behavior is identical, this also changes the set_dec function to reload the PIT
on 40x to match the behavior 44x currently has.

Signed-off-by: Josh Boyer <jwboyer@linux.vnet.ibm.com>


# 9420dc65 29-Jul-2007 Jesper Juhl <jesper.juhl@gmail.com>

[POWERPC] Clean out a bunch of duplicate includes

This removes several duplicate includes from arch/powerpc/.

Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# 1474855d 20-Jul-2007 Bob Nelson <rrnelson@linux.vnet.ibm.com>

[CELL] oprofile: add support to OProfile for profiling CELL BE SPUs

From: Maynard Johnson <mpjohn@us.ibm.com>

This patch updates the existing arch/powerpc/oprofile/op_model_cell.c
to add in the SPU profiling capabilities. In addition, a 'cell' subdirectory
was added to arch/powerpc/oprofile to hold Cell-specific SPU profiling code.
Exports spu_set_profile_private_kref and spu_get_profile_private_kref which
are used by OProfile to store private profile information in spufs data
structures.

Also incorporated several fixes from other patches (rrn). Check pointer
returned from kzalloc. Eliminated unnecessary cast. Better error
handling and cleanup in the related area. 64-bit unsigned long parameter
was being demoted to 32-bit unsigned int and eventually promoted back to
unsigned long.

Signed-off-by: Carl Love <carll@us.ibm.com>
Signed-off-by: Maynard Johnson <mpjohn@us.ibm.com>
Signed-off-by: Bob Nelson <rrnelson@us.ibm.com>
Signed-off-by: Arnd Bergmann <arnd.bergmann@de.ibm.com>
Acked-by: Paul Mackerras <paulus@samba.org>


# fc9069fe 03-Jul-2007 Tony Breeds <tony@bakeyournoodle.com>

[POWERPC] Modify sched_clock() to make CONFIG_PRINTK_TIME more sane

When booting a current kernel with CONFIG_PRINTK_TIME enabled you'll
see messages like:

[ 0.000000] time_init: decrementer frequency = 188.044000 MHz
[ 0.000000] time_init: processor frequency = 1504.352000 MHz
[3712914.436297] Console: colour dummy device 80x25

This cause by the initialisation of tb_to_ns_scale in time_init(), suddenly the
multiplication in sched_clock() now does something :). This patch modifies
sched_clock() to report the offset since the machine booted so the same
printk's now look like:

[ 0.000000] time_init: decrementer frequency = 188.044000 MHz
[ 0.000000] time_init: processor frequency = 1504.352000 MHz
[ 0.000135] Console: colour dummy device 80x25

Effectivly including the uptime in printk()s.

This patch makes tb_to_ns_scale and tb_to_ns_shift static and
read_mostly for good measure.

Signed-off-by: Tony Breeds <tony@bakeyournoodle.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# 71712b45 22-Jun-2007 Tony Breeds <tony@bakeyournoodle.com>

[POWERPC] Move iSeries_tb_recal into its own late_initcall.

Currently iSeries will recalibrate the cputime_factors in the first
settimeofday() call.

It seems the reason for doing this is to ensure a resaonable time delta after
time_init(). On current kernels (with udev), this call is made 40-60 seconds
into the boot process, by moving it to a late initcall it is called
approximately 5 seconds after time_init() is called. This is sufficient to
recalibrate the timebase.

Signed-off-by: Tony Breeds <tony@bakeyournoodle.com>
CC: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# 4cefebb1 07-Jun-2007 Michael Neuling <mikey@neuling.org>

[POWERPC] Fix stolen time for SMT without LPAR

For POWERPC, stolen time accounts for cycles lost to the hypervisor or
PURR cycles attributed to the other SMT thread. Hence, when a PURR is
available, we should still calculate stolen time, irrespective of being
virtualised.

Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# df211c8a 22-May-2007 Nathan Lynch <ntl@pobox.com>

[POWERPC] Remove spinlock from struct cpu_purr_data

cpu_purr_data is a per-cpu array used to account for stolen time on
partitioned systems. It used to be the case that cpus accessed each
others' cpu_purr_data, so each entry was protected by a spinlock.

However, the code was reworked ("Simplify stolen time calculation")
with the result that each cpu accesses its own cpu_purr_data and not
those of other cpus. This means we can get rid of the spinlock as
long as we're careful to disable interrupts when accessing
cpu_purr_data in process context.

Signed-off-by: Nathan Lynch <ntl@pobox.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# e147ec8f 11-May-2007 will schmidt <will_schmidt@vnet.ibm.com>

[POWERPC] Simplify smp_space_timers

Greatly simplify the function smp_space_timers.

The stolen time calculation (per comment within the code) doesn't need the
half-jiffy stagger any more. There isn't an issue with bouncing off global
locks, so we really shouldn't need any sort of staggering at all.

However, the last_jiffy value still needs to be set. This removes the
extra stagger logic, and just sets the values.

This change should benefit applications that rely on barrier
synchronization, and will help cut down OS jitter.

Boot tested across the board (G5,power3,power4,power5,970mp blade).

Signed-off-by: Will Schmidt <will_schmidt@vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# e2eb6392 03-Apr-2007 Stephen Rothwell <sfr@canb.auug.org.au>

[POWERPC] Rename get_property to of_get_property: arch/powerpc

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# 501b6d29 20-Nov-2006 Stephen Rothwell <sfr@canb.auug.org.au>

[POWERPC] iSeries: fix time.c for combined build

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# df9c2309 21-Nov-2006 Kim Phillips <kim.phillips@freescale.com>

[POWERPC] Revert "[POWERPC] Add powerpc get/set_rtc_time interface to new generic rtc class"

This reverts commit 7a69af63e788a324d162201a0b23df41bcf158dd.

As advised by David Brownell:

http://marc.theaimsgroup.com/?l=linux-kernel&m=116387226902131&w=2

Signed-off-by: Kim Phillips <kim.phillips@freescale.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# cbcdb93d 17-Oct-2006 Stephen Rothwell <sfr@canb.auug.org.au>

[POWERPC] Simplify stolen time calculation

In calculating stolen time, we were trying to actually account for time
spent in the hypervisor. We don't really have enough information to do
that accurately, so don't try. Instead, we now calculate stolen time as
time that the current cpu thread is not actually dispatching instructions.
On chips without a PURR, we cannot do this, so stolen time will always
be zero. On chips with a PURR, this is merely the difference between
the elapsed PURR values and the elapsed TB values.

This gives us much more sane vaules from tools such as mpstat, even if
they are still a bit strange e.g. 2 busy threads on one cpu will both
appear to have 50% user time and 50% stolen time while 1 busy thread on
a cpu will look like 100% user on one of them and 100% idle on the other.

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# 35a84c2f 07-Oct-2006 Olaf Hering <olaf@aepfle.de>

[POWERPC] Fix up after irq changes

Remove struct pt_regs * from all handlers.
Also remove the regs argument from get_irq() functions.
Compile tested with arch/powerpc/config/* and
arch/ppc/configs/prep_defconfig

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# 7d12e780 05-Oct-2006 David Howells <dhowells@redhat.com>

IRQ: Maintain regs pointer globally rather than passing to IRQ handlers

Maintain a per-CPU global "struct pt_regs *" variable which can be used instead
of passing regs around manually through all ~1800 interrupt handlers in the
Linux kernel.

The regs pointer is used in few places, but it potentially costs both stack
space and code to pass it around. On the FRV arch, removing the regs parameter
from all the genirq function results in a 20% speed up of the IRQ exit path
(ie: from leaving timer_interrupt() to leaving do_IRQ()).

Where appropriate, an arch may override the generic storage facility and do
something different with the variable. On FRV, for instance, the address is
maintained in GR28 at all times inside the kernel as part of general exception
handling.

Having looked over the code, it appears that the parameter may be handed down
through up to twenty or so layers of functions. Consider a USB character
device attached to a USB hub, attached to a USB controller that posts its
interrupts through a cascaded auxiliary interrupt controller. A character
device driver may want to pass regs to the sysrq handler through the input
layer which adds another few layers of parameter passing.

I've build this code with allyesconfig for x86_64 and i386. I've runtested the
main part of the code on FRV and i386, though I can't test most of the drivers.
I've also done partial conversion for powerpc and MIPS - these at least compile
with minimal configurations.

This will affect all archs. Mostly the changes should be relatively easy.
Take do_IRQ(), store the regs pointer at the beginning, saving the old one:

struct pt_regs *old_regs = set_irq_regs(regs);

And put the old one back at the end:

set_irq_regs(old_regs);

Don't pass regs through to generic_handle_irq() or __do_IRQ().

In timer_interrupt(), this sort of change will be necessary:

- update_process_times(user_mode(regs));
- profile_tick(CPU_PROFILING, regs);
+ update_process_times(user_mode(get_irq_regs()));
+ profile_tick(CPU_PROFILING);

I'd like to move update_process_times()'s use of get_irq_regs() into itself,
except that i386, alone of the archs, uses something other than user_mode().

Some notes on the interrupt handling in the drivers:

(*) input_dev() is now gone entirely. The regs pointer is no longer stored in
the input_dev struct.

(*) finish_unlinks() in drivers/usb/host/ohci-q.c needs checking. It does
something different depending on whether it's been supplied with a regs
pointer or not.

(*) Various IRQ handler function pointers have been moved to type
irq_handler_t.

Signed-Off-By: David Howells <dhowells@redhat.com>
(cherry picked from 1b16e7ac850969f38b375e511e3fa2f474a33867 commit)


# 7a69af63 26-Sep-2006 Kim Phillips <kim.phillips@freescale.com>

[POWERPC] Add powerpc get/set_rtc_time interface to new generic rtc class

Add powerpc get/set_rtc_time interface to new generic rtc class. This
abstracts rtc chip specific code from the platform code for rtc-over-i2c
platforms. Specific RTC chip support is now configured under
Device Drivers -> Real Time Clock. Setting time of day from the RTC
on startup is also configurable.

this time without the potentially platform breaking initcall.

Signed-off-by: Kim Phillips <kim.phillips@freescale.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# 8ef38609 01-Oct-2006 Atsushi Nemoto <anemo@mba.ocn.ne.jp>

[PATCH] kill wall_jiffies

With 2.6.18-rc4-mm2, now wall_jiffies will always be the same as jiffies.
So we can kill wall_jiffies completely.

This is just a cleanup and logically should not change any real behavior
except for one thing: RTC updating code in (old) ppc and xtensa use a
condition "jiffies - wall_jiffies == 1". This condition is never met so I
suppose it is just a bug. I just remove that condition only instead of
kill the whole "if" block.

[heiko.carstens@de.ibm.com: s390 build fix and cleanup]
Signed-off-by: Atsushi Nemoto <anemo@mba.ocn.ne.jp>
Cc: Andi Kleen <ak@muc.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Ian Molton <spyro@f2s.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Hirokazu Takata <takata.hirokazu@renesas.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
Cc: Richard Curnow <rc@rc0.org.uk>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp>
Cc: Chris Zankel <chris@zankel.net>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 3171a030 29-Sep-2006 Atsushi Nemoto <anemo@mba.ocn.ne.jp>

[PATCH] simplify update_times (avoid jiffies/jiffies_64 aliasing problem)

Pass ticks to do_timer() and update_times(), and adjust x86_64 and s390
timer interrupt handler with this change.

Currently update_times() calculates ticks by "jiffies - wall_jiffies", but
callers of do_timer() should know how many ticks to update. Passing ticks
get rid of this redundant calculation. Also there are another redundancy
pointed out by Martin Schwidefsky.

This cleanup make a barrier added by
5aee405c662ca644980c184774277fc6d0769a84 needless. So this patch removes
it.

As a bonus, this cleanup make wall_jiffies can be removed easily, since now
wall_jiffies is always synced with jiffies. (This patch does not really
remove wall_jiffies. It would be another cleanup patch)

Signed-off-by: Atsushi Nemoto <anemo@mba.ocn.ne.jp>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Acked-by: Russell King <rmk@arm.linux.org.uk>
Cc: Ian Molton <spyro@f2s.com>
Cc: Mikael Starvik <starvik@axis.com>
Acked-by: David Howells <dhowells@redhat.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Hirokazu Takata <takata.hirokazu@renesas.com>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
Cc: Richard Curnow <rc@rc0.org.uk>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp>
Cc: Chris Zankel <chris@zankel.net>
Acked-by: "Luck, Tony" <tony.luck@intel.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# a4dc7ff0 18-Sep-2006 Paul Mackerras <paulus@samba.org>

[POWERPC] Define of_read_ulong helper

There are various places where we want to extract an unsigned long
value from a device-tree property that can be 1 or 2 cells in length.
This replaces some open-coded calculations, and one place where we
assumed without checking that properties were the length we wanted,
with a little of_read_ulong() helper.

Signed-off-by: Paul Mackerras <paulus@samba.org>


# eb36c288 30-Aug-2006 Paul Mackerras <paulus@samba.org>

[PATCH] ppc32: fix last_jiffy time comparison

This fixes a hang on ppc32.

The problem was that I was comparing a 32-bit quantity with a 64-bit
quantity, and consequently time wasn't advancing. This makes us use a
64-bit quantity on all platforms, which ends up simplifying the code
since we can now get rid of the tb_last_stamp variable (which actually
fixes another bug that Ben H and I noticed while going carefully through
the code).

This works fine on my G4 tibook. Let me know how it goes on your
machines.

Acked-by: Olaf Hering <olaf@aepfle.de>
Acked-by: Mikael Pettersson <mikpe@it.uu.se>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# e0d872d5 29-Aug-2006 Paul Mackerras <paulus@samba.org>

[POWERPC] Fix problem with time not advancing on 32-bit platforms

This fixes a problem introduced in 5db9fa9593e2ff69f2b95f9d59229dc4faaa564d.
The last_jiffy per-cpu variable is only 32 bits on 32-bit machines, but it
was being compared with a 64-bit quantity (tb_next_jiffy), which resulted in
time not advancing.

This fixes it by changing last_jiffy to be 64 bits on all platforms. With
this, we no longer need tb_last_stamp as a 32-bit version of tb_last_jiffy,
so this gets rid of tb_last_stamp and we just use tb_last_jiffy instead.
This also fixes a bug when the boot cpu is not online, because using
tb_last_stamp could have caused the wrong timebase origin value to be used
when calculating the time of day.

Signed-off-by: Paul Mackerras <paulus@samba.org>


# 5db9fa95 22-Aug-2006 Nathan Lynch <ntl@pobox.com>

[POWERPC] Fix gettimeofday inaccuracies

There are two problems in the powerpc gettimeofday code which can
cause incorrect results to be returned.

The first is that there is a race between do_gettimeofday and the
timer interrupt:

1. do_gettimeofday does get_tb()

2. decrementer exception on boot cpu which runs timer_recalc_offset,
which also samples the timebase and updates the do_gtod structure
with a greater timebase value.

3. do_gettimeofday calls __do_gettimeofday, which leads to the
negative result from tb_val - temp_varp->tb_orig_stamp.

The second is caused by taking the boot cpu offline, which can cause
the value of tb_last_jiffy to be increased past the currently
available timebase, causing the same underflow as above.

[paulus@samba.org - define and use data_barrier() instead of mb().]

Signed-off-by: Nathan Lynch <ntl@pobox.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# a7f67bdf 11-Jul-2006 Jeremy Kerr <jk@ozlabs.org>

[POWERPC] Constify & voidify get_property()

Now that get_property() returns a void *, there's no need to cast its
return value. Also, treat the return value as const, so we can
constify get_property later.

powerpc core changes.

Signed-off-by: Jeremy Kerr <jk@ozlabs.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# 6ab3d562 30-Jun-2006 Jörn Engel <joern@wohnheim.fh-wedel.de>

Remove obsolete #include <linux/config.h>

Signed-off-by: Jörn Engel <joern@wohnheim.fh-wedel.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>


# 19923c19 26-Jun-2006 Roman Zippel <zippel@linux-m68k.org>

[PATCH] fix and optimize clock source update

This fixes the clock source updates in update_wall_time() to correctly
track the time coming in via current_tick_length(). Optimize the fast
paths to be as short as possible to keep the overhead low.

Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Acked-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 260a4230 26-Jun-2006 John Stultz <johnstul@us.ibm.com>

[PATCH] Time: Let user request precision from current_tick_length()

Change the current_tick_length() function so it takes an argument which
specifies how much precision to return in shifted nanoseconds. This provides
a simple way to convert between NTPs internal nanoseconds shifted by
(SHIFT_SCALE - 10) to other shifted nanosecond units that are used by the
clocksource abstraction.

Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 0bb474a4 20-Jun-2006 Anton Blanchard <anton@samba.org>

[POWERPC] support ibm,extended-*-frequency properties

Support the ibm,extended-*-frequency properties found in recent POWER5
firmware:

cpus/PowerPC,POWER5@0/clock-frequency
59aa5880 (1504336000)
cpus/PowerPC,POWER5@0/ibm,extended-clock-frequency
00000000 59aa5880
cpus/PowerPC,POWER5@0/timebase-frequency
0b354b10 (188042000)
cpus/PowerPC,POWER5@0/ibm,extended-timebase-frequency
00000000 0b354b10

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# 4bd174fe 18-Apr-2006 Olof Johansson <olof@lixom.net>

[PATCH] powerpc: Remove stale iseries global

Not even the iSeries maintainer seems to have access to this legendary
piranha simulator. It adds a bit of ugliness in the common time init
code, and if it's no longer used we might as well be done with it and
remove the bloat.

Signed-off-by: Olof Johansson <olof@lixom.net>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# 224ad80a 12-Apr-2006 Olof Johansson <olof@lixom.net>

[PATCH] powerpc: Quiet time init output

Move time_init console output to KERN_DEBUG prink level. No need to
print it at every boot.

Signed-off-by: Olof Johansson <olof@lixom.net>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# 0e551954 28-Mar-2006 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

[PATCH] for_each_possible_cpu: powerpc

for_each_cpu() actually iterates across all possible CPUs. We've had mistakes
in the past where people were using for_each_cpu() where they should have been
iterating across only online or present CPUs. This is inefficient and
possibly buggy.

We're renaming for_each_cpu() to for_each_possible_cpu() to avoid this in the
future.

This patch replaces for_each_cpu with for_each_possible_cpu.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# 0a45d449 14-Mar-2006 Paul Mackerras <paulus@samba.org>

powerpc: Fix problem with time going backwards

The recent changes to keep gettimeofday in sync with xtime had the side
effect that it was occasionally possible for the time reported by
gettimeofday to go back by a microsecond. There were two reasons:
(1) when we recalculated the offsets used by gettimeofday every 2^31
timebase ticks, we lost an accumulated fractional microsecond, and
(2) because the update is done some time after the notional start of
jiffy, if ntp is slowing the clock, it is possible to see time go backwards
when the timebase factor gets reduced.

This fixes it by (a) slowing the gettimeofday clock by about 1us in
2^31 timebase ticks (a factor of less than 1 in 3.7 million), and (b)
adjusting the timebase offsets in the rare case that the gettimeofday
result could possibly go backwards (i.e. when ntp is slowing the clock
and the timer interrupt is late). In this case the adjustment will
reduce to zero eventually because of (a).

Signed-off-by: Paul Mackerras <paulus@samba.org>


# 2cf82c02 26-Feb-2006 Paul Mackerras <paulus@samba.org>

powerpc: Export variables used in conversions to/from cputime_t

The inline cputime_to_foo and foo_to_cputime conversion functions in
include/asm-powerpc/cputime.h refer to 5 variables, which need to be
exported if those functions are to be usable from modules.

Signed-off-by: Paul Mackerras <paulus@samba.org>


# c6622f63 23-Feb-2006 Paul Mackerras <paulus@samba.org>

powerpc: Implement accurate task and CPU time accounting

This implements accurate task and cpu time accounting for 64-bit
powerpc kernels. Instead of accounting a whole jiffy of time to a
task on a timer interrupt because that task happened to be running at
the time, we now account time in units of timebase ticks according to
the actual time spent by the task in user mode and kernel mode. We
also count the time spent processing hardware and software interrupts
accurately. This is conditional on CONFIG_VIRT_CPU_ACCOUNTING. If
that is not set, we do tick-based approximate accounting as before.

To get this accurate information, we read either the PURR (processor
utilization of resources register) on POWER5 machines, or the timebase
on other machines on

* each entry to the kernel from usermode
* each exit to usermode
* transitions between process context, hard irq context and soft irq
context in kernel mode
* context switches.

On POWER5 systems with shared-processor logical partitioning we also
read both the PURR and the timebase at each timer interrupt and
context switch in order to determine how much time has been taken by
the hypervisor to run other partitions ("steal" time). Unfortunately,
since we need values of the PURR on both threads at the same time to
accurately calculate the steal time, and since we can only calculate
steal time on a per-core basis, the apportioning of the steal time
between idle time (time which we ceded to the hypervisor in the idle
loop) and actual stolen time is somewhat approximate at the moment.

This is all based quite heavily on what s390 does, and it uses the
generic interfaces that were added by the s390 developers,
i.e. account_system_time(), account_user_time(), etc.

This patch doesn't add any new interfaces between the kernel and
userspace, and doesn't change the units in which time is reported to
userspace by things such as /proc/stat, /proc/<pid>/stat, getrusage(),
times(), etc. Internally the various task and cpu times are stored in
timebase units, but they are converted to USER_HZ units (1/100th of a
second) when reported to userspace. Some precision is therefore lost
but there should not be any accumulating error, since the internal
accumulation is at full precision.

Signed-off-by: Paul Mackerras <paulus@samba.org>


# 092b8f34 19-Feb-2006 Paul Mackerras <paulus@samba.org>

powerpc: Keep xtime and gettimeofday in sync

This fixes a regression which was introduced by moving ppc32 to use
the same sort of lockless gettimeofday as ppc64 has been using for
some time. This involves getting the timebase and performing some
simple arithmetic to convert it to seconds and microseconds. However,
the factor and offset used there weren't being updated when NTP
varied the tick length using adjtimex. 64-bit didn't notice the
problem because it had a hook in the 32-bit adjtimex compat routine
that attempted to work out what the generic timekeeping code would
do and alter the factor and offset to match. However, that code
was very complex and it wasn't clear that it still matched what the
generic code would do.

Now we use the generic current_tick_length() routine that was recently
added to check that the current tick will be as long as we expect; if
not we recompute the factor and offset. This keeps gettimeofday and
xtime in sync. In addition we check that gettimeofday hasn't got ahead
of xtime on each timer interrupt; if it has, we resync.

Signed-off-by: Paul Mackerras <paulus@samba.org>


# d8a8188d 04-Feb-2006 Olaf Hering <olh@suse.de>

[PATCH] powerpc: remove pointer/integer confusion in generic_calibrate_decr

remove pointer/integer confusion

Signed-off-by: Olaf Hering <olh@suse.de>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# 3356bb9f7 12-Jan-2006 David Gibson <david@gibson.dropbear.id.au>

[PATCH] powerpc: Remove lppaca structure from the PACA

At present the lppaca - the structure shared with the iSeries
hypervisor and phyp - is contained within the PACA, our own low-level
per-cpu structure. This doesn't have to be so, the patch below
removes it, making a separate array of lppaca structures.

This saves approximately 500*NR_CPUS bytes of image size and kernel
memory, because we don't need aligning gap between the Linux and
hypervisor portions of every PACA. On the other hand it means an
extra level of dereference in many accesses to the lppaca.

The patch also gets rid of several places where we assign the paca
address to a local variable for no particular reason.

Signed-off-by: David Gibson <dwg@au1.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# 404849bb 23-Nov-2005 David Gibson <david@gibson.dropbear.id.au>

[PATCH] powerpc: Remove some unneeded fields from the paca

This patch removes several unnecessary fields from the paca:

- next_jiffy_update_tb was simply unused. Remove trivially.

- The exdsi exception save area was not used. There were plans to use
it, but they never seem to have gone anywhere. If they ever do, we
can put it back. Remove from the paca, and from asm-offsets.c

- The default_decr field was used from asm, but was only ever assigned
the value of tb_ticks_per_jiffy. Just access tb_ticks_per_jiffy from
asm directly instead.

Built and booted on POWER5 LPAR and iSeries RS64.

Signed-off-by: David Gibson <dwg@au1.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# 6defa38b 17-Nov-2005 Paul Mackerras <paulus@samba.org>

powerpc: Fix delay functions for 601 processors

My earlier merge of delay.h introduced a timebase-based udelay for
32-bit machines but also broke the 601, which doesn't have the
timebase register. This fixes it by using the 601's RTC register on
the 601, and also moves __delay() and udelay() to be out-of-line in
arch/powerpc/kernel/time.c. These functions aren't really performance
critical, after all.

Signed-off-by: Paul Mackerras <paulus@samba.org>


# a7f290da 11-Nov-2005 Benjamin Herrenschmidt <benh@kernel.crashing.org>

[PATCH] powerpc: Merge vdso's and add vdso support to 32 bits kernel

This patch moves the vdso's to arch/powerpc, adds support for the 32
bits vdso to the 32 bits kernel, rename systemcfg (finally !), and adds
some new (still untested) routines to both vdso's: clock_gettime() with
support for CLOCK_REALTIME and CLOCK_MONOTONIC, clock_getres() (same
clocks) and get_tbfreq() for glibc to retreive the timebase frequency.

Tom,Steve: The implementation of get_tbfreq() I've done for 32 bits
returns a long long (r3, r4) not a long. This is such that if we ever
add support for >4Ghz timebases on ppc32, the userland interface won't
have to change.

I have tested gettimeofday() using some glibc patches in both ppc32 and
ppc64 kernels using 32 bits userland (I haven't had a chance to test a
64 bits userland yet, but the implementation didn't change and was
tested earlier). I haven't tested yet the new functions.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# cbe62e2b 09-Nov-2005 Paul Mackerras <paulus@samba.org>

powerpc: Fix SMP time initialization problem

We were getting the last_jiffy per-cpu variable set ahead of the current
timebase in smp_space_timers on SMP machines. This caused the loop in
timer_interrupt to loop virtually forever, since tb_ticks_since assumes
that it will never be called with the timebase behind the last_jiffy
value.

Signed-off-by: Paul Mackerras <paulus@samba.org>


# 799d6046 09-Nov-2005 Paul Mackerras <paulus@samba.org>

[PATCH] powerpc: merge code values for identifying platforms

This patch merges platform codes. systemcfg->platform is no longer used,
systemcfg use in general is deprecated as much as possible (and renamed
_systemcfg before it gets completely moved elsewhere in a future patch),
_machine is now used on ppc64 along as ppc32. Platform codes aren't gone
yet but we are getting a step closer. A bunch of asm code in head[_64].S
is also turned into C code.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# 732ee21f 07-Nov-2005 Olof Johansson <olof@lixom.net>

[PATCH] POWERPC/PPC64: Fix CONFIG_SMP=n build for ppc64

Two CONFIG_SMP=n build fixes due to missing <asm/smp.h> includes.

Signed-off-by: Olof Johansson <olof@lixom.net>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 2249ca9d 06-Nov-2005 Paul Mackerras <paulus@samba.org>

powerpc: Various UP build fixes

Mostly this involves adding #include <asm/smp.h>, since that defines
things like boot_cpuid[_phys] and [gs]et_hard_smp_processor_id, which
are SMP-related but still needed on UP. This incorporates fixes
posted by Olof Johansson and Heikki Lindholm.

Signed-off-by: Paul Mackerras <paulus@samba.org>


# 8875ccfb 01-Nov-2005 Kelly Daly <kelly@au.ibm.com>

merge filename and modify references to iseries/it_lp_queue.h

Signed-off-by: Kelly Daly <kelly@au.ibm.com>


# 8021b8a7 01-Nov-2005 Kelly Daly <kelly@au.ibm.com>

merge filename and modify references to iseries/hv_call_xm.h

Signed-off-by: Kelly Daly <kelly@au.ibm.com>


# 734d6524 30-Oct-2005 Paul Mackerras <paulus@samba.org>

powerpc: apply recent changes to merged code

Signed-off-by: Paul Mackerras <paulus@samba.org>


# 5f6b5b97 30-Oct-2005 Paul Mackerras <paulus@samba.org>

powerpc: Fix time setting bug on 32-bit

This fixes a bug where settimeofday would set the wrong parameters
in do_gtod, resulting in gettimeofday returning a value about 4
hours after the correct time. The bug was that we divided a
negative 64-bit value with do_div, which treated it as unsigned
and gave us a result that was approximately 1.8e10 too large
(since the divisor was 1e9).

Signed-off-by: Paul Mackerras <paulus@samba.org>


# 0fd6f717 25-Oct-2005 Kumar Gala <galak@freescale.com>

[PATCH] powerpc: Add support for Book-E timer config to generic_calibrate_decr

We need to initialize some control SPRS for timers on Book-E before
we start taking decrementer interrupts.

Signed-off-by: Kumar K. Gala <kumar.gala@freescale.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# d2e61512 20-Oct-2005 Kumar Gala <galak@freescale.com>

[PATCH] powerpc: Make sure we have an RTC before trying to adjust it

Its valid for ppc_md.set_rtc_time to be NULL. We need to check
that its non-NULL before trying to update the RTC.

Signed-off-by: Kumar K. Gala <kumar.gala@freescale.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>


# 96c44507 23-Oct-2005 Paul Mackerras <paulus@samba.org>

powerpc: Fix time code for 601 processors

The 601 doesn't have the timebase register; instead it has an RTCL
register that counts nanoseconds and wraps at 1000000000, and an
RTCU register that counts seconds. This makes the necessary changes
for the merged time code to use the RTCL/U registers when the kernel
is running on a 601.

Signed-off-by: Paul Mackerras <paulus@samba.org>


# a5b518ed 21-Oct-2005 Paul Mackerras <paulus@samba.org>

ppc64/powerpc: Fix time initialization on SMP systems

This moves smp_space_timers from arch/ppc64/kernel/smp.c to
arch/powerpc/kernel/time.c and makes it initialize last_jiffy[]
instead of paca[].next_jiffy_update_tb, since last_jiffy[] is
now what the time code uses. It also declares smp_space_timers
in include/asm-powerpc/time.h and gets rid of an ifdef in
div128_by_32.

Signed-off-by: Paul Mackerras <paulus@samba.org>


# 5d14a18d 20-Oct-2005 Paul Mackerras <paulus@samba.org>

powerpc: Fix some bugs in the new merged time code

I had the sense of the test for when to use the old 601-style RTC
registers inverted. pmac_calibrate_decr and via_calibrate_decr
weren't setting ppc_tb_freq, on which all the further calculations
depended. Lastly, update_gtod was losing the top 32 bits of
the new tb_to_xs value.

Signed-off-by: Paul Mackerras <paulus@samba.org>


# 374e99d4 20-Oct-2005 Paul Mackerras <paulus@samba.org>

powerpc: Move some calculations from xxx_calibrate_decr to time_init

Previously the individual xxx_calibrate_decr functions would each
print the timebase and cpu frequency and calculate several values
such as tb_to_us and tb_to_xs. This moves those printks and
calculations into time_init just after the call to the platform's
calibrate_decr function.

Signed-off-by: Paul Mackerras <paulus@samba.org>


# f2783c15 19-Oct-2005 Paul Mackerras <paulus@samba.org>

powerpc: Merge time.c and asm/time.h.

We now use the merged time.c for both 32-bit and 64-bit compilation
with ARCH=powerpc, and for ARCH=ppc64, but not for ARCH=ppc32.
This removes setup_default_decr (folds its function into time_init)
and moves wakeup_decrementer into time.c. This also makes an
asm-powerpc/rtc.h.

Signed-off-by: Paul Mackerras <paulus@samba.org>