History log of /linux-master/arch/powerpc/lib/memcpy_power7.S
Revision Date Author Comments
# 8488cdcb 29-Feb-2024 Michael Ellerman <mpe@ellerman.id.au>

powerpc/64s: Move dcbt/dcbtst sequence into a macro

There's an almost identical code sequence to specify load/store access
hints in __copy_tofrom_user_power7(), copypage_power7() and
memcpy_power7().

Move the sequence into a common macro, which is passed the registers to
use as they differ slightly.

There also needs to be a copy in the selftests, it could be shared in
future if the headers are cleaned up / refactored.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20240229122521.762431-1-mpe@ellerman.id.au


# 4e991e3c 07-Apr-2023 Nicholas Piggin <npiggin@gmail.com>

powerpc: add CFUNC assembly label annotation

This macro is to be used in assembly where C functions are called.
pcrel addressing mode requires branches to functions with a
localentry value of 1 to have either a trailing nop or @notoc.
This macro permits the latter without changing callers.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Add dummy definitions to fix selftests build]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20230408021752.862660-5-npiggin@gmail.com


# 1a59d1b8 27-May-2019 Thomas Gleixner <tglx@linutronix.de>

treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 156

Based on 1 normalized pattern(s):

this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license as published by
the free software foundation either version 2 of the license or at
your option any later version this program is distributed in the
hope that it will be useful but without any warranty without even
the implied warranty of merchantability or fitness for a particular
purpose see the gnu general public license for more details you
should have received a copy of the gnu general public license along
with this program if not write to the free software foundation inc
59 temple place suite 330 boston ma 02111 1307 usa

extracted by the scancode license scanner the SPDX license identifier

GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 1334 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Richard Fontana <rfontana@redhat.com>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190527070033.113240726@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 98c45f51 03-Aug-2018 Paul Mackerras <paulus@ozlabs.org>

selftests/powerpc/64: Test all paths through copy routines

The hand-coded assembler 64-bit copy routines include feature sections
that select one code path or another depending on which CPU we are
executing on. The self-tests for these copy routines end up testing
just one path. This adds a mechanism for selecting any desired code
path at compile time, and makes 2 or 3 versions of each test, each
using a different code path, so as to cover all the possible paths.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
[mpe: Add -mcpu=power4 to CFLAGS for older compilers]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# d58badfb 06-Jun-2018 Simon Guo <wei.guo.simon@gmail.com>

powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision

This patch add VMX primitives to do memcmp() in case the compare size
is equal or greater than 4K bytes. KSM feature can benefit from this.

Test result with following test program(replace the "^>" with ""):
------
># cat tools/testing/selftests/powerpc/stringloops/memcmp.c
>#include <malloc.h>
>#include <stdlib.h>
>#include <string.h>
>#include <time.h>
>#include "utils.h"
>#define SIZE (1024 * 1024 * 900)
>#define ITERATIONS 40

int test_memcmp(const void *s1, const void *s2, size_t n);

static int testcase(void)
{
char *s1;
char *s2;
unsigned long i;

s1 = memalign(128, SIZE);
if (!s1) {
perror("memalign");
exit(1);
}

s2 = memalign(128, SIZE);
if (!s2) {
perror("memalign");
exit(1);
}

for (i = 0; i < SIZE; i++) {
s1[i] = i & 0xff;
s2[i] = i & 0xff;
}
for (i = 0; i < ITERATIONS; i++) {
int ret = test_memcmp(s1, s2, SIZE);

if (ret) {
printf("return %d at[%ld]! should have returned zero\n", ret, i);
abort();
}
}

return 0;
}

int main(void)
{
return test_harness(testcase, "memcmp");
}
------
Without this patch (but with the first patch "powerpc/64: Align bytes
before fall back to .Lshort in powerpc64 memcmp()." in the series):
4.726728762 seconds time elapsed ( +- 3.54%)
With VMX patch:
4.234335473 seconds time elapsed ( +- 2.63%)
There is ~+10% improvement.

Testing with unaligned and different offset version (make s1 and s2 shift
random offset within 16 bytes) can archieve higher improvement than 10%..

Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 15a3204d 20-Feb-2018 Nicholas Piggin <npiggin@gmail.com>

powerpc/64s: Set assembler machine type to POWER4

Rather than override the machine type in .S code (which can hide wrong
or ambiguous code generation for the target), set the type to power4
for all assembly.

This also means we need to be careful not to build power4-only code
when we're not building for Book3S, such as the "power7" versions of
copyuser/page/memcpy.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Fix Book3E build, don't build the "power7" variants for non-Book3S]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# 8a583c0a 05-Aug-2017 Andreas Schwab <schwab@linux-m68k.org>

powerpc: Fix invalid use of register expressions

binutils >= 2.26 now warns about misuse of register expressions in
assembler operands that are actually literals, for example:

arch/powerpc/kernel/entry_64.S:535: Warning: invalid register expression

In practice these are almost all uses of r0 that should just be a
literal 0.

Signed-off-by: Andreas Schwab <schwab@linux-m68k.org>
[mpe: Mention r0 is almost always the culprit, fold in purgatory change]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# c2ce6f9f 09-Feb-2015 Anton Blanchard <anton@samba.org>

powerpc: Change vrX register defines to vX to match gcc and glibc

As our various loops (copy, string, crypto etc) get more complicated,
we want to share implementations between userspace (eg glibc) and
the kernel. We also want to write userspace test harnesses to put
in tools/testing/selftest.

One gratuitous difference between userspace and the kernel is the
VMX register definitions - the kernel uses vrX whereas both gcc and
glibc use vX.

Change the kernel to match userspace.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


# c2522dcd 26-Sep-2014 Paul Bolle <pebolle@tiscali.nl>

powerpc: Fix comment typos 'CONFiG_ALTIVEC'

Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>


# 752a6422 14-Feb-2014 Ulrich Weigand <ulrich.weigand@de.ibm.com>

powerpc: Fix unsafe accesses to parameter area in ELFv2

Some of the assembler files in lib/ make use of the fact that in the
ELFv1 ABI, the caller guarantees to provide stack space to save the
parameter registers r3 ... r10. This guarantee is no longer present
in ELFv2 for functions that have no variable argument list and no
more than 8 arguments.

Change the affected routines to temporarily store registers in the
red zone and/or the top of their own stack frame (in the space
provided to save r31 .. r29, which is actually not used in these
routines).

In opal_query_takeover, simply always allocate a stack frame;
the routine is not performance critical.

Signed-off-by: Ulrich Weigand <ulrich.weigand@de.ibm.com>
Signed-off-by: Anton Blanchard <anton@samba.org>


# b37c10d1 03-Feb-2014 Anton Blanchard <anton@samba.org>

powerpc: Fix ABIv2 issues with stack offsets in assembly code

Fix STK_PARAM and use it instead of hardcoding ABIv1 offsets.

Signed-off-by: Anton Blanchard <anton@samba.org>


# b1576fec 03-Feb-2014 Anton Blanchard <anton@samba.org>

powerpc: No need to use dot symbols when branching to a function

binutils is smart enough to know that a branch to a function
descriptor is actually a branch to the functions text address.

Alan tells me that binutils has been doing this for 9 years.

Signed-off-by: Anton Blanchard <anton@samba.org>


# 22d651dc 20-Jan-2014 Michael Ellerman <mpe@ellerman.id.au>

selftests/powerpc: Import Anton's memcpy / copy_tofrom_user tests

Turn Anton's memcpy / copy_tofrom_user test into something that can
live in tools/testing/selftests.

It requires one turd in arch/powerpc/lib/memcpy_64.S, but it's pretty
harmless IMHO.

We are sailing very close to the wind with the feature macros. We define
them to nothing, which currently means we get a few extra nops and
include the unaligned calls.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 32ee1e18 22-Sep-2013 Anton Blanchard <anton@samba.org>

powerpc: Fix endian issues in VMX copy loops

Fix the permute loops for little endian.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# c8adfecc 01-Oct-2012 Nishanth Aravamudan <nacc@linux.vnet.ibm.com>

powerpc: Fix VMX fix for memcpy case

In 2fae7cdb60240e2e2d9b378afbf6d9fcce8a3890 ("powerpc: Fix VMX in
interrupt check in POWER7 copy loops"), Anton inadvertently
introduced a regression for memcpy on POWER7 machines. copyuser and
memcpy diverge slightly in their use of cr1 (copyuser doesn't use it,
but memcpy does) and you end up clobbering that register with your fix.
That results in (taken from an FC18 kernel):

[ 18.824604] Unrecoverable VMX/Altivec Unavailable Exception f20 at c000000000052f40
[ 18.824618] Oops: Unrecoverable VMX/Altivec Unavailable Exception, sig: 6 [#1]
[ 18.824623] SMP NR_CPUS=1024 NUMA pSeries
[ 18.824633] Modules linked in: tg3(+) be2net(+) cxgb4(+) ipr(+) sunrpc xts lrw gf128mul dm_crypt dm_round_robin dm_multipath linear raid10 raid456 async_raid6_recov async_memcpy async_pq raid6_pq async_xor xor async_tx raid1 raid0 scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh_alua squashfs cramfs
[ 18.824705] NIP: c000000000052f40 LR: c00000000020b874 CTR: 0000000000000512
[ 18.824709] REGS: c000001f1fef7790 TRAP: 0f20 Not tainted (3.6.0-0.rc6.git0.2.fc18.ppc64)
[ 18.824713] MSR: 8000000000009032 <SF,EE,ME,IR,DR,RI> CR: 4802802e XER: 20000010
[ 18.824726] SOFTE: 0
[ 18.824728] CFAR: 0000000000000f20
[ 18.824731] TASK = c000000fa7128400[0] 'swapper/24' THREAD: c000000fa7480000 CPU: 24
GPR00: 00000000ffffffc0 c000001f1fef7a10 c00000000164edc0 c000000f9b9a8120
GPR04: c000000f9b9a8124 0000000000001438 0000000000000060 03ffffff064657ee
GPR08: 0000000080000000 0000000000000010 0000000000000020 0000000000000030
GPR12: 0000000028028022 c00000000ff25400 0000000000000001 0000000000000000
GPR16: 0000000000000000 7fffffffffffffff c0000000016b2180 c00000000156a500
GPR20: c000000f968c7a90 c0000000131c31d8 c000001f1fef4000 c000000001561d00
GPR24: 000000000000000a 0000000000000000 0000000000000001 0000000000000012
GPR28: c000000fa5c04f80 00000000000008bc c0000000015c0a28 000000000000022e
[ 18.824792] NIP [c000000000052f40] .memcpy_power7+0x5a0/0x7c4
[ 18.824797] LR [c00000000020b874] .pcpu_free_area+0x174/0x2d0
[ 18.824800] Call Trace:
[ 18.824803] [c000001f1fef7a10] [c000000000052c14] .memcpy_power7+0x274/0x7c4 (unreliable)
[ 18.824809] [c000001f1fef7b10] [c00000000020b874] .pcpu_free_area+0x174/0x2d0
[ 18.824813] [c000001f1fef7bb0] [c00000000020ba88] .free_percpu+0xb8/0x1b0
[ 18.824819] [c000001f1fef7c50] [c00000000043d144] .throtl_pd_exit+0x94/0xd0
[ 18.824824] [c000001f1fef7cf0] [c00000000043acf8] .blkg_free+0x88/0xe0
[ 18.824829] [c000001f1fef7d90] [c00000000018c048] .rcu_process_callbacks+0x2e8/0x8a0
[ 18.824835] [c000001f1fef7e90] [c0000000000a8ce8] .__do_softirq+0x158/0x4d0
[ 18.824840] [c000001f1fef7f90] [c000000000025ecc] .call_do_softirq+0x14/0x24
[ 18.824845] [c000000fa7483650] [c000000000010e80] .do_softirq+0x160/0x1a0
[ 18.824850] [c000000fa74836f0] [c0000000000a94a4] .irq_exit+0xf4/0x120
[ 18.824854] [c000000fa7483780] [c000000000020c44] .timer_interrupt+0x154/0x4d0
[ 18.824859] [c000000fa7483830] [c000000000003be0] decrementer_common+0x160/0x180
[ 18.824866] --- Exception: 901 at .plpar_hcall_norets+0x84/0xd4
[ 18.824866] LR = .check_and_cede_processor+0x48/0x80
[ 18.824871] [c000000fa7483b20] [c00000000007f018] .check_and_cede_processor+0x18/0x80 (unreliable)
[ 18.824877] [c000000fa7483b90] [c00000000007f104] .dedicated_cede_loop+0x84/0x150
[ 18.824883] [c000000fa7483c50] [c0000000006bc030] .cpuidle_enter+0x30/0x50
[ 18.824887] [c000000fa7483cc0] [c0000000006bc9f4] .cpuidle_idle_call+0x104/0x720
[ 18.824892] [c000000fa7483d80] [c000000000070af8] .pSeries_idle+0x18/0x40
[ 18.824897] [c000000fa7483df0] [c000000000019084] .cpu_idle+0x1a4/0x380
[ 18.824902] [c000000fa7483ec0] [c0000000008a4c18] .start_secondary+0x520/0x528
[ 18.824907] [c000000fa7483f90] [c0000000000093f0] .start_secondary_prolog+0x10/0x14
[ 18.824911] Instruction dump:
[ 18.824914] 38840008 90030000 90e30004 38630008 7ca62850 7cc300d0 78c7e102 7cf01120
[ 18.824923] 78c60660 39200010 39400020 39600030 <7e00200c> 7c0020ce 38840010 409f001c
[ 18.824935] ---[ end trace 0bb95124affaaa45 ]---
[ 18.825046] Unrecoverable VMX/Altivec Unavailable Exception f20 at c000000000052d08

I believe the right fix is to make memcpy match usercopy and not use
cr1.

Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CC: <stable@kernel.org> [v3.6]


# 2fae7cdb 07-Aug-2012 Anton Blanchard <anton@samba.org>

powerpc: Fix VMX in interrupt check in POWER7 copy loops

The enhanced prefetch hint patches corrupt the condition register
that was used to check if we are in interrupt. Fix this by using cr1.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# 44ce6a5e 25-Jun-2012 Michael Neuling <mikey@neuling.org>

powerpc: Merge STK_REG/PARAM/FRAMESIZE

Merge the defines of STACKFRAMESIZE, STK_REG, STK_PARAM from different
places.

Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# c75df6f9 25-Jun-2012 Michael Neuling <mikey@neuling.org>

powerpc: Fix usage of register macros getting ready for %r0 change

Anything that uses a constructed instruction (ie. from ppc-opcode.h),
need to use the new R0 macro, as %r0 is not going to work.

Also convert usages of macros where we are just determining an offset
(usually for a load/store), like:
std r14,STK_REG(r14)(r1)
Can't use STK_REG(r14) as %r14 doesn't work in the STK_REG macro since
it's just calculating an offset.

Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


# b3f271e8 30-May-2012 Anton Blanchard <anton@samba.org>

powerpc: POWER7 optimised memcpy using VMX and enhanced prefetch

Implement a POWER7 optimised memcpy using VMX and enhanced prefetch
instructions.

This is a copy of the POWER7 optimised copy_to_user/copy_from_user
loop. Detailed implementation and performance details can be found in
commit a66086b8197d (powerpc: POWER7 optimised
copy_to_user/copy_from_user using VMX).

I noticed memcpy issues when profiling a RAID6 workload:

.memcpy
.async_memcpy
.async_copy_data
.__raid_run_ops
.handle_stripe
.raid5d
.md_thread

I created a simplified testcase by building a RAID6 array with 4 1GB
ramdisks (booting with brd.rd_size=1048576):

# mdadm -CR -e 1.2 /dev/md0 --level=6 -n4 /dev/ram[0-3]

I then timed how long it took to write to the entire array:

# dd if=/dev/zero of=/dev/md0 bs=1M

Before: 892 MB/s
After: 999 MB/s

A 12% improvement.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>