History log of /linux-master/tools/perf/bench/Build
Revision Date Author Comments
# 2df27071 02-Jun-2023 Arnaldo Carvalho de Melo <acme@redhat.com>

perf bench uprobe: Add benchmark to test uprobe overhead

This just adds the initial "workload", a call to libc's usleep(1000us)
function:

$ perf stat --null perf bench uprobe all
# Running uprobe/baseline benchmark...
# Executed 1000 usleep(1000) calls
Total time: 1053533 usecs

1053.533 usecs/op

Performance counter stats for 'perf bench uprobe all':

1.061042896 seconds time elapsed

0.001079000 seconds user
0.006499000 seconds sys

$

More entries will be added using a BPF skel to add various uprobes to
the usleep() function.

Acked-by: Ian Rogers <irogers@google.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Andre Fredette <anfredet@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Dave Tucker <datucker@redhat.com>
Cc: Derek Barbosa <debarbos@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/lkml/20230719204910.539044-2-acme@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>


# 7d5cb68a 08-Mar-2023 Andrei Vagin <avagin@google.com>

perf/benchmark: add a new benchmark for seccom_unotify

The benchmark is similar to the pipe benchmark. It creates two processes,
one is calling syscalls, and another process is handling them via seccomp
user notifications. It measures the time required to run a specified number
of interations.

$ ./perf bench sched seccomp-notify --sync-mode --loop 1000000
# Running 'sched/seccomp-notify' benchmark:
# Executed 1000000 system calls

Total time: 2.769 [sec]

2.769629 usecs/op
361059 ops/sec

$ ./perf bench sched seccomp-notify
# Running 'sched/seccomp-notify' benchmark:
# Executed 1000000 system calls

Total time: 8.571 [sec]

8.571119 usecs/op
116670 ops/sec

Signed-off-by: Andrei Vagin <avagin@google.com>
Acked-by: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230308073201.3102738-7-avagin@google.com
Link: https://lore.kernel.org/r/20230630051953.454638-1-avagin@gmail.com
[kees: Added PRIu64 format string]
Signed-off-by: Kees Cook <keescook@chromium.org>


# f6a7bbbf 31-Mar-2023 Namhyung Kim <namhyung@kernel.org>

perf bench: Add pmu-scan benchmark

The pmu-scan benchmark will repeatedly scan the sysfs to get the
available PMU information.

$ ./perf bench internals pmu-scan
# Running 'internals/pmu-scan' benchmark:
Computing performance of sysfs PMU event scan for 100 times
Average PMU scanning took: 6850.990 usec (+- 48.445 usec)

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Acked-by: Ian Rogers <irogers@google.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Leo Yan <leo.yan@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230331202949.810326-2-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>


# 68a6772f 05-May-2022 Dmitry Vyukov <dvyukov@google.com>

perf bench: Add breakpoint benchmarks

Add 2 benchmarks:

1. Performance of thread creation/exiting in presence of breakpoints.
2. Performance of breakpoint modification in presence of threads.

The benchmarks capture use cases that we are interested in:
using inheritable breakpoints in large highly-threaded applications.

The benchmarks show significant slowdown imposed by breakpoints
(even when they don't fire).

Testing on Intel 8173M with 112 HW threads show:

perf bench --repeat=56 breakpoint thread --breakpoints=0 --parallelism=56 --threads=20
78.675000 usecs/op
perf bench --repeat=56 breakpoint thread --breakpoints=4 --parallelism=56 --threads=20
12967.135714 usecs/op

That's 165x slowdown due to presence of the breakpoints.

perf bench --repeat=20000 breakpoint enable --passive=0 --active=0
1.433250 usecs/op
perf bench --repeat=20000 breakpoint enable --passive=224 --active=0
585.318400 usecs/op
perf bench --repeat=20000 breakpoint enable --passive=0 --active=111
635.953000 usecs/op

That's 408x and 444x slowdown due to presence of threads.

Profiles show some overhead in toggle_bp_slot,
but also very high contention:

90.83% breakpoint-thre [kernel.kallsyms] [k] osq_lock
4.69% breakpoint-thre [kernel.kallsyms] [k] mutex_spin_on_owner
2.06% breakpoint-thre [kernel.kallsyms] [k] __reserve_bp_slot
2.04% breakpoint-thre [kernel.kallsyms] [k] toggle_bp_slot

79.01% breakpoint-enab [kernel.kallsyms] [k] smp_call_function_single
9.94% breakpoint-enab [kernel.kallsyms] [k] llist_add_batch
5.70% breakpoint-enab [kernel.kallsyms] [k] _raw_spin_lock_irq
1.84% breakpoint-enab [kernel.kallsyms] [k] event_function_call
1.12% breakpoint-enab [kernel.kallsyms] [k] send_call_function_single_ipi
0.37% breakpoint-enab [kernel.kallsyms] [k] generic_exec_single
0.24% breakpoint-enab [kernel.kallsyms] [k] __perf_event_disable
0.20% breakpoint-enab [kernel.kallsyms] [k] _perf_event_enable
0.18% breakpoint-enab [kernel.kallsyms] [k] toggle_bp_slot

Committer notes:

Fixup struct init for older compilers:

3 32.90 alpine:3.5 : FAIL clang version 3.8.1 (tags/RELEASE_381/final)
bench/breakpoint.c:49:34: error: missing field 'size' initializer [-Werror,-Wmissing-field-initializers]
struct perf_event_attr attr = {0};
^
1 error generated.
7 37.31 alpine:3.9 : FAIL gcc version 8.3.0 (Alpine 8.3.0)
bench/breakpoint.c:49:34: error: missing field 'size' initializer [-Werror,-Wmissing-field-initializers]
struct perf_event_attr attr = {0};
^
1 error generated.

Signed-off-by: Dmitriy Vyukov <dvyukov@google.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Acked-by: Ian Rogers <irogers@google.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Marco Elver <elver@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220505155745.1690906-1-dvyukov@google.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>


# 4241eabf 09-Aug-2021 Riccardo Mancini <rickyman7@gmail.com>

perf bench: Add benchmark for evlist open/close operations

This new benchmark finds the total time that is taken to open, mmap,
enable, disable, munmap, close an evlist (time taken for new,
create_maps, config, delete is not counted in).

The evlist can be configured as in perf-record using the
-a,-C,-e,-u,--per-thread,-t,-p options.

The events can be duplicated in the evlist to quickly test performance
with many events using the -n options.

Furthermore, also the number of iterations used to calculate the
statistics is customizable.

Examples:
- Open one dummy event system-wide:

$ sudo ./perf bench internals evlist-open-close
Number of cpus: 4
Number of threads: 1
Number of events: 1 (4 fds)
Number of iterations: 100
Average open-close took: 613.870 usec (+- 32.852 usec)

- Open the group '{cs,cycles}' on CPU 0

$ sudo ./perf bench internals evlist-open-close -e '{cs,cycles}' -C 0
Number of cpus: 1
Number of threads: 1
Number of events: 2 (2 fds)
Number of iterations: 100
Average open-close took: 8503.220 usec (+- 252.652 usec)

- Open 10 'cycles' events for user 0, calculate average over 100 runs

$ sudo ./perf bench internals evlist-open-close -e cycles -n 10 -u 0 -i 100
Number of cpus: 4
Number of threads: 328
Number of events: 10 (13120 fds)
Number of iterations: 100
Average open-close took: 180043.140 usec (+- 2295.889 usec)

Committer notes:

Replaced a deprecated bzero() call with designated initialized zeroing.

Added some missing evlist allocation checks, one noted by Riccardo on
the mailing list.

Minor cosmetic changes (sent in private).

Signed-off-by: Riccardo Mancini <rickyman7@gmail.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lore.kernel.org/lkml/20210809201101.277594-1-rickyman7@gmail.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>


# 0bf02a0d 12-Oct-2020 Namhyung Kim <namhyung@kernel.org>

perf bench: Add build-id injection benchmark

Sometimes I can see that 'perf record' piped with 'perf inject' take a
long time processing build-ids.

So introduce a inject-build-id benchmark to the internals benchmark
suite to measure its overhead regularly.

It runs the 'perf inject' command internally and feeds the given number
of synthesized events (MMAP2 + SAMPLE basically).

Usage: perf bench internals inject-build-id <options>

-i, --iterations <n> Number of iterations used to compute average (default: 100)
-m, --nr-mmaps <n> Number of mmap events for each iteration (default: 100)
-n, --nr-samples <n> Number of sample events per mmap event (default: 100)
-v, --verbose be more verbose (show iteration count, DSO name, etc)

By default, it measures average processing time of 100 MMAP2 events
and 10000 SAMPLE events. Below is a result on my laptop.

$ perf bench internals inject-build-id
# Running 'internals/inject-build-id' benchmark:
Average build-id injection took: 25.789 msec (+- 0.202 msec)
Average time per event: 2.528 usec (+- 0.020 usec)
Average memory usage: 8411 KB (+- 7 KB)

Committer testing:

$ perf bench
Usage:
perf bench [<common options>] <collection> <benchmark> [<options>]

# List of all available benchmark collections:

sched: Scheduler and IPC benchmarks
syscall: System call benchmarks
mem: Memory access benchmarks
numa: NUMA scheduling and MM benchmarks
futex: Futex stressing benchmarks
epoll: Epoll stressing benchmarks
internals: Perf-internals benchmarks
all: All benchmarks

$ perf bench internals

# List of available benchmarks for collection 'internals':

synthesize: Benchmark perf event synthesis
kallsyms-parse: Benchmark kallsyms parsing
inject-build-id: Benchmark build-id injection

$ perf bench internals inject-build-id
# Running 'internals/inject-build-id' benchmark:
Average build-id injection took: 14.202 msec (+- 0.059 msec)
Average time per event: 1.392 usec (+- 0.006 usec)
Average memory usage: 12650 KB (+- 10 KB)
Average build-id-all injection took: 12.831 msec (+- 0.071 msec)
Average time per event: 1.258 usec (+- 0.007 usec)
Average memory usage: 11895 KB (+- 10 KB)
$

$ perf stat -r5 perf bench internals inject-build-id
# Running 'internals/inject-build-id' benchmark:
Average build-id injection took: 14.380 msec (+- 0.056 msec)
Average time per event: 1.410 usec (+- 0.006 usec)
Average memory usage: 12608 KB (+- 11 KB)
Average build-id-all injection took: 11.889 msec (+- 0.064 msec)
Average time per event: 1.166 usec (+- 0.006 usec)
Average memory usage: 11838 KB (+- 10 KB)
# Running 'internals/inject-build-id' benchmark:
Average build-id injection took: 14.246 msec (+- 0.065 msec)
Average time per event: 1.397 usec (+- 0.006 usec)
Average memory usage: 12744 KB (+- 10 KB)
Average build-id-all injection took: 12.019 msec (+- 0.066 msec)
Average time per event: 1.178 usec (+- 0.006 usec)
Average memory usage: 11963 KB (+- 10 KB)
# Running 'internals/inject-build-id' benchmark:
Average build-id injection took: 14.321 msec (+- 0.067 msec)
Average time per event: 1.404 usec (+- 0.007 usec)
Average memory usage: 12690 KB (+- 10 KB)
Average build-id-all injection took: 11.909 msec (+- 0.041 msec)
Average time per event: 1.168 usec (+- 0.004 usec)
Average memory usage: 11938 KB (+- 10 KB)
# Running 'internals/inject-build-id' benchmark:
Average build-id injection took: 14.287 msec (+- 0.059 msec)
Average time per event: 1.401 usec (+- 0.006 usec)
Average memory usage: 12864 KB (+- 10 KB)
Average build-id-all injection took: 11.862 msec (+- 0.058 msec)
Average time per event: 1.163 usec (+- 0.006 usec)
Average memory usage: 12103 KB (+- 10 KB)
# Running 'internals/inject-build-id' benchmark:
Average build-id injection took: 14.402 msec (+- 0.053 msec)
Average time per event: 1.412 usec (+- 0.005 usec)
Average memory usage: 12876 KB (+- 10 KB)
Average build-id-all injection took: 11.826 msec (+- 0.061 msec)
Average time per event: 1.159 usec (+- 0.006 usec)
Average memory usage: 12111 KB (+- 10 KB)

Performance counter stats for 'perf bench internals inject-build-id' (5 runs):

4,267.48 msec task-clock:u # 1.502 CPUs utilized ( +- 0.14% )
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
102,092 page-faults:u # 0.024 M/sec ( +- 0.08% )
3,894,589,578 cycles:u # 0.913 GHz ( +- 0.19% ) (83.49%)
140,078,421 stalled-cycles-frontend:u # 3.60% frontend cycles idle ( +- 0.77% ) (83.34%)
948,581,189 stalled-cycles-backend:u # 24.36% backend cycles idle ( +- 0.46% ) (83.25%)
5,835,587,719 instructions:u # 1.50 insn per cycle
# 0.16 stalled cycles per insn ( +- 0.21% ) (83.24%)
1,267,423,636 branches:u # 296.996 M/sec ( +- 0.22% ) (83.12%)
17,484,290 branch-misses:u # 1.38% of all branches ( +- 0.12% ) (83.55%)

2.84176 +- 0.00222 seconds time elapsed ( +- 0.08% )

$

Acked-by: Jiri Olsa <jolsa@redhat.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20201012070214.2074921-2-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>


# ec6347bb 05-Oct-2020 Dan Williams <dan.j.williams@intel.com>

x86, powerpc: Rename memcpy_mcsafe() to copy_mc_to_{user, kernel}()

In reaction to a proposal to introduce a memcpy_mcsafe_fast()
implementation Linus points out that memcpy_mcsafe() is poorly named
relative to communicating the scope of the interface. Specifically what
addresses are valid to pass as source, destination, and what faults /
exceptions are handled.

Of particular concern is that even though x86 might be able to handle
the semantics of copy_mc_to_user() with its common copy_user_generic()
implementation other archs likely need / want an explicit path for this
case:

On Fri, May 1, 2020 at 11:28 AM Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
> On Thu, Apr 30, 2020 at 6:21 PM Dan Williams <dan.j.williams@intel.com> wrote:
> >
> > However now I see that copy_user_generic() works for the wrong reason.
> > It works because the exception on the source address due to poison
> > looks no different than a write fault on the user address to the
> > caller, it's still just a short copy. So it makes copy_to_user() work
> > for the wrong reason relative to the name.
>
> Right.
>
> And it won't work that way on other architectures. On x86, we have a
> generic function that can take faults on either side, and we use it
> for both cases (and for the "in_user" case too), but that's an
> artifact of the architecture oddity.
>
> In fact, it's probably wrong even on x86 - because it can hide bugs -
> but writing those things is painful enough that everybody prefers
> having just one function.

Replace a single top-level memcpy_mcsafe() with either
copy_mc_to_user(), or copy_mc_to_kernel().

Introduce an x86 copy_mc_fragile() name as the rename for the
low-level x86 implementation formerly named memcpy_mcsafe(). It is used
as the slow / careful backend that is supplanted by a fast
copy_mc_generic() in a follow-on patch.

One side-effect of this reorganization is that separating copy_mc_64.S
to its own file means that perf no longer needs to track dependencies
for its memcpy_64.S benchmarks.

[ bp: Massage a bit. ]

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: <stable@vger.kernel.org>
Link: http://lore.kernel.org/r/CAHk-=wjSqtXAqfUJxFtWNwmguFASTgB0dz1dT3V-78Quiezqbg@mail.gmail.com
Link: https://lkml.kernel.org/r/160195561680.2163339.11574962055305783722.stgit@dwillia2-desk3.amr.corp.intel.com


# 7c43b0c1 29-Jul-2020 Ian Rogers <irogers@google.com>

perf bench: Add benchmark of find_next_bit

for_each_set_bit, or similar functions like for_each_cpu, may be hot
within the kernel. If many bits were set then one could imagine on Intel
a "bt" instruction with every bit may be faster than the function call
and word length find_next_bit logic. Add a benchmark to measure this.

This benchmark on AMD rome and Intel skylakex shows "bt" is not a good
option except for very small bitmaps.

Committer testing:

# perf bench
Usage:
perf bench [<common options>] <collection> <benchmark> [<options>]

# List of all available benchmark collections:

sched: Scheduler and IPC benchmarks
syscall: System call benchmarks
mem: Memory access benchmarks
numa: NUMA scheduling and MM benchmarks
futex: Futex stressing benchmarks
epoll: Epoll stressing benchmarks
internals: Perf-internals benchmarks
all: All benchmarks

# perf bench mem

# List of available benchmarks for collection 'mem':

memcpy: Benchmark for memcpy() functions
memset: Benchmark for memset() functions
find_bit: Benchmark for find_bit() functions
all: Run all memory access benchmarks

# perf bench mem find_bit
# Running 'mem/find_bit' benchmark:
100000 operations 1 bits set of 1 bits
Average for_each_set_bit took: 730.200 usec (+- 6.468 usec)
Average test_bit loop took: 366.200 usec (+- 4.652 usec)
100000 operations 1 bits set of 2 bits
Average for_each_set_bit took: 781.000 usec (+- 24.247 usec)
Average test_bit loop took: 550.200 usec (+- 4.152 usec)
100000 operations 2 bits set of 2 bits
Average for_each_set_bit took: 1113.400 usec (+- 112.340 usec)
Average test_bit loop took: 1098.500 usec (+- 182.834 usec)
100000 operations 1 bits set of 4 bits
Average for_each_set_bit took: 843.800 usec (+- 8.772 usec)
Average test_bit loop took: 948.800 usec (+- 10.278 usec)
100000 operations 2 bits set of 4 bits
Average for_each_set_bit took: 1185.800 usec (+- 114.345 usec)
Average test_bit loop took: 1473.200 usec (+- 175.498 usec)
100000 operations 4 bits set of 4 bits
Average for_each_set_bit took: 1769.667 usec (+- 233.177 usec)
Average test_bit loop took: 1864.933 usec (+- 187.470 usec)
100000 operations 1 bits set of 8 bits
Average for_each_set_bit took: 898.000 usec (+- 21.755 usec)
Average test_bit loop took: 1768.400 usec (+- 23.672 usec)
100000 operations 2 bits set of 8 bits
Average for_each_set_bit took: 1244.900 usec (+- 116.396 usec)
Average test_bit loop took: 2201.800 usec (+- 145.398 usec)
100000 operations 4 bits set of 8 bits
Average for_each_set_bit took: 1822.533 usec (+- 231.554 usec)
Average test_bit loop took: 2569.467 usec (+- 168.453 usec)
100000 operations 8 bits set of 8 bits
Average for_each_set_bit took: 2845.100 usec (+- 441.365 usec)
Average test_bit loop took: 3023.300 usec (+- 219.575 usec)
100000 operations 1 bits set of 16 bits
Average for_each_set_bit took: 923.400 usec (+- 17.560 usec)
Average test_bit loop took: 3240.000 usec (+- 16.492 usec)
100000 operations 2 bits set of 16 bits
Average for_each_set_bit took: 1264.300 usec (+- 114.034 usec)
Average test_bit loop took: 3714.400 usec (+- 158.898 usec)
100000 operations 4 bits set of 16 bits
Average for_each_set_bit took: 1817.867 usec (+- 222.199 usec)
Average test_bit loop took: 4015.333 usec (+- 154.162 usec)
100000 operations 8 bits set of 16 bits
Average for_each_set_bit took: 2826.350 usec (+- 433.457 usec)
Average test_bit loop took: 4460.350 usec (+- 210.762 usec)
100000 operations 16 bits set of 16 bits
Average for_each_set_bit took: 4615.600 usec (+- 809.350 usec)
Average test_bit loop took: 5129.960 usec (+- 320.821 usec)
100000 operations 1 bits set of 32 bits
Average for_each_set_bit took: 904.400 usec (+- 14.250 usec)
Average test_bit loop took: 6194.000 usec (+- 29.254 usec)
100000 operations 2 bits set of 32 bits
Average for_each_set_bit took: 1252.700 usec (+- 116.432 usec)
Average test_bit loop took: 6652.400 usec (+- 154.352 usec)
100000 operations 4 bits set of 32 bits
Average for_each_set_bit took: 1824.200 usec (+- 229.133 usec)
Average test_bit loop took: 6961.733 usec (+- 154.682 usec)
100000 operations 8 bits set of 32 bits
Average for_each_set_bit took: 2823.950 usec (+- 432.296 usec)
Average test_bit loop took: 7351.900 usec (+- 193.626 usec)
100000 operations 16 bits set of 32 bits
Average for_each_set_bit took: 4552.560 usec (+- 785.141 usec)
Average test_bit loop took: 7998.360 usec (+- 305.629 usec)
100000 operations 32 bits set of 32 bits
Average for_each_set_bit took: 7557.067 usec (+- 1407.702 usec)
Average test_bit loop took: 9072.400 usec (+- 513.209 usec)
100000 operations 1 bits set of 64 bits
Average for_each_set_bit took: 896.800 usec (+- 14.389 usec)
Average test_bit loop took: 11927.200 usec (+- 68.862 usec)
100000 operations 2 bits set of 64 bits
Average for_each_set_bit took: 1230.400 usec (+- 111.731 usec)
Average test_bit loop took: 12478.600 usec (+- 189.382 usec)
100000 operations 4 bits set of 64 bits
Average for_each_set_bit took: 1844.733 usec (+- 244.826 usec)
Average test_bit loop took: 12911.467 usec (+- 206.246 usec)
100000 operations 8 bits set of 64 bits
Average for_each_set_bit took: 2779.300 usec (+- 413.612 usec)
Average test_bit loop took: 13372.650 usec (+- 239.623 usec)
100000 operations 16 bits set of 64 bits
Average for_each_set_bit took: 4423.920 usec (+- 748.240 usec)
Average test_bit loop took: 13995.800 usec (+- 318.427 usec)
100000 operations 32 bits set of 64 bits
Average for_each_set_bit took: 7580.600 usec (+- 1462.407 usec)
Average test_bit loop took: 15063.067 usec (+- 516.477 usec)
100000 operations 64 bits set of 64 bits
Average for_each_set_bit took: 13391.514 usec (+- 2765.371 usec)
Average test_bit loop took: 16974.914 usec (+- 916.936 usec)
100000 operations 1 bits set of 128 bits
Average for_each_set_bit took: 1153.800 usec (+- 124.245 usec)
Average test_bit loop took: 26959.000 usec (+- 714.047 usec)
100000 operations 2 bits set of 128 bits
Average for_each_set_bit took: 1445.200 usec (+- 113.587 usec)
Average test_bit loop took: 25798.800 usec (+- 512.908 usec)
100000 operations 4 bits set of 128 bits
Average for_each_set_bit took: 1990.933 usec (+- 219.362 usec)
Average test_bit loop took: 25589.400 usec (+- 348.288 usec)
100000 operations 8 bits set of 128 bits
Average for_each_set_bit took: 2963.000 usec (+- 419.487 usec)
Average test_bit loop took: 25690.050 usec (+- 262.025 usec)
100000 operations 16 bits set of 128 bits
Average for_each_set_bit took: 4585.200 usec (+- 741.734 usec)
Average test_bit loop took: 26125.040 usec (+- 274.127 usec)
100000 operations 32 bits set of 128 bits
Average for_each_set_bit took: 7626.200 usec (+- 1404.950 usec)
Average test_bit loop took: 27038.867 usec (+- 442.554 usec)
100000 operations 64 bits set of 128 bits
Average for_each_set_bit took: 13343.371 usec (+- 2686.460 usec)
Average test_bit loop took: 28936.543 usec (+- 883.257 usec)
100000 operations 128 bits set of 128 bits
Average for_each_set_bit took: 23442.950 usec (+- 4880.541 usec)
Average test_bit loop took: 32484.125 usec (+- 1691.931 usec)
100000 operations 1 bits set of 256 bits
Average for_each_set_bit took: 1183.000 usec (+- 32.073 usec)
Average test_bit loop took: 50114.600 usec (+- 198.880 usec)
100000 operations 2 bits set of 256 bits
Average for_each_set_bit took: 1550.000 usec (+- 124.550 usec)
Average test_bit loop took: 50334.200 usec (+- 128.425 usec)
100000 operations 4 bits set of 256 bits
Average for_each_set_bit took: 2164.333 usec (+- 246.359 usec)
Average test_bit loop took: 49959.867 usec (+- 188.035 usec)
100000 operations 8 bits set of 256 bits
Average for_each_set_bit took: 3211.200 usec (+- 454.829 usec)
Average test_bit loop took: 50140.850 usec (+- 176.046 usec)
100000 operations 16 bits set of 256 bits
Average for_each_set_bit took: 5181.640 usec (+- 882.726 usec)
Average test_bit loop took: 51003.160 usec (+- 419.601 usec)
100000 operations 32 bits set of 256 bits
Average for_each_set_bit took: 8369.333 usec (+- 1513.150 usec)
Average test_bit loop took: 52096.700 usec (+- 573.022 usec)
100000 operations 64 bits set of 256 bits
Average for_each_set_bit took: 13866.857 usec (+- 2649.393 usec)
Average test_bit loop took: 53989.600 usec (+- 938.808 usec)
100000 operations 128 bits set of 256 bits
Average for_each_set_bit took: 23588.350 usec (+- 4724.222 usec)
Average test_bit loop took: 57300.625 usec (+- 1625.962 usec)
100000 operations 256 bits set of 256 bits
Average for_each_set_bit took: 42752.200 usec (+- 9202.084 usec)
Average test_bit loop took: 64426.933 usec (+- 3402.326 usec)
100000 operations 1 bits set of 512 bits
Average for_each_set_bit took: 1632.000 usec (+- 229.954 usec)
Average test_bit loop took: 98090.000 usec (+- 1120.435 usec)
100000 operations 2 bits set of 512 bits
Average for_each_set_bit took: 1937.700 usec (+- 148.902 usec)
Average test_bit loop took: 100364.100 usec (+- 1433.219 usec)
100000 operations 4 bits set of 512 bits
Average for_each_set_bit took: 2528.000 usec (+- 243.654 usec)
Average test_bit loop took: 99932.067 usec (+- 955.868 usec)
100000 operations 8 bits set of 512 bits
Average for_each_set_bit took: 3734.100 usec (+- 512.359 usec)
Average test_bit loop took: 98944.750 usec (+- 812.070 usec)
100000 operations 16 bits set of 512 bits
Average for_each_set_bit took: 5551.400 usec (+- 846.605 usec)
Average test_bit loop took: 98691.600 usec (+- 654.753 usec)
100000 operations 32 bits set of 512 bits
Average for_each_set_bit took: 8594.500 usec (+- 1446.072 usec)
Average test_bit loop took: 99176.867 usec (+- 579.990 usec)
100000 operations 64 bits set of 512 bits
Average for_each_set_bit took: 13840.743 usec (+- 2527.055 usec)
Average test_bit loop took: 100758.743 usec (+- 833.865 usec)
100000 operations 128 bits set of 512 bits
Average for_each_set_bit took: 23185.925 usec (+- 4532.910 usec)
Average test_bit loop took: 103786.700 usec (+- 1475.276 usec)
100000 operations 256 bits set of 512 bits
Average for_each_set_bit took: 40322.400 usec (+- 8341.802 usec)
Average test_bit loop took: 109433.378 usec (+- 2742.615 usec)
100000 operations 512 bits set of 512 bits
Average for_each_set_bit took: 71804.540 usec (+- 15436.546 usec)
Average test_bit loop took: 120255.440 usec (+- 5252.777 usec)
100000 operations 1 bits set of 1024 bits
Average for_each_set_bit took: 1859.600 usec (+- 27.969 usec)
Average test_bit loop took: 187676.000 usec (+- 1337.770 usec)
100000 operations 2 bits set of 1024 bits
Average for_each_set_bit took: 2273.600 usec (+- 139.420 usec)
Average test_bit loop took: 188176.000 usec (+- 684.357 usec)
100000 operations 4 bits set of 1024 bits
Average for_each_set_bit took: 2940.400 usec (+- 268.213 usec)
Average test_bit loop took: 189172.600 usec (+- 593.295 usec)
100000 operations 8 bits set of 1024 bits
Average for_each_set_bit took: 4224.200 usec (+- 547.933 usec)
Average test_bit loop took: 190257.250 usec (+- 621.021 usec)
100000 operations 16 bits set of 1024 bits
Average for_each_set_bit took: 6090.560 usec (+- 877.975 usec)
Average test_bit loop took: 190143.880 usec (+- 503.753 usec)
100000 operations 32 bits set of 1024 bits
Average for_each_set_bit took: 9178.800 usec (+- 1475.136 usec)
Average test_bit loop took: 190757.100 usec (+- 494.757 usec)
100000 operations 64 bits set of 1024 bits
Average for_each_set_bit took: 14441.457 usec (+- 2545.497 usec)
Average test_bit loop took: 192299.486 usec (+- 795.251 usec)
100000 operations 128 bits set of 1024 bits
Average for_each_set_bit took: 23623.825 usec (+- 4481.182 usec)
Average test_bit loop took: 194885.550 usec (+- 1300.817 usec)
100000 operations 256 bits set of 1024 bits
Average for_each_set_bit took: 40194.956 usec (+- 8109.056 usec)
Average test_bit loop took: 200259.311 usec (+- 2566.085 usec)
100000 operations 512 bits set of 1024 bits
Average for_each_set_bit took: 70983.560 usec (+- 15074.982 usec)
Average test_bit loop took: 210527.460 usec (+- 4968.980 usec)
100000 operations 1024 bits set of 1024 bits
Average for_each_set_bit took: 136530.345 usec (+- 31584.400 usec)
Average test_bit loop took: 233329.691 usec (+- 10814.036 usec)
100000 operations 1 bits set of 2048 bits
Average for_each_set_bit took: 3077.600 usec (+- 76.376 usec)
Average test_bit loop took: 402154.400 usec (+- 518.571 usec)
100000 operations 2 bits set of 2048 bits
Average for_each_set_bit took: 3508.600 usec (+- 148.350 usec)
Average test_bit loop took: 403814.500 usec (+- 1133.027 usec)
100000 operations 4 bits set of 2048 bits
Average for_each_set_bit took: 4219.333 usec (+- 285.844 usec)
Average test_bit loop took: 404312.533 usec (+- 985.751 usec)
100000 operations 8 bits set of 2048 bits
Average for_each_set_bit took: 5670.550 usec (+- 615.238 usec)
Average test_bit loop took: 405321.800 usec (+- 1038.487 usec)
100000 operations 16 bits set of 2048 bits
Average for_each_set_bit took: 7785.080 usec (+- 992.522 usec)
Average test_bit loop took: 406746.160 usec (+- 1015.478 usec)
100000 operations 32 bits set of 2048 bits
Average for_each_set_bit took: 11163.800 usec (+- 1627.320 usec)
Average test_bit loop took: 406124.267 usec (+- 898.785 usec)
100000 operations 64 bits set of 2048 bits
Average for_each_set_bit took: 16964.629 usec (+- 2806.130 usec)
Average test_bit loop took: 406618.514 usec (+- 798.356 usec)
100000 operations 128 bits set of 2048 bits
Average for_each_set_bit took: 27219.625 usec (+- 4988.458 usec)
Average test_bit loop took: 410149.325 usec (+- 1705.641 usec)
100000 operations 256 bits set of 2048 bits
Average for_each_set_bit took: 45138.578 usec (+- 8831.021 usec)
Average test_bit loop took: 415462.467 usec (+- 2725.418 usec)
100000 operations 512 bits set of 2048 bits
Average for_each_set_bit took: 77450.540 usec (+- 15962.238 usec)
Average test_bit loop took: 426089.180 usec (+- 5171.788 usec)
100000 operations 1024 bits set of 2048 bits
Average for_each_set_bit took: 138023.636 usec (+- 29826.959 usec)
Average test_bit loop took: 446346.636 usec (+- 9904.417 usec)
100000 operations 2048 bits set of 2048 bits
Average for_each_set_bit took: 251072.600 usec (+- 55947.692 usec)
Average test_bit loop took: 484855.983 usec (+- 18970.431 usec)
#

Signed-off-by: Ian Rogers <irogers@google.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lore.kernel.org/lkml/20200729220034.1337168-1-irogers@google.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>


# c2a08203 08-Mar-2019 Davidlohr Bueso <dave@stgolabs.net>

perf bench: Add basic syscall benchmark

The usefulness of having a standard way of testing syscall performance
has come up from time to time[0]. Furthermore, some of our testing
machinery (such as 'mmtests') already makes use of a simplified version
of the microbenchmark. This patch mainly takes the same idea to measure
syscall throughput compatible with 'perf-bench' via getppid(2), yet
without any of the additional template stuff from Ingo's version (based
on numa.c). The code is identical to what mmtests uses.

[0] https://lore.kernel.org/lkml/20160201074156.GA27156@gmail.com/

Committer notes:

Add mising stdlib.h and unistd.h to get the prototypes for exit() and
getppid().

Committer testing:

$ perf bench
Usage:
perf bench [<common options>] <collection> <benchmark> [<options>]

# List of all available benchmark collections:

sched: Scheduler and IPC benchmarks
syscall: System call benchmarks
mem: Memory access benchmarks
numa: NUMA scheduling and MM benchmarks
futex: Futex stressing benchmarks
epoll: Epoll stressing benchmarks
internals: Perf-internals benchmarks
all: All benchmarks

$
$ perf bench syscall

# List of available benchmarks for collection 'syscall':

basic: Benchmark for basic getppid(2) calls
all: Run all syscall benchmarks

$ perf bench syscall basic
# Running 'syscall/basic' benchmark:
# Executed 10000000 getppid() calls
Total time: 3.679 [sec]

0.367957 usecs/op
2717708 ops/sec
$ perf bench syscall all
# Running syscall/basic benchmark...
# Executed 10000000 getppid() calls
Total time: 3.644 [sec]

0.364456 usecs/op
2743815 ops/sec

$

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: http://lore.kernel.org/lkml/20190308181747.l36zqz2avtivrr3c@linux-r8p5
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>


# 51876bd4 01-May-2020 Ian Rogers <irogers@google.com>

perf bench: Add kallsyms parsing

Add a benchmark for kallsyms parsing. Example output:

Running 'internals/kallsyms-parse' benchmark:
Average kallsyms__parse took: 103.971 ms (+- 0.121 ms)

Committer testing:

Test Machine: AMD Ryzen 5 3600X 6-Core Processor

[root@five ~]# perf bench internals kallsyms-parse
# Running 'internals/kallsyms-parse' benchmark:
Average kallsyms__parse took: 79.692 ms (+- 0.101 ms)
[root@five ~]# perf stat -r5 perf bench internals kallsyms-parse
# Running 'internals/kallsyms-parse' benchmark:
Average kallsyms__parse took: 80.563 ms (+- 0.079 ms)
# Running 'internals/kallsyms-parse' benchmark:
Average kallsyms__parse took: 81.046 ms (+- 0.155 ms)
# Running 'internals/kallsyms-parse' benchmark:
Average kallsyms__parse took: 80.874 ms (+- 0.104 ms)
# Running 'internals/kallsyms-parse' benchmark:
Average kallsyms__parse took: 81.173 ms (+- 0.133 ms)
# Running 'internals/kallsyms-parse' benchmark:
Average kallsyms__parse took: 81.169 ms (+- 0.074 ms)

Performance counter stats for 'perf bench internals kallsyms-parse' (5 runs):

8,093.54 msec task-clock # 0.999 CPUs utilized ( +- 0.14% )
3,165 context-switches # 0.391 K/sec ( +- 0.18% )
10 cpu-migrations # 0.001 K/sec ( +- 23.13% )
744 page-faults # 0.092 K/sec ( +- 0.21% )
34,551,564,954 cycles # 4.269 GHz ( +- 0.05% ) (83.33%)
1,160,584,308 stalled-cycles-frontend # 3.36% frontend cycles idle ( +- 1.60% ) (83.33%)
14,974,323,985 stalled-cycles-backend # 43.34% backend cycles idle ( +- 0.24% ) (83.33%)
58,712,905,705 instructions # 1.70 insn per cycle
# 0.26 stalled cycles per insn ( +- 0.01% ) (83.34%)
14,136,433,778 branches # 1746.632 M/sec ( +- 0.01% ) (83.33%)
141,943,217 branch-misses # 1.00% of all branches ( +- 0.04% ) (83.33%)

8.1040 +- 0.0115 seconds time elapsed ( +- 0.14% )

[root@five ~]#

Signed-off-by: Ian Rogers <irogers@google.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lore.kernel.org/lkml/20200501221315.54715-2-irogers@google.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>


# 2a4b5166 02-Apr-2020 Ian Rogers <irogers@google.com>

perf bench: Add event synthesis benchmark

Event synthesis may occur at the start or end (tail) of a perf command.
In system-wide mode it can scan every process in /proc, which may add
seconds of latency before event recording. Add a new benchmark that
times how long event synthesis takes with and without data synthesis.

An example execution looks like:

$ perf bench internals synthesize
# Running 'internals/synthesize' benchmark:
Average synthesis took: 168.253800 usec
Average data synthesis took: 208.104700 usec

Signed-off-by: Ian Rogers <irogers@google.com>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andrey Zhizhikin <andrey.z@gmail.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lore.kernel.org/lkml/20200402154357.107873-2-irogers@google.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>


# 231457ec 06-Nov-2018 Davidlohr Bueso <dave@stgolabs.net>

perf bench: Add epoll_ctl(2) benchmark

Benchmark the various operations allowed for epoll_ctl(2). The idea is
to concurrently stress a single epoll instance doing add/mod/del
operations.

Committer testing:

# perf bench epoll ctl
# Running 'epoll/ctl' benchmark:
Run summary [PID 20344]: 4 threads doing epoll_ctl ops 64 file-descriptors for 8 secs.

[thread 0] fdmap: 0x21a46b0 ... 0x21a47ac [ add: 1680960 ops; mod: 1680960 ops; del: 1680960 ops ]
[thread 1] fdmap: 0x21a4960 ... 0x21a4a5c [ add: 1685440 ops; mod: 1685440 ops; del: 1685440 ops ]
[thread 2] fdmap: 0x21a4c10 ... 0x21a4d0c [ add: 1674368 ops; mod: 1674368 ops; del: 1674368 ops ]
[thread 3] fdmap: 0x21a4ec0 ... 0x21a4fbc [ add: 1677568 ops; mod: 1677568 ops; del: 1677568 ops ]

Averaged 1679584 ADD operations (+- 0.14%)
Averaged 1679584 MOD operations (+- 0.14%)
Averaged 1679584 DEL operations (+- 0.14%)
#

Lets measure those calls with 'perf trace' to get a glympse at what this
benchmark is doing in terms of syscalls:

# perf trace -m32768 -s perf bench epoll ctl
# Running 'epoll/ctl' benchmark:
Run summary [PID 20405]: 4 threads doing epoll_ctl ops 64 file-descriptors for 8 secs.

[thread 0] fdmap: 0x21764e0 ... 0x21765dc [ add: 1100480 ops; mod: 1100480 ops; del: 1100480 ops ]
[thread 1] fdmap: 0x2176790 ... 0x217688c [ add: 1250176 ops; mod: 1250176 ops; del: 1250176 ops ]
[thread 2] fdmap: 0x2176a40 ... 0x2176b3c [ add: 1022464 ops; mod: 1022464 ops; del: 1022464 ops ]
[thread 3] fdmap: 0x2176cf0 ... 0x2176dec [ add: 705472 ops; mod: 705472 ops; del: 705472 ops ]

Averaged 1019648 ADD operations (+- 11.27%)
Averaged 1019648 MOD operations (+- 11.27%)
Averaged 1019648 DEL operations (+- 11.27%)

Summary of events:

epoll-ctl (20405), 1264 events, 0.0%

syscall calls total min avg max stddev
(msec) (msec) (msec) (msec) (%)
--------------- -------- --------- --------- --------- --------- ------
eventfd2 256 9.514 0.001 0.037 5.243 68.00%
clone 4 1.245 0.204 0.311 0.531 24.13%
mprotect 66 0.345 0.002 0.005 0.021 7.43%
openat 45 0.313 0.004 0.007 0.073 21.93%
mmap 88 0.302 0.002 0.003 0.013 5.02%
futex 4 0.160 0.002 0.040 0.140 83.43%
sched_setaffinity 4 0.124 0.005 0.031 0.070 49.39%
read 44 0.103 0.001 0.002 0.013 15.54%
fstat 40 0.052 0.001 0.001 0.003 5.43%
close 39 0.039 0.001 0.001 0.001 1.48%
stat 9 0.034 0.003 0.004 0.006 7.30%
access 3 0.023 0.007 0.008 0.008 4.25%
open 2 0.021 0.008 0.011 0.013 22.60%
getdents 4 0.019 0.001 0.005 0.009 37.15%
write 2 0.013 0.004 0.007 0.009 38.48%
munmap 1 0.010 0.010 0.010 0.010 0.00%
brk 3 0.006 0.001 0.002 0.003 26.34%
rt_sigprocmask 2 0.004 0.001 0.002 0.003 43.95%
rt_sigaction 3 0.004 0.001 0.001 0.002 16.07%
prlimit64 3 0.004 0.001 0.001 0.001 5.39%
prctl 1 0.003 0.003 0.003 0.003 0.00%
epoll_create 1 0.003 0.003 0.003 0.003 0.00%
lseek 2 0.002 0.001 0.001 0.001 11.42%
sched_getaffinity 1 0.002 0.002 0.002 0.002 0.00%
arch_prctl 1 0.002 0.002 0.002 0.002 0.00%
set_tid_address 1 0.001 0.001 0.001 0.001 0.00%
getpid 1 0.001 0.001 0.001 0.001 0.00%
set_robust_list 1 0.001 0.001 0.001 0.001 0.00%
execve 1 0.000 0.000 0.000 0.000 0.00%

epoll-ctl (20406), 1245480 events, 14.6%

syscall calls total min avg max stddev
(msec) (msec) (msec) (msec) (%)
--------------- -------- --------- --------- --------- --------- ------
epoll_ctl 619511 1034.927 0.001 0.002 6.691 0.67%
nanosleep 3226 616.114 0.006 0.191 10.376 7.57%
futex 2 11.336 0.002 5.668 11.334 99.97%
set_robust_list 1 0.001 0.001 0.001 0.001 0.00%
clone 1 0.000 0.000 0.000 0.000 0.00%

epoll-ctl (20407), 1243151 events, 14.5%

syscall calls total min avg max stddev
(msec) (msec) (msec) (msec) (%)
--------------- -------- --------- --------- --------- --------- ------
epoll_ctl 618350 1042.181 0.001 0.002 2.512 0.40%
nanosleep 3220 366.261 0.012 0.114 18.162 9.59%
futex 4 5.463 0.001 1.366 5.427 99.12%
set_robust_list 1 0.002 0.002 0.002 0.002 0.00%

epoll-ctl (20408), 1801690 events, 21.1%

syscall calls total min avg max stddev
(msec) (msec) (msec) (msec) (%)
--------------- -------- --------- --------- --------- --------- ------
epoll_ctl 896174 1540.581 0.001 0.002 6.987 0.74%
nanosleep 4667 783.393 0.006 0.168 10.419 7.10%
futex 2 4.682 0.002 2.341 4.681 99.93%
set_robust_list 1 0.002 0.002 0.002 0.002 0.00%
clone 1 0.000 0.000 0.000 0.000 0.00%

epoll-ctl (20409), 4254890 events, 49.8%

syscall calls total min avg max stddev
(msec) (msec) (msec) (msec) (%)
--------------- -------- --------- --------- --------- --------- ------
epoll_ctl 2116416 3768.097 0.001 0.002 9.956 0.41%
nanosleep 11023 1141.778 0.006 0.104 9.447 4.95%
futex 3 0.037 0.002 0.012 0.029 70.50%
set_robust_list 1 0.008 0.008 0.008 0.008 0.00%
madvise 1 0.005 0.005 0.005 0.005 0.00%
clone 1 0.000 0.000 0.000 0.000 0.00%
#

Committer notes:

Fix build on fedora:24-x-ARC-uClibc, debian:experimental-x-mips,
debian:experimental-x-mipsel, ubuntu:16.04-x-arm and ubuntu:16.04-x-powerpc

CC /tmp/build/perf/bench/epoll-ctl.o
bench/epoll-ctl.c: In function 'init_fdmaps':
bench/epoll-ctl.c:214:16: error: comparison between signed and unsigned integer expressions [-Werror=sign-compare]
for (i = 0; i < nfds; i+=inc) {
^
bench/epoll-ctl.c: In function 'bench_epoll_ctl':
bench/epoll-ctl.c:377:16: error: comparison between signed and unsigned integer expressions [-Werror=sign-compare]
for (i = 0; i < nthreads; i++) {
^
bench/epoll-ctl.c:388:16: error: comparison between signed and unsigned integer expressions [-Werror=sign-compare]
for (i = 0; i < nthreads; i++) {
^
cc1: all warnings being treated as errors

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Jason Baron <jbaron@akamai.com>
Link: http://lkml.kernel.org/r/20181106152226.20883-3-dave@stgolabs.net
[ Use inttypes.h to print rlim_t fields, fixing the build on Alpine Linux / musl libc ]
[ Check if eventfd() is available, i.e. if HAVE_EVENTFD is defined ]
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>


# 121dd9ea 06-Nov-2018 Davidlohr Bueso <dave@stgolabs.net>

perf bench: Add epoll parallel epoll_wait benchmark

This program benchmarks concurrent epoll_wait(2) for file descriptors
that are monitored with with EPOLLIN along various semantics, by a
single epoll instance. Such conditions can be found when using
single/combined or multiple queuing when load balancing.

Each thread has a number of private, nonblocking file descriptors,
referred to as fdmap. A writer thread will constantly be writing to the
fdmaps of all threads, minimizing each threads's chances of epoll_wait
not finding any ready read events and blocking as this is not what we
want to stress. Full details in the start of the C file.

Committer testing:

# perf bench
Usage:
perf bench [<common options>] <collection> <benchmark> [<options>]

# List of all available benchmark collections:

sched: Scheduler and IPC benchmarks
mem: Memory access benchmarks
numa: NUMA scheduling and MM benchmarks
futex: Futex stressing benchmarks
epoll: Epoll stressing benchmarks
all: All benchmarks

# perf bench epoll

# List of available benchmarks for collection 'epoll':

wait: Benchmark epoll concurrent epoll_waits
all: Run all futex benchmarks

# perf bench epoll wait
# Running 'epoll/wait' benchmark:
Run summary [PID 19295]: 3 threads monitoring on 64 file-descriptors for 8 secs.

[thread 0] fdmap: 0xdaa650 ... 0xdaa74c [ 328241 ops/sec ]
[thread 1] fdmap: 0xdaa900 ... 0xdaa9fc [ 351695 ops/sec ]
[thread 2] fdmap: 0xdaabb0 ... 0xdaacac [ 381423 ops/sec ]

Averaged 353786 operations/sec (+- 4.35%), total secs = 8
#

Committer notes:

Fix the build on debian:experimental-x-mips, debian:experimental-x-mipsel
and others:

CC /tmp/build/perf/bench/epoll-wait.o
bench/epoll-wait.c: In function 'writerfn':
bench/epoll-wait.c:399:12: error: format '%ld' expects argument of type 'long int', but argument 2 has type 'size_t' {aka 'unsigned int'} [-Werror=format=]
printinfo("exiting writer-thread (total full-loops: %ld)\n", iter);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~
bench/epoll-wait.c:86:31: note: in definition of macro 'printinfo'
do { if (__verbose) { printf(fmt, ## arg); fflush(stdout); } } while (0)
^~~
cc1: all warnings being treated as errors

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Jason Baron <jbaron@akamai.com> <jbaron@akamai.com>
Link: http://lkml.kernel.org/r/20181106152226.20883-2-dave@stgolabs.net
Link: http://lkml.kernel.org/r/20181106182349.thdkpvshkna5vd7o@linux-r8p5>
[ Applied above fixup as per Davidlohr's request ]
[ Use inttypes.h to print rlim_t fields, fixing the build on Alpine Linux / musl libc ]
[ Check if eventfd() is available, i.e. if HAVE_EVENTFD is defined ]
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>


# 1f27a050 29-Jul-2018 Arnaldo Carvalho de Melo <acme@redhat.com>

tools arch: Update arch/x86/lib/memcpy_64.S copy used in 'perf bench mem memcpy'

To cope with the changes in:

12c89130a56a ("x86/asm/memcpy_mcsafe: Add write-protection-fault handling")
60622d68227d ("x86/asm/memcpy_mcsafe: Return bytes remaining")
bd131544aa7e ("x86/asm/memcpy_mcsafe: Add labels for __memcpy_mcsafe() write fault handling")
da7bc9c57eb0 ("x86/asm/memcpy_mcsafe: Remove loop unrolling")

This needed introducing a file with a copy of the mcsafe_handle_tail()
function, that is used in the new memcpy_64.S file, as well as a dummy
mcsafe_test.h header.

Testing it:

$ nm ~/bin/perf | grep mcsafe
0000000000484130 T mcsafe_handle_tail
0000000000484300 T __memcpy_mcsafe
$
$ perf bench mem memcpy
# Running 'mem/memcpy' benchmark:
# function 'default' (Default memcpy() provided by glibc)
# Copying 1MB bytes ...

44.389205 GB/sec
# function 'x86-64-unrolled' (unrolled memcpy() in arch/x86/lib/memcpy_64.S)
# Copying 1MB bytes ...

22.710756 GB/sec
# function 'x86-64-movsq' (movsq-based memcpy() in arch/x86/lib/memcpy_64.S)
# Copying 1MB bytes ...

42.459239 GB/sec
# function 'x86-64-movsb' (movsb-based memcpy() in arch/x86/lib/memcpy_64.S)
# Copying 1MB bytes ...

42.459239 GB/sec
$

This silences this perf tools build warning:

Warning: Kernel ABI header at 'tools/arch/x86/lib/memcpy_64.S' differs from latest version at 'arch/x86/lib/memcpy_64.S'

Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Mika Penttilä <mika.penttila@nextfour.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wang Nan <wangnan0@huawei.com>
Link: https://lkml.kernel.org/n/tip-igdpciheradk3gb3qqal52d0@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>


# 9b2fa7f3 19-Oct-2015 Ingo Molnar <mingo@kernel.org>

perf bench: Rename 'mem-memcpy.c' => 'mem-functions.c'

So mem-memcpy.c started out as a simple memcpy() benchmark, then it grew
memset() functionality and now I plan to add string copy benchmarks as
well.

This makes the file name a misnomer: rename it to the more generic
mem-functions.c name.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: David Ahern <dsahern@gmail.com>
Cc: Hitoshi Mitake <mitake@dcl.info.waseda.ac.jp>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1445241870-24854-5-git-send-email-mingo@kernel.org
[ The "rename" was introducing __unused, wasn't removing the old file,
and didn't update tools/perf/bench/Build, fix it ]
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>


# d2f3f5d2 07-Jul-2015 Davidlohr Bueso <dave@stgolabs.net>

perf bench futex: Add lock_pi stresser

Allows a way of measuring low level kernel implementation of FUTEX_LOCK_PI and
FUTEX_UNLOCK_PI.

The program comes in two flavors:

(i) single futex (default), all threads contend on the same uaddr. For the
sake of the benchmark, we call into kernel space even when the lock is
uncontended. The kernel will set it to TID, any waters that come in and
contend for the pi futex will be handled respectively by the kernel.

(ii) -M option for multiple futexes, each thread deals with its own futex. This
is a trivial scenario and only measures kernel handling of 0->TID transition.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Mel Gorman <mgorman@suse.de>
Link: http://lkml.kernel.org/r/1436259353.12255.78.camel@stgolabs.net
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>


# d65817b4 08-May-2015 Davidlohr Bueso <dave@stgolabs.net>

perf bench futex: Support parallel waker threads

The futex-wake benchmark only measures wakeups done within a single
process. While this has value in its own, it does not really generate
any hb->lock contention.

A new benchmark 'wake-parallel' is added, by extending the futex-wake
code such that we can measure parallel waker threads. The program output
shows the avg per-thread latency in order to complete its share of
wakeups:

Run summary [PID 13474]: blocking on 512 threads (at [private] futex 0xa88668), 8 threads waking up 64 at a time.

[Run 1]: Avg per-thread latency (waking 64/512 threads) in 0.6230 ms (+-15.31%)
[Run 2]: Avg per-thread latency (waking 64/512 threads) in 0.5175 ms (+-29.95%)
[Run 3]: Avg per-thread latency (waking 64/512 threads) in 0.7578 ms (+-18.03%)
[Run 4]: Avg per-thread latency (waking 64/512 threads) in 0.8944 ms (+-12.54%)
[Run 5]: Avg per-thread latency (waking 64/512 threads) in 1.1204 ms (+-23.85%)
Avg per-thread latency (waking 64/512 threads) in 0.7826 ms (+-9.91%)

Naturally, different combinations of numbers of blocking and waker
threads will exhibit different information.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Davidlohr Bueso <dbueso@suse.de>
Link: http://lkml.kernel.org/r/1431110280-20231-1-git-send-email-dave@stgolabs.net
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>


# 72965b87 29-Dec-2014 Jiri Olsa <jolsa@kernel.org>

perf build: Add bench objects building

Move bench objects building under build framework and enable perf-in.o
rule.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Tested-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Tested-by: Will Deacon <will.deacon@arm.com>
Cc: Alexis Berlemont <alexis.berlemont@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lkml.kernel.org/n/tip-b0gxubmn3qjabaq0lune53y3@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>