History log of /linux-master/kernel/trace/Makefile
Revision Date Author Comments
# ebeed8d4 22-Aug-2023 Masami Hiramatsu (Google) <mhiramat@kernel.org>

tracing/probes: Move finding func-proto API and getting func-param API to trace_btf

Move generic function-proto find API and getting function parameter API
to BTF library code from trace_probe.c. This will avoid redundant efforts
on different feature.

Link: https://lore.kernel.org/all/169272155255.160970.719426926348706349.stgit@devnote2/

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>


# 334e5519 06-Jun-2023 Masami Hiramatsu (Google) <mhiramat@kernel.org>

tracing/probes: Add fprobe events for tracing function entry and exit.

Add fprobe events for tracing function entry and exit instead of kprobe
events. With this change, we can continue to trace function entry/exit
even if the CONFIG_KPROBES_ON_FTRACE is not available. Since
CONFIG_KPROBES_ON_FTRACE requires the CONFIG_DYNAMIC_FTRACE_WITH_REGS,
it is not available if the architecture only supports
CONFIG_DYNAMIC_FTRACE_WITH_ARGS. And that means kprobe events can not
probe function entry/exit effectively on such architecture.
But this can be solved if the dynamic events supports fprobe events.

The fprobe event is a new dynamic events which is only for the function
(symbol) entry and exit. This event accepts non register fetch arguments
so that user can trace the function arguments and return values.

The fprobe events syntax is here;

f[:[GRP/][EVENT]] FUNCTION [FETCHARGS]
f[MAXACTIVE][:[GRP/][EVENT]] FUNCTION%return [FETCHARGS]

E.g.

# echo 'f vfs_read $arg1' >> dynamic_events
# echo 'f vfs_read%return $retval' >> dynamic_events
# cat dynamic_events
f:fprobes/vfs_read__entry vfs_read arg1=$arg1
f:fprobes/vfs_read__exit vfs_read%return arg1=$retval
# echo 1 > events/fprobes/enable
# head -n 20 trace | tail
# TASK-PID CPU# ||||| TIMESTAMP FUNCTION
# | | | ||||| | |
sh-142 [005] ...1. 448.386420: vfs_read__entry: (vfs_read+0x4/0x340) arg1=0xffff888007f7c540
sh-142 [005] ..... 448.386436: vfs_read__exit: (ksys_read+0x75/0x100 <- vfs_read) arg1=0x1
sh-142 [005] ...1. 448.386451: vfs_read__entry: (vfs_read+0x4/0x340) arg1=0xffff888007f7c540
sh-142 [005] ..... 448.386458: vfs_read__exit: (ksys_read+0x75/0x100 <- vfs_read) arg1=0x1
sh-142 [005] ...1. 448.386469: vfs_read__entry: (vfs_read+0x4/0x340) arg1=0xffff888007f7c540
sh-142 [005] ..... 448.386476: vfs_read__exit: (ksys_read+0x75/0x100 <- vfs_read) arg1=0x1
sh-142 [005] ...1. 448.602073: vfs_read__entry: (vfs_read+0x4/0x340) arg1=0xffff888007f7c540
sh-142 [005] ..... 448.602089: vfs_read__exit: (ksys_read+0x75/0x100 <- vfs_read) arg1=0x1

Link: https://lore.kernel.org/all/168507469754.913472.6112857614708350210.stgit@mhiramat.roam.corp.google.com/

Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/all/202302011530.7vm4O8Ro-lkp@intel.com/
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>


# 102227b9 29-Jul-2022 Daniel Bristot de Oliveira <bristot@kernel.org>

rv: Add Runtime Verification (RV) interface

RV is a lightweight (yet rigorous) method that complements classical
exhaustive verification techniques (such as model checking and
theorem proving) with a more practical approach to complex systems.

RV works by analyzing the trace of the system's actual execution,
comparing it against a formal specification of the system behavior.
RV can give precise information on the runtime behavior of the
monitored system while enabling the reaction for unexpected
events, avoiding, for example, the propagation of a failure on
safety-critical systems.

The development of this interface roots in the development of the
paper:

De Oliveira, Daniel Bristot; Cucinotta, Tommaso; De Oliveira, Romulo
Silva. Efficient formal verification for the Linux kernel. In:
International Conference on Software Engineering and Formal Methods.
Springer, Cham, 2019. p. 315-332.

And:

De Oliveira, Daniel Bristot. Automata-based formal analysis
and verification of the real-time Linux kernel. PhD Thesis, 2020.

The RV interface resembles the tracing/ interface on purpose. The current
path for the RV interface is /sys/kernel/tracing/rv/.

It presents these files:

"available_monitors"
- List the available monitors, one per line.

For example:
# cat available_monitors
wip
wwnr

"enabled_monitors"
- Lists the enabled monitors, one per line;
- Writing to it enables a given monitor;
- Writing a monitor name with a '!' prefix disables it;
- Truncating the file disables all enabled monitors.

For example:
# cat enabled_monitors
# echo wip > enabled_monitors
# echo wwnr >> enabled_monitors
# cat enabled_monitors
wip
wwnr
# echo '!wip' >> enabled_monitors
# cat enabled_monitors
wwnr
# echo > enabled_monitors
# cat enabled_monitors
#

Note that more than one monitor can be enabled concurrently.

"monitoring_on"
- It is an on/off general switcher for monitoring. Note
that it does not disable enabled monitors or detach events,
but stop the per-entity monitors of monitoring the events
received from the system. It resembles the "tracing_on" switcher.

"monitors/"
Each monitor will have its one directory inside "monitors/". There
the monitor specific files will be presented.
The "monitors/" directory resembles the "events" directory on
tracefs.

For example:
# cd monitors/wip/
# ls
desc enable
# cat desc
wakeup in preemptive per-cpu testing monitor.
# cat enable
0

For further information, see the comments in the header of
kernel/trace/rv/rv.c from this patch.

Link: https://lkml.kernel.org/r/a4bfe038f50cb047bfb343ad0e12b0e646ab308b.1659052063.git.bristot@kernel.org

Cc: Wim Van Sebroeck <wim@linux-watchdog.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Marco Elver <elver@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Cc: Gabriele Paoloni <gpaoloni@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Tao Zhou <tao.zhou@linux.dev>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-trace-devel@vger.kernel.org
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>


# bb5eb8f3 23-May-2022 Congyu Liu <liu3101@purdue.edu>

tracing: Disable kcov on trace_preemptirq.c

Functions in trace_preemptirq.c could be invoked from early interrupt
code that bypasses kcov trace function's in_task() check. Disable kcov
on this file to reduce random code coverage.

Link: https://lkml.kernel.org/r/20220523063033.1778974-1-liu3101@purdue.edu

Acked-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Congyu Liu <liu3101@purdue.edu>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>


# 54ecbe6f 15-Mar-2022 Masami Hiramatsu <mhiramat@kernel.org>

rethook: Add a generic return hook

Add a return hook framework which hooks the function return. Most of the
logic came from the kretprobe, but this is independent from kretprobe.

Note that this is expected to be used with other function entry hooking
feature, like ftrace, fprobe, adn kprobes. Eventually this will replace
the kretprobe (e.g. kprobe + rethook = kretprobe), but at this moment,
this is just an additional hook.

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Tested-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/164735285066.1084943.9259661137330166643.stgit@devnote2


# cad9931f 15-Mar-2022 Masami Hiramatsu <mhiramat@kernel.org>

fprobe: Add ftrace based probe APIs

The fprobe is a wrapper API for ftrace function tracer.
Unlike kprobes, this probes only supports the function entry, but this
can probe multiple functions by one fprobe. The usage is similar, user
will set their callback to fprobe::entry_handler and call
register_fprobe*() with probed functions.
There are 3 registration interfaces,

- register_fprobe() takes filtering patterns of the functin names.
- register_fprobe_ips() takes an array of ftrace-location addresses.
- register_fprobe_syms() takes an array of function names.

The registered fprobes can be unregistered with unregister_fprobe().
e.g.

struct fprobe fp = { .entry_handler = user_handler };
const char *targets[] = { "func1", "func2", "func3"};
...

ret = register_fprobe_syms(&fp, targets, ARRAY_SIZE(targets));

...

unregister_fprobe(&fp);

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Tested-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/164735283857.1084943.1154436951479395551.stgit@devnote2


# 7f5a08c7 18-Jan-2022 Beau Belgrave <beaub@linux.microsoft.com>

user_events: Add minimal support for trace_event into ftrace

Minimal support for interacting with dynamic events, trace_event and
ftrace. Core outline of flow between user process, ioctl and trace_event
APIs.

User mode processes that wish to use trace events to get data into
ftrace, perf, eBPF, etc are limited to uprobes today. The user events
features enables an ABI for user mode processes to create and write to
trace events that are isolated from kernel level trace events. This
enables a faster path for tracing from user mode data as well as opens
managed code to participate in trace events, where stub locations are
dynamic.

User processes often want to trace only when it's useful. To enable this
a set of pages are mapped into the user process space that indicate the
current state of the user events that have been registered. User
processes can check if their event is hooked to a trace/probe, and if it
is, emit the event data out via the write() syscall.

Two new files are introduced into tracefs to accomplish this:
user_events_status - This file is mmap'd into participating user mode
processes to indicate event status.

user_events_data - This file is opened and register/delete ioctl's are
issued to create/open/delete trace events that can be used for tracing.

The typical scenario is on process start to mmap user_events_status. Processes
then register the events they plan to use via the REG ioctl. The ioctl reads
and updates the passed in user_reg struct. The status_index of the struct is
used to know the byte in the status page to check for that event. The
write_index of the struct is used to describe that event when writing out to
the fd that was used for the ioctl call. The data must always include this
index first when writing out data for an event. Data can be written either by
write() or by writev().

For example, in memory:
int index;
char data[];

Psuedo code example of typical usage:
struct user_reg reg;

int page_fd = open("user_events_status", O_RDWR);
char *page_data = mmap(NULL, PAGE_SIZE, PROT_READ, MAP_SHARED, page_fd, 0);
close(page_fd);

int data_fd = open("user_events_data", O_RDWR);

reg.size = sizeof(reg);
reg.name_args = (__u64)"test";

ioctl(data_fd, DIAG_IOCSREG, &reg);
int status_id = reg.status_index;
int write_id = reg.write_index;

struct iovec io[2];
io[0].iov_base = &write_id;
io[0].iov_len = sizeof(write_id);
io[1].iov_base = payload;
io[1].iov_len = sizeof(payload);

if (page_data[status_id])
writev(data_fd, io, 2);

User events are also exposed via the dynamic_events tracefs file for
both create and delete. Current status is exposed via the user_events_status
tracefs file.

Simple example to register a user event via dynamic_events:
echo u:test >> dynamic_events
cat dynamic_events
u:test

If an event is hooked to a probe, the probe hooked shows up:
echo 1 > events/user_events/test/enable
cat user_events_status
1:test # Used by ftrace

Active: 1
Busy: 1
Max: 4096

If an event is not hooked to a probe, no probe status shows up:
echo 0 > events/user_events/test/enable
cat user_events_status
1:test

Active: 1
Busy: 0
Max: 4096

Users can describe the trace event format via the following format:
name[:FLAG1[,FLAG2...] [field1[;field2...]]

Each field has the following format:
type name

Example for char array with a size of 20 named msg:
echo 'u:detailed char[20] msg' >> dynamic_events
cat dynamic_events
u:detailed char[20] msg

Data offsets are based on the data written out via write() and will be
updated to reflect the correct offset in the trace_event fields. For dynamic
data it is recommended to use the new __rel_loc data type. This type will be
the same as __data_loc, but the offset is relative to this entry. This allows
user_events to not worry about what common fields are being inserted before
the data.

The above format is valid for both the ioctl and the dynamic_events file.

Link: https://lkml.kernel.org/r/20220118204326.2169-2-beaub@linux.microsoft.com

Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>


# 6954e415 23-Sep-2021 Steven Rostedt (VMware) <rostedt@goodmis.org>

tracing: Place trace_pid_list logic into abstract functions

Instead of having the logic that does trace_pid_list open coded, wrap it in
abstract functions. This will allow a rewrite of the logic that implements
the trace_pid_list without affecting the users.

Note, this causes a change in behavior. Every time a pid is written into
the set_*_pid file, it creates a new list and uses RCU to update it. If
pid_max is lowered, but there was a pid currently in the list that was
higher than pid_max, those pids will now be removed on updating the list.
The old behavior kept that from happening.

The rewrite of the pid_list logic will no longer depend on pid_max,
and will return the old behavior.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>


# 7491e2c4 19-Aug-2021 Tzvetomir Stoyanov (VMware) <tz.stoyanov@gmail.com>

tracing: Add a probe that attaches to trace events

A new dynamic event is introduced: event probe. The event is attached
to an existing tracepoint and uses its fields as arguments. The user
can specify custom format string of the new event, select what tracepoint
arguments will be printed and how to print them.
An event probe is created by writing configuration string in
'dynamic_events' ftrace file:
e[:[SNAME/]ENAME] SYSTEM/EVENT [FETCHARGS] - Set an event probe
-:SNAME/ENAME - Delete an event probe

Where:
SNAME - System name, if omitted 'eprobes' is used.
ENAME - Name of the new event in SNAME, if omitted the SYSTEM_EVENT is used.
SYSTEM - Name of the system, where the tracepoint is defined, mandatory.
EVENT - Name of the tracepoint event in SYSTEM, mandatory.
FETCHARGS - Arguments:
<name>=$<field>[:TYPE] - Fetch given filed of the tracepoint and print
it as given TYPE with given name. Supported
types are:
(u8/u16/u32/u64/s8/s16/s32/s64), basic type
(x8/x16/x32/x64), hexadecimal types
"string", "ustring" and bitfield.

Example, attach an event probe on openat system call and print name of the
file that will be opened:
echo "e:esys/eopen syscalls/sys_enter_openat file=\$filename:string" >> dynamic_events
A new dynamic event is created in events/esys/eopen/ directory. It
can be deleted with:
echo "-:esys/eopen" >> dynamic_events

Filters, triggers and histograms can be attached to the new event, it can
be matched in synthetic events. There is one limitation - an event probe
can not be attached to kprobe, uprobe or another event probe.

Link: https://lkml.kernel.org/r/20210812145805.2292326-1-tz.stoyanov@gmail.com
Link: https://lkml.kernel.org/r/20210819152825.142428383@goodmis.org

Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Co-developed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Tzvetomir Stoyanov (VMware) <tz.stoyanov@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>


# bce29ac9 22-Jun-2021 Daniel Bristot de Oliveira <bristot@redhat.com>

trace: Add osnoise tracer

In the context of high-performance computing (HPC), the Operating System
Noise (*osnoise*) refers to the interference experienced by an application
due to activities inside the operating system. In the context of Linux,
NMIs, IRQs, SoftIRQs, and any other system thread can cause noise to the
system. Moreover, hardware-related jobs can also cause noise, for example,
via SMIs.

The osnoise tracer leverages the hwlat_detector by running a similar
loop with preemption, SoftIRQs and IRQs enabled, thus allowing all
the sources of *osnoise* during its execution. Using the same approach
of hwlat, osnoise takes note of the entry and exit point of any
source of interferences, increasing a per-cpu interference counter. The
osnoise tracer also saves an interference counter for each source of
interference. The interference counter for NMI, IRQs, SoftIRQs, and
threads is increased anytime the tool observes these interferences' entry
events. When a noise happens without any interference from the operating
system level, the hardware noise counter increases, pointing to a
hardware-related noise. In this way, osnoise can account for any
source of interference. At the end of the period, the osnoise tracer
prints the sum of all noise, the max single noise, the percentage of CPU
available for the thread, and the counters for the noise sources.

Usage

Write the ASCII text "osnoise" into the current_tracer file of the
tracing system (generally mounted at /sys/kernel/tracing).

For example::

[root@f32 ~]# cd /sys/kernel/tracing/
[root@f32 tracing]# echo osnoise > current_tracer

It is possible to follow the trace by reading the trace trace file::

[root@f32 tracing]# cat trace
# tracer: osnoise
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth MAX
# || / SINGLE Interference counters:
# |||| RUNTIME NOISE % OF CPU NOISE +-----------------------------+
# TASK-PID CPU# |||| TIMESTAMP IN US IN US AVAILABLE IN US HW NMI IRQ SIRQ THREAD
# | | | |||| | | | | | | | | | |
<...>-859 [000] .... 81.637220: 1000000 190 99.98100 9 18 0 1007 18 1
<...>-860 [001] .... 81.638154: 1000000 656 99.93440 74 23 0 1006 16 3
<...>-861 [002] .... 81.638193: 1000000 5675 99.43250 202 6 0 1013 25 21
<...>-862 [003] .... 81.638242: 1000000 125 99.98750 45 1 0 1011 23 0
<...>-863 [004] .... 81.638260: 1000000 1721 99.82790 168 7 0 1002 49 41
<...>-864 [005] .... 81.638286: 1000000 263 99.97370 57 6 0 1006 26 2
<...>-865 [006] .... 81.638302: 1000000 109 99.98910 21 3 0 1006 18 1
<...>-866 [007] .... 81.638326: 1000000 7816 99.21840 107 8 0 1016 39 19

In addition to the regular trace fields (from TASK-PID to TIMESTAMP), the
tracer prints a message at the end of each period for each CPU that is
running an osnoise/CPU thread. The osnoise specific fields report:

- The RUNTIME IN USE reports the amount of time in microseconds that
the osnoise thread kept looping reading the time.
- The NOISE IN US reports the sum of noise in microseconds observed
by the osnoise tracer during the associated runtime.
- The % OF CPU AVAILABLE reports the percentage of CPU available for
the osnoise thread during the runtime window.
- The MAX SINGLE NOISE IN US reports the maximum single noise observed
during the runtime window.
- The Interference counters display how many each of the respective
interference happened during the runtime window.

Note that the example above shows a high number of HW noise samples.
The reason being is that this sample was taken on a virtual machine,
and the host interference is detected as a hardware interference.

Tracer options

The tracer has a set of options inside the osnoise directory, they are:

- osnoise/cpus: CPUs at which a osnoise thread will execute.
- osnoise/period_us: the period of the osnoise thread.
- osnoise/runtime_us: how long an osnoise thread will look for noise.
- osnoise/stop_tracing_us: stop the system tracing if a single noise
higher than the configured value happens. Writing 0 disables this
option.
- osnoise/stop_tracing_total_us: stop the system tracing if total noise
higher than the configured value happens. Writing 0 disables this
option.
- tracing_threshold: the minimum delta between two time() reads to be
considered as noise, in us. When set to 0, the default value will
be used, which is currently 5 us.

Additional Tracing

In addition to the tracer, a set of tracepoints were added to
facilitate the identification of the osnoise source.

- osnoise:sample_threshold: printed anytime a noise is higher than
the configurable tolerance_ns.
- osnoise:nmi_noise: noise from NMI, including the duration.
- osnoise:irq_noise: noise from an IRQ, including the duration.
- osnoise:softirq_noise: noise from a SoftIRQ, including the
duration.
- osnoise:thread_noise: noise from a thread, including the duration.

Note that all the values are *net values*. For example, if while osnoise
is running, another thread preempts the osnoise thread, it will start a
thread_noise duration at the start. Then, an IRQ takes place, preempting
the thread_noise, starting a irq_noise. When the IRQ ends its execution,
it will compute its duration, and this duration will be subtracted from
the thread_noise, in such a way as to avoid the double accounting of the
IRQ execution. This logic is valid for all sources of noise.

Here is one example of the usage of these tracepoints::

osnoise/8-961 [008] d.h. 5789.857532: irq_noise: local_timer:236 start 5789.857529929 duration 1845 ns
osnoise/8-961 [008] dNh. 5789.858408: irq_noise: local_timer:236 start 5789.858404871 duration 2848 ns
migration/8-54 [008] d... 5789.858413: thread_noise: migration/8:54 start 5789.858409300 duration 3068 ns
osnoise/8-961 [008] .... 5789.858413: sample_threshold: start 5789.858404555 duration 8723 ns interferences 2

In this example, a noise sample of 8 microseconds was reported in the last
line, pointing to two interferences. Looking backward in the trace, the
two previous entries were about the migration thread running after a
timer IRQ execution. The first event is not part of the noise because
it took place one millisecond before.

It is worth noticing that the sum of the duration reported in the
tracepoints is smaller than eight us reported in the sample_threshold.
The reason roots in the overhead of the entry and exit code that happens
before and after any interference execution. This justifies the dual
approach: measuring thread and tracing.

Link: https://lkml.kernel.org/r/e649467042d60e7b62714c9c6751a56299d15119.1624372313.git.bristot@redhat.com

Cc: Phil Auld <pauld@redhat.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Kate Carcia <kcarcia@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Alexandre Chartre <alexandre.chartre@oracle.com>
Cc: Clark Willaims <williams@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
[
Made the following functions static:
trace_irqentry_callback()
trace_irqexit_callback()
trace_intel_irqentry_callback()
trace_intel_irqexit_callback()

Added to include/trace.h:
osnoise_arch_register()
osnoise_arch_unregister()

Fixed define logic for LATENCY_FS_NOTIFY

Reported-by: kernel test robot <lkp@intel.com>
]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>


# 9c0dee54 25-Feb-2021 Alexander Potapenko <glider@google.com>

tracing: add error_report_end trace point

Patch series "Add error_report_end tracepoint to KFENCE and KASAN", v3.

This patchset adds a tracepoint, error_repor_end, that is to be used by
KFENCE, KASAN, and potentially other bug detection tools, when they print
an error report. One of the possible use cases is userspace collection of
kernel error reports: interested parties can subscribe to the tracing
event via tracefs, and get notified when an error report occurs.

This patch (of 3):

Introduce error_report_end tracepoint. It can be used in debugging tools
like KASAN, KFENCE, etc. to provide extensions to the error reporting
mechanisms (e.g. allow tests hook into error reporting, ease error report
collection from production kernels). Another benefit would be making use
of ftrace for debugging or benchmarking the tools themselves.

Should we need it, the tracepoint name leaves us with the possibility to
introduce a complementary error_report_start tracepoint in the future.

Link: https://lkml.kernel.org/r/20210121131915.1331302-1-glider@google.com
Link: https://lkml.kernel.org/r/20210121131915.1331302-2-glider@google.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Suggested-by: Marco Elver <elver@google.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 773c1670 05-Nov-2020 Steven Rostedt (VMware) <rostedt@goodmis.org>

ftrace: Add recording of functions that caused recursion

This adds CONFIG_FTRACE_RECORD_RECURSION that will record to a file
"recursed_functions" all the functions that caused recursion while a
callback to the function tracer was running.

Link: https://lkml.kernel.org/r/20201106023548.102375687@goodmis.org

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Guo Ren <guoren@kernel.org>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: x86@kernel.org
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Anton Vorontsov <anton@enomsg.org>
Cc: Colin Cross <ccross@android.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Miroslav Benes <mbenes@suse.cz>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Joe Lawrence <joe.lawrence@redhat.com>
Cc: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Cc: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-csky@vger.kernel.org
Cc: linux-parisc@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s390@vger.kernel.org
Cc: live-patching@vger.kernel.org
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>


# 15d5761a 07-Jul-2020 Masahiro Yamada <masahiroy@kernel.org>

kbuild: introduce ccflags-remove-y and asflags-remove-y

CFLAGS_REMOVE_<file>.o filters out flags when compiling a particular
object, but there is no convenient way to do that for every object in
a directory.

Add ccflags-remove-y and asflags-remove-y to make it easily.

Use ccflags-remove-y to clean up some Makefiles.

The add/remove order works as follows:

[1] KBUILD_CFLAGS specifies compiler flags used globally

[2] ccflags-y adds compiler flags for all objects in the
current Makefile

[3] ccflags-remove-y removes compiler flags for all objects in the
current Makefile (New feature)

[4] CFLAGS_<file> adds compiler flags per file.

[5] CFLAGS_REMOVE_<file> removes compiler flags per file.

Having [3] before [4] allows us to remove flags from most (but not all)
objects in the current Makefile.

For example, kernel/trace/Makefile removes $(CC_FLAGS_FTRACE)
from all objects in the directory, then adds it back to
trace_selftest_dynamic.o and CFLAGS_trace_kprobe_selftest.o

The same applies to lib/livepatch/Makefile.

Please note ccflags-remove-y has no effect to the sub-directories.
In contrast, the previous notation got rid of compiler flags also from
all the sub-directories.

The following are not affected because they have no sub-directories:

arch/arm/boot/compressed/
arch/powerpc/xmon/
arch/sh/
kernel/trace/

However, lib/ has several sub-directories.

To keep the behavior, I added ccflags-remove-y to all Makefiles
in subdirectories of lib/, except the following:

lib/vdso/Makefile - Kbuild does not descend into this Makefile
lib/raid/test/Makefile - This is not used for the kernel build

I think commit 2464a609ded0 ("ftrace: do not trace library functions")
excluded too much. In the next commit, I will remove ccflags-remove-y
from the sub-directories of lib/.

Suggested-by: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Acked-by: Brendan Higgins <brendanhiggins@google.com> (KUnit)
Tested-by: Anders Roxell <anders.roxell@linaro.org>


# ac5a72ea 12-Jul-2020 Alan Maguire <alan.maguire@oracle.com>

bpf: Use dedicated bpf_trace_printk event instead of trace_printk()

The bpf helper bpf_trace_printk() uses trace_printk() under the hood.
This leads to an alarming warning message originating from trace
buffer allocation which occurs the first time a program using
bpf_trace_printk() is loaded.

We can instead create a trace event for bpf_trace_printk() and enable
it in-kernel when/if we encounter a program using the
bpf_trace_printk() helper. With this approach, trace_printk()
is not used directly and no warning message appears.

This work was started by Steven (see Link) and finished by Alan; added
Steven's Signed-off-by with his permission.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/r/20200628194334.6238b933@oasis.local.home
Link: https://lore.kernel.org/bpf/1594641154-18897-2-git-send-email-alan.maguire@oracle.com


# 726721a5 28-May-2020 Tom Zanussi <zanussi@kernel.org>

tracing: Move synthetic events to a separate file

With the addition of the in-kernel synthetic event API, synthetic
events are no longer specifically tied to the histogram triggers.

The synthetic event code is also making trace_event_hist.c very
bloated, so for those reasons, move it to a separate file,
trace_events_synth.c, along with a new trace_synth.h header file.

Because synthetic events are now independent from hist triggers, add a
new CONFIG_SYNTH_EVENTS config option, and have CONFIG_HIST_TRIGGERS
select it, and have CONFIG_SYNTH_EVENT_GEN_TEST depend on it.

Link: http://lkml.kernel.org/r/4d1fa1f85ed5982706ac44844ac92451dcb04715.1590693308.git.zanussi@kernel.org

Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>


# f5d2313b 14-Feb-2020 Marco Elver <elver@google.com>

kcsan, trace: Make KCSAN compatible with tracing

Previously the system would lock up if ftrace was enabled together with
KCSAN. This is due to recursion on reporting if the tracer code is
instrumented with KCSAN.

To avoid this for all types of tracing, disable KCSAN instrumentation
for all of kernel/trace.

Furthermore, since KCSAN relies on udelay() to introduce delay, we have
to disable ftrace for udelay() (currently done for x86) in case KCSAN is
used together with lockdep and ftrace. The reason is that it may corrupt
lockdep IRQ flags tracing state due to a peculiar case of recursion
(details in Makefile comment).

Reported-by: Qian Cai <cai@lca.pw>
Tested-by: Qian Cai <cai@lca.pw>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 64836248 28-Jan-2020 Tom Zanussi <zanussi@kernel.org>

tracing: Add kprobe event command generation test module

Add a test module that checks the basic functionality of the in-kernel
kprobe event command generation API by creating kprobe events from a
module.

Link: http://lkml.kernel.org/r/97e502b204f9dba948e3fa3a4315448298218787.1580323897.git.zanussi@kernel.org

Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>


# 9fe41efa 28-Jan-2020 Tom Zanussi <zanussi@kernel.org>

tracing: Add synth event generation test module

Add a test module that checks the basic functionality of the in-kernel
synthetic event generation API by generating and tracing synthetic
events from a module.

Link: http://lkml.kernel.org/r/fcb4dd9eb9eefb70ab20538d3529d51642389664.1580323897.git.zanussi@kernel.org

Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>


# 9c5b9d3d 10-Jan-2020 Masami Hiramatsu <mhiramat@kernel.org>

tracing/boot: Add boot-time tracing

Setup tracing options via extra boot config in addition to kernel
command line.

This adds following commands support. These are applied to
the global trace instance.

- ftrace.options = OPT1[,OPT2...]
Enable given ftrace options.

- ftrace.trace_clock = CLOCK
Set given CLOCK to ftrace's trace_clock.

- ftrace.buffer_size = SIZE
Configure ftrace buffer size to SIZE. You can use "KB" or "MB"
for that SIZE.

- ftrace.events = EVENT[, EVENT2...]
Enable given events on boot. You can use a wild card in EVENT.

- ftrace.tracer = TRACER
Set TRACER to current tracer on boot. (e.g. function)

Note that this is NOT replacing the kernel parameters, because
this boot config based setting is later than that. If you want to
trace earlier boot events, you still need kernel parameters.

Link: http://lkml.kernel.org/r/157867237723.17873.17494943526320587488.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>


# 6c3edaf9 29-Nov-2019 Cong Wang <xiyou.wangcong@gmail.com>

tracing: Introduce trace event injection

We have been trying to use rasdaemon to monitor hardware errors like
correctable memory errors. rasdaemon uses trace events to monitor
various hardware errors. In order to test it, we have to inject some
hardware errors, unfortunately not all of them provide error
injections. MCE does provide a way to inject MCE errors, but errors
like PCI error and devlink error don't, it is not easy to add error
injection to each of them. Instead, it is relatively easier to just
allow users to inject trace events in a generic way so that all trace
events can be injected.

This patch introduces trace event injection, where a new 'inject' is
added to each tracepoint directory. Users could write into this file
with key=value pairs to specify the value of each fields of the trace
event, all unspecified fields are set to zero values by default.

For example, for the net/net_dev_queue tracepoint, we can inject:

INJECT=/sys/kernel/debug/tracing/events/net/net_dev_queue/inject
echo "" > $INJECT
echo "name='test'" > $INJECT
echo "name='test' len=1024" > $INJECT
cat /sys/kernel/debug/tracing/trace
...
<...>-614 [000] .... 36.571483: net_dev_queue: dev= skbaddr=00000000fbf338c2 len=0
<...>-614 [001] .... 136.588252: net_dev_queue: dev=test skbaddr=00000000fbf338c2 len=0
<...>-614 [001] .N.. 208.431878: net_dev_queue: dev=test skbaddr=00000000fbf338c2 len=1024

Triggers could be triggered as usual too:

echo "stacktrace if len == 1025" > /sys/kernel/debug/tracing/events/net/net_dev_queue/trigger
echo "len=1025" > $INJECT
cat /sys/kernel/debug/tracing/trace
...
bash-614 [000] .... 36.571483: net_dev_queue: dev= skbaddr=00000000fbf338c2 len=0
bash-614 [001] .... 136.588252: net_dev_queue: dev=test skbaddr=00000000fbf338c2 len=0
bash-614 [001] .N.. 208.431878: net_dev_queue: dev=test skbaddr=00000000fbf338c2 len=1024
bash-614 [001] .N.1 284.236349: <stack trace>
=> event_inject_write
=> vfs_write
=> ksys_write
=> do_syscall_64
=> entry_SYSCALL_64_after_hwframe

The only thing that can't be injected is string pointers as they
require constant string pointers, this can't be done at run time.

Link: http://lkml.kernel.org/r/20191130045218.18979-1-xiyou.wangcong@gmail.com

Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>


# 5448d44c 05-Nov-2018 Masami Hiramatsu <mhiramat@kernel.org>

tracing: Add unified dynamic event framework

Add unified dynamic event framework for ftrace kprobes, uprobes
and synthetic events. Those dynamic events can be co-exist on
same file because those syntax doesn't overlap.

This introduces a framework part which provides a unified tracefs
interface and operations.

Link: http://lkml.kernel.org/r/154140852824.17322.12250362185969352095.stgit@devbox

Reviewed-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Tested-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>


# d864a3ca 12-Nov-2018 Steven Rostedt (VMware) <rostedt@goodmis.org>

fgraph: Create a fgraph.c file to store function graph infrastructure

As the function graph infrastructure can be used by thing other than
tracing, moving the code to its own file out of the trace_functions_graph.c
code makes more sense.

The fgraph.c file will only contain the infrastructure required to hook into
functions and their return code.

Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>


# 6b7dca40 21-Aug-2018 Masami Hiramatsu <mhiramat@kernel.org>

tracing: Allow gcov profiling on only ftrace subsystem

Add GCOV_PROFILE_FTRACE to allow gcov profiling on only files in ftrace
subsystem. This config option will be used for checking kselftest/ftrace
coverage.

Link: http://lkml.kernel.org/r/153483647755.32472.4746349899604275441.stgit@devbox

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>


# c3bc8fd6 30-Jul-2018 Joel Fernandes (Google) <joel@joelfernandes.org>

tracing: Centralize preemptirq tracepoints and unify their usage

This patch detaches the preemptirq tracepoints from the tracers and
keeps it separate.

Advantages:
* Lockdep and irqsoff event can now run in parallel since they no longer
have their own calls.

* This unifies the usecase of adding hooks to an irqsoff and irqson
event, and a preemptoff and preempton event.
3 users of the events exist:
- Lockdep
- irqsoff and preemptoff tracers
- irqs and preempt trace events

The unification cleans up several ifdefs and makes the code in preempt
tracer and irqsoff tracers simpler. It gets rid of all the horrific
ifdeferry around PROVE_LOCKING and makes configuration of the different
users of the tracepoints more easy and understandable. It also gets rid
of the time_* function calls from the lockdep hooks used to call into
the preemptirq tracer which is not needed anymore. The negative delta in
lines of code in this patch is quite large too.

In the patch we introduce a new CONFIG option PREEMPTIRQ_TRACEPOINTS
as a single point for registering probes onto the tracepoints. With
this,
the web of config options for preempt/irq toggle tracepoints and its
users becomes:

PREEMPT_TRACER PREEMPTIRQ_EVENTS IRQSOFF_TRACER PROVE_LOCKING
| | \ | |
\ (selects) / \ \ (selects) /
TRACE_PREEMPT_TOGGLE ----> TRACE_IRQFLAGS
\ /
\ (depends on) /
PREEMPTIRQ_TRACEPOINTS

Other than the performance tests mentioned in the previous patch, I also
ran the locking API test suite. I verified that all tests cases are
passing.

I also injected issues by not registering lockdep probes onto the
tracepoints and I see failures to confirm that the probes are indeed
working.

This series + lockdep probes not registered (just to inject errors):
[ 0.000000] hard-irqs-on + irq-safe-A/21: ok | ok | ok |
[ 0.000000] soft-irqs-on + irq-safe-A/21: ok | ok | ok |
[ 0.000000] sirq-safe-A => hirqs-on/12:FAILED|FAILED| ok |
[ 0.000000] sirq-safe-A => hirqs-on/21:FAILED|FAILED| ok |
[ 0.000000] hard-safe-A + irqs-on/12:FAILED|FAILED| ok |
[ 0.000000] soft-safe-A + irqs-on/12:FAILED|FAILED| ok |
[ 0.000000] hard-safe-A + irqs-on/21:FAILED|FAILED| ok |
[ 0.000000] soft-safe-A + irqs-on/21:FAILED|FAILED| ok |
[ 0.000000] hard-safe-A + unsafe-B #1/123: ok | ok | ok |
[ 0.000000] soft-safe-A + unsafe-B #1/123: ok | ok | ok |

With this series + lockdep probes registered, all locking tests pass:

[ 0.000000] hard-irqs-on + irq-safe-A/21: ok | ok | ok |
[ 0.000000] soft-irqs-on + irq-safe-A/21: ok | ok | ok |
[ 0.000000] sirq-safe-A => hirqs-on/12: ok | ok | ok |
[ 0.000000] sirq-safe-A => hirqs-on/21: ok | ok | ok |
[ 0.000000] hard-safe-A + irqs-on/12: ok | ok | ok |
[ 0.000000] soft-safe-A + irqs-on/12: ok | ok | ok |
[ 0.000000] hard-safe-A + irqs-on/21: ok | ok | ok |
[ 0.000000] soft-safe-A + irqs-on/21: ok | ok | ok |
[ 0.000000] hard-safe-A + unsafe-B #1/123: ok | ok | ok |
[ 0.000000] soft-safe-A + unsafe-B #1/123: ok | ok | ok |

Link: http://lkml.kernel.org/r/20180730222423.196630-4-joel@joelfernandes.org

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>


# d899926f 30-Jul-2018 Francis Deslauriers <francis.deslauriers@efficios.com>

selftest/ftrace: Move kprobe selftest function to separate compile unit

Move selftest function to its own compile unit so it can be compiled
with the ftrace cflags (CC_FLAGS_FTRACE) allowing it to be probed
during the ftrace startup tests.

Link: http://lkml.kernel.org/r/153294604271.32740.16490677128630177030.stgit@devbox

Signed-off-by: Francis Deslauriers <francis.deslauriers@efficios.com>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>


# f96e8577 12-Jul-2018 Joel Fernandes (Google) <joel@joelfernandes.org>

lib: Add module for testing preemptoff/irqsoff latency tracers

Here we introduce a test module for introducing a long preempt or irq
disable delay in the kernel which the preemptoff or irqsoff tracers can
detect. This module is to be used only for test purposes and is default
disabled.

Following is the expected output (only briefly shown) that can be parsed
to verify that the tracers are working correctly. We will use this from
the kselftests in future patches.

For the preemptoff tracer:

echo preemptoff > /d/tracing/current_tracer
sleep 1
insmod ./preemptirq_delay_test.ko test_mode=preempt delay=500000
sleep 1
bash-4.3# cat /d/tracing/trace
preempt -1066 2...2 0us@: preemptirq_delay_run <-preemptirq_delay_run
preempt -1066 2...2 500002us : preemptirq_delay_run <-preemptirq_delay_run
preempt -1066 2...2 500004us : tracer_preempt_on <-preemptirq_delay_run
preempt -1066 2...2 500012us : <stack trace>
=> kthread
=> ret_from_fork

For the irqsoff tracer:

echo irqsoff > /d/tracing/current_tracer
sleep 1
insmod ./preemptirq_delay_test.ko test_mode=irq delay=500000
sleep 1
bash-4.3# cat /d/tracing/trace
irq dis -1069 1d..1 0us@: preemptirq_delay_run
irq dis -1069 1d..1 500001us : preemptirq_delay_run
irq dis -1069 1d..1 500002us : tracer_hardirqs_on <-preemptirq_delay_run
irq dis -1069 1d..1 500005us : <stack trace>
=> ret_from_fork

Link: http://lkml.kernel.org/r/20180712213611.GA8743@joelaf.mtv.corp.google.com

Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Julia Cartwright <julia@ni.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Glexiner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
[ Erick is a co-developer of this commit ]
Signed-off-by: Erick Reyes <erickreyes@google.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>


# b2441318 01-Nov-2017 Greg Kroah-Hartman <gregkh@linuxfoundation.org>

License cleanup: add SPDX GPL-2.0 license identifier to files with no license

Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).

All documentation files were explicitly excluded.

The following heuristics were used to determine which SPDX license
identifiers to apply.

- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.

For non */uapi/* files that summary was:

SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139

and resulted in the first patch in this series.

If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:

SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930

and resulted in the second patch in this series.

- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:

SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1

and that resulted in the third patch in this series.

- when the two scanners agreed on the detected license(s), that became
the concluded license(s).

- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.

- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).

- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.

- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.

In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.

Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.

Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.

In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.

Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct

This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.

These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.

Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# d5915816 10-Oct-2017 Joel Fernandes <joelaf@google.com>

tracing: Add support for preempt and irq enable/disable events

Preempt and irq trace events can be used for tracing the start and
end of an atomic section which can be used by a trace viewer like
systrace to graphically view the start and end of an atomic section and
correlate them with latencies and scheduling issues.

This also serves as a prelude to using synthetic events or probes to
rewrite the preempt and irqsoff tracers, along with numerous benefits of
using trace events features for these events.
Link: http://lkml.kernel.org/r/20171006005432.14244-3-joelaf@google.com
Link: http://lkml.kernel.org/r/20171010225137.17370-1-joelaf@google.com

Cc: Peter Zilstra <peterz@infradead.org>
Cc: kernel-team@android.com
Signed-off-by: Joel Fernandes <joelaf@google.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>


# 6b0b7551 15-Feb-2017 Anton Blanchard <anton@samba.org>

perf/core: Rename CONFIG_[UK]PROBE_EVENT to CONFIG_[UK]PROBE_EVENTS

We have uses of CONFIG_UPROBE_EVENT and CONFIG_KPROBE_EVENT as
well as CONFIG_UPROBE_EVENTS and CONFIG_KPROBE_EVENTS.

Consistently use the plurals.

Signed-off-by: Anton Blanchard <anton@samba.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@kernel.org
Cc: alexander.shishkin@linux.intel.com
Cc: davem@davemloft.net
Cc: sparclinux@vger.kernel.org
Link: http://lkml.kernel.org/r/20170216060050.20866-1-anton@ozlabs.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# ef6000b4 12-Oct-2016 Linus Torvalds <torvalds@linux-foundation.org>

Disable the __builtin_return_address() warning globally after all

This affectively reverts commit 377ccbb48373 ("Makefile: Mute warning
for __builtin_return_address(>0) for tracing only") because it turns out
that it really isn't tracing only - it's all over the tree.

We already also had the warning disabled separately for mm/usercopy.c
(which this commit also removes), and it turns out that we will also
want to disable it for get_lock_parent_ip(), that is used for at least
TRACE_IRQFLAGS. Which (when enabled) ends up being all over the tree.

Steven Rostedt had a patch that tried to limit it to just the config
options that actually triggered this, but quite frankly, the extra
complexity and abstraction just isn't worth it. We have never actually
had a case where the warning is actually useful, so let's just disable
it globally and not worry about it.

Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Anvin <hpa@zytor.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e7c15cd8 22-Jun-2016 Steven Rostedt (Red Hat) <rostedt@goodmis.org>

tracing: Added hardware latency tracer

The hardware latency tracer has been in the PREEMPT_RT patch for some time.
It is used to detect possible SMIs or any other hardware interruptions that
the kernel is unaware of. Note, NMIs may also be detected, but that may be
good to note as well.

The logic is pretty simple. It simply creates a thread that spins on a
single CPU for a specified amount of time (width) within a periodic window
(window). These numbers may be adjusted by their cooresponding names in

/sys/kernel/tracing/hwlat_detector/

The defaults are window = 1000000 us (1 second)
width = 500000 us (1/2 second)

The loop consists of:

t1 = trace_clock_local();
t2 = trace_clock_local();

Where trace_clock_local() is a variant of sched_clock().

The difference of t2 - t1 is recorded as the "inner" timestamp and also the
timestamp t1 - prev_t2 is recorded as the "outer" timestamp. If either of
these differences are greater than the time denoted in
/sys/kernel/tracing/tracing_thresh then it records the event.

When this tracer is started, and tracing_thresh is zero, it changes to the
default threshold of 10 us.

The hwlat tracer in the PREEMPT_RT patch was originally written by
Jon Masters. I have modified it quite a bit and turned it into a
tracer.

Based-on-code-by: Jon Masters <jcm@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>


# 377ccbb4 28-Jul-2016 Steven Rostedt <rostedt@goodmis.org>

Makefile: Mute warning for __builtin_return_address(>0) for tracing only

With the latest gcc compilers, they give a warning if
__builtin_return_address() parameter is greater than 0. That is because if
it is used by a function called by a top level function (or in the case of
the kernel, by assembly), it can try to access stack frames outside the
stack and crash the system.

The tracing system uses __builtin_return_address() of up to 2! But it is
well aware of the dangers that it may have, and has even added precautions
to protect against it (see the thunk code in arch/x86/entry/thunk*.S)

Linus originally added KBUILD_CFLAGS that would suppress the warning for the
entire kernel, as simply adding KBUILD_CFLAGS to the tracing directory
wouldn't work. The tracing directory plays a bit with the CFLAGS and
requires a little more logic.

This adds that special logic to only suppress the warning for the tracing
directory. If it is used anywhere else outside of tracing, the warning will
still be triggered.

Link: http://lkml.kernel.org/r/20160728223043.51996267@grimm.local.home

Tested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>


# 7ef224d1 02-Mar-2016 Tom Zanussi <tom.zanussi@linux.intel.com>

tracing: Add 'hist' event trigger command

'hist' triggers allow users to continually aggregate trace events,
which can then be viewed afterwards by simply reading a 'hist' file
containing the aggregation in a human-readable format.

The basic idea is very simple and boils down to a mechanism whereby
trace events, rather than being exhaustively dumped in raw form and
viewed directly, are automatically 'compressed' into meaningful tables
completely defined by the user.

This is done strictly via single-line command-line commands and
without the aid of any kind of programming language or interpreter.

A surprising number of typical use cases can be accomplished by users
via this simple mechanism. In fact, a large number of the tasks that
users typically do using the more complicated script-based tracing
tools, at least during the initial stages of an investigation, can be
accomplished by simply specifying a set of keys and values to be used
in the creation of a hash table.

The Linux kernel trace event subsystem happens to provide an extensive
list of keys and values ready-made for such a purpose in the form of
the event format files associated with each trace event. By simply
consulting the format file for field names of interest and by plugging
them into the hist trigger command, users can create an endless number
of useful aggregations to help with investigating various properties
of the system. See Documentation/trace/events.txt for examples.

hist triggers are implemented on top of the existing event trigger
infrastructure, and as such are consistent with the existing triggers
from a user's perspective as well.

The basic syntax follows the existing trigger syntax. Users start an
aggregation by writing a 'hist' trigger to the event of interest's
trigger file:

# echo hist:keys=xxx [ if filter] > event/trigger

Once a hist trigger has been set up, by default it continually
aggregates every matching event into a hash table using the event key
and a value field named 'hitcount'.

To view the aggregation at any point in time, simply read the 'hist'
file in the same directory as the 'trigger' file:

# cat event/hist

The detailed syntax provides additional options for user control, and
is described exhaustively in Documentation/trace/events.txt and in the
virtual tracing/README file in the tracing subsystem.

Link: http://lkml.kernel.org/r/72d263b5e1853fe9c314953b65833c3aa75479f2.1457029949.git.tom.zanussi@linux.intel.com

Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>


# 08d43a5f 09-Dec-2015 Tom Zanussi <tom.zanussi@linux.intel.com>

tracing: Add lock-free tracing_map

Add tracing_map, a special-purpose lock-free map for tracing.

tracing_map is designed to aggregate or 'sum' one or more values
associated with a specific object of type tracing_map_elt, which
is associated by the map to a given key.

It provides various hooks allowing per-tracer customization and is
separated out into a separate file in order to allow it to be shared
between multiple tracers, but isn't meant to be generally used outside
of that context.

The tracing_map implementation was inspired by lock-free map
algorithms originated by Dr. Cliff Click:

http://www.azulsystems.com/blog/cliff/2007-03-26-non-blocking-hashtable
http://www.azulsystems.com/events/javaone_2007/2007_LockFreeHash.pdf

Link: http://lkml.kernel.org/r/b43d68d1add33582a396f553c8ef705a33a6a748.1449767187.git.tom.zanussi@linux.intel.com

Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>


# e1abf2cc 02-Apr-2015 Ingo Molnar <mingo@kernel.org>

bpf: Fix the build on BPF_SYSCALL=y && !CONFIG_TRACING kernels, make it more configurable

So bpf_tracing.o depends on CONFIG_BPF_SYSCALL - but that's not its only
dependency, it also depends on the tracing infrastructure and on kprobes,
without which it will fail to build with:

In file included from kernel/trace/bpf_trace.c:14:0:
kernel/trace/trace.h: In function ‘trace_test_and_set_recursion’:
kernel/trace/trace.h:491:28: error: ‘struct task_struct’ has no member named ‘trace_recursion’
unsigned int val = current->trace_recursion;
[...]

It took quite some time to trigger this build failure, because right now
BPF_SYSCALL is very obscure, depends on CONFIG_EXPERT. So also make BPF_SYSCALL
more configurable, not just under CONFIG_EXPERT.

If BPF_SYSCALL, tracing and kprobes are enabled then enable the bpf_tracing
gateway as well.

We might want to make this an interactive option later on, although
I'd not complicate it unnecessarily: enabling BPF_SYSCALL is enough of
an indicator that the user wants BPF support.

Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: David S. Miller <davem@davemloft.net>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 2541517c 25-Mar-2015 Alexei Starovoitov <ast@kernel.org>

tracing, perf: Implement BPF programs attached to kprobes

BPF programs, attached to kprobes, provide a safe way to execute
user-defined BPF byte-code programs without being able to crash or
hang the kernel in any way. The BPF engine makes sure that such
programs have a finite execution time and that they cannot break
out of their sandbox.

The user interface is to attach to a kprobe via the perf syscall:

struct perf_event_attr attr = {
.type = PERF_TYPE_TRACEPOINT,
.config = event_id,
...
};

event_fd = perf_event_open(&attr,...);
ioctl(event_fd, PERF_EVENT_IOC_SET_BPF, prog_fd);

'prog_fd' is a file descriptor associated with BPF program
previously loaded.

'event_id' is an ID of the kprobe created.

Closing 'event_fd':

close(event_fd);

... automatically detaches BPF program from it.

BPF programs can call in-kernel helper functions to:

- lookup/update/delete elements in maps

- probe_read - wraper of probe_kernel_read() used to access any
kernel data structures

BPF programs receive 'struct pt_regs *' as an input ('struct pt_regs' is
architecture dependent) and return 0 to ignore the event and 1 to store
kprobe event into the ring buffer.

Note, kprobes are a fundamentally _not_ a stable kernel ABI,
so BPF programs attached to kprobes must be recompiled for
every kernel version and user must supply correct LINUX_VERSION_CODE
in attr.kern_version during bpf_prog_load() call.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: David S. Miller <davem@davemloft.net>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1427312966-8434-4-git-send-email-ast@plumgrid.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# c0a80c0c 09-Jan-2015 Heiko Carstens <hca@linux.ibm.com>

ftrace: allow architectures to specify ftrace compile options

If the kernel is compiled with function tracer support the -pg compile option
is passed to gcc to generate extra code into the prologue of each function.

This patch replaces the "open-coded" -pg compile flag with a CC_FLAGS_FTRACE
makefile variable which architectures can override if a different option
should be used for code generation.

Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>


# 798bc6d8 12-Dec-2014 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

tracing / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM

After commit b2b49ccbdd54 (PM: Kconfig: Set PM_RUNTIME if PM_SLEEP is
selected) PM_RUNTIME is always set if PM is set, so files that are
build conditionally if CONFIG_PM_RUNTIME is set may now be build
if CONFIG_PM is set.

Replace CONFIG_PM_RUNTIME with CONFIG_PM in kernel/trace/Makefile
for this reason.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org.


# 8d58e99a 19-Jun-2014 Steven Rostedt (Red Hat) <rostedt@goodmis.org>

seq_buf: Move the seq_buf code to lib/

The seq_buf functions are rather useful outside of tracing. Instead
of having it be dependent on CONFIG_TRACING, move the code into lib/
and allow other users to have access to it even when tracing is not
configured.

The seq_buf utility is similar to the seq_file utility, but instead of
writing sending data back up to userland, it writes it into a buffer
defined at seq_buf_init(). This allows us to send a descriptor around
that writes printf() formatted strings into it that can be retrieved
later.

It is currently used by the tracing facility for such things like trace
events to convert its binary saved data in the ring buffer into an
ASCII human readable context to be displayed in /sys/kernel/debug/trace.

It can also be used for doing NMI prints safely from NMI context into
the seq_buf and retrieved later and dumped to printk() safely. Doing
printk() from an NMI context is dangerous because an NMI can preempt
a current printk() and deadlock on it.

Link: http://lkml.kernel.org/p/20140619213952.058255809@goodmis.org

Tested-by: Jiri Kosina <jkosina@suse.cz>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>


# 3a161d99 25-Jun-2014 Steven Rostedt (Red Hat) <rostedt@goodmis.org>

tracing: Create seq_buf layer in trace_seq

Create a seq_buf layer that trace_seq sits on. The seq_buf will not
be limited to page size. This will allow other usages of seq_buf
instead of a hard set PAGE_SIZE one that trace_seq has.

Link: http://lkml.kernel.org/r/20141104160221.864997179@goodmis.org
Link: http://lkml.kernel.org/r/20141114011412.170377300@goodmis.org

Tested-by: Jiri Kosina <jkosina@suse.cz>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>


# 12306276 20-Jun-2014 Steven Rostedt (Red Hat) <rostedt@goodmis.org>

tracing: Move the trace_seq_* functions into its own trace_seq.c file

The trace_seq_*() functions are a nice utility that allows users to manipulate
buffers with printf() like formats. It has its own trace_seq.h header in
include/linux and should be in its own file. Being tied with trace_output.c
is rather awkward.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>


# 81dc9f0e 29-May-2014 Steven Rostedt (Red Hat) <rostedt@goodmis.org>

tracing: Add tracepoint benchmark tracepoint

In order to help benchmark the time tracepoints take, a new config
option is added called CONFIG_TRACEPOINT_BENCHMARK. When this option
is set a tracepoint is created called "benchmark:benchmark_event".
When the tracepoint is enabled, it kicks off a kernel thread that
goes into an infinite loop (calling cond_sched() to let other tasks
run), and calls the tracepoint. Each iteration will record the time
it took to write to the tracepoint and the next iteration that
data will be passed to the tracepoint itself. That is, the tracepoint
will report the time it took to do the previous tracepoint.
The string written to the tracepoint is a static string of 128 bytes
to keep the time the same. The initial string is simply a write of
"START". The second string records the cold cache time of the first
write which is not added to the rest of the calculations.

As it is a tight loop, it benchmarks as hot cache. That's fine because
we care most about hot paths that are probably in cache already.

An example of the output:

START
first=3672 [COLD CACHED]
last=632 first=3672 max=632 min=632 avg=316 std=446 std^2=199712
last=278 first=3672 max=632 min=278 avg=303 std=316 std^2=100337
last=277 first=3672 max=632 min=277 avg=296 std=258 std^2=67064
last=273 first=3672 max=632 min=273 avg=292 std=224 std^2=50411
last=273 first=3672 max=632 min=273 avg=288 std=200 std^2=40389
last=281 first=3672 max=632 min=273 avg=287 std=183 std^2=33666

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>


# 85f2b082 24-Oct-2013 Tom Zanussi <tom.zanussi@linux.intel.com>

tracing: Add basic event trigger framework

Add a 'trigger' file for each trace event, enabling 'trace event
triggers' to be set for trace events.

'trace event triggers' are patterned after the existing 'ftrace
function triggers' implementation except that triggers are written to
per-event 'trigger' files instead of to a single file such as the
'set_ftrace_filter' used for ftrace function triggers.

The implementation is meant to be entirely separate from ftrace
function triggers, in order to keep the respective implementations
relatively simple and to allow them to diverge.

The event trigger functionality is built on top of SOFT_DISABLE
functionality. It adds a TRIGGER_MODE bit to the ftrace_event_file
flags which is checked when any trace event fires. Triggers set for a
particular event need to be checked regardless of whether that event
is actually enabled or not - getting an event to fire even if it's not
enabled is what's already implemented by SOFT_DISABLE mode, so trigger
mode directly reuses that. Event trigger essentially inherit the soft
disable logic in __ftrace_event_enable_disable() while adding a bit of
logic and trigger reference counting via tm_ref on top of that in a
new trace_event_trigger_enable_disable() function. Because the base
__ftrace_event_enable_disable() code now needs to be invoked from
outside trace_events.c, a wrapper is also added for those usages.

The triggers for an event are actually invoked via a new function,
event_triggers_call(), and code is also added to invoke them for
ftrace_raw_event calls as well as syscall events.

The main part of the patch creates a new trace_events_trigger.c file
to contain the trace event triggers implementation.

The standard open, read, and release file operations are implemented
here.

The open() implementation sets up for the various open modes of the
'trigger' file. It creates and attaches the trigger iterator and sets
up the command parser. If opened for reading set up the trigger
seq_ops.

The read() implementation parses the event trigger written to the
'trigger' file, looks up the trigger command, and passes it along to
that event_command's func() implementation for command-specific
processing.

The release() implementation does whatever cleanup is needed to
release the 'trigger' file, like releasing the parser and trigger
iterator, etc.

A couple of functions for event command registration and
unregistration are added, along with a list to add them to and a mutex
to protect them, as well as an (initially empty) registration function
to add the set of commands that will be added by future commits, and
call to it from the trace event initialization code.

also added are a couple trigger-specific data structures needed for
these implementations such as a trigger iterator and a struct for
trigger-specific data.

A couple structs consisting mostly of function meant to be implemented
in command-specific ways, event_command and event_trigger_ops, are
used by the generic event trigger command implementations. They're
being put into trace.h alongside the other trace_event data structures
and functions, in the expectation that they'll be needed in several
trace_event-related files such as trace_events_trigger.c and
trace_events.c.

The event_command.func() function is meant to be called by the trigger
parsing code in order to add a trigger instance to the corresponding
event. It essentially coordinates adding a live trigger instance to
the event, and arming the triggering the event.

Every event_command func() implementation essentially does the
same thing for any command:

- choose ops - use the value of param to choose either a number or
count version of event_trigger_ops specific to the command
- do the register or unregister of those ops
- associate a filter, if specified, with the triggering event

The reg() and unreg() ops allow command-specific implementations for
event_trigger_op registration and unregistration, and the
get_trigger_ops() op allows command-specific event_trigger_ops
selection to be parameterized. When a trigger instance is added, the
reg() op essentially adds that trigger to the triggering event and
arms it, while unreg() does the opposite. The set_filter() function
is used to associate a filter with the trigger - if the command
doesn't specify a set_filter() implementation, the command will ignore
filters.

Each command has an associated trigger_type, which serves double duty,
both as a unique identifier for the command as well as a value that
can be used for setting a trigger mode bit during trigger invocation.

The signature of func() adds a pointer to the event_command struct,
used to invoke those functions, along with a command_data param that
can be passed to the reg/unreg functions. This allows func()
implementations to use command-specific blobs and supports code
re-use.

The event_trigger_ops.func() command corrsponds to the trigger 'probe'
function that gets called when the triggering event is actually
invoked. The other functions are used to list the trigger when
needed, along with a couple mundane book-keeping functions.

This also moves event_file_data() into trace.h so it can be used
outside of trace_events.c.

Link: http://lkml.kernel.org/r/316d95061accdee070aac8e5750afba0192fa5b9.1382622043.git.tom.zanussi@linux.intel.com

Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Idea-by: Steve Rostedt <rostedt@goodmis.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>


# ea632e9f 02-Sep-2012 Josh Triplett <josh@joshtriplett.org>

trace: Stop compiling in trace_clock unconditionally

Commit 56449f437 "tracing: make the trace clocks available generally",
in April 2009, made trace_clock available unconditionally, since
CONFIG_X86_DS used it too.

Commit faa4602e47 "x86, perf, bts, mm: Delete the never used BTS-ptrace code",
in March 2010, removed CONFIG_X86_DS, and now only CONFIG_RING_BUFFER (split
out from CONFIG_TRACING for general use) has a dependency on trace_clock. So,
only compile in trace_clock with CONFIG_RING_BUFFER or CONFIG_TRACING
enabled.

Link: http://lkml.kernel.org/r/20120903024513.GA19583@leaf

Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>


# 47239c4d 20-Jul-2012 Steven Rostedt <srostedt@redhat.com>

ftrace: Only compile ftrace selftest if selftests are enabled

No need to compile in the ftrace selftest helper file if selftests are
not being executed.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>


# f3f096cf 11-Apr-2012 Srikar Dronamraju <srikar@linux.vnet.ibm.com>

tracing: Provide trace events interface for uprobes

Implements trace_event support for uprobes. In its current form
it can be used to put probes at a specified offset in a file and
dump the required registers when the code flow reaches the
probed address.

The following example shows how to dump the instruction pointer
and %ax a register at the probed text address. Here we are
trying to probe zfree in /bin/zsh:

# cd /sys/kernel/debug/tracing/
# cat /proc/`pgrep zsh`/maps | grep /bin/zsh | grep r-xp
00400000-0048a000 r-xp 00000000 08:03 130904 /bin/zsh
# objdump -T /bin/zsh | grep -w zfree
0000000000446420 g DF .text 0000000000000012 Base
zfree # echo 'p /bin/zsh:0x46420 %ip %ax' > uprobe_events
# cat uprobe_events
p:uprobes/p_zsh_0x46420 /bin/zsh:0x0000000000046420
# echo 1 > events/uprobes/enable
# sleep 20
# echo 0 > events/uprobes/enable
# cat trace
# tracer: nop
#
# TASK-PID CPU# TIMESTAMP FUNCTION
# | | | | |
zsh-24842 [006] 258544.995456: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79
zsh-24842 [007] 258545.000270: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79
zsh-24842 [002] 258545.043929: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79
zsh-24842 [004] 258547.046129: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Jim Keniston <jkenisto@linux.vnet.ibm.com>
Cc: Linux-mm <linux-mm@kvack.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Anton Arapov <anton@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20120411103043.GB29437@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 8ab83f56 09-Apr-2012 Srikar Dronamraju <srikar@linux.vnet.ibm.com>

tracing: Extract out common code for kprobes/uprobes trace events

Move parts of trace_kprobe.c that can be shared with upcoming
trace_uprobe.c. Common code to kernel/trace/trace_probe.h and
kernel/trace/trace_probe.c. There are no functional changes.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Jim Keniston <jkenisto@linux.vnet.ibm.com>
Cc: Linux-mm <linux-mm@kvack.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Anton Arapov <anton@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20120409091144.8343.76218.sendpatchset@srdronam.in.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# d3283fb4 09-Apr-2012 Stephen Boyd <sboyd@codeaurora.org>

trace: Remove unused workqueue tracer

This tracer was temporarily removed in 6416669 (workqueue:
temporarily remove workqueue tracing, 2010-06-29) but never
reinstated after concurrency managed workqueues were completed.
For almost two years it hasn't been compilable so it seems nobody
is using it. Delete it.

Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>


# 2a5306cc 29-Sep-2011 Ming Lei <ming.lei@canonical.com>

PM / Tracing: build rpm-traces.c only if CONFIG_PM_RUNTIME is set

Do not build kernel/trace/rpm-traces.c if CONFIG_PM_RUNTIME is not
set, which avoids a build failure.

[rjw: Added the changelog and modified the subject slightly.]

Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>


# 53b615cc 27-Sep-2011 Ming Lei <ming.lei@canonical.com>

PM / Runtime: Introduce trace points for tracing rpm_* functions

This patch introduces 3 trace points to prepare for tracing
rpm_idle/rpm_suspend/rpm_resume functions, so we can use these
trace points to replace the current dev_dbg().

Signed-off-by: Ming Lei <ming.lei@canonical.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>


# 1d0e78e3 11-Aug-2011 Jiri Olsa <jolsa@redhat.com>

tracing/filter: Add startup tests for events filter

Adding automated tests running as late_initcall. Tests are
compiled in with CONFIG_FTRACE_STARTUP_TEST option.

Adding test event "ftrace_test_filter" used to simulate
filter processing during event occurance.

String filters are compiled and tested against several
test events with different values.

Also testing that evaluation of explicit predicates is ommited
due to the lazy filter evaluation.

Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Link: http://lkml.kernel.org/r/1313072754-4620-11-git-send-email-jolsa@redhat.com
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>


# 870915e0 28-Oct-2010 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

tracing: Fix TRACE_EVENT power tracepoint creation

DEFINE_TRACE should also exist when CONFIG_EVENT_TRACING=n. Otherwise, setting
only TRACEPOINTS=y is broken.

Acked-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
LKML-Reference: <20101028153117.GA4051@Krystal>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>


# 955b61e5 05-Aug-2010 Jason Wessel <jason.wessel@windriver.com>

ftrace,kdb: Extend kdb to be able to dump the ftrace buffer

Add in a helper function to allow the kdb shell to dump the ftrace
buffer.

Modify trace.c to expose the capability to iterate over the ftrace
buffer in a read only capacity.

Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
CC: Frederic Weisbecker <fweisbec@gmail.com>


# f376bf5f 15-Jul-2010 Frederic Weisbecker <fweisbec@gmail.com>

tracing: Remove sysprof ftrace plugin

The sysprof ftrace plugin doesn't seem to be seriously used
somewhere. There is a branch in the sysprof tree that makes
an interface to it, but the real sysprof tool uses either its
own module or perf events.

Drop the sysprof ftrace plugin then, as it's mostly useless.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Soeren Sandmann <sandmann@daimi.au.dk>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>


# 5d550467 15-Jul-2010 Frederic Weisbecker <fweisbec@gmail.com>

tracing: Remove ksym tracer

The ksym (breakpoint) ftrace plugin has been superseded by perf
tools that are much more poweful to use the cpu breakpoints.
This tracer doesn't bring more feature. It has been deprecated
for a while now, lets remove it.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Prasad <prasad@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>


# 039ca4e7 26-May-2010 Li Zefan <lizf@cn.fujitsu.com>

tracing: Remove kmemtrace ftrace plugin

We have been resisting new ftrace plugins and removing existing
ones, and kmemtrace has been superseded by kmem trace events
and perf-kmem, so we remove it.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Steven Rostedt <rostedt@goodmis.org>
[ remove kmemtrace from the makefile, handle slob too ]
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>


# 30dbb20e 26-May-2010 Américo Wang <xiyou.wangcong@gmail.com>

tracing: Remove boot tracer

The boot tracer is useless. It simply logs the initcalls
but in fact these initcalls are also logged through printk
while using the initcall_debug kernel parameter.

Nobody seem to be using it so far. Then just remove it.

Signed-off-by: WANG Cong <xiyou.wangcong@gmail.com>
Cc: Chase Douglas <chase.douglas@canonical.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <20100526105753.GA5677@cr0.nay.redhat.com>
[ remove the hooks in main.c, and the headers ]
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>


# faa4602e 25-Mar-2010 Peter Zijlstra <a.p.zijlstra@chello.nl>

x86, perf, bts, mm: Delete the never used BTS-ptrace code

Support for the PMU's BTS features has been upstreamed in
v2.6.32, but we still have the old and disabled ptrace-BTS,
as Linus noticed it not so long ago.

It's buggy: TIF_DEBUGCTLMSR is trampling all over that MSR without
regard for other uses (perf) and doesn't provide the flexibility
needed for perf either.

Its users are ptrace-block-step and ptrace-bts, since ptrace-bts
was never used and ptrace-block-step can be implemented using a
much simpler approach.

So axe all 3000 lines of it. That includes the *locked_memory*()
APIs in mm/mlock.c as well.

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Markus Metzger <markus.t.metzger@intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
LKML-Reference: <20100325135413.938004390@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 97d5a220 04-Mar-2010 Frederic Weisbecker <fweisbec@gmail.com>

perf: Drop the obsolete profile naming for trace events

Drop the obsolete "profile" naming used by perf for trace events.
Perf can now do more than simple events counting, so generalize
the API naming.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Jason Baron <jbaron@redhat.com>


# 07b139c8 20-Dec-2009 Li Zefan <lizf@cn.fujitsu.com>

perf events: Remove CONFIG_EVENT_PROFILE

Quoted from Ingo:

| This reminds me - i think we should eliminate CONFIG_EVENT_PROFILE -
| it's an unnecessary Kconfig complication. If both PERF_EVENTS and
| EVENT_TRACING is enabled we should expose generic tracepoints.
|
| Nor is it limited to event 'profiling', so it has become a misnomer as
| well.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Mackerras <paulus@samba.org>
LKML-Reference: <4B2F1557.2050705@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 77b44d1b 03-Nov-2009 Masami Hiramatsu <mhiramat@redhat.com>

tracing/kprobes: Rename Kprobe-tracer to kprobe-event

Rename Kprobes-based event tracer to kprobes-based tracing event
(kprobe-event), since it is not a tracer but an extensible
tracing event interface.

This also changes CONFIG_KPROBE_TRACER to CONFIG_KPROBE_EVENT
and sets it y by default.

Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Jim Keniston <jkenisto@us.ibm.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Frank Ch. Eigler <fche@redhat.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: K.Prasad <prasad@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
LKML-Reference: <20091104001247.3454.14131.stgit@harusame>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 61613521 17-Sep-2009 Arjan van de Ven <arjan@linux.intel.com>

tracing, perf: Convert the power tracer into an event tracer

This patch converts the existing power tracer into an event tracer,
so that power events (C states and frequency changes) can be
tracked via "perf".

This also removes the perl script that was used to demo the tracer;
its functionality is being replaced entirely with timechart.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <20090912130542.6d314860@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 413d37d1 13-Aug-2009 Masami Hiramatsu <mhiramat@redhat.com>

tracing: Add kprobe-based event tracer

Add kprobes-based event tracer on ftrace.

This tracer is similar to the events tracer which is based on Tracepoint
infrastructure. Instead of Tracepoint, this tracer is based on kprobes
(kprobe and kretprobe). It probes anywhere where kprobes can probe(this
means, all functions body except for __kprobes functions).

Similar to the events tracer, this tracer doesn't need to be activated
via current_tracer, instead of that, just set probe points via
/sys/kernel/debug/tracing/kprobe_events. And you can set filters on each
probe events via /sys/kernel/debug/tracing/events/kprobes/<EVENT>/filter.

This tracer supports following probe arguments for each probe.

%REG : Fetch register REG
sN : Fetch Nth entry of stack (N >= 0)
sa : Fetch stack address.
@ADDR : Fetch memory at ADDR (ADDR should be in kernel)
@SYM[+|-offs] : Fetch memory at SYM +|- offs (SYM should be a data symbol)
aN : Fetch function argument. (N >= 0)
rv : Fetch return value.
ra : Fetch return address.
+|-offs(FETCHARG) : fetch memory at FETCHARG +|- offs address.

See Documentation/trace/kprobetrace.txt in the next patch for details.

Changes from v13:
- Support 'sa' for stack address.
- Use call->data instead of container_of() macro.

[fweisbec@gmail.com: Fixed conflict against latest tracing/core]

Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Frank Ch. Eigler <fche@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Jim Keniston <jkenisto@us.ibm.com>
Cc: K.Prasad <prasad@linux.vnet.ibm.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Przemysław Pawełczyk <przemyslaw@pawelczyk.it>
Cc: Roland McGrath <roland@redhat.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tom Zanussi <tzanussi@gmail.com>
Cc: Vegard Nossum <vegard.nossum@gmail.com>
LKML-Reference: <20090813203510.31965.29123.stgit@localhost.localdomain>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>


# 31ffe249 26-Aug-2009 David S. Miller <davem@davemloft.net>

net: Temporarily backout SKB sources tracer.

Steven Rostedt has suggested that Neil work with the tracing
folks, trying to use TRACE_EVENT as the mechanism for
implementation. And if that doesn't workout we can investigate
other solutions such as that one which was tried here.

This reverts the following 2 commits:

5a165657bef7c47e5ff4cd138f7758ef6278e87b
("net: skb ftracer - Add config option to enable new ftracer (v3)")

9ec04da7489d2c9ae01ea6e9b5fa313ccf3d35fb
("net: skb ftracer - Add actual ftrace code to kernel (v3)")

Signed-off-by: David S. Miller <davem@davemloft.net>


# 9ec04da7 12-Aug-2009 Neil Horman <nhorman@tuxdriver.com>

net: skb ftracer - Add actual ftrace code to kernel (v3)

skb allocation / consumption correlator

Add ftracer module to kernel to print out a list that correlates a process id,
an skb it read, and the numa nodes on wich the process was running when it was
read along with the numa node the skbuff was allocated on.

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>

Makefile | 1
trace.h | 19 ++++++
trace_skb_sources.c | 154 ++++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 174 insertions(+)
Signed-off-by: David S. Miller <davem@davemloft.net>


# 55782138 08-Jun-2009 Li Zefan <lizf@cn.fujitsu.com>

tracing/events: convert block trace points to TRACE_EVENT()

TRACE_EVENT is a more generic way to define tracepoints. Doing so adds
these new capabilities to this tracepoint:

- zero-copy and per-cpu splice() tracing
- binary tracing without printf overhead
- structured logging records exposed under /debug/tracing/events
- trace events embedded in function tracer output and other plugins
- user-defined, per tracepoint filter expressions
...

Cons:

- no dev_t info for the output of plug, unplug_timer and unplug_io events.
no dev_t info for getrq and sleeprq events if bio == NULL.
no dev_t info for rq_abort,...,rq_requeue events if rq->rq_disk == NULL.

This is mainly because we can't get the deivce from a request queue.
But this may change in the future.

- A packet command is converted to a string in TP_assign, not TP_print.
While blktrace do the convertion just before output.

Since pc requests should be rather rare, this is not a big issue.

- In blktrace, an event can have 2 different print formats, but a TRACE_EVENT
has a unique format, which means we have some unused data in a trace entry.

The overhead is minimized by using __dynamic_array() instead of __array().

I've benchmarked the ioctl blktrace vs the splice based TRACE_EVENT tracing:

dd dd + ioctl blktrace dd + TRACE_EVENT (splice)
1 7.36s, 42.7 MB/s 7.50s, 42.0 MB/s 7.41s, 42.5 MB/s
2 7.43s, 42.3 MB/s 7.48s, 42.1 MB/s 7.43s, 42.4 MB/s
3 7.38s, 42.6 MB/s 7.45s, 42.2 MB/s 7.41s, 42.5 MB/s

So the overhead of tracing is very small, and no regression when using
those trace events vs blktrace.

And the binary output of TRACE_EVENT is much smaller than blktrace:

# ls -l -h
-rw-r--r-- 1 root root 8.8M 06-09 13:24 sda.blktrace.0
-rw-r--r-- 1 root root 195K 06-09 13:24 sda.blktrace.1
-rw-r--r-- 1 root root 2.7M 06-09 13:25 trace_splice.out

Following are some comparisons between TRACE_EVENT and blktrace:

plug:
kjournald-480 [000] 303.084981: block_plug: [kjournald]
kjournald-480 [000] 303.084981: 8,0 P N [kjournald]

unplug_io:
kblockd/0-118 [000] 300.052973: block_unplug_io: [kblockd/0] 1
kblockd/0-118 [000] 300.052974: 8,0 U N [kblockd/0] 1

remap:
kjournald-480 [000] 303.085042: block_remap: 8,0 W 102736992 + 8 <- (8,8) 33384
kjournald-480 [000] 303.085043: 8,0 A W 102736992 + 8 <- (8,8) 33384

bio_backmerge:
kjournald-480 [000] 303.085086: block_bio_backmerge: 8,0 W 102737032 + 8 [kjournald]
kjournald-480 [000] 303.085086: 8,0 M W 102737032 + 8 [kjournald]

getrq:
kjournald-480 [000] 303.084974: block_getrq: 8,0 W 102736984 + 8 [kjournald]
kjournald-480 [000] 303.084975: 8,0 G W 102736984 + 8 [kjournald]

bash-2066 [001] 1072.953770: 8,0 G N [bash]
bash-2066 [001] 1072.953773: block_getrq: 0,0 N 0 + 0 [bash]

rq_complete:
konsole-2065 [001] 300.053184: block_rq_complete: 8,0 W () 103669040 + 16 [0]
konsole-2065 [001] 300.053191: 8,0 C W 103669040 + 16 [0]

ksoftirqd/1-7 [001] 1072.953811: 8,0 C N (5a 00 08 00 00 00 00 00 24 00) [0]
ksoftirqd/1-7 [001] 1072.953813: block_rq_complete: 0,0 N (5a 00 08 00 00 00 00 00 24 00) 0 + 0 [0]

rq_insert:
kjournald-480 [000] 303.084985: block_rq_insert: 8,0 W 0 () 102736984 + 8 [kjournald]
kjournald-480 [000] 303.084986: 8,0 I W 102736984 + 8 [kjournald]

Changelog from v2 -> v3:

- use the newly introduced __dynamic_array().

Changelog from v1 -> v2:

- use __string() instead of __array() to minimize the memory required
to store hex dump of rq->cmd().

- support large pc requests.

- add missing blk_fill_rwbs_rq() in block_rq_requeue TRACE_EVENT.

- some cleanups.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4A2DF669.5070905@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>


# 0722db01 01-Jun-2009 K.Prasad <prasad@linux.vnet.ibm.com>

hw-breakpoints: ftrace plugin for kernel symbol tracing using HW Breakpoint interfaces

This patch adds an ftrace plugin to detect and profile memory access over kernel
variables. It uses HW Breakpoint interfaces to 'watch memory addresses.

Signed-off-by: K.Prasad <prasad@linux.vnet.ibm.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>


# 5092dbc9 05-May-2009 Steven Rostedt <srostedt@redhat.com>

ring-buffer: add benchmark and tester

This patch adds code that can benchmark the ring buffer as well as
test it. This code can be compiled into the kernel (not recommended)
or as a module.

A separate ring buffer is used to not interfer with other users, like
ftrace. It creates a producer and a consumer (option to disable creation
of the consumer) and will run for 10 seconds, then sleep for 10 seconds
and then repeat.

While running, the producer will write 10 byte loads into the ring
buffer with just putting in the current CPU number. The reader will
continually try to read the buffer. The reader will alternate from reading
the buffer via event by event, or by full pages.

The output is a pr_info, thus it will fill up the syslogs.

Starting ring buffer hammer
End ring buffer hammer
Time: 9000349 (usecs)
Overruns: 12578640
Read: 5358440 (by events)
Entries: 0
Total: 17937080
Missed: 0
Hit: 17937080
Entries per millisec: 1993
501 ns per entry
Sleeping for 10 secs
Starting ring buffer hammer
End ring buffer hammer
Time: 9936350 (usecs)
Overruns: 0
Read: 28146644 (by pages)
Entries: 74
Total: 28146718
Missed: 0
Hit: 28146718
Entries per millisec: 2832
353 ns per entry
Sleeping for 10 secs

Time: is the time the test ran
Overruns: the number of events that were overwritten and not read
Read: the number of events read (either by pages or events)
Entries: the number of entries left in the buffer
(the by pages will only read full pages)
Total: Entries + Read + Overruns
Missed: the number of entries that failed to write
Hit: the number of entries that were written

The above example shows that it takes ~353 nanosecs per entry when
there is a reader, reading by pages (and no overruns)

The event by event reader slowed the producer down to 501 nanosecs.

[ Impact: see how changes to the ring buffer affect stability and performance ]

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>


# f42c85e7 12-Apr-2009 Steven Rostedt <srostedt@redhat.com>

tracing/events: move the ftrace event tracing code to core

This patch moves the ftrace creation into include/trace/ftrace.h and
simplifies the work of developers in adding new tracepoints.
Just the act of creating the trace points in include/trace and including
define_trace.h will create the events in the debugfs/tracing/events
directory.

This patch removes the need of include/trace/trace_events.h

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>


# 56449f43 14-Apr-2009 Ingo Molnar <mingo@elte.hu>

tracing: make the trace clocks available generally

Jeremy Fitzhardinge reported this build failure:

LD .tmp_vmlinux1
arch/x86/kernel/built-in.o: In function `ds_take_timestamp':
git/linux/arch/x86/kernel/ds.c:1380: undefined reference to `trace_clock_global'
git/linux/arch/x86/kernel/ds.c:1380: undefined reference to `trace_clock_global'

Which is due to !CONFIG_TRACING && CONFIG_X86_DS=y.

Expose the trace clock code to CONFIG_X86_DS as well.

[ Unfortunately librarizing doesnt work well - ancient architectures
with no raw_local_irq_save() primitive break the build. ]

Reported-by: Jeremy Fitzhardinge <jeremy@goop.org>
LKML-Reference: <49E4413F.7070700@goop.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 5f77a88b 08-Apr-2009 Tom Zanussi <tzanussi@gmail.com>

tracing/infrastructure: separate event tracer from event support

Add a new config option, CONFIG_EVENT_TRACING that gets selected
when CONFIG_TRACING is selected and adds everything needed by the stuff
in trace_export - basically all the event tracing support needed by e.g.
bprint, minus the actual events, which are only included if
CONFIG_EVENT_TRACER is selected.

So CONFIG_EVENT_TRACER can be used to turn on or off the generated events
(what I think of as the 'event tracer'), while CONFIG_EVENT_TRACING turns
on or off the base event tracing support used by both the event tracer and
the other things such as bprint that can't be configured out.

Signed-off-by: Tom Zanussi <tzanussi@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: fweisbec@gmail.com
LKML-Reference: <1239178441.10295.34.camel@tropicana>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 7ce7e424 22-Mar-2009 Tom Zanussi <tzanussi@gmail.com>

tracing: add per-event filtering

This patch adds per-event filtering to the event tracing subsystem.

It adds a 'filter' debugfs file to each event directory. This file can
be written to to set filters; reading from it will display the current
set of filters set for that event.

Basically, any field listed in the 'format' file for an event can be
filtered on (including strings, but not yet other array types) using
either matching ('==') or non-matching ('!=') 'predicates'. A
'predicate' can be either a single expression:

# echo pid != 0 > filter

# cat filter
pid != 0

or a compound expression of up to 8 sub-expressions combined using '&&'
or '||':

# echo comm == Xorg > filter
# echo "&& sig != 29" > filter

# cat filter
comm == Xorg
&& sig != 29

Only events having field values matching an expression will be available
in the trace output; non-matching events are discarded.

Note that a compound expression is built up by echoing each
sub-expression separately - it's not the most efficient way to do
things, but it keeps the parser simple and assumes that compound
expressions will be relatively uncommon. In any case, a subsequent
patch introducing a way to set filters for entire subsystems should
mitigate any need to do this for lots of events.

Setting a filter without an '&&' or '||' clears the previous filter
completely and sets the filter to the new expression:

# cat filter
comm == Xorg
&& sig != 29

# echo comm != Xorg

# cat filter
comm != Xorg

To clear a filter, echo 0 to the filter file:

# echo 0 > filter
# cat filter
none

The limit of 8 predicates for a compound expression is arbitrary - for
efficiency, it's implemented as an array of pointers to predicates, and
8 seemed more than enough for any filter...

Signed-off-by: Tom Zanussi <tzanussi@gmail.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <1237710665.7703.48.camel@charm-linux>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# ac199db0 19-Mar-2009 Peter Zijlstra <a.p.zijlstra@chello.nl>

ftrace: event profile hooks

Impact: new tracing infrastructure feature

Provide infrastructure to generate software perf counter events
from tracepoints.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <20090319194233.557364871@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# ee08c6ec 06-Mar-2009 Frederic Weisbecker <fweisbec@gmail.com>

tracing/ftrace: syscall tracing infrastructure, basics

Provide basic callbacks to do syscall tracing.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
LKML-Reference: <1236401580-5758-2-git-send-email-fweisbec@gmail.com>
[ simplified it to a trace_printk() for now. ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 769b0441 06-Mar-2009 Frederic Weisbecker <fweisbec@gmail.com>

tracing/core: drop the old trace_printk() implementation in favour of trace_bprintk()

Impact: faster and lighter tracing

Now that we have trace_bprintk() which is faster and consume lesser
memory than trace_printk() and has the same purpose, we can now drop
the old implementation in favour of the binary one from trace_bprintk(),
which means we move all the implementation of trace_bprintk() to
trace_printk(), so the Api doesn't change except that we must now use
trace_seq_bprintk() to print the TRACE_PRINT entries.

Some changes result of this:

- Previously, trace_bprintk depended of a single tracer and couldn't
work without. This tracer has been dropped and the whole implementation
of trace_printk() (like the module formats management) is now integrated
in the tracing core (comes with CONFIG_TRACING), though we keep the file
trace_printk (previously trace_bprintk.c) where we can find the module
management. Thus we don't overflow trace.c

- changes some parts to use trace_seq_bprintk() to print TRACE_PRINT entries.

- change a bit trace_printk/trace_vprintk macros to support non-builtin formats
constants, and fix 'const' qualifiers warnings. But this is all transparent for
developers.

- etc...

V2:

- Rebase against last changes
- Fix mispell on the changelog

V3:

- Rebase against last changes (moving trace_printk() to kernel.h)

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <1236356510-8381-5-git-send-email-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 1427cdf0 06-Mar-2009 Lai Jiangshan <laijs@cn.fujitsu.com>

tracing: infrastructure for supporting binary record

Impact: save on memory for tracing

Current tracers are typically using a struct(like struct ftrace_entry,
struct ctx_switch_entry, struct special_entr etc...)to record a binary
event. These structs can only record a their own kind of events.
A new kind of tracer need a new struct and a lot of code too handle it.

So we need a generic binary record for events. This infrastructure
is for this purpose.

[fweisbec@gmail.com: rebase against latest -tip, make it safe while sched
tracing as reported by Steven Rostedt]

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <1236356510-8381-3-git-send-email-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 770cb243 05-Mar-2009 Steven Rostedt <srostedt@redhat.com>

tracing: add format files for ftrace default entries

Impact: allow user apps to read binary format of basic ftrace entries

Currently, only defined raw events export their formats so a binary
reader can parse them. There's no reason that the default ftrace entries
can't export their formats.

This patch adds a subsystem called "ftrace" in the events directory
that includes the ftrace entries for basic ftrace recorded items.

These only have three files in the events directory:

type : printf
available_types : printf
format : format for the event entry

For example:

# cat /debug/tracing/events/ftrace/wakeup/format
name: wakeup
ID: 3
format:
field:unsigned char type; offset:0; size:1;
field:unsigned char flags; offset:1; size:1;
field:unsigned char preempt_count; offset:2; size:1;
field:int pid; offset:4; size:4;
field:int tgid; offset:8; size:4;

field:unsigned int prev_pid; offset:12; size:4;
field:unsigned char prev_prio; offset:16; size:1;
field:unsigned char prev_state; offset:17; size:1;
field:unsigned int next_pid; offset:20; size:4;
field:unsigned char next_prio; offset:24; size:1;
field:unsigned char next_state; offset:25; size:1;
field:unsigned int next_cpu; offset:28; size:4;

print fmt: "%u:%u:%u ==+ %u:%u:%u [%03u]"

Signed-off-by: Steven Rostedt <srostedt@redhat.com>


# 14131f2f 26-Feb-2009 Ingo Molnar <mingo@elte.hu>

tracing: implement trace_clock_*() APIs

Impact: implement new tracing timestamp APIs

Add three trace clock variants, with differing scalability/precision
tradeoffs:

- local: CPU-local trace clock
- medium: scalable global clock with some jitter
- global: globally monotonic, serialized clock

Make the ring-buffer use the local trace clock internally.

Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# f3fe8e4a 24-Feb-2009 Steven Rostedt <srostedt@redhat.com>

tracing: add schedule events to event trace

This patch changes the trace/sched.h to use the DECLARE_TRACE_FMT
such that they are automatically registered with the event tracer.

And it also adds the tracing sched headers to kernel/trace/events.c

Signed-off-by: Steven Rostedt <srostedt@redhat.com>


# b77e38aa 24-Feb-2009 Steven Rostedt <srostedt@redhat.com>

tracing: add event trace infrastructure

This patch creates the event tracing infrastructure of ftrace.
It will create the files:

/debug/tracing/available_events
/debug/tracing/set_event

The available_events will list the trace points that have been
registered with the event tracer.

set_events will allow the user to enable or disable an event hook.

example:

# echo sched_wakeup > /debug/tracing/set_event

Will enable the sched_wakeup event (if it is registered).

# echo "!sched_wakeup" >> /debug/tracing/set_event

Will disable the sched_wakeup event (and only that event).

# echo > /debug/tracing/set_event

Will disable all events (notice the '>')

# cat /debug/tracing/available_events > /debug/tracing/set_event

Will enable all registered event hooks.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>


# 2db270a8 07-Feb-2009 Frederic Weisbecker <fweisbec@gmail.com>

tracing/blktrace: move the tracing file to kernel/trace

Impact: cleanup

Move blktrace.c to kernel/trace, also move its config entry.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# e1d8aa9f 12-Jan-2009 Frederic Weisbecker <fweisbec@gmail.com>

tracing: add a new workqueue tracer

Impact: new tracer

The workqueue tracer provides some statistical informations
about each cpu workqueue thread such as the number of the
works inserted and executed since their creation. It can help
to evaluate the amount of work each of them have to perform.
For example it can help a developer to decide whether he should
choose a per cpu workqueue instead of a singlethreaded one.

It only traces statistical informations for now but it will probably later
provide event tracing too.

Such a tracer could help too, and be improved, to help rt priority sorted
workqueue development.

To have a snapshot of the workqueues state at any time, just do

cat /debugfs/tracing/trace_stat/workqueues

Ie:

1 125 125 reiserfs/1
1 0 0 scsi_tgtd/1
1 0 0 aio/1
1 0 0 ata/1
1 114 114 kblockd/1
1 0 0 kintegrityd/1
1 2147 2147 events/1

0 0 0 kpsmoused
0 105 105 reiserfs/0
0 0 0 scsi_tgtd/0
0 0 0 aio/0
0 0 0 ata_aux
0 0 0 ata/0
0 0 0 cqueue
0 0 0 kacpi_notify
0 0 0 kacpid
0 149 149 kblockd/0
0 0 0 kintegrityd/0
0 1000 1000 khelper
0 2270 2270 events/0

Changes in V2:

_ Drop the static array based on NR_CPU and dynamically allocate the stat array
with num_possible_cpus() and other cpu mask facilities....
_ Trace workqueue insertion at a bit lower level (insert_work instead of queue_work) to handle
even the workqueue barriers.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 36994e58 29-Dec-2008 Frederic Weisbecker <fweisbec@gmail.com>

tracing/kmemtrace: normalize the raw tracer event to the unified tracing API

Impact: new tracer plugin

This patch adapts kmemtrace raw events tracing to the unified tracing API.

To enable and use this tracer, just do the following:

echo kmemtrace > /debugfs/tracing/current_tracer
cat /debugfs/tracing/trace

You will have the following output:

# tracer: kmemtrace
#
#
# ALLOC TYPE REQ GIVEN FLAGS POINTER NODE CALLER
# FREE | | | | | | | |
# |

type_id 1 call_site 18446744071565527833 ptr 18446612134395152256
type_id 0 call_site 18446744071565585597 ptr 18446612134405955584 bytes_req 4096 bytes_alloc 4096 gfp_flags 208 node -1
type_id 1 call_site 18446744071565585534 ptr 18446612134405955584
type_id 0 call_site 18446744071565585597 ptr 18446612134405955584 bytes_req 4096 bytes_alloc 4096 gfp_flags 208 node -1
type_id 0 call_site 18446744071565636711 ptr 18446612134345164672 bytes_req 240 bytes_alloc 240 gfp_flags 208 node -1
type_id 1 call_site 18446744071565585534 ptr 18446612134405955584
type_id 0 call_site 18446744071565585597 ptr 18446612134405955584 bytes_req 4096 bytes_alloc 4096 gfp_flags 208 node -1
type_id 0 call_site 18446744071565636711 ptr 18446612134345164912 bytes_req 240 bytes_alloc 240 gfp_flags 208 node -1
type_id 1 call_site 18446744071565585534 ptr 18446612134405955584
type_id 0 call_site 18446744071565585597 ptr 18446612134405955584 bytes_req 4096 bytes_alloc 4096 gfp_flags 208 node -1
type_id 0 call_site 18446744071565636711 ptr 18446612134345165152 bytes_req 240 bytes_alloc 240 gfp_flags 208 node -1
type_id 0 call_site 18446744071566144042 ptr 18446612134346191680 bytes_req 1304 bytes_alloc 1312 gfp_flags 208 node -1
type_id 1 call_site 18446744071565585534 ptr 18446612134405955584
type_id 0 call_site 18446744071565585597 ptr 18446612134405955584 bytes_req 4096 bytes_alloc 4096 gfp_flags 208 node -1
type_id 1 call_site 18446744071565585534 ptr 18446612134405955584

That was to stay backward compatible with the format output produced in
inux/tracepoint.h.

This is the default ouput, but note that I tried something else.

If you change an option:

echo kmem_minimalistic > /debugfs/trace_options

and then cat /debugfs/trace, you will have the following output:

# tracer: kmemtrace
#
#
# ALLOC TYPE REQ GIVEN FLAGS POINTER NODE CALLER
# FREE | | | | | | | |
# |

- C 0xffff88007c088780 file_free_rcu
+ K 4096 4096 000000d0 0xffff88007cad6000 -1 getname
- C 0xffff88007cad6000 putname
+ K 4096 4096 000000d0 0xffff88007cad6000 -1 getname
+ K 240 240 000000d0 0xffff8800790dc780 -1 d_alloc
- C 0xffff88007cad6000 putname
+ K 4096 4096 000000d0 0xffff88007cad6000 -1 getname
+ K 240 240 000000d0 0xffff8800790dc870 -1 d_alloc
- C 0xffff88007cad6000 putname
+ K 4096 4096 000000d0 0xffff88007cad6000 -1 getname
+ K 240 240 000000d0 0xffff8800790dc960 -1 d_alloc
+ K 1304 1312 000000d0 0xffff8800791d7340 -1 reiserfs_alloc_inode
- C 0xffff88007cad6000 putname
+ K 4096 4096 000000d0 0xffff88007cad6000 -1 getname
- C 0xffff88007cad6000 putname
+ K 992 1000 000000d0 0xffff880079045b58 -1 alloc_inode
+ K 768 1024 000080d0 0xffff88007c096400 -1 alloc_pipe_info
+ K 240 240 000000d0 0xffff8800790dca50 -1 d_alloc
+ K 272 320 000080d0 0xffff88007c088780 -1 get_empty_filp
+ K 272 320 000080d0 0xffff88007c088000 -1 get_empty_filp

Yeah I shall confess kmem_minimalistic should be: kmem_alternative.

Whatever, I find it more readable but this a personal opinion of course.
We can drop it if you want.

On the ALLOC/FREE column, + means an allocation and - a free.

On the type column, you have K = kmalloc, C = cache, P = page

I would like the flags to be GFP_* strings but that would not be easy to not
break the column with strings....

About the node...it seems to always be -1. I don't know why but that shouldn't
be difficult to find.

I moved linux/tracepoint.h to trace/tracepoint.h as well. I think that would
be more easy to find the tracer headers if they are all in their common
directory.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# dbd0b4b3 28-Dec-2008 Frederic Weisbecker <fweisbec@gmail.com>

tracing/ftrace: provide the base infrastructure for histogram tracing

Impact: extend the tracing API

The goal of this patch is to normalize and make more easy the
implementation of statistical (histogram) tracing.

It implements a trace_stat file into the /debugfs/tracing directory where
one can print a one-shot output of statistics/histogram entries.

A tracer has to provide two basic iterator callbacks:

stat_start() => the first entry
stat_next(prev, idx) => the next one.

Note that it is adapted for arrays or hash tables or lists.... since it
provides a pointer to the previous entry and the current index of the
iterator.

These two callbacks are called to get a snapshot of the statistics at each
opening of the trace_stat file because. The values are so updated between
two "cat trace_stat". And the tracer is free to lock its datas during the
iteration to keep consistent values.

Since it is almost always interesting to sort statisticals values to
address the problems by priority, this infrastructure provides a "sorting"
of the stat entries too if desired. A tracer has just to provide a
stat_cmp callback to compare two entries and the stat tracing
infrastructure will build a sorted list of the given entries.

A last callback, called stat_headers, can be implemented by a tracer to
output headers on its trace.

If one of these callbacks is changed on runtime, it just have to signal it
to the stat tracing API by calling the init_tracer_stat() helper.

Changes in V2:

- Fix a memory leak if the user opens multiple times the trace_stat file
without closing it. Now we always free our list before rebuilding it.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# f0868d1e 23-Dec-2008 Steven Rostedt <srostedt@redhat.com>

ftrace: set up trace event hash infrastructure

Impact: simplify/generalize/refactor trace.c

The trace.c file is becoming more difficult to maintain due to the
growing number of events. There is several formats that an event may
be printed. This patch sets up the infrastructure of an event hash to
allow for events to register how they should be printed.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# a93751ca 11-Dec-2008 Markus Metzger <markut.t.metzger@intel.com>

x86, bts, ftrace: adapt the hw-branch-tracer to the ds.c interface

Impact: restructure code, cleanup

Remove BTS bits from the hw-branch-tracer (renamed from bts-tracer) and
use the ds interface.

Signed-off-by: Markus Metzger <markut.t.metzger@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# f3f47a67 23-Nov-2008 Arjan van de Ven <arjan@infradead.org>

tracing: add "power-tracer": C/P state tracer to help power optimization

Impact: new "power-tracer" ftrace plugin

This patch adds a C/P-state ftrace plugin that will generate
detailed statistics about the C/P-states that are being used,
so that we can look at detailed decisions that the C/P-state
code is making, rather than the too high level "average"
that we have today.

An example way of using this is:

mount -t debugfs none /sys/kernel/debug
echo cstate > /sys/kernel/debug/tracing/current_tracer
echo 1 > /sys/kernel/debug/tracing/tracing_enabled
sleep 1
echo 0 > /sys/kernel/debug/tracing/tracing_enabled
cat /sys/kernel/debug/tracing/trace | perl scripts/trace/cstate.pl > out.svg

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# fb52607a 25-Nov-2008 Frederic Weisbecker <fweisbec@gmail.com>

tracing/function-return-tracer: change the name into function-graph-tracer

Impact: cleanup

This patch changes the name of the "return function tracer" into
function-graph-tracer which is a more suitable name for a tracing
which makes one able to retrieve the ordered call stack during
the code flow.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 1e9b51c2 25-Nov-2008 Markus Metzger <markus.t.metzger@intel.com>

x86, bts, ftrace: a BTS ftrace plug-in prototype

Impact: add new ftrace plugin

A prototype for a BTS ftrace plug-in.

The tracer collects branch trace in a cyclic buffer for each cpu.

The tracer is not configurable and the trace for each snapshot is
appended when doing cat /debug/tracing/trace.

This is a proof of concept that will be extended with future patches
to become a (hopefully) useful tool.

Signed-off-by: Markus Metzger <markus.t.metzger@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 94b80ffd 12-Nov-2008 Steven Rostedt <srostedt@redhat.com>

ftrace: rename trace_unlikely.c file

Impact: File name change of trace_unlikely.c

The "unlikely" name for the tracer is quite ugly. We renamed all the
parts of it to "branch" and now it is time to rename the file too.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 2ed84eeb 12-Nov-2008 Steven Rostedt <srostedt@redhat.com>

trace: rename unlikely profiler to branch profiler

Impact: name change of unlikely tracer and profiler

Ingo Molnar suggested changing the config from UNLIKELY_PROFILE
to BRANCH_PROFILING. I never did like the "unlikely" name so I
went one step farther, and renamed all the unlikely configurations
to a "BRANCH" variant.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 52f232cb 11-Nov-2008 Steven Rostedt <rostedt@goodmis.org>

tracing: likely/unlikely branch annotation tracer

Impact: new likely/unlikely branch tracer

This patch adds a way to record the instances of the likely() and unlikely()
branch condition annotations.

When "unlikely" is set in /debugfs/tracing/iter_ctrl the unlikely conditions
will be added to any of the ftrace tracers. The change takes effect when
a new tracer is passed into the current_tracer file.

For example:

bash-3471 [003] 357.014755: [INCORRECT] sched_info_dequeued:sched_stats.h:177
bash-3471 [003] 357.014756: [correct] update_curr:sched_fair.c:489
bash-3471 [003] 357.014758: [correct] calc_delta_fair:sched_fair.c:411
bash-3471 [003] 357.014759: [correct] account_group_exec_runtime:sched_stats.h:356
bash-3471 [003] 357.014761: [correct] update_curr:sched_fair.c:489
bash-3471 [003] 357.014763: [INCORRECT] calc_delta_fair:sched_fair.c:411
bash-3471 [003] 357.014765: [correct] calc_delta_mine:sched.c:1279

Which shows the normal tracer heading, as well as whether the condition was
correct "[correct]" or was mistaken "[INCORRECT]", followed by the function,
file name and line number.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 1f0d69a9 11-Nov-2008 Steven Rostedt <rostedt@goodmis.org>

tracing: profile likely and unlikely annotations

Impact: new unlikely/likely profiler

Andrew Morton recently suggested having an in-kernel way to profile
likely and unlikely macros. This patch achieves that goal.

When configured, every(*) likely and unlikely macro gets a counter attached
to it. When the condition is hit, the hit and misses of that condition
are recorded. These numbers can later be retrieved by:

/debugfs/tracing/profile_likely - All likely markers
/debugfs/tracing/profile_unlikely - All unlikely markers.

# cat /debug/tracing/profile_unlikely | head
correct incorrect % Function File Line
------- --------- - -------- ---- ----
2167 0 0 do_arch_prctl process_64.c 832
0 0 0 do_arch_prctl process_64.c 804
2670 0 0 IS_ERR err.h 34
71230 5693 7 __switch_to process_64.c 673
76919 0 0 __switch_to process_64.c 639
43184 33743 43 __switch_to process_64.c 624
12740 64181 83 __switch_to process_64.c 594
12740 64174 83 __switch_to process_64.c 590

# cat /debug/tracing/profile_unlikely | \
awk '{ if ($3 > 25) print $0; }' |head -20
44963 35259 43 __switch_to process_64.c 624
12762 67454 84 __switch_to process_64.c 594
12762 67447 84 __switch_to process_64.c 590
1478 595 28 syscall_get_error syscall.h 51
0 2821 100 syscall_trace_leave ptrace.c 1567
0 1 100 native_smp_prepare_cpus smpboot.c 1237
86338 265881 75 calc_delta_fair sched_fair.c 408
210410 108540 34 calc_delta_mine sched.c 1267
0 54550 100 sched_info_queued sched_stats.h 222
51899 66435 56 pick_next_task_fair sched_fair.c 1422
6 10 62 yield_task_fair sched_fair.c 982
7325 2692 26 rt_policy sched.c 144
0 1270 100 pre_schedule_rt sched_rt.c 1261
1268 48073 97 pick_next_task_rt sched_rt.c 884
0 45181 100 sched_info_dequeued sched_stats.h 177
0 15 100 sched_move_task sched.c 8700
0 15 100 sched_move_task sched.c 8690
53167 33217 38 schedule sched.c 4457
0 80208 100 sched_info_switch sched_stats.h 270
30585 49631 61 context_switch sched.c 2619

# cat /debug/tracing/profile_likely | awk '{ if ($3 > 25) print $0; }'
39900 36577 47 pick_next_task sched.c 4397
20824 15233 42 switch_mm mmu_context_64.h 18
0 7 100 __cancel_work_timer workqueue.c 560
617 66484 99 clocksource_adjust timekeeping.c 456
0 346340 100 audit_syscall_exit auditsc.c 1570
38 347350 99 audit_get_context auditsc.c 732
0 345244 100 audit_syscall_entry auditsc.c 1541
38 1017 96 audit_free auditsc.c 1446
0 1090 100 audit_alloc auditsc.c 862
2618 1090 29 audit_alloc auditsc.c 858
0 6 100 move_masked_irq migration.c 9
1 198 99 probe_sched_wakeup trace_sched_switch.c 58
2 2 50 probe_wakeup trace_sched_wakeup.c 227
0 2 100 probe_wakeup_sched_switch trace_sched_wakeup.c 144
4514 2090 31 __grab_cache_page filemap.c 2149
12882 228786 94 mapping_unevictable pagemap.h 50
4 11 73 __flush_cpu_slab slub.c 1466
627757 330451 34 slab_free slub.c 1731
2959 61245 95 dentry_lru_del_init dcache.c 153
946 1217 56 load_elf_binary binfmt_elf.c 904
102 82 44 disk_put_part genhd.h 206
1 1 50 dst_gc_task dst.c 82
0 19 100 tcp_mss_split_point tcp_output.c 1126

As you can see by the above, there's a bit of work to do in rethinking
the use of some unlikelys and likelys. Note: the unlikely case had 71 hits
that were more than 25%.

Note: After submitting my first version of this patch, Andrew Morton
showed me a version written by Daniel Walker, where I picked up
the following ideas from:

1) Using __builtin_constant_p to avoid profiling fixed values.
2) Using __FILE__ instead of instruction pointers.
3) Using the preprocessor to stop all profiling of likely
annotations from vsyscall_64.c.

Thanks to Andrew Morton, Arjan van de Ven, Theodore Tso and Ingo Molnar
for their feed back on this patch.

(*) Not ever unlikely is recorded, those that are used by vsyscalls
(a few of them) had to have profiling disabled.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Theodore Tso <tytso@mit.edu>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 15e6cb36 10-Nov-2008 Frederic Weisbecker <fweisbec@gmail.com>

tracing: add a tracer to catch execution time of kernel functions

Impact: add new tracing plugin which can trace full (entry+exit) function calls

This tracer uses the low level function return ftrace plugin to
measure the execution time of the kernel functions.

The first field is the caller of the function, the second is the
measured function, and the last one is the execution time in
nanoseconds.

- v3:

- HAVE_FUNCTION_RET_TRACER have been added. Each arch that support ftrace return
should enable it.
- ftrace_return_stub becomes ftrace_stub.
- CONFIG_FUNCTION_RET_TRACER depends now on CONFIG_FUNCTION_TRACER
- Return traces printing can be used for other tracers on trace.c
- Adapt to the new tracing API (no more ctrl_update callback)
- Correct the check of "disabled" during insertion.
- Minor changes...

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 606576ce 06-Oct-2008 Steven Rostedt <rostedt@goodmis.org>

ftrace: rename FTRACE to FUNCTION_TRACER

Due to confusion between the ftrace infrastructure and the gcc profiling
tracer "ftrace", this patch renames the config options from FTRACE to
FUNCTION_TRACER. The other two names that are offspring from FTRACE
DYNAMIC_FTRACE and FTRACE_MCOUNT_RECORD will stay the same.

This patch was generated mostly by script, and partially by hand.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 7a8e76a3 29-Sep-2008 Steven Rostedt <rostedt@goodmis.org>

tracing: unified trace buffer

This is a unified tracing buffer that implements a ring buffer that
hopefully everyone will eventually be able to use.

The events recorded into the buffer have the following structure:

struct ring_buffer_event {
u32 type:2, len:3, time_delta:27;
u32 array[];
};

The minimum size of an event is 8 bytes. All events are 4 byte
aligned inside the buffer.

There are 4 types (all internal use for the ring buffer, only
the data type is exported to the interface users).

RINGBUF_TYPE_PADDING: this type is used to note extra space at the end
of a buffer page.

RINGBUF_TYPE_TIME_EXTENT: This type is used when the time between events
is greater than the 27 bit delta can hold. We add another
32 bits, and record that in its own event (8 byte size).

RINGBUF_TYPE_TIME_STAMP: (Not implemented yet). This will hold data to
help keep the buffer timestamps in sync.

RINGBUF_TYPE_DATA: The event actually holds user data.

The "len" field is only three bits. Since the data must be
4 byte aligned, this field is shifted left by 2, giving a
max length of 28 bytes. If the data load is greater than 28
bytes, the first array field holds the full length of the
data load and the len field is set to zero.

Example, data size of 7 bytes:

type = RINGBUF_TYPE_DATA
len = 2
time_delta: <time-stamp> - <prev_event-time-stamp>
array[0..1]: <7 bytes of data> <1 byte empty>

This event is saved in 12 bytes of the buffer.

An event with 82 bytes of data:

type = RINGBUF_TYPE_DATA
len = 0
time_delta: <time-stamp> - <prev_event-time-stamp>
array[0]: 84 (Note the alignment)
array[1..14]: <82 bytes of data> <2 bytes empty>

The above event is saved in 92 bytes (if my math is correct).
82 bytes of data, 2 bytes empty, 4 byte header, 4 byte length.

Do not reference the above event struct directly. Use the following
functions to gain access to the event table, since the
ring_buffer_event structure may change in the future.

ring_buffer_event_length(event): get the length of the event.
This is the size of the memory used to record this
event, and not the size of the data pay load.

ring_buffer_time_delta(event): get the time delta of the event
This returns the delta time stamp since the last event.
Note: Even though this is in the header, there should
be no reason to access this directly, accept
for debugging.

ring_buffer_event_data(event): get the data from the event
This is the function to use to get the actual data
from the event. Note, it is only a pointer to the
data inside the buffer. This data must be copied to
another location otherwise you risk it being written
over in the buffer.

ring_buffer_lock: A way to lock the entire buffer.
ring_buffer_unlock: unlock the buffer.

ring_buffer_alloc: create a new ring buffer. Can choose between
overwrite or consumer/producer mode. Overwrite will
overwrite old data, where as consumer producer will
throw away new data if the consumer catches up with the
producer. The consumer/producer is the default.

ring_buffer_free: free the ring buffer.

ring_buffer_resize: resize the buffer. Changes the size of each cpu
buffer. Note, it is up to the caller to provide that
the buffer is not being used while this is happening.
This requirement may go away but do not count on it.

ring_buffer_lock_reserve: locks the ring buffer and allocates an
entry on the buffer to write to.
ring_buffer_unlock_commit: unlocks the ring buffer and commits it to
the buffer.

ring_buffer_write: writes some data into the ring buffer.

ring_buffer_peek: Look at a next item in the cpu buffer.
ring_buffer_consume: get the next item in the cpu buffer and
consume it. That is, this function increments the head
pointer.

ring_buffer_read_start: Start an iterator of a cpu buffer.
For now, this disables the cpu buffer, until you issue
a finish. This is just because we do not want the iterator
to be overwritten. This restriction may change in the future.
But note, this is used for static reading of a buffer which
is usually done "after" a trace. Live readings would want
to use the ring_buffer_consume above, which will not
disable the ring buffer.

ring_buffer_read_finish: Finishes the read iterator and reenables
the ring buffer.

ring_buffer_iter_peek: Look at the next item in the cpu iterator.
ring_buffer_read: Read the iterator and increment it.
ring_buffer_iter_reset: Reset the iterator to point to the beginning
of the cpu buffer.
ring_buffer_iter_empty: Returns true if the iterator is at the end
of the cpu buffer.

ring_buffer_size: returns the size in bytes of each cpu buffer.
Note, the real size is this times the number of CPUs.

ring_buffer_reset_cpu: Sets the cpu buffer to empty
ring_buffer_reset: sets all cpu buffers to empty

ring_buffer_swap_cpu: swaps a cpu buffer from one buffer with a
cpu buffer of another buffer. This is handy when you
want to take a snap shot of a running trace on just one
cpu. Having a backup buffer, to swap with facilitates this.
Ftrace max latencies use this.

ring_buffer_empty: Returns true if the ring buffer is empty.
ring_buffer_empty_cpu: Returns true if the cpu buffer is empty.

ring_buffer_record_disable: disable all cpu buffers (read only)
ring_buffer_record_disable_cpu: disable a single cpu buffer (read only)
ring_buffer_record_enable: enable all cpu buffers.
ring_buffer_record_enabl_cpu: enable a single cpu buffer.

ring_buffer_entries: The number of entries in a ring buffer.
ring_buffer_overruns: The number of entries removed due to writing wrap.

ring_buffer_time_stamp: Get the time stamp used by the ring buffer
ring_buffer_normalize_time_stamp: normalize the ring buffer time stamp
into nanosecs.

I still need to implement the GTOD feature. But we need support from
the cpu frequency infrastructure. But this can be done at a later
time without affecting the ring buffer interface.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 1f5c2abb 23-Sep-2008 Frédéric Weisbecker <fweisbec@gmail.com>

tracing/ftrace: give an entry on the config for boot tracer

Bring the entry to choose the boot tracer on the kernel config.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# fb1b6d8b 19-Sep-2008 Steven Noonan <steven@uplinklabs.net>

ftrace: add nop tracer

A no-op tracer which can serve two purposes:

1. A template for development of a new tracer.
2. A convenient way to see ftrace_printk() calls without
an irrelevant trace making the output messy.

[ mingo@elte.hu: resolved conflicts ]
Signed-off-by: Steven Noonan <steven@uplinklabs.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# e5a81b62 27-Aug-2008 Steven Rostedt <rostedt@goodmis.org>

ftrace: add stack tracer

This is another tracer using the ftrace infrastructure, that examines
at each function call the size of the stack. If the stack use is greater
than the previous max it is recorded.

You can always see (and set) the max stack size seen. By setting it
to zero will start the recording again. The backtrace is also available.

For example:

# cat /debug/tracing/stack_max_size
1856

# cat /debug/tracing/stack_trace
[<c027764d>] stack_trace_call+0x8f/0x101
[<c021b966>] ftrace_call+0x5/0x8
[<c02553cc>] clocksource_get_next+0x12/0x48
[<c02542a5>] update_wall_time+0x538/0x6d1
[<c0245913>] do_timer+0x23/0xb0
[<c0257657>] tick_do_update_jiffies64+0xd9/0xf1
[<c02576b9>] tick_sched_timer+0x4a/0xad
[<c0250fe6>] __run_hrtimer+0x3e/0x75
[<c02518ed>] hrtimer_interrupt+0xf1/0x154
[<c022c870>] smp_apic_timer_interrupt+0x71/0x84
[<c021b7e9>] apic_timer_interrupt+0x2d/0x34
[<c0238597>] finish_task_switch+0x29/0xa0
[<c05abd13>] schedule+0x765/0x7be
[<c05abfca>] schedule_timeout+0x1b/0x90
[<c05ab4d4>] wait_for_common+0xab/0x101
[<c05ab5ac>] wait_for_completion+0x12/0x14
[<c033cfc3>] blk_execute_rq+0x84/0x99
[<c0402470>] scsi_execute+0xc2/0x105
[<c040250a>] scsi_execute_req+0x57/0x7f
[<c043afe0>] sr_test_unit_ready+0x3e/0x97
[<c043bbd6>] sr_media_change+0x43/0x205
[<c046b59f>] media_changed+0x48/0x77
[<c046b5ff>] cdrom_media_changed+0x31/0x37
[<c043b091>] sr_block_media_changed+0x16/0x18
[<c02b9e69>] check_disk_change+0x1b/0x63
[<c046f4c3>] cdrom_open+0x7a1/0x806
[<c043b148>] sr_block_open+0x78/0x8d
[<c02ba4c0>] do_open+0x90/0x257
[<c02ba869>] blkdev_open+0x2d/0x56
[<c0296a1f>] __dentry_open+0x14d/0x23c
[<c0296b32>] nameidata_to_filp+0x24/0x38
[<c02a1c68>] do_filp_open+0x347/0x626
[<c02967ef>] do_sys_open+0x47/0xbc
[<c02968b0>] sys_open+0x23/0x2b
[<c021aadd>] sysenter_do_call+0x12/0x26

I've tested this on both x86_64 and i386.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# f984b51e 12-May-2008 Pekka Paalanen <pq@iki.fi>

ftrace: add mmiotrace plugin

On Sat, 22 Mar 2008 13:07:47 +0100
Ingo Molnar <mingo@elte.hu> wrote:

> > > i'd suggest the following: pull x86.git and sched-devel.git into a
> > > single tree [the two will combine without rejects]. Then try to add a
> > > kernel/tracing/trace_mmiotrace.c ftrace plugin. The trace_sysprof.c
> > > plugin might be a good example.
> >
> > I did this and now I have mmiotrace enabled/disabled via the tracing
> > framework (what do we call this, since ftrace is one of the tracers?).
>
> cool! could you send the patches for that? (even if they are not fully
> functional yet)

Patch attached in the end. Nice to see how much code disappeared. I tried
to mark all the features I had to break with XXX-comments.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


# f06c3810 12-May-2008 Ingo Molnar <mingo@elte.hu>

ftrace: add sysprof plugin

very first baby version.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


# d05cdb25 12-May-2008 Steven Rostedt <rostedt@goodmis.org>

ftrace: fix dynamic ftrace selftest

With the adding of the configuration changes in the Makefile to prevent
tracing of functions in the ftrace code, all tracing of all the ftrace
code has been removed. Unfortunately, one of the selftests, relied on
a function to be traced. With the new change, the function was no longer
traced and the test failed.

This patch separates out the test function into its own file so that
we can add the "-pg" flag to the compilation of that function and the
adding of the mcount call to that function.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


# b53dde9d 12-May-2008 Ingo Molnar <mingo@elte.hu>

ftrace: disable -pg for the tracer itself

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


# 6cd8a4bb 12-May-2008 Steven Rostedt <srostedt@redhat.com>

ftrace: trace preempt off critical timings

Add preempt off timings. A lot of kernel core code is taken from the RT patch
latency trace that was written by Ingo Molnar.

This adds "preemptoff" and "preemptirqsoff" to /debugfs/tracing/available_tracers

Now instead of just tracing irqs off, preemption off can be selected
to be recorded.

When this is selected, it shares the same files as irqs off timings.
One can either trace preemption off, irqs off, or one or the other off.

By echoing "preemptoff" into /debugfs/tracing/current_tracer, recording
of preempt off only is performed. "irqsoff" will only record the time
irqs are disabled, but "preemptirqsoff" will take the total time irqs
or preemption are disabled. Runtime switching of these options is now
supported by simpling echoing in the appropriate trace name into
/debugfs/tracing/current_tracer.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


# 81d68a96 12-May-2008 Steven Rostedt <srostedt@redhat.com>

ftrace: trace irq disabled critical timings

This patch adds latency tracing for critical timings
(how long interrupts are disabled for).

"irqsoff" is added to /debugfs/tracing/available_tracers

Note:
tracing_max_latency
also holds the max latency for irqsoff (in usecs).
(default to large number so one must start latency tracing)

tracing_thresh
threshold (in usecs) to always print out if irqs off
is detected to be longer than stated here.
If irq_thresh is non-zero, then max_irq_latency
is ignored.

Here's an example of a trace with ftrace_enabled = 0

=======
preemption latency trace v1.1.5 on 2.6.24-rc7
Signed-off-by: Ingo Molnar <mingo@elte.hu>
--------------------------------------------------------------------
latency: 100 us, #3/3, CPU#1 | (M:rt VP:0, KP:0, SP:0 HP:0 #P:2)
-----------------
| task: swapper-0 (uid:0 nice:0 policy:0 rt_prio:0)
-----------------
=> started at: _spin_lock_irqsave+0x2a/0xb7
=> ended at: _spin_unlock_irqrestore+0x32/0x5f

_------=> CPU#
/ _-----=> irqs-off
| / _----=> need-resched
|| / _---=> hardirq/softirq
||| / _--=> preempt-depth
|||| /
||||| delay
cmd pid ||||| time | caller
\ / ||||| \ | /
swapper-0 1d.s3 0us+: _spin_lock_irqsave+0x2a/0xb7 (e1000_update_stats+0x47/0x64c [e1000])
swapper-0 1d.s3 100us : _spin_unlock_irqrestore+0x32/0x5f (e1000_update_stats+0x641/0x64c [e1000])
swapper-0 1d.s3 100us : trace_hardirqs_on_caller+0x75/0x89 (_spin_unlock_irqrestore+0x32/0x5f)

vim:ft=help
=======

And this is a trace with ftrace_enabled == 1

=======
preemption latency trace v1.1.5 on 2.6.24-rc7
--------------------------------------------------------------------
latency: 102 us, #12/12, CPU#1 | (M:rt VP:0, KP:0, SP:0 HP:0 #P:2)
-----------------
| task: swapper-0 (uid:0 nice:0 policy:0 rt_prio:0)
-----------------
=> started at: _spin_lock_irqsave+0x2a/0xb7
=> ended at: _spin_unlock_irqrestore+0x32/0x5f

_------=> CPU#
/ _-----=> irqs-off
| / _----=> need-resched
|| / _---=> hardirq/softirq
||| / _--=> preempt-depth
|||| /
||||| delay
cmd pid ||||| time | caller
\ / ||||| \ | /
swapper-0 1dNs3 0us+: _spin_lock_irqsave+0x2a/0xb7 (e1000_update_stats+0x47/0x64c [e1000])
swapper-0 1dNs3 46us : e1000_read_phy_reg+0x16/0x225 [e1000] (e1000_update_stats+0x5e2/0x64c [e1000])
swapper-0 1dNs3 46us : e1000_swfw_sync_acquire+0x10/0x99 [e1000] (e1000_read_phy_reg+0x49/0x225 [e1000])
swapper-0 1dNs3 46us : e1000_get_hw_eeprom_semaphore+0x12/0xa6 [e1000] (e1000_swfw_sync_acquire+0x36/0x99 [e1000])
swapper-0 1dNs3 47us : __const_udelay+0x9/0x47 (e1000_read_phy_reg+0x116/0x225 [e1000])
swapper-0 1dNs3 47us+: __delay+0x9/0x50 (__const_udelay+0x45/0x47)
swapper-0 1dNs3 97us : preempt_schedule+0xc/0x84 (__delay+0x4e/0x50)
swapper-0 1dNs3 98us : e1000_swfw_sync_release+0xc/0x55 [e1000] (e1000_read_phy_reg+0x211/0x225 [e1000])
swapper-0 1dNs3 99us+: e1000_put_hw_eeprom_semaphore+0x9/0x35 [e1000] (e1000_swfw_sync_release+0x50/0x55 [e1000])
swapper-0 1dNs3 101us : _spin_unlock_irqrestore+0xe/0x5f (e1000_update_stats+0x641/0x64c [e1000])
swapper-0 1dNs3 102us : _spin_unlock_irqrestore+0x32/0x5f (e1000_update_stats+0x641/0x64c [e1000])
swapper-0 1dNs3 102us : trace_hardirqs_on_caller+0x75/0x89 (_spin_unlock_irqrestore+0x32/0x5f)

vim:ft=help
=======

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


# 352ad25a 12-May-2008 Steven Rostedt <srostedt@redhat.com>

ftrace: tracer for scheduler wakeup latency

This patch adds the tracer that tracks the wakeup latency of the
highest priority waking task.

"wakeup" is added to /debugfs/tracing/available_tracers

Also added to /debugfs/tracing

tracing_max_latency
holds the current max latency for the wakeup

wakeup_thresh
if set to other than zero, a log will be recorded
for every wakeup that takes longer than the number
entered in here (usecs for all counters)
(deletes previous trace)

Examples:

(with ftrace_enabled = 0)

============
preemption latency trace v1.1.5 on 2.6.24-rc8
Signed-off-by: Ingo Molnar <mingo@elte.hu>
--------------------------------------------------------------------
latency: 26 us, #2/2, CPU#1 | (M:rt VP:0, KP:0, SP:0 HP:0 #P:2)
-----------------
| task: migration/0-3 (uid:0 nice:-5 policy:1 rt_prio:99)
-----------------

_------=> CPU#
/ _-----=> irqs-off
| / _----=> need-resched
|| / _---=> hardirq/softirq
||| / _--=> preempt-depth
|||| /
||||| delay
cmd pid ||||| time | caller
\ / ||||| \ | /
quilt-8551 0d..3 0us+: wake_up_process+0x15/0x17 <ffffffff80233e80> (sched_exec+0xc9/0x100 <ffffffff80235343>)
quilt-8551 0d..4 26us : sched_switch_callback+0x73/0x81 <ffffffff80338d2f> (schedule+0x483/0x6d5 <ffffffff8048b3ee>)

vim:ft=help
============

(with ftrace_enabled = 1)

============
preemption latency trace v1.1.5 on 2.6.24-rc8
--------------------------------------------------------------------
latency: 36 us, #45/45, CPU#0 | (M:rt VP:0, KP:0, SP:0 HP:0 #P:2)
-----------------
| task: migration/1-5 (uid:0 nice:-5 policy:1 rt_prio:99)
-----------------

_------=> CPU#
/ _-----=> irqs-off
| / _----=> need-resched
|| / _---=> hardirq/softirq
||| / _--=> preempt-depth
|||| /
||||| delay
cmd pid ||||| time | caller
\ / ||||| \ | /
bash-10653 1d..3 0us : wake_up_process+0x15/0x17 <ffffffff80233e80> (sched_exec+0xc9/0x100 <ffffffff80235343>)
bash-10653 1d..3 1us : try_to_wake_up+0x271/0x2e7 <ffffffff80233dcf> (sub_preempt_count+0xc/0x7a <ffffffff8023309e>)
bash-10653 1d..2 2us : try_to_wake_up+0x296/0x2e7 <ffffffff80233df4> (update_rq_clock+0x9/0x20 <ffffffff802303f3>)
bash-10653 1d..2 2us : update_rq_clock+0x1e/0x20 <ffffffff80230408> (__update_rq_clock+0xc/0x90 <ffffffff80230366>)
bash-10653 1d..2 3us : __update_rq_clock+0x1b/0x90 <ffffffff80230375> (sched_clock+0x9/0x29 <ffffffff80214529>)
bash-10653 1d..2 4us : try_to_wake_up+0x2a6/0x2e7 <ffffffff80233e04> (activate_task+0xc/0x3f <ffffffff8022ffca>)
bash-10653 1d..2 4us : activate_task+0x2d/0x3f <ffffffff8022ffeb> (enqueue_task+0xe/0x66 <ffffffff8022ff66>)
bash-10653 1d..2 5us : enqueue_task+0x5b/0x66 <ffffffff8022ffb3> (enqueue_task_rt+0x9/0x3c <ffffffff80233351>)
bash-10653 1d..2 6us : try_to_wake_up+0x2ba/0x2e7 <ffffffff80233e18> (check_preempt_wakeup+0x12/0x99 <ffffffff80234f84>)
[...]
bash-10653 1d..5 33us : tracing_record_cmdline+0xcf/0xd4 <ffffffff80338aad> (_spin_unlock+0x9/0x33 <ffffffff8048d3ec>)
bash-10653 1d..5 34us : _spin_unlock+0x19/0x33 <ffffffff8048d3fc> (sub_preempt_count+0xc/0x7a <ffffffff8023309e>)
bash-10653 1d..4 35us : wakeup_sched_switch+0x65/0x2ff <ffffffff80339f66> (_spin_lock_irqsave+0xc/0xa9 <ffffffff8048d08b>)
bash-10653 1d..4 35us : _spin_lock_irqsave+0x19/0xa9 <ffffffff8048d098> (add_preempt_count+0xe/0x77 <ffffffff8023311a>)
bash-10653 1d..4 36us : sched_switch_callback+0x73/0x81 <ffffffff80338d2f> (schedule+0x483/0x6d5 <ffffffff8048b3ee>)

vim:ft=help
============

The [...] was added here to not waste your email box space.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


# 35e8e302 12-May-2008 Steven Rostedt <srostedt@redhat.com>

ftrace: add tracing of context switches

This patch adds context switch tracing, of the format of:

_------=> CPU#
/ _-----=> irqs-off
| / _----=> need-resched
|| / _---=> hardirq/softirq
||| / _--=> preempt-depth
|||| /
||||| delay
cmd pid ||||| time | pid:prio:state
\ / ||||| \ | /
swapper-0 1d..3 137us+: 0:140:R --> 2912:120
sshd-2912 1d..3 216us+: 2912:120:S --> 0:140
swapper-0 1d..3 261us+: 0:140:R --> 2912:120
bash-2920 0d..3 267us+: 2920:120:S --> 0:140
sshd-2912 1d..3 330us!: 2912:120:S --> 0:140
swapper-0 1d..3 2389us+: 0:140:R --> 2847:120
yum-upda-2847 1d..3 2411us!: 2847:120:S --> 0:140
swapper-0 0d..3 11089us+: 0:140:R --> 3139:120
gdm-bina-3139 0d..3 11113us!: 3139:120:S --> 0:140
swapper-0 1d..3 102328us+: 0:140:R --> 2847:120
yum-upda-2847 1d..3 102348us!: 2847:120:S --> 0:140

"sched_switch" is added to /debugfs/tracing/available_tracers

[ Eugene Teo <eugeneteo@kernel.sg: remove unused tracing_sched_switch_enabled ]

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


# 1b29b018 12-May-2008 Steven Rostedt <srostedt@redhat.com>

ftrace: function tracer

This is a simple trace that uses the ftrace infrastructure. It is
designed to be fast and small, and easy to use. It is useful to
record things that happen over a very short period of time, and
not to analyze the system in general.

Updates:

available_tracers
"function" is added to this file.

current_tracer
To enable the function tracer:

echo function > /debugfs/tracing/current_tracer

To disable the tracer:

echo disable > /debugfs/tracing/current_tracer

The output of the function_trace file is as follows

"echo noverbose > /debugfs/tracing/iter_ctrl"

preemption latency trace v1.1.5 on 2.6.24-rc7-tst
Signed-off-by: Ingo Molnar <mingo@elte.hu>
--------------------------------------------------------------------
latency: 0 us, #419428/4361791, CPU#1 | (M:desktop VP:0, KP:0, SP:0 HP:0 #P:4)
-----------------
| task: -0 (uid:0 nice:0 policy:0 rt_prio:0)
-----------------

_------=> CPU#
/ _-----=> irqs-off
| / _----=> need-resched
|| / _---=> hardirq/softirq
||| / _--=> preempt-depth
|||| /
||||| delay
cmd pid ||||| time | caller
\ / ||||| \ | /
swapper-0 0d.h. 1595128us+: set_normalized_timespec+0x8/0x2d <c043841d> (ktime_get_ts+0x4a/0x4e <c04499d4>)
swapper-0 0d.h. 1595131us+: _spin_lock+0x8/0x18 <c0630690> (hrtimer_interrupt+0x6e/0x1b0 <c0449c56>)

Or with verbose turned on:

"echo verbose > /debugfs/tracing/iter_ctrl"

preemption latency trace v1.1.5 on 2.6.24-rc7-tst
--------------------------------------------------------------------
latency: 0 us, #419428/4361791, CPU#1 | (M:desktop VP:0, KP:0, SP:0 HP:0 #P:4)
-----------------
| task: -0 (uid:0 nice:0 policy:0 rt_prio:0)
-----------------

swapper 0 0 9 00000000 00000000 [f3675f41] 1595.128ms (+0.003ms): set_normalized_timespec+0x8/0x2d <c043841d> (ktime_get_ts+0x4a/0x4e <c04499d4>)
swapper 0 0 9 00000000 00000001 [f3675f45] 1595.131ms (+0.003ms): _spin_lock+0x8/0x18 <c0630690> (hrtimer_interrupt+0x6e/0x1b0 <c0449c56>)
swapper 0 0 9 00000000 00000002 [f3675f48] 1595.135ms (+0.003ms): _spin_lock+0x8/0x18 <c0630690> (hrtimer_interrupt+0x6e/0x1b0 <c0449c56>)

The "trace" file is not affected by the verbose mode, but is by the symonly.

echo "nosymonly" > /debugfs/tracing/iter_ctrl

tracer:
[ 81.479967] CPU 0: bash:3154 register_ftrace_function+0x5f/0x66 <ffffffff80337a4d> <-- _spin_unlock_irqrestore+0xe/0x5a <ffffffff8048cc8f>
[ 81.479967] CPU 0: bash:3154 _spin_unlock_irqrestore+0x3e/0x5a <ffffffff8048ccbf> <-- sub_preempt_count+0xc/0x7a <ffffffff80233d7b>
[ 81.479968] CPU 0: bash:3154 sub_preempt_count+0x30/0x7a <ffffffff80233d9f> <-- in_lock_functions+0x9/0x24 <ffffffff8025a75d>
[ 81.479968] CPU 0: bash:3154 vfs_write+0x11d/0x155 <ffffffff8029a043> <-- dnotify_parent+0x12/0x78 <ffffffff802d54fb>
[ 81.479968] CPU 0: bash:3154 dnotify_parent+0x2d/0x78 <ffffffff802d5516> <-- _spin_lock+0xe/0x70 <ffffffff8048c910>
[ 81.479969] CPU 0: bash:3154 _spin_lock+0x1b/0x70 <ffffffff8048c91d> <-- add_preempt_count+0xe/0x77 <ffffffff80233df7>
[ 81.479969] CPU 0: bash:3154 add_preempt_count+0x3e/0x77 <ffffffff80233e27> <-- in_lock_functions+0x9/0x24 <ffffffff8025a75d>

echo "symonly" > /debugfs/tracing/iter_ctrl

tracer:
[ 81.479913] CPU 0: bash:3154 register_ftrace_function+0x5f/0x66 <-- _spin_unlock_irqrestore+0xe/0x5a
[ 81.479913] CPU 0: bash:3154 _spin_unlock_irqrestore+0x3e/0x5a <-- sub_preempt_count+0xc/0x7a
[ 81.479913] CPU 0: bash:3154 sub_preempt_count+0x30/0x7a <-- in_lock_functions+0x9/0x24
[ 81.479914] CPU 0: bash:3154 vfs_write+0x11d/0x155 <-- dnotify_parent+0x12/0x78
[ 81.479914] CPU 0: bash:3154 dnotify_parent+0x2d/0x78 <-- _spin_lock+0xe/0x70
[ 81.479914] CPU 0: bash:3154 _spin_lock+0x1b/0x70 <-- add_preempt_count+0xe/0x77
[ 81.479914] CPU 0: bash:3154 add_preempt_count+0x3e/0x77 <-- in_lock_functions+0x9/0x24

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


# bc0c38d1 12-May-2008 Steven Rostedt <srostedt@redhat.com>

ftrace: latency tracer infrastructure

This patch adds the latency tracer infrastructure. This patch
does not add anything that will select and turn it on, but will
be used by later patches.

If it were to be compiled, it would add the following files
to the debugfs:

The root tracing directory:

/debugfs/tracing/

This patch also adds the following files:

available_tracers
list of available tracers. Currently no tracers are
available. Looking into this file only shows
"none" which is used to unregister all tracers.

current_tracer
The trace that is currently active. Empty on start up.
To switch to a tracer simply echo one of the tracers that
are listed in available_tracers:

example: (used with later patches)

echo function > /debugfs/tracing/current_tracer

To disable the tracer:

echo disable > /debugfs/tracing/current_tracer

tracing_enabled
echoing "1" into this file starts the ftrace function tracing
(if sysctl kernel.ftrace_enabled=1)
echoing "0" turns it off.

latency_trace
This file is readonly and holds the result of the trace.

trace
This file outputs a easier to read version of the trace.

iter_ctrl
Controls the way the output of traces look.
So far there's two controls:
echoing in "symonly" will only show the kallsyms variables
without the addresses (if kallsyms was configured)
echoing in "verbose" will change the output to show
a lot more data, but not very easy to understand by
humans.
echoing in "nosymonly" turns off symonly.
echoing in "noverbose" turns off verbose.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


# 16444a8a 12-May-2008 Arnaldo Carvalho de Melo <acme@ghostprotocols.net>

ftrace: add basic support for gcc profiler instrumentation

If CONFIG_FTRACE is selected and /proc/sys/kernel/ftrace_enabled is
set to a non-zero value the ftrace routine will be called everytime
we enter a kernel function that is not marked with the "notrace"
attribute.

The ftrace routine will then call a registered function if a function
happens to be registered.

[ This code has been highly hacked by Steven Rostedt and Ingo Molnar,
so don't blame Arnaldo for all of this ;-) ]

Update:
It is now possible to register more than one ftrace function.
If only one ftrace function is registered, that will be the
function that ftrace calls directly. If more than one function
is registered, then ftrace will call a function that will loop
through the functions to call.

Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>