Searched hist:4 (Results 26 - 50 of 43670) sorted by relevance

1234567891011>>

/linux-master/scripts/dtc/include-prefixes/arm/qcom/
H A Dqcom-msm8926-samsung-matisselte.dtsd305361f Thu Feb 15 11:02:01 MST 2024 Stefan Hansson <newbyte@postmarketos.org> ARM: dts: qcom: Add support for Samsung Galaxy Tab 4 10.1 LTE (SM-T535)

Add a device tree for the Samsung Galaxy Tab 4 10.1 (SM-T535) LTE tablet
based on the MSM8926 platform.

The common dtsi is also modified to describe the widest constraints,
which required modifications to the matisse-wifi dts.

Signed-off-by: Stefan Hansson <newbyte@postmarketos.org>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://lore.kernel.org/r/20240215180322.99089-4-newbyte@postmarketos.org
Signed-off-by: Bjorn Andersson <andersson@kernel.org>
d305361f Thu Feb 15 11:02:01 MST 2024 Stefan Hansson <newbyte@postmarketos.org> ARM: dts: qcom: Add support for Samsung Galaxy Tab 4 10.1 LTE (SM-T535)

Add a device tree for the Samsung Galaxy Tab 4 10.1 (SM-T535) LTE tablet
based on the MSM8926 platform.

The common dtsi is also modified to describe the widest constraints,
which required modifications to the matisse-wifi dts.

Signed-off-by: Stefan Hansson <newbyte@postmarketos.org>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://lore.kernel.org/r/20240215180322.99089-4-newbyte@postmarketos.org
Signed-off-by: Bjorn Andersson <andersson@kernel.org>
d305361f Thu Feb 15 11:02:01 MST 2024 Stefan Hansson <newbyte@postmarketos.org> ARM: dts: qcom: Add support for Samsung Galaxy Tab 4 10.1 LTE (SM-T535)

Add a device tree for the Samsung Galaxy Tab 4 10.1 (SM-T535) LTE tablet
based on the MSM8926 platform.

The common dtsi is also modified to describe the widest constraints,
which required modifications to the matisse-wifi dts.

Signed-off-by: Stefan Hansson <newbyte@postmarketos.org>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://lore.kernel.org/r/20240215180322.99089-4-newbyte@postmarketos.org
Signed-off-by: Bjorn Andersson <andersson@kernel.org>
/linux-master/tools/testing/selftests/bpf/progs/
H A Dfor_each_multi_maps.cfecb1597 Thu Apr 04 20:55:36 MDT 2024 Philo Lu <lulie@linux.alibaba.com> selftests/bpf: add test for bpf_for_each_map_elem() with different maps

A test is added for bpf_for_each_map_elem() with either an arraymap or a
hashmap.
$ tools/testing/selftests/bpf/test_progs -t for_each
#93/1 for_each/hash_map:OK
#93/2 for_each/array_map:OK
#93/3 for_each/write_map_key:OK
#93/4 for_each/multi_maps:OK
#93 for_each:OK
Summary: 1/4 PASSED, 0 SKIPPED, 0 FAILED

Signed-off-by: Philo Lu <lulie@linux.alibaba.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240405025536.18113-4-lulie@linux.alibaba.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
fecb1597 Thu Apr 04 20:55:36 MDT 2024 Philo Lu <lulie@linux.alibaba.com> selftests/bpf: add test for bpf_for_each_map_elem() with different maps

A test is added for bpf_for_each_map_elem() with either an arraymap or a
hashmap.
$ tools/testing/selftests/bpf/test_progs -t for_each
#93/1 for_each/hash_map:OK
#93/2 for_each/array_map:OK
#93/3 for_each/write_map_key:OK
#93/4 for_each/multi_maps:OK
#93 for_each:OK
Summary: 1/4 PASSED, 0 SKIPPED, 0 FAILED

Signed-off-by: Philo Lu <lulie@linux.alibaba.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240405025536.18113-4-lulie@linux.alibaba.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
fecb1597 Thu Apr 04 20:55:36 MDT 2024 Philo Lu <lulie@linux.alibaba.com> selftests/bpf: add test for bpf_for_each_map_elem() with different maps

A test is added for bpf_for_each_map_elem() with either an arraymap or a
hashmap.
$ tools/testing/selftests/bpf/test_progs -t for_each
#93/1 for_each/hash_map:OK
#93/2 for_each/array_map:OK
#93/3 for_each/write_map_key:OK
#93/4 for_each/multi_maps:OK
#93 for_each:OK
Summary: 1/4 PASSED, 0 SKIPPED, 0 FAILED

Signed-off-by: Philo Lu <lulie@linux.alibaba.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240405025536.18113-4-lulie@linux.alibaba.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
H A Dverifier_arena_large.ca90c5845 Thu Mar 14 20:18:34 MDT 2024 Alexei Starovoitov <ast@kernel.org> selftests/bpf: Add arena test case for 4Gbyte corner case

Check that 4Gbyte arena can be allocated and overflow/underflow access in
the first and the last page behaves as expected.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/20240315021834.62988-5-alexei.starovoitov@gmail.com
a90c5845 Thu Mar 14 20:18:34 MDT 2024 Alexei Starovoitov <ast@kernel.org> selftests/bpf: Add arena test case for 4Gbyte corner case

Check that 4Gbyte arena can be allocated and overflow/underflow access in
the first and the last page behaves as expected.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/20240315021834.62988-5-alexei.starovoitov@gmail.com
/linux-master/drivers/infiniband/core/
H A Dtrace.c3e5901cb Wed Dec 18 13:18:15 MST 2019 Chuck Lever <chuck.lever@oracle.com> RDMA/core: Trace points for diagnosing completion queue issues

Sample trace events:

kworker/u29:0-300 [007] 120.042217: cq_alloc: cq.id=4 nr_cqe=161 comp_vector=2 poll_ctx=WORKQUEUE
<idle>-0 [002] 120.056292: cq_schedule: cq.id=4
kworker/2:1H-482 [002] 120.056402: cq_process: cq.id=4 wake-up took 109 [us] from interrupt
kworker/2:1H-482 [002] 120.056407: cq_poll: cq.id=4 requested 16, returned 1
<idle>-0 [002] 120.067503: cq_schedule: cq.id=4
kworker/2:1H-482 [002] 120.067537: cq_process: cq.id=4 wake-up took 34 [us] from interrupt
kworker/2:1H-482 [002] 120.067541: cq_poll: cq.id=4 requested 16, returned 1
<idle>-0 [002] 120.067657: cq_schedule: cq.id=4
kworker/2:1H-482 [002] 120.067672: cq_process: cq.id=4 wake-up took 15 [us] from interrupt
kworker/2:1H-482 [002] 120.067674: cq_poll: cq.id=4 requested 16, returned 1

...

systemd-1 [002] 122.392653: cq_schedule: cq.id=4
kworker/2:1H-482 [002] 122.392688: cq_process: cq.id=4 wake-up took 35 [us] from interrupt
kworker/2:1H-482 [002] 122.392693: cq_poll: cq.id=4 requested 16, returned 16
kworker/2:1H-482 [002] 122.392836: cq_poll: cq.id=4 requested 16, returned 16
kworker/2:1H-482 [002] 122.392970: cq_poll: cq.id=4 requested 16, returned 16
kworker/2:1H-482 [002] 122.393083: cq_poll: cq.id=4 requested 16, returned 16
kworker/2:1H-482 [002] 122.393195: cq_poll: cq.id=4 requested 16, returned 3

Several features to note in this output:
- The WCE count and context type are reported at allocation time
- The CPU and kworker for each CQ is evident
- The CQ's restracker ID is tagged on each trace event
- CQ poll scheduling latency is measured
- Details about how often single completions occur versus multiple
completions are evident
- The cost of the ULP's completion handler is recorded

Link: https://lore.kernel.org/r/20191218201815.30584.3481.stgit@manet.1015granger.net
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
3e5901cb Wed Dec 18 13:18:15 MST 2019 Chuck Lever <chuck.lever@oracle.com> RDMA/core: Trace points for diagnosing completion queue issues

Sample trace events:

kworker/u29:0-300 [007] 120.042217: cq_alloc: cq.id=4 nr_cqe=161 comp_vector=2 poll_ctx=WORKQUEUE
<idle>-0 [002] 120.056292: cq_schedule: cq.id=4
kworker/2:1H-482 [002] 120.056402: cq_process: cq.id=4 wake-up took 109 [us] from interrupt
kworker/2:1H-482 [002] 120.056407: cq_poll: cq.id=4 requested 16, returned 1
<idle>-0 [002] 120.067503: cq_schedule: cq.id=4
kworker/2:1H-482 [002] 120.067537: cq_process: cq.id=4 wake-up took 34 [us] from interrupt
kworker/2:1H-482 [002] 120.067541: cq_poll: cq.id=4 requested 16, returned 1
<idle>-0 [002] 120.067657: cq_schedule: cq.id=4
kworker/2:1H-482 [002] 120.067672: cq_process: cq.id=4 wake-up took 15 [us] from interrupt
kworker/2:1H-482 [002] 120.067674: cq_poll: cq.id=4 requested 16, returned 1

...

systemd-1 [002] 122.392653: cq_schedule: cq.id=4
kworker/2:1H-482 [002] 122.392688: cq_process: cq.id=4 wake-up took 35 [us] from interrupt
kworker/2:1H-482 [002] 122.392693: cq_poll: cq.id=4 requested 16, returned 16
kworker/2:1H-482 [002] 122.392836: cq_poll: cq.id=4 requested 16, returned 16
kworker/2:1H-482 [002] 122.392970: cq_poll: cq.id=4 requested 16, returned 16
kworker/2:1H-482 [002] 122.393083: cq_poll: cq.id=4 requested 16, returned 16
kworker/2:1H-482 [002] 122.393195: cq_poll: cq.id=4 requested 16, returned 3

Several features to note in this output:
- The WCE count and context type are reported at allocation time
- The CPU and kworker for each CQ is evident
- The CQ's restracker ID is tagged on each trace event
- CQ poll scheduling latency is measured
- Details about how often single completions occur versus multiple
completions are evident
- The cost of the ULP's completion handler is recorded

Link: https://lore.kernel.org/r/20191218201815.30584.3481.stgit@manet.1015granger.net
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
3e5901cb Wed Dec 18 13:18:15 MST 2019 Chuck Lever <chuck.lever@oracle.com> RDMA/core: Trace points for diagnosing completion queue issues

Sample trace events:

kworker/u29:0-300 [007] 120.042217: cq_alloc: cq.id=4 nr_cqe=161 comp_vector=2 poll_ctx=WORKQUEUE
<idle>-0 [002] 120.056292: cq_schedule: cq.id=4
kworker/2:1H-482 [002] 120.056402: cq_process: cq.id=4 wake-up took 109 [us] from interrupt
kworker/2:1H-482 [002] 120.056407: cq_poll: cq.id=4 requested 16, returned 1
<idle>-0 [002] 120.067503: cq_schedule: cq.id=4
kworker/2:1H-482 [002] 120.067537: cq_process: cq.id=4 wake-up took 34 [us] from interrupt
kworker/2:1H-482 [002] 120.067541: cq_poll: cq.id=4 requested 16, returned 1
<idle>-0 [002] 120.067657: cq_schedule: cq.id=4
kworker/2:1H-482 [002] 120.067672: cq_process: cq.id=4 wake-up took 15 [us] from interrupt
kworker/2:1H-482 [002] 120.067674: cq_poll: cq.id=4 requested 16, returned 1

...

systemd-1 [002] 122.392653: cq_schedule: cq.id=4
kworker/2:1H-482 [002] 122.392688: cq_process: cq.id=4 wake-up took 35 [us] from interrupt
kworker/2:1H-482 [002] 122.392693: cq_poll: cq.id=4 requested 16, returned 16
kworker/2:1H-482 [002] 122.392836: cq_poll: cq.id=4 requested 16, returned 16
kworker/2:1H-482 [002] 122.392970: cq_poll: cq.id=4 requested 16, returned 16
kworker/2:1H-482 [002] 122.393083: cq_poll: cq.id=4 requested 16, returned 16
kworker/2:1H-482 [002] 122.393195: cq_poll: cq.id=4 requested 16, returned 3

Several features to note in this output:
- The WCE count and context type are reported at allocation time
- The CPU and kworker for each CQ is evident
- The CQ's restracker ID is tagged on each trace event
- CQ poll scheduling latency is measured
- Details about how often single completions occur versus multiple
completions are evident
- The cost of the ULP's completion handler is recorded

Link: https://lore.kernel.org/r/20191218201815.30584.3481.stgit@manet.1015granger.net
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
3e5901cb Wed Dec 18 13:18:15 MST 2019 Chuck Lever <chuck.lever@oracle.com> RDMA/core: Trace points for diagnosing completion queue issues

Sample trace events:

kworker/u29:0-300 [007] 120.042217: cq_alloc: cq.id=4 nr_cqe=161 comp_vector=2 poll_ctx=WORKQUEUE
<idle>-0 [002] 120.056292: cq_schedule: cq.id=4
kworker/2:1H-482 [002] 120.056402: cq_process: cq.id=4 wake-up took 109 [us] from interrupt
kworker/2:1H-482 [002] 120.056407: cq_poll: cq.id=4 requested 16, returned 1
<idle>-0 [002] 120.067503: cq_schedule: cq.id=4
kworker/2:1H-482 [002] 120.067537: cq_process: cq.id=4 wake-up took 34 [us] from interrupt
kworker/2:1H-482 [002] 120.067541: cq_poll: cq.id=4 requested 16, returned 1
<idle>-0 [002] 120.067657: cq_schedule: cq.id=4
kworker/2:1H-482 [002] 120.067672: cq_process: cq.id=4 wake-up took 15 [us] from interrupt
kworker/2:1H-482 [002] 120.067674: cq_poll: cq.id=4 requested 16, returned 1

...

systemd-1 [002] 122.392653: cq_schedule: cq.id=4
kworker/2:1H-482 [002] 122.392688: cq_process: cq.id=4 wake-up took 35 [us] from interrupt
kworker/2:1H-482 [002] 122.392693: cq_poll: cq.id=4 requested 16, returned 16
kworker/2:1H-482 [002] 122.392836: cq_poll: cq.id=4 requested 16, returned 16
kworker/2:1H-482 [002] 122.392970: cq_poll: cq.id=4 requested 16, returned 16
kworker/2:1H-482 [002] 122.393083: cq_poll: cq.id=4 requested 16, returned 16
kworker/2:1H-482 [002] 122.393195: cq_poll: cq.id=4 requested 16, returned 3

Several features to note in this output:
- The WCE count and context type are reported at allocation time
- The CPU and kworker for each CQ is evident
- The CQ's restracker ID is tagged on each trace event
- CQ poll scheduling latency is measured
- Details about how often single completions occur versus multiple
completions are evident
- The cost of the ULP's completion handler is recorded

Link: https://lore.kernel.org/r/20191218201815.30584.3481.stgit@manet.1015granger.net
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
3e5901cb Wed Dec 18 13:18:15 MST 2019 Chuck Lever <chuck.lever@oracle.com> RDMA/core: Trace points for diagnosing completion queue issues

Sample trace events:

kworker/u29:0-300 [007] 120.042217: cq_alloc: cq.id=4 nr_cqe=161 comp_vector=2 poll_ctx=WORKQUEUE
<idle>-0 [002] 120.056292: cq_schedule: cq.id=4
kworker/2:1H-482 [002] 120.056402: cq_process: cq.id=4 wake-up took 109 [us] from interrupt
kworker/2:1H-482 [002] 120.056407: cq_poll: cq.id=4 requested 16, returned 1
<idle>-0 [002] 120.067503: cq_schedule: cq.id=4
kworker/2:1H-482 [002] 120.067537: cq_process: cq.id=4 wake-up took 34 [us] from interrupt
kworker/2:1H-482 [002] 120.067541: cq_poll: cq.id=4 requested 16, returned 1
<idle>-0 [002] 120.067657: cq_schedule: cq.id=4
kworker/2:1H-482 [002] 120.067672: cq_process: cq.id=4 wake-up took 15 [us] from interrupt
kworker/2:1H-482 [002] 120.067674: cq_poll: cq.id=4 requested 16, returned 1

...

systemd-1 [002] 122.392653: cq_schedule: cq.id=4
kworker/2:1H-482 [002] 122.392688: cq_process: cq.id=4 wake-up took 35 [us] from interrupt
kworker/2:1H-482 [002] 122.392693: cq_poll: cq.id=4 requested 16, returned 16
kworker/2:1H-482 [002] 122.392836: cq_poll: cq.id=4 requested 16, returned 16
kworker/2:1H-482 [002] 122.392970: cq_poll: cq.id=4 requested 16, returned 16
kworker/2:1H-482 [002] 122.393083: cq_poll: cq.id=4 requested 16, returned 16
kworker/2:1H-482 [002] 122.393195: cq_poll: cq.id=4 requested 16, returned 3

Several features to note in this output:
- The WCE count and context type are reported at allocation time
- The CPU and kworker for each CQ is evident
- The CQ's restracker ID is tagged on each trace event
- CQ poll scheduling latency is measured
- Details about how often single completions occur versus multiple
completions are evident
- The cost of the ULP's completion handler is recorded

Link: https://lore.kernel.org/r/20191218201815.30584.3481.stgit@manet.1015granger.net
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
3e5901cb Wed Dec 18 13:18:15 MST 2019 Chuck Lever <chuck.lever@oracle.com> RDMA/core: Trace points for diagnosing completion queue issues

Sample trace events:

kworker/u29:0-300 [007] 120.042217: cq_alloc: cq.id=4 nr_cqe=161 comp_vector=2 poll_ctx=WORKQUEUE
<idle>-0 [002] 120.056292: cq_schedule: cq.id=4
kworker/2:1H-482 [002] 120.056402: cq_process: cq.id=4 wake-up took 109 [us] from interrupt
kworker/2:1H-482 [002] 120.056407: cq_poll: cq.id=4 requested 16, returned 1
<idle>-0 [002] 120.067503: cq_schedule: cq.id=4
kworker/2:1H-482 [002] 120.067537: cq_process: cq.id=4 wake-up took 34 [us] from interrupt
kworker/2:1H-482 [002] 120.067541: cq_poll: cq.id=4 requested 16, returned 1
<idle>-0 [002] 120.067657: cq_schedule: cq.id=4
kworker/2:1H-482 [002] 120.067672: cq_process: cq.id=4 wake-up took 15 [us] from interrupt
kworker/2:1H-482 [002] 120.067674: cq_poll: cq.id=4 requested 16, returned 1

...

systemd-1 [002] 122.392653: cq_schedule: cq.id=4
kworker/2:1H-482 [002] 122.392688: cq_process: cq.id=4 wake-up took 35 [us] from interrupt
kworker/2:1H-482 [002] 122.392693: cq_poll: cq.id=4 requested 16, returned 16
kworker/2:1H-482 [002] 122.392836: cq_poll: cq.id=4 requested 16, returned 16
kworker/2:1H-482 [002] 122.392970: cq_poll: cq.id=4 requested 16, returned 16
kworker/2:1H-482 [002] 122.393083: cq_poll: cq.id=4 requested 16, returned 16
kworker/2:1H-482 [002] 122.393195: cq_poll: cq.id=4 requested 16, returned 3

Several features to note in this output:
- The WCE count and context type are reported at allocation time
- The CPU and kworker for each CQ is evident
- The CQ's restracker ID is tagged on each trace event
- CQ poll scheduling latency is measured
- Details about how often single completions occur versus multiple
completions are evident
- The cost of the ULP's completion handler is recorded

Link: https://lore.kernel.org/r/20191218201815.30584.3481.stgit@manet.1015granger.net
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
3e5901cb Wed Dec 18 13:18:15 MST 2019 Chuck Lever <chuck.lever@oracle.com> RDMA/core: Trace points for diagnosing completion queue issues

Sample trace events:

kworker/u29:0-300 [007] 120.042217: cq_alloc: cq.id=4 nr_cqe=161 comp_vector=2 poll_ctx=WORKQUEUE
<idle>-0 [002] 120.056292: cq_schedule: cq.id=4
kworker/2:1H-482 [002] 120.056402: cq_process: cq.id=4 wake-up took 109 [us] from interrupt
kworker/2:1H-482 [002] 120.056407: cq_poll: cq.id=4 requested 16, returned 1
<idle>-0 [002] 120.067503: cq_schedule: cq.id=4
kworker/2:1H-482 [002] 120.067537: cq_process: cq.id=4 wake-up took 34 [us] from interrupt
kworker/2:1H-482 [002] 120.067541: cq_poll: cq.id=4 requested 16, returned 1
<idle>-0 [002] 120.067657: cq_schedule: cq.id=4
kworker/2:1H-482 [002] 120.067672: cq_process: cq.id=4 wake-up took 15 [us] from interrupt
kworker/2:1H-482 [002] 120.067674: cq_poll: cq.id=4 requested 16, returned 1

...

systemd-1 [002] 122.392653: cq_schedule: cq.id=4
kworker/2:1H-482 [002] 122.392688: cq_process: cq.id=4 wake-up took 35 [us] from interrupt
kworker/2:1H-482 [002] 122.392693: cq_poll: cq.id=4 requested 16, returned 16
kworker/2:1H-482 [002] 122.392836: cq_poll: cq.id=4 requested 16, returned 16
kworker/2:1H-482 [002] 122.392970: cq_poll: cq.id=4 requested 16, returned 16
kworker/2:1H-482 [002] 122.393083: cq_poll: cq.id=4 requested 16, returned 16
kworker/2:1H-482 [002] 122.393195: cq_poll: cq.id=4 requested 16, returned 3

Several features to note in this output:
- The WCE count and context type are reported at allocation time
- The CPU and kworker for each CQ is evident
- The CQ's restracker ID is tagged on each trace event
- CQ poll scheduling latency is measured
- Details about how often single completions occur versus multiple
completions are evident
- The cost of the ULP's completion handler is recorded

Link: https://lore.kernel.org/r/20191218201815.30584.3481.stgit@manet.1015granger.net
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
3e5901cb Wed Dec 18 13:18:15 MST 2019 Chuck Lever <chuck.lever@oracle.com> RDMA/core: Trace points for diagnosing completion queue issues

Sample trace events:

kworker/u29:0-300 [007] 120.042217: cq_alloc: cq.id=4 nr_cqe=161 comp_vector=2 poll_ctx=WORKQUEUE
<idle>-0 [002] 120.056292: cq_schedule: cq.id=4
kworker/2:1H-482 [002] 120.056402: cq_process: cq.id=4 wake-up took 109 [us] from interrupt
kworker/2:1H-482 [002] 120.056407: cq_poll: cq.id=4 requested 16, returned 1
<idle>-0 [002] 120.067503: cq_schedule: cq.id=4
kworker/2:1H-482 [002] 120.067537: cq_process: cq.id=4 wake-up took 34 [us] from interrupt
kworker/2:1H-482 [002] 120.067541: cq_poll: cq.id=4 requested 16, returned 1
<idle>-0 [002] 120.067657: cq_schedule: cq.id=4
kworker/2:1H-482 [002] 120.067672: cq_process: cq.id=4 wake-up took 15 [us] from interrupt
kworker/2:1H-482 [002] 120.067674: cq_poll: cq.id=4 requested 16, returned 1

...

systemd-1 [002] 122.392653: cq_schedule: cq.id=4
kworker/2:1H-482 [002] 122.392688: cq_process: cq.id=4 wake-up took 35 [us] from interrupt
kworker/2:1H-482 [002] 122.392693: cq_poll: cq.id=4 requested 16, returned 16
kworker/2:1H-482 [002] 122.392836: cq_poll: cq.id=4 requested 16, returned 16
kworker/2:1H-482 [002] 122.392970: cq_poll: cq.id=4 requested 16, returned 16
kworker/2:1H-482 [002] 122.393083: cq_poll: cq.id=4 requested 16, returned 16
kworker/2:1H-482 [002] 122.393195: cq_poll: cq.id=4 requested 16, returned 3

Several features to note in this output:
- The WCE count and context type are reported at allocation time
- The CPU and kworker for each CQ is evident
- The CQ's restracker ID is tagged on each trace event
- CQ poll scheduling latency is measured
- Details about how often single completions occur versus multiple
completions are evident
- The cost of the ULP's completion handler is recorded

Link: https://lore.kernel.org/r/20191218201815.30584.3481.stgit@manet.1015granger.net
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
3e5901cb Wed Dec 18 13:18:15 MST 2019 Chuck Lever <chuck.lever@oracle.com> RDMA/core: Trace points for diagnosing completion queue issues

Sample trace events:

kworker/u29:0-300 [007] 120.042217: cq_alloc: cq.id=4 nr_cqe=161 comp_vector=2 poll_ctx=WORKQUEUE
<idle>-0 [002] 120.056292: cq_schedule: cq.id=4
kworker/2:1H-482 [002] 120.056402: cq_process: cq.id=4 wake-up took 109 [us] from interrupt
kworker/2:1H-482 [002] 120.056407: cq_poll: cq.id=4 requested 16, returned 1
<idle>-0 [002] 120.067503: cq_schedule: cq.id=4
kworker/2:1H-482 [002] 120.067537: cq_process: cq.id=4 wake-up took 34 [us] from interrupt
kworker/2:1H-482 [002] 120.067541: cq_poll: cq.id=4 requested 16, returned 1
<idle>-0 [002] 120.067657: cq_schedule: cq.id=4
kworker/2:1H-482 [002] 120.067672: cq_process: cq.id=4 wake-up took 15 [us] from interrupt
kworker/2:1H-482 [002] 120.067674: cq_poll: cq.id=4 requested 16, returned 1

...

systemd-1 [002] 122.392653: cq_schedule: cq.id=4
kworker/2:1H-482 [002] 122.392688: cq_process: cq.id=4 wake-up took 35 [us] from interrupt
kworker/2:1H-482 [002] 122.392693: cq_poll: cq.id=4 requested 16, returned 16
kworker/2:1H-482 [002] 122.392836: cq_poll: cq.id=4 requested 16, returned 16
kworker/2:1H-482 [002] 122.392970: cq_poll: cq.id=4 requested 16, returned 16
kworker/2:1H-482 [002] 122.393083: cq_poll: cq.id=4 requested 16, returned 16
kworker/2:1H-482 [002] 122.393195: cq_poll: cq.id=4 requested 16, returned 3

Several features to note in this output:
- The WCE count and context type are reported at allocation time
- The CPU and kworker for each CQ is evident
- The CQ's restracker ID is tagged on each trace event
- CQ poll scheduling latency is measured
- Details about how often single completions occur versus multiple
completions are evident
- The cost of the ULP's completion handler is recorded

Link: https://lore.kernel.org/r/20191218201815.30584.3481.stgit@manet.1015granger.net
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
3e5901cb Wed Dec 18 13:18:15 MST 2019 Chuck Lever <chuck.lever@oracle.com> RDMA/core: Trace points for diagnosing completion queue issues

Sample trace events:

kworker/u29:0-300 [007] 120.042217: cq_alloc: cq.id=4 nr_cqe=161 comp_vector=2 poll_ctx=WORKQUEUE
<idle>-0 [002] 120.056292: cq_schedule: cq.id=4
kworker/2:1H-482 [002] 120.056402: cq_process: cq.id=4 wake-up took 109 [us] from interrupt
kworker/2:1H-482 [002] 120.056407: cq_poll: cq.id=4 requested 16, returned 1
<idle>-0 [002] 120.067503: cq_schedule: cq.id=4
kworker/2:1H-482 [002] 120.067537: cq_process: cq.id=4 wake-up took 34 [us] from interrupt
kworker/2:1H-482 [002] 120.067541: cq_poll: cq.id=4 requested 16, returned 1
<idle>-0 [002] 120.067657: cq_schedule: cq.id=4
kworker/2:1H-482 [002] 120.067672: cq_process: cq.id=4 wake-up took 15 [us] from interrupt
kworker/2:1H-482 [002] 120.067674: cq_poll: cq.id=4 requested 16, returned 1

...

systemd-1 [002] 122.392653: cq_schedule: cq.id=4
kworker/2:1H-482 [002] 122.392688: cq_process: cq.id=4 wake-up took 35 [us] from interrupt
kworker/2:1H-482 [002] 122.392693: cq_poll: cq.id=4 requested 16, returned 16
kworker/2:1H-482 [002] 122.392836: cq_poll: cq.id=4 requested 16, returned 16
kworker/2:1H-482 [002] 122.392970: cq_poll: cq.id=4 requested 16, returned 16
kworker/2:1H-482 [002] 122.393083: cq_poll: cq.id=4 requested 16, returned 16
kworker/2:1H-482 [002] 122.393195: cq_poll: cq.id=4 requested 16, returned 3

Several features to note in this output:
- The WCE count and context type are reported at allocation time
- The CPU and kworker for each CQ is evident
- The CQ's restracker ID is tagged on each trace event
- CQ poll scheduling latency is measured
- Details about how often single completions occur versus multiple
completions are evident
- The cost of the ULP's completion handler is recorded

Link: https://lore.kernel.org/r/20191218201815.30584.3481.stgit@manet.1015granger.net
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
3e5901cb Wed Dec 18 13:18:15 MST 2019 Chuck Lever <chuck.lever@oracle.com> RDMA/core: Trace points for diagnosing completion queue issues

Sample trace events:

kworker/u29:0-300 [007] 120.042217: cq_alloc: cq.id=4 nr_cqe=161 comp_vector=2 poll_ctx=WORKQUEUE
<idle>-0 [002] 120.056292: cq_schedule: cq.id=4
kworker/2:1H-482 [002] 120.056402: cq_process: cq.id=4 wake-up took 109 [us] from interrupt
kworker/2:1H-482 [002] 120.056407: cq_poll: cq.id=4 requested 16, returned 1
<idle>-0 [002] 120.067503: cq_schedule: cq.id=4
kworker/2:1H-482 [002] 120.067537: cq_process: cq.id=4 wake-up took 34 [us] from interrupt
kworker/2:1H-482 [002] 120.067541: cq_poll: cq.id=4 requested 16, returned 1
<idle>-0 [002] 120.067657: cq_schedule: cq.id=4
kworker/2:1H-482 [002] 120.067672: cq_process: cq.id=4 wake-up took 15 [us] from interrupt
kworker/2:1H-482 [002] 120.067674: cq_poll: cq.id=4 requested 16, returned 1

...

systemd-1 [002] 122.392653: cq_schedule: cq.id=4
kworker/2:1H-482 [002] 122.392688: cq_process: cq.id=4 wake-up took 35 [us] from interrupt
kworker/2:1H-482 [002] 122.392693: cq_poll: cq.id=4 requested 16, returned 16
kworker/2:1H-482 [002] 122.392836: cq_poll: cq.id=4 requested 16, returned 16
kworker/2:1H-482 [002] 122.392970: cq_poll: cq.id=4 requested 16, returned 16
kworker/2:1H-482 [002] 122.393083: cq_poll: cq.id=4 requested 16, returned 16
kworker/2:1H-482 [002] 122.393195: cq_poll: cq.id=4 requested 16, returned 3

Several features to note in this output:
- The WCE count and context type are reported at allocation time
- The CPU and kworker for each CQ is evident
- The CQ's restracker ID is tagged on each trace event
- CQ poll scheduling latency is measured
- Details about how often single completions occur versus multiple
completions are evident
- The cost of the ULP's completion handler is recorded

Link: https://lore.kernel.org/r/20191218201815.30584.3481.stgit@manet.1015granger.net
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
3e5901cb Wed Dec 18 13:18:15 MST 2019 Chuck Lever <chuck.lever@oracle.com> RDMA/core: Trace points for diagnosing completion queue issues

Sample trace events:

kworker/u29:0-300 [007] 120.042217: cq_alloc: cq.id=4 nr_cqe=161 comp_vector=2 poll_ctx=WORKQUEUE
<idle>-0 [002] 120.056292: cq_schedule: cq.id=4
kworker/2:1H-482 [002] 120.056402: cq_process: cq.id=4 wake-up took 109 [us] from interrupt
kworker/2:1H-482 [002] 120.056407: cq_poll: cq.id=4 requested 16, returned 1
<idle>-0 [002] 120.067503: cq_schedule: cq.id=4
kworker/2:1H-482 [002] 120.067537: cq_process: cq.id=4 wake-up took 34 [us] from interrupt
kworker/2:1H-482 [002] 120.067541: cq_poll: cq.id=4 requested 16, returned 1
<idle>-0 [002] 120.067657: cq_schedule: cq.id=4
kworker/2:1H-482 [002] 120.067672: cq_process: cq.id=4 wake-up took 15 [us] from interrupt
kworker/2:1H-482 [002] 120.067674: cq_poll: cq.id=4 requested 16, returned 1

...

systemd-1 [002] 122.392653: cq_schedule: cq.id=4
kworker/2:1H-482 [002] 122.392688: cq_process: cq.id=4 wake-up took 35 [us] from interrupt
kworker/2:1H-482 [002] 122.392693: cq_poll: cq.id=4 requested 16, returned 16
kworker/2:1H-482 [002] 122.392836: cq_poll: cq.id=4 requested 16, returned 16
kworker/2:1H-482 [002] 122.392970: cq_poll: cq.id=4 requested 16, returned 16
kworker/2:1H-482 [002] 122.393083: cq_poll: cq.id=4 requested 16, returned 16
kworker/2:1H-482 [002] 122.393195: cq_poll: cq.id=4 requested 16, returned 3

Several features to note in this output:
- The WCE count and context type are reported at allocation time
- The CPU and kworker for each CQ is evident
- The CQ's restracker ID is tagged on each trace event
- CQ poll scheduling latency is measured
- Details about how often single completions occur versus multiple
completions are evident
- The cost of the ULP's completion handler is recorded

Link: https://lore.kernel.org/r/20191218201815.30584.3481.stgit@manet.1015granger.net
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
3e5901cb Wed Dec 18 13:18:15 MST 2019 Chuck Lever <chuck.lever@oracle.com> RDMA/core: Trace points for diagnosing completion queue issues

Sample trace events:

kworker/u29:0-300 [007] 120.042217: cq_alloc: cq.id=4 nr_cqe=161 comp_vector=2 poll_ctx=WORKQUEUE
<idle>-0 [002] 120.056292: cq_schedule: cq.id=4
kworker/2:1H-482 [002] 120.056402: cq_process: cq.id=4 wake-up took 109 [us] from interrupt
kworker/2:1H-482 [002] 120.056407: cq_poll: cq.id=4 requested 16, returned 1
<idle>-0 [002] 120.067503: cq_schedule: cq.id=4
kworker/2:1H-482 [002] 120.067537: cq_process: cq.id=4 wake-up took 34 [us] from interrupt
kworker/2:1H-482 [002] 120.067541: cq_poll: cq.id=4 requested 16, returned 1
<idle>-0 [002] 120.067657: cq_schedule: cq.id=4
kworker/2:1H-482 [002] 120.067672: cq_process: cq.id=4 wake-up took 15 [us] from interrupt
kworker/2:1H-482 [002] 120.067674: cq_poll: cq.id=4 requested 16, returned 1

...

systemd-1 [002] 122.392653: cq_schedule: cq.id=4
kworker/2:1H-482 [002] 122.392688: cq_process: cq.id=4 wake-up took 35 [us] from interrupt
kworker/2:1H-482 [002] 122.392693: cq_poll: cq.id=4 requested 16, returned 16
kworker/2:1H-482 [002] 122.392836: cq_poll: cq.id=4 requested 16, returned 16
kworker/2:1H-482 [002] 122.392970: cq_poll: cq.id=4 requested 16, returned 16
kworker/2:1H-482 [002] 122.393083: cq_poll: cq.id=4 requested 16, returned 16
kworker/2:1H-482 [002] 122.393195: cq_poll: cq.id=4 requested 16, returned 3

Several features to note in this output:
- The WCE count and context type are reported at allocation time
- The CPU and kworker for each CQ is evident
- The CQ's restracker ID is tagged on each trace event
- CQ poll scheduling latency is measured
- Details about how often single completions occur versus multiple
completions are evident
- The cost of the ULP's completion handler is recorded

Link: https://lore.kernel.org/r/20191218201815.30584.3481.stgit@manet.1015granger.net
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
3e5901cb Wed Dec 18 13:18:15 MST 2019 Chuck Lever <chuck.lever@oracle.com> RDMA/core: Trace points for diagnosing completion queue issues

Sample trace events:

kworker/u29:0-300 [007] 120.042217: cq_alloc: cq.id=4 nr_cqe=161 comp_vector=2 poll_ctx=WORKQUEUE
<idle>-0 [002] 120.056292: cq_schedule: cq.id=4
kworker/2:1H-482 [002] 120.056402: cq_process: cq.id=4 wake-up took 109 [us] from interrupt
kworker/2:1H-482 [002] 120.056407: cq_poll: cq.id=4 requested 16, returned 1
<idle>-0 [002] 120.067503: cq_schedule: cq.id=4
kworker/2:1H-482 [002] 120.067537: cq_process: cq.id=4 wake-up took 34 [us] from interrupt
kworker/2:1H-482 [002] 120.067541: cq_poll: cq.id=4 requested 16, returned 1
<idle>-0 [002] 120.067657: cq_schedule: cq.id=4
kworker/2:1H-482 [002] 120.067672: cq_process: cq.id=4 wake-up took 15 [us] from interrupt
kworker/2:1H-482 [002] 120.067674: cq_poll: cq.id=4 requested 16, returned 1

...

systemd-1 [002] 122.392653: cq_schedule: cq.id=4
kworker/2:1H-482 [002] 122.392688: cq_process: cq.id=4 wake-up took 35 [us] from interrupt
kworker/2:1H-482 [002] 122.392693: cq_poll: cq.id=4 requested 16, returned 16
kworker/2:1H-482 [002] 122.392836: cq_poll: cq.id=4 requested 16, returned 16
kworker/2:1H-482 [002] 122.392970: cq_poll: cq.id=4 requested 16, returned 16
kworker/2:1H-482 [002] 122.393083: cq_poll: cq.id=4 requested 16, returned 16
kworker/2:1H-482 [002] 122.393195: cq_poll: cq.id=4 requested 16, returned 3

Several features to note in this output:
- The WCE count and context type are reported at allocation time
- The CPU and kworker for each CQ is evident
- The CQ's restracker ID is tagged on each trace event
- CQ poll scheduling latency is measured
- Details about how often single completions occur versus multiple
completions are evident
- The cost of the ULP's completion handler is recorded

Link: https://lore.kernel.org/r/20191218201815.30584.3481.stgit@manet.1015granger.net
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
3e5901cb Wed Dec 18 13:18:15 MST 2019 Chuck Lever <chuck.lever@oracle.com> RDMA/core: Trace points for diagnosing completion queue issues

Sample trace events:

kworker/u29:0-300 [007] 120.042217: cq_alloc: cq.id=4 nr_cqe=161 comp_vector=2 poll_ctx=WORKQUEUE
<idle>-0 [002] 120.056292: cq_schedule: cq.id=4
kworker/2:1H-482 [002] 120.056402: cq_process: cq.id=4 wake-up took 109 [us] from interrupt
kworker/2:1H-482 [002] 120.056407: cq_poll: cq.id=4 requested 16, returned 1
<idle>-0 [002] 120.067503: cq_schedule: cq.id=4
kworker/2:1H-482 [002] 120.067537: cq_process: cq.id=4 wake-up took 34 [us] from interrupt
kworker/2:1H-482 [002] 120.067541: cq_poll: cq.id=4 requested 16, returned 1
<idle>-0 [002] 120.067657: cq_schedule: cq.id=4
kworker/2:1H-482 [002] 120.067672: cq_process: cq.id=4 wake-up took 15 [us] from interrupt
kworker/2:1H-482 [002] 120.067674: cq_poll: cq.id=4 requested 16, returned 1

...

systemd-1 [002] 122.392653: cq_schedule: cq.id=4
kworker/2:1H-482 [002] 122.392688: cq_process: cq.id=4 wake-up took 35 [us] from interrupt
kworker/2:1H-482 [002] 122.392693: cq_poll: cq.id=4 requested 16, returned 16
kworker/2:1H-482 [002] 122.392836: cq_poll: cq.id=4 requested 16, returned 16
kworker/2:1H-482 [002] 122.392970: cq_poll: cq.id=4 requested 16, returned 16
kworker/2:1H-482 [002] 122.393083: cq_poll: cq.id=4 requested 16, returned 16
kworker/2:1H-482 [002] 122.393195: cq_poll: cq.id=4 requested 16, returned 3

Several features to note in this output:
- The WCE count and context type are reported at allocation time
- The CPU and kworker for each CQ is evident
- The CQ's restracker ID is tagged on each trace event
- CQ poll scheduling latency is measured
- Details about how often single completions occur versus multiple
completions are evident
- The cost of the ULP's completion handler is recorded

Link: https://lore.kernel.org/r/20191218201815.30584.3481.stgit@manet.1015granger.net
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
3e5901cb Wed Dec 18 13:18:15 MST 2019 Chuck Lever <chuck.lever@oracle.com> RDMA/core: Trace points for diagnosing completion queue issues

Sample trace events:

kworker/u29:0-300 [007] 120.042217: cq_alloc: cq.id=4 nr_cqe=161 comp_vector=2 poll_ctx=WORKQUEUE
<idle>-0 [002] 120.056292: cq_schedule: cq.id=4
kworker/2:1H-482 [002] 120.056402: cq_process: cq.id=4 wake-up took 109 [us] from interrupt
kworker/2:1H-482 [002] 120.056407: cq_poll: cq.id=4 requested 16, returned 1
<idle>-0 [002] 120.067503: cq_schedule: cq.id=4
kworker/2:1H-482 [002] 120.067537: cq_process: cq.id=4 wake-up took 34 [us] from interrupt
kworker/2:1H-482 [002] 120.067541: cq_poll: cq.id=4 requested 16, returned 1
<idle>-0 [002] 120.067657: cq_schedule: cq.id=4
kworker/2:1H-482 [002] 120.067672: cq_process: cq.id=4 wake-up took 15 [us] from interrupt
kworker/2:1H-482 [002] 120.067674: cq_poll: cq.id=4 requested 16, returned 1

...

systemd-1 [002] 122.392653: cq_schedule: cq.id=4
kworker/2:1H-482 [002] 122.392688: cq_process: cq.id=4 wake-up took 35 [us] from interrupt
kworker/2:1H-482 [002] 122.392693: cq_poll: cq.id=4 requested 16, returned 16
kworker/2:1H-482 [002] 122.392836: cq_poll: cq.id=4 requested 16, returned 16
kworker/2:1H-482 [002] 122.392970: cq_poll: cq.id=4 requested 16, returned 16
kworker/2:1H-482 [002] 122.393083: cq_poll: cq.id=4 requested 16, returned 16
kworker/2:1H-482 [002] 122.393195: cq_poll: cq.id=4 requested 16, returned 3

Several features to note in this output:
- The WCE count and context type are reported at allocation time
- The CPU and kworker for each CQ is evident
- The CQ's restracker ID is tagged on each trace event
- CQ poll scheduling latency is measured
- Details about how often single completions occur versus multiple
completions are evident
- The cost of the ULP's completion handler is recorded

Link: https://lore.kernel.org/r/20191218201815.30584.3481.stgit@manet.1015granger.net
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
3e5901cb Wed Dec 18 13:18:15 MST 2019 Chuck Lever <chuck.lever@oracle.com> RDMA/core: Trace points for diagnosing completion queue issues

Sample trace events:

kworker/u29:0-300 [007] 120.042217: cq_alloc: cq.id=4 nr_cqe=161 comp_vector=2 poll_ctx=WORKQUEUE
<idle>-0 [002] 120.056292: cq_schedule: cq.id=4
kworker/2:1H-482 [002] 120.056402: cq_process: cq.id=4 wake-up took 109 [us] from interrupt
kworker/2:1H-482 [002] 120.056407: cq_poll: cq.id=4 requested 16, returned 1
<idle>-0 [002] 120.067503: cq_schedule: cq.id=4
kworker/2:1H-482 [002] 120.067537: cq_process: cq.id=4 wake-up took 34 [us] from interrupt
kworker/2:1H-482 [002] 120.067541: cq_poll: cq.id=4 requested 16, returned 1
<idle>-0 [002] 120.067657: cq_schedule: cq.id=4
kworker/2:1H-482 [002] 120.067672: cq_process: cq.id=4 wake-up took 15 [us] from interrupt
kworker/2:1H-482 [002] 120.067674: cq_poll: cq.id=4 requested 16, returned 1

...

systemd-1 [002] 122.392653: cq_schedule: cq.id=4
kworker/2:1H-482 [002] 122.392688: cq_process: cq.id=4 wake-up took 35 [us] from interrupt
kworker/2:1H-482 [002] 122.392693: cq_poll: cq.id=4 requested 16, returned 16
kworker/2:1H-482 [002] 122.392836: cq_poll: cq.id=4 requested 16, returned 16
kworker/2:1H-482 [002] 122.392970: cq_poll: cq.id=4 requested 16, returned 16
kworker/2:1H-482 [002] 122.393083: cq_poll: cq.id=4 requested 16, returned 16
kworker/2:1H-482 [002] 122.393195: cq_poll: cq.id=4 requested 16, returned 3

Several features to note in this output:
- The WCE count and context type are reported at allocation time
- The CPU and kworker for each CQ is evident
- The CQ's restracker ID is tagged on each trace event
- CQ poll scheduling latency is measured
- Details about how often single completions occur versus multiple
completions are evident
- The cost of the ULP's completion handler is recorded

Link: https://lore.kernel.org/r/20191218201815.30584.3481.stgit@manet.1015granger.net
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
/linux-master/Documentation/userspace-api/media/v4l/
H A Dpixfmt-packed-yuv.rstdiff 99c95496 Tue Mar 21 23:13:07 MDT 2023 Ming Qian <ming.qian@nxp.com> media: Add YUV48_12 video format

YUV48_12 is a YUV format with 12-bits per component like YUV24,
expanded to 16bits.
Data in the 12 high bits, zeros in the 4 low bits,
arranged in little endian order.

[hverkuil: replaced a . by ,]

Signed-off-by: Ming Qian <ming.qian@nxp.com>
Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
diff 00f6842e Mon Mar 07 09:32:00 MST 2022 Laurent Pinchart <laurent.pinchart@ideasonboard.com> media: v4l: Add packed YUV 4:4:4 YUVA and YUVX pixel formats

The new YUVA and YUVX are permutations of the existing AYUV and XYUV
formats. They are use by the NXP i.MX8 ISI hardware.

Signed-off-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Reviewed-by: Nicolas Dufresne <nicolas.dufresne@collabora.com>
Reviewed-by: Jacopo Mondi <jacopo@jmondi.org>
Reviewed-by: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
diff 00f6842e Mon Mar 07 09:32:00 MST 2022 Laurent Pinchart <laurent.pinchart@ideasonboard.com> media: v4l: Add packed YUV 4:4:4 YUVA and YUVX pixel formats

The new YUVA and YUVX are permutations of the existing AYUV and XYUV
formats. They are use by the NXP i.MX8 ISI hardware.

Signed-off-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Reviewed-by: Nicolas Dufresne <nicolas.dufresne@collabora.com>
Reviewed-by: Jacopo Mondi <jacopo@jmondi.org>
Reviewed-by: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
diff 00f6842e Mon Mar 07 09:32:00 MST 2022 Laurent Pinchart <laurent.pinchart@ideasonboard.com> media: v4l: Add packed YUV 4:4:4 YUVA and YUVX pixel formats

The new YUVA and YUVX are permutations of the existing AYUV and XYUV
formats. They are use by the NXP i.MX8 ISI hardware.

Signed-off-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Reviewed-by: Nicolas Dufresne <nicolas.dufresne@collabora.com>
Reviewed-by: Jacopo Mondi <jacopo@jmondi.org>
Reviewed-by: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
diff 0376a51f Wed Mar 10 17:28:47 MST 2021 Mirela Rabulea <mirela.rabulea@nxp.com> media: v4l: Add packed YUV444 24bpp pixel format

The added format is V4L2_PIX_FMT_YUV24, this is a packed
YUV 4:4:4 format, with 8 bits for each component, 24 bits
per sample.

This format is used by the i.MX 8QuadMax and i.MX 8DualXPlus/8QuadXPlus
JPEG encoder/decoder.

Signed-off-by: Mirela Rabulea <mirela.rabulea@nxp.com>
Reviewed-by: Paul Kocialkowski <paul.kocialkowski@bootlin.com>
Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
diff 0376a51f Wed Mar 10 17:28:47 MST 2021 Mirela Rabulea <mirela.rabulea@nxp.com> media: v4l: Add packed YUV444 24bpp pixel format

The added format is V4L2_PIX_FMT_YUV24, this is a packed
YUV 4:4:4 format, with 8 bits for each component, 24 bits
per sample.

This format is used by the i.MX 8QuadMax and i.MX 8DualXPlus/8QuadXPlus
JPEG encoder/decoder.

Signed-off-by: Mirela Rabulea <mirela.rabulea@nxp.com>
Reviewed-by: Paul Kocialkowski <paul.kocialkowski@bootlin.com>
Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
diff 0376a51f Wed Mar 10 17:28:47 MST 2021 Mirela Rabulea <mirela.rabulea@nxp.com> media: v4l: Add packed YUV444 24bpp pixel format

The added format is V4L2_PIX_FMT_YUV24, this is a packed
YUV 4:4:4 format, with 8 bits for each component, 24 bits
per sample.

This format is used by the i.MX 8QuadMax and i.MX 8DualXPlus/8QuadXPlus
JPEG encoder/decoder.

Signed-off-by: Mirela Rabulea <mirela.rabulea@nxp.com>
Reviewed-by: Paul Kocialkowski <paul.kocialkowski@bootlin.com>
Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
diff 4bfc1688 Sun Dec 06 16:03:13 MST 2020 Laurent Pinchart <laurent.pinchart@ideasonboard.com> media: doc: pixfmt-packed-yuv: Clarify naming scheme for 4:4:4 formats

Document the naming scheme for the existing packed YUV 4:4:4 formats, as
previously done for the RGB formats.

Signed-off-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
diff 4bfc1688 Sun Dec 06 16:03:13 MST 2020 Laurent Pinchart <laurent.pinchart@ideasonboard.com> media: doc: pixfmt-packed-yuv: Clarify naming scheme for 4:4:4 formats

Document the naming scheme for the existing packed YUV 4:4:4 formats, as
previously done for the RGB formats.

Signed-off-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
diff 4bfc1688 Sun Dec 06 16:03:13 MST 2020 Laurent Pinchart <laurent.pinchart@ideasonboard.com> media: doc: pixfmt-packed-yuv: Clarify naming scheme for 4:4:4 formats

Document the naming scheme for the existing packed YUV 4:4:4 formats, as
previously done for the RGB formats.

Signed-off-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
diff 4bfc1688 Sun Dec 06 16:03:13 MST 2020 Laurent Pinchart <laurent.pinchart@ideasonboard.com> media: doc: pixfmt-packed-yuv: Clarify naming scheme for 4:4:4 formats

Document the naming scheme for the existing packed YUV 4:4:4 formats, as
previously done for the RGB formats.

Signed-off-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
diff 4bfc1688 Sun Dec 06 16:03:13 MST 2020 Laurent Pinchart <laurent.pinchart@ideasonboard.com> media: doc: pixfmt-packed-yuv: Clarify naming scheme for 4:4:4 formats

Document the naming scheme for the existing packed YUV 4:4:4 formats, as
previously done for the RGB formats.

Signed-off-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
diff 4bfc1688 Sun Dec 06 16:03:13 MST 2020 Laurent Pinchart <laurent.pinchart@ideasonboard.com> media: doc: pixfmt-packed-yuv: Clarify naming scheme for 4:4:4 formats

Document the naming scheme for the existing packed YUV 4:4:4 formats, as
previously done for the RGB formats.

Signed-off-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
diff 4bfc1688 Sun Dec 06 16:03:13 MST 2020 Laurent Pinchart <laurent.pinchart@ideasonboard.com> media: doc: pixfmt-packed-yuv: Clarify naming scheme for 4:4:4 formats

Document the naming scheme for the existing packed YUV 4:4:4 formats, as
previously done for the RGB formats.

Signed-off-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
/linux-master/drivers/mmc/host/
H A Dcqhci-core.cdiff 92b18252 Tue Oct 26 01:08:12 MDT 2021 Wenbin Mei <wenbin.mei@mediatek.com> mmc: cqhci: clear HALT state after CQE enable

While mmc0 enter suspend state, we need halt CQE to send legacy cmd(flush
cache) and disable cqe, for resume back, we enable CQE and not clear HALT
state.
In this case MediaTek mmc host controller will keep the value for HALT
state after CQE disable/enable flow, so the next CQE transfer after resume
will be timeout due to CQE is in HALT state, the log as below:
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: timeout for tag 2
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: ============ CQHCI REGISTER DUMP ===========
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Caps: 0x100020b6 | Version: 0x00000510
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Config: 0x00001103 | Control: 0x00000001
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int stat: 0x00000000 | Int enab: 0x00000006
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int sig: 0x00000006 | Int Coal: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: TDL base: 0xfd05f000 | TDL up32: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Doorbell: 0x8000203c | TCN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Dev queue: 0x00000000 | Dev Pend: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Task clr: 0x00000000 | SSC1: 0x00001000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: SSC2: 0x00000001 | DCMD rsp: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: RED mask: 0xfdf9a080 | TERRI: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Resp idx: 0x00000000 | Resp arg: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQP: 0x00000000 | CRNQDUN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQIS: 0x00000000 | CRNQIE: 0x00000000

This change check HALT state after CQE enable, if CQE is in HALT state, we
will clear it.

Signed-off-by: Wenbin Mei <wenbin.mei@mediatek.com>
Cc: stable@vger.kernel.org
Acked-by: Adrian Hunter <adrian.hunter@intel.com>
Fixes: a4080225f51d ("mmc: cqhci: support for command queue enabled host")
Link: https://lore.kernel.org/r/20211026070812.9359-1-wenbin.mei@mediatek.com
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
diff 92b18252 Tue Oct 26 01:08:12 MDT 2021 Wenbin Mei <wenbin.mei@mediatek.com> mmc: cqhci: clear HALT state after CQE enable

While mmc0 enter suspend state, we need halt CQE to send legacy cmd(flush
cache) and disable cqe, for resume back, we enable CQE and not clear HALT
state.
In this case MediaTek mmc host controller will keep the value for HALT
state after CQE disable/enable flow, so the next CQE transfer after resume
will be timeout due to CQE is in HALT state, the log as below:
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: timeout for tag 2
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: ============ CQHCI REGISTER DUMP ===========
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Caps: 0x100020b6 | Version: 0x00000510
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Config: 0x00001103 | Control: 0x00000001
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int stat: 0x00000000 | Int enab: 0x00000006
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int sig: 0x00000006 | Int Coal: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: TDL base: 0xfd05f000 | TDL up32: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Doorbell: 0x8000203c | TCN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Dev queue: 0x00000000 | Dev Pend: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Task clr: 0x00000000 | SSC1: 0x00001000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: SSC2: 0x00000001 | DCMD rsp: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: RED mask: 0xfdf9a080 | TERRI: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Resp idx: 0x00000000 | Resp arg: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQP: 0x00000000 | CRNQDUN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQIS: 0x00000000 | CRNQIE: 0x00000000

This change check HALT state after CQE enable, if CQE is in HALT state, we
will clear it.

Signed-off-by: Wenbin Mei <wenbin.mei@mediatek.com>
Cc: stable@vger.kernel.org
Acked-by: Adrian Hunter <adrian.hunter@intel.com>
Fixes: a4080225f51d ("mmc: cqhci: support for command queue enabled host")
Link: https://lore.kernel.org/r/20211026070812.9359-1-wenbin.mei@mediatek.com
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
diff 92b18252 Tue Oct 26 01:08:12 MDT 2021 Wenbin Mei <wenbin.mei@mediatek.com> mmc: cqhci: clear HALT state after CQE enable

While mmc0 enter suspend state, we need halt CQE to send legacy cmd(flush
cache) and disable cqe, for resume back, we enable CQE and not clear HALT
state.
In this case MediaTek mmc host controller will keep the value for HALT
state after CQE disable/enable flow, so the next CQE transfer after resume
will be timeout due to CQE is in HALT state, the log as below:
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: timeout for tag 2
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: ============ CQHCI REGISTER DUMP ===========
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Caps: 0x100020b6 | Version: 0x00000510
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Config: 0x00001103 | Control: 0x00000001
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int stat: 0x00000000 | Int enab: 0x00000006
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int sig: 0x00000006 | Int Coal: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: TDL base: 0xfd05f000 | TDL up32: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Doorbell: 0x8000203c | TCN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Dev queue: 0x00000000 | Dev Pend: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Task clr: 0x00000000 | SSC1: 0x00001000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: SSC2: 0x00000001 | DCMD rsp: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: RED mask: 0xfdf9a080 | TERRI: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Resp idx: 0x00000000 | Resp arg: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQP: 0x00000000 | CRNQDUN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQIS: 0x00000000 | CRNQIE: 0x00000000

This change check HALT state after CQE enable, if CQE is in HALT state, we
will clear it.

Signed-off-by: Wenbin Mei <wenbin.mei@mediatek.com>
Cc: stable@vger.kernel.org
Acked-by: Adrian Hunter <adrian.hunter@intel.com>
Fixes: a4080225f51d ("mmc: cqhci: support for command queue enabled host")
Link: https://lore.kernel.org/r/20211026070812.9359-1-wenbin.mei@mediatek.com
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
diff 92b18252 Tue Oct 26 01:08:12 MDT 2021 Wenbin Mei <wenbin.mei@mediatek.com> mmc: cqhci: clear HALT state after CQE enable

While mmc0 enter suspend state, we need halt CQE to send legacy cmd(flush
cache) and disable cqe, for resume back, we enable CQE and not clear HALT
state.
In this case MediaTek mmc host controller will keep the value for HALT
state after CQE disable/enable flow, so the next CQE transfer after resume
will be timeout due to CQE is in HALT state, the log as below:
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: timeout for tag 2
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: ============ CQHCI REGISTER DUMP ===========
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Caps: 0x100020b6 | Version: 0x00000510
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Config: 0x00001103 | Control: 0x00000001
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int stat: 0x00000000 | Int enab: 0x00000006
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int sig: 0x00000006 | Int Coal: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: TDL base: 0xfd05f000 | TDL up32: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Doorbell: 0x8000203c | TCN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Dev queue: 0x00000000 | Dev Pend: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Task clr: 0x00000000 | SSC1: 0x00001000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: SSC2: 0x00000001 | DCMD rsp: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: RED mask: 0xfdf9a080 | TERRI: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Resp idx: 0x00000000 | Resp arg: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQP: 0x00000000 | CRNQDUN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQIS: 0x00000000 | CRNQIE: 0x00000000

This change check HALT state after CQE enable, if CQE is in HALT state, we
will clear it.

Signed-off-by: Wenbin Mei <wenbin.mei@mediatek.com>
Cc: stable@vger.kernel.org
Acked-by: Adrian Hunter <adrian.hunter@intel.com>
Fixes: a4080225f51d ("mmc: cqhci: support for command queue enabled host")
Link: https://lore.kernel.org/r/20211026070812.9359-1-wenbin.mei@mediatek.com
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
diff 92b18252 Tue Oct 26 01:08:12 MDT 2021 Wenbin Mei <wenbin.mei@mediatek.com> mmc: cqhci: clear HALT state after CQE enable

While mmc0 enter suspend state, we need halt CQE to send legacy cmd(flush
cache) and disable cqe, for resume back, we enable CQE and not clear HALT
state.
In this case MediaTek mmc host controller will keep the value for HALT
state after CQE disable/enable flow, so the next CQE transfer after resume
will be timeout due to CQE is in HALT state, the log as below:
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: timeout for tag 2
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: ============ CQHCI REGISTER DUMP ===========
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Caps: 0x100020b6 | Version: 0x00000510
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Config: 0x00001103 | Control: 0x00000001
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int stat: 0x00000000 | Int enab: 0x00000006
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int sig: 0x00000006 | Int Coal: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: TDL base: 0xfd05f000 | TDL up32: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Doorbell: 0x8000203c | TCN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Dev queue: 0x00000000 | Dev Pend: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Task clr: 0x00000000 | SSC1: 0x00001000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: SSC2: 0x00000001 | DCMD rsp: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: RED mask: 0xfdf9a080 | TERRI: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Resp idx: 0x00000000 | Resp arg: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQP: 0x00000000 | CRNQDUN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQIS: 0x00000000 | CRNQIE: 0x00000000

This change check HALT state after CQE enable, if CQE is in HALT state, we
will clear it.

Signed-off-by: Wenbin Mei <wenbin.mei@mediatek.com>
Cc: stable@vger.kernel.org
Acked-by: Adrian Hunter <adrian.hunter@intel.com>
Fixes: a4080225f51d ("mmc: cqhci: support for command queue enabled host")
Link: https://lore.kernel.org/r/20211026070812.9359-1-wenbin.mei@mediatek.com
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
diff 92b18252 Tue Oct 26 01:08:12 MDT 2021 Wenbin Mei <wenbin.mei@mediatek.com> mmc: cqhci: clear HALT state after CQE enable

While mmc0 enter suspend state, we need halt CQE to send legacy cmd(flush
cache) and disable cqe, for resume back, we enable CQE and not clear HALT
state.
In this case MediaTek mmc host controller will keep the value for HALT
state after CQE disable/enable flow, so the next CQE transfer after resume
will be timeout due to CQE is in HALT state, the log as below:
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: timeout for tag 2
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: ============ CQHCI REGISTER DUMP ===========
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Caps: 0x100020b6 | Version: 0x00000510
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Config: 0x00001103 | Control: 0x00000001
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int stat: 0x00000000 | Int enab: 0x00000006
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int sig: 0x00000006 | Int Coal: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: TDL base: 0xfd05f000 | TDL up32: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Doorbell: 0x8000203c | TCN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Dev queue: 0x00000000 | Dev Pend: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Task clr: 0x00000000 | SSC1: 0x00001000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: SSC2: 0x00000001 | DCMD rsp: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: RED mask: 0xfdf9a080 | TERRI: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Resp idx: 0x00000000 | Resp arg: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQP: 0x00000000 | CRNQDUN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQIS: 0x00000000 | CRNQIE: 0x00000000

This change check HALT state after CQE enable, if CQE is in HALT state, we
will clear it.

Signed-off-by: Wenbin Mei <wenbin.mei@mediatek.com>
Cc: stable@vger.kernel.org
Acked-by: Adrian Hunter <adrian.hunter@intel.com>
Fixes: a4080225f51d ("mmc: cqhci: support for command queue enabled host")
Link: https://lore.kernel.org/r/20211026070812.9359-1-wenbin.mei@mediatek.com
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
diff 92b18252 Tue Oct 26 01:08:12 MDT 2021 Wenbin Mei <wenbin.mei@mediatek.com> mmc: cqhci: clear HALT state after CQE enable

While mmc0 enter suspend state, we need halt CQE to send legacy cmd(flush
cache) and disable cqe, for resume back, we enable CQE and not clear HALT
state.
In this case MediaTek mmc host controller will keep the value for HALT
state after CQE disable/enable flow, so the next CQE transfer after resume
will be timeout due to CQE is in HALT state, the log as below:
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: timeout for tag 2
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: ============ CQHCI REGISTER DUMP ===========
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Caps: 0x100020b6 | Version: 0x00000510
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Config: 0x00001103 | Control: 0x00000001
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int stat: 0x00000000 | Int enab: 0x00000006
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int sig: 0x00000006 | Int Coal: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: TDL base: 0xfd05f000 | TDL up32: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Doorbell: 0x8000203c | TCN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Dev queue: 0x00000000 | Dev Pend: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Task clr: 0x00000000 | SSC1: 0x00001000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: SSC2: 0x00000001 | DCMD rsp: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: RED mask: 0xfdf9a080 | TERRI: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Resp idx: 0x00000000 | Resp arg: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQP: 0x00000000 | CRNQDUN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQIS: 0x00000000 | CRNQIE: 0x00000000

This change check HALT state after CQE enable, if CQE is in HALT state, we
will clear it.

Signed-off-by: Wenbin Mei <wenbin.mei@mediatek.com>
Cc: stable@vger.kernel.org
Acked-by: Adrian Hunter <adrian.hunter@intel.com>
Fixes: a4080225f51d ("mmc: cqhci: support for command queue enabled host")
Link: https://lore.kernel.org/r/20211026070812.9359-1-wenbin.mei@mediatek.com
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
diff 92b18252 Tue Oct 26 01:08:12 MDT 2021 Wenbin Mei <wenbin.mei@mediatek.com> mmc: cqhci: clear HALT state after CQE enable

While mmc0 enter suspend state, we need halt CQE to send legacy cmd(flush
cache) and disable cqe, for resume back, we enable CQE and not clear HALT
state.
In this case MediaTek mmc host controller will keep the value for HALT
state after CQE disable/enable flow, so the next CQE transfer after resume
will be timeout due to CQE is in HALT state, the log as below:
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: timeout for tag 2
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: ============ CQHCI REGISTER DUMP ===========
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Caps: 0x100020b6 | Version: 0x00000510
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Config: 0x00001103 | Control: 0x00000001
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int stat: 0x00000000 | Int enab: 0x00000006
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int sig: 0x00000006 | Int Coal: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: TDL base: 0xfd05f000 | TDL up32: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Doorbell: 0x8000203c | TCN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Dev queue: 0x00000000 | Dev Pend: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Task clr: 0x00000000 | SSC1: 0x00001000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: SSC2: 0x00000001 | DCMD rsp: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: RED mask: 0xfdf9a080 | TERRI: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Resp idx: 0x00000000 | Resp arg: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQP: 0x00000000 | CRNQDUN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQIS: 0x00000000 | CRNQIE: 0x00000000

This change check HALT state after CQE enable, if CQE is in HALT state, we
will clear it.

Signed-off-by: Wenbin Mei <wenbin.mei@mediatek.com>
Cc: stable@vger.kernel.org
Acked-by: Adrian Hunter <adrian.hunter@intel.com>
Fixes: a4080225f51d ("mmc: cqhci: support for command queue enabled host")
Link: https://lore.kernel.org/r/20211026070812.9359-1-wenbin.mei@mediatek.com
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
diff 92b18252 Tue Oct 26 01:08:12 MDT 2021 Wenbin Mei <wenbin.mei@mediatek.com> mmc: cqhci: clear HALT state after CQE enable

While mmc0 enter suspend state, we need halt CQE to send legacy cmd(flush
cache) and disable cqe, for resume back, we enable CQE and not clear HALT
state.
In this case MediaTek mmc host controller will keep the value for HALT
state after CQE disable/enable flow, so the next CQE transfer after resume
will be timeout due to CQE is in HALT state, the log as below:
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: timeout for tag 2
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: ============ CQHCI REGISTER DUMP ===========
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Caps: 0x100020b6 | Version: 0x00000510
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Config: 0x00001103 | Control: 0x00000001
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int stat: 0x00000000 | Int enab: 0x00000006
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int sig: 0x00000006 | Int Coal: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: TDL base: 0xfd05f000 | TDL up32: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Doorbell: 0x8000203c | TCN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Dev queue: 0x00000000 | Dev Pend: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Task clr: 0x00000000 | SSC1: 0x00001000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: SSC2: 0x00000001 | DCMD rsp: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: RED mask: 0xfdf9a080 | TERRI: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Resp idx: 0x00000000 | Resp arg: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQP: 0x00000000 | CRNQDUN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQIS: 0x00000000 | CRNQIE: 0x00000000

This change check HALT state after CQE enable, if CQE is in HALT state, we
will clear it.

Signed-off-by: Wenbin Mei <wenbin.mei@mediatek.com>
Cc: stable@vger.kernel.org
Acked-by: Adrian Hunter <adrian.hunter@intel.com>
Fixes: a4080225f51d ("mmc: cqhci: support for command queue enabled host")
Link: https://lore.kernel.org/r/20211026070812.9359-1-wenbin.mei@mediatek.com
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
diff 92b18252 Tue Oct 26 01:08:12 MDT 2021 Wenbin Mei <wenbin.mei@mediatek.com> mmc: cqhci: clear HALT state after CQE enable

While mmc0 enter suspend state, we need halt CQE to send legacy cmd(flush
cache) and disable cqe, for resume back, we enable CQE and not clear HALT
state.
In this case MediaTek mmc host controller will keep the value for HALT
state after CQE disable/enable flow, so the next CQE transfer after resume
will be timeout due to CQE is in HALT state, the log as below:
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: timeout for tag 2
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: ============ CQHCI REGISTER DUMP ===========
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Caps: 0x100020b6 | Version: 0x00000510
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Config: 0x00001103 | Control: 0x00000001
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int stat: 0x00000000 | Int enab: 0x00000006
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int sig: 0x00000006 | Int Coal: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: TDL base: 0xfd05f000 | TDL up32: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Doorbell: 0x8000203c | TCN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Dev queue: 0x00000000 | Dev Pend: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Task clr: 0x00000000 | SSC1: 0x00001000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: SSC2: 0x00000001 | DCMD rsp: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: RED mask: 0xfdf9a080 | TERRI: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Resp idx: 0x00000000 | Resp arg: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQP: 0x00000000 | CRNQDUN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQIS: 0x00000000 | CRNQIE: 0x00000000

This change check HALT state after CQE enable, if CQE is in HALT state, we
will clear it.

Signed-off-by: Wenbin Mei <wenbin.mei@mediatek.com>
Cc: stable@vger.kernel.org
Acked-by: Adrian Hunter <adrian.hunter@intel.com>
Fixes: a4080225f51d ("mmc: cqhci: support for command queue enabled host")
Link: https://lore.kernel.org/r/20211026070812.9359-1-wenbin.mei@mediatek.com
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
diff 92b18252 Tue Oct 26 01:08:12 MDT 2021 Wenbin Mei <wenbin.mei@mediatek.com> mmc: cqhci: clear HALT state after CQE enable

While mmc0 enter suspend state, we need halt CQE to send legacy cmd(flush
cache) and disable cqe, for resume back, we enable CQE and not clear HALT
state.
In this case MediaTek mmc host controller will keep the value for HALT
state after CQE disable/enable flow, so the next CQE transfer after resume
will be timeout due to CQE is in HALT state, the log as below:
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: timeout for tag 2
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: ============ CQHCI REGISTER DUMP ===========
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Caps: 0x100020b6 | Version: 0x00000510
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Config: 0x00001103 | Control: 0x00000001
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int stat: 0x00000000 | Int enab: 0x00000006
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int sig: 0x00000006 | Int Coal: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: TDL base: 0xfd05f000 | TDL up32: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Doorbell: 0x8000203c | TCN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Dev queue: 0x00000000 | Dev Pend: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Task clr: 0x00000000 | SSC1: 0x00001000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: SSC2: 0x00000001 | DCMD rsp: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: RED mask: 0xfdf9a080 | TERRI: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Resp idx: 0x00000000 | Resp arg: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQP: 0x00000000 | CRNQDUN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQIS: 0x00000000 | CRNQIE: 0x00000000

This change check HALT state after CQE enable, if CQE is in HALT state, we
will clear it.

Signed-off-by: Wenbin Mei <wenbin.mei@mediatek.com>
Cc: stable@vger.kernel.org
Acked-by: Adrian Hunter <adrian.hunter@intel.com>
Fixes: a4080225f51d ("mmc: cqhci: support for command queue enabled host")
Link: https://lore.kernel.org/r/20211026070812.9359-1-wenbin.mei@mediatek.com
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
diff 92b18252 Tue Oct 26 01:08:12 MDT 2021 Wenbin Mei <wenbin.mei@mediatek.com> mmc: cqhci: clear HALT state after CQE enable

While mmc0 enter suspend state, we need halt CQE to send legacy cmd(flush
cache) and disable cqe, for resume back, we enable CQE and not clear HALT
state.
In this case MediaTek mmc host controller will keep the value for HALT
state after CQE disable/enable flow, so the next CQE transfer after resume
will be timeout due to CQE is in HALT state, the log as below:
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: timeout for tag 2
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: ============ CQHCI REGISTER DUMP ===========
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Caps: 0x100020b6 | Version: 0x00000510
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Config: 0x00001103 | Control: 0x00000001
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int stat: 0x00000000 | Int enab: 0x00000006
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int sig: 0x00000006 | Int Coal: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: TDL base: 0xfd05f000 | TDL up32: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Doorbell: 0x8000203c | TCN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Dev queue: 0x00000000 | Dev Pend: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Task clr: 0x00000000 | SSC1: 0x00001000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: SSC2: 0x00000001 | DCMD rsp: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: RED mask: 0xfdf9a080 | TERRI: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Resp idx: 0x00000000 | Resp arg: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQP: 0x00000000 | CRNQDUN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQIS: 0x00000000 | CRNQIE: 0x00000000

This change check HALT state after CQE enable, if CQE is in HALT state, we
will clear it.

Signed-off-by: Wenbin Mei <wenbin.mei@mediatek.com>
Cc: stable@vger.kernel.org
Acked-by: Adrian Hunter <adrian.hunter@intel.com>
Fixes: a4080225f51d ("mmc: cqhci: support for command queue enabled host")
Link: https://lore.kernel.org/r/20211026070812.9359-1-wenbin.mei@mediatek.com
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
diff 92b18252 Tue Oct 26 01:08:12 MDT 2021 Wenbin Mei <wenbin.mei@mediatek.com> mmc: cqhci: clear HALT state after CQE enable

While mmc0 enter suspend state, we need halt CQE to send legacy cmd(flush
cache) and disable cqe, for resume back, we enable CQE and not clear HALT
state.
In this case MediaTek mmc host controller will keep the value for HALT
state after CQE disable/enable flow, so the next CQE transfer after resume
will be timeout due to CQE is in HALT state, the log as below:
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: timeout for tag 2
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: ============ CQHCI REGISTER DUMP ===========
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Caps: 0x100020b6 | Version: 0x00000510
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Config: 0x00001103 | Control: 0x00000001
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int stat: 0x00000000 | Int enab: 0x00000006
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int sig: 0x00000006 | Int Coal: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: TDL base: 0xfd05f000 | TDL up32: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Doorbell: 0x8000203c | TCN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Dev queue: 0x00000000 | Dev Pend: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Task clr: 0x00000000 | SSC1: 0x00001000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: SSC2: 0x00000001 | DCMD rsp: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: RED mask: 0xfdf9a080 | TERRI: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Resp idx: 0x00000000 | Resp arg: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQP: 0x00000000 | CRNQDUN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQIS: 0x00000000 | CRNQIE: 0x00000000

This change check HALT state after CQE enable, if CQE is in HALT state, we
will clear it.

Signed-off-by: Wenbin Mei <wenbin.mei@mediatek.com>
Cc: stable@vger.kernel.org
Acked-by: Adrian Hunter <adrian.hunter@intel.com>
Fixes: a4080225f51d ("mmc: cqhci: support for command queue enabled host")
Link: https://lore.kernel.org/r/20211026070812.9359-1-wenbin.mei@mediatek.com
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
diff 92b18252 Tue Oct 26 01:08:12 MDT 2021 Wenbin Mei <wenbin.mei@mediatek.com> mmc: cqhci: clear HALT state after CQE enable

While mmc0 enter suspend state, we need halt CQE to send legacy cmd(flush
cache) and disable cqe, for resume back, we enable CQE and not clear HALT
state.
In this case MediaTek mmc host controller will keep the value for HALT
state after CQE disable/enable flow, so the next CQE transfer after resume
will be timeout due to CQE is in HALT state, the log as below:
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: timeout for tag 2
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: ============ CQHCI REGISTER DUMP ===========
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Caps: 0x100020b6 | Version: 0x00000510
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Config: 0x00001103 | Control: 0x00000001
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int stat: 0x00000000 | Int enab: 0x00000006
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int sig: 0x00000006 | Int Coal: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: TDL base: 0xfd05f000 | TDL up32: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Doorbell: 0x8000203c | TCN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Dev queue: 0x00000000 | Dev Pend: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Task clr: 0x00000000 | SSC1: 0x00001000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: SSC2: 0x00000001 | DCMD rsp: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: RED mask: 0xfdf9a080 | TERRI: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Resp idx: 0x00000000 | Resp arg: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQP: 0x00000000 | CRNQDUN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQIS: 0x00000000 | CRNQIE: 0x00000000

This change check HALT state after CQE enable, if CQE is in HALT state, we
will clear it.

Signed-off-by: Wenbin Mei <wenbin.mei@mediatek.com>
Cc: stable@vger.kernel.org
Acked-by: Adrian Hunter <adrian.hunter@intel.com>
Fixes: a4080225f51d ("mmc: cqhci: support for command queue enabled host")
Link: https://lore.kernel.org/r/20211026070812.9359-1-wenbin.mei@mediatek.com
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
diff 92b18252 Tue Oct 26 01:08:12 MDT 2021 Wenbin Mei <wenbin.mei@mediatek.com> mmc: cqhci: clear HALT state after CQE enable

While mmc0 enter suspend state, we need halt CQE to send legacy cmd(flush
cache) and disable cqe, for resume back, we enable CQE and not clear HALT
state.
In this case MediaTek mmc host controller will keep the value for HALT
state after CQE disable/enable flow, so the next CQE transfer after resume
will be timeout due to CQE is in HALT state, the log as below:
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: timeout for tag 2
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: ============ CQHCI REGISTER DUMP ===========
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Caps: 0x100020b6 | Version: 0x00000510
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Config: 0x00001103 | Control: 0x00000001
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int stat: 0x00000000 | Int enab: 0x00000006
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int sig: 0x00000006 | Int Coal: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: TDL base: 0xfd05f000 | TDL up32: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Doorbell: 0x8000203c | TCN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Dev queue: 0x00000000 | Dev Pend: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Task clr: 0x00000000 | SSC1: 0x00001000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: SSC2: 0x00000001 | DCMD rsp: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: RED mask: 0xfdf9a080 | TERRI: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Resp idx: 0x00000000 | Resp arg: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQP: 0x00000000 | CRNQDUN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQIS: 0x00000000 | CRNQIE: 0x00000000

This change check HALT state after CQE enable, if CQE is in HALT state, we
will clear it.

Signed-off-by: Wenbin Mei <wenbin.mei@mediatek.com>
Cc: stable@vger.kernel.org
Acked-by: Adrian Hunter <adrian.hunter@intel.com>
Fixes: a4080225f51d ("mmc: cqhci: support for command queue enabled host")
Link: https://lore.kernel.org/r/20211026070812.9359-1-wenbin.mei@mediatek.com
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
diff 92b18252 Tue Oct 26 01:08:12 MDT 2021 Wenbin Mei <wenbin.mei@mediatek.com> mmc: cqhci: clear HALT state after CQE enable

While mmc0 enter suspend state, we need halt CQE to send legacy cmd(flush
cache) and disable cqe, for resume back, we enable CQE and not clear HALT
state.
In this case MediaTek mmc host controller will keep the value for HALT
state after CQE disable/enable flow, so the next CQE transfer after resume
will be timeout due to CQE is in HALT state, the log as below:
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: timeout for tag 2
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: ============ CQHCI REGISTER DUMP ===========
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Caps: 0x100020b6 | Version: 0x00000510
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Config: 0x00001103 | Control: 0x00000001
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int stat: 0x00000000 | Int enab: 0x00000006
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int sig: 0x00000006 | Int Coal: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: TDL base: 0xfd05f000 | TDL up32: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Doorbell: 0x8000203c | TCN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Dev queue: 0x00000000 | Dev Pend: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Task clr: 0x00000000 | SSC1: 0x00001000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: SSC2: 0x00000001 | DCMD rsp: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: RED mask: 0xfdf9a080 | TERRI: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Resp idx: 0x00000000 | Resp arg: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQP: 0x00000000 | CRNQDUN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQIS: 0x00000000 | CRNQIE: 0x00000000

This change check HALT state after CQE enable, if CQE is in HALT state, we
will clear it.

Signed-off-by: Wenbin Mei <wenbin.mei@mediatek.com>
Cc: stable@vger.kernel.org
Acked-by: Adrian Hunter <adrian.hunter@intel.com>
Fixes: a4080225f51d ("mmc: cqhci: support for command queue enabled host")
Link: https://lore.kernel.org/r/20211026070812.9359-1-wenbin.mei@mediatek.com
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
diff 92b18252 Tue Oct 26 01:08:12 MDT 2021 Wenbin Mei <wenbin.mei@mediatek.com> mmc: cqhci: clear HALT state after CQE enable

While mmc0 enter suspend state, we need halt CQE to send legacy cmd(flush
cache) and disable cqe, for resume back, we enable CQE and not clear HALT
state.
In this case MediaTek mmc host controller will keep the value for HALT
state after CQE disable/enable flow, so the next CQE transfer after resume
will be timeout due to CQE is in HALT state, the log as below:
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: timeout for tag 2
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: ============ CQHCI REGISTER DUMP ===========
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Caps: 0x100020b6 | Version: 0x00000510
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Config: 0x00001103 | Control: 0x00000001
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int stat: 0x00000000 | Int enab: 0x00000006
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int sig: 0x00000006 | Int Coal: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: TDL base: 0xfd05f000 | TDL up32: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Doorbell: 0x8000203c | TCN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Dev queue: 0x00000000 | Dev Pend: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Task clr: 0x00000000 | SSC1: 0x00001000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: SSC2: 0x00000001 | DCMD rsp: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: RED mask: 0xfdf9a080 | TERRI: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Resp idx: 0x00000000 | Resp arg: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQP: 0x00000000 | CRNQDUN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQIS: 0x00000000 | CRNQIE: 0x00000000

This change check HALT state after CQE enable, if CQE is in HALT state, we
will clear it.

Signed-off-by: Wenbin Mei <wenbin.mei@mediatek.com>
Cc: stable@vger.kernel.org
Acked-by: Adrian Hunter <adrian.hunter@intel.com>
Fixes: a4080225f51d ("mmc: cqhci: support for command queue enabled host")
Link: https://lore.kernel.org/r/20211026070812.9359-1-wenbin.mei@mediatek.com
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
diff 92b18252 Tue Oct 26 01:08:12 MDT 2021 Wenbin Mei <wenbin.mei@mediatek.com> mmc: cqhci: clear HALT state after CQE enable

While mmc0 enter suspend state, we need halt CQE to send legacy cmd(flush
cache) and disable cqe, for resume back, we enable CQE and not clear HALT
state.
In this case MediaTek mmc host controller will keep the value for HALT
state after CQE disable/enable flow, so the next CQE transfer after resume
will be timeout due to CQE is in HALT state, the log as below:
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: timeout for tag 2
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: ============ CQHCI REGISTER DUMP ===========
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Caps: 0x100020b6 | Version: 0x00000510
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Config: 0x00001103 | Control: 0x00000001
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int stat: 0x00000000 | Int enab: 0x00000006
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int sig: 0x00000006 | Int Coal: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: TDL base: 0xfd05f000 | TDL up32: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Doorbell: 0x8000203c | TCN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Dev queue: 0x00000000 | Dev Pend: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Task clr: 0x00000000 | SSC1: 0x00001000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: SSC2: 0x00000001 | DCMD rsp: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: RED mask: 0xfdf9a080 | TERRI: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Resp idx: 0x00000000 | Resp arg: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQP: 0x00000000 | CRNQDUN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQIS: 0x00000000 | CRNQIE: 0x00000000

This change check HALT state after CQE enable, if CQE is in HALT state, we
will clear it.

Signed-off-by: Wenbin Mei <wenbin.mei@mediatek.com>
Cc: stable@vger.kernel.org
Acked-by: Adrian Hunter <adrian.hunter@intel.com>
Fixes: a4080225f51d ("mmc: cqhci: support for command queue enabled host")
Link: https://lore.kernel.org/r/20211026070812.9359-1-wenbin.mei@mediatek.com
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
diff 92b18252 Tue Oct 26 01:08:12 MDT 2021 Wenbin Mei <wenbin.mei@mediatek.com> mmc: cqhci: clear HALT state after CQE enable

While mmc0 enter suspend state, we need halt CQE to send legacy cmd(flush
cache) and disable cqe, for resume back, we enable CQE and not clear HALT
state.
In this case MediaTek mmc host controller will keep the value for HALT
state after CQE disable/enable flow, so the next CQE transfer after resume
will be timeout due to CQE is in HALT state, the log as below:
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: timeout for tag 2
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: ============ CQHCI REGISTER DUMP ===========
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Caps: 0x100020b6 | Version: 0x00000510
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Config: 0x00001103 | Control: 0x00000001
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int stat: 0x00000000 | Int enab: 0x00000006
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int sig: 0x00000006 | Int Coal: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: TDL base: 0xfd05f000 | TDL up32: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Doorbell: 0x8000203c | TCN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Dev queue: 0x00000000 | Dev Pend: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Task clr: 0x00000000 | SSC1: 0x00001000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: SSC2: 0x00000001 | DCMD rsp: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: RED mask: 0xfdf9a080 | TERRI: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Resp idx: 0x00000000 | Resp arg: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQP: 0x00000000 | CRNQDUN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQIS: 0x00000000 | CRNQIE: 0x00000000

This change check HALT state after CQE enable, if CQE is in HALT state, we
will clear it.

Signed-off-by: Wenbin Mei <wenbin.mei@mediatek.com>
Cc: stable@vger.kernel.org
Acked-by: Adrian Hunter <adrian.hunter@intel.com>
Fixes: a4080225f51d ("mmc: cqhci: support for command queue enabled host")
Link: https://lore.kernel.org/r/20211026070812.9359-1-wenbin.mei@mediatek.com
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
diff 92b18252 Tue Oct 26 01:08:12 MDT 2021 Wenbin Mei <wenbin.mei@mediatek.com> mmc: cqhci: clear HALT state after CQE enable

While mmc0 enter suspend state, we need halt CQE to send legacy cmd(flush
cache) and disable cqe, for resume back, we enable CQE and not clear HALT
state.
In this case MediaTek mmc host controller will keep the value for HALT
state after CQE disable/enable flow, so the next CQE transfer after resume
will be timeout due to CQE is in HALT state, the log as below:
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: timeout for tag 2
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: ============ CQHCI REGISTER DUMP ===========
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Caps: 0x100020b6 | Version: 0x00000510
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Config: 0x00001103 | Control: 0x00000001
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int stat: 0x00000000 | Int enab: 0x00000006
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int sig: 0x00000006 | Int Coal: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: TDL base: 0xfd05f000 | TDL up32: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Doorbell: 0x8000203c | TCN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Dev queue: 0x00000000 | Dev Pend: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Task clr: 0x00000000 | SSC1: 0x00001000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: SSC2: 0x00000001 | DCMD rsp: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: RED mask: 0xfdf9a080 | TERRI: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Resp idx: 0x00000000 | Resp arg: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQP: 0x00000000 | CRNQDUN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQIS: 0x00000000 | CRNQIE: 0x00000000

This change check HALT state after CQE enable, if CQE is in HALT state, we
will clear it.

Signed-off-by: Wenbin Mei <wenbin.mei@mediatek.com>
Cc: stable@vger.kernel.org
Acked-by: Adrian Hunter <adrian.hunter@intel.com>
Fixes: a4080225f51d ("mmc: cqhci: support for command queue enabled host")
Link: https://lore.kernel.org/r/20211026070812.9359-1-wenbin.mei@mediatek.com
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
diff 92b18252 Tue Oct 26 01:08:12 MDT 2021 Wenbin Mei <wenbin.mei@mediatek.com> mmc: cqhci: clear HALT state after CQE enable

While mmc0 enter suspend state, we need halt CQE to send legacy cmd(flush
cache) and disable cqe, for resume back, we enable CQE and not clear HALT
state.
In this case MediaTek mmc host controller will keep the value for HALT
state after CQE disable/enable flow, so the next CQE transfer after resume
will be timeout due to CQE is in HALT state, the log as below:
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: timeout for tag 2
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: ============ CQHCI REGISTER DUMP ===========
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Caps: 0x100020b6 | Version: 0x00000510
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Config: 0x00001103 | Control: 0x00000001
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int stat: 0x00000000 | Int enab: 0x00000006
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int sig: 0x00000006 | Int Coal: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: TDL base: 0xfd05f000 | TDL up32: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Doorbell: 0x8000203c | TCN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Dev queue: 0x00000000 | Dev Pend: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Task clr: 0x00000000 | SSC1: 0x00001000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: SSC2: 0x00000001 | DCMD rsp: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: RED mask: 0xfdf9a080 | TERRI: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Resp idx: 0x00000000 | Resp arg: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQP: 0x00000000 | CRNQDUN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQIS: 0x00000000 | CRNQIE: 0x00000000

This change check HALT state after CQE enable, if CQE is in HALT state, we
will clear it.

Signed-off-by: Wenbin Mei <wenbin.mei@mediatek.com>
Cc: stable@vger.kernel.org
Acked-by: Adrian Hunter <adrian.hunter@intel.com>
Fixes: a4080225f51d ("mmc: cqhci: support for command queue enabled host")
Link: https://lore.kernel.org/r/20211026070812.9359-1-wenbin.mei@mediatek.com
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
diff 92b18252 Tue Oct 26 01:08:12 MDT 2021 Wenbin Mei <wenbin.mei@mediatek.com> mmc: cqhci: clear HALT state after CQE enable

While mmc0 enter suspend state, we need halt CQE to send legacy cmd(flush
cache) and disable cqe, for resume back, we enable CQE and not clear HALT
state.
In this case MediaTek mmc host controller will keep the value for HALT
state after CQE disable/enable flow, so the next CQE transfer after resume
will be timeout due to CQE is in HALT state, the log as below:
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: timeout for tag 2
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: ============ CQHCI REGISTER DUMP ===========
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Caps: 0x100020b6 | Version: 0x00000510
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Config: 0x00001103 | Control: 0x00000001
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int stat: 0x00000000 | Int enab: 0x00000006
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int sig: 0x00000006 | Int Coal: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: TDL base: 0xfd05f000 | TDL up32: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Doorbell: 0x8000203c | TCN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Dev queue: 0x00000000 | Dev Pend: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Task clr: 0x00000000 | SSC1: 0x00001000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: SSC2: 0x00000001 | DCMD rsp: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: RED mask: 0xfdf9a080 | TERRI: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Resp idx: 0x00000000 | Resp arg: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQP: 0x00000000 | CRNQDUN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQIS: 0x00000000 | CRNQIE: 0x00000000

This change check HALT state after CQE enable, if CQE is in HALT state, we
will clear it.

Signed-off-by: Wenbin Mei <wenbin.mei@mediatek.com>
Cc: stable@vger.kernel.org
Acked-by: Adrian Hunter <adrian.hunter@intel.com>
Fixes: a4080225f51d ("mmc: cqhci: support for command queue enabled host")
Link: https://lore.kernel.org/r/20211026070812.9359-1-wenbin.mei@mediatek.com
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
diff 92b18252 Tue Oct 26 01:08:12 MDT 2021 Wenbin Mei <wenbin.mei@mediatek.com> mmc: cqhci: clear HALT state after CQE enable

While mmc0 enter suspend state, we need halt CQE to send legacy cmd(flush
cache) and disable cqe, for resume back, we enable CQE and not clear HALT
state.
In this case MediaTek mmc host controller will keep the value for HALT
state after CQE disable/enable flow, so the next CQE transfer after resume
will be timeout due to CQE is in HALT state, the log as below:
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: timeout for tag 2
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: ============ CQHCI REGISTER DUMP ===========
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Caps: 0x100020b6 | Version: 0x00000510
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Config: 0x00001103 | Control: 0x00000001
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int stat: 0x00000000 | Int enab: 0x00000006
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int sig: 0x00000006 | Int Coal: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: TDL base: 0xfd05f000 | TDL up32: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Doorbell: 0x8000203c | TCN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Dev queue: 0x00000000 | Dev Pend: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Task clr: 0x00000000 | SSC1: 0x00001000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: SSC2: 0x00000001 | DCMD rsp: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: RED mask: 0xfdf9a080 | TERRI: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Resp idx: 0x00000000 | Resp arg: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQP: 0x00000000 | CRNQDUN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQIS: 0x00000000 | CRNQIE: 0x00000000

This change check HALT state after CQE enable, if CQE is in HALT state, we
will clear it.

Signed-off-by: Wenbin Mei <wenbin.mei@mediatek.com>
Cc: stable@vger.kernel.org
Acked-by: Adrian Hunter <adrian.hunter@intel.com>
Fixes: a4080225f51d ("mmc: cqhci: support for command queue enabled host")
Link: https://lore.kernel.org/r/20211026070812.9359-1-wenbin.mei@mediatek.com
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
diff 92b18252 Tue Oct 26 01:08:12 MDT 2021 Wenbin Mei <wenbin.mei@mediatek.com> mmc: cqhci: clear HALT state after CQE enable

While mmc0 enter suspend state, we need halt CQE to send legacy cmd(flush
cache) and disable cqe, for resume back, we enable CQE and not clear HALT
state.
In this case MediaTek mmc host controller will keep the value for HALT
state after CQE disable/enable flow, so the next CQE transfer after resume
will be timeout due to CQE is in HALT state, the log as below:
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: timeout for tag 2
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: ============ CQHCI REGISTER DUMP ===========
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Caps: 0x100020b6 | Version: 0x00000510
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Config: 0x00001103 | Control: 0x00000001
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int stat: 0x00000000 | Int enab: 0x00000006
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int sig: 0x00000006 | Int Coal: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: TDL base: 0xfd05f000 | TDL up32: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Doorbell: 0x8000203c | TCN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Dev queue: 0x00000000 | Dev Pend: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Task clr: 0x00000000 | SSC1: 0x00001000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: SSC2: 0x00000001 | DCMD rsp: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: RED mask: 0xfdf9a080 | TERRI: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Resp idx: 0x00000000 | Resp arg: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQP: 0x00000000 | CRNQDUN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQIS: 0x00000000 | CRNQIE: 0x00000000

This change check HALT state after CQE enable, if CQE is in HALT state, we
will clear it.

Signed-off-by: Wenbin Mei <wenbin.mei@mediatek.com>
Cc: stable@vger.kernel.org
Acked-by: Adrian Hunter <adrian.hunter@intel.com>
Fixes: a4080225f51d ("mmc: cqhci: support for command queue enabled host")
Link: https://lore.kernel.org/r/20211026070812.9359-1-wenbin.mei@mediatek.com
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
diff 92b18252 Tue Oct 26 01:08:12 MDT 2021 Wenbin Mei <wenbin.mei@mediatek.com> mmc: cqhci: clear HALT state after CQE enable

While mmc0 enter suspend state, we need halt CQE to send legacy cmd(flush
cache) and disable cqe, for resume back, we enable CQE and not clear HALT
state.
In this case MediaTek mmc host controller will keep the value for HALT
state after CQE disable/enable flow, so the next CQE transfer after resume
will be timeout due to CQE is in HALT state, the log as below:
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: timeout for tag 2
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: ============ CQHCI REGISTER DUMP ===========
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Caps: 0x100020b6 | Version: 0x00000510
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Config: 0x00001103 | Control: 0x00000001
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int stat: 0x00000000 | Int enab: 0x00000006
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int sig: 0x00000006 | Int Coal: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: TDL base: 0xfd05f000 | TDL up32: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Doorbell: 0x8000203c | TCN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Dev queue: 0x00000000 | Dev Pend: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Task clr: 0x00000000 | SSC1: 0x00001000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: SSC2: 0x00000001 | DCMD rsp: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: RED mask: 0xfdf9a080 | TERRI: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Resp idx: 0x00000000 | Resp arg: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQP: 0x00000000 | CRNQDUN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQIS: 0x00000000 | CRNQIE: 0x00000000

This change check HALT state after CQE enable, if CQE is in HALT state, we
will clear it.

Signed-off-by: Wenbin Mei <wenbin.mei@mediatek.com>
Cc: stable@vger.kernel.org
Acked-by: Adrian Hunter <adrian.hunter@intel.com>
Fixes: a4080225f51d ("mmc: cqhci: support for command queue enabled host")
Link: https://lore.kernel.org/r/20211026070812.9359-1-wenbin.mei@mediatek.com
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
diff 92b18252 Tue Oct 26 01:08:12 MDT 2021 Wenbin Mei <wenbin.mei@mediatek.com> mmc: cqhci: clear HALT state after CQE enable

While mmc0 enter suspend state, we need halt CQE to send legacy cmd(flush
cache) and disable cqe, for resume back, we enable CQE and not clear HALT
state.
In this case MediaTek mmc host controller will keep the value for HALT
state after CQE disable/enable flow, so the next CQE transfer after resume
will be timeout due to CQE is in HALT state, the log as below:
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: timeout for tag 2
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: ============ CQHCI REGISTER DUMP ===========
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Caps: 0x100020b6 | Version: 0x00000510
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Config: 0x00001103 | Control: 0x00000001
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int stat: 0x00000000 | Int enab: 0x00000006
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int sig: 0x00000006 | Int Coal: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: TDL base: 0xfd05f000 | TDL up32: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Doorbell: 0x8000203c | TCN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Dev queue: 0x00000000 | Dev Pend: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Task clr: 0x00000000 | SSC1: 0x00001000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: SSC2: 0x00000001 | DCMD rsp: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: RED mask: 0xfdf9a080 | TERRI: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Resp idx: 0x00000000 | Resp arg: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQP: 0x00000000 | CRNQDUN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQIS: 0x00000000 | CRNQIE: 0x00000000

This change check HALT state after CQE enable, if CQE is in HALT state, we
will clear it.

Signed-off-by: Wenbin Mei <wenbin.mei@mediatek.com>
Cc: stable@vger.kernel.org
Acked-by: Adrian Hunter <adrian.hunter@intel.com>
Fixes: a4080225f51d ("mmc: cqhci: support for command queue enabled host")
Link: https://lore.kernel.org/r/20211026070812.9359-1-wenbin.mei@mediatek.com
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
diff 92b18252 Tue Oct 26 01:08:12 MDT 2021 Wenbin Mei <wenbin.mei@mediatek.com> mmc: cqhci: clear HALT state after CQE enable

While mmc0 enter suspend state, we need halt CQE to send legacy cmd(flush
cache) and disable cqe, for resume back, we enable CQE and not clear HALT
state.
In this case MediaTek mmc host controller will keep the value for HALT
state after CQE disable/enable flow, so the next CQE transfer after resume
will be timeout due to CQE is in HALT state, the log as below:
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: timeout for tag 2
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: ============ CQHCI REGISTER DUMP ===========
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Caps: 0x100020b6 | Version: 0x00000510
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Config: 0x00001103 | Control: 0x00000001
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int stat: 0x00000000 | Int enab: 0x00000006
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int sig: 0x00000006 | Int Coal: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: TDL base: 0xfd05f000 | TDL up32: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Doorbell: 0x8000203c | TCN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Dev queue: 0x00000000 | Dev Pend: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Task clr: 0x00000000 | SSC1: 0x00001000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: SSC2: 0x00000001 | DCMD rsp: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: RED mask: 0xfdf9a080 | TERRI: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Resp idx: 0x00000000 | Resp arg: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQP: 0x00000000 | CRNQDUN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQIS: 0x00000000 | CRNQIE: 0x00000000

This change check HALT state after CQE enable, if CQE is in HALT state, we
will clear it.

Signed-off-by: Wenbin Mei <wenbin.mei@mediatek.com>
Cc: stable@vger.kernel.org
Acked-by: Adrian Hunter <adrian.hunter@intel.com>
Fixes: a4080225f51d ("mmc: cqhci: support for command queue enabled host")
Link: https://lore.kernel.org/r/20211026070812.9359-1-wenbin.mei@mediatek.com
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
diff 92b18252 Tue Oct 26 01:08:12 MDT 2021 Wenbin Mei <wenbin.mei@mediatek.com> mmc: cqhci: clear HALT state after CQE enable

While mmc0 enter suspend state, we need halt CQE to send legacy cmd(flush
cache) and disable cqe, for resume back, we enable CQE and not clear HALT
state.
In this case MediaTek mmc host controller will keep the value for HALT
state after CQE disable/enable flow, so the next CQE transfer after resume
will be timeout due to CQE is in HALT state, the log as below:
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: timeout for tag 2
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: ============ CQHCI REGISTER DUMP ===========
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Caps: 0x100020b6 | Version: 0x00000510
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Config: 0x00001103 | Control: 0x00000001
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int stat: 0x00000000 | Int enab: 0x00000006
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int sig: 0x00000006 | Int Coal: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: TDL base: 0xfd05f000 | TDL up32: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Doorbell: 0x8000203c | TCN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Dev queue: 0x00000000 | Dev Pend: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Task clr: 0x00000000 | SSC1: 0x00001000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: SSC2: 0x00000001 | DCMD rsp: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: RED mask: 0xfdf9a080 | TERRI: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Resp idx: 0x00000000 | Resp arg: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQP: 0x00000000 | CRNQDUN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQIS: 0x00000000 | CRNQIE: 0x00000000

This change check HALT state after CQE enable, if CQE is in HALT state, we
will clear it.

Signed-off-by: Wenbin Mei <wenbin.mei@mediatek.com>
Cc: stable@vger.kernel.org
Acked-by: Adrian Hunter <adrian.hunter@intel.com>
Fixes: a4080225f51d ("mmc: cqhci: support for command queue enabled host")
Link: https://lore.kernel.org/r/20211026070812.9359-1-wenbin.mei@mediatek.com
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
diff 92b18252 Tue Oct 26 01:08:12 MDT 2021 Wenbin Mei <wenbin.mei@mediatek.com> mmc: cqhci: clear HALT state after CQE enable

While mmc0 enter suspend state, we need halt CQE to send legacy cmd(flush
cache) and disable cqe, for resume back, we enable CQE and not clear HALT
state.
In this case MediaTek mmc host controller will keep the value for HALT
state after CQE disable/enable flow, so the next CQE transfer after resume
will be timeout due to CQE is in HALT state, the log as below:
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: timeout for tag 2
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: ============ CQHCI REGISTER DUMP ===========
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Caps: 0x100020b6 | Version: 0x00000510
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Config: 0x00001103 | Control: 0x00000001
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int stat: 0x00000000 | Int enab: 0x00000006
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int sig: 0x00000006 | Int Coal: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: TDL base: 0xfd05f000 | TDL up32: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Doorbell: 0x8000203c | TCN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Dev queue: 0x00000000 | Dev Pend: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Task clr: 0x00000000 | SSC1: 0x00001000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: SSC2: 0x00000001 | DCMD rsp: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: RED mask: 0xfdf9a080 | TERRI: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Resp idx: 0x00000000 | Resp arg: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQP: 0x00000000 | CRNQDUN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQIS: 0x00000000 | CRNQIE: 0x00000000

This change check HALT state after CQE enable, if CQE is in HALT state, we
will clear it.

Signed-off-by: Wenbin Mei <wenbin.mei@mediatek.com>
Cc: stable@vger.kernel.org
Acked-by: Adrian Hunter <adrian.hunter@intel.com>
Fixes: a4080225f51d ("mmc: cqhci: support for command queue enabled host")
Link: https://lore.kernel.org/r/20211026070812.9359-1-wenbin.mei@mediatek.com
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
diff 92b18252 Tue Oct 26 01:08:12 MDT 2021 Wenbin Mei <wenbin.mei@mediatek.com> mmc: cqhci: clear HALT state after CQE enable

While mmc0 enter suspend state, we need halt CQE to send legacy cmd(flush
cache) and disable cqe, for resume back, we enable CQE and not clear HALT
state.
In this case MediaTek mmc host controller will keep the value for HALT
state after CQE disable/enable flow, so the next CQE transfer after resume
will be timeout due to CQE is in HALT state, the log as below:
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: timeout for tag 2
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: ============ CQHCI REGISTER DUMP ===========
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Caps: 0x100020b6 | Version: 0x00000510
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Config: 0x00001103 | Control: 0x00000001
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int stat: 0x00000000 | Int enab: 0x00000006
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int sig: 0x00000006 | Int Coal: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: TDL base: 0xfd05f000 | TDL up32: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Doorbell: 0x8000203c | TCN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Dev queue: 0x00000000 | Dev Pend: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Task clr: 0x00000000 | SSC1: 0x00001000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: SSC2: 0x00000001 | DCMD rsp: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: RED mask: 0xfdf9a080 | TERRI: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Resp idx: 0x00000000 | Resp arg: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQP: 0x00000000 | CRNQDUN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQIS: 0x00000000 | CRNQIE: 0x00000000

This change check HALT state after CQE enable, if CQE is in HALT state, we
will clear it.

Signed-off-by: Wenbin Mei <wenbin.mei@mediatek.com>
Cc: stable@vger.kernel.org
Acked-by: Adrian Hunter <adrian.hunter@intel.com>
Fixes: a4080225f51d ("mmc: cqhci: support for command queue enabled host")
Link: https://lore.kernel.org/r/20211026070812.9359-1-wenbin.mei@mediatek.com
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
diff 92b18252 Tue Oct 26 01:08:12 MDT 2021 Wenbin Mei <wenbin.mei@mediatek.com> mmc: cqhci: clear HALT state after CQE enable

While mmc0 enter suspend state, we need halt CQE to send legacy cmd(flush
cache) and disable cqe, for resume back, we enable CQE and not clear HALT
state.
In this case MediaTek mmc host controller will keep the value for HALT
state after CQE disable/enable flow, so the next CQE transfer after resume
will be timeout due to CQE is in HALT state, the log as below:
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: timeout for tag 2
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: ============ CQHCI REGISTER DUMP ===========
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Caps: 0x100020b6 | Version: 0x00000510
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Config: 0x00001103 | Control: 0x00000001
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int stat: 0x00000000 | Int enab: 0x00000006
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int sig: 0x00000006 | Int Coal: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: TDL base: 0xfd05f000 | TDL up32: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Doorbell: 0x8000203c | TCN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Dev queue: 0x00000000 | Dev Pend: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Task clr: 0x00000000 | SSC1: 0x00001000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: SSC2: 0x00000001 | DCMD rsp: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: RED mask: 0xfdf9a080 | TERRI: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Resp idx: 0x00000000 | Resp arg: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQP: 0x00000000 | CRNQDUN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQIS: 0x00000000 | CRNQIE: 0x00000000

This change check HALT state after CQE enable, if CQE is in HALT state, we
will clear it.

Signed-off-by: Wenbin Mei <wenbin.mei@mediatek.com>
Cc: stable@vger.kernel.org
Acked-by: Adrian Hunter <adrian.hunter@intel.com>
Fixes: a4080225f51d ("mmc: cqhci: support for command queue enabled host")
Link: https://lore.kernel.org/r/20211026070812.9359-1-wenbin.mei@mediatek.com
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
diff 92b18252 Tue Oct 26 01:08:12 MDT 2021 Wenbin Mei <wenbin.mei@mediatek.com> mmc: cqhci: clear HALT state after CQE enable

While mmc0 enter suspend state, we need halt CQE to send legacy cmd(flush
cache) and disable cqe, for resume back, we enable CQE and not clear HALT
state.
In this case MediaTek mmc host controller will keep the value for HALT
state after CQE disable/enable flow, so the next CQE transfer after resume
will be timeout due to CQE is in HALT state, the log as below:
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: timeout for tag 2
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: ============ CQHCI REGISTER DUMP ===========
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Caps: 0x100020b6 | Version: 0x00000510
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Config: 0x00001103 | Control: 0x00000001
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int stat: 0x00000000 | Int enab: 0x00000006
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int sig: 0x00000006 | Int Coal: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: TDL base: 0xfd05f000 | TDL up32: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Doorbell: 0x8000203c | TCN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Dev queue: 0x00000000 | Dev Pend: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Task clr: 0x00000000 | SSC1: 0x00001000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: SSC2: 0x00000001 | DCMD rsp: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: RED mask: 0xfdf9a080 | TERRI: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Resp idx: 0x00000000 | Resp arg: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQP: 0x00000000 | CRNQDUN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQIS: 0x00000000 | CRNQIE: 0x00000000

This change check HALT state after CQE enable, if CQE is in HALT state, we
will clear it.

Signed-off-by: Wenbin Mei <wenbin.mei@mediatek.com>
Cc: stable@vger.kernel.org
Acked-by: Adrian Hunter <adrian.hunter@intel.com>
Fixes: a4080225f51d ("mmc: cqhci: support for command queue enabled host")
Link: https://lore.kernel.org/r/20211026070812.9359-1-wenbin.mei@mediatek.com
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
diff 92b18252 Tue Oct 26 01:08:12 MDT 2021 Wenbin Mei <wenbin.mei@mediatek.com> mmc: cqhci: clear HALT state after CQE enable

While mmc0 enter suspend state, we need halt CQE to send legacy cmd(flush
cache) and disable cqe, for resume back, we enable CQE and not clear HALT
state.
In this case MediaTek mmc host controller will keep the value for HALT
state after CQE disable/enable flow, so the next CQE transfer after resume
will be timeout due to CQE is in HALT state, the log as below:
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: timeout for tag 2
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: ============ CQHCI REGISTER DUMP ===========
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Caps: 0x100020b6 | Version: 0x00000510
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Config: 0x00001103 | Control: 0x00000001
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int stat: 0x00000000 | Int enab: 0x00000006
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int sig: 0x00000006 | Int Coal: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: TDL base: 0xfd05f000 | TDL up32: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Doorbell: 0x8000203c | TCN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Dev queue: 0x00000000 | Dev Pend: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Task clr: 0x00000000 | SSC1: 0x00001000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: SSC2: 0x00000001 | DCMD rsp: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: RED mask: 0xfdf9a080 | TERRI: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Resp idx: 0x00000000 | Resp arg: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQP: 0x00000000 | CRNQDUN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQIS: 0x00000000 | CRNQIE: 0x00000000

This change check HALT state after CQE enable, if CQE is in HALT state, we
will clear it.

Signed-off-by: Wenbin Mei <wenbin.mei@mediatek.com>
Cc: stable@vger.kernel.org
Acked-by: Adrian Hunter <adrian.hunter@intel.com>
Fixes: a4080225f51d ("mmc: cqhci: support for command queue enabled host")
Link: https://lore.kernel.org/r/20211026070812.9359-1-wenbin.mei@mediatek.com
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
diff 92b18252 Tue Oct 26 01:08:12 MDT 2021 Wenbin Mei <wenbin.mei@mediatek.com> mmc: cqhci: clear HALT state after CQE enable

While mmc0 enter suspend state, we need halt CQE to send legacy cmd(flush
cache) and disable cqe, for resume back, we enable CQE and not clear HALT
state.
In this case MediaTek mmc host controller will keep the value for HALT
state after CQE disable/enable flow, so the next CQE transfer after resume
will be timeout due to CQE is in HALT state, the log as below:
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: timeout for tag 2
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: ============ CQHCI REGISTER DUMP ===========
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Caps: 0x100020b6 | Version: 0x00000510
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Config: 0x00001103 | Control: 0x00000001
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int stat: 0x00000000 | Int enab: 0x00000006
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int sig: 0x00000006 | Int Coal: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: TDL base: 0xfd05f000 | TDL up32: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Doorbell: 0x8000203c | TCN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Dev queue: 0x00000000 | Dev Pend: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Task clr: 0x00000000 | SSC1: 0x00001000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: SSC2: 0x00000001 | DCMD rsp: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: RED mask: 0xfdf9a080 | TERRI: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Resp idx: 0x00000000 | Resp arg: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQP: 0x00000000 | CRNQDUN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQIS: 0x00000000 | CRNQIE: 0x00000000

This change check HALT state after CQE enable, if CQE is in HALT state, we
will clear it.

Signed-off-by: Wenbin Mei <wenbin.mei@mediatek.com>
Cc: stable@vger.kernel.org
Acked-by: Adrian Hunter <adrian.hunter@intel.com>
Fixes: a4080225f51d ("mmc: cqhci: support for command queue enabled host")
Link: https://lore.kernel.org/r/20211026070812.9359-1-wenbin.mei@mediatek.com
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
diff 92b18252 Tue Oct 26 01:08:12 MDT 2021 Wenbin Mei <wenbin.mei@mediatek.com> mmc: cqhci: clear HALT state after CQE enable

While mmc0 enter suspend state, we need halt CQE to send legacy cmd(flush
cache) and disable cqe, for resume back, we enable CQE and not clear HALT
state.
In this case MediaTek mmc host controller will keep the value for HALT
state after CQE disable/enable flow, so the next CQE transfer after resume
will be timeout due to CQE is in HALT state, the log as below:
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: timeout for tag 2
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: ============ CQHCI REGISTER DUMP ===========
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Caps: 0x100020b6 | Version: 0x00000510
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Config: 0x00001103 | Control: 0x00000001
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int stat: 0x00000000 | Int enab: 0x00000006
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int sig: 0x00000006 | Int Coal: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: TDL base: 0xfd05f000 | TDL up32: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Doorbell: 0x8000203c | TCN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Dev queue: 0x00000000 | Dev Pend: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Task clr: 0x00000000 | SSC1: 0x00001000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: SSC2: 0x00000001 | DCMD rsp: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: RED mask: 0xfdf9a080 | TERRI: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Resp idx: 0x00000000 | Resp arg: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQP: 0x00000000 | CRNQDUN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQIS: 0x00000000 | CRNQIE: 0x00000000

This change check HALT state after CQE enable, if CQE is in HALT state, we
will clear it.

Signed-off-by: Wenbin Mei <wenbin.mei@mediatek.com>
Cc: stable@vger.kernel.org
Acked-by: Adrian Hunter <adrian.hunter@intel.com>
Fixes: a4080225f51d ("mmc: cqhci: support for command queue enabled host")
Link: https://lore.kernel.org/r/20211026070812.9359-1-wenbin.mei@mediatek.com
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
diff 92b18252 Tue Oct 26 01:08:12 MDT 2021 Wenbin Mei <wenbin.mei@mediatek.com> mmc: cqhci: clear HALT state after CQE enable

While mmc0 enter suspend state, we need halt CQE to send legacy cmd(flush
cache) and disable cqe, for resume back, we enable CQE and not clear HALT
state.
In this case MediaTek mmc host controller will keep the value for HALT
state after CQE disable/enable flow, so the next CQE transfer after resume
will be timeout due to CQE is in HALT state, the log as below:
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: timeout for tag 2
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: ============ CQHCI REGISTER DUMP ===========
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Caps: 0x100020b6 | Version: 0x00000510
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Config: 0x00001103 | Control: 0x00000001
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int stat: 0x00000000 | Int enab: 0x00000006
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int sig: 0x00000006 | Int Coal: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: TDL base: 0xfd05f000 | TDL up32: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Doorbell: 0x8000203c | TCN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Dev queue: 0x00000000 | Dev Pend: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Task clr: 0x00000000 | SSC1: 0x00001000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: SSC2: 0x00000001 | DCMD rsp: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: RED mask: 0xfdf9a080 | TERRI: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Resp idx: 0x00000000 | Resp arg: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQP: 0x00000000 | CRNQDUN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQIS: 0x00000000 | CRNQIE: 0x00000000

This change check HALT state after CQE enable, if CQE is in HALT state, we
will clear it.

Signed-off-by: Wenbin Mei <wenbin.mei@mediatek.com>
Cc: stable@vger.kernel.org
Acked-by: Adrian Hunter <adrian.hunter@intel.com>
Fixes: a4080225f51d ("mmc: cqhci: support for command queue enabled host")
Link: https://lore.kernel.org/r/20211026070812.9359-1-wenbin.mei@mediatek.com
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
diff 92b18252 Tue Oct 26 01:08:12 MDT 2021 Wenbin Mei <wenbin.mei@mediatek.com> mmc: cqhci: clear HALT state after CQE enable

While mmc0 enter suspend state, we need halt CQE to send legacy cmd(flush
cache) and disable cqe, for resume back, we enable CQE and not clear HALT
state.
In this case MediaTek mmc host controller will keep the value for HALT
state after CQE disable/enable flow, so the next CQE transfer after resume
will be timeout due to CQE is in HALT state, the log as below:
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: timeout for tag 2
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: ============ CQHCI REGISTER DUMP ===========
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Caps: 0x100020b6 | Version: 0x00000510
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Config: 0x00001103 | Control: 0x00000001
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int stat: 0x00000000 | Int enab: 0x00000006
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int sig: 0x00000006 | Int Coal: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: TDL base: 0xfd05f000 | TDL up32: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Doorbell: 0x8000203c | TCN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Dev queue: 0x00000000 | Dev Pend: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Task clr: 0x00000000 | SSC1: 0x00001000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: SSC2: 0x00000001 | DCMD rsp: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: RED mask: 0xfdf9a080 | TERRI: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Resp idx: 0x00000000 | Resp arg: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQP: 0x00000000 | CRNQDUN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQIS: 0x00000000 | CRNQIE: 0x00000000

This change check HALT state after CQE enable, if CQE is in HALT state, we
will clear it.

Signed-off-by: Wenbin Mei <wenbin.mei@mediatek.com>
Cc: stable@vger.kernel.org
Acked-by: Adrian Hunter <adrian.hunter@intel.com>
Fixes: a4080225f51d ("mmc: cqhci: support for command queue enabled host")
Link: https://lore.kernel.org/r/20211026070812.9359-1-wenbin.mei@mediatek.com
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
diff 92b18252 Tue Oct 26 01:08:12 MDT 2021 Wenbin Mei <wenbin.mei@mediatek.com> mmc: cqhci: clear HALT state after CQE enable

While mmc0 enter suspend state, we need halt CQE to send legacy cmd(flush
cache) and disable cqe, for resume back, we enable CQE and not clear HALT
state.
In this case MediaTek mmc host controller will keep the value for HALT
state after CQE disable/enable flow, so the next CQE transfer after resume
will be timeout due to CQE is in HALT state, the log as below:
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: timeout for tag 2
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: ============ CQHCI REGISTER DUMP ===========
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Caps: 0x100020b6 | Version: 0x00000510
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Config: 0x00001103 | Control: 0x00000001
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int stat: 0x00000000 | Int enab: 0x00000006
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int sig: 0x00000006 | Int Coal: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: TDL base: 0xfd05f000 | TDL up32: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Doorbell: 0x8000203c | TCN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Dev queue: 0x00000000 | Dev Pend: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Task clr: 0x00000000 | SSC1: 0x00001000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: SSC2: 0x00000001 | DCMD rsp: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: RED mask: 0xfdf9a080 | TERRI: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Resp idx: 0x00000000 | Resp arg: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQP: 0x00000000 | CRNQDUN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQIS: 0x00000000 | CRNQIE: 0x00000000

This change check HALT state after CQE enable, if CQE is in HALT state, we
will clear it.

Signed-off-by: Wenbin Mei <wenbin.mei@mediatek.com>
Cc: stable@vger.kernel.org
Acked-by: Adrian Hunter <adrian.hunter@intel.com>
Fixes: a4080225f51d ("mmc: cqhci: support for command queue enabled host")
Link: https://lore.kernel.org/r/20211026070812.9359-1-wenbin.mei@mediatek.com
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
diff 92b18252 Tue Oct 26 01:08:12 MDT 2021 Wenbin Mei <wenbin.mei@mediatek.com> mmc: cqhci: clear HALT state after CQE enable

While mmc0 enter suspend state, we need halt CQE to send legacy cmd(flush
cache) and disable cqe, for resume back, we enable CQE and not clear HALT
state.
In this case MediaTek mmc host controller will keep the value for HALT
state after CQE disable/enable flow, so the next CQE transfer after resume
will be timeout due to CQE is in HALT state, the log as below:
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: timeout for tag 2
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: ============ CQHCI REGISTER DUMP ===========
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Caps: 0x100020b6 | Version: 0x00000510
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Config: 0x00001103 | Control: 0x00000001
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int stat: 0x00000000 | Int enab: 0x00000006
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int sig: 0x00000006 | Int Coal: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: TDL base: 0xfd05f000 | TDL up32: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Doorbell: 0x8000203c | TCN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Dev queue: 0x00000000 | Dev Pend: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Task clr: 0x00000000 | SSC1: 0x00001000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: SSC2: 0x00000001 | DCMD rsp: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: RED mask: 0xfdf9a080 | TERRI: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Resp idx: 0x00000000 | Resp arg: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQP: 0x00000000 | CRNQDUN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQIS: 0x00000000 | CRNQIE: 0x00000000

This change check HALT state after CQE enable, if CQE is in HALT state, we
will clear it.

Signed-off-by: Wenbin Mei <wenbin.mei@mediatek.com>
Cc: stable@vger.kernel.org
Acked-by: Adrian Hunter <adrian.hunter@intel.com>
Fixes: a4080225f51d ("mmc: cqhci: support for command queue enabled host")
Link: https://lore.kernel.org/r/20211026070812.9359-1-wenbin.mei@mediatek.com
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
diff 92b18252 Tue Oct 26 01:08:12 MDT 2021 Wenbin Mei <wenbin.mei@mediatek.com> mmc: cqhci: clear HALT state after CQE enable

While mmc0 enter suspend state, we need halt CQE to send legacy cmd(flush
cache) and disable cqe, for resume back, we enable CQE and not clear HALT
state.
In this case MediaTek mmc host controller will keep the value for HALT
state after CQE disable/enable flow, so the next CQE transfer after resume
will be timeout due to CQE is in HALT state, the log as below:
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: timeout for tag 2
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: ============ CQHCI REGISTER DUMP ===========
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Caps: 0x100020b6 | Version: 0x00000510
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Config: 0x00001103 | Control: 0x00000001
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int stat: 0x00000000 | Int enab: 0x00000006
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int sig: 0x00000006 | Int Coal: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: TDL base: 0xfd05f000 | TDL up32: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Doorbell: 0x8000203c | TCN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Dev queue: 0x00000000 | Dev Pend: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Task clr: 0x00000000 | SSC1: 0x00001000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: SSC2: 0x00000001 | DCMD rsp: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: RED mask: 0xfdf9a080 | TERRI: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Resp idx: 0x00000000 | Resp arg: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQP: 0x00000000 | CRNQDUN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQIS: 0x00000000 | CRNQIE: 0x00000000

This change check HALT state after CQE enable, if CQE is in HALT state, we
will clear it.

Signed-off-by: Wenbin Mei <wenbin.mei@mediatek.com>
Cc: stable@vger.kernel.org
Acked-by: Adrian Hunter <adrian.hunter@intel.com>
Fixes: a4080225f51d ("mmc: cqhci: support for command queue enabled host")
Link: https://lore.kernel.org/r/20211026070812.9359-1-wenbin.mei@mediatek.com
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
diff 92b18252 Tue Oct 26 01:08:12 MDT 2021 Wenbin Mei <wenbin.mei@mediatek.com> mmc: cqhci: clear HALT state after CQE enable

While mmc0 enter suspend state, we need halt CQE to send legacy cmd(flush
cache) and disable cqe, for resume back, we enable CQE and not clear HALT
state.
In this case MediaTek mmc host controller will keep the value for HALT
state after CQE disable/enable flow, so the next CQE transfer after resume
will be timeout due to CQE is in HALT state, the log as below:
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: timeout for tag 2
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: ============ CQHCI REGISTER DUMP ===========
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Caps: 0x100020b6 | Version: 0x00000510
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Config: 0x00001103 | Control: 0x00000001
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int stat: 0x00000000 | Int enab: 0x00000006
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int sig: 0x00000006 | Int Coal: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: TDL base: 0xfd05f000 | TDL up32: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Doorbell: 0x8000203c | TCN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Dev queue: 0x00000000 | Dev Pend: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Task clr: 0x00000000 | SSC1: 0x00001000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: SSC2: 0x00000001 | DCMD rsp: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: RED mask: 0xfdf9a080 | TERRI: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Resp idx: 0x00000000 | Resp arg: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQP: 0x00000000 | CRNQDUN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQIS: 0x00000000 | CRNQIE: 0x00000000

This change check HALT state after CQE enable, if CQE is in HALT state, we
will clear it.

Signed-off-by: Wenbin Mei <wenbin.mei@mediatek.com>
Cc: stable@vger.kernel.org
Acked-by: Adrian Hunter <adrian.hunter@intel.com>
Fixes: a4080225f51d ("mmc: cqhci: support for command queue enabled host")
Link: https://lore.kernel.org/r/20211026070812.9359-1-wenbin.mei@mediatek.com
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
diff 92b18252 Tue Oct 26 01:08:12 MDT 2021 Wenbin Mei <wenbin.mei@mediatek.com> mmc: cqhci: clear HALT state after CQE enable

While mmc0 enter suspend state, we need halt CQE to send legacy cmd(flush
cache) and disable cqe, for resume back, we enable CQE and not clear HALT
state.
In this case MediaTek mmc host controller will keep the value for HALT
state after CQE disable/enable flow, so the next CQE transfer after resume
will be timeout due to CQE is in HALT state, the log as below:
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: timeout for tag 2
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: ============ CQHCI REGISTER DUMP ===========
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Caps: 0x100020b6 | Version: 0x00000510
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Config: 0x00001103 | Control: 0x00000001
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int stat: 0x00000000 | Int enab: 0x00000006
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int sig: 0x00000006 | Int Coal: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: TDL base: 0xfd05f000 | TDL up32: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Doorbell: 0x8000203c | TCN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Dev queue: 0x00000000 | Dev Pend: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Task clr: 0x00000000 | SSC1: 0x00001000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: SSC2: 0x00000001 | DCMD rsp: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: RED mask: 0xfdf9a080 | TERRI: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Resp idx: 0x00000000 | Resp arg: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQP: 0x00000000 | CRNQDUN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQIS: 0x00000000 | CRNQIE: 0x00000000

This change check HALT state after CQE enable, if CQE is in HALT state, we
will clear it.

Signed-off-by: Wenbin Mei <wenbin.mei@mediatek.com>
Cc: stable@vger.kernel.org
Acked-by: Adrian Hunter <adrian.hunter@intel.com>
Fixes: a4080225f51d ("mmc: cqhci: support for command queue enabled host")
Link: https://lore.kernel.org/r/20211026070812.9359-1-wenbin.mei@mediatek.com
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
diff 92b18252 Tue Oct 26 01:08:12 MDT 2021 Wenbin Mei <wenbin.mei@mediatek.com> mmc: cqhci: clear HALT state after CQE enable

While mmc0 enter suspend state, we need halt CQE to send legacy cmd(flush
cache) and disable cqe, for resume back, we enable CQE and not clear HALT
state.
In this case MediaTek mmc host controller will keep the value for HALT
state after CQE disable/enable flow, so the next CQE transfer after resume
will be timeout due to CQE is in HALT state, the log as below:
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: timeout for tag 2
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: ============ CQHCI REGISTER DUMP ===========
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Caps: 0x100020b6 | Version: 0x00000510
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Config: 0x00001103 | Control: 0x00000001
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int stat: 0x00000000 | Int enab: 0x00000006
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int sig: 0x00000006 | Int Coal: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: TDL base: 0xfd05f000 | TDL up32: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Doorbell: 0x8000203c | TCN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Dev queue: 0x00000000 | Dev Pend: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Task clr: 0x00000000 | SSC1: 0x00001000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: SSC2: 0x00000001 | DCMD rsp: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: RED mask: 0xfdf9a080 | TERRI: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Resp idx: 0x00000000 | Resp arg: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQP: 0x00000000 | CRNQDUN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQIS: 0x00000000 | CRNQIE: 0x00000000

This change check HALT state after CQE enable, if CQE is in HALT state, we
will clear it.

Signed-off-by: Wenbin Mei <wenbin.mei@mediatek.com>
Cc: stable@vger.kernel.org
Acked-by: Adrian Hunter <adrian.hunter@intel.com>
Fixes: a4080225f51d ("mmc: cqhci: support for command queue enabled host")
Link: https://lore.kernel.org/r/20211026070812.9359-1-wenbin.mei@mediatek.com
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
diff 92b18252 Tue Oct 26 01:08:12 MDT 2021 Wenbin Mei <wenbin.mei@mediatek.com> mmc: cqhci: clear HALT state after CQE enable

While mmc0 enter suspend state, we need halt CQE to send legacy cmd(flush
cache) and disable cqe, for resume back, we enable CQE and not clear HALT
state.
In this case MediaTek mmc host controller will keep the value for HALT
state after CQE disable/enable flow, so the next CQE transfer after resume
will be timeout due to CQE is in HALT state, the log as below:
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: timeout for tag 2
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: ============ CQHCI REGISTER DUMP ===========
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Caps: 0x100020b6 | Version: 0x00000510
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Config: 0x00001103 | Control: 0x00000001
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int stat: 0x00000000 | Int enab: 0x00000006
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int sig: 0x00000006 | Int Coal: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: TDL base: 0xfd05f000 | TDL up32: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Doorbell: 0x8000203c | TCN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Dev queue: 0x00000000 | Dev Pend: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Task clr: 0x00000000 | SSC1: 0x00001000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: SSC2: 0x00000001 | DCMD rsp: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: RED mask: 0xfdf9a080 | TERRI: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Resp idx: 0x00000000 | Resp arg: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQP: 0x00000000 | CRNQDUN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQIS: 0x00000000 | CRNQIE: 0x00000000

This change check HALT state after CQE enable, if CQE is in HALT state, we
will clear it.

Signed-off-by: Wenbin Mei <wenbin.mei@mediatek.com>
Cc: stable@vger.kernel.org
Acked-by: Adrian Hunter <adrian.hunter@intel.com>
Fixes: a4080225f51d ("mmc: cqhci: support for command queue enabled host")
Link: https://lore.kernel.org/r/20211026070812.9359-1-wenbin.mei@mediatek.com
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
diff 92b18252 Tue Oct 26 01:08:12 MDT 2021 Wenbin Mei <wenbin.mei@mediatek.com> mmc: cqhci: clear HALT state after CQE enable

While mmc0 enter suspend state, we need halt CQE to send legacy cmd(flush
cache) and disable cqe, for resume back, we enable CQE and not clear HALT
state.
In this case MediaTek mmc host controller will keep the value for HALT
state after CQE disable/enable flow, so the next CQE transfer after resume
will be timeout due to CQE is in HALT state, the log as below:
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: timeout for tag 2
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: ============ CQHCI REGISTER DUMP ===========
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Caps: 0x100020b6 | Version: 0x00000510
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Config: 0x00001103 | Control: 0x00000001
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int stat: 0x00000000 | Int enab: 0x00000006
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int sig: 0x00000006 | Int Coal: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: TDL base: 0xfd05f000 | TDL up32: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Doorbell: 0x8000203c | TCN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Dev queue: 0x00000000 | Dev Pend: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Task clr: 0x00000000 | SSC1: 0x00001000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: SSC2: 0x00000001 | DCMD rsp: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: RED mask: 0xfdf9a080 | TERRI: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Resp idx: 0x00000000 | Resp arg: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQP: 0x00000000 | CRNQDUN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQIS: 0x00000000 | CRNQIE: 0x00000000

This change check HALT state after CQE enable, if CQE is in HALT state, we
will clear it.

Signed-off-by: Wenbin Mei <wenbin.mei@mediatek.com>
Cc: stable@vger.kernel.org
Acked-by: Adrian Hunter <adrian.hunter@intel.com>
Fixes: a4080225f51d ("mmc: cqhci: support for command queue enabled host")
Link: https://lore.kernel.org/r/20211026070812.9359-1-wenbin.mei@mediatek.com
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
/linux-master/arch/sh/kernel/cpu/sh4a/
H A Dubc.c4352fc1b Tue Jan 05 03:06:45 MST 2010 Paul Mundt <lethal@linux-sh.org> sh: Abstracted SH-4A UBC support on hw-breakpoint core.

This is the next big chunk of hw_breakpoint support. This decouples
the SH-4A support from the core and moves it out in to its own stub,
following many of the conventions established with the perf events
layering.

In addition to extending SH-4A support to encapsulate the remainder
of the UBC channels, clock framework support for handling the UBC
interface clock is added as well, allowing for dynamic clock gating.

This also fixes up a regression introduced by the SIGTRAP handling that
broke the ksym_tracer, to the extent that the current support works well
with all of the ksym_tracer/ptrace/kgdb. The kprobes singlestep code will
follow in turn.

With this in place, the remaining UBC variants (SH-2A and SH-4) can now
be trivially plugged in.

Signed-off-by: Paul Mundt <lethal@linux-sh.org>
4352fc1b Tue Jan 05 03:06:45 MST 2010 Paul Mundt <lethal@linux-sh.org> sh: Abstracted SH-4A UBC support on hw-breakpoint core.

This is the next big chunk of hw_breakpoint support. This decouples
the SH-4A support from the core and moves it out in to its own stub,
following many of the conventions established with the perf events
layering.

In addition to extending SH-4A support to encapsulate the remainder
of the UBC channels, clock framework support for handling the UBC
interface clock is added as well, allowing for dynamic clock gating.

This also fixes up a regression introduced by the SIGTRAP handling that
broke the ksym_tracer, to the extent that the current support works well
with all of the ksym_tracer/ptrace/kgdb. The kprobes singlestep code will
follow in turn.

With this in place, the remaining UBC variants (SH-2A and SH-4) can now
be trivially plugged in.

Signed-off-by: Paul Mundt <lethal@linux-sh.org>
4352fc1b Tue Jan 05 03:06:45 MST 2010 Paul Mundt <lethal@linux-sh.org> sh: Abstracted SH-4A UBC support on hw-breakpoint core.

This is the next big chunk of hw_breakpoint support. This decouples
the SH-4A support from the core and moves it out in to its own stub,
following many of the conventions established with the perf events
layering.

In addition to extending SH-4A support to encapsulate the remainder
of the UBC channels, clock framework support for handling the UBC
interface clock is added as well, allowing for dynamic clock gating.

This also fixes up a regression introduced by the SIGTRAP handling that
broke the ksym_tracer, to the extent that the current support works well
with all of the ksym_tracer/ptrace/kgdb. The kprobes singlestep code will
follow in turn.

With this in place, the remaining UBC variants (SH-2A and SH-4) can now
be trivially plugged in.

Signed-off-by: Paul Mundt <lethal@linux-sh.org>
4352fc1b Tue Jan 05 03:06:45 MST 2010 Paul Mundt <lethal@linux-sh.org> sh: Abstracted SH-4A UBC support on hw-breakpoint core.

This is the next big chunk of hw_breakpoint support. This decouples
the SH-4A support from the core and moves it out in to its own stub,
following many of the conventions established with the perf events
layering.

In addition to extending SH-4A support to encapsulate the remainder
of the UBC channels, clock framework support for handling the UBC
interface clock is added as well, allowing for dynamic clock gating.

This also fixes up a regression introduced by the SIGTRAP handling that
broke the ksym_tracer, to the extent that the current support works well
with all of the ksym_tracer/ptrace/kgdb. The kprobes singlestep code will
follow in turn.

With this in place, the remaining UBC variants (SH-2A and SH-4) can now
be trivially plugged in.

Signed-off-by: Paul Mundt <lethal@linux-sh.org>
/linux-master/fs/bcachefs/
H A Dsb-downgrade_format.h4c5eef0c Tue May 28 14:54:29 MDT 2024 Kent Overstreet <kent.overstreet@linux.dev> bcachefs: split out sb-downgrade_format.h

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
/linux-master/scripts/dtc/include-prefixes/dt-bindings/clock/
H A Dfsl,qoriq-clockgen.h4cb15934 Sun Nov 08 11:51:07 MST 2020 Michael Walle <michael@walle.cc> clk: qoriq: provide constants for the type

To avoid future mistakes in the device tree for the clockgen module, add
constants for the clockgen subtype as well as a macro for the PLL
divider.

Signed-off-by: Michael Walle <michael@walle.cc>
Acked-by: Rob Herring <robh@kernel.org>
Link: https://lore.kernel.org/r/20201108185113.31377-4-michael@walle.cc
Signed-off-by: Stephen Boyd <sboyd@kernel.org>
4cb15934 Sun Nov 08 11:51:07 MST 2020 Michael Walle <michael@walle.cc> clk: qoriq: provide constants for the type

To avoid future mistakes in the device tree for the clockgen module, add
constants for the clockgen subtype as well as a macro for the PLL
divider.

Signed-off-by: Michael Walle <michael@walle.cc>
Acked-by: Rob Herring <robh@kernel.org>
Link: https://lore.kernel.org/r/20201108185113.31377-4-michael@walle.cc
Signed-off-by: Stephen Boyd <sboyd@kernel.org>
/linux-master/include/dt-bindings/clock/
H A Dfsl,qoriq-clockgen.h4cb15934 Sun Nov 08 11:51:07 MST 2020 Michael Walle <michael@walle.cc> clk: qoriq: provide constants for the type

To avoid future mistakes in the device tree for the clockgen module, add
constants for the clockgen subtype as well as a macro for the PLL
divider.

Signed-off-by: Michael Walle <michael@walle.cc>
Acked-by: Rob Herring <robh@kernel.org>
Link: https://lore.kernel.org/r/20201108185113.31377-4-michael@walle.cc
Signed-off-by: Stephen Boyd <sboyd@kernel.org>
4cb15934 Sun Nov 08 11:51:07 MST 2020 Michael Walle <michael@walle.cc> clk: qoriq: provide constants for the type

To avoid future mistakes in the device tree for the clockgen module, add
constants for the clockgen subtype as well as a macro for the PLL
divider.

Signed-off-by: Michael Walle <michael@walle.cc>
Acked-by: Rob Herring <robh@kernel.org>
Link: https://lore.kernel.org/r/20201108185113.31377-4-michael@walle.cc
Signed-off-by: Stephen Boyd <sboyd@kernel.org>
/linux-master/Documentation/ABI/testing/
H A Dsysfs-bus-platform-devices-occ-hwmon4cf400e1 Tue Oct 19 14:53:06 MDT 2021 Eddie James <eajames@linux.ibm.com> docs: ABI: testing: Document the OCC hwmon FFDC binary interface

Add documentation for the new binary sysfs that will dump the SBEFIFO
FFDC.

Signed-off-by: Eddie James <eajames@linux.ibm.com>
Link: https://lore.kernel.org/r/20211019205307.36946-4-eajames@linux.ibm.com
Signed-off-by: Joel Stanley <joel@jms.id.au>
4cf400e1 Tue Oct 19 14:53:06 MDT 2021 Eddie James <eajames@linux.ibm.com> docs: ABI: testing: Document the OCC hwmon FFDC binary interface

Add documentation for the new binary sysfs that will dump the SBEFIFO
FFDC.

Signed-off-by: Eddie James <eajames@linux.ibm.com>
Link: https://lore.kernel.org/r/20211019205307.36946-4-eajames@linux.ibm.com
Signed-off-by: Joel Stanley <joel@jms.id.au>
/linux-master/Documentation/devicetree/bindings/sound/
H A Damlogic,axg-pdm.yaml4d37c72e Mon Feb 06 08:34:45 MST 2023 Jerome Brunet <jbrunet@baylibre.com> ASoC: dt-bindings: meson: convert axg pdm to schema

Convert the DT binding documentation for the Amlogic axg PDM device to
schema.

Signed-off-by: Jerome Brunet <jbrunet@baylibre.com>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://lore.kernel.org/r/20230206153449.596326-4-jbrunet@baylibre.com
Signed-off-by: Mark Brown <broonie@kernel.org>
4d37c72e Mon Feb 06 08:34:45 MST 2023 Jerome Brunet <jbrunet@baylibre.com> ASoC: dt-bindings: meson: convert axg pdm to schema

Convert the DT binding documentation for the Amlogic axg PDM device to
schema.

Signed-off-by: Jerome Brunet <jbrunet@baylibre.com>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://lore.kernel.org/r/20230206153449.596326-4-jbrunet@baylibre.com
Signed-off-by: Mark Brown <broonie@kernel.org>
/linux-master/arch/arm/boot/dts/nvidia/
H A Dtegra30-lg-p880.dtsea5e97e9 Wed Feb 14 02:12:01 MST 2024 Svyatoslav Ryhel <clamor95@gmail.com> ARM: tegra: Add device-tree for LG Optimus 4X HD (P880)

Add device-tree for LG Optimus 4X HD P880, which is a NVIDIA
Tegra30-based smartphone, originally running Android.

Signed-off-by: Svyatoslav Ryhel <clamor95@gmail.com>
Signed-off-by: Thierry Reding <treding@nvidia.com>
ea5e97e9 Wed Feb 14 02:12:01 MST 2024 Svyatoslav Ryhel <clamor95@gmail.com> ARM: tegra: Add device-tree for LG Optimus 4X HD (P880)

Add device-tree for LG Optimus 4X HD P880, which is a NVIDIA
Tegra30-based smartphone, originally running Android.

Signed-off-by: Svyatoslav Ryhel <clamor95@gmail.com>
Signed-off-by: Thierry Reding <treding@nvidia.com>
/linux-master/scripts/dtc/include-prefixes/arm/nvidia/
H A Dtegra30-lg-p880.dtsea5e97e9 Wed Feb 14 02:12:01 MST 2024 Svyatoslav Ryhel <clamor95@gmail.com> ARM: tegra: Add device-tree for LG Optimus 4X HD (P880)

Add device-tree for LG Optimus 4X HD P880, which is a NVIDIA
Tegra30-based smartphone, originally running Android.

Signed-off-by: Svyatoslav Ryhel <clamor95@gmail.com>
Signed-off-by: Thierry Reding <treding@nvidia.com>
ea5e97e9 Wed Feb 14 02:12:01 MST 2024 Svyatoslav Ryhel <clamor95@gmail.com> ARM: tegra: Add device-tree for LG Optimus 4X HD (P880)

Add device-tree for LG Optimus 4X HD P880, which is a NVIDIA
Tegra30-based smartphone, originally running Android.

Signed-off-by: Svyatoslav Ryhel <clamor95@gmail.com>
Signed-off-by: Thierry Reding <treding@nvidia.com>
/linux-master/Documentation/devicetree/bindings/net/
H A Dibm,emac.txte21f9e2e Wed Apr 25 20:07:13 MDT 2018 Rob Herring <robh@kernel.org> dt-bindings: powerpc/4xx: move 4xx NDFC and EMAC bindings to subsystem directories

Bindings are supposed to be organized by device class/function. Move a
couple of powerpc 4xx bindings to the correct binding directory.

Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Rob Herring <robh@kernel.org>
e21f9e2e Wed Apr 25 20:07:13 MDT 2018 Rob Herring <robh@kernel.org> dt-bindings: powerpc/4xx: move 4xx NDFC and EMAC bindings to subsystem directories

Bindings are supposed to be organized by device class/function. Move a
couple of powerpc 4xx bindings to the correct binding directory.

Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Rob Herring <robh@kernel.org>
e21f9e2e Wed Apr 25 20:07:13 MDT 2018 Rob Herring <robh@kernel.org> dt-bindings: powerpc/4xx: move 4xx NDFC and EMAC bindings to subsystem directories

Bindings are supposed to be organized by device class/function. Move a
couple of powerpc 4xx bindings to the correct binding directory.

Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Rob Herring <robh@kernel.org>
/linux-master/tools/testing/selftests/uevent/
H A DMakefile9d3df886 Tue May 22 13:34:21 MDT 2018 Christian Brauner <christianvanbrauner@gmail.com> selftests: uevent filtering

Recent discussions around uevent filtering (cf. net-next commit [1], [2],
and [3] and discussions in [4], [5], and [6]) have shown that the semantics
around uevent filtering where not well understood.
Now that we have settled - at least for the moment - how uevent filtering
should look like let's add some selftests to ensure we don't regress
anything in the future.
Note, the semantics of uevent filtering are described in detail in my
commit message to [2] so I won't repeat them here.

[1]: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=90d52d4fd82007005125d9a8d2d560a1ca059b9d
[2]: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=a3498436b3a0f8ec289e6847e1de40b4123e1639
[3]: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=26045a7b14bc7a5455e411d820110f66557d6589
[4]: https://lkml.org/lkml/2018/4/4/739
[5]: https://lkml.org/lkml/2018/4/26/767
[6]: https://lkml.org/lkml/2018/4/26/738

Signed-off-by: Christian Brauner <christian@brauner.io>
Signed-off-by: David S. Miller <davem@davemloft.net>
9d3df886 Tue May 22 13:34:21 MDT 2018 Christian Brauner <christianvanbrauner@gmail.com> selftests: uevent filtering

Recent discussions around uevent filtering (cf. net-next commit [1], [2],
and [3] and discussions in [4], [5], and [6]) have shown that the semantics
around uevent filtering where not well understood.
Now that we have settled - at least for the moment - how uevent filtering
should look like let's add some selftests to ensure we don't regress
anything in the future.
Note, the semantics of uevent filtering are described in detail in my
commit message to [2] so I won't repeat them here.

[1]: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=90d52d4fd82007005125d9a8d2d560a1ca059b9d
[2]: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=a3498436b3a0f8ec289e6847e1de40b4123e1639
[3]: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=26045a7b14bc7a5455e411d820110f66557d6589
[4]: https://lkml.org/lkml/2018/4/4/739
[5]: https://lkml.org/lkml/2018/4/26/767
[6]: https://lkml.org/lkml/2018/4/26/738

Signed-off-by: Christian Brauner <christian@brauner.io>
Signed-off-by: David S. Miller <davem@davemloft.net>
9d3df886 Tue May 22 13:34:21 MDT 2018 Christian Brauner <christianvanbrauner@gmail.com> selftests: uevent filtering

Recent discussions around uevent filtering (cf. net-next commit [1], [2],
and [3] and discussions in [4], [5], and [6]) have shown that the semantics
around uevent filtering where not well understood.
Now that we have settled - at least for the moment - how uevent filtering
should look like let's add some selftests to ensure we don't regress
anything in the future.
Note, the semantics of uevent filtering are described in detail in my
commit message to [2] so I won't repeat them here.

[1]: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=90d52d4fd82007005125d9a8d2d560a1ca059b9d
[2]: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=a3498436b3a0f8ec289e6847e1de40b4123e1639
[3]: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=26045a7b14bc7a5455e411d820110f66557d6589
[4]: https://lkml.org/lkml/2018/4/4/739
[5]: https://lkml.org/lkml/2018/4/26/767
[6]: https://lkml.org/lkml/2018/4/26/738

Signed-off-by: Christian Brauner <christian@brauner.io>
Signed-off-by: David S. Miller <davem@davemloft.net>
9d3df886 Tue May 22 13:34:21 MDT 2018 Christian Brauner <christianvanbrauner@gmail.com> selftests: uevent filtering

Recent discussions around uevent filtering (cf. net-next commit [1], [2],
and [3] and discussions in [4], [5], and [6]) have shown that the semantics
around uevent filtering where not well understood.
Now that we have settled - at least for the moment - how uevent filtering
should look like let's add some selftests to ensure we don't regress
anything in the future.
Note, the semantics of uevent filtering are described in detail in my
commit message to [2] so I won't repeat them here.

[1]: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=90d52d4fd82007005125d9a8d2d560a1ca059b9d
[2]: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=a3498436b3a0f8ec289e6847e1de40b4123e1639
[3]: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=26045a7b14bc7a5455e411d820110f66557d6589
[4]: https://lkml.org/lkml/2018/4/4/739
[5]: https://lkml.org/lkml/2018/4/26/767
[6]: https://lkml.org/lkml/2018/4/26/738

Signed-off-by: Christian Brauner <christian@brauner.io>
Signed-off-by: David S. Miller <davem@davemloft.net>
9d3df886 Tue May 22 13:34:21 MDT 2018 Christian Brauner <christianvanbrauner@gmail.com> selftests: uevent filtering

Recent discussions around uevent filtering (cf. net-next commit [1], [2],
and [3] and discussions in [4], [5], and [6]) have shown that the semantics
around uevent filtering where not well understood.
Now that we have settled - at least for the moment - how uevent filtering
should look like let's add some selftests to ensure we don't regress
anything in the future.
Note, the semantics of uevent filtering are described in detail in my
commit message to [2] so I won't repeat them here.

[1]: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=90d52d4fd82007005125d9a8d2d560a1ca059b9d
[2]: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=a3498436b3a0f8ec289e6847e1de40b4123e1639
[3]: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=26045a7b14bc7a5455e411d820110f66557d6589
[4]: https://lkml.org/lkml/2018/4/4/739
[5]: https://lkml.org/lkml/2018/4/26/767
[6]: https://lkml.org/lkml/2018/4/26/738

Signed-off-by: Christian Brauner <christian@brauner.io>
Signed-off-by: David S. Miller <davem@davemloft.net>
9d3df886 Tue May 22 13:34:21 MDT 2018 Christian Brauner <christianvanbrauner@gmail.com> selftests: uevent filtering

Recent discussions around uevent filtering (cf. net-next commit [1], [2],
and [3] and discussions in [4], [5], and [6]) have shown that the semantics
around uevent filtering where not well understood.
Now that we have settled - at least for the moment - how uevent filtering
should look like let's add some selftests to ensure we don't regress
anything in the future.
Note, the semantics of uevent filtering are described in detail in my
commit message to [2] so I won't repeat them here.

[1]: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=90d52d4fd82007005125d9a8d2d560a1ca059b9d
[2]: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=a3498436b3a0f8ec289e6847e1de40b4123e1639
[3]: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=26045a7b14bc7a5455e411d820110f66557d6589
[4]: https://lkml.org/lkml/2018/4/4/739
[5]: https://lkml.org/lkml/2018/4/26/767
[6]: https://lkml.org/lkml/2018/4/26/738

Signed-off-by: Christian Brauner <christian@brauner.io>
Signed-off-by: David S. Miller <davem@davemloft.net>
H A Dconfig9d3df886 Tue May 22 13:34:21 MDT 2018 Christian Brauner <christianvanbrauner@gmail.com> selftests: uevent filtering

Recent discussions around uevent filtering (cf. net-next commit [1], [2],
and [3] and discussions in [4], [5], and [6]) have shown that the semantics
around uevent filtering where not well understood.
Now that we have settled - at least for the moment - how uevent filtering
should look like let's add some selftests to ensure we don't regress
anything in the future.
Note, the semantics of uevent filtering are described in detail in my
commit message to [2] so I won't repeat them here.

[1]: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=90d52d4fd82007005125d9a8d2d560a1ca059b9d
[2]: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=a3498436b3a0f8ec289e6847e1de40b4123e1639
[3]: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=26045a7b14bc7a5455e411d820110f66557d6589
[4]: https://lkml.org/lkml/2018/4/4/739
[5]: https://lkml.org/lkml/2018/4/26/767
[6]: https://lkml.org/lkml/2018/4/26/738

Signed-off-by: Christian Brauner <christian@brauner.io>
Signed-off-by: David S. Miller <davem@davemloft.net>
9d3df886 Tue May 22 13:34:21 MDT 2018 Christian Brauner <christianvanbrauner@gmail.com> selftests: uevent filtering

Recent discussions around uevent filtering (cf. net-next commit [1], [2],
and [3] and discussions in [4], [5], and [6]) have shown that the semantics
around uevent filtering where not well understood.
Now that we have settled - at least for the moment - how uevent filtering
should look like let's add some selftests to ensure we don't regress
anything in the future.
Note, the semantics of uevent filtering are described in detail in my
commit message to [2] so I won't repeat them here.

[1]: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=90d52d4fd82007005125d9a8d2d560a1ca059b9d
[2]: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=a3498436b3a0f8ec289e6847e1de40b4123e1639
[3]: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=26045a7b14bc7a5455e411d820110f66557d6589
[4]: https://lkml.org/lkml/2018/4/4/739
[5]: https://lkml.org/lkml/2018/4/26/767
[6]: https://lkml.org/lkml/2018/4/26/738

Signed-off-by: Christian Brauner <christian@brauner.io>
Signed-off-by: David S. Miller <davem@davemloft.net>
9d3df886 Tue May 22 13:34:21 MDT 2018 Christian Brauner <christianvanbrauner@gmail.com> selftests: uevent filtering

Recent discussions around uevent filtering (cf. net-next commit [1], [2],
and [3] and discussions in [4], [5], and [6]) have shown that the semantics
around uevent filtering where not well understood.
Now that we have settled - at least for the moment - how uevent filtering
should look like let's add some selftests to ensure we don't regress
anything in the future.
Note, the semantics of uevent filtering are described in detail in my
commit message to [2] so I won't repeat them here.

[1]: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=90d52d4fd82007005125d9a8d2d560a1ca059b9d
[2]: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=a3498436b3a0f8ec289e6847e1de40b4123e1639
[3]: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=26045a7b14bc7a5455e411d820110f66557d6589
[4]: https://lkml.org/lkml/2018/4/4/739
[5]: https://lkml.org/lkml/2018/4/26/767
[6]: https://lkml.org/lkml/2018/4/26/738

Signed-off-by: Christian Brauner <christian@brauner.io>
Signed-off-by: David S. Miller <davem@davemloft.net>
9d3df886 Tue May 22 13:34:21 MDT 2018 Christian Brauner <christianvanbrauner@gmail.com> selftests: uevent filtering

Recent discussions around uevent filtering (cf. net-next commit [1], [2],
and [3] and discussions in [4], [5], and [6]) have shown that the semantics
around uevent filtering where not well understood.
Now that we have settled - at least for the moment - how uevent filtering
should look like let's add some selftests to ensure we don't regress
anything in the future.
Note, the semantics of uevent filtering are described in detail in my
commit message to [2] so I won't repeat them here.

[1]: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=90d52d4fd82007005125d9a8d2d560a1ca059b9d
[2]: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=a3498436b3a0f8ec289e6847e1de40b4123e1639
[3]: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=26045a7b14bc7a5455e411d820110f66557d6589
[4]: https://lkml.org/lkml/2018/4/4/739
[5]: https://lkml.org/lkml/2018/4/26/767
[6]: https://lkml.org/lkml/2018/4/26/738

Signed-off-by: Christian Brauner <christian@brauner.io>
Signed-off-by: David S. Miller <davem@davemloft.net>
9d3df886 Tue May 22 13:34:21 MDT 2018 Christian Brauner <christianvanbrauner@gmail.com> selftests: uevent filtering

Recent discussions around uevent filtering (cf. net-next commit [1], [2],
and [3] and discussions in [4], [5], and [6]) have shown that the semantics
around uevent filtering where not well understood.
Now that we have settled - at least for the moment - how uevent filtering
should look like let's add some selftests to ensure we don't regress
anything in the future.
Note, the semantics of uevent filtering are described in detail in my
commit message to [2] so I won't repeat them here.

[1]: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=90d52d4fd82007005125d9a8d2d560a1ca059b9d
[2]: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=a3498436b3a0f8ec289e6847e1de40b4123e1639
[3]: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=26045a7b14bc7a5455e411d820110f66557d6589
[4]: https://lkml.org/lkml/2018/4/4/739
[5]: https://lkml.org/lkml/2018/4/26/767
[6]: https://lkml.org/lkml/2018/4/26/738

Signed-off-by: Christian Brauner <christian@brauner.io>
Signed-off-by: David S. Miller <davem@davemloft.net>
9d3df886 Tue May 22 13:34:21 MDT 2018 Christian Brauner <christianvanbrauner@gmail.com> selftests: uevent filtering

Recent discussions around uevent filtering (cf. net-next commit [1], [2],
and [3] and discussions in [4], [5], and [6]) have shown that the semantics
around uevent filtering where not well understood.
Now that we have settled - at least for the moment - how uevent filtering
should look like let's add some selftests to ensure we don't regress
anything in the future.
Note, the semantics of uevent filtering are described in detail in my
commit message to [2] so I won't repeat them here.

[1]: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=90d52d4fd82007005125d9a8d2d560a1ca059b9d
[2]: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=a3498436b3a0f8ec289e6847e1de40b4123e1639
[3]: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=26045a7b14bc7a5455e411d820110f66557d6589
[4]: https://lkml.org/lkml/2018/4/4/739
[5]: https://lkml.org/lkml/2018/4/26/767
[6]: https://lkml.org/lkml/2018/4/26/738

Signed-off-by: Christian Brauner <christian@brauner.io>
Signed-off-by: David S. Miller <davem@davemloft.net>
/linux-master/Documentation/devicetree/bindings/arm/marvell/
H A Dmarvell,ac5.yamld6c1b95d Tue Jul 05 13:09:19 MDT 2022 Chris Packham <chris.packham@alliedtelesis.co.nz> dt-bindings: marvell: Document the AC5/AC5X compatibles

Describe the compatible properties for the Marvell Alleycat5/5X switches
with integrated CPUs.

Alleycat5:
* 98DX2538: 24x1G + 2x10G + 2x10G Stack
* 98DX2535: 24x1G + 4x1G Stack
* 98DX2532: 8x1G + 2x10G + 2x1G Stack
* 98DX2531: 8x1G + 4x1G Stack
* 98DX2528: 24x1G + 2x10G + 2x10G Stack
* 98DX2525: 24x1G + 4x1G Stack
* 98DX2522: 8x1G + 2x10G + 2x1G Stack
* 98DX2521: 8x1G + 4x1G Stack
* 98DX2518: 24x1G + 2x10G + 2x10G Stack
* 98DX2515: 24x1G + 4x1G Stack
* 98DX2512: 8x1G + 2x10G + 2x1G Stack
* 98DX2511: 8x1G + 4x1G Stack

Alleycat5X:
* 98DX3500: 24x1G + 6x25G
* 98DX3501: 16x1G + 6x10G
* 98DX3510: 48x1G + 6x25G
* 98DX3520: 24x2.5G + 6x25G
* 98DX3530: 48x2.5G + 6x25G
* 98DX3540: 12x5G/6x10G + 6x25G
* 98DX3550: 24x5G/12x10G + 6x25G

Signed-off-by: Chris Packham <chris.packham@alliedtelesis.co.nz>
Signed-off-by: Vadym Kochan <vadym.kochan@plvision.eu>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: Gregory CLEMENT <gregory.clement@bootlin.com>
d6c1b95d Tue Jul 05 13:09:19 MDT 2022 Chris Packham <chris.packham@alliedtelesis.co.nz> dt-bindings: marvell: Document the AC5/AC5X compatibles

Describe the compatible properties for the Marvell Alleycat5/5X switches
with integrated CPUs.

Alleycat5:
* 98DX2538: 24x1G + 2x10G + 2x10G Stack
* 98DX2535: 24x1G + 4x1G Stack
* 98DX2532: 8x1G + 2x10G + 2x1G Stack
* 98DX2531: 8x1G + 4x1G Stack
* 98DX2528: 24x1G + 2x10G + 2x10G Stack
* 98DX2525: 24x1G + 4x1G Stack
* 98DX2522: 8x1G + 2x10G + 2x1G Stack
* 98DX2521: 8x1G + 4x1G Stack
* 98DX2518: 24x1G + 2x10G + 2x10G Stack
* 98DX2515: 24x1G + 4x1G Stack
* 98DX2512: 8x1G + 2x10G + 2x1G Stack
* 98DX2511: 8x1G + 4x1G Stack

Alleycat5X:
* 98DX3500: 24x1G + 6x25G
* 98DX3501: 16x1G + 6x10G
* 98DX3510: 48x1G + 6x25G
* 98DX3520: 24x2.5G + 6x25G
* 98DX3530: 48x2.5G + 6x25G
* 98DX3540: 12x5G/6x10G + 6x25G
* 98DX3550: 24x5G/12x10G + 6x25G

Signed-off-by: Chris Packham <chris.packham@alliedtelesis.co.nz>
Signed-off-by: Vadym Kochan <vadym.kochan@plvision.eu>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: Gregory CLEMENT <gregory.clement@bootlin.com>
d6c1b95d Tue Jul 05 13:09:19 MDT 2022 Chris Packham <chris.packham@alliedtelesis.co.nz> dt-bindings: marvell: Document the AC5/AC5X compatibles

Describe the compatible properties for the Marvell Alleycat5/5X switches
with integrated CPUs.

Alleycat5:
* 98DX2538: 24x1G + 2x10G + 2x10G Stack
* 98DX2535: 24x1G + 4x1G Stack
* 98DX2532: 8x1G + 2x10G + 2x1G Stack
* 98DX2531: 8x1G + 4x1G Stack
* 98DX2528: 24x1G + 2x10G + 2x10G Stack
* 98DX2525: 24x1G + 4x1G Stack
* 98DX2522: 8x1G + 2x10G + 2x1G Stack
* 98DX2521: 8x1G + 4x1G Stack
* 98DX2518: 24x1G + 2x10G + 2x10G Stack
* 98DX2515: 24x1G + 4x1G Stack
* 98DX2512: 8x1G + 2x10G + 2x1G Stack
* 98DX2511: 8x1G + 4x1G Stack

Alleycat5X:
* 98DX3500: 24x1G + 6x25G
* 98DX3501: 16x1G + 6x10G
* 98DX3510: 48x1G + 6x25G
* 98DX3520: 24x2.5G + 6x25G
* 98DX3530: 48x2.5G + 6x25G
* 98DX3540: 12x5G/6x10G + 6x25G
* 98DX3550: 24x5G/12x10G + 6x25G

Signed-off-by: Chris Packham <chris.packham@alliedtelesis.co.nz>
Signed-off-by: Vadym Kochan <vadym.kochan@plvision.eu>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: Gregory CLEMENT <gregory.clement@bootlin.com>
d6c1b95d Tue Jul 05 13:09:19 MDT 2022 Chris Packham <chris.packham@alliedtelesis.co.nz> dt-bindings: marvell: Document the AC5/AC5X compatibles

Describe the compatible properties for the Marvell Alleycat5/5X switches
with integrated CPUs.

Alleycat5:
* 98DX2538: 24x1G + 2x10G + 2x10G Stack
* 98DX2535: 24x1G + 4x1G Stack
* 98DX2532: 8x1G + 2x10G + 2x1G Stack
* 98DX2531: 8x1G + 4x1G Stack
* 98DX2528: 24x1G + 2x10G + 2x10G Stack
* 98DX2525: 24x1G + 4x1G Stack
* 98DX2522: 8x1G + 2x10G + 2x1G Stack
* 98DX2521: 8x1G + 4x1G Stack
* 98DX2518: 24x1G + 2x10G + 2x10G Stack
* 98DX2515: 24x1G + 4x1G Stack
* 98DX2512: 8x1G + 2x10G + 2x1G Stack
* 98DX2511: 8x1G + 4x1G Stack

Alleycat5X:
* 98DX3500: 24x1G + 6x25G
* 98DX3501: 16x1G + 6x10G
* 98DX3510: 48x1G + 6x25G
* 98DX3520: 24x2.5G + 6x25G
* 98DX3530: 48x2.5G + 6x25G
* 98DX3540: 12x5G/6x10G + 6x25G
* 98DX3550: 24x5G/12x10G + 6x25G

Signed-off-by: Chris Packham <chris.packham@alliedtelesis.co.nz>
Signed-off-by: Vadym Kochan <vadym.kochan@plvision.eu>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: Gregory CLEMENT <gregory.clement@bootlin.com>
d6c1b95d Tue Jul 05 13:09:19 MDT 2022 Chris Packham <chris.packham@alliedtelesis.co.nz> dt-bindings: marvell: Document the AC5/AC5X compatibles

Describe the compatible properties for the Marvell Alleycat5/5X switches
with integrated CPUs.

Alleycat5:
* 98DX2538: 24x1G + 2x10G + 2x10G Stack
* 98DX2535: 24x1G + 4x1G Stack
* 98DX2532: 8x1G + 2x10G + 2x1G Stack
* 98DX2531: 8x1G + 4x1G Stack
* 98DX2528: 24x1G + 2x10G + 2x10G Stack
* 98DX2525: 24x1G + 4x1G Stack
* 98DX2522: 8x1G + 2x10G + 2x1G Stack
* 98DX2521: 8x1G + 4x1G Stack
* 98DX2518: 24x1G + 2x10G + 2x10G Stack
* 98DX2515: 24x1G + 4x1G Stack
* 98DX2512: 8x1G + 2x10G + 2x1G Stack
* 98DX2511: 8x1G + 4x1G Stack

Alleycat5X:
* 98DX3500: 24x1G + 6x25G
* 98DX3501: 16x1G + 6x10G
* 98DX3510: 48x1G + 6x25G
* 98DX3520: 24x2.5G + 6x25G
* 98DX3530: 48x2.5G + 6x25G
* 98DX3540: 12x5G/6x10G + 6x25G
* 98DX3550: 24x5G/12x10G + 6x25G

Signed-off-by: Chris Packham <chris.packham@alliedtelesis.co.nz>
Signed-off-by: Vadym Kochan <vadym.kochan@plvision.eu>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: Gregory CLEMENT <gregory.clement@bootlin.com>
d6c1b95d Tue Jul 05 13:09:19 MDT 2022 Chris Packham <chris.packham@alliedtelesis.co.nz> dt-bindings: marvell: Document the AC5/AC5X compatibles

Describe the compatible properties for the Marvell Alleycat5/5X switches
with integrated CPUs.

Alleycat5:
* 98DX2538: 24x1G + 2x10G + 2x10G Stack
* 98DX2535: 24x1G + 4x1G Stack
* 98DX2532: 8x1G + 2x10G + 2x1G Stack
* 98DX2531: 8x1G + 4x1G Stack
* 98DX2528: 24x1G + 2x10G + 2x10G Stack
* 98DX2525: 24x1G + 4x1G Stack
* 98DX2522: 8x1G + 2x10G + 2x1G Stack
* 98DX2521: 8x1G + 4x1G Stack
* 98DX2518: 24x1G + 2x10G + 2x10G Stack
* 98DX2515: 24x1G + 4x1G Stack
* 98DX2512: 8x1G + 2x10G + 2x1G Stack
* 98DX2511: 8x1G + 4x1G Stack

Alleycat5X:
* 98DX3500: 24x1G + 6x25G
* 98DX3501: 16x1G + 6x10G
* 98DX3510: 48x1G + 6x25G
* 98DX3520: 24x2.5G + 6x25G
* 98DX3530: 48x2.5G + 6x25G
* 98DX3540: 12x5G/6x10G + 6x25G
* 98DX3550: 24x5G/12x10G + 6x25G

Signed-off-by: Chris Packham <chris.packham@alliedtelesis.co.nz>
Signed-off-by: Vadym Kochan <vadym.kochan@plvision.eu>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: Gregory CLEMENT <gregory.clement@bootlin.com>
/linux-master/arch/arm64/boot/dts/rockchip/
H A Drk3399-rock-pi-4b-plus.dtsdiff cfa12c32 Fri Jul 07 00:33:34 MDT 2023 Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> arm64: dts: rockchip: correct wifi interrupt flag in Rock Pi 4B

GPIO_ACTIVE_x flags are not correct in the context of interrupt flags.
These are simple defines so they could be used in DTS but they will not
have the same meaning: GPIO_ACTIVE_HIGH = 0 = IRQ_TYPE_NONE.

Correct the interrupt flags, assuming the author of the code wanted same
logical behavior behind the name "ACTIVE_xxx", this is:
ACTIVE_HIGH => IRQ_TYPE_LEVEL_HIGH

Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Tested-by: Christopher Obbard <chris.obbard@collabora.com>
Link: https://lore.kernel.org/r/20230707063335.13317-2-krzysztof.kozlowski@linaro.org
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
diff f471b1b2 Fri Nov 12 07:23:59 MST 2021 Jagan Teki <jagan@amarulasolutions.com> arm64: dts: rockchip: Fix Bluetooth on ROCK Pi 4 boards

This patch fixes the Bluetooth on ROCK Pi 4 boards.

ROCK Pi 4 boards has BCM4345C5 and now it is supported
on Mainline Linux, brcm,bcm43438-bt still working but
observed the BT Audio issues with latest test.

So, use the BCM4345C5 compatible and its associated
properties like clock-names as lpo and max-speed.

Attach vbat and vddio supply rails as well.

Signed-off-by: Jagan Teki <jagan@amarulasolutions.com>
Link: https://lore.kernel.org/r/20211112142359.320798-1-jagan@amarulasolutions.com
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
diff f471b1b2 Fri Nov 12 07:23:59 MST 2021 Jagan Teki <jagan@amarulasolutions.com> arm64: dts: rockchip: Fix Bluetooth on ROCK Pi 4 boards

This patch fixes the Bluetooth on ROCK Pi 4 boards.

ROCK Pi 4 boards has BCM4345C5 and now it is supported
on Mainline Linux, brcm,bcm43438-bt still working but
observed the BT Audio issues with latest test.

So, use the BCM4345C5 compatible and its associated
properties like clock-names as lpo and max-speed.

Attach vbat and vddio supply rails as well.

Signed-off-by: Jagan Teki <jagan@amarulasolutions.com>
Link: https://lore.kernel.org/r/20211112142359.320798-1-jagan@amarulasolutions.com
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
diff f471b1b2 Fri Nov 12 07:23:59 MST 2021 Jagan Teki <jagan@amarulasolutions.com> arm64: dts: rockchip: Fix Bluetooth on ROCK Pi 4 boards

This patch fixes the Bluetooth on ROCK Pi 4 boards.

ROCK Pi 4 boards has BCM4345C5 and now it is supported
on Mainline Linux, brcm,bcm43438-bt still working but
observed the BT Audio issues with latest test.

So, use the BCM4345C5 compatible and its associated
properties like clock-names as lpo and max-speed.

Attach vbat and vddio supply rails as well.

Signed-off-by: Jagan Teki <jagan@amarulasolutions.com>
Link: https://lore.kernel.org/r/20211112142359.320798-1-jagan@amarulasolutions.com
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
4b718ae7 Fri Jun 18 12:12:54 MDT 2021 Alex Bee <knaerzche@gmail.com> arm64: dts: rockchip: Add RK3399 ROCK Pi 4B+ board

ROCK Pi 4B+ board is the successor of ROCK Pi 4B board.

Differences to the original version are
- has RK3399 OP1 SoC revision
- has eMMC (16 or 32 GB) soldered on board (no changes required,
since it is enabled in rk3399-rock-pi-4.dtsi)
- dev boards have SPI flash soldered, but as per manufacturer response,
this won't be the case for mass production boards

Signed-off-by: Alex Bee <knaerzche@gmail.com>
Link: https://lore.kernel.org/r/20210618181256.27992-4-knaerzche@gmail.com
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
4b718ae7 Fri Jun 18 12:12:54 MDT 2021 Alex Bee <knaerzche@gmail.com> arm64: dts: rockchip: Add RK3399 ROCK Pi 4B+ board

ROCK Pi 4B+ board is the successor of ROCK Pi 4B board.

Differences to the original version are
- has RK3399 OP1 SoC revision
- has eMMC (16 or 32 GB) soldered on board (no changes required,
since it is enabled in rk3399-rock-pi-4.dtsi)
- dev boards have SPI flash soldered, but as per manufacturer response,
this won't be the case for mass production boards

Signed-off-by: Alex Bee <knaerzche@gmail.com>
Link: https://lore.kernel.org/r/20210618181256.27992-4-knaerzche@gmail.com
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
4b718ae7 Fri Jun 18 12:12:54 MDT 2021 Alex Bee <knaerzche@gmail.com> arm64: dts: rockchip: Add RK3399 ROCK Pi 4B+ board

ROCK Pi 4B+ board is the successor of ROCK Pi 4B board.

Differences to the original version are
- has RK3399 OP1 SoC revision
- has eMMC (16 or 32 GB) soldered on board (no changes required,
since it is enabled in rk3399-rock-pi-4.dtsi)
- dev boards have SPI flash soldered, but as per manufacturer response,
this won't be the case for mass production boards

Signed-off-by: Alex Bee <knaerzche@gmail.com>
Link: https://lore.kernel.org/r/20210618181256.27992-4-knaerzche@gmail.com
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
4b718ae7 Fri Jun 18 12:12:54 MDT 2021 Alex Bee <knaerzche@gmail.com> arm64: dts: rockchip: Add RK3399 ROCK Pi 4B+ board

ROCK Pi 4B+ board is the successor of ROCK Pi 4B board.

Differences to the original version are
- has RK3399 OP1 SoC revision
- has eMMC (16 or 32 GB) soldered on board (no changes required,
since it is enabled in rk3399-rock-pi-4.dtsi)
- dev boards have SPI flash soldered, but as per manufacturer response,
this won't be the case for mass production boards

Signed-off-by: Alex Bee <knaerzche@gmail.com>
Link: https://lore.kernel.org/r/20210618181256.27992-4-knaerzche@gmail.com
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
4b718ae7 Fri Jun 18 12:12:54 MDT 2021 Alex Bee <knaerzche@gmail.com> arm64: dts: rockchip: Add RK3399 ROCK Pi 4B+ board

ROCK Pi 4B+ board is the successor of ROCK Pi 4B board.

Differences to the original version are
- has RK3399 OP1 SoC revision
- has eMMC (16 or 32 GB) soldered on board (no changes required,
since it is enabled in rk3399-rock-pi-4.dtsi)
- dev boards have SPI flash soldered, but as per manufacturer response,
this won't be the case for mass production boards

Signed-off-by: Alex Bee <knaerzche@gmail.com>
Link: https://lore.kernel.org/r/20210618181256.27992-4-knaerzche@gmail.com
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
4b718ae7 Fri Jun 18 12:12:54 MDT 2021 Alex Bee <knaerzche@gmail.com> arm64: dts: rockchip: Add RK3399 ROCK Pi 4B+ board

ROCK Pi 4B+ board is the successor of ROCK Pi 4B board.

Differences to the original version are
- has RK3399 OP1 SoC revision
- has eMMC (16 or 32 GB) soldered on board (no changes required,
since it is enabled in rk3399-rock-pi-4.dtsi)
- dev boards have SPI flash soldered, but as per manufacturer response,
this won't be the case for mass production boards

Signed-off-by: Alex Bee <knaerzche@gmail.com>
Link: https://lore.kernel.org/r/20210618181256.27992-4-knaerzche@gmail.com
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
/linux-master/scripts/dtc/include-prefixes/arm64/rockchip/
H A Drk3399-rock-pi-4b-plus.dtsdiff cfa12c32 Fri Jul 07 00:33:34 MDT 2023 Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> arm64: dts: rockchip: correct wifi interrupt flag in Rock Pi 4B

GPIO_ACTIVE_x flags are not correct in the context of interrupt flags.
These are simple defines so they could be used in DTS but they will not
have the same meaning: GPIO_ACTIVE_HIGH = 0 = IRQ_TYPE_NONE.

Correct the interrupt flags, assuming the author of the code wanted same
logical behavior behind the name "ACTIVE_xxx", this is:
ACTIVE_HIGH => IRQ_TYPE_LEVEL_HIGH

Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Tested-by: Christopher Obbard <chris.obbard@collabora.com>
Link: https://lore.kernel.org/r/20230707063335.13317-2-krzysztof.kozlowski@linaro.org
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
diff f471b1b2 Fri Nov 12 07:23:59 MST 2021 Jagan Teki <jagan@amarulasolutions.com> arm64: dts: rockchip: Fix Bluetooth on ROCK Pi 4 boards

This patch fixes the Bluetooth on ROCK Pi 4 boards.

ROCK Pi 4 boards has BCM4345C5 and now it is supported
on Mainline Linux, brcm,bcm43438-bt still working but
observed the BT Audio issues with latest test.

So, use the BCM4345C5 compatible and its associated
properties like clock-names as lpo and max-speed.

Attach vbat and vddio supply rails as well.

Signed-off-by: Jagan Teki <jagan@amarulasolutions.com>
Link: https://lore.kernel.org/r/20211112142359.320798-1-jagan@amarulasolutions.com
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
diff f471b1b2 Fri Nov 12 07:23:59 MST 2021 Jagan Teki <jagan@amarulasolutions.com> arm64: dts: rockchip: Fix Bluetooth on ROCK Pi 4 boards

This patch fixes the Bluetooth on ROCK Pi 4 boards.

ROCK Pi 4 boards has BCM4345C5 and now it is supported
on Mainline Linux, brcm,bcm43438-bt still working but
observed the BT Audio issues with latest test.

So, use the BCM4345C5 compatible and its associated
properties like clock-names as lpo and max-speed.

Attach vbat and vddio supply rails as well.

Signed-off-by: Jagan Teki <jagan@amarulasolutions.com>
Link: https://lore.kernel.org/r/20211112142359.320798-1-jagan@amarulasolutions.com
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
diff f471b1b2 Fri Nov 12 07:23:59 MST 2021 Jagan Teki <jagan@amarulasolutions.com> arm64: dts: rockchip: Fix Bluetooth on ROCK Pi 4 boards

This patch fixes the Bluetooth on ROCK Pi 4 boards.

ROCK Pi 4 boards has BCM4345C5 and now it is supported
on Mainline Linux, brcm,bcm43438-bt still working but
observed the BT Audio issues with latest test.

So, use the BCM4345C5 compatible and its associated
properties like clock-names as lpo and max-speed.

Attach vbat and vddio supply rails as well.

Signed-off-by: Jagan Teki <jagan@amarulasolutions.com>
Link: https://lore.kernel.org/r/20211112142359.320798-1-jagan@amarulasolutions.com
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
4b718ae7 Fri Jun 18 12:12:54 MDT 2021 Alex Bee <knaerzche@gmail.com> arm64: dts: rockchip: Add RK3399 ROCK Pi 4B+ board

ROCK Pi 4B+ board is the successor of ROCK Pi 4B board.

Differences to the original version are
- has RK3399 OP1 SoC revision
- has eMMC (16 or 32 GB) soldered on board (no changes required,
since it is enabled in rk3399-rock-pi-4.dtsi)
- dev boards have SPI flash soldered, but as per manufacturer response,
this won't be the case for mass production boards

Signed-off-by: Alex Bee <knaerzche@gmail.com>
Link: https://lore.kernel.org/r/20210618181256.27992-4-knaerzche@gmail.com
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
4b718ae7 Fri Jun 18 12:12:54 MDT 2021 Alex Bee <knaerzche@gmail.com> arm64: dts: rockchip: Add RK3399 ROCK Pi 4B+ board

ROCK Pi 4B+ board is the successor of ROCK Pi 4B board.

Differences to the original version are
- has RK3399 OP1 SoC revision
- has eMMC (16 or 32 GB) soldered on board (no changes required,
since it is enabled in rk3399-rock-pi-4.dtsi)
- dev boards have SPI flash soldered, but as per manufacturer response,
this won't be the case for mass production boards

Signed-off-by: Alex Bee <knaerzche@gmail.com>
Link: https://lore.kernel.org/r/20210618181256.27992-4-knaerzche@gmail.com
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
4b718ae7 Fri Jun 18 12:12:54 MDT 2021 Alex Bee <knaerzche@gmail.com> arm64: dts: rockchip: Add RK3399 ROCK Pi 4B+ board

ROCK Pi 4B+ board is the successor of ROCK Pi 4B board.

Differences to the original version are
- has RK3399 OP1 SoC revision
- has eMMC (16 or 32 GB) soldered on board (no changes required,
since it is enabled in rk3399-rock-pi-4.dtsi)
- dev boards have SPI flash soldered, but as per manufacturer response,
this won't be the case for mass production boards

Signed-off-by: Alex Bee <knaerzche@gmail.com>
Link: https://lore.kernel.org/r/20210618181256.27992-4-knaerzche@gmail.com
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
4b718ae7 Fri Jun 18 12:12:54 MDT 2021 Alex Bee <knaerzche@gmail.com> arm64: dts: rockchip: Add RK3399 ROCK Pi 4B+ board

ROCK Pi 4B+ board is the successor of ROCK Pi 4B board.

Differences to the original version are
- has RK3399 OP1 SoC revision
- has eMMC (16 or 32 GB) soldered on board (no changes required,
since it is enabled in rk3399-rock-pi-4.dtsi)
- dev boards have SPI flash soldered, but as per manufacturer response,
this won't be the case for mass production boards

Signed-off-by: Alex Bee <knaerzche@gmail.com>
Link: https://lore.kernel.org/r/20210618181256.27992-4-knaerzche@gmail.com
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
4b718ae7 Fri Jun 18 12:12:54 MDT 2021 Alex Bee <knaerzche@gmail.com> arm64: dts: rockchip: Add RK3399 ROCK Pi 4B+ board

ROCK Pi 4B+ board is the successor of ROCK Pi 4B board.

Differences to the original version are
- has RK3399 OP1 SoC revision
- has eMMC (16 or 32 GB) soldered on board (no changes required,
since it is enabled in rk3399-rock-pi-4.dtsi)
- dev boards have SPI flash soldered, but as per manufacturer response,
this won't be the case for mass production boards

Signed-off-by: Alex Bee <knaerzche@gmail.com>
Link: https://lore.kernel.org/r/20210618181256.27992-4-knaerzche@gmail.com
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
4b718ae7 Fri Jun 18 12:12:54 MDT 2021 Alex Bee <knaerzche@gmail.com> arm64: dts: rockchip: Add RK3399 ROCK Pi 4B+ board

ROCK Pi 4B+ board is the successor of ROCK Pi 4B board.

Differences to the original version are
- has RK3399 OP1 SoC revision
- has eMMC (16 or 32 GB) soldered on board (no changes required,
since it is enabled in rk3399-rock-pi-4.dtsi)
- dev boards have SPI flash soldered, but as per manufacturer response,
this won't be the case for mass production boards

Signed-off-by: Alex Bee <knaerzche@gmail.com>
Link: https://lore.kernel.org/r/20210618181256.27992-4-knaerzche@gmail.com
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
H A Drk3399-rock-4c-plus.dtsdiff c1b1f340 Wed Mar 27 13:26:38 MDT 2024 Folker Schwesinger <dev@folker-schwesinger.de> arm64: dts: rockchip: Add enable-strobe-pulldown to emmc phy on ROCK 4C+

Commit 8b5c2b45b8f0 disabled the internal pull-down for the strobe line
causing I/O errors in HS400 mode for various eMMC modules.

Enable the internal strobe pull-down for the ROCK 4C+ board. Also re-enable
HS400 mode, that was replaced with HS200 mode as a workaround for the
stability issues in:
2bd1d2dd808c ("arm64: dts: rockchip: Disable HS400 for eMMC on ROCK 4C+").

Fixes: 8b5c2b45b8f0 ("phy: rockchip: set pulldown for strobe line in dts")
Signed-off-by: Folker Schwesinger <dev@folker-schwesinger.de>
Reviewed-by: Dragan Simic <dsimic@manjaro.org>
Link: https://lore.kernel.org/r/20240327192641.14220-3-dev@folker-schwesinger.de
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
diff c1b1f340 Wed Mar 27 13:26:38 MDT 2024 Folker Schwesinger <dev@folker-schwesinger.de> arm64: dts: rockchip: Add enable-strobe-pulldown to emmc phy on ROCK 4C+

Commit 8b5c2b45b8f0 disabled the internal pull-down for the strobe line
causing I/O errors in HS400 mode for various eMMC modules.

Enable the internal strobe pull-down for the ROCK 4C+ board. Also re-enable
HS400 mode, that was replaced with HS200 mode as a workaround for the
stability issues in:
2bd1d2dd808c ("arm64: dts: rockchip: Disable HS400 for eMMC on ROCK 4C+").

Fixes: 8b5c2b45b8f0 ("phy: rockchip: set pulldown for strobe line in dts")
Signed-off-by: Folker Schwesinger <dev@folker-schwesinger.de>
Reviewed-by: Dragan Simic <dsimic@manjaro.org>
Link: https://lore.kernel.org/r/20240327192641.14220-3-dev@folker-schwesinger.de
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
diff c1b1f340 Wed Mar 27 13:26:38 MDT 2024 Folker Schwesinger <dev@folker-schwesinger.de> arm64: dts: rockchip: Add enable-strobe-pulldown to emmc phy on ROCK 4C+

Commit 8b5c2b45b8f0 disabled the internal pull-down for the strobe line
causing I/O errors in HS400 mode for various eMMC modules.

Enable the internal strobe pull-down for the ROCK 4C+ board. Also re-enable
HS400 mode, that was replaced with HS200 mode as a workaround for the
stability issues in:
2bd1d2dd808c ("arm64: dts: rockchip: Disable HS400 for eMMC on ROCK 4C+").

Fixes: 8b5c2b45b8f0 ("phy: rockchip: set pulldown for strobe line in dts")
Signed-off-by: Folker Schwesinger <dev@folker-schwesinger.de>
Reviewed-by: Dragan Simic <dsimic@manjaro.org>
Link: https://lore.kernel.org/r/20240327192641.14220-3-dev@folker-schwesinger.de
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
diff 2bd1d2dd Wed Jul 05 08:42:55 MDT 2023 Christopher Obbard <chris.obbard@collabora.com> arm64: dts: rockchip: Disable HS400 for eMMC on ROCK 4C+

There is some instablity with some eMMC modules on ROCK Pi 4 SBCs running
in HS400 mode. This ends up resulting in some block errors after a while
or after a "heavy" operation utilising the eMMC (e.g. resizing a
filesystem). An example of these errors is as follows:

[ 289.171014] mmc1: running CQE recovery
[ 290.048972] mmc1: running CQE recovery
[ 290.054834] mmc1: running CQE recovery
[ 290.060817] mmc1: running CQE recovery
[ 290.061337] blk_update_request: I/O error, dev mmcblk1, sector 1411072 op 0x1:(WRITE) flags 0x800 phys_seg 36 prio class 0
[ 290.061370] EXT4-fs warning (device mmcblk1p1): ext4_end_bio:348: I/O error 10 writing to inode 29547 starting block 176466)
[ 290.061484] Buffer I/O error on device mmcblk1p1, logical block 172288
[ 290.061531] Buffer I/O error on device mmcblk1p1, logical block 172289
[ 290.061551] Buffer I/O error on device mmcblk1p1, logical block 172290
[ 290.061574] Buffer I/O error on device mmcblk1p1, logical block 172291
[ 290.061592] Buffer I/O error on device mmcblk1p1, logical block 172292
[ 290.061615] Buffer I/O error on device mmcblk1p1, logical block 172293
[ 290.061632] Buffer I/O error on device mmcblk1p1, logical block 172294
[ 290.061654] Buffer I/O error on device mmcblk1p1, logical block 172295
[ 290.061673] Buffer I/O error on device mmcblk1p1, logical block 172296
[ 290.061695] Buffer I/O error on device mmcblk1p1, logical block 172297

Disabling the Command Queue seems to stop the CQE recovery from running,
but doesn't seem to improve the I/O errors. Until this can be investigated
further, disable HS400 mode on the ROCK Pi 4 SBCs to at least stop I/O
errors from occurring.

Fixes: 246450344dad ("arm64: dts: rockchip: rk3399: Radxa ROCK 4C+")
Signed-off-by: Christopher Obbard <chris.obbard@collabora.com>
Link: https://lore.kernel.org/r/20230705144255.115299-3-chris.obbard@collabora.com
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
diff 2bd1d2dd Wed Jul 05 08:42:55 MDT 2023 Christopher Obbard <chris.obbard@collabora.com> arm64: dts: rockchip: Disable HS400 for eMMC on ROCK 4C+

There is some instablity with some eMMC modules on ROCK Pi 4 SBCs running
in HS400 mode. This ends up resulting in some block errors after a while
or after a "heavy" operation utilising the eMMC (e.g. resizing a
filesystem). An example of these errors is as follows:

[ 289.171014] mmc1: running CQE recovery
[ 290.048972] mmc1: running CQE recovery
[ 290.054834] mmc1: running CQE recovery
[ 290.060817] mmc1: running CQE recovery
[ 290.061337] blk_update_request: I/O error, dev mmcblk1, sector 1411072 op 0x1:(WRITE) flags 0x800 phys_seg 36 prio class 0
[ 290.061370] EXT4-fs warning (device mmcblk1p1): ext4_end_bio:348: I/O error 10 writing to inode 29547 starting block 176466)
[ 290.061484] Buffer I/O error on device mmcblk1p1, logical block 172288
[ 290.061531] Buffer I/O error on device mmcblk1p1, logical block 172289
[ 290.061551] Buffer I/O error on device mmcblk1p1, logical block 172290
[ 290.061574] Buffer I/O error on device mmcblk1p1, logical block 172291
[ 290.061592] Buffer I/O error on device mmcblk1p1, logical block 172292
[ 290.061615] Buffer I/O error on device mmcblk1p1, logical block 172293
[ 290.061632] Buffer I/O error on device mmcblk1p1, logical block 172294
[ 290.061654] Buffer I/O error on device mmcblk1p1, logical block 172295
[ 290.061673] Buffer I/O error on device mmcblk1p1, logical block 172296
[ 290.061695] Buffer I/O error on device mmcblk1p1, logical block 172297

Disabling the Command Queue seems to stop the CQE recovery from running,
but doesn't seem to improve the I/O errors. Until this can be investigated
further, disable HS400 mode on the ROCK Pi 4 SBCs to at least stop I/O
errors from occurring.

Fixes: 246450344dad ("arm64: dts: rockchip: rk3399: Radxa ROCK 4C+")
Signed-off-by: Christopher Obbard <chris.obbard@collabora.com>
Link: https://lore.kernel.org/r/20230705144255.115299-3-chris.obbard@collabora.com
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
diff 2bd1d2dd Wed Jul 05 08:42:55 MDT 2023 Christopher Obbard <chris.obbard@collabora.com> arm64: dts: rockchip: Disable HS400 for eMMC on ROCK 4C+

There is some instablity with some eMMC modules on ROCK Pi 4 SBCs running
in HS400 mode. This ends up resulting in some block errors after a while
or after a "heavy" operation utilising the eMMC (e.g. resizing a
filesystem). An example of these errors is as follows:

[ 289.171014] mmc1: running CQE recovery
[ 290.048972] mmc1: running CQE recovery
[ 290.054834] mmc1: running CQE recovery
[ 290.060817] mmc1: running CQE recovery
[ 290.061337] blk_update_request: I/O error, dev mmcblk1, sector 1411072 op 0x1:(WRITE) flags 0x800 phys_seg 36 prio class 0
[ 290.061370] EXT4-fs warning (device mmcblk1p1): ext4_end_bio:348: I/O error 10 writing to inode 29547 starting block 176466)
[ 290.061484] Buffer I/O error on device mmcblk1p1, logical block 172288
[ 290.061531] Buffer I/O error on device mmcblk1p1, logical block 172289
[ 290.061551] Buffer I/O error on device mmcblk1p1, logical block 172290
[ 290.061574] Buffer I/O error on device mmcblk1p1, logical block 172291
[ 290.061592] Buffer I/O error on device mmcblk1p1, logical block 172292
[ 290.061615] Buffer I/O error on device mmcblk1p1, logical block 172293
[ 290.061632] Buffer I/O error on device mmcblk1p1, logical block 172294
[ 290.061654] Buffer I/O error on device mmcblk1p1, logical block 172295
[ 290.061673] Buffer I/O error on device mmcblk1p1, logical block 172296
[ 290.061695] Buffer I/O error on device mmcblk1p1, logical block 172297

Disabling the Command Queue seems to stop the CQE recovery from running,
but doesn't seem to improve the I/O errors. Until this can be investigated
further, disable HS400 mode on the ROCK Pi 4 SBCs to at least stop I/O
errors from occurring.

Fixes: 246450344dad ("arm64: dts: rockchip: rk3399: Radxa ROCK 4C+")
Signed-off-by: Christopher Obbard <chris.obbard@collabora.com>
Link: https://lore.kernel.org/r/20230705144255.115299-3-chris.obbard@collabora.com
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
diff 2bd1d2dd Wed Jul 05 08:42:55 MDT 2023 Christopher Obbard <chris.obbard@collabora.com> arm64: dts: rockchip: Disable HS400 for eMMC on ROCK 4C+

There is some instablity with some eMMC modules on ROCK Pi 4 SBCs running
in HS400 mode. This ends up resulting in some block errors after a while
or after a "heavy" operation utilising the eMMC (e.g. resizing a
filesystem). An example of these errors is as follows:

[ 289.171014] mmc1: running CQE recovery
[ 290.048972] mmc1: running CQE recovery
[ 290.054834] mmc1: running CQE recovery
[ 290.060817] mmc1: running CQE recovery
[ 290.061337] blk_update_request: I/O error, dev mmcblk1, sector 1411072 op 0x1:(WRITE) flags 0x800 phys_seg 36 prio class 0
[ 290.061370] EXT4-fs warning (device mmcblk1p1): ext4_end_bio:348: I/O error 10 writing to inode 29547 starting block 176466)
[ 290.061484] Buffer I/O error on device mmcblk1p1, logical block 172288
[ 290.061531] Buffer I/O error on device mmcblk1p1, logical block 172289
[ 290.061551] Buffer I/O error on device mmcblk1p1, logical block 172290
[ 290.061574] Buffer I/O error on device mmcblk1p1, logical block 172291
[ 290.061592] Buffer I/O error on device mmcblk1p1, logical block 172292
[ 290.061615] Buffer I/O error on device mmcblk1p1, logical block 172293
[ 290.061632] Buffer I/O error on device mmcblk1p1, logical block 172294
[ 290.061654] Buffer I/O error on device mmcblk1p1, logical block 172295
[ 290.061673] Buffer I/O error on device mmcblk1p1, logical block 172296
[ 290.061695] Buffer I/O error on device mmcblk1p1, logical block 172297

Disabling the Command Queue seems to stop the CQE recovery from running,
but doesn't seem to improve the I/O errors. Until this can be investigated
further, disable HS400 mode on the ROCK Pi 4 SBCs to at least stop I/O
errors from occurring.

Fixes: 246450344dad ("arm64: dts: rockchip: rk3399: Radxa ROCK 4C+")
Signed-off-by: Christopher Obbard <chris.obbard@collabora.com>
Link: https://lore.kernel.org/r/20230705144255.115299-3-chris.obbard@collabora.com
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
diff d268da06 Wed Jan 18 01:04:54 MST 2023 Jagan Teki <jagan@amarulasolutions.com> arm64: dts: rockchip: Update eMMC, SD aliases for Radxa SBC boards

Radxa SBC boards like ROCK 3A/4 models do support eMMC and SDcard
via external connector slots.

Mark, the eMMC has mmc0 by considering the Rockchip boot order priority
as both MMC devices are connected externally.

Reported-by: FUKAUMI Naoki <naoki@radxa.com>
Signed-off-by: Jagan Teki <jagan@amarulasolutions.com>
Link: https://lore.kernel.org/r/20230118080454.11643-2-jagan@amarulasolutions.com
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
diff 7f767e68 Tue Oct 18 01:22:42 MDT 2022 FUKAUMI Naoki <naoki@radxa.com> arm64: dts: rockchip: enable tsadc for ROCK 4C+

add and enable Temperature Sensor ADC for Radxa ROCK 4C+

Signed-off-by: FUKAUMI Naoki <naoki@radxa.com>
Link: https://lore.kernel.org/r/20221018072242.2348995-1-naoki@radxa.com
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
diff 7f767e68 Tue Oct 18 01:22:42 MDT 2022 FUKAUMI Naoki <naoki@radxa.com> arm64: dts: rockchip: enable tsadc for ROCK 4C+

add and enable Temperature Sensor ADC for Radxa ROCK 4C+

Signed-off-by: FUKAUMI Naoki <naoki@radxa.com>
Link: https://lore.kernel.org/r/20221018072242.2348995-1-naoki@radxa.com
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
/linux-master/tools/perf/arch/csky/annotate/
H A Dinstructions.caa23aa55 Wed Jun 26 00:52:19 MDT 2019 Mao Han <han_mao@c-sky.com> perf annotate: Add csky support

This patch add basic arch initialization and instruction associate
support for the csky CPU architecture.

E.g.:

$ perf annotate --stdio2
Samples: 161 of event 'cpu-clock:pppH', 4000 Hz, Event count (approx.):
40250000, [percent: local period]
test_4() /usr/lib/perf-test/callchain_test
Percent

Disassembly of section .text:

00008420 <test_4>:
test_4():
subi sp, sp, 4
st.w r8, (sp, 0x0)
mov r8, sp
subi sp, sp, 8
subi r3, r8, 4
movi r2, 0
st.w r2, (r3, 0x0)
↓ br 2e
100.00 14: subi r3, r8, 4
ld.w r2, (r3, 0x0)
subi r3, r8, 8
st.w r2, (r3, 0x0)
subi r3, r8, 4
ld.w r3, (r3, 0x0)
addi r2, r3, 1
subi r3, r8, 4
st.w r2, (r3, 0x0)
2e: subi r3, r8, 4
ld.w r2, (r3, 0x0)
lrw r3, 0x98967f // 8598 <main+0x28>
cmplt r3, r2
↑ bf 14
mov r0, r0
mov r0, r0
mov sp, r8
ld.w r8, (sp, 0x0)
addi sp, sp, 4
← rts

Signed-off-by: Mao Han <han_mao@c-sky.com>
Acked-by: Guo Ren <guoren@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-csky@vger.kernel.org
Link: http://lkml.kernel.org/r/d874d7782d9acdad5d98f2f5c4a6fb26fbe41c5d.1561531557.git.han_mao@c-sky.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
aa23aa55 Wed Jun 26 00:52:19 MDT 2019 Mao Han <han_mao@c-sky.com> perf annotate: Add csky support

This patch add basic arch initialization and instruction associate
support for the csky CPU architecture.

E.g.:

$ perf annotate --stdio2
Samples: 161 of event 'cpu-clock:pppH', 4000 Hz, Event count (approx.):
40250000, [percent: local period]
test_4() /usr/lib/perf-test/callchain_test
Percent

Disassembly of section .text:

00008420 <test_4>:
test_4():
subi sp, sp, 4
st.w r8, (sp, 0x0)
mov r8, sp
subi sp, sp, 8
subi r3, r8, 4
movi r2, 0
st.w r2, (r3, 0x0)
↓ br 2e
100.00 14: subi r3, r8, 4
ld.w r2, (r3, 0x0)
subi r3, r8, 8
st.w r2, (r3, 0x0)
subi r3, r8, 4
ld.w r3, (r3, 0x0)
addi r2, r3, 1
subi r3, r8, 4
st.w r2, (r3, 0x0)
2e: subi r3, r8, 4
ld.w r2, (r3, 0x0)
lrw r3, 0x98967f // 8598 <main+0x28>
cmplt r3, r2
↑ bf 14
mov r0, r0
mov r0, r0
mov sp, r8
ld.w r8, (sp, 0x0)
addi sp, sp, 4
← rts

Signed-off-by: Mao Han <han_mao@c-sky.com>
Acked-by: Guo Ren <guoren@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-csky@vger.kernel.org
Link: http://lkml.kernel.org/r/d874d7782d9acdad5d98f2f5c4a6fb26fbe41c5d.1561531557.git.han_mao@c-sky.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
aa23aa55 Wed Jun 26 00:52:19 MDT 2019 Mao Han <han_mao@c-sky.com> perf annotate: Add csky support

This patch add basic arch initialization and instruction associate
support for the csky CPU architecture.

E.g.:

$ perf annotate --stdio2
Samples: 161 of event 'cpu-clock:pppH', 4000 Hz, Event count (approx.):
40250000, [percent: local period]
test_4() /usr/lib/perf-test/callchain_test
Percent

Disassembly of section .text:

00008420 <test_4>:
test_4():
subi sp, sp, 4
st.w r8, (sp, 0x0)
mov r8, sp
subi sp, sp, 8
subi r3, r8, 4
movi r2, 0
st.w r2, (r3, 0x0)
↓ br 2e
100.00 14: subi r3, r8, 4
ld.w r2, (r3, 0x0)
subi r3, r8, 8
st.w r2, (r3, 0x0)
subi r3, r8, 4
ld.w r3, (r3, 0x0)
addi r2, r3, 1
subi r3, r8, 4
st.w r2, (r3, 0x0)
2e: subi r3, r8, 4
ld.w r2, (r3, 0x0)
lrw r3, 0x98967f // 8598 <main+0x28>
cmplt r3, r2
↑ bf 14
mov r0, r0
mov r0, r0
mov sp, r8
ld.w r8, (sp, 0x0)
addi sp, sp, 4
← rts

Signed-off-by: Mao Han <han_mao@c-sky.com>
Acked-by: Guo Ren <guoren@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-csky@vger.kernel.org
Link: http://lkml.kernel.org/r/d874d7782d9acdad5d98f2f5c4a6fb26fbe41c5d.1561531557.git.han_mao@c-sky.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
aa23aa55 Wed Jun 26 00:52:19 MDT 2019 Mao Han <han_mao@c-sky.com> perf annotate: Add csky support

This patch add basic arch initialization and instruction associate
support for the csky CPU architecture.

E.g.:

$ perf annotate --stdio2
Samples: 161 of event 'cpu-clock:pppH', 4000 Hz, Event count (approx.):
40250000, [percent: local period]
test_4() /usr/lib/perf-test/callchain_test
Percent

Disassembly of section .text:

00008420 <test_4>:
test_4():
subi sp, sp, 4
st.w r8, (sp, 0x0)
mov r8, sp
subi sp, sp, 8
subi r3, r8, 4
movi r2, 0
st.w r2, (r3, 0x0)
↓ br 2e
100.00 14: subi r3, r8, 4
ld.w r2, (r3, 0x0)
subi r3, r8, 8
st.w r2, (r3, 0x0)
subi r3, r8, 4
ld.w r3, (r3, 0x0)
addi r2, r3, 1
subi r3, r8, 4
st.w r2, (r3, 0x0)
2e: subi r3, r8, 4
ld.w r2, (r3, 0x0)
lrw r3, 0x98967f // 8598 <main+0x28>
cmplt r3, r2
↑ bf 14
mov r0, r0
mov r0, r0
mov sp, r8
ld.w r8, (sp, 0x0)
addi sp, sp, 4
← rts

Signed-off-by: Mao Han <han_mao@c-sky.com>
Acked-by: Guo Ren <guoren@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-csky@vger.kernel.org
Link: http://lkml.kernel.org/r/d874d7782d9acdad5d98f2f5c4a6fb26fbe41c5d.1561531557.git.han_mao@c-sky.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
aa23aa55 Wed Jun 26 00:52:19 MDT 2019 Mao Han <han_mao@c-sky.com> perf annotate: Add csky support

This patch add basic arch initialization and instruction associate
support for the csky CPU architecture.

E.g.:

$ perf annotate --stdio2
Samples: 161 of event 'cpu-clock:pppH', 4000 Hz, Event count (approx.):
40250000, [percent: local period]
test_4() /usr/lib/perf-test/callchain_test
Percent

Disassembly of section .text:

00008420 <test_4>:
test_4():
subi sp, sp, 4
st.w r8, (sp, 0x0)
mov r8, sp
subi sp, sp, 8
subi r3, r8, 4
movi r2, 0
st.w r2, (r3, 0x0)
↓ br 2e
100.00 14: subi r3, r8, 4
ld.w r2, (r3, 0x0)
subi r3, r8, 8
st.w r2, (r3, 0x0)
subi r3, r8, 4
ld.w r3, (r3, 0x0)
addi r2, r3, 1
subi r3, r8, 4
st.w r2, (r3, 0x0)
2e: subi r3, r8, 4
ld.w r2, (r3, 0x0)
lrw r3, 0x98967f // 8598 <main+0x28>
cmplt r3, r2
↑ bf 14
mov r0, r0
mov r0, r0
mov sp, r8
ld.w r8, (sp, 0x0)
addi sp, sp, 4
← rts

Signed-off-by: Mao Han <han_mao@c-sky.com>
Acked-by: Guo Ren <guoren@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-csky@vger.kernel.org
Link: http://lkml.kernel.org/r/d874d7782d9acdad5d98f2f5c4a6fb26fbe41c5d.1561531557.git.han_mao@c-sky.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
aa23aa55 Wed Jun 26 00:52:19 MDT 2019 Mao Han <han_mao@c-sky.com> perf annotate: Add csky support

This patch add basic arch initialization and instruction associate
support for the csky CPU architecture.

E.g.:

$ perf annotate --stdio2
Samples: 161 of event 'cpu-clock:pppH', 4000 Hz, Event count (approx.):
40250000, [percent: local period]
test_4() /usr/lib/perf-test/callchain_test
Percent

Disassembly of section .text:

00008420 <test_4>:
test_4():
subi sp, sp, 4
st.w r8, (sp, 0x0)
mov r8, sp
subi sp, sp, 8
subi r3, r8, 4
movi r2, 0
st.w r2, (r3, 0x0)
↓ br 2e
100.00 14: subi r3, r8, 4
ld.w r2, (r3, 0x0)
subi r3, r8, 8
st.w r2, (r3, 0x0)
subi r3, r8, 4
ld.w r3, (r3, 0x0)
addi r2, r3, 1
subi r3, r8, 4
st.w r2, (r3, 0x0)
2e: subi r3, r8, 4
ld.w r2, (r3, 0x0)
lrw r3, 0x98967f // 8598 <main+0x28>
cmplt r3, r2
↑ bf 14
mov r0, r0
mov r0, r0
mov sp, r8
ld.w r8, (sp, 0x0)
addi sp, sp, 4
← rts

Signed-off-by: Mao Han <han_mao@c-sky.com>
Acked-by: Guo Ren <guoren@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-csky@vger.kernel.org
Link: http://lkml.kernel.org/r/d874d7782d9acdad5d98f2f5c4a6fb26fbe41c5d.1561531557.git.han_mao@c-sky.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
aa23aa55 Wed Jun 26 00:52:19 MDT 2019 Mao Han <han_mao@c-sky.com> perf annotate: Add csky support

This patch add basic arch initialization and instruction associate
support for the csky CPU architecture.

E.g.:

$ perf annotate --stdio2
Samples: 161 of event 'cpu-clock:pppH', 4000 Hz, Event count (approx.):
40250000, [percent: local period]
test_4() /usr/lib/perf-test/callchain_test
Percent

Disassembly of section .text:

00008420 <test_4>:
test_4():
subi sp, sp, 4
st.w r8, (sp, 0x0)
mov r8, sp
subi sp, sp, 8
subi r3, r8, 4
movi r2, 0
st.w r2, (r3, 0x0)
↓ br 2e
100.00 14: subi r3, r8, 4
ld.w r2, (r3, 0x0)
subi r3, r8, 8
st.w r2, (r3, 0x0)
subi r3, r8, 4
ld.w r3, (r3, 0x0)
addi r2, r3, 1
subi r3, r8, 4
st.w r2, (r3, 0x0)
2e: subi r3, r8, 4
ld.w r2, (r3, 0x0)
lrw r3, 0x98967f // 8598 <main+0x28>
cmplt r3, r2
↑ bf 14
mov r0, r0
mov r0, r0
mov sp, r8
ld.w r8, (sp, 0x0)
addi sp, sp, 4
← rts

Signed-off-by: Mao Han <han_mao@c-sky.com>
Acked-by: Guo Ren <guoren@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-csky@vger.kernel.org
Link: http://lkml.kernel.org/r/d874d7782d9acdad5d98f2f5c4a6fb26fbe41c5d.1561531557.git.han_mao@c-sky.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
/linux-master/tools/perf/util/bpf_skel/
H A Dbperf_follower.bpf.cdiff 5a897531 Fri Dec 03 16:14:41 MST 2021 Song Liu <songliubraving@fb.com> perf bpf_skel: Do not use typedef to avoid error on old clang

When building bpf_skel with clang-10, typedef causes confusions like:

libbpf: map 'prev_readings': unexpected def kind var.

Fix this by removing the typedef.

Fixes: 7fac83aaf2eecc9e ("perf stat: Introduce 'bperf' to share hardware PMCs with BPF")
Reported-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: http://lore.kernel.org/lkml/BEF5C312-4331-4A60-AEC0-AD7617CB2BC4@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
7fac83aa Tue Mar 16 15:18:35 MDT 2021 Song Liu <songliubraving@fb.com> perf stat: Introduce 'bperf' to share hardware PMCs with BPF

The perf tool uses performance monitoring counters (PMCs) to monitor
system performance. The PMCs are limited hardware resources. For
example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways: system
level monitoring, (maybe nested) container level monitoring, per process
monitoring, profiling (in sample mode), etc. In some cases, there are
more active perf_events than available hardware PMCs. To allow all
perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics
(cycles, instructions). It is a waste to have multiple tools create
multiple perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --bpf-counters option to
perf-stat. bperf uses a BPF hashmap to share information about BPF
programs and maps used by bperf. This map is pinned to bpffs. The
default path is /sys/fs/bpf/perf_attr_map. The user could change the
path with option --bpf-attr-map.

Committer testing:

# dmesg|grep "Performance Events" -A5
[ 0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[ 0.225280] ... version: 0
[ 0.225280] ... bit width: 48
[ 0.225281] ... generic registers: 6
[ 0.225281] ... value mask: 0000ffffffffffff
[ 0.225281] ... max period: 00007fffffffffff
#
# for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done
[1] 2436231
[2] 2436232
[3] 2436233
[4] 2436234
[5] 2436235
[6] 2436236
# perf stat -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

310,326,987 cycles (41.87%)
236,143,290 instructions # 0.76 insn per cycle (41.87%)

0.100800885 seconds time elapsed

#

We can see that the counters were enabled for this workload 41.87% of
the time.

Now with --bpf-counters:

# for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done
[1] 2436514
[2] 2436515
[3] 2436516
[4] 2436517
[5] 2436518
[6] 2436519
[7] 2436520
[8] 2436521
[9] 2436522
[10] 2436523
[11] 2436524
[12] 2436525
[13] 2436526
[14] 2436527
[15] 2436528
[16] 2436529
[17] 2436530
[18] 2436531
[19] 2436532
[20] 2436533
[21] 2436534
[22] 2436535
[23] 2436536
[24] 2436537
[25] 2436538
[26] 2436539
[27] 2436540
[28] 2436541
[29] 2436542
[30] 2436543
[31] 2436544
[32] 2436545
#
# ls -la /sys/fs/bpf/perf_attr_map
-rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map
# bpftool map | grep bperf | wc -l
64
#

# bpftool map | tail
1265: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1266: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1267: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 996
pids perf(2436545)
1268: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1269: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1270: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 997
pids perf(2436541)
1285: array name pid_iter.rodata flags 0x480
key 4B value 4B max_entries 1 memlock 4096B
btf_id 1017 frozen
pids bpftool(2437504)
1286: array flags 0x0
key 4B value 32B max_entries 1 memlock 4096B
#
# bpftool map dump id 1268 | tail
value (CPU 21):
8f f3 bc ca 00 00 00 00 80 fd 2a d1 4d 00 00 00
80 fd 2a d1 4d 00 00 00
value (CPU 22):
7e d5 64 4d 00 00 00 00 a4 8a 2e ee 4d 00 00 00
a4 8a 2e ee 4d 00 00 00
value (CPU 23):
a7 78 3e 06 01 00 00 00 b2 34 94 f6 4d 00 00 00
b2 34 94 f6 4d 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
c6 8b d9 ca 00 00 00 00 20 c6 fc 83 4e 00 00 00
20 c6 fc 83 4e 00 00 00
value (CPU 22):
9c b4 d2 4d 00 00 00 00 3e 0c df 89 4e 00 00 00
3e 0c df 89 4e 00 00 00
value (CPU 23):
18 43 66 06 01 00 00 00 5b 69 ed 83 4e 00 00 00
5b 69 ed 83 4e 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
f2 6e db ca 00 00 00 00 92 67 4c ba 4e 00 00 00
92 67 4c ba 4e 00 00 00
value (CPU 22):
dc 8e e1 4d 00 00 00 00 d9 32 7a c5 4e 00 00 00
d9 32 7a c5 4e 00 00 00
value (CPU 23):
bd 2b 73 06 01 00 00 00 7c 73 87 bf 4e 00 00 00
7c 73 87 bf 4e 00 00 00
Found 1 element
#

# perf stat --bpf-counters -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

119,410,122 cycles
152,105,479 instructions # 1.27 insn per cycle

0.101395093 seconds time elapsed

#

See? We had the counters enabled all the time.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: kernel-team@fb.com
Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
7fac83aa Tue Mar 16 15:18:35 MDT 2021 Song Liu <songliubraving@fb.com> perf stat: Introduce 'bperf' to share hardware PMCs with BPF

The perf tool uses performance monitoring counters (PMCs) to monitor
system performance. The PMCs are limited hardware resources. For
example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways: system
level monitoring, (maybe nested) container level monitoring, per process
monitoring, profiling (in sample mode), etc. In some cases, there are
more active perf_events than available hardware PMCs. To allow all
perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics
(cycles, instructions). It is a waste to have multiple tools create
multiple perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --bpf-counters option to
perf-stat. bperf uses a BPF hashmap to share information about BPF
programs and maps used by bperf. This map is pinned to bpffs. The
default path is /sys/fs/bpf/perf_attr_map. The user could change the
path with option --bpf-attr-map.

Committer testing:

# dmesg|grep "Performance Events" -A5
[ 0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[ 0.225280] ... version: 0
[ 0.225280] ... bit width: 48
[ 0.225281] ... generic registers: 6
[ 0.225281] ... value mask: 0000ffffffffffff
[ 0.225281] ... max period: 00007fffffffffff
#
# for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done
[1] 2436231
[2] 2436232
[3] 2436233
[4] 2436234
[5] 2436235
[6] 2436236
# perf stat -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

310,326,987 cycles (41.87%)
236,143,290 instructions # 0.76 insn per cycle (41.87%)

0.100800885 seconds time elapsed

#

We can see that the counters were enabled for this workload 41.87% of
the time.

Now with --bpf-counters:

# for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done
[1] 2436514
[2] 2436515
[3] 2436516
[4] 2436517
[5] 2436518
[6] 2436519
[7] 2436520
[8] 2436521
[9] 2436522
[10] 2436523
[11] 2436524
[12] 2436525
[13] 2436526
[14] 2436527
[15] 2436528
[16] 2436529
[17] 2436530
[18] 2436531
[19] 2436532
[20] 2436533
[21] 2436534
[22] 2436535
[23] 2436536
[24] 2436537
[25] 2436538
[26] 2436539
[27] 2436540
[28] 2436541
[29] 2436542
[30] 2436543
[31] 2436544
[32] 2436545
#
# ls -la /sys/fs/bpf/perf_attr_map
-rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map
# bpftool map | grep bperf | wc -l
64
#

# bpftool map | tail
1265: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1266: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1267: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 996
pids perf(2436545)
1268: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1269: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1270: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 997
pids perf(2436541)
1285: array name pid_iter.rodata flags 0x480
key 4B value 4B max_entries 1 memlock 4096B
btf_id 1017 frozen
pids bpftool(2437504)
1286: array flags 0x0
key 4B value 32B max_entries 1 memlock 4096B
#
# bpftool map dump id 1268 | tail
value (CPU 21):
8f f3 bc ca 00 00 00 00 80 fd 2a d1 4d 00 00 00
80 fd 2a d1 4d 00 00 00
value (CPU 22):
7e d5 64 4d 00 00 00 00 a4 8a 2e ee 4d 00 00 00
a4 8a 2e ee 4d 00 00 00
value (CPU 23):
a7 78 3e 06 01 00 00 00 b2 34 94 f6 4d 00 00 00
b2 34 94 f6 4d 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
c6 8b d9 ca 00 00 00 00 20 c6 fc 83 4e 00 00 00
20 c6 fc 83 4e 00 00 00
value (CPU 22):
9c b4 d2 4d 00 00 00 00 3e 0c df 89 4e 00 00 00
3e 0c df 89 4e 00 00 00
value (CPU 23):
18 43 66 06 01 00 00 00 5b 69 ed 83 4e 00 00 00
5b 69 ed 83 4e 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
f2 6e db ca 00 00 00 00 92 67 4c ba 4e 00 00 00
92 67 4c ba 4e 00 00 00
value (CPU 22):
dc 8e e1 4d 00 00 00 00 d9 32 7a c5 4e 00 00 00
d9 32 7a c5 4e 00 00 00
value (CPU 23):
bd 2b 73 06 01 00 00 00 7c 73 87 bf 4e 00 00 00
7c 73 87 bf 4e 00 00 00
Found 1 element
#

# perf stat --bpf-counters -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

119,410,122 cycles
152,105,479 instructions # 1.27 insn per cycle

0.101395093 seconds time elapsed

#

See? We had the counters enabled all the time.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: kernel-team@fb.com
Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
7fac83aa Tue Mar 16 15:18:35 MDT 2021 Song Liu <songliubraving@fb.com> perf stat: Introduce 'bperf' to share hardware PMCs with BPF

The perf tool uses performance monitoring counters (PMCs) to monitor
system performance. The PMCs are limited hardware resources. For
example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways: system
level monitoring, (maybe nested) container level monitoring, per process
monitoring, profiling (in sample mode), etc. In some cases, there are
more active perf_events than available hardware PMCs. To allow all
perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics
(cycles, instructions). It is a waste to have multiple tools create
multiple perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --bpf-counters option to
perf-stat. bperf uses a BPF hashmap to share information about BPF
programs and maps used by bperf. This map is pinned to bpffs. The
default path is /sys/fs/bpf/perf_attr_map. The user could change the
path with option --bpf-attr-map.

Committer testing:

# dmesg|grep "Performance Events" -A5
[ 0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[ 0.225280] ... version: 0
[ 0.225280] ... bit width: 48
[ 0.225281] ... generic registers: 6
[ 0.225281] ... value mask: 0000ffffffffffff
[ 0.225281] ... max period: 00007fffffffffff
#
# for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done
[1] 2436231
[2] 2436232
[3] 2436233
[4] 2436234
[5] 2436235
[6] 2436236
# perf stat -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

310,326,987 cycles (41.87%)
236,143,290 instructions # 0.76 insn per cycle (41.87%)

0.100800885 seconds time elapsed

#

We can see that the counters were enabled for this workload 41.87% of
the time.

Now with --bpf-counters:

# for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done
[1] 2436514
[2] 2436515
[3] 2436516
[4] 2436517
[5] 2436518
[6] 2436519
[7] 2436520
[8] 2436521
[9] 2436522
[10] 2436523
[11] 2436524
[12] 2436525
[13] 2436526
[14] 2436527
[15] 2436528
[16] 2436529
[17] 2436530
[18] 2436531
[19] 2436532
[20] 2436533
[21] 2436534
[22] 2436535
[23] 2436536
[24] 2436537
[25] 2436538
[26] 2436539
[27] 2436540
[28] 2436541
[29] 2436542
[30] 2436543
[31] 2436544
[32] 2436545
#
# ls -la /sys/fs/bpf/perf_attr_map
-rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map
# bpftool map | grep bperf | wc -l
64
#

# bpftool map | tail
1265: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1266: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1267: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 996
pids perf(2436545)
1268: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1269: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1270: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 997
pids perf(2436541)
1285: array name pid_iter.rodata flags 0x480
key 4B value 4B max_entries 1 memlock 4096B
btf_id 1017 frozen
pids bpftool(2437504)
1286: array flags 0x0
key 4B value 32B max_entries 1 memlock 4096B
#
# bpftool map dump id 1268 | tail
value (CPU 21):
8f f3 bc ca 00 00 00 00 80 fd 2a d1 4d 00 00 00
80 fd 2a d1 4d 00 00 00
value (CPU 22):
7e d5 64 4d 00 00 00 00 a4 8a 2e ee 4d 00 00 00
a4 8a 2e ee 4d 00 00 00
value (CPU 23):
a7 78 3e 06 01 00 00 00 b2 34 94 f6 4d 00 00 00
b2 34 94 f6 4d 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
c6 8b d9 ca 00 00 00 00 20 c6 fc 83 4e 00 00 00
20 c6 fc 83 4e 00 00 00
value (CPU 22):
9c b4 d2 4d 00 00 00 00 3e 0c df 89 4e 00 00 00
3e 0c df 89 4e 00 00 00
value (CPU 23):
18 43 66 06 01 00 00 00 5b 69 ed 83 4e 00 00 00
5b 69 ed 83 4e 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
f2 6e db ca 00 00 00 00 92 67 4c ba 4e 00 00 00
92 67 4c ba 4e 00 00 00
value (CPU 22):
dc 8e e1 4d 00 00 00 00 d9 32 7a c5 4e 00 00 00
d9 32 7a c5 4e 00 00 00
value (CPU 23):
bd 2b 73 06 01 00 00 00 7c 73 87 bf 4e 00 00 00
7c 73 87 bf 4e 00 00 00
Found 1 element
#

# perf stat --bpf-counters -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

119,410,122 cycles
152,105,479 instructions # 1.27 insn per cycle

0.101395093 seconds time elapsed

#

See? We had the counters enabled all the time.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: kernel-team@fb.com
Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
7fac83aa Tue Mar 16 15:18:35 MDT 2021 Song Liu <songliubraving@fb.com> perf stat: Introduce 'bperf' to share hardware PMCs with BPF

The perf tool uses performance monitoring counters (PMCs) to monitor
system performance. The PMCs are limited hardware resources. For
example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways: system
level monitoring, (maybe nested) container level monitoring, per process
monitoring, profiling (in sample mode), etc. In some cases, there are
more active perf_events than available hardware PMCs. To allow all
perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics
(cycles, instructions). It is a waste to have multiple tools create
multiple perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --bpf-counters option to
perf-stat. bperf uses a BPF hashmap to share information about BPF
programs and maps used by bperf. This map is pinned to bpffs. The
default path is /sys/fs/bpf/perf_attr_map. The user could change the
path with option --bpf-attr-map.

Committer testing:

# dmesg|grep "Performance Events" -A5
[ 0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[ 0.225280] ... version: 0
[ 0.225280] ... bit width: 48
[ 0.225281] ... generic registers: 6
[ 0.225281] ... value mask: 0000ffffffffffff
[ 0.225281] ... max period: 00007fffffffffff
#
# for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done
[1] 2436231
[2] 2436232
[3] 2436233
[4] 2436234
[5] 2436235
[6] 2436236
# perf stat -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

310,326,987 cycles (41.87%)
236,143,290 instructions # 0.76 insn per cycle (41.87%)

0.100800885 seconds time elapsed

#

We can see that the counters were enabled for this workload 41.87% of
the time.

Now with --bpf-counters:

# for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done
[1] 2436514
[2] 2436515
[3] 2436516
[4] 2436517
[5] 2436518
[6] 2436519
[7] 2436520
[8] 2436521
[9] 2436522
[10] 2436523
[11] 2436524
[12] 2436525
[13] 2436526
[14] 2436527
[15] 2436528
[16] 2436529
[17] 2436530
[18] 2436531
[19] 2436532
[20] 2436533
[21] 2436534
[22] 2436535
[23] 2436536
[24] 2436537
[25] 2436538
[26] 2436539
[27] 2436540
[28] 2436541
[29] 2436542
[30] 2436543
[31] 2436544
[32] 2436545
#
# ls -la /sys/fs/bpf/perf_attr_map
-rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map
# bpftool map | grep bperf | wc -l
64
#

# bpftool map | tail
1265: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1266: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1267: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 996
pids perf(2436545)
1268: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1269: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1270: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 997
pids perf(2436541)
1285: array name pid_iter.rodata flags 0x480
key 4B value 4B max_entries 1 memlock 4096B
btf_id 1017 frozen
pids bpftool(2437504)
1286: array flags 0x0
key 4B value 32B max_entries 1 memlock 4096B
#
# bpftool map dump id 1268 | tail
value (CPU 21):
8f f3 bc ca 00 00 00 00 80 fd 2a d1 4d 00 00 00
80 fd 2a d1 4d 00 00 00
value (CPU 22):
7e d5 64 4d 00 00 00 00 a4 8a 2e ee 4d 00 00 00
a4 8a 2e ee 4d 00 00 00
value (CPU 23):
a7 78 3e 06 01 00 00 00 b2 34 94 f6 4d 00 00 00
b2 34 94 f6 4d 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
c6 8b d9 ca 00 00 00 00 20 c6 fc 83 4e 00 00 00
20 c6 fc 83 4e 00 00 00
value (CPU 22):
9c b4 d2 4d 00 00 00 00 3e 0c df 89 4e 00 00 00
3e 0c df 89 4e 00 00 00
value (CPU 23):
18 43 66 06 01 00 00 00 5b 69 ed 83 4e 00 00 00
5b 69 ed 83 4e 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
f2 6e db ca 00 00 00 00 92 67 4c ba 4e 00 00 00
92 67 4c ba 4e 00 00 00
value (CPU 22):
dc 8e e1 4d 00 00 00 00 d9 32 7a c5 4e 00 00 00
d9 32 7a c5 4e 00 00 00
value (CPU 23):
bd 2b 73 06 01 00 00 00 7c 73 87 bf 4e 00 00 00
7c 73 87 bf 4e 00 00 00
Found 1 element
#

# perf stat --bpf-counters -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

119,410,122 cycles
152,105,479 instructions # 1.27 insn per cycle

0.101395093 seconds time elapsed

#

See? We had the counters enabled all the time.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: kernel-team@fb.com
Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
7fac83aa Tue Mar 16 15:18:35 MDT 2021 Song Liu <songliubraving@fb.com> perf stat: Introduce 'bperf' to share hardware PMCs with BPF

The perf tool uses performance monitoring counters (PMCs) to monitor
system performance. The PMCs are limited hardware resources. For
example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways: system
level monitoring, (maybe nested) container level monitoring, per process
monitoring, profiling (in sample mode), etc. In some cases, there are
more active perf_events than available hardware PMCs. To allow all
perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics
(cycles, instructions). It is a waste to have multiple tools create
multiple perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --bpf-counters option to
perf-stat. bperf uses a BPF hashmap to share information about BPF
programs and maps used by bperf. This map is pinned to bpffs. The
default path is /sys/fs/bpf/perf_attr_map. The user could change the
path with option --bpf-attr-map.

Committer testing:

# dmesg|grep "Performance Events" -A5
[ 0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[ 0.225280] ... version: 0
[ 0.225280] ... bit width: 48
[ 0.225281] ... generic registers: 6
[ 0.225281] ... value mask: 0000ffffffffffff
[ 0.225281] ... max period: 00007fffffffffff
#
# for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done
[1] 2436231
[2] 2436232
[3] 2436233
[4] 2436234
[5] 2436235
[6] 2436236
# perf stat -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

310,326,987 cycles (41.87%)
236,143,290 instructions # 0.76 insn per cycle (41.87%)

0.100800885 seconds time elapsed

#

We can see that the counters were enabled for this workload 41.87% of
the time.

Now with --bpf-counters:

# for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done
[1] 2436514
[2] 2436515
[3] 2436516
[4] 2436517
[5] 2436518
[6] 2436519
[7] 2436520
[8] 2436521
[9] 2436522
[10] 2436523
[11] 2436524
[12] 2436525
[13] 2436526
[14] 2436527
[15] 2436528
[16] 2436529
[17] 2436530
[18] 2436531
[19] 2436532
[20] 2436533
[21] 2436534
[22] 2436535
[23] 2436536
[24] 2436537
[25] 2436538
[26] 2436539
[27] 2436540
[28] 2436541
[29] 2436542
[30] 2436543
[31] 2436544
[32] 2436545
#
# ls -la /sys/fs/bpf/perf_attr_map
-rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map
# bpftool map | grep bperf | wc -l
64
#

# bpftool map | tail
1265: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1266: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1267: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 996
pids perf(2436545)
1268: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1269: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1270: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 997
pids perf(2436541)
1285: array name pid_iter.rodata flags 0x480
key 4B value 4B max_entries 1 memlock 4096B
btf_id 1017 frozen
pids bpftool(2437504)
1286: array flags 0x0
key 4B value 32B max_entries 1 memlock 4096B
#
# bpftool map dump id 1268 | tail
value (CPU 21):
8f f3 bc ca 00 00 00 00 80 fd 2a d1 4d 00 00 00
80 fd 2a d1 4d 00 00 00
value (CPU 22):
7e d5 64 4d 00 00 00 00 a4 8a 2e ee 4d 00 00 00
a4 8a 2e ee 4d 00 00 00
value (CPU 23):
a7 78 3e 06 01 00 00 00 b2 34 94 f6 4d 00 00 00
b2 34 94 f6 4d 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
c6 8b d9 ca 00 00 00 00 20 c6 fc 83 4e 00 00 00
20 c6 fc 83 4e 00 00 00
value (CPU 22):
9c b4 d2 4d 00 00 00 00 3e 0c df 89 4e 00 00 00
3e 0c df 89 4e 00 00 00
value (CPU 23):
18 43 66 06 01 00 00 00 5b 69 ed 83 4e 00 00 00
5b 69 ed 83 4e 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
f2 6e db ca 00 00 00 00 92 67 4c ba 4e 00 00 00
92 67 4c ba 4e 00 00 00
value (CPU 22):
dc 8e e1 4d 00 00 00 00 d9 32 7a c5 4e 00 00 00
d9 32 7a c5 4e 00 00 00
value (CPU 23):
bd 2b 73 06 01 00 00 00 7c 73 87 bf 4e 00 00 00
7c 73 87 bf 4e 00 00 00
Found 1 element
#

# perf stat --bpf-counters -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

119,410,122 cycles
152,105,479 instructions # 1.27 insn per cycle

0.101395093 seconds time elapsed

#

See? We had the counters enabled all the time.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: kernel-team@fb.com
Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
7fac83aa Tue Mar 16 15:18:35 MDT 2021 Song Liu <songliubraving@fb.com> perf stat: Introduce 'bperf' to share hardware PMCs with BPF

The perf tool uses performance monitoring counters (PMCs) to monitor
system performance. The PMCs are limited hardware resources. For
example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways: system
level monitoring, (maybe nested) container level monitoring, per process
monitoring, profiling (in sample mode), etc. In some cases, there are
more active perf_events than available hardware PMCs. To allow all
perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics
(cycles, instructions). It is a waste to have multiple tools create
multiple perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --bpf-counters option to
perf-stat. bperf uses a BPF hashmap to share information about BPF
programs and maps used by bperf. This map is pinned to bpffs. The
default path is /sys/fs/bpf/perf_attr_map. The user could change the
path with option --bpf-attr-map.

Committer testing:

# dmesg|grep "Performance Events" -A5
[ 0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[ 0.225280] ... version: 0
[ 0.225280] ... bit width: 48
[ 0.225281] ... generic registers: 6
[ 0.225281] ... value mask: 0000ffffffffffff
[ 0.225281] ... max period: 00007fffffffffff
#
# for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done
[1] 2436231
[2] 2436232
[3] 2436233
[4] 2436234
[5] 2436235
[6] 2436236
# perf stat -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

310,326,987 cycles (41.87%)
236,143,290 instructions # 0.76 insn per cycle (41.87%)

0.100800885 seconds time elapsed

#

We can see that the counters were enabled for this workload 41.87% of
the time.

Now with --bpf-counters:

# for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done
[1] 2436514
[2] 2436515
[3] 2436516
[4] 2436517
[5] 2436518
[6] 2436519
[7] 2436520
[8] 2436521
[9] 2436522
[10] 2436523
[11] 2436524
[12] 2436525
[13] 2436526
[14] 2436527
[15] 2436528
[16] 2436529
[17] 2436530
[18] 2436531
[19] 2436532
[20] 2436533
[21] 2436534
[22] 2436535
[23] 2436536
[24] 2436537
[25] 2436538
[26] 2436539
[27] 2436540
[28] 2436541
[29] 2436542
[30] 2436543
[31] 2436544
[32] 2436545
#
# ls -la /sys/fs/bpf/perf_attr_map
-rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map
# bpftool map | grep bperf | wc -l
64
#

# bpftool map | tail
1265: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1266: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1267: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 996
pids perf(2436545)
1268: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1269: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1270: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 997
pids perf(2436541)
1285: array name pid_iter.rodata flags 0x480
key 4B value 4B max_entries 1 memlock 4096B
btf_id 1017 frozen
pids bpftool(2437504)
1286: array flags 0x0
key 4B value 32B max_entries 1 memlock 4096B
#
# bpftool map dump id 1268 | tail
value (CPU 21):
8f f3 bc ca 00 00 00 00 80 fd 2a d1 4d 00 00 00
80 fd 2a d1 4d 00 00 00
value (CPU 22):
7e d5 64 4d 00 00 00 00 a4 8a 2e ee 4d 00 00 00
a4 8a 2e ee 4d 00 00 00
value (CPU 23):
a7 78 3e 06 01 00 00 00 b2 34 94 f6 4d 00 00 00
b2 34 94 f6 4d 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
c6 8b d9 ca 00 00 00 00 20 c6 fc 83 4e 00 00 00
20 c6 fc 83 4e 00 00 00
value (CPU 22):
9c b4 d2 4d 00 00 00 00 3e 0c df 89 4e 00 00 00
3e 0c df 89 4e 00 00 00
value (CPU 23):
18 43 66 06 01 00 00 00 5b 69 ed 83 4e 00 00 00
5b 69 ed 83 4e 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
f2 6e db ca 00 00 00 00 92 67 4c ba 4e 00 00 00
92 67 4c ba 4e 00 00 00
value (CPU 22):
dc 8e e1 4d 00 00 00 00 d9 32 7a c5 4e 00 00 00
d9 32 7a c5 4e 00 00 00
value (CPU 23):
bd 2b 73 06 01 00 00 00 7c 73 87 bf 4e 00 00 00
7c 73 87 bf 4e 00 00 00
Found 1 element
#

# perf stat --bpf-counters -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

119,410,122 cycles
152,105,479 instructions # 1.27 insn per cycle

0.101395093 seconds time elapsed

#

See? We had the counters enabled all the time.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: kernel-team@fb.com
Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
7fac83aa Tue Mar 16 15:18:35 MDT 2021 Song Liu <songliubraving@fb.com> perf stat: Introduce 'bperf' to share hardware PMCs with BPF

The perf tool uses performance monitoring counters (PMCs) to monitor
system performance. The PMCs are limited hardware resources. For
example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways: system
level monitoring, (maybe nested) container level monitoring, per process
monitoring, profiling (in sample mode), etc. In some cases, there are
more active perf_events than available hardware PMCs. To allow all
perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics
(cycles, instructions). It is a waste to have multiple tools create
multiple perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --bpf-counters option to
perf-stat. bperf uses a BPF hashmap to share information about BPF
programs and maps used by bperf. This map is pinned to bpffs. The
default path is /sys/fs/bpf/perf_attr_map. The user could change the
path with option --bpf-attr-map.

Committer testing:

# dmesg|grep "Performance Events" -A5
[ 0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[ 0.225280] ... version: 0
[ 0.225280] ... bit width: 48
[ 0.225281] ... generic registers: 6
[ 0.225281] ... value mask: 0000ffffffffffff
[ 0.225281] ... max period: 00007fffffffffff
#
# for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done
[1] 2436231
[2] 2436232
[3] 2436233
[4] 2436234
[5] 2436235
[6] 2436236
# perf stat -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

310,326,987 cycles (41.87%)
236,143,290 instructions # 0.76 insn per cycle (41.87%)

0.100800885 seconds time elapsed

#

We can see that the counters were enabled for this workload 41.87% of
the time.

Now with --bpf-counters:

# for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done
[1] 2436514
[2] 2436515
[3] 2436516
[4] 2436517
[5] 2436518
[6] 2436519
[7] 2436520
[8] 2436521
[9] 2436522
[10] 2436523
[11] 2436524
[12] 2436525
[13] 2436526
[14] 2436527
[15] 2436528
[16] 2436529
[17] 2436530
[18] 2436531
[19] 2436532
[20] 2436533
[21] 2436534
[22] 2436535
[23] 2436536
[24] 2436537
[25] 2436538
[26] 2436539
[27] 2436540
[28] 2436541
[29] 2436542
[30] 2436543
[31] 2436544
[32] 2436545
#
# ls -la /sys/fs/bpf/perf_attr_map
-rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map
# bpftool map | grep bperf | wc -l
64
#

# bpftool map | tail
1265: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1266: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1267: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 996
pids perf(2436545)
1268: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1269: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1270: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 997
pids perf(2436541)
1285: array name pid_iter.rodata flags 0x480
key 4B value 4B max_entries 1 memlock 4096B
btf_id 1017 frozen
pids bpftool(2437504)
1286: array flags 0x0
key 4B value 32B max_entries 1 memlock 4096B
#
# bpftool map dump id 1268 | tail
value (CPU 21):
8f f3 bc ca 00 00 00 00 80 fd 2a d1 4d 00 00 00
80 fd 2a d1 4d 00 00 00
value (CPU 22):
7e d5 64 4d 00 00 00 00 a4 8a 2e ee 4d 00 00 00
a4 8a 2e ee 4d 00 00 00
value (CPU 23):
a7 78 3e 06 01 00 00 00 b2 34 94 f6 4d 00 00 00
b2 34 94 f6 4d 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
c6 8b d9 ca 00 00 00 00 20 c6 fc 83 4e 00 00 00
20 c6 fc 83 4e 00 00 00
value (CPU 22):
9c b4 d2 4d 00 00 00 00 3e 0c df 89 4e 00 00 00
3e 0c df 89 4e 00 00 00
value (CPU 23):
18 43 66 06 01 00 00 00 5b 69 ed 83 4e 00 00 00
5b 69 ed 83 4e 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
f2 6e db ca 00 00 00 00 92 67 4c ba 4e 00 00 00
92 67 4c ba 4e 00 00 00
value (CPU 22):
dc 8e e1 4d 00 00 00 00 d9 32 7a c5 4e 00 00 00
d9 32 7a c5 4e 00 00 00
value (CPU 23):
bd 2b 73 06 01 00 00 00 7c 73 87 bf 4e 00 00 00
7c 73 87 bf 4e 00 00 00
Found 1 element
#

# perf stat --bpf-counters -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

119,410,122 cycles
152,105,479 instructions # 1.27 insn per cycle

0.101395093 seconds time elapsed

#

See? We had the counters enabled all the time.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: kernel-team@fb.com
Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
7fac83aa Tue Mar 16 15:18:35 MDT 2021 Song Liu <songliubraving@fb.com> perf stat: Introduce 'bperf' to share hardware PMCs with BPF

The perf tool uses performance monitoring counters (PMCs) to monitor
system performance. The PMCs are limited hardware resources. For
example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways: system
level monitoring, (maybe nested) container level monitoring, per process
monitoring, profiling (in sample mode), etc. In some cases, there are
more active perf_events than available hardware PMCs. To allow all
perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics
(cycles, instructions). It is a waste to have multiple tools create
multiple perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --bpf-counters option to
perf-stat. bperf uses a BPF hashmap to share information about BPF
programs and maps used by bperf. This map is pinned to bpffs. The
default path is /sys/fs/bpf/perf_attr_map. The user could change the
path with option --bpf-attr-map.

Committer testing:

# dmesg|grep "Performance Events" -A5
[ 0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[ 0.225280] ... version: 0
[ 0.225280] ... bit width: 48
[ 0.225281] ... generic registers: 6
[ 0.225281] ... value mask: 0000ffffffffffff
[ 0.225281] ... max period: 00007fffffffffff
#
# for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done
[1] 2436231
[2] 2436232
[3] 2436233
[4] 2436234
[5] 2436235
[6] 2436236
# perf stat -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

310,326,987 cycles (41.87%)
236,143,290 instructions # 0.76 insn per cycle (41.87%)

0.100800885 seconds time elapsed

#

We can see that the counters were enabled for this workload 41.87% of
the time.

Now with --bpf-counters:

# for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done
[1] 2436514
[2] 2436515
[3] 2436516
[4] 2436517
[5] 2436518
[6] 2436519
[7] 2436520
[8] 2436521
[9] 2436522
[10] 2436523
[11] 2436524
[12] 2436525
[13] 2436526
[14] 2436527
[15] 2436528
[16] 2436529
[17] 2436530
[18] 2436531
[19] 2436532
[20] 2436533
[21] 2436534
[22] 2436535
[23] 2436536
[24] 2436537
[25] 2436538
[26] 2436539
[27] 2436540
[28] 2436541
[29] 2436542
[30] 2436543
[31] 2436544
[32] 2436545
#
# ls -la /sys/fs/bpf/perf_attr_map
-rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map
# bpftool map | grep bperf | wc -l
64
#

# bpftool map | tail
1265: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1266: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1267: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 996
pids perf(2436545)
1268: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1269: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1270: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 997
pids perf(2436541)
1285: array name pid_iter.rodata flags 0x480
key 4B value 4B max_entries 1 memlock 4096B
btf_id 1017 frozen
pids bpftool(2437504)
1286: array flags 0x0
key 4B value 32B max_entries 1 memlock 4096B
#
# bpftool map dump id 1268 | tail
value (CPU 21):
8f f3 bc ca 00 00 00 00 80 fd 2a d1 4d 00 00 00
80 fd 2a d1 4d 00 00 00
value (CPU 22):
7e d5 64 4d 00 00 00 00 a4 8a 2e ee 4d 00 00 00
a4 8a 2e ee 4d 00 00 00
value (CPU 23):
a7 78 3e 06 01 00 00 00 b2 34 94 f6 4d 00 00 00
b2 34 94 f6 4d 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
c6 8b d9 ca 00 00 00 00 20 c6 fc 83 4e 00 00 00
20 c6 fc 83 4e 00 00 00
value (CPU 22):
9c b4 d2 4d 00 00 00 00 3e 0c df 89 4e 00 00 00
3e 0c df 89 4e 00 00 00
value (CPU 23):
18 43 66 06 01 00 00 00 5b 69 ed 83 4e 00 00 00
5b 69 ed 83 4e 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
f2 6e db ca 00 00 00 00 92 67 4c ba 4e 00 00 00
92 67 4c ba 4e 00 00 00
value (CPU 22):
dc 8e e1 4d 00 00 00 00 d9 32 7a c5 4e 00 00 00
d9 32 7a c5 4e 00 00 00
value (CPU 23):
bd 2b 73 06 01 00 00 00 7c 73 87 bf 4e 00 00 00
7c 73 87 bf 4e 00 00 00
Found 1 element
#

# perf stat --bpf-counters -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

119,410,122 cycles
152,105,479 instructions # 1.27 insn per cycle

0.101395093 seconds time elapsed

#

See? We had the counters enabled all the time.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: kernel-team@fb.com
Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
7fac83aa Tue Mar 16 15:18:35 MDT 2021 Song Liu <songliubraving@fb.com> perf stat: Introduce 'bperf' to share hardware PMCs with BPF

The perf tool uses performance monitoring counters (PMCs) to monitor
system performance. The PMCs are limited hardware resources. For
example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways: system
level monitoring, (maybe nested) container level monitoring, per process
monitoring, profiling (in sample mode), etc. In some cases, there are
more active perf_events than available hardware PMCs. To allow all
perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics
(cycles, instructions). It is a waste to have multiple tools create
multiple perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --bpf-counters option to
perf-stat. bperf uses a BPF hashmap to share information about BPF
programs and maps used by bperf. This map is pinned to bpffs. The
default path is /sys/fs/bpf/perf_attr_map. The user could change the
path with option --bpf-attr-map.

Committer testing:

# dmesg|grep "Performance Events" -A5
[ 0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[ 0.225280] ... version: 0
[ 0.225280] ... bit width: 48
[ 0.225281] ... generic registers: 6
[ 0.225281] ... value mask: 0000ffffffffffff
[ 0.225281] ... max period: 00007fffffffffff
#
# for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done
[1] 2436231
[2] 2436232
[3] 2436233
[4] 2436234
[5] 2436235
[6] 2436236
# perf stat -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

310,326,987 cycles (41.87%)
236,143,290 instructions # 0.76 insn per cycle (41.87%)

0.100800885 seconds time elapsed

#

We can see that the counters were enabled for this workload 41.87% of
the time.

Now with --bpf-counters:

# for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done
[1] 2436514
[2] 2436515
[3] 2436516
[4] 2436517
[5] 2436518
[6] 2436519
[7] 2436520
[8] 2436521
[9] 2436522
[10] 2436523
[11] 2436524
[12] 2436525
[13] 2436526
[14] 2436527
[15] 2436528
[16] 2436529
[17] 2436530
[18] 2436531
[19] 2436532
[20] 2436533
[21] 2436534
[22] 2436535
[23] 2436536
[24] 2436537
[25] 2436538
[26] 2436539
[27] 2436540
[28] 2436541
[29] 2436542
[30] 2436543
[31] 2436544
[32] 2436545
#
# ls -la /sys/fs/bpf/perf_attr_map
-rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map
# bpftool map | grep bperf | wc -l
64
#

# bpftool map | tail
1265: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1266: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1267: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 996
pids perf(2436545)
1268: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1269: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1270: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 997
pids perf(2436541)
1285: array name pid_iter.rodata flags 0x480
key 4B value 4B max_entries 1 memlock 4096B
btf_id 1017 frozen
pids bpftool(2437504)
1286: array flags 0x0
key 4B value 32B max_entries 1 memlock 4096B
#
# bpftool map dump id 1268 | tail
value (CPU 21):
8f f3 bc ca 00 00 00 00 80 fd 2a d1 4d 00 00 00
80 fd 2a d1 4d 00 00 00
value (CPU 22):
7e d5 64 4d 00 00 00 00 a4 8a 2e ee 4d 00 00 00
a4 8a 2e ee 4d 00 00 00
value (CPU 23):
a7 78 3e 06 01 00 00 00 b2 34 94 f6 4d 00 00 00
b2 34 94 f6 4d 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
c6 8b d9 ca 00 00 00 00 20 c6 fc 83 4e 00 00 00
20 c6 fc 83 4e 00 00 00
value (CPU 22):
9c b4 d2 4d 00 00 00 00 3e 0c df 89 4e 00 00 00
3e 0c df 89 4e 00 00 00
value (CPU 23):
18 43 66 06 01 00 00 00 5b 69 ed 83 4e 00 00 00
5b 69 ed 83 4e 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
f2 6e db ca 00 00 00 00 92 67 4c ba 4e 00 00 00
92 67 4c ba 4e 00 00 00
value (CPU 22):
dc 8e e1 4d 00 00 00 00 d9 32 7a c5 4e 00 00 00
d9 32 7a c5 4e 00 00 00
value (CPU 23):
bd 2b 73 06 01 00 00 00 7c 73 87 bf 4e 00 00 00
7c 73 87 bf 4e 00 00 00
Found 1 element
#

# perf stat --bpf-counters -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

119,410,122 cycles
152,105,479 instructions # 1.27 insn per cycle

0.101395093 seconds time elapsed

#

See? We had the counters enabled all the time.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: kernel-team@fb.com
Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
7fac83aa Tue Mar 16 15:18:35 MDT 2021 Song Liu <songliubraving@fb.com> perf stat: Introduce 'bperf' to share hardware PMCs with BPF

The perf tool uses performance monitoring counters (PMCs) to monitor
system performance. The PMCs are limited hardware resources. For
example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways: system
level monitoring, (maybe nested) container level monitoring, per process
monitoring, profiling (in sample mode), etc. In some cases, there are
more active perf_events than available hardware PMCs. To allow all
perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics
(cycles, instructions). It is a waste to have multiple tools create
multiple perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --bpf-counters option to
perf-stat. bperf uses a BPF hashmap to share information about BPF
programs and maps used by bperf. This map is pinned to bpffs. The
default path is /sys/fs/bpf/perf_attr_map. The user could change the
path with option --bpf-attr-map.

Committer testing:

# dmesg|grep "Performance Events" -A5
[ 0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[ 0.225280] ... version: 0
[ 0.225280] ... bit width: 48
[ 0.225281] ... generic registers: 6
[ 0.225281] ... value mask: 0000ffffffffffff
[ 0.225281] ... max period: 00007fffffffffff
#
# for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done
[1] 2436231
[2] 2436232
[3] 2436233
[4] 2436234
[5] 2436235
[6] 2436236
# perf stat -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

310,326,987 cycles (41.87%)
236,143,290 instructions # 0.76 insn per cycle (41.87%)

0.100800885 seconds time elapsed

#

We can see that the counters were enabled for this workload 41.87% of
the time.

Now with --bpf-counters:

# for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done
[1] 2436514
[2] 2436515
[3] 2436516
[4] 2436517
[5] 2436518
[6] 2436519
[7] 2436520
[8] 2436521
[9] 2436522
[10] 2436523
[11] 2436524
[12] 2436525
[13] 2436526
[14] 2436527
[15] 2436528
[16] 2436529
[17] 2436530
[18] 2436531
[19] 2436532
[20] 2436533
[21] 2436534
[22] 2436535
[23] 2436536
[24] 2436537
[25] 2436538
[26] 2436539
[27] 2436540
[28] 2436541
[29] 2436542
[30] 2436543
[31] 2436544
[32] 2436545
#
# ls -la /sys/fs/bpf/perf_attr_map
-rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map
# bpftool map | grep bperf | wc -l
64
#

# bpftool map | tail
1265: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1266: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1267: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 996
pids perf(2436545)
1268: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1269: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1270: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 997
pids perf(2436541)
1285: array name pid_iter.rodata flags 0x480
key 4B value 4B max_entries 1 memlock 4096B
btf_id 1017 frozen
pids bpftool(2437504)
1286: array flags 0x0
key 4B value 32B max_entries 1 memlock 4096B
#
# bpftool map dump id 1268 | tail
value (CPU 21):
8f f3 bc ca 00 00 00 00 80 fd 2a d1 4d 00 00 00
80 fd 2a d1 4d 00 00 00
value (CPU 22):
7e d5 64 4d 00 00 00 00 a4 8a 2e ee 4d 00 00 00
a4 8a 2e ee 4d 00 00 00
value (CPU 23):
a7 78 3e 06 01 00 00 00 b2 34 94 f6 4d 00 00 00
b2 34 94 f6 4d 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
c6 8b d9 ca 00 00 00 00 20 c6 fc 83 4e 00 00 00
20 c6 fc 83 4e 00 00 00
value (CPU 22):
9c b4 d2 4d 00 00 00 00 3e 0c df 89 4e 00 00 00
3e 0c df 89 4e 00 00 00
value (CPU 23):
18 43 66 06 01 00 00 00 5b 69 ed 83 4e 00 00 00
5b 69 ed 83 4e 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
f2 6e db ca 00 00 00 00 92 67 4c ba 4e 00 00 00
92 67 4c ba 4e 00 00 00
value (CPU 22):
dc 8e e1 4d 00 00 00 00 d9 32 7a c5 4e 00 00 00
d9 32 7a c5 4e 00 00 00
value (CPU 23):
bd 2b 73 06 01 00 00 00 7c 73 87 bf 4e 00 00 00
7c 73 87 bf 4e 00 00 00
Found 1 element
#

# perf stat --bpf-counters -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

119,410,122 cycles
152,105,479 instructions # 1.27 insn per cycle

0.101395093 seconds time elapsed

#

See? We had the counters enabled all the time.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: kernel-team@fb.com
Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
7fac83aa Tue Mar 16 15:18:35 MDT 2021 Song Liu <songliubraving@fb.com> perf stat: Introduce 'bperf' to share hardware PMCs with BPF

The perf tool uses performance monitoring counters (PMCs) to monitor
system performance. The PMCs are limited hardware resources. For
example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways: system
level monitoring, (maybe nested) container level monitoring, per process
monitoring, profiling (in sample mode), etc. In some cases, there are
more active perf_events than available hardware PMCs. To allow all
perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics
(cycles, instructions). It is a waste to have multiple tools create
multiple perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --bpf-counters option to
perf-stat. bperf uses a BPF hashmap to share information about BPF
programs and maps used by bperf. This map is pinned to bpffs. The
default path is /sys/fs/bpf/perf_attr_map. The user could change the
path with option --bpf-attr-map.

Committer testing:

# dmesg|grep "Performance Events" -A5
[ 0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[ 0.225280] ... version: 0
[ 0.225280] ... bit width: 48
[ 0.225281] ... generic registers: 6
[ 0.225281] ... value mask: 0000ffffffffffff
[ 0.225281] ... max period: 00007fffffffffff
#
# for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done
[1] 2436231
[2] 2436232
[3] 2436233
[4] 2436234
[5] 2436235
[6] 2436236
# perf stat -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

310,326,987 cycles (41.87%)
236,143,290 instructions # 0.76 insn per cycle (41.87%)

0.100800885 seconds time elapsed

#

We can see that the counters were enabled for this workload 41.87% of
the time.

Now with --bpf-counters:

# for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done
[1] 2436514
[2] 2436515
[3] 2436516
[4] 2436517
[5] 2436518
[6] 2436519
[7] 2436520
[8] 2436521
[9] 2436522
[10] 2436523
[11] 2436524
[12] 2436525
[13] 2436526
[14] 2436527
[15] 2436528
[16] 2436529
[17] 2436530
[18] 2436531
[19] 2436532
[20] 2436533
[21] 2436534
[22] 2436535
[23] 2436536
[24] 2436537
[25] 2436538
[26] 2436539
[27] 2436540
[28] 2436541
[29] 2436542
[30] 2436543
[31] 2436544
[32] 2436545
#
# ls -la /sys/fs/bpf/perf_attr_map
-rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map
# bpftool map | grep bperf | wc -l
64
#

# bpftool map | tail
1265: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1266: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1267: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 996
pids perf(2436545)
1268: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1269: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1270: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 997
pids perf(2436541)
1285: array name pid_iter.rodata flags 0x480
key 4B value 4B max_entries 1 memlock 4096B
btf_id 1017 frozen
pids bpftool(2437504)
1286: array flags 0x0
key 4B value 32B max_entries 1 memlock 4096B
#
# bpftool map dump id 1268 | tail
value (CPU 21):
8f f3 bc ca 00 00 00 00 80 fd 2a d1 4d 00 00 00
80 fd 2a d1 4d 00 00 00
value (CPU 22):
7e d5 64 4d 00 00 00 00 a4 8a 2e ee 4d 00 00 00
a4 8a 2e ee 4d 00 00 00
value (CPU 23):
a7 78 3e 06 01 00 00 00 b2 34 94 f6 4d 00 00 00
b2 34 94 f6 4d 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
c6 8b d9 ca 00 00 00 00 20 c6 fc 83 4e 00 00 00
20 c6 fc 83 4e 00 00 00
value (CPU 22):
9c b4 d2 4d 00 00 00 00 3e 0c df 89 4e 00 00 00
3e 0c df 89 4e 00 00 00
value (CPU 23):
18 43 66 06 01 00 00 00 5b 69 ed 83 4e 00 00 00
5b 69 ed 83 4e 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
f2 6e db ca 00 00 00 00 92 67 4c ba 4e 00 00 00
92 67 4c ba 4e 00 00 00
value (CPU 22):
dc 8e e1 4d 00 00 00 00 d9 32 7a c5 4e 00 00 00
d9 32 7a c5 4e 00 00 00
value (CPU 23):
bd 2b 73 06 01 00 00 00 7c 73 87 bf 4e 00 00 00
7c 73 87 bf 4e 00 00 00
Found 1 element
#

# perf stat --bpf-counters -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

119,410,122 cycles
152,105,479 instructions # 1.27 insn per cycle

0.101395093 seconds time elapsed

#

See? We had the counters enabled all the time.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: kernel-team@fb.com
Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
7fac83aa Tue Mar 16 15:18:35 MDT 2021 Song Liu <songliubraving@fb.com> perf stat: Introduce 'bperf' to share hardware PMCs with BPF

The perf tool uses performance monitoring counters (PMCs) to monitor
system performance. The PMCs are limited hardware resources. For
example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways: system
level monitoring, (maybe nested) container level monitoring, per process
monitoring, profiling (in sample mode), etc. In some cases, there are
more active perf_events than available hardware PMCs. To allow all
perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics
(cycles, instructions). It is a waste to have multiple tools create
multiple perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --bpf-counters option to
perf-stat. bperf uses a BPF hashmap to share information about BPF
programs and maps used by bperf. This map is pinned to bpffs. The
default path is /sys/fs/bpf/perf_attr_map. The user could change the
path with option --bpf-attr-map.

Committer testing:

# dmesg|grep "Performance Events" -A5
[ 0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[ 0.225280] ... version: 0
[ 0.225280] ... bit width: 48
[ 0.225281] ... generic registers: 6
[ 0.225281] ... value mask: 0000ffffffffffff
[ 0.225281] ... max period: 00007fffffffffff
#
# for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done
[1] 2436231
[2] 2436232
[3] 2436233
[4] 2436234
[5] 2436235
[6] 2436236
# perf stat -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

310,326,987 cycles (41.87%)
236,143,290 instructions # 0.76 insn per cycle (41.87%)

0.100800885 seconds time elapsed

#

We can see that the counters were enabled for this workload 41.87% of
the time.

Now with --bpf-counters:

# for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done
[1] 2436514
[2] 2436515
[3] 2436516
[4] 2436517
[5] 2436518
[6] 2436519
[7] 2436520
[8] 2436521
[9] 2436522
[10] 2436523
[11] 2436524
[12] 2436525
[13] 2436526
[14] 2436527
[15] 2436528
[16] 2436529
[17] 2436530
[18] 2436531
[19] 2436532
[20] 2436533
[21] 2436534
[22] 2436535
[23] 2436536
[24] 2436537
[25] 2436538
[26] 2436539
[27] 2436540
[28] 2436541
[29] 2436542
[30] 2436543
[31] 2436544
[32] 2436545
#
# ls -la /sys/fs/bpf/perf_attr_map
-rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map
# bpftool map | grep bperf | wc -l
64
#

# bpftool map | tail
1265: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1266: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1267: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 996
pids perf(2436545)
1268: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1269: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1270: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 997
pids perf(2436541)
1285: array name pid_iter.rodata flags 0x480
key 4B value 4B max_entries 1 memlock 4096B
btf_id 1017 frozen
pids bpftool(2437504)
1286: array flags 0x0
key 4B value 32B max_entries 1 memlock 4096B
#
# bpftool map dump id 1268 | tail
value (CPU 21):
8f f3 bc ca 00 00 00 00 80 fd 2a d1 4d 00 00 00
80 fd 2a d1 4d 00 00 00
value (CPU 22):
7e d5 64 4d 00 00 00 00 a4 8a 2e ee 4d 00 00 00
a4 8a 2e ee 4d 00 00 00
value (CPU 23):
a7 78 3e 06 01 00 00 00 b2 34 94 f6 4d 00 00 00
b2 34 94 f6 4d 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
c6 8b d9 ca 00 00 00 00 20 c6 fc 83 4e 00 00 00
20 c6 fc 83 4e 00 00 00
value (CPU 22):
9c b4 d2 4d 00 00 00 00 3e 0c df 89 4e 00 00 00
3e 0c df 89 4e 00 00 00
value (CPU 23):
18 43 66 06 01 00 00 00 5b 69 ed 83 4e 00 00 00
5b 69 ed 83 4e 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
f2 6e db ca 00 00 00 00 92 67 4c ba 4e 00 00 00
92 67 4c ba 4e 00 00 00
value (CPU 22):
dc 8e e1 4d 00 00 00 00 d9 32 7a c5 4e 00 00 00
d9 32 7a c5 4e 00 00 00
value (CPU 23):
bd 2b 73 06 01 00 00 00 7c 73 87 bf 4e 00 00 00
7c 73 87 bf 4e 00 00 00
Found 1 element
#

# perf stat --bpf-counters -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

119,410,122 cycles
152,105,479 instructions # 1.27 insn per cycle

0.101395093 seconds time elapsed

#

See? We had the counters enabled all the time.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: kernel-team@fb.com
Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
7fac83aa Tue Mar 16 15:18:35 MDT 2021 Song Liu <songliubraving@fb.com> perf stat: Introduce 'bperf' to share hardware PMCs with BPF

The perf tool uses performance monitoring counters (PMCs) to monitor
system performance. The PMCs are limited hardware resources. For
example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways: system
level monitoring, (maybe nested) container level monitoring, per process
monitoring, profiling (in sample mode), etc. In some cases, there are
more active perf_events than available hardware PMCs. To allow all
perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics
(cycles, instructions). It is a waste to have multiple tools create
multiple perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --bpf-counters option to
perf-stat. bperf uses a BPF hashmap to share information about BPF
programs and maps used by bperf. This map is pinned to bpffs. The
default path is /sys/fs/bpf/perf_attr_map. The user could change the
path with option --bpf-attr-map.

Committer testing:

# dmesg|grep "Performance Events" -A5
[ 0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[ 0.225280] ... version: 0
[ 0.225280] ... bit width: 48
[ 0.225281] ... generic registers: 6
[ 0.225281] ... value mask: 0000ffffffffffff
[ 0.225281] ... max period: 00007fffffffffff
#
# for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done
[1] 2436231
[2] 2436232
[3] 2436233
[4] 2436234
[5] 2436235
[6] 2436236
# perf stat -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

310,326,987 cycles (41.87%)
236,143,290 instructions # 0.76 insn per cycle (41.87%)

0.100800885 seconds time elapsed

#

We can see that the counters were enabled for this workload 41.87% of
the time.

Now with --bpf-counters:

# for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done
[1] 2436514
[2] 2436515
[3] 2436516
[4] 2436517
[5] 2436518
[6] 2436519
[7] 2436520
[8] 2436521
[9] 2436522
[10] 2436523
[11] 2436524
[12] 2436525
[13] 2436526
[14] 2436527
[15] 2436528
[16] 2436529
[17] 2436530
[18] 2436531
[19] 2436532
[20] 2436533
[21] 2436534
[22] 2436535
[23] 2436536
[24] 2436537
[25] 2436538
[26] 2436539
[27] 2436540
[28] 2436541
[29] 2436542
[30] 2436543
[31] 2436544
[32] 2436545
#
# ls -la /sys/fs/bpf/perf_attr_map
-rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map
# bpftool map | grep bperf | wc -l
64
#

# bpftool map | tail
1265: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1266: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1267: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 996
pids perf(2436545)
1268: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1269: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1270: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 997
pids perf(2436541)
1285: array name pid_iter.rodata flags 0x480
key 4B value 4B max_entries 1 memlock 4096B
btf_id 1017 frozen
pids bpftool(2437504)
1286: array flags 0x0
key 4B value 32B max_entries 1 memlock 4096B
#
# bpftool map dump id 1268 | tail
value (CPU 21):
8f f3 bc ca 00 00 00 00 80 fd 2a d1 4d 00 00 00
80 fd 2a d1 4d 00 00 00
value (CPU 22):
7e d5 64 4d 00 00 00 00 a4 8a 2e ee 4d 00 00 00
a4 8a 2e ee 4d 00 00 00
value (CPU 23):
a7 78 3e 06 01 00 00 00 b2 34 94 f6 4d 00 00 00
b2 34 94 f6 4d 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
c6 8b d9 ca 00 00 00 00 20 c6 fc 83 4e 00 00 00
20 c6 fc 83 4e 00 00 00
value (CPU 22):
9c b4 d2 4d 00 00 00 00 3e 0c df 89 4e 00 00 00
3e 0c df 89 4e 00 00 00
value (CPU 23):
18 43 66 06 01 00 00 00 5b 69 ed 83 4e 00 00 00
5b 69 ed 83 4e 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
f2 6e db ca 00 00 00 00 92 67 4c ba 4e 00 00 00
92 67 4c ba 4e 00 00 00
value (CPU 22):
dc 8e e1 4d 00 00 00 00 d9 32 7a c5 4e 00 00 00
d9 32 7a c5 4e 00 00 00
value (CPU 23):
bd 2b 73 06 01 00 00 00 7c 73 87 bf 4e 00 00 00
7c 73 87 bf 4e 00 00 00
Found 1 element
#

# perf stat --bpf-counters -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

119,410,122 cycles
152,105,479 instructions # 1.27 insn per cycle

0.101395093 seconds time elapsed

#

See? We had the counters enabled all the time.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: kernel-team@fb.com
Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
7fac83aa Tue Mar 16 15:18:35 MDT 2021 Song Liu <songliubraving@fb.com> perf stat: Introduce 'bperf' to share hardware PMCs with BPF

The perf tool uses performance monitoring counters (PMCs) to monitor
system performance. The PMCs are limited hardware resources. For
example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways: system
level monitoring, (maybe nested) container level monitoring, per process
monitoring, profiling (in sample mode), etc. In some cases, there are
more active perf_events than available hardware PMCs. To allow all
perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics
(cycles, instructions). It is a waste to have multiple tools create
multiple perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --bpf-counters option to
perf-stat. bperf uses a BPF hashmap to share information about BPF
programs and maps used by bperf. This map is pinned to bpffs. The
default path is /sys/fs/bpf/perf_attr_map. The user could change the
path with option --bpf-attr-map.

Committer testing:

# dmesg|grep "Performance Events" -A5
[ 0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[ 0.225280] ... version: 0
[ 0.225280] ... bit width: 48
[ 0.225281] ... generic registers: 6
[ 0.225281] ... value mask: 0000ffffffffffff
[ 0.225281] ... max period: 00007fffffffffff
#
# for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done
[1] 2436231
[2] 2436232
[3] 2436233
[4] 2436234
[5] 2436235
[6] 2436236
# perf stat -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

310,326,987 cycles (41.87%)
236,143,290 instructions # 0.76 insn per cycle (41.87%)

0.100800885 seconds time elapsed

#

We can see that the counters were enabled for this workload 41.87% of
the time.

Now with --bpf-counters:

# for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done
[1] 2436514
[2] 2436515
[3] 2436516
[4] 2436517
[5] 2436518
[6] 2436519
[7] 2436520
[8] 2436521
[9] 2436522
[10] 2436523
[11] 2436524
[12] 2436525
[13] 2436526
[14] 2436527
[15] 2436528
[16] 2436529
[17] 2436530
[18] 2436531
[19] 2436532
[20] 2436533
[21] 2436534
[22] 2436535
[23] 2436536
[24] 2436537
[25] 2436538
[26] 2436539
[27] 2436540
[28] 2436541
[29] 2436542
[30] 2436543
[31] 2436544
[32] 2436545
#
# ls -la /sys/fs/bpf/perf_attr_map
-rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map
# bpftool map | grep bperf | wc -l
64
#

# bpftool map | tail
1265: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1266: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1267: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 996
pids perf(2436545)
1268: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1269: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1270: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 997
pids perf(2436541)
1285: array name pid_iter.rodata flags 0x480
key 4B value 4B max_entries 1 memlock 4096B
btf_id 1017 frozen
pids bpftool(2437504)
1286: array flags 0x0
key 4B value 32B max_entries 1 memlock 4096B
#
# bpftool map dump id 1268 | tail
value (CPU 21):
8f f3 bc ca 00 00 00 00 80 fd 2a d1 4d 00 00 00
80 fd 2a d1 4d 00 00 00
value (CPU 22):
7e d5 64 4d 00 00 00 00 a4 8a 2e ee 4d 00 00 00
a4 8a 2e ee 4d 00 00 00
value (CPU 23):
a7 78 3e 06 01 00 00 00 b2 34 94 f6 4d 00 00 00
b2 34 94 f6 4d 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
c6 8b d9 ca 00 00 00 00 20 c6 fc 83 4e 00 00 00
20 c6 fc 83 4e 00 00 00
value (CPU 22):
9c b4 d2 4d 00 00 00 00 3e 0c df 89 4e 00 00 00
3e 0c df 89 4e 00 00 00
value (CPU 23):
18 43 66 06 01 00 00 00 5b 69 ed 83 4e 00 00 00
5b 69 ed 83 4e 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
f2 6e db ca 00 00 00 00 92 67 4c ba 4e 00 00 00
92 67 4c ba 4e 00 00 00
value (CPU 22):
dc 8e e1 4d 00 00 00 00 d9 32 7a c5 4e 00 00 00
d9 32 7a c5 4e 00 00 00
value (CPU 23):
bd 2b 73 06 01 00 00 00 7c 73 87 bf 4e 00 00 00
7c 73 87 bf 4e 00 00 00
Found 1 element
#

# perf stat --bpf-counters -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

119,410,122 cycles
152,105,479 instructions # 1.27 insn per cycle

0.101395093 seconds time elapsed

#

See? We had the counters enabled all the time.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: kernel-team@fb.com
Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
7fac83aa Tue Mar 16 15:18:35 MDT 2021 Song Liu <songliubraving@fb.com> perf stat: Introduce 'bperf' to share hardware PMCs with BPF

The perf tool uses performance monitoring counters (PMCs) to monitor
system performance. The PMCs are limited hardware resources. For
example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways: system
level monitoring, (maybe nested) container level monitoring, per process
monitoring, profiling (in sample mode), etc. In some cases, there are
more active perf_events than available hardware PMCs. To allow all
perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics
(cycles, instructions). It is a waste to have multiple tools create
multiple perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --bpf-counters option to
perf-stat. bperf uses a BPF hashmap to share information about BPF
programs and maps used by bperf. This map is pinned to bpffs. The
default path is /sys/fs/bpf/perf_attr_map. The user could change the
path with option --bpf-attr-map.

Committer testing:

# dmesg|grep "Performance Events" -A5
[ 0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[ 0.225280] ... version: 0
[ 0.225280] ... bit width: 48
[ 0.225281] ... generic registers: 6
[ 0.225281] ... value mask: 0000ffffffffffff
[ 0.225281] ... max period: 00007fffffffffff
#
# for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done
[1] 2436231
[2] 2436232
[3] 2436233
[4] 2436234
[5] 2436235
[6] 2436236
# perf stat -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

310,326,987 cycles (41.87%)
236,143,290 instructions # 0.76 insn per cycle (41.87%)

0.100800885 seconds time elapsed

#

We can see that the counters were enabled for this workload 41.87% of
the time.

Now with --bpf-counters:

# for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done
[1] 2436514
[2] 2436515
[3] 2436516
[4] 2436517
[5] 2436518
[6] 2436519
[7] 2436520
[8] 2436521
[9] 2436522
[10] 2436523
[11] 2436524
[12] 2436525
[13] 2436526
[14] 2436527
[15] 2436528
[16] 2436529
[17] 2436530
[18] 2436531
[19] 2436532
[20] 2436533
[21] 2436534
[22] 2436535
[23] 2436536
[24] 2436537
[25] 2436538
[26] 2436539
[27] 2436540
[28] 2436541
[29] 2436542
[30] 2436543
[31] 2436544
[32] 2436545
#
# ls -la /sys/fs/bpf/perf_attr_map
-rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map
# bpftool map | grep bperf | wc -l
64
#

# bpftool map | tail
1265: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1266: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1267: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 996
pids perf(2436545)
1268: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1269: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1270: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 997
pids perf(2436541)
1285: array name pid_iter.rodata flags 0x480
key 4B value 4B max_entries 1 memlock 4096B
btf_id 1017 frozen
pids bpftool(2437504)
1286: array flags 0x0
key 4B value 32B max_entries 1 memlock 4096B
#
# bpftool map dump id 1268 | tail
value (CPU 21):
8f f3 bc ca 00 00 00 00 80 fd 2a d1 4d 00 00 00
80 fd 2a d1 4d 00 00 00
value (CPU 22):
7e d5 64 4d 00 00 00 00 a4 8a 2e ee 4d 00 00 00
a4 8a 2e ee 4d 00 00 00
value (CPU 23):
a7 78 3e 06 01 00 00 00 b2 34 94 f6 4d 00 00 00
b2 34 94 f6 4d 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
c6 8b d9 ca 00 00 00 00 20 c6 fc 83 4e 00 00 00
20 c6 fc 83 4e 00 00 00
value (CPU 22):
9c b4 d2 4d 00 00 00 00 3e 0c df 89 4e 00 00 00
3e 0c df 89 4e 00 00 00
value (CPU 23):
18 43 66 06 01 00 00 00 5b 69 ed 83 4e 00 00 00
5b 69 ed 83 4e 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
f2 6e db ca 00 00 00 00 92 67 4c ba 4e 00 00 00
92 67 4c ba 4e 00 00 00
value (CPU 22):
dc 8e e1 4d 00 00 00 00 d9 32 7a c5 4e 00 00 00
d9 32 7a c5 4e 00 00 00
value (CPU 23):
bd 2b 73 06 01 00 00 00 7c 73 87 bf 4e 00 00 00
7c 73 87 bf 4e 00 00 00
Found 1 element
#

# perf stat --bpf-counters -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

119,410,122 cycles
152,105,479 instructions # 1.27 insn per cycle

0.101395093 seconds time elapsed

#

See? We had the counters enabled all the time.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: kernel-team@fb.com
Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
7fac83aa Tue Mar 16 15:18:35 MDT 2021 Song Liu <songliubraving@fb.com> perf stat: Introduce 'bperf' to share hardware PMCs with BPF

The perf tool uses performance monitoring counters (PMCs) to monitor
system performance. The PMCs are limited hardware resources. For
example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways: system
level monitoring, (maybe nested) container level monitoring, per process
monitoring, profiling (in sample mode), etc. In some cases, there are
more active perf_events than available hardware PMCs. To allow all
perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics
(cycles, instructions). It is a waste to have multiple tools create
multiple perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --bpf-counters option to
perf-stat. bperf uses a BPF hashmap to share information about BPF
programs and maps used by bperf. This map is pinned to bpffs. The
default path is /sys/fs/bpf/perf_attr_map. The user could change the
path with option --bpf-attr-map.

Committer testing:

# dmesg|grep "Performance Events" -A5
[ 0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[ 0.225280] ... version: 0
[ 0.225280] ... bit width: 48
[ 0.225281] ... generic registers: 6
[ 0.225281] ... value mask: 0000ffffffffffff
[ 0.225281] ... max period: 00007fffffffffff
#
# for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done
[1] 2436231
[2] 2436232
[3] 2436233
[4] 2436234
[5] 2436235
[6] 2436236
# perf stat -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

310,326,987 cycles (41.87%)
236,143,290 instructions # 0.76 insn per cycle (41.87%)

0.100800885 seconds time elapsed

#

We can see that the counters were enabled for this workload 41.87% of
the time.

Now with --bpf-counters:

# for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done
[1] 2436514
[2] 2436515
[3] 2436516
[4] 2436517
[5] 2436518
[6] 2436519
[7] 2436520
[8] 2436521
[9] 2436522
[10] 2436523
[11] 2436524
[12] 2436525
[13] 2436526
[14] 2436527
[15] 2436528
[16] 2436529
[17] 2436530
[18] 2436531
[19] 2436532
[20] 2436533
[21] 2436534
[22] 2436535
[23] 2436536
[24] 2436537
[25] 2436538
[26] 2436539
[27] 2436540
[28] 2436541
[29] 2436542
[30] 2436543
[31] 2436544
[32] 2436545
#
# ls -la /sys/fs/bpf/perf_attr_map
-rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map
# bpftool map | grep bperf | wc -l
64
#

# bpftool map | tail
1265: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1266: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1267: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 996
pids perf(2436545)
1268: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1269: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1270: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 997
pids perf(2436541)
1285: array name pid_iter.rodata flags 0x480
key 4B value 4B max_entries 1 memlock 4096B
btf_id 1017 frozen
pids bpftool(2437504)
1286: array flags 0x0
key 4B value 32B max_entries 1 memlock 4096B
#
# bpftool map dump id 1268 | tail
value (CPU 21):
8f f3 bc ca 00 00 00 00 80 fd 2a d1 4d 00 00 00
80 fd 2a d1 4d 00 00 00
value (CPU 22):
7e d5 64 4d 00 00 00 00 a4 8a 2e ee 4d 00 00 00
a4 8a 2e ee 4d 00 00 00
value (CPU 23):
a7 78 3e 06 01 00 00 00 b2 34 94 f6 4d 00 00 00
b2 34 94 f6 4d 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
c6 8b d9 ca 00 00 00 00 20 c6 fc 83 4e 00 00 00
20 c6 fc 83 4e 00 00 00
value (CPU 22):
9c b4 d2 4d 00 00 00 00 3e 0c df 89 4e 00 00 00
3e 0c df 89 4e 00 00 00
value (CPU 23):
18 43 66 06 01 00 00 00 5b 69 ed 83 4e 00 00 00
5b 69 ed 83 4e 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
f2 6e db ca 00 00 00 00 92 67 4c ba 4e 00 00 00
92 67 4c ba 4e 00 00 00
value (CPU 22):
dc 8e e1 4d 00 00 00 00 d9 32 7a c5 4e 00 00 00
d9 32 7a c5 4e 00 00 00
value (CPU 23):
bd 2b 73 06 01 00 00 00 7c 73 87 bf 4e 00 00 00
7c 73 87 bf 4e 00 00 00
Found 1 element
#

# perf stat --bpf-counters -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

119,410,122 cycles
152,105,479 instructions # 1.27 insn per cycle

0.101395093 seconds time elapsed

#

See? We had the counters enabled all the time.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: kernel-team@fb.com
Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
7fac83aa Tue Mar 16 15:18:35 MDT 2021 Song Liu <songliubraving@fb.com> perf stat: Introduce 'bperf' to share hardware PMCs with BPF

The perf tool uses performance monitoring counters (PMCs) to monitor
system performance. The PMCs are limited hardware resources. For
example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways: system
level monitoring, (maybe nested) container level monitoring, per process
monitoring, profiling (in sample mode), etc. In some cases, there are
more active perf_events than available hardware PMCs. To allow all
perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics
(cycles, instructions). It is a waste to have multiple tools create
multiple perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --bpf-counters option to
perf-stat. bperf uses a BPF hashmap to share information about BPF
programs and maps used by bperf. This map is pinned to bpffs. The
default path is /sys/fs/bpf/perf_attr_map. The user could change the
path with option --bpf-attr-map.

Committer testing:

# dmesg|grep "Performance Events" -A5
[ 0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[ 0.225280] ... version: 0
[ 0.225280] ... bit width: 48
[ 0.225281] ... generic registers: 6
[ 0.225281] ... value mask: 0000ffffffffffff
[ 0.225281] ... max period: 00007fffffffffff
#
# for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done
[1] 2436231
[2] 2436232
[3] 2436233
[4] 2436234
[5] 2436235
[6] 2436236
# perf stat -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

310,326,987 cycles (41.87%)
236,143,290 instructions # 0.76 insn per cycle (41.87%)

0.100800885 seconds time elapsed

#

We can see that the counters were enabled for this workload 41.87% of
the time.

Now with --bpf-counters:

# for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done
[1] 2436514
[2] 2436515
[3] 2436516
[4] 2436517
[5] 2436518
[6] 2436519
[7] 2436520
[8] 2436521
[9] 2436522
[10] 2436523
[11] 2436524
[12] 2436525
[13] 2436526
[14] 2436527
[15] 2436528
[16] 2436529
[17] 2436530
[18] 2436531
[19] 2436532
[20] 2436533
[21] 2436534
[22] 2436535
[23] 2436536
[24] 2436537
[25] 2436538
[26] 2436539
[27] 2436540
[28] 2436541
[29] 2436542
[30] 2436543
[31] 2436544
[32] 2436545
#
# ls -la /sys/fs/bpf/perf_attr_map
-rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map
# bpftool map | grep bperf | wc -l
64
#

# bpftool map | tail
1265: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1266: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1267: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 996
pids perf(2436545)
1268: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1269: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1270: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 997
pids perf(2436541)
1285: array name pid_iter.rodata flags 0x480
key 4B value 4B max_entries 1 memlock 4096B
btf_id 1017 frozen
pids bpftool(2437504)
1286: array flags 0x0
key 4B value 32B max_entries 1 memlock 4096B
#
# bpftool map dump id 1268 | tail
value (CPU 21):
8f f3 bc ca 00 00 00 00 80 fd 2a d1 4d 00 00 00
80 fd 2a d1 4d 00 00 00
value (CPU 22):
7e d5 64 4d 00 00 00 00 a4 8a 2e ee 4d 00 00 00
a4 8a 2e ee 4d 00 00 00
value (CPU 23):
a7 78 3e 06 01 00 00 00 b2 34 94 f6 4d 00 00 00
b2 34 94 f6 4d 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
c6 8b d9 ca 00 00 00 00 20 c6 fc 83 4e 00 00 00
20 c6 fc 83 4e 00 00 00
value (CPU 22):
9c b4 d2 4d 00 00 00 00 3e 0c df 89 4e 00 00 00
3e 0c df 89 4e 00 00 00
value (CPU 23):
18 43 66 06 01 00 00 00 5b 69 ed 83 4e 00 00 00
5b 69 ed 83 4e 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
f2 6e db ca 00 00 00 00 92 67 4c ba 4e 00 00 00
92 67 4c ba 4e 00 00 00
value (CPU 22):
dc 8e e1 4d 00 00 00 00 d9 32 7a c5 4e 00 00 00
d9 32 7a c5 4e 00 00 00
value (CPU 23):
bd 2b 73 06 01 00 00 00 7c 73 87 bf 4e 00 00 00
7c 73 87 bf 4e 00 00 00
Found 1 element
#

# perf stat --bpf-counters -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

119,410,122 cycles
152,105,479 instructions # 1.27 insn per cycle

0.101395093 seconds time elapsed

#

See? We had the counters enabled all the time.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: kernel-team@fb.com
Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
7fac83aa Tue Mar 16 15:18:35 MDT 2021 Song Liu <songliubraving@fb.com> perf stat: Introduce 'bperf' to share hardware PMCs with BPF

The perf tool uses performance monitoring counters (PMCs) to monitor
system performance. The PMCs are limited hardware resources. For
example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways: system
level monitoring, (maybe nested) container level monitoring, per process
monitoring, profiling (in sample mode), etc. In some cases, there are
more active perf_events than available hardware PMCs. To allow all
perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics
(cycles, instructions). It is a waste to have multiple tools create
multiple perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --bpf-counters option to
perf-stat. bperf uses a BPF hashmap to share information about BPF
programs and maps used by bperf. This map is pinned to bpffs. The
default path is /sys/fs/bpf/perf_attr_map. The user could change the
path with option --bpf-attr-map.

Committer testing:

# dmesg|grep "Performance Events" -A5
[ 0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[ 0.225280] ... version: 0
[ 0.225280] ... bit width: 48
[ 0.225281] ... generic registers: 6
[ 0.225281] ... value mask: 0000ffffffffffff
[ 0.225281] ... max period: 00007fffffffffff
#
# for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done
[1] 2436231
[2] 2436232
[3] 2436233
[4] 2436234
[5] 2436235
[6] 2436236
# perf stat -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

310,326,987 cycles (41.87%)
236,143,290 instructions # 0.76 insn per cycle (41.87%)

0.100800885 seconds time elapsed

#

We can see that the counters were enabled for this workload 41.87% of
the time.

Now with --bpf-counters:

# for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done
[1] 2436514
[2] 2436515
[3] 2436516
[4] 2436517
[5] 2436518
[6] 2436519
[7] 2436520
[8] 2436521
[9] 2436522
[10] 2436523
[11] 2436524
[12] 2436525
[13] 2436526
[14] 2436527
[15] 2436528
[16] 2436529
[17] 2436530
[18] 2436531
[19] 2436532
[20] 2436533
[21] 2436534
[22] 2436535
[23] 2436536
[24] 2436537
[25] 2436538
[26] 2436539
[27] 2436540
[28] 2436541
[29] 2436542
[30] 2436543
[31] 2436544
[32] 2436545
#
# ls -la /sys/fs/bpf/perf_attr_map
-rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map
# bpftool map | grep bperf | wc -l
64
#

# bpftool map | tail
1265: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1266: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1267: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 996
pids perf(2436545)
1268: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1269: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1270: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 997
pids perf(2436541)
1285: array name pid_iter.rodata flags 0x480
key 4B value 4B max_entries 1 memlock 4096B
btf_id 1017 frozen
pids bpftool(2437504)
1286: array flags 0x0
key 4B value 32B max_entries 1 memlock 4096B
#
# bpftool map dump id 1268 | tail
value (CPU 21):
8f f3 bc ca 00 00 00 00 80 fd 2a d1 4d 00 00 00
80 fd 2a d1 4d 00 00 00
value (CPU 22):
7e d5 64 4d 00 00 00 00 a4 8a 2e ee 4d 00 00 00
a4 8a 2e ee 4d 00 00 00
value (CPU 23):
a7 78 3e 06 01 00 00 00 b2 34 94 f6 4d 00 00 00
b2 34 94 f6 4d 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
c6 8b d9 ca 00 00 00 00 20 c6 fc 83 4e 00 00 00
20 c6 fc 83 4e 00 00 00
value (CPU 22):
9c b4 d2 4d 00 00 00 00 3e 0c df 89 4e 00 00 00
3e 0c df 89 4e 00 00 00
value (CPU 23):
18 43 66 06 01 00 00 00 5b 69 ed 83 4e 00 00 00
5b 69 ed 83 4e 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
f2 6e db ca 00 00 00 00 92 67 4c ba 4e 00 00 00
92 67 4c ba 4e 00 00 00
value (CPU 22):
dc 8e e1 4d 00 00 00 00 d9 32 7a c5 4e 00 00 00
d9 32 7a c5 4e 00 00 00
value (CPU 23):
bd 2b 73 06 01 00 00 00 7c 73 87 bf 4e 00 00 00
7c 73 87 bf 4e 00 00 00
Found 1 element
#

# perf stat --bpf-counters -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

119,410,122 cycles
152,105,479 instructions # 1.27 insn per cycle

0.101395093 seconds time elapsed

#

See? We had the counters enabled all the time.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: kernel-team@fb.com
Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
7fac83aa Tue Mar 16 15:18:35 MDT 2021 Song Liu <songliubraving@fb.com> perf stat: Introduce 'bperf' to share hardware PMCs with BPF

The perf tool uses performance monitoring counters (PMCs) to monitor
system performance. The PMCs are limited hardware resources. For
example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways: system
level monitoring, (maybe nested) container level monitoring, per process
monitoring, profiling (in sample mode), etc. In some cases, there are
more active perf_events than available hardware PMCs. To allow all
perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics
(cycles, instructions). It is a waste to have multiple tools create
multiple perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --bpf-counters option to
perf-stat. bperf uses a BPF hashmap to share information about BPF
programs and maps used by bperf. This map is pinned to bpffs. The
default path is /sys/fs/bpf/perf_attr_map. The user could change the
path with option --bpf-attr-map.

Committer testing:

# dmesg|grep "Performance Events" -A5
[ 0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[ 0.225280] ... version: 0
[ 0.225280] ... bit width: 48
[ 0.225281] ... generic registers: 6
[ 0.225281] ... value mask: 0000ffffffffffff
[ 0.225281] ... max period: 00007fffffffffff
#
# for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done
[1] 2436231
[2] 2436232
[3] 2436233
[4] 2436234
[5] 2436235
[6] 2436236
# perf stat -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

310,326,987 cycles (41.87%)
236,143,290 instructions # 0.76 insn per cycle (41.87%)

0.100800885 seconds time elapsed

#

We can see that the counters were enabled for this workload 41.87% of
the time.

Now with --bpf-counters:

# for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done
[1] 2436514
[2] 2436515
[3] 2436516
[4] 2436517
[5] 2436518
[6] 2436519
[7] 2436520
[8] 2436521
[9] 2436522
[10] 2436523
[11] 2436524
[12] 2436525
[13] 2436526
[14] 2436527
[15] 2436528
[16] 2436529
[17] 2436530
[18] 2436531
[19] 2436532
[20] 2436533
[21] 2436534
[22] 2436535
[23] 2436536
[24] 2436537
[25] 2436538
[26] 2436539
[27] 2436540
[28] 2436541
[29] 2436542
[30] 2436543
[31] 2436544
[32] 2436545
#
# ls -la /sys/fs/bpf/perf_attr_map
-rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map
# bpftool map | grep bperf | wc -l
64
#

# bpftool map | tail
1265: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1266: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1267: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 996
pids perf(2436545)
1268: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1269: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1270: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 997
pids perf(2436541)
1285: array name pid_iter.rodata flags 0x480
key 4B value 4B max_entries 1 memlock 4096B
btf_id 1017 frozen
pids bpftool(2437504)
1286: array flags 0x0
key 4B value 32B max_entries 1 memlock 4096B
#
# bpftool map dump id 1268 | tail
value (CPU 21):
8f f3 bc ca 00 00 00 00 80 fd 2a d1 4d 00 00 00
80 fd 2a d1 4d 00 00 00
value (CPU 22):
7e d5 64 4d 00 00 00 00 a4 8a 2e ee 4d 00 00 00
a4 8a 2e ee 4d 00 00 00
value (CPU 23):
a7 78 3e 06 01 00 00 00 b2 34 94 f6 4d 00 00 00
b2 34 94 f6 4d 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
c6 8b d9 ca 00 00 00 00 20 c6 fc 83 4e 00 00 00
20 c6 fc 83 4e 00 00 00
value (CPU 22):
9c b4 d2 4d 00 00 00 00 3e 0c df 89 4e 00 00 00
3e 0c df 89 4e 00 00 00
value (CPU 23):
18 43 66 06 01 00 00 00 5b 69 ed 83 4e 00 00 00
5b 69 ed 83 4e 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
f2 6e db ca 00 00 00 00 92 67 4c ba 4e 00 00 00
92 67 4c ba 4e 00 00 00
value (CPU 22):
dc 8e e1 4d 00 00 00 00 d9 32 7a c5 4e 00 00 00
d9 32 7a c5 4e 00 00 00
value (CPU 23):
bd 2b 73 06 01 00 00 00 7c 73 87 bf 4e 00 00 00
7c 73 87 bf 4e 00 00 00
Found 1 element
#

# perf stat --bpf-counters -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

119,410,122 cycles
152,105,479 instructions # 1.27 insn per cycle

0.101395093 seconds time elapsed

#

See? We had the counters enabled all the time.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: kernel-team@fb.com
Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
7fac83aa Tue Mar 16 15:18:35 MDT 2021 Song Liu <songliubraving@fb.com> perf stat: Introduce 'bperf' to share hardware PMCs with BPF

The perf tool uses performance monitoring counters (PMCs) to monitor
system performance. The PMCs are limited hardware resources. For
example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways: system
level monitoring, (maybe nested) container level monitoring, per process
monitoring, profiling (in sample mode), etc. In some cases, there are
more active perf_events than available hardware PMCs. To allow all
perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics
(cycles, instructions). It is a waste to have multiple tools create
multiple perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --bpf-counters option to
perf-stat. bperf uses a BPF hashmap to share information about BPF
programs and maps used by bperf. This map is pinned to bpffs. The
default path is /sys/fs/bpf/perf_attr_map. The user could change the
path with option --bpf-attr-map.

Committer testing:

# dmesg|grep "Performance Events" -A5
[ 0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[ 0.225280] ... version: 0
[ 0.225280] ... bit width: 48
[ 0.225281] ... generic registers: 6
[ 0.225281] ... value mask: 0000ffffffffffff
[ 0.225281] ... max period: 00007fffffffffff
#
# for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done
[1] 2436231
[2] 2436232
[3] 2436233
[4] 2436234
[5] 2436235
[6] 2436236
# perf stat -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

310,326,987 cycles (41.87%)
236,143,290 instructions # 0.76 insn per cycle (41.87%)

0.100800885 seconds time elapsed

#

We can see that the counters were enabled for this workload 41.87% of
the time.

Now with --bpf-counters:

# for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done
[1] 2436514
[2] 2436515
[3] 2436516
[4] 2436517
[5] 2436518
[6] 2436519
[7] 2436520
[8] 2436521
[9] 2436522
[10] 2436523
[11] 2436524
[12] 2436525
[13] 2436526
[14] 2436527
[15] 2436528
[16] 2436529
[17] 2436530
[18] 2436531
[19] 2436532
[20] 2436533
[21] 2436534
[22] 2436535
[23] 2436536
[24] 2436537
[25] 2436538
[26] 2436539
[27] 2436540
[28] 2436541
[29] 2436542
[30] 2436543
[31] 2436544
[32] 2436545
#
# ls -la /sys/fs/bpf/perf_attr_map
-rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map
# bpftool map | grep bperf | wc -l
64
#

# bpftool map | tail
1265: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1266: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1267: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 996
pids perf(2436545)
1268: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1269: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1270: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 997
pids perf(2436541)
1285: array name pid_iter.rodata flags 0x480
key 4B value 4B max_entries 1 memlock 4096B
btf_id 1017 frozen
pids bpftool(2437504)
1286: array flags 0x0
key 4B value 32B max_entries 1 memlock 4096B
#
# bpftool map dump id 1268 | tail
value (CPU 21):
8f f3 bc ca 00 00 00 00 80 fd 2a d1 4d 00 00 00
80 fd 2a d1 4d 00 00 00
value (CPU 22):
7e d5 64 4d 00 00 00 00 a4 8a 2e ee 4d 00 00 00
a4 8a 2e ee 4d 00 00 00
value (CPU 23):
a7 78 3e 06 01 00 00 00 b2 34 94 f6 4d 00 00 00
b2 34 94 f6 4d 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
c6 8b d9 ca 00 00 00 00 20 c6 fc 83 4e 00 00 00
20 c6 fc 83 4e 00 00 00
value (CPU 22):
9c b4 d2 4d 00 00 00 00 3e 0c df 89 4e 00 00 00
3e 0c df 89 4e 00 00 00
value (CPU 23):
18 43 66 06 01 00 00 00 5b 69 ed 83 4e 00 00 00
5b 69 ed 83 4e 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
f2 6e db ca 00 00 00 00 92 67 4c ba 4e 00 00 00
92 67 4c ba 4e 00 00 00
value (CPU 22):
dc 8e e1 4d 00 00 00 00 d9 32 7a c5 4e 00 00 00
d9 32 7a c5 4e 00 00 00
value (CPU 23):
bd 2b 73 06 01 00 00 00 7c 73 87 bf 4e 00 00 00
7c 73 87 bf 4e 00 00 00
Found 1 element
#

# perf stat --bpf-counters -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

119,410,122 cycles
152,105,479 instructions # 1.27 insn per cycle

0.101395093 seconds time elapsed

#

See? We had the counters enabled all the time.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: kernel-team@fb.com
Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
7fac83aa Tue Mar 16 15:18:35 MDT 2021 Song Liu <songliubraving@fb.com> perf stat: Introduce 'bperf' to share hardware PMCs with BPF

The perf tool uses performance monitoring counters (PMCs) to monitor
system performance. The PMCs are limited hardware resources. For
example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways: system
level monitoring, (maybe nested) container level monitoring, per process
monitoring, profiling (in sample mode), etc. In some cases, there are
more active perf_events than available hardware PMCs. To allow all
perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics
(cycles, instructions). It is a waste to have multiple tools create
multiple perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --bpf-counters option to
perf-stat. bperf uses a BPF hashmap to share information about BPF
programs and maps used by bperf. This map is pinned to bpffs. The
default path is /sys/fs/bpf/perf_attr_map. The user could change the
path with option --bpf-attr-map.

Committer testing:

# dmesg|grep "Performance Events" -A5
[ 0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[ 0.225280] ... version: 0
[ 0.225280] ... bit width: 48
[ 0.225281] ... generic registers: 6
[ 0.225281] ... value mask: 0000ffffffffffff
[ 0.225281] ... max period: 00007fffffffffff
#
# for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done
[1] 2436231
[2] 2436232
[3] 2436233
[4] 2436234
[5] 2436235
[6] 2436236
# perf stat -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

310,326,987 cycles (41.87%)
236,143,290 instructions # 0.76 insn per cycle (41.87%)

0.100800885 seconds time elapsed

#

We can see that the counters were enabled for this workload 41.87% of
the time.

Now with --bpf-counters:

# for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done
[1] 2436514
[2] 2436515
[3] 2436516
[4] 2436517
[5] 2436518
[6] 2436519
[7] 2436520
[8] 2436521
[9] 2436522
[10] 2436523
[11] 2436524
[12] 2436525
[13] 2436526
[14] 2436527
[15] 2436528
[16] 2436529
[17] 2436530
[18] 2436531
[19] 2436532
[20] 2436533
[21] 2436534
[22] 2436535
[23] 2436536
[24] 2436537
[25] 2436538
[26] 2436539
[27] 2436540
[28] 2436541
[29] 2436542
[30] 2436543
[31] 2436544
[32] 2436545
#
# ls -la /sys/fs/bpf/perf_attr_map
-rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map
# bpftool map | grep bperf | wc -l
64
#

# bpftool map | tail
1265: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1266: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1267: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 996
pids perf(2436545)
1268: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1269: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1270: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 997
pids perf(2436541)
1285: array name pid_iter.rodata flags 0x480
key 4B value 4B max_entries 1 memlock 4096B
btf_id 1017 frozen
pids bpftool(2437504)
1286: array flags 0x0
key 4B value 32B max_entries 1 memlock 4096B
#
# bpftool map dump id 1268 | tail
value (CPU 21):
8f f3 bc ca 00 00 00 00 80 fd 2a d1 4d 00 00 00
80 fd 2a d1 4d 00 00 00
value (CPU 22):
7e d5 64 4d 00 00 00 00 a4 8a 2e ee 4d 00 00 00
a4 8a 2e ee 4d 00 00 00
value (CPU 23):
a7 78 3e 06 01 00 00 00 b2 34 94 f6 4d 00 00 00
b2 34 94 f6 4d 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
c6 8b d9 ca 00 00 00 00 20 c6 fc 83 4e 00 00 00
20 c6 fc 83 4e 00 00 00
value (CPU 22):
9c b4 d2 4d 00 00 00 00 3e 0c df 89 4e 00 00 00
3e 0c df 89 4e 00 00 00
value (CPU 23):
18 43 66 06 01 00 00 00 5b 69 ed 83 4e 00 00 00
5b 69 ed 83 4e 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
f2 6e db ca 00 00 00 00 92 67 4c ba 4e 00 00 00
92 67 4c ba 4e 00 00 00
value (CPU 22):
dc 8e e1 4d 00 00 00 00 d9 32 7a c5 4e 00 00 00
d9 32 7a c5 4e 00 00 00
value (CPU 23):
bd 2b 73 06 01 00 00 00 7c 73 87 bf 4e 00 00 00
7c 73 87 bf 4e 00 00 00
Found 1 element
#

# perf stat --bpf-counters -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

119,410,122 cycles
152,105,479 instructions # 1.27 insn per cycle

0.101395093 seconds time elapsed

#

See? We had the counters enabled all the time.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: kernel-team@fb.com
Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
7fac83aa Tue Mar 16 15:18:35 MDT 2021 Song Liu <songliubraving@fb.com> perf stat: Introduce 'bperf' to share hardware PMCs with BPF

The perf tool uses performance monitoring counters (PMCs) to monitor
system performance. The PMCs are limited hardware resources. For
example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways: system
level monitoring, (maybe nested) container level monitoring, per process
monitoring, profiling (in sample mode), etc. In some cases, there are
more active perf_events than available hardware PMCs. To allow all
perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics
(cycles, instructions). It is a waste to have multiple tools create
multiple perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --bpf-counters option to
perf-stat. bperf uses a BPF hashmap to share information about BPF
programs and maps used by bperf. This map is pinned to bpffs. The
default path is /sys/fs/bpf/perf_attr_map. The user could change the
path with option --bpf-attr-map.

Committer testing:

# dmesg|grep "Performance Events" -A5
[ 0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[ 0.225280] ... version: 0
[ 0.225280] ... bit width: 48
[ 0.225281] ... generic registers: 6
[ 0.225281] ... value mask: 0000ffffffffffff
[ 0.225281] ... max period: 00007fffffffffff
#
# for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done
[1] 2436231
[2] 2436232
[3] 2436233
[4] 2436234
[5] 2436235
[6] 2436236
# perf stat -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

310,326,987 cycles (41.87%)
236,143,290 instructions # 0.76 insn per cycle (41.87%)

0.100800885 seconds time elapsed

#

We can see that the counters were enabled for this workload 41.87% of
the time.

Now with --bpf-counters:

# for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done
[1] 2436514
[2] 2436515
[3] 2436516
[4] 2436517
[5] 2436518
[6] 2436519
[7] 2436520
[8] 2436521
[9] 2436522
[10] 2436523
[11] 2436524
[12] 2436525
[13] 2436526
[14] 2436527
[15] 2436528
[16] 2436529
[17] 2436530
[18] 2436531
[19] 2436532
[20] 2436533
[21] 2436534
[22] 2436535
[23] 2436536
[24] 2436537
[25] 2436538
[26] 2436539
[27] 2436540
[28] 2436541
[29] 2436542
[30] 2436543
[31] 2436544
[32] 2436545
#
# ls -la /sys/fs/bpf/perf_attr_map
-rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map
# bpftool map | grep bperf | wc -l
64
#

# bpftool map | tail
1265: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1266: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1267: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 996
pids perf(2436545)
1268: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1269: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1270: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 997
pids perf(2436541)
1285: array name pid_iter.rodata flags 0x480
key 4B value 4B max_entries 1 memlock 4096B
btf_id 1017 frozen
pids bpftool(2437504)
1286: array flags 0x0
key 4B value 32B max_entries 1 memlock 4096B
#
# bpftool map dump id 1268 | tail
value (CPU 21):
8f f3 bc ca 00 00 00 00 80 fd 2a d1 4d 00 00 00
80 fd 2a d1 4d 00 00 00
value (CPU 22):
7e d5 64 4d 00 00 00 00 a4 8a 2e ee 4d 00 00 00
a4 8a 2e ee 4d 00 00 00
value (CPU 23):
a7 78 3e 06 01 00 00 00 b2 34 94 f6 4d 00 00 00
b2 34 94 f6 4d 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
c6 8b d9 ca 00 00 00 00 20 c6 fc 83 4e 00 00 00
20 c6 fc 83 4e 00 00 00
value (CPU 22):
9c b4 d2 4d 00 00 00 00 3e 0c df 89 4e 00 00 00
3e 0c df 89 4e 00 00 00
value (CPU 23):
18 43 66 06 01 00 00 00 5b 69 ed 83 4e 00 00 00
5b 69 ed 83 4e 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
f2 6e db ca 00 00 00 00 92 67 4c ba 4e 00 00 00
92 67 4c ba 4e 00 00 00
value (CPU 22):
dc 8e e1 4d 00 00 00 00 d9 32 7a c5 4e 00 00 00
d9 32 7a c5 4e 00 00 00
value (CPU 23):
bd 2b 73 06 01 00 00 00 7c 73 87 bf 4e 00 00 00
7c 73 87 bf 4e 00 00 00
Found 1 element
#

# perf stat --bpf-counters -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

119,410,122 cycles
152,105,479 instructions # 1.27 insn per cycle

0.101395093 seconds time elapsed

#

See? We had the counters enabled all the time.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: kernel-team@fb.com
Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
7fac83aa Tue Mar 16 15:18:35 MDT 2021 Song Liu <songliubraving@fb.com> perf stat: Introduce 'bperf' to share hardware PMCs with BPF

The perf tool uses performance monitoring counters (PMCs) to monitor
system performance. The PMCs are limited hardware resources. For
example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways: system
level monitoring, (maybe nested) container level monitoring, per process
monitoring, profiling (in sample mode), etc. In some cases, there are
more active perf_events than available hardware PMCs. To allow all
perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics
(cycles, instructions). It is a waste to have multiple tools create
multiple perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --bpf-counters option to
perf-stat. bperf uses a BPF hashmap to share information about BPF
programs and maps used by bperf. This map is pinned to bpffs. The
default path is /sys/fs/bpf/perf_attr_map. The user could change the
path with option --bpf-attr-map.

Committer testing:

# dmesg|grep "Performance Events" -A5
[ 0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[ 0.225280] ... version: 0
[ 0.225280] ... bit width: 48
[ 0.225281] ... generic registers: 6
[ 0.225281] ... value mask: 0000ffffffffffff
[ 0.225281] ... max period: 00007fffffffffff
#
# for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done
[1] 2436231
[2] 2436232
[3] 2436233
[4] 2436234
[5] 2436235
[6] 2436236
# perf stat -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

310,326,987 cycles (41.87%)
236,143,290 instructions # 0.76 insn per cycle (41.87%)

0.100800885 seconds time elapsed

#

We can see that the counters were enabled for this workload 41.87% of
the time.

Now with --bpf-counters:

# for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done
[1] 2436514
[2] 2436515
[3] 2436516
[4] 2436517
[5] 2436518
[6] 2436519
[7] 2436520
[8] 2436521
[9] 2436522
[10] 2436523
[11] 2436524
[12] 2436525
[13] 2436526
[14] 2436527
[15] 2436528
[16] 2436529
[17] 2436530
[18] 2436531
[19] 2436532
[20] 2436533
[21] 2436534
[22] 2436535
[23] 2436536
[24] 2436537
[25] 2436538
[26] 2436539
[27] 2436540
[28] 2436541
[29] 2436542
[30] 2436543
[31] 2436544
[32] 2436545
#
# ls -la /sys/fs/bpf/perf_attr_map
-rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map
# bpftool map | grep bperf | wc -l
64
#

# bpftool map | tail
1265: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1266: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1267: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 996
pids perf(2436545)
1268: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1269: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1270: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 997
pids perf(2436541)
1285: array name pid_iter.rodata flags 0x480
key 4B value 4B max_entries 1 memlock 4096B
btf_id 1017 frozen
pids bpftool(2437504)
1286: array flags 0x0
key 4B value 32B max_entries 1 memlock 4096B
#
# bpftool map dump id 1268 | tail
value (CPU 21):
8f f3 bc ca 00 00 00 00 80 fd 2a d1 4d 00 00 00
80 fd 2a d1 4d 00 00 00
value (CPU 22):
7e d5 64 4d 00 00 00 00 a4 8a 2e ee 4d 00 00 00
a4 8a 2e ee 4d 00 00 00
value (CPU 23):
a7 78 3e 06 01 00 00 00 b2 34 94 f6 4d 00 00 00
b2 34 94 f6 4d 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
c6 8b d9 ca 00 00 00 00 20 c6 fc 83 4e 00 00 00
20 c6 fc 83 4e 00 00 00
value (CPU 22):
9c b4 d2 4d 00 00 00 00 3e 0c df 89 4e 00 00 00
3e 0c df 89 4e 00 00 00
value (CPU 23):
18 43 66 06 01 00 00 00 5b 69 ed 83 4e 00 00 00
5b 69 ed 83 4e 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
f2 6e db ca 00 00 00 00 92 67 4c ba 4e 00 00 00
92 67 4c ba 4e 00 00 00
value (CPU 22):
dc 8e e1 4d 00 00 00 00 d9 32 7a c5 4e 00 00 00
d9 32 7a c5 4e 00 00 00
value (CPU 23):
bd 2b 73 06 01 00 00 00 7c 73 87 bf 4e 00 00 00
7c 73 87 bf 4e 00 00 00
Found 1 element
#

# perf stat --bpf-counters -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

119,410,122 cycles
152,105,479 instructions # 1.27 insn per cycle

0.101395093 seconds time elapsed

#

See? We had the counters enabled all the time.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: kernel-team@fb.com
Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
7fac83aa Tue Mar 16 15:18:35 MDT 2021 Song Liu <songliubraving@fb.com> perf stat: Introduce 'bperf' to share hardware PMCs with BPF

The perf tool uses performance monitoring counters (PMCs) to monitor
system performance. The PMCs are limited hardware resources. For
example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways: system
level monitoring, (maybe nested) container level monitoring, per process
monitoring, profiling (in sample mode), etc. In some cases, there are
more active perf_events than available hardware PMCs. To allow all
perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics
(cycles, instructions). It is a waste to have multiple tools create
multiple perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --bpf-counters option to
perf-stat. bperf uses a BPF hashmap to share information about BPF
programs and maps used by bperf. This map is pinned to bpffs. The
default path is /sys/fs/bpf/perf_attr_map. The user could change the
path with option --bpf-attr-map.

Committer testing:

# dmesg|grep "Performance Events" -A5
[ 0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[ 0.225280] ... version: 0
[ 0.225280] ... bit width: 48
[ 0.225281] ... generic registers: 6
[ 0.225281] ... value mask: 0000ffffffffffff
[ 0.225281] ... max period: 00007fffffffffff
#
# for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done
[1] 2436231
[2] 2436232
[3] 2436233
[4] 2436234
[5] 2436235
[6] 2436236
# perf stat -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

310,326,987 cycles (41.87%)
236,143,290 instructions # 0.76 insn per cycle (41.87%)

0.100800885 seconds time elapsed

#

We can see that the counters were enabled for this workload 41.87% of
the time.

Now with --bpf-counters:

# for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done
[1] 2436514
[2] 2436515
[3] 2436516
[4] 2436517
[5] 2436518
[6] 2436519
[7] 2436520
[8] 2436521
[9] 2436522
[10] 2436523
[11] 2436524
[12] 2436525
[13] 2436526
[14] 2436527
[15] 2436528
[16] 2436529
[17] 2436530
[18] 2436531
[19] 2436532
[20] 2436533
[21] 2436534
[22] 2436535
[23] 2436536
[24] 2436537
[25] 2436538
[26] 2436539
[27] 2436540
[28] 2436541
[29] 2436542
[30] 2436543
[31] 2436544
[32] 2436545
#
# ls -la /sys/fs/bpf/perf_attr_map
-rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map
# bpftool map | grep bperf | wc -l
64
#

# bpftool map | tail
1265: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1266: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1267: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 996
pids perf(2436545)
1268: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1269: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1270: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 997
pids perf(2436541)
1285: array name pid_iter.rodata flags 0x480
key 4B value 4B max_entries 1 memlock 4096B
btf_id 1017 frozen
pids bpftool(2437504)
1286: array flags 0x0
key 4B value 32B max_entries 1 memlock 4096B
#
# bpftool map dump id 1268 | tail
value (CPU 21):
8f f3 bc ca 00 00 00 00 80 fd 2a d1 4d 00 00 00
80 fd 2a d1 4d 00 00 00
value (CPU 22):
7e d5 64 4d 00 00 00 00 a4 8a 2e ee 4d 00 00 00
a4 8a 2e ee 4d 00 00 00
value (CPU 23):
a7 78 3e 06 01 00 00 00 b2 34 94 f6 4d 00 00 00
b2 34 94 f6 4d 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
c6 8b d9 ca 00 00 00 00 20 c6 fc 83 4e 00 00 00
20 c6 fc 83 4e 00 00 00
value (CPU 22):
9c b4 d2 4d 00 00 00 00 3e 0c df 89 4e 00 00 00
3e 0c df 89 4e 00 00 00
value (CPU 23):
18 43 66 06 01 00 00 00 5b 69 ed 83 4e 00 00 00
5b 69 ed 83 4e 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
f2 6e db ca 00 00 00 00 92 67 4c ba 4e 00 00 00
92 67 4c ba 4e 00 00 00
value (CPU 22):
dc 8e e1 4d 00 00 00 00 d9 32 7a c5 4e 00 00 00
d9 32 7a c5 4e 00 00 00
value (CPU 23):
bd 2b 73 06 01 00 00 00 7c 73 87 bf 4e 00 00 00
7c 73 87 bf 4e 00 00 00
Found 1 element
#

# perf stat --bpf-counters -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

119,410,122 cycles
152,105,479 instructions # 1.27 insn per cycle

0.101395093 seconds time elapsed

#

See? We had the counters enabled all the time.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: kernel-team@fb.com
Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
7fac83aa Tue Mar 16 15:18:35 MDT 2021 Song Liu <songliubraving@fb.com> perf stat: Introduce 'bperf' to share hardware PMCs with BPF

The perf tool uses performance monitoring counters (PMCs) to monitor
system performance. The PMCs are limited hardware resources. For
example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways: system
level monitoring, (maybe nested) container level monitoring, per process
monitoring, profiling (in sample mode), etc. In some cases, there are
more active perf_events than available hardware PMCs. To allow all
perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics
(cycles, instructions). It is a waste to have multiple tools create
multiple perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --bpf-counters option to
perf-stat. bperf uses a BPF hashmap to share information about BPF
programs and maps used by bperf. This map is pinned to bpffs. The
default path is /sys/fs/bpf/perf_attr_map. The user could change the
path with option --bpf-attr-map.

Committer testing:

# dmesg|grep "Performance Events" -A5
[ 0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[ 0.225280] ... version: 0
[ 0.225280] ... bit width: 48
[ 0.225281] ... generic registers: 6
[ 0.225281] ... value mask: 0000ffffffffffff
[ 0.225281] ... max period: 00007fffffffffff
#
# for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done
[1] 2436231
[2] 2436232
[3] 2436233
[4] 2436234
[5] 2436235
[6] 2436236
# perf stat -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

310,326,987 cycles (41.87%)
236,143,290 instructions # 0.76 insn per cycle (41.87%)

0.100800885 seconds time elapsed

#

We can see that the counters were enabled for this workload 41.87% of
the time.

Now with --bpf-counters:

# for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done
[1] 2436514
[2] 2436515
[3] 2436516
[4] 2436517
[5] 2436518
[6] 2436519
[7] 2436520
[8] 2436521
[9] 2436522
[10] 2436523
[11] 2436524
[12] 2436525
[13] 2436526
[14] 2436527
[15] 2436528
[16] 2436529
[17] 2436530
[18] 2436531
[19] 2436532
[20] 2436533
[21] 2436534
[22] 2436535
[23] 2436536
[24] 2436537
[25] 2436538
[26] 2436539
[27] 2436540
[28] 2436541
[29] 2436542
[30] 2436543
[31] 2436544
[32] 2436545
#
# ls -la /sys/fs/bpf/perf_attr_map
-rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map
# bpftool map | grep bperf | wc -l
64
#

# bpftool map | tail
1265: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1266: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1267: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 996
pids perf(2436545)
1268: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1269: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1270: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 997
pids perf(2436541)
1285: array name pid_iter.rodata flags 0x480
key 4B value 4B max_entries 1 memlock 4096B
btf_id 1017 frozen
pids bpftool(2437504)
1286: array flags 0x0
key 4B value 32B max_entries 1 memlock 4096B
#
# bpftool map dump id 1268 | tail
value (CPU 21):
8f f3 bc ca 00 00 00 00 80 fd 2a d1 4d 00 00 00
80 fd 2a d1 4d 00 00 00
value (CPU 22):
7e d5 64 4d 00 00 00 00 a4 8a 2e ee 4d 00 00 00
a4 8a 2e ee 4d 00 00 00
value (CPU 23):
a7 78 3e 06 01 00 00 00 b2 34 94 f6 4d 00 00 00
b2 34 94 f6 4d 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
c6 8b d9 ca 00 00 00 00 20 c6 fc 83 4e 00 00 00
20 c6 fc 83 4e 00 00 00
value (CPU 22):
9c b4 d2 4d 00 00 00 00 3e 0c df 89 4e 00 00 00
3e 0c df 89 4e 00 00 00
value (CPU 23):
18 43 66 06 01 00 00 00 5b 69 ed 83 4e 00 00 00
5b 69 ed 83 4e 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
f2 6e db ca 00 00 00 00 92 67 4c ba 4e 00 00 00
92 67 4c ba 4e 00 00 00
value (CPU 22):
dc 8e e1 4d 00 00 00 00 d9 32 7a c5 4e 00 00 00
d9 32 7a c5 4e 00 00 00
value (CPU 23):
bd 2b 73 06 01 00 00 00 7c 73 87 bf 4e 00 00 00
7c 73 87 bf 4e 00 00 00
Found 1 element
#

# perf stat --bpf-counters -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

119,410,122 cycles
152,105,479 instructions # 1.27 insn per cycle

0.101395093 seconds time elapsed

#

See? We had the counters enabled all the time.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: kernel-team@fb.com
Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
7fac83aa Tue Mar 16 15:18:35 MDT 2021 Song Liu <songliubraving@fb.com> perf stat: Introduce 'bperf' to share hardware PMCs with BPF

The perf tool uses performance monitoring counters (PMCs) to monitor
system performance. The PMCs are limited hardware resources. For
example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways: system
level monitoring, (maybe nested) container level monitoring, per process
monitoring, profiling (in sample mode), etc. In some cases, there are
more active perf_events than available hardware PMCs. To allow all
perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics
(cycles, instructions). It is a waste to have multiple tools create
multiple perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --bpf-counters option to
perf-stat. bperf uses a BPF hashmap to share information about BPF
programs and maps used by bperf. This map is pinned to bpffs. The
default path is /sys/fs/bpf/perf_attr_map. The user could change the
path with option --bpf-attr-map.

Committer testing:

# dmesg|grep "Performance Events" -A5
[ 0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[ 0.225280] ... version: 0
[ 0.225280] ... bit width: 48
[ 0.225281] ... generic registers: 6
[ 0.225281] ... value mask: 0000ffffffffffff
[ 0.225281] ... max period: 00007fffffffffff
#
# for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done
[1] 2436231
[2] 2436232
[3] 2436233
[4] 2436234
[5] 2436235
[6] 2436236
# perf stat -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

310,326,987 cycles (41.87%)
236,143,290 instructions # 0.76 insn per cycle (41.87%)

0.100800885 seconds time elapsed

#

We can see that the counters were enabled for this workload 41.87% of
the time.

Now with --bpf-counters:

# for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done
[1] 2436514
[2] 2436515
[3] 2436516
[4] 2436517
[5] 2436518
[6] 2436519
[7] 2436520
[8] 2436521
[9] 2436522
[10] 2436523
[11] 2436524
[12] 2436525
[13] 2436526
[14] 2436527
[15] 2436528
[16] 2436529
[17] 2436530
[18] 2436531
[19] 2436532
[20] 2436533
[21] 2436534
[22] 2436535
[23] 2436536
[24] 2436537
[25] 2436538
[26] 2436539
[27] 2436540
[28] 2436541
[29] 2436542
[30] 2436543
[31] 2436544
[32] 2436545
#
# ls -la /sys/fs/bpf/perf_attr_map
-rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map
# bpftool map | grep bperf | wc -l
64
#

# bpftool map | tail
1265: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1266: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1267: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 996
pids perf(2436545)
1268: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1269: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1270: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 997
pids perf(2436541)
1285: array name pid_iter.rodata flags 0x480
key 4B value 4B max_entries 1 memlock 4096B
btf_id 1017 frozen
pids bpftool(2437504)
1286: array flags 0x0
key 4B value 32B max_entries 1 memlock 4096B
#
# bpftool map dump id 1268 | tail
value (CPU 21):
8f f3 bc ca 00 00 00 00 80 fd 2a d1 4d 00 00 00
80 fd 2a d1 4d 00 00 00
value (CPU 22):
7e d5 64 4d 00 00 00 00 a4 8a 2e ee 4d 00 00 00
a4 8a 2e ee 4d 00 00 00
value (CPU 23):
a7 78 3e 06 01 00 00 00 b2 34 94 f6 4d 00 00 00
b2 34 94 f6 4d 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
c6 8b d9 ca 00 00 00 00 20 c6 fc 83 4e 00 00 00
20 c6 fc 83 4e 00 00 00
value (CPU 22):
9c b4 d2 4d 00 00 00 00 3e 0c df 89 4e 00 00 00
3e 0c df 89 4e 00 00 00
value (CPU 23):
18 43 66 06 01 00 00 00 5b 69 ed 83 4e 00 00 00
5b 69 ed 83 4e 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
f2 6e db ca 00 00 00 00 92 67 4c ba 4e 00 00 00
92 67 4c ba 4e 00 00 00
value (CPU 22):
dc 8e e1 4d 00 00 00 00 d9 32 7a c5 4e 00 00 00
d9 32 7a c5 4e 00 00 00
value (CPU 23):
bd 2b 73 06 01 00 00 00 7c 73 87 bf 4e 00 00 00
7c 73 87 bf 4e 00 00 00
Found 1 element
#

# perf stat --bpf-counters -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

119,410,122 cycles
152,105,479 instructions # 1.27 insn per cycle

0.101395093 seconds time elapsed

#

See? We had the counters enabled all the time.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: kernel-team@fb.com
Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
7fac83aa Tue Mar 16 15:18:35 MDT 2021 Song Liu <songliubraving@fb.com> perf stat: Introduce 'bperf' to share hardware PMCs with BPF

The perf tool uses performance monitoring counters (PMCs) to monitor
system performance. The PMCs are limited hardware resources. For
example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways: system
level monitoring, (maybe nested) container level monitoring, per process
monitoring, profiling (in sample mode), etc. In some cases, there are
more active perf_events than available hardware PMCs. To allow all
perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics
(cycles, instructions). It is a waste to have multiple tools create
multiple perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --bpf-counters option to
perf-stat. bperf uses a BPF hashmap to share information about BPF
programs and maps used by bperf. This map is pinned to bpffs. The
default path is /sys/fs/bpf/perf_attr_map. The user could change the
path with option --bpf-attr-map.

Committer testing:

# dmesg|grep "Performance Events" -A5
[ 0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[ 0.225280] ... version: 0
[ 0.225280] ... bit width: 48
[ 0.225281] ... generic registers: 6
[ 0.225281] ... value mask: 0000ffffffffffff
[ 0.225281] ... max period: 00007fffffffffff
#
# for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done
[1] 2436231
[2] 2436232
[3] 2436233
[4] 2436234
[5] 2436235
[6] 2436236
# perf stat -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

310,326,987 cycles (41.87%)
236,143,290 instructions # 0.76 insn per cycle (41.87%)

0.100800885 seconds time elapsed

#

We can see that the counters were enabled for this workload 41.87% of
the time.

Now with --bpf-counters:

# for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done
[1] 2436514
[2] 2436515
[3] 2436516
[4] 2436517
[5] 2436518
[6] 2436519
[7] 2436520
[8] 2436521
[9] 2436522
[10] 2436523
[11] 2436524
[12] 2436525
[13] 2436526
[14] 2436527
[15] 2436528
[16] 2436529
[17] 2436530
[18] 2436531
[19] 2436532
[20] 2436533
[21] 2436534
[22] 2436535
[23] 2436536
[24] 2436537
[25] 2436538
[26] 2436539
[27] 2436540
[28] 2436541
[29] 2436542
[30] 2436543
[31] 2436544
[32] 2436545
#
# ls -la /sys/fs/bpf/perf_attr_map
-rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map
# bpftool map | grep bperf | wc -l
64
#

# bpftool map | tail
1265: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1266: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1267: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 996
pids perf(2436545)
1268: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1269: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1270: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 997
pids perf(2436541)
1285: array name pid_iter.rodata flags 0x480
key 4B value 4B max_entries 1 memlock 4096B
btf_id 1017 frozen
pids bpftool(2437504)
1286: array flags 0x0
key 4B value 32B max_entries 1 memlock 4096B
#
# bpftool map dump id 1268 | tail
value (CPU 21):
8f f3 bc ca 00 00 00 00 80 fd 2a d1 4d 00 00 00
80 fd 2a d1 4d 00 00 00
value (CPU 22):
7e d5 64 4d 00 00 00 00 a4 8a 2e ee 4d 00 00 00
a4 8a 2e ee 4d 00 00 00
value (CPU 23):
a7 78 3e 06 01 00 00 00 b2 34 94 f6 4d 00 00 00
b2 34 94 f6 4d 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
c6 8b d9 ca 00 00 00 00 20 c6 fc 83 4e 00 00 00
20 c6 fc 83 4e 00 00 00
value (CPU 22):
9c b4 d2 4d 00 00 00 00 3e 0c df 89 4e 00 00 00
3e 0c df 89 4e 00 00 00
value (CPU 23):
18 43 66 06 01 00 00 00 5b 69 ed 83 4e 00 00 00
5b 69 ed 83 4e 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
f2 6e db ca 00 00 00 00 92 67 4c ba 4e 00 00 00
92 67 4c ba 4e 00 00 00
value (CPU 22):
dc 8e e1 4d 00 00 00 00 d9 32 7a c5 4e 00 00 00
d9 32 7a c5 4e 00 00 00
value (CPU 23):
bd 2b 73 06 01 00 00 00 7c 73 87 bf 4e 00 00 00
7c 73 87 bf 4e 00 00 00
Found 1 element
#

# perf stat --bpf-counters -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

119,410,122 cycles
152,105,479 instructions # 1.27 insn per cycle

0.101395093 seconds time elapsed

#

See? We had the counters enabled all the time.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: kernel-team@fb.com
Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
7fac83aa Tue Mar 16 15:18:35 MDT 2021 Song Liu <songliubraving@fb.com> perf stat: Introduce 'bperf' to share hardware PMCs with BPF

The perf tool uses performance monitoring counters (PMCs) to monitor
system performance. The PMCs are limited hardware resources. For
example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways: system
level monitoring, (maybe nested) container level monitoring, per process
monitoring, profiling (in sample mode), etc. In some cases, there are
more active perf_events than available hardware PMCs. To allow all
perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics
(cycles, instructions). It is a waste to have multiple tools create
multiple perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --bpf-counters option to
perf-stat. bperf uses a BPF hashmap to share information about BPF
programs and maps used by bperf. This map is pinned to bpffs. The
default path is /sys/fs/bpf/perf_attr_map. The user could change the
path with option --bpf-attr-map.

Committer testing:

# dmesg|grep "Performance Events" -A5
[ 0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[ 0.225280] ... version: 0
[ 0.225280] ... bit width: 48
[ 0.225281] ... generic registers: 6
[ 0.225281] ... value mask: 0000ffffffffffff
[ 0.225281] ... max period: 00007fffffffffff
#
# for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done
[1] 2436231
[2] 2436232
[3] 2436233
[4] 2436234
[5] 2436235
[6] 2436236
# perf stat -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

310,326,987 cycles (41.87%)
236,143,290 instructions # 0.76 insn per cycle (41.87%)

0.100800885 seconds time elapsed

#

We can see that the counters were enabled for this workload 41.87% of
the time.

Now with --bpf-counters:

# for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done
[1] 2436514
[2] 2436515
[3] 2436516
[4] 2436517
[5] 2436518
[6] 2436519
[7] 2436520
[8] 2436521
[9] 2436522
[10] 2436523
[11] 2436524
[12] 2436525
[13] 2436526
[14] 2436527
[15] 2436528
[16] 2436529
[17] 2436530
[18] 2436531
[19] 2436532
[20] 2436533
[21] 2436534
[22] 2436535
[23] 2436536
[24] 2436537
[25] 2436538
[26] 2436539
[27] 2436540
[28] 2436541
[29] 2436542
[30] 2436543
[31] 2436544
[32] 2436545
#
# ls -la /sys/fs/bpf/perf_attr_map
-rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map
# bpftool map | grep bperf | wc -l
64
#

# bpftool map | tail
1265: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1266: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1267: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 996
pids perf(2436545)
1268: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1269: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1270: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 997
pids perf(2436541)
1285: array name pid_iter.rodata flags 0x480
key 4B value 4B max_entries 1 memlock 4096B
btf_id 1017 frozen
pids bpftool(2437504)
1286: array flags 0x0
key 4B value 32B max_entries 1 memlock 4096B
#
# bpftool map dump id 1268 | tail
value (CPU 21):
8f f3 bc ca 00 00 00 00 80 fd 2a d1 4d 00 00 00
80 fd 2a d1 4d 00 00 00
value (CPU 22):
7e d5 64 4d 00 00 00 00 a4 8a 2e ee 4d 00 00 00
a4 8a 2e ee 4d 00 00 00
value (CPU 23):
a7 78 3e 06 01 00 00 00 b2 34 94 f6 4d 00 00 00
b2 34 94 f6 4d 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
c6 8b d9 ca 00 00 00 00 20 c6 fc 83 4e 00 00 00
20 c6 fc 83 4e 00 00 00
value (CPU 22):
9c b4 d2 4d 00 00 00 00 3e 0c df 89 4e 00 00 00
3e 0c df 89 4e 00 00 00
value (CPU 23):
18 43 66 06 01 00 00 00 5b 69 ed 83 4e 00 00 00
5b 69 ed 83 4e 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
f2 6e db ca 00 00 00 00 92 67 4c ba 4e 00 00 00
92 67 4c ba 4e 00 00 00
value (CPU 22):
dc 8e e1 4d 00 00 00 00 d9 32 7a c5 4e 00 00 00
d9 32 7a c5 4e 00 00 00
value (CPU 23):
bd 2b 73 06 01 00 00 00 7c 73 87 bf 4e 00 00 00
7c 73 87 bf 4e 00 00 00
Found 1 element
#

# perf stat --bpf-counters -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

119,410,122 cycles
152,105,479 instructions # 1.27 insn per cycle

0.101395093 seconds time elapsed

#

See? We had the counters enabled all the time.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: kernel-team@fb.com
Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
7fac83aa Tue Mar 16 15:18:35 MDT 2021 Song Liu <songliubraving@fb.com> perf stat: Introduce 'bperf' to share hardware PMCs with BPF

The perf tool uses performance monitoring counters (PMCs) to monitor
system performance. The PMCs are limited hardware resources. For
example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways: system
level monitoring, (maybe nested) container level monitoring, per process
monitoring, profiling (in sample mode), etc. In some cases, there are
more active perf_events than available hardware PMCs. To allow all
perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics
(cycles, instructions). It is a waste to have multiple tools create
multiple perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --bpf-counters option to
perf-stat. bperf uses a BPF hashmap to share information about BPF
programs and maps used by bperf. This map is pinned to bpffs. The
default path is /sys/fs/bpf/perf_attr_map. The user could change the
path with option --bpf-attr-map.

Committer testing:

# dmesg|grep "Performance Events" -A5
[ 0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[ 0.225280] ... version: 0
[ 0.225280] ... bit width: 48
[ 0.225281] ... generic registers: 6
[ 0.225281] ... value mask: 0000ffffffffffff
[ 0.225281] ... max period: 00007fffffffffff
#
# for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done
[1] 2436231
[2] 2436232
[3] 2436233
[4] 2436234
[5] 2436235
[6] 2436236
# perf stat -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

310,326,987 cycles (41.87%)
236,143,290 instructions # 0.76 insn per cycle (41.87%)

0.100800885 seconds time elapsed

#

We can see that the counters were enabled for this workload 41.87% of
the time.

Now with --bpf-counters:

# for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done
[1] 2436514
[2] 2436515
[3] 2436516
[4] 2436517
[5] 2436518
[6] 2436519
[7] 2436520
[8] 2436521
[9] 2436522
[10] 2436523
[11] 2436524
[12] 2436525
[13] 2436526
[14] 2436527
[15] 2436528
[16] 2436529
[17] 2436530
[18] 2436531
[19] 2436532
[20] 2436533
[21] 2436534
[22] 2436535
[23] 2436536
[24] 2436537
[25] 2436538
[26] 2436539
[27] 2436540
[28] 2436541
[29] 2436542
[30] 2436543
[31] 2436544
[32] 2436545
#
# ls -la /sys/fs/bpf/perf_attr_map
-rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map
# bpftool map | grep bperf | wc -l
64
#

# bpftool map | tail
1265: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1266: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1267: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 996
pids perf(2436545)
1268: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1269: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1270: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 997
pids perf(2436541)
1285: array name pid_iter.rodata flags 0x480
key 4B value 4B max_entries 1 memlock 4096B
btf_id 1017 frozen
pids bpftool(2437504)
1286: array flags 0x0
key 4B value 32B max_entries 1 memlock 4096B
#
# bpftool map dump id 1268 | tail
value (CPU 21):
8f f3 bc ca 00 00 00 00 80 fd 2a d1 4d 00 00 00
80 fd 2a d1 4d 00 00 00
value (CPU 22):
7e d5 64 4d 00 00 00 00 a4 8a 2e ee 4d 00 00 00
a4 8a 2e ee 4d 00 00 00
value (CPU 23):
a7 78 3e 06 01 00 00 00 b2 34 94 f6 4d 00 00 00
b2 34 94 f6 4d 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
c6 8b d9 ca 00 00 00 00 20 c6 fc 83 4e 00 00 00
20 c6 fc 83 4e 00 00 00
value (CPU 22):
9c b4 d2 4d 00 00 00 00 3e 0c df 89 4e 00 00 00
3e 0c df 89 4e 00 00 00
value (CPU 23):
18 43 66 06 01 00 00 00 5b 69 ed 83 4e 00 00 00
5b 69 ed 83 4e 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
f2 6e db ca 00 00 00 00 92 67 4c ba 4e 00 00 00
92 67 4c ba 4e 00 00 00
value (CPU 22):
dc 8e e1 4d 00 00 00 00 d9 32 7a c5 4e 00 00 00
d9 32 7a c5 4e 00 00 00
value (CPU 23):
bd 2b 73 06 01 00 00 00 7c 73 87 bf 4e 00 00 00
7c 73 87 bf 4e 00 00 00
Found 1 element
#

# perf stat --bpf-counters -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

119,410,122 cycles
152,105,479 instructions # 1.27 insn per cycle

0.101395093 seconds time elapsed

#

See? We had the counters enabled all the time.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: kernel-team@fb.com
Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
7fac83aa Tue Mar 16 15:18:35 MDT 2021 Song Liu <songliubraving@fb.com> perf stat: Introduce 'bperf' to share hardware PMCs with BPF

The perf tool uses performance monitoring counters (PMCs) to monitor
system performance. The PMCs are limited hardware resources. For
example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways: system
level monitoring, (maybe nested) container level monitoring, per process
monitoring, profiling (in sample mode), etc. In some cases, there are
more active perf_events than available hardware PMCs. To allow all
perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics
(cycles, instructions). It is a waste to have multiple tools create
multiple perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --bpf-counters option to
perf-stat. bperf uses a BPF hashmap to share information about BPF
programs and maps used by bperf. This map is pinned to bpffs. The
default path is /sys/fs/bpf/perf_attr_map. The user could change the
path with option --bpf-attr-map.

Committer testing:

# dmesg|grep "Performance Events" -A5
[ 0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[ 0.225280] ... version: 0
[ 0.225280] ... bit width: 48
[ 0.225281] ... generic registers: 6
[ 0.225281] ... value mask: 0000ffffffffffff
[ 0.225281] ... max period: 00007fffffffffff
#
# for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done
[1] 2436231
[2] 2436232
[3] 2436233
[4] 2436234
[5] 2436235
[6] 2436236
# perf stat -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

310,326,987 cycles (41.87%)
236,143,290 instructions # 0.76 insn per cycle (41.87%)

0.100800885 seconds time elapsed

#

We can see that the counters were enabled for this workload 41.87% of
the time.

Now with --bpf-counters:

# for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done
[1] 2436514
[2] 2436515
[3] 2436516
[4] 2436517
[5] 2436518
[6] 2436519
[7] 2436520
[8] 2436521
[9] 2436522
[10] 2436523
[11] 2436524
[12] 2436525
[13] 2436526
[14] 2436527
[15] 2436528
[16] 2436529
[17] 2436530
[18] 2436531
[19] 2436532
[20] 2436533
[21] 2436534
[22] 2436535
[23] 2436536
[24] 2436537
[25] 2436538
[26] 2436539
[27] 2436540
[28] 2436541
[29] 2436542
[30] 2436543
[31] 2436544
[32] 2436545
#
# ls -la /sys/fs/bpf/perf_attr_map
-rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map
# bpftool map | grep bperf | wc -l
64
#

# bpftool map | tail
1265: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1266: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1267: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 996
pids perf(2436545)
1268: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1269: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1270: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 997
pids perf(2436541)
1285: array name pid_iter.rodata flags 0x480
key 4B value 4B max_entries 1 memlock 4096B
btf_id 1017 frozen
pids bpftool(2437504)
1286: array flags 0x0
key 4B value 32B max_entries 1 memlock 4096B
#
# bpftool map dump id 1268 | tail
value (CPU 21):
8f f3 bc ca 00 00 00 00 80 fd 2a d1 4d 00 00 00
80 fd 2a d1 4d 00 00 00
value (CPU 22):
7e d5 64 4d 00 00 00 00 a4 8a 2e ee 4d 00 00 00
a4 8a 2e ee 4d 00 00 00
value (CPU 23):
a7 78 3e 06 01 00 00 00 b2 34 94 f6 4d 00 00 00
b2 34 94 f6 4d 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
c6 8b d9 ca 00 00 00 00 20 c6 fc 83 4e 00 00 00
20 c6 fc 83 4e 00 00 00
value (CPU 22):
9c b4 d2 4d 00 00 00 00 3e 0c df 89 4e 00 00 00
3e 0c df 89 4e 00 00 00
value (CPU 23):
18 43 66 06 01 00 00 00 5b 69 ed 83 4e 00 00 00
5b 69 ed 83 4e 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
f2 6e db ca 00 00 00 00 92 67 4c ba 4e 00 00 00
92 67 4c ba 4e 00 00 00
value (CPU 22):
dc 8e e1 4d 00 00 00 00 d9 32 7a c5 4e 00 00 00
d9 32 7a c5 4e 00 00 00
value (CPU 23):
bd 2b 73 06 01 00 00 00 7c 73 87 bf 4e 00 00 00
7c 73 87 bf 4e 00 00 00
Found 1 element
#

# perf stat --bpf-counters -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

119,410,122 cycles
152,105,479 instructions # 1.27 insn per cycle

0.101395093 seconds time elapsed

#

See? We had the counters enabled all the time.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: kernel-team@fb.com
Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
7fac83aa Tue Mar 16 15:18:35 MDT 2021 Song Liu <songliubraving@fb.com> perf stat: Introduce 'bperf' to share hardware PMCs with BPF

The perf tool uses performance monitoring counters (PMCs) to monitor
system performance. The PMCs are limited hardware resources. For
example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways: system
level monitoring, (maybe nested) container level monitoring, per process
monitoring, profiling (in sample mode), etc. In some cases, there are
more active perf_events than available hardware PMCs. To allow all
perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics
(cycles, instructions). It is a waste to have multiple tools create
multiple perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --bpf-counters option to
perf-stat. bperf uses a BPF hashmap to share information about BPF
programs and maps used by bperf. This map is pinned to bpffs. The
default path is /sys/fs/bpf/perf_attr_map. The user could change the
path with option --bpf-attr-map.

Committer testing:

# dmesg|grep "Performance Events" -A5
[ 0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[ 0.225280] ... version: 0
[ 0.225280] ... bit width: 48
[ 0.225281] ... generic registers: 6
[ 0.225281] ... value mask: 0000ffffffffffff
[ 0.225281] ... max period: 00007fffffffffff
#
# for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done
[1] 2436231
[2] 2436232
[3] 2436233
[4] 2436234
[5] 2436235
[6] 2436236
# perf stat -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

310,326,987 cycles (41.87%)
236,143,290 instructions # 0.76 insn per cycle (41.87%)

0.100800885 seconds time elapsed

#

We can see that the counters were enabled for this workload 41.87% of
the time.

Now with --bpf-counters:

# for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done
[1] 2436514
[2] 2436515
[3] 2436516
[4] 2436517
[5] 2436518
[6] 2436519
[7] 2436520
[8] 2436521
[9] 2436522
[10] 2436523
[11] 2436524
[12] 2436525
[13] 2436526
[14] 2436527
[15] 2436528
[16] 2436529
[17] 2436530
[18] 2436531
[19] 2436532
[20] 2436533
[21] 2436534
[22] 2436535
[23] 2436536
[24] 2436537
[25] 2436538
[26] 2436539
[27] 2436540
[28] 2436541
[29] 2436542
[30] 2436543
[31] 2436544
[32] 2436545
#
# ls -la /sys/fs/bpf/perf_attr_map
-rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map
# bpftool map | grep bperf | wc -l
64
#

# bpftool map | tail
1265: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1266: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1267: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 996
pids perf(2436545)
1268: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1269: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1270: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 997
pids perf(2436541)
1285: array name pid_iter.rodata flags 0x480
key 4B value 4B max_entries 1 memlock 4096B
btf_id 1017 frozen
pids bpftool(2437504)
1286: array flags 0x0
key 4B value 32B max_entries 1 memlock 4096B
#
# bpftool map dump id 1268 | tail
value (CPU 21):
8f f3 bc ca 00 00 00 00 80 fd 2a d1 4d 00 00 00
80 fd 2a d1 4d 00 00 00
value (CPU 22):
7e d5 64 4d 00 00 00 00 a4 8a 2e ee 4d 00 00 00
a4 8a 2e ee 4d 00 00 00
value (CPU 23):
a7 78 3e 06 01 00 00 00 b2 34 94 f6 4d 00 00 00
b2 34 94 f6 4d 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
c6 8b d9 ca 00 00 00 00 20 c6 fc 83 4e 00 00 00
20 c6 fc 83 4e 00 00 00
value (CPU 22):
9c b4 d2 4d 00 00 00 00 3e 0c df 89 4e 00 00 00
3e 0c df 89 4e 00 00 00
value (CPU 23):
18 43 66 06 01 00 00 00 5b 69 ed 83 4e 00 00 00
5b 69 ed 83 4e 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
f2 6e db ca 00 00 00 00 92 67 4c ba 4e 00 00 00
92 67 4c ba 4e 00 00 00
value (CPU 22):
dc 8e e1 4d 00 00 00 00 d9 32 7a c5 4e 00 00 00
d9 32 7a c5 4e 00 00 00
value (CPU 23):
bd 2b 73 06 01 00 00 00 7c 73 87 bf 4e 00 00 00
7c 73 87 bf 4e 00 00 00
Found 1 element
#

# perf stat --bpf-counters -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

119,410,122 cycles
152,105,479 instructions # 1.27 insn per cycle

0.101395093 seconds time elapsed

#

See? We had the counters enabled all the time.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: kernel-team@fb.com
Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
7fac83aa Tue Mar 16 15:18:35 MDT 2021 Song Liu <songliubraving@fb.com> perf stat: Introduce 'bperf' to share hardware PMCs with BPF

The perf tool uses performance monitoring counters (PMCs) to monitor
system performance. The PMCs are limited hardware resources. For
example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways: system
level monitoring, (maybe nested) container level monitoring, per process
monitoring, profiling (in sample mode), etc. In some cases, there are
more active perf_events than available hardware PMCs. To allow all
perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics
(cycles, instructions). It is a waste to have multiple tools create
multiple perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --bpf-counters option to
perf-stat. bperf uses a BPF hashmap to share information about BPF
programs and maps used by bperf. This map is pinned to bpffs. The
default path is /sys/fs/bpf/perf_attr_map. The user could change the
path with option --bpf-attr-map.

Committer testing:

# dmesg|grep "Performance Events" -A5
[ 0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[ 0.225280] ... version: 0
[ 0.225280] ... bit width: 48
[ 0.225281] ... generic registers: 6
[ 0.225281] ... value mask: 0000ffffffffffff
[ 0.225281] ... max period: 00007fffffffffff
#
# for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done
[1] 2436231
[2] 2436232
[3] 2436233
[4] 2436234
[5] 2436235
[6] 2436236
# perf stat -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

310,326,987 cycles (41.87%)
236,143,290 instructions # 0.76 insn per cycle (41.87%)

0.100800885 seconds time elapsed

#

We can see that the counters were enabled for this workload 41.87% of
the time.

Now with --bpf-counters:

# for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done
[1] 2436514
[2] 2436515
[3] 2436516
[4] 2436517
[5] 2436518
[6] 2436519
[7] 2436520
[8] 2436521
[9] 2436522
[10] 2436523
[11] 2436524
[12] 2436525
[13] 2436526
[14] 2436527
[15] 2436528
[16] 2436529
[17] 2436530
[18] 2436531
[19] 2436532
[20] 2436533
[21] 2436534
[22] 2436535
[23] 2436536
[24] 2436537
[25] 2436538
[26] 2436539
[27] 2436540
[28] 2436541
[29] 2436542
[30] 2436543
[31] 2436544
[32] 2436545
#
# ls -la /sys/fs/bpf/perf_attr_map
-rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map
# bpftool map | grep bperf | wc -l
64
#

# bpftool map | tail
1265: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1266: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1267: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 996
pids perf(2436545)
1268: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1269: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1270: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 997
pids perf(2436541)
1285: array name pid_iter.rodata flags 0x480
key 4B value 4B max_entries 1 memlock 4096B
btf_id 1017 frozen
pids bpftool(2437504)
1286: array flags 0x0
key 4B value 32B max_entries 1 memlock 4096B
#
# bpftool map dump id 1268 | tail
value (CPU 21):
8f f3 bc ca 00 00 00 00 80 fd 2a d1 4d 00 00 00
80 fd 2a d1 4d 00 00 00
value (CPU 22):
7e d5 64 4d 00 00 00 00 a4 8a 2e ee 4d 00 00 00
a4 8a 2e ee 4d 00 00 00
value (CPU 23):
a7 78 3e 06 01 00 00 00 b2 34 94 f6 4d 00 00 00
b2 34 94 f6 4d 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
c6 8b d9 ca 00 00 00 00 20 c6 fc 83 4e 00 00 00
20 c6 fc 83 4e 00 00 00
value (CPU 22):
9c b4 d2 4d 00 00 00 00 3e 0c df 89 4e 00 00 00
3e 0c df 89 4e 00 00 00
value (CPU 23):
18 43 66 06 01 00 00 00 5b 69 ed 83 4e 00 00 00
5b 69 ed 83 4e 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
f2 6e db ca 00 00 00 00 92 67 4c ba 4e 00 00 00
92 67 4c ba 4e 00 00 00
value (CPU 22):
dc 8e e1 4d 00 00 00 00 d9 32 7a c5 4e 00 00 00
d9 32 7a c5 4e 00 00 00
value (CPU 23):
bd 2b 73 06 01 00 00 00 7c 73 87 bf 4e 00 00 00
7c 73 87 bf 4e 00 00 00
Found 1 element
#

# perf stat --bpf-counters -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

119,410,122 cycles
152,105,479 instructions # 1.27 insn per cycle

0.101395093 seconds time elapsed

#

See? We had the counters enabled all the time.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: kernel-team@fb.com
Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
7fac83aa Tue Mar 16 15:18:35 MDT 2021 Song Liu <songliubraving@fb.com> perf stat: Introduce 'bperf' to share hardware PMCs with BPF

The perf tool uses performance monitoring counters (PMCs) to monitor
system performance. The PMCs are limited hardware resources. For
example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways: system
level monitoring, (maybe nested) container level monitoring, per process
monitoring, profiling (in sample mode), etc. In some cases, there are
more active perf_events than available hardware PMCs. To allow all
perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics
(cycles, instructions). It is a waste to have multiple tools create
multiple perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --bpf-counters option to
perf-stat. bperf uses a BPF hashmap to share information about BPF
programs and maps used by bperf. This map is pinned to bpffs. The
default path is /sys/fs/bpf/perf_attr_map. The user could change the
path with option --bpf-attr-map.

Committer testing:

# dmesg|grep "Performance Events" -A5
[ 0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[ 0.225280] ... version: 0
[ 0.225280] ... bit width: 48
[ 0.225281] ... generic registers: 6
[ 0.225281] ... value mask: 0000ffffffffffff
[ 0.225281] ... max period: 00007fffffffffff
#
# for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done
[1] 2436231
[2] 2436232
[3] 2436233
[4] 2436234
[5] 2436235
[6] 2436236
# perf stat -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

310,326,987 cycles (41.87%)
236,143,290 instructions # 0.76 insn per cycle (41.87%)

0.100800885 seconds time elapsed

#

We can see that the counters were enabled for this workload 41.87% of
the time.

Now with --bpf-counters:

# for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done
[1] 2436514
[2] 2436515
[3] 2436516
[4] 2436517
[5] 2436518
[6] 2436519
[7] 2436520
[8] 2436521
[9] 2436522
[10] 2436523
[11] 2436524
[12] 2436525
[13] 2436526
[14] 2436527
[15] 2436528
[16] 2436529
[17] 2436530
[18] 2436531
[19] 2436532
[20] 2436533
[21] 2436534
[22] 2436535
[23] 2436536
[24] 2436537
[25] 2436538
[26] 2436539
[27] 2436540
[28] 2436541
[29] 2436542
[30] 2436543
[31] 2436544
[32] 2436545
#
# ls -la /sys/fs/bpf/perf_attr_map
-rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map
# bpftool map | grep bperf | wc -l
64
#

# bpftool map | tail
1265: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1266: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1267: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 996
pids perf(2436545)
1268: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1269: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1270: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 997
pids perf(2436541)
1285: array name pid_iter.rodata flags 0x480
key 4B value 4B max_entries 1 memlock 4096B
btf_id 1017 frozen
pids bpftool(2437504)
1286: array flags 0x0
key 4B value 32B max_entries 1 memlock 4096B
#
# bpftool map dump id 1268 | tail
value (CPU 21):
8f f3 bc ca 00 00 00 00 80 fd 2a d1 4d 00 00 00
80 fd 2a d1 4d 00 00 00
value (CPU 22):
7e d5 64 4d 00 00 00 00 a4 8a 2e ee 4d 00 00 00
a4 8a 2e ee 4d 00 00 00
value (CPU 23):
a7 78 3e 06 01 00 00 00 b2 34 94 f6 4d 00 00 00
b2 34 94 f6 4d 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
c6 8b d9 ca 00 00 00 00 20 c6 fc 83 4e 00 00 00
20 c6 fc 83 4e 00 00 00
value (CPU 22):
9c b4 d2 4d 00 00 00 00 3e 0c df 89 4e 00 00 00
3e 0c df 89 4e 00 00 00
value (CPU 23):
18 43 66 06 01 00 00 00 5b 69 ed 83 4e 00 00 00
5b 69 ed 83 4e 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
f2 6e db ca 00 00 00 00 92 67 4c ba 4e 00 00 00
92 67 4c ba 4e 00 00 00
value (CPU 22):
dc 8e e1 4d 00 00 00 00 d9 32 7a c5 4e 00 00 00
d9 32 7a c5 4e 00 00 00
value (CPU 23):
bd 2b 73 06 01 00 00 00 7c 73 87 bf 4e 00 00 00
7c 73 87 bf 4e 00 00 00
Found 1 element
#

# perf stat --bpf-counters -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

119,410,122 cycles
152,105,479 instructions # 1.27 insn per cycle

0.101395093 seconds time elapsed

#

See? We had the counters enabled all the time.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: kernel-team@fb.com
Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
7fac83aa Tue Mar 16 15:18:35 MDT 2021 Song Liu <songliubraving@fb.com> perf stat: Introduce 'bperf' to share hardware PMCs with BPF

The perf tool uses performance monitoring counters (PMCs) to monitor
system performance. The PMCs are limited hardware resources. For
example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways: system
level monitoring, (maybe nested) container level monitoring, per process
monitoring, profiling (in sample mode), etc. In some cases, there are
more active perf_events than available hardware PMCs. To allow all
perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics
(cycles, instructions). It is a waste to have multiple tools create
multiple perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --bpf-counters option to
perf-stat. bperf uses a BPF hashmap to share information about BPF
programs and maps used by bperf. This map is pinned to bpffs. The
default path is /sys/fs/bpf/perf_attr_map. The user could change the
path with option --bpf-attr-map.

Committer testing:

# dmesg|grep "Performance Events" -A5
[ 0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[ 0.225280] ... version: 0
[ 0.225280] ... bit width: 48
[ 0.225281] ... generic registers: 6
[ 0.225281] ... value mask: 0000ffffffffffff
[ 0.225281] ... max period: 00007fffffffffff
#
# for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done
[1] 2436231
[2] 2436232
[3] 2436233
[4] 2436234
[5] 2436235
[6] 2436236
# perf stat -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

310,326,987 cycles (41.87%)
236,143,290 instructions # 0.76 insn per cycle (41.87%)

0.100800885 seconds time elapsed

#

We can see that the counters were enabled for this workload 41.87% of
the time.

Now with --bpf-counters:

# for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done
[1] 2436514
[2] 2436515
[3] 2436516
[4] 2436517
[5] 2436518
[6] 2436519
[7] 2436520
[8] 2436521
[9] 2436522
[10] 2436523
[11] 2436524
[12] 2436525
[13] 2436526
[14] 2436527
[15] 2436528
[16] 2436529
[17] 2436530
[18] 2436531
[19] 2436532
[20] 2436533
[21] 2436534
[22] 2436535
[23] 2436536
[24] 2436537
[25] 2436538
[26] 2436539
[27] 2436540
[28] 2436541
[29] 2436542
[30] 2436543
[31] 2436544
[32] 2436545
#
# ls -la /sys/fs/bpf/perf_attr_map
-rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map
# bpftool map | grep bperf | wc -l
64
#

# bpftool map | tail
1265: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1266: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1267: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 996
pids perf(2436545)
1268: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1269: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1270: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 997
pids perf(2436541)
1285: array name pid_iter.rodata flags 0x480
key 4B value 4B max_entries 1 memlock 4096B
btf_id 1017 frozen
pids bpftool(2437504)
1286: array flags 0x0
key 4B value 32B max_entries 1 memlock 4096B
#
# bpftool map dump id 1268 | tail
value (CPU 21):
8f f3 bc ca 00 00 00 00 80 fd 2a d1 4d 00 00 00
80 fd 2a d1 4d 00 00 00
value (CPU 22):
7e d5 64 4d 00 00 00 00 a4 8a 2e ee 4d 00 00 00
a4 8a 2e ee 4d 00 00 00
value (CPU 23):
a7 78 3e 06 01 00 00 00 b2 34 94 f6 4d 00 00 00
b2 34 94 f6 4d 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
c6 8b d9 ca 00 00 00 00 20 c6 fc 83 4e 00 00 00
20 c6 fc 83 4e 00 00 00
value (CPU 22):
9c b4 d2 4d 00 00 00 00 3e 0c df 89 4e 00 00 00
3e 0c df 89 4e 00 00 00
value (CPU 23):
18 43 66 06 01 00 00 00 5b 69 ed 83 4e 00 00 00
5b 69 ed 83 4e 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
f2 6e db ca 00 00 00 00 92 67 4c ba 4e 00 00 00
92 67 4c ba 4e 00 00 00
value (CPU 22):
dc 8e e1 4d 00 00 00 00 d9 32 7a c5 4e 00 00 00
d9 32 7a c5 4e 00 00 00
value (CPU 23):
bd 2b 73 06 01 00 00 00 7c 73 87 bf 4e 00 00 00
7c 73 87 bf 4e 00 00 00
Found 1 element
#

# perf stat --bpf-counters -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

119,410,122 cycles
152,105,479 instructions # 1.27 insn per cycle

0.101395093 seconds time elapsed

#

See? We had the counters enabled all the time.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: kernel-team@fb.com
Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
7fac83aa Tue Mar 16 15:18:35 MDT 2021 Song Liu <songliubraving@fb.com> perf stat: Introduce 'bperf' to share hardware PMCs with BPF

The perf tool uses performance monitoring counters (PMCs) to monitor
system performance. The PMCs are limited hardware resources. For
example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways: system
level monitoring, (maybe nested) container level monitoring, per process
monitoring, profiling (in sample mode), etc. In some cases, there are
more active perf_events than available hardware PMCs. To allow all
perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics
(cycles, instructions). It is a waste to have multiple tools create
multiple perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --bpf-counters option to
perf-stat. bperf uses a BPF hashmap to share information about BPF
programs and maps used by bperf. This map is pinned to bpffs. The
default path is /sys/fs/bpf/perf_attr_map. The user could change the
path with option --bpf-attr-map.

Committer testing:

# dmesg|grep "Performance Events" -A5
[ 0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[ 0.225280] ... version: 0
[ 0.225280] ... bit width: 48
[ 0.225281] ... generic registers: 6
[ 0.225281] ... value mask: 0000ffffffffffff
[ 0.225281] ... max period: 00007fffffffffff
#
# for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done
[1] 2436231
[2] 2436232
[3] 2436233
[4] 2436234
[5] 2436235
[6] 2436236
# perf stat -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

310,326,987 cycles (41.87%)
236,143,290 instructions # 0.76 insn per cycle (41.87%)

0.100800885 seconds time elapsed

#

We can see that the counters were enabled for this workload 41.87% of
the time.

Now with --bpf-counters:

# for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done
[1] 2436514
[2] 2436515
[3] 2436516
[4] 2436517
[5] 2436518
[6] 2436519
[7] 2436520
[8] 2436521
[9] 2436522
[10] 2436523
[11] 2436524
[12] 2436525
[13] 2436526
[14] 2436527
[15] 2436528
[16] 2436529
[17] 2436530
[18] 2436531
[19] 2436532
[20] 2436533
[21] 2436534
[22] 2436535
[23] 2436536
[24] 2436537
[25] 2436538
[26] 2436539
[27] 2436540
[28] 2436541
[29] 2436542
[30] 2436543
[31] 2436544
[32] 2436545
#
# ls -la /sys/fs/bpf/perf_attr_map
-rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map
# bpftool map | grep bperf | wc -l
64
#

# bpftool map | tail
1265: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1266: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1267: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 996
pids perf(2436545)
1268: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1269: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1270: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 997
pids perf(2436541)
1285: array name pid_iter.rodata flags 0x480
key 4B value 4B max_entries 1 memlock 4096B
btf_id 1017 frozen
pids bpftool(2437504)
1286: array flags 0x0
key 4B value 32B max_entries 1 memlock 4096B
#
# bpftool map dump id 1268 | tail
value (CPU 21):
8f f3 bc ca 00 00 00 00 80 fd 2a d1 4d 00 00 00
80 fd 2a d1 4d 00 00 00
value (CPU 22):
7e d5 64 4d 00 00 00 00 a4 8a 2e ee 4d 00 00 00
a4 8a 2e ee 4d 00 00 00
value (CPU 23):
a7 78 3e 06 01 00 00 00 b2 34 94 f6 4d 00 00 00
b2 34 94 f6 4d 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
c6 8b d9 ca 00 00 00 00 20 c6 fc 83 4e 00 00 00
20 c6 fc 83 4e 00 00 00
value (CPU 22):
9c b4 d2 4d 00 00 00 00 3e 0c df 89 4e 00 00 00
3e 0c df 89 4e 00 00 00
value (CPU 23):
18 43 66 06 01 00 00 00 5b 69 ed 83 4e 00 00 00
5b 69 ed 83 4e 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
f2 6e db ca 00 00 00 00 92 67 4c ba 4e 00 00 00
92 67 4c ba 4e 00 00 00
value (CPU 22):
dc 8e e1 4d 00 00 00 00 d9 32 7a c5 4e 00 00 00
d9 32 7a c5 4e 00 00 00
value (CPU 23):
bd 2b 73 06 01 00 00 00 7c 73 87 bf 4e 00 00 00
7c 73 87 bf 4e 00 00 00
Found 1 element
#

# perf stat --bpf-counters -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

119,410,122 cycles
152,105,479 instructions # 1.27 insn per cycle

0.101395093 seconds time elapsed

#

See? We had the counters enabled all the time.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: kernel-team@fb.com
Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
7fac83aa Tue Mar 16 15:18:35 MDT 2021 Song Liu <songliubraving@fb.com> perf stat: Introduce 'bperf' to share hardware PMCs with BPF

The perf tool uses performance monitoring counters (PMCs) to monitor
system performance. The PMCs are limited hardware resources. For
example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways: system
level monitoring, (maybe nested) container level monitoring, per process
monitoring, profiling (in sample mode), etc. In some cases, there are
more active perf_events than available hardware PMCs. To allow all
perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics
(cycles, instructions). It is a waste to have multiple tools create
multiple perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --bpf-counters option to
perf-stat. bperf uses a BPF hashmap to share information about BPF
programs and maps used by bperf. This map is pinned to bpffs. The
default path is /sys/fs/bpf/perf_attr_map. The user could change the
path with option --bpf-attr-map.

Committer testing:

# dmesg|grep "Performance Events" -A5
[ 0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[ 0.225280] ... version: 0
[ 0.225280] ... bit width: 48
[ 0.225281] ... generic registers: 6
[ 0.225281] ... value mask: 0000ffffffffffff
[ 0.225281] ... max period: 00007fffffffffff
#
# for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done
[1] 2436231
[2] 2436232
[3] 2436233
[4] 2436234
[5] 2436235
[6] 2436236
# perf stat -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

310,326,987 cycles (41.87%)
236,143,290 instructions # 0.76 insn per cycle (41.87%)

0.100800885 seconds time elapsed

#

We can see that the counters were enabled for this workload 41.87% of
the time.

Now with --bpf-counters:

# for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done
[1] 2436514
[2] 2436515
[3] 2436516
[4] 2436517
[5] 2436518
[6] 2436519
[7] 2436520
[8] 2436521
[9] 2436522
[10] 2436523
[11] 2436524
[12] 2436525
[13] 2436526
[14] 2436527
[15] 2436528
[16] 2436529
[17] 2436530
[18] 2436531
[19] 2436532
[20] 2436533
[21] 2436534
[22] 2436535
[23] 2436536
[24] 2436537
[25] 2436538
[26] 2436539
[27] 2436540
[28] 2436541
[29] 2436542
[30] 2436543
[31] 2436544
[32] 2436545
#
# ls -la /sys/fs/bpf/perf_attr_map
-rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map
# bpftool map | grep bperf | wc -l
64
#

# bpftool map | tail
1265: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1266: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1267: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 996
pids perf(2436545)
1268: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1269: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1270: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 997
pids perf(2436541)
1285: array name pid_iter.rodata flags 0x480
key 4B value 4B max_entries 1 memlock 4096B
btf_id 1017 frozen
pids bpftool(2437504)
1286: array flags 0x0
key 4B value 32B max_entries 1 memlock 4096B
#
# bpftool map dump id 1268 | tail
value (CPU 21):
8f f3 bc ca 00 00 00 00 80 fd 2a d1 4d 00 00 00
80 fd 2a d1 4d 00 00 00
value (CPU 22):
7e d5 64 4d 00 00 00 00 a4 8a 2e ee 4d 00 00 00
a4 8a 2e ee 4d 00 00 00
value (CPU 23):
a7 78 3e 06 01 00 00 00 b2 34 94 f6 4d 00 00 00
b2 34 94 f6 4d 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
c6 8b d9 ca 00 00 00 00 20 c6 fc 83 4e 00 00 00
20 c6 fc 83 4e 00 00 00
value (CPU 22):
9c b4 d2 4d 00 00 00 00 3e 0c df 89 4e 00 00 00
3e 0c df 89 4e 00 00 00
value (CPU 23):
18 43 66 06 01 00 00 00 5b 69 ed 83 4e 00 00 00
5b 69 ed 83 4e 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
f2 6e db ca 00 00 00 00 92 67 4c ba 4e 00 00 00
92 67 4c ba 4e 00 00 00
value (CPU 22):
dc 8e e1 4d 00 00 00 00 d9 32 7a c5 4e 00 00 00
d9 32 7a c5 4e 00 00 00
value (CPU 23):
bd 2b 73 06 01 00 00 00 7c 73 87 bf 4e 00 00 00
7c 73 87 bf 4e 00 00 00
Found 1 element
#

# perf stat --bpf-counters -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

119,410,122 cycles
152,105,479 instructions # 1.27 insn per cycle

0.101395093 seconds time elapsed

#

See? We had the counters enabled all the time.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: kernel-team@fb.com
Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
7fac83aa Tue Mar 16 15:18:35 MDT 2021 Song Liu <songliubraving@fb.com> perf stat: Introduce 'bperf' to share hardware PMCs with BPF

The perf tool uses performance monitoring counters (PMCs) to monitor
system performance. The PMCs are limited hardware resources. For
example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways: system
level monitoring, (maybe nested) container level monitoring, per process
monitoring, profiling (in sample mode), etc. In some cases, there are
more active perf_events than available hardware PMCs. To allow all
perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics
(cycles, instructions). It is a waste to have multiple tools create
multiple perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --bpf-counters option to
perf-stat. bperf uses a BPF hashmap to share information about BPF
programs and maps used by bperf. This map is pinned to bpffs. The
default path is /sys/fs/bpf/perf_attr_map. The user could change the
path with option --bpf-attr-map.

Committer testing:

# dmesg|grep "Performance Events" -A5
[ 0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[ 0.225280] ... version: 0
[ 0.225280] ... bit width: 48
[ 0.225281] ... generic registers: 6
[ 0.225281] ... value mask: 0000ffffffffffff
[ 0.225281] ... max period: 00007fffffffffff
#
# for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done
[1] 2436231
[2] 2436232
[3] 2436233
[4] 2436234
[5] 2436235
[6] 2436236
# perf stat -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

310,326,987 cycles (41.87%)
236,143,290 instructions # 0.76 insn per cycle (41.87%)

0.100800885 seconds time elapsed

#

We can see that the counters were enabled for this workload 41.87% of
the time.

Now with --bpf-counters:

# for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done
[1] 2436514
[2] 2436515
[3] 2436516
[4] 2436517
[5] 2436518
[6] 2436519
[7] 2436520
[8] 2436521
[9] 2436522
[10] 2436523
[11] 2436524
[12] 2436525
[13] 2436526
[14] 2436527
[15] 2436528
[16] 2436529
[17] 2436530
[18] 2436531
[19] 2436532
[20] 2436533
[21] 2436534
[22] 2436535
[23] 2436536
[24] 2436537
[25] 2436538
[26] 2436539
[27] 2436540
[28] 2436541
[29] 2436542
[30] 2436543
[31] 2436544
[32] 2436545
#
# ls -la /sys/fs/bpf/perf_attr_map
-rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map
# bpftool map | grep bperf | wc -l
64
#

# bpftool map | tail
1265: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1266: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1267: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 996
pids perf(2436545)
1268: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1269: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1270: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 997
pids perf(2436541)
1285: array name pid_iter.rodata flags 0x480
key 4B value 4B max_entries 1 memlock 4096B
btf_id 1017 frozen
pids bpftool(2437504)
1286: array flags 0x0
key 4B value 32B max_entries 1 memlock 4096B
#
# bpftool map dump id 1268 | tail
value (CPU 21):
8f f3 bc ca 00 00 00 00 80 fd 2a d1 4d 00 00 00
80 fd 2a d1 4d 00 00 00
value (CPU 22):
7e d5 64 4d 00 00 00 00 a4 8a 2e ee 4d 00 00 00
a4 8a 2e ee 4d 00 00 00
value (CPU 23):
a7 78 3e 06 01 00 00 00 b2 34 94 f6 4d 00 00 00
b2 34 94 f6 4d 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
c6 8b d9 ca 00 00 00 00 20 c6 fc 83 4e 00 00 00
20 c6 fc 83 4e 00 00 00
value (CPU 22):
9c b4 d2 4d 00 00 00 00 3e 0c df 89 4e 00 00 00
3e 0c df 89 4e 00 00 00
value (CPU 23):
18 43 66 06 01 00 00 00 5b 69 ed 83 4e 00 00 00
5b 69 ed 83 4e 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
f2 6e db ca 00 00 00 00 92 67 4c ba 4e 00 00 00
92 67 4c ba 4e 00 00 00
value (CPU 22):
dc 8e e1 4d 00 00 00 00 d9 32 7a c5 4e 00 00 00
d9 32 7a c5 4e 00 00 00
value (CPU 23):
bd 2b 73 06 01 00 00 00 7c 73 87 bf 4e 00 00 00
7c 73 87 bf 4e 00 00 00
Found 1 element
#

# perf stat --bpf-counters -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

119,410,122 cycles
152,105,479 instructions # 1.27 insn per cycle

0.101395093 seconds time elapsed

#

See? We had the counters enabled all the time.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: kernel-team@fb.com
Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
H A Dbperf_leader.bpf.cdiff 5a897531 Fri Dec 03 16:14:41 MST 2021 Song Liu <songliubraving@fb.com> perf bpf_skel: Do not use typedef to avoid error on old clang

When building bpf_skel with clang-10, typedef causes confusions like:

libbpf: map 'prev_readings': unexpected def kind var.

Fix this by removing the typedef.

Fixes: 7fac83aaf2eecc9e ("perf stat: Introduce 'bperf' to share hardware PMCs with BPF")
Reported-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: http://lore.kernel.org/lkml/BEF5C312-4331-4A60-AEC0-AD7617CB2BC4@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
7fac83aa Tue Mar 16 15:18:35 MDT 2021 Song Liu <songliubraving@fb.com> perf stat: Introduce 'bperf' to share hardware PMCs with BPF

The perf tool uses performance monitoring counters (PMCs) to monitor
system performance. The PMCs are limited hardware resources. For
example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways: system
level monitoring, (maybe nested) container level monitoring, per process
monitoring, profiling (in sample mode), etc. In some cases, there are
more active perf_events than available hardware PMCs. To allow all
perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics
(cycles, instructions). It is a waste to have multiple tools create
multiple perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --bpf-counters option to
perf-stat. bperf uses a BPF hashmap to share information about BPF
programs and maps used by bperf. This map is pinned to bpffs. The
default path is /sys/fs/bpf/perf_attr_map. The user could change the
path with option --bpf-attr-map.

Committer testing:

# dmesg|grep "Performance Events" -A5
[ 0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[ 0.225280] ... version: 0
[ 0.225280] ... bit width: 48
[ 0.225281] ... generic registers: 6
[ 0.225281] ... value mask: 0000ffffffffffff
[ 0.225281] ... max period: 00007fffffffffff
#
# for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done
[1] 2436231
[2] 2436232
[3] 2436233
[4] 2436234
[5] 2436235
[6] 2436236
# perf stat -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

310,326,987 cycles (41.87%)
236,143,290 instructions # 0.76 insn per cycle (41.87%)

0.100800885 seconds time elapsed

#

We can see that the counters were enabled for this workload 41.87% of
the time.

Now with --bpf-counters:

# for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done
[1] 2436514
[2] 2436515
[3] 2436516
[4] 2436517
[5] 2436518
[6] 2436519
[7] 2436520
[8] 2436521
[9] 2436522
[10] 2436523
[11] 2436524
[12] 2436525
[13] 2436526
[14] 2436527
[15] 2436528
[16] 2436529
[17] 2436530
[18] 2436531
[19] 2436532
[20] 2436533
[21] 2436534
[22] 2436535
[23] 2436536
[24] 2436537
[25] 2436538
[26] 2436539
[27] 2436540
[28] 2436541
[29] 2436542
[30] 2436543
[31] 2436544
[32] 2436545
#
# ls -la /sys/fs/bpf/perf_attr_map
-rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map
# bpftool map | grep bperf | wc -l
64
#

# bpftool map | tail
1265: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1266: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1267: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 996
pids perf(2436545)
1268: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1269: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1270: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 997
pids perf(2436541)
1285: array name pid_iter.rodata flags 0x480
key 4B value 4B max_entries 1 memlock 4096B
btf_id 1017 frozen
pids bpftool(2437504)
1286: array flags 0x0
key 4B value 32B max_entries 1 memlock 4096B
#
# bpftool map dump id 1268 | tail
value (CPU 21):
8f f3 bc ca 00 00 00 00 80 fd 2a d1 4d 00 00 00
80 fd 2a d1 4d 00 00 00
value (CPU 22):
7e d5 64 4d 00 00 00 00 a4 8a 2e ee 4d 00 00 00
a4 8a 2e ee 4d 00 00 00
value (CPU 23):
a7 78 3e 06 01 00 00 00 b2 34 94 f6 4d 00 00 00
b2 34 94 f6 4d 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
c6 8b d9 ca 00 00 00 00 20 c6 fc 83 4e 00 00 00
20 c6 fc 83 4e 00 00 00
value (CPU 22):
9c b4 d2 4d 00 00 00 00 3e 0c df 89 4e 00 00 00
3e 0c df 89 4e 00 00 00
value (CPU 23):
18 43 66 06 01 00 00 00 5b 69 ed 83 4e 00 00 00
5b 69 ed 83 4e 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
f2 6e db ca 00 00 00 00 92 67 4c ba 4e 00 00 00
92 67 4c ba 4e 00 00 00
value (CPU 22):
dc 8e e1 4d 00 00 00 00 d9 32 7a c5 4e 00 00 00
d9 32 7a c5 4e 00 00 00
value (CPU 23):
bd 2b 73 06 01 00 00 00 7c 73 87 bf 4e 00 00 00
7c 73 87 bf 4e 00 00 00
Found 1 element
#

# perf stat --bpf-counters -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

119,410,122 cycles
152,105,479 instructions # 1.27 insn per cycle

0.101395093 seconds time elapsed

#

See? We had the counters enabled all the time.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: kernel-team@fb.com
Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
7fac83aa Tue Mar 16 15:18:35 MDT 2021 Song Liu <songliubraving@fb.com> perf stat: Introduce 'bperf' to share hardware PMCs with BPF

The perf tool uses performance monitoring counters (PMCs) to monitor
system performance. The PMCs are limited hardware resources. For
example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways: system
level monitoring, (maybe nested) container level monitoring, per process
monitoring, profiling (in sample mode), etc. In some cases, there are
more active perf_events than available hardware PMCs. To allow all
perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics
(cycles, instructions). It is a waste to have multiple tools create
multiple perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --bpf-counters option to
perf-stat. bperf uses a BPF hashmap to share information about BPF
programs and maps used by bperf. This map is pinned to bpffs. The
default path is /sys/fs/bpf/perf_attr_map. The user could change the
path with option --bpf-attr-map.

Committer testing:

# dmesg|grep "Performance Events" -A5
[ 0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[ 0.225280] ... version: 0
[ 0.225280] ... bit width: 48
[ 0.225281] ... generic registers: 6
[ 0.225281] ... value mask: 0000ffffffffffff
[ 0.225281] ... max period: 00007fffffffffff
#
# for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done
[1] 2436231
[2] 2436232
[3] 2436233
[4] 2436234
[5] 2436235
[6] 2436236
# perf stat -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

310,326,987 cycles (41.87%)
236,143,290 instructions # 0.76 insn per cycle (41.87%)

0.100800885 seconds time elapsed

#

We can see that the counters were enabled for this workload 41.87% of
the time.

Now with --bpf-counters:

# for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done
[1] 2436514
[2] 2436515
[3] 2436516
[4] 2436517
[5] 2436518
[6] 2436519
[7] 2436520
[8] 2436521
[9] 2436522
[10] 2436523
[11] 2436524
[12] 2436525
[13] 2436526
[14] 2436527
[15] 2436528
[16] 2436529
[17] 2436530
[18] 2436531
[19] 2436532
[20] 2436533
[21] 2436534
[22] 2436535
[23] 2436536
[24] 2436537
[25] 2436538
[26] 2436539
[27] 2436540
[28] 2436541
[29] 2436542
[30] 2436543
[31] 2436544
[32] 2436545
#
# ls -la /sys/fs/bpf/perf_attr_map
-rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map
# bpftool map | grep bperf | wc -l
64
#

# bpftool map | tail
1265: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1266: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1267: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 996
pids perf(2436545)
1268: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1269: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1270: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 997
pids perf(2436541)
1285: array name pid_iter.rodata flags 0x480
key 4B value 4B max_entries 1 memlock 4096B
btf_id 1017 frozen
pids bpftool(2437504)
1286: array flags 0x0
key 4B value 32B max_entries 1 memlock 4096B
#
# bpftool map dump id 1268 | tail
value (CPU 21):
8f f3 bc ca 00 00 00 00 80 fd 2a d1 4d 00 00 00
80 fd 2a d1 4d 00 00 00
value (CPU 22):
7e d5 64 4d 00 00 00 00 a4 8a 2e ee 4d 00 00 00
a4 8a 2e ee 4d 00 00 00
value (CPU 23):
a7 78 3e 06 01 00 00 00 b2 34 94 f6 4d 00 00 00
b2 34 94 f6 4d 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
c6 8b d9 ca 00 00 00 00 20 c6 fc 83 4e 00 00 00
20 c6 fc 83 4e 00 00 00
value (CPU 22):
9c b4 d2 4d 00 00 00 00 3e 0c df 89 4e 00 00 00
3e 0c df 89 4e 00 00 00
value (CPU 23):
18 43 66 06 01 00 00 00 5b 69 ed 83 4e 00 00 00
5b 69 ed 83 4e 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
f2 6e db ca 00 00 00 00 92 67 4c ba 4e 00 00 00
92 67 4c ba 4e 00 00 00
value (CPU 22):
dc 8e e1 4d 00 00 00 00 d9 32 7a c5 4e 00 00 00
d9 32 7a c5 4e 00 00 00
value (CPU 23):
bd 2b 73 06 01 00 00 00 7c 73 87 bf 4e 00 00 00
7c 73 87 bf 4e 00 00 00
Found 1 element
#

# perf stat --bpf-counters -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

119,410,122 cycles
152,105,479 instructions # 1.27 insn per cycle

0.101395093 seconds time elapsed

#

See? We had the counters enabled all the time.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: kernel-team@fb.com
Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
7fac83aa Tue Mar 16 15:18:35 MDT 2021 Song Liu <songliubraving@fb.com> perf stat: Introduce 'bperf' to share hardware PMCs with BPF

The perf tool uses performance monitoring counters (PMCs) to monitor
system performance. The PMCs are limited hardware resources. For
example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways: system
level monitoring, (maybe nested) container level monitoring, per process
monitoring, profiling (in sample mode), etc. In some cases, there are
more active perf_events than available hardware PMCs. To allow all
perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics
(cycles, instructions). It is a waste to have multiple tools create
multiple perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --bpf-counters option to
perf-stat. bperf uses a BPF hashmap to share information about BPF
programs and maps used by bperf. This map is pinned to bpffs. The
default path is /sys/fs/bpf/perf_attr_map. The user could change the
path with option --bpf-attr-map.

Committer testing:

# dmesg|grep "Performance Events" -A5
[ 0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[ 0.225280] ... version: 0
[ 0.225280] ... bit width: 48
[ 0.225281] ... generic registers: 6
[ 0.225281] ... value mask: 0000ffffffffffff
[ 0.225281] ... max period: 00007fffffffffff
#
# for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done
[1] 2436231
[2] 2436232
[3] 2436233
[4] 2436234
[5] 2436235
[6] 2436236
# perf stat -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

310,326,987 cycles (41.87%)
236,143,290 instructions # 0.76 insn per cycle (41.87%)

0.100800885 seconds time elapsed

#

We can see that the counters were enabled for this workload 41.87% of
the time.

Now with --bpf-counters:

# for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done
[1] 2436514
[2] 2436515
[3] 2436516
[4] 2436517
[5] 2436518
[6] 2436519
[7] 2436520
[8] 2436521
[9] 2436522
[10] 2436523
[11] 2436524
[12] 2436525
[13] 2436526
[14] 2436527
[15] 2436528
[16] 2436529
[17] 2436530
[18] 2436531
[19] 2436532
[20] 2436533
[21] 2436534
[22] 2436535
[23] 2436536
[24] 2436537
[25] 2436538
[26] 2436539
[27] 2436540
[28] 2436541
[29] 2436542
[30] 2436543
[31] 2436544
[32] 2436545
#
# ls -la /sys/fs/bpf/perf_attr_map
-rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map
# bpftool map | grep bperf | wc -l
64
#

# bpftool map | tail
1265: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1266: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1267: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 996
pids perf(2436545)
1268: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1269: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1270: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 997
pids perf(2436541)
1285: array name pid_iter.rodata flags 0x480
key 4B value 4B max_entries 1 memlock 4096B
btf_id 1017 frozen
pids bpftool(2437504)
1286: array flags 0x0
key 4B value 32B max_entries 1 memlock 4096B
#
# bpftool map dump id 1268 | tail
value (CPU 21):
8f f3 bc ca 00 00 00 00 80 fd 2a d1 4d 00 00 00
80 fd 2a d1 4d 00 00 00
value (CPU 22):
7e d5 64 4d 00 00 00 00 a4 8a 2e ee 4d 00 00 00
a4 8a 2e ee 4d 00 00 00
value (CPU 23):
a7 78 3e 06 01 00 00 00 b2 34 94 f6 4d 00 00 00
b2 34 94 f6 4d 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
c6 8b d9 ca 00 00 00 00 20 c6 fc 83 4e 00 00 00
20 c6 fc 83 4e 00 00 00
value (CPU 22):
9c b4 d2 4d 00 00 00 00 3e 0c df 89 4e 00 00 00
3e 0c df 89 4e 00 00 00
value (CPU 23):
18 43 66 06 01 00 00 00 5b 69 ed 83 4e 00 00 00
5b 69 ed 83 4e 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
f2 6e db ca 00 00 00 00 92 67 4c ba 4e 00 00 00
92 67 4c ba 4e 00 00 00
value (CPU 22):
dc 8e e1 4d 00 00 00 00 d9 32 7a c5 4e 00 00 00
d9 32 7a c5 4e 00 00 00
value (CPU 23):
bd 2b 73 06 01 00 00 00 7c 73 87 bf 4e 00 00 00
7c 73 87 bf 4e 00 00 00
Found 1 element
#

# perf stat --bpf-counters -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

119,410,122 cycles
152,105,479 instructions # 1.27 insn per cycle

0.101395093 seconds time elapsed

#

See? We had the counters enabled all the time.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: kernel-team@fb.com
Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
7fac83aa Tue Mar 16 15:18:35 MDT 2021 Song Liu <songliubraving@fb.com> perf stat: Introduce 'bperf' to share hardware PMCs with BPF

The perf tool uses performance monitoring counters (PMCs) to monitor
system performance. The PMCs are limited hardware resources. For
example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways: system
level monitoring, (maybe nested) container level monitoring, per process
monitoring, profiling (in sample mode), etc. In some cases, there are
more active perf_events than available hardware PMCs. To allow all
perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics
(cycles, instructions). It is a waste to have multiple tools create
multiple perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --bpf-counters option to
perf-stat. bperf uses a BPF hashmap to share information about BPF
programs and maps used by bperf. This map is pinned to bpffs. The
default path is /sys/fs/bpf/perf_attr_map. The user could change the
path with option --bpf-attr-map.

Committer testing:

# dmesg|grep "Performance Events" -A5
[ 0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[ 0.225280] ... version: 0
[ 0.225280] ... bit width: 48
[ 0.225281] ... generic registers: 6
[ 0.225281] ... value mask: 0000ffffffffffff
[ 0.225281] ... max period: 00007fffffffffff
#
# for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done
[1] 2436231
[2] 2436232
[3] 2436233
[4] 2436234
[5] 2436235
[6] 2436236
# perf stat -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

310,326,987 cycles (41.87%)
236,143,290 instructions # 0.76 insn per cycle (41.87%)

0.100800885 seconds time elapsed

#

We can see that the counters were enabled for this workload 41.87% of
the time.

Now with --bpf-counters:

# for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done
[1] 2436514
[2] 2436515
[3] 2436516
[4] 2436517
[5] 2436518
[6] 2436519
[7] 2436520
[8] 2436521
[9] 2436522
[10] 2436523
[11] 2436524
[12] 2436525
[13] 2436526
[14] 2436527
[15] 2436528
[16] 2436529
[17] 2436530
[18] 2436531
[19] 2436532
[20] 2436533
[21] 2436534
[22] 2436535
[23] 2436536
[24] 2436537
[25] 2436538
[26] 2436539
[27] 2436540
[28] 2436541
[29] 2436542
[30] 2436543
[31] 2436544
[32] 2436545
#
# ls -la /sys/fs/bpf/perf_attr_map
-rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map
# bpftool map | grep bperf | wc -l
64
#

# bpftool map | tail
1265: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1266: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1267: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 996
pids perf(2436545)
1268: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1269: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1270: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 997
pids perf(2436541)
1285: array name pid_iter.rodata flags 0x480
key 4B value 4B max_entries 1 memlock 4096B
btf_id 1017 frozen
pids bpftool(2437504)
1286: array flags 0x0
key 4B value 32B max_entries 1 memlock 4096B
#
# bpftool map dump id 1268 | tail
value (CPU 21):
8f f3 bc ca 00 00 00 00 80 fd 2a d1 4d 00 00 00
80 fd 2a d1 4d 00 00 00
value (CPU 22):
7e d5 64 4d 00 00 00 00 a4 8a 2e ee 4d 00 00 00
a4 8a 2e ee 4d 00 00 00
value (CPU 23):
a7 78 3e 06 01 00 00 00 b2 34 94 f6 4d 00 00 00
b2 34 94 f6 4d 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
c6 8b d9 ca 00 00 00 00 20 c6 fc 83 4e 00 00 00
20 c6 fc 83 4e 00 00 00
value (CPU 22):
9c b4 d2 4d 00 00 00 00 3e 0c df 89 4e 00 00 00
3e 0c df 89 4e 00 00 00
value (CPU 23):
18 43 66 06 01 00 00 00 5b 69 ed 83 4e 00 00 00
5b 69 ed 83 4e 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
f2 6e db ca 00 00 00 00 92 67 4c ba 4e 00 00 00
92 67 4c ba 4e 00 00 00
value (CPU 22):
dc 8e e1 4d 00 00 00 00 d9 32 7a c5 4e 00 00 00
d9 32 7a c5 4e 00 00 00
value (CPU 23):
bd 2b 73 06 01 00 00 00 7c 73 87 bf 4e 00 00 00
7c 73 87 bf 4e 00 00 00
Found 1 element
#

# perf stat --bpf-counters -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

119,410,122 cycles
152,105,479 instructions # 1.27 insn per cycle

0.101395093 seconds time elapsed

#

See? We had the counters enabled all the time.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: kernel-team@fb.com
Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
7fac83aa Tue Mar 16 15:18:35 MDT 2021 Song Liu <songliubraving@fb.com> perf stat: Introduce 'bperf' to share hardware PMCs with BPF

The perf tool uses performance monitoring counters (PMCs) to monitor
system performance. The PMCs are limited hardware resources. For
example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways: system
level monitoring, (maybe nested) container level monitoring, per process
monitoring, profiling (in sample mode), etc. In some cases, there are
more active perf_events than available hardware PMCs. To allow all
perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics
(cycles, instructions). It is a waste to have multiple tools create
multiple perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --bpf-counters option to
perf-stat. bperf uses a BPF hashmap to share information about BPF
programs and maps used by bperf. This map is pinned to bpffs. The
default path is /sys/fs/bpf/perf_attr_map. The user could change the
path with option --bpf-attr-map.

Committer testing:

# dmesg|grep "Performance Events" -A5
[ 0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[ 0.225280] ... version: 0
[ 0.225280] ... bit width: 48
[ 0.225281] ... generic registers: 6
[ 0.225281] ... value mask: 0000ffffffffffff
[ 0.225281] ... max period: 00007fffffffffff
#
# for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done
[1] 2436231
[2] 2436232
[3] 2436233
[4] 2436234
[5] 2436235
[6] 2436236
# perf stat -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

310,326,987 cycles (41.87%)
236,143,290 instructions # 0.76 insn per cycle (41.87%)

0.100800885 seconds time elapsed

#

We can see that the counters were enabled for this workload 41.87% of
the time.

Now with --bpf-counters:

# for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done
[1] 2436514
[2] 2436515
[3] 2436516
[4] 2436517
[5] 2436518
[6] 2436519
[7] 2436520
[8] 2436521
[9] 2436522
[10] 2436523
[11] 2436524
[12] 2436525
[13] 2436526
[14] 2436527
[15] 2436528
[16] 2436529
[17] 2436530
[18] 2436531
[19] 2436532
[20] 2436533
[21] 2436534
[22] 2436535
[23] 2436536
[24] 2436537
[25] 2436538
[26] 2436539
[27] 2436540
[28] 2436541
[29] 2436542
[30] 2436543
[31] 2436544
[32] 2436545
#
# ls -la /sys/fs/bpf/perf_attr_map
-rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map
# bpftool map | grep bperf | wc -l
64
#

# bpftool map | tail
1265: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1266: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1267: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 996
pids perf(2436545)
1268: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1269: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1270: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 997
pids perf(2436541)
1285: array name pid_iter.rodata flags 0x480
key 4B value 4B max_entries 1 memlock 4096B
btf_id 1017 frozen
pids bpftool(2437504)
1286: array flags 0x0
key 4B value 32B max_entries 1 memlock 4096B
#
# bpftool map dump id 1268 | tail
value (CPU 21):
8f f3 bc ca 00 00 00 00 80 fd 2a d1 4d 00 00 00
80 fd 2a d1 4d 00 00 00
value (CPU 22):
7e d5 64 4d 00 00 00 00 a4 8a 2e ee 4d 00 00 00
a4 8a 2e ee 4d 00 00 00
value (CPU 23):
a7 78 3e 06 01 00 00 00 b2 34 94 f6 4d 00 00 00
b2 34 94 f6 4d 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
c6 8b d9 ca 00 00 00 00 20 c6 fc 83 4e 00 00 00
20 c6 fc 83 4e 00 00 00
value (CPU 22):
9c b4 d2 4d 00 00 00 00 3e 0c df 89 4e 00 00 00
3e 0c df 89 4e 00 00 00
value (CPU 23):
18 43 66 06 01 00 00 00 5b 69 ed 83 4e 00 00 00
5b 69 ed 83 4e 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
f2 6e db ca 00 00 00 00 92 67 4c ba 4e 00 00 00
92 67 4c ba 4e 00 00 00
value (CPU 22):
dc 8e e1 4d 00 00 00 00 d9 32 7a c5 4e 00 00 00
d9 32 7a c5 4e 00 00 00
value (CPU 23):
bd 2b 73 06 01 00 00 00 7c 73 87 bf 4e 00 00 00
7c 73 87 bf 4e 00 00 00
Found 1 element
#

# perf stat --bpf-counters -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

119,410,122 cycles
152,105,479 instructions # 1.27 insn per cycle

0.101395093 seconds time elapsed

#

See? We had the counters enabled all the time.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: kernel-team@fb.com
Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
7fac83aa Tue Mar 16 15:18:35 MDT 2021 Song Liu <songliubraving@fb.com> perf stat: Introduce 'bperf' to share hardware PMCs with BPF

The perf tool uses performance monitoring counters (PMCs) to monitor
system performance. The PMCs are limited hardware resources. For
example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways: system
level monitoring, (maybe nested) container level monitoring, per process
monitoring, profiling (in sample mode), etc. In some cases, there are
more active perf_events than available hardware PMCs. To allow all
perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics
(cycles, instructions). It is a waste to have multiple tools create
multiple perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --bpf-counters option to
perf-stat. bperf uses a BPF hashmap to share information about BPF
programs and maps used by bperf. This map is pinned to bpffs. The
default path is /sys/fs/bpf/perf_attr_map. The user could change the
path with option --bpf-attr-map.

Committer testing:

# dmesg|grep "Performance Events" -A5
[ 0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[ 0.225280] ... version: 0
[ 0.225280] ... bit width: 48
[ 0.225281] ... generic registers: 6
[ 0.225281] ... value mask: 0000ffffffffffff
[ 0.225281] ... max period: 00007fffffffffff
#
# for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done
[1] 2436231
[2] 2436232
[3] 2436233
[4] 2436234
[5] 2436235
[6] 2436236
# perf stat -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

310,326,987 cycles (41.87%)
236,143,290 instructions # 0.76 insn per cycle (41.87%)

0.100800885 seconds time elapsed

#

We can see that the counters were enabled for this workload 41.87% of
the time.

Now with --bpf-counters:

# for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done
[1] 2436514
[2] 2436515
[3] 2436516
[4] 2436517
[5] 2436518
[6] 2436519
[7] 2436520
[8] 2436521
[9] 2436522
[10] 2436523
[11] 2436524
[12] 2436525
[13] 2436526
[14] 2436527
[15] 2436528
[16] 2436529
[17] 2436530
[18] 2436531
[19] 2436532
[20] 2436533
[21] 2436534
[22] 2436535
[23] 2436536
[24] 2436537
[25] 2436538
[26] 2436539
[27] 2436540
[28] 2436541
[29] 2436542
[30] 2436543
[31] 2436544
[32] 2436545
#
# ls -la /sys/fs/bpf/perf_attr_map
-rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map
# bpftool map | grep bperf | wc -l
64
#

# bpftool map | tail
1265: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1266: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1267: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 996
pids perf(2436545)
1268: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1269: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1270: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 997
pids perf(2436541)
1285: array name pid_iter.rodata flags 0x480
key 4B value 4B max_entries 1 memlock 4096B
btf_id 1017 frozen
pids bpftool(2437504)
1286: array flags 0x0
key 4B value 32B max_entries 1 memlock 4096B
#
# bpftool map dump id 1268 | tail
value (CPU 21):
8f f3 bc ca 00 00 00 00 80 fd 2a d1 4d 00 00 00
80 fd 2a d1 4d 00 00 00
value (CPU 22):
7e d5 64 4d 00 00 00 00 a4 8a 2e ee 4d 00 00 00
a4 8a 2e ee 4d 00 00 00
value (CPU 23):
a7 78 3e 06 01 00 00 00 b2 34 94 f6 4d 00 00 00
b2 34 94 f6 4d 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
c6 8b d9 ca 00 00 00 00 20 c6 fc 83 4e 00 00 00
20 c6 fc 83 4e 00 00 00
value (CPU 22):
9c b4 d2 4d 00 00 00 00 3e 0c df 89 4e 00 00 00
3e 0c df 89 4e 00 00 00
value (CPU 23):
18 43 66 06 01 00 00 00 5b 69 ed 83 4e 00 00 00
5b 69 ed 83 4e 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
f2 6e db ca 00 00 00 00 92 67 4c ba 4e 00 00 00
92 67 4c ba 4e 00 00 00
value (CPU 22):
dc 8e e1 4d 00 00 00 00 d9 32 7a c5 4e 00 00 00
d9 32 7a c5 4e 00 00 00
value (CPU 23):
bd 2b 73 06 01 00 00 00 7c 73 87 bf 4e 00 00 00
7c 73 87 bf 4e 00 00 00
Found 1 element
#

# perf stat --bpf-counters -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

119,410,122 cycles
152,105,479 instructions # 1.27 insn per cycle

0.101395093 seconds time elapsed

#

See? We had the counters enabled all the time.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: kernel-team@fb.com
Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
7fac83aa Tue Mar 16 15:18:35 MDT 2021 Song Liu <songliubraving@fb.com> perf stat: Introduce 'bperf' to share hardware PMCs with BPF

The perf tool uses performance monitoring counters (PMCs) to monitor
system performance. The PMCs are limited hardware resources. For
example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways: system
level monitoring, (maybe nested) container level monitoring, per process
monitoring, profiling (in sample mode), etc. In some cases, there are
more active perf_events than available hardware PMCs. To allow all
perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics
(cycles, instructions). It is a waste to have multiple tools create
multiple perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --bpf-counters option to
perf-stat. bperf uses a BPF hashmap to share information about BPF
programs and maps used by bperf. This map is pinned to bpffs. The
default path is /sys/fs/bpf/perf_attr_map. The user could change the
path with option --bpf-attr-map.

Committer testing:

# dmesg|grep "Performance Events" -A5
[ 0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[ 0.225280] ... version: 0
[ 0.225280] ... bit width: 48
[ 0.225281] ... generic registers: 6
[ 0.225281] ... value mask: 0000ffffffffffff
[ 0.225281] ... max period: 00007fffffffffff
#
# for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done
[1] 2436231
[2] 2436232
[3] 2436233
[4] 2436234
[5] 2436235
[6] 2436236
# perf stat -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

310,326,987 cycles (41.87%)
236,143,290 instructions # 0.76 insn per cycle (41.87%)

0.100800885 seconds time elapsed

#

We can see that the counters were enabled for this workload 41.87% of
the time.

Now with --bpf-counters:

# for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done
[1] 2436514
[2] 2436515
[3] 2436516
[4] 2436517
[5] 2436518
[6] 2436519
[7] 2436520
[8] 2436521
[9] 2436522
[10] 2436523
[11] 2436524
[12] 2436525
[13] 2436526
[14] 2436527
[15] 2436528
[16] 2436529
[17] 2436530
[18] 2436531
[19] 2436532
[20] 2436533
[21] 2436534
[22] 2436535
[23] 2436536
[24] 2436537
[25] 2436538
[26] 2436539
[27] 2436540
[28] 2436541
[29] 2436542
[30] 2436543
[31] 2436544
[32] 2436545
#
# ls -la /sys/fs/bpf/perf_attr_map
-rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map
# bpftool map | grep bperf | wc -l
64
#

# bpftool map | tail
1265: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1266: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1267: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 996
pids perf(2436545)
1268: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1269: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1270: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 997
pids perf(2436541)
1285: array name pid_iter.rodata flags 0x480
key 4B value 4B max_entries 1 memlock 4096B
btf_id 1017 frozen
pids bpftool(2437504)
1286: array flags 0x0
key 4B value 32B max_entries 1 memlock 4096B
#
# bpftool map dump id 1268 | tail
value (CPU 21):
8f f3 bc ca 00 00 00 00 80 fd 2a d1 4d 00 00 00
80 fd 2a d1 4d 00 00 00
value (CPU 22):
7e d5 64 4d 00 00 00 00 a4 8a 2e ee 4d 00 00 00
a4 8a 2e ee 4d 00 00 00
value (CPU 23):
a7 78 3e 06 01 00 00 00 b2 34 94 f6 4d 00 00 00
b2 34 94 f6 4d 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
c6 8b d9 ca 00 00 00 00 20 c6 fc 83 4e 00 00 00
20 c6 fc 83 4e 00 00 00
value (CPU 22):
9c b4 d2 4d 00 00 00 00 3e 0c df 89 4e 00 00 00
3e 0c df 89 4e 00 00 00
value (CPU 23):
18 43 66 06 01 00 00 00 5b 69 ed 83 4e 00 00 00
5b 69 ed 83 4e 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
f2 6e db ca 00 00 00 00 92 67 4c ba 4e 00 00 00
92 67 4c ba 4e 00 00 00
value (CPU 22):
dc 8e e1 4d 00 00 00 00 d9 32 7a c5 4e 00 00 00
d9 32 7a c5 4e 00 00 00
value (CPU 23):
bd 2b 73 06 01 00 00 00 7c 73 87 bf 4e 00 00 00
7c 73 87 bf 4e 00 00 00
Found 1 element
#

# perf stat --bpf-counters -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

119,410,122 cycles
152,105,479 instructions # 1.27 insn per cycle

0.101395093 seconds time elapsed

#

See? We had the counters enabled all the time.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: kernel-team@fb.com
Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
7fac83aa Tue Mar 16 15:18:35 MDT 2021 Song Liu <songliubraving@fb.com> perf stat: Introduce 'bperf' to share hardware PMCs with BPF

The perf tool uses performance monitoring counters (PMCs) to monitor
system performance. The PMCs are limited hardware resources. For
example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways: system
level monitoring, (maybe nested) container level monitoring, per process
monitoring, profiling (in sample mode), etc. In some cases, there are
more active perf_events than available hardware PMCs. To allow all
perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics
(cycles, instructions). It is a waste to have multiple tools create
multiple perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --bpf-counters option to
perf-stat. bperf uses a BPF hashmap to share information about BPF
programs and maps used by bperf. This map is pinned to bpffs. The
default path is /sys/fs/bpf/perf_attr_map. The user could change the
path with option --bpf-attr-map.

Committer testing:

# dmesg|grep "Performance Events" -A5
[ 0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[ 0.225280] ... version: 0
[ 0.225280] ... bit width: 48
[ 0.225281] ... generic registers: 6
[ 0.225281] ... value mask: 0000ffffffffffff
[ 0.225281] ... max period: 00007fffffffffff
#
# for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done
[1] 2436231
[2] 2436232
[3] 2436233
[4] 2436234
[5] 2436235
[6] 2436236
# perf stat -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

310,326,987 cycles (41.87%)
236,143,290 instructions # 0.76 insn per cycle (41.87%)

0.100800885 seconds time elapsed

#

We can see that the counters were enabled for this workload 41.87% of
the time.

Now with --bpf-counters:

# for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done
[1] 2436514
[2] 2436515
[3] 2436516
[4] 2436517
[5] 2436518
[6] 2436519
[7] 2436520
[8] 2436521
[9] 2436522
[10] 2436523
[11] 2436524
[12] 2436525
[13] 2436526
[14] 2436527
[15] 2436528
[16] 2436529
[17] 2436530
[18] 2436531
[19] 2436532
[20] 2436533
[21] 2436534
[22] 2436535
[23] 2436536
[24] 2436537
[25] 2436538
[26] 2436539
[27] 2436540
[28] 2436541
[29] 2436542
[30] 2436543
[31] 2436544
[32] 2436545
#
# ls -la /sys/fs/bpf/perf_attr_map
-rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map
# bpftool map | grep bperf | wc -l
64
#

# bpftool map | tail
1265: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1266: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1267: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 996
pids perf(2436545)
1268: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1269: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1270: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 997
pids perf(2436541)
1285: array name pid_iter.rodata flags 0x480
key 4B value 4B max_entries 1 memlock 4096B
btf_id 1017 frozen
pids bpftool(2437504)
1286: array flags 0x0
key 4B value 32B max_entries 1 memlock 4096B
#
# bpftool map dump id 1268 | tail
value (CPU 21):
8f f3 bc ca 00 00 00 00 80 fd 2a d1 4d 00 00 00
80 fd 2a d1 4d 00 00 00
value (CPU 22):
7e d5 64 4d 00 00 00 00 a4 8a 2e ee 4d 00 00 00
a4 8a 2e ee 4d 00 00 00
value (CPU 23):
a7 78 3e 06 01 00 00 00 b2 34 94 f6 4d 00 00 00
b2 34 94 f6 4d 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
c6 8b d9 ca 00 00 00 00 20 c6 fc 83 4e 00 00 00
20 c6 fc 83 4e 00 00 00
value (CPU 22):
9c b4 d2 4d 00 00 00 00 3e 0c df 89 4e 00 00 00
3e 0c df 89 4e 00 00 00
value (CPU 23):
18 43 66 06 01 00 00 00 5b 69 ed 83 4e 00 00 00
5b 69 ed 83 4e 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
f2 6e db ca 00 00 00 00 92 67 4c ba 4e 00 00 00
92 67 4c ba 4e 00 00 00
value (CPU 22):
dc 8e e1 4d 00 00 00 00 d9 32 7a c5 4e 00 00 00
d9 32 7a c5 4e 00 00 00
value (CPU 23):
bd 2b 73 06 01 00 00 00 7c 73 87 bf 4e 00 00 00
7c 73 87 bf 4e 00 00 00
Found 1 element
#

# perf stat --bpf-counters -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

119,410,122 cycles
152,105,479 instructions # 1.27 insn per cycle

0.101395093 seconds time elapsed

#

See? We had the counters enabled all the time.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: kernel-team@fb.com
Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
7fac83aa Tue Mar 16 15:18:35 MDT 2021 Song Liu <songliubraving@fb.com> perf stat: Introduce 'bperf' to share hardware PMCs with BPF

The perf tool uses performance monitoring counters (PMCs) to monitor
system performance. The PMCs are limited hardware resources. For
example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways: system
level monitoring, (maybe nested) container level monitoring, per process
monitoring, profiling (in sample mode), etc. In some cases, there are
more active perf_events than available hardware PMCs. To allow all
perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics
(cycles, instructions). It is a waste to have multiple tools create
multiple perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --bpf-counters option to
perf-stat. bperf uses a BPF hashmap to share information about BPF
programs and maps used by bperf. This map is pinned to bpffs. The
default path is /sys/fs/bpf/perf_attr_map. The user could change the
path with option --bpf-attr-map.

Committer testing:

# dmesg|grep "Performance Events" -A5
[ 0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[ 0.225280] ... version: 0
[ 0.225280] ... bit width: 48
[ 0.225281] ... generic registers: 6
[ 0.225281] ... value mask: 0000ffffffffffff
[ 0.225281] ... max period: 00007fffffffffff
#
# for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done
[1] 2436231
[2] 2436232
[3] 2436233
[4] 2436234
[5] 2436235
[6] 2436236
# perf stat -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

310,326,987 cycles (41.87%)
236,143,290 instructions # 0.76 insn per cycle (41.87%)

0.100800885 seconds time elapsed

#

We can see that the counters were enabled for this workload 41.87% of
the time.

Now with --bpf-counters:

# for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done
[1] 2436514
[2] 2436515
[3] 2436516
[4] 2436517
[5] 2436518
[6] 2436519
[7] 2436520
[8] 2436521
[9] 2436522
[10] 2436523
[11] 2436524
[12] 2436525
[13] 2436526
[14] 2436527
[15] 2436528
[16] 2436529
[17] 2436530
[18] 2436531
[19] 2436532
[20] 2436533
[21] 2436534
[22] 2436535
[23] 2436536
[24] 2436537
[25] 2436538
[26] 2436539
[27] 2436540
[28] 2436541
[29] 2436542
[30] 2436543
[31] 2436544
[32] 2436545
#
# ls -la /sys/fs/bpf/perf_attr_map
-rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map
# bpftool map | grep bperf | wc -l
64
#

# bpftool map | tail
1265: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1266: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1267: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 996
pids perf(2436545)
1268: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1269: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1270: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 997
pids perf(2436541)
1285: array name pid_iter.rodata flags 0x480
key 4B value 4B max_entries 1 memlock 4096B
btf_id 1017 frozen
pids bpftool(2437504)
1286: array flags 0x0
key 4B value 32B max_entries 1 memlock 4096B
#
# bpftool map dump id 1268 | tail
value (CPU 21):
8f f3 bc ca 00 00 00 00 80 fd 2a d1 4d 00 00 00
80 fd 2a d1 4d 00 00 00
value (CPU 22):
7e d5 64 4d 00 00 00 00 a4 8a 2e ee 4d 00 00 00
a4 8a 2e ee 4d 00 00 00
value (CPU 23):
a7 78 3e 06 01 00 00 00 b2 34 94 f6 4d 00 00 00
b2 34 94 f6 4d 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
c6 8b d9 ca 00 00 00 00 20 c6 fc 83 4e 00 00 00
20 c6 fc 83 4e 00 00 00
value (CPU 22):
9c b4 d2 4d 00 00 00 00 3e 0c df 89 4e 00 00 00
3e 0c df 89 4e 00 00 00
value (CPU 23):
18 43 66 06 01 00 00 00 5b 69 ed 83 4e 00 00 00
5b 69 ed 83 4e 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
f2 6e db ca 00 00 00 00 92 67 4c ba 4e 00 00 00
92 67 4c ba 4e 00 00 00
value (CPU 22):
dc 8e e1 4d 00 00 00 00 d9 32 7a c5 4e 00 00 00
d9 32 7a c5 4e 00 00 00
value (CPU 23):
bd 2b 73 06 01 00 00 00 7c 73 87 bf 4e 00 00 00
7c 73 87 bf 4e 00 00 00
Found 1 element
#

# perf stat --bpf-counters -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

119,410,122 cycles
152,105,479 instructions # 1.27 insn per cycle

0.101395093 seconds time elapsed

#

See? We had the counters enabled all the time.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: kernel-team@fb.com
Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
7fac83aa Tue Mar 16 15:18:35 MDT 2021 Song Liu <songliubraving@fb.com> perf stat: Introduce 'bperf' to share hardware PMCs with BPF

The perf tool uses performance monitoring counters (PMCs) to monitor
system performance. The PMCs are limited hardware resources. For
example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways: system
level monitoring, (maybe nested) container level monitoring, per process
monitoring, profiling (in sample mode), etc. In some cases, there are
more active perf_events than available hardware PMCs. To allow all
perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics
(cycles, instructions). It is a waste to have multiple tools create
multiple perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --bpf-counters option to
perf-stat. bperf uses a BPF hashmap to share information about BPF
programs and maps used by bperf. This map is pinned to bpffs. The
default path is /sys/fs/bpf/perf_attr_map. The user could change the
path with option --bpf-attr-map.

Committer testing:

# dmesg|grep "Performance Events" -A5
[ 0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[ 0.225280] ... version: 0
[ 0.225280] ... bit width: 48
[ 0.225281] ... generic registers: 6
[ 0.225281] ... value mask: 0000ffffffffffff
[ 0.225281] ... max period: 00007fffffffffff
#
# for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done
[1] 2436231
[2] 2436232
[3] 2436233
[4] 2436234
[5] 2436235
[6] 2436236
# perf stat -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

310,326,987 cycles (41.87%)
236,143,290 instructions # 0.76 insn per cycle (41.87%)

0.100800885 seconds time elapsed

#

We can see that the counters were enabled for this workload 41.87% of
the time.

Now with --bpf-counters:

# for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done
[1] 2436514
[2] 2436515
[3] 2436516
[4] 2436517
[5] 2436518
[6] 2436519
[7] 2436520
[8] 2436521
[9] 2436522
[10] 2436523
[11] 2436524
[12] 2436525
[13] 2436526
[14] 2436527
[15] 2436528
[16] 2436529
[17] 2436530
[18] 2436531
[19] 2436532
[20] 2436533
[21] 2436534
[22] 2436535
[23] 2436536
[24] 2436537
[25] 2436538
[26] 2436539
[27] 2436540
[28] 2436541
[29] 2436542
[30] 2436543
[31] 2436544
[32] 2436545
#
# ls -la /sys/fs/bpf/perf_attr_map
-rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map
# bpftool map | grep bperf | wc -l
64
#

# bpftool map | tail
1265: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1266: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1267: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 996
pids perf(2436545)
1268: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1269: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1270: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 997
pids perf(2436541)
1285: array name pid_iter.rodata flags 0x480
key 4B value 4B max_entries 1 memlock 4096B
btf_id 1017 frozen
pids bpftool(2437504)
1286: array flags 0x0
key 4B value 32B max_entries 1 memlock 4096B
#
# bpftool map dump id 1268 | tail
value (CPU 21):
8f f3 bc ca 00 00 00 00 80 fd 2a d1 4d 00 00 00
80 fd 2a d1 4d 00 00 00
value (CPU 22):
7e d5 64 4d 00 00 00 00 a4 8a 2e ee 4d 00 00 00
a4 8a 2e ee 4d 00 00 00
value (CPU 23):
a7 78 3e 06 01 00 00 00 b2 34 94 f6 4d 00 00 00
b2 34 94 f6 4d 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
c6 8b d9 ca 00 00 00 00 20 c6 fc 83 4e 00 00 00
20 c6 fc 83 4e 00 00 00
value (CPU 22):
9c b4 d2 4d 00 00 00 00 3e 0c df 89 4e 00 00 00
3e 0c df 89 4e 00 00 00
value (CPU 23):
18 43 66 06 01 00 00 00 5b 69 ed 83 4e 00 00 00
5b 69 ed 83 4e 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
f2 6e db ca 00 00 00 00 92 67 4c ba 4e 00 00 00
92 67 4c ba 4e 00 00 00
value (CPU 22):
dc 8e e1 4d 00 00 00 00 d9 32 7a c5 4e 00 00 00
d9 32 7a c5 4e 00 00 00
value (CPU 23):
bd 2b 73 06 01 00 00 00 7c 73 87 bf 4e 00 00 00
7c 73 87 bf 4e 00 00 00
Found 1 element
#

# perf stat --bpf-counters -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

119,410,122 cycles
152,105,479 instructions # 1.27 insn per cycle

0.101395093 seconds time elapsed

#

See? We had the counters enabled all the time.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: kernel-team@fb.com
Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
7fac83aa Tue Mar 16 15:18:35 MDT 2021 Song Liu <songliubraving@fb.com> perf stat: Introduce 'bperf' to share hardware PMCs with BPF

The perf tool uses performance monitoring counters (PMCs) to monitor
system performance. The PMCs are limited hardware resources. For
example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways: system
level monitoring, (maybe nested) container level monitoring, per process
monitoring, profiling (in sample mode), etc. In some cases, there are
more active perf_events than available hardware PMCs. To allow all
perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics
(cycles, instructions). It is a waste to have multiple tools create
multiple perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --bpf-counters option to
perf-stat. bperf uses a BPF hashmap to share information about BPF
programs and maps used by bperf. This map is pinned to bpffs. The
default path is /sys/fs/bpf/perf_attr_map. The user could change the
path with option --bpf-attr-map.

Committer testing:

# dmesg|grep "Performance Events" -A5
[ 0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[ 0.225280] ... version: 0
[ 0.225280] ... bit width: 48
[ 0.225281] ... generic registers: 6
[ 0.225281] ... value mask: 0000ffffffffffff
[ 0.225281] ... max period: 00007fffffffffff
#
# for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done
[1] 2436231
[2] 2436232
[3] 2436233
[4] 2436234
[5] 2436235
[6] 2436236
# perf stat -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

310,326,987 cycles (41.87%)
236,143,290 instructions # 0.76 insn per cycle (41.87%)

0.100800885 seconds time elapsed

#

We can see that the counters were enabled for this workload 41.87% of
the time.

Now with --bpf-counters:

# for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done
[1] 2436514
[2] 2436515
[3] 2436516
[4] 2436517
[5] 2436518
[6] 2436519
[7] 2436520
[8] 2436521
[9] 2436522
[10] 2436523
[11] 2436524
[12] 2436525
[13] 2436526
[14] 2436527
[15] 2436528
[16] 2436529
[17] 2436530
[18] 2436531
[19] 2436532
[20] 2436533
[21] 2436534
[22] 2436535
[23] 2436536
[24] 2436537
[25] 2436538
[26] 2436539
[27] 2436540
[28] 2436541
[29] 2436542
[30] 2436543
[31] 2436544
[32] 2436545
#
# ls -la /sys/fs/bpf/perf_attr_map
-rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map
# bpftool map | grep bperf | wc -l
64
#

# bpftool map | tail
1265: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1266: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1267: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 996
pids perf(2436545)
1268: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1269: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1270: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 997
pids perf(2436541)
1285: array name pid_iter.rodata flags 0x480
key 4B value 4B max_entries 1 memlock 4096B
btf_id 1017 frozen
pids bpftool(2437504)
1286: array flags 0x0
key 4B value 32B max_entries 1 memlock 4096B
#
# bpftool map dump id 1268 | tail
value (CPU 21):
8f f3 bc ca 00 00 00 00 80 fd 2a d1 4d 00 00 00
80 fd 2a d1 4d 00 00 00
value (CPU 22):
7e d5 64 4d 00 00 00 00 a4 8a 2e ee 4d 00 00 00
a4 8a 2e ee 4d 00 00 00
value (CPU 23):
a7 78 3e 06 01 00 00 00 b2 34 94 f6 4d 00 00 00
b2 34 94 f6 4d 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
c6 8b d9 ca 00 00 00 00 20 c6 fc 83 4e 00 00 00
20 c6 fc 83 4e 00 00 00
value (CPU 22):
9c b4 d2 4d 00 00 00 00 3e 0c df 89 4e 00 00 00
3e 0c df 89 4e 00 00 00
value (CPU 23):
18 43 66 06 01 00 00 00 5b 69 ed 83 4e 00 00 00
5b 69 ed 83 4e 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
f2 6e db ca 00 00 00 00 92 67 4c ba 4e 00 00 00
92 67 4c ba 4e 00 00 00
value (CPU 22):
dc 8e e1 4d 00 00 00 00 d9 32 7a c5 4e 00 00 00
d9 32 7a c5 4e 00 00 00
value (CPU 23):
bd 2b 73 06 01 00 00 00 7c 73 87 bf 4e 00 00 00
7c 73 87 bf 4e 00 00 00
Found 1 element
#

# perf stat --bpf-counters -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

119,410,122 cycles
152,105,479 instructions # 1.27 insn per cycle

0.101395093 seconds time elapsed

#

See? We had the counters enabled all the time.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: kernel-team@fb.com
Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
7fac83aa Tue Mar 16 15:18:35 MDT 2021 Song Liu <songliubraving@fb.com> perf stat: Introduce 'bperf' to share hardware PMCs with BPF

The perf tool uses performance monitoring counters (PMCs) to monitor
system performance. The PMCs are limited hardware resources. For
example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways: system
level monitoring, (maybe nested) container level monitoring, per process
monitoring, profiling (in sample mode), etc. In some cases, there are
more active perf_events than available hardware PMCs. To allow all
perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics
(cycles, instructions). It is a waste to have multiple tools create
multiple perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --bpf-counters option to
perf-stat. bperf uses a BPF hashmap to share information about BPF
programs and maps used by bperf. This map is pinned to bpffs. The
default path is /sys/fs/bpf/perf_attr_map. The user could change the
path with option --bpf-attr-map.

Committer testing:

# dmesg|grep "Performance Events" -A5
[ 0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[ 0.225280] ... version: 0
[ 0.225280] ... bit width: 48
[ 0.225281] ... generic registers: 6
[ 0.225281] ... value mask: 0000ffffffffffff
[ 0.225281] ... max period: 00007fffffffffff
#
# for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done
[1] 2436231
[2] 2436232
[3] 2436233
[4] 2436234
[5] 2436235
[6] 2436236
# perf stat -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

310,326,987 cycles (41.87%)
236,143,290 instructions # 0.76 insn per cycle (41.87%)

0.100800885 seconds time elapsed

#

We can see that the counters were enabled for this workload 41.87% of
the time.

Now with --bpf-counters:

# for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done
[1] 2436514
[2] 2436515
[3] 2436516
[4] 2436517
[5] 2436518
[6] 2436519
[7] 2436520
[8] 2436521
[9] 2436522
[10] 2436523
[11] 2436524
[12] 2436525
[13] 2436526
[14] 2436527
[15] 2436528
[16] 2436529
[17] 2436530
[18] 2436531
[19] 2436532
[20] 2436533
[21] 2436534
[22] 2436535
[23] 2436536
[24] 2436537
[25] 2436538
[26] 2436539
[27] 2436540
[28] 2436541
[29] 2436542
[30] 2436543
[31] 2436544
[32] 2436545
#
# ls -la /sys/fs/bpf/perf_attr_map
-rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map
# bpftool map | grep bperf | wc -l
64
#

# bpftool map | tail
1265: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1266: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1267: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 996
pids perf(2436545)
1268: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1269: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1270: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 997
pids perf(2436541)
1285: array name pid_iter.rodata flags 0x480
key 4B value 4B max_entries 1 memlock 4096B
btf_id 1017 frozen
pids bpftool(2437504)
1286: array flags 0x0
key 4B value 32B max_entries 1 memlock 4096B
#
# bpftool map dump id 1268 | tail
value (CPU 21):
8f f3 bc ca 00 00 00 00 80 fd 2a d1 4d 00 00 00
80 fd 2a d1 4d 00 00 00
value (CPU 22):
7e d5 64 4d 00 00 00 00 a4 8a 2e ee 4d 00 00 00
a4 8a 2e ee 4d 00 00 00
value (CPU 23):
a7 78 3e 06 01 00 00 00 b2 34 94 f6 4d 00 00 00
b2 34 94 f6 4d 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
c6 8b d9 ca 00 00 00 00 20 c6 fc 83 4e 00 00 00
20 c6 fc 83 4e 00 00 00
value (CPU 22):
9c b4 d2 4d 00 00 00 00 3e 0c df 89 4e 00 00 00
3e 0c df 89 4e 00 00 00
value (CPU 23):
18 43 66 06 01 00 00 00 5b 69 ed 83 4e 00 00 00
5b 69 ed 83 4e 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
f2 6e db ca 00 00 00 00 92 67 4c ba 4e 00 00 00
92 67 4c ba 4e 00 00 00
value (CPU 22):
dc 8e e1 4d 00 00 00 00 d9 32 7a c5 4e 00 00 00
d9 32 7a c5 4e 00 00 00
value (CPU 23):
bd 2b 73 06 01 00 00 00 7c 73 87 bf 4e 00 00 00
7c 73 87 bf 4e 00 00 00
Found 1 element
#

# perf stat --bpf-counters -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

119,410,122 cycles
152,105,479 instructions # 1.27 insn per cycle

0.101395093 seconds time elapsed

#

See? We had the counters enabled all the time.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: kernel-team@fb.com
Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
7fac83aa Tue Mar 16 15:18:35 MDT 2021 Song Liu <songliubraving@fb.com> perf stat: Introduce 'bperf' to share hardware PMCs with BPF

The perf tool uses performance monitoring counters (PMCs) to monitor
system performance. The PMCs are limited hardware resources. For
example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways: system
level monitoring, (maybe nested) container level monitoring, per process
monitoring, profiling (in sample mode), etc. In some cases, there are
more active perf_events than available hardware PMCs. To allow all
perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics
(cycles, instructions). It is a waste to have multiple tools create
multiple perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --bpf-counters option to
perf-stat. bperf uses a BPF hashmap to share information about BPF
programs and maps used by bperf. This map is pinned to bpffs. The
default path is /sys/fs/bpf/perf_attr_map. The user could change the
path with option --bpf-attr-map.

Committer testing:

# dmesg|grep "Performance Events" -A5
[ 0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[ 0.225280] ... version: 0
[ 0.225280] ... bit width: 48
[ 0.225281] ... generic registers: 6
[ 0.225281] ... value mask: 0000ffffffffffff
[ 0.225281] ... max period: 00007fffffffffff
#
# for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done
[1] 2436231
[2] 2436232
[3] 2436233
[4] 2436234
[5] 2436235
[6] 2436236
# perf stat -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

310,326,987 cycles (41.87%)
236,143,290 instructions # 0.76 insn per cycle (41.87%)

0.100800885 seconds time elapsed

#

We can see that the counters were enabled for this workload 41.87% of
the time.

Now with --bpf-counters:

# for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done
[1] 2436514
[2] 2436515
[3] 2436516
[4] 2436517
[5] 2436518
[6] 2436519
[7] 2436520
[8] 2436521
[9] 2436522
[10] 2436523
[11] 2436524
[12] 2436525
[13] 2436526
[14] 2436527
[15] 2436528
[16] 2436529
[17] 2436530
[18] 2436531
[19] 2436532
[20] 2436533
[21] 2436534
[22] 2436535
[23] 2436536
[24] 2436537
[25] 2436538
[26] 2436539
[27] 2436540
[28] 2436541
[29] 2436542
[30] 2436543
[31] 2436544
[32] 2436545
#
# ls -la /sys/fs/bpf/perf_attr_map
-rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map
# bpftool map | grep bperf | wc -l
64
#

# bpftool map | tail
1265: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1266: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1267: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 996
pids perf(2436545)
1268: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1269: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1270: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 997
pids perf(2436541)
1285: array name pid_iter.rodata flags 0x480
key 4B value 4B max_entries 1 memlock 4096B
btf_id 1017 frozen
pids bpftool(2437504)
1286: array flags 0x0
key 4B value 32B max_entries 1 memlock 4096B
#
# bpftool map dump id 1268 | tail
value (CPU 21):
8f f3 bc ca 00 00 00 00 80 fd 2a d1 4d 00 00 00
80 fd 2a d1 4d 00 00 00
value (CPU 22):
7e d5 64 4d 00 00 00 00 a4 8a 2e ee 4d 00 00 00
a4 8a 2e ee 4d 00 00 00
value (CPU 23):
a7 78 3e 06 01 00 00 00 b2 34 94 f6 4d 00 00 00
b2 34 94 f6 4d 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
c6 8b d9 ca 00 00 00 00 20 c6 fc 83 4e 00 00 00
20 c6 fc 83 4e 00 00 00
value (CPU 22):
9c b4 d2 4d 00 00 00 00 3e 0c df 89 4e 00 00 00
3e 0c df 89 4e 00 00 00
value (CPU 23):
18 43 66 06 01 00 00 00 5b 69 ed 83 4e 00 00 00
5b 69 ed 83 4e 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
f2 6e db ca 00 00 00 00 92 67 4c ba 4e 00 00 00
92 67 4c ba 4e 00 00 00
value (CPU 22):
dc 8e e1 4d 00 00 00 00 d9 32 7a c5 4e 00 00 00
d9 32 7a c5 4e 00 00 00
value (CPU 23):
bd 2b 73 06 01 00 00 00 7c 73 87 bf 4e 00 00 00
7c 73 87 bf 4e 00 00 00
Found 1 element
#

# perf stat --bpf-counters -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

119,410,122 cycles
152,105,479 instructions # 1.27 insn per cycle

0.101395093 seconds time elapsed

#

See? We had the counters enabled all the time.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: kernel-team@fb.com
Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
7fac83aa Tue Mar 16 15:18:35 MDT 2021 Song Liu <songliubraving@fb.com> perf stat: Introduce 'bperf' to share hardware PMCs with BPF

The perf tool uses performance monitoring counters (PMCs) to monitor
system performance. The PMCs are limited hardware resources. For
example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways: system
level monitoring, (maybe nested) container level monitoring, per process
monitoring, profiling (in sample mode), etc. In some cases, there are
more active perf_events than available hardware PMCs. To allow all
perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics
(cycles, instructions). It is a waste to have multiple tools create
multiple perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --bpf-counters option to
perf-stat. bperf uses a BPF hashmap to share information about BPF
programs and maps used by bperf. This map is pinned to bpffs. The
default path is /sys/fs/bpf/perf_attr_map. The user could change the
path with option --bpf-attr-map.

Committer testing:

# dmesg|grep "Performance Events" -A5
[ 0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[ 0.225280] ... version: 0
[ 0.225280] ... bit width: 48
[ 0.225281] ... generic registers: 6
[ 0.225281] ... value mask: 0000ffffffffffff
[ 0.225281] ... max period: 00007fffffffffff
#
# for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done
[1] 2436231
[2] 2436232
[3] 2436233
[4] 2436234
[5] 2436235
[6] 2436236
# perf stat -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

310,326,987 cycles (41.87%)
236,143,290 instructions # 0.76 insn per cycle (41.87%)

0.100800885 seconds time elapsed

#

We can see that the counters were enabled for this workload 41.87% of
the time.

Now with --bpf-counters:

# for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done
[1] 2436514
[2] 2436515
[3] 2436516
[4] 2436517
[5] 2436518
[6] 2436519
[7] 2436520
[8] 2436521
[9] 2436522
[10] 2436523
[11] 2436524
[12] 2436525
[13] 2436526
[14] 2436527
[15] 2436528
[16] 2436529
[17] 2436530
[18] 2436531
[19] 2436532
[20] 2436533
[21] 2436534
[22] 2436535
[23] 2436536
[24] 2436537
[25] 2436538
[26] 2436539
[27] 2436540
[28] 2436541
[29] 2436542
[30] 2436543
[31] 2436544
[32] 2436545
#
# ls -la /sys/fs/bpf/perf_attr_map
-rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map
# bpftool map | grep bperf | wc -l
64
#

# bpftool map | tail
1265: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1266: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1267: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 996
pids perf(2436545)
1268: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1269: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1270: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 997
pids perf(2436541)
1285: array name pid_iter.rodata flags 0x480
key 4B value 4B max_entries 1 memlock 4096B
btf_id 1017 frozen
pids bpftool(2437504)
1286: array flags 0x0
key 4B value 32B max_entries 1 memlock 4096B
#
# bpftool map dump id 1268 | tail
value (CPU 21):
8f f3 bc ca 00 00 00 00 80 fd 2a d1 4d 00 00 00
80 fd 2a d1 4d 00 00 00
value (CPU 22):
7e d5 64 4d 00 00 00 00 a4 8a 2e ee 4d 00 00 00
a4 8a 2e ee 4d 00 00 00
value (CPU 23):
a7 78 3e 06 01 00 00 00 b2 34 94 f6 4d 00 00 00
b2 34 94 f6 4d 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
c6 8b d9 ca 00 00 00 00 20 c6 fc 83 4e 00 00 00
20 c6 fc 83 4e 00 00 00
value (CPU 22):
9c b4 d2 4d 00 00 00 00 3e 0c df 89 4e 00 00 00
3e 0c df 89 4e 00 00 00
value (CPU 23):
18 43 66 06 01 00 00 00 5b 69 ed 83 4e 00 00 00
5b 69 ed 83 4e 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
f2 6e db ca 00 00 00 00 92 67 4c ba 4e 00 00 00
92 67 4c ba 4e 00 00 00
value (CPU 22):
dc 8e e1 4d 00 00 00 00 d9 32 7a c5 4e 00 00 00
d9 32 7a c5 4e 00 00 00
value (CPU 23):
bd 2b 73 06 01 00 00 00 7c 73 87 bf 4e 00 00 00
7c 73 87 bf 4e 00 00 00
Found 1 element
#

# perf stat --bpf-counters -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

119,410,122 cycles
152,105,479 instructions # 1.27 insn per cycle

0.101395093 seconds time elapsed

#

See? We had the counters enabled all the time.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: kernel-team@fb.com
Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
7fac83aa Tue Mar 16 15:18:35 MDT 2021 Song Liu <songliubraving@fb.com> perf stat: Introduce 'bperf' to share hardware PMCs with BPF

The perf tool uses performance monitoring counters (PMCs) to monitor
system performance. The PMCs are limited hardware resources. For
example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways: system
level monitoring, (maybe nested) container level monitoring, per process
monitoring, profiling (in sample mode), etc. In some cases, there are
more active perf_events than available hardware PMCs. To allow all
perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics
(cycles, instructions). It is a waste to have multiple tools create
multiple perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --bpf-counters option to
perf-stat. bperf uses a BPF hashmap to share information about BPF
programs and maps used by bperf. This map is pinned to bpffs. The
default path is /sys/fs/bpf/perf_attr_map. The user could change the
path with option --bpf-attr-map.

Committer testing:

# dmesg|grep "Performance Events" -A5
[ 0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[ 0.225280] ... version: 0
[ 0.225280] ... bit width: 48
[ 0.225281] ... generic registers: 6
[ 0.225281] ... value mask: 0000ffffffffffff
[ 0.225281] ... max period: 00007fffffffffff
#
# for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done
[1] 2436231
[2] 2436232
[3] 2436233
[4] 2436234
[5] 2436235
[6] 2436236
# perf stat -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

310,326,987 cycles (41.87%)
236,143,290 instructions # 0.76 insn per cycle (41.87%)

0.100800885 seconds time elapsed

#

We can see that the counters were enabled for this workload 41.87% of
the time.

Now with --bpf-counters:

# for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done
[1] 2436514
[2] 2436515
[3] 2436516
[4] 2436517
[5] 2436518
[6] 2436519
[7] 2436520
[8] 2436521
[9] 2436522
[10] 2436523
[11] 2436524
[12] 2436525
[13] 2436526
[14] 2436527
[15] 2436528
[16] 2436529
[17] 2436530
[18] 2436531
[19] 2436532
[20] 2436533
[21] 2436534
[22] 2436535
[23] 2436536
[24] 2436537
[25] 2436538
[26] 2436539
[27] 2436540
[28] 2436541
[29] 2436542
[30] 2436543
[31] 2436544
[32] 2436545
#
# ls -la /sys/fs/bpf/perf_attr_map
-rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map
# bpftool map | grep bperf | wc -l
64
#

# bpftool map | tail
1265: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1266: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1267: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 996
pids perf(2436545)
1268: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1269: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1270: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 997
pids perf(2436541)
1285: array name pid_iter.rodata flags 0x480
key 4B value 4B max_entries 1 memlock 4096B
btf_id 1017 frozen
pids bpftool(2437504)
1286: array flags 0x0
key 4B value 32B max_entries 1 memlock 4096B
#
# bpftool map dump id 1268 | tail
value (CPU 21):
8f f3 bc ca 00 00 00 00 80 fd 2a d1 4d 00 00 00
80 fd 2a d1 4d 00 00 00
value (CPU 22):
7e d5 64 4d 00 00 00 00 a4 8a 2e ee 4d 00 00 00
a4 8a 2e ee 4d 00 00 00
value (CPU 23):
a7 78 3e 06 01 00 00 00 b2 34 94 f6 4d 00 00 00
b2 34 94 f6 4d 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
c6 8b d9 ca 00 00 00 00 20 c6 fc 83 4e 00 00 00
20 c6 fc 83 4e 00 00 00
value (CPU 22):
9c b4 d2 4d 00 00 00 00 3e 0c df 89 4e 00 00 00
3e 0c df 89 4e 00 00 00
value (CPU 23):
18 43 66 06 01 00 00 00 5b 69 ed 83 4e 00 00 00
5b 69 ed 83 4e 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
f2 6e db ca 00 00 00 00 92 67 4c ba 4e 00 00 00
92 67 4c ba 4e 00 00 00
value (CPU 22):
dc 8e e1 4d 00 00 00 00 d9 32 7a c5 4e 00 00 00
d9 32 7a c5 4e 00 00 00
value (CPU 23):
bd 2b 73 06 01 00 00 00 7c 73 87 bf 4e 00 00 00
7c 73 87 bf 4e 00 00 00
Found 1 element
#

# perf stat --bpf-counters -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

119,410,122 cycles
152,105,479 instructions # 1.27 insn per cycle

0.101395093 seconds time elapsed

#

See? We had the counters enabled all the time.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: kernel-team@fb.com
Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
7fac83aa Tue Mar 16 15:18:35 MDT 2021 Song Liu <songliubraving@fb.com> perf stat: Introduce 'bperf' to share hardware PMCs with BPF

The perf tool uses performance monitoring counters (PMCs) to monitor
system performance. The PMCs are limited hardware resources. For
example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways: system
level monitoring, (maybe nested) container level monitoring, per process
monitoring, profiling (in sample mode), etc. In some cases, there are
more active perf_events than available hardware PMCs. To allow all
perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics
(cycles, instructions). It is a waste to have multiple tools create
multiple perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --bpf-counters option to
perf-stat. bperf uses a BPF hashmap to share information about BPF
programs and maps used by bperf. This map is pinned to bpffs. The
default path is /sys/fs/bpf/perf_attr_map. The user could change the
path with option --bpf-attr-map.

Committer testing:

# dmesg|grep "Performance Events" -A5
[ 0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[ 0.225280] ... version: 0
[ 0.225280] ... bit width: 48
[ 0.225281] ... generic registers: 6
[ 0.225281] ... value mask: 0000ffffffffffff
[ 0.225281] ... max period: 00007fffffffffff
#
# for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done
[1] 2436231
[2] 2436232
[3] 2436233
[4] 2436234
[5] 2436235
[6] 2436236
# perf stat -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

310,326,987 cycles (41.87%)
236,143,290 instructions # 0.76 insn per cycle (41.87%)

0.100800885 seconds time elapsed

#

We can see that the counters were enabled for this workload 41.87% of
the time.

Now with --bpf-counters:

# for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done
[1] 2436514
[2] 2436515
[3] 2436516
[4] 2436517
[5] 2436518
[6] 2436519
[7] 2436520
[8] 2436521
[9] 2436522
[10] 2436523
[11] 2436524
[12] 2436525
[13] 2436526
[14] 2436527
[15] 2436528
[16] 2436529
[17] 2436530
[18] 2436531
[19] 2436532
[20] 2436533
[21] 2436534
[22] 2436535
[23] 2436536
[24] 2436537
[25] 2436538
[26] 2436539
[27] 2436540
[28] 2436541
[29] 2436542
[30] 2436543
[31] 2436544
[32] 2436545
#
# ls -la /sys/fs/bpf/perf_attr_map
-rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map
# bpftool map | grep bperf | wc -l
64
#

# bpftool map | tail
1265: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1266: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1267: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 996
pids perf(2436545)
1268: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1269: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1270: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 997
pids perf(2436541)
1285: array name pid_iter.rodata flags 0x480
key 4B value 4B max_entries 1 memlock 4096B
btf_id 1017 frozen
pids bpftool(2437504)
1286: array flags 0x0
key 4B value 32B max_entries 1 memlock 4096B
#
# bpftool map dump id 1268 | tail
value (CPU 21):
8f f3 bc ca 00 00 00 00 80 fd 2a d1 4d 00 00 00
80 fd 2a d1 4d 00 00 00
value (CPU 22):
7e d5 64 4d 00 00 00 00 a4 8a 2e ee 4d 00 00 00
a4 8a 2e ee 4d 00 00 00
value (CPU 23):
a7 78 3e 06 01 00 00 00 b2 34 94 f6 4d 00 00 00
b2 34 94 f6 4d 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
c6 8b d9 ca 00 00 00 00 20 c6 fc 83 4e 00 00 00
20 c6 fc 83 4e 00 00 00
value (CPU 22):
9c b4 d2 4d 00 00 00 00 3e 0c df 89 4e 00 00 00
3e 0c df 89 4e 00 00 00
value (CPU 23):
18 43 66 06 01 00 00 00 5b 69 ed 83 4e 00 00 00
5b 69 ed 83 4e 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
f2 6e db ca 00 00 00 00 92 67 4c ba 4e 00 00 00
92 67 4c ba 4e 00 00 00
value (CPU 22):
dc 8e e1 4d 00 00 00 00 d9 32 7a c5 4e 00 00 00
d9 32 7a c5 4e 00 00 00
value (CPU 23):
bd 2b 73 06 01 00 00 00 7c 73 87 bf 4e 00 00 00
7c 73 87 bf 4e 00 00 00
Found 1 element
#

# perf stat --bpf-counters -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

119,410,122 cycles
152,105,479 instructions # 1.27 insn per cycle

0.101395093 seconds time elapsed

#

See? We had the counters enabled all the time.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: kernel-team@fb.com
Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
7fac83aa Tue Mar 16 15:18:35 MDT 2021 Song Liu <songliubraving@fb.com> perf stat: Introduce 'bperf' to share hardware PMCs with BPF

The perf tool uses performance monitoring counters (PMCs) to monitor
system performance. The PMCs are limited hardware resources. For
example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways: system
level monitoring, (maybe nested) container level monitoring, per process
monitoring, profiling (in sample mode), etc. In some cases, there are
more active perf_events than available hardware PMCs. To allow all
perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics
(cycles, instructions). It is a waste to have multiple tools create
multiple perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --bpf-counters option to
perf-stat. bperf uses a BPF hashmap to share information about BPF
programs and maps used by bperf. This map is pinned to bpffs. The
default path is /sys/fs/bpf/perf_attr_map. The user could change the
path with option --bpf-attr-map.

Committer testing:

# dmesg|grep "Performance Events" -A5
[ 0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[ 0.225280] ... version: 0
[ 0.225280] ... bit width: 48
[ 0.225281] ... generic registers: 6
[ 0.225281] ... value mask: 0000ffffffffffff
[ 0.225281] ... max period: 00007fffffffffff
#
# for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done
[1] 2436231
[2] 2436232
[3] 2436233
[4] 2436234
[5] 2436235
[6] 2436236
# perf stat -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

310,326,987 cycles (41.87%)
236,143,290 instructions # 0.76 insn per cycle (41.87%)

0.100800885 seconds time elapsed

#

We can see that the counters were enabled for this workload 41.87% of
the time.

Now with --bpf-counters:

# for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done
[1] 2436514
[2] 2436515
[3] 2436516
[4] 2436517
[5] 2436518
[6] 2436519
[7] 2436520
[8] 2436521
[9] 2436522
[10] 2436523
[11] 2436524
[12] 2436525
[13] 2436526
[14] 2436527
[15] 2436528
[16] 2436529
[17] 2436530
[18] 2436531
[19] 2436532
[20] 2436533
[21] 2436534
[22] 2436535
[23] 2436536
[24] 2436537
[25] 2436538
[26] 2436539
[27] 2436540
[28] 2436541
[29] 2436542
[30] 2436543
[31] 2436544
[32] 2436545
#
# ls -la /sys/fs/bpf/perf_attr_map
-rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map
# bpftool map | grep bperf | wc -l
64
#

# bpftool map | tail
1265: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1266: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1267: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 996
pids perf(2436545)
1268: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1269: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1270: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 997
pids perf(2436541)
1285: array name pid_iter.rodata flags 0x480
key 4B value 4B max_entries 1 memlock 4096B
btf_id 1017 frozen
pids bpftool(2437504)
1286: array flags 0x0
key 4B value 32B max_entries 1 memlock 4096B
#
# bpftool map dump id 1268 | tail
value (CPU 21):
8f f3 bc ca 00 00 00 00 80 fd 2a d1 4d 00 00 00
80 fd 2a d1 4d 00 00 00
value (CPU 22):
7e d5 64 4d 00 00 00 00 a4 8a 2e ee 4d 00 00 00
a4 8a 2e ee 4d 00 00 00
value (CPU 23):
a7 78 3e 06 01 00 00 00 b2 34 94 f6 4d 00 00 00
b2 34 94 f6 4d 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
c6 8b d9 ca 00 00 00 00 20 c6 fc 83 4e 00 00 00
20 c6 fc 83 4e 00 00 00
value (CPU 22):
9c b4 d2 4d 00 00 00 00 3e 0c df 89 4e 00 00 00
3e 0c df 89 4e 00 00 00
value (CPU 23):
18 43 66 06 01 00 00 00 5b 69 ed 83 4e 00 00 00
5b 69 ed 83 4e 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
f2 6e db ca 00 00 00 00 92 67 4c ba 4e 00 00 00
92 67 4c ba 4e 00 00 00
value (CPU 22):
dc 8e e1 4d 00 00 00 00 d9 32 7a c5 4e 00 00 00
d9 32 7a c5 4e 00 00 00
value (CPU 23):
bd 2b 73 06 01 00 00 00 7c 73 87 bf 4e 00 00 00
7c 73 87 bf 4e 00 00 00
Found 1 element
#

# perf stat --bpf-counters -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

119,410,122 cycles
152,105,479 instructions # 1.27 insn per cycle

0.101395093 seconds time elapsed

#

See? We had the counters enabled all the time.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: kernel-team@fb.com
Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
7fac83aa Tue Mar 16 15:18:35 MDT 2021 Song Liu <songliubraving@fb.com> perf stat: Introduce 'bperf' to share hardware PMCs with BPF

The perf tool uses performance monitoring counters (PMCs) to monitor
system performance. The PMCs are limited hardware resources. For
example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways: system
level monitoring, (maybe nested) container level monitoring, per process
monitoring, profiling (in sample mode), etc. In some cases, there are
more active perf_events than available hardware PMCs. To allow all
perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics
(cycles, instructions). It is a waste to have multiple tools create
multiple perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --bpf-counters option to
perf-stat. bperf uses a BPF hashmap to share information about BPF
programs and maps used by bperf. This map is pinned to bpffs. The
default path is /sys/fs/bpf/perf_attr_map. The user could change the
path with option --bpf-attr-map.

Committer testing:

# dmesg|grep "Performance Events" -A5
[ 0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[ 0.225280] ... version: 0
[ 0.225280] ... bit width: 48
[ 0.225281] ... generic registers: 6
[ 0.225281] ... value mask: 0000ffffffffffff
[ 0.225281] ... max period: 00007fffffffffff
#
# for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done
[1] 2436231
[2] 2436232
[3] 2436233
[4] 2436234
[5] 2436235
[6] 2436236
# perf stat -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

310,326,987 cycles (41.87%)
236,143,290 instructions # 0.76 insn per cycle (41.87%)

0.100800885 seconds time elapsed

#

We can see that the counters were enabled for this workload 41.87% of
the time.

Now with --bpf-counters:

# for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done
[1] 2436514
[2] 2436515
[3] 2436516
[4] 2436517
[5] 2436518
[6] 2436519
[7] 2436520
[8] 2436521
[9] 2436522
[10] 2436523
[11] 2436524
[12] 2436525
[13] 2436526
[14] 2436527
[15] 2436528
[16] 2436529
[17] 2436530
[18] 2436531
[19] 2436532
[20] 2436533
[21] 2436534
[22] 2436535
[23] 2436536
[24] 2436537
[25] 2436538
[26] 2436539
[27] 2436540
[28] 2436541
[29] 2436542
[30] 2436543
[31] 2436544
[32] 2436545
#
# ls -la /sys/fs/bpf/perf_attr_map
-rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map
# bpftool map | grep bperf | wc -l
64
#

# bpftool map | tail
1265: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1266: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1267: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 996
pids perf(2436545)
1268: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1269: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1270: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 997
pids perf(2436541)
1285: array name pid_iter.rodata flags 0x480
key 4B value 4B max_entries 1 memlock 4096B
btf_id 1017 frozen
pids bpftool(2437504)
1286: array flags 0x0
key 4B value 32B max_entries 1 memlock 4096B
#
# bpftool map dump id 1268 | tail
value (CPU 21):
8f f3 bc ca 00 00 00 00 80 fd 2a d1 4d 00 00 00
80 fd 2a d1 4d 00 00 00
value (CPU 22):
7e d5 64 4d 00 00 00 00 a4 8a 2e ee 4d 00 00 00
a4 8a 2e ee 4d 00 00 00
value (CPU 23):
a7 78 3e 06 01 00 00 00 b2 34 94 f6 4d 00 00 00
b2 34 94 f6 4d 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
c6 8b d9 ca 00 00 00 00 20 c6 fc 83 4e 00 00 00
20 c6 fc 83 4e 00 00 00
value (CPU 22):
9c b4 d2 4d 00 00 00 00 3e 0c df 89 4e 00 00 00
3e 0c df 89 4e 00 00 00
value (CPU 23):
18 43 66 06 01 00 00 00 5b 69 ed 83 4e 00 00 00
5b 69 ed 83 4e 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
f2 6e db ca 00 00 00 00 92 67 4c ba 4e 00 00 00
92 67 4c ba 4e 00 00 00
value (CPU 22):
dc 8e e1 4d 00 00 00 00 d9 32 7a c5 4e 00 00 00
d9 32 7a c5 4e 00 00 00
value (CPU 23):
bd 2b 73 06 01 00 00 00 7c 73 87 bf 4e 00 00 00
7c 73 87 bf 4e 00 00 00
Found 1 element
#

# perf stat --bpf-counters -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

119,410,122 cycles
152,105,479 instructions # 1.27 insn per cycle

0.101395093 seconds time elapsed

#

See? We had the counters enabled all the time.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: kernel-team@fb.com
Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
7fac83aa Tue Mar 16 15:18:35 MDT 2021 Song Liu <songliubraving@fb.com> perf stat: Introduce 'bperf' to share hardware PMCs with BPF

The perf tool uses performance monitoring counters (PMCs) to monitor
system performance. The PMCs are limited hardware resources. For
example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways: system
level monitoring, (maybe nested) container level monitoring, per process
monitoring, profiling (in sample mode), etc. In some cases, there are
more active perf_events than available hardware PMCs. To allow all
perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics
(cycles, instructions). It is a waste to have multiple tools create
multiple perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --bpf-counters option to
perf-stat. bperf uses a BPF hashmap to share information about BPF
programs and maps used by bperf. This map is pinned to bpffs. The
default path is /sys/fs/bpf/perf_attr_map. The user could change the
path with option --bpf-attr-map.

Committer testing:

# dmesg|grep "Performance Events" -A5
[ 0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[ 0.225280] ... version: 0
[ 0.225280] ... bit width: 48
[ 0.225281] ... generic registers: 6
[ 0.225281] ... value mask: 0000ffffffffffff
[ 0.225281] ... max period: 00007fffffffffff
#
# for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done
[1] 2436231
[2] 2436232
[3] 2436233
[4] 2436234
[5] 2436235
[6] 2436236
# perf stat -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

310,326,987 cycles (41.87%)
236,143,290 instructions # 0.76 insn per cycle (41.87%)

0.100800885 seconds time elapsed

#

We can see that the counters were enabled for this workload 41.87% of
the time.

Now with --bpf-counters:

# for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done
[1] 2436514
[2] 2436515
[3] 2436516
[4] 2436517
[5] 2436518
[6] 2436519
[7] 2436520
[8] 2436521
[9] 2436522
[10] 2436523
[11] 2436524
[12] 2436525
[13] 2436526
[14] 2436527
[15] 2436528
[16] 2436529
[17] 2436530
[18] 2436531
[19] 2436532
[20] 2436533
[21] 2436534
[22] 2436535
[23] 2436536
[24] 2436537
[25] 2436538
[26] 2436539
[27] 2436540
[28] 2436541
[29] 2436542
[30] 2436543
[31] 2436544
[32] 2436545
#
# ls -la /sys/fs/bpf/perf_attr_map
-rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map
# bpftool map | grep bperf | wc -l
64
#

# bpftool map | tail
1265: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1266: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1267: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 996
pids perf(2436545)
1268: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1269: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1270: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 997
pids perf(2436541)
1285: array name pid_iter.rodata flags 0x480
key 4B value 4B max_entries 1 memlock 4096B
btf_id 1017 frozen
pids bpftool(2437504)
1286: array flags 0x0
key 4B value 32B max_entries 1 memlock 4096B
#
# bpftool map dump id 1268 | tail
value (CPU 21):
8f f3 bc ca 00 00 00 00 80 fd 2a d1 4d 00 00 00
80 fd 2a d1 4d 00 00 00
value (CPU 22):
7e d5 64 4d 00 00 00 00 a4 8a 2e ee 4d 00 00 00
a4 8a 2e ee 4d 00 00 00
value (CPU 23):
a7 78 3e 06 01 00 00 00 b2 34 94 f6 4d 00 00 00
b2 34 94 f6 4d 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
c6 8b d9 ca 00 00 00 00 20 c6 fc 83 4e 00 00 00
20 c6 fc 83 4e 00 00 00
value (CPU 22):
9c b4 d2 4d 00 00 00 00 3e 0c df 89 4e 00 00 00
3e 0c df 89 4e 00 00 00
value (CPU 23):
18 43 66 06 01 00 00 00 5b 69 ed 83 4e 00 00 00
5b 69 ed 83 4e 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
f2 6e db ca 00 00 00 00 92 67 4c ba 4e 00 00 00
92 67 4c ba 4e 00 00 00
value (CPU 22):
dc 8e e1 4d 00 00 00 00 d9 32 7a c5 4e 00 00 00
d9 32 7a c5 4e 00 00 00
value (CPU 23):
bd 2b 73 06 01 00 00 00 7c 73 87 bf 4e 00 00 00
7c 73 87 bf 4e 00 00 00
Found 1 element
#

# perf stat --bpf-counters -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

119,410,122 cycles
152,105,479 instructions # 1.27 insn per cycle

0.101395093 seconds time elapsed

#

See? We had the counters enabled all the time.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: kernel-team@fb.com
Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
7fac83aa Tue Mar 16 15:18:35 MDT 2021 Song Liu <songliubraving@fb.com> perf stat: Introduce 'bperf' to share hardware PMCs with BPF

The perf tool uses performance monitoring counters (PMCs) to monitor
system performance. The PMCs are limited hardware resources. For
example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways: system
level monitoring, (maybe nested) container level monitoring, per process
monitoring, profiling (in sample mode), etc. In some cases, there are
more active perf_events than available hardware PMCs. To allow all
perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics
(cycles, instructions). It is a waste to have multiple tools create
multiple perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --bpf-counters option to
perf-stat. bperf uses a BPF hashmap to share information about BPF
programs and maps used by bperf. This map is pinned to bpffs. The
default path is /sys/fs/bpf/perf_attr_map. The user could change the
path with option --bpf-attr-map.

Committer testing:

# dmesg|grep "Performance Events" -A5
[ 0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[ 0.225280] ... version: 0
[ 0.225280] ... bit width: 48
[ 0.225281] ... generic registers: 6
[ 0.225281] ... value mask: 0000ffffffffffff
[ 0.225281] ... max period: 00007fffffffffff
#
# for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done
[1] 2436231
[2] 2436232
[3] 2436233
[4] 2436234
[5] 2436235
[6] 2436236
# perf stat -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

310,326,987 cycles (41.87%)
236,143,290 instructions # 0.76 insn per cycle (41.87%)

0.100800885 seconds time elapsed

#

We can see that the counters were enabled for this workload 41.87% of
the time.

Now with --bpf-counters:

# for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done
[1] 2436514
[2] 2436515
[3] 2436516
[4] 2436517
[5] 2436518
[6] 2436519
[7] 2436520
[8] 2436521
[9] 2436522
[10] 2436523
[11] 2436524
[12] 2436525
[13] 2436526
[14] 2436527
[15] 2436528
[16] 2436529
[17] 2436530
[18] 2436531
[19] 2436532
[20] 2436533
[21] 2436534
[22] 2436535
[23] 2436536
[24] 2436537
[25] 2436538
[26] 2436539
[27] 2436540
[28] 2436541
[29] 2436542
[30] 2436543
[31] 2436544
[32] 2436545
#
# ls -la /sys/fs/bpf/perf_attr_map
-rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map
# bpftool map | grep bperf | wc -l
64
#

# bpftool map | tail
1265: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1266: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1267: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 996
pids perf(2436545)
1268: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1269: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1270: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 997
pids perf(2436541)
1285: array name pid_iter.rodata flags 0x480
key 4B value 4B max_entries 1 memlock 4096B
btf_id 1017 frozen
pids bpftool(2437504)
1286: array flags 0x0
key 4B value 32B max_entries 1 memlock 4096B
#
# bpftool map dump id 1268 | tail
value (CPU 21):
8f f3 bc ca 00 00 00 00 80 fd 2a d1 4d 00 00 00
80 fd 2a d1 4d 00 00 00
value (CPU 22):
7e d5 64 4d 00 00 00 00 a4 8a 2e ee 4d 00 00 00
a4 8a 2e ee 4d 00 00 00
value (CPU 23):
a7 78 3e 06 01 00 00 00 b2 34 94 f6 4d 00 00 00
b2 34 94 f6 4d 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
c6 8b d9 ca 00 00 00 00 20 c6 fc 83 4e 00 00 00
20 c6 fc 83 4e 00 00 00
value (CPU 22):
9c b4 d2 4d 00 00 00 00 3e 0c df 89 4e 00 00 00
3e 0c df 89 4e 00 00 00
value (CPU 23):
18 43 66 06 01 00 00 00 5b 69 ed 83 4e 00 00 00
5b 69 ed 83 4e 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
f2 6e db ca 00 00 00 00 92 67 4c ba 4e 00 00 00
92 67 4c ba 4e 00 00 00
value (CPU 22):
dc 8e e1 4d 00 00 00 00 d9 32 7a c5 4e 00 00 00
d9 32 7a c5 4e 00 00 00
value (CPU 23):
bd 2b 73 06 01 00 00 00 7c 73 87 bf 4e 00 00 00
7c 73 87 bf 4e 00 00 00
Found 1 element
#

# perf stat --bpf-counters -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

119,410,122 cycles
152,105,479 instructions # 1.27 insn per cycle

0.101395093 seconds time elapsed

#

See? We had the counters enabled all the time.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: kernel-team@fb.com
Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
7fac83aa Tue Mar 16 15:18:35 MDT 2021 Song Liu <songliubraving@fb.com> perf stat: Introduce 'bperf' to share hardware PMCs with BPF

The perf tool uses performance monitoring counters (PMCs) to monitor
system performance. The PMCs are limited hardware resources. For
example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways: system
level monitoring, (maybe nested) container level monitoring, per process
monitoring, profiling (in sample mode), etc. In some cases, there are
more active perf_events than available hardware PMCs. To allow all
perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics
(cycles, instructions). It is a waste to have multiple tools create
multiple perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --bpf-counters option to
perf-stat. bperf uses a BPF hashmap to share information about BPF
programs and maps used by bperf. This map is pinned to bpffs. The
default path is /sys/fs/bpf/perf_attr_map. The user could change the
path with option --bpf-attr-map.

Committer testing:

# dmesg|grep "Performance Events" -A5
[ 0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[ 0.225280] ... version: 0
[ 0.225280] ... bit width: 48
[ 0.225281] ... generic registers: 6
[ 0.225281] ... value mask: 0000ffffffffffff
[ 0.225281] ... max period: 00007fffffffffff
#
# for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done
[1] 2436231
[2] 2436232
[3] 2436233
[4] 2436234
[5] 2436235
[6] 2436236
# perf stat -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

310,326,987 cycles (41.87%)
236,143,290 instructions # 0.76 insn per cycle (41.87%)

0.100800885 seconds time elapsed

#

We can see that the counters were enabled for this workload 41.87% of
the time.

Now with --bpf-counters:

# for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done
[1] 2436514
[2] 2436515
[3] 2436516
[4] 2436517
[5] 2436518
[6] 2436519
[7] 2436520
[8] 2436521
[9] 2436522
[10] 2436523
[11] 2436524
[12] 2436525
[13] 2436526
[14] 2436527
[15] 2436528
[16] 2436529
[17] 2436530
[18] 2436531
[19] 2436532
[20] 2436533
[21] 2436534
[22] 2436535
[23] 2436536
[24] 2436537
[25] 2436538
[26] 2436539
[27] 2436540
[28] 2436541
[29] 2436542
[30] 2436543
[31] 2436544
[32] 2436545
#
# ls -la /sys/fs/bpf/perf_attr_map
-rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map
# bpftool map | grep bperf | wc -l
64
#

# bpftool map | tail
1265: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1266: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1267: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 996
pids perf(2436545)
1268: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1269: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1270: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 997
pids perf(2436541)
1285: array name pid_iter.rodata flags 0x480
key 4B value 4B max_entries 1 memlock 4096B
btf_id 1017 frozen
pids bpftool(2437504)
1286: array flags 0x0
key 4B value 32B max_entries 1 memlock 4096B
#
# bpftool map dump id 1268 | tail
value (CPU 21):
8f f3 bc ca 00 00 00 00 80 fd 2a d1 4d 00 00 00
80 fd 2a d1 4d 00 00 00
value (CPU 22):
7e d5 64 4d 00 00 00 00 a4 8a 2e ee 4d 00 00 00
a4 8a 2e ee 4d 00 00 00
value (CPU 23):
a7 78 3e 06 01 00 00 00 b2 34 94 f6 4d 00 00 00
b2 34 94 f6 4d 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
c6 8b d9 ca 00 00 00 00 20 c6 fc 83 4e 00 00 00
20 c6 fc 83 4e 00 00 00
value (CPU 22):
9c b4 d2 4d 00 00 00 00 3e 0c df 89 4e 00 00 00
3e 0c df 89 4e 00 00 00
value (CPU 23):
18 43 66 06 01 00 00 00 5b 69 ed 83 4e 00 00 00
5b 69 ed 83 4e 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
f2 6e db ca 00 00 00 00 92 67 4c ba 4e 00 00 00
92 67 4c ba 4e 00 00 00
value (CPU 22):
dc 8e e1 4d 00 00 00 00 d9 32 7a c5 4e 00 00 00
d9 32 7a c5 4e 00 00 00
value (CPU 23):
bd 2b 73 06 01 00 00 00 7c 73 87 bf 4e 00 00 00
7c 73 87 bf 4e 00 00 00
Found 1 element
#

# perf stat --bpf-counters -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

119,410,122 cycles
152,105,479 instructions # 1.27 insn per cycle

0.101395093 seconds time elapsed

#

See? We had the counters enabled all the time.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: kernel-team@fb.com
Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
7fac83aa Tue Mar 16 15:18:35 MDT 2021 Song Liu <songliubraving@fb.com> perf stat: Introduce 'bperf' to share hardware PMCs with BPF

The perf tool uses performance monitoring counters (PMCs) to monitor
system performance. The PMCs are limited hardware resources. For
example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways: system
level monitoring, (maybe nested) container level monitoring, per process
monitoring, profiling (in sample mode), etc. In some cases, there are
more active perf_events than available hardware PMCs. To allow all
perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics
(cycles, instructions). It is a waste to have multiple tools create
multiple perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --bpf-counters option to
perf-stat. bperf uses a BPF hashmap to share information about BPF
programs and maps used by bperf. This map is pinned to bpffs. The
default path is /sys/fs/bpf/perf_attr_map. The user could change the
path with option --bpf-attr-map.

Committer testing:

# dmesg|grep "Performance Events" -A5
[ 0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[ 0.225280] ... version: 0
[ 0.225280] ... bit width: 48
[ 0.225281] ... generic registers: 6
[ 0.225281] ... value mask: 0000ffffffffffff
[ 0.225281] ... max period: 00007fffffffffff
#
# for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done
[1] 2436231
[2] 2436232
[3] 2436233
[4] 2436234
[5] 2436235
[6] 2436236
# perf stat -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

310,326,987 cycles (41.87%)
236,143,290 instructions # 0.76 insn per cycle (41.87%)

0.100800885 seconds time elapsed

#

We can see that the counters were enabled for this workload 41.87% of
the time.

Now with --bpf-counters:

# for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done
[1] 2436514
[2] 2436515
[3] 2436516
[4] 2436517
[5] 2436518
[6] 2436519
[7] 2436520
[8] 2436521
[9] 2436522
[10] 2436523
[11] 2436524
[12] 2436525
[13] 2436526
[14] 2436527
[15] 2436528
[16] 2436529
[17] 2436530
[18] 2436531
[19] 2436532
[20] 2436533
[21] 2436534
[22] 2436535
[23] 2436536
[24] 2436537
[25] 2436538
[26] 2436539
[27] 2436540
[28] 2436541
[29] 2436542
[30] 2436543
[31] 2436544
[32] 2436545
#
# ls -la /sys/fs/bpf/perf_attr_map
-rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map
# bpftool map | grep bperf | wc -l
64
#

# bpftool map | tail
1265: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1266: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1267: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 996
pids perf(2436545)
1268: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1269: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1270: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 997
pids perf(2436541)
1285: array name pid_iter.rodata flags 0x480
key 4B value 4B max_entries 1 memlock 4096B
btf_id 1017 frozen
pids bpftool(2437504)
1286: array flags 0x0
key 4B value 32B max_entries 1 memlock 4096B
#
# bpftool map dump id 1268 | tail
value (CPU 21):
8f f3 bc ca 00 00 00 00 80 fd 2a d1 4d 00 00 00
80 fd 2a d1 4d 00 00 00
value (CPU 22):
7e d5 64 4d 00 00 00 00 a4 8a 2e ee 4d 00 00 00
a4 8a 2e ee 4d 00 00 00
value (CPU 23):
a7 78 3e 06 01 00 00 00 b2 34 94 f6 4d 00 00 00
b2 34 94 f6 4d 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
c6 8b d9 ca 00 00 00 00 20 c6 fc 83 4e 00 00 00
20 c6 fc 83 4e 00 00 00
value (CPU 22):
9c b4 d2 4d 00 00 00 00 3e 0c df 89 4e 00 00 00
3e 0c df 89 4e 00 00 00
value (CPU 23):
18 43 66 06 01 00 00 00 5b 69 ed 83 4e 00 00 00
5b 69 ed 83 4e 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
f2 6e db ca 00 00 00 00 92 67 4c ba 4e 00 00 00
92 67 4c ba 4e 00 00 00
value (CPU 22):
dc 8e e1 4d 00 00 00 00 d9 32 7a c5 4e 00 00 00
d9 32 7a c5 4e 00 00 00
value (CPU 23):
bd 2b 73 06 01 00 00 00 7c 73 87 bf 4e 00 00 00
7c 73 87 bf 4e 00 00 00
Found 1 element
#

# perf stat --bpf-counters -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

119,410,122 cycles
152,105,479 instructions # 1.27 insn per cycle

0.101395093 seconds time elapsed

#

See? We had the counters enabled all the time.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: kernel-team@fb.com
Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
7fac83aa Tue Mar 16 15:18:35 MDT 2021 Song Liu <songliubraving@fb.com> perf stat: Introduce 'bperf' to share hardware PMCs with BPF

The perf tool uses performance monitoring counters (PMCs) to monitor
system performance. The PMCs are limited hardware resources. For
example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways: system
level monitoring, (maybe nested) container level monitoring, per process
monitoring, profiling (in sample mode), etc. In some cases, there are
more active perf_events than available hardware PMCs. To allow all
perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics
(cycles, instructions). It is a waste to have multiple tools create
multiple perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --bpf-counters option to
perf-stat. bperf uses a BPF hashmap to share information about BPF
programs and maps used by bperf. This map is pinned to bpffs. The
default path is /sys/fs/bpf/perf_attr_map. The user could change the
path with option --bpf-attr-map.

Committer testing:

# dmesg|grep "Performance Events" -A5
[ 0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[ 0.225280] ... version: 0
[ 0.225280] ... bit width: 48
[ 0.225281] ... generic registers: 6
[ 0.225281] ... value mask: 0000ffffffffffff
[ 0.225281] ... max period: 00007fffffffffff
#
# for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done
[1] 2436231
[2] 2436232
[3] 2436233
[4] 2436234
[5] 2436235
[6] 2436236
# perf stat -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

310,326,987 cycles (41.87%)
236,143,290 instructions # 0.76 insn per cycle (41.87%)

0.100800885 seconds time elapsed

#

We can see that the counters were enabled for this workload 41.87% of
the time.

Now with --bpf-counters:

# for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done
[1] 2436514
[2] 2436515
[3] 2436516
[4] 2436517
[5] 2436518
[6] 2436519
[7] 2436520
[8] 2436521
[9] 2436522
[10] 2436523
[11] 2436524
[12] 2436525
[13] 2436526
[14] 2436527
[15] 2436528
[16] 2436529
[17] 2436530
[18] 2436531
[19] 2436532
[20] 2436533
[21] 2436534
[22] 2436535
[23] 2436536
[24] 2436537
[25] 2436538
[26] 2436539
[27] 2436540
[28] 2436541
[29] 2436542
[30] 2436543
[31] 2436544
[32] 2436545
#
# ls -la /sys/fs/bpf/perf_attr_map
-rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map
# bpftool map | grep bperf | wc -l
64
#

# bpftool map | tail
1265: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1266: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1267: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 996
pids perf(2436545)
1268: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1269: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1270: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 997
pids perf(2436541)
1285: array name pid_iter.rodata flags 0x480
key 4B value 4B max_entries 1 memlock 4096B
btf_id 1017 frozen
pids bpftool(2437504)
1286: array flags 0x0
key 4B value 32B max_entries 1 memlock 4096B
#
# bpftool map dump id 1268 | tail
value (CPU 21):
8f f3 bc ca 00 00 00 00 80 fd 2a d1 4d 00 00 00
80 fd 2a d1 4d 00 00 00
value (CPU 22):
7e d5 64 4d 00 00 00 00 a4 8a 2e ee 4d 00 00 00
a4 8a 2e ee 4d 00 00 00
value (CPU 23):
a7 78 3e 06 01 00 00 00 b2 34 94 f6 4d 00 00 00
b2 34 94 f6 4d 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
c6 8b d9 ca 00 00 00 00 20 c6 fc 83 4e 00 00 00
20 c6 fc 83 4e 00 00 00
value (CPU 22):
9c b4 d2 4d 00 00 00 00 3e 0c df 89 4e 00 00 00
3e 0c df 89 4e 00 00 00
value (CPU 23):
18 43 66 06 01 00 00 00 5b 69 ed 83 4e 00 00 00
5b 69 ed 83 4e 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
f2 6e db ca 00 00 00 00 92 67 4c ba 4e 00 00 00
92 67 4c ba 4e 00 00 00
value (CPU 22):
dc 8e e1 4d 00 00 00 00 d9 32 7a c5 4e 00 00 00
d9 32 7a c5 4e 00 00 00
value (CPU 23):
bd 2b 73 06 01 00 00 00 7c 73 87 bf 4e 00 00 00
7c 73 87 bf 4e 00 00 00
Found 1 element
#

# perf stat --bpf-counters -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

119,410,122 cycles
152,105,479 instructions # 1.27 insn per cycle

0.101395093 seconds time elapsed

#

See? We had the counters enabled all the time.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: kernel-team@fb.com
Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
7fac83aa Tue Mar 16 15:18:35 MDT 2021 Song Liu <songliubraving@fb.com> perf stat: Introduce 'bperf' to share hardware PMCs with BPF

The perf tool uses performance monitoring counters (PMCs) to monitor
system performance. The PMCs are limited hardware resources. For
example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways: system
level monitoring, (maybe nested) container level monitoring, per process
monitoring, profiling (in sample mode), etc. In some cases, there are
more active perf_events than available hardware PMCs. To allow all
perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics
(cycles, instructions). It is a waste to have multiple tools create
multiple perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --bpf-counters option to
perf-stat. bperf uses a BPF hashmap to share information about BPF
programs and maps used by bperf. This map is pinned to bpffs. The
default path is /sys/fs/bpf/perf_attr_map. The user could change the
path with option --bpf-attr-map.

Committer testing:

# dmesg|grep "Performance Events" -A5
[ 0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[ 0.225280] ... version: 0
[ 0.225280] ... bit width: 48
[ 0.225281] ... generic registers: 6
[ 0.225281] ... value mask: 0000ffffffffffff
[ 0.225281] ... max period: 00007fffffffffff
#
# for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done
[1] 2436231
[2] 2436232
[3] 2436233
[4] 2436234
[5] 2436235
[6] 2436236
# perf stat -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

310,326,987 cycles (41.87%)
236,143,290 instructions # 0.76 insn per cycle (41.87%)

0.100800885 seconds time elapsed

#

We can see that the counters were enabled for this workload 41.87% of
the time.

Now with --bpf-counters:

# for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done
[1] 2436514
[2] 2436515
[3] 2436516
[4] 2436517
[5] 2436518
[6] 2436519
[7] 2436520
[8] 2436521
[9] 2436522
[10] 2436523
[11] 2436524
[12] 2436525
[13] 2436526
[14] 2436527
[15] 2436528
[16] 2436529
[17] 2436530
[18] 2436531
[19] 2436532
[20] 2436533
[21] 2436534
[22] 2436535
[23] 2436536
[24] 2436537
[25] 2436538
[26] 2436539
[27] 2436540
[28] 2436541
[29] 2436542
[30] 2436543
[31] 2436544
[32] 2436545
#
# ls -la /sys/fs/bpf/perf_attr_map
-rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map
# bpftool map | grep bperf | wc -l
64
#

# bpftool map | tail
1265: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1266: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1267: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 996
pids perf(2436545)
1268: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1269: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1270: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 997
pids perf(2436541)
1285: array name pid_iter.rodata flags 0x480
key 4B value 4B max_entries 1 memlock 4096B
btf_id 1017 frozen
pids bpftool(2437504)
1286: array flags 0x0
key 4B value 32B max_entries 1 memlock 4096B
#
# bpftool map dump id 1268 | tail
value (CPU 21):
8f f3 bc ca 00 00 00 00 80 fd 2a d1 4d 00 00 00
80 fd 2a d1 4d 00 00 00
value (CPU 22):
7e d5 64 4d 00 00 00 00 a4 8a 2e ee 4d 00 00 00
a4 8a 2e ee 4d 00 00 00
value (CPU 23):
a7 78 3e 06 01 00 00 00 b2 34 94 f6 4d 00 00 00
b2 34 94 f6 4d 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
c6 8b d9 ca 00 00 00 00 20 c6 fc 83 4e 00 00 00
20 c6 fc 83 4e 00 00 00
value (CPU 22):
9c b4 d2 4d 00 00 00 00 3e 0c df 89 4e 00 00 00
3e 0c df 89 4e 00 00 00
value (CPU 23):
18 43 66 06 01 00 00 00 5b 69 ed 83 4e 00 00 00
5b 69 ed 83 4e 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
f2 6e db ca 00 00 00 00 92 67 4c ba 4e 00 00 00
92 67 4c ba 4e 00 00 00
value (CPU 22):
dc 8e e1 4d 00 00 00 00 d9 32 7a c5 4e 00 00 00
d9 32 7a c5 4e 00 00 00
value (CPU 23):
bd 2b 73 06 01 00 00 00 7c 73 87 bf 4e 00 00 00
7c 73 87 bf 4e 00 00 00
Found 1 element
#

# perf stat --bpf-counters -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

119,410,122 cycles
152,105,479 instructions # 1.27 insn per cycle

0.101395093 seconds time elapsed

#

See? We had the counters enabled all the time.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: kernel-team@fb.com
Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
7fac83aa Tue Mar 16 15:18:35 MDT 2021 Song Liu <songliubraving@fb.com> perf stat: Introduce 'bperf' to share hardware PMCs with BPF

The perf tool uses performance monitoring counters (PMCs) to monitor
system performance. The PMCs are limited hardware resources. For
example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways: system
level monitoring, (maybe nested) container level monitoring, per process
monitoring, profiling (in sample mode), etc. In some cases, there are
more active perf_events than available hardware PMCs. To allow all
perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics
(cycles, instructions). It is a waste to have multiple tools create
multiple perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --bpf-counters option to
perf-stat. bperf uses a BPF hashmap to share information about BPF
programs and maps used by bperf. This map is pinned to bpffs. The
default path is /sys/fs/bpf/perf_attr_map. The user could change the
path with option --bpf-attr-map.

Committer testing:

# dmesg|grep "Performance Events" -A5
[ 0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[ 0.225280] ... version: 0
[ 0.225280] ... bit width: 48
[ 0.225281] ... generic registers: 6
[ 0.225281] ... value mask: 0000ffffffffffff
[ 0.225281] ... max period: 00007fffffffffff
#
# for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done
[1] 2436231
[2] 2436232
[3] 2436233
[4] 2436234
[5] 2436235
[6] 2436236
# perf stat -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

310,326,987 cycles (41.87%)
236,143,290 instructions # 0.76 insn per cycle (41.87%)

0.100800885 seconds time elapsed

#

We can see that the counters were enabled for this workload 41.87% of
the time.

Now with --bpf-counters:

# for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done
[1] 2436514
[2] 2436515
[3] 2436516
[4] 2436517
[5] 2436518
[6] 2436519
[7] 2436520
[8] 2436521
[9] 2436522
[10] 2436523
[11] 2436524
[12] 2436525
[13] 2436526
[14] 2436527
[15] 2436528
[16] 2436529
[17] 2436530
[18] 2436531
[19] 2436532
[20] 2436533
[21] 2436534
[22] 2436535
[23] 2436536
[24] 2436537
[25] 2436538
[26] 2436539
[27] 2436540
[28] 2436541
[29] 2436542
[30] 2436543
[31] 2436544
[32] 2436545
#
# ls -la /sys/fs/bpf/perf_attr_map
-rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map
# bpftool map | grep bperf | wc -l
64
#

# bpftool map | tail
1265: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1266: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1267: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 996
pids perf(2436545)
1268: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1269: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1270: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 997
pids perf(2436541)
1285: array name pid_iter.rodata flags 0x480
key 4B value 4B max_entries 1 memlock 4096B
btf_id 1017 frozen
pids bpftool(2437504)
1286: array flags 0x0
key 4B value 32B max_entries 1 memlock 4096B
#
# bpftool map dump id 1268 | tail
value (CPU 21):
8f f3 bc ca 00 00 00 00 80 fd 2a d1 4d 00 00 00
80 fd 2a d1 4d 00 00 00
value (CPU 22):
7e d5 64 4d 00 00 00 00 a4 8a 2e ee 4d 00 00 00
a4 8a 2e ee 4d 00 00 00
value (CPU 23):
a7 78 3e 06 01 00 00 00 b2 34 94 f6 4d 00 00 00
b2 34 94 f6 4d 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
c6 8b d9 ca 00 00 00 00 20 c6 fc 83 4e 00 00 00
20 c6 fc 83 4e 00 00 00
value (CPU 22):
9c b4 d2 4d 00 00 00 00 3e 0c df 89 4e 00 00 00
3e 0c df 89 4e 00 00 00
value (CPU 23):
18 43 66 06 01 00 00 00 5b 69 ed 83 4e 00 00 00
5b 69 ed 83 4e 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
f2 6e db ca 00 00 00 00 92 67 4c ba 4e 00 00 00
92 67 4c ba 4e 00 00 00
value (CPU 22):
dc 8e e1 4d 00 00 00 00 d9 32 7a c5 4e 00 00 00
d9 32 7a c5 4e 00 00 00
value (CPU 23):
bd 2b 73 06 01 00 00 00 7c 73 87 bf 4e 00 00 00
7c 73 87 bf 4e 00 00 00
Found 1 element
#

# perf stat --bpf-counters -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

119,410,122 cycles
152,105,479 instructions # 1.27 insn per cycle

0.101395093 seconds time elapsed

#

See? We had the counters enabled all the time.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: kernel-team@fb.com
Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
7fac83aa Tue Mar 16 15:18:35 MDT 2021 Song Liu <songliubraving@fb.com> perf stat: Introduce 'bperf' to share hardware PMCs with BPF

The perf tool uses performance monitoring counters (PMCs) to monitor
system performance. The PMCs are limited hardware resources. For
example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways: system
level monitoring, (maybe nested) container level monitoring, per process
monitoring, profiling (in sample mode), etc. In some cases, there are
more active perf_events than available hardware PMCs. To allow all
perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics
(cycles, instructions). It is a waste to have multiple tools create
multiple perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --bpf-counters option to
perf-stat. bperf uses a BPF hashmap to share information about BPF
programs and maps used by bperf. This map is pinned to bpffs. The
default path is /sys/fs/bpf/perf_attr_map. The user could change the
path with option --bpf-attr-map.

Committer testing:

# dmesg|grep "Performance Events" -A5
[ 0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[ 0.225280] ... version: 0
[ 0.225280] ... bit width: 48
[ 0.225281] ... generic registers: 6
[ 0.225281] ... value mask: 0000ffffffffffff
[ 0.225281] ... max period: 00007fffffffffff
#
# for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done
[1] 2436231
[2] 2436232
[3] 2436233
[4] 2436234
[5] 2436235
[6] 2436236
# perf stat -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

310,326,987 cycles (41.87%)
236,143,290 instructions # 0.76 insn per cycle (41.87%)

0.100800885 seconds time elapsed

#

We can see that the counters were enabled for this workload 41.87% of
the time.

Now with --bpf-counters:

# for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done
[1] 2436514
[2] 2436515
[3] 2436516
[4] 2436517
[5] 2436518
[6] 2436519
[7] 2436520
[8] 2436521
[9] 2436522
[10] 2436523
[11] 2436524
[12] 2436525
[13] 2436526
[14] 2436527
[15] 2436528
[16] 2436529
[17] 2436530
[18] 2436531
[19] 2436532
[20] 2436533
[21] 2436534
[22] 2436535
[23] 2436536
[24] 2436537
[25] 2436538
[26] 2436539
[27] 2436540
[28] 2436541
[29] 2436542
[30] 2436543
[31] 2436544
[32] 2436545
#
# ls -la /sys/fs/bpf/perf_attr_map
-rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map
# bpftool map | grep bperf | wc -l
64
#

# bpftool map | tail
1265: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1266: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1267: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 996
pids perf(2436545)
1268: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1269: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1270: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 997
pids perf(2436541)
1285: array name pid_iter.rodata flags 0x480
key 4B value 4B max_entries 1 memlock 4096B
btf_id 1017 frozen
pids bpftool(2437504)
1286: array flags 0x0
key 4B value 32B max_entries 1 memlock 4096B
#
# bpftool map dump id 1268 | tail
value (CPU 21):
8f f3 bc ca 00 00 00 00 80 fd 2a d1 4d 00 00 00
80 fd 2a d1 4d 00 00 00
value (CPU 22):
7e d5 64 4d 00 00 00 00 a4 8a 2e ee 4d 00 00 00
a4 8a 2e ee 4d 00 00 00
value (CPU 23):
a7 78 3e 06 01 00 00 00 b2 34 94 f6 4d 00 00 00
b2 34 94 f6 4d 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
c6 8b d9 ca 00 00 00 00 20 c6 fc 83 4e 00 00 00
20 c6 fc 83 4e 00 00 00
value (CPU 22):
9c b4 d2 4d 00 00 00 00 3e 0c df 89 4e 00 00 00
3e 0c df 89 4e 00 00 00
value (CPU 23):
18 43 66 06 01 00 00 00 5b 69 ed 83 4e 00 00 00
5b 69 ed 83 4e 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
f2 6e db ca 00 00 00 00 92 67 4c ba 4e 00 00 00
92 67 4c ba 4e 00 00 00
value (CPU 22):
dc 8e e1 4d 00 00 00 00 d9 32 7a c5 4e 00 00 00
d9 32 7a c5 4e 00 00 00
value (CPU 23):
bd 2b 73 06 01 00 00 00 7c 73 87 bf 4e 00 00 00
7c 73 87 bf 4e 00 00 00
Found 1 element
#

# perf stat --bpf-counters -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

119,410,122 cycles
152,105,479 instructions # 1.27 insn per cycle

0.101395093 seconds time elapsed

#

See? We had the counters enabled all the time.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: kernel-team@fb.com
Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
7fac83aa Tue Mar 16 15:18:35 MDT 2021 Song Liu <songliubraving@fb.com> perf stat: Introduce 'bperf' to share hardware PMCs with BPF

The perf tool uses performance monitoring counters (PMCs) to monitor
system performance. The PMCs are limited hardware resources. For
example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways: system
level monitoring, (maybe nested) container level monitoring, per process
monitoring, profiling (in sample mode), etc. In some cases, there are
more active perf_events than available hardware PMCs. To allow all
perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics
(cycles, instructions). It is a waste to have multiple tools create
multiple perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --bpf-counters option to
perf-stat. bperf uses a BPF hashmap to share information about BPF
programs and maps used by bperf. This map is pinned to bpffs. The
default path is /sys/fs/bpf/perf_attr_map. The user could change the
path with option --bpf-attr-map.

Committer testing:

# dmesg|grep "Performance Events" -A5
[ 0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[ 0.225280] ... version: 0
[ 0.225280] ... bit width: 48
[ 0.225281] ... generic registers: 6
[ 0.225281] ... value mask: 0000ffffffffffff
[ 0.225281] ... max period: 00007fffffffffff
#
# for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done
[1] 2436231
[2] 2436232
[3] 2436233
[4] 2436234
[5] 2436235
[6] 2436236
# perf stat -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

310,326,987 cycles (41.87%)
236,143,290 instructions # 0.76 insn per cycle (41.87%)

0.100800885 seconds time elapsed

#

We can see that the counters were enabled for this workload 41.87% of
the time.

Now with --bpf-counters:

# for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done
[1] 2436514
[2] 2436515
[3] 2436516
[4] 2436517
[5] 2436518
[6] 2436519
[7] 2436520
[8] 2436521
[9] 2436522
[10] 2436523
[11] 2436524
[12] 2436525
[13] 2436526
[14] 2436527
[15] 2436528
[16] 2436529
[17] 2436530
[18] 2436531
[19] 2436532
[20] 2436533
[21] 2436534
[22] 2436535
[23] 2436536
[24] 2436537
[25] 2436538
[26] 2436539
[27] 2436540
[28] 2436541
[29] 2436542
[30] 2436543
[31] 2436544
[32] 2436545
#
# ls -la /sys/fs/bpf/perf_attr_map
-rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map
# bpftool map | grep bperf | wc -l
64
#

# bpftool map | tail
1265: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1266: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1267: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 996
pids perf(2436545)
1268: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1269: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1270: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 997
pids perf(2436541)
1285: array name pid_iter.rodata flags 0x480
key 4B value 4B max_entries 1 memlock 4096B
btf_id 1017 frozen
pids bpftool(2437504)
1286: array flags 0x0
key 4B value 32B max_entries 1 memlock 4096B
#
# bpftool map dump id 1268 | tail
value (CPU 21):
8f f3 bc ca 00 00 00 00 80 fd 2a d1 4d 00 00 00
80 fd 2a d1 4d 00 00 00
value (CPU 22):
7e d5 64 4d 00 00 00 00 a4 8a 2e ee 4d 00 00 00
a4 8a 2e ee 4d 00 00 00
value (CPU 23):
a7 78 3e 06 01 00 00 00 b2 34 94 f6 4d 00 00 00
b2 34 94 f6 4d 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
c6 8b d9 ca 00 00 00 00 20 c6 fc 83 4e 00 00 00
20 c6 fc 83 4e 00 00 00
value (CPU 22):
9c b4 d2 4d 00 00 00 00 3e 0c df 89 4e 00 00 00
3e 0c df 89 4e 00 00 00
value (CPU 23):
18 43 66 06 01 00 00 00 5b 69 ed 83 4e 00 00 00
5b 69 ed 83 4e 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
f2 6e db ca 00 00 00 00 92 67 4c ba 4e 00 00 00
92 67 4c ba 4e 00 00 00
value (CPU 22):
dc 8e e1 4d 00 00 00 00 d9 32 7a c5 4e 00 00 00
d9 32 7a c5 4e 00 00 00
value (CPU 23):
bd 2b 73 06 01 00 00 00 7c 73 87 bf 4e 00 00 00
7c 73 87 bf 4e 00 00 00
Found 1 element
#

# perf stat --bpf-counters -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

119,410,122 cycles
152,105,479 instructions # 1.27 insn per cycle

0.101395093 seconds time elapsed

#

See? We had the counters enabled all the time.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: kernel-team@fb.com
Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
7fac83aa Tue Mar 16 15:18:35 MDT 2021 Song Liu <songliubraving@fb.com> perf stat: Introduce 'bperf' to share hardware PMCs with BPF

The perf tool uses performance monitoring counters (PMCs) to monitor
system performance. The PMCs are limited hardware resources. For
example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways: system
level monitoring, (maybe nested) container level monitoring, per process
monitoring, profiling (in sample mode), etc. In some cases, there are
more active perf_events than available hardware PMCs. To allow all
perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics
(cycles, instructions). It is a waste to have multiple tools create
multiple perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --bpf-counters option to
perf-stat. bperf uses a BPF hashmap to share information about BPF
programs and maps used by bperf. This map is pinned to bpffs. The
default path is /sys/fs/bpf/perf_attr_map. The user could change the
path with option --bpf-attr-map.

Committer testing:

# dmesg|grep "Performance Events" -A5
[ 0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[ 0.225280] ... version: 0
[ 0.225280] ... bit width: 48
[ 0.225281] ... generic registers: 6
[ 0.225281] ... value mask: 0000ffffffffffff
[ 0.225281] ... max period: 00007fffffffffff
#
# for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done
[1] 2436231
[2] 2436232
[3] 2436233
[4] 2436234
[5] 2436235
[6] 2436236
# perf stat -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

310,326,987 cycles (41.87%)
236,143,290 instructions # 0.76 insn per cycle (41.87%)

0.100800885 seconds time elapsed

#

We can see that the counters were enabled for this workload 41.87% of
the time.

Now with --bpf-counters:

# for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done
[1] 2436514
[2] 2436515
[3] 2436516
[4] 2436517
[5] 2436518
[6] 2436519
[7] 2436520
[8] 2436521
[9] 2436522
[10] 2436523
[11] 2436524
[12] 2436525
[13] 2436526
[14] 2436527
[15] 2436528
[16] 2436529
[17] 2436530
[18] 2436531
[19] 2436532
[20] 2436533
[21] 2436534
[22] 2436535
[23] 2436536
[24] 2436537
[25] 2436538
[26] 2436539
[27] 2436540
[28] 2436541
[29] 2436542
[30] 2436543
[31] 2436544
[32] 2436545
#
# ls -la /sys/fs/bpf/perf_attr_map
-rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map
# bpftool map | grep bperf | wc -l
64
#

# bpftool map | tail
1265: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1266: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1267: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 996
pids perf(2436545)
1268: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1269: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1270: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 997
pids perf(2436541)
1285: array name pid_iter.rodata flags 0x480
key 4B value 4B max_entries 1 memlock 4096B
btf_id 1017 frozen
pids bpftool(2437504)
1286: array flags 0x0
key 4B value 32B max_entries 1 memlock 4096B
#
# bpftool map dump id 1268 | tail
value (CPU 21):
8f f3 bc ca 00 00 00 00 80 fd 2a d1 4d 00 00 00
80 fd 2a d1 4d 00 00 00
value (CPU 22):
7e d5 64 4d 00 00 00 00 a4 8a 2e ee 4d 00 00 00
a4 8a 2e ee 4d 00 00 00
value (CPU 23):
a7 78 3e 06 01 00 00 00 b2 34 94 f6 4d 00 00 00
b2 34 94 f6 4d 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
c6 8b d9 ca 00 00 00 00 20 c6 fc 83 4e 00 00 00
20 c6 fc 83 4e 00 00 00
value (CPU 22):
9c b4 d2 4d 00 00 00 00 3e 0c df 89 4e 00 00 00
3e 0c df 89 4e 00 00 00
value (CPU 23):
18 43 66 06 01 00 00 00 5b 69 ed 83 4e 00 00 00
5b 69 ed 83 4e 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
f2 6e db ca 00 00 00 00 92 67 4c ba 4e 00 00 00
92 67 4c ba 4e 00 00 00
value (CPU 22):
dc 8e e1 4d 00 00 00 00 d9 32 7a c5 4e 00 00 00
d9 32 7a c5 4e 00 00 00
value (CPU 23):
bd 2b 73 06 01 00 00 00 7c 73 87 bf 4e 00 00 00
7c 73 87 bf 4e 00 00 00
Found 1 element
#

# perf stat --bpf-counters -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

119,410,122 cycles
152,105,479 instructions # 1.27 insn per cycle

0.101395093 seconds time elapsed

#

See? We had the counters enabled all the time.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: kernel-team@fb.com
Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
7fac83aa Tue Mar 16 15:18:35 MDT 2021 Song Liu <songliubraving@fb.com> perf stat: Introduce 'bperf' to share hardware PMCs with BPF

The perf tool uses performance monitoring counters (PMCs) to monitor
system performance. The PMCs are limited hardware resources. For
example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways: system
level monitoring, (maybe nested) container level monitoring, per process
monitoring, profiling (in sample mode), etc. In some cases, there are
more active perf_events than available hardware PMCs. To allow all
perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics
(cycles, instructions). It is a waste to have multiple tools create
multiple perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --bpf-counters option to
perf-stat. bperf uses a BPF hashmap to share information about BPF
programs and maps used by bperf. This map is pinned to bpffs. The
default path is /sys/fs/bpf/perf_attr_map. The user could change the
path with option --bpf-attr-map.

Committer testing:

# dmesg|grep "Performance Events" -A5
[ 0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[ 0.225280] ... version: 0
[ 0.225280] ... bit width: 48
[ 0.225281] ... generic registers: 6
[ 0.225281] ... value mask: 0000ffffffffffff
[ 0.225281] ... max period: 00007fffffffffff
#
# for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done
[1] 2436231
[2] 2436232
[3] 2436233
[4] 2436234
[5] 2436235
[6] 2436236
# perf stat -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

310,326,987 cycles (41.87%)
236,143,290 instructions # 0.76 insn per cycle (41.87%)

0.100800885 seconds time elapsed

#

We can see that the counters were enabled for this workload 41.87% of
the time.

Now with --bpf-counters:

# for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done
[1] 2436514
[2] 2436515
[3] 2436516
[4] 2436517
[5] 2436518
[6] 2436519
[7] 2436520
[8] 2436521
[9] 2436522
[10] 2436523
[11] 2436524
[12] 2436525
[13] 2436526
[14] 2436527
[15] 2436528
[16] 2436529
[17] 2436530
[18] 2436531
[19] 2436532
[20] 2436533
[21] 2436534
[22] 2436535
[23] 2436536
[24] 2436537
[25] 2436538
[26] 2436539
[27] 2436540
[28] 2436541
[29] 2436542
[30] 2436543
[31] 2436544
[32] 2436545
#
# ls -la /sys/fs/bpf/perf_attr_map
-rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map
# bpftool map | grep bperf | wc -l
64
#

# bpftool map | tail
1265: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1266: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1267: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 996
pids perf(2436545)
1268: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1269: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1270: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 997
pids perf(2436541)
1285: array name pid_iter.rodata flags 0x480
key 4B value 4B max_entries 1 memlock 4096B
btf_id 1017 frozen
pids bpftool(2437504)
1286: array flags 0x0
key 4B value 32B max_entries 1 memlock 4096B
#
# bpftool map dump id 1268 | tail
value (CPU 21):
8f f3 bc ca 00 00 00 00 80 fd 2a d1 4d 00 00 00
80 fd 2a d1 4d 00 00 00
value (CPU 22):
7e d5 64 4d 00 00 00 00 a4 8a 2e ee 4d 00 00 00
a4 8a 2e ee 4d 00 00 00
value (CPU 23):
a7 78 3e 06 01 00 00 00 b2 34 94 f6 4d 00 00 00
b2 34 94 f6 4d 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
c6 8b d9 ca 00 00 00 00 20 c6 fc 83 4e 00 00 00
20 c6 fc 83 4e 00 00 00
value (CPU 22):
9c b4 d2 4d 00 00 00 00 3e 0c df 89 4e 00 00 00
3e 0c df 89 4e 00 00 00
value (CPU 23):
18 43 66 06 01 00 00 00 5b 69 ed 83 4e 00 00 00
5b 69 ed 83 4e 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
f2 6e db ca 00 00 00 00 92 67 4c ba 4e 00 00 00
92 67 4c ba 4e 00 00 00
value (CPU 22):
dc 8e e1 4d 00 00 00 00 d9 32 7a c5 4e 00 00 00
d9 32 7a c5 4e 00 00 00
value (CPU 23):
bd 2b 73 06 01 00 00 00 7c 73 87 bf 4e 00 00 00
7c 73 87 bf 4e 00 00 00
Found 1 element
#

# perf stat --bpf-counters -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

119,410,122 cycles
152,105,479 instructions # 1.27 insn per cycle

0.101395093 seconds time elapsed

#

See? We had the counters enabled all the time.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: kernel-team@fb.com
Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
7fac83aa Tue Mar 16 15:18:35 MDT 2021 Song Liu <songliubraving@fb.com> perf stat: Introduce 'bperf' to share hardware PMCs with BPF

The perf tool uses performance monitoring counters (PMCs) to monitor
system performance. The PMCs are limited hardware resources. For
example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways: system
level monitoring, (maybe nested) container level monitoring, per process
monitoring, profiling (in sample mode), etc. In some cases, there are
more active perf_events than available hardware PMCs. To allow all
perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics
(cycles, instructions). It is a waste to have multiple tools create
multiple perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --bpf-counters option to
perf-stat. bperf uses a BPF hashmap to share information about BPF
programs and maps used by bperf. This map is pinned to bpffs. The
default path is /sys/fs/bpf/perf_attr_map. The user could change the
path with option --bpf-attr-map.

Committer testing:

# dmesg|grep "Performance Events" -A5
[ 0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[ 0.225280] ... version: 0
[ 0.225280] ... bit width: 48
[ 0.225281] ... generic registers: 6
[ 0.225281] ... value mask: 0000ffffffffffff
[ 0.225281] ... max period: 00007fffffffffff
#
# for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done
[1] 2436231
[2] 2436232
[3] 2436233
[4] 2436234
[5] 2436235
[6] 2436236
# perf stat -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

310,326,987 cycles (41.87%)
236,143,290 instructions # 0.76 insn per cycle (41.87%)

0.100800885 seconds time elapsed

#

We can see that the counters were enabled for this workload 41.87% of
the time.

Now with --bpf-counters:

# for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done
[1] 2436514
[2] 2436515
[3] 2436516
[4] 2436517
[5] 2436518
[6] 2436519
[7] 2436520
[8] 2436521
[9] 2436522
[10] 2436523
[11] 2436524
[12] 2436525
[13] 2436526
[14] 2436527
[15] 2436528
[16] 2436529
[17] 2436530
[18] 2436531
[19] 2436532
[20] 2436533
[21] 2436534
[22] 2436535
[23] 2436536
[24] 2436537
[25] 2436538
[26] 2436539
[27] 2436540
[28] 2436541
[29] 2436542
[30] 2436543
[31] 2436544
[32] 2436545
#
# ls -la /sys/fs/bpf/perf_attr_map
-rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map
# bpftool map | grep bperf | wc -l
64
#

# bpftool map | tail
1265: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1266: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1267: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 996
pids perf(2436545)
1268: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1269: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1270: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 997
pids perf(2436541)
1285: array name pid_iter.rodata flags 0x480
key 4B value 4B max_entries 1 memlock 4096B
btf_id 1017 frozen
pids bpftool(2437504)
1286: array flags 0x0
key 4B value 32B max_entries 1 memlock 4096B
#
# bpftool map dump id 1268 | tail
value (CPU 21):
8f f3 bc ca 00 00 00 00 80 fd 2a d1 4d 00 00 00
80 fd 2a d1 4d 00 00 00
value (CPU 22):
7e d5 64 4d 00 00 00 00 a4 8a 2e ee 4d 00 00 00
a4 8a 2e ee 4d 00 00 00
value (CPU 23):
a7 78 3e 06 01 00 00 00 b2 34 94 f6 4d 00 00 00
b2 34 94 f6 4d 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
c6 8b d9 ca 00 00 00 00 20 c6 fc 83 4e 00 00 00
20 c6 fc 83 4e 00 00 00
value (CPU 22):
9c b4 d2 4d 00 00 00 00 3e 0c df 89 4e 00 00 00
3e 0c df 89 4e 00 00 00
value (CPU 23):
18 43 66 06 01 00 00 00 5b 69 ed 83 4e 00 00 00
5b 69 ed 83 4e 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
f2 6e db ca 00 00 00 00 92 67 4c ba 4e 00 00 00
92 67 4c ba 4e 00 00 00
value (CPU 22):
dc 8e e1 4d 00 00 00 00 d9 32 7a c5 4e 00 00 00
d9 32 7a c5 4e 00 00 00
value (CPU 23):
bd 2b 73 06 01 00 00 00 7c 73 87 bf 4e 00 00 00
7c 73 87 bf 4e 00 00 00
Found 1 element
#

# perf stat --bpf-counters -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

119,410,122 cycles
152,105,479 instructions # 1.27 insn per cycle

0.101395093 seconds time elapsed

#

See? We had the counters enabled all the time.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: kernel-team@fb.com
Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
7fac83aa Tue Mar 16 15:18:35 MDT 2021 Song Liu <songliubraving@fb.com> perf stat: Introduce 'bperf' to share hardware PMCs with BPF

The perf tool uses performance monitoring counters (PMCs) to monitor
system performance. The PMCs are limited hardware resources. For
example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways: system
level monitoring, (maybe nested) container level monitoring, per process
monitoring, profiling (in sample mode), etc. In some cases, there are
more active perf_events than available hardware PMCs. To allow all
perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics
(cycles, instructions). It is a waste to have multiple tools create
multiple perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --bpf-counters option to
perf-stat. bperf uses a BPF hashmap to share information about BPF
programs and maps used by bperf. This map is pinned to bpffs. The
default path is /sys/fs/bpf/perf_attr_map. The user could change the
path with option --bpf-attr-map.

Committer testing:

# dmesg|grep "Performance Events" -A5
[ 0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[ 0.225280] ... version: 0
[ 0.225280] ... bit width: 48
[ 0.225281] ... generic registers: 6
[ 0.225281] ... value mask: 0000ffffffffffff
[ 0.225281] ... max period: 00007fffffffffff
#
# for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done
[1] 2436231
[2] 2436232
[3] 2436233
[4] 2436234
[5] 2436235
[6] 2436236
# perf stat -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

310,326,987 cycles (41.87%)
236,143,290 instructions # 0.76 insn per cycle (41.87%)

0.100800885 seconds time elapsed

#

We can see that the counters were enabled for this workload 41.87% of
the time.

Now with --bpf-counters:

# for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done
[1] 2436514
[2] 2436515
[3] 2436516
[4] 2436517
[5] 2436518
[6] 2436519
[7] 2436520
[8] 2436521
[9] 2436522
[10] 2436523
[11] 2436524
[12] 2436525
[13] 2436526
[14] 2436527
[15] 2436528
[16] 2436529
[17] 2436530
[18] 2436531
[19] 2436532
[20] 2436533
[21] 2436534
[22] 2436535
[23] 2436536
[24] 2436537
[25] 2436538
[26] 2436539
[27] 2436540
[28] 2436541
[29] 2436542
[30] 2436543
[31] 2436544
[32] 2436545
#
# ls -la /sys/fs/bpf/perf_attr_map
-rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map
# bpftool map | grep bperf | wc -l
64
#

# bpftool map | tail
1265: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1266: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1267: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 996
pids perf(2436545)
1268: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1269: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1270: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 997
pids perf(2436541)
1285: array name pid_iter.rodata flags 0x480
key 4B value 4B max_entries 1 memlock 4096B
btf_id 1017 frozen
pids bpftool(2437504)
1286: array flags 0x0
key 4B value 32B max_entries 1 memlock 4096B
#
# bpftool map dump id 1268 | tail
value (CPU 21):
8f f3 bc ca 00 00 00 00 80 fd 2a d1 4d 00 00 00
80 fd 2a d1 4d 00 00 00
value (CPU 22):
7e d5 64 4d 00 00 00 00 a4 8a 2e ee 4d 00 00 00
a4 8a 2e ee 4d 00 00 00
value (CPU 23):
a7 78 3e 06 01 00 00 00 b2 34 94 f6 4d 00 00 00
b2 34 94 f6 4d 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
c6 8b d9 ca 00 00 00 00 20 c6 fc 83 4e 00 00 00
20 c6 fc 83 4e 00 00 00
value (CPU 22):
9c b4 d2 4d 00 00 00 00 3e 0c df 89 4e 00 00 00
3e 0c df 89 4e 00 00 00
value (CPU 23):
18 43 66 06 01 00 00 00 5b 69 ed 83 4e 00 00 00
5b 69 ed 83 4e 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
f2 6e db ca 00 00 00 00 92 67 4c ba 4e 00 00 00
92 67 4c ba 4e 00 00 00
value (CPU 22):
dc 8e e1 4d 00 00 00 00 d9 32 7a c5 4e 00 00 00
d9 32 7a c5 4e 00 00 00
value (CPU 23):
bd 2b 73 06 01 00 00 00 7c 73 87 bf 4e 00 00 00
7c 73 87 bf 4e 00 00 00
Found 1 element
#

# perf stat --bpf-counters -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

119,410,122 cycles
152,105,479 instructions # 1.27 insn per cycle

0.101395093 seconds time elapsed

#

See? We had the counters enabled all the time.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: kernel-team@fb.com
Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
7fac83aa Tue Mar 16 15:18:35 MDT 2021 Song Liu <songliubraving@fb.com> perf stat: Introduce 'bperf' to share hardware PMCs with BPF

The perf tool uses performance monitoring counters (PMCs) to monitor
system performance. The PMCs are limited hardware resources. For
example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways: system
level monitoring, (maybe nested) container level monitoring, per process
monitoring, profiling (in sample mode), etc. In some cases, there are
more active perf_events than available hardware PMCs. To allow all
perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics
(cycles, instructions). It is a waste to have multiple tools create
multiple perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --bpf-counters option to
perf-stat. bperf uses a BPF hashmap to share information about BPF
programs and maps used by bperf. This map is pinned to bpffs. The
default path is /sys/fs/bpf/perf_attr_map. The user could change the
path with option --bpf-attr-map.

Committer testing:

# dmesg|grep "Performance Events" -A5
[ 0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[ 0.225280] ... version: 0
[ 0.225280] ... bit width: 48
[ 0.225281] ... generic registers: 6
[ 0.225281] ... value mask: 0000ffffffffffff
[ 0.225281] ... max period: 00007fffffffffff
#
# for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done
[1] 2436231
[2] 2436232
[3] 2436233
[4] 2436234
[5] 2436235
[6] 2436236
# perf stat -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

310,326,987 cycles (41.87%)
236,143,290 instructions # 0.76 insn per cycle (41.87%)

0.100800885 seconds time elapsed

#

We can see that the counters were enabled for this workload 41.87% of
the time.

Now with --bpf-counters:

# for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done
[1] 2436514
[2] 2436515
[3] 2436516
[4] 2436517
[5] 2436518
[6] 2436519
[7] 2436520
[8] 2436521
[9] 2436522
[10] 2436523
[11] 2436524
[12] 2436525
[13] 2436526
[14] 2436527
[15] 2436528
[16] 2436529
[17] 2436530
[18] 2436531
[19] 2436532
[20] 2436533
[21] 2436534
[22] 2436535
[23] 2436536
[24] 2436537
[25] 2436538
[26] 2436539
[27] 2436540
[28] 2436541
[29] 2436542
[30] 2436543
[31] 2436544
[32] 2436545
#
# ls -la /sys/fs/bpf/perf_attr_map
-rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map
# bpftool map | grep bperf | wc -l
64
#

# bpftool map | tail
1265: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1266: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1267: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 996
pids perf(2436545)
1268: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1269: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1270: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 997
pids perf(2436541)
1285: array name pid_iter.rodata flags 0x480
key 4B value 4B max_entries 1 memlock 4096B
btf_id 1017 frozen
pids bpftool(2437504)
1286: array flags 0x0
key 4B value 32B max_entries 1 memlock 4096B
#
# bpftool map dump id 1268 | tail
value (CPU 21):
8f f3 bc ca 00 00 00 00 80 fd 2a d1 4d 00 00 00
80 fd 2a d1 4d 00 00 00
value (CPU 22):
7e d5 64 4d 00 00 00 00 a4 8a 2e ee 4d 00 00 00
a4 8a 2e ee 4d 00 00 00
value (CPU 23):
a7 78 3e 06 01 00 00 00 b2 34 94 f6 4d 00 00 00
b2 34 94 f6 4d 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
c6 8b d9 ca 00 00 00 00 20 c6 fc 83 4e 00 00 00
20 c6 fc 83 4e 00 00 00
value (CPU 22):
9c b4 d2 4d 00 00 00 00 3e 0c df 89 4e 00 00 00
3e 0c df 89 4e 00 00 00
value (CPU 23):
18 43 66 06 01 00 00 00 5b 69 ed 83 4e 00 00 00
5b 69 ed 83 4e 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
f2 6e db ca 00 00 00 00 92 67 4c ba 4e 00 00 00
92 67 4c ba 4e 00 00 00
value (CPU 22):
dc 8e e1 4d 00 00 00 00 d9 32 7a c5 4e 00 00 00
d9 32 7a c5 4e 00 00 00
value (CPU 23):
bd 2b 73 06 01 00 00 00 7c 73 87 bf 4e 00 00 00
7c 73 87 bf 4e 00 00 00
Found 1 element
#

# perf stat --bpf-counters -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

119,410,122 cycles
152,105,479 instructions # 1.27 insn per cycle

0.101395093 seconds time elapsed

#

See? We had the counters enabled all the time.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: kernel-team@fb.com
Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
7fac83aa Tue Mar 16 15:18:35 MDT 2021 Song Liu <songliubraving@fb.com> perf stat: Introduce 'bperf' to share hardware PMCs with BPF

The perf tool uses performance monitoring counters (PMCs) to monitor
system performance. The PMCs are limited hardware resources. For
example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways: system
level monitoring, (maybe nested) container level monitoring, per process
monitoring, profiling (in sample mode), etc. In some cases, there are
more active perf_events than available hardware PMCs. To allow all
perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics
(cycles, instructions). It is a waste to have multiple tools create
multiple perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --bpf-counters option to
perf-stat. bperf uses a BPF hashmap to share information about BPF
programs and maps used by bperf. This map is pinned to bpffs. The
default path is /sys/fs/bpf/perf_attr_map. The user could change the
path with option --bpf-attr-map.

Committer testing:

# dmesg|grep "Performance Events" -A5
[ 0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[ 0.225280] ... version: 0
[ 0.225280] ... bit width: 48
[ 0.225281] ... generic registers: 6
[ 0.225281] ... value mask: 0000ffffffffffff
[ 0.225281] ... max period: 00007fffffffffff
#
# for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done
[1] 2436231
[2] 2436232
[3] 2436233
[4] 2436234
[5] 2436235
[6] 2436236
# perf stat -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

310,326,987 cycles (41.87%)
236,143,290 instructions # 0.76 insn per cycle (41.87%)

0.100800885 seconds time elapsed

#

We can see that the counters were enabled for this workload 41.87% of
the time.

Now with --bpf-counters:

# for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done
[1] 2436514
[2] 2436515
[3] 2436516
[4] 2436517
[5] 2436518
[6] 2436519
[7] 2436520
[8] 2436521
[9] 2436522
[10] 2436523
[11] 2436524
[12] 2436525
[13] 2436526
[14] 2436527
[15] 2436528
[16] 2436529
[17] 2436530
[18] 2436531
[19] 2436532
[20] 2436533
[21] 2436534
[22] 2436535
[23] 2436536
[24] 2436537
[25] 2436538
[26] 2436539
[27] 2436540
[28] 2436541
[29] 2436542
[30] 2436543
[31] 2436544
[32] 2436545
#
# ls -la /sys/fs/bpf/perf_attr_map
-rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map
# bpftool map | grep bperf | wc -l
64
#

# bpftool map | tail
1265: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1266: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1267: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 996
pids perf(2436545)
1268: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1269: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1270: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 997
pids perf(2436541)
1285: array name pid_iter.rodata flags 0x480
key 4B value 4B max_entries 1 memlock 4096B
btf_id 1017 frozen
pids bpftool(2437504)
1286: array flags 0x0
key 4B value 32B max_entries 1 memlock 4096B
#
# bpftool map dump id 1268 | tail
value (CPU 21):
8f f3 bc ca 00 00 00 00 80 fd 2a d1 4d 00 00 00
80 fd 2a d1 4d 00 00 00
value (CPU 22):
7e d5 64 4d 00 00 00 00 a4 8a 2e ee 4d 00 00 00
a4 8a 2e ee 4d 00 00 00
value (CPU 23):
a7 78 3e 06 01 00 00 00 b2 34 94 f6 4d 00 00 00
b2 34 94 f6 4d 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
c6 8b d9 ca 00 00 00 00 20 c6 fc 83 4e 00 00 00
20 c6 fc 83 4e 00 00 00
value (CPU 22):
9c b4 d2 4d 00 00 00 00 3e 0c df 89 4e 00 00 00
3e 0c df 89 4e 00 00 00
value (CPU 23):
18 43 66 06 01 00 00 00 5b 69 ed 83 4e 00 00 00
5b 69 ed 83 4e 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
f2 6e db ca 00 00 00 00 92 67 4c ba 4e 00 00 00
92 67 4c ba 4e 00 00 00
value (CPU 22):
dc 8e e1 4d 00 00 00 00 d9 32 7a c5 4e 00 00 00
d9 32 7a c5 4e 00 00 00
value (CPU 23):
bd 2b 73 06 01 00 00 00 7c 73 87 bf 4e 00 00 00
7c 73 87 bf 4e 00 00 00
Found 1 element
#

# perf stat --bpf-counters -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

119,410,122 cycles
152,105,479 instructions # 1.27 insn per cycle

0.101395093 seconds time elapsed

#

See? We had the counters enabled all the time.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: kernel-team@fb.com
Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
7fac83aa Tue Mar 16 15:18:35 MDT 2021 Song Liu <songliubraving@fb.com> perf stat: Introduce 'bperf' to share hardware PMCs with BPF

The perf tool uses performance monitoring counters (PMCs) to monitor
system performance. The PMCs are limited hardware resources. For
example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways: system
level monitoring, (maybe nested) container level monitoring, per process
monitoring, profiling (in sample mode), etc. In some cases, there are
more active perf_events than available hardware PMCs. To allow all
perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics
(cycles, instructions). It is a waste to have multiple tools create
multiple perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --bpf-counters option to
perf-stat. bperf uses a BPF hashmap to share information about BPF
programs and maps used by bperf. This map is pinned to bpffs. The
default path is /sys/fs/bpf/perf_attr_map. The user could change the
path with option --bpf-attr-map.

Committer testing:

# dmesg|grep "Performance Events" -A5
[ 0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[ 0.225280] ... version: 0
[ 0.225280] ... bit width: 48
[ 0.225281] ... generic registers: 6
[ 0.225281] ... value mask: 0000ffffffffffff
[ 0.225281] ... max period: 00007fffffffffff
#
# for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done
[1] 2436231
[2] 2436232
[3] 2436233
[4] 2436234
[5] 2436235
[6] 2436236
# perf stat -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

310,326,987 cycles (41.87%)
236,143,290 instructions # 0.76 insn per cycle (41.87%)

0.100800885 seconds time elapsed

#

We can see that the counters were enabled for this workload 41.87% of
the time.

Now with --bpf-counters:

# for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done
[1] 2436514
[2] 2436515
[3] 2436516
[4] 2436517
[5] 2436518
[6] 2436519
[7] 2436520
[8] 2436521
[9] 2436522
[10] 2436523
[11] 2436524
[12] 2436525
[13] 2436526
[14] 2436527
[15] 2436528
[16] 2436529
[17] 2436530
[18] 2436531
[19] 2436532
[20] 2436533
[21] 2436534
[22] 2436535
[23] 2436536
[24] 2436537
[25] 2436538
[26] 2436539
[27] 2436540
[28] 2436541
[29] 2436542
[30] 2436543
[31] 2436544
[32] 2436545
#
# ls -la /sys/fs/bpf/perf_attr_map
-rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map
# bpftool map | grep bperf | wc -l
64
#

# bpftool map | tail
1265: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1266: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1267: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 996
pids perf(2436545)
1268: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1269: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1270: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 997
pids perf(2436541)
1285: array name pid_iter.rodata flags 0x480
key 4B value 4B max_entries 1 memlock 4096B
btf_id 1017 frozen
pids bpftool(2437504)
1286: array flags 0x0
key 4B value 32B max_entries 1 memlock 4096B
#
# bpftool map dump id 1268 | tail
value (CPU 21):
8f f3 bc ca 00 00 00 00 80 fd 2a d1 4d 00 00 00
80 fd 2a d1 4d 00 00 00
value (CPU 22):
7e d5 64 4d 00 00 00 00 a4 8a 2e ee 4d 00 00 00
a4 8a 2e ee 4d 00 00 00
value (CPU 23):
a7 78 3e 06 01 00 00 00 b2 34 94 f6 4d 00 00 00
b2 34 94 f6 4d 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
c6 8b d9 ca 00 00 00 00 20 c6 fc 83 4e 00 00 00
20 c6 fc 83 4e 00 00 00
value (CPU 22):
9c b4 d2 4d 00 00 00 00 3e 0c df 89 4e 00 00 00
3e 0c df 89 4e 00 00 00
value (CPU 23):
18 43 66 06 01 00 00 00 5b 69 ed 83 4e 00 00 00
5b 69 ed 83 4e 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
f2 6e db ca 00 00 00 00 92 67 4c ba 4e 00 00 00
92 67 4c ba 4e 00 00 00
value (CPU 22):
dc 8e e1 4d 00 00 00 00 d9 32 7a c5 4e 00 00 00
d9 32 7a c5 4e 00 00 00
value (CPU 23):
bd 2b 73 06 01 00 00 00 7c 73 87 bf 4e 00 00 00
7c 73 87 bf 4e 00 00 00
Found 1 element
#

# perf stat --bpf-counters -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

119,410,122 cycles
152,105,479 instructions # 1.27 insn per cycle

0.101395093 seconds time elapsed

#

See? We had the counters enabled all the time.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: kernel-team@fb.com
Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
7fac83aa Tue Mar 16 15:18:35 MDT 2021 Song Liu <songliubraving@fb.com> perf stat: Introduce 'bperf' to share hardware PMCs with BPF

The perf tool uses performance monitoring counters (PMCs) to monitor
system performance. The PMCs are limited hardware resources. For
example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways: system
level monitoring, (maybe nested) container level monitoring, per process
monitoring, profiling (in sample mode), etc. In some cases, there are
more active perf_events than available hardware PMCs. To allow all
perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics
(cycles, instructions). It is a waste to have multiple tools create
multiple perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --bpf-counters option to
perf-stat. bperf uses a BPF hashmap to share information about BPF
programs and maps used by bperf. This map is pinned to bpffs. The
default path is /sys/fs/bpf/perf_attr_map. The user could change the
path with option --bpf-attr-map.

Committer testing:

# dmesg|grep "Performance Events" -A5
[ 0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[ 0.225280] ... version: 0
[ 0.225280] ... bit width: 48
[ 0.225281] ... generic registers: 6
[ 0.225281] ... value mask: 0000ffffffffffff
[ 0.225281] ... max period: 00007fffffffffff
#
# for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done
[1] 2436231
[2] 2436232
[3] 2436233
[4] 2436234
[5] 2436235
[6] 2436236
# perf stat -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

310,326,987 cycles (41.87%)
236,143,290 instructions # 0.76 insn per cycle (41.87%)

0.100800885 seconds time elapsed

#

We can see that the counters were enabled for this workload 41.87% of
the time.

Now with --bpf-counters:

# for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done
[1] 2436514
[2] 2436515
[3] 2436516
[4] 2436517
[5] 2436518
[6] 2436519
[7] 2436520
[8] 2436521
[9] 2436522
[10] 2436523
[11] 2436524
[12] 2436525
[13] 2436526
[14] 2436527
[15] 2436528
[16] 2436529
[17] 2436530
[18] 2436531
[19] 2436532
[20] 2436533
[21] 2436534
[22] 2436535
[23] 2436536
[24] 2436537
[25] 2436538
[26] 2436539
[27] 2436540
[28] 2436541
[29] 2436542
[30] 2436543
[31] 2436544
[32] 2436545
#
# ls -la /sys/fs/bpf/perf_attr_map
-rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map
# bpftool map | grep bperf | wc -l
64
#

# bpftool map | tail
1265: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1266: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1267: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 996
pids perf(2436545)
1268: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1269: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1270: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 997
pids perf(2436541)
1285: array name pid_iter.rodata flags 0x480
key 4B value 4B max_entries 1 memlock 4096B
btf_id 1017 frozen
pids bpftool(2437504)
1286: array flags 0x0
key 4B value 32B max_entries 1 memlock 4096B
#
# bpftool map dump id 1268 | tail
value (CPU 21):
8f f3 bc ca 00 00 00 00 80 fd 2a d1 4d 00 00 00
80 fd 2a d1 4d 00 00 00
value (CPU 22):
7e d5 64 4d 00 00 00 00 a4 8a 2e ee 4d 00 00 00
a4 8a 2e ee 4d 00 00 00
value (CPU 23):
a7 78 3e 06 01 00 00 00 b2 34 94 f6 4d 00 00 00
b2 34 94 f6 4d 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
c6 8b d9 ca 00 00 00 00 20 c6 fc 83 4e 00 00 00
20 c6 fc 83 4e 00 00 00
value (CPU 22):
9c b4 d2 4d 00 00 00 00 3e 0c df 89 4e 00 00 00
3e 0c df 89 4e 00 00 00
value (CPU 23):
18 43 66 06 01 00 00 00 5b 69 ed 83 4e 00 00 00
5b 69 ed 83 4e 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
f2 6e db ca 00 00 00 00 92 67 4c ba 4e 00 00 00
92 67 4c ba 4e 00 00 00
value (CPU 22):
dc 8e e1 4d 00 00 00 00 d9 32 7a c5 4e 00 00 00
d9 32 7a c5 4e 00 00 00
value (CPU 23):
bd 2b 73 06 01 00 00 00 7c 73 87 bf 4e 00 00 00
7c 73 87 bf 4e 00 00 00
Found 1 element
#

# perf stat --bpf-counters -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

119,410,122 cycles
152,105,479 instructions # 1.27 insn per cycle

0.101395093 seconds time elapsed

#

See? We had the counters enabled all the time.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: kernel-team@fb.com
Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
7fac83aa Tue Mar 16 15:18:35 MDT 2021 Song Liu <songliubraving@fb.com> perf stat: Introduce 'bperf' to share hardware PMCs with BPF

The perf tool uses performance monitoring counters (PMCs) to monitor
system performance. The PMCs are limited hardware resources. For
example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways: system
level monitoring, (maybe nested) container level monitoring, per process
monitoring, profiling (in sample mode), etc. In some cases, there are
more active perf_events than available hardware PMCs. To allow all
perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics
(cycles, instructions). It is a waste to have multiple tools create
multiple perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --bpf-counters option to
perf-stat. bperf uses a BPF hashmap to share information about BPF
programs and maps used by bperf. This map is pinned to bpffs. The
default path is /sys/fs/bpf/perf_attr_map. The user could change the
path with option --bpf-attr-map.

Committer testing:

# dmesg|grep "Performance Events" -A5
[ 0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[ 0.225280] ... version: 0
[ 0.225280] ... bit width: 48
[ 0.225281] ... generic registers: 6
[ 0.225281] ... value mask: 0000ffffffffffff
[ 0.225281] ... max period: 00007fffffffffff
#
# for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done
[1] 2436231
[2] 2436232
[3] 2436233
[4] 2436234
[5] 2436235
[6] 2436236
# perf stat -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

310,326,987 cycles (41.87%)
236,143,290 instructions # 0.76 insn per cycle (41.87%)

0.100800885 seconds time elapsed

#

We can see that the counters were enabled for this workload 41.87% of
the time.

Now with --bpf-counters:

# for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done
[1] 2436514
[2] 2436515
[3] 2436516
[4] 2436517
[5] 2436518
[6] 2436519
[7] 2436520
[8] 2436521
[9] 2436522
[10] 2436523
[11] 2436524
[12] 2436525
[13] 2436526
[14] 2436527
[15] 2436528
[16] 2436529
[17] 2436530
[18] 2436531
[19] 2436532
[20] 2436533
[21] 2436534
[22] 2436535
[23] 2436536
[24] 2436537
[25] 2436538
[26] 2436539
[27] 2436540
[28] 2436541
[29] 2436542
[30] 2436543
[31] 2436544
[32] 2436545
#
# ls -la /sys/fs/bpf/perf_attr_map
-rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map
# bpftool map | grep bperf | wc -l
64
#

# bpftool map | tail
1265: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1266: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1267: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 996
pids perf(2436545)
1268: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1269: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1270: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 997
pids perf(2436541)
1285: array name pid_iter.rodata flags 0x480
key 4B value 4B max_entries 1 memlock 4096B
btf_id 1017 frozen
pids bpftool(2437504)
1286: array flags 0x0
key 4B value 32B max_entries 1 memlock 4096B
#
# bpftool map dump id 1268 | tail
value (CPU 21):
8f f3 bc ca 00 00 00 00 80 fd 2a d1 4d 00 00 00
80 fd 2a d1 4d 00 00 00
value (CPU 22):
7e d5 64 4d 00 00 00 00 a4 8a 2e ee 4d 00 00 00
a4 8a 2e ee 4d 00 00 00
value (CPU 23):
a7 78 3e 06 01 00 00 00 b2 34 94 f6 4d 00 00 00
b2 34 94 f6 4d 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
c6 8b d9 ca 00 00 00 00 20 c6 fc 83 4e 00 00 00
20 c6 fc 83 4e 00 00 00
value (CPU 22):
9c b4 d2 4d 00 00 00 00 3e 0c df 89 4e 00 00 00
3e 0c df 89 4e 00 00 00
value (CPU 23):
18 43 66 06 01 00 00 00 5b 69 ed 83 4e 00 00 00
5b 69 ed 83 4e 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
f2 6e db ca 00 00 00 00 92 67 4c ba 4e 00 00 00
92 67 4c ba 4e 00 00 00
value (CPU 22):
dc 8e e1 4d 00 00 00 00 d9 32 7a c5 4e 00 00 00
d9 32 7a c5 4e 00 00 00
value (CPU 23):
bd 2b 73 06 01 00 00 00 7c 73 87 bf 4e 00 00 00
7c 73 87 bf 4e 00 00 00
Found 1 element
#

# perf stat --bpf-counters -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

119,410,122 cycles
152,105,479 instructions # 1.27 insn per cycle

0.101395093 seconds time elapsed

#

See? We had the counters enabled all the time.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: kernel-team@fb.com
Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
7fac83aa Tue Mar 16 15:18:35 MDT 2021 Song Liu <songliubraving@fb.com> perf stat: Introduce 'bperf' to share hardware PMCs with BPF

The perf tool uses performance monitoring counters (PMCs) to monitor
system performance. The PMCs are limited hardware resources. For
example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways: system
level monitoring, (maybe nested) container level monitoring, per process
monitoring, profiling (in sample mode), etc. In some cases, there are
more active perf_events than available hardware PMCs. To allow all
perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics
(cycles, instructions). It is a waste to have multiple tools create
multiple perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --bpf-counters option to
perf-stat. bperf uses a BPF hashmap to share information about BPF
programs and maps used by bperf. This map is pinned to bpffs. The
default path is /sys/fs/bpf/perf_attr_map. The user could change the
path with option --bpf-attr-map.

Committer testing:

# dmesg|grep "Performance Events" -A5
[ 0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[ 0.225280] ... version: 0
[ 0.225280] ... bit width: 48
[ 0.225281] ... generic registers: 6
[ 0.225281] ... value mask: 0000ffffffffffff
[ 0.225281] ... max period: 00007fffffffffff
#
# for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done
[1] 2436231
[2] 2436232
[3] 2436233
[4] 2436234
[5] 2436235
[6] 2436236
# perf stat -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

310,326,987 cycles (41.87%)
236,143,290 instructions # 0.76 insn per cycle (41.87%)

0.100800885 seconds time elapsed

#

We can see that the counters were enabled for this workload 41.87% of
the time.

Now with --bpf-counters:

# for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done
[1] 2436514
[2] 2436515
[3] 2436516
[4] 2436517
[5] 2436518
[6] 2436519
[7] 2436520
[8] 2436521
[9] 2436522
[10] 2436523
[11] 2436524
[12] 2436525
[13] 2436526
[14] 2436527
[15] 2436528
[16] 2436529
[17] 2436530
[18] 2436531
[19] 2436532
[20] 2436533
[21] 2436534
[22] 2436535
[23] 2436536
[24] 2436537
[25] 2436538
[26] 2436539
[27] 2436540
[28] 2436541
[29] 2436542
[30] 2436543
[31] 2436544
[32] 2436545
#
# ls -la /sys/fs/bpf/perf_attr_map
-rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map
# bpftool map | grep bperf | wc -l
64
#

# bpftool map | tail
1265: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1266: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1267: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 996
pids perf(2436545)
1268: percpu_array name accum_readings flags 0x0
key 4B value 24B max_entries 1 memlock 4096B
1269: hash name filter flags 0x0
key 4B value 4B max_entries 1 memlock 4096B
1270: array name bperf_fo.bss flags 0x400
key 4B value 8B max_entries 1 memlock 4096B
btf_id 997
pids perf(2436541)
1285: array name pid_iter.rodata flags 0x480
key 4B value 4B max_entries 1 memlock 4096B
btf_id 1017 frozen
pids bpftool(2437504)
1286: array flags 0x0
key 4B value 32B max_entries 1 memlock 4096B
#
# bpftool map dump id 1268 | tail
value (CPU 21):
8f f3 bc ca 00 00 00 00 80 fd 2a d1 4d 00 00 00
80 fd 2a d1 4d 00 00 00
value (CPU 22):
7e d5 64 4d 00 00 00 00 a4 8a 2e ee 4d 00 00 00
a4 8a 2e ee 4d 00 00 00
value (CPU 23):
a7 78 3e 06 01 00 00 00 b2 34 94 f6 4d 00 00 00
b2 34 94 f6 4d 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
c6 8b d9 ca 00 00 00 00 20 c6 fc 83 4e 00 00 00
20 c6 fc 83 4e 00 00 00
value (CPU 22):
9c b4 d2 4d 00 00 00 00 3e 0c df 89 4e 00 00 00
3e 0c df 89 4e 00 00 00
value (CPU 23):
18 43 66 06 01 00 00 00 5b 69 ed 83 4e 00 00 00
5b 69 ed 83 4e 00 00 00
Found 1 element
# bpftool map dump id 1268 | tail
value (CPU 21):
f2 6e db ca 00 00 00 00 92 67 4c ba 4e 00 00 00
92 67 4c ba 4e 00 00 00
value (CPU 22):
dc 8e e1 4d 00 00 00 00 d9 32 7a c5 4e 00 00 00
d9 32 7a c5 4e 00 00 00
value (CPU 23):
bd 2b 73 06 01 00 00 00 7c 73 87 bf 4e 00 00 00
7c 73 87 bf 4e 00 00 00
Found 1 element
#

# perf stat --bpf-counters -a -e cycles,instructions sleep 0.1

Performance counter stats for 'system wide':

119,410,122 cycles
152,105,479 instructions # 1.27 insn per cycle

0.101395093 seconds time elapsed

#

See? We had the counters enabled all the time.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: kernel-team@fb.com
Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
/linux-master/drivers/gpu/drm/i915/gem/
H A Di915_gem_domain.cdiff 09f9b441 Wed Nov 30 16:58:01 MST 2022 Andi Shyti <andi.shyti@linux.intel.com> drm/i915: Limit the display memory alignment to 32 bit instead of 64

The coming commit "drm/i915: Introduce guard pages to i915_vma"
from Chris, was originally changing display_alignment to u32
from u64. The reason is that the display GGTT is and will be
limited o 4GB.

Put it in a separate patch and use "max(...)" instead of
"max_t(64, ...)" when asigning the value. We can safely use max
as we know beforehand that the comparison is between two u32
variables.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20221130235805.221010-2-andi.shyti@linux.intel.com
diff 068b1bd0 Wed Oct 27 10:18:13 MDT 2021 Matthew Auld <matthew.auld@intel.com> drm/i915: stop setting cache_dirty on discrete

Should not be needed. Even with non-coherent display, we should be using
device local-memory there, and not system memory.

v2: also add a warning in i915_gem_clflush_object

Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> #v1
Link: https://patchwork.freedesktop.org/patch/msgid/20211027161813.3094681-4-matthew.auld@intel.com
diff e11b7b6e Tue Jun 29 09:12:03 MDT 2021 Thomas Hellström <thomas.hellstrom@linux.intel.com> drm/i915/display: Migrate objects to LMEM if possible for display

Objects intended to be used as display framebuffers must reside in
LMEM for discrete. If they happen to not do that, migrate them to
LMEM before pinning.

Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210629151203.209465-4-thomas.hellstrom@linux.intel.com
diff 47bf7b7a Mon Apr 20 06:53:55 MDT 2020 Chris Wilson <chris@chris-wilson.co.uk> drm/i915/gem: Remove object_is_locked assertion from unpin_from_display_plane

Since moving the obj->vma.list to a spin_lock, and the vm->bound_list to
its vm->mutex, along with tracking shrinkable status under its own
spinlock, we no long require the object to be locked by the caller.

This is fortunate as it appears we can be called with the lock along an
error path in flipping:

<4> [139.942851] WARN_ON(debug_locks && !lock_is_held(&(&((obj)->base.resv)->lock.base)->dep_map))
<4> [139.943242] WARNING: CPU: 0 PID: 1203 at drivers/gpu/drm/i915/gem/i915_gem_domain.c:405 i915_gem_object_unpin_from_display_plane+0x70/0x130 [i915]
<4> [139.943263] Modules linked in: snd_hda_intel i915 vgem snd_hda_codec_realtek snd_hda_codec_generic coretemp snd_intel_dspcfg snd_hda_codec snd_hwdep snd_hda_core r8169 lpc_ich snd_pcm realtek prime_numbers [last unloaded: i915]
<4> [139.943347] CPU: 0 PID: 1203 Comm: kms_flip Tainted: G U 5.6.0-gd0fda5c2cf3f1-drmtip_474+ #1
<4> [139.943363] Hardware name: /D510MO, BIOS MOPNV10J.86A.0311.2010.0802.2346 08/02/2010
<4> [139.943589] RIP: 0010:i915_gem_object_unpin_from_display_plane+0x70/0x130 [i915]
<4> [139.943589] Code: 85 28 01 00 00 be ff ff ff ff 48 8d 78 60 e8 d7 9b f0 e2 85 c0 75 b9 48 c7 c6 50 b9 38 c0 48 c7 c7 e9 48 3c c0 e8 20 d4 e9 e2 <0f> 0b eb a2 48 c7 c1 08 bb 38 c0 ba 0a 01 00 00 48 c7 c6 88 a3 35
<4> [139.943589] RSP: 0018:ffffb774c0603b48 EFLAGS: 00010282
<4> [139.943589] RAX: 0000000000000000 RBX: ffff9a142fa36e80 RCX: 0000000000000006
<4> [139.943589] RDX: 000000000000160d RSI: ffff9a142c1a88f8 RDI: ffffffffa434a64d
<4> [139.943589] RBP: ffff9a1410a513c0 R08: ffff9a142c1a88f8 R09: 0000000000000000
<4> [139.943589] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9a1436ee94b8
<4> [139.943589] R13: 0000000000000001 R14: 00000000ffffffff R15: ffff9a1410960000
<4> [139.943589] FS: 00007fc73a744e40(0000) GS:ffff9a143da00000(0000) knlGS:0000000000000000
<4> [139.943589] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4> [139.943589] CR2: 00007fc73997e098 CR3: 000000002f5fe000 CR4: 00000000000006f0
<4> [139.943589] Call Trace:
<4> [139.943589] intel_pin_and_fence_fb_obj+0x1c9/0x1f0 [i915]
<4> [139.943589] intel_plane_pin_fb+0x3f/0xd0 [i915]
<4> [139.943589] intel_prepare_plane_fb+0x13b/0x5c0 [i915]
<4> [139.943589] drm_atomic_helper_prepare_planes+0x85/0x110
<4> [139.943589] intel_atomic_commit+0xda/0x390 [i915]
<4> [139.943589] drm_atomic_helper_page_flip+0x9c/0xd0
<4> [139.943589] ? drm_event_reserve_init+0x46/0x60
<4> [139.943589] drm_mode_page_flip_ioctl+0x587/0x5d0

This completes the symmetry lost in commit 8b1c78e06e61 ("drm/i915: Avoid
calling i915_gem_object_unbind holding object lock").

Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/1743
Fixes: 8b1c78e06e61 ("drm/i915: Avoid calling i915_gem_object_unbind holding object lock")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Andi Shyti <andi.shyti@intel.com>
Cc: <stable@vger.kernel.org> # v5.6+
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200420125356.26614-1-chris@chris-wilson.co.uk
(cherry picked from commit a95f3ac21d64d62c746f836598d1467d5837fa28)
(cherry picked from commit 2208b85fa1766ee4821a9435d548578b67090531)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
diff 47bf7b7a Mon Apr 20 06:53:55 MDT 2020 Chris Wilson <chris@chris-wilson.co.uk> drm/i915/gem: Remove object_is_locked assertion from unpin_from_display_plane

Since moving the obj->vma.list to a spin_lock, and the vm->bound_list to
its vm->mutex, along with tracking shrinkable status under its own
spinlock, we no long require the object to be locked by the caller.

This is fortunate as it appears we can be called with the lock along an
error path in flipping:

<4> [139.942851] WARN_ON(debug_locks && !lock_is_held(&(&((obj)->base.resv)->lock.base)->dep_map))
<4> [139.943242] WARNING: CPU: 0 PID: 1203 at drivers/gpu/drm/i915/gem/i915_gem_domain.c:405 i915_gem_object_unpin_from_display_plane+0x70/0x130 [i915]
<4> [139.943263] Modules linked in: snd_hda_intel i915 vgem snd_hda_codec_realtek snd_hda_codec_generic coretemp snd_intel_dspcfg snd_hda_codec snd_hwdep snd_hda_core r8169 lpc_ich snd_pcm realtek prime_numbers [last unloaded: i915]
<4> [139.943347] CPU: 0 PID: 1203 Comm: kms_flip Tainted: G U 5.6.0-gd0fda5c2cf3f1-drmtip_474+ #1
<4> [139.943363] Hardware name: /D510MO, BIOS MOPNV10J.86A.0311.2010.0802.2346 08/02/2010
<4> [139.943589] RIP: 0010:i915_gem_object_unpin_from_display_plane+0x70/0x130 [i915]
<4> [139.943589] Code: 85 28 01 00 00 be ff ff ff ff 48 8d 78 60 e8 d7 9b f0 e2 85 c0 75 b9 48 c7 c6 50 b9 38 c0 48 c7 c7 e9 48 3c c0 e8 20 d4 e9 e2 <0f> 0b eb a2 48 c7 c1 08 bb 38 c0 ba 0a 01 00 00 48 c7 c6 88 a3 35
<4> [139.943589] RSP: 0018:ffffb774c0603b48 EFLAGS: 00010282
<4> [139.943589] RAX: 0000000000000000 RBX: ffff9a142fa36e80 RCX: 0000000000000006
<4> [139.943589] RDX: 000000000000160d RSI: ffff9a142c1a88f8 RDI: ffffffffa434a64d
<4> [139.943589] RBP: ffff9a1410a513c0 R08: ffff9a142c1a88f8 R09: 0000000000000000
<4> [139.943589] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9a1436ee94b8
<4> [139.943589] R13: 0000000000000001 R14: 00000000ffffffff R15: ffff9a1410960000
<4> [139.943589] FS: 00007fc73a744e40(0000) GS:ffff9a143da00000(0000) knlGS:0000000000000000
<4> [139.943589] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4> [139.943589] CR2: 00007fc73997e098 CR3: 000000002f5fe000 CR4: 00000000000006f0
<4> [139.943589] Call Trace:
<4> [139.943589] intel_pin_and_fence_fb_obj+0x1c9/0x1f0 [i915]
<4> [139.943589] intel_plane_pin_fb+0x3f/0xd0 [i915]
<4> [139.943589] intel_prepare_plane_fb+0x13b/0x5c0 [i915]
<4> [139.943589] drm_atomic_helper_prepare_planes+0x85/0x110
<4> [139.943589] intel_atomic_commit+0xda/0x390 [i915]
<4> [139.943589] drm_atomic_helper_page_flip+0x9c/0xd0
<4> [139.943589] ? drm_event_reserve_init+0x46/0x60
<4> [139.943589] drm_mode_page_flip_ioctl+0x587/0x5d0

This completes the symmetry lost in commit 8b1c78e06e61 ("drm/i915: Avoid
calling i915_gem_object_unbind holding object lock").

Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/1743
Fixes: 8b1c78e06e61 ("drm/i915: Avoid calling i915_gem_object_unbind holding object lock")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Andi Shyti <andi.shyti@intel.com>
Cc: <stable@vger.kernel.org> # v5.6+
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200420125356.26614-1-chris@chris-wilson.co.uk
(cherry picked from commit a95f3ac21d64d62c746f836598d1467d5837fa28)
(cherry picked from commit 2208b85fa1766ee4821a9435d548578b67090531)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
diff 47bf7b7a Mon Apr 20 06:53:55 MDT 2020 Chris Wilson <chris@chris-wilson.co.uk> drm/i915/gem: Remove object_is_locked assertion from unpin_from_display_plane

Since moving the obj->vma.list to a spin_lock, and the vm->bound_list to
its vm->mutex, along with tracking shrinkable status under its own
spinlock, we no long require the object to be locked by the caller.

This is fortunate as it appears we can be called with the lock along an
error path in flipping:

<4> [139.942851] WARN_ON(debug_locks && !lock_is_held(&(&((obj)->base.resv)->lock.base)->dep_map))
<4> [139.943242] WARNING: CPU: 0 PID: 1203 at drivers/gpu/drm/i915/gem/i915_gem_domain.c:405 i915_gem_object_unpin_from_display_plane+0x70/0x130 [i915]
<4> [139.943263] Modules linked in: snd_hda_intel i915 vgem snd_hda_codec_realtek snd_hda_codec_generic coretemp snd_intel_dspcfg snd_hda_codec snd_hwdep snd_hda_core r8169 lpc_ich snd_pcm realtek prime_numbers [last unloaded: i915]
<4> [139.943347] CPU: 0 PID: 1203 Comm: kms_flip Tainted: G U 5.6.0-gd0fda5c2cf3f1-drmtip_474+ #1
<4> [139.943363] Hardware name: /D510MO, BIOS MOPNV10J.86A.0311.2010.0802.2346 08/02/2010
<4> [139.943589] RIP: 0010:i915_gem_object_unpin_from_display_plane+0x70/0x130 [i915]
<4> [139.943589] Code: 85 28 01 00 00 be ff ff ff ff 48 8d 78 60 e8 d7 9b f0 e2 85 c0 75 b9 48 c7 c6 50 b9 38 c0 48 c7 c7 e9 48 3c c0 e8 20 d4 e9 e2 <0f> 0b eb a2 48 c7 c1 08 bb 38 c0 ba 0a 01 00 00 48 c7 c6 88 a3 35
<4> [139.943589] RSP: 0018:ffffb774c0603b48 EFLAGS: 00010282
<4> [139.943589] RAX: 0000000000000000 RBX: ffff9a142fa36e80 RCX: 0000000000000006
<4> [139.943589] RDX: 000000000000160d RSI: ffff9a142c1a88f8 RDI: ffffffffa434a64d
<4> [139.943589] RBP: ffff9a1410a513c0 R08: ffff9a142c1a88f8 R09: 0000000000000000
<4> [139.943589] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9a1436ee94b8
<4> [139.943589] R13: 0000000000000001 R14: 00000000ffffffff R15: ffff9a1410960000
<4> [139.943589] FS: 00007fc73a744e40(0000) GS:ffff9a143da00000(0000) knlGS:0000000000000000
<4> [139.943589] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4> [139.943589] CR2: 00007fc73997e098 CR3: 000000002f5fe000 CR4: 00000000000006f0
<4> [139.943589] Call Trace:
<4> [139.943589] intel_pin_and_fence_fb_obj+0x1c9/0x1f0 [i915]
<4> [139.943589] intel_plane_pin_fb+0x3f/0xd0 [i915]
<4> [139.943589] intel_prepare_plane_fb+0x13b/0x5c0 [i915]
<4> [139.943589] drm_atomic_helper_prepare_planes+0x85/0x110
<4> [139.943589] intel_atomic_commit+0xda/0x390 [i915]
<4> [139.943589] drm_atomic_helper_page_flip+0x9c/0xd0
<4> [139.943589] ? drm_event_reserve_init+0x46/0x60
<4> [139.943589] drm_mode_page_flip_ioctl+0x587/0x5d0

This completes the symmetry lost in commit 8b1c78e06e61 ("drm/i915: Avoid
calling i915_gem_object_unbind holding object lock").

Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/1743
Fixes: 8b1c78e06e61 ("drm/i915: Avoid calling i915_gem_object_unbind holding object lock")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Andi Shyti <andi.shyti@intel.com>
Cc: <stable@vger.kernel.org> # v5.6+
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200420125356.26614-1-chris@chris-wilson.co.uk
(cherry picked from commit a95f3ac21d64d62c746f836598d1467d5837fa28)
(cherry picked from commit 2208b85fa1766ee4821a9435d548578b67090531)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
diff 47bf7b7a Mon Apr 20 06:53:55 MDT 2020 Chris Wilson <chris@chris-wilson.co.uk> drm/i915/gem: Remove object_is_locked assertion from unpin_from_display_plane

Since moving the obj->vma.list to a spin_lock, and the vm->bound_list to
its vm->mutex, along with tracking shrinkable status under its own
spinlock, we no long require the object to be locked by the caller.

This is fortunate as it appears we can be called with the lock along an
error path in flipping:

<4> [139.942851] WARN_ON(debug_locks && !lock_is_held(&(&((obj)->base.resv)->lock.base)->dep_map))
<4> [139.943242] WARNING: CPU: 0 PID: 1203 at drivers/gpu/drm/i915/gem/i915_gem_domain.c:405 i915_gem_object_unpin_from_display_plane+0x70/0x130 [i915]
<4> [139.943263] Modules linked in: snd_hda_intel i915 vgem snd_hda_codec_realtek snd_hda_codec_generic coretemp snd_intel_dspcfg snd_hda_codec snd_hwdep snd_hda_core r8169 lpc_ich snd_pcm realtek prime_numbers [last unloaded: i915]
<4> [139.943347] CPU: 0 PID: 1203 Comm: kms_flip Tainted: G U 5.6.0-gd0fda5c2cf3f1-drmtip_474+ #1
<4> [139.943363] Hardware name: /D510MO, BIOS MOPNV10J.86A.0311.2010.0802.2346 08/02/2010
<4> [139.943589] RIP: 0010:i915_gem_object_unpin_from_display_plane+0x70/0x130 [i915]
<4> [139.943589] Code: 85 28 01 00 00 be ff ff ff ff 48 8d 78 60 e8 d7 9b f0 e2 85 c0 75 b9 48 c7 c6 50 b9 38 c0 48 c7 c7 e9 48 3c c0 e8 20 d4 e9 e2 <0f> 0b eb a2 48 c7 c1 08 bb 38 c0 ba 0a 01 00 00 48 c7 c6 88 a3 35
<4> [139.943589] RSP: 0018:ffffb774c0603b48 EFLAGS: 00010282
<4> [139.943589] RAX: 0000000000000000 RBX: ffff9a142fa36e80 RCX: 0000000000000006
<4> [139.943589] RDX: 000000000000160d RSI: ffff9a142c1a88f8 RDI: ffffffffa434a64d
<4> [139.943589] RBP: ffff9a1410a513c0 R08: ffff9a142c1a88f8 R09: 0000000000000000
<4> [139.943589] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9a1436ee94b8
<4> [139.943589] R13: 0000000000000001 R14: 00000000ffffffff R15: ffff9a1410960000
<4> [139.943589] FS: 00007fc73a744e40(0000) GS:ffff9a143da00000(0000) knlGS:0000000000000000
<4> [139.943589] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4> [139.943589] CR2: 00007fc73997e098 CR3: 000000002f5fe000 CR4: 00000000000006f0
<4> [139.943589] Call Trace:
<4> [139.943589] intel_pin_and_fence_fb_obj+0x1c9/0x1f0 [i915]
<4> [139.943589] intel_plane_pin_fb+0x3f/0xd0 [i915]
<4> [139.943589] intel_prepare_plane_fb+0x13b/0x5c0 [i915]
<4> [139.943589] drm_atomic_helper_prepare_planes+0x85/0x110
<4> [139.943589] intel_atomic_commit+0xda/0x390 [i915]
<4> [139.943589] drm_atomic_helper_page_flip+0x9c/0xd0
<4> [139.943589] ? drm_event_reserve_init+0x46/0x60
<4> [139.943589] drm_mode_page_flip_ioctl+0x587/0x5d0

This completes the symmetry lost in commit 8b1c78e06e61 ("drm/i915: Avoid
calling i915_gem_object_unbind holding object lock").

Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/1743
Fixes: 8b1c78e06e61 ("drm/i915: Avoid calling i915_gem_object_unbind holding object lock")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Andi Shyti <andi.shyti@intel.com>
Cc: <stable@vger.kernel.org> # v5.6+
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200420125356.26614-1-chris@chris-wilson.co.uk
(cherry picked from commit a95f3ac21d64d62c746f836598d1467d5837fa28)
(cherry picked from commit 2208b85fa1766ee4821a9435d548578b67090531)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
diff 47bf7b7a Mon Apr 20 06:53:55 MDT 2020 Chris Wilson <chris@chris-wilson.co.uk> drm/i915/gem: Remove object_is_locked assertion from unpin_from_display_plane

Since moving the obj->vma.list to a spin_lock, and the vm->bound_list to
its vm->mutex, along with tracking shrinkable status under its own
spinlock, we no long require the object to be locked by the caller.

This is fortunate as it appears we can be called with the lock along an
error path in flipping:

<4> [139.942851] WARN_ON(debug_locks && !lock_is_held(&(&((obj)->base.resv)->lock.base)->dep_map))
<4> [139.943242] WARNING: CPU: 0 PID: 1203 at drivers/gpu/drm/i915/gem/i915_gem_domain.c:405 i915_gem_object_unpin_from_display_plane+0x70/0x130 [i915]
<4> [139.943263] Modules linked in: snd_hda_intel i915 vgem snd_hda_codec_realtek snd_hda_codec_generic coretemp snd_intel_dspcfg snd_hda_codec snd_hwdep snd_hda_core r8169 lpc_ich snd_pcm realtek prime_numbers [last unloaded: i915]
<4> [139.943347] CPU: 0 PID: 1203 Comm: kms_flip Tainted: G U 5.6.0-gd0fda5c2cf3f1-drmtip_474+ #1
<4> [139.943363] Hardware name: /D510MO, BIOS MOPNV10J.86A.0311.2010.0802.2346 08/02/2010
<4> [139.943589] RIP: 0010:i915_gem_object_unpin_from_display_plane+0x70/0x130 [i915]
<4> [139.943589] Code: 85 28 01 00 00 be ff ff ff ff 48 8d 78 60 e8 d7 9b f0 e2 85 c0 75 b9 48 c7 c6 50 b9 38 c0 48 c7 c7 e9 48 3c c0 e8 20 d4 e9 e2 <0f> 0b eb a2 48 c7 c1 08 bb 38 c0 ba 0a 01 00 00 48 c7 c6 88 a3 35
<4> [139.943589] RSP: 0018:ffffb774c0603b48 EFLAGS: 00010282
<4> [139.943589] RAX: 0000000000000000 RBX: ffff9a142fa36e80 RCX: 0000000000000006
<4> [139.943589] RDX: 000000000000160d RSI: ffff9a142c1a88f8 RDI: ffffffffa434a64d
<4> [139.943589] RBP: ffff9a1410a513c0 R08: ffff9a142c1a88f8 R09: 0000000000000000
<4> [139.943589] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9a1436ee94b8
<4> [139.943589] R13: 0000000000000001 R14: 00000000ffffffff R15: ffff9a1410960000
<4> [139.943589] FS: 00007fc73a744e40(0000) GS:ffff9a143da00000(0000) knlGS:0000000000000000
<4> [139.943589] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4> [139.943589] CR2: 00007fc73997e098 CR3: 000000002f5fe000 CR4: 00000000000006f0
<4> [139.943589] Call Trace:
<4> [139.943589] intel_pin_and_fence_fb_obj+0x1c9/0x1f0 [i915]
<4> [139.943589] intel_plane_pin_fb+0x3f/0xd0 [i915]
<4> [139.943589] intel_prepare_plane_fb+0x13b/0x5c0 [i915]
<4> [139.943589] drm_atomic_helper_prepare_planes+0x85/0x110
<4> [139.943589] intel_atomic_commit+0xda/0x390 [i915]
<4> [139.943589] drm_atomic_helper_page_flip+0x9c/0xd0
<4> [139.943589] ? drm_event_reserve_init+0x46/0x60
<4> [139.943589] drm_mode_page_flip_ioctl+0x587/0x5d0

This completes the symmetry lost in commit 8b1c78e06e61 ("drm/i915: Avoid
calling i915_gem_object_unbind holding object lock").

Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/1743
Fixes: 8b1c78e06e61 ("drm/i915: Avoid calling i915_gem_object_unbind holding object lock")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Andi Shyti <andi.shyti@intel.com>
Cc: <stable@vger.kernel.org> # v5.6+
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200420125356.26614-1-chris@chris-wilson.co.uk
(cherry picked from commit a95f3ac21d64d62c746f836598d1467d5837fa28)
(cherry picked from commit 2208b85fa1766ee4821a9435d548578b67090531)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
diff 47bf7b7a Mon Apr 20 06:53:55 MDT 2020 Chris Wilson <chris@chris-wilson.co.uk> drm/i915/gem: Remove object_is_locked assertion from unpin_from_display_plane

Since moving the obj->vma.list to a spin_lock, and the vm->bound_list to
its vm->mutex, along with tracking shrinkable status under its own
spinlock, we no long require the object to be locked by the caller.

This is fortunate as it appears we can be called with the lock along an
error path in flipping:

<4> [139.942851] WARN_ON(debug_locks && !lock_is_held(&(&((obj)->base.resv)->lock.base)->dep_map))
<4> [139.943242] WARNING: CPU: 0 PID: 1203 at drivers/gpu/drm/i915/gem/i915_gem_domain.c:405 i915_gem_object_unpin_from_display_plane+0x70/0x130 [i915]
<4> [139.943263] Modules linked in: snd_hda_intel i915 vgem snd_hda_codec_realtek snd_hda_codec_generic coretemp snd_intel_dspcfg snd_hda_codec snd_hwdep snd_hda_core r8169 lpc_ich snd_pcm realtek prime_numbers [last unloaded: i915]
<4> [139.943347] CPU: 0 PID: 1203 Comm: kms_flip Tainted: G U 5.6.0-gd0fda5c2cf3f1-drmtip_474+ #1
<4> [139.943363] Hardware name: /D510MO, BIOS MOPNV10J.86A.0311.2010.0802.2346 08/02/2010
<4> [139.943589] RIP: 0010:i915_gem_object_unpin_from_display_plane+0x70/0x130 [i915]
<4> [139.943589] Code: 85 28 01 00 00 be ff ff ff ff 48 8d 78 60 e8 d7 9b f0 e2 85 c0 75 b9 48 c7 c6 50 b9 38 c0 48 c7 c7 e9 48 3c c0 e8 20 d4 e9 e2 <0f> 0b eb a2 48 c7 c1 08 bb 38 c0 ba 0a 01 00 00 48 c7 c6 88 a3 35
<4> [139.943589] RSP: 0018:ffffb774c0603b48 EFLAGS: 00010282
<4> [139.943589] RAX: 0000000000000000 RBX: ffff9a142fa36e80 RCX: 0000000000000006
<4> [139.943589] RDX: 000000000000160d RSI: ffff9a142c1a88f8 RDI: ffffffffa434a64d
<4> [139.943589] RBP: ffff9a1410a513c0 R08: ffff9a142c1a88f8 R09: 0000000000000000
<4> [139.943589] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9a1436ee94b8
<4> [139.943589] R13: 0000000000000001 R14: 00000000ffffffff R15: ffff9a1410960000
<4> [139.943589] FS: 00007fc73a744e40(0000) GS:ffff9a143da00000(0000) knlGS:0000000000000000
<4> [139.943589] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4> [139.943589] CR2: 00007fc73997e098 CR3: 000000002f5fe000 CR4: 00000000000006f0
<4> [139.943589] Call Trace:
<4> [139.943589] intel_pin_and_fence_fb_obj+0x1c9/0x1f0 [i915]
<4> [139.943589] intel_plane_pin_fb+0x3f/0xd0 [i915]
<4> [139.943589] intel_prepare_plane_fb+0x13b/0x5c0 [i915]
<4> [139.943589] drm_atomic_helper_prepare_planes+0x85/0x110
<4> [139.943589] intel_atomic_commit+0xda/0x390 [i915]
<4> [139.943589] drm_atomic_helper_page_flip+0x9c/0xd0
<4> [139.943589] ? drm_event_reserve_init+0x46/0x60
<4> [139.943589] drm_mode_page_flip_ioctl+0x587/0x5d0

This completes the symmetry lost in commit 8b1c78e06e61 ("drm/i915: Avoid
calling i915_gem_object_unbind holding object lock").

Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/1743
Fixes: 8b1c78e06e61 ("drm/i915: Avoid calling i915_gem_object_unbind holding object lock")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Andi Shyti <andi.shyti@intel.com>
Cc: <stable@vger.kernel.org> # v5.6+
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200420125356.26614-1-chris@chris-wilson.co.uk
(cherry picked from commit a95f3ac21d64d62c746f836598d1467d5837fa28)
(cherry picked from commit 2208b85fa1766ee4821a9435d548578b67090531)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
diff 47bf7b7a Mon Apr 20 06:53:55 MDT 2020 Chris Wilson <chris@chris-wilson.co.uk> drm/i915/gem: Remove object_is_locked assertion from unpin_from_display_plane

Since moving the obj->vma.list to a spin_lock, and the vm->bound_list to
its vm->mutex, along with tracking shrinkable status under its own
spinlock, we no long require the object to be locked by the caller.

This is fortunate as it appears we can be called with the lock along an
error path in flipping:

<4> [139.942851] WARN_ON(debug_locks && !lock_is_held(&(&((obj)->base.resv)->lock.base)->dep_map))
<4> [139.943242] WARNING: CPU: 0 PID: 1203 at drivers/gpu/drm/i915/gem/i915_gem_domain.c:405 i915_gem_object_unpin_from_display_plane+0x70/0x130 [i915]
<4> [139.943263] Modules linked in: snd_hda_intel i915 vgem snd_hda_codec_realtek snd_hda_codec_generic coretemp snd_intel_dspcfg snd_hda_codec snd_hwdep snd_hda_core r8169 lpc_ich snd_pcm realtek prime_numbers [last unloaded: i915]
<4> [139.943347] CPU: 0 PID: 1203 Comm: kms_flip Tainted: G U 5.6.0-gd0fda5c2cf3f1-drmtip_474+ #1
<4> [139.943363] Hardware name: /D510MO, BIOS MOPNV10J.86A.0311.2010.0802.2346 08/02/2010
<4> [139.943589] RIP: 0010:i915_gem_object_unpin_from_display_plane+0x70/0x130 [i915]
<4> [139.943589] Code: 85 28 01 00 00 be ff ff ff ff 48 8d 78 60 e8 d7 9b f0 e2 85 c0 75 b9 48 c7 c6 50 b9 38 c0 48 c7 c7 e9 48 3c c0 e8 20 d4 e9 e2 <0f> 0b eb a2 48 c7 c1 08 bb 38 c0 ba 0a 01 00 00 48 c7 c6 88 a3 35
<4> [139.943589] RSP: 0018:ffffb774c0603b48 EFLAGS: 00010282
<4> [139.943589] RAX: 0000000000000000 RBX: ffff9a142fa36e80 RCX: 0000000000000006
<4> [139.943589] RDX: 000000000000160d RSI: ffff9a142c1a88f8 RDI: ffffffffa434a64d
<4> [139.943589] RBP: ffff9a1410a513c0 R08: ffff9a142c1a88f8 R09: 0000000000000000
<4> [139.943589] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9a1436ee94b8
<4> [139.943589] R13: 0000000000000001 R14: 00000000ffffffff R15: ffff9a1410960000
<4> [139.943589] FS: 00007fc73a744e40(0000) GS:ffff9a143da00000(0000) knlGS:0000000000000000
<4> [139.943589] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4> [139.943589] CR2: 00007fc73997e098 CR3: 000000002f5fe000 CR4: 00000000000006f0
<4> [139.943589] Call Trace:
<4> [139.943589] intel_pin_and_fence_fb_obj+0x1c9/0x1f0 [i915]
<4> [139.943589] intel_plane_pin_fb+0x3f/0xd0 [i915]
<4> [139.943589] intel_prepare_plane_fb+0x13b/0x5c0 [i915]
<4> [139.943589] drm_atomic_helper_prepare_planes+0x85/0x110
<4> [139.943589] intel_atomic_commit+0xda/0x390 [i915]
<4> [139.943589] drm_atomic_helper_page_flip+0x9c/0xd0
<4> [139.943589] ? drm_event_reserve_init+0x46/0x60
<4> [139.943589] drm_mode_page_flip_ioctl+0x587/0x5d0

This completes the symmetry lost in commit 8b1c78e06e61 ("drm/i915: Avoid
calling i915_gem_object_unbind holding object lock").

Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/1743
Fixes: 8b1c78e06e61 ("drm/i915: Avoid calling i915_gem_object_unbind holding object lock")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Andi Shyti <andi.shyti@intel.com>
Cc: <stable@vger.kernel.org> # v5.6+
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200420125356.26614-1-chris@chris-wilson.co.uk
(cherry picked from commit a95f3ac21d64d62c746f836598d1467d5837fa28)
(cherry picked from commit 2208b85fa1766ee4821a9435d548578b67090531)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
diff 47bf7b7a Mon Apr 20 06:53:55 MDT 2020 Chris Wilson <chris@chris-wilson.co.uk> drm/i915/gem: Remove object_is_locked assertion from unpin_from_display_plane

Since moving the obj->vma.list to a spin_lock, and the vm->bound_list to
its vm->mutex, along with tracking shrinkable status under its own
spinlock, we no long require the object to be locked by the caller.

This is fortunate as it appears we can be called with the lock along an
error path in flipping:

<4> [139.942851] WARN_ON(debug_locks && !lock_is_held(&(&((obj)->base.resv)->lock.base)->dep_map))
<4> [139.943242] WARNING: CPU: 0 PID: 1203 at drivers/gpu/drm/i915/gem/i915_gem_domain.c:405 i915_gem_object_unpin_from_display_plane+0x70/0x130 [i915]
<4> [139.943263] Modules linked in: snd_hda_intel i915 vgem snd_hda_codec_realtek snd_hda_codec_generic coretemp snd_intel_dspcfg snd_hda_codec snd_hwdep snd_hda_core r8169 lpc_ich snd_pcm realtek prime_numbers [last unloaded: i915]
<4> [139.943347] CPU: 0 PID: 1203 Comm: kms_flip Tainted: G U 5.6.0-gd0fda5c2cf3f1-drmtip_474+ #1
<4> [139.943363] Hardware name: /D510MO, BIOS MOPNV10J.86A.0311.2010.0802.2346 08/02/2010
<4> [139.943589] RIP: 0010:i915_gem_object_unpin_from_display_plane+0x70/0x130 [i915]
<4> [139.943589] Code: 85 28 01 00 00 be ff ff ff ff 48 8d 78 60 e8 d7 9b f0 e2 85 c0 75 b9 48 c7 c6 50 b9 38 c0 48 c7 c7 e9 48 3c c0 e8 20 d4 e9 e2 <0f> 0b eb a2 48 c7 c1 08 bb 38 c0 ba 0a 01 00 00 48 c7 c6 88 a3 35
<4> [139.943589] RSP: 0018:ffffb774c0603b48 EFLAGS: 00010282
<4> [139.943589] RAX: 0000000000000000 RBX: ffff9a142fa36e80 RCX: 0000000000000006
<4> [139.943589] RDX: 000000000000160d RSI: ffff9a142c1a88f8 RDI: ffffffffa434a64d
<4> [139.943589] RBP: ffff9a1410a513c0 R08: ffff9a142c1a88f8 R09: 0000000000000000
<4> [139.943589] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9a1436ee94b8
<4> [139.943589] R13: 0000000000000001 R14: 00000000ffffffff R15: ffff9a1410960000
<4> [139.943589] FS: 00007fc73a744e40(0000) GS:ffff9a143da00000(0000) knlGS:0000000000000000
<4> [139.943589] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4> [139.943589] CR2: 00007fc73997e098 CR3: 000000002f5fe000 CR4: 00000000000006f0
<4> [139.943589] Call Trace:
<4> [139.943589] intel_pin_and_fence_fb_obj+0x1c9/0x1f0 [i915]
<4> [139.943589] intel_plane_pin_fb+0x3f/0xd0 [i915]
<4> [139.943589] intel_prepare_plane_fb+0x13b/0x5c0 [i915]
<4> [139.943589] drm_atomic_helper_prepare_planes+0x85/0x110
<4> [139.943589] intel_atomic_commit+0xda/0x390 [i915]
<4> [139.943589] drm_atomic_helper_page_flip+0x9c/0xd0
<4> [139.943589] ? drm_event_reserve_init+0x46/0x60
<4> [139.943589] drm_mode_page_flip_ioctl+0x587/0x5d0

This completes the symmetry lost in commit 8b1c78e06e61 ("drm/i915: Avoid
calling i915_gem_object_unbind holding object lock").

Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/1743
Fixes: 8b1c78e06e61 ("drm/i915: Avoid calling i915_gem_object_unbind holding object lock")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Andi Shyti <andi.shyti@intel.com>
Cc: <stable@vger.kernel.org> # v5.6+
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200420125356.26614-1-chris@chris-wilson.co.uk
(cherry picked from commit a95f3ac21d64d62c746f836598d1467d5837fa28)
(cherry picked from commit 2208b85fa1766ee4821a9435d548578b67090531)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
diff 47bf7b7a Mon Apr 20 06:53:55 MDT 2020 Chris Wilson <chris@chris-wilson.co.uk> drm/i915/gem: Remove object_is_locked assertion from unpin_from_display_plane

Since moving the obj->vma.list to a spin_lock, and the vm->bound_list to
its vm->mutex, along with tracking shrinkable status under its own
spinlock, we no long require the object to be locked by the caller.

This is fortunate as it appears we can be called with the lock along an
error path in flipping:

<4> [139.942851] WARN_ON(debug_locks && !lock_is_held(&(&((obj)->base.resv)->lock.base)->dep_map))
<4> [139.943242] WARNING: CPU: 0 PID: 1203 at drivers/gpu/drm/i915/gem/i915_gem_domain.c:405 i915_gem_object_unpin_from_display_plane+0x70/0x130 [i915]
<4> [139.943263] Modules linked in: snd_hda_intel i915 vgem snd_hda_codec_realtek snd_hda_codec_generic coretemp snd_intel_dspcfg snd_hda_codec snd_hwdep snd_hda_core r8169 lpc_ich snd_pcm realtek prime_numbers [last unloaded: i915]
<4> [139.943347] CPU: 0 PID: 1203 Comm: kms_flip Tainted: G U 5.6.0-gd0fda5c2cf3f1-drmtip_474+ #1
<4> [139.943363] Hardware name: /D510MO, BIOS MOPNV10J.86A.0311.2010.0802.2346 08/02/2010
<4> [139.943589] RIP: 0010:i915_gem_object_unpin_from_display_plane+0x70/0x130 [i915]
<4> [139.943589] Code: 85 28 01 00 00 be ff ff ff ff 48 8d 78 60 e8 d7 9b f0 e2 85 c0 75 b9 48 c7 c6 50 b9 38 c0 48 c7 c7 e9 48 3c c0 e8 20 d4 e9 e2 <0f> 0b eb a2 48 c7 c1 08 bb 38 c0 ba 0a 01 00 00 48 c7 c6 88 a3 35
<4> [139.943589] RSP: 0018:ffffb774c0603b48 EFLAGS: 00010282
<4> [139.943589] RAX: 0000000000000000 RBX: ffff9a142fa36e80 RCX: 0000000000000006
<4> [139.943589] RDX: 000000000000160d RSI: ffff9a142c1a88f8 RDI: ffffffffa434a64d
<4> [139.943589] RBP: ffff9a1410a513c0 R08: ffff9a142c1a88f8 R09: 0000000000000000
<4> [139.943589] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9a1436ee94b8
<4> [139.943589] R13: 0000000000000001 R14: 00000000ffffffff R15: ffff9a1410960000
<4> [139.943589] FS: 00007fc73a744e40(0000) GS:ffff9a143da00000(0000) knlGS:0000000000000000
<4> [139.943589] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4> [139.943589] CR2: 00007fc73997e098 CR3: 000000002f5fe000 CR4: 00000000000006f0
<4> [139.943589] Call Trace:
<4> [139.943589] intel_pin_and_fence_fb_obj+0x1c9/0x1f0 [i915]
<4> [139.943589] intel_plane_pin_fb+0x3f/0xd0 [i915]
<4> [139.943589] intel_prepare_plane_fb+0x13b/0x5c0 [i915]
<4> [139.943589] drm_atomic_helper_prepare_planes+0x85/0x110
<4> [139.943589] intel_atomic_commit+0xda/0x390 [i915]
<4> [139.943589] drm_atomic_helper_page_flip+0x9c/0xd0
<4> [139.943589] ? drm_event_reserve_init+0x46/0x60
<4> [139.943589] drm_mode_page_flip_ioctl+0x587/0x5d0

This completes the symmetry lost in commit 8b1c78e06e61 ("drm/i915: Avoid
calling i915_gem_object_unbind holding object lock").

Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/1743
Fixes: 8b1c78e06e61 ("drm/i915: Avoid calling i915_gem_object_unbind holding object lock")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Andi Shyti <andi.shyti@intel.com>
Cc: <stable@vger.kernel.org> # v5.6+
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200420125356.26614-1-chris@chris-wilson.co.uk
(cherry picked from commit a95f3ac21d64d62c746f836598d1467d5837fa28)
(cherry picked from commit 2208b85fa1766ee4821a9435d548578b67090531)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
diff 47bf7b7a Mon Apr 20 06:53:55 MDT 2020 Chris Wilson <chris@chris-wilson.co.uk> drm/i915/gem: Remove object_is_locked assertion from unpin_from_display_plane

Since moving the obj->vma.list to a spin_lock, and the vm->bound_list to
its vm->mutex, along with tracking shrinkable status under its own
spinlock, we no long require the object to be locked by the caller.

This is fortunate as it appears we can be called with the lock along an
error path in flipping:

<4> [139.942851] WARN_ON(debug_locks && !lock_is_held(&(&((obj)->base.resv)->lock.base)->dep_map))
<4> [139.943242] WARNING: CPU: 0 PID: 1203 at drivers/gpu/drm/i915/gem/i915_gem_domain.c:405 i915_gem_object_unpin_from_display_plane+0x70/0x130 [i915]
<4> [139.943263] Modules linked in: snd_hda_intel i915 vgem snd_hda_codec_realtek snd_hda_codec_generic coretemp snd_intel_dspcfg snd_hda_codec snd_hwdep snd_hda_core r8169 lpc_ich snd_pcm realtek prime_numbers [last unloaded: i915]
<4> [139.943347] CPU: 0 PID: 1203 Comm: kms_flip Tainted: G U 5.6.0-gd0fda5c2cf3f1-drmtip_474+ #1
<4> [139.943363] Hardware name: /D510MO, BIOS MOPNV10J.86A.0311.2010.0802.2346 08/02/2010
<4> [139.943589] RIP: 0010:i915_gem_object_unpin_from_display_plane+0x70/0x130 [i915]
<4> [139.943589] Code: 85 28 01 00 00 be ff ff ff ff 48 8d 78 60 e8 d7 9b f0 e2 85 c0 75 b9 48 c7 c6 50 b9 38 c0 48 c7 c7 e9 48 3c c0 e8 20 d4 e9 e2 <0f> 0b eb a2 48 c7 c1 08 bb 38 c0 ba 0a 01 00 00 48 c7 c6 88 a3 35
<4> [139.943589] RSP: 0018:ffffb774c0603b48 EFLAGS: 00010282
<4> [139.943589] RAX: 0000000000000000 RBX: ffff9a142fa36e80 RCX: 0000000000000006
<4> [139.943589] RDX: 000000000000160d RSI: ffff9a142c1a88f8 RDI: ffffffffa434a64d
<4> [139.943589] RBP: ffff9a1410a513c0 R08: ffff9a142c1a88f8 R09: 0000000000000000
<4> [139.943589] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9a1436ee94b8
<4> [139.943589] R13: 0000000000000001 R14: 00000000ffffffff R15: ffff9a1410960000
<4> [139.943589] FS: 00007fc73a744e40(0000) GS:ffff9a143da00000(0000) knlGS:0000000000000000
<4> [139.943589] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4> [139.943589] CR2: 00007fc73997e098 CR3: 000000002f5fe000 CR4: 00000000000006f0
<4> [139.943589] Call Trace:
<4> [139.943589] intel_pin_and_fence_fb_obj+0x1c9/0x1f0 [i915]
<4> [139.943589] intel_plane_pin_fb+0x3f/0xd0 [i915]
<4> [139.943589] intel_prepare_plane_fb+0x13b/0x5c0 [i915]
<4> [139.943589] drm_atomic_helper_prepare_planes+0x85/0x110
<4> [139.943589] intel_atomic_commit+0xda/0x390 [i915]
<4> [139.943589] drm_atomic_helper_page_flip+0x9c/0xd0
<4> [139.943589] ? drm_event_reserve_init+0x46/0x60
<4> [139.943589] drm_mode_page_flip_ioctl+0x587/0x5d0

This completes the symmetry lost in commit 8b1c78e06e61 ("drm/i915: Avoid
calling i915_gem_object_unbind holding object lock").

Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/1743
Fixes: 8b1c78e06e61 ("drm/i915: Avoid calling i915_gem_object_unbind holding object lock")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Andi Shyti <andi.shyti@intel.com>
Cc: <stable@vger.kernel.org> # v5.6+
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200420125356.26614-1-chris@chris-wilson.co.uk
(cherry picked from commit a95f3ac21d64d62c746f836598d1467d5837fa28)
(cherry picked from commit 2208b85fa1766ee4821a9435d548578b67090531)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
diff 47bf7b7a Mon Apr 20 06:53:55 MDT 2020 Chris Wilson <chris@chris-wilson.co.uk> drm/i915/gem: Remove object_is_locked assertion from unpin_from_display_plane

Since moving the obj->vma.list to a spin_lock, and the vm->bound_list to
its vm->mutex, along with tracking shrinkable status under its own
spinlock, we no long require the object to be locked by the caller.

This is fortunate as it appears we can be called with the lock along an
error path in flipping:

<4> [139.942851] WARN_ON(debug_locks && !lock_is_held(&(&((obj)->base.resv)->lock.base)->dep_map))
<4> [139.943242] WARNING: CPU: 0 PID: 1203 at drivers/gpu/drm/i915/gem/i915_gem_domain.c:405 i915_gem_object_unpin_from_display_plane+0x70/0x130 [i915]
<4> [139.943263] Modules linked in: snd_hda_intel i915 vgem snd_hda_codec_realtek snd_hda_codec_generic coretemp snd_intel_dspcfg snd_hda_codec snd_hwdep snd_hda_core r8169 lpc_ich snd_pcm realtek prime_numbers [last unloaded: i915]
<4> [139.943347] CPU: 0 PID: 1203 Comm: kms_flip Tainted: G U 5.6.0-gd0fda5c2cf3f1-drmtip_474+ #1
<4> [139.943363] Hardware name: /D510MO, BIOS MOPNV10J.86A.0311.2010.0802.2346 08/02/2010
<4> [139.943589] RIP: 0010:i915_gem_object_unpin_from_display_plane+0x70/0x130 [i915]
<4> [139.943589] Code: 85 28 01 00 00 be ff ff ff ff 48 8d 78 60 e8 d7 9b f0 e2 85 c0 75 b9 48 c7 c6 50 b9 38 c0 48 c7 c7 e9 48 3c c0 e8 20 d4 e9 e2 <0f> 0b eb a2 48 c7 c1 08 bb 38 c0 ba 0a 01 00 00 48 c7 c6 88 a3 35
<4> [139.943589] RSP: 0018:ffffb774c0603b48 EFLAGS: 00010282
<4> [139.943589] RAX: 0000000000000000 RBX: ffff9a142fa36e80 RCX: 0000000000000006
<4> [139.943589] RDX: 000000000000160d RSI: ffff9a142c1a88f8 RDI: ffffffffa434a64d
<4> [139.943589] RBP: ffff9a1410a513c0 R08: ffff9a142c1a88f8 R09: 0000000000000000
<4> [139.943589] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9a1436ee94b8
<4> [139.943589] R13: 0000000000000001 R14: 00000000ffffffff R15: ffff9a1410960000
<4> [139.943589] FS: 00007fc73a744e40(0000) GS:ffff9a143da00000(0000) knlGS:0000000000000000
<4> [139.943589] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4> [139.943589] CR2: 00007fc73997e098 CR3: 000000002f5fe000 CR4: 00000000000006f0
<4> [139.943589] Call Trace:
<4> [139.943589] intel_pin_and_fence_fb_obj+0x1c9/0x1f0 [i915]
<4> [139.943589] intel_plane_pin_fb+0x3f/0xd0 [i915]
<4> [139.943589] intel_prepare_plane_fb+0x13b/0x5c0 [i915]
<4> [139.943589] drm_atomic_helper_prepare_planes+0x85/0x110
<4> [139.943589] intel_atomic_commit+0xda/0x390 [i915]
<4> [139.943589] drm_atomic_helper_page_flip+0x9c/0xd0
<4> [139.943589] ? drm_event_reserve_init+0x46/0x60
<4> [139.943589] drm_mode_page_flip_ioctl+0x587/0x5d0

This completes the symmetry lost in commit 8b1c78e06e61 ("drm/i915: Avoid
calling i915_gem_object_unbind holding object lock").

Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/1743
Fixes: 8b1c78e06e61 ("drm/i915: Avoid calling i915_gem_object_unbind holding object lock")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Andi Shyti <andi.shyti@intel.com>
Cc: <stable@vger.kernel.org> # v5.6+
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200420125356.26614-1-chris@chris-wilson.co.uk
(cherry picked from commit a95f3ac21d64d62c746f836598d1467d5837fa28)
(cherry picked from commit 2208b85fa1766ee4821a9435d548578b67090531)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
diff 47bf7b7a Mon Apr 20 06:53:55 MDT 2020 Chris Wilson <chris@chris-wilson.co.uk> drm/i915/gem: Remove object_is_locked assertion from unpin_from_display_plane

Since moving the obj->vma.list to a spin_lock, and the vm->bound_list to
its vm->mutex, along with tracking shrinkable status under its own
spinlock, we no long require the object to be locked by the caller.

This is fortunate as it appears we can be called with the lock along an
error path in flipping:

<4> [139.942851] WARN_ON(debug_locks && !lock_is_held(&(&((obj)->base.resv)->lock.base)->dep_map))
<4> [139.943242] WARNING: CPU: 0 PID: 1203 at drivers/gpu/drm/i915/gem/i915_gem_domain.c:405 i915_gem_object_unpin_from_display_plane+0x70/0x130 [i915]
<4> [139.943263] Modules linked in: snd_hda_intel i915 vgem snd_hda_codec_realtek snd_hda_codec_generic coretemp snd_intel_dspcfg snd_hda_codec snd_hwdep snd_hda_core r8169 lpc_ich snd_pcm realtek prime_numbers [last unloaded: i915]
<4> [139.943347] CPU: 0 PID: 1203 Comm: kms_flip Tainted: G U 5.6.0-gd0fda5c2cf3f1-drmtip_474+ #1
<4> [139.943363] Hardware name: /D510MO, BIOS MOPNV10J.86A.0311.2010.0802.2346 08/02/2010
<4> [139.943589] RIP: 0010:i915_gem_object_unpin_from_display_plane+0x70/0x130 [i915]
<4> [139.943589] Code: 85 28 01 00 00 be ff ff ff ff 48 8d 78 60 e8 d7 9b f0 e2 85 c0 75 b9 48 c7 c6 50 b9 38 c0 48 c7 c7 e9 48 3c c0 e8 20 d4 e9 e2 <0f> 0b eb a2 48 c7 c1 08 bb 38 c0 ba 0a 01 00 00 48 c7 c6 88 a3 35
<4> [139.943589] RSP: 0018:ffffb774c0603b48 EFLAGS: 00010282
<4> [139.943589] RAX: 0000000000000000 RBX: ffff9a142fa36e80 RCX: 0000000000000006
<4> [139.943589] RDX: 000000000000160d RSI: ffff9a142c1a88f8 RDI: ffffffffa434a64d
<4> [139.943589] RBP: ffff9a1410a513c0 R08: ffff9a142c1a88f8 R09: 0000000000000000
<4> [139.943589] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9a1436ee94b8
<4> [139.943589] R13: 0000000000000001 R14: 00000000ffffffff R15: ffff9a1410960000
<4> [139.943589] FS: 00007fc73a744e40(0000) GS:ffff9a143da00000(0000) knlGS:0000000000000000
<4> [139.943589] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4> [139.943589] CR2: 00007fc73997e098 CR3: 000000002f5fe000 CR4: 00000000000006f0
<4> [139.943589] Call Trace:
<4> [139.943589] intel_pin_and_fence_fb_obj+0x1c9/0x1f0 [i915]
<4> [139.943589] intel_plane_pin_fb+0x3f/0xd0 [i915]
<4> [139.943589] intel_prepare_plane_fb+0x13b/0x5c0 [i915]
<4> [139.943589] drm_atomic_helper_prepare_planes+0x85/0x110
<4> [139.943589] intel_atomic_commit+0xda/0x390 [i915]
<4> [139.943589] drm_atomic_helper_page_flip+0x9c/0xd0
<4> [139.943589] ? drm_event_reserve_init+0x46/0x60
<4> [139.943589] drm_mode_page_flip_ioctl+0x587/0x5d0

This completes the symmetry lost in commit 8b1c78e06e61 ("drm/i915: Avoid
calling i915_gem_object_unbind holding object lock").

Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/1743
Fixes: 8b1c78e06e61 ("drm/i915: Avoid calling i915_gem_object_unbind holding object lock")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Andi Shyti <andi.shyti@intel.com>
Cc: <stable@vger.kernel.org> # v5.6+
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200420125356.26614-1-chris@chris-wilson.co.uk
(cherry picked from commit a95f3ac21d64d62c746f836598d1467d5837fa28)
(cherry picked from commit 2208b85fa1766ee4821a9435d548578b67090531)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
diff 47bf7b7a Mon Apr 20 06:53:55 MDT 2020 Chris Wilson <chris@chris-wilson.co.uk> drm/i915/gem: Remove object_is_locked assertion from unpin_from_display_plane

Since moving the obj->vma.list to a spin_lock, and the vm->bound_list to
its vm->mutex, along with tracking shrinkable status under its own
spinlock, we no long require the object to be locked by the caller.

This is fortunate as it appears we can be called with the lock along an
error path in flipping:

<4> [139.942851] WARN_ON(debug_locks && !lock_is_held(&(&((obj)->base.resv)->lock.base)->dep_map))
<4> [139.943242] WARNING: CPU: 0 PID: 1203 at drivers/gpu/drm/i915/gem/i915_gem_domain.c:405 i915_gem_object_unpin_from_display_plane+0x70/0x130 [i915]
<4> [139.943263] Modules linked in: snd_hda_intel i915 vgem snd_hda_codec_realtek snd_hda_codec_generic coretemp snd_intel_dspcfg snd_hda_codec snd_hwdep snd_hda_core r8169 lpc_ich snd_pcm realtek prime_numbers [last unloaded: i915]
<4> [139.943347] CPU: 0 PID: 1203 Comm: kms_flip Tainted: G U 5.6.0-gd0fda5c2cf3f1-drmtip_474+ #1
<4> [139.943363] Hardware name: /D510MO, BIOS MOPNV10J.86A.0311.2010.0802.2346 08/02/2010
<4> [139.943589] RIP: 0010:i915_gem_object_unpin_from_display_plane+0x70/0x130 [i915]
<4> [139.943589] Code: 85 28 01 00 00 be ff ff ff ff 48 8d 78 60 e8 d7 9b f0 e2 85 c0 75 b9 48 c7 c6 50 b9 38 c0 48 c7 c7 e9 48 3c c0 e8 20 d4 e9 e2 <0f> 0b eb a2 48 c7 c1 08 bb 38 c0 ba 0a 01 00 00 48 c7 c6 88 a3 35
<4> [139.943589] RSP: 0018:ffffb774c0603b48 EFLAGS: 00010282
<4> [139.943589] RAX: 0000000000000000 RBX: ffff9a142fa36e80 RCX: 0000000000000006
<4> [139.943589] RDX: 000000000000160d RSI: ffff9a142c1a88f8 RDI: ffffffffa434a64d
<4> [139.943589] RBP: ffff9a1410a513c0 R08: ffff9a142c1a88f8 R09: 0000000000000000
<4> [139.943589] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9a1436ee94b8
<4> [139.943589] R13: 0000000000000001 R14: 00000000ffffffff R15: ffff9a1410960000
<4> [139.943589] FS: 00007fc73a744e40(0000) GS:ffff9a143da00000(0000) knlGS:0000000000000000
<4> [139.943589] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4> [139.943589] CR2: 00007fc73997e098 CR3: 000000002f5fe000 CR4: 00000000000006f0
<4> [139.943589] Call Trace:
<4> [139.943589] intel_pin_and_fence_fb_obj+0x1c9/0x1f0 [i915]
<4> [139.943589] intel_plane_pin_fb+0x3f/0xd0 [i915]
<4> [139.943589] intel_prepare_plane_fb+0x13b/0x5c0 [i915]
<4> [139.943589] drm_atomic_helper_prepare_planes+0x85/0x110
<4> [139.943589] intel_atomic_commit+0xda/0x390 [i915]
<4> [139.943589] drm_atomic_helper_page_flip+0x9c/0xd0
<4> [139.943589] ? drm_event_reserve_init+0x46/0x60
<4> [139.943589] drm_mode_page_flip_ioctl+0x587/0x5d0

This completes the symmetry lost in commit 8b1c78e06e61 ("drm/i915: Avoid
calling i915_gem_object_unbind holding object lock").

Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/1743
Fixes: 8b1c78e06e61 ("drm/i915: Avoid calling i915_gem_object_unbind holding object lock")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Andi Shyti <andi.shyti@intel.com>
Cc: <stable@vger.kernel.org> # v5.6+
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200420125356.26614-1-chris@chris-wilson.co.uk
(cherry picked from commit a95f3ac21d64d62c746f836598d1467d5837fa28)
(cherry picked from commit 2208b85fa1766ee4821a9435d548578b67090531)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
diff 47bf7b7a Mon Apr 20 06:53:55 MDT 2020 Chris Wilson <chris@chris-wilson.co.uk> drm/i915/gem: Remove object_is_locked assertion from unpin_from_display_plane

Since moving the obj->vma.list to a spin_lock, and the vm->bound_list to
its vm->mutex, along with tracking shrinkable status under its own
spinlock, we no long require the object to be locked by the caller.

This is fortunate as it appears we can be called with the lock along an
error path in flipping:

<4> [139.942851] WARN_ON(debug_locks && !lock_is_held(&(&((obj)->base.resv)->lock.base)->dep_map))
<4> [139.943242] WARNING: CPU: 0 PID: 1203 at drivers/gpu/drm/i915/gem/i915_gem_domain.c:405 i915_gem_object_unpin_from_display_plane+0x70/0x130 [i915]
<4> [139.943263] Modules linked in: snd_hda_intel i915 vgem snd_hda_codec_realtek snd_hda_codec_generic coretemp snd_intel_dspcfg snd_hda_codec snd_hwdep snd_hda_core r8169 lpc_ich snd_pcm realtek prime_numbers [last unloaded: i915]
<4> [139.943347] CPU: 0 PID: 1203 Comm: kms_flip Tainted: G U 5.6.0-gd0fda5c2cf3f1-drmtip_474+ #1
<4> [139.943363] Hardware name: /D510MO, BIOS MOPNV10J.86A.0311.2010.0802.2346 08/02/2010
<4> [139.943589] RIP: 0010:i915_gem_object_unpin_from_display_plane+0x70/0x130 [i915]
<4> [139.943589] Code: 85 28 01 00 00 be ff ff ff ff 48 8d 78 60 e8 d7 9b f0 e2 85 c0 75 b9 48 c7 c6 50 b9 38 c0 48 c7 c7 e9 48 3c c0 e8 20 d4 e9 e2 <0f> 0b eb a2 48 c7 c1 08 bb 38 c0 ba 0a 01 00 00 48 c7 c6 88 a3 35
<4> [139.943589] RSP: 0018:ffffb774c0603b48 EFLAGS: 00010282
<4> [139.943589] RAX: 0000000000000000 RBX: ffff9a142fa36e80 RCX: 0000000000000006
<4> [139.943589] RDX: 000000000000160d RSI: ffff9a142c1a88f8 RDI: ffffffffa434a64d
<4> [139.943589] RBP: ffff9a1410a513c0 R08: ffff9a142c1a88f8 R09: 0000000000000000
<4> [139.943589] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9a1436ee94b8
<4> [139.943589] R13: 0000000000000001 R14: 00000000ffffffff R15: ffff9a1410960000
<4> [139.943589] FS: 00007fc73a744e40(0000) GS:ffff9a143da00000(0000) knlGS:0000000000000000
<4> [139.943589] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4> [139.943589] CR2: 00007fc73997e098 CR3: 000000002f5fe000 CR4: 00000000000006f0
<4> [139.943589] Call Trace:
<4> [139.943589] intel_pin_and_fence_fb_obj+0x1c9/0x1f0 [i915]
<4> [139.943589] intel_plane_pin_fb+0x3f/0xd0 [i915]
<4> [139.943589] intel_prepare_plane_fb+0x13b/0x5c0 [i915]
<4> [139.943589] drm_atomic_helper_prepare_planes+0x85/0x110
<4> [139.943589] intel_atomic_commit+0xda/0x390 [i915]
<4> [139.943589] drm_atomic_helper_page_flip+0x9c/0xd0
<4> [139.943589] ? drm_event_reserve_init+0x46/0x60
<4> [139.943589] drm_mode_page_flip_ioctl+0x587/0x5d0

This completes the symmetry lost in commit 8b1c78e06e61 ("drm/i915: Avoid
calling i915_gem_object_unbind holding object lock").

Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/1743
Fixes: 8b1c78e06e61 ("drm/i915: Avoid calling i915_gem_object_unbind holding object lock")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Andi Shyti <andi.shyti@intel.com>
Cc: <stable@vger.kernel.org> # v5.6+
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200420125356.26614-1-chris@chris-wilson.co.uk
(cherry picked from commit a95f3ac21d64d62c746f836598d1467d5837fa28)
(cherry picked from commit 2208b85fa1766ee4821a9435d548578b67090531)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
diff 47bf7b7a Mon Apr 20 06:53:55 MDT 2020 Chris Wilson <chris@chris-wilson.co.uk> drm/i915/gem: Remove object_is_locked assertion from unpin_from_display_plane

Since moving the obj->vma.list to a spin_lock, and the vm->bound_list to
its vm->mutex, along with tracking shrinkable status under its own
spinlock, we no long require the object to be locked by the caller.

This is fortunate as it appears we can be called with the lock along an
error path in flipping:

<4> [139.942851] WARN_ON(debug_locks && !lock_is_held(&(&((obj)->base.resv)->lock.base)->dep_map))
<4> [139.943242] WARNING: CPU: 0 PID: 1203 at drivers/gpu/drm/i915/gem/i915_gem_domain.c:405 i915_gem_object_unpin_from_display_plane+0x70/0x130 [i915]
<4> [139.943263] Modules linked in: snd_hda_intel i915 vgem snd_hda_codec_realtek snd_hda_codec_generic coretemp snd_intel_dspcfg snd_hda_codec snd_hwdep snd_hda_core r8169 lpc_ich snd_pcm realtek prime_numbers [last unloaded: i915]
<4> [139.943347] CPU: 0 PID: 1203 Comm: kms_flip Tainted: G U 5.6.0-gd0fda5c2cf3f1-drmtip_474+ #1
<4> [139.943363] Hardware name: /D510MO, BIOS MOPNV10J.86A.0311.2010.0802.2346 08/02/2010
<4> [139.943589] RIP: 0010:i915_gem_object_unpin_from_display_plane+0x70/0x130 [i915]
<4> [139.943589] Code: 85 28 01 00 00 be ff ff ff ff 48 8d 78 60 e8 d7 9b f0 e2 85 c0 75 b9 48 c7 c6 50 b9 38 c0 48 c7 c7 e9 48 3c c0 e8 20 d4 e9 e2 <0f> 0b eb a2 48 c7 c1 08 bb 38 c0 ba 0a 01 00 00 48 c7 c6 88 a3 35
<4> [139.943589] RSP: 0018:ffffb774c0603b48 EFLAGS: 00010282
<4> [139.943589] RAX: 0000000000000000 RBX: ffff9a142fa36e80 RCX: 0000000000000006
<4> [139.943589] RDX: 000000000000160d RSI: ffff9a142c1a88f8 RDI: ffffffffa434a64d
<4> [139.943589] RBP: ffff9a1410a513c0 R08: ffff9a142c1a88f8 R09: 0000000000000000
<4> [139.943589] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9a1436ee94b8
<4> [139.943589] R13: 0000000000000001 R14: 00000000ffffffff R15: ffff9a1410960000
<4> [139.943589] FS: 00007fc73a744e40(0000) GS:ffff9a143da00000(0000) knlGS:0000000000000000
<4> [139.943589] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4> [139.943589] CR2: 00007fc73997e098 CR3: 000000002f5fe000 CR4: 00000000000006f0
<4> [139.943589] Call Trace:
<4> [139.943589] intel_pin_and_fence_fb_obj+0x1c9/0x1f0 [i915]
<4> [139.943589] intel_plane_pin_fb+0x3f/0xd0 [i915]
<4> [139.943589] intel_prepare_plane_fb+0x13b/0x5c0 [i915]
<4> [139.943589] drm_atomic_helper_prepare_planes+0x85/0x110
<4> [139.943589] intel_atomic_commit+0xda/0x390 [i915]
<4> [139.943589] drm_atomic_helper_page_flip+0x9c/0xd0
<4> [139.943589] ? drm_event_reserve_init+0x46/0x60
<4> [139.943589] drm_mode_page_flip_ioctl+0x587/0x5d0

This completes the symmetry lost in commit 8b1c78e06e61 ("drm/i915: Avoid
calling i915_gem_object_unbind holding object lock").

Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/1743
Fixes: 8b1c78e06e61 ("drm/i915: Avoid calling i915_gem_object_unbind holding object lock")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Andi Shyti <andi.shyti@intel.com>
Cc: <stable@vger.kernel.org> # v5.6+
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200420125356.26614-1-chris@chris-wilson.co.uk
(cherry picked from commit a95f3ac21d64d62c746f836598d1467d5837fa28)
(cherry picked from commit 2208b85fa1766ee4821a9435d548578b67090531)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
diff 47bf7b7a Mon Apr 20 06:53:55 MDT 2020 Chris Wilson <chris@chris-wilson.co.uk> drm/i915/gem: Remove object_is_locked assertion from unpin_from_display_plane

Since moving the obj->vma.list to a spin_lock, and the vm->bound_list to
its vm->mutex, along with tracking shrinkable status under its own
spinlock, we no long require the object to be locked by the caller.

This is fortunate as it appears we can be called with the lock along an
error path in flipping:

<4> [139.942851] WARN_ON(debug_locks && !lock_is_held(&(&((obj)->base.resv)->lock.base)->dep_map))
<4> [139.943242] WARNING: CPU: 0 PID: 1203 at drivers/gpu/drm/i915/gem/i915_gem_domain.c:405 i915_gem_object_unpin_from_display_plane+0x70/0x130 [i915]
<4> [139.943263] Modules linked in: snd_hda_intel i915 vgem snd_hda_codec_realtek snd_hda_codec_generic coretemp snd_intel_dspcfg snd_hda_codec snd_hwdep snd_hda_core r8169 lpc_ich snd_pcm realtek prime_numbers [last unloaded: i915]
<4> [139.943347] CPU: 0 PID: 1203 Comm: kms_flip Tainted: G U 5.6.0-gd0fda5c2cf3f1-drmtip_474+ #1
<4> [139.943363] Hardware name: /D510MO, BIOS MOPNV10J.86A.0311.2010.0802.2346 08/02/2010
<4> [139.943589] RIP: 0010:i915_gem_object_unpin_from_display_plane+0x70/0x130 [i915]
<4> [139.943589] Code: 85 28 01 00 00 be ff ff ff ff 48 8d 78 60 e8 d7 9b f0 e2 85 c0 75 b9 48 c7 c6 50 b9 38 c0 48 c7 c7 e9 48 3c c0 e8 20 d4 e9 e2 <0f> 0b eb a2 48 c7 c1 08 bb 38 c0 ba 0a 01 00 00 48 c7 c6 88 a3 35
<4> [139.943589] RSP: 0018:ffffb774c0603b48 EFLAGS: 00010282
<4> [139.943589] RAX: 0000000000000000 RBX: ffff9a142fa36e80 RCX: 0000000000000006
<4> [139.943589] RDX: 000000000000160d RSI: ffff9a142c1a88f8 RDI: ffffffffa434a64d
<4> [139.943589] RBP: ffff9a1410a513c0 R08: ffff9a142c1a88f8 R09: 0000000000000000
<4> [139.943589] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9a1436ee94b8
<4> [139.943589] R13: 0000000000000001 R14: 00000000ffffffff R15: ffff9a1410960000
<4> [139.943589] FS: 00007fc73a744e40(0000) GS:ffff9a143da00000(0000) knlGS:0000000000000000
<4> [139.943589] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4> [139.943589] CR2: 00007fc73997e098 CR3: 000000002f5fe000 CR4: 00000000000006f0
<4> [139.943589] Call Trace:
<4> [139.943589] intel_pin_and_fence_fb_obj+0x1c9/0x1f0 [i915]
<4> [139.943589] intel_plane_pin_fb+0x3f/0xd0 [i915]
<4> [139.943589] intel_prepare_plane_fb+0x13b/0x5c0 [i915]
<4> [139.943589] drm_atomic_helper_prepare_planes+0x85/0x110
<4> [139.943589] intel_atomic_commit+0xda/0x390 [i915]
<4> [139.943589] drm_atomic_helper_page_flip+0x9c/0xd0
<4> [139.943589] ? drm_event_reserve_init+0x46/0x60
<4> [139.943589] drm_mode_page_flip_ioctl+0x587/0x5d0

This completes the symmetry lost in commit 8b1c78e06e61 ("drm/i915: Avoid
calling i915_gem_object_unbind holding object lock").

Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/1743
Fixes: 8b1c78e06e61 ("drm/i915: Avoid calling i915_gem_object_unbind holding object lock")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Andi Shyti <andi.shyti@intel.com>
Cc: <stable@vger.kernel.org> # v5.6+
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200420125356.26614-1-chris@chris-wilson.co.uk
(cherry picked from commit a95f3ac21d64d62c746f836598d1467d5837fa28)
(cherry picked from commit 2208b85fa1766ee4821a9435d548578b67090531)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
diff 47bf7b7a Mon Apr 20 06:53:55 MDT 2020 Chris Wilson <chris@chris-wilson.co.uk> drm/i915/gem: Remove object_is_locked assertion from unpin_from_display_plane

Since moving the obj->vma.list to a spin_lock, and the vm->bound_list to
its vm->mutex, along with tracking shrinkable status under its own
spinlock, we no long require the object to be locked by the caller.

This is fortunate as it appears we can be called with the lock along an
error path in flipping:

<4> [139.942851] WARN_ON(debug_locks && !lock_is_held(&(&((obj)->base.resv)->lock.base)->dep_map))
<4> [139.943242] WARNING: CPU: 0 PID: 1203 at drivers/gpu/drm/i915/gem/i915_gem_domain.c:405 i915_gem_object_unpin_from_display_plane+0x70/0x130 [i915]
<4> [139.943263] Modules linked in: snd_hda_intel i915 vgem snd_hda_codec_realtek snd_hda_codec_generic coretemp snd_intel_dspcfg snd_hda_codec snd_hwdep snd_hda_core r8169 lpc_ich snd_pcm realtek prime_numbers [last unloaded: i915]
<4> [139.943347] CPU: 0 PID: 1203 Comm: kms_flip Tainted: G U 5.6.0-gd0fda5c2cf3f1-drmtip_474+ #1
<4> [139.943363] Hardware name: /D510MO, BIOS MOPNV10J.86A.0311.2010.0802.2346 08/02/2010
<4> [139.943589] RIP: 0010:i915_gem_object_unpin_from_display_plane+0x70/0x130 [i915]
<4> [139.943589] Code: 85 28 01 00 00 be ff ff ff ff 48 8d 78 60 e8 d7 9b f0 e2 85 c0 75 b9 48 c7 c6 50 b9 38 c0 48 c7 c7 e9 48 3c c0 e8 20 d4 e9 e2 <0f> 0b eb a2 48 c7 c1 08 bb 38 c0 ba 0a 01 00 00 48 c7 c6 88 a3 35
<4> [139.943589] RSP: 0018:ffffb774c0603b48 EFLAGS: 00010282
<4> [139.943589] RAX: 0000000000000000 RBX: ffff9a142fa36e80 RCX: 0000000000000006
<4> [139.943589] RDX: 000000000000160d RSI: ffff9a142c1a88f8 RDI: ffffffffa434a64d
<4> [139.943589] RBP: ffff9a1410a513c0 R08: ffff9a142c1a88f8 R09: 0000000000000000
<4> [139.943589] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9a1436ee94b8
<4> [139.943589] R13: 0000000000000001 R14: 00000000ffffffff R15: ffff9a1410960000
<4> [139.943589] FS: 00007fc73a744e40(0000) GS:ffff9a143da00000(0000) knlGS:0000000000000000
<4> [139.943589] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4> [139.943589] CR2: 00007fc73997e098 CR3: 000000002f5fe000 CR4: 00000000000006f0
<4> [139.943589] Call Trace:
<4> [139.943589] intel_pin_and_fence_fb_obj+0x1c9/0x1f0 [i915]
<4> [139.943589] intel_plane_pin_fb+0x3f/0xd0 [i915]
<4> [139.943589] intel_prepare_plane_fb+0x13b/0x5c0 [i915]
<4> [139.943589] drm_atomic_helper_prepare_planes+0x85/0x110
<4> [139.943589] intel_atomic_commit+0xda/0x390 [i915]
<4> [139.943589] drm_atomic_helper_page_flip+0x9c/0xd0
<4> [139.943589] ? drm_event_reserve_init+0x46/0x60
<4> [139.943589] drm_mode_page_flip_ioctl+0x587/0x5d0

This completes the symmetry lost in commit 8b1c78e06e61 ("drm/i915: Avoid
calling i915_gem_object_unbind holding object lock").

Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/1743
Fixes: 8b1c78e06e61 ("drm/i915: Avoid calling i915_gem_object_unbind holding object lock")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Andi Shyti <andi.shyti@intel.com>
Cc: <stable@vger.kernel.org> # v5.6+
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200420125356.26614-1-chris@chris-wilson.co.uk
(cherry picked from commit a95f3ac21d64d62c746f836598d1467d5837fa28)
(cherry picked from commit 2208b85fa1766ee4821a9435d548578b67090531)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
diff 47bf7b7a Mon Apr 20 06:53:55 MDT 2020 Chris Wilson <chris@chris-wilson.co.uk> drm/i915/gem: Remove object_is_locked assertion from unpin_from_display_plane

Since moving the obj->vma.list to a spin_lock, and the vm->bound_list to
its vm->mutex, along with tracking shrinkable status under its own
spinlock, we no long require the object to be locked by the caller.

This is fortunate as it appears we can be called with the lock along an
error path in flipping:

<4> [139.942851] WARN_ON(debug_locks && !lock_is_held(&(&((obj)->base.resv)->lock.base)->dep_map))
<4> [139.943242] WARNING: CPU: 0 PID: 1203 at drivers/gpu/drm/i915/gem/i915_gem_domain.c:405 i915_gem_object_unpin_from_display_plane+0x70/0x130 [i915]
<4> [139.943263] Modules linked in: snd_hda_intel i915 vgem snd_hda_codec_realtek snd_hda_codec_generic coretemp snd_intel_dspcfg snd_hda_codec snd_hwdep snd_hda_core r8169 lpc_ich snd_pcm realtek prime_numbers [last unloaded: i915]
<4> [139.943347] CPU: 0 PID: 1203 Comm: kms_flip Tainted: G U 5.6.0-gd0fda5c2cf3f1-drmtip_474+ #1
<4> [139.943363] Hardware name: /D510MO, BIOS MOPNV10J.86A.0311.2010.0802.2346 08/02/2010
<4> [139.943589] RIP: 0010:i915_gem_object_unpin_from_display_plane+0x70/0x130 [i915]
<4> [139.943589] Code: 85 28 01 00 00 be ff ff ff ff 48 8d 78 60 e8 d7 9b f0 e2 85 c0 75 b9 48 c7 c6 50 b9 38 c0 48 c7 c7 e9 48 3c c0 e8 20 d4 e9 e2 <0f> 0b eb a2 48 c7 c1 08 bb 38 c0 ba 0a 01 00 00 48 c7 c6 88 a3 35
<4> [139.943589] RSP: 0018:ffffb774c0603b48 EFLAGS: 00010282
<4> [139.943589] RAX: 0000000000000000 RBX: ffff9a142fa36e80 RCX: 0000000000000006
<4> [139.943589] RDX: 000000000000160d RSI: ffff9a142c1a88f8 RDI: ffffffffa434a64d
<4> [139.943589] RBP: ffff9a1410a513c0 R08: ffff9a142c1a88f8 R09: 0000000000000000
<4> [139.943589] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9a1436ee94b8
<4> [139.943589] R13: 0000000000000001 R14: 00000000ffffffff R15: ffff9a1410960000
<4> [139.943589] FS: 00007fc73a744e40(0000) GS:ffff9a143da00000(0000) knlGS:0000000000000000
<4> [139.943589] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4> [139.943589] CR2: 00007fc73997e098 CR3: 000000002f5fe000 CR4: 00000000000006f0
<4> [139.943589] Call Trace:
<4> [139.943589] intel_pin_and_fence_fb_obj+0x1c9/0x1f0 [i915]
<4> [139.943589] intel_plane_pin_fb+0x3f/0xd0 [i915]
<4> [139.943589] intel_prepare_plane_fb+0x13b/0x5c0 [i915]
<4> [139.943589] drm_atomic_helper_prepare_planes+0x85/0x110
<4> [139.943589] intel_atomic_commit+0xda/0x390 [i915]
<4> [139.943589] drm_atomic_helper_page_flip+0x9c/0xd0
<4> [139.943589] ? drm_event_reserve_init+0x46/0x60
<4> [139.943589] drm_mode_page_flip_ioctl+0x587/0x5d0

This completes the symmetry lost in commit 8b1c78e06e61 ("drm/i915: Avoid
calling i915_gem_object_unbind holding object lock").

Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/1743
Fixes: 8b1c78e06e61 ("drm/i915: Avoid calling i915_gem_object_unbind holding object lock")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Andi Shyti <andi.shyti@intel.com>
Cc: <stable@vger.kernel.org> # v5.6+
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200420125356.26614-1-chris@chris-wilson.co.uk
(cherry picked from commit a95f3ac21d64d62c746f836598d1467d5837fa28)
(cherry picked from commit 2208b85fa1766ee4821a9435d548578b67090531)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
diff 47bf7b7a Mon Apr 20 06:53:55 MDT 2020 Chris Wilson <chris@chris-wilson.co.uk> drm/i915/gem: Remove object_is_locked assertion from unpin_from_display_plane

Since moving the obj->vma.list to a spin_lock, and the vm->bound_list to
its vm->mutex, along with tracking shrinkable status under its own
spinlock, we no long require the object to be locked by the caller.

This is fortunate as it appears we can be called with the lock along an
error path in flipping:

<4> [139.942851] WARN_ON(debug_locks && !lock_is_held(&(&((obj)->base.resv)->lock.base)->dep_map))
<4> [139.943242] WARNING: CPU: 0 PID: 1203 at drivers/gpu/drm/i915/gem/i915_gem_domain.c:405 i915_gem_object_unpin_from_display_plane+0x70/0x130 [i915]
<4> [139.943263] Modules linked in: snd_hda_intel i915 vgem snd_hda_codec_realtek snd_hda_codec_generic coretemp snd_intel_dspcfg snd_hda_codec snd_hwdep snd_hda_core r8169 lpc_ich snd_pcm realtek prime_numbers [last unloaded: i915]
<4> [139.943347] CPU: 0 PID: 1203 Comm: kms_flip Tainted: G U 5.6.0-gd0fda5c2cf3f1-drmtip_474+ #1
<4> [139.943363] Hardware name: /D510MO, BIOS MOPNV10J.86A.0311.2010.0802.2346 08/02/2010
<4> [139.943589] RIP: 0010:i915_gem_object_unpin_from_display_plane+0x70/0x130 [i915]
<4> [139.943589] Code: 85 28 01 00 00 be ff ff ff ff 48 8d 78 60 e8 d7 9b f0 e2 85 c0 75 b9 48 c7 c6 50 b9 38 c0 48 c7 c7 e9 48 3c c0 e8 20 d4 e9 e2 <0f> 0b eb a2 48 c7 c1 08 bb 38 c0 ba 0a 01 00 00 48 c7 c6 88 a3 35
<4> [139.943589] RSP: 0018:ffffb774c0603b48 EFLAGS: 00010282
<4> [139.943589] RAX: 0000000000000000 RBX: ffff9a142fa36e80 RCX: 0000000000000006
<4> [139.943589] RDX: 000000000000160d RSI: ffff9a142c1a88f8 RDI: ffffffffa434a64d
<4> [139.943589] RBP: ffff9a1410a513c0 R08: ffff9a142c1a88f8 R09: 0000000000000000
<4> [139.943589] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9a1436ee94b8
<4> [139.943589] R13: 0000000000000001 R14: 00000000ffffffff R15: ffff9a1410960000
<4> [139.943589] FS: 00007fc73a744e40(0000) GS:ffff9a143da00000(0000) knlGS:0000000000000000
<4> [139.943589] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4> [139.943589] CR2: 00007fc73997e098 CR3: 000000002f5fe000 CR4: 00000000000006f0
<4> [139.943589] Call Trace:
<4> [139.943589] intel_pin_and_fence_fb_obj+0x1c9/0x1f0 [i915]
<4> [139.943589] intel_plane_pin_fb+0x3f/0xd0 [i915]
<4> [139.943589] intel_prepare_plane_fb+0x13b/0x5c0 [i915]
<4> [139.943589] drm_atomic_helper_prepare_planes+0x85/0x110
<4> [139.943589] intel_atomic_commit+0xda/0x390 [i915]
<4> [139.943589] drm_atomic_helper_page_flip+0x9c/0xd0
<4> [139.943589] ? drm_event_reserve_init+0x46/0x60
<4> [139.943589] drm_mode_page_flip_ioctl+0x587/0x5d0

This completes the symmetry lost in commit 8b1c78e06e61 ("drm/i915: Avoid
calling i915_gem_object_unbind holding object lock").

Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/1743
Fixes: 8b1c78e06e61 ("drm/i915: Avoid calling i915_gem_object_unbind holding object lock")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Andi Shyti <andi.shyti@intel.com>
Cc: <stable@vger.kernel.org> # v5.6+
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200420125356.26614-1-chris@chris-wilson.co.uk
(cherry picked from commit a95f3ac21d64d62c746f836598d1467d5837fa28)
(cherry picked from commit 2208b85fa1766ee4821a9435d548578b67090531)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
diff 47bf7b7a Mon Apr 20 06:53:55 MDT 2020 Chris Wilson <chris@chris-wilson.co.uk> drm/i915/gem: Remove object_is_locked assertion from unpin_from_display_plane

Since moving the obj->vma.list to a spin_lock, and the vm->bound_list to
its vm->mutex, along with tracking shrinkable status under its own
spinlock, we no long require the object to be locked by the caller.

This is fortunate as it appears we can be called with the lock along an
error path in flipping:

<4> [139.942851] WARN_ON(debug_locks && !lock_is_held(&(&((obj)->base.resv)->lock.base)->dep_map))
<4> [139.943242] WARNING: CPU: 0 PID: 1203 at drivers/gpu/drm/i915/gem/i915_gem_domain.c:405 i915_gem_object_unpin_from_display_plane+0x70/0x130 [i915]
<4> [139.943263] Modules linked in: snd_hda_intel i915 vgem snd_hda_codec_realtek snd_hda_codec_generic coretemp snd_intel_dspcfg snd_hda_codec snd_hwdep snd_hda_core r8169 lpc_ich snd_pcm realtek prime_numbers [last unloaded: i915]
<4> [139.943347] CPU: 0 PID: 1203 Comm: kms_flip Tainted: G U 5.6.0-gd0fda5c2cf3f1-drmtip_474+ #1
<4> [139.943363] Hardware name: /D510MO, BIOS MOPNV10J.86A.0311.2010.0802.2346 08/02/2010
<4> [139.943589] RIP: 0010:i915_gem_object_unpin_from_display_plane+0x70/0x130 [i915]
<4> [139.943589] Code: 85 28 01 00 00 be ff ff ff ff 48 8d 78 60 e8 d7 9b f0 e2 85 c0 75 b9 48 c7 c6 50 b9 38 c0 48 c7 c7 e9 48 3c c0 e8 20 d4 e9 e2 <0f> 0b eb a2 48 c7 c1 08 bb 38 c0 ba 0a 01 00 00 48 c7 c6 88 a3 35
<4> [139.943589] RSP: 0018:ffffb774c0603b48 EFLAGS: 00010282
<4> [139.943589] RAX: 0000000000000000 RBX: ffff9a142fa36e80 RCX: 0000000000000006
<4> [139.943589] RDX: 000000000000160d RSI: ffff9a142c1a88f8 RDI: ffffffffa434a64d
<4> [139.943589] RBP: ffff9a1410a513c0 R08: ffff9a142c1a88f8 R09: 0000000000000000
<4> [139.943589] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9a1436ee94b8
<4> [139.943589] R13: 0000000000000001 R14: 00000000ffffffff R15: ffff9a1410960000
<4> [139.943589] FS: 00007fc73a744e40(0000) GS:ffff9a143da00000(0000) knlGS:0000000000000000
<4> [139.943589] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4> [139.943589] CR2: 00007fc73997e098 CR3: 000000002f5fe000 CR4: 00000000000006f0
<4> [139.943589] Call Trace:
<4> [139.943589] intel_pin_and_fence_fb_obj+0x1c9/0x1f0 [i915]
<4> [139.943589] intel_plane_pin_fb+0x3f/0xd0 [i915]
<4> [139.943589] intel_prepare_plane_fb+0x13b/0x5c0 [i915]
<4> [139.943589] drm_atomic_helper_prepare_planes+0x85/0x110
<4> [139.943589] intel_atomic_commit+0xda/0x390 [i915]
<4> [139.943589] drm_atomic_helper_page_flip+0x9c/0xd0
<4> [139.943589] ? drm_event_reserve_init+0x46/0x60
<4> [139.943589] drm_mode_page_flip_ioctl+0x587/0x5d0

This completes the symmetry lost in commit 8b1c78e06e61 ("drm/i915: Avoid
calling i915_gem_object_unbind holding object lock").

Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/1743
Fixes: 8b1c78e06e61 ("drm/i915: Avoid calling i915_gem_object_unbind holding object lock")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Andi Shyti <andi.shyti@intel.com>
Cc: <stable@vger.kernel.org> # v5.6+
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200420125356.26614-1-chris@chris-wilson.co.uk
(cherry picked from commit a95f3ac21d64d62c746f836598d1467d5837fa28)
(cherry picked from commit 2208b85fa1766ee4821a9435d548578b67090531)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
diff 47bf7b7a Mon Apr 20 06:53:55 MDT 2020 Chris Wilson <chris@chris-wilson.co.uk> drm/i915/gem: Remove object_is_locked assertion from unpin_from_display_plane

Since moving the obj->vma.list to a spin_lock, and the vm->bound_list to
its vm->mutex, along with tracking shrinkable status under its own
spinlock, we no long require the object to be locked by the caller.

This is fortunate as it appears we can be called with the lock along an
error path in flipping:

<4> [139.942851] WARN_ON(debug_locks && !lock_is_held(&(&((obj)->base.resv)->lock.base)->dep_map))
<4> [139.943242] WARNING: CPU: 0 PID: 1203 at drivers/gpu/drm/i915/gem/i915_gem_domain.c:405 i915_gem_object_unpin_from_display_plane+0x70/0x130 [i915]
<4> [139.943263] Modules linked in: snd_hda_intel i915 vgem snd_hda_codec_realtek snd_hda_codec_generic coretemp snd_intel_dspcfg snd_hda_codec snd_hwdep snd_hda_core r8169 lpc_ich snd_pcm realtek prime_numbers [last unloaded: i915]
<4> [139.943347] CPU: 0 PID: 1203 Comm: kms_flip Tainted: G U 5.6.0-gd0fda5c2cf3f1-drmtip_474+ #1
<4> [139.943363] Hardware name: /D510MO, BIOS MOPNV10J.86A.0311.2010.0802.2346 08/02/2010
<4> [139.943589] RIP: 0010:i915_gem_object_unpin_from_display_plane+0x70/0x130 [i915]
<4> [139.943589] Code: 85 28 01 00 00 be ff ff ff ff 48 8d 78 60 e8 d7 9b f0 e2 85 c0 75 b9 48 c7 c6 50 b9 38 c0 48 c7 c7 e9 48 3c c0 e8 20 d4 e9 e2 <0f> 0b eb a2 48 c7 c1 08 bb 38 c0 ba 0a 01 00 00 48 c7 c6 88 a3 35
<4> [139.943589] RSP: 0018:ffffb774c0603b48 EFLAGS: 00010282
<4> [139.943589] RAX: 0000000000000000 RBX: ffff9a142fa36e80 RCX: 0000000000000006
<4> [139.943589] RDX: 000000000000160d RSI: ffff9a142c1a88f8 RDI: ffffffffa434a64d
<4> [139.943589] RBP: ffff9a1410a513c0 R08: ffff9a142c1a88f8 R09: 0000000000000000
<4> [139.943589] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9a1436ee94b8
<4> [139.943589] R13: 0000000000000001 R14: 00000000ffffffff R15: ffff9a1410960000
<4> [139.943589] FS: 00007fc73a744e40(0000) GS:ffff9a143da00000(0000) knlGS:0000000000000000
<4> [139.943589] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4> [139.943589] CR2: 00007fc73997e098 CR3: 000000002f5fe000 CR4: 00000000000006f0
<4> [139.943589] Call Trace:
<4> [139.943589] intel_pin_and_fence_fb_obj+0x1c9/0x1f0 [i915]
<4> [139.943589] intel_plane_pin_fb+0x3f/0xd0 [i915]
<4> [139.943589] intel_prepare_plane_fb+0x13b/0x5c0 [i915]
<4> [139.943589] drm_atomic_helper_prepare_planes+0x85/0x110
<4> [139.943589] intel_atomic_commit+0xda/0x390 [i915]
<4> [139.943589] drm_atomic_helper_page_flip+0x9c/0xd0
<4> [139.943589] ? drm_event_reserve_init+0x46/0x60
<4> [139.943589] drm_mode_page_flip_ioctl+0x587/0x5d0

This completes the symmetry lost in commit 8b1c78e06e61 ("drm/i915: Avoid
calling i915_gem_object_unbind holding object lock").

Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/1743
Fixes: 8b1c78e06e61 ("drm/i915: Avoid calling i915_gem_object_unbind holding object lock")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Andi Shyti <andi.shyti@intel.com>
Cc: <stable@vger.kernel.org> # v5.6+
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200420125356.26614-1-chris@chris-wilson.co.uk
(cherry picked from commit a95f3ac21d64d62c746f836598d1467d5837fa28)
(cherry picked from commit 2208b85fa1766ee4821a9435d548578b67090531)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
diff 47bf7b7a Mon Apr 20 06:53:55 MDT 2020 Chris Wilson <chris@chris-wilson.co.uk> drm/i915/gem: Remove object_is_locked assertion from unpin_from_display_plane

Since moving the obj->vma.list to a spin_lock, and the vm->bound_list to
its vm->mutex, along with tracking shrinkable status under its own
spinlock, we no long require the object to be locked by the caller.

This is fortunate as it appears we can be called with the lock along an
error path in flipping:

<4> [139.942851] WARN_ON(debug_locks && !lock_is_held(&(&((obj)->base.resv)->lock.base)->dep_map))
<4> [139.943242] WARNING: CPU: 0 PID: 1203 at drivers/gpu/drm/i915/gem/i915_gem_domain.c:405 i915_gem_object_unpin_from_display_plane+0x70/0x130 [i915]
<4> [139.943263] Modules linked in: snd_hda_intel i915 vgem snd_hda_codec_realtek snd_hda_codec_generic coretemp snd_intel_dspcfg snd_hda_codec snd_hwdep snd_hda_core r8169 lpc_ich snd_pcm realtek prime_numbers [last unloaded: i915]
<4> [139.943347] CPU: 0 PID: 1203 Comm: kms_flip Tainted: G U 5.6.0-gd0fda5c2cf3f1-drmtip_474+ #1
<4> [139.943363] Hardware name: /D510MO, BIOS MOPNV10J.86A.0311.2010.0802.2346 08/02/2010
<4> [139.943589] RIP: 0010:i915_gem_object_unpin_from_display_plane+0x70/0x130 [i915]
<4> [139.943589] Code: 85 28 01 00 00 be ff ff ff ff 48 8d 78 60 e8 d7 9b f0 e2 85 c0 75 b9 48 c7 c6 50 b9 38 c0 48 c7 c7 e9 48 3c c0 e8 20 d4 e9 e2 <0f> 0b eb a2 48 c7 c1 08 bb 38 c0 ba 0a 01 00 00 48 c7 c6 88 a3 35
<4> [139.943589] RSP: 0018:ffffb774c0603b48 EFLAGS: 00010282
<4> [139.943589] RAX: 0000000000000000 RBX: ffff9a142fa36e80 RCX: 0000000000000006
<4> [139.943589] RDX: 000000000000160d RSI: ffff9a142c1a88f8 RDI: ffffffffa434a64d
<4> [139.943589] RBP: ffff9a1410a513c0 R08: ffff9a142c1a88f8 R09: 0000000000000000
<4> [139.943589] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9a1436ee94b8
<4> [139.943589] R13: 0000000000000001 R14: 00000000ffffffff R15: ffff9a1410960000
<4> [139.943589] FS: 00007fc73a744e40(0000) GS:ffff9a143da00000(0000) knlGS:0000000000000000
<4> [139.943589] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4> [139.943589] CR2: 00007fc73997e098 CR3: 000000002f5fe000 CR4: 00000000000006f0
<4> [139.943589] Call Trace:
<4> [139.943589] intel_pin_and_fence_fb_obj+0x1c9/0x1f0 [i915]
<4> [139.943589] intel_plane_pin_fb+0x3f/0xd0 [i915]
<4> [139.943589] intel_prepare_plane_fb+0x13b/0x5c0 [i915]
<4> [139.943589] drm_atomic_helper_prepare_planes+0x85/0x110
<4> [139.943589] intel_atomic_commit+0xda/0x390 [i915]
<4> [139.943589] drm_atomic_helper_page_flip+0x9c/0xd0
<4> [139.943589] ? drm_event_reserve_init+0x46/0x60
<4> [139.943589] drm_mode_page_flip_ioctl+0x587/0x5d0

This completes the symmetry lost in commit 8b1c78e06e61 ("drm/i915: Avoid
calling i915_gem_object_unbind holding object lock").

Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/1743
Fixes: 8b1c78e06e61 ("drm/i915: Avoid calling i915_gem_object_unbind holding object lock")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Andi Shyti <andi.shyti@intel.com>
Cc: <stable@vger.kernel.org> # v5.6+
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200420125356.26614-1-chris@chris-wilson.co.uk
(cherry picked from commit a95f3ac21d64d62c746f836598d1467d5837fa28)
(cherry picked from commit 2208b85fa1766ee4821a9435d548578b67090531)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
diff 47bf7b7a Mon Apr 20 06:53:55 MDT 2020 Chris Wilson <chris@chris-wilson.co.uk> drm/i915/gem: Remove object_is_locked assertion from unpin_from_display_plane

Since moving the obj->vma.list to a spin_lock, and the vm->bound_list to
its vm->mutex, along with tracking shrinkable status under its own
spinlock, we no long require the object to be locked by the caller.

This is fortunate as it appears we can be called with the lock along an
error path in flipping:

<4> [139.942851] WARN_ON(debug_locks && !lock_is_held(&(&((obj)->base.resv)->lock.base)->dep_map))
<4> [139.943242] WARNING: CPU: 0 PID: 1203 at drivers/gpu/drm/i915/gem/i915_gem_domain.c:405 i915_gem_object_unpin_from_display_plane+0x70/0x130 [i915]
<4> [139.943263] Modules linked in: snd_hda_intel i915 vgem snd_hda_codec_realtek snd_hda_codec_generic coretemp snd_intel_dspcfg snd_hda_codec snd_hwdep snd_hda_core r8169 lpc_ich snd_pcm realtek prime_numbers [last unloaded: i915]
<4> [139.943347] CPU: 0 PID: 1203 Comm: kms_flip Tainted: G U 5.6.0-gd0fda5c2cf3f1-drmtip_474+ #1
<4> [139.943363] Hardware name: /D510MO, BIOS MOPNV10J.86A.0311.2010.0802.2346 08/02/2010
<4> [139.943589] RIP: 0010:i915_gem_object_unpin_from_display_plane+0x70/0x130 [i915]
<4> [139.943589] Code: 85 28 01 00 00 be ff ff ff ff 48 8d 78 60 e8 d7 9b f0 e2 85 c0 75 b9 48 c7 c6 50 b9 38 c0 48 c7 c7 e9 48 3c c0 e8 20 d4 e9 e2 <0f> 0b eb a2 48 c7 c1 08 bb 38 c0 ba 0a 01 00 00 48 c7 c6 88 a3 35
<4> [139.943589] RSP: 0018:ffffb774c0603b48 EFLAGS: 00010282
<4> [139.943589] RAX: 0000000000000000 RBX: ffff9a142fa36e80 RCX: 0000000000000006
<4> [139.943589] RDX: 000000000000160d RSI: ffff9a142c1a88f8 RDI: ffffffffa434a64d
<4> [139.943589] RBP: ffff9a1410a513c0 R08: ffff9a142c1a88f8 R09: 0000000000000000
<4> [139.943589] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9a1436ee94b8
<4> [139.943589] R13: 0000000000000001 R14: 00000000ffffffff R15: ffff9a1410960000
<4> [139.943589] FS: 00007fc73a744e40(0000) GS:ffff9a143da00000(0000) knlGS:0000000000000000
<4> [139.943589] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4> [139.943589] CR2: 00007fc73997e098 CR3: 000000002f5fe000 CR4: 00000000000006f0
<4> [139.943589] Call Trace:
<4> [139.943589] intel_pin_and_fence_fb_obj+0x1c9/0x1f0 [i915]
<4> [139.943589] intel_plane_pin_fb+0x3f/0xd0 [i915]
<4> [139.943589] intel_prepare_plane_fb+0x13b/0x5c0 [i915]
<4> [139.943589] drm_atomic_helper_prepare_planes+0x85/0x110
<4> [139.943589] intel_atomic_commit+0xda/0x390 [i915]
<4> [139.943589] drm_atomic_helper_page_flip+0x9c/0xd0
<4> [139.943589] ? drm_event_reserve_init+0x46/0x60
<4> [139.943589] drm_mode_page_flip_ioctl+0x587/0x5d0

This completes the symmetry lost in commit 8b1c78e06e61 ("drm/i915: Avoid
calling i915_gem_object_unbind holding object lock").

Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/1743
Fixes: 8b1c78e06e61 ("drm/i915: Avoid calling i915_gem_object_unbind holding object lock")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Andi Shyti <andi.shyti@intel.com>
Cc: <stable@vger.kernel.org> # v5.6+
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200420125356.26614-1-chris@chris-wilson.co.uk
(cherry picked from commit a95f3ac21d64d62c746f836598d1467d5837fa28)
(cherry picked from commit 2208b85fa1766ee4821a9435d548578b67090531)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
diff 47bf7b7a Mon Apr 20 06:53:55 MDT 2020 Chris Wilson <chris@chris-wilson.co.uk> drm/i915/gem: Remove object_is_locked assertion from unpin_from_display_plane

Since moving the obj->vma.list to a spin_lock, and the vm->bound_list to
its vm->mutex, along with tracking shrinkable status under its own
spinlock, we no long require the object to be locked by the caller.

This is fortunate as it appears we can be called with the lock along an
error path in flipping:

<4> [139.942851] WARN_ON(debug_locks && !lock_is_held(&(&((obj)->base.resv)->lock.base)->dep_map))
<4> [139.943242] WARNING: CPU: 0 PID: 1203 at drivers/gpu/drm/i915/gem/i915_gem_domain.c:405 i915_gem_object_unpin_from_display_plane+0x70/0x130 [i915]
<4> [139.943263] Modules linked in: snd_hda_intel i915 vgem snd_hda_codec_realtek snd_hda_codec_generic coretemp snd_intel_dspcfg snd_hda_codec snd_hwdep snd_hda_core r8169 lpc_ich snd_pcm realtek prime_numbers [last unloaded: i915]
<4> [139.943347] CPU: 0 PID: 1203 Comm: kms_flip Tainted: G U 5.6.0-gd0fda5c2cf3f1-drmtip_474+ #1
<4> [139.943363] Hardware name: /D510MO, BIOS MOPNV10J.86A.0311.2010.0802.2346 08/02/2010
<4> [139.943589] RIP: 0010:i915_gem_object_unpin_from_display_plane+0x70/0x130 [i915]
<4> [139.943589] Code: 85 28 01 00 00 be ff ff ff ff 48 8d 78 60 e8 d7 9b f0 e2 85 c0 75 b9 48 c7 c6 50 b9 38 c0 48 c7 c7 e9 48 3c c0 e8 20 d4 e9 e2 <0f> 0b eb a2 48 c7 c1 08 bb 38 c0 ba 0a 01 00 00 48 c7 c6 88 a3 35
<4> [139.943589] RSP: 0018:ffffb774c0603b48 EFLAGS: 00010282
<4> [139.943589] RAX: 0000000000000000 RBX: ffff9a142fa36e80 RCX: 0000000000000006
<4> [139.943589] RDX: 000000000000160d RSI: ffff9a142c1a88f8 RDI: ffffffffa434a64d
<4> [139.943589] RBP: ffff9a1410a513c0 R08: ffff9a142c1a88f8 R09: 0000000000000000
<4> [139.943589] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9a1436ee94b8
<4> [139.943589] R13: 0000000000000001 R14: 00000000ffffffff R15: ffff9a1410960000
<4> [139.943589] FS: 00007fc73a744e40(0000) GS:ffff9a143da00000(0000) knlGS:0000000000000000
<4> [139.943589] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4> [139.943589] CR2: 00007fc73997e098 CR3: 000000002f5fe000 CR4: 00000000000006f0
<4> [139.943589] Call Trace:
<4> [139.943589] intel_pin_and_fence_fb_obj+0x1c9/0x1f0 [i915]
<4> [139.943589] intel_plane_pin_fb+0x3f/0xd0 [i915]
<4> [139.943589] intel_prepare_plane_fb+0x13b/0x5c0 [i915]
<4> [139.943589] drm_atomic_helper_prepare_planes+0x85/0x110
<4> [139.943589] intel_atomic_commit+0xda/0x390 [i915]
<4> [139.943589] drm_atomic_helper_page_flip+0x9c/0xd0
<4> [139.943589] ? drm_event_reserve_init+0x46/0x60
<4> [139.943589] drm_mode_page_flip_ioctl+0x587/0x5d0

This completes the symmetry lost in commit 8b1c78e06e61 ("drm/i915: Avoid
calling i915_gem_object_unbind holding object lock").

Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/1743
Fixes: 8b1c78e06e61 ("drm/i915: Avoid calling i915_gem_object_unbind holding object lock")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Andi Shyti <andi.shyti@intel.com>
Cc: <stable@vger.kernel.org> # v5.6+
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200420125356.26614-1-chris@chris-wilson.co.uk
(cherry picked from commit a95f3ac21d64d62c746f836598d1467d5837fa28)
(cherry picked from commit 2208b85fa1766ee4821a9435d548578b67090531)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
diff 47bf7b7a Mon Apr 20 06:53:55 MDT 2020 Chris Wilson <chris@chris-wilson.co.uk> drm/i915/gem: Remove object_is_locked assertion from unpin_from_display_plane

Since moving the obj->vma.list to a spin_lock, and the vm->bound_list to
its vm->mutex, along with tracking shrinkable status under its own
spinlock, we no long require the object to be locked by the caller.

This is fortunate as it appears we can be called with the lock along an
error path in flipping:

<4> [139.942851] WARN_ON(debug_locks && !lock_is_held(&(&((obj)->base.resv)->lock.base)->dep_map))
<4> [139.943242] WARNING: CPU: 0 PID: 1203 at drivers/gpu/drm/i915/gem/i915_gem_domain.c:405 i915_gem_object_unpin_from_display_plane+0x70/0x130 [i915]
<4> [139.943263] Modules linked in: snd_hda_intel i915 vgem snd_hda_codec_realtek snd_hda_codec_generic coretemp snd_intel_dspcfg snd_hda_codec snd_hwdep snd_hda_core r8169 lpc_ich snd_pcm realtek prime_numbers [last unloaded: i915]
<4> [139.943347] CPU: 0 PID: 1203 Comm: kms_flip Tainted: G U 5.6.0-gd0fda5c2cf3f1-drmtip_474+ #1
<4> [139.943363] Hardware name: /D510MO, BIOS MOPNV10J.86A.0311.2010.0802.2346 08/02/2010
<4> [139.943589] RIP: 0010:i915_gem_object_unpin_from_display_plane+0x70/0x130 [i915]
<4> [139.943589] Code: 85 28 01 00 00 be ff ff ff ff 48 8d 78 60 e8 d7 9b f0 e2 85 c0 75 b9 48 c7 c6 50 b9 38 c0 48 c7 c7 e9 48 3c c0 e8 20 d4 e9 e2 <0f> 0b eb a2 48 c7 c1 08 bb 38 c0 ba 0a 01 00 00 48 c7 c6 88 a3 35
<4> [139.943589] RSP: 0018:ffffb774c0603b48 EFLAGS: 00010282
<4> [139.943589] RAX: 0000000000000000 RBX: ffff9a142fa36e80 RCX: 0000000000000006
<4> [139.943589] RDX: 000000000000160d RSI: ffff9a142c1a88f8 RDI: ffffffffa434a64d
<4> [139.943589] RBP: ffff9a1410a513c0 R08: ffff9a142c1a88f8 R09: 0000000000000000
<4> [139.943589] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9a1436ee94b8
<4> [139.943589] R13: 0000000000000001 R14: 00000000ffffffff R15: ffff9a1410960000
<4> [139.943589] FS: 00007fc73a744e40(0000) GS:ffff9a143da00000(0000) knlGS:0000000000000000
<4> [139.943589] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4> [139.943589] CR2: 00007fc73997e098 CR3: 000000002f5fe000 CR4: 00000000000006f0
<4> [139.943589] Call Trace:
<4> [139.943589] intel_pin_and_fence_fb_obj+0x1c9/0x1f0 [i915]
<4> [139.943589] intel_plane_pin_fb+0x3f/0xd0 [i915]
<4> [139.943589] intel_prepare_plane_fb+0x13b/0x5c0 [i915]
<4> [139.943589] drm_atomic_helper_prepare_planes+0x85/0x110
<4> [139.943589] intel_atomic_commit+0xda/0x390 [i915]
<4> [139.943589] drm_atomic_helper_page_flip+0x9c/0xd0
<4> [139.943589] ? drm_event_reserve_init+0x46/0x60
<4> [139.943589] drm_mode_page_flip_ioctl+0x587/0x5d0

This completes the symmetry lost in commit 8b1c78e06e61 ("drm/i915: Avoid
calling i915_gem_object_unbind holding object lock").

Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/1743
Fixes: 8b1c78e06e61 ("drm/i915: Avoid calling i915_gem_object_unbind holding object lock")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Andi Shyti <andi.shyti@intel.com>
Cc: <stable@vger.kernel.org> # v5.6+
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200420125356.26614-1-chris@chris-wilson.co.uk
(cherry picked from commit a95f3ac21d64d62c746f836598d1467d5837fa28)
(cherry picked from commit 2208b85fa1766ee4821a9435d548578b67090531)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

Completed in 808 milliseconds

1234567891011>>