1\input texinfo @c -*-texinfo-*-
2
3@c ----------------------------------------------------------------------------
4@c This is the Texinfo source file for the GPROFNG manual.
5@c
6@c Author: Ruud van der Pas
7@c ----------------------------------------------------------------------------
8
9@c %**start of header
10
11@setfilename gprofng.info
12@settitle GNU gprofng
13
14@c -- Set the indent for the @example command to 1 space, not 5 ---------------
15@exampleindent 1
16
17@c %**end of header
18
19@c -- Start a new chapter on a new, odd numbered, page ------------------------
20@setchapternewpage odd
21
22@c -- Merge all index entries into the Concepts Index -------------------------
23@syncodeindex fn cp
24@syncodeindex ky cp
25@syncodeindex pg cp
26@syncodeindex vr cp
27
28@c -- Macro definitions -------------------------------------------------------
29@c
30@c Since only letters can be used, we use capitalization to distinguish
31@c different words.
32@c ----------------------------------------------------------------------------
33@macro CollectApp{}
34@command{gprofng collect app}
35@end macro
36
37@macro DisplayHTML{}
38@command{gprofng display html}
39@end macro
40
41@macro DisplayText{}
42@command{gprofng display text}
43@end macro
44
45@macro Driver{}
46@command{gprofng}
47@end macro
48
49@macro ProductName{}
50gprofng
51@end macro
52
53@macro ToolName{}
54@command{gprofng}
55@end macro
56
57@macro IndexSubentry{label, string}
58@c -- @cindex \label\ @subentry \string\
59@cindex \label\, \string\
60@end macro
61
62@c -- Get the version information ---------------------------------------------
63@include version.texi
64
65@c -- Entry for the Info dir structure ----------------------------------------
66@ifnottex
67@dircategory Software development
68@direntry
69* gprofng: (gprofng).                    The next generation profiling tool for Linux
70@end direntry
71@end ifnottex
72
73@c -- Copyright stuff ---------------------------------------------------------
74@copying
75This document is the manual for @ProductName{}, last updated @value{UPDATED}.
76
77Copyright @copyright{} 2022 Free Software Foundation, Inc.
78
79@c -- @quotation
80Permission is granted to copy, distribute and/or modify this document
81under the terms of the GNU Free Documentation License,
82Version 1.3 or any later version published by the Free Software
83Foundation; with no Invariant Sections, with no Front-Cover texts,
84and with no Back-Cover Texts.  A copy of the license is included in the
85section entitled ``GNU Free Documentation License.''
86
87@c -- @end quotation
88@end copying
89
90@finalout
91@smallbook
92
93@c -- Define the title page ---------------------------------------------------
94@titlepage
95@title GNU gprofng
96@subtitle The next generation profiling tool for Linux
97@subtitle version @value{VERSION} (last updated @value{UPDATED})
98@author Ruud van der Pas
99@page
100@vskip 0pt plus 1filll
101@insertcopying
102@end titlepage
103
104@c -- Generate the Table of Contents ------------------------------------------
105@contents
106
107@c -- The Top node ------------------------------------------------------------
108@c Should contain a short summary, copying permissions and a master menu.
109@c ----------------------------------------------------------------------------
110@ifnottex
111@node Top
112@top  GNU Gprofng
113
114@insertcopying
115@end ifnottex
116
117@ifinfo
118@c -- The menu entries --------------------------------------------------------
119
120@menu
121* Introduction::           About this manual.
122* Overview::               A brief overview of @ProductName{}.
123* A Mini Tutorial::        A short tutorial covering the key features.
124* Terminology::            Various concepts and some terminology explained.
125* Other Document Formats:: How to create this document in other formats.
126* Index::                  The index.
127
128@detailmenu
129
130--- The Detailed Node Listing ---
131
132Introduction
133
134Overview
135
136* Main Features::                     A high level overview.
137* Sampling versus Tracing::           The pros and cons of sampling versus tracing.
138* Steps Needed to Create a Profile::  How to create a profile.
139
140A Mini Tutorial
141
142* Getting Started::                 The basics of profiling with @ProductName().
143* Support for Multithreading::      Commands specific to multithreaded applications.
144* Viewing Multiple Experiments::    Analyze multiple experiments.
145* Profile Hardware Event Counters:: How to use hardware event counters.
146* Java Profiling::                  How to profile a Java application.
147
148Terminology
149
150* The Program Counter::                    What is a Program Counter?
151* Inclusive and Exclusive Metrics::        An explanation of inclusive and exclusive metrics.
152* Metric Definitions::                     Definitions associated with metrics.
153* The Viewmode::                           Select the way call stacks are presented.
154* The Selection List::                     How to define a selection.
155* Load Objects and Functions::             The components in an application.
156* The Concept of a CPU in @ProductName{}:: The definition of a CPU.
157* Hardware Event Counters Explained::      What are event counters?
158* apath::                                  Our generic definition of a path.
159
160@c -- Index
161
162@end detailmenu
163@end menu
164@end ifinfo
165
166@c -- A new node --------------------------------------------------------------
167@node    Introduction
168@chapter Introduction
169@c ----------------------------------------------------------------------------
170The @ProductName{} tool is the next generation profiler for Linux. It consists 
171of various commands to generate and display profile information.
172
173This manual starts with a tutorial how to create and interpret a profile. This
174part is highly practical and has the goal to get users up to speed as quickly
175as possible. As soon as possible, we would like to show you how to get your
176first profile on your screen.
177
178This is followed by more examples, covering many of the features. At the
179end of this tutorial, you should feel confident enough to tackle the more
180complex tasks.
181
182In a future update a more formal reference manual will be included as well.
183Since even in this tutorial we use certain terminology, we have included a
184chapter with descriptions at the end. In case you encounter unfamiliar 
185wordings or terminology, please check this chapter.
186
187One word of caution. In several cases we had to somewhat tweak the screen
188output in order to make it fit. This is why the output may look somewhat
189different when you try things yourself.
190
191For now, we wish you a smooth profiling experience with @ProductName{} and 
192good luck tackling performance bottlenecks.
193
194@c -- A new node --------------------------------------------------------------
195@c cccccc @node    A Brief Overview of @ProductName{}
196@node    Overview
197@chapter A Brief Overview of @ProductName{}
198@c ----------------------------------------------------------------------------
199
200@menu
201* Main Features::                     A high level overview.
202* Sampling versus Tracing::           The pros and cons of sampling versus tracing.
203* Steps Needed to Create a Profile::  How to create a profile.
204@end menu
205
206Before we cover this tool in quite some detail, we start with a brief overview
207of what it is, and the main features. Since we know that many of you would 
208like to get started rightaway, already in this first chapter we explain the
209basics of profiling with @ToolName{}.
210
211@c ----------------------------------------------------------------------------
212@c TBD Review this text. Probably be more specific on the gcc releases and
213@c processor specifics.
214@c ----------------------------------------------------------------------------
215
216@c -- A new node --------------------------------------------------------------
217@node    Main Features
218@section Main Features
219@c ----------------------------------------------------------------------------
220
221@noindent
222These are the main features of the @ProductName{} tool:
223
224@itemize @bullet
225
226@item
227Profiling is supported for an application written in C, C++, Java, or Scala.
228
229@c TBD Java: up to 1.8 full support, support other than for modules 
230
231@item
232Shared libraries are supported. The information is presented at the instruction
233level.
234
235@item
236The following multithreading programming models are supported: Pthreads,
237OpenMP, and Java threads.
238
239@item
240This tool works with unmodified production level executables. There is no need to 
241recompile the code, but if the @code{-g} option has been used when building
242the application, source line level information is available.
243
244@item
245The focus is on support for code generated with the @code{gcc} compiler, but 
246there is some limited support for the @code{icc} compiler as well. Future
247improvements and enhancements will focus on @code{gcc} though.
248
249@item
250Processors from Intel, AMD, and Arm are supported, but the level of support
251depends on the architectural details. In particular, hardware event counters
252may not be supported.
253
254@item 
255Several views into the data are supported. For example, a function overview
256where the time is spent, but also a source line, disassembly, call tree and 
257a caller-callees overview are available.
258
259@item
260Through filters, the user can zoom in on an area of interest.
261
262@item
263Two or more profiles can be aggregated, or used in a comparison. This comparison 
264can be obtained at the function, source line, and disassembly level.
265
266@item
267Through a scripting language, and customization of the metrics shown,
268the generation and creation of a profile can be fully automated and provide 
269tailored output.
270
271@end itemize
272
273@c -- A new node --------------------------------------------------------------
274@node    Sampling versus Tracing
275@section Sampling versus Tracing
276@c ----------------------------------------------------------------------------
277
278A key difference with some other profiling tools is that the main data 
279collection command @CollectApp{} mostly uses 
280@cindex Program Counter sampling
281@cindex PC sampling
282Program Counter (PC) sampling
283under the hood. 
284
285With @emph{sampling}, the executable is stopped at regular intervals. Each time
286it is halted, key information is gathered and stored. This includes the Program
287Counter that keeps track of where the execution is. Hence the name.
288
289Together with operational
290data, this information is stored in the experiment directory and can be
291viewed in the second phase.
292
293For example, the PC information is used to derive where the program was when
294it was halted. Since the sampling interval is known, it is relatively easy to 
295derive how much time was spent in the various parts of the program.
296
297The opposite technique is generally referred to as @emph{tracing}. With
298tracing, the target is instrumented with specific calls that collect the
299requested information.
300
301These are some of the pros and cons of PC sampling verus tracing:
302
303@itemize
304
305@item
306Since there is no need to recompile, existing executables can be used
307and the profile measures the behaviour of exactly the same executable that is
308used in production runs.
309
310With sampling, one inherently profiles a different executable because
311the calls to the instrumentation library may affect the compiler optimizations 
312and run time behaviour. 
313
314@item
315With sampling, there are very few restrictions on what can be profiled and even without
316access to the source code, a basic profile can be made.
317
318@item
319A downside of sampling is that, depending on the sampling frequency, small 
320functions may be missed or not captured accurately. Although this is rare, 
321this may happen and is the reason why the user has control over the sampling rate.
322
323@item
324While tracing produces precise information, sampling is statistical in nature.
325As a result, small variations may occur across seemingly identical runs. We
326have not observed more than a few percent deviation though. Especially if 
327the target job executed for a sufficiently long time.
328
329@item
330With sampling, it is not possible to get an accurate count how often
331functions are called.
332
333@end itemize
334
335@c -- A new node --------------------------------------------------------------
336@node    Steps Needed to Create a Profile
337@section Steps Needed to Create a Profile
338@c ----------------------------------------------------------------------------
339
340Creating a profile takes two steps. First the profile data needs to be 
341generated. This is followed by a viewing step to create a report from the
342information that has been gathered.
343
344Every @ProductName{} command starts with @ToolName{}, the name of the driver. This is followed
345by a keyword to define the high level functionality. Depending on this
346keyword, a third qualifier may be needed to further narrow down the request. 
347This combination is then followed by options that are specific to the functionality
348desired.
349
350The command to gather, or ``collect'', the performance data is called 
351@CollectApp{}. Aside from numerous options, this command takes the name
352of the target executable as an input parameter.
353
354Upon completion of the run, the performance data can be
355found in the newly created 
356@cindex Experiment directory
357experiment directory.
358
359Unless explicitly specified otherwise, a default
360name for this directory is chosen. The name is @code{test.<n>.er} where
361@code{n} is the first integer number not in use yet for such a name.
362
363For example, the first time @CollectApp{} is invoked, an experiment
364directory with the name @code{test.1.er} is created.
365
366Upon a subsequent invocation of @CollectApp{} in the same directory,
367an experiment directory with the name @code{test.2.er} will be created, 
368and so forth.
369
370Note that @CollectApp{} supports an option to explicitly name the experiment directory.
371Outside of the restriction that the name of this directory has to end
372with @code{.er}, any valid directory name can be used for this.
373
374Now that we have the performance data, the next step is to display it.
375
376@pindex @DisplayText{}
377The most commonly used command to view the performance information is 
378@DisplayText{}. This is a very extensive and customizable tool that 
379produces the information in ASCII format. 
380
381@pindex @DisplayHTML{}
382Another option is to use @DisplayHTML{}. This tool generates a directory with 
383files in html format. These can be viewed in a browser, allowing for easy 
384navigation through the profile data.
385
386@c -- A new node --------------------------------------------------------------
387@node    A Mini Tutorial 
388@chapter A Mini Tutorial 
389@c ----------------------------------------------------------------------------
390
391In this chapter we present and discuss the main functionality of @ToolName{}.
392This will be a practical approach, using an example code to generate profile
393data and show how to get various performance reports. 
394
395@menu
396* Getting Started::                 The basics of profiling with @ProductName().
397* Support for Multithreading::      Commands specific to multithreaded applications.
398* Viewing Multiple Experiments::    Analyze multiple experiments.
399* Profile Hardware Event Counters:: How to use hardware event counters.
400* Java Profiling::                  How to profile a Java application.
401@end menu
402
403@c -- A new node --------------------------------------------------------------
404@node    Getting Started
405@section Getting Started
406@c ----------------------------------------------------------------------------
407
408The information presented here provides a good and common basis for many 
409profiling tasks, but there are more features that you may want to leverage.
410
411These are covered in subsequent sections in this chapter.
412
413@menu
414* The Example Program::                        A description of the example program used.
415* A First Profile::                            How to get the first profile.
416* The Source Code View::                       Display the metrics in the source code.
417* The Disassembly View::                       Display the metrics at the instruction level.
418* Display and Define the Metrics::             An example how to customize the metrics.
419* A First Customization of the Output::        An example how to customize the output.
420* Name the Experiment Directory::              Change the name of the experiment directory.
421* Control the Number of Lines in the Output::  Change the number of lines in the tables.
422* Sorting the Performance Data::               How to set the metric to sort by.
423* Scripting::                                  Use a script to execute the commands.
424* A More Elaborate Example::                   An example of customization.
425* The Call Tree::                              Display the dynamic call tree.
426* More Information on the Experiment::         How to get additional statistics.
427* Control the Sampling Frequency::             How to control the sampling granularity.
428* Information on Load Objects::                How to get more information on load objects.
429@end menu
430
431@c -- A new node --------------------------------------------------------------
432@node       The Example Program
433@subsection The Example Program
434@c ----------------------------------------------------------------------------
435
436Throughout this guide we use the same example C code that implements the 
437multiplication of a vector of length @math{n} by an @math{m} by @math{n}
438matrix. The result is stored in a vector of length @math{m}. 
439@cindex Pthreads
440@cindex Posix Threads
441The algorithm has been parallelized using Posix Threads, or Pthreads for short.
442
443The code was built using the @code{gcc} compiler and the name of the executable 
444is
445@cindex mxv-pthreads.exe
446mxv-pthreads.exe.
447
448The matrix sizes can be set through the @code{-m} and @code{-n} options. The
449number of threads is set with the @code{-t} option. To increase the duration
450of the run, the multiplication is executed repeatedly. 
451
452This is an example that multiplies a @math{3000} by @math{2000} matrix with
453a vector of length @math{2000} using @math{2} threads:
454
455@smallexample
456@verbatim
457$ ./mxv-pthreads.exe -m 3000 -n 2000 -t 2
458mxv: error check passed - rows = 3000 columns = 2000 threads = 2
459$
460@end verbatim
461@end smallexample
462
463The program performs an internal check to verify the results are correct.
464The result of this check is printed, followed by the matrix sizes and the 
465number of threads used.
466
467@c -- A new node --------------------------------------------------------------
468@node       A First Profile
469@subsection A First Profile
470@c ----------------------------------------------------------------------------
471
472The first step is to collect the performance data. It is important to remember
473that much more information is gathered than may be shown by default. Often a
474single data collection run is sufficient to get a lot of insight.
475
476The @CollectApp{} command is used for the data collection. Nothing needs to be
477changed in the way the application is executed. The only difference is that it
478is now run under control of the tool, as shown below:
479
480@cartouche
481@smallexample
482$ gprofng collect app ./mxv.pthreads.exe -m 3000 -n 2000 -t 1
483@end smallexample
484@end cartouche
485
486This command produces the following output:
487
488@smallexample
489@verbatim
490Creating experiment database test.1.er (Process ID: 2416504) ...
491mxv: error check passed - rows = 3000 columns = 2000 threads = 1
492@end verbatim
493@end smallexample
494
495We see the message that a directory with the name @code{test.1.er} 
496has been created. 
497The application then completes as usual and we have our first experiment 
498directory that can be analyzed.
499
500The tool we use for this is called @DisplayText{}. It takes the name of
501the experiment directory as an argument.
502
503@cindex Interpreter mode 
504If invoked this way, the tool starts in the interactive @emph{interpreter} mode.
505While in this environment, commands can be given and the tool responds. This is
506illustrated below:
507
508@smallexample
509@verbatim
510$ gprofng display text test.1.er
511Warning: History and command editing is not supported on this system.
512(gp-display-text) quit
513$
514@end verbatim
515@end smallexample
516
517@cindex Command line mode 
518While useful in certain cases, we prefer to use this tool in command line mode,
519by specifying the commands to be issued when invoking the tool. The way to do
520this is to prepend the command with a hyphen (@code{-}) if used on the command
521line.
522
523For example,
524@IndexSubentry{Commands, @code{functions}}
525with the @code{functions} command we request a list of the functions that 
526have been executed and their respective CPU times:
527
528@cartouche
529@smallexample
530$ gprofng display text -functions test.1.er
531@end smallexample
532@end cartouche
533
534@smallexample
535@verbatim
536$ gprofng display text -functions test.1.er
537Functions sorted by metric: Exclusive Total CPU Time
538
539Excl.     Incl.      Name
540Total     Total
541CPU sec.  CPU sec.
5422.272     2.272      <Total>
5432.160     2.160      mxv_core
5440.047     0.103      init_data
5450.030     0.043      erand48_r
5460.013     0.013      __drand48_iterate
5470.013     0.056      drand48
5480.008     0.010      _int_malloc
5490.001     0.001      brk
5500.001     0.002      sysmalloc
5510.        0.001      __default_morecore
5520.        0.113      __libc_start_main
5530.        0.010      allocate_data
5540.        2.160      collector_root
5550.        2.160      driver_mxv
5560.        0.113      main
5570.        0.010      malloc
5580.        0.001      sbrk
559@end verbatim
560@end smallexample
561
562As easy and simple as these steps are, we do have a first profile of our program!
563There are three columns. The first two contain the 
564@cindex Total CPU time
565@emph{Total CPU Time}, 
566which 
567is the sum of the user and system time. @xref{Inclusive and Exclusive Metrics}
568for an explanation of ``exclusive'' and ``inclusive'' times.
569
570The first line echoes the metric that is used to sort the output. By default, this
571is the exclusive CPU time, but the sort metric can be changed by the user.
572
573We then see three columns with the exclusive and inclusive CPU times, plus the
574name of the function.
575
576@IndexSubentry{Miscellaneous, @code{<Total>}}
577The function with the name @code{<Total>} is not a user function, but is introduced
578by @ToolName{} and is used to display the accumulated metric values. In this case,
579we see that the total CPU time of this job was @code{2.272} seconds.
580
581With @code{2.160} seconds, function @code{mxv_core} is the most time 
582consuming function. It is also a leaf function.
583
584The next function in the list is @code{init_data}. Although the CPU time spent in
585this part is negligible, this is an interesting entry because the inclusive CPU
586time of @code{0.103} seconds is higher than the exclusive CPU time of @code{0.047}
587seconds. Clearly it is calling another function,
588or even more than one function. 
589@xref{The Call Tree} for the details how to get more information on this.
590
591The function @code{collector_root} does not look familiar. It is one of the internal
592functions used by @CollectApp{} and can be ignored. While the inclusive time is high,
593the exclusive time is zero. This means it doesn't contribute to the performance.
594
595The question is how we know where this function originates from? There is a very useful
596command to get more details on a function. @xref{Information on Load Objects}.
597
598@c -- A new node --------------------------------------------------------------
599@node       The Source Code View
600@subsection The Source Code View
601@c ----------------------------------------------------------------------------
602
603In general, you would like to focus the tuning efforts on the most time
604consuming part(s) of the program. In this case that is easy, since 2.160
605seconds on a total of 2.272 seconds is spent in function @code{mxv_core}. 
606That is 95% of the total and it is time to dig deeper and look
607@cindex Source level timings
608at the time distribution at the source code level.
609
610@IndexSubentry{Commands, @code{source}}
611The @code{source} command is used to accomplish this. It takes the name of the
612function, not the source filename, as an argument. This is demonstrated
613below, where the @DisplayText{} command is used to show the annotated
614source listing of function @code{mxv_core}.
615
616Please note that the source code has to be compiled with the @code{-g}
617option in order for the source code feature to work. Otherwise the
618location can not be determined.
619
620@cartouche
621@smallexample
622$ gprofng display text -source mxv_core test.1.er
623@end smallexample
624@end cartouche
625
626The slightly modified output is as follows:
627
628@smallexample
629@verbatim
630Source file: <apath>/mxv.c
631Object file: mxv-pthreads.exe (found as test.1.er/archives/...)
632Load Object: mxv-pthreads.exe (found as test.1.er/archives/...)
633
634   Excl.     Incl.
635   Total     Total
636   CPU sec.  CPU sec.
637
638   <lines deleted>
639                               <Function: mxv_core>
640   0.        0.             32. void __attribute__ ((noinline)) 
641                                mxv_core (
642                                uint64_t row_index_start, 
643                                uint64_t row_index_end,
644                                uint64_t m, uint64_t n, 
645                                double **restrict A,
646                                double *restrict b, 
647                                double *restrict c)
648   0.        0.             33. {
649   0.        0.             34.    for (uint64_t i=row_index_start; 
650                                        i<=row_index_end; i++) {
651   0.        0.             35.       double row_sum = 0.0;
652## 1.687     1.687          36.       for (int64_t j=0; j<n; j++)
653   0.473     0.473          37.          row_sum += A[i][j]*b[j];
654   0.        0.             38.       c[i] = row_sum;
655                            39.    }
656   0.        0.             40. }
657@end verbatim
658@end smallexample
659
660The first three lines provide information on the location of the source file,
661the object file and the load object (@xref{Load Objects and Functions}).
662
663Function @code{mxv_core} is part of a source file that has other functions
664as well. These functions will be shown, but without timing information. They
665have been removed in the output shown above.
666
667This is followed by the annotated source code listing. The selected metrics 
668are shown first, followed by a source line number, and the source code.
669@IndexSubentry{Miscellaneous ,@code{##}}
670The most time consuming line(s) are marked with the @code{##} symbol. In
671this way they are easier to find.
672
673What we see is that all of the time is spent in lines 36-37. 
674
675@IndexSubentry{Commands, @code{lines}}
676A related command sometimes comes handy as well. It is called @code{lines}
677and displays a list of the source lines and their metrics, ordered according
678to the current sort metric (@xref{Sorting the Performance Data}).
679
680Below the command and the output. For lay-out reasons, only the top 10 is 
681shown here and the last part of the text on some lines has been replaced
682by dots.
683
684@cartouche
685@smallexample
686$ gprofng display text -lines test.1.er
687@end smallexample
688@end cartouche
689
690@smallexample
691@verbatim
692Lines sorted by metric: Exclusive Total CPU Time
693
694Excl.     Incl.  Name
695Total     Total
696CPU sec.  CPU sec.
6972.272     2.272  <Total>
6981.687     1.687  mxv_core, line 36 in "mxv.c"
6990.473     0.473  mxv_core, line 37 in "mxv.c"
7000.032     0.088  init_data, line 72 in "manage_data.c"
7010.030     0.043  <Function: erand48_r, instructions without line numbers>
7020.013     0.013  <Function: __drand48_iterate, instructions without ...>
7030.013     0.056  <Function: drand48, instructions without line numbers>
7040.012     0.012  init_data, line 77 in "manage_data.c"
7050.008     0.010  <Function: _int_malloc, instructions without ...>
7060.003     0.003  init_data, line 71 in "manage_data.c"
707@end verbatim
708@end smallexample
709
710What this overview immediately highlights is that the next most time consuming
711source line takes 0.032 seconds only. With an inclusive time of 0.088 seconds,
712it is also clear that this branch of the code does not impact the performance.
713
714@c -- A new node --------------------------------------------------------------
715@node       The Disassembly View
716@subsection The Disassembly View
717@c ----------------------------------------------------------------------------
718
719The source view is very useful to obtain more insight where the time is spent,
720but sometimes this is not sufficient. This is when the disassembly view comes
721in. It is activated with the 
722@IndexSubentry{Commands, @code{disasm}}
723@code{disasm} 
724command and as with the source view, it displays an annotated listing. In this
725@cindex Instruction level timings
726case it shows the instructions with the metrics, interleaved with the
727source lines. The
728instructions have a reference in square brackets (@code{[} and @code{]})
729to the source line they correspond to.
730
731@noindent
732This is what we get for our example:
733
734@cartouche
735@smallexample
736$ gprofng display text -disasm mxv_core test.1.er
737@end smallexample
738@end cartouche
739
740@smallexample
741@verbatim
742Source file: <apath>/mxv.c
743Object file: mxv-pthreads.exe (found as test.1.er/archives/...)
744Load Object: mxv-pthreads.exe (found as test.1.er/archives/...)
745
746   Excl.     Incl.
747   Total     Total
748   CPU sec.  CPU sec.
749
750   <lines deleted>
751                        32. void __attribute__ ((noinline)) 
752                            mxv_core (
753                            uint64_t row_index_start, 
754                            uint64_t row_index_end,
755                            uint64_t m, uint64_t n, 
756                            double **restrict A,
757                            double *restrict b, 
758                            double *restrict c)
759                        33. {
760                            <Function: mxv_core>
761   0.        0.             [33]   4021ba:  mov    0x8(%rsp),%r10
762                        34.    for (uint64_t i=row_index_start;
763                                    i<=row_index_end; i++) {
764   0.        0.             [34]   4021bf:  cmp    %rsi,%rdi
765   0.        0.             [34]   4021c2:  jbe    0x37
766   0.        0.             [34]   4021c4:  ret
767                        35.        double row_sum = 0.0;
768                        36.        for (int64_t j=0; j<n; j++)
769                        37.           row_sum += A[i][j]*b[j];
770   0.        0.             [37]   4021c5:  mov    (%r8,%rdi,8),%rdx
771   0.        0.             [36]   4021c9:  mov    $0x0,%eax
772   0.        0.             [35]   4021ce:  pxor   %xmm1,%xmm1
773   0.002     0.002          [37]   4021d2:  movsd  (%rdx,%rax,8),%xmm0
774   0.096     0.096          [37]   4021d7:  mulsd  (%r9,%rax,8),%xmm0
775   0.375     0.375          [37]   4021dd:  addsd  %xmm0,%xmm1
776## 1.683     1.683          [36]   4021e1:  add    $0x1,%rax
777   0.004     0.004          [36]   4021e5:  cmp    %rax,%rcx
778   0.        0.             [36]   4021e8:  jne    0xffffffffffffffea
779                        38.        c[i] = row_sum;
780   0.        0.             [38]   4021ea:  movsd  %xmm1,(%r10,%rdi,8)
781   0.        0.             [34]   4021f0:  add    $0x1,%rdi
782   0.        0.             [34]   4021f4:  cmp    %rdi,%rsi
783   0.        0.             [34]   4021f7:  jb     0xd
784   0.        0.             [35]   4021f9:  pxor   %xmm1,%xmm1
785   0.        0.             [36]   4021fd:  test   %rcx,%rcx
786   0.        0.             [36]   402200:  jne    0xffffffffffffffc5
787   0.        0.             [36]   402202:  jmp    0xffffffffffffffe8
788                        39.    }
789                        40. }
790   0.        0.             [40]   402204:  ret
791@end verbatim
792@end smallexample
793
794For each instruction, the timing values are given and we can exactly which ones
795are the most expensive. As with the source level view, the most expensive 
796instructions are market with the @code{##} symbol.
797
798As illustrated below and similar to the @code{lines} command, we can get 
799an overview of the instructions executed by using the 
800@IndexSubentry{Commands, @code{pcs}}
801@code{pcs} 
802command. 
803
804@noindent
805Below the command and the output, which again has been restricted
806to 10 lines:
807
808@cartouche
809@smallexample
810$ gprofng display text -pcs test.1.er
811@end smallexample
812@end cartouche
813
814@smallexample
815@verbatim
816PCs sorted by metric: Exclusive Total CPU Time
817
818Excl.     Incl.      Name
819Total     Total
820CPU sec.  CPU sec.
8212.272     2.272  <Total>
8221.683     1.683  mxv_core + 0x00000027, line 36 in "mxv.c"
8230.375     0.375  mxv_core + 0x00000023, line 37 in "mxv.c"
8240.096     0.096  mxv_core + 0x0000001D, line 37 in "mxv.c"
8250.027     0.027  init_data + 0x000000BD, line 72 in "manage_data.c"
8260.012     0.012  init_data + 0x00000117, line 77 in "manage_data.c"
8270.008     0.008  _int_malloc + 0x00000A45
8280.007     0.007  erand48_r + 0x00000062
8290.006     0.006  drand48 + 0x00000000
8300.005     0.005  __drand48_iterate + 0x00000005
831@end verbatim
832@end smallexample
833
834@c -- A new node --------------------------------------------------------------
835@node       Display and Define the Metrics
836@subsection Display and Define the Metrics
837@c ----------------------------------------------------------------------------
838
839The default metrics shown by @DisplayText{} are useful, but there is more
840recorded than displayed. We can customize the values shown by defining the 
841metrics ourselves.
842
843@IndexSubentry{Commands, @code{metric_list}}
844There are two commands related to changing the metrics shown: @code{metric_list}
845and 
846@IndexSubentry{Commands, @code{metrics}}
847@code{metrics}.
848
849The first command shows the metrics in use, plus all the metrics that have 
850been stored as part of the experiment. The second command may be used to
851define the metric list.
852
853In our example we get the following values for the metrics:
854
855@IndexSubentry{Commands, @code{metric_list}}
856@cartouche
857@smallexample
858$ gprofng display text -metric_list test.1.er
859@end smallexample
860@end cartouche
861
862@smallexample
863@verbatim
864Current metrics: e.totalcpu:i.totalcpu:name
865Current Sort Metric: Exclusive Total CPU Time ( e.totalcpu )
866Available metrics:
867   Exclusive Total CPU Time: e.%totalcpu
868   Inclusive Total CPU Time: i.%totalcpu
869                       Size: size
870                 PC Address: address
871                       Name: name
872@end verbatim
873@end smallexample
874
875This shows the metrics currently in use, the metric that is used to sort
876the data and all the metrics that have been recorded, but are not necessarily
877shown.
878
879@cindex Default metrics
880In this case, the default metrics are set to the exclusive and inclusive
881total CPU times, plus the name of the function, or load object.
882
883@IndexSubentry{Commands, @code{metrics}}
884The @code{metrics} command is used to define the metrics that need to be
885displayed. 
886
887For example, to display the exclusive total CPU time, both as a number and a
888percentage, use the following metric definition: @code{e.%totalcpu}
889
890Since the metrics can be tailored for different views, there is a way
891to reset them to the default. This is done through the special keyword
892@code{default}.
893
894@c -- A new node --------------------------------------------------------------
895@node    A First Customization of the Output
896@subsection A First Customization of the Output
897@c ----------------------------------------------------------------------------
898
899With the information just given, we can customize the function overview. 
900For sake of the example, we would like to display the name of the function
901first, followed by the exclusive CPU time, given as an absolute number and 
902a percentage.
903
904Note that the commands are parsed in order of appearance. This is why we
905need to define the metrics @emph{before} requesting the function overview:
906
907@cartouche
908@smallexample
909$ gprofng display text -metrics name:e.%totalcpu -functions test.1.er
910@end smallexample
911@end cartouche
912
913@smallexample
914@verbatim
915Current metrics: name:e.%totalcpu
916Current Sort Metric: Exclusive Total CPU Time ( e.%totalcpu )
917Functions sorted by metric: Exclusive Total CPU Time
918
919Name                Excl. Total
920                    CPU
921                     sec.      %
922 <Total>            2.272 100.00
923 mxv_core           2.160  95.04
924 init_data          0.047   2.06
925 erand48_r          0.030   1.32
926 __drand48_iterate  0.013   0.57
927 drand48            0.013   0.57
928 _int_malloc        0.008   0.35
929 brk                0.001   0.04
930 sysmalloc          0.001   0.04
931 __default_morecore 0.      0.
932 __libc_start_main  0.      0.
933 allocate_data      0.      0.
934 collector_root     0.      0.
935 driver_mxv         0.      0.
936 main               0.      0.
937 malloc             0.      0.
938 sbrk               0.      0.
939@end verbatim
940@end smallexample
941
942This was a first and simple example how to customize the output. Note that we
943did not rerun our profiling job and merely modified the display settings.
944Below we will show other and also more advanced examples of customization.
945
946
947@c -- A new node --------------------------------------------------------------
948@node    Name the Experiment Directory
949@subsection Name the Experiment Directory
950@c ----------------------------------------------------------------------------
951
952When using @CollectApp{}, the default names for experiments work fine, but
953they are quite generic. It is often more convenient to select a more 
954descriptive name. For example, one that reflects conditions for the experiment 
955conducted.
956
957For this, the mutually exclusive @code{-o} and @code{-O} options come in handy. 
958Both may be used to provide a name for the experiment directory, but the
959behaviour of @CollectApp{} is different.
960
961With the 
962@IndexSubentry{Options, @code{-o}}
963@code{-o} 
964option, an existing experiment directory is not overwritten. You either
965need to explicitly remove an existing directory first, or use a name that is not
966in use yet.
967
968This is in contrast with the behaviour for the
969 @IndexSubentry{Options, @code{-O}}
970@code{-O} 
971option. Any existing (experiment) directory with the same name is silently 
972overwritten.
973
974Be aware that the name of the experiment directory has to end with @code{.er}.
975
976@c -- A new node --------------------------------------------------------------
977@node    Control the Number of Lines in the Output
978@subsection Control the Number of Lines in the Output
979@c ----------------------------------------------------------------------------
980
981@IndexSubentry{Commands, @code{limit}}
982The @code{limit <n>} command can be used to control the number of lines printed
983in various overviews, including the function view, but it also takes effect
984for other display commands, like @code{lines}.
985
986The argument @code{<n>} should be a positive integer number. It sets the number
987of lines in the function view. A value of zero resets the limit to the default.
988
989Be aware that the pseudo-function @code{<Total>} counts as a regular function.
990For example @code{limit 10} displays nine user level functions.
991
992@c -- A new node --------------------------------------------------------------
993@node    Sorting the Performance Data
994@subsection Sorting the Performance Data
995@c ----------------------------------------------------------------------------
996
997@IndexSubentry{Commands, @code{sort}}
998The @code{sort <key>} command sets the key to be used when sorting the 
999performance data.
1000
1001The key is a valid metric definition, but the
1002@cindex Visibility field
1003visibility field 
1004(@xref{Metric Definitions})
1005in the metric
1006definition is ignored since this does not affect the outcome of the sorting
1007operation.
1008For example if we set the sort key to @code{e.totalcpu}, the values
1009will be sorted in descending order with respect to the exclusive total
1010CPU time.
1011
1012The data can be sorted in reverse order by prepending the metric definition
1013with a minus (@code{-}) sign. For example @code{sort -e.totalcpu}.
1014
1015A default metric for the sort operation has been defined and since this is 
1016a persistent command, this default can be restored with @code{default} as 
1017the key.
1018
1019@c -- A new node --------------------------------------------------------------
1020@node    Scripting
1021@subsection Scripting
1022@c ----------------------------------------------------------------------------
1023
1024As is probably clear by now, the list with commands for @DisplayText{} can be
1025very long. This is tedious and also error prone. Luckily, there is an easier and 
1026more elegant way to control the behaviour of this tool.
1027
1028@IndexSubentry{Commands, @code{script}}
1029Through the @code{script} command, the name of a file with commands can be
1030passed in. These commands are parsed and executed as if they appeared on
1031the command line in the same order as encountered in the file. The commands
1032in this script file can actually be mixed with commands on the command line.
1033
1034The difference between the commands in the script file and those used on the
1035command line is that the latter require a leading dash (@code{-}) symbol.
1036
1037Comment lines are supported. They need to start with the @code{#} symbol.
1038
1039@c -- A new node --------------------------------------------------------------
1040@node       A More Elaborate Example
1041@subsection A More Elaborate Example
1042@c ----------------------------------------------------------------------------
1043
1044With the information presented so far, we can customize our data
1045gathering and display commands.
1046
1047As an example, to reflect the name of the algorithm and the number of threads 
1048that were used in the experiment, we select @code{mxv.1.thr.er} 
1049as the name of the experiment directory.
1050All we then need to 
1051do is to add the 
1052 @IndexSubentry{Options, @code{-O}}
1053@code{-O} 
1054option followed by this name on the command line when running @CollectApp{}:
1055
1056@cartouche
1057@smallexample
1058$ exe=mxv-pthreads.exe
1059$ m=3000
1060$ n=2000
1061$ gprofng collect app -O mxv.1.thr.er ./$exe -m $m -n $n -t 1
1062@end smallexample
1063@end cartouche
1064
1065The commands to generate the profile are put into a file that we simply call
1066@code{my-script}:
1067
1068@smallexample
1069@verbatim
1070$ cat my-script
1071# This is my first gprofng script
1072# Set the metrics
1073metrics i.%totalcpu:e.%totalcpu:name
1074# Use the exclusive time to sort
1075sort e.totalcpu
1076# Limit the function list to 5 lines
1077limit 5
1078# Show the function list
1079functions
1080@end verbatim
1081@end smallexample
1082
1083This script file is then specified as input to the @DisplayText{} command 
1084that is used to display the performance information stored in 
1085@code{mxv.1.thr.er}:
1086
1087@cartouche
1088@smallexample
1089$ gprofng display text -script my-script mxv.1.thr.er
1090@end smallexample
1091@end cartouche
1092
1093The command above produces the following output:
1094
1095@smallexample
1096@verbatim
1097# This is my first gprofng script
1098# Set the metrics
1099Current metrics: i.%totalcpu:e.%totalcpu:name
1100Current Sort Metric: Exclusive Total CPU Time ( e.%totalcpu )
1101# Use the exclusive time to sort
1102Current Sort Metric: Exclusive Total CPU Time ( e.%totalcpu )
1103# Limit the function list to 5 lines
1104Print limit set to 5
1105# Show the function list
1106Functions sorted by metric: Exclusive Total CPU Time
1107
1108Incl. Total   Excl. Total    Name
1109CPU           CPU
1110 sec.      %   sec.      %
11112.272 100.00  2.272 100.00   <Total>
11122.159  95.00  2.159  95.00   mxv_core
11130.102   4.48  0.054   2.37   init_data
11140.035   1.54  0.025   1.10   erand48_r
11150.048   2.11  0.013   0.57   drand48
1116@end verbatim
1117@end smallexample
1118
1119In the first part of the output, our comment lines in the script file are 
1120shown. These are interleaved with an acknowledgement message for the commands.
1121
1122This is followed by a profile consisting of 5 lines only. For both metrics,
1123the percentages plus the timings are given. The numbers are sorted with respect 
1124to the exclusive total CPU time.
1125
1126It is now immediately clear that function @code{mxv_core} is responsbile for
112795% of the CPU time and @code{init_data} takes 4.5% only.
1128
1129This is also where we see sampling in action. Although this is exactly the
1130same job we profiled before, the timings are somewhat different, but the
1131differences are very small.
1132
1133@c -- A new node --------------------------------------------------------------
1134@node       The Call Tree
1135@subsection The Call Tree
1136@c ----------------------------------------------------------------------------
1137
1138The call tree shows the dynamic hierarchy of the application by displaying the
1139functions executed and their parent. It helps to find the most expensive path
1140in the program.
1141
1142@IndexSubentry{Commands, @code{calltree}}
1143This feature is enabled through the @code{calltree} command. This is how to get
1144this tree for our current experiment:
1145
1146@cartouche
1147@smallexample
1148$ gprofng display text -calltree mxv.1.thr.er
1149@end smallexample
1150@end cartouche
1151
1152This displays the following structure:
1153
1154@smallexample
1155@verbatim
1156Functions Call Tree. Metric: Attributed Total CPU Time
1157
1158Attr.      Name
1159Total
1160CPU sec.
11612.272      +-<Total>
11622.159        +-collector_root
11632.159        |  +-driver_mxv
11642.159        |    +-mxv_core
11650.114        +-__libc_start_main
11660.114          +-main
11670.102            +-init_data
11680.048            |  +-drand48
11690.035            |    +-erand48_r
11700.010            |      +-__drand48_iterate
11710.011            +-allocate_data
11720.011            |  +-malloc
11730.011            |    +-_int_malloc
11740.001            |      +-sysmalloc
11750.001            +-check_results
11760.001              +-malloc
11770.001                +-_int_malloc
1178@end verbatim
1179@end smallexample
1180
1181At first sight this may not be what you expected and some explanation is in
1182place.
1183
1184@c ----------------------------------------------------------------------------
1185@c TBD: Revise this text when we have user and machine mode.
1186@c ----------------------------------------------------------------------------
1187First of all, function @code{collector_root} is internal to @ToolName{} and
1188should be hidden to the user. This is part of a planned future enhancement.
1189
1190Recall that the @code{objects} and @code{fsingle} commands are very useful
1191to find out more about load objects in general, but also to help identify
1192an unknown entry in the function overview. @xref{Load Objects and Functions}.
1193
1194Another thing to note is that there are two main branches. The one under
1195@code{collector_root} and the second one under @code{__libc_start_main}.
1196This reflects the fact that we are executing a parallel program. Even though
1197we only used one thread for this run, this is still executed in a separate
1198path.
1199
1200The main, sequential part of the program is displayed under @code{main} and
1201shows the functions called and the time they took.
1202
1203There are two things worth noting for the call tree feature:
1204
1205@itemize
1206
1207@item
1208This is a dynamic tree and since sampling is used, it most likely looks
1209slighlty different across seemingly identical profile runs. In case the
1210run times are short, it is worth considering to use a high resolution
1211through the 
1212@IndexSubentry{Options, @code{-p}}
1213@code{-p} 
1214option. For example to use @code{-p hi} to increase the sampling rate.
1215
1216@item
1217In case hardware event counters have been enabled 
1218(@xref{Profile Hardware Event Counters}), these values are also displayed
1219in the call tree view.
1220
1221@end itemize
1222
1223@c -- A new node --------------------------------------------------------------
1224@node       More Information on the Experiment
1225@subsection More Information on the Experiment
1226@c ----------------------------------------------------------------------------
1227
1228The experiment directory not only contains performance related data. Several
1229system characteristics, the actually command executed, and some global 
1230performance statistics can be displayed.
1231
1232@IndexSubentry{Commands, @code{header}}
1233The @code{header} command displays information about the experiment(s).
1234For example, this is the command to extract this data from for our experiment 
1235directory:
1236
1237@cartouche
1238@smallexample
1239$ gprofng display text -header mxv.1.thr.er
1240@end smallexample
1241@end cartouche
1242
1243The above command prints the following information. Note that some of the
1244lay-out and the information has been modified. The textual changes are 
1245marked with the @code{<} and @code{>} symbols.
1246
1247@smallexample
1248@verbatim
1249Experiment: mxv.1.thr.er
1250No errors
1251No warnings
1252Archive command `gp-archive -n -a on 
1253         --outfile <exp_dir>/archive.log <exp_dir>'
1254
1255Target command (64-bit): './mxv-pthreads.exe -m 3000 -n 2000 -t 1'
1256Process pid 30591, ppid 30589, pgrp 30551, sid 30468
1257Current working directory: <cwd>
1258Collector version: `2.36.50'; experiment version 12.4 (64-bit)
1259Host `<hostname>', OS `Linux <version>', page size 4096, 
1260     architecture `x86_64'
1261  16 CPUs, clock speed 1995 MHz.
1262  Memory: 30871514 pages @  4096 = 120591 MB.
1263Data collection parameters:
1264  Clock-profiling, interval = 997 microsecs.
1265  Periodic sampling, 1 secs.
1266  Follow descendant processes from: fork|exec|combo
1267
1268Experiment started <date and time>
1269
1270Experiment Ended: 2.293162658
1271Data Collection Duration: 2.293162658
1272@end verbatim
1273@end smallexample
1274
1275The output above may assist in troubleshooting, or to verify some of the
1276operational conditions and we recommand to include this command when 
1277generating a profile.
1278
1279@IndexSubentry{Options, @code{-C}}
1280Related to this command there is a useful option to record your own comment(s) in 
1281an experiment.
1282To this end, use the @code{-C} option on the @CollectApp{} tool to
1283specify a comment string. Up to ten comment lines can be included. 
1284These comments are displayed with the @code{header} command on
1285the @DisplayText{} tool.
1286
1287@IndexSubentry{Commands, @code{overview}}
1288The @code{overview} command displays information on the experiment(s) and also
1289shows a summary of the values for the metric(s) used. This is an example how to
1290use it on our newly created experiment directory:
1291
1292@cartouche
1293@smallexample
1294$ gprofng display text -overview mxv.1.thr.er
1295@end smallexample
1296@end cartouche
1297
1298@smallexample
1299@verbatim
1300Experiment(s):
1301
1302Experiment      :mxv.1.thr.er
1303  Target        : './mxv-pthreads.exe -m 3000 -n 2000 -t 1'
1304  Host          : <hostname> (<ISA>, Linux <version>)
1305  Start Time    : <date and time>
1306  Duration      : 2.293 Seconds
1307
1308Metrics:
1309
1310  Experiment Duration (Seconds): [2.293]
1311  Clock Profiling
1312    [X]Total CPU Time - totalcpu (Seconds): [*2.272]
1313
1314Notes: '*' indicates hot metrics, '[X]' indicates currently enabled 
1315       metrics.
1316       The metrics command can be used to change selections. The 
1317       metric_list command lists all available metrics.
1318@end verbatim
1319@end smallexample
1320
1321This command provides a dashboard overview that helps to easily identify
1322where the time is spent and in case hardware event counters are used, it
1323shows their total values.
1324
1325@c -- A new node --------------------------------------------------------------
1326@node       Control the Sampling Frequency
1327@subsection Control the Sampling Frequency
1328@c ----------------------------------------------------------------------------
1329
1330So far we did not talk about the frequency of the sampling process, but in
1331some cases it is useful to change the default of 10 milliseconds.
1332
1333The advantage of increasing the sampling frequency is that functions that
1334do not take much time per invocation are more accurately captured. The
1335downside is that more data is gathered. This has an impact on the overhead
1336of the collection process and more disk space is required. 
1337
1338In general this is not an immediate concern, but with heavily threaded
1339applications that run for an extended period of time, increasing the 
1340frequency may have a more noticeable impact.
1341
1342@IndexSubentry{Options, @code{-p}}
1343The @code{-p} option on the @CollectApp{} tool is used to enable or disable
1344clock based profiling, or to explicitly set the sampling rate. 
1345@cindex Sampling interval
1346This option takes one of the following keywords:
1347
1348@table @code
1349
1350@item off
1351Disable clock based profiling.
1352
1353@item on
1354Enable clock based profiling with a per thread sampling interval of 10 ms. This is the default.
1355
1356@item lo
1357Enable clock based profiling with a per thread sampling interval of 100 ms.
1358
1359@item hi
1360Enable clock based profiling with a per thread sampling interval of 1 ms.
1361
1362@item <value>
1363Enable clock based profiling with a per thread sampling interval of <value>. 
1364
1365@end table 
1366
1367One may wonder why there is an option to disable clock based profiling. This
1368is because by default, it is enabled when conducting hardware event counter
1369experiments (@xref{Profile Hardware Event Counters}).
1370With the @code{-p off} option, this can be disabled.
1371
1372If an explicit value is set for the sampling, the number can be an integer or a 
1373floating-point number.
1374A  suffix of @code{u} for microseconds, or @code{m} for milliseconds is supported. 
1375If no suffix is used, the value is assumed to be in milliseconds.
1376
1377If the value is smaller than the clock profiling minimum, a warning message is issued
1378and it is set to the minimum.
1379In case it is not a multiple of the clock profiling resolution, it is silently rounded 
1380down to the nearest multiple of the clock resolution. 
1381
1382If the value exceeds the clock profiling maximum, is negative, or zero, an error is 
1383reported.
1384
1385@IndexSubentry{Commands, @code{header}}
1386Note that the @code{header} command echoes the sampling rate used.
1387
1388@c -- A new node --------------------------------------------------------------
1389@node    Information on Load Objects
1390@subsection Information on Load Objects
1391@c ----------------------------------------------------------------------------
1392
1393It may happen that the function list contains a function that is not known to 
1394the user. This can easily happen with library functions for example.
1395Luckily there are three commands that come in handy then. 
1396
1397@IndexSubentry{Commands, @code{objects}}
1398@IndexSubentry{Commands, @code{fsingle}}
1399@IndexSubentry{Commands, @code{fsummary}}
1400These commands are @code{objects}, @code{fsingle}, and @code{fsummary}. 
1401They provide details on
1402@cindex Load objects
1403load objects (@xref{Load Objects and Functions}).
1404
1405The @code{objects} command lists all load objects that have been referenced 
1406during the performance experiment.
1407Below we show the command and the result for our profile job. Like before, 
1408the (long) path names in the output have been shortened and replaced by the 
1409@IndexSubentry{Miscellaneous, @code{<apath>}}
1410@code{<apath>} symbol that represents an absolute directory path.
1411
1412@cartouche
1413@smallexample
1414$ gprofng display text -objects mxv.1.thr.er
1415@end smallexample
1416@end cartouche
1417
1418The output includes the name and path of the target executable:
1419
1420@smallexample
1421@verbatim
1422 <Unknown> (<Unknown>)
1423 <mxv-pthreads.exe> (<apath>/mxv-pthreads.exe)
1424 <librt-2.17.so> (/usr/lib64/librt-2.17.so)
1425 <libdl-2.17.so> (/usr/lib64/libdl-2.17.so)
1426 <libbfd-2.36.50.20210505.so> (<apath>/libbfd-2.36.50 <etc>)
1427 <libopcodes-2.36.50.20210505.so> (<apath>/libopcodes-2. <etc>)
1428 <libc-2.17.so> (/usr/lib64/libc-2.17.so)
1429 <libpthread-2.17.so> (/usr/lib64/libpthread-2.17.so)
1430 <libm-2.17.so> (/usr/lib64/libm-2.17.so)
1431 <libgp-collector.so> (<apath>/libgp-collector.so)
1432 <ld-2.17.so> (/usr/lib64/ld-2.17.so)
1433 <DYNAMIC_FUNCTIONS> (DYNAMIC_FUNCTIONS)
1434@end verbatim
1435@end smallexample
1436
1437@IndexSubentry{Commands, @code{fsingle}}
1438The @code{fsingle} command may be used to get more details on a specific entry 
1439in the function view, say. For example, the command below provides additional
1440information on the @code{collector_root} function shown in the function overview.
1441
1442@cartouche
1443@smallexample
1444$ gprofng display text -fsingle collector_root mxv.1.thr.er
1445@end smallexample
1446@end cartouche
1447
1448Below the output from this command. It has been somewhat modified to match the
1449display requirements.
1450
1451@smallexample
1452@verbatim
1453collector_root
1454  Exclusive Total CPU Time: 0.    (  0. %)
1455  Inclusive Total CPU Time: 2.159 ( 95.0%)
1456            Size:   401
1457      PC Address: 10:0x0001db60
1458     Source File: <apath>/dispatcher.c
1459     Object File: mxv.1.thr.er/archives/libgp-collector.so_HpzZ6wMR-3b
1460     Load Object: <apath>/libgp-collector.so
1461    Mangled Name:
1462         Aliases:
1463@end verbatim
1464@end smallexample
1465
1466In this table we not only see how much time was spent in this function, we
1467also see where it originates from. In addition to this, the size and start
1468address are given as well. If the source code location is known it is also 
1469shown here.
1470
1471@IndexSubentry{Commands, @code{fsummary}}
1472The related @code{fsummary} command displays the same information as 
1473@code{fsingle}, but for all functions in the function overview, 
1474including @code{<Total>}:
1475
1476@cartouche
1477@smallexample
1478$ gprofng display text -fsummary mxv.1.thr.er
1479@end smallexample
1480@end cartouche
1481
1482@smallexample
1483@verbatim
1484Functions sorted by metric: Exclusive Total CPU Time
1485
1486<Total>
1487  Exclusive Total CPU Time: 2.272 (100.0%)
1488  Inclusive Total CPU Time: 2.272 (100.0%)
1489            Size:     0
1490      PC Address: 1:0x00000000
1491     Source File: (unknown)
1492     Object File: (unknown)
1493     Load Object: <Total>
1494    Mangled Name:
1495         Aliases:
1496
1497mxv_core
1498  Exclusive Total CPU Time: 2.159 ( 95.0%)
1499  Inclusive Total CPU Time: 2.159 ( 95.0%)
1500            Size:    75
1501      PC Address: 2:0x000021ba
1502     Source File: <apath>/mxv.c
1503     Object File: mxv.1.thr.er/archives/mxv-pthreads.exe_hRxWdccbJPc
1504     Load Object: <apath>/mxv-pthreads.exe
1505    Mangled Name:
1506         Aliases:
1507
1508          ... etc ...
1509@end verbatim
1510@end smallexample
1511
1512@c -- A new node --------------------------------------------------------------
1513@node    Support for Multithreading
1514@section Support for Multithreading
1515@c ----------------------------------------------------------------------------
1516
1517In this chapter we introduce and discuss the support for multithreading. As
1518is shown below, nothing needs to be changed when collecting the performance 
1519data.
1520
1521The difference is that additional commands are available to get more 
1522information on the parallel environment, plus that several filters allow
1523the user to zoom in on specific threads.
1524
1525@c -- A new node --------------------------------------------------------------
1526@node       Creating a Multithreading Experiment
1527@subsection Creating a Multithreading Experiment
1528@c ----------------------------------------------------------------------------
1529
1530We demonstrate the support for multithreading using the same code and settings
1531as before, but this time we use 2 threads:
1532
1533@cartouche
1534@smallexample
1535$ exe=mxv-pthreads.exe
1536$ m=3000
1537$ n=2000
1538$ gprofng collect app -O mxv.2.thr.er ./$exe -m $m -n $n -t 2
1539@end smallexample
1540@end cartouche
1541
1542First of all, note that we did not change anything, other than setting the 
1543number of threads to 2. Nothing special is needed to profile a multithreaded
1544job when using @ToolName{}.
1545
1546The same is true when displaying the performance results. The same commands
1547that we used before work unmodified. For example, this is all that is needed to 
1548get a function overview:
1549
1550@cartouche
1551@smallexample
1552$ gpprofng display text -limit 10 -functions mxv.2.thr.er
1553@end smallexample
1554@end cartouche
1555
1556This produces the following familiar looking output:
1557
1558@smallexample
1559@verbatim
1560Print limit set to 10
1561Functions sorted by metric: Exclusive Total CPU Time
1562
1563Excl.     Incl.      Name
1564Total     Total
1565CPU sec.  CPU sec.
15662.268     2.268      <Total>
15672.155     2.155      mxv_core
15680.044     0.103      init_data
15690.030     0.046      erand48_r
15700.016     0.016      __drand48_iterate
15710.013     0.059      drand48
15720.008     0.011      _int_malloc
15730.003     0.003      brk
15740.        0.003      __default_morecore
15750.        0.114      __libc_start_main
1576@end verbatim
1577@end smallexample
1578
1579@c -- A new node --------------------------------------------------------------
1580@node       Commands Specific to Multithreading
1581@subsection Commands Specific to Multithreading
1582@c ----------------------------------------------------------------------------
1583
1584The function overview shown above shows the results aggregated over all the 
1585threads. The interesting new element is that we can also look at the 
1586performance data for the individual threads.
1587
1588@IndexSubentry{Commands, @code{thread_list}}
1589The @code{thread_list} command displays how many threads have been used:
1590
1591@cartouche
1592@smallexample
1593$ gprofng display text -thread_list mxv.2.thr.er
1594@end smallexample
1595@end cartouche
1596
1597This produces the following output, showing that three threads have
1598been used:
1599
1600@smallexample
1601@verbatim
1602Exp Sel Total
1603=== === =====
1604  1 all     3
1605@end verbatim
1606@end smallexample
1607
1608The output confirms there is one experiment and that by default all
1609threads are selected.
1610
1611It may seem surprising to see three threads here, since we used the 
1612@code{-t 2} option, but it is common for a Pthreads program to use one 
1613additional thread. This is typically the thread that runs from start to 
1614finish and handles the sequential portions of the code, as well as takes
1615care of managing the threads. 
1616
1617It is no different in our example code. At some point, the main thread 
1618creates and activates the two threads that perform the multiplication 
1619of the matrix with the vector. Upon completion of this computation,
1620the main thread continues.
1621
1622@IndexSubentry{Commands, @code{threads}}
1623The @code{threads} command is simple, yet very powerful. It shows the
1624total value of the metrics for each thread. To make it easier to 
1625interpret the data, we modify the metrics to include percentages:
1626
1627@cartouche
1628@smallexample
1629$ gprofng display text -metrics e.%totalcpu -threads mxv.2.thr.er
1630@end smallexample
1631@end cartouche
1632
1633The command above produces the following overview:
1634
1635@smallexample
1636@verbatim
1637Current metrics: e.%totalcpu:name
1638Current Sort Metric: Exclusive Total CPU Time ( e.%totalcpu )
1639Objects sorted by metric: Exclusive Total CPU Time
1640
1641Excl. Total    Name
1642CPU
1643 sec.      %
16442.258 100.00   <Total>
16451.075  47.59   Process 1, Thread 3
16461.070  47.37   Process 1, Thread 2
16470.114   5.03   Process 1, Thread 1
1648@end verbatim
1649@end smallexample
1650
1651The first line gives the total CPU time accumulated over the threads
1652selected. This is followed by the metric value(s) for each thread.
1653
1654From this it is clear that the main thread is responsible for 5% of
1655the total CPU time, while the other two threads take 47% each.
1656
1657This view is ideally suited to verify if there any load balancing
1658issues and also to find the most time consuming thread(s).
1659
1660@IndexSubentry{Filters, Thread selection}
1661While useful, often more information than this is needed. This is
1662@IndexSubentry{Commands, @code{thread_select}}
1663where the thread selection filter comes in. Through the @code{thread_select}
1664command, one or more threads may be selected 
1665(@xref{The Selection List} how to define the selection list).
1666
1667Since it is most common to use this command in a script, we do so as
1668well here. Below the script we are using:
1669
1670@cartouche
1671@smallexample
1672# Define the metrics
1673metrics e.%totalcpu
1674# Limit the output to 10 lines
1675limit 10
1676# Get the function overview for thread 1
1677thread_select 1
1678functions
1679# Get the function overview for thread 2
1680thread_select 2
1681functions
1682# Get the function overview for thread 3
1683thread_select 3
1684functions
1685@end smallexample
1686@end cartouche
1687
1688The definition of the metrics and the output limiter has been shown and
1689explained before and will be ignored. The new command we focus on is 
1690@IndexSubentry{Commands, @code{thread_select}}
1691@code{thread_select}.
1692
1693This command takes a list (@xref{The Selection List}) to select specific
1694threads. In this case we simply use the individual thread numbers that we
1695obtained with the @code{thread_list} command earlier.
1696
1697This restricts the output of the @code{functions} command to the thread
1698number(s) specified. This means that the script above shows which 
1699function(s) each thread executes and how much CPU time they consumed.
1700Both the timings and their percentages are given.
1701
1702This is the relevant part of the output for the first thread:
1703
1704@smallexample
1705@verbatim
1706# Get the function overview for thread 1
1707Exp Sel Total
1708=== === =====
1709  1 1       3
1710Functions sorted by metric: Exclusive Total CPU Time
1711
1712Excl. Total    Name
1713CPU
1714 sec.      %
17150.114 100.00   <Total>
17160.051  44.74   init_data
17170.028  24.56   erand48_r
17180.017  14.91   __drand48_iterate
17190.010   8.77   _int_malloc
17200.008   7.02   drand48
17210.      0.     __libc_start_main
17220.      0.     allocate_data
17230.      0.     main
17240.      0.     malloc
1725@end verbatim
1726@end smallexample
1727
1728As usual, the comment lines are echoed. This is followed by a confirmation
1729of our selection. We see that indeed thread 1 has been selected. What is
1730displayed next is the function overview for this particular thread. Due to
1731the @code{limit 10} command, there are ten entries in this list.
1732
1733Below are the overviews for threads 2 and 3 respectively. We see that all
1734of the CPU time is spent in function @code{mxv_core} and that this time
1735is approximately the same for both threads.
1736
1737@smallexample
1738@verbatim
1739# Get the function overview for thread 2
1740Exp Sel Total
1741=== === =====
1742  1 2       3
1743Functions sorted by metric: Exclusive Total CPU Time
1744
1745Excl. Total    Name
1746CPU
1747 sec.      %
17481.072 100.00   <Total>
17491.072 100.00   mxv_core
17500.      0.     collector_root
17510.      0.     driver_mxv
1752
1753# Get the function overview for thread 3
1754Exp Sel Total
1755=== === =====
1756  1 3       3
1757Functions sorted by metric: Exclusive Total CPU Time
1758
1759Excl. Total    Name
1760CPU
1761 sec.      %
17621.076 100.00   <Total>
17631.076 100.00   mxv_core
17640.      0.     collector_root
17650.      0.     driver_mxv
1766@end verbatim
1767@end smallexample
1768
1769When analyzing the performance of a multithreaded application, it is sometimes
1770useful to know whether threads have mostly executed on the same core, say, or
1771if they have wandered across multiple cores. This sort of stickiness is usually 
1772referred to as
1773@cindex Thread affinity
1774@emph{thread affinity}.
1775
1776Similar to the commands for the threads, there are several commands related 
1777to the usage of the cores, or @emph{CPUs} as they are called in @ToolName{}
1778(@xref{The Concept of a CPU in @ProductName{}}).
1779
1780In order to have some more interesting data to look at, we created a new
1781experiment, this time using 8 threads:
1782
1783@cartouche
1784@smallexample
1785$ exe=mxv-pthreads.exe
1786$ m=3000
1787$ n=2000
1788$ gprofng collect app -O mxv.8.thr.er ./$exe -m $m -n $n -t 8
1789@end smallexample
1790@end cartouche
1791
1792@IndexSubentry{Commands, @code{cpu_list}}
1793Similar to the @code{thread_list} command, the @code{cpu_list} command 
1794displays how many CPUs have been used. 
1795@IndexSubentry{Commands, @code{cpus}}
1796The equivalent of the @code{threads} threads command, is the @code{cpus} 
1797command, which shows the CPU numbers that were used and how much time was 
1798spent on each of them. Both are demonstrated below.
1799
1800@cartouche
1801@smallexample
1802$ gprofng display text -metrics e.%totalcpu -cpu_list -cpus mxv.8.thr.er
1803@end smallexample
1804@end cartouche
1805
1806This command produces the following output:
1807
1808@smallexample
1809@verbatim
1810Current metrics: e.%totalcpu:name
1811Current Sort Metric: Exclusive Total CPU Time ( e.%totalcpu )
1812Exp Sel Total
1813=== === =====
1814  1 all    10
1815Objects sorted by metric: Exclusive Total CPU Time
1816
1817Excl. Total    Name
1818CPU
1819 sec.      %
18202.310 100.00   <Total>
18210.286  12.39   CPU 7
18220.284  12.30   CPU 13
18230.282  12.21   CPU 5
18240.280  12.13   CPU 14
18250.266  11.52   CPU 9
18260.265  11.48   CPU 2
18270.264  11.44   CPU 11
18280.194   8.42   CPU 0
18290.114   4.92   CPU 1
18300.074   3.19   CPU 15
1831@end verbatim
1832@end smallexample
1833
1834@c ----------------------------------------------------------------------------
1835@c TBD - Ruud
1836@c I'd like to improve this and have a way to see where a thread has executed.
1837@c ----------------------------------------------------------------------------
1838
1839What we see in this table is that a total of 10 CPUs have been used. This is
1840followed by a list with all the CPU numbers that have been used during the 
1841run. For each CPU it is shown how much time was spent on it.
1842
1843While the table with thread times shown earlier may point at a load imbalance
1844in the application, this overview has a different purpose.
1845
1846For example, we see that 10 CPUs have been used, but we know that the 
1847application uses 9 threads only.
1848This means that at least one thread has executed on more than one CPU. In 
1849itself this is not something to worry about, but warrants a deeper 
1850investigation.
1851
1852Honesty dictates that next we performed a pre-analysis to find out 
1853which thread(s) have been running on more than one CPU. We found this 
1854to be thread 7. It has executed on CPUs 0 and 15.
1855
1856With this knowledge, we wrote the script shown below. It zooms in on
1857the behaviour of thread 7.
1858
1859@cartouche
1860@smallexample
1861# Define the metrics
1862metrics e.%totalcpu
1863# Limit the output to 10 lines
1864limit 10
1865functions
1866# Get the function overview for CPU 0
1867cpu_select 0
1868functions
1869# Get the function overview for CPU 15
1870cpu_select 15
1871functions
1872@end smallexample
1873@end cartouche
1874
1875From the earlier shown threads overview, we know that thread 7 has
1876used @code{0.268} seconds of CPU time..
1877
1878By selecting CPUs 0 and 15, respectively, we get the following
1879function overviews:
1880
1881@smallexample
1882@verbatim
1883# Get the function overview for CPU 0
1884Exp Sel Total
1885=== === =====
1886  1 0      10
1887Functions sorted by metric: Exclusive Total CPU Time
1888
1889Excl. Total    Name
1890CPU
1891 sec.      %
18920.194 100.00   <Total>
18930.194 100.00   mxv_core
18940.      0.     collector_root
18950.      0.     driver_mxv
1896
1897# Get the function overview for CPU 15
1898Exp Sel Total
1899=== === =====
1900  1 15     10
1901Functions sorted by metric: Exclusive Total CPU Time
1902
1903Excl. Total    Name
1904CPU
1905 sec.      %
19060.074 100.00   <Total>
19070.074 100.00   mxv_core
19080.      0.     collector_root
19090.      0.     driver_mxv
1910@end verbatim
1911@end smallexample
1912
1913This shows that thread 7 spent @code{0.194} seconds on CPU 0 and 
1914@code{0.074} seconds on CPU 15.
1915
1916@c -- A new node --------------------------------------------------------------
1917@node    Viewing Multiple Experiments
1918@section Viewing Multiple Experiments
1919@c ----------------------------------------------------------------------------
1920
1921One thing we did not cover sofar is that @ToolName{} fully supports the analysis
1922of multiple experiments. The @DisplayText{} tool accepts a list of experiments.
1923The data can either be aggregated across the experiments, or used in a 
1924comparison.
1925
1926Mention @code{experiment_list}
1927
1928@c -- A new node --------------------------------------------------------------
1929@node    Aggregation of Experiments
1930@subsection Aggregation of Experiments
1931@c ----------------------------------------------------------------------------
1932
1933By default, the data for multiple experiments is aggregrated and the display 
1934commands shows these combined results.
1935
1936For example, we can aggregate the data for our single and dual thread
1937experiments. Below is the script we used for this:
1938
1939@cartouche
1940@smallexample
1941# Define the metrics
1942metrics e.%totalcpu
1943# Limit the output to 10 lines
1944limit 10
1945# Get the list with experiments
1946experiment_list
1947# Get the function overview
1948functions
1949@end smallexample
1950@end cartouche
1951
1952@IndexSubentry{Commands, @code{experiment_list}}
1953With the exception of the @code{experiment_list} command, all commands
1954used have been discussed earlier.
1955
1956The @code{experiment_list} command provides a list of the experiments
1957that have been loaded. This is is used to verify we are looking at the
1958experiments we intend to aggregate.
1959
1960@cartouche
1961@smallexample
1962$ gprofng display text -script my-script-agg mxv.1.thr.er mxv.2.thr.er
1963@end smallexample
1964@end cartouche
1965
1966With the command above, we get the following output:
1967
1968@smallexample
1969@verbatim
1970# Define the metrics
1971Current metrics: e.%totalcpu:name
1972Current Sort Metric: Exclusive Total CPU Time ( e.%totalcpu )
1973# Limit the output to 10 lines
1974Print limit set to 10
1975# Get the list with experiments
1976ID Sel   PID Experiment
1977== === ===== ============
1978 1 yes 30591 mxv.1.thr.er
1979 2 yes 11629 mxv.2.thr.er
1980# Get the function overview
1981Functions sorted by metric: Exclusive Total CPU Time
1982
1983Excl. Total    Name
1984CPU
1985 sec.      %
19864.533 100.00   <Total>
19874.306  94.99   mxv_core
19880.105   2.31   init_data
19890.053   1.17   erand48_r
19900.027   0.59   __drand48_iterate
19910.021   0.46   _int_malloc
19920.021   0.46   drand48
19930.001   0.02   sysmalloc
19940.      0.     __libc_start_main
19950.      0.     allocate_data
1996@end verbatim
1997@end smallexample
1998
1999The first five lines should look familiar. The five lines following, echo
2000the comment line in the script and show the overview of the experiments.
2001This confirms two experiments have been loaded and that both are active.
2002
2003This is followed by the function overview. The timings have been summed
2004up and the percentages are adjusted accordingly. For example, the total
2005accumulated time is indeed 2.272 + 2.261 = 4.533 seconds.
2006
2007@c -- A new node --------------------------------------------------------------
2008@node       Comparison of Experiments
2009@subsection Comparison of Experiments
2010@c ----------------------------------------------------------------------------
2011
2012The support for multiple experiments really shines in comparison mode. This
2013feature is enabled through the command 
2014@IndexSubentry{Commands, @code{compare on/off}}
2015@code{compare on} 
2016and is disabled
2017by setting 
2018@code{compare off}.
2019
2020@cindex Compare experiments
2021In comparison mode, the data for the various experiments is shown side by
2022side, as illustrated below where we compare the results for the multithreaded
2023experiments using one and two threads respectively:
2024
2025@cartouche
2026@smallexample
2027$ gprofng display text -compare on -functions mxv.1.thr.er mxv.2.thr.er
2028@end smallexample
2029@end cartouche
2030
2031@noindent
2032This produces the following output:
2033
2034@smallexample
2035@verbatim
2036Functions sorted by metric: Exclusive Total CPU Time
2037
2038mxv.1.thr.er  mxv.2.thr.er  mxv.1.thr.er  mxv.2.thr.er
2039Excl. Total   Excl. Total   Incl. Total   Incl. Total    Name
2040CPU           CPU           CPU           CPU
2041 sec.          sec.          sec.          sec.
20422.272         2.261         2.272         2.261          <Total>
20432.159         2.148         2.159         2.148          mxv_core
20440.054         0.051         0.102         0.104          init_data
20450.025         0.028         0.035         0.045          erand48_r
20460.013         0.008         0.048         0.053          drand48
20470.011         0.010         0.012         0.010          _int_malloc
20480.010         0.017         0.010         0.017          __drand48_iterate
20490.001         0.            0.001         0.             sysmalloc
20500.            0.            0.114         0.114          __libc_start_main
20510.            0.            0.011         0.010          allocate_data
20520.            0.            0.001         0.             check_results
20530.            0.            2.159         2.148          collector_root
20540.            0.            2.159         2.148          driver_mxv
20550.            0.            0.114         0.114          main
20560.            0.            0.012         0.010          malloc
2057@end verbatim
2058@end smallexample
2059
2060This table is already helpful to more easily compare (two) profiles, but 
2061there is more that we can do here. 
2062
2063By default, in comparison mode, all measured values are shown. Often 
2064profiling is about comparing performance data. It is therefore
2065more useful to look at differences, or ratios, using one experiment as 
2066a reference. 
2067
2068The values shown are relative to this difference. For example if a ratio
2069is below one, it means the reference value was higher. 
2070
2071@IndexSubentry{Commands, @code{compare on/off}}
2072This feature is supported on the @code{compare} command. In addition to @code{on},
2073or @code{off}, this command also supports 
2074@IndexSubentry{Commands, @code{compare delta}}
2075@code{delta}, or 
2076@IndexSubentry{Commands, @code{compare ratio}}
2077@code{ratio}.
2078
2079Usage of one of these two keywords enables the comparison feature and shows
2080either the difference, or the ratio, relative to the reference data.
2081
2082In the example below, we use the same two experiments used in the comparison
2083above, but as before, the number of lines is restricted to 10 and we focus on 
2084the exclusive timings plus percentages. For the comparison part we are 
2085interested in the differences.
2086
2087This is the script that produces such an overview:
2088
2089@cartouche
2090@smallexample
2091# Define the metrics
2092metrics e.%totalcpu
2093# Limit the output to 10 lines
2094limit 10
2095# Set the comparison mode to differences
2096compare delta
2097# Get the function overview
2098functions
2099@end smallexample
2100@end cartouche
2101
2102Assuming this script file is called @code{my-script-comp}, this is how we
2103get the table displayed on our screen:
2104
2105@cartouche
2106@smallexample
2107$ gprofng display text -script my-script-comp mxv.1.thr.er mxv.2.thr.er
2108@end smallexample
2109@end cartouche
2110
2111Leaving out some of the lines printed, but we have seen before, we get 
2112the following table:
2113
2114@smallexample
2115@verbatim
2116mxv.1.thr.er  mxv.2.thr.er
2117Excl. Total   Excl. Total     Name
2118CPU           CPU
2119 sec.      %   delta      %
21202.272 100.00  -0.011 100.00   <Total>
21212.159  95.00  -0.011  94.97   mxv_core
21220.054   2.37  -0.003   2.25   init_data
21230.025   1.10  +0.003   1.23   erand48_r
21240.013   0.57  -0.005   0.35   drand48
21250.011   0.48  -0.001   0.44   _int_malloc
21260.010   0.44  +0.007   0.75   __drand48_iterate
21270.001   0.04  -0.001   0.     sysmalloc
21280.      0.    +0.      0.     __libc_start_main
21290.      0.    +0.      0.     allocate_data
2130@end verbatim
2131@end smallexample
2132
2133It is now easy to see that the CPU times for the most time consuming
2134functions in this code are practically the same. 
2135
2136While in this case we used the delta as a comparison,
2137
2138Note that the comparison feature is supported at the function, source, and 
2139disassembly level. There is no practical limit on the number of experiments
2140that can be used in a comparison.
2141
2142
2143
2144@c -- A new node --------------------------------------------------------------
2145@node    Profile Hardware Event Counters
2146@section Profile Hardware Event Counters
2147@c ----------------------------------------------------------------------------
2148
2149Many processors provide a set of hardware event counters and @ToolName{}
2150provides support for this feature.
2151@xref{Hardware Event Counters Explained} for those readers that are not 
2152familiar with such counters and like to learn more.
2153
2154In this section we explain how to get the details on the event counter
2155support for the processor used in the experiment(s), and show several
2156examples.
2157
2158@c -- A new node --------------------------------------------------------------
2159@node       Getting Information on the Counters Supported
2160@subsection Getting Information on the Counters Supported
2161@c ----------------------------------------------------------------------------
2162
2163The first step is to check if the processor used for the experiments is 
2164supported by @ToolName{}.
2165
2166@IndexSubentry{Options, @code{-h}}
2167The @code{-h} option on @CollectApp{} will show the event counter
2168information:
2169
2170@cartouche
2171@smallexample
2172$ gprofng collect app -h
2173@end smallexample
2174@end cartouche
2175
2176In case the counters are supported, a list with the events is printed. 
2177Otherwise, a warning message will be issued. 
2178
2179For example, below we show this command and the output on an Intel Xeon 
2180Platinum 8167M (aka ``Skylake'') processor. The output has been split
2181into several sections and each section is commented upon separately.
2182
2183@smallexample
2184@verbatim
2185Run "gprofng collect app --help" for a usage message.
2186
2187Specifying HW counters on `Intel Arch PerfMon v2 on Family 6 Model 85' 
2188(cpuver=2499):
2189
2190  -h {auto|lo|on|hi}
2191	turn on default set of HW counters at the specified rate
2192  -h <ctr_def> [-h <ctr_def>]...
2193  -h <ctr_def>[,<ctr_def>]...
2194	specify HW counter profiling for up to 4 HW counters
2195@end verbatim
2196@end smallexample
2197
2198The first line shows how to get a usage overview. This is followed by
2199some information on the target processor.
2200
2201The next five lines explain in what ways the @code{-h} option can be 
2202used to define the events to be monitored.
2203
2204The first version shown above enables a default set of counters. This
2205default depends on the processor this command is executed on. The
2206keyword following the @code{-h} option defines the sampling rate:
2207
2208@table @code
2209
2210@item auto
2211Match the sample rate of used by clock profiling. If the latter is disabled,
2212Use a per thread sampling rate of approximately 100 samples per second.
2213This setting is the default and preferred.
2214
2215@item on
2216Use a per thread sampling rate of approximately 100 samples per second.
2217
2218@item lo
2219Use a per thread sampling rate of approximately 10 samples per second.
2220
2221@item hi
2222Use a per thread sampling rate of approximately 1000 samples per second.
2223
2224@end table 
2225
2226The second and third variant define the events to be monitored. Note
2227that the number of simultaneous events supported is printed. In this
2228case we can monitor four events in a single profiling job.
2229
2230It is a matter of preference whether you like to use the @code{-h}
2231option for each event, or use it once, followed by a comma separated
2232list.
2233
2234There is one slight catch though. The counter definition below has 
2235mandatory comma (@code{,}) between the event and the rate. While a 
2236default can be used for the rate, the comma cannot be omitted. 
2237This may result in a somewhat awkward counter definition in case
2238the default sampling rate is used.
2239
2240For example, the following two commands are equivalent. Note
2241the double comma in the second command. This is not a typo.
2242
2243@cartouche
2244@smallexample
2245$ gprofng collect app -h cycles -h insts ... 
2246$ gprofng collect app -h cycles,,insts ... 
2247@end smallexample
2248@end cartouche
2249
2250In the first command this comma is not needed, because a 
2251comma (``@code{,}'') immediately followed by white space may 
2252be omitted.
2253
2254This is why we prefer the this syntax and in the remainder will 
2255use the first version of this command.
2256
2257@IndexSubentry{Hardware event counters, counter definition}
2258The counter definition takes an event name, plus optionally one or
2259more attributes, followed by a comma, and optionally the sampling rate.
2260The output section below shows the formal definition.
2261
2262@cartouche
2263@smallexample
2264  <ctr_def> == <ctr>[[~<attr>=<val>]...],[<rate>]
2265@end smallexample
2266@end cartouche
2267
2268The printed help then explains this syntax. Below we have summarized
2269and expanded this output:
2270
2271@table @code
2272
2273@item <ctr>
2274The counter name must be selected from the available counters listed
2275as part of the output printed with the @code{-h} option.
2276On most systems, if a counter is not listed, it may still be specified 
2277by its numeric value.
2278
2279@item ~<attr>=<val>
2280This is an optional attribute that depends on the processor. The list
2281of supported attributes is printed in the output. Examples of 
2282attributes are ``user'', or ``system''. The value can given in decimal
2283or hexadecimal format.
2284Multiple attributes may be specified, and each must be preceded 
2285by a ~.
2286
2287@item <rate>
2288
2289The sampling rate is one of the following:
2290
2291@table @code
2292
2293@item auto
2294This is the default and matches the rate used by clock profiling.
2295If clock profiling is disabled, use @code{on}.
2296
2297@item on
2298Set the per thread maximum sampling rate to ~100 samples/second
2299
2300@item lo
2301Set the per thread maximum sampling rate to ~10 samples/second
2302
2303@item hi
2304Set the per thread maximum sampling rate to ~1000 samples/second
2305
2306@item <interval>
2307Define the sampling interval. 
2308@xref{Control the Sampling Frequency} how to define this.
2309
2310@end table
2311
2312@end table
2313
2314After the section with the formal definition of events and counters, a
2315processor specific list is displayed. This part starts with an overview
2316of the default set of counters and the aliased names supported 
2317@emph{on this specific processor}.
2318
2319@smallexample
2320@verbatim
2321Default set of HW counters:
2322
2323    -h cycles,,insts,,llm
2324
2325Aliases for most useful HW counters:
2326
2327 alias    raw name                   type units regs description
2328
2329 cycles   unhalted-core-cycles   CPU-cycles 0123 CPU Cycles
2330 insts    instruction-retired        events 0123 Instructions Executed
2331 llm      llc-misses                 events 0123 Last-Level Cache Misses
2332 br_msp   branch-misses-retired      events 0123 Branch Mispredict
2333 br_ins   branch-instruction-retired events 0123 Branch Instructions
2334@end verbatim
2335@end smallexample
2336
2337The definitions given above may or may not be available on other processors,
2338but we try to maximize the overlap across alias sets.
2339
2340The table above shows the default set of counters defined for this processor,
2341and the aliases. For each alias the full ``raw'' name is given, plus the
2342unit of the number returned by the counter (CPU cycles, or a raw count), 
2343the hardware counter the event is allowed to be mapped onto, and a short 
2344description.
2345
2346The last part of the output contains all the events that can be monitored:
2347
2348@smallexample
2349@verbatim
2350Raw HW counters:
2351
2352    name                                type      units regs description
2353
2354    unhalted-core-cycles                     CPU-cycles 0123
2355    unhalted-reference-cycles                    events 0123
2356    instruction-retired                          events 0123
2357    llc-reference                                events 0123
2358    llc-misses                                   events 0123
2359    branch-instruction-retired                   events 0123
2360    branch-misses-retired                        events 0123
2361    ld_blocks.store_forward                      events 0123
2362    ld_blocks.no_sr                              events 0123
2363    ld_blocks_partial.address_alias              events 0123
2364    dtlb_load_misses.miss_causes_a_walk          events 0123
2365    dtlb_load_misses.walk_completed_4k           events 0123
2366
2367    <many lines deleted>
2368
2369    l2_lines_out.silent                          events 0123
2370    l2_lines_out.non_silent                      events 0123
2371    l2_lines_out.useless_hwpf                    events 0123
2372    sq_misc.split_lock                           events 0123
2373
2374See Chapter 19 of the "Intel 64 and IA-32 Architectures Software
2375Developer's Manual Volume 3B: System Programming Guide"
2376@end verbatim
2377@end smallexample
2378
2379As can be seen, these names are not always easy to correlate to a specific
2380event of interest. The processor manual should provide more clarity on this.
2381
2382@c -- A new node --------------------------------------------------------------
2383@node       Examples Using Hardware Event Counters
2384@subsection Examples Using Hardware Event Counters
2385@c ----------------------------------------------------------------------------
2386
2387The previous section may give the impression that these counters are hard to
2388use, but as we will show now, in practice it is quite simple.
2389
2390With the information from the @code{-h} option, we can easily set up our first 
2391event counter experiment.
2392
2393We start by using the default set of counters defined for our processor and we
2394use 2 threads:
2395
2396@cartouche
2397@smallexample
2398$ exe=mxv-pthreads.exe
2399$ m=3000
2400$ n=2000
2401$ exp=mxv.hwc.def.2.thr.er
2402$ gprofng collect app -O $exp -h auto ./$exe -m $m -n $n -t 2
2403@end smallexample
2404@end cartouche
2405
2406@IndexSubentry{Options, @code{-h}}
2407@IndexSubentry{Hardware event counters, @code{auto} option}
2408The new option here is @code{-h auto}. The @code{auto} keyword enables 
2409hardware event counter profiling and selects the default set of counters 
2410defined for this processor.
2411
2412As before, we can display the information, but there is one practical hurdle
2413to take. Unless we like to view all metrics recorded, we would need to know
2414the names of the events that have been enabled. This is tedious and also not
2415portable in case we would like to repeat this experiment on another processor.
2416
2417@IndexSubentry{Hardware event counters, @code{hwc} metric}
2418This is where the special @code{hwc} metric comes very handy. It 
2419automatically expands to the active set of events used.
2420
2421With this, it is very easy to display the event counter values. Note that
2422although the regular clock based profiling was enabled, we only want to see 
2423the counter values. We also request to see the percentages and limit the
2424output to the first 5 lines:
2425
2426@cartouche
2427@smallexample
2428$ exp=mxv.hwc.def.2.thr.er
2429$ gprofng display text -metrics e.%hwc -limit 5 -functions $exp
2430@end smallexample
2431@end cartouche
2432
2433@smallexample
2434@verbatim
2435Current metrics: e.%cycles:e+%insts:e+%llm:name
2436Current Sort Metric: Exclusive CPU Cycles ( e.%cycles )
2437Print limit set to 5
2438Functions sorted by metric: Exclusive CPU Cycles
2439
2440Excl. CPU     Excl. Instructions  Excl. Last-Level   Name
2441Cycles        Executed            Cache Misses
2442 sec.      %                  %                 %
24432.691 100.00  7906475309 100.00   122658983 100.00   <Total>
24442.598  96.54  7432724378  94.01   121745696  99.26   mxv_core
24450.035   1.31   188860269   2.39       70084   0.06   erand48_r
24460.026   0.95    73623396   0.93      763116   0.62   init_data
24470.018   0.66    76824434   0.97       40040   0.03   drand48
2448@end verbatim
2449@end smallexample
2450
2451As we have seen before, the first few lines echo the settings.
2452This includes a list with the hardware event counters used by
2453default.
2454
2455The table that follows makes it very easy to get an overview where the 
2456time is spent and how many of the target events have occurred.
2457
2458As before, we can drill down deeper and see the same metrics at the source
2459line and instruction level. Other than using @code{hwc} in the metrics
2460definitions, nothing has changed compared to the previous examples:
2461
2462@cartouche
2463@smallexample
2464$ exp=mxv.hwc.def.2.thr.er
2465$ gprofng display text -metrics e.hwc -source mxv_core $exp
2466@end smallexample
2467@end cartouche
2468
2469This is the relevant part of the output. Since the lines get very long,
2470we have somewhat modified the lay-out:
2471
2472@smallexample
2473@verbatim
2474   Excl. CPU Excl.        Excl.
2475   Cycles    Instructions Last-Level
2476    sec.     Executed     Cache Misses
2477                                         <Function: mxv_core>
2478   0.                 0          0   32. void __attribute__ ((noinline)) 
2479                                         mxv_core(...)
2480   0.                 0          0   33. {
2481   0.                 0          0   34.   for (uint64_t i=...) {
2482   0.                 0          0   35.     double row_sum = 0.0;
2483## 1.872     7291879319   88150571   36.     for (int64_t j=0; j<n; j++)
2484   0.725      140845059   33595125   37.        row_sum += A[i][j]*b[j];
2485   0.                 0          0   38.     c[i] = row_sum;
2486                                     39.    }
2487   0.                 0          0   40. }
2488@end verbatim
2489@end smallexample
2490
2491In a smiliar way we can display the event counter values at the instruction
2492level. Again we have modified the lay-out due to page width limitations:
2493
2494@cartouche
2495@smallexample
2496$ exp=mxv.hwc.def.2.thr.er
2497$ gprofng display text -metrics e.hwc -disasm mxv_core $exp
2498@end smallexample
2499@end cartouche
2500
2501@smallexample
2502@verbatim
2503   Excl. CPU Excl.        Excl.
2504   Cycles    Instructions Last-Level
2505    sec.     Executed     Cache Misses
2506                                                <Function: mxv_core>
2507   0.                 0          0  [33] 4021ba: mov   0x8(%rsp),%r10
2508                                    34.   for (uint64_t i=...) {
2509   0.                 0          0  [34] 4021bf: cmp   %rsi,%rdi
2510   0.                 0          0  [34] 4021c2: jbe   0x37
2511   0.                 0          0  [34] 4021c4: ret
2512                                    35.       double row_sum = 0.0;
2513                                    36.       for (int64_t j=0; j<n; j++)
2514                                    37.         row_sum += A[i][j]*b[j];
2515   0.                 0          0  [37] 4021c5: mov   (%r8,%rdi,8),%rdx
2516   0.                 0          0  [36] 4021c9: mov   $0x0,%eax
2517   0.                 0          0  [35] 4021ce: pxor  %xmm1,%xmm1
2518   0.002       12804230     321394  [37] 4021d2: movsd (%rdx,%rax,8),%xmm0
2519   0.141       60819025    3866677  [37] 4021d7: mulsd (%r9,%rax,8),%xmm0
2520   0.582       67221804   29407054  [37] 4021dd: addsd %xmm0,%xmm1
2521## 1.871     7279075109   87989870  [36] 4021e1: add   $0x1,%rax
2522   0.002       12804210      80351  [36] 4021e5: cmp   %rax,%rcx
2523   0.                 0          0  [36] 4021e8: jne   0xffffffffffffffea
2524                                    38.       c[i] = row_sum;
2525   0.                 0          0  [38] 4021ea: movsd %xmm1,(%r10,%rdi,8)
2526   0.                 0          0  [34] 4021f0: add   $0x1,%rdi
2527   0.                 0          0  [34] 4021f4: cmp   %rdi,%rsi
2528   0.                 0          0  [34] 4021f7: jb    0xd
2529   0.                 0          0  [35] 4021f9: pxor  %xmm1,%xmm1
2530   0.                 0          0  [36] 4021fd: test  %rcx,%rcx
2531   0.                 0      80350  [36] 402200: jne   0xffffffffffffffc5
2532   0.                 0          0  [36] 402202: jmp   0xffffffffffffffe8
2533                                    39.   }
2534                                    40. }
2535   0.                 0          0  [40]  402204:  ret
2536@end verbatim
2537@end smallexample
2538
2539So far we have used the default settings for the event counters. It is
2540quite straightforward to select specific counters. For sake of the
2541example, let's assume we would like to count how many branch instructions
2542and retired memory load instructions that missed in the L1 cache have been
2543executed. We also want to count these events with a high resolution.
2544
2545This is the command to do so:
2546
2547@cartouche
2548@smallexample
2549$ exe=mxv-pthreads.exe
2550$ m=3000
2551$ n=2000
2552$ exp=mxv.hwc.sel.2.thr.er
2553$ hwc1=br_ins,hi
2554$ hwc2=mem_load_retired.l1_miss,hi
2555$ gprofng collect app -O $exp -h $hwc1 -h $hwc2 $exe -m $m -n $n -t 2
2556@end smallexample
2557@end cartouche
2558
2559As before, we get a table with the event counts. Due to the very
2560long name for the second counter, we have somewhat modified the
2561output.
2562
2563@cartouche
2564@smallexample
2565$ gprofng display text -limit 10 -functions mxv.hwc.sel.2.thr.er
2566@end smallexample
2567@end cartouche
2568
2569@smallexample
2570@verbatim
2571Functions sorted by metric: Exclusive Total CPU Time
2572Excl.     Incl.     Excl. Branch  Excl.                 Name
2573Total     Total     Instructions  mem_load_retired.l1_miss
2574CPU sec.  CPU sec.                Events
25752.597     2.597     1305305319    4021340               <Total>
25762.481     2.481     1233233242    3982327               mxv_core
25770.040     0.107       19019012       9003               init_data
25780.028     0.052       23023048      15006               erand48_r
25790.024     0.024       19019008       9004               __drand48_iterate
25800.015     0.067       11011009       2998               drand48
25810.008     0.010              0       3002               _int_malloc
25820.001     0.001              0          0               brk
25830.001     0.002              0          0               sysmalloc
25840.        0.001              0          0               __default_morecore
2585@end verbatim
2586@end smallexample
2587
2588@IndexSubentry{Commands, @code{compare ratio}}
2589When using event counters, the values could be very large and it is not easy
2590to compare the numbers. As we will show next, the @code{ratio} feature is
2591very useful when comparing such profiles.
2592
2593To demonstrate this, we have set up another event counter experiment where
2594we would like to compare the number of last level cache miss and the number
2595of branch instructions executed when using a single thread, or two threads.
2596
2597These are the commands used to generate the experiment directories:
2598
2599@cartouche
2600@smallexample
2601$ exe=./mxv-pthreads.exe
2602$ m=3000
2603$ n=2000
2604$ exp1=mxv.hwc.comp.1.thr.er
2605$ exp2=mxv.hwc.comp.2.thr.er
2606$ gprofng collect app -O $exp1 -h llm -h br_ins $exe -m $m -n $n -t 1
2607$ gprofng collect app -O $exp2 -h llm -h br_ins $exe -m $m -n $n -t 2
2608@end smallexample
2609@end cartouche
2610
2611The following script has been used to get the tables. Due to lay-out
2612restrictions, we have to create two tables, one for each counter.
2613
2614@cartouche
2615@smallexample
2616# Limit the output to 5 lines
2617limit 5
2618# Define the metrics
2619metrics name:e.llm
2620# Set the comparison to ratio
2621compare ratio
2622functions
2623# Define the metrics
2624metrics name:e.br_ins
2625# Set the comparison to ratio
2626compare ratio
2627functions
2628@end smallexample
2629@end cartouche
2630
2631Note that we print the name of the function first, followed by the counter 
2632data.
2633The new element is that we set the comparison mode to @code{ratio}. This
2634divides the data in a column by its counterpart in the reference experiment.
2635
2636This is the command using this script and the two experiment directories as 
2637input:
2638
2639@cartouche
2640@smallexample
2641$ gprofng display text -script my-script-comp-counters \
2642  mxv.hwc.comp.1.thr.er \
2643  mxv.hwc.comp.2.thr.er
2644@end smallexample
2645@end cartouche
2646
2647By design, we get two tables, one for each counter:
2648
2649@smallexample
2650@verbatim
2651Functions sorted by metric: Exclusive Last-Level Cache Misses
2652
2653                              mxv.hwc.comp.1.thr.er  mxv.hwc.comp.2.thr.er
2654Name                          Excl. Last-Level       Excl. Last-Level
2655                              Cache Misses           Cache Misses
2656                                                         ratio
2657 <Total>                      122709276              x   0.788
2658 mxv_core                     121796001              x   0.787
2659 init_data                       723064              x   1.055
2660 erand48_r                       100111              x   0.500
2661 drand48                          60065              x   1.167
2662
2663Functions sorted by metric: Exclusive Branch Instructions
2664
2665                              mxv.hwc.comp.1.thr.er  mxv.hwc.comp.2.thr.er
2666Name                          Excl. Branch           Excl. Branch
2667                              Instructions           Instructions
2668                                                       ratio
2669 <Total>                      1307307316             x 0.997
2670 mxv_core                     1235235239             x 0.997
2671 erand48_r                      23023033             x 0.957
2672 drand48                        20020009             x 0.600
2673 __drand48_iterate              17017028             x 0.882
2674@end verbatim
2675@end smallexample
2676
2677A ratio less than one in the second column, means that this counter
2678value was smaller than the value from the reference experiment shown
2679in the first column.
2680
2681This kind of presentation of the results makes it much easier to 
2682quickly interpret the data.
2683
2684We conclude this section with thread-level event counter overviews,
2685but before we go into this, there is an important metric we need to
2686mention.
2687
2688@IndexSubentry{Hardware event counters, IPC}
2689In case it is known how many instructions and CPU cycles have been executed,
2690the value for the IPC (``Instructions Per Clockycle'') can be computed. 
2691@xref{Hardware Event Counters Explained}.
2692This is a derived metric that gives an indication how well the processor
2693is utilized. The inverse of the IPC is called CPI.
2694
2695The @DisplayText{} command automatically computes the IPC and CPI values
2696if an experiment contains the event counter values for the instructions
2697and CPU cycles executed. These are part of the metric list and can be
2698displayed, just like any other metric.
2699
2700@IndexSubentry{Commands, @code{metric_list}}
2701This can be verified through the @code{metric_list} command. If we go
2702back to our earlier experiment with the default event counters, we get
2703the following result.
2704
2705@cartouche
2706@smallexample
2707$ gprofng display text -metric_list mxv.hwc.def.2.thr.er
2708@end smallexample
2709@end cartouche
2710
2711@smallexample
2712@verbatim
2713Current metrics: e.totalcpu:i.totalcpu:e.cycles:e+insts:e+llm:name
2714Current Sort Metric: Exclusive Total CPU Time ( e.totalcpu )
2715Available metrics:
2716         Exclusive Total CPU Time: e.%totalcpu
2717         Inclusive Total CPU Time: i.%totalcpu
2718             Exclusive CPU Cycles: e.+%cycles
2719             Inclusive CPU Cycles: i.+%cycles
2720  Exclusive Instructions Executed: e+%insts
2721  Inclusive Instructions Executed: i+%insts
2722Exclusive Last-Level Cache Misses: e+%llm
2723Inclusive Last-Level Cache Misses: i+%llm
2724 Exclusive Instructions Per Cycle: e+IPC
2725 Inclusive Instructions Per Cycle: i+IPC
2726 Exclusive Cycles Per Instruction: e+CPI
2727 Inclusive Cycles Per Instruction: i+CPI
2728                             Size: size
2729                       PC Address: address
2730                             Name: name
2731@end verbatim
2732@end smallexample
2733
2734Among the other metrics, we see the new metrics for the IPC and CPI 
2735listed.
2736
2737In the script below, we use this information and add the IPC and CPI 
2738to the metrics to be displayed. We also use a the thread filter to 
2739display these values for the individual threads.
2740
2741This is the complete script we have used. Other than a different selection
2742of the metrics, there are no new features.
2743
2744@cartouche
2745@smallexample
2746# Define the metrics
2747metrics e.insts:e.%cycles:e.IPC:e.CPI
2748# Sort with respect to cycles
2749sort e.cycles
2750# Limit the output to 5 lines
2751limit 5
2752# Get the function overview for all threads
2753functions
2754# Get the function overview for thread 1
2755thread_select 1
2756functions
2757# Get the function overview for thread 2
2758thread_select 2
2759functions
2760# Get the function overview for thread 3
2761thread_select 3
2762functions
2763@end smallexample
2764@end cartouche
2765
2766In the metrics definition on the second line, we explicitly request the 
2767counter values for the instructions (@code{e.insts}) and CPU cycles 
2768(@code{e.cycles}) executed. These names can be found in output from the
2769@code{metric_list} commad above.
2770In addition to these metrics, we also request the IPC and CPI to be shown.
2771
2772As before, we used the @code{limit} command to control the number of 
2773functions displayed. We then request an overview for all the threads,
2774followed by three sets of two commands to select a thread and display the 
2775function overview.
2776
2777The script above is used as follows:
2778
2779@cartouche
2780@smallexample
2781$ gprofng display text -script my-script-ipc mxv.hwc.def.2.thr.er
2782@end smallexample
2783@end cartouche
2784
2785This script produces four tables. We list them separately below,
2786and have left out the additional output.
2787
2788The first table shows the accumulated values across the three
2789threads that have been active.
2790
2791@smallexample
2792@verbatim
2793Functions sorted by metric: Exclusive CPU Cycles
2794
2795Excl.         Excl. CPU     Excl.  Excl.   Name
2796Instructions  Cycles        IPC    CPI
2797Executed       sec.      %
27987906475309    2.691 100.00  1.473  0.679   <Total>
27997432724378    2.598  96.54  1.434  0.697   mxv_core
2800 188860269    0.035   1.31  2.682  0.373   erand48_r
2801  73623396    0.026   0.95  1.438  0.696   init_data
2802  76824434    0.018   0.66  2.182  0.458   drand48
2803@end verbatim
2804@end smallexample
2805
2806This shows that IPC of this program is completely dominated
2807by function @code{mxv_core}. It has a fairly low IPC value
2808of 1.43.
2809
2810The next table is for thread 1 and shows the values for the 
2811main thread.
2812
2813@smallexample
2814@verbatim
2815Exp Sel Total
2816=== === =====
2817  1 1       3
2818Functions sorted by metric: Exclusive CPU Cycles
2819
2820Excl.         Excl. CPU     Excl.  Excl.   Name
2821Instructions  Cycles        IPC    CPI
2822Executed       sec.      %
2823473750931     0.093 100.00  2.552  0.392   <Total>
2824188860269     0.035  37.93  2.682  0.373   erand48_r
2825 73623396     0.026  27.59  1.438  0.696   init_data
2826 76824434     0.018  18.97  2.182  0.458   drand48
2827134442832     0.013  13.79  5.250  0.190   __drand48_iterate
2828@end verbatim
2829@end smallexample
2830
2831Although this thread hardly uses any CPU cycles, the overall IPC 
2832of 2.55 is not all that bad.
2833
2834Last, we show the tables for threads 2 and 3:
2835
2836@smallexample
2837@verbatim
2838Exp Sel Total
2839=== === =====
2840  1 2       3
2841Functions sorted by metric: Exclusive CPU Cycles
2842
2843Excl.         Excl. CPU     Excl.  Excl.   Name
2844Instructions  Cycles        IPC    CPI
2845Executed       sec.      %
28463716362189    1.298 100.00  1.435  0.697   <Total>
28473716362189    1.298 100.00  1.435  0.697   mxv_core
2848         0    0.      0.    0.     0.      collector_root
2849         0    0.      0.    0.     0.      driver_mxv
2850
2851Exp Sel Total
2852=== === =====
2853  1 3       3
2854Functions sorted by metric: Exclusive CPU Cycles
2855
2856Excl.         Excl. CPU     Excl.  Excl.   Name
2857Instructions  Cycles        IPC    CPI
2858Executed       sec.      %
28593716362189    1.300 100.00  1.433  0.698   <Total>
28603716362189    1.300 100.00  1.433  0.698   mxv_core
2861         0    0.      0.    0.     0.      collector_root
2862         0    0.      0.    0.     0.      driver_mxv
2863@end verbatim
2864@end smallexample
2865
2866It is seen that both execute the same number of instructions and
2867take about the same number of CPU cycles. As a result, the IPC is
2868the same for both threads.
2869
2870@c -- A new node --------------------------------------------------------------
2871@c TBD @node    Additional Features
2872@c TBD @section Additional Features
2873@c ----------------------------------------------------------------------------
2874
2875@c -- A new node --------------------------------------------------------------
2876@c TBD @node    More Filtering Capabilities
2877@c TBD @subsection More Filtering Capabilities
2878@c ----------------------------------------------------------------------------
2879
2880@c TBD Cover @code{samples} and @code{seconds}
2881
2882@c -- A new node --------------------------------------------------------------
2883@node    Java Profiling 
2884@section Java Profiling
2885@c ----------------------------------------------------------------------------
2886
2887@IndexSubentry{Java profiling, @code{-j on/off}}
2888The @CollectApp{} command supports Java profiling. The @code{-j on} option
2889can be used for this, but since this feature is enabled by default, there is 
2890no need to set this explicitly. Java profiling may be disabled through the 
2891@code{-j off} option.
2892
2893The program is compiled as usual and the experiment directory is created 
2894similar to what we have seen before. The only difference with a C/C++
2895application is that the program has to be explicitly executed by java.
2896
2897For example, this is how to generate the experiment data for a Java
2898program that has the source code stored in file @code{Pi.java}:
2899
2900@cartouche
2901@smallexample
2902$ javac Pi.java
2903$ gprofng collect app -j on -O pi.demo.er java Pi < pi.in
2904@end smallexample
2905@end cartouche
2906
2907Regarding which java is selected to generate the data, @ToolName{} 
2908first looks for the JDK in the path set in either the 
2909@IndexSubentry{Java profiling, @code{JDK_HOME}}
2910@code{JDK_HOME} environment variable, or in the
2911@IndexSubentry{Java profiling, @code{JAVA_PATH}}
2912@code{JAVA_PATH} environment variable. If neither of these variables is 
2913set, it checks for a JDK in the search path (set in the PATH
2914environment variable). If there is no JDK in this path, it checks for 
2915the java executable in @code{/usr/java/bin/java}.
2916
2917In case additional options need to be passed on to the JVM, the 
2918@IndexSubentry{Java profiling, @code{-J <string>}}
2919@code{-J <string>} option can be used. The string with the
2920option(s) has to be delimited by quotation marks in case
2921there is more than one argument.
2922
2923The @DisplayText{} command may be used to view the performance data. There is
2924no need for any special options and the same commands as previously discussed
2925are supported.
2926
2927@IndexSubentry{Commands, @code{viewmode}}
2928@IndexSubentry{Java profiling, different view modes}
2929The @code{viewmode} command 
2930@xref{The Viewmode}
2931is very useful to examine the call stacks. 
2932
2933For example, this is how one can see the native call stacks. For
2934lay-out purposes we have restricted the list to the first five entries:
2935
2936@cartouche
2937@smallexample
2938$ gprofng display text -limit 5 -viewmode machine -calltree pi.demo.er 
2939@end smallexample
2940@end cartouche
2941
2942@smallexample
2943@verbatim
2944Print limit set to 5
2945Viewmode set to machine
2946Functions Call Tree. Metric: Attributed Total CPU Time
2947
2948Attr.      Name
2949Total
2950CPU sec.
29511.381      +-<Total>
29521.171        +-Pi.calculatePi(double)
29530.110        +-collector_root
29540.110        |  +-JavaMain
29550.070        |    +-jni_CallStaticVoidMethod
2956@end verbatim
2957@end smallexample
2958
2959@noindent
2960Note that the selection of the viewmode is echoed in the output.
2961
2962@c -- A new node --------------------------------------------------------------
2963@c TBD @node Summary of Options and Commands
2964@c TBD @chapter Summary of Options and Commands
2965@c ----------------------------------------------------------------------------
2966
2967@c -- A new node --------------------------------------------------------------
2968@node    Terminology
2969@chapter Terminology
2970
2971Throughout this manual, certain terminology specific to profiling tools, 
2972or @ToolName{}, or even to this document only, is used. In this chapter we 
2973explain this terminology in detail.
2974
2975@menu
2976* The Program Counter::                    What is a Program Counter?
2977* Inclusive and Exclusive Metrics::        An explanation of inclusive and exclusive metrics.
2978* Metric Definitions::                     Definitions associated with metrics.
2979* The Viewmode::                           Select the way call stacks are presented.
2980* The Selection List::                     How to define a selection.
2981* Load Objects and Functions::             The components in an application.
2982* The Concept of a CPU in @ProductName{}:: The definition of a CPU.
2983* Hardware Event Counters Explained::      What are event counters?
2984* apath::                                  Our generic definition of a path.
2985@end menu
2986
2987@c ----------------------------------------------------------------------------
2988@node    The Program Counter
2989@section The Program Counter
2990@c ----------------------------------------------------------------------------
2991
2992@cindex PC
2993@cindex Program Counter
2994The @emph{Program Counter}, or PC for short, keeps track where program execution is.
2995The address of the next instruction to be executed is stored in a special
2996purpose register in the processor, or core.
2997
2998@cindex Instruction pointer
2999The PC is sometimes also referred to as the @emph{instruction pointer}, but
3000we will use Program Counter or PC throughout this document.
3001
3002@c ----------------------------------------------------------------------------
3003@node    Inclusive and Exclusive Metrics
3004@section Inclusive and Exclusive Metrics
3005@c ----------------------------------------------------------------------------
3006
3007In the remainder, these two concepts occur quite often and for lack of a better
3008place, they are explained here.
3009
3010@cindex Inclusive metric
3011The @emph{inclusive} value for a metric includes all values that are part of
3012the dynamic extent of the target function. For example if function @code{A}
3013calls functions @code{B} and @code{C}, the inclusive CPU time for @code{A} 
3014includes the CPU time spent in @code{B} and @code{C}.
3015
3016@cindex Exclusive metric
3017In contrast with this, the @emph{exclusive} value for a metric is computed
3018by excluding the metric values used by other functions called. In our imaginary
3019example, the exclusive CPU time for function @code{A} is the time spent outside
3020calling functions @code{B} and @code{C}.
3021
3022@cindex Leaf function
3023In case of a @emph{leaf function}, the inclusive and exclusive values for the 
3024metric are the same since by definition, it is not calling any other 
3025function(s).
3026
3027Why do we use these two different values? The inclusive metric shows the most
3028expensive path, in terms of this metric, in the application. For example, if
3029the metric is cache misses, the function with the highest inclusive metric
3030tells you where most of the cache misses come from.
3031
3032Within this branch of the application, the exclusive metric points to the
3033functions that contribute and help to identify which part(s) to consider
3034for further analysis.
3035
3036@c ----------------------------------------------------------------------------
3037@node    Metric Definitions
3038@section Metric Definitions
3039@c ----------------------------------------------------------------------------
3040The metrics to be shown are highly customizable. In this section we explain 
3041the definitions associated with metrics.
3042
3043@IndexSubentry{Commands, @code{metrics}}
3044The @code{metrics} command takes a colon (:) separated list with special
3045keywords. This keyword consists of the following three fields: 
3046@code{<flavor>}@code{<visibility>}@code{<metric_name>}.
3047
3048@cindex Flavor field
3049@cindex Visibility field
3050@cindex Metric name field
3051The @emph{<flavor>} field is either an @code{e} for ``exclusive'', or @code{i}
3052for ``inclusive''. The @code{<metric_name>} field is the name of the metric
3053request. The @emph{<visibility>} field consists of one ore more characters
3054from the following table:
3055
3056@table @code
3057
3058@item .
3059Show the metric as time. This applies to timing metrics and hardware event counters
3060that measure cycles. Interpret as @code{+} for other metrics.
3061
3062@item %
3063Show the metric as a percentage of the total value for this metric.
3064
3065@item +
3066Show the metric as an absolute value. For hardware event counters this is
3067the event count. Interpret as @code{.} for timing metrics.
3068
3069@item |
3070Do not show any metric value. Cannot be used with other visibility characters.
3071
3072@end table 
3073
3074@c ----------------------------------------------------------------------------
3075@node    The Viewmode
3076@section The Viewmode
3077
3078@cindex Viewmode
3079@IndexSubentry{Commands, @code{viewmode}}
3080
3081There are different ways to view a call stack in Java. In @ToolName{}, this
3082is called the @emph{viewmode} and the setting is controlled through a command
3083with the same name.
3084
3085The @code{viewmode} command takes one of the following keywords:
3086
3087@table @code
3088
3089@item user
3090This is the default and shows the Java call stacks for Java threads.
3091No call stacks for any housekeeping threads are shown. The function 
3092list contains a function 
3093@IndexSubentry{Java profiling, @code{<JVM-System>}}
3094@code{<JVM-System>} that represents the aggregated time from non-Java
3095threads.
3096When the JVM software does not report a Java call stack, time is reported
3097against the function 
3098@IndexSubentry{Java profiling, @code{<no Java callstack recorded>}}
3099@code{<no Java callstack recorded>}.
3100
3101
3102@item expert
3103Show the Java call stacks for Java threads when the Java code from the
3104user is executed and machine call stacks when JVM code is executed, or 
3105when the JVM software does not report a Java call stack. 
3106Show the machine call stacks for housekeeping threads.
3107
3108@item machine
3109Show the actual native call stacks for all threads.
3110
3111@end table
3112
3113@c ----------------------------------------------------------------------------
3114@c ----------------------------------------------------------------------------
3115@node    The Selection List
3116@section The Selection List
3117@c ----------------------------------------------------------------------------
3118
3119@cindex Selection list
3120@cindex List specification
3121Several commands allow the user to specify a subset of a list. For example,
3122to select specific threads from all the threads that have been used when 
3123conducting the experiment(s). 
3124
3125Such a selection list (or ``list'' in the remainder of this section) can be a 
3126single number, a contiguous range of numbers with the start and end numbers 
3127separated by a hyphen (@code{-}), a comma-separated list of numbers and 
3128ranges, or the @code{all} keyword. Lists must not contain spaces. 
3129
3130Each list can optionally be preceded by an experiment list with a similar 
3131format, separated from the list by a colon (:). 
3132If no experiment list is included, the list applies to all experiments.
3133
3134Multiple lists can be concatenated by separating the individual lists 
3135by a plus sign.
3136
3137These are some examples of various filters using a list:
3138
3139@table @code
3140
3141@item thread_select 1
3142Select thread 1 from all experiments.
3143
3144@item thread_select all:1
3145Select thread 1 from all experiments.
3146
3147@item thread_select 1:1+2:2
3148Select thread 1 from experiment 1 and thread 2 from experiment 2.
3149
3150@item cpu_select all:1,3,5
3151Selects cores 1, 3, and 5 from all experiments.
3152
3153@item cpu_select 1,2:all
3154Select all cores from experiments 1 and 2, as listed by the @code{by exp_list} command.
3155
3156@end table
3157
3158@c ----------------------------------------------------------------------------
3159@node    Load Objects and Functions
3160@section Load Objects and Functions
3161@c ----------------------------------------------------------------------------
3162
3163An application consists of various components. The source code files are 
3164compiled into object files. These are then glued together at link time to form
3165the executable. 
3166During execution, the program may also dynamically load objects.
3167
3168@cindex Load object 
3169A @emph{load object} is defined to be an executable, or shared object. A shared
3170library is an example of a load object in @ToolName{}.
3171
3172Each load object, contains a text section with the instructions generated by the 
3173compiler, a data section for data, and various symbol tables.
3174All load objects must contain an 
3175@cindex ELF
3176ELF 
3177symbol table, which gives the names and addresses of all the globally known 
3178functions in that object. 
3179
3180Load objects compiled with the -g option contain additional symbolic information 
3181that can augment the ELF symbol table and provide information about functions that 
3182are not global, additional information about object modules from which the functions 
3183came, and line number information relating addresses to source lines.
3184
3185The term
3186@cindex Function
3187@emph{function}
3188is used to describe a set of instructions that represent a high-level operation 
3189described in the source code. The term also covers methods as used in C++ and in
3190the Java programming language. 
3191
3192In the @ToolName{} context, functions are provided in source code format. 
3193Normally their names appear in the symbol table representing a set of addresses. 
3194@cindex Program Counter
3195@cindex PC
3196If the Program Counter (PC) is within that set, the program is executing within that function.
3197
3198In principle, any address within the text segment of a load object can be mapped to a 
3199function. Exactly the same mapping is used for the leaf PC and all the other PCs on the 
3200call stack. 
3201
3202Most of the functions correspond directly to the source model of the program, but 
3203there are exceptions. This topic is however outside of the scope of this guide.
3204
3205@c ----------------------------------------------------------------------------
3206@node    The Concept of a CPU in @ProductName{}
3207@section The Concept of a CPU in @ProductName{}
3208@c ----------------------------------------------------------------------------
3209
3210@cindex CPU
3211In @ProductName{}, there is the concept of a CPU. Admittedly, this is not the
3212best word to describe what is meant here and may be replaced in the future.
3213
3214The word CPU is used in many of the displays.
3215In the context of @ProductName{}, it is meant to denote a part of the 
3216processor that is capable of executing instructions and with its own state, 
3217like the program counter.
3218
3219For example, on a contemporary processor, a CPU could be a core. In case
3220hardware threads are supported within a core, it could be one of those
3221hardware threads.
3222
3223@c ----------------------------------------------------------------------------
3224@node    Hardware Event Counters Explained
3225@section Hardware Event Counters Explained
3226@c ----------------------------------------------------------------------------
3227
3228@IndexSubentry{Hardware event counters, description}
3229For quite a number of years now, many microprocessors have supported hardware 
3230event counters. 
3231
3232On the hardware side, this means that in the processor there are one or more 
3233registers dedicated to count certain activities, or ``events''.
3234Examples of such events are the number of instructions executed, or the number 
3235of cache misses at level 2 in the memory hierarchy.
3236
3237While there is a limited set of such registers, the user can map events onto
3238them. In case more than one register is available, this allows for the 
3239simultaenous measurement of various events.
3240
3241A simple, yet powerful, example is to simultaneously count the number of CPU 
3242cycles and the number of instructions excuted. These two numbers can then be
3243used to compute the 
3244@cindex IPC
3245@emph{IPC} value. IPC stands for ``Instructions Per Clockcycle'' and each processor 
3246has a maximum. For example, if this maximum number is 2, it means the 
3247processor is capable of executing two instructions every clock cycle.
3248
3249Whether this is actually achieved, depends on several factors, including the
3250instruction characteristics.
3251However, in case the IPC value is well below this maximum in a time critical 
3252part of the application and this cannot be easily explained, further 
3253investigation is probably warranted.
3254
3255@cindex CPI
3256A related metric is called @emph{CPI}, or ``Clockcycles Per Instruction''.
3257It is the inverse of the CPI and can be compared against the theoretical
3258value(s) of the target instruction(s). A significant difference may point
3259at a bottleneck.
3260
3261One thing to keep in mind is that the value returned by a counter can either
3262be the number of times the event occured, or a CPU cycle count. In case of 
3263the latter it is possible to convert this number to time.
3264
3265@IndexSubentry{Hardware event counters, variable CPU frequency}
3266This is often easier to interpret than a simple count, but there is one
3267caveat to keep in mind. The CPU frequency may not have been constant while
3268the experimen was recorded and this impacts the time reported.
3269
3270These event counters, or ``counters'' for short, provide great insight into
3271what happens deep inside the processor. In case higher level information does
3272not provide the insight needed, the counters provide the information to get 
3273to the bottom of a performance problem.
3274
3275There are some things to consider though. 
3276
3277@itemize @bullet
3278
3279@item
3280The event definitions and names vary across processors and it may even happen 
3281that some events change with an update.
3282Unfortunately and this is luckily rare, there are sometimes bugs causing the 
3283wrong count to be returned.
3284
3285@IndexSubentry{Hardware event counters, alias name}
3286In @ToolName{}, some of the processor specific event names have an alias 
3287name. For example @code{insts} measures the instructions executed. 
3288These aliases not only makes it easier to identify the functionality, but also 
3289provide portability of certain events across processors.
3290
3291@item
3292Another complexity is that there are typically many events one can monitor.
3293There may up to hundreds of events available and it could require several
3294experiments to zoom in on the root cause of a performance problem.
3295
3296@item
3297There may be restrictions regarding the mapping of event(s) onto the 
3298counters. For example, certain events may be restricted to specific 
3299counters only. As a result, one may have to conduct additional experiments
3300to cover all the events of interest.
3301
3302@item
3303The names of the events may also not be easy to interpret. In such cases,
3304the description can be found in the architecture manual for the processor.
3305
3306@end itemize
3307
3308Despite these drawbacks, hardware event counters are extremely useful and
3309may even turn out to be indispensable.
3310
3311@c ----------------------------------------------------------------------------
3312@node    apath
3313@section What is <apath>?
3314@c ----------------------------------------------------------------------------
3315
3316In most cases, @ToolName{} shows the absolute pathnames of directories. These
3317tend to be rather long, causing display issues in this document. 
3318
3319Instead of wrapping these long pathnames over multiple lines, we decided to 
3320represent them by the @code{<apath>} symbol, which stands for ``an absolute 
3321pathname''.
3322
3323Note that different occurrences of @code{<apath>} may represent different
3324absolute pathnames.
3325
3326@c -- A new node --------------------------------------------------------------
3327@node    Other Document Formats
3328@chapter Other Document Formats
3329@c ----------------------------------------------------------------------------
3330
3331This document is written in Texinfo and the source text is made available as
3332part of the binutils distribution. The file name is @code{gprofng.texi} and
3333can be found in subdirectory @code{doc} under directory @code{gprofng} in the 
3334top level directory.
3335
3336This file can be used to generate the document in the @code{info}, @code{html}, 
3337and @code{pdf} formats.
3338The default installation procedure creates a file in the @code{info} format and 
3339stores it in the documentation section of binutils.
3340
3341The probably easiest way to generate a different format from this Texinfo 
3342document is to go to the distribution directory that was created when the 
3343tools were built.
3344This is either the default distribution directory, or the one that has been set
3345with the @code{--prefix} option as part of the @code{configure} command.
3346In this example we symbolize this location with @code{<dist>}.
3347
3348The make file called @code{Makefile} in directory @code{<dist>/gprofng/doc}
3349supports several commands to generate this document in different formats. 
3350We recommend to use these commands.
3351
3352They create the file(s) and install it in the documentation directory of binutils,
3353which is @code{<dist>/share/doc} in case @code{html} or @code{pdf} is selected and
3354@code{<dist>/share/info} for the file in the @code{info} format.
3355
3356To generate this document in the requested format and install it in the documentation 
3357directory, the commands below should be executed. In this notation, @code{<format>} 
3358is one of @code{info}, @code{html}, or @code{pdf}:
3359
3360@smallexample
3361@verbatim
3362$ cd <dist>/gprofng/doc
3363$ make install-<format>
3364@end verbatim
3365@end smallexample
3366
3367@noindent
3368Some things to note:
3369
3370@itemize
3371
3372@item
3373For the @code{pdf} file to be generated, the 
3374@cindex TeX
3375TeX document formatting software is required and the relevant commmands need
3376to be included in the search path. An example of a popular TeX implementation 
3377is @emph{TexLive}. It is beyond the scope of this document to go into the
3378details of installing and using TeX, but it is well documented elsewhere.
3379
3380@item
3381Instead of generating a single file in the @code{html} format, it is also 
3382possible to create a directory with individual files for the various chapters. 
3383To do so, remove the use of @code{--no-split} in variable @code{MAKEINFOHTML}
3384in the make file in the @code{doc} directory.
3385
3386@item
3387The make file also supports commands to only generate the file in the desired
3388format and not move them to the documentation directory. This is
3389accomplished through the @code{make <format>} command.
3390
3391@end itemize
3392
3393@ifnothtml
3394@node       Index
3395@unnumbered Index
3396@printindex cp
3397@end ifnothtml
3398
3399@bye
3400