profile_mode.xml revision 1.1
1<?xml version='1.0'?>
2<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN"
3 "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd"
4[ ]>
5
6<chapter id="manual.ext.profile_mode" xreflabel="Profile Mode">
7<?dbhtml filename="profile_mode.html"?>
8
9<chapterinfo>
10  <keywordset>
11    <keyword>
12      C++
13    </keyword>
14    <keyword>
15      library
16    </keyword>
17    <keyword>
18      profile
19    </keyword>
20  </keywordset>
21</chapterinfo>
22
23<title>Profile Mode</title>
24
25
26<sect1 id="manual.ext.profile_mode.intro" xreflabel="Intro">
27  <title>Intro</title>
28  <para>
29  <emphasis>Goal: </emphasis>Give performance improvement advice based on
30  recognition of suboptimal usage patterns of the standard library.
31  </para>
32
33  <para>
34  <emphasis>Method: </emphasis>Wrap the standard library code.  Insert
35  calls to an instrumentation library to record the internal state of
36  various components at interesting entry/exit points to/from the standard
37  library.  Process trace, recognize suboptimal patterns, give advice.
38  For details, see
39  <ulink url="http://dx.doi.org/10.1109/CGO.2009.36">paper presented at
40   CGO 2009</ulink>.
41  </para>
42  <para>
43  <emphasis>Strengths: </emphasis>
44<itemizedlist>
45  <listitem><para>
46  Unintrusive solution.  The application code does not require any
47  modification.
48  </para></listitem>
49  <listitem><para> The advice is call context sensitive, thus capable of
50  identifying precisely interesting dynamic performance behavior.
51  </para></listitem>
52  <listitem><para>
53  The overhead model is pay-per-view.  When you turn off a diagnostic class
54  at compile time, its overhead disappears.
55  </para></listitem>
56</itemizedlist>
57  </para>
58  <para>
59  <emphasis>Drawbacks: </emphasis>
60<itemizedlist>
61  <listitem><para>
62  You must recompile the application code with custom options.
63  </para></listitem>
64  <listitem><para>You must run the application on representative input.
65  The advice is input dependent.
66  </para></listitem>
67  <listitem><para>
68  The execution time will increase, in some cases by factors.
69  </para></listitem>
70</itemizedlist>
71  </para>
72
73
74<sect2 id="manual.ext.profile_mode.using" xreflabel="Using">
75  <title>Using the Profile Mode</title>
76
77  <para>
78  This is the anticipated common workflow for program <code>foo.cc</code>:
79<programlisting>
80$ cat foo.cc
81#include &lt;vector&gt;
82int main() {
83  vector&lt;int&gt; v;
84  for (int k = 0; k &lt; 1024; ++k) v.insert(v.begin(), k);
85}
86
87$ g++ -D_GLIBCXX_PROFILE foo.cc
88$ ./a.out
89$ cat libstdcxx-profile.txt
90vector-to-list: improvement = 5: call stack = 0x804842c ...
91    : advice = change std::vector to std::list
92vector-size: improvement = 3: call stack = 0x804842c ...
93    : advice = change initial container size from 0 to 1024
94</programlisting>
95  </para>
96
97  <para>
98  Anatomy of a warning:
99  <itemizedlist>
100  <listitem>
101  <para>
102  Warning id.  This is a short descriptive string for the class
103  that this warning belongs to.  E.g., "vector-to-list".
104  </para>
105  </listitem>
106  <listitem>
107  <para>
108  Estimated improvement.  This is an approximation of the benefit expected
109  from implementing the change suggested by the warning.  It is given on
110  a log10 scale.  Negative values mean that the alternative would actually
111  do worse than the current choice.
112  In the example above, 5 comes from the fact that the overhead of
113  inserting at the beginning of a vector vs. a list is around 1024 * 1024 / 2,
114  which is around 10e5.  The improvement from setting the initial size to
115  1024 is in the range of 10e3, since the overhead of dynamic resizing is
116  linear in this case.
117  </para>
118  </listitem>
119  <listitem>
120  <para>
121  Call stack.  Currently, the addresses are printed without
122  symbol name or code location attribution.
123  Users are expected to postprocess the output using, for instance, addr2line.
124  </para>
125  </listitem>
126  <listitem>
127  <para>
128  The warning message.  For some warnings, this is static text, e.g.,
129  "change vector to list".  For other warnings, such as the one above,
130  the message contains numeric advice, e.g., the suggested initial size
131  of the vector.
132  </para>
133  </listitem>
134  </itemizedlist>
135  </para>
136
137  <para>Three files are generated.  <code>libstdcxx-profile.txt</code>
138   contains human readable advice.  <code>libstdcxx-profile.raw</code>
139   contains implementation specific data about each diagnostic.
140   Their format is not documented.  They are sufficient to generate
141   all the advice given in <code>libstdcxx-profile.txt</code>.  The advantage
142   of keeping this raw format is that traces from multiple executions can
143   be aggregated simply by concatenating the raw traces.  We intend to
144   offer an external utility program that can issue advice from a trace.
145   <code>libstdcxx-profile.conf.out</code> lists the actual diagnostic
146   parameters used.  To alter parameters, edit this file and rename it to
147   <code>libstdcxx-profile.conf</code>.
148  </para>
149
150  <para>Advice is given regardless whether the transformation is valid.
151  For instance, we advise changing a map to an unordered_map even if the
152  application semantics require that data be ordered.
153  We believe such warnings can help users understand the performance
154  behavior of their application better, which can lead to changes
155  at a higher abstraction level.
156  </para>
157
158</sect2>
159
160<sect2 id="manual.ext.profile_mode.tuning" xreflabel="Tuning">
161  <title>Tuning the Profile Mode</title>
162
163  <para>Compile time switches and environment variables (see also file
164   profiler.h).  Unless specified otherwise, they can be set at compile time
165   using -D_&lt;name&gt; or by setting variable &lt;name&gt;
166   in the environment where the program is run, before starting execution.
167  <itemizedlist>
168  <listitem><para>
169   <code>_GLIBCXX_PROFILE_NO_&lt;diagnostic&gt;</code>:
170   disable specific diagnostics.
171   See section Diagnostics for possible values.
172   (Environment variables not supported.)
173   </para></listitem>
174  <listitem><para>
175   <code>_GLIBCXX_PROFILE_TRACE_PATH_ROOT</code>: set an alternative root
176   path for the output files.
177   </para></listitem>
178  <listitem><para>_GLIBCXX_PROFILE_MAX_WARN_COUNT: set it to the maximum
179   number of warnings desired.  The default value is 10.</para></listitem>
180  <listitem><para>
181   <code>_GLIBCXX_PROFILE_MAX_STACK_DEPTH</code>: if set to 0,
182   the advice will
183   be collected and reported for the program as a whole, and not for each
184   call context.
185   This could also be used in continuous regression tests, where you
186   just need to know whether there is a regression or not.
187   The default value is 32.
188   </para></listitem>
189  <listitem><para>
190   <code>_GLIBCXX_PROFILE_MEM_PER_DIAGNOSTIC</code>:
191   set a limit on how much memory to use for the accounting tables for each
192   diagnostic type.  When this limit is reached, new events are ignored
193   until the memory usage decreases under the limit.  Generally, this means
194   that newly created containers will not be instrumented until some
195   live containers are deleted.  The default is 128 MB.
196   </para></listitem>
197  <listitem><para>
198   <code>_GLIBCXX_PROFILE_NO_THREADS</code>:
199   Make the library not use threads.  If thread local storage (TLS) is not
200   available, you will get a preprocessor error asking you to set
201   -D_GLIBCXX_PROFILE_NO_THREADS if your program is single-threaded.
202   Multithreaded execution without TLS is not supported.
203   (Environment variable not supported.)
204   </para></listitem>
205  <listitem><para>
206   <code>_GLIBCXX_HAVE_EXECINFO_H</code>:
207   This name should be defined automatically at library configuration time.
208   If your library was configured without <code>execinfo.h</code>, but
209   you have it in your include path, you can define it explicitly.  Without
210   it, advice is collected for the program as a whole, and not for each
211   call context.
212   (Environment variable not supported.)
213   </para></listitem>
214  </itemizedlist>
215  </para>
216
217</sect2>
218
219</sect1>
220
221
222<sect1 id="manual.ext.profile_mode.design" xreflabel="Design">
223  <title>Design</title>
224
225<para>
226</para>
227<table frame='all'>
228<title>Profile Code Location</title>
229<tgroup cols='2' align='left' colsep='1' rowsep='1'>
230<colspec colname='c1'></colspec>
231<colspec colname='c2'></colspec>
232
233<thead>
234  <row>
235    <entry>Code Location</entry>
236    <entry>Use</entry>
237  </row>
238</thead>
239<tbody>
240  <row>
241    <entry><code>libstdc++-v3/include/std/*</code></entry>
242    <entry>Preprocessor code to redirect to profile extension headers.</entry>
243  </row>
244  <row>
245    <entry><code>libstdc++-v3/include/profile/*</code></entry>
246    <entry>Profile extension public headers (map, vector, ...).</entry>
247  </row>
248  <row>
249    <entry><code>libstdc++-v3/include/profile/impl/*</code></entry>
250    <entry>Profile extension internals.  Implementation files are
251     only included from <code>impl/profiler.h</code>, which is the only
252     file included from the public headers.</entry>
253  </row>
254</tbody>
255</tgroup>
256</table>
257
258<para>
259</para>
260
261<sect2 id="manual.ext.profile_mode.design.wrapper"
262 xreflabel="Wrapper">
263<title>Wrapper Model</title>
264  <para>
265  In order to get our instrumented library version included instead of the
266  release one,
267  we use the same wrapper model as the debug mode.
268  We subclass entities from the release version.  Wherever
269  <code>_GLIBCXX_PROFILE</code> is defined, the release namespace is
270  <code>std::__norm</code>, whereas the profile namespace is
271  <code>std::__profile</code>.  Using plain <code>std</code> translates
272  into <code>std::__profile</code>.
273  </para>
274  <para>
275  Whenever possible, we try to wrap at the public interface level, e.g.,
276  in <code>unordered_set</code> rather than in <code>hashtable</code>,
277  in order not to depend on implementation.
278  </para>
279  <para>
280  Mixing object files built with and without the profile mode must
281  not affect the program execution.  However, there are no guarantees to
282  the accuracy of diagnostics when using even a single object not built with
283  <code>-D_GLIBCXX_PROFILE</code>.
284  Currently, mixing the profile mode with debug and parallel extensions is
285  not allowed.  Mixing them at compile time will result in preprocessor errors.
286  Mixing them at link time is undefined.
287  </para>
288</sect2>
289
290
291<sect2 id="manual.ext.profile_mode.design.instrumentation"
292 xreflabel="Instrumentation">
293<title>Instrumentation</title>
294  <para>
295  Instead of instrumenting every public entry and exit point,
296  we chose to add instrumentation on demand, as needed
297  by individual diagnostics.
298  The main reason is that some diagnostics require us to extract bits of
299  internal state that are particular only to that diagnostic.
300  We plan to formalize this later, after we learn more about the requirements
301  of several diagnostics.
302  </para>
303  <para>
304  All the instrumentation points can be switched on and off using
305  <code>-D[_NO]_GLIBCXX_PROFILE_&lt;diagnostic&gt;</code> options.
306  With all the instrumentation calls off, there should be negligible
307  overhead over the release version.  This property is needed to support
308  diagnostics based on timing of internal operations.  For such diagnostics,
309  we anticipate turning most of the instrumentation off in order to prevent
310  profiling overhead from polluting time measurements, and thus diagnostics.
311  </para>
312  <para>
313  All the instrumentation on/off compile time switches live in
314  <code>include/profile/profiler.h</code>.
315  </para>
316</sect2>
317
318
319<sect2 id="manual.ext.profile_mode.design.rtlib"
320 xreflabel="Run Time Behavior">
321<title>Run Time Behavior</title>
322  <para>
323  For practical reasons, the instrumentation library processes the trace
324  partially
325  rather than dumping it to disk in raw form.  Each event is processed when
326  it occurs.  It is usually attached a cost and it is aggregated into
327  the database of a specific diagnostic class.  The cost model
328  is based largely on the standard performance guarantees, but in some
329  cases we use knowledge about GCC's standard library implementation.
330  </para>
331  <para>
332  Information is indexed by (1) call stack and (2) instance id or address
333  to be able to understand and summarize precise creation-use-destruction
334  dynamic chains.  Although the analysis is sensitive to dynamic instances,
335  the reports are only sensitive to call context.  Whenever a dynamic instance
336  is destroyed, we accumulate its effect to the corresponding entry for the
337  call stack of its constructor location.
338  </para>
339
340  <para>
341  For details, see
342   <ulink url="http://dx.doi.org/10.1109/CGO.2009.36">paper presented at
343   CGO 2009</ulink>.
344  </para>
345</sect2>
346
347
348<sect2 id="manual.ext.profile_mode.design.analysis"
349 xreflabel="Analysis and Diagnostics">
350<title>Analysis and Diagnostics</title>
351  <para>
352  Final analysis takes place offline, and it is based entirely on the
353  generated trace and debugging info in the application binary.
354  See section Diagnostics for a list of analysis types that we plan to support.
355  </para>
356  <para>
357  The input to the analysis is a table indexed by profile type and call stack.
358  The data type for each entry depends on the profile type.
359  </para>
360</sect2>
361
362
363<sect2 id="manual.ext.profile_mode.design.cost-model"
364 xreflabel="Cost Model">
365<title>Cost Model</title>
366  <para>
367  While it is likely that cost models become complex as we get into
368  more sophisticated analysis, we will try to follow a simple set of rules
369  at the beginning.
370  </para>
371<itemizedlist>
372  <listitem><para><emphasis>Relative benefit estimation:</emphasis>
373  The idea is to estimate or measure the cost of all operations
374  in the original scenario versus the scenario we advise to switch to.
375  For instance, when advising to change a vector to a list, an occurrence
376  of the <code>insert</code> method will generally count as a benefit.
377  Its magnitude depends on (1) the number of elements that get shifted
378  and (2) whether it triggers a reallocation.
379  </para></listitem>
380  <listitem><para><emphasis>Synthetic measurements:</emphasis>
381  We will measure the relative difference between similar operations on
382  different containers.  We plan to write a battery of small tests that
383  compare the times of the executions of similar methods on different
384  containers.  The idea is to run these tests on the target machine.
385  If this training phase is very quick, we may decide to perform it at
386  library initialization time.  The results can be cached on disk and reused
387  across runs.
388  </para></listitem>
389  <listitem><para><emphasis>Timers:</emphasis>
390  We plan to use timers for operations of larger granularity, such as sort.
391  For instance, we can switch between different sort methods on the fly
392  and report the one that performs best for each call context.
393  </para></listitem>
394  <listitem><para><emphasis>Show stoppers:</emphasis>
395  We may decide that the presence of an operation nullifies the advice.
396  For instance, when considering switching from <code>set</code> to
397  <code>unordered_set</code>, if we detect use of operator <code>++</code>,
398  we will simply not issue the advice, since this could signal that the use
399  care require a sorted container.</para></listitem>
400</itemizedlist>
401
402</sect2>
403
404
405<sect2 id="manual.ext.profile_mode.design.reports"
406 xreflabel="Reports">
407<title>Reports</title>
408  <para>
409There are two types of reports.  First, if we recognize a pattern for which
410we have a substitute that is likely to give better performance, we print
411the advice and estimated performance gain.  The advice is usually associated
412to a code position and possibly a call stack.
413  </para>
414  <para>
415Second, we report performance characteristics for which we do not have
416a clear solution for improvement.  For instance, we can point to the user
417the top 10 <code>multimap</code> locations
418which have the worst data locality in actual traversals.
419Although this does not offer a solution,
420it helps the user focus on the key problems and ignore the uninteresting ones.
421  </para>
422</sect2>
423
424
425<sect2 id="manual.ext.profile_mode.design.testing"
426 xreflabel="Testing">
427<title>Testing</title>
428  <para>
429  First, we want to make sure we preserve the behavior of the release mode.
430  You can just type <code>"make check-profile"</code>, which
431  builds and runs the whole test suite in profile mode.
432  </para>
433  <para>
434  Second, we want to test the correctness of each diagnostic.
435  We created a <code>profile</code> directory in the test suite.
436  Each diagnostic must come with at least two tests, one for false positives
437  and one for false negatives.
438  </para>
439</sect2>
440
441</sect1>
442
443<sect1 id="manual.ext.profile_mode.api"
444 xreflabel="API">
445<title>Extensions for Custom Containers</title>
446
447  <para>
448  Many large projects use their own data structures instead of the ones in the
449  standard library.  If these data structures are similar in functionality
450  to the standard library, they can be instrumented with the same hooks
451  that are used to instrument the standard library.
452  The instrumentation API is exposed in file
453  <code>profiler.h</code> (look for "Instrumentation hooks").
454  </para>
455
456</sect1>
457
458
459<sect1 id="manual.ext.profile_mode.cost_model"
460 xreflabel="Cost Model">
461<title>Empirical Cost Model</title>
462
463  <para>
464  Currently, the cost model uses formulas with predefined relative weights
465  for alternative containers or container implementations.  For instance,
466  iterating through a vector is X times faster than iterating through a list.
467  </para>
468  <para>
469  (Under development.)
470  We are working on customizing this to a particular machine by providing
471  an automated way to compute the actual relative weights for operations
472  on the given machine.
473  </para>
474  <para>
475  (Under development.)
476  We plan to provide a performance parameter database format that can be
477  filled in either by hand or by an automated training mechanism.
478  The analysis module will then use this database instead of the built in.
479  generic parameters.
480  </para>
481
482</sect1>
483
484
485<sect1 id="manual.ext.profile_mode.implementation"
486 xreflabel="Implementation">
487<title>Implementation Issues</title>
488
489
490<sect2 id="manual.ext.profile_mode.implementation.stack"
491 xreflabel="Stack Traces">
492<title>Stack Traces</title>
493  <para>
494  Accurate stack traces are needed during profiling since we group events by
495  call context and dynamic instance.  Without accurate traces, diagnostics
496  may be hard to interpret.  For instance, when giving advice to the user
497  it is imperative to reference application code, not library code.
498  </para>
499  <para>
500  Currently we are using the libc <code>backtrace</code> routine to get
501  stack traces.
502  <code>_GLIBCXX_PROFILE_STACK_DEPTH</code> can be set
503  to 0 if you are willing to give up call context information, or to a small
504  positive value to reduce run time overhead.
505  </para>
506</sect2>
507
508
509<sect2 id="manual.ext.profile_mode.implementation.symbols"
510 xreflabel="Symbolization">
511<title>Symbolization of Instruction Addresses</title>
512  <para>
513  The profiling and analysis phases use only instruction addresses.
514  An external utility such as addr2line is needed to postprocess the result.
515  We do not plan to add symbolization support in the profile extension.
516  This would require access to symbol tables, debug information tables,
517  external programs or libraries and other system dependent information.
518  </para>
519</sect2>
520
521
522<sect2 id="manual.ext.profile_mode.implementation.concurrency"
523 xreflabel="Concurrency">
524<title>Concurrency</title>
525  <para>
526  Our current model is simplistic, but precise.
527  We cannot afford to approximate because some of our diagnostics require
528  precise matching of operations to container instance and call context.
529  During profiling, we keep a single information table per diagnostic.
530  There is a single lock per information table.
531  </para>
532</sect2>
533
534
535<sect2 id="manual.ext.profile_mode.implementation.stdlib-in-proflib"
536 xreflabel="Using the Standard Library in the Runtime Library">
537<title>Using the Standard Library in the Instrumentation Implementation</title>
538  <para>
539  As much as we would like to avoid uses of libstdc++ within our
540  instrumentation library, containers such as unordered_map are very
541  appealing.  We plan to use them as long as they are named properly
542  to avoid ambiguity.
543  </para>
544</sect2>
545
546
547<sect2 id="manual.ext.profile_mode.implementation.malloc-hooks"
548 xreflabel="Malloc Hooks">
549<title>Malloc Hooks</title>
550  <para>
551  User applications/libraries can provide malloc hooks.
552  When the implementation of the malloc hooks uses stdlibc++, there can
553  be an infinite cycle between the profile mode instrumentation and the
554  malloc hook code.
555  </para>
556  <para>
557  We protect against reentrance to the profile mode instrumentation code,
558  which should avoid this problem in most cases.
559  The protection mechanism is thread safe and exception safe.
560  This mechanism does not prevent reentrance to the malloc hook itself,
561  which could still result in deadlock, if, for instance, the malloc hook
562  uses non-recursive locks.
563  XXX: A definitive solution to this problem would be for the profile extension
564  to use a custom allocator internally, and perhaps not to use libstdc++.
565  </para>
566</sect2>
567
568
569<sect2 id="manual.ext.profile_mode.implementation.construction-destruction"
570 xreflabel="Construction and Destruction of Global Objects">
571<title>Construction and Destruction of Global Objects</title>
572  <para>
573  The profiling library state is initialized at the first call to a profiling
574  method.  This allows us to record the construction of all global objects.
575  However, we cannot do the same at destruction time.  The trace is written
576  by a function registered by <code>atexit</code>, thus invoked by
577  <code>exit</code>.
578  </para>
579</sect2>
580
581</sect1>
582
583
584<sect1 id="manual.ext.profile_mode.developer"
585 xreflabel="Developer Information">
586<title>Developer Information</title>
587
588<sect2 id="manual.ext.profile_mode.developer.bigpic"
589 xreflabel="Big Picture">
590<title>Big Picture</title>
591
592  <para>The profile mode headers are included with
593   <code>-D_GLIBCXX_PROFILE</code> through preprocessor directives in
594   <code>include/std/*</code>.
595  </para>
596
597  <para>Instrumented implementations are provided in
598   <code>include/profile/*</code>.  All instrumentation hooks are macros
599   defined in <code>include/profile/profiler.h</code>.
600  </para>
601
602  <para>All the implementation of the instrumentation hooks is in
603   <code>include/profile/impl/*</code>.  Although all the code gets included,
604   thus is publicly visible, only a small number of functions are called from
605   outside this directory.  All calls to hook implementations must be
606   done through macros defined in <code>profiler.h</code>.  The macro
607   must ensure (1) that the call is guarded against reentrance and
608   (2) that the call can be turned off at compile time using a
609   <code>-D_GLIBCXX_PROFILE_...</code> compiler option.
610  </para>
611
612</sect2>
613
614<sect2 id="manual.ext.profile_mode.developer.howto"
615 xreflabel="How To Add A Diagnostic">
616<title>How To Add A Diagnostic</title>
617
618  <para>Let's say the diagnostic name is "magic".
619  </para>
620
621  <para>If you need to instrument a header not already under
622   <code>include/profile/*</code>, first edit the corresponding header
623   under <code>include/std/</code> and add a preprocessor directive such
624   as the one in <code>include/std/vector</code>:
625<programlisting>
626#ifdef _GLIBCXX_PROFILE
627# include &lt;profile/vector&gt;
628#endif
629</programlisting>
630  </para>
631
632  <para>If the file you need to instrument is not yet under
633   <code>include/profile/</code>, make a copy of the one in
634   <code>include/debug</code>, or the main implementation.
635   You'll need to include the main implementation and inherit the classes
636   you want to instrument.  Then define the methods you want to instrument,
637   define the instrumentation hooks and add calls to them.
638   Look at <code>include/profile/vector</code> for an example.
639  </para>
640
641  <para>Add macros for the instrumentation hooks in
642   <code>include/profile/impl/profiler.h</code>.
643   Hook names must start with <code>__profcxx_</code>.
644   Make sure they transform
645   in no code with <code>-D_NO_GLBICXX_PROFILE_MAGIC</code>.
646   Make sure all calls to any method in namespace <code>__gnu_profile</code>
647   is protected against reentrance using macro
648   <code>_GLIBCXX_PROFILE_REENTRANCE_GUARD</code>.
649   All names of methods in namespace <code>__gnu_profile</code> called from
650   <code>profiler.h</code> must start with <code>__trace_magic_</code>.
651  </para>
652
653  <para>Add the implementation of the diagnostic.
654   <itemizedlist>
655     <listitem><para>
656      Create new file <code>include/profile/impl/profiler_magic.h</code>.
657     </para></listitem>
658     <listitem><para>
659      Define class <code>__magic_info: public __object_info_base</code>.
660      This is the representation of a line in the object table.
661      The <code>__merge</code> method is used to aggregate information
662      across all dynamic instances created at the same call context.
663      The <code>__magnitude</code> must return the estimation of the benefit
664      as a number of small operations, e.g., number of words copied.
665      The <code>__write</code> method is used to produce the raw trace.
666      The <code>__advice</code> method is used to produce the advice string.
667     </para></listitem>
668     <listitem><para>
669      Define class <code>__magic_stack_info: public __magic_info</code>.
670      This defines the content of a line in the stack table.
671     </para></listitem>
672     <listitem><para>
673      Define class <code>__trace_magic: public __trace_base&lt;__magic_info,
674      __magic_stack_info&gt;</code>.
675      It defines the content of the trace associated with this diagnostic.
676     </para></listitem>
677    </itemizedlist>
678  </para>
679
680  <para>Add initialization and reporting calls in
681   <code>include/profile/impl/profiler_trace.h</code>.  Use
682   <code>__trace_vector_to_list</code> as an example.
683  </para>
684
685  <para>Add documentation in file <code>doc/xml/manual/profile_mode.xml</code>.
686  </para>
687</sect2>
688</sect1>
689
690<sect1 id="manual.ext.profile_mode.diagnostics">
691<title>Diagnostics</title>
692
693  <para>
694  The table below presents all the diagnostics we intend to implement.
695  Each diagnostic has a corresponding compile time switch
696  <code>-D_GLIBCXX_PROFILE_&lt;diagnostic&gt;</code>.
697  Groups of related diagnostics can be turned on with a single switch.
698  For instance, <code>-D_GLIBCXX_PROFILE_LOCALITY</code> is equivalent to
699  <code>-D_GLIBCXX_PROFILE_SOFTWARE_PREFETCH
700  -D_GLIBCXX_PROFILE_RBTREE_LOCALITY</code>.
701  </para>
702
703  <para>
704  The benefit, cost, expected frequency and accuracy of each diagnostic
705  was given a grade from 1 to 10, where 10 is highest.
706  A high benefit means that, if the diagnostic is accurate, the expected
707  performance improvement is high.
708  A high cost means that turning this diagnostic on leads to high slowdown.
709  A high frequency means that we expect this to occur relatively often.
710  A high accuracy means that the diagnostic is unlikely to be wrong.
711  These grades are not perfect.  They are just meant to guide users with
712  specific needs or time budgets.
713  </para>
714
715<table frame='all'>
716<title>Profile Diagnostics</title>
717<tgroup cols='7' align='left' colsep='1' rowsep='1'>
718<colspec colname='c1'></colspec>
719<colspec colname='c2'></colspec>
720<colspec colname='c3'></colspec>
721<colspec colname='c4'></colspec>
722<colspec colname='c5'></colspec>
723<colspec colname='c6'></colspec>
724<colspec colname='c7'></colspec>
725
726<thead>
727  <row>
728    <entry>Group</entry>
729    <entry>Flag</entry>
730    <entry>Benefit</entry>
731    <entry>Cost</entry>
732    <entry>Freq.</entry>
733    <entry>Implemented</entry>
734  </row>
735</thead>
736<tbody>
737  <row>
738    <entry><ulink url="#manual.ext.profile_mode.analysis.containers">
739    CONTAINERS</ulink></entry>
740    <entry><ulink url="#manual.ext.profile_mode.analysis.hashtable_too_small">
741    HASHTABLE_TOO_SMALL</ulink></entry>
742    <entry>10</entry>
743    <entry>1</entry>
744    <entry></entry>
745    <entry>10</entry>
746    <entry>yes</entry>
747  </row>
748  <row>
749    <entry></entry>
750    <entry><ulink url="#manual.ext.profile_mode.analysis.hashtable_too_large">
751    HASHTABLE_TOO_LARGE</ulink></entry>
752    <entry>5</entry>
753    <entry>1</entry>
754    <entry></entry>
755    <entry>10</entry>
756    <entry>yes</entry>
757  </row>
758  <row>
759    <entry></entry>
760    <entry><ulink url="#manual.ext.profile_mode.analysis.inefficient_hash">
761    INEFFICIENT_HASH</ulink></entry>
762    <entry>7</entry>
763    <entry>3</entry>
764    <entry></entry>
765    <entry>10</entry>
766    <entry>yes</entry>
767  </row>
768  <row>
769    <entry></entry>
770    <entry><ulink url="#manual.ext.profile_mode.analysis.vector_too_small">
771    VECTOR_TOO_SMALL</ulink></entry>
772    <entry>8</entry>
773    <entry>1</entry>
774    <entry></entry>
775    <entry>10</entry>
776    <entry>yes</entry>
777  </row>
778  <row>
779    <entry></entry>
780    <entry><ulink url="#manual.ext.profile_mode.analysis.vector_too_large">
781    VECTOR_TOO_LARGE</ulink></entry>
782    <entry>5</entry>
783    <entry>1</entry>
784    <entry></entry>
785    <entry>10</entry>
786    <entry>yes</entry>
787  </row>
788  <row>
789    <entry></entry>
790    <entry><ulink url="#manual.ext.profile_mode.analysis.vector_to_hashtable">
791    VECTOR_TO_HASHTABLE</ulink></entry>
792    <entry>7</entry>
793    <entry>7</entry>
794    <entry></entry>
795    <entry>10</entry>
796    <entry>no</entry>
797  </row>
798  <row>
799    <entry></entry>
800    <entry><ulink url="#manual.ext.profile_mode.analysis.hashtable_to_vector">
801    HASHTABLE_TO_VECTOR</ulink></entry>
802    <entry>7</entry>
803    <entry>7</entry>
804    <entry></entry>
805    <entry>10</entry>
806    <entry>no</entry>
807  </row>
808  <row>
809    <entry></entry>
810    <entry><ulink url="#manual.ext.profile_mode.analysis.vector_to_list">
811    VECTOR_TO_LIST</ulink></entry>
812    <entry>8</entry>
813    <entry>5</entry>
814    <entry></entry>
815    <entry>10</entry>
816    <entry>yes</entry>
817  </row>
818  <row>
819    <entry></entry>
820    <entry><ulink url="#manual.ext.profile_mode.analysis.list_to_vector">
821    LIST_TO_VECTOR</ulink></entry>
822    <entry>10</entry>
823    <entry>5</entry>
824    <entry></entry>
825    <entry>10</entry>
826    <entry>no</entry>
827  </row>
828  <row>
829    <entry></entry>
830    <entry><ulink url="#manual.ext.profile_mode.analysis.assoc_ord_to_unord">
831    ORDERED_TO_UNORDERED</ulink></entry>
832    <entry>10</entry>
833    <entry>5</entry>
834    <entry></entry>
835    <entry>10</entry>
836    <entry>only map/unordered_map</entry>
837  </row>
838  <row>
839    <entry><ulink url="#manual.ext.profile_mode.analysis.algorithms">
840    ALGORITHMS</ulink></entry>
841    <entry><ulink url="#manual.ext.profile_mode.analysis.algorithms.sort">
842    SORT</ulink></entry>
843    <entry>7</entry>
844    <entry>8</entry>
845    <entry></entry>
846    <entry>7</entry>
847    <entry>no</entry>
848  </row>
849  <row>
850    <entry><ulink url="#manual.ext.profile_mode.analysis.locality">
851    LOCALITY</ulink></entry>
852    <entry><ulink url="#manual.ext.profile_mode.analysis.locality.sw_prefetch">
853    SOFTWARE_PREFETCH</ulink></entry>
854    <entry>8</entry>
855    <entry>8</entry>
856    <entry></entry>
857    <entry>5</entry>
858    <entry>no</entry>
859  </row>
860  <row>
861    <entry></entry>
862    <entry><ulink url="#manual.ext.profile_mode.analysis.locality.linked">
863    RBTREE_LOCALITY</ulink></entry>
864    <entry>4</entry>
865    <entry>8</entry>
866    <entry></entry>
867    <entry>5</entry>
868    <entry>no</entry>
869  </row>
870  <row>
871    <entry></entry>
872    <entry><ulink url="#manual.ext.profile_mode.analysis.mthread.false_share">
873    FALSE_SHARING</ulink></entry>
874    <entry>8</entry>
875    <entry>10</entry>
876    <entry></entry>
877    <entry>10</entry>
878    <entry>no</entry>
879  </row>
880</tbody>
881</tgroup>
882</table>
883
884<sect2 id="manual.ext.profile_mode.analysis.template"
885 xreflabel="Template">
886<title>Diagnostic Template</title>
887<itemizedlist>
888  <listitem><para><emphasis>Switch:</emphasis>
889  <code>_GLIBCXX_PROFILE_&lt;diagnostic&gt;</code>.
890  </para></listitem>
891  <listitem><para><emphasis>Goal:</emphasis>  What problem will it diagnose?
892  </para></listitem>
893  <listitem><para><emphasis>Fundamentals:</emphasis>.
894  What is the fundamental reason why this is a problem</para></listitem>
895  <listitem><para><emphasis>Sample runtime reduction:</emphasis>
896  Percentage reduction in execution time.  When reduction is more than
897  a constant factor, describe the reduction rate formula.
898  </para></listitem>
899  <listitem><para><emphasis>Recommendation:</emphasis>
900  What would the advise look like?</para></listitem>
901  <listitem><para><emphasis>To instrument:</emphasis>
902  What stdlibc++ components need to be instrumented?</para></listitem>
903  <listitem><para><emphasis>Analysis:</emphasis>
904  How do we decide when to issue the advice?</para></listitem>
905  <listitem><para><emphasis>Cost model:</emphasis>
906  How do we measure benefits?  Math goes here.</para></listitem>
907  <listitem><para><emphasis>Example:</emphasis>
908<programlisting>
909program code
910...
911advice sample
912</programlisting>
913</para></listitem>
914</itemizedlist>
915</sect2>
916
917
918<sect2 id="manual.ext.profile_mode.analysis.containers"
919 xreflabel="Containers">
920<title>Containers</title>
921
922<para>
923<emphasis>Switch:</emphasis>
924  <code>_GLIBCXX_PROFILE_CONTAINERS</code>.
925</para>
926
927<sect3 id="manual.ext.profile_mode.analysis.hashtable_too_small"
928 xreflabel="Hashtable Too Small">
929<title>Hashtable Too Small</title>
930<itemizedlist>
931  <listitem><para><emphasis>Switch:</emphasis>
932  <code>_GLIBCXX_PROFILE_HASHTABLE_TOO_SMALL</code>.
933  </para></listitem>
934  <listitem><para><emphasis>Goal:</emphasis> Detect hashtables with many
935  rehash operations, small construction size and large destruction size.
936  </para></listitem>
937  <listitem><para><emphasis>Fundamentals:</emphasis> Rehash is very expensive.
938  Read content, follow chains within bucket, evaluate hash function, place at
939  new location in different order.</para></listitem>
940  <listitem><para><emphasis>Sample runtime reduction:</emphasis> 36%.
941  Code similar to example below.
942  </para></listitem>
943  <listitem><para><emphasis>Recommendation:</emphasis>
944  Set initial size to N at construction site S.
945  </para></listitem>
946  <listitem><para><emphasis>To instrument:</emphasis>
947  <code>unordered_set, unordered_map</code> constructor, destructor, rehash.
948  </para></listitem>
949  <listitem><para><emphasis>Analysis:</emphasis>
950  For each dynamic instance of <code>unordered_[multi]set|map</code>,
951  record initial size and call context of the constructor.
952  Record size increase, if any, after each relevant operation such as insert.
953  Record the estimated rehash cost.</para></listitem>
954  <listitem><para><emphasis>Cost model:</emphasis>
955  Number of individual rehash operations * cost per rehash.</para></listitem>
956  <listitem><para><emphasis>Example:</emphasis>
957<programlisting>
9581 unordered_set&lt;int&gt; us;
9592 for (int k = 0; k &lt; 1000000; ++k) {
9603   us.insert(k);
9614 }
962
963foo.cc:1: advice: Changing initial unordered_set size from 10 to 1000000 saves 1025530 rehash operations.
964</programlisting>
965</para></listitem>
966</itemizedlist>
967</sect3>
968
969
970<sect3 id="manual.ext.profile_mode.analysis.hashtable_too_large"
971 xreflabel="Hashtable Too Large">
972<title>Hashtable Too Large</title>
973<itemizedlist>
974  <listitem><para><emphasis>Switch:</emphasis>
975  <code>_GLIBCXX_PROFILE_HASHTABLE_TOO_LARGE</code>.
976  </para></listitem>
977  <listitem><para><emphasis>Goal:</emphasis> Detect hashtables which are
978  never filled up because fewer elements than reserved are ever
979  inserted.
980  </para></listitem>
981  <listitem><para><emphasis>Fundamentals:</emphasis> Save memory, which
982  is good in itself and may also improve memory reference performance through
983  fewer cache and TLB misses.</para></listitem>
984  <listitem><para><emphasis>Sample runtime reduction:</emphasis> unknown.
985  </para></listitem>
986  <listitem><para><emphasis>Recommendation:</emphasis>
987  Set initial size to N at construction site S.
988  </para></listitem>
989  <listitem><para><emphasis>To instrument:</emphasis>
990  <code>unordered_set, unordered_map</code> constructor, destructor, rehash.
991  </para></listitem>
992  <listitem><para><emphasis>Analysis:</emphasis>
993  For each dynamic instance of <code>unordered_[multi]set|map</code>,
994  record initial size and call context of the constructor, and correlate it
995  with its size at destruction time.
996  </para></listitem>
997  <listitem><para><emphasis>Cost model:</emphasis>
998  Number of iteration operations + memory saved.</para></listitem>
999  <listitem><para><emphasis>Example:</emphasis>
1000<programlisting>
10011 vector&lt;unordered_set&lt;int&gt;&gt; v(100000, unordered_set&lt;int&gt;(100)) ;
10022 for (int k = 0; k &lt; 100000; ++k) {
10033   for (int j = 0; j &lt; 10; ++j) {
10044     v[k].insert(k + j);
10055  }
10066 }
1007
1008foo.cc:1: advice: Changing initial unordered_set size from 100 to 10 saves N
1009bytes of memory and M iteration steps.
1010</programlisting>
1011</para></listitem>
1012</itemizedlist>
1013</sect3>
1014
1015<sect3 id="manual.ext.profile_mode.analysis.inefficient_hash"
1016 xreflabel="Inefficient Hash">
1017<title>Inefficient Hash</title>
1018<itemizedlist>
1019  <listitem><para><emphasis>Switch:</emphasis>
1020  <code>_GLIBCXX_PROFILE_INEFFICIENT_HASH</code>.
1021  </para></listitem>
1022  <listitem><para><emphasis>Goal:</emphasis> Detect hashtables with polarized
1023  distribution.
1024  </para></listitem>
1025  <listitem><para><emphasis>Fundamentals:</emphasis> A non-uniform
1026  distribution may lead to long chains, thus possibly increasing complexity
1027  by a factor up to the number of elements.
1028  </para></listitem>
1029  <listitem><para><emphasis>Sample runtime reduction:</emphasis> factor up
1030   to container size.
1031  </para></listitem>
1032  <listitem><para><emphasis>Recommendation:</emphasis> Change hash function
1033  for container built at site S.  Distribution score = N.  Access score = S.
1034  Longest chain = C, in bucket B.
1035  </para></listitem>
1036  <listitem><para><emphasis>To instrument:</emphasis>
1037  <code>unordered_set, unordered_map</code> constructor, destructor, [],
1038  insert, iterator.
1039  </para></listitem>
1040  <listitem><para><emphasis>Analysis:</emphasis>
1041  Count the exact number of link traversals.
1042  </para></listitem>
1043  <listitem><para><emphasis>Cost model:</emphasis>
1044  Total number of links traversed.</para></listitem>
1045  <listitem><para><emphasis>Example:</emphasis>
1046<programlisting>
1047class dumb_hash {
1048 public:
1049  size_t operator() (int i) const { return 0; }
1050};
1051...
1052  unordered_set&lt;int, dumb_hash&gt; hs;
1053  ...
1054  for (int i = 0; i &lt; COUNT; ++i) {
1055    hs.find(i);
1056  }
1057</programlisting>
1058</para></listitem>
1059</itemizedlist>
1060</sect3>
1061
1062<sect3 id="manual.ext.profile_mode.analysis.vector_too_small"
1063 xreflabel="Vector Too Small">
1064<title>Vector Too Small</title>
1065<itemizedlist>
1066  <listitem><para><emphasis>Switch:</emphasis>
1067  <code>_GLIBCXX_PROFILE_VECTOR_TOO_SMALL</code>.
1068  </para></listitem>
1069  <listitem><para><emphasis>Goal:</emphasis>Detect vectors with many
1070  resize operations, small construction size and large destruction size..
1071  </para></listitem>
1072  <listitem><para><emphasis>Fundamentals:</emphasis>Resizing can be expensive.
1073  Copying large amounts of data takes time.  Resizing many small vectors may
1074  have allocation overhead and affect locality.</para></listitem>
1075  <listitem><para><emphasis>Sample runtime reduction:</emphasis>%.
1076  </para></listitem>
1077  <listitem><para><emphasis>Recommendation:</emphasis>
1078  Set initial size to N at construction site S.</para></listitem>
1079  <listitem><para><emphasis>To instrument:</emphasis><code>vector</code>.
1080  </para></listitem>
1081  <listitem><para><emphasis>Analysis:</emphasis>
1082  For each dynamic instance of <code>vector</code>,
1083  record initial size and call context of the constructor.
1084  Record size increase, if any, after each relevant operation such as
1085  <code>push_back</code>.  Record the estimated resize cost.
1086  </para></listitem>
1087  <listitem><para><emphasis>Cost model:</emphasis>
1088  Total number of words copied * time to copy a word.</para></listitem>
1089  <listitem><para><emphasis>Example:</emphasis>
1090<programlisting>
10911 vector&lt;int&gt; v;
10922 for (int k = 0; k &lt; 1000000; ++k) {
10933   v.push_back(k);
10944 }
1095
1096foo.cc:1: advice: Changing initial vector size from 10 to 1000000 saves
1097copying 4000000 bytes and 20 memory allocations and deallocations.
1098</programlisting>
1099</para></listitem>
1100</itemizedlist>
1101</sect3>
1102
1103<sect3 id="manual.ext.profile_mode.analysis.vector_too_large"
1104 xreflabel="Vector Too Large">
1105<title>Vector Too Large</title>
1106<itemizedlist>
1107  <listitem><para><emphasis>Switch:</emphasis>
1108  <code>_GLIBCXX_PROFILE_VECTOR_TOO_LARGE</code>
1109  </para></listitem>
1110  <listitem><para><emphasis>Goal:</emphasis>Detect vectors which are
1111  never filled up because fewer elements than reserved are ever
1112  inserted.
1113  </para></listitem>
1114  <listitem><para><emphasis>Fundamentals:</emphasis>Save memory, which
1115  is good in itself and may also improve memory reference performance through
1116  fewer cache and TLB misses.</para></listitem>
1117  <listitem><para><emphasis>Sample runtime reduction:</emphasis>%.
1118  </para></listitem>
1119  <listitem><para><emphasis>Recommendation:</emphasis>
1120  Set initial size to N at construction site S.</para></listitem>
1121  <listitem><para><emphasis>To instrument:</emphasis><code>vector</code>.
1122  </para></listitem>
1123  <listitem><para><emphasis>Analysis:</emphasis>
1124  For each dynamic instance of <code>vector</code>,
1125  record initial size and call context of the constructor, and correlate it
1126  with its size at destruction time.</para></listitem>
1127  <listitem><para><emphasis>Cost model:</emphasis>
1128  Total amount of memory saved.</para></listitem>
1129  <listitem><para><emphasis>Example:</emphasis>
1130<programlisting>
11311 vector&lt;vector&lt;int&gt;&gt; v(100000, vector&lt;int&gt;(100)) ;
11322 for (int k = 0; k &lt; 100000; ++k) {
11333   for (int j = 0; j &lt; 10; ++j) {
11344     v[k].insert(k + j);
11355  }
11366 }
1137
1138foo.cc:1: advice: Changing initial vector size from 100 to 10 saves N
1139bytes of memory and may reduce the number of cache and TLB misses.
1140</programlisting>
1141</para></listitem>
1142</itemizedlist>
1143</sect3>
1144
1145<sect3 id="manual.ext.profile_mode.analysis.vector_to_hashtable"
1146 xreflabel="Vector to Hashtable">
1147<title>Vector to Hashtable</title>
1148<itemizedlist>
1149  <listitem><para><emphasis>Switch:</emphasis>
1150  <code>_GLIBCXX_PROFILE_VECTOR_TO_HASHTABLE</code>.
1151  </para></listitem>
1152  <listitem><para><emphasis>Goal:</emphasis> Detect uses of
1153  <code>vector</code> that can be substituted with <code>unordered_set</code>
1154  to reduce execution time.
1155  </para></listitem>
1156  <listitem><para><emphasis>Fundamentals:</emphasis>
1157  Linear search in a vector is very expensive, whereas searching in a hashtable
1158  is very quick.</para></listitem>
1159  <listitem><para><emphasis>Sample runtime reduction:</emphasis>factor up
1160   to container size.
1161  </para></listitem>
1162  <listitem><para><emphasis>Recommendation:</emphasis>Replace
1163  <code>vector</code> with <code>unordered_set</code> at site S.
1164  </para></listitem>
1165  <listitem><para><emphasis>To instrument:</emphasis><code>vector</code>
1166  operations and access methods.</para></listitem>
1167  <listitem><para><emphasis>Analysis:</emphasis>
1168  For each dynamic instance of <code>vector</code>,
1169  record call context of the constructor.  Issue the advice only if the
1170  only methods called on this <code>vector</code> are <code>push_back</code>,
1171  <code>insert</code> and <code>find</code>.
1172  </para></listitem>
1173  <listitem><para><emphasis>Cost model:</emphasis>
1174  Cost(vector::push_back) + cost(vector::insert) + cost(find, vector) -
1175  cost(unordered_set::insert) + cost(unordered_set::find).
1176  </para></listitem>
1177  <listitem><para><emphasis>Example:</emphasis>
1178<programlisting>
11791  vector&lt;int&gt; v;
1180...
11812  for (int i = 0; i &lt; 1000; ++i) {
11823    find(v.begin(), v.end(), i);
11834  }
1184
1185foo.cc:1: advice: Changing "vector" to "unordered_set" will save about 500,000
1186comparisons.
1187</programlisting>
1188</para></listitem>
1189</itemizedlist>
1190</sect3>
1191
1192<sect3 id="manual.ext.profile_mode.analysis.hashtable_to_vector"
1193 xreflabel="Hashtable to Vector">
1194<title>Hashtable to Vector</title>
1195<itemizedlist>
1196  <listitem><para><emphasis>Switch:</emphasis>
1197  <code>_GLIBCXX_PROFILE_HASHTABLE_TO_VECTOR</code>.
1198  </para></listitem>
1199  <listitem><para><emphasis>Goal:</emphasis> Detect uses of
1200  <code>unordered_set</code> that can be substituted with <code>vector</code>
1201  to reduce execution time.
1202  </para></listitem>
1203  <listitem><para><emphasis>Fundamentals:</emphasis>
1204  Hashtable iterator is slower than vector iterator.</para></listitem>
1205  <listitem><para><emphasis>Sample runtime reduction:</emphasis>95%.
1206  </para></listitem>
1207  <listitem><para><emphasis>Recommendation:</emphasis>Replace
1208  <code>unordered_set</code> with <code>vector</code> at site S.
1209  </para></listitem>
1210  <listitem><para><emphasis>To instrument:</emphasis><code>unordered_set</code>
1211  operations and access methods.</para></listitem>
1212  <listitem><para><emphasis>Analysis:</emphasis>
1213  For each dynamic instance of <code>unordered_set</code>,
1214  record call context of the constructor.  Issue the advice only if the
1215  number of <code>find</code>, <code>insert</code> and <code>[]</code>
1216  operations on this <code>unordered_set</code> are small relative to the
1217  number of elements, and methods <code>begin</code> or <code>end</code>
1218  are invoked (suggesting iteration).</para></listitem>
1219  <listitem><para><emphasis>Cost model:</emphasis>
1220  Number of .</para></listitem>
1221  <listitem><para><emphasis>Example:</emphasis>
1222<programlisting>
12231  unordered_set&lt;int&gt; us;
1224...
12252  int s = 0;
12263  for (unordered_set&lt;int&gt;::iterator it = us.begin(); it != us.end(); ++it) {
12274    s += *it;
12285  }
1229
1230foo.cc:1: advice: Changing "unordered_set" to "vector" will save about N
1231indirections and may achieve better data locality.
1232</programlisting>
1233</para></listitem>
1234</itemizedlist>
1235</sect3>
1236
1237<sect3 id="manual.ext.profile_mode.analysis.vector_to_list"
1238 xreflabel="Vector to List">
1239<title>Vector to List</title>
1240<itemizedlist>
1241  <listitem><para><emphasis>Switch:</emphasis>
1242  <code>_GLIBCXX_PROFILE_VECTOR_TO_LIST</code>.
1243  </para></listitem>
1244  <listitem><para><emphasis>Goal:</emphasis> Detect cases where
1245  <code>vector</code> could be substituted with <code>list</code> for
1246  better performance.
1247  </para></listitem>
1248  <listitem><para><emphasis>Fundamentals:</emphasis>
1249  Inserting in the middle of a vector is expensive compared to inserting in a
1250  list.
1251  </para></listitem>
1252  <listitem><para><emphasis>Sample runtime reduction:</emphasis>factor up to
1253   container size.
1254  </para></listitem>
1255  <listitem><para><emphasis>Recommendation:</emphasis>Replace vector with list
1256  at site S.</para></listitem>
1257  <listitem><para><emphasis>To instrument:</emphasis><code>vector</code>
1258  operations and access methods.</para></listitem>
1259  <listitem><para><emphasis>Analysis:</emphasis>
1260  For each dynamic instance of <code>vector</code>,
1261  record the call context of the constructor.  Record the overhead of each
1262  <code>insert</code> operation based on current size and insert position.
1263  Report instance with high insertion overhead.
1264  </para></listitem>
1265  <listitem><para><emphasis>Cost model:</emphasis>
1266  (Sum(cost(vector::method)) - Sum(cost(list::method)), for
1267  method in [push_back, insert, erase])
1268  + (Cost(iterate vector) - Cost(iterate list))</para></listitem>
1269  <listitem><para><emphasis>Example:</emphasis>
1270<programlisting>
12711  vector&lt;int&gt; v;
12722  for (int i = 0; i &lt; 10000; ++i) {
12733    v.insert(v.begin(), i);
12744  }
1275
1276foo.cc:1: advice: Changing "vector" to "list" will save about 5,000,000
1277operations.
1278</programlisting>
1279</para></listitem>
1280</itemizedlist>
1281</sect3>
1282
1283<sect3 id="manual.ext.profile_mode.analysis.list_to_vector"
1284 xreflabel="List to Vector">
1285<title>List to Vector</title>
1286<itemizedlist>
1287  <listitem><para><emphasis>Switch:</emphasis>
1288  <code>_GLIBCXX_PROFILE_LIST_TO_VECTOR</code>.
1289  </para></listitem>
1290  <listitem><para><emphasis>Goal:</emphasis> Detect cases where
1291  <code>list</code> could be substituted with <code>vector</code> for
1292  better performance.
1293  </para></listitem>
1294  <listitem><para><emphasis>Fundamentals:</emphasis>
1295  Iterating through a vector is faster than through a list.
1296  </para></listitem>
1297  <listitem><para><emphasis>Sample runtime reduction:</emphasis>64%.
1298  </para></listitem>
1299  <listitem><para><emphasis>Recommendation:</emphasis>Replace list with vector
1300  at site S.</para></listitem>
1301  <listitem><para><emphasis>To instrument:</emphasis><code>vector</code>
1302  operations and access methods.</para></listitem>
1303  <listitem><para><emphasis>Analysis:</emphasis>
1304  Issue the advice if there are no <code>insert</code> operations.
1305  </para></listitem>
1306  <listitem><para><emphasis>Cost model:</emphasis>
1307    (Sum(cost(vector::method)) - Sum(cost(list::method)), for
1308  method in [push_back, insert, erase])
1309  + (Cost(iterate vector) - Cost(iterate list))</para></listitem>
1310  <listitem><para><emphasis>Example:</emphasis>
1311<programlisting>
13121  list&lt;int&gt; l;
1313...
13142  int sum = 0;
13153  for (list&lt;int&gt;::iterator it = l.begin(); it != l.end(); ++it) {
13164    sum += *it;
13175  }
1318
1319foo.cc:1: advice: Changing "list" to "vector" will save about 1000000 indirect
1320memory references.
1321</programlisting>
1322</para></listitem>
1323</itemizedlist>
1324</sect3>
1325
1326<sect3 id="manual.ext.profile_mode.analysis.list_to_slist"
1327 xreflabel="List to Forward List">
1328<title>List to Forward List (Slist)</title>
1329<itemizedlist>
1330  <listitem><para><emphasis>Switch:</emphasis>
1331  <code>_GLIBCXX_PROFILE_LIST_TO_SLIST</code>.
1332  </para></listitem>
1333  <listitem><para><emphasis>Goal:</emphasis> Detect cases where
1334  <code>list</code> could be substituted with <code>forward_list</code> for
1335  better performance.
1336  </para></listitem>
1337  <listitem><para><emphasis>Fundamentals:</emphasis>
1338  The memory footprint of a forward_list is smaller than that of a list.
1339  This has beneficial effects on memory subsystem, e.g., fewer cache misses.
1340  </para></listitem>
1341  <listitem><para><emphasis>Sample runtime reduction:</emphasis>40%.
1342  Note that the reduction is only noticeable if the size of the forward_list
1343  node is in fact larger than that of the list node.  For memory allocators
1344  with size classes, you will only notice an effect when the two node sizes
1345  belong to different allocator size classes.
1346  </para></listitem>
1347  <listitem><para><emphasis>Recommendation:</emphasis>Replace list with
1348  forward_list at site S.</para></listitem>
1349  <listitem><para><emphasis>To instrument:</emphasis><code>list</code>
1350  operations and iteration methods.</para></listitem>
1351  <listitem><para><emphasis>Analysis:</emphasis>
1352  Issue the advice if there are no <code>backwards</code> traversals
1353  or insertion before a given node.
1354  </para></listitem>
1355  <listitem><para><emphasis>Cost model:</emphasis>
1356  Always true.</para></listitem>
1357  <listitem><para><emphasis>Example:</emphasis>
1358<programlisting>
13591  list&lt;int&gt; l;
1360...
13612  int sum = 0;
13623  for (list&lt;int&gt;::iterator it = l.begin(); it != l.end(); ++it) {
13634    sum += *it;
13645  }
1365
1366foo.cc:1: advice: Change "list" to "forward_list".
1367</programlisting>
1368</para></listitem>
1369</itemizedlist>
1370</sect3>
1371
1372<sect3 id="manual.ext.profile_mode.analysis.assoc_ord_to_unord"
1373 xreflabel="Ordered to Unordered Associative Container">
1374<title>Ordered to Unordered Associative Container</title>
1375<itemizedlist>
1376  <listitem><para><emphasis>Switch:</emphasis>
1377  <code>_GLIBCXX_PROFILE_ORDERED_TO_UNORDERED</code>.
1378  </para></listitem>
1379  <listitem><para><emphasis>Goal:</emphasis>  Detect cases where ordered
1380  associative containers can be replaced with unordered ones.
1381  </para></listitem>
1382  <listitem><para><emphasis>Fundamentals:</emphasis>
1383  Insert and search are quicker in a hashtable than in
1384  a red-black tree.</para></listitem>
1385  <listitem><para><emphasis>Sample runtime reduction:</emphasis>52%.
1386  </para></listitem>
1387  <listitem><para><emphasis>Recommendation:</emphasis>
1388  Replace set with unordered_set at site S.</para></listitem>
1389  <listitem><para><emphasis>To instrument:</emphasis>
1390  <code>set</code>, <code>multiset</code>, <code>map</code>,
1391  <code>multimap</code> methods.</para></listitem>
1392  <listitem><para><emphasis>Analysis:</emphasis>
1393  Issue the advice only if we are not using operator <code>++</code> on any
1394  iterator on a particular <code>[multi]set|map</code>.
1395  </para></listitem>
1396  <listitem><para><emphasis>Cost model:</emphasis>
1397  (Sum(cost(hashtable::method)) - Sum(cost(rbtree::method)), for
1398  method in [insert, erase, find])
1399  + (Cost(iterate hashtable) - Cost(iterate rbtree))</para></listitem>
1400  <listitem><para><emphasis>Example:</emphasis>
1401<programlisting>
14021  set&lt;int&gt; s;
14032  for (int i = 0; i &lt; 100000; ++i) {
14043    s.insert(i);
14054  }
14065  int sum = 0;
14076  for (int i = 0; i &lt; 100000; ++i) {
14087    sum += *s.find(i);
14098  }
1410</programlisting>
1411</para></listitem>
1412</itemizedlist>
1413</sect3>
1414
1415</sect2>
1416
1417
1418
1419<sect2 id="manual.ext.profile_mode.analysis.algorithms"
1420 xreflabel="Algorithms">
1421<title>Algorithms</title>
1422
1423  <para><emphasis>Switch:</emphasis>
1424  <code>_GLIBCXX_PROFILE_ALGORITHMS</code>.
1425  </para>
1426
1427<sect3 id="manual.ext.profile_mode.analysis.algorithms.sort"
1428 xreflabel="Sorting">
1429<title>Sort Algorithm Performance</title>
1430<itemizedlist>
1431  <listitem><para><emphasis>Switch:</emphasis>
1432  <code>_GLIBCXX_PROFILE_SORT</code>.
1433  </para></listitem>
1434  <listitem><para><emphasis>Goal:</emphasis> Give measure of sort algorithm
1435  performance based on actual input.  For instance, advise Radix Sort over
1436  Quick Sort for a particular call context.
1437  </para></listitem>
1438  <listitem><para><emphasis>Fundamentals:</emphasis>
1439  See papers:
1440  <ulink url="http://portal.acm.org/citation.cfm?doid=1065944.1065981">
1441  A framework for adaptive algorithm selection in STAPL</ulink> and
1442  <ulink url="http://ieeexplore.ieee.org/search/wrapper.jsp?arnumber=4228227">
1443  Optimizing Sorting with Machine Learning Algorithms</ulink>.
1444  </para></listitem>
1445  <listitem><para><emphasis>Sample runtime reduction:</emphasis>60%.
1446  </para></listitem>
1447  <listitem><para><emphasis>Recommendation:</emphasis> Change sort algorithm
1448  at site S from X Sort to Y Sort.</para></listitem>
1449  <listitem><para><emphasis>To instrument:</emphasis> <code>sort</code>
1450  algorithm.</para></listitem>
1451  <listitem><para><emphasis>Analysis:</emphasis>
1452  Issue the advice if the cost model tells us that another sort algorithm
1453  would do better on this input.  Requires us to know what algorithm we
1454  are using in our sort implementation in release mode.</para></listitem>
1455  <listitem><para><emphasis>Cost model:</emphasis>
1456  Runtime(algo) for algo in [radix, quick, merge, ...]</para></listitem>
1457  <listitem><para><emphasis>Example:</emphasis>
1458<programlisting>
1459</programlisting>
1460</para></listitem>
1461</itemizedlist>
1462</sect3>
1463
1464</sect2>
1465
1466
1467<sect2 id="manual.ext.profile_mode.analysis.locality"
1468 xreflabel="Data Locality">
1469<title>Data Locality</title>
1470
1471  <para><emphasis>Switch:</emphasis>
1472  <code>_GLIBCXX_PROFILE_LOCALITY</code>.
1473  </para>
1474
1475<sect3 id="manual.ext.profile_mode.analysis.locality.sw_prefetch"
1476 xreflabel="Need Software Prefetch">
1477<title>Need Software Prefetch</title>
1478<itemizedlist>
1479  <listitem><para><emphasis>Switch:</emphasis>
1480  <code>_GLIBCXX_PROFILE_SOFTWARE_PREFETCH</code>.
1481  </para></listitem>
1482  <listitem><para><emphasis>Goal:</emphasis> Discover sequences of indirect
1483  memory accesses that are not regular, thus cannot be predicted by
1484  hardware prefetchers.
1485  </para></listitem>
1486  <listitem><para><emphasis>Fundamentals:</emphasis>
1487  Indirect references are hard to predict and are very expensive when they
1488  miss in caches.</para></listitem>
1489  <listitem><para><emphasis>Sample runtime reduction:</emphasis>25%.
1490  </para></listitem>
1491  <listitem><para><emphasis>Recommendation:</emphasis> Insert prefetch
1492  instruction.</para></listitem>
1493  <listitem><para><emphasis>To instrument:</emphasis> Vector iterator and
1494  access operator [].
1495  </para></listitem>
1496  <listitem><para><emphasis>Analysis:</emphasis>
1497  First, get cache line size and page size from system.
1498  Then record iterator dereference sequences for which the value is a pointer.
1499  For each sequence within a container, issue a warning if successive pointer
1500  addresses are not within cache lines and do not form a linear pattern
1501  (otherwise they may be prefetched by hardware).
1502  If they also step across page boundaries, make the warning stronger.
1503  </para>
1504  <para>The same analysis applies to containers other than vector.
1505  However, we cannot give the same advice for linked structures, such as list,
1506  as there is no random access to the n-th element.  The user may still be
1507  able to benefit from this information, for instance by employing frays (user
1508  level light weight threads) to hide the latency of chasing pointers.
1509  </para>
1510  <para>
1511  This analysis is a little oversimplified.  A better cost model could be
1512  created by understanding the capability of the hardware prefetcher.
1513  This model could be trained automatically by running a set of synthetic
1514  cases.
1515  </para>
1516  </listitem>
1517  <listitem><para><emphasis>Cost model:</emphasis>
1518  Total distance between pointer values of successive elements in vectors
1519  of pointers.</para></listitem>
1520  <listitem><para><emphasis>Example:</emphasis>
1521<programlisting>
15221 int zero = 0;
15232 vector&lt;int*&gt; v(10000000, &amp;zero);
15243 for (int k = 0; k &lt; 10000000; ++k) {
15254   v[random() % 10000000] = new int(k);
15265 }
15276 for (int j = 0; j &lt; 10000000; ++j) {
15287   count += (*v[j] == 0 ? 0 : 1);
15298 }
1530
1531foo.cc:7: advice: Insert prefetch instruction.
1532</programlisting>
1533</para></listitem>
1534</itemizedlist>
1535</sect3>
1536
1537<sect3 id="manual.ext.profile_mode.analysis.locality.linked"
1538 xreflabel="Linked Structure Locality">
1539<title>Linked Structure Locality</title>
1540<itemizedlist>
1541  <listitem><para><emphasis>Switch:</emphasis>
1542  <code>_GLIBCXX_PROFILE_RBTREE_LOCALITY</code>.
1543  </para></listitem>
1544  <listitem><para><emphasis>Goal:</emphasis> Give measure of locality of
1545  objects stored in linked structures (lists, red-black trees and hashtables)
1546  with respect to their actual traversal patterns.
1547  </para></listitem>
1548  <listitem><para><emphasis>Fundamentals:</emphasis>Allocation can be tuned
1549  to a specific traversal pattern, to result in better data locality.
1550  See paper:
1551  <ulink url="http://www.springerlink.com/content/8085744l00x72662/">
1552  Custom Memory Allocation for Free</ulink>.
1553  </para></listitem>
1554  <listitem><para><emphasis>Sample runtime reduction:</emphasis>30%.
1555  </para></listitem>
1556  <listitem><para><emphasis>Recommendation:</emphasis>
1557  High scatter score N for container built at site S.
1558  Consider changing allocation sequence or choosing a structure conscious
1559  allocator.</para></listitem>
1560  <listitem><para><emphasis>To instrument:</emphasis> Methods of all
1561  containers using linked structures.</para></listitem>
1562  <listitem><para><emphasis>Analysis:</emphasis>
1563  First, get cache line size and page size from system.
1564  Then record the number of successive elements that are on different line
1565  or page, for each traversal method such as <code>find</code>.  Give advice
1566  only if the ratio between this number and the number of total node hops
1567  is above a threshold.</para></listitem>
1568  <listitem><para><emphasis>Cost model:</emphasis>
1569  Sum(same_cache_line(this,previous))</para></listitem>
1570  <listitem><para><emphasis>Example:</emphasis>
1571<programlisting>
1572 1  set&lt;int&gt; s;
1573 2  for (int i = 0; i &lt; 10000000; ++i) {
1574 3    s.insert(i);
1575 4  }
1576 5  set&lt;int&gt; s1, s2;
1577 6  for (int i = 0; i &lt; 10000000; ++i) {
1578 7    s1.insert(i);
1579 8    s2.insert(i);
1580 9  }
1581...
1582      // Fast, better locality.
158310    for (set&lt;int&gt;::iterator it = s.begin(); it != s.end(); ++it) {
158411      sum += *it;
158512    }
1586      // Slow, elements are further apart.
158713    for (set&lt;int&gt;::iterator it = s1.begin(); it != s1.end(); ++it) {
158814      sum += *it;
158915    }
1590
1591foo.cc:5: advice: High scatter score NNN for set built here.  Consider changing
1592the allocation sequence or switching to a structure conscious allocator.
1593</programlisting>
1594</para></listitem>
1595</itemizedlist>
1596</sect3>
1597
1598</sect2>
1599
1600
1601<sect2 id="manual.ext.profile_mode.analysis.mthread"
1602 xreflabel="Multithreaded Data Access">
1603<title>Multithreaded Data Access</title>
1604
1605  <para>
1606  The diagnostics in this group are not meant to be implemented short term.
1607  They require compiler support to know when container elements are written
1608  to.  Instrumentation can only tell us when elements are referenced.
1609  </para>
1610
1611  <para><emphasis>Switch:</emphasis>
1612  <code>_GLIBCXX_PROFILE_MULTITHREADED</code>.
1613  </para>
1614
1615<sect3 id="manual.ext.profile_mode.analysis.mthread.ddtest"
1616 xreflabel="Dependence Violations at Container Level">
1617<title>Data Dependence Violations at Container Level</title>
1618<itemizedlist>
1619  <listitem><para><emphasis>Switch:</emphasis>
1620  <code>_GLIBCXX_PROFILE_DDTEST</code>.
1621  </para></listitem>
1622  <listitem><para><emphasis>Goal:</emphasis> Detect container elements
1623  that are referenced from multiple threads in the parallel region or
1624  across parallel regions.
1625  </para></listitem>
1626  <listitem><para><emphasis>Fundamentals:</emphasis>
1627  Sharing data between threads requires communication and perhaps locking,
1628  which may be expensive.
1629  </para></listitem>
1630  <listitem><para><emphasis>Sample runtime reduction:</emphasis>?%.
1631  </para></listitem>
1632  <listitem><para><emphasis>Recommendation:</emphasis> Change data
1633  distribution or parallel algorithm.</para></listitem>
1634  <listitem><para><emphasis>To instrument:</emphasis> Container access methods
1635  and iterators.
1636  </para></listitem>
1637  <listitem><para><emphasis>Analysis:</emphasis>
1638  Keep a shadow for each container.  Record iterator dereferences and
1639  container member accesses.  Issue advice for elements referenced by
1640  multiple threads.
1641  See paper: <ulink url="http://portal.acm.org/citation.cfm?id=207110.207148">
1642  The LRPD test: speculative run-time parallelization of loops with
1643  privatization and reduction parallelization</ulink>.
1644  </para></listitem>
1645  <listitem><para><emphasis>Cost model:</emphasis>
1646  Number of accesses to elements referenced from multiple threads
1647  </para></listitem>
1648  <listitem><para><emphasis>Example:</emphasis>
1649<programlisting>
1650</programlisting>
1651</para></listitem>
1652</itemizedlist>
1653</sect3>
1654
1655<sect3 id="manual.ext.profile_mode.analysis.mthread.false_share"
1656 xreflabel="False Sharing">
1657<title>False Sharing</title>
1658<itemizedlist>
1659  <listitem><para><emphasis>Switch:</emphasis>
1660  <code>_GLIBCXX_PROFILE_FALSE_SHARING</code>.
1661  </para></listitem>
1662  <listitem><para><emphasis>Goal:</emphasis> Detect elements in the
1663  same container which share a cache line, are written by at least one
1664  thread, and accessed by different threads.
1665  </para></listitem>
1666  <listitem><para><emphasis>Fundamentals:</emphasis> Under these assumptions,
1667  cache protocols require
1668  communication to invalidate lines, which may be expensive.
1669  </para></listitem>
1670  <listitem><para><emphasis>Sample runtime reduction:</emphasis>68%.
1671  </para></listitem>
1672  <listitem><para><emphasis>Recommendation:</emphasis> Reorganize container
1673  or use padding to avoid false sharing.</para></listitem>
1674  <listitem><para><emphasis>To instrument:</emphasis> Container access methods
1675  and iterators.
1676  </para></listitem>
1677  <listitem><para><emphasis>Analysis:</emphasis>
1678  First, get the cache line size.
1679  For each shared container, record all the associated iterator dereferences
1680  and member access methods with the thread id.  Compare the address lists
1681  across threads to detect references in two different threads to the same
1682  cache line.  Issue a warning only if the ratio to total references is
1683  significant.  Do the same for iterator dereference values if they are
1684  pointers.</para></listitem>
1685  <listitem><para><emphasis>Cost model:</emphasis>
1686  Number of accesses to same cache line from different threads.
1687  </para></listitem>
1688  <listitem><para><emphasis>Example:</emphasis>
1689<programlisting>
16901     vector&lt;int&gt; v(2, 0);
16912 #pragma omp parallel for shared(v, SIZE) schedule(static, 1)
16923     for (i = 0; i &lt; SIZE; ++i) {
16934       v[i % 2] += i;
16945     }
1695
1696OMP_NUM_THREADS=2 ./a.out
1697foo.cc:1: advice: Change container structure or padding to avoid false
1698sharing in multithreaded access at foo.cc:4.  Detected N shared cache lines.
1699</programlisting>
1700</para></listitem>
1701</itemizedlist>
1702</sect3>
1703
1704</sect2>
1705
1706
1707<sect2 id="manual.ext.profile_mode.analysis.statistics"
1708 xreflabel="Statistics">
1709<title>Statistics</title>
1710
1711<para>
1712<emphasis>Switch:</emphasis>
1713  <code>_GLIBCXX_PROFILE_STATISTICS</code>.
1714</para>
1715
1716<para>
1717  In some cases the cost model may not tell us anything because the costs
1718  appear to offset the benefits.  Consider the choice between a vector and
1719  a list.  When there are both inserts and iteration, an automatic advice
1720  may not be issued.  However, the programmer may still be able to make use
1721  of this information in a different way.
1722</para>
1723<para>
1724  This diagnostic will not issue any advice, but it will print statistics for
1725  each container construction site.  The statistics will contain the cost
1726  of each operation actually performed on the container.
1727</para>
1728
1729</sect2>
1730
1731
1732</sect1>
1733
1734
1735<bibliography id="profile_mode.biblio">
1736<title>Bibliography</title>
1737
1738  <biblioentry>
1739    <title>
1740      Perflint: A Context Sensitive Performance Advisor for C++ Programs
1741    </title>
1742
1743    <author>
1744      <firstname>Lixia</firstname>
1745      <surname>Liu</surname>
1746    </author>
1747    <author>
1748      <firstname>Silvius</firstname>
1749      <surname>Rus</surname>
1750    </author>
1751
1752    <copyright>
1753      <year>2009</year>
1754      <holder></holder>
1755    </copyright>
1756
1757    <publisher>
1758      <publishername>
1759	Proceedings of the 2009 International Symposium on Code Generation
1760	and Optimization
1761      </publishername>
1762    </publisher>
1763  </biblioentry>
1764</bibliography>
1765
1766
1767</chapter>
1768