profile_mode.xml revision 1.1
1<?xml version='1.0'?> 2<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" 3 "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd" 4[ ]> 5 6<chapter id="manual.ext.profile_mode" xreflabel="Profile Mode"> 7<?dbhtml filename="profile_mode.html"?> 8 9<chapterinfo> 10 <keywordset> 11 <keyword> 12 C++ 13 </keyword> 14 <keyword> 15 library 16 </keyword> 17 <keyword> 18 profile 19 </keyword> 20 </keywordset> 21</chapterinfo> 22 23<title>Profile Mode</title> 24 25 26<sect1 id="manual.ext.profile_mode.intro" xreflabel="Intro"> 27 <title>Intro</title> 28 <para> 29 <emphasis>Goal: </emphasis>Give performance improvement advice based on 30 recognition of suboptimal usage patterns of the standard library. 31 </para> 32 33 <para> 34 <emphasis>Method: </emphasis>Wrap the standard library code. Insert 35 calls to an instrumentation library to record the internal state of 36 various components at interesting entry/exit points to/from the standard 37 library. Process trace, recognize suboptimal patterns, give advice. 38 For details, see 39 <ulink url="http://dx.doi.org/10.1109/CGO.2009.36">paper presented at 40 CGO 2009</ulink>. 41 </para> 42 <para> 43 <emphasis>Strengths: </emphasis> 44<itemizedlist> 45 <listitem><para> 46 Unintrusive solution. The application code does not require any 47 modification. 48 </para></listitem> 49 <listitem><para> The advice is call context sensitive, thus capable of 50 identifying precisely interesting dynamic performance behavior. 51 </para></listitem> 52 <listitem><para> 53 The overhead model is pay-per-view. When you turn off a diagnostic class 54 at compile time, its overhead disappears. 55 </para></listitem> 56</itemizedlist> 57 </para> 58 <para> 59 <emphasis>Drawbacks: </emphasis> 60<itemizedlist> 61 <listitem><para> 62 You must recompile the application code with custom options. 63 </para></listitem> 64 <listitem><para>You must run the application on representative input. 65 The advice is input dependent. 66 </para></listitem> 67 <listitem><para> 68 The execution time will increase, in some cases by factors. 69 </para></listitem> 70</itemizedlist> 71 </para> 72 73 74<sect2 id="manual.ext.profile_mode.using" xreflabel="Using"> 75 <title>Using the Profile Mode</title> 76 77 <para> 78 This is the anticipated common workflow for program <code>foo.cc</code>: 79<programlisting> 80$ cat foo.cc 81#include <vector> 82int main() { 83 vector<int> v; 84 for (int k = 0; k < 1024; ++k) v.insert(v.begin(), k); 85} 86 87$ g++ -D_GLIBCXX_PROFILE foo.cc 88$ ./a.out 89$ cat libstdcxx-profile.txt 90vector-to-list: improvement = 5: call stack = 0x804842c ... 91 : advice = change std::vector to std::list 92vector-size: improvement = 3: call stack = 0x804842c ... 93 : advice = change initial container size from 0 to 1024 94</programlisting> 95 </para> 96 97 <para> 98 Anatomy of a warning: 99 <itemizedlist> 100 <listitem> 101 <para> 102 Warning id. This is a short descriptive string for the class 103 that this warning belongs to. E.g., "vector-to-list". 104 </para> 105 </listitem> 106 <listitem> 107 <para> 108 Estimated improvement. This is an approximation of the benefit expected 109 from implementing the change suggested by the warning. It is given on 110 a log10 scale. Negative values mean that the alternative would actually 111 do worse than the current choice. 112 In the example above, 5 comes from the fact that the overhead of 113 inserting at the beginning of a vector vs. a list is around 1024 * 1024 / 2, 114 which is around 10e5. The improvement from setting the initial size to 115 1024 is in the range of 10e3, since the overhead of dynamic resizing is 116 linear in this case. 117 </para> 118 </listitem> 119 <listitem> 120 <para> 121 Call stack. Currently, the addresses are printed without 122 symbol name or code location attribution. 123 Users are expected to postprocess the output using, for instance, addr2line. 124 </para> 125 </listitem> 126 <listitem> 127 <para> 128 The warning message. For some warnings, this is static text, e.g., 129 "change vector to list". For other warnings, such as the one above, 130 the message contains numeric advice, e.g., the suggested initial size 131 of the vector. 132 </para> 133 </listitem> 134 </itemizedlist> 135 </para> 136 137 <para>Three files are generated. <code>libstdcxx-profile.txt</code> 138 contains human readable advice. <code>libstdcxx-profile.raw</code> 139 contains implementation specific data about each diagnostic. 140 Their format is not documented. They are sufficient to generate 141 all the advice given in <code>libstdcxx-profile.txt</code>. The advantage 142 of keeping this raw format is that traces from multiple executions can 143 be aggregated simply by concatenating the raw traces. We intend to 144 offer an external utility program that can issue advice from a trace. 145 <code>libstdcxx-profile.conf.out</code> lists the actual diagnostic 146 parameters used. To alter parameters, edit this file and rename it to 147 <code>libstdcxx-profile.conf</code>. 148 </para> 149 150 <para>Advice is given regardless whether the transformation is valid. 151 For instance, we advise changing a map to an unordered_map even if the 152 application semantics require that data be ordered. 153 We believe such warnings can help users understand the performance 154 behavior of their application better, which can lead to changes 155 at a higher abstraction level. 156 </para> 157 158</sect2> 159 160<sect2 id="manual.ext.profile_mode.tuning" xreflabel="Tuning"> 161 <title>Tuning the Profile Mode</title> 162 163 <para>Compile time switches and environment variables (see also file 164 profiler.h). Unless specified otherwise, they can be set at compile time 165 using -D_<name> or by setting variable <name> 166 in the environment where the program is run, before starting execution. 167 <itemizedlist> 168 <listitem><para> 169 <code>_GLIBCXX_PROFILE_NO_<diagnostic></code>: 170 disable specific diagnostics. 171 See section Diagnostics for possible values. 172 (Environment variables not supported.) 173 </para></listitem> 174 <listitem><para> 175 <code>_GLIBCXX_PROFILE_TRACE_PATH_ROOT</code>: set an alternative root 176 path for the output files. 177 </para></listitem> 178 <listitem><para>_GLIBCXX_PROFILE_MAX_WARN_COUNT: set it to the maximum 179 number of warnings desired. The default value is 10.</para></listitem> 180 <listitem><para> 181 <code>_GLIBCXX_PROFILE_MAX_STACK_DEPTH</code>: if set to 0, 182 the advice will 183 be collected and reported for the program as a whole, and not for each 184 call context. 185 This could also be used in continuous regression tests, where you 186 just need to know whether there is a regression or not. 187 The default value is 32. 188 </para></listitem> 189 <listitem><para> 190 <code>_GLIBCXX_PROFILE_MEM_PER_DIAGNOSTIC</code>: 191 set a limit on how much memory to use for the accounting tables for each 192 diagnostic type. When this limit is reached, new events are ignored 193 until the memory usage decreases under the limit. Generally, this means 194 that newly created containers will not be instrumented until some 195 live containers are deleted. The default is 128 MB. 196 </para></listitem> 197 <listitem><para> 198 <code>_GLIBCXX_PROFILE_NO_THREADS</code>: 199 Make the library not use threads. If thread local storage (TLS) is not 200 available, you will get a preprocessor error asking you to set 201 -D_GLIBCXX_PROFILE_NO_THREADS if your program is single-threaded. 202 Multithreaded execution without TLS is not supported. 203 (Environment variable not supported.) 204 </para></listitem> 205 <listitem><para> 206 <code>_GLIBCXX_HAVE_EXECINFO_H</code>: 207 This name should be defined automatically at library configuration time. 208 If your library was configured without <code>execinfo.h</code>, but 209 you have it in your include path, you can define it explicitly. Without 210 it, advice is collected for the program as a whole, and not for each 211 call context. 212 (Environment variable not supported.) 213 </para></listitem> 214 </itemizedlist> 215 </para> 216 217</sect2> 218 219</sect1> 220 221 222<sect1 id="manual.ext.profile_mode.design" xreflabel="Design"> 223 <title>Design</title> 224 225<para> 226</para> 227<table frame='all'> 228<title>Profile Code Location</title> 229<tgroup cols='2' align='left' colsep='1' rowsep='1'> 230<colspec colname='c1'></colspec> 231<colspec colname='c2'></colspec> 232 233<thead> 234 <row> 235 <entry>Code Location</entry> 236 <entry>Use</entry> 237 </row> 238</thead> 239<tbody> 240 <row> 241 <entry><code>libstdc++-v3/include/std/*</code></entry> 242 <entry>Preprocessor code to redirect to profile extension headers.</entry> 243 </row> 244 <row> 245 <entry><code>libstdc++-v3/include/profile/*</code></entry> 246 <entry>Profile extension public headers (map, vector, ...).</entry> 247 </row> 248 <row> 249 <entry><code>libstdc++-v3/include/profile/impl/*</code></entry> 250 <entry>Profile extension internals. Implementation files are 251 only included from <code>impl/profiler.h</code>, which is the only 252 file included from the public headers.</entry> 253 </row> 254</tbody> 255</tgroup> 256</table> 257 258<para> 259</para> 260 261<sect2 id="manual.ext.profile_mode.design.wrapper" 262 xreflabel="Wrapper"> 263<title>Wrapper Model</title> 264 <para> 265 In order to get our instrumented library version included instead of the 266 release one, 267 we use the same wrapper model as the debug mode. 268 We subclass entities from the release version. Wherever 269 <code>_GLIBCXX_PROFILE</code> is defined, the release namespace is 270 <code>std::__norm</code>, whereas the profile namespace is 271 <code>std::__profile</code>. Using plain <code>std</code> translates 272 into <code>std::__profile</code>. 273 </para> 274 <para> 275 Whenever possible, we try to wrap at the public interface level, e.g., 276 in <code>unordered_set</code> rather than in <code>hashtable</code>, 277 in order not to depend on implementation. 278 </para> 279 <para> 280 Mixing object files built with and without the profile mode must 281 not affect the program execution. However, there are no guarantees to 282 the accuracy of diagnostics when using even a single object not built with 283 <code>-D_GLIBCXX_PROFILE</code>. 284 Currently, mixing the profile mode with debug and parallel extensions is 285 not allowed. Mixing them at compile time will result in preprocessor errors. 286 Mixing them at link time is undefined. 287 </para> 288</sect2> 289 290 291<sect2 id="manual.ext.profile_mode.design.instrumentation" 292 xreflabel="Instrumentation"> 293<title>Instrumentation</title> 294 <para> 295 Instead of instrumenting every public entry and exit point, 296 we chose to add instrumentation on demand, as needed 297 by individual diagnostics. 298 The main reason is that some diagnostics require us to extract bits of 299 internal state that are particular only to that diagnostic. 300 We plan to formalize this later, after we learn more about the requirements 301 of several diagnostics. 302 </para> 303 <para> 304 All the instrumentation points can be switched on and off using 305 <code>-D[_NO]_GLIBCXX_PROFILE_<diagnostic></code> options. 306 With all the instrumentation calls off, there should be negligible 307 overhead over the release version. This property is needed to support 308 diagnostics based on timing of internal operations. For such diagnostics, 309 we anticipate turning most of the instrumentation off in order to prevent 310 profiling overhead from polluting time measurements, and thus diagnostics. 311 </para> 312 <para> 313 All the instrumentation on/off compile time switches live in 314 <code>include/profile/profiler.h</code>. 315 </para> 316</sect2> 317 318 319<sect2 id="manual.ext.profile_mode.design.rtlib" 320 xreflabel="Run Time Behavior"> 321<title>Run Time Behavior</title> 322 <para> 323 For practical reasons, the instrumentation library processes the trace 324 partially 325 rather than dumping it to disk in raw form. Each event is processed when 326 it occurs. It is usually attached a cost and it is aggregated into 327 the database of a specific diagnostic class. The cost model 328 is based largely on the standard performance guarantees, but in some 329 cases we use knowledge about GCC's standard library implementation. 330 </para> 331 <para> 332 Information is indexed by (1) call stack and (2) instance id or address 333 to be able to understand and summarize precise creation-use-destruction 334 dynamic chains. Although the analysis is sensitive to dynamic instances, 335 the reports are only sensitive to call context. Whenever a dynamic instance 336 is destroyed, we accumulate its effect to the corresponding entry for the 337 call stack of its constructor location. 338 </para> 339 340 <para> 341 For details, see 342 <ulink url="http://dx.doi.org/10.1109/CGO.2009.36">paper presented at 343 CGO 2009</ulink>. 344 </para> 345</sect2> 346 347 348<sect2 id="manual.ext.profile_mode.design.analysis" 349 xreflabel="Analysis and Diagnostics"> 350<title>Analysis and Diagnostics</title> 351 <para> 352 Final analysis takes place offline, and it is based entirely on the 353 generated trace and debugging info in the application binary. 354 See section Diagnostics for a list of analysis types that we plan to support. 355 </para> 356 <para> 357 The input to the analysis is a table indexed by profile type and call stack. 358 The data type for each entry depends on the profile type. 359 </para> 360</sect2> 361 362 363<sect2 id="manual.ext.profile_mode.design.cost-model" 364 xreflabel="Cost Model"> 365<title>Cost Model</title> 366 <para> 367 While it is likely that cost models become complex as we get into 368 more sophisticated analysis, we will try to follow a simple set of rules 369 at the beginning. 370 </para> 371<itemizedlist> 372 <listitem><para><emphasis>Relative benefit estimation:</emphasis> 373 The idea is to estimate or measure the cost of all operations 374 in the original scenario versus the scenario we advise to switch to. 375 For instance, when advising to change a vector to a list, an occurrence 376 of the <code>insert</code> method will generally count as a benefit. 377 Its magnitude depends on (1) the number of elements that get shifted 378 and (2) whether it triggers a reallocation. 379 </para></listitem> 380 <listitem><para><emphasis>Synthetic measurements:</emphasis> 381 We will measure the relative difference between similar operations on 382 different containers. We plan to write a battery of small tests that 383 compare the times of the executions of similar methods on different 384 containers. The idea is to run these tests on the target machine. 385 If this training phase is very quick, we may decide to perform it at 386 library initialization time. The results can be cached on disk and reused 387 across runs. 388 </para></listitem> 389 <listitem><para><emphasis>Timers:</emphasis> 390 We plan to use timers for operations of larger granularity, such as sort. 391 For instance, we can switch between different sort methods on the fly 392 and report the one that performs best for each call context. 393 </para></listitem> 394 <listitem><para><emphasis>Show stoppers:</emphasis> 395 We may decide that the presence of an operation nullifies the advice. 396 For instance, when considering switching from <code>set</code> to 397 <code>unordered_set</code>, if we detect use of operator <code>++</code>, 398 we will simply not issue the advice, since this could signal that the use 399 care require a sorted container.</para></listitem> 400</itemizedlist> 401 402</sect2> 403 404 405<sect2 id="manual.ext.profile_mode.design.reports" 406 xreflabel="Reports"> 407<title>Reports</title> 408 <para> 409There are two types of reports. First, if we recognize a pattern for which 410we have a substitute that is likely to give better performance, we print 411the advice and estimated performance gain. The advice is usually associated 412to a code position and possibly a call stack. 413 </para> 414 <para> 415Second, we report performance characteristics for which we do not have 416a clear solution for improvement. For instance, we can point to the user 417the top 10 <code>multimap</code> locations 418which have the worst data locality in actual traversals. 419Although this does not offer a solution, 420it helps the user focus on the key problems and ignore the uninteresting ones. 421 </para> 422</sect2> 423 424 425<sect2 id="manual.ext.profile_mode.design.testing" 426 xreflabel="Testing"> 427<title>Testing</title> 428 <para> 429 First, we want to make sure we preserve the behavior of the release mode. 430 You can just type <code>"make check-profile"</code>, which 431 builds and runs the whole test suite in profile mode. 432 </para> 433 <para> 434 Second, we want to test the correctness of each diagnostic. 435 We created a <code>profile</code> directory in the test suite. 436 Each diagnostic must come with at least two tests, one for false positives 437 and one for false negatives. 438 </para> 439</sect2> 440 441</sect1> 442 443<sect1 id="manual.ext.profile_mode.api" 444 xreflabel="API"> 445<title>Extensions for Custom Containers</title> 446 447 <para> 448 Many large projects use their own data structures instead of the ones in the 449 standard library. If these data structures are similar in functionality 450 to the standard library, they can be instrumented with the same hooks 451 that are used to instrument the standard library. 452 The instrumentation API is exposed in file 453 <code>profiler.h</code> (look for "Instrumentation hooks"). 454 </para> 455 456</sect1> 457 458 459<sect1 id="manual.ext.profile_mode.cost_model" 460 xreflabel="Cost Model"> 461<title>Empirical Cost Model</title> 462 463 <para> 464 Currently, the cost model uses formulas with predefined relative weights 465 for alternative containers or container implementations. For instance, 466 iterating through a vector is X times faster than iterating through a list. 467 </para> 468 <para> 469 (Under development.) 470 We are working on customizing this to a particular machine by providing 471 an automated way to compute the actual relative weights for operations 472 on the given machine. 473 </para> 474 <para> 475 (Under development.) 476 We plan to provide a performance parameter database format that can be 477 filled in either by hand or by an automated training mechanism. 478 The analysis module will then use this database instead of the built in. 479 generic parameters. 480 </para> 481 482</sect1> 483 484 485<sect1 id="manual.ext.profile_mode.implementation" 486 xreflabel="Implementation"> 487<title>Implementation Issues</title> 488 489 490<sect2 id="manual.ext.profile_mode.implementation.stack" 491 xreflabel="Stack Traces"> 492<title>Stack Traces</title> 493 <para> 494 Accurate stack traces are needed during profiling since we group events by 495 call context and dynamic instance. Without accurate traces, diagnostics 496 may be hard to interpret. For instance, when giving advice to the user 497 it is imperative to reference application code, not library code. 498 </para> 499 <para> 500 Currently we are using the libc <code>backtrace</code> routine to get 501 stack traces. 502 <code>_GLIBCXX_PROFILE_STACK_DEPTH</code> can be set 503 to 0 if you are willing to give up call context information, or to a small 504 positive value to reduce run time overhead. 505 </para> 506</sect2> 507 508 509<sect2 id="manual.ext.profile_mode.implementation.symbols" 510 xreflabel="Symbolization"> 511<title>Symbolization of Instruction Addresses</title> 512 <para> 513 The profiling and analysis phases use only instruction addresses. 514 An external utility such as addr2line is needed to postprocess the result. 515 We do not plan to add symbolization support in the profile extension. 516 This would require access to symbol tables, debug information tables, 517 external programs or libraries and other system dependent information. 518 </para> 519</sect2> 520 521 522<sect2 id="manual.ext.profile_mode.implementation.concurrency" 523 xreflabel="Concurrency"> 524<title>Concurrency</title> 525 <para> 526 Our current model is simplistic, but precise. 527 We cannot afford to approximate because some of our diagnostics require 528 precise matching of operations to container instance and call context. 529 During profiling, we keep a single information table per diagnostic. 530 There is a single lock per information table. 531 </para> 532</sect2> 533 534 535<sect2 id="manual.ext.profile_mode.implementation.stdlib-in-proflib" 536 xreflabel="Using the Standard Library in the Runtime Library"> 537<title>Using the Standard Library in the Instrumentation Implementation</title> 538 <para> 539 As much as we would like to avoid uses of libstdc++ within our 540 instrumentation library, containers such as unordered_map are very 541 appealing. We plan to use them as long as they are named properly 542 to avoid ambiguity. 543 </para> 544</sect2> 545 546 547<sect2 id="manual.ext.profile_mode.implementation.malloc-hooks" 548 xreflabel="Malloc Hooks"> 549<title>Malloc Hooks</title> 550 <para> 551 User applications/libraries can provide malloc hooks. 552 When the implementation of the malloc hooks uses stdlibc++, there can 553 be an infinite cycle between the profile mode instrumentation and the 554 malloc hook code. 555 </para> 556 <para> 557 We protect against reentrance to the profile mode instrumentation code, 558 which should avoid this problem in most cases. 559 The protection mechanism is thread safe and exception safe. 560 This mechanism does not prevent reentrance to the malloc hook itself, 561 which could still result in deadlock, if, for instance, the malloc hook 562 uses non-recursive locks. 563 XXX: A definitive solution to this problem would be for the profile extension 564 to use a custom allocator internally, and perhaps not to use libstdc++. 565 </para> 566</sect2> 567 568 569<sect2 id="manual.ext.profile_mode.implementation.construction-destruction" 570 xreflabel="Construction and Destruction of Global Objects"> 571<title>Construction and Destruction of Global Objects</title> 572 <para> 573 The profiling library state is initialized at the first call to a profiling 574 method. This allows us to record the construction of all global objects. 575 However, we cannot do the same at destruction time. The trace is written 576 by a function registered by <code>atexit</code>, thus invoked by 577 <code>exit</code>. 578 </para> 579</sect2> 580 581</sect1> 582 583 584<sect1 id="manual.ext.profile_mode.developer" 585 xreflabel="Developer Information"> 586<title>Developer Information</title> 587 588<sect2 id="manual.ext.profile_mode.developer.bigpic" 589 xreflabel="Big Picture"> 590<title>Big Picture</title> 591 592 <para>The profile mode headers are included with 593 <code>-D_GLIBCXX_PROFILE</code> through preprocessor directives in 594 <code>include/std/*</code>. 595 </para> 596 597 <para>Instrumented implementations are provided in 598 <code>include/profile/*</code>. All instrumentation hooks are macros 599 defined in <code>include/profile/profiler.h</code>. 600 </para> 601 602 <para>All the implementation of the instrumentation hooks is in 603 <code>include/profile/impl/*</code>. Although all the code gets included, 604 thus is publicly visible, only a small number of functions are called from 605 outside this directory. All calls to hook implementations must be 606 done through macros defined in <code>profiler.h</code>. The macro 607 must ensure (1) that the call is guarded against reentrance and 608 (2) that the call can be turned off at compile time using a 609 <code>-D_GLIBCXX_PROFILE_...</code> compiler option. 610 </para> 611 612</sect2> 613 614<sect2 id="manual.ext.profile_mode.developer.howto" 615 xreflabel="How To Add A Diagnostic"> 616<title>How To Add A Diagnostic</title> 617 618 <para>Let's say the diagnostic name is "magic". 619 </para> 620 621 <para>If you need to instrument a header not already under 622 <code>include/profile/*</code>, first edit the corresponding header 623 under <code>include/std/</code> and add a preprocessor directive such 624 as the one in <code>include/std/vector</code>: 625<programlisting> 626#ifdef _GLIBCXX_PROFILE 627# include <profile/vector> 628#endif 629</programlisting> 630 </para> 631 632 <para>If the file you need to instrument is not yet under 633 <code>include/profile/</code>, make a copy of the one in 634 <code>include/debug</code>, or the main implementation. 635 You'll need to include the main implementation and inherit the classes 636 you want to instrument. Then define the methods you want to instrument, 637 define the instrumentation hooks and add calls to them. 638 Look at <code>include/profile/vector</code> for an example. 639 </para> 640 641 <para>Add macros for the instrumentation hooks in 642 <code>include/profile/impl/profiler.h</code>. 643 Hook names must start with <code>__profcxx_</code>. 644 Make sure they transform 645 in no code with <code>-D_NO_GLBICXX_PROFILE_MAGIC</code>. 646 Make sure all calls to any method in namespace <code>__gnu_profile</code> 647 is protected against reentrance using macro 648 <code>_GLIBCXX_PROFILE_REENTRANCE_GUARD</code>. 649 All names of methods in namespace <code>__gnu_profile</code> called from 650 <code>profiler.h</code> must start with <code>__trace_magic_</code>. 651 </para> 652 653 <para>Add the implementation of the diagnostic. 654 <itemizedlist> 655 <listitem><para> 656 Create new file <code>include/profile/impl/profiler_magic.h</code>. 657 </para></listitem> 658 <listitem><para> 659 Define class <code>__magic_info: public __object_info_base</code>. 660 This is the representation of a line in the object table. 661 The <code>__merge</code> method is used to aggregate information 662 across all dynamic instances created at the same call context. 663 The <code>__magnitude</code> must return the estimation of the benefit 664 as a number of small operations, e.g., number of words copied. 665 The <code>__write</code> method is used to produce the raw trace. 666 The <code>__advice</code> method is used to produce the advice string. 667 </para></listitem> 668 <listitem><para> 669 Define class <code>__magic_stack_info: public __magic_info</code>. 670 This defines the content of a line in the stack table. 671 </para></listitem> 672 <listitem><para> 673 Define class <code>__trace_magic: public __trace_base<__magic_info, 674 __magic_stack_info></code>. 675 It defines the content of the trace associated with this diagnostic. 676 </para></listitem> 677 </itemizedlist> 678 </para> 679 680 <para>Add initialization and reporting calls in 681 <code>include/profile/impl/profiler_trace.h</code>. Use 682 <code>__trace_vector_to_list</code> as an example. 683 </para> 684 685 <para>Add documentation in file <code>doc/xml/manual/profile_mode.xml</code>. 686 </para> 687</sect2> 688</sect1> 689 690<sect1 id="manual.ext.profile_mode.diagnostics"> 691<title>Diagnostics</title> 692 693 <para> 694 The table below presents all the diagnostics we intend to implement. 695 Each diagnostic has a corresponding compile time switch 696 <code>-D_GLIBCXX_PROFILE_<diagnostic></code>. 697 Groups of related diagnostics can be turned on with a single switch. 698 For instance, <code>-D_GLIBCXX_PROFILE_LOCALITY</code> is equivalent to 699 <code>-D_GLIBCXX_PROFILE_SOFTWARE_PREFETCH 700 -D_GLIBCXX_PROFILE_RBTREE_LOCALITY</code>. 701 </para> 702 703 <para> 704 The benefit, cost, expected frequency and accuracy of each diagnostic 705 was given a grade from 1 to 10, where 10 is highest. 706 A high benefit means that, if the diagnostic is accurate, the expected 707 performance improvement is high. 708 A high cost means that turning this diagnostic on leads to high slowdown. 709 A high frequency means that we expect this to occur relatively often. 710 A high accuracy means that the diagnostic is unlikely to be wrong. 711 These grades are not perfect. They are just meant to guide users with 712 specific needs or time budgets. 713 </para> 714 715<table frame='all'> 716<title>Profile Diagnostics</title> 717<tgroup cols='7' align='left' colsep='1' rowsep='1'> 718<colspec colname='c1'></colspec> 719<colspec colname='c2'></colspec> 720<colspec colname='c3'></colspec> 721<colspec colname='c4'></colspec> 722<colspec colname='c5'></colspec> 723<colspec colname='c6'></colspec> 724<colspec colname='c7'></colspec> 725 726<thead> 727 <row> 728 <entry>Group</entry> 729 <entry>Flag</entry> 730 <entry>Benefit</entry> 731 <entry>Cost</entry> 732 <entry>Freq.</entry> 733 <entry>Implemented</entry> 734 </row> 735</thead> 736<tbody> 737 <row> 738 <entry><ulink url="#manual.ext.profile_mode.analysis.containers"> 739 CONTAINERS</ulink></entry> 740 <entry><ulink url="#manual.ext.profile_mode.analysis.hashtable_too_small"> 741 HASHTABLE_TOO_SMALL</ulink></entry> 742 <entry>10</entry> 743 <entry>1</entry> 744 <entry></entry> 745 <entry>10</entry> 746 <entry>yes</entry> 747 </row> 748 <row> 749 <entry></entry> 750 <entry><ulink url="#manual.ext.profile_mode.analysis.hashtable_too_large"> 751 HASHTABLE_TOO_LARGE</ulink></entry> 752 <entry>5</entry> 753 <entry>1</entry> 754 <entry></entry> 755 <entry>10</entry> 756 <entry>yes</entry> 757 </row> 758 <row> 759 <entry></entry> 760 <entry><ulink url="#manual.ext.profile_mode.analysis.inefficient_hash"> 761 INEFFICIENT_HASH</ulink></entry> 762 <entry>7</entry> 763 <entry>3</entry> 764 <entry></entry> 765 <entry>10</entry> 766 <entry>yes</entry> 767 </row> 768 <row> 769 <entry></entry> 770 <entry><ulink url="#manual.ext.profile_mode.analysis.vector_too_small"> 771 VECTOR_TOO_SMALL</ulink></entry> 772 <entry>8</entry> 773 <entry>1</entry> 774 <entry></entry> 775 <entry>10</entry> 776 <entry>yes</entry> 777 </row> 778 <row> 779 <entry></entry> 780 <entry><ulink url="#manual.ext.profile_mode.analysis.vector_too_large"> 781 VECTOR_TOO_LARGE</ulink></entry> 782 <entry>5</entry> 783 <entry>1</entry> 784 <entry></entry> 785 <entry>10</entry> 786 <entry>yes</entry> 787 </row> 788 <row> 789 <entry></entry> 790 <entry><ulink url="#manual.ext.profile_mode.analysis.vector_to_hashtable"> 791 VECTOR_TO_HASHTABLE</ulink></entry> 792 <entry>7</entry> 793 <entry>7</entry> 794 <entry></entry> 795 <entry>10</entry> 796 <entry>no</entry> 797 </row> 798 <row> 799 <entry></entry> 800 <entry><ulink url="#manual.ext.profile_mode.analysis.hashtable_to_vector"> 801 HASHTABLE_TO_VECTOR</ulink></entry> 802 <entry>7</entry> 803 <entry>7</entry> 804 <entry></entry> 805 <entry>10</entry> 806 <entry>no</entry> 807 </row> 808 <row> 809 <entry></entry> 810 <entry><ulink url="#manual.ext.profile_mode.analysis.vector_to_list"> 811 VECTOR_TO_LIST</ulink></entry> 812 <entry>8</entry> 813 <entry>5</entry> 814 <entry></entry> 815 <entry>10</entry> 816 <entry>yes</entry> 817 </row> 818 <row> 819 <entry></entry> 820 <entry><ulink url="#manual.ext.profile_mode.analysis.list_to_vector"> 821 LIST_TO_VECTOR</ulink></entry> 822 <entry>10</entry> 823 <entry>5</entry> 824 <entry></entry> 825 <entry>10</entry> 826 <entry>no</entry> 827 </row> 828 <row> 829 <entry></entry> 830 <entry><ulink url="#manual.ext.profile_mode.analysis.assoc_ord_to_unord"> 831 ORDERED_TO_UNORDERED</ulink></entry> 832 <entry>10</entry> 833 <entry>5</entry> 834 <entry></entry> 835 <entry>10</entry> 836 <entry>only map/unordered_map</entry> 837 </row> 838 <row> 839 <entry><ulink url="#manual.ext.profile_mode.analysis.algorithms"> 840 ALGORITHMS</ulink></entry> 841 <entry><ulink url="#manual.ext.profile_mode.analysis.algorithms.sort"> 842 SORT</ulink></entry> 843 <entry>7</entry> 844 <entry>8</entry> 845 <entry></entry> 846 <entry>7</entry> 847 <entry>no</entry> 848 </row> 849 <row> 850 <entry><ulink url="#manual.ext.profile_mode.analysis.locality"> 851 LOCALITY</ulink></entry> 852 <entry><ulink url="#manual.ext.profile_mode.analysis.locality.sw_prefetch"> 853 SOFTWARE_PREFETCH</ulink></entry> 854 <entry>8</entry> 855 <entry>8</entry> 856 <entry></entry> 857 <entry>5</entry> 858 <entry>no</entry> 859 </row> 860 <row> 861 <entry></entry> 862 <entry><ulink url="#manual.ext.profile_mode.analysis.locality.linked"> 863 RBTREE_LOCALITY</ulink></entry> 864 <entry>4</entry> 865 <entry>8</entry> 866 <entry></entry> 867 <entry>5</entry> 868 <entry>no</entry> 869 </row> 870 <row> 871 <entry></entry> 872 <entry><ulink url="#manual.ext.profile_mode.analysis.mthread.false_share"> 873 FALSE_SHARING</ulink></entry> 874 <entry>8</entry> 875 <entry>10</entry> 876 <entry></entry> 877 <entry>10</entry> 878 <entry>no</entry> 879 </row> 880</tbody> 881</tgroup> 882</table> 883 884<sect2 id="manual.ext.profile_mode.analysis.template" 885 xreflabel="Template"> 886<title>Diagnostic Template</title> 887<itemizedlist> 888 <listitem><para><emphasis>Switch:</emphasis> 889 <code>_GLIBCXX_PROFILE_<diagnostic></code>. 890 </para></listitem> 891 <listitem><para><emphasis>Goal:</emphasis> What problem will it diagnose? 892 </para></listitem> 893 <listitem><para><emphasis>Fundamentals:</emphasis>. 894 What is the fundamental reason why this is a problem</para></listitem> 895 <listitem><para><emphasis>Sample runtime reduction:</emphasis> 896 Percentage reduction in execution time. When reduction is more than 897 a constant factor, describe the reduction rate formula. 898 </para></listitem> 899 <listitem><para><emphasis>Recommendation:</emphasis> 900 What would the advise look like?</para></listitem> 901 <listitem><para><emphasis>To instrument:</emphasis> 902 What stdlibc++ components need to be instrumented?</para></listitem> 903 <listitem><para><emphasis>Analysis:</emphasis> 904 How do we decide when to issue the advice?</para></listitem> 905 <listitem><para><emphasis>Cost model:</emphasis> 906 How do we measure benefits? Math goes here.</para></listitem> 907 <listitem><para><emphasis>Example:</emphasis> 908<programlisting> 909program code 910... 911advice sample 912</programlisting> 913</para></listitem> 914</itemizedlist> 915</sect2> 916 917 918<sect2 id="manual.ext.profile_mode.analysis.containers" 919 xreflabel="Containers"> 920<title>Containers</title> 921 922<para> 923<emphasis>Switch:</emphasis> 924 <code>_GLIBCXX_PROFILE_CONTAINERS</code>. 925</para> 926 927<sect3 id="manual.ext.profile_mode.analysis.hashtable_too_small" 928 xreflabel="Hashtable Too Small"> 929<title>Hashtable Too Small</title> 930<itemizedlist> 931 <listitem><para><emphasis>Switch:</emphasis> 932 <code>_GLIBCXX_PROFILE_HASHTABLE_TOO_SMALL</code>. 933 </para></listitem> 934 <listitem><para><emphasis>Goal:</emphasis> Detect hashtables with many 935 rehash operations, small construction size and large destruction size. 936 </para></listitem> 937 <listitem><para><emphasis>Fundamentals:</emphasis> Rehash is very expensive. 938 Read content, follow chains within bucket, evaluate hash function, place at 939 new location in different order.</para></listitem> 940 <listitem><para><emphasis>Sample runtime reduction:</emphasis> 36%. 941 Code similar to example below. 942 </para></listitem> 943 <listitem><para><emphasis>Recommendation:</emphasis> 944 Set initial size to N at construction site S. 945 </para></listitem> 946 <listitem><para><emphasis>To instrument:</emphasis> 947 <code>unordered_set, unordered_map</code> constructor, destructor, rehash. 948 </para></listitem> 949 <listitem><para><emphasis>Analysis:</emphasis> 950 For each dynamic instance of <code>unordered_[multi]set|map</code>, 951 record initial size and call context of the constructor. 952 Record size increase, if any, after each relevant operation such as insert. 953 Record the estimated rehash cost.</para></listitem> 954 <listitem><para><emphasis>Cost model:</emphasis> 955 Number of individual rehash operations * cost per rehash.</para></listitem> 956 <listitem><para><emphasis>Example:</emphasis> 957<programlisting> 9581 unordered_set<int> us; 9592 for (int k = 0; k < 1000000; ++k) { 9603 us.insert(k); 9614 } 962 963foo.cc:1: advice: Changing initial unordered_set size from 10 to 1000000 saves 1025530 rehash operations. 964</programlisting> 965</para></listitem> 966</itemizedlist> 967</sect3> 968 969 970<sect3 id="manual.ext.profile_mode.analysis.hashtable_too_large" 971 xreflabel="Hashtable Too Large"> 972<title>Hashtable Too Large</title> 973<itemizedlist> 974 <listitem><para><emphasis>Switch:</emphasis> 975 <code>_GLIBCXX_PROFILE_HASHTABLE_TOO_LARGE</code>. 976 </para></listitem> 977 <listitem><para><emphasis>Goal:</emphasis> Detect hashtables which are 978 never filled up because fewer elements than reserved are ever 979 inserted. 980 </para></listitem> 981 <listitem><para><emphasis>Fundamentals:</emphasis> Save memory, which 982 is good in itself and may also improve memory reference performance through 983 fewer cache and TLB misses.</para></listitem> 984 <listitem><para><emphasis>Sample runtime reduction:</emphasis> unknown. 985 </para></listitem> 986 <listitem><para><emphasis>Recommendation:</emphasis> 987 Set initial size to N at construction site S. 988 </para></listitem> 989 <listitem><para><emphasis>To instrument:</emphasis> 990 <code>unordered_set, unordered_map</code> constructor, destructor, rehash. 991 </para></listitem> 992 <listitem><para><emphasis>Analysis:</emphasis> 993 For each dynamic instance of <code>unordered_[multi]set|map</code>, 994 record initial size and call context of the constructor, and correlate it 995 with its size at destruction time. 996 </para></listitem> 997 <listitem><para><emphasis>Cost model:</emphasis> 998 Number of iteration operations + memory saved.</para></listitem> 999 <listitem><para><emphasis>Example:</emphasis> 1000<programlisting> 10011 vector<unordered_set<int>> v(100000, unordered_set<int>(100)) ; 10022 for (int k = 0; k < 100000; ++k) { 10033 for (int j = 0; j < 10; ++j) { 10044 v[k].insert(k + j); 10055 } 10066 } 1007 1008foo.cc:1: advice: Changing initial unordered_set size from 100 to 10 saves N 1009bytes of memory and M iteration steps. 1010</programlisting> 1011</para></listitem> 1012</itemizedlist> 1013</sect3> 1014 1015<sect3 id="manual.ext.profile_mode.analysis.inefficient_hash" 1016 xreflabel="Inefficient Hash"> 1017<title>Inefficient Hash</title> 1018<itemizedlist> 1019 <listitem><para><emphasis>Switch:</emphasis> 1020 <code>_GLIBCXX_PROFILE_INEFFICIENT_HASH</code>. 1021 </para></listitem> 1022 <listitem><para><emphasis>Goal:</emphasis> Detect hashtables with polarized 1023 distribution. 1024 </para></listitem> 1025 <listitem><para><emphasis>Fundamentals:</emphasis> A non-uniform 1026 distribution may lead to long chains, thus possibly increasing complexity 1027 by a factor up to the number of elements. 1028 </para></listitem> 1029 <listitem><para><emphasis>Sample runtime reduction:</emphasis> factor up 1030 to container size. 1031 </para></listitem> 1032 <listitem><para><emphasis>Recommendation:</emphasis> Change hash function 1033 for container built at site S. Distribution score = N. Access score = S. 1034 Longest chain = C, in bucket B. 1035 </para></listitem> 1036 <listitem><para><emphasis>To instrument:</emphasis> 1037 <code>unordered_set, unordered_map</code> constructor, destructor, [], 1038 insert, iterator. 1039 </para></listitem> 1040 <listitem><para><emphasis>Analysis:</emphasis> 1041 Count the exact number of link traversals. 1042 </para></listitem> 1043 <listitem><para><emphasis>Cost model:</emphasis> 1044 Total number of links traversed.</para></listitem> 1045 <listitem><para><emphasis>Example:</emphasis> 1046<programlisting> 1047class dumb_hash { 1048 public: 1049 size_t operator() (int i) const { return 0; } 1050}; 1051... 1052 unordered_set<int, dumb_hash> hs; 1053 ... 1054 for (int i = 0; i < COUNT; ++i) { 1055 hs.find(i); 1056 } 1057</programlisting> 1058</para></listitem> 1059</itemizedlist> 1060</sect3> 1061 1062<sect3 id="manual.ext.profile_mode.analysis.vector_too_small" 1063 xreflabel="Vector Too Small"> 1064<title>Vector Too Small</title> 1065<itemizedlist> 1066 <listitem><para><emphasis>Switch:</emphasis> 1067 <code>_GLIBCXX_PROFILE_VECTOR_TOO_SMALL</code>. 1068 </para></listitem> 1069 <listitem><para><emphasis>Goal:</emphasis>Detect vectors with many 1070 resize operations, small construction size and large destruction size.. 1071 </para></listitem> 1072 <listitem><para><emphasis>Fundamentals:</emphasis>Resizing can be expensive. 1073 Copying large amounts of data takes time. Resizing many small vectors may 1074 have allocation overhead and affect locality.</para></listitem> 1075 <listitem><para><emphasis>Sample runtime reduction:</emphasis>%. 1076 </para></listitem> 1077 <listitem><para><emphasis>Recommendation:</emphasis> 1078 Set initial size to N at construction site S.</para></listitem> 1079 <listitem><para><emphasis>To instrument:</emphasis><code>vector</code>. 1080 </para></listitem> 1081 <listitem><para><emphasis>Analysis:</emphasis> 1082 For each dynamic instance of <code>vector</code>, 1083 record initial size and call context of the constructor. 1084 Record size increase, if any, after each relevant operation such as 1085 <code>push_back</code>. Record the estimated resize cost. 1086 </para></listitem> 1087 <listitem><para><emphasis>Cost model:</emphasis> 1088 Total number of words copied * time to copy a word.</para></listitem> 1089 <listitem><para><emphasis>Example:</emphasis> 1090<programlisting> 10911 vector<int> v; 10922 for (int k = 0; k < 1000000; ++k) { 10933 v.push_back(k); 10944 } 1095 1096foo.cc:1: advice: Changing initial vector size from 10 to 1000000 saves 1097copying 4000000 bytes and 20 memory allocations and deallocations. 1098</programlisting> 1099</para></listitem> 1100</itemizedlist> 1101</sect3> 1102 1103<sect3 id="manual.ext.profile_mode.analysis.vector_too_large" 1104 xreflabel="Vector Too Large"> 1105<title>Vector Too Large</title> 1106<itemizedlist> 1107 <listitem><para><emphasis>Switch:</emphasis> 1108 <code>_GLIBCXX_PROFILE_VECTOR_TOO_LARGE</code> 1109 </para></listitem> 1110 <listitem><para><emphasis>Goal:</emphasis>Detect vectors which are 1111 never filled up because fewer elements than reserved are ever 1112 inserted. 1113 </para></listitem> 1114 <listitem><para><emphasis>Fundamentals:</emphasis>Save memory, which 1115 is good in itself and may also improve memory reference performance through 1116 fewer cache and TLB misses.</para></listitem> 1117 <listitem><para><emphasis>Sample runtime reduction:</emphasis>%. 1118 </para></listitem> 1119 <listitem><para><emphasis>Recommendation:</emphasis> 1120 Set initial size to N at construction site S.</para></listitem> 1121 <listitem><para><emphasis>To instrument:</emphasis><code>vector</code>. 1122 </para></listitem> 1123 <listitem><para><emphasis>Analysis:</emphasis> 1124 For each dynamic instance of <code>vector</code>, 1125 record initial size and call context of the constructor, and correlate it 1126 with its size at destruction time.</para></listitem> 1127 <listitem><para><emphasis>Cost model:</emphasis> 1128 Total amount of memory saved.</para></listitem> 1129 <listitem><para><emphasis>Example:</emphasis> 1130<programlisting> 11311 vector<vector<int>> v(100000, vector<int>(100)) ; 11322 for (int k = 0; k < 100000; ++k) { 11333 for (int j = 0; j < 10; ++j) { 11344 v[k].insert(k + j); 11355 } 11366 } 1137 1138foo.cc:1: advice: Changing initial vector size from 100 to 10 saves N 1139bytes of memory and may reduce the number of cache and TLB misses. 1140</programlisting> 1141</para></listitem> 1142</itemizedlist> 1143</sect3> 1144 1145<sect3 id="manual.ext.profile_mode.analysis.vector_to_hashtable" 1146 xreflabel="Vector to Hashtable"> 1147<title>Vector to Hashtable</title> 1148<itemizedlist> 1149 <listitem><para><emphasis>Switch:</emphasis> 1150 <code>_GLIBCXX_PROFILE_VECTOR_TO_HASHTABLE</code>. 1151 </para></listitem> 1152 <listitem><para><emphasis>Goal:</emphasis> Detect uses of 1153 <code>vector</code> that can be substituted with <code>unordered_set</code> 1154 to reduce execution time. 1155 </para></listitem> 1156 <listitem><para><emphasis>Fundamentals:</emphasis> 1157 Linear search in a vector is very expensive, whereas searching in a hashtable 1158 is very quick.</para></listitem> 1159 <listitem><para><emphasis>Sample runtime reduction:</emphasis>factor up 1160 to container size. 1161 </para></listitem> 1162 <listitem><para><emphasis>Recommendation:</emphasis>Replace 1163 <code>vector</code> with <code>unordered_set</code> at site S. 1164 </para></listitem> 1165 <listitem><para><emphasis>To instrument:</emphasis><code>vector</code> 1166 operations and access methods.</para></listitem> 1167 <listitem><para><emphasis>Analysis:</emphasis> 1168 For each dynamic instance of <code>vector</code>, 1169 record call context of the constructor. Issue the advice only if the 1170 only methods called on this <code>vector</code> are <code>push_back</code>, 1171 <code>insert</code> and <code>find</code>. 1172 </para></listitem> 1173 <listitem><para><emphasis>Cost model:</emphasis> 1174 Cost(vector::push_back) + cost(vector::insert) + cost(find, vector) - 1175 cost(unordered_set::insert) + cost(unordered_set::find). 1176 </para></listitem> 1177 <listitem><para><emphasis>Example:</emphasis> 1178<programlisting> 11791 vector<int> v; 1180... 11812 for (int i = 0; i < 1000; ++i) { 11823 find(v.begin(), v.end(), i); 11834 } 1184 1185foo.cc:1: advice: Changing "vector" to "unordered_set" will save about 500,000 1186comparisons. 1187</programlisting> 1188</para></listitem> 1189</itemizedlist> 1190</sect3> 1191 1192<sect3 id="manual.ext.profile_mode.analysis.hashtable_to_vector" 1193 xreflabel="Hashtable to Vector"> 1194<title>Hashtable to Vector</title> 1195<itemizedlist> 1196 <listitem><para><emphasis>Switch:</emphasis> 1197 <code>_GLIBCXX_PROFILE_HASHTABLE_TO_VECTOR</code>. 1198 </para></listitem> 1199 <listitem><para><emphasis>Goal:</emphasis> Detect uses of 1200 <code>unordered_set</code> that can be substituted with <code>vector</code> 1201 to reduce execution time. 1202 </para></listitem> 1203 <listitem><para><emphasis>Fundamentals:</emphasis> 1204 Hashtable iterator is slower than vector iterator.</para></listitem> 1205 <listitem><para><emphasis>Sample runtime reduction:</emphasis>95%. 1206 </para></listitem> 1207 <listitem><para><emphasis>Recommendation:</emphasis>Replace 1208 <code>unordered_set</code> with <code>vector</code> at site S. 1209 </para></listitem> 1210 <listitem><para><emphasis>To instrument:</emphasis><code>unordered_set</code> 1211 operations and access methods.</para></listitem> 1212 <listitem><para><emphasis>Analysis:</emphasis> 1213 For each dynamic instance of <code>unordered_set</code>, 1214 record call context of the constructor. Issue the advice only if the 1215 number of <code>find</code>, <code>insert</code> and <code>[]</code> 1216 operations on this <code>unordered_set</code> are small relative to the 1217 number of elements, and methods <code>begin</code> or <code>end</code> 1218 are invoked (suggesting iteration).</para></listitem> 1219 <listitem><para><emphasis>Cost model:</emphasis> 1220 Number of .</para></listitem> 1221 <listitem><para><emphasis>Example:</emphasis> 1222<programlisting> 12231 unordered_set<int> us; 1224... 12252 int s = 0; 12263 for (unordered_set<int>::iterator it = us.begin(); it != us.end(); ++it) { 12274 s += *it; 12285 } 1229 1230foo.cc:1: advice: Changing "unordered_set" to "vector" will save about N 1231indirections and may achieve better data locality. 1232</programlisting> 1233</para></listitem> 1234</itemizedlist> 1235</sect3> 1236 1237<sect3 id="manual.ext.profile_mode.analysis.vector_to_list" 1238 xreflabel="Vector to List"> 1239<title>Vector to List</title> 1240<itemizedlist> 1241 <listitem><para><emphasis>Switch:</emphasis> 1242 <code>_GLIBCXX_PROFILE_VECTOR_TO_LIST</code>. 1243 </para></listitem> 1244 <listitem><para><emphasis>Goal:</emphasis> Detect cases where 1245 <code>vector</code> could be substituted with <code>list</code> for 1246 better performance. 1247 </para></listitem> 1248 <listitem><para><emphasis>Fundamentals:</emphasis> 1249 Inserting in the middle of a vector is expensive compared to inserting in a 1250 list. 1251 </para></listitem> 1252 <listitem><para><emphasis>Sample runtime reduction:</emphasis>factor up to 1253 container size. 1254 </para></listitem> 1255 <listitem><para><emphasis>Recommendation:</emphasis>Replace vector with list 1256 at site S.</para></listitem> 1257 <listitem><para><emphasis>To instrument:</emphasis><code>vector</code> 1258 operations and access methods.</para></listitem> 1259 <listitem><para><emphasis>Analysis:</emphasis> 1260 For each dynamic instance of <code>vector</code>, 1261 record the call context of the constructor. Record the overhead of each 1262 <code>insert</code> operation based on current size and insert position. 1263 Report instance with high insertion overhead. 1264 </para></listitem> 1265 <listitem><para><emphasis>Cost model:</emphasis> 1266 (Sum(cost(vector::method)) - Sum(cost(list::method)), for 1267 method in [push_back, insert, erase]) 1268 + (Cost(iterate vector) - Cost(iterate list))</para></listitem> 1269 <listitem><para><emphasis>Example:</emphasis> 1270<programlisting> 12711 vector<int> v; 12722 for (int i = 0; i < 10000; ++i) { 12733 v.insert(v.begin(), i); 12744 } 1275 1276foo.cc:1: advice: Changing "vector" to "list" will save about 5,000,000 1277operations. 1278</programlisting> 1279</para></listitem> 1280</itemizedlist> 1281</sect3> 1282 1283<sect3 id="manual.ext.profile_mode.analysis.list_to_vector" 1284 xreflabel="List to Vector"> 1285<title>List to Vector</title> 1286<itemizedlist> 1287 <listitem><para><emphasis>Switch:</emphasis> 1288 <code>_GLIBCXX_PROFILE_LIST_TO_VECTOR</code>. 1289 </para></listitem> 1290 <listitem><para><emphasis>Goal:</emphasis> Detect cases where 1291 <code>list</code> could be substituted with <code>vector</code> for 1292 better performance. 1293 </para></listitem> 1294 <listitem><para><emphasis>Fundamentals:</emphasis> 1295 Iterating through a vector is faster than through a list. 1296 </para></listitem> 1297 <listitem><para><emphasis>Sample runtime reduction:</emphasis>64%. 1298 </para></listitem> 1299 <listitem><para><emphasis>Recommendation:</emphasis>Replace list with vector 1300 at site S.</para></listitem> 1301 <listitem><para><emphasis>To instrument:</emphasis><code>vector</code> 1302 operations and access methods.</para></listitem> 1303 <listitem><para><emphasis>Analysis:</emphasis> 1304 Issue the advice if there are no <code>insert</code> operations. 1305 </para></listitem> 1306 <listitem><para><emphasis>Cost model:</emphasis> 1307 (Sum(cost(vector::method)) - Sum(cost(list::method)), for 1308 method in [push_back, insert, erase]) 1309 + (Cost(iterate vector) - Cost(iterate list))</para></listitem> 1310 <listitem><para><emphasis>Example:</emphasis> 1311<programlisting> 13121 list<int> l; 1313... 13142 int sum = 0; 13153 for (list<int>::iterator it = l.begin(); it != l.end(); ++it) { 13164 sum += *it; 13175 } 1318 1319foo.cc:1: advice: Changing "list" to "vector" will save about 1000000 indirect 1320memory references. 1321</programlisting> 1322</para></listitem> 1323</itemizedlist> 1324</sect3> 1325 1326<sect3 id="manual.ext.profile_mode.analysis.list_to_slist" 1327 xreflabel="List to Forward List"> 1328<title>List to Forward List (Slist)</title> 1329<itemizedlist> 1330 <listitem><para><emphasis>Switch:</emphasis> 1331 <code>_GLIBCXX_PROFILE_LIST_TO_SLIST</code>. 1332 </para></listitem> 1333 <listitem><para><emphasis>Goal:</emphasis> Detect cases where 1334 <code>list</code> could be substituted with <code>forward_list</code> for 1335 better performance. 1336 </para></listitem> 1337 <listitem><para><emphasis>Fundamentals:</emphasis> 1338 The memory footprint of a forward_list is smaller than that of a list. 1339 This has beneficial effects on memory subsystem, e.g., fewer cache misses. 1340 </para></listitem> 1341 <listitem><para><emphasis>Sample runtime reduction:</emphasis>40%. 1342 Note that the reduction is only noticeable if the size of the forward_list 1343 node is in fact larger than that of the list node. For memory allocators 1344 with size classes, you will only notice an effect when the two node sizes 1345 belong to different allocator size classes. 1346 </para></listitem> 1347 <listitem><para><emphasis>Recommendation:</emphasis>Replace list with 1348 forward_list at site S.</para></listitem> 1349 <listitem><para><emphasis>To instrument:</emphasis><code>list</code> 1350 operations and iteration methods.</para></listitem> 1351 <listitem><para><emphasis>Analysis:</emphasis> 1352 Issue the advice if there are no <code>backwards</code> traversals 1353 or insertion before a given node. 1354 </para></listitem> 1355 <listitem><para><emphasis>Cost model:</emphasis> 1356 Always true.</para></listitem> 1357 <listitem><para><emphasis>Example:</emphasis> 1358<programlisting> 13591 list<int> l; 1360... 13612 int sum = 0; 13623 for (list<int>::iterator it = l.begin(); it != l.end(); ++it) { 13634 sum += *it; 13645 } 1365 1366foo.cc:1: advice: Change "list" to "forward_list". 1367</programlisting> 1368</para></listitem> 1369</itemizedlist> 1370</sect3> 1371 1372<sect3 id="manual.ext.profile_mode.analysis.assoc_ord_to_unord" 1373 xreflabel="Ordered to Unordered Associative Container"> 1374<title>Ordered to Unordered Associative Container</title> 1375<itemizedlist> 1376 <listitem><para><emphasis>Switch:</emphasis> 1377 <code>_GLIBCXX_PROFILE_ORDERED_TO_UNORDERED</code>. 1378 </para></listitem> 1379 <listitem><para><emphasis>Goal:</emphasis> Detect cases where ordered 1380 associative containers can be replaced with unordered ones. 1381 </para></listitem> 1382 <listitem><para><emphasis>Fundamentals:</emphasis> 1383 Insert and search are quicker in a hashtable than in 1384 a red-black tree.</para></listitem> 1385 <listitem><para><emphasis>Sample runtime reduction:</emphasis>52%. 1386 </para></listitem> 1387 <listitem><para><emphasis>Recommendation:</emphasis> 1388 Replace set with unordered_set at site S.</para></listitem> 1389 <listitem><para><emphasis>To instrument:</emphasis> 1390 <code>set</code>, <code>multiset</code>, <code>map</code>, 1391 <code>multimap</code> methods.</para></listitem> 1392 <listitem><para><emphasis>Analysis:</emphasis> 1393 Issue the advice only if we are not using operator <code>++</code> on any 1394 iterator on a particular <code>[multi]set|map</code>. 1395 </para></listitem> 1396 <listitem><para><emphasis>Cost model:</emphasis> 1397 (Sum(cost(hashtable::method)) - Sum(cost(rbtree::method)), for 1398 method in [insert, erase, find]) 1399 + (Cost(iterate hashtable) - Cost(iterate rbtree))</para></listitem> 1400 <listitem><para><emphasis>Example:</emphasis> 1401<programlisting> 14021 set<int> s; 14032 for (int i = 0; i < 100000; ++i) { 14043 s.insert(i); 14054 } 14065 int sum = 0; 14076 for (int i = 0; i < 100000; ++i) { 14087 sum += *s.find(i); 14098 } 1410</programlisting> 1411</para></listitem> 1412</itemizedlist> 1413</sect3> 1414 1415</sect2> 1416 1417 1418 1419<sect2 id="manual.ext.profile_mode.analysis.algorithms" 1420 xreflabel="Algorithms"> 1421<title>Algorithms</title> 1422 1423 <para><emphasis>Switch:</emphasis> 1424 <code>_GLIBCXX_PROFILE_ALGORITHMS</code>. 1425 </para> 1426 1427<sect3 id="manual.ext.profile_mode.analysis.algorithms.sort" 1428 xreflabel="Sorting"> 1429<title>Sort Algorithm Performance</title> 1430<itemizedlist> 1431 <listitem><para><emphasis>Switch:</emphasis> 1432 <code>_GLIBCXX_PROFILE_SORT</code>. 1433 </para></listitem> 1434 <listitem><para><emphasis>Goal:</emphasis> Give measure of sort algorithm 1435 performance based on actual input. For instance, advise Radix Sort over 1436 Quick Sort for a particular call context. 1437 </para></listitem> 1438 <listitem><para><emphasis>Fundamentals:</emphasis> 1439 See papers: 1440 <ulink url="http://portal.acm.org/citation.cfm?doid=1065944.1065981"> 1441 A framework for adaptive algorithm selection in STAPL</ulink> and 1442 <ulink url="http://ieeexplore.ieee.org/search/wrapper.jsp?arnumber=4228227"> 1443 Optimizing Sorting with Machine Learning Algorithms</ulink>. 1444 </para></listitem> 1445 <listitem><para><emphasis>Sample runtime reduction:</emphasis>60%. 1446 </para></listitem> 1447 <listitem><para><emphasis>Recommendation:</emphasis> Change sort algorithm 1448 at site S from X Sort to Y Sort.</para></listitem> 1449 <listitem><para><emphasis>To instrument:</emphasis> <code>sort</code> 1450 algorithm.</para></listitem> 1451 <listitem><para><emphasis>Analysis:</emphasis> 1452 Issue the advice if the cost model tells us that another sort algorithm 1453 would do better on this input. Requires us to know what algorithm we 1454 are using in our sort implementation in release mode.</para></listitem> 1455 <listitem><para><emphasis>Cost model:</emphasis> 1456 Runtime(algo) for algo in [radix, quick, merge, ...]</para></listitem> 1457 <listitem><para><emphasis>Example:</emphasis> 1458<programlisting> 1459</programlisting> 1460</para></listitem> 1461</itemizedlist> 1462</sect3> 1463 1464</sect2> 1465 1466 1467<sect2 id="manual.ext.profile_mode.analysis.locality" 1468 xreflabel="Data Locality"> 1469<title>Data Locality</title> 1470 1471 <para><emphasis>Switch:</emphasis> 1472 <code>_GLIBCXX_PROFILE_LOCALITY</code>. 1473 </para> 1474 1475<sect3 id="manual.ext.profile_mode.analysis.locality.sw_prefetch" 1476 xreflabel="Need Software Prefetch"> 1477<title>Need Software Prefetch</title> 1478<itemizedlist> 1479 <listitem><para><emphasis>Switch:</emphasis> 1480 <code>_GLIBCXX_PROFILE_SOFTWARE_PREFETCH</code>. 1481 </para></listitem> 1482 <listitem><para><emphasis>Goal:</emphasis> Discover sequences of indirect 1483 memory accesses that are not regular, thus cannot be predicted by 1484 hardware prefetchers. 1485 </para></listitem> 1486 <listitem><para><emphasis>Fundamentals:</emphasis> 1487 Indirect references are hard to predict and are very expensive when they 1488 miss in caches.</para></listitem> 1489 <listitem><para><emphasis>Sample runtime reduction:</emphasis>25%. 1490 </para></listitem> 1491 <listitem><para><emphasis>Recommendation:</emphasis> Insert prefetch 1492 instruction.</para></listitem> 1493 <listitem><para><emphasis>To instrument:</emphasis> Vector iterator and 1494 access operator []. 1495 </para></listitem> 1496 <listitem><para><emphasis>Analysis:</emphasis> 1497 First, get cache line size and page size from system. 1498 Then record iterator dereference sequences for which the value is a pointer. 1499 For each sequence within a container, issue a warning if successive pointer 1500 addresses are not within cache lines and do not form a linear pattern 1501 (otherwise they may be prefetched by hardware). 1502 If they also step across page boundaries, make the warning stronger. 1503 </para> 1504 <para>The same analysis applies to containers other than vector. 1505 However, we cannot give the same advice for linked structures, such as list, 1506 as there is no random access to the n-th element. The user may still be 1507 able to benefit from this information, for instance by employing frays (user 1508 level light weight threads) to hide the latency of chasing pointers. 1509 </para> 1510 <para> 1511 This analysis is a little oversimplified. A better cost model could be 1512 created by understanding the capability of the hardware prefetcher. 1513 This model could be trained automatically by running a set of synthetic 1514 cases. 1515 </para> 1516 </listitem> 1517 <listitem><para><emphasis>Cost model:</emphasis> 1518 Total distance between pointer values of successive elements in vectors 1519 of pointers.</para></listitem> 1520 <listitem><para><emphasis>Example:</emphasis> 1521<programlisting> 15221 int zero = 0; 15232 vector<int*> v(10000000, &zero); 15243 for (int k = 0; k < 10000000; ++k) { 15254 v[random() % 10000000] = new int(k); 15265 } 15276 for (int j = 0; j < 10000000; ++j) { 15287 count += (*v[j] == 0 ? 0 : 1); 15298 } 1530 1531foo.cc:7: advice: Insert prefetch instruction. 1532</programlisting> 1533</para></listitem> 1534</itemizedlist> 1535</sect3> 1536 1537<sect3 id="manual.ext.profile_mode.analysis.locality.linked" 1538 xreflabel="Linked Structure Locality"> 1539<title>Linked Structure Locality</title> 1540<itemizedlist> 1541 <listitem><para><emphasis>Switch:</emphasis> 1542 <code>_GLIBCXX_PROFILE_RBTREE_LOCALITY</code>. 1543 </para></listitem> 1544 <listitem><para><emphasis>Goal:</emphasis> Give measure of locality of 1545 objects stored in linked structures (lists, red-black trees and hashtables) 1546 with respect to their actual traversal patterns. 1547 </para></listitem> 1548 <listitem><para><emphasis>Fundamentals:</emphasis>Allocation can be tuned 1549 to a specific traversal pattern, to result in better data locality. 1550 See paper: 1551 <ulink url="http://www.springerlink.com/content/8085744l00x72662/"> 1552 Custom Memory Allocation for Free</ulink>. 1553 </para></listitem> 1554 <listitem><para><emphasis>Sample runtime reduction:</emphasis>30%. 1555 </para></listitem> 1556 <listitem><para><emphasis>Recommendation:</emphasis> 1557 High scatter score N for container built at site S. 1558 Consider changing allocation sequence or choosing a structure conscious 1559 allocator.</para></listitem> 1560 <listitem><para><emphasis>To instrument:</emphasis> Methods of all 1561 containers using linked structures.</para></listitem> 1562 <listitem><para><emphasis>Analysis:</emphasis> 1563 First, get cache line size and page size from system. 1564 Then record the number of successive elements that are on different line 1565 or page, for each traversal method such as <code>find</code>. Give advice 1566 only if the ratio between this number and the number of total node hops 1567 is above a threshold.</para></listitem> 1568 <listitem><para><emphasis>Cost model:</emphasis> 1569 Sum(same_cache_line(this,previous))</para></listitem> 1570 <listitem><para><emphasis>Example:</emphasis> 1571<programlisting> 1572 1 set<int> s; 1573 2 for (int i = 0; i < 10000000; ++i) { 1574 3 s.insert(i); 1575 4 } 1576 5 set<int> s1, s2; 1577 6 for (int i = 0; i < 10000000; ++i) { 1578 7 s1.insert(i); 1579 8 s2.insert(i); 1580 9 } 1581... 1582 // Fast, better locality. 158310 for (set<int>::iterator it = s.begin(); it != s.end(); ++it) { 158411 sum += *it; 158512 } 1586 // Slow, elements are further apart. 158713 for (set<int>::iterator it = s1.begin(); it != s1.end(); ++it) { 158814 sum += *it; 158915 } 1590 1591foo.cc:5: advice: High scatter score NNN for set built here. Consider changing 1592the allocation sequence or switching to a structure conscious allocator. 1593</programlisting> 1594</para></listitem> 1595</itemizedlist> 1596</sect3> 1597 1598</sect2> 1599 1600 1601<sect2 id="manual.ext.profile_mode.analysis.mthread" 1602 xreflabel="Multithreaded Data Access"> 1603<title>Multithreaded Data Access</title> 1604 1605 <para> 1606 The diagnostics in this group are not meant to be implemented short term. 1607 They require compiler support to know when container elements are written 1608 to. Instrumentation can only tell us when elements are referenced. 1609 </para> 1610 1611 <para><emphasis>Switch:</emphasis> 1612 <code>_GLIBCXX_PROFILE_MULTITHREADED</code>. 1613 </para> 1614 1615<sect3 id="manual.ext.profile_mode.analysis.mthread.ddtest" 1616 xreflabel="Dependence Violations at Container Level"> 1617<title>Data Dependence Violations at Container Level</title> 1618<itemizedlist> 1619 <listitem><para><emphasis>Switch:</emphasis> 1620 <code>_GLIBCXX_PROFILE_DDTEST</code>. 1621 </para></listitem> 1622 <listitem><para><emphasis>Goal:</emphasis> Detect container elements 1623 that are referenced from multiple threads in the parallel region or 1624 across parallel regions. 1625 </para></listitem> 1626 <listitem><para><emphasis>Fundamentals:</emphasis> 1627 Sharing data between threads requires communication and perhaps locking, 1628 which may be expensive. 1629 </para></listitem> 1630 <listitem><para><emphasis>Sample runtime reduction:</emphasis>?%. 1631 </para></listitem> 1632 <listitem><para><emphasis>Recommendation:</emphasis> Change data 1633 distribution or parallel algorithm.</para></listitem> 1634 <listitem><para><emphasis>To instrument:</emphasis> Container access methods 1635 and iterators. 1636 </para></listitem> 1637 <listitem><para><emphasis>Analysis:</emphasis> 1638 Keep a shadow for each container. Record iterator dereferences and 1639 container member accesses. Issue advice for elements referenced by 1640 multiple threads. 1641 See paper: <ulink url="http://portal.acm.org/citation.cfm?id=207110.207148"> 1642 The LRPD test: speculative run-time parallelization of loops with 1643 privatization and reduction parallelization</ulink>. 1644 </para></listitem> 1645 <listitem><para><emphasis>Cost model:</emphasis> 1646 Number of accesses to elements referenced from multiple threads 1647 </para></listitem> 1648 <listitem><para><emphasis>Example:</emphasis> 1649<programlisting> 1650</programlisting> 1651</para></listitem> 1652</itemizedlist> 1653</sect3> 1654 1655<sect3 id="manual.ext.profile_mode.analysis.mthread.false_share" 1656 xreflabel="False Sharing"> 1657<title>False Sharing</title> 1658<itemizedlist> 1659 <listitem><para><emphasis>Switch:</emphasis> 1660 <code>_GLIBCXX_PROFILE_FALSE_SHARING</code>. 1661 </para></listitem> 1662 <listitem><para><emphasis>Goal:</emphasis> Detect elements in the 1663 same container which share a cache line, are written by at least one 1664 thread, and accessed by different threads. 1665 </para></listitem> 1666 <listitem><para><emphasis>Fundamentals:</emphasis> Under these assumptions, 1667 cache protocols require 1668 communication to invalidate lines, which may be expensive. 1669 </para></listitem> 1670 <listitem><para><emphasis>Sample runtime reduction:</emphasis>68%. 1671 </para></listitem> 1672 <listitem><para><emphasis>Recommendation:</emphasis> Reorganize container 1673 or use padding to avoid false sharing.</para></listitem> 1674 <listitem><para><emphasis>To instrument:</emphasis> Container access methods 1675 and iterators. 1676 </para></listitem> 1677 <listitem><para><emphasis>Analysis:</emphasis> 1678 First, get the cache line size. 1679 For each shared container, record all the associated iterator dereferences 1680 and member access methods with the thread id. Compare the address lists 1681 across threads to detect references in two different threads to the same 1682 cache line. Issue a warning only if the ratio to total references is 1683 significant. Do the same for iterator dereference values if they are 1684 pointers.</para></listitem> 1685 <listitem><para><emphasis>Cost model:</emphasis> 1686 Number of accesses to same cache line from different threads. 1687 </para></listitem> 1688 <listitem><para><emphasis>Example:</emphasis> 1689<programlisting> 16901 vector<int> v(2, 0); 16912 #pragma omp parallel for shared(v, SIZE) schedule(static, 1) 16923 for (i = 0; i < SIZE; ++i) { 16934 v[i % 2] += i; 16945 } 1695 1696OMP_NUM_THREADS=2 ./a.out 1697foo.cc:1: advice: Change container structure or padding to avoid false 1698sharing in multithreaded access at foo.cc:4. Detected N shared cache lines. 1699</programlisting> 1700</para></listitem> 1701</itemizedlist> 1702</sect3> 1703 1704</sect2> 1705 1706 1707<sect2 id="manual.ext.profile_mode.analysis.statistics" 1708 xreflabel="Statistics"> 1709<title>Statistics</title> 1710 1711<para> 1712<emphasis>Switch:</emphasis> 1713 <code>_GLIBCXX_PROFILE_STATISTICS</code>. 1714</para> 1715 1716<para> 1717 In some cases the cost model may not tell us anything because the costs 1718 appear to offset the benefits. Consider the choice between a vector and 1719 a list. When there are both inserts and iteration, an automatic advice 1720 may not be issued. However, the programmer may still be able to make use 1721 of this information in a different way. 1722</para> 1723<para> 1724 This diagnostic will not issue any advice, but it will print statistics for 1725 each container construction site. The statistics will contain the cost 1726 of each operation actually performed on the container. 1727</para> 1728 1729</sect2> 1730 1731 1732</sect1> 1733 1734 1735<bibliography id="profile_mode.biblio"> 1736<title>Bibliography</title> 1737 1738 <biblioentry> 1739 <title> 1740 Perflint: A Context Sensitive Performance Advisor for C++ Programs 1741 </title> 1742 1743 <author> 1744 <firstname>Lixia</firstname> 1745 <surname>Liu</surname> 1746 </author> 1747 <author> 1748 <firstname>Silvius</firstname> 1749 <surname>Rus</surname> 1750 </author> 1751 1752 <copyright> 1753 <year>2009</year> 1754 <holder></holder> 1755 </copyright> 1756 1757 <publisher> 1758 <publishername> 1759 Proceedings of the 2009 International Symposium on Code Generation 1760 and Optimization 1761 </publishername> 1762 </publisher> 1763 </biblioentry> 1764</bibliography> 1765 1766 1767</chapter> 1768