1<chapter xmlns="http://docbook.org/ns/docbook" version="5.0" 2 xml:id="manual.ext.allocator.mt" xreflabel="mt allocator"> 3<?dbhtml filename="mt_allocator.html"?> 4 5<info><title>The mt_allocator</title> 6 <keywordset> 7 <keyword>ISO C++</keyword> 8 <keyword>allocator</keyword> 9 </keywordset> 10</info> 11 12 13 14<para> 15</para> 16 17<section xml:id="allocator.mt.intro"><info><title>Intro</title></info> 18 19 20<para> 21 The mt allocator [hereinafter referred to simply as "the allocator"] 22 is a fixed size (power of two) allocator that was initially 23 developed specifically to suit the needs of multi-threaded 24 applications [hereinafter referred to as an MT application]. Over 25 time the allocator has evolved and been improved in many ways, in 26 particular it now also does a good job in single-threaded 27 applications [hereinafter referred to as an ST application]. (Note: 28 In this document, when referring to single-threaded applications 29 this also includes applications that are compiled with gcc without 30 thread support enabled. This is accomplished using ifdef's on 31 __GTHREADS). This allocator is tunable, very flexible, and capable 32 of high-performance. 33</para> 34 35<para> 36 The aim of this document is to describe - from an application point of 37 view - the "inner workings" of the allocator. 38</para> 39 40</section> 41 42 43<section xml:id="allocator.mt.design_issues"><info><title>Design Issues</title></info> 44<?dbhtml filename="mt_allocator_design.html"?> 45 46 47<section xml:id="allocator.mt.overview"><info><title>Overview</title></info> 48 49 50 51<para> There are three general components to the allocator: a datum 52describing the characteristics of the memory pool, a policy class 53containing this pool that links instantiation types to common or 54individual pools, and a class inheriting from the policy class that is 55the actual allocator. 56</para> 57 58<para>The datum describing pools characteristics is 59</para> 60<programlisting> 61 template<bool _Thread> 62 class __pool 63</programlisting> 64<para> This class is parametrized on thread support, and is explicitly 65specialized for both multiple threads (with <code>bool==true</code>) 66and single threads (via <code>bool==false</code>.) It is possible to 67use a custom pool datum instead of the default class that is provided. 68</para> 69 70<para> There are two distinct policy classes, each of which can be used 71with either type of underlying pool datum. 72</para> 73 74<programlisting> 75 template<bool _Thread> 76 struct __common_pool_policy 77 78 template<typename _Tp, bool _Thread> 79 struct __per_type_pool_policy 80</programlisting> 81 82<para> The first policy, <code>__common_pool_policy</code>, implements a 83common pool. This means that allocators that are instantiated with 84different types, say <code>char</code> and <code>long</code> will both 85use the same pool. This is the default policy. 86</para> 87 88<para> The second policy, <code>__per_type_pool_policy</code>, implements 89a separate pool for each instantiating type. Thus, <code>char</code> 90and <code>long</code> will use separate pools. This allows per-type 91tuning, for instance. 92</para> 93 94<para> Putting this all together, the actual allocator class is 95</para> 96<programlisting> 97 template<typename _Tp, typename _Poolp = __default_policy> 98 class __mt_alloc : public __mt_alloc_base<_Tp>, _Poolp 99</programlisting> 100<para> This class has the interface required for standard library allocator 101classes, namely member functions <code>allocate</code> and 102<code>deallocate</code>, plus others. 103</para> 104 105</section> 106</section> 107 108<section xml:id="allocator.mt.impl"><info><title>Implementation</title></info> 109<?dbhtml filename="mt_allocator_impl.html"?> 110 111 112 113<section xml:id="allocator.mt.tune"><info><title>Tunable Parameters</title></info> 114 115 116<para>Certain allocation parameters can be modified, or tuned. There 117exists a nested <code>struct __pool_base::_Tune</code> that contains all 118these parameters, which include settings for 119</para> 120 <itemizedlist> 121 <listitem><para>Alignment</para></listitem> 122 <listitem><para>Maximum bytes before calling <code>::operator new</code> directly</para></listitem> 123 <listitem><para>Minimum bytes</para></listitem> 124 <listitem><para>Size of underlying global allocations</para></listitem> 125 <listitem><para>Maximum number of supported threads</para></listitem> 126 <listitem><para>Migration of deallocations to the global free list</para></listitem> 127 <listitem><para>Shunt for global <code>new</code> and <code>delete</code></para></listitem> 128 </itemizedlist> 129<para>Adjusting parameters for a given instance of an allocator can only 130happen before any allocations take place, when the allocator itself is 131initialized. For instance: 132</para> 133<programlisting> 134#include <ext/mt_allocator.h> 135 136struct pod 137{ 138 int i; 139 int j; 140}; 141 142int main() 143{ 144 typedef pod value_type; 145 typedef __gnu_cxx::__mt_alloc<value_type> allocator_type; 146 typedef __gnu_cxx::__pool_base::_Tune tune_type; 147 148 tune_type t_default; 149 tune_type t_opt(16, 5120, 32, 5120, 20, 10, false); 150 tune_type t_single(16, 5120, 32, 5120, 1, 10, false); 151 152 tune_type t; 153 t = allocator_type::_M_get_options(); 154 allocator_type::_M_set_options(t_opt); 155 t = allocator_type::_M_get_options(); 156 157 allocator_type a; 158 allocator_type::pointer p1 = a.allocate(128); 159 allocator_type::pointer p2 = a.allocate(5128); 160 161 a.deallocate(p1, 128); 162 a.deallocate(p2, 5128); 163 164 return 0; 165} 166</programlisting> 167 168</section> 169 170<section xml:id="allocator.mt.init"><info><title>Initialization</title></info> 171 172 173<para> 174The static variables (pointers to freelists, tuning parameters etc) 175are initialized as above, or are set to the global defaults. 176</para> 177 178<para> 179The very first allocate() call will always call the 180_S_initialize_once() function. In order to make sure that this 181function is called exactly once we make use of a __gthread_once call 182in MT applications and check a static bool (_S_init) in ST 183applications. 184</para> 185 186<para> 187The _S_initialize() function: 188- If the GLIBCXX_FORCE_NEW environment variable is set, it sets the bool 189 _S_force_new to true and then returns. This will cause subsequent calls to 190 allocate() to return memory directly from a new() call, and deallocate will 191 only do a delete() call. 192</para> 193 194<para> 195- If the GLIBCXX_FORCE_NEW environment variable is not set, both ST and MT 196 applications will: 197 - Calculate the number of bins needed. A bin is a specific power of two size 198 of bytes. I.e., by default the allocator will deal with requests of up to 199 128 bytes (or whatever the value of _S_max_bytes is when _S_init() is 200 called). This means that there will be bins of the following sizes 201 (in bytes): 1, 2, 4, 8, 16, 32, 64, 128. 202 203 - Create the _S_binmap array. All requests are rounded up to the next 204 "large enough" bin. I.e., a request for 29 bytes will cause a block from 205 the "32 byte bin" to be returned to the application. The purpose of 206 _S_binmap is to speed up the process of finding out which bin to use. 207 I.e., the value of _S_binmap[ 29 ] is initialized to 5 (bin 5 = 32 bytes). 208</para> 209<para> 210 - Create the _S_bin array. This array consists of bin_records. There will be 211 as many bin_records in this array as the number of bins that we calculated 212 earlier. I.e., if _S_max_bytes = 128 there will be 8 entries. 213 Each bin_record is then initialized: 214 - bin_record->first = An array of pointers to block_records. There will be 215 as many block_records pointers as there are maximum number of threads 216 (in a ST application there is only 1 thread, in a MT application there 217 are _S_max_threads). 218 This holds the pointer to the first free block for each thread in this 219 bin. I.e., if we would like to know where the first free block of size 32 220 for thread number 3 is we would look this up by: _S_bin[ 5 ].first[ 3 ] 221 222 The above created block_record pointers members are now initialized to 223 their initial values. I.e. _S_bin[ n ].first[ n ] = NULL; 224</para> 225 226<para> 227- Additionally a MT application will: 228 - Create a list of free thread id's. The pointer to the first entry 229 is stored in _S_thread_freelist_first. The reason for this approach is 230 that the __gthread_self() call will not return a value that corresponds to 231 the maximum number of threads allowed but rather a process id number or 232 something else. So what we do is that we create a list of thread_records. 233 This list is _S_max_threads long and each entry holds a size_t thread_id 234 which is initialized to 1, 2, 3, 4, 5 and so on up to _S_max_threads. 235 Each time a thread calls allocate() or deallocate() we call 236 _S_get_thread_id() which looks at the value of _S_thread_key which is a 237 thread local storage pointer. If this is NULL we know that this is a newly 238 created thread and we pop the first entry from this list and saves the 239 pointer to this record in the _S_thread_key variable. The next time 240 we will get the pointer to the thread_record back and we use the 241 thread_record->thread_id as identification. I.e., the first thread that 242 calls allocate will get the first record in this list and thus be thread 243 number 1 and will then find the pointer to its first free 32 byte block 244 in _S_bin[ 5 ].first[ 1 ] 245 When we create the _S_thread_key we also define a destructor 246 (_S_thread_key_destr) which means that when the thread dies, this 247 thread_record is returned to the front of this list and the thread id 248 can then be reused if a new thread is created. 249 This list is protected by a mutex (_S_thread_freelist_mutex) which is only 250 locked when records are removed or added to the list. 251</para> 252<para> 253 - Initialize the free and used counters of each bin_record: 254 - bin_record->free = An array of size_t. This keeps track of the number 255 of blocks on a specific thread's freelist in each bin. I.e., if a thread 256 has 12 32-byte blocks on it's freelists and allocates one of these, this 257 counter would be decreased to 11. 258 259 - bin_record->used = An array of size_t. This keeps track of the number 260 of blocks currently in use of this size by this thread. I.e., if a thread 261 has made 678 requests (and no deallocations...) of 32-byte blocks this 262 counter will read 678. 263 264 The above created arrays are now initialized with their initial values. 265 I.e. _S_bin[ n ].free[ n ] = 0; 266</para> 267<para> 268 - Initialize the mutex of each bin_record: The bin_record->mutex 269 is used to protect the global freelist. This concept of a global 270 freelist is explained in more detail in the section "A multi 271 threaded example", but basically this mutex is locked whenever a 272 block of memory is retrieved or returned to the global freelist 273 for this specific bin. This only occurs when a number of blocks 274 are grabbed from the global list to a thread specific list or when 275 a thread decides to return some blocks to the global freelist. 276</para> 277 278</section> 279 280<section xml:id="allocator.mt.deallocation"><info><title>Deallocation Notes</title></info> 281 282 283<para> Notes about deallocation. This allocator does not explicitly 284release memory back to the OS, but keeps its own freelists instead. 285Because of this, memory debugging programs like 286valgrind or purify may notice leaks: sorry about this 287inconvenience. Operating systems will reclaim allocated memory at 288program termination anyway. If sidestepping this kind of noise is 289desired, there are three options: use an allocator, like 290<code>new_allocator</code> that releases memory while debugging, use 291GLIBCXX_FORCE_NEW to bypass the allocator's internal pools, or use a 292custom pool datum that releases resources on destruction. 293</para> 294 295<para> 296 On systems with the function <code>__cxa_atexit</code>, the 297allocator can be forced to free all memory allocated before program 298termination with the member function 299<code>__pool_type::_M_destroy</code>. However, because this member 300function relies on the precise and exactly-conforming ordering of 301static destructors, including those of a static local 302<code>__pool</code> object, it should not be used, ever, on systems 303that don't have the necessary underlying support. In addition, in 304practice, forcing deallocation can be tricky, as it requires the 305<code>__pool</code> object to be fully-constructed before the object 306that uses it is fully constructed. For most (but not all) STL 307containers, this works, as an instance of the allocator is constructed 308as part of a container's constructor. However, this assumption is 309implementation-specific, and subject to change. For an example of a 310pool that frees memory, see the following 311 <link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="http://gcc.gnu.org/viewcvs/gcc/trunk/libstdc++-v3/testsuite/ext/mt_allocator/deallocate_local-6.cc?view=markup"> 312 example.</link> 313</para> 314 315</section> 316 317</section> 318 319<section xml:id="allocator.mt.example_single"><info><title>Single Thread Example</title></info> 320<?dbhtml filename="mt_allocator_ex_single.html"?> 321 322 323<para> 324Let's start by describing how the data on a freelist is laid out in memory. 325This is the first two blocks in freelist for thread id 3 in bin 3 (8 bytes): 326</para> 327<programlisting> 328+----------------+ 329| next* ---------|--+ (_S_bin[ 3 ].first[ 3 ] points here) 330| | | 331| | | 332| | | 333+----------------+ | 334| thread_id = 3 | | 335| | | 336| | | 337| | | 338+----------------+ | 339| DATA | | (A pointer to here is what is returned to the 340| | | the application when needed) 341| | | 342| | | 343| | | 344| | | 345| | | 346| | | 347+----------------+ | 348+----------------+ | 349| next* |<-+ (If next == NULL it's the last one on the list) 350| | 351| | 352| | 353+----------------+ 354| thread_id = 3 | 355| | 356| | 357| | 358+----------------+ 359| DATA | 360| | 361| | 362| | 363| | 364| | 365| | 366| | 367+----------------+ 368</programlisting> 369 370<para> 371With this in mind we simplify things a bit for a while and say that there is 372only one thread (a ST application). In this case all operations are made to 373what is referred to as the global pool - thread id 0 (No thread may be 374assigned this id since they span from 1 to _S_max_threads in a MT application). 375</para> 376<para> 377When the application requests memory (calling allocate()) we first look at the 378requested size and if this is > _S_max_bytes we call new() directly and return. 379</para> 380<para> 381If the requested size is within limits we start by finding out from which 382bin we should serve this request by looking in _S_binmap. 383</para> 384<para> 385A quick look at _S_bin[ bin ].first[ 0 ] tells us if there are any blocks of 386this size on the freelist (0). If this is not NULL - fine, just remove the 387block that _S_bin[ bin ].first[ 0 ] points to from the list, 388update _S_bin[ bin ].first[ 0 ] and return a pointer to that blocks data. 389</para> 390<para> 391If the freelist is empty (the pointer is NULL) we must get memory from the 392system and build us a freelist within this memory. All requests for new memory 393is made in chunks of _S_chunk_size. Knowing the size of a block_record and 394the bytes that this bin stores we then calculate how many blocks we can create 395within this chunk, build the list, remove the first block, update the pointer 396(_S_bin[ bin ].first[ 0 ]) and return a pointer to that blocks data. 397</para> 398 399<para> 400Deallocation is equally simple; the pointer is casted back to a block_record 401pointer, lookup which bin to use based on the size, add the block to the front 402of the global freelist and update the pointer as needed 403(_S_bin[ bin ].first[ 0 ]). 404</para> 405 406<para> 407The decision to add deallocated blocks to the front of the freelist was made 408after a set of performance measurements that showed that this is roughly 10% 409faster than maintaining a set of "last pointers" as well. 410</para> 411 412</section> 413 414<section xml:id="allocator.mt.example_multi"><info><title>Multiple Thread Example</title></info> 415<?dbhtml filename="mt_allocator_ex_multi.html"?> 416 417 418<para> 419In the ST example we never used the thread_id variable present in each block. 420Let's start by explaining the purpose of this in a MT application. 421</para> 422 423<para> 424The concept of "ownership" was introduced since many MT applications 425allocate and deallocate memory to shared containers from different 426threads (such as a cache shared amongst all threads). This introduces 427a problem if the allocator only returns memory to the current threads 428freelist (I.e., there might be one thread doing all the allocation and 429thus obtaining ever more memory from the system and another thread 430that is getting a longer and longer freelist - this will in the end 431consume all available memory). 432</para> 433 434<para> 435Each time a block is moved from the global list (where ownership is 436irrelevant), to a threads freelist (or when a new freelist is built 437from a chunk directly onto a threads freelist or when a deallocation 438occurs on a block which was not allocated by the same thread id as the 439one doing the deallocation) the thread id is set to the current one. 440</para> 441 442<para> 443What's the use? Well, when a deallocation occurs we can now look at 444the thread id and find out if it was allocated by another thread id 445and decrease the used counter of that thread instead, thus keeping the 446free and used counters correct. And keeping the free and used counters 447corrects is very important since the relationship between these two 448variables decides if memory should be returned to the global pool or 449not when a deallocation occurs. 450</para> 451 452<para> 453When the application requests memory (calling allocate()) we first 454look at the requested size and if this is >_S_max_bytes we call new() 455directly and return. 456</para> 457 458<para> 459If the requested size is within limits we start by finding out from which 460bin we should serve this request by looking in _S_binmap. 461</para> 462 463<para> 464A call to _S_get_thread_id() returns the thread id for the calling thread 465(and if no value has been set in _S_thread_key, a new id is assigned and 466returned). 467</para> 468 469<para> 470A quick look at _S_bin[ bin ].first[ thread_id ] tells us if there are 471any blocks of this size on the current threads freelist. If this is 472not NULL - fine, just remove the block that _S_bin[ bin ].first[ 473thread_id ] points to from the list, update _S_bin[ bin ].first[ 474thread_id ], update the free and used counters and return a pointer to 475that blocks data. 476</para> 477 478<para> 479If the freelist is empty (the pointer is NULL) we start by looking at 480the global freelist (0). If there are blocks available on the global 481freelist we lock this bins mutex and move up to block_count (the 482number of blocks of this bins size that will fit into a _S_chunk_size) 483or until end of list - whatever comes first - to the current threads 484freelist and at the same time change the thread_id ownership and 485update the counters and pointers. When the bins mutex has been 486unlocked, we remove the block that _S_bin[ bin ].first[ thread_id ] 487points to from the list, update _S_bin[ bin ].first[ thread_id ], 488update the free and used counters, and return a pointer to that blocks 489data. 490</para> 491 492<para> 493The reason that the number of blocks moved to the current threads 494freelist is limited to block_count is to minimize the chance that a 495subsequent deallocate() call will return the excess blocks to the 496global freelist (based on the _S_freelist_headroom calculation, see 497below). 498</para> 499 500<para> 501However if there isn't any memory on the global pool we need to get 502memory from the system - this is done in exactly the same way as in a 503single threaded application with one major difference; the list built 504in the newly allocated memory (of _S_chunk_size size) is added to the 505current threads freelist instead of to the global. 506</para> 507 508<para> 509The basic process of a deallocation call is simple: always add the 510block to the front of the current threads freelist and update the 511counters and pointers (as described earlier with the specific check of 512ownership that causes the used counter of the thread that originally 513allocated the block to be decreased instead of the current threads 514counter). 515</para> 516 517<para> 518And here comes the free and used counters to service. Each time a 519deallocation() call is made, the length of the current threads 520freelist is compared to the amount memory in use by this thread. 521</para> 522 523<para> 524Let's go back to the example of an application that has one thread 525that does all the allocations and one that deallocates. Both these 526threads use say 516 32-byte blocks that was allocated during thread 527creation for example. Their used counters will both say 516 at this 528point. The allocation thread now grabs 1000 32-byte blocks and puts 529them in a shared container. The used counter for this thread is now 5301516. 531</para> 532 533<para> 534The deallocation thread now deallocates 500 of these blocks. For each 535deallocation made the used counter of the allocating thread is 536decreased and the freelist of the deallocation thread gets longer and 537longer. But the calculation made in deallocate() will limit the length 538of the freelist in the deallocation thread to _S_freelist_headroom % 539of it's used counter. In this case, when the freelist (given that the 540_S_freelist_headroom is at it's default value of 10%) exceeds 52 541(516/10) blocks will be returned to the global pool where the 542allocating thread may pick them up and reuse them. 543</para> 544 545<para> 546In order to reduce lock contention (since this requires this bins 547mutex to be locked) this operation is also made in chunks of blocks 548(just like when chunks of blocks are moved from the global freelist to 549a threads freelist mentioned above). The "formula" used can probably 550be improved to further reduce the risk of blocks being "bounced back 551and forth" between freelists. 552</para> 553 554</section> 555 556</chapter> 557