mt_allocator.xml revision 1.6
1<chapter xmlns="http://docbook.org/ns/docbook" version="5.0" 
2	 xml:id="manual.ext.allocator.mt" xreflabel="mt allocator">
3<?dbhtml filename="mt_allocator.html"?>
4
5<info><title>The mt_allocator</title>
6  <keywordset>
7    <keyword>ISO C++</keyword>
8    <keyword>allocator</keyword>
9  </keywordset>
10</info>
11
12
13
14<para>
15</para>
16
17<section xml:id="allocator.mt.intro"><info><title>Intro</title></info>
18
19
20<para>
21  The mt allocator [hereinafter referred to simply as "the allocator"]
22  is a fixed size (power of two) allocator that was initially
23  developed specifically to suit the needs of multi threaded
24  applications [hereinafter referred to as an MT application]. Over
25  time the allocator has evolved and been improved in many ways, in
26  particular it now also does a good job in single threaded
27  applications [hereinafter referred to as a ST application]. (Note:
28  In this document, when referring to single threaded applications
29  this also includes applications that are compiled with gcc without
30  thread support enabled. This is accomplished using ifdef's on
31  __GTHREADS). This allocator is tunable, very flexible, and capable
32  of high-performance.
33</para>
34
35<para>
36  The aim of this document is to describe - from an application point of
37  view - the "inner workings" of the allocator.
38</para>
39
40</section>
41
42
43<section xml:id="allocator.mt.design_issues"><info><title>Design Issues</title></info>
44<?dbhtml filename="mt_allocator_design.html"?>
45
46
47<section xml:id="allocator.mt.overview"><info><title>Overview</title></info>
48
49
50
51<para> There are three general components to the allocator: a datum
52describing the characteristics of the memory pool, a policy class
53containing this pool that links instantiation types to common or
54individual pools, and a class inheriting from the policy class that is
55the actual allocator.
56</para>
57
58<para>The datum describing pools characteristics is
59</para>
60<programlisting>
61  template&lt;bool _Thread&gt;
62    class __pool
63</programlisting>
64<para> This class is parametrized on thread support, and is explicitly
65specialized for both multiple threads (with <code>bool==true</code>)
66and single threads (via <code>bool==false</code>.) It is possible to
67use a custom pool datum instead of the default class that is provided.
68</para>
69
70<para> There are two distinct policy classes, each of which can be used
71with either type of underlying pool datum.
72</para>
73
74<programlisting>
75  template&lt;bool _Thread&gt;
76    struct __common_pool_policy
77
78  template&lt;typename _Tp, bool _Thread&gt;
79    struct __per_type_pool_policy
80</programlisting>
81
82<para> The first policy, <code>__common_pool_policy</code>, implements a
83common pool. This means that allocators that are instantiated with
84different types, say <code>char</code> and <code>long</code> will both
85use the same pool. This is the default policy.
86</para>
87
88<para> The second policy, <code>__per_type_pool_policy</code>, implements
89a separate pool for each instantiating type. Thus, <code>char</code>
90and <code>long</code> will use separate pools. This allows per-type
91tuning, for instance.
92</para>
93
94<para> Putting this all together, the actual allocator class is
95</para>
96<programlisting>
97  template&lt;typename _Tp, typename _Poolp = __default_policy&gt;
98    class __mt_alloc : public __mt_alloc_base&lt;_Tp&gt;,  _Poolp
99</programlisting>
100<para> This class has the interface required for standard library allocator
101classes, namely member functions <code>allocate</code> and
102<code>deallocate</code>, plus others.
103</para>
104
105</section>
106</section>
107
108<section xml:id="allocator.mt.impl"><info><title>Implementation</title></info>
109<?dbhtml filename="mt_allocator_impl.html"?>
110
111
112
113<section xml:id="allocator.mt.tune"><info><title>Tunable Parameters</title></info>
114
115
116<para>Certain allocation parameters can be modified, or tuned. There
117exists a nested <code>struct __pool_base::_Tune</code> that contains all
118these parameters, which include settings for
119</para>
120   <itemizedlist>
121     <listitem><para>Alignment</para></listitem>
122     <listitem><para>Maximum bytes before calling <code>::operator new</code> directly</para></listitem>
123     <listitem><para>Minimum bytes</para></listitem>
124     <listitem><para>Size of underlying global allocations</para></listitem>
125     <listitem><para>Maximum number of supported threads</para></listitem>
126     <listitem><para>Migration of deallocations to the global free list</para></listitem>
127     <listitem><para>Shunt for global <code>new</code> and <code>delete</code></para></listitem>
128   </itemizedlist>
129<para>Adjusting parameters for a given instance of an allocator can only
130happen before any allocations take place, when the allocator itself is
131initialized. For instance:
132</para>
133<programlisting>
134#include &lt;ext/mt_allocator.h&gt;
135
136struct pod
137{
138  int i;
139  int j;
140};
141
142int main()
143{
144  typedef pod value_type;
145  typedef __gnu_cxx::__mt_alloc&lt;value_type&gt; allocator_type;
146  typedef __gnu_cxx::__pool_base::_Tune tune_type;
147
148  tune_type t_default;
149  tune_type t_opt(16, 5120, 32, 5120, 20, 10, false);
150  tune_type t_single(16, 5120, 32, 5120, 1, 10, false);
151
152  tune_type t;
153  t = allocator_type::_M_get_options();
154  allocator_type::_M_set_options(t_opt);
155  t = allocator_type::_M_get_options();
156
157  allocator_type a;
158  allocator_type::pointer p1 = a.allocate(128);
159  allocator_type::pointer p2 = a.allocate(5128);
160
161  a.deallocate(p1, 128);
162  a.deallocate(p2, 5128);
163
164  return 0;
165}
166</programlisting>
167
168</section>
169
170<section xml:id="allocator.mt.init"><info><title>Initialization</title></info>
171
172
173<para>
174The static variables (pointers to freelists, tuning parameters etc)
175are initialized as above, or are set to the global defaults.
176</para>
177
178<para>
179The very first allocate() call will always call the
180_S_initialize_once() function.  In order to make sure that this
181function is called exactly once we make use of a __gthread_once call
182in MT applications and check a static bool (_S_init) in ST
183applications.
184</para>
185
186<para>
187The _S_initialize() function:
188- If the GLIBCXX_FORCE_NEW environment variable is set, it sets the bool
189  _S_force_new to true and then returns. This will cause subsequent calls to
190  allocate() to return memory directly from a new() call, and deallocate will
191  only do a delete() call.
192</para>
193
194<para>
195- If the GLIBCXX_FORCE_NEW environment variable is not set, both ST and MT
196  applications will:
197  - Calculate the number of bins needed. A bin is a specific power of two size
198    of bytes. I.e., by default the allocator will deal with requests of up to
199    128 bytes (or whatever the value of _S_max_bytes is when _S_init() is
200    called). This means that there will be bins of the following sizes
201    (in bytes): 1, 2, 4, 8, 16, 32, 64, 128.
202
203  - Create the _S_binmap array. All requests are rounded up to the next
204    "large enough" bin. I.e., a request for 29 bytes will cause a block from
205    the "32 byte bin" to be returned to the application. The purpose of
206    _S_binmap is to speed up the process of finding out which bin to use.
207    I.e., the value of _S_binmap[ 29 ] is initialized to 5 (bin 5 = 32 bytes).
208</para>
209<para>
210  - Create the _S_bin array. This array consists of bin_records. There will be
211    as many bin_records in this array as the number of bins that we calculated
212    earlier. I.e., if _S_max_bytes = 128 there will be 8 entries.
213    Each bin_record is then initialized:
214    - bin_record-&gt;first = An array of pointers to block_records. There will be
215      as many block_records pointers as there are maximum number of threads
216      (in a ST application there is only 1 thread, in a MT application there
217      are _S_max_threads).
218      This holds the pointer to the first free block for each thread in this
219      bin. I.e., if we would like to know where the first free block of size 32
220      for thread number 3 is we would look this up by: _S_bin[ 5 ].first[ 3 ]
221
222    The above created block_record pointers members are now initialized to
223    their initial values. I.e. _S_bin[ n ].first[ n ] = NULL;
224</para>
225
226<para>
227- Additionally a MT application will:
228  - Create a list of free thread id's. The pointer to the first entry
229    is stored in _S_thread_freelist_first. The reason for this approach is
230    that the __gthread_self() call will not return a value that corresponds to
231    the maximum number of threads allowed but rather a process id number or
232    something else. So what we do is that we create a list of thread_records.
233    This list is _S_max_threads long and each entry holds a size_t thread_id
234    which is initialized to 1, 2, 3, 4, 5 and so on up to _S_max_threads.
235    Each time a thread calls allocate() or deallocate() we call
236    _S_get_thread_id() which looks at the value of _S_thread_key which is a
237    thread local storage pointer. If this is NULL we know that this is a newly
238    created thread and we pop the first entry from this list and saves the
239    pointer to this record in the _S_thread_key variable. The next time
240    we will get the pointer to the thread_record back and we use the
241    thread_record-&gt;thread_id as identification. I.e., the first thread that
242    calls allocate will get the first record in this list and thus be thread
243    number 1 and will then find the pointer to its first free 32 byte block
244    in _S_bin[ 5 ].first[ 1 ]
245    When we create the _S_thread_key we also define a destructor
246    (_S_thread_key_destr) which means that when the thread dies, this
247    thread_record is returned to the front of this list and the thread id
248    can then be reused if a new thread is created.
249    This list is protected by a mutex (_S_thread_freelist_mutex) which is only
250    locked when records are removed or added to the list.
251</para>
252<para>
253  - Initialize the free and used counters of each bin_record:
254    - bin_record-&gt;free = An array of size_t. This keeps track of the number
255      of blocks on a specific thread's freelist in each bin. I.e., if a thread
256      has 12 32-byte blocks on it's freelists and allocates one of these, this
257      counter would be decreased to 11.
258
259    - bin_record-&gt;used = An array of size_t. This keeps track of the number
260      of blocks currently in use of this size by this thread. I.e., if a thread
261      has made 678 requests (and no deallocations...) of 32-byte blocks this
262      counter will read 678.
263
264    The above created arrays are now initialized with their initial values.
265    I.e. _S_bin[ n ].free[ n ] = 0;
266</para>
267<para>
268  - Initialize the mutex of each bin_record: The bin_record-&gt;mutex
269    is used to protect the global freelist. This concept of a global
270    freelist is explained in more detail in the section "A multi
271    threaded example", but basically this mutex is locked whenever a
272    block of memory is retrieved or returned to the global freelist
273    for this specific bin. This only occurs when a number of blocks
274    are grabbed from the global list to a thread specific list or when
275    a thread decides to return some blocks to the global freelist.
276</para>
277
278</section>
279
280<section xml:id="allocator.mt.deallocation"><info><title>Deallocation Notes</title></info>
281
282
283<para> Notes about deallocation. This allocator does not explicitly
284release memory. Because of this, memory debugging programs like
285valgrind or purify may notice leaks: sorry about this
286inconvenience. Operating systems will reclaim allocated memory at
287program termination anyway. If sidestepping this kind of noise is
288desired, there are three options: use an allocator, like
289<code>new_allocator</code> that releases memory while debugging, use
290GLIBCXX_FORCE_NEW to bypass the allocator's internal pools, or use a
291custom pool datum that releases resources on destruction.
292</para>
293
294<para>
295  On systems with the function <code>__cxa_atexit</code>, the
296allocator can be forced to free all memory allocated before program
297termination with the member function
298<code>__pool_type::_M_destroy</code>. However, because this member
299function relies on the precise and exactly-conforming ordering of
300static destructors, including those of a static local
301<code>__pool</code> object, it should not be used, ever, on systems
302that don't have the necessary underlying support. In addition, in
303practice, forcing deallocation can be tricky, as it requires the
304<code>__pool</code> object to be fully-constructed before the object
305that uses it is fully constructed. For most (but not all) STL
306containers, this works, as an instance of the allocator is constructed
307as part of a container's constructor. However, this assumption is
308implementation-specific, and subject to change. For an example of a
309pool that frees memory, see the following
310    <link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="http://gcc.gnu.org/viewcvs/gcc/trunk/libstdc++-v3/testsuite/ext/mt_allocator/deallocate_local-6.cc?view=markup">
311    example.</link>
312</para>
313
314</section>
315
316</section>
317
318<section xml:id="allocator.mt.example_single"><info><title>Single Thread Example</title></info>
319<?dbhtml filename="mt_allocator_ex_single.html"?>
320
321
322<para>
323Let's start by describing how the data on a freelist is laid out in memory.
324This is the first two blocks in freelist for thread id 3 in bin 3 (8 bytes):
325</para>
326<programlisting>
327+----------------+
328| next* ---------|--+  (_S_bin[ 3 ].first[ 3 ] points here)
329|                |  |
330|                |  |
331|                |  |
332+----------------+  |
333| thread_id = 3  |  |
334|                |  |
335|                |  |
336|                |  |
337+----------------+  |
338| DATA           |  |  (A pointer to here is what is returned to the
339|                |  |   the application when needed)
340|                |  |
341|                |  |
342|                |  |
343|                |  |
344|                |  |
345|                |  |
346+----------------+  |
347+----------------+  |
348| next*          |&lt;-+  (If next == NULL it's the last one on the list)
349|                |
350|                |
351|                |
352+----------------+
353| thread_id = 3  |
354|                |
355|                |
356|                |
357+----------------+
358| DATA           |
359|                |
360|                |
361|                |
362|                |
363|                |
364|                |
365|                |
366+----------------+
367</programlisting>
368
369<para>
370With this in mind we simplify things a bit for a while and say that there is
371only one thread (a ST application). In this case all operations are made to
372what is referred to as the global pool - thread id 0 (No thread may be
373assigned this id since they span from 1 to _S_max_threads in a MT application).
374</para>
375<para>
376When the application requests memory (calling allocate()) we first look at the
377requested size and if this is &gt; _S_max_bytes we call new() directly and return.
378</para>
379<para>
380If the requested size is within limits we start by finding out from which
381bin we should serve this request by looking in _S_binmap.
382</para>
383<para>
384A quick look at _S_bin[ bin ].first[ 0 ] tells us if there are any blocks of
385this size on the freelist (0). If this is not NULL - fine, just remove the
386block that _S_bin[ bin ].first[ 0 ] points to from the list,
387update _S_bin[ bin ].first[ 0 ] and return a pointer to that blocks data.
388</para>
389<para>
390If the freelist is empty (the pointer is NULL) we must get memory from the
391system and build us a freelist within this memory. All requests for new memory
392is made in chunks of _S_chunk_size. Knowing the size of a block_record and
393the bytes that this bin stores we then calculate how many blocks we can create
394within this chunk, build the list, remove the first block, update the pointer
395(_S_bin[ bin ].first[ 0 ]) and return a pointer to that blocks data.
396</para>
397
398<para>
399Deallocation is equally simple; the pointer is casted back to a block_record
400pointer, lookup which bin to use based on the size, add the block to the front
401of the global freelist and update the pointer as needed
402(_S_bin[ bin ].first[ 0 ]).
403</para>
404
405<para>
406The decision to add deallocated blocks to the front of the freelist was made
407after a set of performance measurements that showed that this is roughly 10%
408faster than maintaining a set of "last pointers" as well.
409</para>
410
411</section>
412
413<section xml:id="allocator.mt.example_multi"><info><title>Multiple Thread Example</title></info>
414<?dbhtml filename="mt_allocator_ex_multi.html"?>
415
416
417<para>
418In the ST example we never used the thread_id variable present in each block.
419Let's start by explaining the purpose of this in a MT application.
420</para>
421
422<para>
423The concept of "ownership" was introduced since many MT applications
424allocate and deallocate memory to shared containers from different
425threads (such as a cache shared amongst all threads). This introduces
426a problem if the allocator only returns memory to the current threads
427freelist (I.e., there might be one thread doing all the allocation and
428thus obtaining ever more memory from the system and another thread
429that is getting a longer and longer freelist - this will in the end
430consume all available memory).
431</para>
432
433<para>
434Each time a block is moved from the global list (where ownership is
435irrelevant), to a threads freelist (or when a new freelist is built
436from a chunk directly onto a threads freelist or when a deallocation
437occurs on a block which was not allocated by the same thread id as the
438one doing the deallocation) the thread id is set to the current one.
439</para>
440
441<para>
442What's the use? Well, when a deallocation occurs we can now look at
443the thread id and find out if it was allocated by another thread id
444and decrease the used counter of that thread instead, thus keeping the
445free and used counters correct. And keeping the free and used counters
446corrects is very important since the relationship between these two
447variables decides if memory should be returned to the global pool or
448not when a deallocation occurs.
449</para>
450
451<para>
452When the application requests memory (calling allocate()) we first
453look at the requested size and if this is &gt;_S_max_bytes we call new()
454directly and return.
455</para>
456
457<para>
458If the requested size is within limits we start by finding out from which
459bin we should serve this request by looking in _S_binmap.
460</para>
461
462<para>
463A call to _S_get_thread_id() returns the thread id for the calling thread
464(and if no value has been set in _S_thread_key, a new id is assigned and
465returned).
466</para>
467
468<para>
469A quick look at _S_bin[ bin ].first[ thread_id ] tells us if there are
470any blocks of this size on the current threads freelist. If this is
471not NULL - fine, just remove the block that _S_bin[ bin ].first[
472thread_id ] points to from the list, update _S_bin[ bin ].first[
473thread_id ], update the free and used counters and return a pointer to
474that blocks data.
475</para>
476
477<para>
478If the freelist is empty (the pointer is NULL) we start by looking at
479the global freelist (0). If there are blocks available on the global
480freelist we lock this bins mutex and move up to block_count (the
481number of blocks of this bins size that will fit into a _S_chunk_size)
482or until end of list - whatever comes first - to the current threads
483freelist and at the same time change the thread_id ownership and
484update the counters and pointers. When the bins mutex has been
485unlocked, we remove the block that _S_bin[ bin ].first[ thread_id ]
486points to from the list, update _S_bin[ bin ].first[ thread_id ],
487update the free and used counters, and return a pointer to that blocks
488data.
489</para>
490
491<para>
492The reason that the number of blocks moved to the current threads
493freelist is limited to block_count is to minimize the chance that a
494subsequent deallocate() call will return the excess blocks to the
495global freelist (based on the _S_freelist_headroom calculation, see
496below).
497</para>
498
499<para>
500However if there isn't any memory on the global pool we need to get
501memory from the system - this is done in exactly the same way as in a
502single threaded application with one major difference; the list built
503in the newly allocated memory (of _S_chunk_size size) is added to the
504current threads freelist instead of to the global.
505</para>
506
507<para>
508The basic process of a deallocation call is simple: always add the
509block to the front of the current threads freelist and update the
510counters and pointers (as described earlier with the specific check of
511ownership that causes the used counter of the thread that originally
512allocated the block to be decreased instead of the current threads
513counter).
514</para>
515
516<para>
517And here comes the free and used counters to service. Each time a
518deallocation() call is made, the length of the current threads
519freelist is compared to the amount memory in use by this thread.
520</para>
521
522<para>
523Let's go back to the example of an application that has one thread
524that does all the allocations and one that deallocates. Both these
525threads use say 516 32-byte blocks that was allocated during thread
526creation for example.  Their used counters will both say 516 at this
527point. The allocation thread now grabs 1000 32-byte blocks and puts
528them in a shared container. The used counter for this thread is now
5291516.
530</para>
531
532<para>
533The deallocation thread now deallocates 500 of these blocks. For each
534deallocation made the used counter of the allocating thread is
535decreased and the freelist of the deallocation thread gets longer and
536longer. But the calculation made in deallocate() will limit the length
537of the freelist in the deallocation thread to _S_freelist_headroom %
538of it's used counter.  In this case, when the freelist (given that the
539_S_freelist_headroom is at it's default value of 10%) exceeds 52
540(516/10) blocks will be returned to the global pool where the
541allocating thread may pick them up and reuse them.
542</para>
543
544<para>
545In order to reduce lock contention (since this requires this bins
546mutex to be locked) this operation is also made in chunks of blocks
547(just like when chunks of blocks are moved from the global freelist to
548a threads freelist mentioned above). The "formula" used can probably
549be improved to further reduce the risk of blocks being "bounced back
550and forth" between freelists.
551</para>
552
553</section>
554
555</chapter>
556