Deleted Added
full compact
implementation.ms (18568) implementation.ms (18684)
1.\"
2.\" ----------------------------------------------------------------------------
3.\" "THE BEER-WARE LICENSE" (Revision 42):
4.\" <phk@login.dknet.dk> wrote this file. As long as you retain this notice you
5.\" can do whatever you want with this stuff. If we meet some day, and you think
6.\" this stuff is worth it, you can buy me a beer in return. Poul-Henning Kamp
7.\" ----------------------------------------------------------------------------
8.\"
1.\"
2.\" ----------------------------------------------------------------------------
3.\" "THE BEER-WARE LICENSE" (Revision 42):
4.\" <phk@login.dknet.dk> wrote this file. As long as you retain this notice you
5.\" can do whatever you want with this stuff. If we meet some day, and you think
6.\" this stuff is worth it, you can buy me a beer in return. Poul-Henning Kamp
7.\" ----------------------------------------------------------------------------
8.\"
9.\" $Id: implementation.ms,v 1.1 1996/04/13 08:30:13 phk Exp $
9.\" $Id: implementation.ms,v 1.2 1996/09/29 18:36:11 phk Exp $
10.\"
11.ds RH Implementation
12.NH
13Implementation
14.PP
15A new malloc(3) implementation was written to meet the goals,
16and to the extent possible to address the shortcomings listed previously.
17.PP
18The source is 1218 lines of C code, and can be found in FreeBSD 2.2
19(and probably later versions as well) as src/lib/libc/stdlib/malloc.c.
20.PP
21The main data structure is the
22.I page-directory
23which contains a
24.B void*
25for each page we have control over.
26The value can be one of:
27.IP
28.B MALLOC_NOT_MINE
29Another part of the code may call brk(2) to get a piece of the cake.
30Consequently we cannot rely on the memory we get from the kernel to
31be one consecutive piece of memory and therefore we need a way to
32mark such pages as "untouchable".
33.IP
34.B MALLOC_FREE
35This is a free page.
36.IP
37.B MALLOC_FIRST
38This is the first page in a (multi-)page allocation.
39.IP
40.B MALLOC_FOLLOW
41This is a subsequent page in a multi-page allocation.
42.IP
43.B
44struct pginfo*
45.R
46A pointer to a structure describing a partitioned page.
47.PP
48In addition there exist a linked list of small data structures that
49describe the free space as runs of free pages.
50.PP
51Notice that these structures are not part of the free pages themselves,
52but rather allocated with malloc so that the free pages themselves
53are never referenced while they are free.
54.PP
55When a request for storage comes in, it will be treated as a ``page''
56allocation if it is bigger than half a page.
57The freelist will be searched and the first run of free pages that
58can satisfy the request is used. The first page gets set to
59.B MALLOC_FIRST
60status, if more than that one page is needed the rest of them gets
61.B MALLOC_FOLLOW
62status in the page-directory.
63.PP
64If there were no pages on the free-list, brk(2) will be called, and
65the pages will get added to the page-directory with status
66.B MALLOC_FREE
67and the search restarts.
68.PP
69Freeing a number of pages is done by changing their state in the
70page directory to MALLOC_FREE, and then traverse the free-pages list to
71find the right place for this run of pages, possibly collapsing
72with the two neighbouring runs into one run and, if it is possible,
73release some memory back to the kernel by calling brk(2).
74.PP
75If the request is less than or equal to half of a page, its size will be
76rounded up to the nearest power of two before being processed
77and if the request is less than some minimum size, it is rounded up to
78that size.
79.PP
80These sub-page allocations are served from pages which are split up
81into some number of equal size chunks.
82For each of these pages a
83.B
84struct pginfo
85.R
86describes the size of the chunks on this page, how many there are,
87how many are free and so on.
88The description consist of a bitmap of used chunks, and various counters
89and numbers used to keep track of the stuff in the page.
90.PP
91For each size of sub-page allocation, the pginfo structures for the
92pages that have free chunks in them form a list.
93The head of these lists are stored in predetermined slots at
94the beginning of the page directory to make access fast.
95.PP
96To allocate a chunk of some size, the head of the list for the
97corresponding size is examined, and a free chunk found, the number
98of free chunks on that page is decreased by one and if zero the
99pginfo structure is unlinked from the list.
100.PP
101To free a chunk, the page is derived from the pointer, the page table
102for that page contains a pointer to the pginfo structure, where the
103free bit is set for the chunk, the number of free chunks increased by
104one, and if equal to one, the pginfo structure is linked into the
105proper place on the list for this size of chunks.
106If the count increases to match the number of chunks on the page, the
107pginfo structure is unlinked from the list and free(3)'ed and the
108actual page itself is free(3)'ed too.
109.PP
110To be 100% correct performance-wise these lists should be ordered
111according to the recent number of accesses to that page. This
112information is not available and it would essentially mean a reordering
113of the list on every memory reference to keep it up-to-date.
114Instead they are ordered according to the address of the pages.
115Interestingly enough, in practice this comes out to almost the same
116thing performance wise.
117.PP
118It's not that surprising after all, it's the difference between
119following the crowd or actively directing where it can go, in both
120ways you can end up in the middle of it all.
121.PP
10.\"
11.ds RH Implementation
12.NH
13Implementation
14.PP
15A new malloc(3) implementation was written to meet the goals,
16and to the extent possible to address the shortcomings listed previously.
17.PP
18The source is 1218 lines of C code, and can be found in FreeBSD 2.2
19(and probably later versions as well) as src/lib/libc/stdlib/malloc.c.
20.PP
21The main data structure is the
22.I page-directory
23which contains a
24.B void*
25for each page we have control over.
26The value can be one of:
27.IP
28.B MALLOC_NOT_MINE
29Another part of the code may call brk(2) to get a piece of the cake.
30Consequently we cannot rely on the memory we get from the kernel to
31be one consecutive piece of memory and therefore we need a way to
32mark such pages as "untouchable".
33.IP
34.B MALLOC_FREE
35This is a free page.
36.IP
37.B MALLOC_FIRST
38This is the first page in a (multi-)page allocation.
39.IP
40.B MALLOC_FOLLOW
41This is a subsequent page in a multi-page allocation.
42.IP
43.B
44struct pginfo*
45.R
46A pointer to a structure describing a partitioned page.
47.PP
48In addition there exist a linked list of small data structures that
49describe the free space as runs of free pages.
50.PP
51Notice that these structures are not part of the free pages themselves,
52but rather allocated with malloc so that the free pages themselves
53are never referenced while they are free.
54.PP
55When a request for storage comes in, it will be treated as a ``page''
56allocation if it is bigger than half a page.
57The freelist will be searched and the first run of free pages that
58can satisfy the request is used. The first page gets set to
59.B MALLOC_FIRST
60status, if more than that one page is needed the rest of them gets
61.B MALLOC_FOLLOW
62status in the page-directory.
63.PP
64If there were no pages on the free-list, brk(2) will be called, and
65the pages will get added to the page-directory with status
66.B MALLOC_FREE
67and the search restarts.
68.PP
69Freeing a number of pages is done by changing their state in the
70page directory to MALLOC_FREE, and then traverse the free-pages list to
71find the right place for this run of pages, possibly collapsing
72with the two neighbouring runs into one run and, if it is possible,
73release some memory back to the kernel by calling brk(2).
74.PP
75If the request is less than or equal to half of a page, its size will be
76rounded up to the nearest power of two before being processed
77and if the request is less than some minimum size, it is rounded up to
78that size.
79.PP
80These sub-page allocations are served from pages which are split up
81into some number of equal size chunks.
82For each of these pages a
83.B
84struct pginfo
85.R
86describes the size of the chunks on this page, how many there are,
87how many are free and so on.
88The description consist of a bitmap of used chunks, and various counters
89and numbers used to keep track of the stuff in the page.
90.PP
91For each size of sub-page allocation, the pginfo structures for the
92pages that have free chunks in them form a list.
93The head of these lists are stored in predetermined slots at
94the beginning of the page directory to make access fast.
95.PP
96To allocate a chunk of some size, the head of the list for the
97corresponding size is examined, and a free chunk found, the number
98of free chunks on that page is decreased by one and if zero the
99pginfo structure is unlinked from the list.
100.PP
101To free a chunk, the page is derived from the pointer, the page table
102for that page contains a pointer to the pginfo structure, where the
103free bit is set for the chunk, the number of free chunks increased by
104one, and if equal to one, the pginfo structure is linked into the
105proper place on the list for this size of chunks.
106If the count increases to match the number of chunks on the page, the
107pginfo structure is unlinked from the list and free(3)'ed and the
108actual page itself is free(3)'ed too.
109.PP
110To be 100% correct performance-wise these lists should be ordered
111according to the recent number of accesses to that page. This
112information is not available and it would essentially mean a reordering
113of the list on every memory reference to keep it up-to-date.
114Instead they are ordered according to the address of the pages.
115Interestingly enough, in practice this comes out to almost the same
116thing performance wise.
117.PP
118It's not that surprising after all, it's the difference between
119following the crowd or actively directing where it can go, in both
120ways you can end up in the middle of it all.
121.PP
122The sideffect of this compromise is that it also uses less storage,
122The side effect of this compromise is that it also uses less storage,
123and the list never has to be reordered, all the ordering happens when
124pages are added or deleted.
125.PP
126It is an interesting twist to the implementation that the
127.B
128struct pginfo
129.R
130Is allocated with malloc.
131That is, "as with malloc" to be painfully correct.
132The code knows the special case where the first (couple) of allocations on
133the page is actually the pginfo structure and deals with it accordingly.
134This avoids some silly "chicken and egg" issues.
135.ds RH Bells and whistles.
136.NH
137Bells and whistles.
138.PP
139brk(2) is actually not a very fast system call when you ask for storage.
140This is mainly because of the need by the kernel to zero the pages before
141handing them over, so therefore this implementation does not release
142back heap-pages, until there is a large chunk to release back to the kernel.
143Chances are pretty good that we will need it again pretty soon anyway.
144Since these pages are not accessed at all, they will soon be paged out
145and don't affect anything but swap-space usage.
146.PP
147The page directory is actually kept in a mmap(2)'ed piece of
148anonymous memory. This avoids some rather silly cases that
149we would otherwise have to be handled when the page directory
150has to be extended.
151.PP
152One particular nice feature is that all pointers passed to free(3)
153and realloc(3) can be checked conclusively for validity:
154First the pointer is masked to find the page. The page directory
155is then examined, it must contain either MALLOC_FIRST, in which
156case the pointer must point exactly at the page, or it can contain
157a struct pginfo*, in which case the pointer must point to a one of
158the chunks described by that structure.
159Warnings will be printed on stderr and nothing will be done with
160the pointer in case it is found to be invalid.
161.PP
162An environment variable
163.B MALLOC_OPTIONS
164allows the user some control over the behaviour of malloc.
165Some of the more interesting options are:
166.IP
167.B Abort
168If malloc fails to allocate storage, core-dump the process with
169a message rather than expect it handle this correctly.
170It's amazing how few programs actually handle this condition correctly,
171and consequently the havoc they can create is the more creative or
172destructive.
173.IP
174.B Realloc
175Always do a free and malloc when realloc(3) is called. The default
176is to leave things alone if the size of the allocation is still in
177the same size-class.
178For programs doing garbage collect using realloc(3) this make the
179heap collapse faster. Since the malloc will reallocate from the
180lowest available address.
181.IP
182.B Junk
183will explicitly fill the allocated area with a particular value
184to try to detect if programs rely on it being zero.
185.IP
186.B Zero
187will explicitly zero out the allocated chunk of memory, while any
188space after the allocation in the chunk will be filled with the
189junk value to try to catch out of the chunk references.
190.ds RH The road not taken.
191.NH
192The road yet not taken.
193.PP
194A couple of avenues were explored that could be interesting in some
195set of circumstances.
196.PP
197Using mmap(2) instead of brk(2) was actually slower, since brk(2)
198knows a lot of the things that mmap has to find out first.
199.PP
200A system call where we could tell the kernel that "we don't
201need the contents of this page anymore" would allow us to
202return the pages on the free list to the kernel and to instruct
203the kernel that it doesn't need to page it out nor in.
204It would save some page-out events, and the page-in would be replaced
205by a zero-fill page.
206This is, according to the VM goods in the FreeBSD camp, "easy",
207and it will probably be attempted at some point in the future.
208.PP
209In general there is little room for further improvement of the
210time-overhead of the malloc, further improvements will have to
211be in the area of improving paging behaviour.
212.PP
213It is still under consideration to add a feature such that
214if realloc is called with two zero arguments, the internal
215allocations will be reallocated to perform a garbage collect.
216This could be used in certain types of programs to collapse
217the memory use, but so far it doesn't seem to be worth the effort.
218.PP
219Malloc/Free can be a significant point of contention in multi-threaded
220programs. Low-grain locking of the data-structures inside the
221implementation should be implemented to avoid excessive spin-waiting.
222
123and the list never has to be reordered, all the ordering happens when
124pages are added or deleted.
125.PP
126It is an interesting twist to the implementation that the
127.B
128struct pginfo
129.R
130Is allocated with malloc.
131That is, "as with malloc" to be painfully correct.
132The code knows the special case where the first (couple) of allocations on
133the page is actually the pginfo structure and deals with it accordingly.
134This avoids some silly "chicken and egg" issues.
135.ds RH Bells and whistles.
136.NH
137Bells and whistles.
138.PP
139brk(2) is actually not a very fast system call when you ask for storage.
140This is mainly because of the need by the kernel to zero the pages before
141handing them over, so therefore this implementation does not release
142back heap-pages, until there is a large chunk to release back to the kernel.
143Chances are pretty good that we will need it again pretty soon anyway.
144Since these pages are not accessed at all, they will soon be paged out
145and don't affect anything but swap-space usage.
146.PP
147The page directory is actually kept in a mmap(2)'ed piece of
148anonymous memory. This avoids some rather silly cases that
149we would otherwise have to be handled when the page directory
150has to be extended.
151.PP
152One particular nice feature is that all pointers passed to free(3)
153and realloc(3) can be checked conclusively for validity:
154First the pointer is masked to find the page. The page directory
155is then examined, it must contain either MALLOC_FIRST, in which
156case the pointer must point exactly at the page, or it can contain
157a struct pginfo*, in which case the pointer must point to a one of
158the chunks described by that structure.
159Warnings will be printed on stderr and nothing will be done with
160the pointer in case it is found to be invalid.
161.PP
162An environment variable
163.B MALLOC_OPTIONS
164allows the user some control over the behaviour of malloc.
165Some of the more interesting options are:
166.IP
167.B Abort
168If malloc fails to allocate storage, core-dump the process with
169a message rather than expect it handle this correctly.
170It's amazing how few programs actually handle this condition correctly,
171and consequently the havoc they can create is the more creative or
172destructive.
173.IP
174.B Realloc
175Always do a free and malloc when realloc(3) is called. The default
176is to leave things alone if the size of the allocation is still in
177the same size-class.
178For programs doing garbage collect using realloc(3) this make the
179heap collapse faster. Since the malloc will reallocate from the
180lowest available address.
181.IP
182.B Junk
183will explicitly fill the allocated area with a particular value
184to try to detect if programs rely on it being zero.
185.IP
186.B Zero
187will explicitly zero out the allocated chunk of memory, while any
188space after the allocation in the chunk will be filled with the
189junk value to try to catch out of the chunk references.
190.ds RH The road not taken.
191.NH
192The road yet not taken.
193.PP
194A couple of avenues were explored that could be interesting in some
195set of circumstances.
196.PP
197Using mmap(2) instead of brk(2) was actually slower, since brk(2)
198knows a lot of the things that mmap has to find out first.
199.PP
200A system call where we could tell the kernel that "we don't
201need the contents of this page anymore" would allow us to
202return the pages on the free list to the kernel and to instruct
203the kernel that it doesn't need to page it out nor in.
204It would save some page-out events, and the page-in would be replaced
205by a zero-fill page.
206This is, according to the VM goods in the FreeBSD camp, "easy",
207and it will probably be attempted at some point in the future.
208.PP
209In general there is little room for further improvement of the
210time-overhead of the malloc, further improvements will have to
211be in the area of improving paging behaviour.
212.PP
213It is still under consideration to add a feature such that
214if realloc is called with two zero arguments, the internal
215allocations will be reallocated to perform a garbage collect.
216This could be used in certain types of programs to collapse
217the memory use, but so far it doesn't seem to be worth the effort.
218.PP
219Malloc/Free can be a significant point of contention in multi-threaded
220programs. Low-grain locking of the data-structures inside the
221implementation should be implemented to avoid excessive spin-waiting.
222