Cross Reference: /freebsd-current/sys/vm/uma

History log of /freebsd-current/sys/vm/uma_core.c
Revision	Date	Author	Comments
# d25ed650	26-May-2024	Bojan Novković <bnovkov@FreeBSD.org>	uma: Fix improper uses of UMA_MD_SMALL_ALLOC UMA_MD_SMALL_ALLOC was recently replaced by UMA_USE_DMAP, but da76d349b6b1 missed some improper uses of the old symbol. This change makes sure that UMA_USE_DMAP is used properly in code that selects uma_small_alloc. Fixes: da76d349b6b1 Reported by: eduardo, rlibby Approved by: markj (mentor) Differential Revision: https://reviews.freebsd.org/D45368
# 0a44b8a5	03-May-2024	Bojan Novković <bnovkov@FreeBSD.org>	vm: Simplify startup page dumping conditional This commit introduces the MINIDUMP_STARTUP_PAGE_TRACKING symbol and uses it to simplify several instances of a complex preprocessor conditional for adding pages allocated when bootstraping the kernel to minidumps. Reviewed by: markj, mhorne Approved by: markj (mentor) Differential Revision: https://reviews.freebsd.org/D45085
# da76d349	03-May-2024	Bojan Novković <bnovkov@FreeBSD.org>	uma: Deduplicate uma_small_alloc This commit refactors the UMA small alloc code and removes most UMA machine-dependent code. The existing machine-dependent uma_small_alloc code is almost identical across all architectures, except for powerpc where using the direct map addresses involved extra steps in some cases. The MI/MD split was replaced by a default uma_small_alloc implementation that can be overridden by architecture-specific code by defining the UMA_MD_SMALL_ALLOC symbol. Furthermore, UMA_USE_DMAP was introduced to replace most UMA_MD_SMALL_ALLOC uses. Reviewed by: markj, kib Approved by: markj (mentor) Differential Revision: https://reviews.freebsd.org/D45084
# a03c2393	09-Nov-2023	Alexander Motin <mav@FreeBSD.org>	uma: Improve memory modified after free panic messages - Pass zone pointer to trash_ctor() and report zone name in the panic message. It may be difficult to figyre out zone just by the item size. - Do not pass user arguments to internal trash calls, pass thezone. - Report malloc type name in the same unified panic message. - Report corruption offset from the beginning of the items instead of the full pointer. It makes panic message shorter and more readable.
# 87090f5e	13-Oct-2023	Olivier Certner <olce.freebsd@certner.fr>	uma: New check_align_mask(): Validate alignments (INVARIANTS) New function check_align_mask() asserts (under INVARIANTS) that the mask fits in a (signed) integer (see the comment) and that the corresponding alignment is a power of two. Use check_align_mask() in uma_set_align_mask() and also in uma_zcreate() to replace the KASSERT() there (that was checking only for a power of 2). Reviewed by: kib, markj MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D42263
# 3d8f548b	13-Oct-2023	Olivier Certner <olce.freebsd@certner.fr>	uma: Make the cache alignment mask unsigned In uma_set_align_mask(), ensure that the passed value doesn't have its highest bit set, which would lead to problems since keg/zone alignment is internally stored as signed integers. Such big values do not make sense anyway and indicate some programming error. A future commit will introduce checks for this case and other ones. Reviewed by: kib, markj MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D42262
# e557eafe	13-Oct-2023	Olivier Certner <olce.freebsd@certner.fr>	uma: UMA_ALIGN_CACHE: Resolve the proper value at use point Having a special value of -1 that is resolved internally to 'uma_align_cache' provides no significant advantages and prevents changing that variable to an unsigned type, which is natural for an alignment mask. So suppress it and replace its use with a call to uma_get_align_mask(). The small overhead of the added function call is irrelevant since UMA_ALIGN_CACHE is only used when creating new zones, which is not performance critical. Reviewed by: markj, kib MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D42259
# dc8f7692	13-Oct-2023	Olivier Certner <olce.freebsd@certner.fr>	uma: Hide 'uma_align_cache'; Create/rename accessors Create the uma_get_cache_align_mask() accessor and put it in a separate private header so as to minimize namespace pollution in header/source files that need only this function and not the whole 'uma.h' header. Make sure the accessors have '_mask' as a suffix, so that callers are aware that the real alignment is the power of two that is the mask plus one. Rename the stem to something more explicit. Rename uma_set_cache_align_mask()'s single parameter to 'mask'. Hide 'uma_align_cache' to ensure that it cannot be set in any other way then by a call to uma_set_cache_align_mask(), which will perform sanity checks in a further commit. While here, rename it to 'uma_cache_align_mask'. This is also in preparation for some further changes, such as improving the sanity checks, eliminating internal resolving of UMA_ALIGN_CACHE and changing the type of the 'uma_cache_align_mask' variable. Reviewed by: markj, kib MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D42258
# 685dc743	16-Aug-2023	Warner Losh <imp@FreeBSD.org>	sys: Remove $FreeBSD$: one-line .c pattern Remove /^[\s]__FBSDID$"\$FreeBSD\$"$;?\s*\n/
# 4d846d26	10-May-2023	Warner Losh <imp@FreeBSD.org>	spdx: The BSD-2-Clause-FreeBSD identifier is obsolete, drop -FreeBSD The SPDX folks have obsoleted the BSD-2-Clause-FreeBSD identifier. Catch up to that fact and revert to their recommended match of BSD-2-Clause. Discussed with: pfg MFC After: 3 days Sponsored by: Netflix
# 2dba2288	19-Oct-2022	Mark Johnston <markj@FreeBSD.org>	uma: Never pass cache zones to memguard Items allocated from cache zones cannot usefully be protected by memguard. PR: 267151 Reported and tested by: pho MFC after: 1 week
# f49fd63a	22-Sep-2022	John Baldwin <jhb@FreeBSD.org>	kmem_malloc/free: Use void * instead of vm_offset_t for kernel pointers. Reviewed by: kib, markj Sponsored by: DARPA Differential Revision: https://reviews.freebsd.org/D36549
# b9fd884a	12-Aug-2022	Colin Percival <cperciva@FreeBSD.org>	sys/vm: Add TSLOG to some functions The functions pbuf_init, kva_alloc, and keg_alloc_slab are significant contributors to the kernel boot time when FreeBSD boots inside the Firecracker VMM. Instrument them so they show up on flamecharts.
# c84c5e00	18-Jul-2022	Mitchell Horne <mhorne@FreeBSD.org>	ddb: annotate some commands with DB_CMD_MEMSAFE This is not completely exhaustive, but covers a large majority of commands in the tree. Reviewed by: markj Sponsored by: Juniper Networks, Inc. Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D35583
# 31508912	13-Jul-2022	Mark Johnston <markj@FreeBSD.org>	uma: Apply a missed piece of review feedback from D35738 Fixes: 93cd28ea82bb ("uma: Use a taskqueue to execute uma_timeout()")
# 93cd28ea	11-Jul-2022	Mark Johnston <markj@FreeBSD.org>	uma: Use a taskqueue to execute uma_timeout() uma_timeout() has several responsibilities; it visits every UMA zone and as of recently will drain underutilized caches, so is rather expensive (>1ms in some cases). Currently it is executed by softclock threads and so will preempt most other CPU activity. None of this work requires a high scheduling priority, though, so defer it to a taskqueue so as to avoid stalling higher-priority work. Reviewed by: rlibby, alc, mav, kib MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D35738
# a932a5a6	20-Jun-2022	Mark Johnston <markj@FreeBSD.org>	uma: Mark zeroed slabs as initialized for KMSAN Otherwise zone initializers can produce false positives, e.g., when lock_init() attempts to detect double initialization. Sponsored by: The FreeBSD Foundation
# a7e1a585	08-Apr-2022	John Baldwin <jhb@FreeBSD.org>	uma_zfree_smr: uz_flags is only used if NUMA is defined.
# d53927b0	30-Mar-2022	Mark Johnston <markj@FreeBSD.org>	uma: Don't allow a limit to be set in a warm zone The limit accounting in UMA does not tolerate this. MFC after: 2 weeks Sponsored by: The FreeBSD Foundation
# 54361f90	30-Mar-2022	Mark Johnston <markj@FreeBSD.org>	uma: Use the correct type for a return value zone_alloc_bucket() returns a pointer, not a bool. MFC after: 1 week Sponsored by: The FreeBSD Foundation
# 490b09f2	07-Mar-2022	Eric van Gyzen <vangyzen@FreeBSD.org>	uma_zalloc_domain: call uma_zalloc_debug in multi-domain path It was only called in the non-NUMA and single-domain paths. Some of its assertions were duplicated in uma_zalloc_domain, but some things were missed, especially memguard. Reviewed by: markj, rstone MFC after: 1 week Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D34472
# a8cbb835	04-Mar-2022	Eric van Gyzen <vangyzen@FreeBSD.org>	uma_zalloc: assert M_NOWAIT ^ M_WAITOK The uma_zalloc functions expect exactly one of [M_NOWAIT, M_WAITOK]. If neither or both are passed, print an error and a stack dump. Only do this ten times, to prevent livelock. In the future, after this exposes enough bad callers, this will be changed to a KASSERT(). Reviewed by: rstone, markj MFC after: 1 month Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D34452
# 389a3fa6	15-Feb-2022	Mark Johnston <markj@FreeBSD.org>	uma: Add UMA_ZONE_UNMANAGED Allow a zone to opt out of cache size management. In particular, uma_reclaim() and uma_reclaim_domain() will not reclaim any memory from the zone, nor will uma_timeout() purge cached items if the zone is idle. This effectively means that the zone consumer has control over when items are reclaimed from the cache. In particular, uma_zone_reclaim() will still reclaim cached items from an unmanaged zone. Reviewed by: hselasky, kib MFC after: 3 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D34142
# a04ce833	14-Jan-2022	Mark Johnston <markj@FreeBSD.org>	uma: Avoid polling for an invalid SMR sequence number Buckets in an SMR-enabled zone can legitimately be tagged with SMR_SEQ_INVALID. This effectively means that the zone destructor (if any) was invoked on all items in the bucket, and the contained memory is safe to reuse. If the first bucket in the full bucket list was tagged this way, UMA would unnecessarily poll per-CPU state before attempting to fetch a full bucket from the list. MFC after: 2 weeks Sponsored by: The FreeBSD Foundation
# c25a30e2	05-Jan-2022	Konstantin Belousov <kib@FreeBSD.org>	Dump page tracking no longer needed on mips Reviewed by: imp Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D33763
# 841e0a87	30-Dec-2021	Gleb Smirnoff <glebius@FreeBSD.org>	uma: with KTR trace allocs/frees from SMR zones
# 28782f73	30-Dec-2021	Gleb Smirnoff <glebius@FreeBSD.org>	uma: with KTR report item being freed in uma_zfree_arg()
# 2cb67bd7	05-Dec-2021	Gleb Smirnoff <glebius@FreeBSD.org>	uma: remove unused *item argument from cache_free() Reviewed by: markj Differential revision: https://reviews.freebsd.org/D33272
# 7585c5db	01-Nov-2021	Mark Johnston <markj@FreeBSD.org>	uma: Fix handling of reserves in zone_import() Kegs with no items reserved have uk_reserve = 0. So the check keg->uk_reserve >= dom->ud_free_items will be true once all slabs are depleted. Then, rather than go and allocate a fresh slab, we return to the cache layer. The intent was to do this only when the keg actually has a reserve, so modify the check to verify this first. Another approach would be to make uk_reserve signed and set it to -1 until uma_zone_reserve() is called, but this requires a few casts elsewhere. Fixes: 1b2dcc8c54a8 ("uma: Avoid depleting keg reserves when filling a bucket") MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D32516
# fab343a7	01-Nov-2021	Mark Johnston <markj@FreeBSD.org>	uma: Improve M_USE_RESERVE handling in keg_fetch_slab() M_USE_RESERVE is used in a couple of places in the VM to avoid unbounded recursion when the direct map is not available, as is the case on 32-bit platforms or when certain kernel sanitizers (KASAN and KMSAN) are enabled. For example, to allocate KVA, the kernel might allocate a kernel map entry, which might require a new slab, which requires KVA. For these zones, we use uma_prealloc() to populate a reserve of items, and then in certain serialized contexts M_USE_RESERVE can be used to guarantee a successful allocation. uma_prealloc() allocates the requested number of items, distributing them evenly among NUMA domains. Thus, in a first-touch zone, to satisfy an M_USE_RESERVE allocation we might have to check the slab lists of other domains than the current one to provide the semantics expected by consumers. So, try harder to find an item if M_USE_RESERVE is specified and the keg doesn't have anything for current (first-touch) domain. Specifically, fall back to a round-robin slab allocation. This change fixes boot-time panics on NUMA systems with KASAN or KMSAN enabled.[1] Alternately we could have uma_prealloc() allocate the requested number of items for each domain, but for some existing consumers this would be quite wasteful. In general I think keg_fetch_slab() should try harder to find free slabs in other domains before trying to allocate fresh ones, but let's limit this to M_USE_RESERVE for now. Also fix a separate problem that I noticed: in a non-round-robin slab allocation with M_WAITOK, rather than sleeping after a failed slab allocation we simply try again. Call vm_wait_domain() before retrying. Reported by: mjg, tuexen [1] Reviewed by: alc MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D32515
# a9d6f1fe	19-Oct-2021	Mark Johnston <markj@FreeBSD.org>	Remove some remaining references to VM_ALLOC_NOOBJ Reviewed by: alc, kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D32037
# 84c39222	19-Oct-2021	Mark Johnston <markj@FreeBSD.org>	Convert consumers to vm_page_alloc_noobj_contig() Remove now-unneeded page zeroing. No functional change intended. Reviewed by: alc, hselasky, kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D32006
# a4667e09	19-Oct-2021	Mark Johnston <markj@FreeBSD.org>	Convert vm_page_alloc() callers to use vm_page_alloc_noobj(). Remove page zeroing code from consumers and stop specifying VM_ALLOC_NOOBJ. In a few places, also convert an allocation loop to simply use VM_ALLOC_WAITOK. Similarly, convert vm_page_alloc_domain() callers. Note that callers are now responsible for assigning the pindex. Reviewed by: alc, hselasky, kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31986
# d6e77cda	16-Sep-2021	Mark Johnston <markj@FreeBSD.org>	uma: Show the count of free slabs in each per-domain keg's sysctl tree This is useful for measuring the number of pages that could be freed from a NOFREE zone under memory pressure. MFC after: 1 week Sponsored by: The FreeBSD Foundation
# 10094910	10-Aug-2021	Mark Johnston <markj@FreeBSD.org>	uma: Add KMSAN hooks For now, just hook the allocation path: upon allocation, items are marked as initialized (absent M_ZERO). Some zones are exempted from this when it would otherwise raise false positives. Use kmsan_orig() to update the origin map for UMA and malloc(9) allocations. This allows KMSAN to print the return address when an uninitialized UMA item is implicated in a report. For example: panic: MSan: Uninitialized UMA memory from m_getm2+0x7fe Sponsored by: The FreeBSD Foundation
# b0dfc486	09-Jul-2021	Mark Johnston <markj@FreeBSD.org>	uma: Fix a few problems with KASAN integration - Ensure that all items returned by UMA are aligned to KASAN_SHADOW_SCALE (8). This was true in practice since smaller alignments are not used by any consumers, but we should enforce it anyway. - Use a non-zero code for marking redzones that appear naturally in items that are not a multiple of the scale factor in size. Currently we do not modify keg layouts to force the creation of redzones. - Use a non-zero code for marking freed per-CPU items, otherwise accesses of freed per-CPU items are not detected by the runtime. Sponsored by: The FreeBSD Foundation
# 9a7c2de3	05-May-2021	Mark Johnston <markj@FreeBSD.org>	realloc: Fix KASAN(9) shadow map updates When copying from the old buffer to the new buffer, we don't know the requested size of the old allocation, but only the size of the allocation provided by UMA. This value is "alloc". Because the copy may access bytes in the old allocation's red zone, we must mark the full allocation valid in the shadow map. Do so using the correct size. Reported by: kp Tested by: kp Sponsored by: The FreeBSD Foundation
# 2760658b	02-May-2021	Alexander Motin <mav@FreeBSD.org>	Improve UMA cache reclamation. When estimating working set size, measure only allocation batches, not free batches. Allocation and free patterns can be very different. For example, ZFS on vm_lowmem event can free to UMA few gigabytes of memory in one call, but it does not mean it will request the same amount back that fast too, in fact it won't. Update working set size on every reclamation call, shrinking caches faster under pressure. Lack of this caused repeating vm_lowmem events squeezing more and more memory out of real consumers only to make it stuck in UMA caches. I saw ZFS drop ARC size in half before previous algorithm after periodic WSS update decided to reclaim UMA caches. Introduce voluntary reclamation of UMA caches not used for a long time. For each zdom track longterm minimal cache size watermark, freeing some unused items every UMA_TIMEOUT after first 15 minutes without cache misses. Freed memory can get better use by other consumers. For example, ZFS won't grow its ARC unless it see free memory, since it does not know it is not really used. And even if memory is not really needed, periodic free during inactivity periods should reduce its fragmentation. Reviewed by: markj, jeff (previous version) MFC after: 2 weeks Sponsored by: iXsystems, Inc. Differential Revision: https://reviews.freebsd.org/D29790
# aabe13f1	13-Apr-2021	Mark Johnston <markj@FreeBSD.org>	uma: Introduce per-domain reclamation functions Make it possible to reclaim items from a specific NUMA domain. - Add uma_zone_reclaim_domain() and uma_reclaim_domain(). - Permit parallel reclamations. Use a counter instead of a flag to synchronize with zone_dtor(). - Use the zone lock to protect cache_shrink() now that parallel reclaims can happen. - Add a sysctl that can be used to trigger reclamation from a specific domain. Currently the new KPIs are unused, so there should be no functional change. Reviewed by: mav MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D29685
# 54f421f9	09-Apr-2021	Mark Johnston <markj@FreeBSD.org>	uma: Split bucket_cache_drain() to permit per-domain reclamation Note that the per-domain variant does not shrink the target bucket size. No functional change intended. MFC after: 2 weeks Sponsored by: The FreeBSD Foundation
# 09c8cb71	13-Apr-2021	Mark Johnston <markj@FreeBSD.org>	uma: Add KASAN state transitions - Add a UMA_ZONE_NOKASAN flag to indicate that items from a particular zone should not be sanitized. This is applied implicitly for NOFREE and cache zones. - Add KASAN call backs which get invoked: 1) when a slab is imported into a keg 2) when an item is allocated from a zone 3) when an item is freed to a zone 4) when a slab is freed back to the VM In state transitions 1 and 3, memory is poisoned so that accesses will trigger a panic. In state transitions 2 and 4, memory is marked valid. - Disable trashing if KASAN is enabled. It just adds extra CPU overhead to catch problems that are detected by KASAN. MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D29456
# b8f7267d	10-Mar-2021	Kristof Provost <kp@FreeBSD.org>	uma: allow uma_zfree_pcu(..., NULL) We already allow free(NULL) and uma_zfree(..., NULL). Make uma_zfree_pcpu(..., NULL) work as well. This also means that counter_u64_free(NULL) will work. These make cleanup code simpler. MFC after: 1 week Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D29189
# 537f92cd	22-Feb-2021	Mark Johnston <markj@FreeBSD.org>	uma: Update the comment above startup_alloc() to reflect reality The scheme used for early slab allocations changed in commit a81c400e75. Reported by: alc Reviewed by: alc MFC after: 1 week
# 663de81f	03-Jan-2021	Mark Johnston <markj@FreeBSD.org>	uma: Avoid unmapping direct-mapped slabs startup_alloc() uses pmap_map() to map slabs used for bootstrapping the VM. pmap_map() may ignore the hint address and simply return a range from the direct map. In this case we must not unmap the range in startup_free(). UMA uses bootstart and bootmem to track the range of KVA into which slabs are mapped if the direct map is not used. Unmap a startup slab only if it was mapped into that range. Reported by: alc Reviewed by: alc, kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D27885
# 942951ba	31-Dec-2020	Ryan Libby <rlibby@FreeBSD.org>	uma dbg: catch more corruption with atomics Use atomic testandset and testandclear to catch concurrent double free, and to reduce the number of atomic operations. Submitted by: jeff Reviewed by: cem, kib, markj (all previous version) Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D22703
# e574d407	06-Dec-2020	Mark Johnston <markj@FreeBSD.org>	uma: Make uma_zone_set_maxcache() work better with small limits The old implementation chose the largest bucket zone such that if the per-CPU caches are fully populated, the total number of items cached is no larger than the specified limit. If no such zone existed, UMA would not do any caching. We can now use uz_bucket_size_max to set a precise limit on the number of items in a zone's bucket, so the total size of per-CPU caches can be bounded more easily. Implement a new policy in uma_zone_set_maxcache(): choose a bucket size such that up to half of the limit can be cached in per-CPU caches, with the rest going to the full bucket cache. This fixes a problem with the kstack_cache zone: the limit of 4 * mp_ncpus items meant that the zone would not do any caching, defeating the whole purpose of the zone. That's because the smallest bucket size holds up to 2 items and we may cache up to 3 full buckets per CPU, and 2 * 3 * mp_ncpus > 4 * mp_ncpus. Reported by: mjg Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D27168
# f8b6c515	06-Dec-2020	Mark Johnston <markj@FreeBSD.org>	uma: Enforce the use of uz_bucket_size_max in the free path uz_bucket_size_max is the maximum permitted bucket size. When filling a new bucket to satisfy uma_zalloc(), the bucket is populated with at most uz_bucket_size_max items. The maximum number of entries in the bucket may be larger. When freeing items, however, we will fill per-CPPU buckets up to their maximum number of entries, potentially exceeding uz_bucket_size_max. This makes it difficult to precisely limit the number of items that may be cached in a zone. For example, if one wants to limit buckets to 1 entry for a particular zone, that's not possible since the smallest bucket holds up to 2 entries. Try to solve the problem by using uz_bucket_size_max to limit the number of entries in a bucket. Note that the ub_entries field is initialized upon every bucket allocation. Most zones are not affected since they do not impose any specific limit on the maximum bucket size. While here, remove the UMA_ZONE_MINBUCKET flag. It was unused and we now have uma_zone_set_maxcache() to control the zone's cache size more precisely. Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D27167
# 8a6776ca	06-Dec-2020	Mark Johnston <markj@FreeBSD.org>	uma: Use atomic load for uz_sleepers This field is updated locklessly. Sponsored by: The FreeBSD Foundation
# 991f23ef	30-Nov-2020	Mark Johnston <markj@FreeBSD.org>	uma: Avoid allocating buckets with the cross-domain lock held Allocation of a bucket can trigger a cross-domain free in the bucket zone, e.g., if the per-CPU alloc bucket is empty, we free it and get migrated to a remote domain. This can lead to deadlocks since a bucket zone may allocate buckets from itself or a pair of bucket zones could be allocating from each other. Fix the problem by dropping the cross-domain lock before allocating a new bucket and handling refill races. Use a list of empty buckets to ensure that we can make forward progress. Reported by: imp, mjg (witness(9) warnings) Discussed with: jeff Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D27341
# 431fb8ab	18-Nov-2020	Mark Johnston <markj@FreeBSD.org>	vm_phys: Try to clean up NUMA KPIs It can useful for code outside the VM system to look up the NUMA domain of a page backing a virtual or physical address, specifically when creating NUMA-aware data structures. We have _vm_phys_domain() for this, but the leading underscore implies that it's an internal function, and vm_phys.h has dependencies on a number of other headers. Rename vm_phys_domain() to vm_page_domain(), and _vm_phys_domain() to vm_phys_domain(). Make the latter an inline function. Add _vm_phys.h and define struct vm_phys_seg there so that it's easier to use in other headers. Include it from vm_page.h so that vm_page_domain() can be defined there. Include machine/vmparam.h from _vm_phys.h since it depends directly on some constants defined there. Reviewed by: alc Reviewed by: dougm, kib (earlier versions) Differential Revision: https://reviews.freebsd.org/D27207
# 7b516613	10-Nov-2020	Jonathan T. Looney <jtl@FreeBSD.org>	When destroying a UMA zone which has a reserve (set with uma_zone_reserve()), messages like the following appear on the console: "Freed UMA keg (Test zone) was not empty (0 items). Lost 528 pages of memory." When keg_drain_domain() is draining the zone, it tries to keep the number of items specified in the reservation. However, when we are destroying the UMA zone, we do not need to keep those items. Therefore, when destroying a non-secondary and non-cache zone, we should reset the keg reservation to 0 prior to draining the zone. Reviewed by: markj Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D27129
# 575a4437	19-Oct-2020	Ed Maste <emaste@FreeBSD.org>	uma: fix KTR message after r366840 Reported by: bz Sponsored by: The FreeBSD Foundation
# f09cbea3	19-Oct-2020	Mark Johnston <markj@FreeBSD.org>	uma: Respect uk_reserve in keg_drain() When a reserve of free items is configured for a zone, the reserve must not be reclaimed under memory pressure. Modify keg_drain() to simply respect the reserved pool. While here remove an always-false uk_freef == NULL check (kegs that shouldn't be drained should set _NOFREE instead), and make sure that the keg_drain() KTR statement does not reference an uninitialized variable. Reviewed by: alc, rlibby Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D26772
# 1b2dcc8c	19-Oct-2020	Mark Johnston <markj@FreeBSD.org>	uma: Avoid depleting keg reserves when filling a bucket zone_import() fetches a free or partially free slab from the keg and then uses its items to populate an array, typically filling a bucket. If a single allocation causes the keg to drop below its minimum reserve, the inner loop ends. However, if the bucket is still not full and M_USE_RESERVE is specified, the outer loop will continue to fetch items from the keg. If M_USE_RESERVE is specified and the number of free items is below the reserved limit, we should return only a single item. Otherwise, if the bucket size is larger than the reserve, all of the reserved items may end up in a single per-CPU bucket, invisible to other CPUs. Reviewed by: rlibby MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D26771
# 6f3b523c	14-Oct-2020	Konstantin Belousov <kib@FreeBSD.org>	Avoid dump_avail[] redefinition. Move dump_avail[] extern declaration and inlines into a new header vm/vm_dumpset.h. This fixes default gcc build for mips. Reviewed by: alc, scottph Tested by: kevans (previous version) Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D26741
# 06d8bdcbf	02-Oct-2020	Mark Johnston <markj@FreeBSD.org>	uma: Use the bucket cache for cross-domain allocations uma_zalloc_domain() allocates from the requested domain instead of following a first-touch policy (the default for most zones). Currently it is only used by malloc_domainset(), and consumers free returned items with free(9) since r363834. Previously uma_zalloc_domain() worked by always going to the keg for an item. As a result, the use of UMA zone caches was unbalanced: we free items to the caches, but always allocate from the keg, skipping the caches. Make some effort to allocate from the UMA caches when performing a cross-domain allocation. This avoids blowing up the caches when something is performing many transient allocations with malloc_domainset(). Reported and tested by: dhw, glebius Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D26427
# 5afdf5c1	02-Oct-2020	Mark Johnston <markj@FreeBSD.org>	uma: Use LIFO for non-SMR bucket caches When SMR was introduced, zone_put_bucket() was changed to always place full buckets at the end of the queue. However, it is generally preferable to use recently used buckets since their items are more likely to be resident in cache. So, for buckets that have no constraint on item reuse, use a last-in-first-out ordering as we did before. Reviewed by: rlibby Tested by: dhw, glebius Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D26426
# 952c8964	02-Oct-2020	Mark Johnston <markj@FreeBSD.org>	uma: Remove newlines from panic messages Sponsored by: The FreeBSD Foundation
# 89d2fb14	08-Sep-2020	Konstantin Belousov <kib@FreeBSD.org>	Add interruptible variant of vm_wait(9), vm_wait_intr(9). Also add msleep flags argument to vm_wait_doms(9). Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D24652
# c3aa3bf9	01-Sep-2020	Mateusz Guzik <mjg@FreeBSD.org>	vm: clean up empty lines in .c and .h files
# 791dda87	21-Aug-2020	Andrew Gallatin <gallatin@FreeBSD.org>	uma: record allocation failures due to zone limits The zone limit mechanism was recently reworked, and allocation failures due to limits being exceeded were inadvertently no longer being recorded. This would lead to, for example, mbuf allocation failures not being indicated in netstat -m or vmstat -z Reviewed by: markj Sponsored by: Netflix
# b21b022a	18-Aug-2020	Mark Johnston <markj@FreeBSD.org>	Revert r364310. Some of the resulting fallout in CAM does not appear straightforward to fix, so simply revert the commit for now in the absence of a better solution. Discussed with: mjg Reported by: dhw
# 1921bb7b	17-Aug-2020	Gleb Smirnoff <glebius@FreeBSD.org>	With INVARIANTS panic immediately if M_WAITOK is requested in a non-sleepable context. Previously only _sleep() would panic. This will catch misuse of M_WAITOK at development stage rather than at stress load stage. Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D26027
# af32cefd	10-Aug-2020	Mark Johnston <markj@FreeBSD.org>	Check the UMA zone's full bucket cache before short-circuiting an alloc. The global "bucketdisable" flag indicates that we are in a low memory situation and should avoid allocating buckets. However, in the allocation path we were checking it before the full bucket cache and bailing even if the cache is non-empty. Defer the check so that we have a shot at allocating from the cache. This came up because M_NOWAIT allocations from the buf trie node zone must always succeed. In one scenario, all of the preallocated trie nodes were in the bucket list, and a new slab allocation could not succeed due to a memory shortage. The short-circuiting caused an allocation failure which triggered a panic. Reported by: pho Reviewed by: cem Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D25980
# 96ad26ee	04-Aug-2020	Mark Johnston <markj@FreeBSD.org>	Remove free_domain() and uma_zfree_domain(). These functions were introduced before UMA started ensuring that freed memory gets placed in domain-local caches. They no longer serve any purpose since UMA now provides their functionality by default. Remove them to simplyify the kernel memory allocator interfaces a bit. Reviewed by: cem, kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D25937
# 8c277118	28-Jun-2020	Mark Johnston <markj@FreeBSD.org>	Fix UMA's first-touch policy on systems with empty domains. Suppose a thread is running on a CPU in a NUMA domain with no physical RAM. When an item is freed to a first-touch zone, it ends up in the cross-domain bucket. When the bucket is full, it gets placed in another domain's bucket queue. However, when allocating an item, UMA will always go to the keg upon a per-CPU cache miss because the empty domain's bucket queue will always be empty. This means that a non-empty domain's bucket queues can grow very rapidly on such systems. For example, it can easily cause mbuf allocation failures when the zone limit is reached. Change cache_alloc() to follow a round-robin policy when running on an empty domain. Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D25355
# c8b0a88b	20-Jun-2020	Jeff Roberson <jeff@FreeBSD.org>	Clarify some language. Favor primary where both master and primary were used in conjunction with secondary.
# 1c58c09f	29-May-2020	Mateusz Guzik <mjg@FreeBSD.org>	uma: hide item_domain under ifdef NUMA Fixes build warnings on mips.
# 81302f1d	28-May-2020	Mark Johnston <markj@FreeBSD.org>	Fix boot on systems where NUMA domain 0 is unpopulated. - Add vm_phys_early_add_seg(), complementing vm_phys_early_alloc(), to ensure that segments registered during hammer_time() are placed in the right domain. Otherwise, since the SRAT is not parsed at that point, we just add them to domain 0, which may be incorrect and results in a domain with only several MB worth of memory. - Fix uma_startup1() to try allocating memory for zones from any domain. If domain 0 is unpopulated, the allocation will simply fail, resulting in a page fault slightly later during boot. - Change _vm_phys_domain() to return -1 for addresses not covered by the affinity table, and change vm_phys_early_alloc() to handle wildcard domains. This is necessary on amd64, where the page array is dense and pmap_page_array_startup() may allocate page table pages for non-existent page frames. Reported and tested by: Rafael Kitover <rkitover@gmail.com> Reviewed by: cem (earlier version), kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D25001
# dc2b3205	14-May-2020	Mark Johnston <markj@FreeBSD.org>	Allocate UMA per-CPU counters earlier. Otherwise anything counted before SI_SUB_VM_CONF is discarded. However, it is useful to be able to see stats from allocations done early during boot. Reviewed by: kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D24756
# 54007ce8	07-Mar-2020	Mark Johnston <markj@FreeBSD.org>	Clean up uma_int.h a bit. This makes it easier to write libkvm programs that access UMA data structures. - Remove a couple of unused slab functions and make others local to uma_core.c. Similarly move SLAB_BITSETS, which affects the layout of slab structures, to uma_core.c. - Stop defining the slab structures under _KERNEL. There's no real reason they can't be visible to userspace like the rest of UMA's structures are. - Group KEG_ASSERT_COLD with other keg macros. - Convert an assertion about MAXMEMDOM to use _Static_assert. No functional change intended. Discussed with: jeff Reviewed by: rlibby Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D23980
# 3823a599	06-Mar-2020	Brooks Davis <brooks@FreeBSD.org>	Remove an apparently incorrect assertion. Without this change mips64 fails to boot. Discussed with: markj Sponsored by: DARPA
# 7f746c9f	01-Mar-2020	Mateusz Guzik <mjg@FreeBSD.org>	vm: add debug to uma_zone_set_smr Reviewed by: markj, rlibby Differential Revision: https://reviews.freebsd.org/D23902
# fe835cbf	27-Feb-2020	Jeff Roberson <jeff@FreeBSD.org>	A pair of performance improvements. Swap buckets on free as well as alloc so that alloc is always the most cache-hot data. When selecting a zone domain for the round-robin bucket cache use the local domain unless there is a severe imbalance. This does not affinitize memory, only locks and queues. Reviewed by: markj, rlibby Differential Revision: https://reviews.freebsd.org/D23824
# 7029da5c	26-Feb-2020	Pawel Biernacki <kaktus@FreeBSD.org>	Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many) r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are still not MPSAFE (or already are but aren’t properly marked). Use it in preparation for a general review of all nodes. This is non-functional change that adds annotations to SYSCTL_NODE and SYSCTL_PROC nodes using one of the soon-to-be-required flags. Mark all obvious cases as MPSAFE. All entries that haven't been marked as MPSAFE before are by default marked as NEEDGIANT Approved by: kib (mentor, blanket) Commented by: kib, gallatin, melifaro Differential Revision: https://reviews.freebsd.org/D23718
# eaa17d42	22-Feb-2020	Ryan Libby <rlibby@FreeBSD.org>	sys/vm: quiet -Wwrite-strings Discussed with: kib Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D23796
# 0464f16e	22-Feb-2020	Mark Johnston <markj@FreeBSD.org>	Constify uma_zcache_create() and uma_zsecond_create()'s "name" argument. It is already internally handled as a pointer to a const string, in particular by uma_zcreate(). Fix indentation while here. MFC after: 1 week
# 226dd6db	21-Feb-2020	Jeff Roberson <jeff@FreeBSD.org>	Add an atomic-free tick moderated lazy update variant of SMR. This enables very cheap read sections with free-to-use latencies and memory overhead similar to epoch. On a recent AMD platform a read section cost 1ns vs 5ns for the default SMR. On Xeon the numbers should be more like 1 ns vs 11. The memory consumption should be proportional to the product of the free rate and 2*1/hz while normal SMR consumption is proportional to the product of free rate and maximum read section time. While here refactor the code to make future additions more straightforward. Name the overall technique Global Unbound Sequences (GUS) and adjust some comments accordingly. This helps distinguish discussions of the general technique (SMR) vs this specific implementation (GUS). Discussed with: rlibby, markj
# c6fd3e23	19-Feb-2020	Jeff Roberson <jeff@FreeBSD.org>	Use per-domain locks for the bucket cache. This gives much better concurrency when there are a large number of cores per-domain and multiple domains. Avoid taking the lock entirely if it will not be productive. ROUNDROBIN domains will have mixed memory in each domain and will load balance to all domains. While here refactor the zone/domain separation and bucket limits to simplify callers. Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D23673
# ed581bf6	16-Feb-2020	Jeff Roberson <jeff@FreeBSD.org>	Add a simple accessor that returns the bytes of memory consumed by a zone.
# 70260874	16-Feb-2020	Jeff Roberson <jeff@FreeBSD.org>	UMA has become more particular about zone types. Use the right allocator calls in uma_zwait().
# 6d88d784	15-Feb-2020	Jeff Roberson <jeff@FreeBSD.org>	Slightly restructure uma_zalloc* to generate better code from clang and reduce duplication among zalloc functions. Reviewed by: markj Discussed with: mjg Differential Revision: https://reviews.freebsd.org/D23672
# cefc92e1	13-Feb-2020	Mark Johnston <markj@FreeBSD.org>	Update the zone-global count of cached items in bucket_cache_reclaim(). This was missed in r351673. The count is used to enfore cache limits, which are rarely used. Discussed with: jeff Sponsored by: The FreeBSD Foundation
# 543117be	13-Feb-2020	Jeff Roberson <jeff@FreeBSD.org>	Fix a case where ub_seq would fail to be set if the cross bucket was flushed due to memory pressure. Reviewed by: markj Differential Revision: http://reviews.freebsd.org/D23614
# 3acb6572	12-Feb-2020	Mateusz Guzik <mjg@FreeBSD.org>	Store offset into zpcpu allocations in the per-cpu area. This shorten zpcpu_get and allows more optimizations. Reviewed by: jeff Differential Revision: https://reviews.freebsd.org/D23570
# 4ab3aee8	11-Feb-2020	Mark Johnston <markj@FreeBSD.org>	Reduce lock hold time in keg_drain(). Maintain a count of free slabs in the per-domain keg structure and use that to clear the free slab list in constant time for most cases. This helps minimize lock contention induced by reclamation, in preparation for proactive trimming of excesses of free memory. Reviewed by: jeff, rlibby Tested by: pho Differential Revision: https://reviews.freebsd.org/D23532
# bae55c4a	06-Feb-2020	Ryan Libby <rlibby@FreeBSD.org>	uma: remove UMA_ZFLAG_CACHEONLY flag UMA_ZFLAG_CACHEONLY was essentially the same thing as UMA_ZONE_VM, but with a more confusing name. Remove the flag, make UMA_ZONE_VM an inherit flag, and replace all references. Reviewed by: markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D23516
# 33e5a1ea	04-Feb-2020	Ryan Libby <rlibby@FreeBSD.org>	uma: multipage chicken switch Add a switch to allow disabling multipage slabs, in order to facilitate measuring memory usage and performance effects. The tunable vm.debug.uma_multipage_slabs defaults to 1 and can be set to 0 to disable. The name may change soon. Reviewed by: markj (previous version) Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D23487
# 27ca37ac	04-Feb-2020	Ryan Libby <rlibby@FreeBSD.org>	uma: grow slabs to enforce minimum memory efficiency Memory efficiency can be poor with awkward item sizes (e.g. 1/2 or 1 page size + epsilon). In order to achieve a minimum memory efficiency, select a slab size with a potentially larger number of pages if it yields a lower portion of waste. This may mean using page_alloc instead of uma_small_alloc, which could be more costly. Discussed with: jeff, mckusick Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D23239
# ec0d8280	04-Feb-2020	Ryan Libby <rlibby@FreeBSD.org>	uma: add UMA_ZONE_CONTIG, and a default contig_alloc For now, copy the mbuf allocator. Reviewed by: jeff, markj (previous version) Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D23237
# 5ba16cf3	04-Feb-2020	Ryan Libby <rlibby@FreeBSD.org>	uma: pcpu_page_free needs to startup_free pages from startup_alloc After r357392, it is apparent that we do have some early-boot PCPU zones. Make it so we can safely free pages from them if they are actually used during early boot. Reviewed by: jeff, markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D23496
# e84130a0	04-Feb-2020	Jeff Roberson <jeff@FreeBSD.org>	Use literal bucket sizes for smaller buckets rather than the rounding system. Small bucket sizes already pack well even if they are an odd number of words. This prevents any potential new instances of the problem fixed in r357463 as well as making the system easier to understand. Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D23494
# dc3915c8	03-Feb-2020	Jeff Roberson <jeff@FreeBSD.org>	Use STAILQ instead of TAILQ for bucket lists. We only need FIFO behavior and this is more space efficient. Stop queueing recently used buckets to the head of the list. If the bucket goes to a different processor the cache coherency will be more expensive. We already try to encourage cache-hot behavior in the per-cpu layer. Reviewed by: rlibby Differential Revision: https://reviews.freebsd.org/D23493
# 36cb95c7	03-Feb-2020	Mark Johnston <markj@FreeBSD.org>	Disable the smallest UMA bucket size on 32-bit platforms. With r357314, sizeof(struct uma_bucket) grew to 16 bytes on 32-bit platforms, so BUCKET_SIZE(4) is 0. This resulted in the creation of a bucket zone for buckets with zero capacity. A more general fix is planned, but for now this bandaid allows 32-bit platforms to boot again. PR: 243837 Discussed with: jeff Reported by: pho, Jenkins via lwhsu Tested by: pho Sponsored by: The FreeBSD Foundation
# f96d4157	01-Feb-2020	Jeff Roberson <jeff@FreeBSD.org>	Fix a bug in r356776 where the page allocator was not properly restored to the percpu page allocator after it had been temporarily overridden by startup_alloc. Reported by: pho, bdragon
# 9e47b341	30-Jan-2020	Jeff Roberson <jeff@FreeBSD.org>	Fix LINT build with MEMGUARD.
# d4665eaa	30-Jan-2020	Jeff Roberson <jeff@FreeBSD.org>	Implement a safe memory reclamation feature that is tightly coupled with UMA. This is in the same family of algorithms as Epoch/QSBR/RCU/PARSEC but is a unique algorithm. This has 3x the performance of epoch in a write heavy workload with less than half of the read side cost. The memory overhead is significantly lessened by limiting the free-to-use latency. A synthetic test uses 1/20th of the memory vs Epoch. There is significant further discussion in the comments and code review. This code should be considered experimental. I will write a man page after it has settled. After further validation the VM will begin using this feature to permit lockless page lookups. Both markj and cperciva tested on arm64 at large core counts to verify fences on weaker ordering architectures. I will commit a stress testing tool in a follow-up. Reviewed by: mmacy, markj, rlibby, hselasky Discussed with: sbahara Differential Revision: https://reviews.freebsd.org/D22586
# 8d1c459a	22-Jan-2020	Ryan Libby <rlibby@FreeBSD.org>	uma: fix zone domain overlaying pcpu cache with disabled cpus UMA zone structures have two arrays at the end which are sized according to the machine: an array of CPU count length, and an array of NUMA domain count length. The CPU counting was wrong in the case where some CPUs are disabled (when mp_ncpus != mp_maxid + 1), and this caused the second array to be overlaid with the first. Reported by: olivier Reviewed by: jeff, markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D23318
# 7e240677	22-Jan-2020	Ryan Libby <rlibby@FreeBSD.org>	uma: report leaks more accurately Previously UMA had some false negatives in the leak report at keg destruction time, where it only reported leaks if there were free items in the slab layer (rather than allocated items), which notably would not be true for single-item slabs (large items). Now, report a leak if there are any allocated pages, and calculate and report the number of allocated items rather than free items. Reviewed by: jeff, markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D23275
# 530cc6a2	22-Jan-2020	Jeff Roberson <jeff@FreeBSD.org>	Some architectures with DMAP still consume boot kva. Simplify the test for claiming kva in uma_startup2() to handle this. Reported by: bdragon
# 20526802	18-Jan-2020	Andrew Gallatin <gallatin@FreeBSD.org>	pcpu_page_alloc: guard against empty NUMA domains Some systems, such as higher end Threadripper, may have NUMA domains with no physical memory, Don't allocate from these domains. This fixes a "panic: vm_wait in early boot" on my 2990WX desktop Reviewed by: jeff Sponsored by: Netflix
# a81c400e	15-Jan-2020	Jeff Roberson <jeff@FreeBSD.org>	Simplify VM and UMA startup by eliminating boot pages. Instead use careful ordering to allocate early pages in the same way boot pages were but only as needed. After the KVA allocator has started up we allocate the KVA that we consumed during boot. This also makes the boot pages freeable since they have vm_page structures allocated with the rest of memory. Parts of this patch were written and tested by markj. Reviewed by: glebius, markj Differential Revision: https://reviews.freebsd.org/D23102
# 9b8db4d0	13-Jan-2020	Ryan Libby <rlibby@FreeBSD.org>	uma: split slabzone into two sizes By allowing more items per slab, we can improve memory efficiency for small allocs. If we were just to increase the bitmap size of the slabzone, we would then waste slabzone memory. So, split slabzone into two zones, one especially for 8-byte allocs (512 per slab). The practical effect should be reduced memory usage for counter(9). Reviewed by: jeff, markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D23149
# e63a1c2f	13-Jan-2020	Ryan Libby <rlibby@FreeBSD.org>	uma: fixup some ktr messages Reviewed by: markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D23148
# a314aba8	11-Jan-2020	Mateusz Guzik <mjg@FreeBSD.org>	vm: add missing CLTFLAG_MPSAFE annotations This covers all vm/* files.
# 860bb7a0	09-Jan-2020	Mark Johnston <markj@FreeBSD.org>	UMA: Don't destroy zones after the system shutdown process starts. Some kernel subsystems, notably ZFS, will destroy UMA zones from a shutdown eventhandler. This causes the zone to be drained. For slabs that are mapped into KVA this can be very expensive and so it needlessly delays the shutdown process. Add a new state to the "booted" variable, BOOT_SHUTDOWN. Once kern_reboot() starts invoking shutdown handlers, turn uma_zdestroy() into a no-op, provided that the zone does not have a custom finalization routine. PR: 242427 Reviewed by: jeff, kib, rlibby MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D23066
# 4a8b575c	08-Jan-2020	Ryan Libby <rlibby@FreeBSD.org>	uma: unify layout paths and improve efficiency Unify the keg layout selection paths (keg_small_init, keg_large_init, keg_cachespread_init), and slightly improve memory efficiecy by: - using the padding of the final item to store the slab header, - not going OFFPAGE if we have a choice unless it improves efficiency. Reviewed by: jeff, markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D23048
# 54c5ae80	08-Jan-2020	Ryan Libby <rlibby@FreeBSD.org>	uma: reorganize flags - Garbage collect UMA_ZONE_PAGEABLE & UMA_ZONE_STATIC. - Move flag VTOSLAB from public to private. - Introduce public NOTPAGE flag and make HASH private. - Introduce public NOTOUCH flag and make OFFPAGE private. - Update man page. The net effect of this should be to make the contract with clients more clear. Clients should choose constraints, UMA will figure out how to implement them. This also breaks the confusing double meaning of OFFPAGE. Reviewed by: jeff, markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D23016
# 79c9f942	05-Jan-2020	Jeff Roberson <jeff@FreeBSD.org>	Fix uma boot pages calculations on NUMA machines that also don't have MD_UMA_SMALL_ALLOC. This is unusual but not impossible. Fix the alignemnt of zones while here. This was already correct because uz_cpu strongly aligned the zone structure but the specified alignment did not match reality and involved redundant defines. Reviewed by: markj, rlibby Differential Revision: https://reviews.freebsd.org/D23046
# bfb6b7a1	05-Jan-2020	Jeff Roberson <jeff@FreeBSD.org>	The fix in r356353 was insufficient. Not every architecture returns 0 for EARLY_COUNTER. Only amd64 seems to. Suggested by: markj Reported by: lwhsu Reviewed by: markj PR: 243117
# 31c251a0	04-Jan-2020	Jeff Roberson <jeff@FreeBSD.org>	Fix an assertion introduced in r356348. On architectures without UMA_MD_SMALL_ALLOC vmem has a more complicated startup sequence that violated the new assert. Resolve this by rewriting the COLD asserts to look at the per-cpu allocation counts for evidence of api activity. Discussed with: rlibby Reviewed by: markj Reported by: lwhsu
# dfe13344	04-Jan-2020	Jeff Roberson <jeff@FreeBSD.org>	UMA NUMA flag day. UMA_ZONE_NUMA was a source of confusion. Make the names more consistent with other NUMA features as UMA_ZONE_FIRSTTOUCH and UMA_ZONE_ROUNDROBIN. The system will now pick a select a default depending on kernel configuration. API users need only specify one if they want to override the default. Remove the UMA_XDOMAIN and UMA_FIRSTTOUCH kernel options and key only off of NUMA. XDOMAIN is now fast enough in all cases to enable whenever NUMA is. Reviewed by: markj Discussed with: rlibby Differential Revision: https://reviews.freebsd.org/D22831
# 91d947bf	04-Jan-2020	Jeff Roberson <jeff@FreeBSD.org>	Sort cross-domain frees into per-domain buckets before inserting these onto their respective bucket lists. This is a several order of magnitude improvement in contention on the keg lock under heavy free traffic while requiring only an additional bucket per-domain worth of memory. Discussed with: markj, rlibby Differential Revision: https://reviews.freebsd.org/D22830
# 8b987a77	03-Jan-2020	Jeff Roberson <jeff@FreeBSD.org>	Use per-domain keg locks. This provides both a lock and separate space accounting for each NUMA domain. Independent keg domain locks are important with cross-domain frees. Hashed zones are non-numa and use a single keg lock to protect the hash table. Reviewed by: markj, rlibby Differential Revision: https://reviews.freebsd.org/D22829
# 727c6918	03-Jan-2020	Jeff Roberson <jeff@FreeBSD.org>	Use a separate lock for the zone and keg. This provides concurrency between populating buckets from the slab layer and fetching full buckets from the zone layer. Eliminate some nonsense locking patterns where we lock to fetch a single variable. Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D22828
# 4bd61e19	03-Jan-2020	Jeff Roberson <jeff@FreeBSD.org>	Use atomics for the zone limit and sleeper count. This relies on the sleepq to serialize sleepers. This patch retains the existing sleep/wakeup paradigm to limit 'thundering herd' wakeups. It resolves a missing wakeup in one case but otherwise should be bug for bug compatible. In particular, there are still various races surrounding adjusting the limit via sysctl that are now documented. Discussed with: markj Reviewed by: rlibby Differential Revision: https://reviews.freebsd.org/D22827
# cc7ce83a	25-Dec-2019	Jeff Roberson <jeff@FreeBSD.org>	Further reduce the cacheline footprint of fast allocations by duplicating the zone size and flags fields in the per-cpu caches. This allows fast alloctions to proceed only touching the single per-cpu cacheline and simplifies the common case when no ctor/dtor is specified. Reviewed by: markj, rlibby Differential Revision: https://reviews.freebsd.org/D22826
# 376b1ba3	25-Dec-2019	Jeff Roberson <jeff@FreeBSD.org>	Optimize fast path allocations by storing bucket headers in the per-cpu cache area. This allows us to check on bucket space for all per-cpu buckets with a single cacheline access and fewer branches. Reviewed by: markj, rlibby Differential Revision: https://reviews.freebsd.org/D22825
# 3639ac42	25-Dec-2019	Jeff Roberson <jeff@FreeBSD.org>	Fix a bug with _NUMA domains introduced in r339686. When M_NOWAIT is specified there was no loop termination condition in keg_fetch_slab(). Reported by: pho Reviewed by: markj
# 815db204	13-Dec-2019	Ryan Libby <rlibby@FreeBSD.org>	uma dbg: flexible size for slab debug bitset too Recently (r355315) the size of the struct uma_slab bitset field us_free became dynamic instead of conservative. Now, make the debug bitset size dynamic too. The debug bitset is INVARIANTS-only, so in fact we don't care too much about the space savings that results from this, but enabling minimally-sized slabs on INVARIANTS builds is still important in order to be able to test new slab layouts effectively. Reviewed by: jeff (previous version), markj (previous version) Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D22759
# 325c4ced	13-Dec-2019	Mark Johnston <markj@FreeBSD.org>	Restore the reservation of boot pages for bucket zones after r355707. uma_startup2() sets booted = BOOT_BUCKETS after calling bucket_init(), but before that assignment, startup_alloc() will use pages from the reserved pool, so the bucket zones themselves are still allocated using startup pages. Reviewed by: rlibby Reported by: Jenkins via lwhsu Differential Revision: https://reviews.freebsd.org/D22797
# d82c8ffb	13-Dec-2019	Ryan Libby <rlibby@FreeBSD.org>	Revert r355706 & r355710 The quick fix didn't work. I'll sort it out tomorrow. Revert r355710: "libmemstat: unbreak build" Revert r355706: "uma dbg: flexible size for slab debug bitset too"
# f7af5015	13-Dec-2019	Ryan Libby <rlibby@FreeBSD.org>	uma: report slab efficiency Reviewed by: jeff Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D22766
# 3182660a	13-Dec-2019	Ryan Libby <rlibby@FreeBSD.org>	uma: delay bucket_init() until we might actually enable buckets This helps with a bootstrapping problem in upcoming work. We don't first enable buckets until uma_startup2(), so we can delay bucket creation until then. The other two paths to bucket_enable() are both later, one in the pageout daemon (SI_SUB_KTHREAD_PAGE vs SI_SUB_VM) and one in uma_timeout() (first activated in uma_startup3()). Note that although some bucket functions are accessible before uma_startup2() (e.g. bucket_select() in zone_ctor()), none of them inspect ubz_zone. Discussed with: jeff Reviewed by: markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D22765
# 7508f15f	13-Dec-2019	Ryan Libby <rlibby@FreeBSD.org>	uma dbg: flexible size for slab debug bitset too Recently (r355315) the size of the struct uma_slab bitset field us_free became dynamic instead of conservative. Now, make the debug bitset size dynamic too. The debug bitset is INVARIANTS-only, so in fact we don't care too much about the space savings that results from this, but enabling minimally-sized slabs on INVARIANTS builds is still important in order to be able to test new slab layouts effectively. Reviewed by: jeff, markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D22759
# 6d204a6a	10-Dec-2019	Ryan Libby <rlibby@FreeBSD.org>	uma: pretty print zone flags sysctl Requested by: jeff Reviewed by: jeff, markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D22748
# 3b490537	07-Dec-2019	Jeff Roberson <jeff@FreeBSD.org>	Fix two problems with r355149. The sysctl name collision code assumed that zones would never be freed. In the case of tmpfs this was not true. While here test for the right bit to disable the keg related sysctls for zones that don't have kegs. Reported by: pho Reviewed by: rlibby Differential Revision: https://reviews.freebsd.org/D22655
# 1e0701e1	07-Dec-2019	Jeff Roberson <jeff@FreeBSD.org>	Use a variant slab structure for offpage zones. This saves space in embedded slabs but also is an opportunity to tidy up code and add accessor inlines. Reviewed by: markj, rlibby Differential Revision: https://reviews.freebsd.org/D22609
# b75c4efc	04-Dec-2019	Andrew Turner <andrew@FreeBSD.org>	Fix the signature for zone_import and zone_release These are cast to uma_import and uma_release functions. Use the signature for these in the zone functions. This was found with an experimental Kernel CFI. It will complain if the signature is different than what a function pointer expects. The simplest way to fix these is to correct the signature. Reviewed by: rlibby Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D22671
# 9b78b1f4	02-Dec-2019	Jeff Roberson <jeff@FreeBSD.org>	Use a precise bit count for the slab free items in UMA. This significantly shrinks embedded slab structures. Reviewed by: markj, rlibby (prior version) Differential Revision: https://reviews.freebsd.org/D22584
# 6d6a03d7	28-Nov-2019	Jeff Roberson <jeff@FreeBSD.org>	Handle large mallocs by going directly to kmem. Taking a detour through UMA does not provide any additional value. Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D22563
# 584061b4	28-Nov-2019	Jeff Roberson <jeff@FreeBSD.org>	Garbage collect the mostly unused us_keg field. Use appropriately named union members in vm_page.h to store the zone and slab. Remove some nearby dead code. Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D22564
# 35ec24f3	27-Nov-2019	Ryan Libby <rlibby@FreeBSD.org>	uma: move sysctl vm.uma defn out from under INVARIANTS Fix non-INVARIANTS builds after r355149. Reported by: Michael Butler <imb@protected-networks.net> Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D22588
# 20a4e154	27-Nov-2019	Jeff Roberson <jeff@FreeBSD.org>	Implement a sysctl tree for uma zones to assist in debugging and provide more statistcs than are exported via the ABI stable vmstat interface. Rename uz_count to uz_bucket_size because even I was confused by the name after returning to the source years later. Reviewed by: rlibby Differential Revision: https://reviews.freebsd.org/D22554
# 0a81b439	27-Nov-2019	Jeff Roberson <jeff@FreeBSD.org>	Refactor uma_zfree_arg into several functions to make control flow more clear and icache usage cleaner. Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D22491
# ca293436	27-Nov-2019	Ryan Libby <rlibby@FreeBSD.org>	uma: trash memory when ctor/dtor supplied too On INVARIANTS kernels, UMA has a use-after-free detection mechanism. This mechanism previously required that all of the ctor/dtor/uminit/fini arguments to uma_zcreate() be NULL in order to function. Now, it only requires that uminit and fini be NULL; now, the trash ctor and dtor will be called in addition to any supplied ctor or dtor. Also do a little refactoring for readability of the resulting logic. This enables use-after-free detection for more zones, and will allow for simplification of some callers that worked around the previous restriction (see kern_mbuf.c). Reviewed by: jeff, markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D20722
# beb8beef	26-Nov-2019	Jeff Roberson <jeff@FreeBSD.org>	Refactor uma_zalloc_arg(). It is a mess of gotos and code which doesn't make sense after many partial refactors. Attempt to make a smaller cache footprint for the fast path. Reviewed by: markj, rlibby Differential Revision: https://reviews.freebsd.org/D22470
# 003cf08b	22-Nov-2019	Mark Johnston <markj@FreeBSD.org>	Revise the page cache size policy. In r353734 the use of the page caches was limited to systems with a relatively large amount of RAM per CPU. This was to mitigate some issues reported with the system not able to keep up with memory pressure in cases where it had been able to do so prior to the addition of the direct free pool cache. This change re-enables those caches. The change modifies uma_zone_set_maxcache(), which was introduced specifically for the page cache zones. Rather than using it to limit only the full bucket cache, have it also set uz_count_max to provide an upper bound on the per-CPU cache size that is consistent with the number of items requested. Remove its return value since it has no use. Enable the page cache zones unconditionally, and limit them to 0.1% of the domain's pages. The limit can be overridden by the vm.pgcache_zone_max tunable as before. Change the item size parameter passed to uma_zcache_create() to the correct size, and stop setting UMA_ZONE_MAXBUCKET. This allows the page cache buckets to be adaptively sized, like the rest of UMA's caches. This also causes the initial bucket size to be small, so only systems which benefit from large caches will get them. Reviewed by: gallatin, jeff MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D22393
# 71353f7a	19-Nov-2019	Jeff Roberson <jeff@FreeBSD.org>	When we set OFFPAGE to limit fragmentation we should also set VTOSLAB so that we avoid the hashtables. The hashtable is now only required if a zone is created with OFFPAGE specified initially, not internally. This flag signals to UMA that it can't touch the allocated memory and so can't store a slab pointer in the containing page. Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D22453
# 08034d10	10-Nov-2019	Konstantin Belousov <kib@FreeBSD.org>	Include cache zones into zone_foreach() where appropriate. The r354367 is reverted since it is subsumed by this, more complete, approach. Suggested by: markj Reviewed by: alc. glebius, markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D22242
# 432fc36d	05-Nov-2019	Konstantin Belousov <kib@FreeBSD.org>	Switch cache zones from early counters to real implementation. Early counter mock can be only used on BSP for amd64, when APs try to update it that causes random memory corruption. N.B. This is a temporary patch to plug the corruption for now, while a proper solution for handling cache zones in zone_foreach() is being developed. In collaboration with: pho Reviewed by: markj Sponsored by: The FreeBSD Foundation, Mellanox Technologies
# 1de9724e	22-Oct-2019	Mark Johnston <markj@FreeBSD.org>	Avoid reloading bucket pointers in uma_vm_zone_stats(). The correctness of per-CPU cache accounting in that function is dependent on reading per-CPU pointers exactly once. Ensure that the compiler does not emit multiple loads of those pointers. Reported and tested by: pho Reviewed by: kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D22081
# 0223790f	11-Oct-2019	Conrad Meyer <cem@FreeBSD.org>	Fix braino in r353429 cy@ points out that I got parameter order backwards between definition and invocation of the helper function. He is totally correct. The earlier version of this patch predated the XFree column so this is one I introduced, rather than the original author. Submitted by: cy Reported by: cy X-MFC-With: r353429
# 46d70077	10-Oct-2019	Conrad Meyer <cem@FreeBSD.org>	ddb: Add CSV option, sorting to 'show (malloc\|uma)' Add /i option for machine-parseable CSV output. This allows ready copy/ pasting into more sophisticated tooling outside of DDB. Add total zone size ("Memory Use") as a new column for UMA. For both, sort the displayed list on size (print the largest zones/types first). This is handy for quickly diagnosing "where has my memory gone?" at a high level. Submitted by: Emily Pettigrew <Emily.Pettigrew AT isilon.com> (earlier version) Sponsored by: Dell EMC Isilon
# 08cfa56e	01-Sep-2019	Mark Johnston <markj@FreeBSD.org>	Extend uma_reclaim() to permit different reclamation targets. The page daemon periodically invokes uma_reclaim() to reclaim cached items from each zone when the system is under memory pressure. This is important since the size of these caches is unbounded by default. However it also results in bursts of high latency when allocating from heavily used zones as threads miss in the per-CPU caches and must access the keg in order to allocate new items. With r340405 we maintain an estimate of each zone's usage of its (per-NUMA domain) cache of full buckets. Start making use of this estimate to avoid reclaiming the entire cache when under memory pressure. In particular, introduce TRIM, DRAIN and DRAIN_CPU verbs for uma_reclaim() and uma_zone_reclaim(). When trimming, only items in excess of the estimate are reclaimed. Draining a zone reclaims all of the cached full buckets (the previous behaviour of uma_reclaim()), and may further drain the per-CPU caches in extreme cases. Now, when under memory pressure, the page daemon will trim zones rather than draining them. As a result, heavily used zones do not incur bursts of bucket cache misses following reclamation, but large, unused caches will be reclaimed as before. Reviewed by: jeff Tested by: pho (an earlier version) MFC after: 2 months Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D16667
# b48d4efe	25-Aug-2019	Mark Johnston <markj@FreeBSD.org>	Handle UMA_ANYDOMAIN in kstack_import(). The kernel thread stack zone performs first-touch allocations by default, and must handle the case where the local memory domain is empty. For most UMA zones this is handled in the keg layer, but cache zones currently must implement a policy for this case. Simply use a round-robin policy if UMA_ANYDOMAIN is passed. Reported and tested by: bcran Reviewed by: kib Sponsored by: The FreeBSD Foundation
# eda1b016	06-Aug-2019	Jeff Roberson <jeff@FreeBSD.org>	Implement a MINBUCKET zone flag so we can use minimal caching on zones that may be expensive to cache. Reviewed by: markj, kib Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20930
# c1685086	06-Aug-2019	Jeff Roberson <jeff@FreeBSD.org>	Add two new kernel options to control memory locality on NUMA hardware. - UMA_XDOMAIN enables an additional per-cpu bucket for freed memory that was freed on a different domain from where it was allocated. This is only used for UMA_ZONE_NUMA (first-touch) zones. - UMA_FIRSTTOUCH sets the default UMA policy to be first-touch for all zones. This tries to maintain locality for kernel memory. Reviewed by: gallatin, alc, kib Tested by: pho, gallatin Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20929
# 88ea538a	07-Jun-2019	Mark Johnston <markj@FreeBSD.org>	Replace uses of vm_page_unwire(m, PQ_NONE) with vm_page_unwire_noq(m). These calls are not the same in general: the former will dequeue the page if it is enqueued, while the latter will just leave it alone. But, all existing uses of the former apply to unmanaged pages, which are never enqueued in the first place. No functional change intended. Reviewed by: kib MFC after: 1 week Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20470
# 3b2f2cb8	06-Jun-2019	Alexander Motin <mav@FreeBSD.org>	Allow UMA hash tables to expand faster then 2x in 20 seconds. ZFS ABD allocates tons of 4KB chunks via UMA, requiring huge hash tables. With initial hash table size of only 32 elements it takes ~20 expansions or ~400 seconds to adapt to handling 220GB ZFS ARC. During that time not only the hash table is highly inefficient, but also each of those expan- sions takes significant time with the lock held, blocking operation. On my test system with 256GB of RAM and ZFS pool of 28 HDDs this change reduces time needed to first time read 240GB from ~300-400s, during which system is quite busy and unresponsive, to only ~150s with light CPU load and just 5 sub-second CPU spikes to expand the hash table. MFC after: 2 weeks Sponsored by: iXsystems, Inc.
# fbd95859	06-Jun-2019	Mark Johnston <markj@FreeBSD.org>	Add sysctls for uma_kmem_{limit,total}. Reviewed by: alc, dougm, kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D20514
# 058f0f74	06-Jun-2019	Mark Johnston <markj@FreeBSD.org>	Remove the volatile qualifer from uma_kmem_total. No functional change intended. Reviewed by: alc, dougm, kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D20514
# 4a9f6ba7	29-May-2019	Gleb Smirnoff <glebius@FreeBSD.org>	In r343857 the referred comment moved to uma_vm_zone_stats().
# 323ad386	11-Apr-2019	Tycho Nightingale <tychon@FreeBSD.org>	for a cache-only zone the destructor tries to destroy a non-existent keg Reviewed by: markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D19835
# 6929b7d1	11-Feb-2019	Pedro F. Giffuni <pfg@FreeBSD.org>	UMA: unsign some variables related to allocation in hash_alloc(). As a followup to r343673, unsign some variables related to allocation since the hashsize cannot be negative. This gives a bit more space to handle bigger allocations and avoid some implicit casting. While here also unsign uh_hashmask, it makes little sense to keep that signed. MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D19148
# ad66f958	06-Feb-2019	Gleb Smirnoff <glebius@FreeBSD.org>	Now that there is only one way to allocate a slab, remove uz_slab method. Discussed with: jeff
# b47acb0a	06-Feb-2019	Gleb Smirnoff <glebius@FreeBSD.org>	Report cache zones in UMA stats sysctl, that 'vmstat -z' uses. This should had been part of r251826.
# 59568a0e	01-Feb-2019	Alexander Motin <mav@FreeBSD.org>	Fix integer math overflow in UMA hash_alloc(). 512GB of ZFS ABD ARC means abd_chunk zone of 128M 4KB items. To manage them UMA tries to allocate 2GB hash table, which size does not fit into the int variable, causing later allocation failure, which makes ARC shrink back below the 512GB, not letting it to use more RAM. With this change I easily reached >700GB ARC size on 768GB RAM machine. MFC after: 1 week Sponsored by: iXsystems, Inc.
# 37125720	31-Jan-2019	Gleb Smirnoff <glebius@FreeBSD.org>	In zone_alloc_bucket() max argument was calculated based on uz_count. Then bucket_alloc() also selects bucket size based on uz_count. However, since zone lock is dropped, uz_count may reduce. In this case max may be greater than ub_entries and that would yield into writing beyond end of the allocation. Reported by: pho
# 86220393	23-Jan-2019	Mark Johnston <markj@FreeBSD.org>	Correct uma_prealloc()'s use of domainset iterators after r339925. The iterator should be reinitialized after every successful slab allocation. A request to advance the iterator is interpreted as an allocation failure, so a sufficiently large preallocation would cause the iterator to believe that all domains were exhausted, resulting in a sleep with the keg lock held. [1] Also, keg_alloc_slab() should pass the unmodified wait flag to the item initialization routine, which may use it to perform allocations from other zones. Reported and tested by: slavah Diagnosed by: kib [1] Reviewed by: kib MFC after: 1 week Sponsored by: The FreeBSD Foundation
# e7e4bcd8	15-Jan-2019	Gleb Smirnoff <glebius@FreeBSD.org>	style(9): break long line.
# f8c86a5f	15-Jan-2019	Gleb Smirnoff <glebius@FreeBSD.org>	Remove harmless leftover from code that cycles over zone's kegs. Just use + instead of +=. There is no functional change.
# bb45b411	15-Jan-2019	Gleb Smirnoff <glebius@FreeBSD.org>	Only do uz_items accounting for zones that have a limit set in uz_max_items. This reduces amount of locking required for these zones. Also, for cache only zones (UMA_ZFLAG_CACHE) accounting uz_items wasn't correct at all, since they may allocate items directly from their backing store and then free them via UMA underflowing uz_items. Tested by: pho
# 2efcc8cb	15-Jan-2019	Gleb Smirnoff <glebius@FreeBSD.org>	Make uz_allocs, uz_frees and uz_fails counter(9). This removes some atomic updates and reduces amount of data protected by zone lock. During startup point these fields to EARLY_COUNTER. After startup allocate them for all early zones. Tested by: pho
# 5a8eee2b	14-Jan-2019	Gleb Smirnoff <glebius@FreeBSD.org>	Fix compilation on 32-bit.
# bb15d1c7	14-Jan-2019	Gleb Smirnoff <glebius@FreeBSD.org>	o Move zone limit from keg level up to zone level. This means that now two zones sharing a keg may have different limits. Now this is going to work: zone = uma_zcreate(); uma_zone_set_max(zone, limit); zone2 = uma_zsecond_create(zone); uma_zone_set_max(zone2, limit2); Kegs no longer have uk_maxpages field, but zones have uz_items. When set, it may be rounded up to minimum possible CPU bucket cache size. For small limits bucket cache can also be reconfigured to be smaller. Counter uz_items is updated whenever items transition from keg to a bucket cache or directly to a consumer. If zone has uz_maxitems set and it is reached, then we are going to sleep. o Since new limits don't play well with multi-keg zones, remove them. The idea of multi-keg zones was introduced exactly 10 years ago, and never have had a practical usage. In discussion with Jeff we came to a wild agreement that if we ever want to reintroduce the idea of a smart allocator that would be able to choose between two (or more) totally different backing stores, that choice should be made one level higher than UMA, e.g. in malloc(9) or in mget(), or whatever and choice should be controlled by the caller. o Sleeping code is improved to account number of sleepers and wake them one by one, to avoid thundering herd problem. o Flag UMA_ZONE_NOBUCKETCACHE removed, instead uma_zone_set_maxcache() KPI added. Having no bucket cache basically means setting maxcache to 0. o Now with many fields added and many removed (no multi-keg zones!) make sure that struct uma_zone is perfectly aligned. Reviewed by: markj, jeff Tested by: pho Differential Revision: https://reviews.freebsd.org/D17773
# 0b2e3aea	28-Nov-2018	Gleb Smirnoff <glebius@FreeBSD.org>	Fix yet another edge case in uma_startup_count(). If zone size fits into several pages, but leaves no space for struct uma_slab at the end we miscalculate number of pages by one. Totally mimic keg_large_init() math here to cover that problem. Reported by: gallatin
# 3d5e3df7	28-Nov-2018	Gleb Smirnoff <glebius@FreeBSD.org>	For not offpage zones the slab is placed at the end of page. Keg's uk_pgoff is calculated to guarantee that struct uma_slab is placed at pointer size alignment. Calculation of real struct uma_slab size is done in keg_ctor() and yet again in keg_large_init(), to check if we need an extra page. This calculation can actually be performed at compile time. - Add SIZEOF_UMA_SLAB macro to calculate size of struct uma_slab placed at an end of a page with alignment requirement. - Use SIZEOF_UMA_SLAB in keg_ctor() and in keg_large_init(). This is a not a functional change. - Use SIZEOF_UMA_SLAB in UMA_SLAB_SPACE definition and in keg_small_init(). This is a potential bugfix, but in reality I don't think there are any systems affected, since compiler aligns struct uma_slab anyway.
# 0f9b7bf3	13-Nov-2018	Mark Johnston <markj@FreeBSD.org>	Add accounting to per-domain UMA full bucket caches. In particular, track the current size of the cache and maintain an estimate of its working set size. This will be used to decide how much to shrink various caches when the kernel attempts to reclaim pages. As a secondary effect, it makes statistics aggregation (done by, e.g., vmstat -z) cheaper since sysctl_vm_zone_stats() no longer needs to iterate over lists of cached buckets. Discussed with: alc, glebius, jeff Tested by: pho (previous version) MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D16666
# 9978bd99	30-Oct-2018	Mark Johnston <markj@FreeBSD.org>	Add malloc_domainset(9) and _domainset variants to other allocator KPIs. Remove malloc_domain(9) and most other _domain KPIs added in r327900. The new functions allow the caller to specify a general NUMA domain selection policy, rather than specifically requesting an allocation from a specific domain. The latter policy tends to interact poorly with M_WAITOK, resulting in situations where a caller is blocked indefinitely because the specified domain is depleted. Most existing consumers of the _domain KPIs are converted to instead use a DOMAINSET_PREF() policy, in which we fall back to other domains to satisfy the allocation request. This change also defines a set of DOMAINSET_FIXED() policies, which only permit allocations from the specified domain. Discussed with: gallatin, jeff Reported and tested by: pho (previous version) MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17418
# 920239ef	30-Oct-2018	Mark Johnston <markj@FreeBSD.org>	Fix some problems that manifest when NUMA domain 0 is empty. - In uma_prealloc(), we need to check for an empty domain before the first allocation attempt, not after. Fix this by switching uma_prealloc() to use a vm_domainset iterator, which addresses the secondary issue of using a signed domain identifier in round-robin iteration. - Don't automatically create a page daemon for domain 0. - In domainset_empty_vm(), recompute ds_cnt and ds_order after excluding empty domains; otherwise we may frequently specify an empty domain when calling in to the page allocator, wasting CPU time. Convert DOMAINSET_PREF() policies for empty domains to round-robin. - When freeing bootstrap pages, don't count them towards the per-domain total page counts for now: some vm_phys segments are created before the SRAT is parsed and are thus always identified as being in domain 0 even when they are not. Then, when bootstrap pages are freed, they are added to a domain that we had previously thought was empty. Until this is corrected, we simply exclude them from the per-domain page count. Reported and tested by: Rajesh Kumar <rajfbsd@gmail.com> Reviewed by: gallatin MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17704
# 194a979e	24-Oct-2018	Mark Johnston <markj@FreeBSD.org>	Use a vm_domainset iterator in keg_fetch_slab(). Previously, it used a hand-rolled round-robin iterator. This meant that the minskip logic in r338507 didn't apply to UMA allocations, and also meant that we would call vm_wait() for individual domains rather than permitting an allocation from any domain with sufficient free pages. Discussed with: jeff Tested by: pho MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17420
# 81c0d72c	22-Oct-2018	Gleb Smirnoff <glebius@FreeBSD.org>	If we lost race or were migrated during bucket allocation for the per-CPU cache, then we put new bucket on generic bucket cache. However, code didn't honor UMA_ZONE_NOBUCKETCACHE flag, so potentially we could start a cache on a zone that clearly forbids that. Fix this. Reviewed by: markj
# 30c5525b	01-Oct-2018	Andrew Gallatin <gallatin@FreeBSD.org>	Allow empty NUMA memory domains to support Threadripper2 The AMD Threadripper 2990WX is basically a slightly crippled Epyc. Rather than having 4 memory controllers, one per NUMA domain, it has only 2 memory controllers enabled. This means that only 2 of the 4 NUMA domains can be populated with physical memory, and the others are empty. Add support to FreeBSD for empty NUMA domains by: - creating empty memory domains when parsing the SRAT table, rather than failing to parse the table - not running the pageout deamon threads in empty domains - adding defensive code to UMA to avoid allocating from empty domains - adding defensive code to cpuset to avoid binding to an empty domain Thanks to Jeff for suggesting this strategy. Reviewed by: alc, markj Approved by: re (gjb@) Differential Revision: https://reviews.freebsd.org/D1683
# 26fe2217	18-Sep-2018	Mark Johnston <markj@FreeBSD.org>	Only update the domain cursor once in keg_fetch_slab(). We drop the keg lock when we go to actually allocate the slab, allowing other threads to advance the cursor. This can cause us to exit the round-robin loop before having attempted allocations from all domains, resulting in a hang during a subsequent blocking allocation attempt from a depleted domain. Reported and tested by: Jan Bramkamp <crest@bultmann.eu> Reviewed by: alc, cem Approved by: re (gjb) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17209
# 19fa89e9	25-Aug-2018	Mark Murray <markm@FreeBSD.org>	Remove the Yarrow PRNG algorithm option in accordance with due notice given in random(4). This includes updating of the relevant man pages, and no-longer-used harvesting parameters. Ensure that the pseudo-unit-test still does something useful, now also with the "other" algorithm instead of Yarrow. PR: 230870 Reviewed by: cem Approved by: so(delphij,gtetlow) Approved by: re(marius) Differential Revision: https://reviews.freebsd.org/D16898
# 49bfa624	25-Aug-2018	Alan Cox <alc@FreeBSD.org>	Eliminate the arena parameter to kmem_free(). Implicitly this corrects an error in the function hypercall_memfree(), where the wrong arena was being passed to kmem_free(). Introduce a per-page flag, VPO_KMEM_EXEC, to mark physical pages that are mapped in kmem with execute permissions. Use this flag to determine which arena the kmem virtual addresses are returned to. Eliminate UMA_SLAB_KRWX. The introduction of VPO_KMEM_EXEC makes it redundant. Update the nearby comment for UMA_SLAB_KERNEL. Reviewed by: kib, markj Discussed with: jeff Approved by: re (marius) Differential Revision: https://reviews.freebsd.org/D16845
# 83a90bff	21-Aug-2018	Alan Cox <alc@FreeBSD.org>	Eliminate kmem_malloc()'s unused arena parameter. (The arena parameter became unused in FreeBSD 12.x as a side-effect of the NUMA-related changes.) Reviewed by: kib, markj Discussed with: jeff, re@ Differential Revision: https://reviews.freebsd.org/D16825
# 067fd858	18-Aug-2018	Alan Cox <alc@FreeBSD.org>	Eliminate the arena parameter to kmem_malloc_domain(). It is redundant. The domain and flags parameters suffice. In fact, the related functions kmem_alloc_{attr,contig}_domain() don't have an arena parameter. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D16713
# efb6d4a4	12-Jul-2018	Mateusz Guzik <mjg@FreeBSD.org>	uma: whack main zone counter update in the slow path, freeing side See r333052.
# 013072f0	09-Jul-2018	Mark Johnston <markj@FreeBSD.org>	Fix pre-SI_SUB_CPU initialization of per-CPU counters. r336020 introduced pcpu_page_alloc(), replacing page_alloc() as the backend allocator for PCPU UMA zones. Unlike page_alloc(), it does not honour malloc(9) flags such as M_ZERO or M_NODUMP, so fix that. r336020 also changed counter(9) to initialize each counter using a CPU_FOREACH() loop instead of an SMP rendezvous. Before SI_SUB_CPU, smp_rendezvous() will only execute the callback on the current CPU (i.e., CPU 0), so only one counter gets zeroed. The rest are zeroed by virtue of the fact that UMA gratuitously zeroes slabs when importing them into a zone. Prior to SI_SUB_CPU, all_cpus is clear, so with r336020 we weren't zeroing vm_cnt counters during boot: the CPU_FOREACH() loop had no effect, and pcpu_page_alloc() didn't honour M_ZERO. Fix this by iterating over the full range of CPU IDs when zeroing counters, ignoring whether the corresponding bits in all_cpus are set. Reported and tested by: pho (previous version) Reviewed by: kib (previous version) Differential Revision: https://reviews.freebsd.org/D16190
# a03af342	07-Jul-2018	Sean Bruno <sbruno@FreeBSD.org>	Wrap the declaration and assignment of "stripe" with #ifdef NUMA declarations as not all targets are NUMA aware. Found with gcc. Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D16113
# ab3059a8	05-Jul-2018	Matt Macy <mmacy@FreeBSD.org>	Back pcpu zone with domain correct pages - Change pcpu zone consumers to use a stride size of PAGE_SIZE. (defined as UMA_PCPU_ALLOC_SIZE to make future identification easier) - Allocate page from the correct domain for a given cpu. - Don't initialize pc_domain to non-zero value if NUMA is not defined There are some misconceptions surrounding this field. It is the _VM_ NUMA domain and should only ever correspond to valid domain values as understood by the VM. The former slab size of sizeof(struct pcpu) was somewhat arbitrary. The new value is PAGE_SIZE because that's the smallest granularity which the VM can allocate a slab for a given domain. If you have fewer than PAGE_SIZE/8 counters on your system there will be some memory wasted, but this is obviously something where you want the cache line to be coming from the correct domain. Reviewed by: jeff Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D15933
# c5b7751f	22-Jun-2018	Ian Lepore <ian@FreeBSD.org>	Eliminate a spurious panic on non-SMP systems (occurred on shutdown/reboot).
# b4799947	21-Jun-2018	Ruslan Bukin <br@FreeBSD.org>	Fix uma_zalloc_pcpu_arg() operation in case of !SMP build. Reviewed by: mjg Sponsored by: DARPA, AFRL
# 0766f278	13-Jun-2018	Jonathan T. Looney <jtl@FreeBSD.org>	Make UMA and malloc(9) return non-executable memory in most cases. Most kernel memory that is allocated after boot does not need to be executable. There are a few exceptions. For example, kernel modules do need executable memory, but they don't use UMA or malloc(9). The BPF JIT compiler also needs executable memory and did use malloc(9) until r317072. (Note that a side effect of r316767 was that the "small allocation" path in UMA on amd64 already returned non-executable memory. This meant that some calls to malloc(9) or the UMA zone(9) allocator could return executable memory, while others could return non-executable memory. This change makes the behavior consistent.) This change makes malloc(9) return non-executable memory unless the new M_EXEC flag is specified. After this change, the UMA zone(9) allocator will always return non-executable memory, and a KASSERT will catch attempts to use the M_EXEC flag to allocate executable memory using uma_zalloc() or its variants. Allocations that do need executable memory have various choices. They may use the M_EXEC flag to malloc(9), or they may use a different VM interfact to obtain executable pages. Now that malloc(9) again allows executable allocations, this change also reverts most of r317072. PR: 228927 Reviewed by: alc, kib, markj, jhb (previous version) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D15691
# 4e180881	08-Jun-2018	Mateusz Guzik <mjg@FreeBSD.org>	uma: implement provisional api for per-cpu zones Per-cpu zone allocations are very rarely done compared to regular zones. The intent is to avoid pessimizing the latter case with per-cpu specific code. In particular contrary to the claim in r334824, M_ZERO is sometimes being used for such zones. But the zeroing method is completely different and braching on it in the fast path for regular zones is a waste of time.
# b8af2820	07-Jun-2018	Mateusz Guzik <mjg@FreeBSD.org>	uma: fix up r334824 Turns out there is code which ends up passing M_ZERO to counters. Since counters zero unconditionally on their own, just ignore drop the flag in that place.
# ea99223e	07-Jun-2018	Mateusz Guzik <mjg@FreeBSD.org>	uma: remove M_ZERO support for pcpu zones Nothing in the tree uses it and pcpu zones have a fundamentally different use case than the regular zones - they are not supposed to be allocated and freed all the time. This reduces pollution in the allocation fast path.
# c5deaf04	07-Jun-2018	Gleb Smirnoff <glebius@FreeBSD.org>	UMA memory debugging enabled with INVARIANTS consists of two things: trashing freed memory and checking that allocated memory is properly trashed, and also of keeping a bitset of freed items. Trashing/checking creates a lot of CPU cache poisoning, while keeping debugging bitsets consistent creates a lot of contention on UMA zone lock(s). The performance difference between INVARIANTS kernel and normal one is mostly attributed to UMA debugging, rather than to all KASSERT checks in the kernel. Add loader tunable vm.debug.divisor that allows either to turn off UMA debugging completely, or turn it on only for a fraction of allocations, while still running all KASSERTs in kernel. That allows to run INVARIANTS kernels in production environments without reducing load by orders of magnitude, but still doing useful extra checks. Default value is 1, meaning debug every allocation. Value of 0 would disable UMA debugging completely. Values above 1 enable debugging only for every N-th item. It isn't possible to strictly follow the number, but still amount of debugging is reduced roughly by (N-1)/N percent. Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D15199
# e825ab8d	26-Apr-2018	Mateusz Guzik <mjg@FreeBSD.org>	uma: whack main zone counter update in the slow path Cached counters are typically zero at this point so it performs avoidable atomics. Everything reading them also reads the cached ones, thus there is really no point. Reviewed by: jeff
# 7e28037a	24-Apr-2018	Mark Johnston <markj@FreeBSD.org>	Add a UMA zone flag to disable the use of buckets. This allows the creation of zones which don't do any caching in front of the keg. If the zone is a cache zone, this means that UMA will not attempt any memory allocations when allocating an item from the backend. This is intended for use after a panic by netdump, but likely has other applications. Reviewed by: kib MFC after: 2 weeks Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D15184
# b92b26ad	01-Apr-2018	Gleb Smirnoff <glebius@FreeBSD.org>	Use UMA_SLAB_SPACE macro. No functional change here.
# 96a10340	01-Apr-2018	Gleb Smirnoff <glebius@FreeBSD.org>	In uma_startup_count() handle special case when zone will fit into single slab, but with alignment adjustment it won't. Again, when there is only one item in a slab alignment can be ignored. See previous revision of this file for more info. PR: 227116
# 1ca6ed45	01-Apr-2018	Gleb Smirnoff <glebius@FreeBSD.org>	Handle a special case when a slab can fit only one allocation, and zone has a large alignment. With alignment taken into account uk_rsize will be greater than space in a slab. However, since we have only one item per slab, it is always naturally aligned. Code that will panic before this change with 4k page: z = uma_zcreate("test", 3984, NULL, NULL, NULL, NULL, 31, 0); uma_zalloc(z, M_WAITOK); A practical scenario to hit the panic is a machine with 56 CPUs and 2 NUMA domains, which yields in zone size of 3984. PR: 227116 MFC after: 2 weeks
# e8bb2dc7	31-Mar-2018	Jeff Roberson <jeff@FreeBSD.org>	Add the flag ZONE_NOBUCKETCACHE. This flag instructions UMA not to keep a cache of fully populated buckets. This will be used in a follow-on commit. The flag idea was originally from markj. Reviewed by: markj, kib Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon
# 63b5d112	24-Mar-2018	Konstantin Belousov <kib@FreeBSD.org>	For vm_zone_stats() sysctl handler, do not drain sbuf calling copyout(9) while owning zone lock. Despite old value sysctl buffer is wired, spurious faults might still occur. Note that we still own the uma_rwlock there, but this lock does not participate in sensitive lock orders. Reported and tested by: pho (as part of the larger patch) Sponsored by: The FreeBSD Foundation MFC after: 1 week
# f7d35785	08-Feb-2018	Gleb Smirnoff <glebius@FreeBSD.org>	Fix boot_pages exhaustion on machines with many domains and cores, where size of UMA zone allocation is greater than page size. In this case zone of zones can not use UMA_MD_SMALL_ALLOC, and we need to postpone switch off of this zone from startup_alloc() until full launch of VM. o Always supply number of VM zones to uma_startup_count(). On machines with UMA_MD_SMALL_ALLOC ignore it completely, unless zsize goes over a page. In the latter case account VM zones for number of allocations from the zone of zones. o Rewrite startup_alloc() so that it will immediately switch off from itself any zone that is already capable of running real alloc. In worst case scenario we may leak a single page here. See comment in uma_startup_count(). o Hardcode call to uma_startup2() into vm_mem_init(). Otherwise some extra SYSINITs, e.g. vm_page_init() may sneak in before. o While here, remove uma_boot_pages_mtx. With recent changes to boot pages calculation, we are guaranteed to use all of the boot_pages in the early single threaded stage. Reported & tested by: mav
# 5073a083	07-Feb-2018	Gleb Smirnoff <glebius@FreeBSD.org>	Fix three miscalculations in amount of boot pages: o Most of startup zones have struct uma_slab embedded into the slab, so provide macro UMA_SLAB_SPACE and use it instead of UMA_SLAB_SIZE, when calculating how many pages would certain kind of allocations require. Some zones are offpage, so we might have a positive inaccuracy. o The keg for the zone of zones is allocated "dynamically", so we need +1 when calculating amount of pages for kegs. [1] o The zones of zones and zones of kegs have arbitrary alignment of 32, and this also needs to be accounted for. [2] While here, spread more comments and improve diagnostic messages. Reported by: pho [1], jtl [2]
# d2be4a1e	06-Feb-2018	Gleb Smirnoff <glebius@FreeBSD.org>	Use correct arithmetic to calculate how many pages we need for kegs and hashes. There is no functional change with current sizes.
# e2068d0b	06-Feb-2018	Jeff Roberson <jeff@FreeBSD.org>	Use per-domain locks for vm page queue free. Move paging control from global to per-domain state. Protect reservations with the free lock from the domain that they belong to. Refactor to make vm domains more of a first class object. Reviewed by: markj, kib, gallatin Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14000
# 1616767d	06-Feb-2018	Gleb Smirnoff <glebius@FreeBSD.org>	Improve DIAGNOSTIC printf. Report using a boot page every time regardless of booted status.
# f4bef67c	05-Feb-2018	Gleb Smirnoff <glebius@FreeBSD.org>	Followup on r302393 by cperciva, improving calculation of boot pages required for UMA startup. o Introduce another stage of UMA startup, which is entered after vm_page_startup() finishes. After this stage we don't yet enable buckets, but we can ask VM for pages. Rename stages to meaningful names while here. New list of stages: BOOT_COLD, BOOT_STRAPPED, BOOT_PAGEALLOC, BOOT_BUCKETS, BOOT_RUNNING. Enabling page alloc earlier allows us to dramatically reduce number of boot pages required. What is more important number of zones becomes consistent across different machines, as no MD allocations are done before the BOOT_PAGEALLOC stage. Now only UMA internal zones actually need to use startup_alloc(), however that may change, so vm_page_startup() provides its need for early zones as argument. o Introduce uma_startup_count() function, to avoid code duplication. The functions calculates sizes of zones zone and kegs zone, and calculates how many pages UMA will need to bootstrap. It counts not only of zone structures, but also of kegs, slabs and hashes. o Hide uma_startup_foo() declarations from public file. o Provide several DIAGNOSTIC printfs on boot_pages usage. o Bugfix: when calculating zone of zones size use (mp_maxid + 1) instead of mp_ncpus. Use resulting number not only in the size argument to zone_ctor() but also as args.size. Reviewed by: imp, gallatin (earlier version) Differential Revision: https://reviews.freebsd.org/D14054
# b6715dab	13-Jan-2018	Jeff Roberson <jeff@FreeBSD.org>	Move VM_NUMA_ALLOC and DEVICE_NUMA under the single global config option NUMA. Sponsored by: Netflix, Dell/EMC Isilon Discussed with: jhb
# ab3185d1	12-Jan-2018	Jeff Roberson <jeff@FreeBSD.org>	Implement NUMA support in uma(9) and malloc(9). Allocations from specific domains can be done by the _domain() API variants. UMA also supports a first-touch policy via the NUMA zone flag. The slab layer is now segregated by VM domains and is precise. It handles iteration for round-robin directly. The per-cpu cache layer remains a mix of domains according to where memory is allocated and freed. Well behaved clients can achieve perfect locality with no performance penalty. The direct domain allocation functions have to visit the slab layer and so require per-zone locks which come at some expense. Reviewed by: Attilio (a slightly older version) Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon
# ad5b0f5b	01-Jan-2018	Jeff Roberson <jeff@FreeBSD.org>	Fix arc after r326347 broke various memory limit queries. Use UMA features rather than kmem arena size to determine available memory. Initialize the UMA limit to LONG_MAX to avoid spurious wakeups on boot before the real limit is set. PR: 224330 (partial), 224080 Reviewed by: markj, avg Sponsored by: Netflix / Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D13494
# 200f8117	19-Dec-2017	Konstantin Belousov <kib@FreeBSD.org>	Perform all accesses to uma_reclaim_needed using atomic(9) KPI. Reviewed by: alc, jhb Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D13534
# 952a29c0	07-Dec-2017	Mark Johnston <markj@FreeBSD.org>	Fix the UMA reclaim worker after r326347. atomic_set_*() sets a bit in the target memory location, so atomic_set_int(&uma_reclaim_needed, 0) does not do what it looks like it does. PR: 224080 Reviewed by: jeff, kib Differential Revision: https://reviews.freebsd.org/D13412
# 2e47807c	28-Nov-2017	Jeff Roberson <jeff@FreeBSD.org>	Eliminate kmem_arena and kmem_object in preparation for further NUMA commits. The arena argument to kmem_*() is now only used in an assert. A follow-up commit will remove the argument altogether before we freeze the API for the next release. This replaces the hard limit on kmem size with a soft limit imposed by UMA. When the soft limit is exceeded we periodically wakeup the UMA reclaim thread to attempt to shrink KVA. On 32bit architectures this should behave much more gracefully as we exhaust KVA. On 64bit the limits are likely never hit. Reviewed by: markj, kib (some objections) Discussed with: alc Tested by: pho Sponsored by: Netflix / Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D13187
# fe267a55	27-Nov-2017	Pedro F. Giffuni <pfg@FreeBSD.org>	sys: general adoption of SPDX licensing ID tags. Mainly focus on files that use BSD 2-Clause license, however the tool I was using misidentified many licenses so this was mostly a manual - error prone - task. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts. No functional change intended.
# 772c8b674	08-Nov-2017	Konstantin Belousov <kib@FreeBSD.org>	Fix operator priority. Sponsored by: The FreeBSD Foundation
# 8d6fbbb8	07-Nov-2017	Jeff Roberson <jeff@FreeBSD.org>	Replace manyinstances of VM_WAIT with blocking page allocation flags similar to the kernel memory allocator. This simplifies NUMA allocation because the domain will be known at wait time and races between failure and sleeping are eliminated. This also reduces boilerplate code and simplifies callers. A wait primitive is supplied for uma zones for similar reasons. This eliminates some non-specific VM_WAIT calls in favor of more explicit sleeps that may be satisfied without new pages. Reviewed by: alc, kib, markj Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon
# 2934eb8a	13-Sep-2017	Mark Johnston <markj@FreeBSD.org>	Fix a logic error in the item size calculation for internal UMA zones. Kegs for internal zones always keep the slab header in the slab itself. Therefore, when determining the allocation size, we need to take the slab header size into account. Reported and tested by: ae, rakuco Reviewed by: avg MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D12342
# fe933c1d	06-Sep-2017	Mateusz Guzik <mjg@FreeBSD.org>	Start annotating global _padalign locks with __exclusive_cache_line While these locks are guarnteed to not share their respective cache lines, their current placement leaves unnecessary holes in lines which preceeded them. For instance the annotation of vm_page_queue_free_mtx allows 2 neighbour cachelines (previously separate by the lock) to be collapsed into 1. The annotation is only effective on architectures which have it implemented in their linker script (currently only amd64). Thus locks are not converted to their not-padaligned variants as to not affect the rest. MFC after: 1 week
# 77e19437	08-Jun-2017	Gleb Smirnoff <glebius@FreeBSD.org>	When we are in UMA_STARTUP use startup_alloc() for any zone, not for internal zones only. This allows to create new zones at early stages of boot, without need to mark them as internal to UMA, which isn't always true. Reviewed by: alc
# 1431a748	01-Jun-2017	Gleb Smirnoff <glebius@FreeBSD.org>	As old prophecy says, some day UMA_DEBUG printfs shall be made CTRs.
# ac0a6fd0	01-Jun-2017	Gleb Smirnoff <glebius@FreeBSD.org>	Simplify boot pages management in UMA. It is simply a contigous virtual memory pointer and number of pages. There is no need to build a linked list here. Just increment pointer and decrement counter. The only functional difference to old allocator is that before we gave pages from topmost and down to lowest, and now we give them in normal ascending order. While here remove padalign from a mutex that is unused at runtime. Reviewed by: alc
# a5a35578	04-Apr-2017	John Baldwin <jhb@FreeBSD.org>	Assert that the align parameter to uma_zcreate() is valid. Reviewed by: kib MFC after: 1 week Sponsored by: DARPA / AFRL Differential Revision: https://reviews.freebsd.org/D10100
# 57223e99	11-Mar-2017	Andriy Gapon <avg@FreeBSD.org>	uma: fix pages <-> items conversions at several places Those places were not taking into account uk_ppera. At present one allocation is always used by one slab, so uk_ppera must be used to convert between pages and slabs. uk_ipers is used to convert between slabs and items. MFC after: 1 month (if ever)
# a55ebb7c	11-Mar-2017	Andriy Gapon <avg@FreeBSD.org>	uma: eliminate uk_slabsize field The field was not used beyond the initial keg setup stage anyway. MFC after: 1 month (if ever)
# 9b43bc27	25-Feb-2017	Andriy Gapon <avg@FreeBSD.org>	call vm_lowmem hook in uma_reclaim_worker A comment near kmem_reclaim() implies that we already did that. Calling the hook is useful, because some handlers, e.g. ARC, might be able to release significant amounts of KVA. Now that we have more than one place where vm_lowmem hook is called, use this change as an opportunity to introduce flags that describe a reason for calling the hook. No handler makes use of the flags yet. Reviewed by: markj, kib MFC after: 1 week Sponsored by: Panzura Differential Revision: https://reviews.freebsd.org/D9764
# b5345ef1	02-Jan-2017	Justin Hibbits <jhibbits@FreeBSD.org>	Print flags in hex instead of decimal. Hex is easier to grok for flags, and consistent with other prints.
# 829be516	20-Oct-2016	Mark Johnston <markj@FreeBSD.org>	Simplify keg_drain() a bit by using LIST_FOREACH_SAFE. MFC after: 1 week
# afa5d703	19-Jul-2016	Mark Johnston <markj@FreeBSD.org>	Release the second critical section in uma_zfree_arg() slightly earlier. It is only needed when removing a full bucket from the per-CPU cache. The bucket cache (uz_buckets) is protected by the zone mutex and thus the critical section can be released before inserting into that list. MFC after: 1 week
# 96c85efb	06-Jul-2016	Nathan Whitehorn <nwhitehorn@FreeBSD.org>	Replace a number of conflations of mp_ncpus and mp_maxid with either mp_maxid or CPU_FOREACH() as appropriate. This fixes a number of places in the kernel that assumed CPU IDs are dense in [0, mp_ncpus) and would try, for example, to run tasks on CPUs that did not exist or to allocate too few buffers on systems with sparse CPU IDs in which there are holes in the range and mp_maxid > mp_ncpus. Such circumstances generally occur on systems with SMT, but on which SMT is disabled. This patch restores system operation at least on POWER8 systems configured in this way. There are a number of other places in the kernel with potential problems in these situations, but where sparse CPU IDs are not currently known to occur, mostly in the ARM machine-dependent code. These will be fixed in a follow-up commit after the stable/11 branch. PR: kern/210106 Reviewed by: jhb Approved by: re (glebius)
# bc9d08e1	01-Jun-2016	Mark Johnston <markj@FreeBSD.org>	Fix memguard(9) in kernels with INVARIANTS enabled. With r284861, UMA zones use the trash ctor and dtor by default. This is incompatible with memguard, which frees the backing page when the item is freed. Modify the UMA debug functions to be no-ops if the item was allocated from memguard. This also fixes constructors such as mb_ctor_pack(), which invokes the trash ctor in addition to performing some initialization. Reviewed by: glebius MFC after: 3 weeks Differential Revision: https://reviews.freebsd.org/D6562
# 763df3ec	02-May-2016	Pedro F. Giffuni <pfg@FreeBSD.org>	sys/vm: minor spelling fixes in comments. No functional change.
# cfcae3f8	29-Feb-2016	Gleb Smirnoff <glebius@FreeBSD.org>	Remove UMA_ZONE_REFCNT feature, now unused. Blessed by: jeff
# e60b2fcb	03-Feb-2016	Gleb Smirnoff <glebius@FreeBSD.org>	Redo r292484. Embed task(9) into zone, so that uz_maxaction is called in a context that can sleep, allowing consumers of the KPI to run their drain routines without any extra measures. Discussed with: jtl
# 9542ea7b	03-Feb-2016	Gleb Smirnoff <glebius@FreeBSD.org>	Move uma_dbg_alloc() and uma_dbg_free() into uma_core.c, which allows to make uma_dbg.h not depend on uma_int.h, which allows to uninclude uma_int.h from the mbuf(9) allocator.
# 54503a13	19-Dec-2015	Jonathan T. Looney <jtl@FreeBSD.org>	Add a safety net to reclaim mbufs when one of the mbuf zones become exhausted. It is possible for a bug in the code (or, theoretically, even unusual network conditions) to exhaust all possible mbufs or mbuf clusters. When this occurs, things can grind to a halt fairly quickly. However, we currently do not call mb_reclaim() unless the entire system is experiencing a low-memory condition. While it is best to try to prevent exhaustion of one of the mbuf zones, it would also be useful to have a mechanism to attempt to recover from these situations by freeing "expendable" mbufs. This patch makes two changes: a) The patch adds a generic API to the UMA zone allocator to set a function that should be called when an allocation fails because the zone limit has been reached. Because of the way this function can be called, it really should do minimal work. b) The patch uses this API to try to free mbufs when an allocation fails from one of the mbuf zones because the zone limit has been reached. The function schedules a callout to run mb_reclaim(). Differential Revision: https://reviews.freebsd.org/D3864 Reviewed by: gnn Comments by: rrs, glebius MFC after: 2 weeks Sponsored by: Juniper Networks
# d9e2e68d	11-Dec-2015	Mark Johnston <markj@FreeBSD.org>	Don't make assertions about td_critnest when the scheduler is stopped. A panicking thread always executes with a critical section held, so any attempt to allocate or free memory while dumping will otherwise cause a second panic. This can occur, for example, if xpt_polled_action() completes non-dump I/O that was pending at the time of the panic. The fact that this can occur is itself a bug, but asserting in this case does little but reduce the reliability of kernel dumps. Suggested by: kib Reported by: pho
# 1067a2ba	19-Nov-2015	Jonathan T. Looney <jtl@FreeBSD.org>	Consistently enforce the restriction against calling malloc/free when in a critical section. uma_zalloc_arg()/uma_zalloc_free() may acquire a sleepable lock on the zone. The malloc() family of functions may call uma_zalloc_arg() or uma_zalloc_free(). The malloc(9) man page currently claims that free() will never sleep. It also implies that the malloc() family of functions will not sleep when called with M_NOWAIT. However, it is more correct to say that these functions will not sleep indefinitely. Indeed, they may acquire a sleepable lock. However, a developer may overlook this restriction because the WITNESS check that catches attempts to call the malloc() family of functions within a critical section is inconsistenly applied. This change clarifies the language of the malloc(9) man page to clarify the restriction against calling the malloc() family of functions while in a critical section or holding a spin lock. It also adds KASSERTs at appropriate points to make the enforcement of this restriction more consistent. PR: 204633 Differential Revision: https://reviews.freebsd.org/D4197 Reviewed by: markj Approved by: gnn (mentor) Sponsored by: Juniper Networks
# 087a6132	26-Sep-2015	Alan Cox <alc@FreeBSD.org>	Exploit r288122 to address a cosmetic issue. Since the pages allocated by noobj_alloc() don't belong to a vm object, they can't be paged out. Since they can't be paged out, they are never enqueued in a paging queue. Nonetheless, passing PQ_INACTIVE to vm_page_unwire() creates the appearance that these pages are being enqueued in the inactive queue. As of r288122, we can avoid giving this false impression by passing PQ_NONE. Submitted by: kmacy Differential Revision: https://reviews.freebsd.org/D1674
# 19c591bf	02-Sep-2015	Mateusz Guzik <mjg@FreeBSD.org>	Don't trash memory from UMA_ZONE_NOFREE zones. Objects obtained from such zones are supposed to retain type stability, which was violated by aforementioned trashing. This is a follow-up to r284861. Discussed with: kib
# e866d8f0	21-Aug-2015	Mark Murray <markm@FreeBSD.org>	Make the UMA harvesting go away completely if not wanted. Default to "not wanted". Provide and document the RANDOM_ENABLE_UMA option. Change RANDOM_FAST to RANDOM_UMA to clarify the harvesting. Remove RANDOM_DEBUG option, replace with SDT probes. These will be of use to folks measuring the harvesting effect when deciding whether to use RANDOM_ENABLE_UMA. Requested by: scottl and others. Approved by: so (/dev/random blanket) Differential Revision: https://reviews.freebsd.org/D3197
# 9ba30bcb	10-Aug-2015	Zbigniew Bodek <zbb@FreeBSD.org>	Avoid sign extension of value passed to kva_alloc from uma_zone_reserve_kva Fixes "panic: vm_radix_reserve_kva: unable to reserve KVA" caused by sign extention of "pages * UMA_SLAB_SIZE" value passed to kva_alloc() which takes unsigned long argument. In the erroneus case that triggered this bug, the number of pages to allocate in uma_zone_reserve_kva() was 0x8ebe6, that gave the total number of bytes to allocate equal to 0x8ebe6000 (int). This was then sign extended in kva_alloc() to 0xffffffff8ebe6000 (unsigned long). Reviewed by: alc, kib Submitted by: Zbigniew Bodek <zbb@semihalf.com> Obtained from: Semihalf Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D3346
# d1b06863	30-Jun-2015	Mark Murray <markm@FreeBSD.org>	Huge cleanup of random(4) code. * GENERAL - Update copyright. - Make kernel options for RANDOM_YARROW and RANDOM_DUMMY. Set neither to ON, which means we want Fortuna - If there is no 'device random' in the kernel, there will be NO random(4) device in the kernel, and the KERN_ARND sysctl will return nothing. With RANDOM_DUMMY there will be a random(4) that always blocks. - Repair kern.arandom (KERN_ARND sysctl). The old version went through arc4random(9) and was a bit weird. - Adjust arc4random stirring a bit - the existing code looks a little suspect. - Fix the nasty pre- and post-read overloading by providing explictit functions to do these tasks. - Redo read_random(9) so as to duplicate random(4)'s read internals. This makes it a first-class citizen rather than a hack. - Move stuff out of locked regions when it does not need to be there. - Trim RANDOM_DEBUG printfs. Some are excess to requirement, some behind boot verbose. - Use SYSINIT to sequence the startup. - Fix init/deinit sysctl stuff. - Make relevant sysctls also tunables. - Add different harvesting "styles" to allow for different requirements (direct, queue, fast). - Add harvesting of FFS atime events. This needs to be checked for weighing down the FS code. - Add harvesting of slab allocator events. This needs to be checked for weighing down the allocator code. - Fix the random(9) manpage. - Loadable modules are not present for now. These will be re-engineered when the dust settles. - Use macros for locks. - Fix comments. * src/share/man/... - Update the man pages. * src/etc/... - The startup/shutdown work is done in D2924. * src/UPDATING - Add UPDATING announcement. * src/sys/dev/random/build.sh - Add copyright. - Add libz for unit tests. * src/sys/dev/random/dummy.c - Remove; no longer needed. Functionality incorporated into randomdev.. live_entropy_sources.c live_entropy_sources.h - Remove; content moved. - move content to randomdev.[ch] and optimise. * src/sys/dev/random/random_adaptors.c src/sys/dev/random/random_adaptors.h - Remove; plugability is no longer used. Compile-time algorithm selection is the way to go. * src/sys/dev/random/random_harvestq.c src/sys/dev/random/random_harvestq.h - Add early (re)boot-time randomness caching. * src/sys/dev/random/randomdev_soft.c src/sys/dev/random/randomdev_soft.h - Remove; no longer needed. * src/sys/dev/random/uint128.h - Provide a fake uint128_t; if a real one ever arrived, we can use that instead. All that is needed here is N=0, N++, N==0, and some localised trickery is used to manufacture a 128-bit 0ULLL. * src/sys/dev/random/unit_test.c src/sys/dev/random/unit_test.h - Improve unit tests; previously the testing human needed clairvoyance; now the test will do a basic check of compressibility. Clairvoyant talent is still a good idea. - This is still a long way off a proper unit test. * src/sys/dev/random/fortuna.c src/sys/dev/random/fortuna.h - Improve messy union to just uint128_t. - Remove unneeded 'static struct fortuna_start_cache'. - Tighten up up arithmetic. - Provide a method to allow eternal junk to be introduced; harden it against blatant by compress/hashing. - Assert that locks are held correctly. - Fix the nasty pre- and post-read overloading by providing explictit functions to do these tasks. - Turn into self-sufficient module (no longer requires randomdev_soft.[ch]) * src/sys/dev/random/yarrow.c src/sys/dev/random/yarrow.h - Improve messy union to just uint128_t. - Remove unneeded 'staic struct start_cache'. - Tighten up up arithmetic. - Provide a method to allow eternal junk to be introduced; harden it against blatant by compress/hashing. - Assert that locks are held correctly. - Fix the nasty pre- and post-read overloading by providing explictit functions to do these tasks. - Turn into self-sufficient module (no longer requires randomdev_soft.[ch]) - Fix some magic numbers elsewhere used as FAST and SLOW. Differential Revision: https://reviews.freebsd.org/D2025 Reviewed by: vsevolod,delphij,rwatson,trasz,jmg Approved by: so (delphij)
# afc6dc36	25-Jun-2015	John-Mark Gurney <jmg@FreeBSD.org>	If INVARIANTS is specified, add ctor/dtor to junk memory if they are unspecified... Submitted by: Suresh Gumpula at Netapp Differential Revision: https://reviews.freebsd.org/D2725
# fd90e2ed	22-May-2015	Jung-uk Kim <jkim@FreeBSD.org>	CALLOUT_MPSAFE has lost its meaning since r141428, i.e., for more than ten years for head. However, it is continuously misused as the mpsafe argument for callout_init(9). Deprecate the flag and clean up callout_init() calls to make them more consistent. Differential Revision: https://reviews.freebsd.org/D2613 Reviewed by: jhb MFC after: 2 weeks
# 44ec2b63	09-May-2015	Konstantin Belousov <kib@FreeBSD.org>	The vmem callback to reclaim kmem arena address space on low or fragmented conditions currently just wakes up the pagedaemon. The kmem arena is significantly smaller then the total available physical memory, which means that there are loads where kmem arena space could be exhausted, while there is a lot of pages available still. The woken up pagedaemon sees vm_pages_needed != 0, verifies the condition vm_paging_needed() which is false, clears the pass and returns back to sleep, not calling neither uma_reclaim() nor lowmem handler. To handle low kmem arena conditions, create additional pagedaemon thread which calls uma_reclaim() directly. The thread sleeps on the dedicated channel and kmem_reclaim() wakes the thread in addition to the pagedaemon. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
# d74e6a1d	20-Apr-2015	Alan Cox <alc@FreeBSD.org>	Eliminate an unused variable. MFC after: 1 week
# 51cfb0be	12-Apr-2015	Dmitry Chagin <dchagin@FreeBSD.org>	Rework r281162. Indeed, the flexible array member is preferable here. Suggested by: Justin T. Gibbs MFC after: 3 days
# 16be9f54	10-Apr-2015	Gleb Smirnoff <glebius@FreeBSD.org>	UMA zone limit can be lowered, so remove protection against from the sysctl_handle_uma_zone_max(). Sponsored by: Nginx, Inc.
# 6723fdfe	06-Apr-2015	Dmitry Chagin <dchagin@FreeBSD.org>	Properly calculate "UMA Zones" per cpu cache size. Avoid allocating an extra struct uma_cache since the struct uma_zone already has one. PR: 199169 Submitted by: luke.tw gmail com MFC after: 1 week
# 1d2c0c46	05-Apr-2015	Dmitry Chagin <dchagin@FreeBSD.org>	Fix wrong kassert msg in uma. PR: 199172 Submitted by: luke.tw gmail com MFC after: 1 week
# f2c2231e	31-Mar-2015	Ryan Stone <rstone@FreeBSD.org>	Fix integer truncation bug in malloc(9) A couple of internal functions used by malloc(9) and uma truncated a size_t down to an int. This could cause any number of issues (e.g. indefinite sleeps, memory corruption) if any kernel subsystem tried to allocate 2GB or more through malloc. zfs would attempt such an allocation when run on a system with 2TB or more of RAM. Note to self: When this is MFCed, sparc64 needs the same fix. Differential revision: https://reviews.freebsd.org/D2106 Reviewed by: kib Reported by: Michael Fuckner <michael@fuckner.net> Tested by: Michael Fuckner <michael@fuckner.net> MFC after: 2 weeks
# 1eafc078	14-Mar-2015	Ian Lepore <ian@FreeBSD.org>	Set the SBUF_INCLUDENUL flag in sbuf_new_for_sysctl() so that sysctl strings returned to userland include the nulterm byte. Some uses of sbuf_new_for_sysctl() write binary data rather than strings; clear the SBUF_INCLUDENUL flag after calling sbuf_new_for_sysctl() in those cases. (Note that the sbuf code still automatically adds a nulterm byte in sbuf_finish(), but since it's not included in the length it won't get copied to userland along with the binary data.) Remove explicit adding of a nulterm byte in a couple places now that it gets done automatically by the sbuf drain code. PR: 195668
# 67c44fa3	31-Dec-2014	Alan Cox <alc@FreeBSD.org>	Eliminate a stale debug message. The per-CPU cache locks were replaced by critical sections in r145686. PR: 193254 Submitted by: luke.tw@gmail.com MFC after: 3 days
# 95c4bf75	30-Nov-2014	Konstantin Belousov <kib@FreeBSD.org>	Provide mutual exclusion between zone allocation/destruction and uma_reclaim(). Reclamation code must not see half-constructed or destructed zones. Do this by bracing uma_zcreate() and uma_zdestroy() into a shared-locked sx, and take the sx exclusively in uma_reclaim(). Usually zones are not created/destroyed during the system operation, but tmpfs mounts do cause zone operations and exposed the bug. Another solution could be to only expose a new keg on uma_kegs list after the corresponding zone is fully constructed, and similar treatment for the destruction. But it probably requires more risky code rearrangement as well. Reported and tested by: pho Discussed with: avg Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
# 10cb2424	30-Oct-2014	Mark Murray <markm@FreeBSD.org>	This is the much-discussed major upgrade to the random(4) device, known to you all as /dev/random. This code has had an extensive rewrite and a good series of reviews, both by the author and other parties. This means a lot of code has been simplified. Pluggable structures for high-rate entropy generators are available, and it is most definitely not the case that /dev/random can be driven by only a hardware souce any more. This has been designed out of the device. Hardware sources are stirred into the CSPRNG (Yarrow, Fortuna) like any other entropy source. Pluggable modules may be written by third parties for additional sources. The harvesting structures and consequently the locking have been simplified. Entropy harvesting is done in a more general way (the documentation for this will follow). There is some GREAT entropy to be had in the UMA allocator, but it is disabled for now as messing with that is likely to annoy many people. The venerable (but effective) Yarrow algorithm, which is no longer supported by its authors now has an alternative, Fortuna. For now, Yarrow is retained as the default algorithm, but this may be changed using a kernel option. It is intended to make Fortuna the default algorithm for 11.0. Interested parties are encouraged to read ISBN 978-0-470-47424-2 "Cryptography Engineering" By Ferguson, Schneier and Kohno for Fortuna's gory details. Heck, read it anyway. Many thanks to Arthur Mesh who did early grunt work, and who got caught in the crossfire rather more than he deserved to. My thanks also to folks who helped me thresh this out on whiteboards and in the odd "Hallway track", or otherwise. My Nomex pants are on. Let the feedback commence! Reviewed by: trasz,des(partial),imp(partial?),rwatson(partial?) Approved by: so(des)
# 111fbcd5	05-Oct-2014	Bryan Venteicher <bryanv@FreeBSD.org>	Change the UMA mutex into a rwlock Acquire the lock in read mode when just needed to ensure the stability of the keg list. The UMA lock may be held for a long time (relatively speaking) in uma_reclaim() on machines with lots of zones/kegs. If the uma_timeout() would fire during that period, subsequent callouts on that CPU may be significantly delayed. Reviewed by: jhb
# 6e5254e0	04-Oct-2014	Bryan Venteicher <bryanv@FreeBSD.org>	Remove stray uma_mtx lock/unlock in zone_drain_wait() Callers of zone_drain_wait(M_WAITOK) do not need to hold (and were not) the uma_mtx, but we would attempt to unlock and relock the mutex if we had to sleep because the zone was already draining. The M_NOWAIT callers may hold the uma_mtx, but we do not sleep in that case. Reviewed by: jhb MFC after: 3 days
# af3b2549	27-Jun-2014	Hans Petter Selasky <hselasky@FreeBSD.org>	Pull in r267961 and r267973 again. Fix for issues reported will follow.
# 37a107a4	27-Jun-2014	Glen Barber <gjb@FreeBSD.org>	Revert r267961, r267973: These changes prevent sysctl(8) from returning proper output, such as: 1) no output from sysctl(8) 2) erroneously returning ENOMEM with tools like truss(1) or uname(1) truss: can not get etype: Cannot allocate memory
# 3da1cf1e	27-Jun-2014	Hans Petter Selasky <hselasky@FreeBSD.org>	Extend the meaning of the CTLFLAG_TUN flag to automatically check if there is an environment variable which shall initialize the SYSCTL during early boot. This works for all SYSCTL types both statically and dynamically created ones, except for the SYSCTL NODE type and SYSCTLs which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to be used in the case a tunable sysctl has a custom initialisation function allowing the sysctl to still be marked as a tunable. The kernel SYSCTL API is mostly the same, with a few exceptions for some special operations like iterating childrens of a static/extern SYSCTL node. This operation should probably be made into a factored out common macro, hence some device drivers use this. The reason for changing the SYSCTL API was the need for a SYSCTL parent OID pointer and not only the SYSCTL parent OID list pointer in order to quickly generate the sysctl path. The motivation behind this patch is to avoid parameter loading cludges inside the OFED driver subsystem. Instead of adding special code to the OFED driver subsystem to post-load tunables into dynamically created sysctls, we generalize this in the kernel. Other changes: - Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask" to "hw.pcic.intr_mask". - Removed redundant TUNABLE statements throughout the kernel. - Some minor code rewrites in connection to removing not needed TUNABLE statements. - Added a missing SYSCTL_DECL(). - Wrapped two very long lines. - Avoid malloc()/free() inside sysctl string handling, in case it is called to initialize a sysctl from a tunable, hence malloc()/free() is not ready when sysctls from the sysctl dataset are registered. - Bumped FreeBSD version to indicate SYSCTL API change. MFC after: 2 weeks Sponsored by: Mellanox Technologies
# 3ae10f74	16-Jun-2014	Attilio Rao <attilio@FreeBSD.org>	- Modify vm_page_unwire() and vm_page_enqueue() to directly accept the queue where to enqueue pages that are going to be unwired. - Add stronger checks to the enqueue/dequeue for the pagequeues when adding and removing pages to them. Of course, for unmanaged pages the queue parameter of vm_page_unwire() will be ignored, just as the active parameter today. This makes adding new pagequeues quicker. This change effectively modifies the KPI. __FreeBSD_version will be, however, bumped just when the full cache of free pages will be evicted. Sponsored by: EMC / Isilon storage division Reviewed by: alc Tested by: pho
# 1aa6c758	12-Jun-2014	Alexander Motin <mav@FreeBSD.org>	Introduce new "256 Bucket" zone to split requests and reduce congestion on "128 Bucket" zone lock. MFC after: 2 weeks Sponsored by: iXsystems, Inc.
# 20d3ab87	12-Jun-2014	Alexander Motin <mav@FreeBSD.org>	Allocating new bucket for bucket zone, never take it from the zone itself, since it will almost certanly fail. Take next bigger zone instead. This situation should not happen with original bucket zones configuration: "32 Bucket" zone uses "64 Bucket" and vice versa. But if "64 Bucket" zone lock is congested, zone may grow its bucket size and start biting itself. MFC after: 2 weeks Sponsored by: iXsystems, Inc.
# 2367b4dd	14-Feb-2014	Dimitry Andric <dim@FreeBSD.org>	After r251709, avoid a clang 3.4 warning about an unused static const variable (uma_max_ipers), when asserts are disabled. Reviewed by: glebius MFC after: 3 days
# 48343a2f	10-Feb-2014	Gleb Smirnoff <glebius@FreeBSD.org>	Make M_ZERO flag work correctly on UMA_ZONE_PCPU zones. Sponsored by: Nginx, Inc.
# 0a5a3ccb	07-Feb-2014	Gleb Smirnoff <glebius@FreeBSD.org>	Provide macros that allow easily export uma(9) zone limits and current usage via sysctl(9): SYSCTL_UMA_MAX() SYSCTL_ADD_UMA_MAX() SYSCTL_UMA_CUR() SYSCTL_ADD_UMA_CUR() Sponsored by: Nginx, Inc.
# a3845534	29-Nov-2013	Craig Rodrigues <rodrigc@FreeBSD.org>	In keg_dtor(), print out the keg name in the "Freed UMA keg was not empty" message printed to the console. This makes it easier to track down the source of certain memory leaks. Suggested by: adrian
# 03175483	28-Nov-2013	Alexander Motin <mav@FreeBSD.org>	- Add bucket size column to `show uma` DDB command. - Add `show umacache` command to show alike stats for cache-only UMA zones.
# cec48e00	27-Nov-2013	Alexander Motin <mav@FreeBSD.org>	Make UMA to not blindly force offpage slab header allocation for large (> PAGE_SIZE) zones. If zone is not multiple to PAGE_SIZE, there may be enough space for the header at the last page, so we may avoid extra header memory allocation and hash table update/lookup. ZFS creates bunch of odd-sized UMA zones (5120, 6144, 7168, 10240, 14336). This change gives good use to at least some of otherwise lost memory there. Reviewed by: avg
# f7104ccd	27-Nov-2013	Alexander Motin <mav@FreeBSD.org>	Don't count bucket allocation failures for UMA zones as their own failures. There are good reasons for this to happen, such as recursion prevention, etc. and they are not fatal since buckets are just an optimization mechanism. Real bucket allocation failures are any way counted by the bucket zones themselves, and we don't need double accounting there.
# e8a720fe	27-Nov-2013	Alexander Motin <mav@FreeBSD.org>	Fix bug introduced at r252226, when udata argument passed to bucket_alloc() was used without making sure first that it was really passed for us. On some of my systems this bug made user argument passed by ZFS code to uma_zalloc_arg() unexpectedly block UMA per-CPU caches for those zones.
# 8a8d9d14	23-Nov-2013	Alexander Motin <mav@FreeBSD.org>	When purging per-CPU UMA caches do not return empty buckets into the global full bucket cache to not trigger assertion if allocation happen before that global cache get purged.
# a2de44ab	19-Nov-2013	Alexander Motin <mav@FreeBSD.org>	Implement mechanism to safely but slowly purge UMA per-CPU caches. This is a last resort for very low memory condition in case other measures to free memory were ineffective. Sequentially cycle through all CPUs and extract per-CPU cache buckets into zone cache from where they can be freed.
# 4d104ba0	19-Nov-2013	Alexander Motin <mav@FreeBSD.org>	Grow UMA zone bucket size also on lock congestion during item free. Lock congestion is the same, whether it happens on alloc or free, so handle it equally. Now that we have back pressure, there is no problem to grow buckets a bit faster. Any way growth is much slower then in 9.x.
# f3932e90	19-Nov-2013	Alexander Motin <mav@FreeBSD.org>	Add two new UMA bucket zones to store 3 and 9 items per bucket. These new buckets make bucket size self-tuning more soft and precise. Without them there are buckets for 1, 5, 13, 29, ... items. While at bigger sizes difference about 2x is fine, at smallest ones it is 5x and 2.6x respectively. New buckets make that line look like 1, 3, 5, 9, 13, 29, reducing jumps between steps, making algorithm work softer, allocating and freeing memory in better fitting chunks. Otherwise there is quite a big gap between allocating 128K and 5x128K of RAM at once.
# ace66b56	19-Nov-2013	Alexander Motin <mav@FreeBSD.org>	Implement soft pressure on UMA cache bucket sizes. Every time system detects low memory condition decrease bucket sizes for each zone by one item. As result, higher memory pressure will push to smaller bucket sizes and so smaller per-CPU caches and so more efficient memory use. Before this change there was no force to oppose buckets growth as result of practically inevitable zone lock conflicts, and after some run time per-CPU caches could consume enough RAM to kill the system.
# 1645995b	31-Aug-2013	Kirk McKusick <mckusick@FreeBSD.org>	Fix bug introduced in rewrite of keg_free_slab in -r251894. The consequence of the bug is that fini calls are not done when a slab is freed by a call-back from the page daemon. It went unnoticed for two months because fini is little used. I spotted the bug while reading the code to learn how it works so I could write it up for the next edition of the Design and Implementation of FreeBSD book. No MFC needed as this code exists only in HEAD. Reviewed by: kib, jeff Tested by: pho
# c325e866	10-Aug-2013	Konstantin Belousov <kib@FreeBSD.org>	Different consumers of the struct vm_page abuse pageq member to keep additional information, when the page is guaranteed to not belong to a paging queue. Usually, this results in a lot of type casts which make reasoning about the code correctness harder. Sometimes m->object is used instead of pageq, which could cause real and confusing bugs if non-NULL m->object is leaked. See r141955 and r253140 for examples. Change the pageq member into a union containing explicitly-typed members. Use them instead of type-punning or abusing m->object in x86 pmaps, uma and vm_page_alloc_contig(). Requested and reviewed by: alc Sponsored by: The FreeBSD Foundation
# 5df87b21	07-Aug-2013	Jeff Roberson <jeff@FreeBSD.org>	Replace kernel virtual address space allocation with vmem. This provides transparent layering and better fragmentation. - Normalize functions that allocate memory to use kmem_* - Those that allocate address space are named kva_* - Those that operate on maps are named kmap_* - Implement recursive allocation handling for kmem_arena in vmem. Reviewed by: alc Tested by: pho Sponsored by: EMC / Isilon Storage Division
# e28a647d	23-Jul-2013	Gleb Smirnoff <glebius@FreeBSD.org>	Revert r249590 and in case if mp_ncpus isn't initialized use MAXCPU. This allows us to init counter zone at early stage of boot. Reviewed by: kib Tested by: Lytochkin Boris <lytboris gmail.com>
# a1dff920	28-Jun-2013	Davide Italiano <davide@FreeBSD.org>	Remove a spurious keg lock acquisition.
# 6fd34d6f	25-Jun-2013	Jeff Roberson <jeff@FreeBSD.org>	- Resolve bucket recursion issues by passing a cookie with zone flags through bucket_alloc() to uma_zalloc_arg() and uma_zfree_arg(). - Make some smaller buckets for large zones to further reduce memory waste. - Implement uma_zone_reserve(). This holds aside a number of items only for callers who specify M_USE_RESERVE. buckets will never be filled from reserve allocations. Sponsored by: EMC / Isilon Storage Division
# af526374	20-Jun-2013	Jeff Roberson <jeff@FreeBSD.org>	- Add a per-zone lock for zones without kegs. - Be more explicit about zone vs keg locking. This functionally changes almost nothing. - Add a size parameter to uma_zcache_create() so we can size the buckets. - Pass the zone to bucket_alloc() so it can modify allocation flags as appropriate. - Fix a bug in zone_alloc_bucket() where I missed an address of operator in a failure case. (Found by pho) Sponsored by: EMC / Isilon Storage Division
# 8aaf680e	18-Jun-2013	Jeff Roberson <jeff@FreeBSD.org>	- Persist the caller's flags in the bucket allocation flags so we don't lose a M_NOVM when we recurse into a bucket allocation. Sponsored by: EMC / Isilon Storage Division
# fc03d22b	17-Jun-2013	Jeff Roberson <jeff@FreeBSD.org>	Refine UMA bucket allocation to reduce space consumption and improve performance. - Always free to the alloc bucket if there is space. This gives LIFO allocation order to improve hot-cache performance. This also allows for zones with a single bucket per-cpu rather than a pair if the entire working set fits in one bucket. - Enable per-cpu caches of buckets. To prevent recursive bucket allocation one bucket zone still has per-cpu caches disabled. - Pick the initial bucket size based on a table driven maximum size per-bucket rather than the number of items per-page. This gives more sane initial sizes. - Only grow the bucket size when we face contention on the zone lock, this causes bucket sizes to grow more slowly. - Adjust the number of items per-bucket to account for the header space. This packs the buckets more efficiently per-page while making them not quite powers of two. - Eliminate the per-zone free bucket list. Always return buckets back to the bucket zone. This ensures that as zones grow into larger bucket sizes they eventually discard the smaller sizes. It persists fewer buckets in the system. The locking is slightly trickier. - Only switch buckets in zalloc, not zfree, this eliminates pathological cases where we ping-pong between two buckets. - Ensure that the thread that fills a new bucket gets to allocate from it to give a better upper bound on allocation time. Sponsored by: EMC / Isilon Storage Division
# 0095a784	16-Jun-2013	Jeff Roberson <jeff@FreeBSD.org>	- Add a new UMA API: uma_zcache_create(). This makes a zone without any backing memory that is only a container for per-cpu caches of arbitrary pointer items. These zones have no kegs. - Convert the regular keg based allocator to use the new import/release functions. - Move some stats to be atomics since they would require excessive zone locking/unlocking with the new import/release paradigm. Make zone_free_item simpler now that callers can manage more stats. - Check for these cache-only zones in the public APIs and debugging code by checking zone_first_keg() against NULL. Sponsored by: EMC / Isilong Storage Division
# ef72505e	13-Jun-2013	Jeff Roberson <jeff@FreeBSD.org>	- Convert the slab free item list from a linked array of indices to a bitmap using sys/bitset. This is much simpler, has lower space overhead and is cheaper in most cases. - Use a second bitmap for invariants asserts and improve the quality of the asserts as well as the number of erroneous conditions that we will catch. - Drastically simplify sizing code. Special case refcnt zones since they will be going away. - Update stale comments. Sponsored by: EMC / Isilon Storage Division
# 08a3102c	22-Apr-2013	Gleb Smirnoff <glebius@FreeBSD.org>	Panic if UMA_ZONE_PCPU is created at early stages of boot, when mp_ncpus isn't yet initialized. Otherwise we will panic at first allocation later. Sponsored by: Nginx, Inc.
# 85dcf349	09-Apr-2013	Gleb Smirnoff <glebius@FreeBSD.org>	Convert UMA code to C99 uintXX_t types.
# 025071f2	08-Apr-2013	Gleb Smirnoff <glebius@FreeBSD.org>	Fix KASSERTs: maximum number of items per slab is 256.
# ad97af7e	08-Apr-2013	Gleb Smirnoff <glebius@FreeBSD.org>	Merge from projects/counters: UMA_ZONE_PCPU zones. These zones have slab size == sizeof(struct pcpu), but request from VM enough pages to fit (uk_slabsize * mp_ncpus). An item allocated from such zone would have a separate twin for each CPU in the system, and these twins are at a distance of sizeof(struct pcpu) from each other. This magic value of distance would allow us to make some optimizations later. To address private item from a CPU simple arithmetics should be used: item = (type )((char )base + sizeof(struct pcpu) * curcpu) These arithmetics are available as zpcpu_get() macro in pcpu.h. To introduce non-page size slabs a new field had been added to uma_keg uk_slabsize. This shifted some frequently used fields of uma_keg to the fourth cache line on amd64. To mitigate this pessimization, uma_keg fields were a bit rearranged and least frequently used uk_name and uk_link moved down to the fourth cache line. All other fields, that are dereferenced frequently fit into first three cache lines. Sponsored by: Nginx, Inc.
# 89f6b863	08-Mar-2013	Attilio Rao <attilio@FreeBSD.org>	Switch the vm_object mutex to be a rwlock. This will enable in the future further optimizations where the vm_object lock will be held in read mode most of the time the page cache resident pool of pages are accessed for reading purposes. The change is mostly mechanical but few notes are reported: * The KPI changes as follow: - VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK() - VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK() - VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK() - VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED() (in order to avoid visibility of implementation details) - The read-mode operations are added: VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(), VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED() * The vm/vm_pager.h namespace pollution avoidance (forcing requiring sys/mutex.h in consumers directly to cater its inlining functions using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h consumers now must include also sys/rwlock.h. * zfs requires a quite convoluted fix to include FreeBSD rwlocks into the compat layer because the name clash between FreeBSD and solaris versions must be avoided. At this purpose zfs redefines the vm_object locking functions directly, isolating the FreeBSD components in specific compat stubs. The KPI results heavilly broken by this commit. Thirdy part ports must be updated accordingly (I can think off-hand of VirtualBox, for example). Sponsored by: EMC / Isilon storage division Reviewed by: jeff Reviewed by: pjd (ZFS specific review) Discussed with: alc Tested by: pho
# a4915c21	26-Feb-2013	Attilio Rao <attilio@FreeBSD.org>	Merge from vmc-playground branch: Replace the sub-optimal uma_zone_set_obj() primitive with more modern uma_zone_reserve_kva(). The new primitive reserves before hand the necessary KVA space to cater the zone allocations and allocates pages with ALLOC_NOOBJ. More specifically: - uma_zone_reserve_kva() does not need an object to cater the backend allocator. - uma_zone_reserve_kva() can cater M_WAITOK requests, in order to serve zones which need to do uma_prealloc() too. - When possible, uma_zone_reserve_kva() uses directly the direct-mapping by uma_small_alloc() rather than relying on the KVA / offset combination. The removal of the object attribute allows 2 further changes: 1) _vm_object_allocate() becomes static within vm_object.c 2) VM_OBJECT_LOCK_INIT() is removed. This function is replaced by direct calls to mtx_init() as there is no need to export it anymore and the calls aren't either homogeneous anymore: there are now small differences between arguments passed to mtx_init(). Sponsored by: EMC / Isilon storage division Reviewed by: alc (which also offered almost all the comments) Tested by: pho, jhb, davide
# 3caae6ca	29-Jan-2013	Gleb Smirnoff <glebius@FreeBSD.org>	Fix typo in debug printf.
# 2f891cd5	07-Dec-2012	Pawel Jakub Dawidek <pjd@FreeBSD.org>	Implemented uma_zone_set_warning(9) function that sets a warning, which will be printed once the given zone becomes full and cannot allocate an item. The warning will not be printed more often than every five minutes. All UMA warnings can be globally turned off by setting sysctl/tunable vm.zone_warnings to 0. Discussed on: arch Obtained from: WHEEL Systems MFC after: 2 weeks
# bb196eb4	26-Oct-2012	Matthew D Fleming <mdf@FreeBSD.org>	Const-ify the zone name argument to uma_zcreate(9). MFC after: 3 days
# 0b80c1e4	21-Oct-2012	Eitan Adler <eadler@FreeBSD.org>	Print flags as hex instead of an integer. PR: kern/168210 Submitted by: linimon Reviewed by: alc Approved by: cperciva MFC after: 3 days
# 2864dbbf	18-Sep-2012	Gleb Smirnoff <glebius@FreeBSD.org>	If caller specifies UMA_ZONE_OFFPAGE explicitly, then do not waste memory in an allocation for a slab. Reviewed by: jeff
# 42321809	26-Aug-2012	Gleb Smirnoff <glebius@FreeBSD.org>	Fix function name in keg_cachespread_init() assert.
# c288b548	07-Jul-2012	Eitan Adler <eadler@FreeBSD.org>	Add missing sleep stat increase PR: kern/168211 Submitted by: linimon Reviewed by: alc Approved by: cperciva MFC after: 3 days
# 687c94aa	02-Jul-2012	John Baldwin <jhb@FreeBSD.org>	Honor db_pager_quit in 'show uma' and 'show malloc'. MFC after: 1 month
# 251386b4	23-May-2012	Maksim Yevmenkin <emax@FreeBSD.org>	Tweak condition for disabling allocation from per-CPU buckets in low memory situation. I've observed a situation where per-CPU allocations were disabled while there were enough free cached pages. Basically, cnt.v_free_count was sitting stable at a value lower than cnt.v_free_min and that caused massive performance drop. Reviewed by: alc MFC after: 1 week
# 263811f7	27-Jan-2012	Kip Macy <kmacy@FreeBSD.org>	exclude kmem_alloc'ed ARC data buffers from kernel minidumps on amd64 excluding other allocations including UMA now entails the addition of a single flag to kmem_alloc or uma zone create Reviewed by: alc, avg MFC after: 2 weeks
# 8d689e04	12-Oct-2011	Gleb Smirnoff <glebius@FreeBSD.org>	Make memguard(9) capable to guard uma(9) allocations.
# 8cd02d00	22-May-2011	Alan Cox <alc@FreeBSD.org>	Correct an error in r222163. Unless UMA_MD_SMALL_ALLOC is defined, startup_alloc() must be used until uma_startup2() is called. Reported by: jh
# 342f1793	21-May-2011	Alan Cox <alc@FreeBSD.org>	1. Prior to r214782, UMA did not support multipage allocations before uma_startup2() was called. Thus, setting the variable "booted" to true in uma_startup() was ok on machines with UMA_MD_SMALL_ALLOC defined, because any allocations made after uma_startup() but before uma_startup2() could be satisfied by uma_small_alloc(). Now, however, some multipage allocations are necessary before uma_startup2() just to allocate zone structures on machines with a large number of processors. Thus, a Boolean can no longer effectively describe the state of the UMA allocator. Instead, make "booted" have three values to describe how far initialization has progressed. This allows multipage allocations to continue using startup_alloc() until uma_startup2(), but single-page allocations may begin using uma_small_alloc() after uma_startup(). 2. With the aforementioned change, only a modest increase in boot pages is necessary to boot UMA on a large number of processors. 3. Retire UMA_MD_SMALL_ALLOC_NEEDS_VM. It has only been used between r182028 and r204128. Reviewed by: attilio [1], nwhitehorn [3] Tested by: sbruno
# df1bc9de	20-May-2011	Alan Cox <alc@FreeBSD.org>	Eliminate a redundant #include. ("vm/vm_param.h" already includes "machine/vmparam.h".)
# e4cd31dd	21-Mar-2011	Jeff Roberson <jeff@FreeBSD.org>	- Merge changes to the base system to support OFED. These include a wider arg2 for sysctl, updates to vlan code, IFT_INFINIBAND, and other miscellaneous small features.
# 00f0e671	26-Jan-2011	Matthew D Fleming <mdf@FreeBSD.org>	Explicitly wire the user buffer rather than doing it implicitly in sbuf_new_for_sysctl(9). This allows using an sbuf with a SYSCTL_OUT drain for extremely large amounts of data where the caller knows that appropriate references are held, and sleeping is not an issue. Inspired by: rwatson
# e9a069d8	04-Nov-2010	John Baldwin <jhb@FreeBSD.org>	Update startup_alloc() to support multi-page allocations and allow internal zones whose objects are larger than a page to use startup_alloc(). This allows allocation of zone objects during early boot on machines with a large number of CPUs since the resulting zone objects are larger than a page. Submitted by: trema Reviewed by: attilio MFC after: 1 week
# a7d5f7eb	19-Oct-2010	Jamie Gritton <jamie@FreeBSD.org>	A new jail(8) with a configuration file, to replace the work currently done by /etc/rc.d/jail.
# 20ed0cb0	19-Oct-2010	Matthew D Fleming <mdf@FreeBSD.org>	uma_zfree(zone, NULL) should do nothing, to match free(9). Noticed by: Ron Steinke <rsteinke at isilon dot com> MFC after: 3 days
# 1c6cae97	15-Oct-2010	Lawrence Stewart <lstewart@FreeBSD.org>	Change uma_zone_set_max to return the effective value of "nitems" after rounding. The same value can also be obtained with uma_zone_get_max, but this change avoids a caller having to make two back-to-back calls. Sponsored by: FreeBSD Foundation Reviewed by: gnn, jhb
# c4ae7908	15-Oct-2010	Lawrence Stewart <lstewart@FreeBSD.org>	- Simplify implementation of uma_zone_get_max. - Add uma_zone_get_cur which returns the current approximate occupancy of a zone. This is useful for providing stats via sysctl amongst other things. Sponsored by: FreeBSD Foundation Reviewed by: gnn, jhb MFC after: 2 weeks
# 4e657159	16-Sep-2010	Matthew D Fleming <mdf@FreeBSD.org>	Re-add r212370 now that the LOR in powerpc64 has been resolved: Add a drain function for struct sysctl_req, and use it for a variety of handlers, some of which had to do awkward things to get a large enough SBUF_FIXEDLEN buffer. Note that some sysctl handlers were explicitly outputting a trailing NUL byte. This behaviour was preserved, though it should not be necessary. Reviewed by: phk (original patch)
# 404a593e	13-Sep-2010	Matthew D Fleming <mdf@FreeBSD.org>	Revert r212370, as it causes a LOR on powerpc. powerpc does a few unexpected things in copyout(9) and so wiring the user buffer is not sufficient to perform a copyout(9) while holding a random mutex. Requested by: nwhitehorn
# dd67e210	09-Sep-2010	Matthew D Fleming <mdf@FreeBSD.org>	Add a drain function for struct sysctl_req, and use it for a variety of handlers, some of which had to do awkward things to get a large enough FIXEDLEN buffer. Note that some sysctl handlers were explicitly outputting a trailing NUL byte. This behaviour was preserved, though it should not be necessary. Reviewed by: phk
# e49471b0	16-Aug-2010	Andre Oppermann <andre@FreeBSD.org>	Add uma_zone_get_max() to obtain the effective limit after a call to uma_zone_set_max(). The UMA zone limit is not exactly set to the value supplied but rounded up to completely fill the backing store increment (a page normally). This can lead to surprising situations where the number of elements allocated from UMA is higher than the supplied limit value. The new get function reads back the effective value so that the supplied limit value can be adjusted to the real limit. Reviewed by: jeffr MFC after: 1 week
# bf965959	15-Jun-2010	Sean Bruno <sbruno@FreeBSD.org>	Add a new column to the output of vmstat -z to indicate the number of times the system was forced to sleep when requesting a new allocation. Expand the debugger hook, db_show_uma, to display these results as well. This has proven to be very useful in out of memory situations when it is not known why systems have become sluggish or fail in odd ways. Reviewed by: rwatson alc Approved by: scottl (mentor) peter Obtained from: Yahoo Inc.
# 3aa6d94e	11-Jun-2010	John Baldwin <jhb@FreeBSD.org>	Update several places that iterate over CPUs to use CPU_FOREACH().
# 451033a4	03-May-2010	Alan Cox <alc@FreeBSD.org>	It makes more sense for the object-based backend allocator to use OBJT_PHYS objects instead of OBJT_DEFAULT objects because we never reclaim or pageout the allocated pages. Moreover, they are mapped with pmap_qenter(), which creates unmanaged mappings. Reviewed by: kib
# 2965a453	29-Apr-2010	Kip Macy <kmacy@FreeBSD.org>	On Alan's advice, rather than do a wholesale conversion on a single architecture from page queue lock to a hashed array of page locks (based on a patch by Jeff Roberson), I've implemented page lock support in the MI code and have only moved vm_page's hold_count out from under page queue mutex to page lock. This changes pmap_extract_and_hold on all pmaps. Supported by: Bitgravity Inc. Discussed with: alc, jeffr, and kib
# e2b36efd	29-Jan-2010	Antoine Brodin <antoine@FreeBSD.org>	MFC r201145 to stable/8: (S)LIST_HEAD_INITIALIZER takes a (S)LIST_HEAD as an argument. Fix some wrong usages. Note: this does not affect generated binaries as this argument is not used. PR: 137213 Submitted by: Eygene Ryabinkin (initial version)
# 13e403fd	28-Dec-2009	Antoine Brodin <antoine@FreeBSD.org>	(S)LIST_HEAD_INITIALIZER takes a (S)LIST_HEAD as an argument. Fix some wrong usages. Note: this does not affect generated binaries as this argument is not used. PR: 137213 Submitted by: Eygene Ryabinkin (initial version) MFC after: 1 month
# aea6e893	18-Jun-2009	Alan Cox <alc@FreeBSD.org>	Add support for UMA_SLAB_KERNEL to page_free(). (While I'm here remove an unnecessary newline character from the end of two panic messages.)
# e20a199f	25-Jan-2009	Jeff Roberson <jeff@FreeBSD.org>	- Make the keg abstraction more complete. Permit a zone to have multiple backend kegs so it may source compatible memory from multiple backends. This is useful for cases such as NUMA or different layouts for the same memory type. - Provide a new api for adding new backend kegs to secondary zones. - Provide a new flag for adjusting the layout of zones to stagger allocations better across cache lines. Sponsored by: Nokia
# d7f03759	19-Oct-2008	Ulf Lilleengen <lulf@FreeBSD.org>	- Import the HEAD csup code which is the basis for the cvsmode work.
# 2f2ea10a	22-Aug-2008	Antoine Brodin <antoine@FreeBSD.org>	Remove unused variable nosleepwithlocks. PR: 126609 Submitted by: Mateusz Guzik MFC after: 1 month X-MFC: to stable/7 only, this variable is still used in stable/6
# f620b5bf	22-Aug-2008	Nathan Whitehorn <nwhitehorn@FreeBSD.org>	Allow the MD UMA allocator to use VM routines like kmem_*(). Existing code requires MD allocator to be available early in the boot process, before the VM is fully available. This defines a new VM define (UMA_MD_SMALL_ALLOC_NEEDS_VM) that allows an MD UMA small allocator to become available at the same time as the default UMA allocator. Approved by: marcel (mentor)
# 7630c265	04-Apr-2008	Alan Cox <alc@FreeBSD.org>	Reintroduce UMA_SLAB_KMAP; however, change its spelling to UMA_SLAB_KERNEL for consistency with its sibling UMA_SLAB_KMEM. (UMA_SLAB_KMAP met its original demise in revision 1.30 of vm/uma_core.c.) UMA_SLAB_KERNEL is now required by the jumbo frame allocators. Without it, UMA cannot correctly return pages from the jumbo frame zones to the VM system because it resets the pages' object field to NULL instead of the kernel object. In more detail, the jumbo frame zones are created with the option UMA_ZONE_REFCNT. This causes UMA to overwrite the pages' object field with the address of the slab. However, when UMA wants to release these pages, it doesn't know how to restore the object field, so it sets it to NULL. This change teaches UMA how to reset the object field to the kernel object. Crashes reported by: kris Fix tested by: kris Fix discussed with: jeff MFC after: 6 weeks
# 71eb44c7	11-Oct-2007	John Baldwin <jhb@FreeBSD.org>	Allow recursion on the 'zones' internal UMA zone. Submitted by: thompsa MFC after: 1 week Approved by: re (kensmith) Discussed with: jeff
# 2feb50bf	31-May-2007	Attilio Rao <attilio@FreeBSD.org>	Revert VMCNT_* operations introduction. Probabilly, a general approach is not the better solution here, so we should solve the sched_lock protection problems separately. Requested by: alc Approved by: jeff (mentor)
# 222d0195	18-May-2007	Jeff Roberson <jeff@FreeBSD.org>	- define and use VMCNT_{GET,SET,ADD,SUB,PTR} macros for manipulating vmcnts. This can be used to abstract away pcpu details but also changes to use atomics for all counters now. This means sched lock is no longer responsible for protecting counts in the switch routines. Contributed by: Attilio Rao <attilio@FreeBSD.org>
# 1e319f6d	11-Feb-2007	Robert Watson <rwatson@FreeBSD.org>	Add uma_set_align() interface, which will be called at most once during boot by MD code to indicated detected alignment preference. Rather than cache alignment being encoded in UMA consumers by defining a global alignment value of (16 - 1) in UMA_ALIGN_CACHE, UMA_ALIGN_CACHE is now a special value (-1) that causes UMA to look at registered alignment. If no preferred alignment has been selected by MD code, a default alignment of (16 - 1) will be used. Currently, no hardware platforms specify alignment; architecture maintainers will need to modify MD startup code to specify an alignment if desired. This must occur before initialization of UMA so that all UMA zones pick up the requested alignment. Reviewed by: jeff, alc Submitted by: attilio
# 6c125b8d	24-Jan-2007	Mohan Srinivasan <mohans@FreeBSD.org>	Fix for problems that occur when all mbuf clusters migrate to the mbuf packet zone. Cluster allocations fail when this happens. Also processes that may have blocked on cluster allocations will never be woken up. Thanks to rwatson for an overview of the issue and pointers to the mbuma paper and his tool to dump out UMA zones. Reviewed by: andre@
# 77380291	24-Jan-2007	Mohan Srinivasan <mohans@FreeBSD.org>	Fix for a bug where only one process (of multiple) blocked on maxpages on a zone is woken up, with the rest never being woken up as a result of the ZFLAG_FULL flag being cleared. Wakeup all such blocked procsses instead. This change introduces a thundering herd, but since this should be relatively infrequent, optimizing this (by introducing a count of blocked processes, for example) may be premature. Reviewd by: ups@
# 635fd505	10-Jan-2007	Robert Watson <rwatson@FreeBSD.org>	Remove uma_zalloc_arg() hack, which coerced M_WAITOK to M_NOWAIT when allocations were made using improper flags in interrupt context. Replace with a simple WITNESS warning call. This restores the invariant that M_WAITOK allocations will always succeed or die horribly trying, which is relied on by many UMA consumers. MFC after: 3 weeks Discussed with: jhb
# 663b416f	05-Jan-2007	John Baldwin <jhb@FreeBSD.org>	- Add a new function uma_zone_exhausted() to see if a zone is full. - Add a printf in swp_pager_meta_build() to warn if the swapzone becomes exhausted so that there's at least a warning before a box that runs out of swapzone space before running out of swap space deadlocks. MFC after: 1 week Reviwed by: alc
# ae4e9636	25-Oct-2006	Robert Watson <rwatson@FreeBSD.org>	Better align output of "show uma" by moving from displaying the basic counters of allocs/frees/use for each zone to the same statistics shown by userspace "vmstat -z". MFC after: 3 days
# a0d4b0ae	17-Jul-2006	Robert Watson <rwatson@FreeBSD.org>	Fix build of uma_core.c when DDB is not compiled into the kernel by making uma_zone_sumstat() ifdef DDB, as it's only used with DDB now. Submitted by: Wolfram Fenske <Wolfram.Fenske at Student.Uni-Magdeburg.DE>
# eabadd9e	16-Jul-2006	Robert Watson <rwatson@FreeBSD.org>	Remove sysctl_vm_zone() and vm.zone sysctl from 7.x. As of 6.x, libmemstat(3) is used by vmstat (and friends) to produce more accurate and more detailed statistics information in a machine-readable way, and vmstat continues to provide the same text-based front-end. This change should not be MFC'd.
# 4f538c74	21-May-2006	Robert Watson <rwatson@FreeBSD.org>	When allocating a bucket to hold a free'd item in UMA fails, don't report this as an allocation failure for the item type. The failure will be separately recorded with the bucket type. This my eliminate high mbuf allocation failure counts under some circumstances, which can be alarming in appearance, but not actually a problem in practice. MFC after: 2 weeks Reported by: ps, Peter J. Blok <pblok at bsd4all dot org>, OxY <oxy at field dot hu>, Gabor MICSKO <gmicskoa at szintezis dot hu>
# 082dc776	11-Feb-2006	Robert Watson <rwatson@FreeBSD.org>	Skip per-cpu caches associated with absent CPUs when generating a memory statistics record stream via sysctl. MFC after: 3 days
# ffaf2c55	27-Jan-2006	John Baldwin <jhb@FreeBSD.org>	Add a new macro wrapper WITNESS_CHECK() around the witness_warn() function. The difference between WITNESS_CHECK() and WITNESS_WARN() is that WITNESS_CHECK() should be used in the places that the return value of witness_warn() is checked, whereas WITNESS_WARN() should be used in places where the return value is ignored. Specifically, in a kernel without WITNESS enabled, WITNESS_WARN() evaluates to an empty string where as WITNESS_CHECK evaluates to 0. I also updated the one place that was checking the return value of WITNESS_WARN() to use WITNESS_CHECK.
# ca49f12f	06-Jan-2006	John Baldwin <jhb@FreeBSD.org>	Reduce the scope of one #ifdef to avoid duplicating a SYSCTL_INT() macro and trim another unneeded #ifdef (it was just around a macro that is already conditionally defined).
# 64a266f9	20-Oct-2005	Robert Watson <rwatson@FreeBSD.org>	Change format string for u_int64_t to %ju from %llu, in order to use the correct format string on 64-bit systems. Pointed out by: pjd
# 48c5777e	20-Oct-2005	Robert Watson <rwatson@FreeBSD.org>	Add a "show uma" command to DDB, which prints out the current stats for available UMA zones. Quite useful for post-mortem debugging of memory leaks without a dump device configured on a panicked box. MFC after: 2 weeks
# 3803b26b	08-Oct-2005	Dag-Erling Smørgrav <des@FreeBSD.org>	As alc pointed out to me, vm_page.c 1.305 was incomplete: uma_startup() still uses the constant UMA_BOOT_PAGES. Change it to accept boot_pages as an additional argument. MFC after: 2 weeks
# f353d338	09-Sep-2005	Alan Cox <alc@FreeBSD.org>	Introduce a new lock for the purpose of synchronizing access to the UMA boot pages. Disable recursion on the general UMA lock now that startup_alloc() no longer uses it. Eliminate the variable uma_boot_free. It serves no purpose. Note: This change eliminates a lock-order reversal between a system map mutex and the UMA lock. See http://sources.zabbadoz.net/freebsd/lor.html#109 for details. MFC after: 3 days
# cbbb4a00	24-Jul-2005	Robert Watson <rwatson@FreeBSD.org>	Rename UMA_MAX_NAME to UTH_MAX_NAME, since it's a maximum in the monitoring API, which might or might not be the same as the internal maximum (currently none). Export flag information on UMA zones -- in particular, whether or not this is a secondary zone, and so the keg free count should be considered in that light. MFC after: 1 day
# f4ff923b	20-Jul-2005	Robert Watson <rwatson@FreeBSD.org>	Further UMA statistics related changes: - Add a new uma_zfree_internal() flag, ZFREE_STATFREE, which causes it to to update the zone's uz_frees statistic. Previously, the statistic was updated unconditionally. - Use the flag in situations where a "real" free occurs: i.e., one where the caller is freeing an allocated item, to be differentiated from situations where uma_zfree_internal() is used to tear down the item during slab teardown in order to invoke its fini() method. Also use the flag when UMA is freeing its internal objects. - When exchanging a bucket with the zone from the per-CPU cache when freeing an item, flush cache statistics back to the zone (since the zone lock and critical section are both held) to match the allocation case. MFC after: 3 days
# ab3a57c0	16-Jul-2005	Robert Watson <rwatson@FreeBSD.org>	Use mp_maxid in preference to MAXCPU when creating exports of UMA per-CPU cache statistics. UMA sizes the cache array based on the number of CPUs at boot (mp_maxid + 1), and iterating based on MAXCPU could read off the end of the array (into the next zone). Reported by: yongari MFC after: 1 week
# 08ecce74	16-Jul-2005	Robert Watson <rwatson@FreeBSD.org>	Improve canonicalization of copyrights. Order copyrights by order of assertion (jeff, bmilekic, rwatson). Suggested ages ago by: bde MFC after: 1 week
# 2450bbb8	16-Jul-2005	Robert Watson <rwatson@FreeBSD.org>	Move the unlocking of the zone mutex in sysctl_vm_zone_stats() so that it covers the following of the uc_alloc/freebucket cache pointers. Originally, I felt that the race wasn't helped by holding the mutex, hence a comment in the code and not holding it across the cache access. However, it does improve consistency, as while it doesn't prevent bucket exchange, it does prevent bucket pointer invalidation. So a race in gathering cache free space statistics still can occur, but not one that follows an invalid bucket pointer, if the mutex is held. Submitted by: yongari MFC after: 1 week
# 2018f30c	15-Jul-2005	Mike Silbersack <silby@FreeBSD.org>	Increase the flags field for kegs from a 16 to a 32 bit value; we have exhausted all 16 flags.
# 2019094a	15-Jul-2005	Robert Watson <rwatson@FreeBSD.org>	Track UMA(9) allocation failures by zone, and export via sysctl. Requested by: victor cruceru <victor dot cruceru at gmail dot com> MFC after: 1 week
# 7a52a97e	14-Jul-2005	Robert Watson <rwatson@FreeBSD.org>	Introduce a new sysctl, vm.zone_stats, which exports UMA(9) allocator statistics via a binary structure stream: - Add structure 'uma_stream_header', which defines a stream version, definition of MAXCPUs used in the stream, and the number of zone records in the stream. - Add structure 'uma_type_header', which defines the name, alignment, size, resource allocation limits, current pages allocated, preferred bucket size, and central zone + keg statistics. - Add structure 'uma_percpu_stat', which, for each per-CPU cache, includes the number of allocations and frees, as well as the number of free items in the cache. - When the sysctl is queried, return a stream header, followed by a series of type descriptions, each consisting of a type header followed by a series of MAXCPUs uma_percpu_stat structures holding per-CPU allocation information. Typical values of MAXCPU will be 1 (UP compiled kernel) and 16 (SMP compiled kernel). This query mechanism allows user space monitoring tools to extract memory allocation statistics in a machine-readable form, and to do so at a per-CPU granularity, allowing monitoring of allocation patterns across CPUs in order to better understand the distribution of work and memory flow over multiple CPUs. While here, also export the number of UMA zones as a sysctl vm.uma_count, in order to assist in sizing user swpace buffers to receive the stream. A follow-up commit of libmemstat(3), a library to monitor kernel memory allocation, will occur in the next few days. This change directly supports converting netstat(1)'s "-mb" mode to using UMA-sourced stats rather than separately maintained mbuf allocator statistics. MFC after: 1 week
# 773df9ab	14-Jul-2005	Robert Watson <rwatson@FreeBSD.org>	In addition to tracking allocs in the zone, also track frees. Add a zone free counter, as well as a cache free counter. MFC after: 1 week
# 2c743d36	14-Jul-2005	Robert Watson <rwatson@FreeBSD.org>	In an earlier world order, UMA would flush per-CPU statistics to the zone whenever it was moving buckets between the zone and the cache, or when coalescing statistics across the CPU. Remove flushing of statistics to the zone when coalescing statistics as part of sysctl, as we won't be running on the right CPU to write to the cache statistics. Add a missed gathering of statistics: when uma_zalloc_internal() does a special case allocation of a single item, make sure to update the zone statistics to represent this. Previously this case wasn't accounted for in user-visible statistics. MFC after: 1 week
# 5d1ae027	29-Apr-2005	Robert Watson <rwatson@FreeBSD.org>	Modify UMA to use critical sections to protect per-CPU caches, rather than mutexes, which offers lower overhead on both UP and SMP. When allocating from or freeing to the per-cpu cache, without INVARIANTS enabled, we now no longer perform any mutex operations, which offers a 1%-3% performance improvement in a variety of micro-benchmarks. We rely on critical sections to prevent (a) preemption resulting in reentrant access to UMA on a single CPU, and (b) migration of the thread during access. In the event we need to go back to the zone for a new bucket, we release the critical section to acquire the global zone mutex, and must re-acquire the critical section and re-evaluate which cache we are accessing in case migration has occured, or circumstances have changed in the current cache. Per-CPU cache statistics are now gathered lock-free by the sysctl, which can result in small races in statistics reporting for caches. Reviewed by: bmilekic, jeff (somewhat) Tested by: rwatson, kris, gnn, scottl, mike at sentex dot net, others
# b70458ae	23-Feb-2005	Alan Cox <alc@FreeBSD.org>	Revert the first part of revision 1.114 and modify the second part. On architectures implementing uma_small_alloc() pages do not necessarily belong to the kmem object.
# 8076cb52	16-Feb-2005	Bosko Milekic <bmilekic@FreeBSD.org>	Well, it seems that I pre-maturely removed the "All rights reserved" statement from some files, so re-add it for the moment, until the related legalese is sorted out. This change affects: sys/kern/kern_mbuf.c sys/vm/memguard.c sys/vm/memguard.h sys/vm/uma.h sys/vm/uma_core.c sys/vm/uma_dbg.c sys/vm/uma_dbg.h sys/vm/uma_int.h
# 500f29d0	16-Feb-2005	Bosko Milekic <bmilekic@FreeBSD.org>	Make UMA set the overloaded page->object back to kmem_object for UMA_ZONE_REFCNT and UMA_ZONE_MALLOC zones, as the page(s) undoubtedly came from kmem_map for those two. Previously it would set it back to NULL for UMA_ZONE_REFCNT zones and although this was probably not fatal, it added MORE code for no reason.
# c5c1b16e	10-Jan-2005	Bosko Milekic <bmilekic@FreeBSD.org>	While we want the recursion protection for the bucket zones so that recursion from the VM is handled (and the calling code that allocates buckets knows how to deal with it), we do not want to prevent allocation from the slab header zones (slabzone and slabrefzone) if uk_recurse is not zero for them. The reason is that it could lead to NULL being returned for the slab header allocations even in the M_WAITOK case, and the caller can't handle that (this is also explained in a comment with this commit). The problem analysis is documented in our mailing lists: http://docs.freebsd.org/cgi/getmsg.cgi?fetch=153445+0+archive/2004/freebsd-current/20041231.freebsd-current (see entire thread for proper context). Crash dump data provided by: Peter Holm <peter@holm.cc>
# 1e183df2	10-Jan-2005	Stefan Farfeleder <stefanf@FreeBSD.org>	ISO C requires at least one element in an initialiser list.
# 60727d8b	06-Jan-2005	Warner Losh <imp@FreeBSD.org>	/* -> /*- for license, minor formatting changes
# 7b871205	25-Dec-2004	Bosko Milekic <bmilekic@FreeBSD.org>	Add my copyright and update Jeff's copyright on UMA source files, as per his request. Discussed with: Jeffrey Roberson
# dc2c7965	06-Nov-2004	Robert Watson <rwatson@FreeBSD.org>	Abstract the logic to look up the uma_bucket_zone given a desired number of entries into bucket_zone_lookup(), which helps make more clear the logic of consumers of bucket zones. Annotate the behavior of bucket_init() with a comment indicating how the various data structures, including the bucket lookup tables, are initialized.
# f9d27e75	06-Nov-2004	Robert Watson <rwatson@FreeBSD.org>	Annotate what bucket_size[] array does; staticize since it's used only in uma_core.c.
# a5a262c6	27-Oct-2004	Bosko Milekic <bmilekic@FreeBSD.org>	Fix a INVARIANTS-only bug introduced in Revision 1.104: IF INVARIANTS is defined, and in the rare case that we have allocated some objects from the slab and at least one initializer on at least one of those objects failed, and we need to fail the allocation and push the uninitialized items back into the slab caches -- in that scenario, we would fail to [re]set the bucket cache's ub_bucket item references to NULL, which would eventually trigger a KASSERT.
# 55fc8c11	09-Oct-2004	Brian Feldman <green@FreeBSD.org>	In the previous revision, I did not intend to change the default value of "nosleepwithlocks." Submitted by: ru
# ab14a3f7	08-Oct-2004	Brian Feldman <green@FreeBSD.org>	Fix critical stability problems that can cause UMA mbuf cluster state management corruption, mbuf leaks, general mbuf corruption, and at least on i386 a first level splash damage radius that encompasses up to about half a megabyte of the memory after an mbuf cluster's allocation slab. In short, this has caused instability nightmares anywhere the right kind of network traffic is present. When the polymorphic refcount slabs were added to UMA, the new types were not used pervasively. In particular, the slab management structure was turned into one for refcounts, and one for non-refcounts (supposed to be mostly like the old slab management structure), but the latter was almost always used through out. In general, every access to zones with UMA_ZONE_REFCNT turned on corrupted the "next free" slab offset offset and the refcount with each other and with other allocations (on i386, 2 mbuf clusters per 4096 byte slab). Fix things so that the right type is used to access refcounted zones where it was not before. There are additional errors in gross overestimation of padding, it seems, that would cause a large kegs (nee zones) to be allocated when small ones would do. Unless I have analyzed this incorrectly, it is not directly harmful.
# 3659f747	06-Aug-2004	Robert Watson <rwatson@FreeBSD.org>	Generate KTR trace records for uma_zalloc_arg() and uma_zfree_arg(). This doesn't trace every event of interest in UMA, but provides enough basic information to explain lock traces and sleep patterns.
# b23f72e9	01-Aug-2004	Brian Feldman <green@FreeBSD.org>	* Add a "how" argument to uma_zone constructors and initialization functions so that they know whether the allocation is supposed to be able to sleep or not. * Allow uma_zone constructors and initialation functions to return either success or error. Almost all of the ones in the tree currently return success unconditionally, but mbuf is a notable exception: the packet zone constructor wants to be able to fail if it cannot suballocate an mbuf cluster, and the mbuf allocators want to be able to fail in general in a MAC kernel if the MAC mbuf initializer fails. This fixes the panics people are seeing when they run out of memory for mbuf clusters. * Allow debug.nosleepwithlocks on WITNESS to be disabled, without changing the default. Both bmilekic and jeff have reviewed the changes made to make failable zone allocations work.
# 244f4554	29-Jul-2004	Bosko Milekic <bmilekic@FreeBSD.org>	Rework the way slab header storage space is calculated in UMA. - zone_large_init() stays pretty much the same. - zone_small_init() will try to stash the slab header in the slab page being allocated if the amount of calculated wasted space is less than UMA_MAX_WASTE (for both the UMA_ZONE_REFCNT case and regular case). If the amount of wasted space is >= UMA_MAX_WASTE, then UMA_ZONE_OFFPAGE will be set and the slab header will be allocated separately for better use of space. - uma_startup() calculates the maximum ipers required in offpage slabs (so that the offpage slab header zone(s) can be sized accordingly). The algorithm used to calculate this replaces the old calculation (which only happened to work coincidentally). We now iterate over possible object sizes, starting from the smallest one, until we determine that wastedspace calculated in zone_small_init() might end up being greater than UMA_MAX_WASTE, at which point we use the found object size to compute the maximum possible ipers. The reason this works is because: - wastedspace versus objectsize is a see-saw function with local minima all equal to zero and local maxima growing directly proportioned to objectsize. This implies that for objects up to or equal a certain objectsize, the see-saw remains entirely below UMA_MAX_WASTE, so for those objectsizes it is impossible to ever go OFFPAGE for slab headers. - ipers (items-per-slab) versus objectsize is an inversely proportional function which falls off very quickly (very large for small objectsizes). - To determine the maximum ipers we'll ever need from OFFPAGE slab headers we first find the largest objectsize for which we are guaranteed to not go offpage for and use it to compute ipers (as though we were offpage). Since the only objectsizes allowed to go offpage are bigger than the found objectsize, and since ipers vs objectsize is inversely proportional (and monotonically decreasing), then we are guaranteed that the ipers computed is always >= what we will ever need in offpage slab headers. - Define UMA_FRITM_SZ and UMA_FRITMREF_SZ to be the actual (possibly padded) size of each freelist index so that offset calculations are fixed. This might fix weird data corruption problems and certainly allows ARM to now boot to at least single-user (via simulator). Tested on i386 UP by me. Tested on sparc64 SMP by fenner. Tested on ARM simulator to single-user by cognet.
# 5285558a	22-Jul-2004	Alan Cox <alc@FreeBSD.org>	- Change uma_zone_set_obj() to call kmem_alloc_nofault() instead of kmem_alloc_pageable(). The difference between these is that an errant memory access to the zone will be detected sooner with kmem_alloc_nofault(). The following changes serve to eliminate the following lock-order reversal reported by witness: 1st 0xc1a3c084 vm object (vm object) @ vm/swap_pager.c:1311 2nd 0xc07acb00 swap_pager swhash (swap_pager swhash) @ vm/swap_pager.c:1797 3rd 0xc1804bdc vm object (vm object) @ vm/uma_core.c:931 There is no potential deadlock in this case. However, witness is unable to recognize this because vm objects used by UMA have the same type as ordinary vm objects. To remedy this, we make the following changes: - Add a mutex type argument to VM_OBJECT_LOCK_INIT(). - Use the mutex type argument to assign distinct types to special vm objects such as the kernel object, kmem object, and UMA objects. - Define a static swap zone object for use by UMA. (Only static objects are assigned a special mutex type.)
# 0c3c862e	19-Jul-2004	Brian Feldman <green@FreeBSD.org>	Since breakage of malloc(9)/uma_zalloc(9) is totally non-optional in GENERIC/for WITNESS users, make sure the sysctl to disable the behavior is read-only and always enabled.
# 0d0837ee	04-Jul-2004	Bosko Milekic <bmilekic@FreeBSD.org>	Introduce debug.nosleepwithlocks sysctl, 0 by default. If set to 1 and WITNESS is not built, then force all M_WAITOK allocations to M_NOWAIT behavior (transparently). This is to be used temporarily if wierd deadlocks are reported because we still have code paths that perform M_WAITOK allocations with lock(s) held, which can lead to deadlock. If WITNESS is compiled, then the sysctl is ignored and we ask witness to tell us wether we have locks held, converting to M_NOWAIT behavior only if it tells us that we do. Note this removes the previous mbuf.h inclusion as well (only needed by last revision), and cleans up unneeded [artificial] comparisons to just the mbuf zones. The problem described above has nothing to do with previous mbuf wait behavior; it is a general problem.
# 7a708c36	04-Jul-2004	Brian Feldman <green@FreeBSD.org>	Reextend the M_WAITOK-disabling-hack to all three of the mbuf-related zones, and do it by direct comparison of uma_zone_t instead of strcmp. The mbuf subsystem used to provide M_TRYWAIT/M_DONTWAIT semantics, but this is mostly no longer the case. M_WAITOK has taken over the spot M_TRYWAIT used to have, and for mbuf things, still may return NULL if the code path is incorrectly holding a mutex going into mbuf allocation functions. The M_WAITOK/M_NOWAIT semantics are absolute; though it may deadlock the system to try to malloc or uma_zalloc something with a mutex held and M_WAITOK specified, it is absolutely required to not return NULL and will result in instability and/or security breaches otherwise. There is still room to add the WITNESS_WARN() to all cases so that we are notified of the possibility of deadlocks, but it cannot change the value of the "badness" variable and allow allocation to actually fail except for the specialized cases which used to be M_TRYWAIT.
# cf107c1d	03-Jul-2004	Brian Feldman <green@FreeBSD.org>	Limit mbuma damage. Suddenly ALL allocations with M_WAITOK are subject to failing -- that is, allocations via malloc(M_WAITOK) that are required to never fail -- if WITNESS is not defined. While everyone should be running WITNESS, in any case, zone "Mbuf" allocations are really the only ones that should be screwed with by this hack. This hack is crashing people, and would continue to do so with or without WITNESS. Things shouldn't be allocating with M_WAITOK with locks held, but it's not okay just to always remove M_WAITOK when !WITNESS. Reported by: Bernd Walter <ticso@cicely5.cicely.de>
# cc822cb5	23-Jun-2004	Bosko Milekic <bmilekic@FreeBSD.org>	Make uma_mtx MTX_RECURSE. Here's why: The general UMA lock is a recursion-allowed lock because there is a code path where, while we're still configured to use startup_alloc() for backend page allocations, we may end up in uma_reclaim() which calls zone_foreach(zone_drain), which grabs uma_mtx, only to later call into startup_alloc() because while freeing we needed to allocate a bucket. Since startup_alloc() also takes uma_mtx, we need to be able to recurse on it. This exact explanation also added as comment above mtx_init(). Trace showing recursion reported by: Peter Holm <peter-at-holm.cc>
# 7fd87882	09-Jun-2004	Bosko Milekic <bmilekic@FreeBSD.org>	Backout previous change, I think Julian has a better solution which does not require type-stable refcnts here.
# e66468ea	09-Jun-2004	Bosko Milekic <bmilekic@FreeBSD.org>	Make the slabrefzone, the zone from which we allocated slabs with internal reference counters, UMA_ZONE_NOFREE. This way, those slabs (with their ref counts) will be effectively type-stable, then using a trick like this on the refcount is no longer dangerous: MEXT_REM_REF(m); if (atomic_cmpset_int(m->m_ext.ref_cnt, 0, 1)) { if (m->m_ext.ext_type == EXT_PACKET) { uma_zfree(zone_pack, m); return; } else if (m->m_ext.ext_type == EXT_CLUSTER) { uma_zfree(zone_clust, m->m_ext.ext_buf); m->m_ext.ext_buf = NULL; } else { (*(m->m_ext.ext_free))(m->m_ext.ext_buf, m->m_ext.ext_args); if (m->m_ext.ext_type != EXT_EXTREF) free(m->m_ext.ref_cnt, M_MBUF); } } uma_zfree(zone_mbuf, m); Previously, a second thread hitting the above cmpset might actually read the refcnt AFTER it has already been freed. A very rare occurance. Now we'll know that it won't be freed, though. Spotted by: julian, pjd
# 099a0e58	31-May-2004	Bosko Milekic <bmilekic@FreeBSD.org>	Bring in mbuma to replace mballoc. mbuma is an Mbuf & Cluster allocator built on top of a number of extensions to the UMA framework, all included herein. Extensions to UMA worth noting: - Better layering between slab <-> zone caches; introduce Keg structure which splits off slab cache away from the zone structure and allows multiple zones to be stacked on top of a single Keg (single type of slab cache); perhaps we should look into defining a subset API on top of the Keg for special use by malloc(9), for example. - UMA_ZONE_REFCNT zones can now be added, and reference counters automagically allocated for them within the end of the associated slab structures. uma_find_refcnt() does a kextract to fetch the slab struct reference from the underlying page, and lookup the corresponding refcnt. mbuma things worth noting: - integrates mbuf & cluster allocations with extended UMA and provides caches for commonly-allocated items; defines several zones (two primary, one secondary) and two kegs. - change up certain code paths that always used to do: m_get() + m_clget() to instead just use m_getcl() and try to take advantage of the newly defined secondary Packet zone. - netstat(1) and systat(1) quickly hacked up to do basic stat reporting but additional stats work needs to be done once some other details within UMA have been taken care of and it becomes clearer to how stats will work within the modified framework. From the user perspective, one implication is that the NMBCLUSTERS compile-time option is no longer used. The maximum number of clusters is still capped off according to maxusers, but it can be made unlimited by setting the kern.ipc.nmbclusters boot-time tunable to zero. Work should be done to write an appropriate sysctl handler allowing dynamic tuning of kern.ipc.nmbclusters at runtime. Additional things worth noting/known issues (READ): - One report of 'ips' (ServeRAID) driver acting really slow in conjunction with mbuma. Need more data. Latest report is that ips is equally sucking with and without mbuma. - Giant leak in NFS code sometimes occurs, can't reproduce but currently analyzing; brueffer is able to reproduce but THIS IS NOT an mbuma-specific problem and currently occurs even WITHOUT mbuma. - Issues in network locking: there is at least one code path in the rip code where one or more locks are acquired and we end up in m_prepend() with M_WAITOK, which causes WITNESS to whine from within UMA. Current temporary solution: force all UMA allocations to be M_NOWAIT from within UMA for now to avoid deadlocks unless WITNESS is defined and we can determine with certainty that we're not holding any locks when we're M_WAITOK. - I've seen at least one weird socketbuffer empty-but- mbuf-still-attached panic. I don't believe this to be related to mbuma but please keep your eyes open, turn on debugging, and capture crash dumps. This change removes more code than it adds. A paper is available detailing the change and considering various performance issues, it was presented at BSDCan2004: http://www.unixdaemons.com/~bmilekic/netbuf_bmilekic.pdf Please read the paper for Future Work and implementation details, as well as credits. Testing and Debugging: rwatson, brueffer, Ketrien I. Saihr-Kesenchedra, ... Reviewed by: Lots of people (for different parts)
# 5d328ed4	09-Mar-2004	Alan Cox <alc@FreeBSD.org>	- Make the acquisition of Giant in vm_fault_unwire() conditional on the pmap. For the kernel pmap, Giant is not required. In general, for other pmaps, Giant is required by i386's pmap_pte() implementation. Specifically, the use of PMAP2/PADDR2 is synchronized by Giant. Note: In principle, updates to the kernel pmap's wired count could be lost without Giant. However, in practice, we never use the kernel pmap's wired count. This will be resolved when pmap locking appears. - With the above change, cpu_thread_clean() and uma_large_free() need not acquire Giant. (The first case is simply the revival of i386/i386/vm_machdep.c's revision 1.226 by peter.)
# a3c07611	07-Mar-2004	Robert Watson <rwatson@FreeBSD.org>	Mark uma_callout as CALLOUT_MPSAFE, as uma_timeout can run MPSAFE. Reviewed by: jeff
# aaa8bb16	31-Jan-2004	Jeff Roberson <jeff@FreeBSD.org>	- Fix a problem where we did not drain the cache of buckets in the zone when uma_reclaim() was called. This was introduced when the zone working-set algorithm was removed in favor of using the per cpu caches as the working set.
# e726bc0e	30-Jan-2004	Dag-Erling Smørgrav <des@FreeBSD.org>	Mechanical whitespace cleanup.
# b6c71225	03-Dec-2003	John Baldwin <jhb@FreeBSD.org>	Fix all users of mp_maxid to use the same semantics, namely: 1) mp_maxid is a valid FreeBSD CPU ID in the range 0 .. MAXCPU - 1. 2) For all active CPUs in the system, PCPU_GET(cpuid) <= mp_maxid. Approved by: re (scottl) Tested on: i386, amd64, alpha
# e30b97c5	30-Nov-2003	Jeff Roberson <jeff@FreeBSD.org>	- Unbreak UP. mp_maxid is not defined on uni-processor machines, although I believe it and the other MP variables should be. For now, just define it here and wait for jhb to clean it up later. Approved by: re (rwatson)
# 504d5de3	30-Nov-2003	Jeff Roberson <jeff@FreeBSD.org>	- Replace the local maxcpu with mp_maxid. Previously, if mp_maxid was equal to MAXCPU, we would overrun the pcpu_mtx array because maxcpu was calculated incorrectly. - Add some more debugging code so that memory leaks at the time of uma_zdestroy() are more easily diagnosed. Approved by: re (rwatson)
# d1f42ac2	14-Nov-2003	Alan Cox <alc@FreeBSD.org>	- Remove use of Giant from uma_zone_set_obj().
# 009b6fcb	21-Sep-2003	Jeff Roberson <jeff@FreeBSD.org>	- Fix MD_SMALL_ALLOC on architectures that support it. Define a new alloc function, startup_alloc(), that is used for single page allocations prior to the VM starting up. If it is used after the VM startups up, it replaces the zone's allocf pointer with either page_alloc() or uma_small_alloc() where appropriate. Pointy hat to: me Tested by: phk/amd64, me/x86
# c43ab0b5	20-Sep-2003	Peter Wemm <peter@FreeBSD.org>	Bad Jeffr! No cookie! Temporarily disable the UMA_MD_SMALL_ALLOC stuff since recent commits break sparc64, amd64, ia64 and alpha. It appears only i386 and maybe powerpc were not broken.
# 9643769a	19-Sep-2003	Jeff Roberson <jeff@FreeBSD.org>	- Remove the working-set algorithm. Instead, use the per cpu buckets as the working set cache. This has several advantages. Firstly, we never touch the per cpu queues now in the timeout handler. This removes one more reason for having per cpu locks. Secondly, it reduces the size of the zone by 8 bytes, bringing it under 200 bytes for a single proc x86 box. This tidies up other logic as well. - The 'destroy' flag no longer needs to be passed to zone_drain() since it always frees everything in the zone's slabs. - cache_drain() is now only called from zone_dtor() and so it destroys by default. It also does not need the destroy parameter now.
# 3e0cab95	19-Sep-2003	Jeff Roberson <jeff@FreeBSD.org>	- Remove the cache colorization code. We can't use it due to all of the broken consumers of the malloc interface who assume that the allocated address will be an even multiple of the size. - Remove disabled time delay code on uma_reclaim(). The comment there said it all. It was not an effective strategy and it should not be left in #if 0'd for all eternity.
# 64f051e9	19-Sep-2003	Jeff Roberson <jeff@FreeBSD.org>	- There are an endless stream of style(9) errors in this file. Fix a few. Also catch some spelling errors.
# 44eca34a	19-Sep-2003	Jeff Roberson <jeff@FreeBSD.org>	- Don't inspect the zone in page_alloc(). It may be NULL. - Don't cache more items than the zone would like in uma_zalloc_bucket().
# 45bf76f0	19-Sep-2003	Jeff Roberson <jeff@FreeBSD.org>	- Move the logic for dealing with the uma_boot_pages cache into the page_alloc() function from the slab_zalloc() function. This allows us to unconditionally call uz_allocf(). - In page_alloc() cleanup the boot_pages logic some. Previously memory from this cache that was not used by the time the system started was left in the cache and never used. Typically this wasn't more than a few pages, but now we will use this cache so long as memory is available.
# b60f5b79	19-Sep-2003	Jeff Roberson <jeff@FreeBSD.org>	- Fix the silly flag situation in UMA. Remove redundant ZFLAG/ZONE flags by accepting the user supplied flags directly. Previously this was not done so that flags for the same field would not be defined in two different files. Add comments in each header instructing future developers on how now to shoot their feet. - Fix a test for !OFFPAGE which should have been a test for HASH. This would have caused a panic if we had ever destructed a malloc zone. This also opens up the possibility that other zones could use the vsetobj() method rather than a hash.
# 961647df	19-Sep-2003	Jeff Roberson <jeff@FreeBSD.org>	- Don't abuse M_DEVBUF, define a tag for UMA hashes.
# b983089a	19-Sep-2003	Jeff Roberson <jeff@FreeBSD.org>	- Eliminate a pair of unnecessary variables.
# cae33c14	19-Sep-2003	Jeff Roberson <jeff@FreeBSD.org>	- Initialize a pool of bucket zones so that we waste less space on zones that don't cache as many items. - Introduce the bucket_alloc(), bucket_free() functions to wrap bucket allocation. These functions select the appropriate bucket zone to allocate from or free to. - Rename ub_ptr to ub_cnt to reflect a change in its use. ub_cnt now reflects the count of free items in the bucket. This gets rid of many unnatural subtractions by 1 throughout the code. - Add ub_entries which reflects the number of entries possibly held in a bucket.
# 1c35e213	20-Aug-2003	Bosko Milekic <bmilekic@FreeBSD.org>	In sysctl_vm_zone, do not calculate per-cpu cache stats on UMA_ZFLAG_INTERNAL zones at all. Apparently, Wilko's alpha was crashing while entering multi-user because, I think, we were calculating the garbage cachefree for pcpu caches that essentially don't exist for at least the 'zones' zone and it so happened that we were reading from an unmapped location. Confirmed to fix crash: wilko Helped debug: wilko, gallatin
# 20e8e865	11-Aug-2003	Bosko Milekic <bmilekic@FreeBSD.org>	- When deciding whether to init the zone with small_init or large_init, compare the zone element size (+1 for the byte of linkage) against UMA_SLAB_SIZE - sizeof(struct uma_slab), and not just UMA_SLAB_SIZE. Add a KASSERT in zone_small_init to make sure that the computed ipers (items per slab) for the zone is not zero, despite the addition of the check, just to be sure (this part submitted by: silby) - UMA_ZONE_VM used to imply BUCKETCACHE. Now it implies CACHEONLY instead. CACHEONLY is like BUCKETCACHE in the case of bucket allocations, but in addition to that also ensures that we don't setup the zone with OFFPAGE slab headers allocated from the slabzone. This means that we're not allowed to have a UMA_ZONE_VM zone initialized for large items (zone_large_init) because it would require the slab headers to be allocated from slabzone, and hence kmem_map. Some of the zones init'd with UMA_ZONE_VM are so init'd before kmem_map is suballoc'd from kernel_map, which is why this change is necessary.
# b245ac95	03-Aug-2003	Alan Cox <alc@FreeBSD.org>	Revise obj_alloc(). Most notably, use the object's lock to prevent two concurrent invocations from acquiring the same address(es). Also, in case of an incomplete allocation, free any allocated pages. In collaboration with: tegge
# 48bf8725	02-Aug-2003	Bosko Milekic <bmilekic@FreeBSD.org>	When INVARIANTS is on and we're in uma_zalloc_free(), we need to make sure that uma_dbg_free() is called if we're about to call uma_zfree_internal() but we're asking it to skip the dtor and uma_dbg_free() call itself. So, if we're about to call uma_zfree_internal() from uma_zfree_arg() and skip == 1, call uma_dbg_free() ourselves.
# 174ab450	01-Aug-2003	Bosko Milekic <bmilekic@FreeBSD.org>	Only free the pcpu cache buckets if they are non-NULL. Crashed this person's machine: harti Pointy-hat to: me
# d56368d7	30-Jul-2003	Bosko Milekic <bmilekic@FreeBSD.org>	Plug a race and a leak in UMA. 1) The race has to do with zone destruction. From the zone destructor we would lock the zone, set the working set size to 0, then unlock the zone, drain it, and then free the structure. Within the window following the working-set-size set to 0 and unlocking of the zone and the point where in zone_drain we re-acquire the zone lock, the uma timer routine could have fired off and changed the working set size to something non-zero, thereby potentially preventing us from completely freeing slabs before destroying the zone (and thus leaking them). 2) The leak has to do with zone destruction as well. When destroying a zone we would take care to free all the buckets cached in the zone, but although we would drain the pcpu cache buckets, we would not free them. This resulted in leaking a couple of bucket structures (512 bytes each) per cpu on SMP during zone destruction. While I'm here, also silence GCC warnings by turning uma_slab_alloc() from inline to real function. It's too big to be an inline. Reviewed by: JeffR
# a40fdcb4	30-Jul-2003	Bosko Milekic <bmilekic@FreeBSD.org>	When generating the zone stats make sure to handle the master zone ("UMA Zone") carefully, because it does not have pcpu caches allocated at all. In the UP case, we did not catch this because one pcpu cache is always allocated with the zone, but for the MP case, we were getting bogus stats for this zone. Tested by: Lukas Ertl <le@univie.ac.at>
# 7b4bd98a	30-Jul-2003	Poul-Henning Kamp <phk@FreeBSD.org>	Remove the disabling of buckets workaround. Thanks to: jeffr
# f828e5be	29-Jul-2003	Jeff Roberson <jeff@FreeBSD.org>	- Get rid of the ill-conceived uz_cachefree member of uma_zone. - In sysctl_vm_zone use the per cpu locks to read the current cache statistics this makes them more accurate while under heavy load. Submitted by: tegge
# d11e0ba5	29-Jul-2003	Jeff Roberson <jeff@FreeBSD.org>	- Check to see if we need a slab prior to allocating one. Failure to do so not only wastes memory but it can also cause a leak in zones that will be destroyed later. The problem is that the slab allocation code places newly created slabs on the partially allocated list because it assumes that the caller will actually allocate some memory from it. Failure to do so places an otherwise free slab on the partial slab list where we wont find it later in zone_drain(). Continuously prodded to fix by: phk (Thanks)
# 0c32d97a	29-Jul-2003	Poul-Henning Kamp <phk@FreeBSD.org>	Temporary workaround: Always disable buckets, there is a bug there somewhere. JeffR will look at this as soon as he has time. OK'ed by: jeffr
# 234c7726	27-Jul-2003	Alan Cox <alc@FreeBSD.org>	None of the "alloc" functions used by UMA assume that Giant is held any longer. (If they still need it, e.g., contigmalloc(), they acquire it themselves.) Therefore, we need not acquire Giant in slab_zalloc().
# 0c1a133f	25-Jul-2003	Alan Cox <alc@FreeBSD.org>	Gulp ... call kmem_malloc() without Giant.
# 8522511b	18-Jul-2003	Hartmut Brandt <harti@FreeBSD.org>	When INVARIANTS is defined make sure that uma_zalloc_arg (and hence uma_zalloc) is called with exactly one of either M_WAITOK or M_NOWAIT and that it is called with neither M_TRYWAIT or M_DONTWAIT. Print a warning if anything is wrong. Default to M_WAITOK of no flag is given. This is the same test as in malloc(9).
# d88797c2	25-Jun-2003	Bosko Milekic <bmilekic@FreeBSD.org>	Move the pcpu lock out of the uma_cache and instead have a single set of pcpu locks. This makes uma_zone somewhat smaller (by (LOCKNAME_LEN * sizeof(char) + sizeof(struct mtx) * maxcpu) bytes, to be exact). No Objections from jeff.
# 5c133dfa	25-Jun-2003	Bosko Milekic <bmilekic@FreeBSD.org>	Make sure that the zone destructor doesn't get called twice in certain free paths.
# 874651b1	11-Jun-2003	David E. O'Brien <obrien@FreeBSD.org>	Use __FBSDID().
# c1f5a182	09-Jun-2003	Poul-Henning Kamp <phk@FreeBSD.org>	Revert last commit, I have no idea what happened.
# 47f94c12	09-Jun-2003	Poul-Henning Kamp <phk@FreeBSD.org>	A white-space nit I noticed.
# 82774d80	28-Apr-2003	Alan Cox <alc@FreeBSD.org>	uma_zone_set_obj() must perform VM_OBJECT_LOCK_INIT() if the caller provides storage for the vm_object.
# 5103186c	25-Apr-2003	Alan Cox <alc@FreeBSD.org>	Remove an XXX comment. It is no longer a problem.
# 410cfc45	18-Apr-2003	Alan Cox <alc@FreeBSD.org>	Lock the vm_object in obj_alloc().
# b37d8ead	18-Apr-2003	Andrew Gallatin <gallatin@FreeBSD.org>	Don't grab Giant in slab_zalloc() if M_NOWAIT is specified. This should allow the use of INTR_MPSAFE network drivers. Tested by: njl Glanced at by: jeff
# 125ee0d1	26-Mar-2003	Tor Egge <tegge@FreeBSD.org>	Obtain Giant before calling kmem_alloc without M_NOWAIT and before calling kmem_free if Giant isn't already held.
# 26306795	04-Mar-2003	John Baldwin <jhb@FreeBSD.org>	Replace calls to WITNESS_SLEEP() and witness_list() with equivalent calls to WITNESS_WARN().
# a163d034	18-Feb-2003	Warner Losh <imp@FreeBSD.org>	Back out M_* changes, per decision of the TRB. Approved by: trb
# 886eaaac	04-Feb-2003	Poul-Henning Kamp <phk@FreeBSD.org>	Change a printf to also tell how many items were left in the zone.
# 44956c98	21-Jan-2003	Alfred Perlstein <alfred@FreeBSD.org>	Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0. Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.
# ebc85edf	19-Jan-2003	Jeff Roberson <jeff@FreeBSD.org>	- M_WAITOK is 0 and not a real flag. Test for this properly. Submitted by: tmm Pointy hat to: jeff
# 9d5abbdd	01-Jan-2003	Jens Schweikhardt <schweikh@FreeBSD.org>	Correct typos, mostly s/ a / an / where appropriate. Some whitespace cleanup, especially in troff files.
# 74c924b5	18-Nov-2002	Jeff Roberson <jeff@FreeBSD.org>	- Wakeup the correct address when a zone is no longer full. Spotted by: jake
# f3da1873	16-Nov-2002	Jeff Roberson <jeff@FreeBSD.org>	- Don't forget the flags value when using boot pages. Reported by: grehan
# 81f71eda	11-Nov-2002	Matt Jacob <mjacob@FreeBSD.org>	atomic_set_8 isn't MI. Instead, follow Jake's suggestions about ZONE_LOCK.
# 48eea375	31-Oct-2002	Jeff Roberson <jeff@FreeBSD.org>	- Add support for machine dependant page allocation routines. MD code may define UMA_MD_SMALL_ALLOC to make use of this feature. Reviewed by: peter, jake
# bbee39c6	24-Oct-2002	Jeff Roberson <jeff@FreeBSD.org>	- Now that uma_zalloc_internal is not the fast path don't be so fussy about extra function calls. Refactor uma_zalloc_internal into seperate functions for finding the most appropriate slab, filling buckets, allocating single items, and pulling items off of slabs. This makes the code significantly cleaner. - This also fixes the "Returning an empty bucket." panic that a few people have seen. Tested On: alpha, x86
# bba739ab	24-Oct-2002	Jeff Roberson <jeff@FreeBSD.org>	- Move the destructor calls so that they are not called with the zone lock held. This avoids a lock order reversal when destroying zones. Unfortunately, this also means that the free checks are not done before the destructor is called. Reported by: phk
# 37c84183	28-Sep-2002	Poul-Henning Kamp <phk@FreeBSD.org>	Be consistent about "static" functions: if the function is marked static in its prototype, mark it static at the definition too. Inspired by: FlexeLint warning #512
# f461cf22	19-Sep-2002	Jeff Roberson <jeff@FreeBSD.org>	- Use my freebsd email alias in the copyright. - Remove redundant instances of my email alias in the file summary.
# 99571dc3	18-Sep-2002	Jeff Roberson <jeff@FreeBSD.org>	- Split UMA_ZFLAG_OFFPAGE into UMA_ZFLAG_OFFPAGE and UMA_ZFLAG_HASH. - Remove all instances of the mallochash. - Stash the slab pointer in the vm page's object pointer when allocating from the kmem_obj. - Use the overloaded object pointer to find slabs for malloced memory.
# 55f7c614	21-Aug-2002	Archie Cobbs <archie@FreeBSD.org>	Don't use "NULL" when "0" is really meant.
# 17b9cc49	05-Jul-2002	Jeff Roberson <jeff@FreeBSD.org>	Fix a lock order reversal in uma_zdestroy. The uma_mtx needs to be held across calls to zone_drain(). Noticed by: scottl
# f5118d6a	04-Jul-2002	Jeff Roberson <jeff@FreeBSD.org>	Remove unnecessary includes.
# e221e841	02-Jul-2002	Jeff Roberson <jeff@FreeBSD.org>	Actually use the fini callback. Pointy hat to: me :-( Noticed By: Julian
# 5c0e403b	25-Jun-2002	Jeff Roberson <jeff@FreeBSD.org>	Reduce the amount of code that runs with the zone lock held in slab_zalloc(). This allows us to run the zone initialization functions without any locks held.
# 3370c5bf	19-Jun-2002	Jeff Roberson <jeff@FreeBSD.org>	- Remove bogus use of kmem_alloc that was inherited from the old zone allocator. - Properly set M_ZERO when talking to the back end page allocators for non malloc zones. This forces us to zero fill pages when they are first brought into a cache. - Properly handle M_ZERO in uma_zalloc_internal. This fixes a problem where per cpu buckets weren't always getting zeroed.
# 4741dcbf	17-Jun-2002	Jeff Roberson <jeff@FreeBSD.org>	Honor the BUCKETCACHE flag on free as well.
# 18aa2de5	17-Jun-2002	Jeff Roberson <jeff@FreeBSD.org>	- Introduce the new M_NOVM option which tells uma to only check the currently allocated slabs and bucket caches for free items. It will not go ask the vm for pages. This differs from M_NOWAIT in that it not only doesn't block, it doesn't even ask. - Add a new zcreate option ZONE_VM, that sets the BUCKETCACHE zflag. This tells uma that it should only allocate buckets out of the bucket cache, and not from the VM. It does this by using the M_NOVM option to zalloc when getting a new bucket. This is so that the VM doesn't recursively enter itself while trying to allocate buckets for vm_map_entry zones. If there are already allocated buckets when we get here we'll still use them but otherwise we'll skip it. - Use the ZONE_VM flag on vm map entries and pv entries on x86.
# f97d6ce3	09-Jun-2002	Ian Dowse <iedowse@FreeBSD.org>	Correct the logic for determining whether the per-CPU locks need to be destroyed. This fixes a problem where destroying a UMA zone would fail to destroy all zone mutexes. Reviewed by: jeff
# 494273be	03-Jun-2002	Jeff Roberson <jeff@FreeBSD.org>	Add a comment describing a resource leak that occurs during a failure case in obj_alloc.
# 4c1cc01c	20-May-2002	John Baldwin <jhb@FreeBSD.org>	In uma_zalloc_arg(), if we are performing a M_WAITOK allocation, ensure that td_intr_nesting_level is 0 (like malloc() does). Since malloc() calls uma we can probably remove the check in malloc() for this now. Also, perform an extra witness check in that case to make sure we don't hold any locks when performing a M_WAITOK allocation.
# 713deb36	12-May-2002	Jeff Roberson <jeff@FreeBSD.org>	Don't call the uz free function while the zone lock is held. This can lead to lock order reversals. uma_reclaim now builds a list of freeable slabs and then unlocks the zones to do all of the frees.
# 0aef6126	12-May-2002	Jeff Roberson <jeff@FreeBSD.org>	Remove the hash_free() lock order reversal. This could have happened for several reasons before. Fixing it involved restructuring the generic hash code to require calling code to handle locking, unlocking, and freeing hashes on error conditions.
# c7173f58	04-May-2002	Jeff Roberson <jeff@FreeBSD.org>	Use pages instead of uz_maxpages, which has not been initialized yet, when creating the vm_object. This was broken after the code was rearranged to grab giant itself. Spotted by: alc
# b9ba8931	02-May-2002	Jeff Roberson <jeff@FreeBSD.org>	Move around the dbg code a bit so it's always under a lock. This stops a weird potential race if we were preempted right as we were doing the dbg checks.
# c3bdc05f	02-May-2002	Andrew R. Reiter <arr@FreeBSD.org>	- Changed the size element of uma_zctor_args to be size_t instead of int. - Changed uma_zcreate to accept the size argument as a size_t intead of int. Approved by: jeff
# 5a34a9f0	02-May-2002	Jeff Roberson <jeff@FreeBSD.org>	malloc/free(9) no longer require Giant. Use the malloc_mtx to protect the mallochash. Mallochash is going to go away as soon as I introduce the kfree/kmalloc api and partially overhaul the malloc wrapper. This can't happen until all users of the malloc api that expect memory to be aligned on the size of the allocation are fixed.
# 639c9550	01-May-2002	Jeff Roberson <jeff@FreeBSD.org>	Remove the temporary alignment check in free(). Implement the following checks on freed memory in the bucket path: - Slab membership - Alignment - Duplicate free This previously was only done if we skipped the buckets. This code will slow down INVARIANTS a bit, but it is smp safe. The checks were moved out of the normal path and into hooks supplied in uma_dbg.
# 2cc35ff9	29-Apr-2002	Jeff Roberson <jeff@FreeBSD.org>	Move the implementation of M_ZERO into UMA so that it can be passed to uma_zalloc and friends. Remove this functionality from the malloc wrapper. Document this change in uma.h and adjust variable names in uma_core.
# 28bc4419	29-Apr-2002	Jeff Roberson <jeff@FreeBSD.org>	Add a new zone flag UMA_ZONE_MTXCLASS. This puts the zone in it's own mutex class. Currently this is only used for kmapentzone because kmapents are are potentially allocated when freeing memory. This is not dangerous though because no other allocations will be done while holding the kmapentzone lock.
# d4d6aee5	25-Apr-2002	Andrew R. Reiter <arr@FreeBSD.org>	- Fix a round down bogon in uma_zone_set_max(). Submitted by: jeff@
# 5300d9dd	14-Apr-2002	Jeff Roberson <jeff@FreeBSD.org>	Fix a witness warning when expanding a hash table. We were allocating the new hash while holding the lock on a zone. Fix this by doing the allocation seperately from the actual hash expansion. The lock is dropped before the allocation and reacquired before the expansion. The expansion code checks to see if we lost the race and frees the new hash if we do. We really never will lose this race because the hash expansion is single threaded via the timeout mechanism.
# 0da47b2f	13-Apr-2002	Jeff Roberson <jeff@FreeBSD.org>	Protect the initial list traversal in sysctl_vm_zone() with the uma_mtx.
# af7f9b97	13-Apr-2002	Jeff Roberson <jeff@FreeBSD.org>	Fix the calculation that determines uz_maxpages. It was off for large zones. Fortunately we have no large zones with maximums specified yet, so it wasn't breaking anything. Implement blocking when a zone exceeds the maximum and M_WAITOK is specified. Previously this just failed like the old zone allocator did. The old zone allocator didn't support WAITOK/NOWAIT though so we should do what we advertise. While I was in there I cleaned up some more zalloc logic to further simplify that code path and reduce redundant code. This was needed to make the blocking work properly anyway.
# bce97791	09-Apr-2002	Jeff Roberson <jeff@FreeBSD.org>	Remember to unlock the zone if the fill count is too high. Pointed out by: pete, jake, jhb
# 86bbae32	08-Apr-2002	Jeff Roberson <jeff@FreeBSD.org>	Add a mechanism to disable buckets when the v_free_count drops below v_free_min. This should help performance in memory starved situations.
# 605cbd6a	07-Apr-2002	Jeff Roberson <jeff@FreeBSD.org>	Don't release the zone lock until after the dtor has been called. As far as I can tell this could not have caused any problems yet because UMA is still called with giant. Pointy hat to: jeff Noticed by: jake
# 9c2cd7e5	07-Apr-2002	Jeff Roberson <jeff@FreeBSD.org>	Implement uma_zdestroy(). It's prototype changed slightly. I decided that I didn't like the wait argument and that if you were removing a zone it had better be empty. Also, I broke out part of hash_expand and made a seperate hash_free() for use in uma_zdestroy.
# a553d4b8	07-Apr-2002	Jeff Roberson <jeff@FreeBSD.org>	Rework most of the bucket allocation and free code so that per cpu locks are never held across blocking operations. Also, fix two other lock order reversals that were exposed by jhb's witness change. The free path previously had a bug that would cause it to skip the free bucket list in some cases and go straight to allocating a new bucket. This has been fixed as well. These changes made the bucket handling code much cleaner and removed quite a few lock operations. This should be marginally faster now. It is now possible to call malloc w/o Giant and avoid any witness warnings. This still isn't entirely safe though because malloc_type statistics are not protected by any lock.
# d0b06acb	07-Apr-2002	Jeff Roberson <jeff@FreeBSD.org>	This fixes a bug where isitem never got set to 1 if a certain chain of events relating to extreme low memory situations occured. This was only ever seen on the port build cluster, so many thanks to kris for helping me debug this. Tested by: kris
# 6008862b	04-Apr-2002	John Baldwin <jhb@FreeBSD.org>	Change callers of mtx_init() to pass in an appropriate lock type name. In most cases NULL is passed, but in some cases such as network driver locks (which use the MTX_NETWORK_LOCK macro) and UMA zone locks, a name is used. Tested on: i386, alpha, sparc64
# 157d7b35	02-Apr-2002	Alfred Perlstein <alfred@FreeBSD.org>	fix comment typo, s/neccisary/necessary/g
# f4af24d5	24-Mar-2002	Jeff Roberson <jeff@FreeBSD.org>	Reset the cachefree statistics after draining the cache. This fixes a bug where a sysctl within 20 seconds of a cache_drain could yield negative "USED" counts. Also, grab the uma_mtx while in the sysctl handler. This hadn't caused problems yet because Giant is held all the time. Reported by: kkenn
# 736ee590	19-Mar-2002	Jeff Roberson <jeff@FreeBSD.org>	Add uma_zone_set_max() to add enforced limits to non vm obj backed zones.
# 8355f576	19-Mar-2002	Jeff Roberson <jeff@FreeBSD.org>	This is the first part of the new kernel memory allocator. This replaces malloc(9) and vm_zone with a slab like allocator. Reviewed by: arch@