History log of /freebsd-current/sys/vm/uma_core.c
Revision Date Author Comments
# d25ed650 26-May-2024 Bojan Novković <bnovkov@FreeBSD.org>

uma: Fix improper uses of UMA_MD_SMALL_ALLOC

UMA_MD_SMALL_ALLOC was recently replaced by UMA_USE_DMAP, but
da76d349b6b1 missed some improper uses of the old symbol.
This change makes sure that UMA_USE_DMAP is used properly in
code that selects uma_small_alloc.

Fixes: da76d349b6b1
Reported by: eduardo, rlibby
Approved by: markj (mentor)
Differential Revision: https://reviews.freebsd.org/D45368


# 0a44b8a5 03-May-2024 Bojan Novković <bnovkov@FreeBSD.org>

vm: Simplify startup page dumping conditional

This commit introduces the MINIDUMP_STARTUP_PAGE_TRACKING symbol and
uses it to simplify several instances of a complex preprocessor conditional
for adding pages allocated when bootstraping the kernel to minidumps.

Reviewed by: markj, mhorne
Approved by: markj (mentor)
Differential Revision: https://reviews.freebsd.org/D45085


# da76d349 03-May-2024 Bojan Novković <bnovkov@FreeBSD.org>

uma: Deduplicate uma_small_alloc

This commit refactors the UMA small alloc code and
removes most UMA machine-dependent code.
The existing machine-dependent uma_small_alloc code is almost identical
across all architectures, except for powerpc where using the direct
map addresses involved extra steps in some cases.

The MI/MD split was replaced by a default uma_small_alloc
implementation that can be overridden by architecture-specific code by
defining the UMA_MD_SMALL_ALLOC symbol. Furthermore, UMA_USE_DMAP was
introduced to replace most UMA_MD_SMALL_ALLOC uses.

Reviewed by: markj, kib
Approved by: markj (mentor)
Differential Revision: https://reviews.freebsd.org/D45084


# a03c2393 09-Nov-2023 Alexander Motin <mav@FreeBSD.org>

uma: Improve memory modified after free panic messages

- Pass zone pointer to trash_ctor() and report zone name in the panic
message. It may be difficult to figyre out zone just by the item size.
- Do not pass user arguments to internal trash calls, pass thezone.
- Report malloc type name in the same unified panic message.
- Report corruption offset from the beginning of the items instead of
the full pointer. It makes panic message shorter and more readable.


# 87090f5e 13-Oct-2023 Olivier Certner <olce.freebsd@certner.fr>

uma: New check_align_mask(): Validate alignments (INVARIANTS)

New function check_align_mask() asserts (under INVARIANTS) that the mask
fits in a (signed) integer (see the comment) and that the corresponding
alignment is a power of two.

Use check_align_mask() in uma_set_align_mask() and also in uma_zcreate()
to replace the KASSERT() there (that was checking only for a power of
2).

Reviewed by: kib, markj
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D42263


# 3d8f548b 13-Oct-2023 Olivier Certner <olce.freebsd@certner.fr>

uma: Make the cache alignment mask unsigned

In uma_set_align_mask(), ensure that the passed value doesn't have its
highest bit set, which would lead to problems since keg/zone alignment
is internally stored as signed integers. Such big values do not make
sense anyway and indicate some programming error. A future commit will
introduce checks for this case and other ones.

Reviewed by: kib, markj
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D42262


# e557eafe 13-Oct-2023 Olivier Certner <olce.freebsd@certner.fr>

uma: UMA_ALIGN_CACHE: Resolve the proper value at use point

Having a special value of -1 that is resolved internally to
'uma_align_cache' provides no significant advantages and prevents
changing that variable to an unsigned type, which is natural for an
alignment mask. So suppress it and replace its use with a call to
uma_get_align_mask(). The small overhead of the added function call is
irrelevant since UMA_ALIGN_CACHE is only used when creating new zones,
which is not performance critical.

Reviewed by: markj, kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D42259


# dc8f7692 13-Oct-2023 Olivier Certner <olce.freebsd@certner.fr>

uma: Hide 'uma_align_cache'; Create/rename accessors

Create the uma_get_cache_align_mask() accessor and put it in a separate
private header so as to minimize namespace pollution in header/source
files that need only this function and not the whole 'uma.h' header.

Make sure the accessors have '_mask' as a suffix, so that callers are
aware that the real alignment is the power of two that is the mask plus
one. Rename the stem to something more explicit. Rename
uma_set_cache_align_mask()'s single parameter to 'mask'.

Hide 'uma_align_cache' to ensure that it cannot be set in any other way
then by a call to uma_set_cache_align_mask(), which will perform sanity
checks in a further commit. While here, rename it to
'uma_cache_align_mask'.

This is also in preparation for some further changes, such as improving
the sanity checks, eliminating internal resolving of UMA_ALIGN_CACHE and
changing the type of the 'uma_cache_align_mask' variable.

Reviewed by: markj, kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D42258


# 685dc743 16-Aug-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove $FreeBSD$: one-line .c pattern

Remove /^[\s*]*__FBSDID\("\$FreeBSD\$"\);?\s*\n/


# 4d846d26 10-May-2023 Warner Losh <imp@FreeBSD.org>

spdx: The BSD-2-Clause-FreeBSD identifier is obsolete, drop -FreeBSD

The SPDX folks have obsoleted the BSD-2-Clause-FreeBSD identifier. Catch
up to that fact and revert to their recommended match of BSD-2-Clause.

Discussed with: pfg
MFC After: 3 days
Sponsored by: Netflix


# 2dba2288 19-Oct-2022 Mark Johnston <markj@FreeBSD.org>

uma: Never pass cache zones to memguard

Items allocated from cache zones cannot usefully be protected by
memguard.

PR: 267151
Reported and tested by: pho
MFC after: 1 week


# f49fd63a 22-Sep-2022 John Baldwin <jhb@FreeBSD.org>

kmem_malloc/free: Use void * instead of vm_offset_t for kernel pointers.

Reviewed by: kib, markj
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D36549


# b9fd884a 12-Aug-2022 Colin Percival <cperciva@FreeBSD.org>

sys/vm: Add TSLOG to some functions

The functions pbuf_init, kva_alloc, and keg_alloc_slab are significant
contributors to the kernel boot time when FreeBSD boots inside the
Firecracker VMM. Instrument them so they show up on flamecharts.


# c84c5e00 18-Jul-2022 Mitchell Horne <mhorne@FreeBSD.org>

ddb: annotate some commands with DB_CMD_MEMSAFE

This is not completely exhaustive, but covers a large majority of
commands in the tree.

Reviewed by: markj
Sponsored by: Juniper Networks, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D35583


# 31508912 13-Jul-2022 Mark Johnston <markj@FreeBSD.org>

uma: Apply a missed piece of review feedback from D35738

Fixes: 93cd28ea82bb ("uma: Use a taskqueue to execute uma_timeout()")


# 93cd28ea 11-Jul-2022 Mark Johnston <markj@FreeBSD.org>

uma: Use a taskqueue to execute uma_timeout()

uma_timeout() has several responsibilities; it visits every UMA zone and
as of recently will drain underutilized caches, so is rather expensive
(>1ms in some cases). Currently it is executed by softclock threads
and so will preempt most other CPU activity. None of this work requires
a high scheduling priority, though, so defer it to a taskqueue so as to
avoid stalling higher-priority work.

Reviewed by: rlibby, alc, mav, kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D35738


# a932a5a6 20-Jun-2022 Mark Johnston <markj@FreeBSD.org>

uma: Mark zeroed slabs as initialized for KMSAN

Otherwise zone initializers can produce false positives, e.g., when
lock_init() attempts to detect double initialization.

Sponsored by: The FreeBSD Foundation


# a7e1a585 08-Apr-2022 John Baldwin <jhb@FreeBSD.org>

uma_zfree_smr: uz_flags is only used if NUMA is defined.


# d53927b0 30-Mar-2022 Mark Johnston <markj@FreeBSD.org>

uma: Don't allow a limit to be set in a warm zone

The limit accounting in UMA does not tolerate this.

MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation


# 54361f90 30-Mar-2022 Mark Johnston <markj@FreeBSD.org>

uma: Use the correct type for a return value

zone_alloc_bucket() returns a pointer, not a bool.

MFC after: 1 week
Sponsored by: The FreeBSD Foundation


# 490b09f2 07-Mar-2022 Eric van Gyzen <vangyzen@FreeBSD.org>

uma_zalloc_domain: call uma_zalloc_debug in multi-domain path

It was only called in the non-NUMA and single-domain paths.
Some of its assertions were duplicated in uma_zalloc_domain,
but some things were missed, especially memguard.

Reviewed by: markj, rstone
MFC after: 1 week
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D34472


# a8cbb835 04-Mar-2022 Eric van Gyzen <vangyzen@FreeBSD.org>

uma_zalloc: assert M_NOWAIT ^ M_WAITOK

The uma_zalloc functions expect exactly one of [M_NOWAIT, M_WAITOK].
If neither or both are passed, print an error and a stack dump.
Only do this ten times, to prevent livelock. In the future, after
this exposes enough bad callers, this will be changed to a KASSERT().

Reviewed by: rstone, markj
MFC after: 1 month
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D34452


# 389a3fa6 15-Feb-2022 Mark Johnston <markj@FreeBSD.org>

uma: Add UMA_ZONE_UNMANAGED

Allow a zone to opt out of cache size management. In particular,
uma_reclaim() and uma_reclaim_domain() will not reclaim any memory from
the zone, nor will uma_timeout() purge cached items if the zone is idle.
This effectively means that the zone consumer has control over when
items are reclaimed from the cache. In particular, uma_zone_reclaim()
will still reclaim cached items from an unmanaged zone.

Reviewed by: hselasky, kib
MFC after: 3 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D34142


# a04ce833 14-Jan-2022 Mark Johnston <markj@FreeBSD.org>

uma: Avoid polling for an invalid SMR sequence number

Buckets in an SMR-enabled zone can legitimately be tagged with
SMR_SEQ_INVALID. This effectively means that the zone destructor (if
any) was invoked on all items in the bucket, and the contained memory is
safe to reuse. If the first bucket in the full bucket list was tagged
this way, UMA would unnecessarily poll per-CPU state before attempting
to fetch a full bucket from the list.

MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation


# c25a30e2 05-Jan-2022 Konstantin Belousov <kib@FreeBSD.org>

Dump page tracking no longer needed on mips

Reviewed by: imp
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D33763


# 841e0a87 30-Dec-2021 Gleb Smirnoff <glebius@FreeBSD.org>

uma: with KTR trace allocs/frees from SMR zones


# 28782f73 30-Dec-2021 Gleb Smirnoff <glebius@FreeBSD.org>

uma: with KTR report item being freed in uma_zfree_arg()


# 2cb67bd7 05-Dec-2021 Gleb Smirnoff <glebius@FreeBSD.org>

uma: remove unused *item argument from cache_free()

Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D33272


# 7585c5db 01-Nov-2021 Mark Johnston <markj@FreeBSD.org>

uma: Fix handling of reserves in zone_import()

Kegs with no items reserved have uk_reserve = 0. So the check
keg->uk_reserve >= dom->ud_free_items will be true once all slabs are
depleted. Then, rather than go and allocate a fresh slab, we return to
the cache layer.

The intent was to do this only when the keg actually has a reserve, so
modify the check to verify this first. Another approach would be to
make uk_reserve signed and set it to -1 until uma_zone_reserve() is
called, but this requires a few casts elsewhere.

Fixes: 1b2dcc8c54a8 ("uma: Avoid depleting keg reserves when filling a bucket")
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32516


# fab343a7 01-Nov-2021 Mark Johnston <markj@FreeBSD.org>

uma: Improve M_USE_RESERVE handling in keg_fetch_slab()

M_USE_RESERVE is used in a couple of places in the VM to avoid unbounded
recursion when the direct map is not available, as is the case on 32-bit
platforms or when certain kernel sanitizers (KASAN and KMSAN) are
enabled. For example, to allocate KVA, the kernel might allocate a
kernel map entry, which might require a new slab, which requires KVA.

For these zones, we use uma_prealloc() to populate a reserve of items,
and then in certain serialized contexts M_USE_RESERVE can be used to
guarantee a successful allocation. uma_prealloc() allocates the
requested number of items, distributing them evenly among NUMA domains.
Thus, in a first-touch zone, to satisfy an M_USE_RESERVE allocation we
might have to check the slab lists of other domains than the current one
to provide the semantics expected by consumers.

So, try harder to find an item if M_USE_RESERVE is specified and the keg
doesn't have anything for current (first-touch) domain. Specifically,
fall back to a round-robin slab allocation. This change fixes boot-time
panics on NUMA systems with KASAN or KMSAN enabled.[1]

Alternately we could have uma_prealloc() allocate the requested number
of items for each domain, but for some existing consumers this would be
quite wasteful. In general I think keg_fetch_slab() should try harder
to find free slabs in other domains before trying to allocate fresh
ones, but let's limit this to M_USE_RESERVE for now.

Also fix a separate problem that I noticed: in a non-round-robin slab
allocation with M_WAITOK, rather than sleeping after a failed slab
allocation we simply try again. Call vm_wait_domain() before retrying.

Reported by: mjg, tuexen [1]
Reviewed by: alc
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32515


# a9d6f1fe 19-Oct-2021 Mark Johnston <markj@FreeBSD.org>

Remove some remaining references to VM_ALLOC_NOOBJ

Reviewed by: alc, kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32037


# 84c39222 19-Oct-2021 Mark Johnston <markj@FreeBSD.org>

Convert consumers to vm_page_alloc_noobj_contig()

Remove now-unneeded page zeroing. No functional change intended.

Reviewed by: alc, hselasky, kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32006


# a4667e09 19-Oct-2021 Mark Johnston <markj@FreeBSD.org>

Convert vm_page_alloc() callers to use vm_page_alloc_noobj().

Remove page zeroing code from consumers and stop specifying
VM_ALLOC_NOOBJ. In a few places, also convert an allocation loop to
simply use VM_ALLOC_WAITOK.

Similarly, convert vm_page_alloc_domain() callers.

Note that callers are now responsible for assigning the pindex.

Reviewed by: alc, hselasky, kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31986


# d6e77cda 16-Sep-2021 Mark Johnston <markj@FreeBSD.org>

uma: Show the count of free slabs in each per-domain keg's sysctl tree

This is useful for measuring the number of pages that could be freed
from a NOFREE zone under memory pressure.

MFC after: 1 week
Sponsored by: The FreeBSD Foundation


# 10094910 10-Aug-2021 Mark Johnston <markj@FreeBSD.org>

uma: Add KMSAN hooks

For now, just hook the allocation path: upon allocation, items are
marked as initialized (absent M_ZERO). Some zones are exempted from
this when it would otherwise raise false positives.

Use kmsan_orig() to update the origin map for UMA and malloc(9)
allocations. This allows KMSAN to print the return address when an
uninitialized UMA item is implicated in a report. For example:
panic: MSan: Uninitialized UMA memory from m_getm2+0x7fe

Sponsored by: The FreeBSD Foundation


# b0dfc486 09-Jul-2021 Mark Johnston <markj@FreeBSD.org>

uma: Fix a few problems with KASAN integration

- Ensure that all items returned by UMA are aligned to
KASAN_SHADOW_SCALE (8). This was true in practice since smaller
alignments are not used by any consumers, but we should enforce it
anyway.
- Use a non-zero code for marking redzones that appear naturally in
items that are not a multiple of the scale factor in size. Currently
we do not modify keg layouts to force the creation of redzones.
- Use a non-zero code for marking freed per-CPU items, otherwise
accesses of freed per-CPU items are not detected by the runtime.

Sponsored by: The FreeBSD Foundation


# 9a7c2de3 05-May-2021 Mark Johnston <markj@FreeBSD.org>

realloc: Fix KASAN(9) shadow map updates

When copying from the old buffer to the new buffer, we don't know the
requested size of the old allocation, but only the size of the
allocation provided by UMA. This value is "alloc". Because the copy
may access bytes in the old allocation's red zone, we must mark the full
allocation valid in the shadow map. Do so using the correct size.

Reported by: kp
Tested by: kp
Sponsored by: The FreeBSD Foundation


# 2760658b 02-May-2021 Alexander Motin <mav@FreeBSD.org>

Improve UMA cache reclamation.

When estimating working set size, measure only allocation batches, not free
batches. Allocation and free patterns can be very different. For example,
ZFS on vm_lowmem event can free to UMA few gigabytes of memory in one call,
but it does not mean it will request the same amount back that fast too, in
fact it won't.

Update working set size on every reclamation call, shrinking caches faster
under pressure. Lack of this caused repeating vm_lowmem events squeezing
more and more memory out of real consumers only to make it stuck in UMA
caches. I saw ZFS drop ARC size in half before previous algorithm after
periodic WSS update decided to reclaim UMA caches.

Introduce voluntary reclamation of UMA caches not used for a long time. For
each zdom track longterm minimal cache size watermark, freeing some unused
items every UMA_TIMEOUT after first 15 minutes without cache misses. Freed
memory can get better use by other consumers. For example, ZFS won't grow
its ARC unless it see free memory, since it does not know it is not really
used. And even if memory is not really needed, periodic free during
inactivity periods should reduce its fragmentation.

Reviewed by: markj, jeff (previous version)
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
Differential Revision: https://reviews.freebsd.org/D29790


# aabe13f1 13-Apr-2021 Mark Johnston <markj@FreeBSD.org>

uma: Introduce per-domain reclamation functions

Make it possible to reclaim items from a specific NUMA domain.

- Add uma_zone_reclaim_domain() and uma_reclaim_domain().
- Permit parallel reclamations. Use a counter instead of a flag to
synchronize with zone_dtor().
- Use the zone lock to protect cache_shrink() now that parallel reclaims
can happen.
- Add a sysctl that can be used to trigger reclamation from a specific
domain.

Currently the new KPIs are unused, so there should be no functional
change.

Reviewed by: mav
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D29685


# 54f421f9 09-Apr-2021 Mark Johnston <markj@FreeBSD.org>

uma: Split bucket_cache_drain() to permit per-domain reclamation

Note that the per-domain variant does not shrink the target bucket size.

No functional change intended.

MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation


# 09c8cb71 13-Apr-2021 Mark Johnston <markj@FreeBSD.org>

uma: Add KASAN state transitions

- Add a UMA_ZONE_NOKASAN flag to indicate that items from a particular
zone should not be sanitized. This is applied implicitly for NOFREE
and cache zones.
- Add KASAN call backs which get invoked:
1) when a slab is imported into a keg
2) when an item is allocated from a zone
3) when an item is freed to a zone
4) when a slab is freed back to the VM

In state transitions 1 and 3, memory is poisoned so that accesses will
trigger a panic. In state transitions 2 and 4, memory is marked
valid.
- Disable trashing if KASAN is enabled. It just adds extra CPU overhead
to catch problems that are detected by KASAN.

MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D29456


# b8f7267d 10-Mar-2021 Kristof Provost <kp@FreeBSD.org>

uma: allow uma_zfree_pcu(..., NULL)

We already allow free(NULL) and uma_zfree(..., NULL). Make
uma_zfree_pcpu(..., NULL) work as well.
This also means that counter_u64_free(NULL) will work.

These make cleanup code simpler.

MFC after: 1 week
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D29189


# 537f92cd 22-Feb-2021 Mark Johnston <markj@FreeBSD.org>

uma: Update the comment above startup_alloc() to reflect reality

The scheme used for early slab allocations changed in commit a81c400e75.

Reported by: alc
Reviewed by: alc
MFC after: 1 week


# 663de81f 03-Jan-2021 Mark Johnston <markj@FreeBSD.org>

uma: Avoid unmapping direct-mapped slabs

startup_alloc() uses pmap_map() to map slabs used for bootstrapping the
VM. pmap_map() may ignore the hint address and simply return a range
from the direct map. In this case we must not unmap the range in
startup_free().

UMA uses bootstart and bootmem to track the range of KVA into which
slabs are mapped if the direct map is not used. Unmap a startup slab
only if it was mapped into that range.

Reported by: alc
Reviewed by: alc, kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27885


# 942951ba 31-Dec-2020 Ryan Libby <rlibby@FreeBSD.org>

uma dbg: catch more corruption with atomics

Use atomic testandset and testandclear to catch concurrent double free,
and to reduce the number of atomic operations.

Submitted by: jeff
Reviewed by: cem, kib, markj (all previous version)
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D22703


# e574d407 06-Dec-2020 Mark Johnston <markj@FreeBSD.org>

uma: Make uma_zone_set_maxcache() work better with small limits

The old implementation chose the largest bucket zone such that if the
per-CPU caches are fully populated, the total number of items cached is
no larger than the specified limit. If no such zone existed, UMA would
not do any caching.

We can now use uz_bucket_size_max to set a precise limit on the number
of items in a zone's bucket, so the total size of per-CPU caches can be
bounded more easily. Implement a new policy in uma_zone_set_maxcache():
choose a bucket size such that up to half of the limit can be cached in
per-CPU caches, with the rest going to the full bucket cache. This
fixes a problem with the kstack_cache zone: the limit of 4 * mp_ncpus
items meant that the zone would not do any caching, defeating the whole
purpose of the zone. That's because the smallest bucket size holds up
to 2 items and we may cache up to 3 full buckets per CPU, and
2 * 3 * mp_ncpus > 4 * mp_ncpus.

Reported by: mjg
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27168


# f8b6c515 06-Dec-2020 Mark Johnston <markj@FreeBSD.org>

uma: Enforce the use of uz_bucket_size_max in the free path

uz_bucket_size_max is the maximum permitted bucket size. When filling a
new bucket to satisfy uma_zalloc(), the bucket is populated with at most
uz_bucket_size_max items. The maximum number of entries in the bucket
may be larger. When freeing items, however, we will fill per-CPPU
buckets up to their maximum number of entries, potentially exceeding
uz_bucket_size_max. This makes it difficult to precisely limit the
number of items that may be cached in a zone. For example, if one wants
to limit buckets to 1 entry for a particular zone, that's not possible
since the smallest bucket holds up to 2 entries.

Try to solve the problem by using uz_bucket_size_max to limit the number
of entries in a bucket. Note that the ub_entries field is initialized
upon every bucket allocation. Most zones are not affected since they do
not impose any specific limit on the maximum bucket size.

While here, remove the UMA_ZONE_MINBUCKET flag. It was unused and we
now have uma_zone_set_maxcache() to control the zone's cache size more
precisely.

Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27167


# 8a6776ca 06-Dec-2020 Mark Johnston <markj@FreeBSD.org>

uma: Use atomic load for uz_sleepers

This field is updated locklessly.

Sponsored by: The FreeBSD Foundation


# 991f23ef 30-Nov-2020 Mark Johnston <markj@FreeBSD.org>

uma: Avoid allocating buckets with the cross-domain lock held

Allocation of a bucket can trigger a cross-domain free in the bucket
zone, e.g., if the per-CPU alloc bucket is empty, we free it and get
migrated to a remote domain. This can lead to deadlocks since a bucket
zone may allocate buckets from itself or a pair of bucket zones could be
allocating from each other.

Fix the problem by dropping the cross-domain lock before allocating a
new bucket and handling refill races. Use a list of empty buckets to
ensure that we can make forward progress.

Reported by: imp, mjg (witness(9) warnings)
Discussed with: jeff
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27341


# 431fb8ab 18-Nov-2020 Mark Johnston <markj@FreeBSD.org>

vm_phys: Try to clean up NUMA KPIs

It can useful for code outside the VM system to look up the NUMA domain
of a page backing a virtual or physical address, specifically when
creating NUMA-aware data structures. We have _vm_phys_domain() for
this, but the leading underscore implies that it's an internal function,
and vm_phys.h has dependencies on a number of other headers.

Rename vm_phys_domain() to vm_page_domain(), and _vm_phys_domain() to
vm_phys_domain(). Make the latter an inline function.

Add _vm_phys.h and define struct vm_phys_seg there so that it's easier
to use in other headers. Include it from vm_page.h so that
vm_page_domain() can be defined there.

Include machine/vmparam.h from _vm_phys.h since it depends directly on
some constants defined there.

Reviewed by: alc
Reviewed by: dougm, kib (earlier versions)
Differential Revision: https://reviews.freebsd.org/D27207


# 7b516613 10-Nov-2020 Jonathan T. Looney <jtl@FreeBSD.org>

When destroying a UMA zone which has a reserve (set with
uma_zone_reserve()), messages like the following appear on the console:
"Freed UMA keg (Test zone) was not empty (0 items). Lost 528 pages of
memory."

When keg_drain_domain() is draining the zone, it tries to keep the number
of items specified in the reservation. However, when we are destroying the
UMA zone, we do not need to keep those items. Therefore, when destroying a
non-secondary and non-cache zone, we should reset the keg reservation to 0
prior to draining the zone.

Reviewed by: markj
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D27129


# 575a4437 19-Oct-2020 Ed Maste <emaste@FreeBSD.org>

uma: fix KTR message after r366840

Reported by: bz
Sponsored by: The FreeBSD Foundation


# f09cbea3 19-Oct-2020 Mark Johnston <markj@FreeBSD.org>

uma: Respect uk_reserve in keg_drain()

When a reserve of free items is configured for a zone, the reserve must
not be reclaimed under memory pressure. Modify keg_drain() to simply
respect the reserved pool.

While here remove an always-false uk_freef == NULL check (kegs that
shouldn't be drained should set _NOFREE instead), and make sure that the
keg_drain() KTR statement does not reference an uninitialized variable.

Reviewed by: alc, rlibby
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D26772


# 1b2dcc8c 19-Oct-2020 Mark Johnston <markj@FreeBSD.org>

uma: Avoid depleting keg reserves when filling a bucket

zone_import() fetches a free or partially free slab from the keg and
then uses its items to populate an array, typically filling a bucket.
If a single allocation causes the keg to drop below its minimum reserve,
the inner loop ends. However, if the bucket is still not full and
M_USE_RESERVE is specified, the outer loop will continue to fetch items
from the keg.

If M_USE_RESERVE is specified and the number of free items is below the
reserved limit, we should return only a single item. Otherwise, if the
bucket size is larger than the reserve, all of the reserved items may
end up in a single per-CPU bucket, invisible to other CPUs.

Reviewed by: rlibby
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D26771


# 6f3b523c 14-Oct-2020 Konstantin Belousov <kib@FreeBSD.org>

Avoid dump_avail[] redefinition.

Move dump_avail[] extern declaration and inlines into a new header
vm/vm_dumpset.h. This fixes default gcc build for mips.

Reviewed by: alc, scottph
Tested by: kevans (previous version)
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D26741


# 06d8bdcbf 02-Oct-2020 Mark Johnston <markj@FreeBSD.org>

uma: Use the bucket cache for cross-domain allocations

uma_zalloc_domain() allocates from the requested domain instead of
following a first-touch policy (the default for most zones). Currently
it is only used by malloc_domainset(), and consumers free returned items
with free(9) since r363834.

Previously uma_zalloc_domain() worked by always going to the keg for an
item. As a result, the use of UMA zone caches was unbalanced: we free
items to the caches, but always allocate from the keg, skipping the
caches.

Make some effort to allocate from the UMA caches when performing a
cross-domain allocation. This avoids blowing up the caches when
something is performing many transient allocations with
malloc_domainset().

Reported and tested by: dhw, glebius
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D26427


# 5afdf5c1 02-Oct-2020 Mark Johnston <markj@FreeBSD.org>

uma: Use LIFO for non-SMR bucket caches

When SMR was introduced, zone_put_bucket() was changed to always place
full buckets at the end of the queue. However, it is generally
preferable to use recently used buckets since their items are more
likely to be resident in cache. So, for buckets that have no constraint
on item reuse, use a last-in-first-out ordering as we did before.

Reviewed by: rlibby
Tested by: dhw, glebius
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D26426


# 952c8964 02-Oct-2020 Mark Johnston <markj@FreeBSD.org>

uma: Remove newlines from panic messages

Sponsored by: The FreeBSD Foundation


# 89d2fb14 08-Sep-2020 Konstantin Belousov <kib@FreeBSD.org>

Add interruptible variant of vm_wait(9), vm_wait_intr(9).

Also add msleep flags argument to vm_wait_doms(9).

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D24652


# c3aa3bf9 01-Sep-2020 Mateusz Guzik <mjg@FreeBSD.org>

vm: clean up empty lines in .c and .h files


# 791dda87 21-Aug-2020 Andrew Gallatin <gallatin@FreeBSD.org>

uma: record allocation failures due to zone limits

The zone limit mechanism was recently reworked, and
allocation failures due to limits being exceeded
were inadvertently no longer being recorded. This
would lead to, for example, mbuf allocation failures
not being indicated in netstat -m or vmstat -z

Reviewed by: markj
Sponsored by: Netflix


# b21b022a 18-Aug-2020 Mark Johnston <markj@FreeBSD.org>

Revert r364310.

Some of the resulting fallout in CAM does not appear straightforward to
fix, so simply revert the commit for now in the absence of a better
solution.

Discussed with: mjg
Reported by: dhw


# 1921bb7b 17-Aug-2020 Gleb Smirnoff <glebius@FreeBSD.org>

With INVARIANTS panic immediately if M_WAITOK is requested in a
non-sleepable context. Previously only _sleep() would panic.
This will catch misuse of M_WAITOK at development stage rather
than at stress load stage.

Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D26027


# af32cefd 10-Aug-2020 Mark Johnston <markj@FreeBSD.org>

Check the UMA zone's full bucket cache before short-circuiting an alloc.

The global "bucketdisable" flag indicates that we are in a low memory
situation and should avoid allocating buckets. However, in the
allocation path we were checking it before the full bucket cache and
bailing even if the cache is non-empty. Defer the check so that we have
a shot at allocating from the cache.

This came up because M_NOWAIT allocations from the buf trie node zone
must always succeed. In one scenario, all of the preallocated trie
nodes were in the bucket list, and a new slab allocation could not
succeed due to a memory shortage. The short-circuiting caused an
allocation failure which triggered a panic.

Reported by: pho
Reviewed by: cem
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D25980


# 96ad26ee 04-Aug-2020 Mark Johnston <markj@FreeBSD.org>

Remove free_domain() and uma_zfree_domain().

These functions were introduced before UMA started ensuring that freed
memory gets placed in domain-local caches. They no longer serve any
purpose since UMA now provides their functionality by default. Remove
them to simplyify the kernel memory allocator interfaces a bit.

Reviewed by: cem, kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D25937


# 8c277118 28-Jun-2020 Mark Johnston <markj@FreeBSD.org>

Fix UMA's first-touch policy on systems with empty domains.

Suppose a thread is running on a CPU in a NUMA domain with no physical
RAM. When an item is freed to a first-touch zone, it ends up in the
cross-domain bucket. When the bucket is full, it gets placed in another
domain's bucket queue. However, when allocating an item, UMA will
always go to the keg upon a per-CPU cache miss because the empty
domain's bucket queue will always be empty. This means that a non-empty
domain's bucket queues can grow very rapidly on such systems. For
example, it can easily cause mbuf allocation failures when the zone
limit is reached.

Change cache_alloc() to follow a round-robin policy when running on an
empty domain.

Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D25355


# c8b0a88b 20-Jun-2020 Jeff Roberson <jeff@FreeBSD.org>

Clarify some language. Favor primary where both master and primary were
used in conjunction with secondary.


# 1c58c09f 29-May-2020 Mateusz Guzik <mjg@FreeBSD.org>

uma: hide item_domain under ifdef NUMA

Fixes build warnings on mips.


# 81302f1d 28-May-2020 Mark Johnston <markj@FreeBSD.org>

Fix boot on systems where NUMA domain 0 is unpopulated.

- Add vm_phys_early_add_seg(), complementing vm_phys_early_alloc(), to
ensure that segments registered during hammer_time() are placed in the
right domain. Otherwise, since the SRAT is not parsed at that point,
we just add them to domain 0, which may be incorrect and results in a
domain with only several MB worth of memory.
- Fix uma_startup1() to try allocating memory for zones from any domain.
If domain 0 is unpopulated, the allocation will simply fail, resulting
in a page fault slightly later during boot.
- Change _vm_phys_domain() to return -1 for addresses not covered by the
affinity table, and change vm_phys_early_alloc() to handle wildcard
domains. This is necessary on amd64, where the page array is dense
and pmap_page_array_startup() may allocate page table pages for
non-existent page frames.

Reported and tested by: Rafael Kitover <rkitover@gmail.com>
Reviewed by: cem (earlier version), kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D25001


# dc2b3205 14-May-2020 Mark Johnston <markj@FreeBSD.org>

Allocate UMA per-CPU counters earlier.

Otherwise anything counted before SI_SUB_VM_CONF is discarded. However,
it is useful to be able to see stats from allocations done early during
boot.

Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D24756


# 54007ce8 07-Mar-2020 Mark Johnston <markj@FreeBSD.org>

Clean up uma_int.h a bit.

This makes it easier to write libkvm programs that access UMA data
structures.

- Remove a couple of unused slab functions and make others local to
uma_core.c. Similarly move SLAB_BITSETS, which affects the layout of
slab structures, to uma_core.c.
- Stop defining the slab structures under _KERNEL. There's no real
reason they can't be visible to userspace like the rest of UMA's
structures are.
- Group KEG_ASSERT_COLD with other keg macros.
- Convert an assertion about MAXMEMDOM to use _Static_assert.

No functional change intended.

Discussed with: jeff
Reviewed by: rlibby
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D23980


# 3823a599 06-Mar-2020 Brooks Davis <brooks@FreeBSD.org>

Remove an apparently incorrect assertion.

Without this change mips64 fails to boot.

Discussed with: markj
Sponsored by: DARPA


# 7f746c9f 01-Mar-2020 Mateusz Guzik <mjg@FreeBSD.org>

vm: add debug to uma_zone_set_smr

Reviewed by: markj, rlibby
Differential Revision: https://reviews.freebsd.org/D23902


# fe835cbf 27-Feb-2020 Jeff Roberson <jeff@FreeBSD.org>

A pair of performance improvements.

Swap buckets on free as well as alloc so that alloc is always the most
cache-hot data.

When selecting a zone domain for the round-robin bucket cache use the
local domain unless there is a severe imbalance. This does not affinitize
memory, only locks and queues.

Reviewed by: markj, rlibby
Differential Revision: https://reviews.freebsd.org/D23824


# 7029da5c 26-Feb-2020 Pawel Biernacki <kaktus@FreeBSD.org>

Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many)

r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.

This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.

Mark all obvious cases as MPSAFE. All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT

Approved by: kib (mentor, blanket)
Commented by: kib, gallatin, melifaro
Differential Revision: https://reviews.freebsd.org/D23718


# eaa17d42 22-Feb-2020 Ryan Libby <rlibby@FreeBSD.org>

sys/vm: quiet -Wwrite-strings

Discussed with: kib
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D23796


# 0464f16e 22-Feb-2020 Mark Johnston <markj@FreeBSD.org>

Constify uma_zcache_create() and uma_zsecond_create()'s "name" argument.

It is already internally handled as a pointer to a const string, in
particular by uma_zcreate().

Fix indentation while here.

MFC after: 1 week


# 226dd6db 21-Feb-2020 Jeff Roberson <jeff@FreeBSD.org>

Add an atomic-free tick moderated lazy update variant of SMR.

This enables very cheap read sections with free-to-use latencies and memory
overhead similar to epoch. On a recent AMD platform a read section cost
1ns vs 5ns for the default SMR. On Xeon the numbers should be more like 1
ns vs 11. The memory consumption should be proportional to the product
of the free rate and 2*1/hz while normal SMR consumption is proportional
to the product of free rate and maximum read section time.

While here refactor the code to make future additions more
straightforward.

Name the overall technique Global Unbound Sequences (GUS) and adjust some
comments accordingly. This helps distinguish discussions of the general
technique (SMR) vs this specific implementation (GUS).

Discussed with: rlibby, markj


# c6fd3e23 19-Feb-2020 Jeff Roberson <jeff@FreeBSD.org>

Use per-domain locks for the bucket cache.

This gives much better concurrency when there are a large number of
cores per-domain and multiple domains. Avoid taking the lock entirely
if it will not be productive. ROUNDROBIN domains will have mixed
memory in each domain and will load balance to all domains.

While here refactor the zone/domain separation and bucket limits to
simplify callers.

Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D23673


# ed581bf6 16-Feb-2020 Jeff Roberson <jeff@FreeBSD.org>

Add a simple accessor that returns the bytes of memory consumed by a zone.


# 70260874 16-Feb-2020 Jeff Roberson <jeff@FreeBSD.org>

UMA has become more particular about zone types. Use the right allocator
calls in uma_zwait().


# 6d88d784 15-Feb-2020 Jeff Roberson <jeff@FreeBSD.org>

Slightly restructure uma_zalloc* to generate better code from clang and
reduce duplication among zalloc functions.

Reviewed by: markj
Discussed with: mjg
Differential Revision: https://reviews.freebsd.org/D23672


# cefc92e1 13-Feb-2020 Mark Johnston <markj@FreeBSD.org>

Update the zone-global count of cached items in bucket_cache_reclaim().

This was missed in r351673. The count is used to enfore cache limits,
which are rarely used.

Discussed with: jeff
Sponsored by: The FreeBSD Foundation


# 543117be 13-Feb-2020 Jeff Roberson <jeff@FreeBSD.org>

Fix a case where ub_seq would fail to be set if the cross bucket was
flushed due to memory pressure.

Reviewed by: markj
Differential Revision: http://reviews.freebsd.org/D23614


# 3acb6572 12-Feb-2020 Mateusz Guzik <mjg@FreeBSD.org>

Store offset into zpcpu allocations in the per-cpu area.

This shorten zpcpu_get and allows more optimizations.

Reviewed by: jeff
Differential Revision: https://reviews.freebsd.org/D23570


# 4ab3aee8 11-Feb-2020 Mark Johnston <markj@FreeBSD.org>

Reduce lock hold time in keg_drain().

Maintain a count of free slabs in the per-domain keg structure and use
that to clear the free slab list in constant time for most cases. This
helps minimize lock contention induced by reclamation, in preparation
for proactive trimming of excesses of free memory.

Reviewed by: jeff, rlibby
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D23532


# bae55c4a 06-Feb-2020 Ryan Libby <rlibby@FreeBSD.org>

uma: remove UMA_ZFLAG_CACHEONLY flag

UMA_ZFLAG_CACHEONLY was essentially the same thing as UMA_ZONE_VM, but
with a more confusing name. Remove the flag, make UMA_ZONE_VM an
inherit flag, and replace all references.

Reviewed by: markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D23516


# 33e5a1ea 04-Feb-2020 Ryan Libby <rlibby@FreeBSD.org>

uma: multipage chicken switch

Add a switch to allow disabling multipage slabs, in order to facilitate
measuring memory usage and performance effects. The tunable
vm.debug.uma_multipage_slabs defaults to 1 and can be set to 0 to
disable. The name may change soon.

Reviewed by: markj (previous version)
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D23487


# 27ca37ac 04-Feb-2020 Ryan Libby <rlibby@FreeBSD.org>

uma: grow slabs to enforce minimum memory efficiency

Memory efficiency can be poor with awkward item sizes (e.g. 1/2 or 1
page size + epsilon). In order to achieve a minimum memory efficiency,
select a slab size with a potentially larger number of pages if it
yields a lower portion of waste.

This may mean using page_alloc instead of uma_small_alloc, which could
be more costly.

Discussed with: jeff, mckusick
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D23239


# ec0d8280 04-Feb-2020 Ryan Libby <rlibby@FreeBSD.org>

uma: add UMA_ZONE_CONTIG, and a default contig_alloc

For now, copy the mbuf allocator.

Reviewed by: jeff, markj (previous version)
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D23237


# 5ba16cf3 04-Feb-2020 Ryan Libby <rlibby@FreeBSD.org>

uma: pcpu_page_free needs to startup_free pages from startup_alloc

After r357392, it is apparent that we do have some early-boot PCPU
zones. Make it so we can safely free pages from them if they are
actually used during early boot.

Reviewed by: jeff, markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D23496


# e84130a0 04-Feb-2020 Jeff Roberson <jeff@FreeBSD.org>

Use literal bucket sizes for smaller buckets rather than the rounding
system. Small bucket sizes already pack well even if they are an odd
number of words. This prevents any potential new instances of the
problem fixed in r357463 as well as making the system easier to
understand.

Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D23494


# dc3915c8 03-Feb-2020 Jeff Roberson <jeff@FreeBSD.org>

Use STAILQ instead of TAILQ for bucket lists. We only need FIFO behavior
and this is more space efficient.

Stop queueing recently used buckets to the head of the list. If the bucket
goes to a different processor the cache coherency will be more expensive.
We already try to encourage cache-hot behavior in the per-cpu layer.

Reviewed by: rlibby
Differential Revision: https://reviews.freebsd.org/D23493


# 36cb95c7 03-Feb-2020 Mark Johnston <markj@FreeBSD.org>

Disable the smallest UMA bucket size on 32-bit platforms.

With r357314, sizeof(struct uma_bucket) grew to 16 bytes on 32-bit
platforms, so BUCKET_SIZE(4) is 0. This resulted in the creation of a
bucket zone for buckets with zero capacity. A more general fix is
planned, but for now this bandaid allows 32-bit platforms to boot again.

PR: 243837
Discussed with: jeff
Reported by: pho, Jenkins via lwhsu
Tested by: pho
Sponsored by: The FreeBSD Foundation


# f96d4157 01-Feb-2020 Jeff Roberson <jeff@FreeBSD.org>

Fix a bug in r356776 where the page allocator was not properly restored to
the percpu page allocator after it had been temporarily overridden by
startup_alloc.

Reported by: pho, bdragon


# 9e47b341 30-Jan-2020 Jeff Roberson <jeff@FreeBSD.org>

Fix LINT build with MEMGUARD.


# d4665eaa 30-Jan-2020 Jeff Roberson <jeff@FreeBSD.org>

Implement a safe memory reclamation feature that is tightly coupled with UMA.

This is in the same family of algorithms as Epoch/QSBR/RCU/PARSEC but is
a unique algorithm. This has 3x the performance of epoch in a write heavy
workload with less than half of the read side cost. The memory overhead
is significantly lessened by limiting the free-to-use latency. A synthetic
test uses 1/20th of the memory vs Epoch. There is significant further
discussion in the comments and code review.

This code should be considered experimental. I will write a man page after
it has settled. After further validation the VM will begin using this
feature to permit lockless page lookups.

Both markj and cperciva tested on arm64 at large core counts to verify
fences on weaker ordering architectures. I will commit a stress testing
tool in a follow-up.

Reviewed by: mmacy, markj, rlibby, hselasky
Discussed with: sbahara
Differential Revision: https://reviews.freebsd.org/D22586


# 8d1c459a 22-Jan-2020 Ryan Libby <rlibby@FreeBSD.org>

uma: fix zone domain overlaying pcpu cache with disabled cpus

UMA zone structures have two arrays at the end which are sized according
to the machine: an array of CPU count length, and an array of NUMA
domain count length. The CPU counting was wrong in the case where some
CPUs are disabled (when mp_ncpus != mp_maxid + 1), and this caused the
second array to be overlaid with the first.

Reported by: olivier
Reviewed by: jeff, markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D23318


# 7e240677 22-Jan-2020 Ryan Libby <rlibby@FreeBSD.org>

uma: report leaks more accurately

Previously UMA had some false negatives in the leak report at keg
destruction time, where it only reported leaks if there were free items
in the slab layer (rather than allocated items), which notably would not
be true for single-item slabs (large items). Now, report a leak if
there are any allocated pages, and calculate and report the number of
allocated items rather than free items.

Reviewed by: jeff, markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D23275


# 530cc6a2 22-Jan-2020 Jeff Roberson <jeff@FreeBSD.org>

Some architectures with DMAP still consume boot kva. Simplify the test for
claiming kva in uma_startup2() to handle this.

Reported by: bdragon


# 20526802 18-Jan-2020 Andrew Gallatin <gallatin@FreeBSD.org>

pcpu_page_alloc: guard against empty NUMA domains

Some systems, such as higher end Threadripper, may have
NUMA domains with no physical memory, Don't allocate
from these domains.

This fixes a "panic: vm_wait in early boot" on my 2990WX desktop

Reviewed by: jeff
Sponsored by: Netflix


# a81c400e 15-Jan-2020 Jeff Roberson <jeff@FreeBSD.org>

Simplify VM and UMA startup by eliminating boot pages. Instead use careful
ordering to allocate early pages in the same way boot pages were but only
as needed. After the KVA allocator has started up we allocate the KVA that
we consumed during boot. This also makes the boot pages freeable since they
have vm_page structures allocated with the rest of memory.

Parts of this patch were written and tested by markj.

Reviewed by: glebius, markj
Differential Revision: https://reviews.freebsd.org/D23102


# 9b8db4d0 13-Jan-2020 Ryan Libby <rlibby@FreeBSD.org>

uma: split slabzone into two sizes

By allowing more items per slab, we can improve memory efficiency for
small allocs. If we were just to increase the bitmap size of the
slabzone, we would then waste slabzone memory. So, split slabzone into
two zones, one especially for 8-byte allocs (512 per slab). The
practical effect should be reduced memory usage for counter(9).

Reviewed by: jeff, markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D23149


# e63a1c2f 13-Jan-2020 Ryan Libby <rlibby@FreeBSD.org>

uma: fixup some ktr messages

Reviewed by: markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D23148


# a314aba8 11-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vm: add missing CLTFLAG_MPSAFE annotations

This covers all vm/* files.


# 860bb7a0 09-Jan-2020 Mark Johnston <markj@FreeBSD.org>

UMA: Don't destroy zones after the system shutdown process starts.

Some kernel subsystems, notably ZFS, will destroy UMA zones from a
shutdown eventhandler. This causes the zone to be drained. For slabs
that are mapped into KVA this can be very expensive and so it needlessly
delays the shutdown process.

Add a new state to the "booted" variable, BOOT_SHUTDOWN. Once
kern_reboot() starts invoking shutdown handlers, turn uma_zdestroy()
into a no-op, provided that the zone does not have a custom finalization
routine.

PR: 242427
Reviewed by: jeff, kib, rlibby
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D23066


# 4a8b575c 08-Jan-2020 Ryan Libby <rlibby@FreeBSD.org>

uma: unify layout paths and improve efficiency

Unify the keg layout selection paths (keg_small_init, keg_large_init,
keg_cachespread_init), and slightly improve memory efficiecy by:
- using the padding of the final item to store the slab header,
- not going OFFPAGE if we have a choice unless it improves efficiency.

Reviewed by: jeff, markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D23048


# 54c5ae80 08-Jan-2020 Ryan Libby <rlibby@FreeBSD.org>

uma: reorganize flags

- Garbage collect UMA_ZONE_PAGEABLE & UMA_ZONE_STATIC.
- Move flag VTOSLAB from public to private.
- Introduce public NOTPAGE flag and make HASH private.
- Introduce public NOTOUCH flag and make OFFPAGE private.
- Update man page.

The net effect of this should be to make the contract with clients more
clear. Clients should choose constraints, UMA will figure out how to
implement them. This also breaks the confusing double meaning of
OFFPAGE.

Reviewed by: jeff, markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D23016


# 79c9f942 05-Jan-2020 Jeff Roberson <jeff@FreeBSD.org>

Fix uma boot pages calculations on NUMA machines that also don't have
MD_UMA_SMALL_ALLOC. This is unusual but not impossible. Fix the alignemnt
of zones while here. This was already correct because uz_cpu strongly
aligned the zone structure but the specified alignment did not match
reality and involved redundant defines.

Reviewed by: markj, rlibby
Differential Revision: https://reviews.freebsd.org/D23046


# bfb6b7a1 05-Jan-2020 Jeff Roberson <jeff@FreeBSD.org>

The fix in r356353 was insufficient. Not every architecture returns 0 for
EARLY_COUNTER. Only amd64 seems to.

Suggested by: markj
Reported by: lwhsu
Reviewed by: markj
PR: 243117


# 31c251a0 04-Jan-2020 Jeff Roberson <jeff@FreeBSD.org>

Fix an assertion introduced in r356348. On architectures without
UMA_MD_SMALL_ALLOC vmem has a more complicated startup sequence that
violated the new assert. Resolve this by rewriting the COLD asserts to
look at the per-cpu allocation counts for evidence of api activity.

Discussed with: rlibby
Reviewed by: markj
Reported by: lwhsu


# dfe13344 04-Jan-2020 Jeff Roberson <jeff@FreeBSD.org>

UMA NUMA flag day. UMA_ZONE_NUMA was a source of confusion. Make the names
more consistent with other NUMA features as UMA_ZONE_FIRSTTOUCH and
UMA_ZONE_ROUNDROBIN. The system will now pick a select a default depending
on kernel configuration. API users need only specify one if they want to
override the default.

Remove the UMA_XDOMAIN and UMA_FIRSTTOUCH kernel options and key only off
of NUMA. XDOMAIN is now fast enough in all cases to enable whenever NUMA
is.

Reviewed by: markj
Discussed with: rlibby
Differential Revision: https://reviews.freebsd.org/D22831


# 91d947bf 04-Jan-2020 Jeff Roberson <jeff@FreeBSD.org>

Sort cross-domain frees into per-domain buckets before inserting these
onto their respective bucket lists. This is a several order of magnitude
improvement in contention on the keg lock under heavy free traffic while
requiring only an additional bucket per-domain worth of memory.

Discussed with: markj, rlibby
Differential Revision: https://reviews.freebsd.org/D22830


# 8b987a77 03-Jan-2020 Jeff Roberson <jeff@FreeBSD.org>

Use per-domain keg locks. This provides both a lock and separate space
accounting for each NUMA domain. Independent keg domain locks are important
with cross-domain frees. Hashed zones are non-numa and use a single keg
lock to protect the hash table.

Reviewed by: markj, rlibby
Differential Revision: https://reviews.freebsd.org/D22829


# 727c6918 03-Jan-2020 Jeff Roberson <jeff@FreeBSD.org>

Use a separate lock for the zone and keg. This provides concurrency
between populating buckets from the slab layer and fetching full buckets
from the zone layer. Eliminate some nonsense locking patterns where
we lock to fetch a single variable.

Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D22828


# 4bd61e19 03-Jan-2020 Jeff Roberson <jeff@FreeBSD.org>

Use atomics for the zone limit and sleeper count. This relies on the
sleepq to serialize sleepers. This patch retains the existing sleep/wakeup
paradigm to limit 'thundering herd' wakeups. It resolves a missing wakeup
in one case but otherwise should be bug for bug compatible. In particular,
there are still various races surrounding adjusting the limit via sysctl
that are now documented.

Discussed with: markj
Reviewed by: rlibby
Differential Revision: https://reviews.freebsd.org/D22827


# cc7ce83a 25-Dec-2019 Jeff Roberson <jeff@FreeBSD.org>

Further reduce the cacheline footprint of fast allocations by duplicating
the zone size and flags fields in the per-cpu caches. This allows fast
alloctions to proceed only touching the single per-cpu cacheline and
simplifies the common case when no ctor/dtor is specified.

Reviewed by: markj, rlibby
Differential Revision: https://reviews.freebsd.org/D22826


# 376b1ba3 25-Dec-2019 Jeff Roberson <jeff@FreeBSD.org>

Optimize fast path allocations by storing bucket headers in the per-cpu
cache area. This allows us to check on bucket space for all per-cpu
buckets with a single cacheline access and fewer branches.

Reviewed by: markj, rlibby
Differential Revision: https://reviews.freebsd.org/D22825


# 3639ac42 25-Dec-2019 Jeff Roberson <jeff@FreeBSD.org>

Fix a bug with _NUMA domains introduced in r339686. When M_NOWAIT is
specified there was no loop termination condition in keg_fetch_slab().

Reported by: pho
Reviewed by: markj


# 815db204 13-Dec-2019 Ryan Libby <rlibby@FreeBSD.org>

uma dbg: flexible size for slab debug bitset too

Recently (r355315) the size of the struct uma_slab bitset field us_free
became dynamic instead of conservative. Now, make the debug bitset
size dynamic too. The debug bitset is INVARIANTS-only, so in fact we
don't care too much about the space savings that results from this, but
enabling minimally-sized slabs on INVARIANTS builds is still important
in order to be able to test new slab layouts effectively.

Reviewed by: jeff (previous version), markj (previous version)
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D22759


# 325c4ced 13-Dec-2019 Mark Johnston <markj@FreeBSD.org>

Restore the reservation of boot pages for bucket zones after r355707.

uma_startup2() sets booted = BOOT_BUCKETS after calling bucket_init(),
but before that assignment, startup_alloc() will use pages from the
reserved pool, so the bucket zones themselves are still allocated using
startup pages.

Reviewed by: rlibby
Reported by: Jenkins via lwhsu
Differential Revision: https://reviews.freebsd.org/D22797


# d82c8ffb 13-Dec-2019 Ryan Libby <rlibby@FreeBSD.org>

Revert r355706 & r355710

The quick fix didn't work. I'll sort it out tomorrow.

Revert r355710: "libmemstat: unbreak build"
Revert r355706: "uma dbg: flexible size for slab debug bitset too"


# f7af5015 13-Dec-2019 Ryan Libby <rlibby@FreeBSD.org>

uma: report slab efficiency

Reviewed by: jeff
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D22766


# 3182660a 13-Dec-2019 Ryan Libby <rlibby@FreeBSD.org>

uma: delay bucket_init() until we might actually enable buckets

This helps with a bootstrapping problem in upcoming work.

We don't first enable buckets until uma_startup2(), so we can delay
bucket creation until then. The other two paths to bucket_enable() are
both later, one in the pageout daemon (SI_SUB_KTHREAD_PAGE vs SI_SUB_VM)
and one in uma_timeout() (first activated in uma_startup3()). Note that
although some bucket functions are accessible before uma_startup2()
(e.g. bucket_select() in zone_ctor()), none of them inspect ubz_zone.

Discussed with: jeff
Reviewed by: markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D22765


# 7508f15f 13-Dec-2019 Ryan Libby <rlibby@FreeBSD.org>

uma dbg: flexible size for slab debug bitset too

Recently (r355315) the size of the struct uma_slab bitset field us_free
became dynamic instead of conservative. Now, make the debug bitset
size dynamic too. The debug bitset is INVARIANTS-only, so in fact we
don't care too much about the space savings that results from this, but
enabling minimally-sized slabs on INVARIANTS builds is still important
in order to be able to test new slab layouts effectively.

Reviewed by: jeff, markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D22759


# 6d204a6a 10-Dec-2019 Ryan Libby <rlibby@FreeBSD.org>

uma: pretty print zone flags sysctl

Requested by: jeff
Reviewed by: jeff, markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D22748


# 3b490537 07-Dec-2019 Jeff Roberson <jeff@FreeBSD.org>

Fix two problems with r355149. The sysctl name collision code assumed that
zones would never be freed. In the case of tmpfs this was not true. While
here test for the right bit to disable the keg related sysctls for zones
that don't have kegs.

Reported by: pho
Reviewed by: rlibby
Differential Revision: https://reviews.freebsd.org/D22655


# 1e0701e1 07-Dec-2019 Jeff Roberson <jeff@FreeBSD.org>

Use a variant slab structure for offpage zones. This saves space in
embedded slabs but also is an opportunity to tidy up code and add
accessor inlines.

Reviewed by: markj, rlibby
Differential Revision: https://reviews.freebsd.org/D22609


# b75c4efc 04-Dec-2019 Andrew Turner <andrew@FreeBSD.org>

Fix the signature for zone_import and zone_release

These are cast to uma_import and uma_release functions. Use the signature
for these in the zone functions.

This was found with an experimental Kernel CFI. It will complain if the
signature is different than what a function pointer expects. The
simplest way to fix these is to correct the signature.

Reviewed by: rlibby
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D22671


# 9b78b1f4 02-Dec-2019 Jeff Roberson <jeff@FreeBSD.org>

Use a precise bit count for the slab free items in UMA. This significantly
shrinks embedded slab structures.

Reviewed by: markj, rlibby (prior version)
Differential Revision: https://reviews.freebsd.org/D22584


# 6d6a03d7 28-Nov-2019 Jeff Roberson <jeff@FreeBSD.org>

Handle large mallocs by going directly to kmem. Taking a detour through
UMA does not provide any additional value.

Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D22563


# 584061b4 28-Nov-2019 Jeff Roberson <jeff@FreeBSD.org>

Garbage collect the mostly unused us_keg field. Use appropriately named
union members in vm_page.h to store the zone and slab. Remove some nearby
dead code.

Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D22564


# 35ec24f3 27-Nov-2019 Ryan Libby <rlibby@FreeBSD.org>

uma: move sysctl vm.uma defn out from under INVARIANTS

Fix non-INVARIANTS builds after r355149.

Reported by: Michael Butler <imb@protected-networks.net>
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D22588


# 20a4e154 27-Nov-2019 Jeff Roberson <jeff@FreeBSD.org>

Implement a sysctl tree for uma zones to assist in debugging and provide
more statistcs than are exported via the ABI stable vmstat interface.
Rename uz_count to uz_bucket_size because even I was confused by the
name after returning to the source years later.

Reviewed by: rlibby
Differential Revision: https://reviews.freebsd.org/D22554


# 0a81b439 27-Nov-2019 Jeff Roberson <jeff@FreeBSD.org>

Refactor uma_zfree_arg into several functions to make control flow more
clear and icache usage cleaner.

Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D22491


# ca293436 27-Nov-2019 Ryan Libby <rlibby@FreeBSD.org>

uma: trash memory when ctor/dtor supplied too

On INVARIANTS kernels, UMA has a use-after-free detection mechanism.
This mechanism previously required that all of the ctor/dtor/uminit/fini
arguments to uma_zcreate() be NULL in order to function. Now, it only
requires that uminit and fini be NULL; now, the trash ctor and dtor will
be called in addition to any supplied ctor or dtor.

Also do a little refactoring for readability of the resulting logic.

This enables use-after-free detection for more zones, and will allow for
simplification of some callers that worked around the previous
restriction (see kern_mbuf.c).

Reviewed by: jeff, markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D20722


# beb8beef 26-Nov-2019 Jeff Roberson <jeff@FreeBSD.org>

Refactor uma_zalloc_arg(). It is a mess of gotos and code which doesn't
make sense after many partial refactors. Attempt to make a smaller cache
footprint for the fast path.

Reviewed by: markj, rlibby
Differential Revision: https://reviews.freebsd.org/D22470


# 003cf08b 22-Nov-2019 Mark Johnston <markj@FreeBSD.org>

Revise the page cache size policy.

In r353734 the use of the page caches was limited to systems with a
relatively large amount of RAM per CPU. This was to mitigate some
issues reported with the system not able to keep up with memory pressure
in cases where it had been able to do so prior to the addition of the
direct free pool cache. This change re-enables those caches.

The change modifies uma_zone_set_maxcache(), which was introduced
specifically for the page cache zones. Rather than using it to limit
only the full bucket cache, have it also set uz_count_max to provide an
upper bound on the per-CPU cache size that is consistent with the number
of items requested. Remove its return value since it has no use.

Enable the page cache zones unconditionally, and limit them to 0.1% of
the domain's pages. The limit can be overridden by the
vm.pgcache_zone_max tunable as before.

Change the item size parameter passed to uma_zcache_create() to the
correct size, and stop setting UMA_ZONE_MAXBUCKET. This allows the page
cache buckets to be adaptively sized, like the rest of UMA's caches.
This also causes the initial bucket size to be small, so only systems
which benefit from large caches will get them.

Reviewed by: gallatin, jeff
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D22393


# 71353f7a 19-Nov-2019 Jeff Roberson <jeff@FreeBSD.org>

When we set OFFPAGE to limit fragmentation we should also set VTOSLAB
so that we avoid the hashtables. The hashtable is now only required if
a zone is created with OFFPAGE specified initially, not internally. This
flag signals to UMA that it can't touch the allocated memory and so
can't store a slab pointer in the containing page.

Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D22453


# 08034d10 10-Nov-2019 Konstantin Belousov <kib@FreeBSD.org>

Include cache zones into zone_foreach() where appropriate.

The r354367 is reverted since it is subsumed by this, more complete, approach.

Suggested by: markj
Reviewed by: alc. glebius, markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D22242


# 432fc36d 05-Nov-2019 Konstantin Belousov <kib@FreeBSD.org>

Switch cache zones from early counters to real implementation.

Early counter mock can be only used on BSP for amd64, when APs try to
update it that causes random memory corruption.

N.B. This is a temporary patch to plug the corruption for now, while
a proper solution for handling cache zones in zone_foreach() is being
developed.

In collaboration with: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation, Mellanox Technologies


# 1de9724e 22-Oct-2019 Mark Johnston <markj@FreeBSD.org>

Avoid reloading bucket pointers in uma_vm_zone_stats().

The correctness of per-CPU cache accounting in that function is
dependent on reading per-CPU pointers exactly once. Ensure that
the compiler does not emit multiple loads of those pointers.

Reported and tested by: pho
Reviewed by: kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D22081


# 0223790f 11-Oct-2019 Conrad Meyer <cem@FreeBSD.org>

Fix braino in r353429

cy@ points out that I got parameter order backwards between definition and
invocation of the helper function. He is totally correct. The earlier
version of this patch predated the XFree column so this is one I introduced,
rather than the original author.

Submitted by: cy
Reported by: cy
X-MFC-With: r353429


# 46d70077 10-Oct-2019 Conrad Meyer <cem@FreeBSD.org>

ddb: Add CSV option, sorting to 'show (malloc|uma)'

Add /i option for machine-parseable CSV output. This allows ready copy/
pasting into more sophisticated tooling outside of DDB.

Add total zone size ("Memory Use") as a new column for UMA.

For both, sort the displayed list on size (print the largest zones/types
first). This is handy for quickly diagnosing "where has my memory gone?" at
a high level.

Submitted by: Emily Pettigrew <Emily.Pettigrew AT isilon.com> (earlier version)
Sponsored by: Dell EMC Isilon


# 08cfa56e 01-Sep-2019 Mark Johnston <markj@FreeBSD.org>

Extend uma_reclaim() to permit different reclamation targets.

The page daemon periodically invokes uma_reclaim() to reclaim cached
items from each zone when the system is under memory pressure. This
is important since the size of these caches is unbounded by default.
However it also results in bursts of high latency when allocating from
heavily used zones as threads miss in the per-CPU caches and must
access the keg in order to allocate new items.

With r340405 we maintain an estimate of each zone's usage of its
(per-NUMA domain) cache of full buckets. Start making use of this
estimate to avoid reclaiming the entire cache when under memory
pressure. In particular, introduce TRIM, DRAIN and DRAIN_CPU
verbs for uma_reclaim() and uma_zone_reclaim(). When trimming, only
items in excess of the estimate are reclaimed. Draining a zone
reclaims all of the cached full buckets (the previous behaviour of
uma_reclaim()), and may further drain the per-CPU caches in extreme
cases.

Now, when under memory pressure, the page daemon will trim zones
rather than draining them. As a result, heavily used zones do not incur
bursts of bucket cache misses following reclamation, but large, unused
caches will be reclaimed as before.

Reviewed by: jeff
Tested by: pho (an earlier version)
MFC after: 2 months
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D16667


# b48d4efe 25-Aug-2019 Mark Johnston <markj@FreeBSD.org>

Handle UMA_ANYDOMAIN in kstack_import().

The kernel thread stack zone performs first-touch allocations by
default, and must handle the case where the local memory domain
is empty. For most UMA zones this is handled in the keg layer,
but cache zones currently must implement a policy for this case.
Simply use a round-robin policy if UMA_ANYDOMAIN is passed.

Reported and tested by: bcran
Reviewed by: kib
Sponsored by: The FreeBSD Foundation


# eda1b016 06-Aug-2019 Jeff Roberson <jeff@FreeBSD.org>

Implement a MINBUCKET zone flag so we can use minimal caching on zones that
may be expensive to cache.

Reviewed by: markj, kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20930


# c1685086 06-Aug-2019 Jeff Roberson <jeff@FreeBSD.org>

Add two new kernel options to control memory locality on NUMA hardware.
- UMA_XDOMAIN enables an additional per-cpu bucket for freed memory that
was freed on a different domain from where it was allocated. This is
only used for UMA_ZONE_NUMA (first-touch) zones.
- UMA_FIRSTTOUCH sets the default UMA policy to be first-touch for all
zones. This tries to maintain locality for kernel memory.

Reviewed by: gallatin, alc, kib
Tested by: pho, gallatin
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20929


# 88ea538a 07-Jun-2019 Mark Johnston <markj@FreeBSD.org>

Replace uses of vm_page_unwire(m, PQ_NONE) with vm_page_unwire_noq(m).

These calls are not the same in general: the former will dequeue the
page if it is enqueued, while the latter will just leave it alone. But,
all existing uses of the former apply to unmanaged pages, which are
never enqueued in the first place. No functional change intended.

Reviewed by: kib
MFC after: 1 week
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20470


# 3b2f2cb8 06-Jun-2019 Alexander Motin <mav@FreeBSD.org>

Allow UMA hash tables to expand faster then 2x in 20 seconds.

ZFS ABD allocates tons of 4KB chunks via UMA, requiring huge hash tables.
With initial hash table size of only 32 elements it takes ~20 expansions
or ~400 seconds to adapt to handling 220GB ZFS ARC. During that time not
only the hash table is highly inefficient, but also each of those expan-
sions takes significant time with the lock held, blocking operation.

On my test system with 256GB of RAM and ZFS pool of 28 HDDs this change
reduces time needed to first time read 240GB from ~300-400s, during which
system is quite busy and unresponsive, to only ~150s with light CPU load
and just 5 sub-second CPU spikes to expand the hash table.

MFC after: 2 weeks
Sponsored by: iXsystems, Inc.


# fbd95859 06-Jun-2019 Mark Johnston <markj@FreeBSD.org>

Add sysctls for uma_kmem_{limit,total}.

Reviewed by: alc, dougm, kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20514


# 058f0f74 06-Jun-2019 Mark Johnston <markj@FreeBSD.org>

Remove the volatile qualifer from uma_kmem_total.

No functional change intended.

Reviewed by: alc, dougm, kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20514


# 4a9f6ba7 29-May-2019 Gleb Smirnoff <glebius@FreeBSD.org>

In r343857 the referred comment moved to uma_vm_zone_stats().


# 323ad386 11-Apr-2019 Tycho Nightingale <tychon@FreeBSD.org>

for a cache-only zone the destructor tries to destroy a non-existent keg

Reviewed by: markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D19835


# 6929b7d1 11-Feb-2019 Pedro F. Giffuni <pfg@FreeBSD.org>

UMA: unsign some variables related to allocation in hash_alloc().

As a followup to r343673, unsign some variables related to allocation
since the hashsize cannot be negative. This gives a bit more space to
handle bigger allocations and avoid some implicit casting.

While here also unsign uh_hashmask, it makes little sense to keep that
signed.

MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D19148


# ad66f958 06-Feb-2019 Gleb Smirnoff <glebius@FreeBSD.org>

Now that there is only one way to allocate a slab, remove uz_slab method.

Discussed with: jeff


# b47acb0a 06-Feb-2019 Gleb Smirnoff <glebius@FreeBSD.org>

Report cache zones in UMA stats sysctl, that 'vmstat -z' uses. This
should had been part of r251826.


# 59568a0e 01-Feb-2019 Alexander Motin <mav@FreeBSD.org>

Fix integer math overflow in UMA hash_alloc().

512GB of ZFS ABD ARC means abd_chunk zone of 128M 4KB items. To manage
them UMA tries to allocate 2GB hash table, which size does not fit into
the int variable, causing later allocation failure, which makes ARC shrink
back below the 512GB, not letting it to use more RAM. With this change I
easily reached >700GB ARC size on 768GB RAM machine.

MFC after: 1 week
Sponsored by: iXsystems, Inc.


# 37125720 31-Jan-2019 Gleb Smirnoff <glebius@FreeBSD.org>

In zone_alloc_bucket() max argument was calculated based on uz_count.
Then bucket_alloc() also selects bucket size based on uz_count. However,
since zone lock is dropped, uz_count may reduce. In this case max may
be greater than ub_entries and that would yield into writing beyond end
of the allocation.

Reported by: pho


# 86220393 23-Jan-2019 Mark Johnston <markj@FreeBSD.org>

Correct uma_prealloc()'s use of domainset iterators after r339925.

The iterator should be reinitialized after every successful slab
allocation. A request to advance the iterator is interpreted as
an allocation failure, so a sufficiently large preallocation would
cause the iterator to believe that all domains were exhausted,
resulting in a sleep with the keg lock held. [1]

Also, keg_alloc_slab() should pass the unmodified wait flag to the
item initialization routine, which may use it to perform allocations
from other zones.

Reported and tested by: slavah
Diagnosed by: kib [1]
Reviewed by: kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation


# e7e4bcd8 15-Jan-2019 Gleb Smirnoff <glebius@FreeBSD.org>

style(9): break long line.


# f8c86a5f 15-Jan-2019 Gleb Smirnoff <glebius@FreeBSD.org>

Remove harmless leftover from code that cycles over zone's kegs. Just use +
instead of +=. There is no functional change.


# bb45b411 15-Jan-2019 Gleb Smirnoff <glebius@FreeBSD.org>

Only do uz_items accounting for zones that have a limit set in uz_max_items.
This reduces amount of locking required for these zones.

Also, for cache only zones (UMA_ZFLAG_CACHE) accounting uz_items wasn't
correct at all, since they may allocate items directly from their backing
store and then free them via UMA underflowing uz_items.

Tested by: pho


# 2efcc8cb 15-Jan-2019 Gleb Smirnoff <glebius@FreeBSD.org>

Make uz_allocs, uz_frees and uz_fails counter(9). This removes some
atomic updates and reduces amount of data protected by zone lock.

During startup point these fields to EARLY_COUNTER. After startup
allocate them for all early zones.

Tested by: pho


# 5a8eee2b 14-Jan-2019 Gleb Smirnoff <glebius@FreeBSD.org>

Fix compilation on 32-bit.


# bb15d1c7 14-Jan-2019 Gleb Smirnoff <glebius@FreeBSD.org>

o Move zone limit from keg level up to zone level. This means that now
two zones sharing a keg may have different limits. Now this is going
to work:

zone = uma_zcreate();
uma_zone_set_max(zone, limit);
zone2 = uma_zsecond_create(zone);
uma_zone_set_max(zone2, limit2);

Kegs no longer have uk_maxpages field, but zones have uz_items. When
set, it may be rounded up to minimum possible CPU bucket cache size.
For small limits bucket cache can also be reconfigured to be smaller.
Counter uz_items is updated whenever items transition from keg to a
bucket cache or directly to a consumer. If zone has uz_maxitems set and
it is reached, then we are going to sleep.

o Since new limits don't play well with multi-keg zones, remove them. The
idea of multi-keg zones was introduced exactly 10 years ago, and never
have had a practical usage. In discussion with Jeff we came to a wild
agreement that if we ever want to reintroduce the idea of a smart allocator
that would be able to choose between two (or more) totally different
backing stores, that choice should be made one level higher than UMA,
e.g. in malloc(9) or in mget(), or whatever and choice should be controlled
by the caller.

o Sleeping code is improved to account number of sleepers and wake them one
by one, to avoid thundering herd problem.

o Flag UMA_ZONE_NOBUCKETCACHE removed, instead uma_zone_set_maxcache()
KPI added. Having no bucket cache basically means setting maxcache to 0.

o Now with many fields added and many removed (no multi-keg zones!) make
sure that struct uma_zone is perfectly aligned.

Reviewed by: markj, jeff
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D17773


# 0b2e3aea 28-Nov-2018 Gleb Smirnoff <glebius@FreeBSD.org>

Fix yet another edge case in uma_startup_count(). If zone size fits into
several pages, but leaves no space for struct uma_slab at the end we
miscalculate number of pages by one. Totally mimic keg_large_init() math
here to cover that problem.

Reported by: gallatin


# 3d5e3df7 28-Nov-2018 Gleb Smirnoff <glebius@FreeBSD.org>

For not offpage zones the slab is placed at the end of page. Keg's uk_pgoff
is calculated to guarantee that struct uma_slab is placed at pointer size
alignment. Calculation of real struct uma_slab size is done in keg_ctor()
and yet again in keg_large_init(), to check if we need an extra page. This
calculation can actually be performed at compile time.

- Add SIZEOF_UMA_SLAB macro to calculate size of struct uma_slab placed at
an end of a page with alignment requirement.
- Use SIZEOF_UMA_SLAB in keg_ctor() and in keg_large_init(). This is a not
a functional change.
- Use SIZEOF_UMA_SLAB in UMA_SLAB_SPACE definition and in keg_small_init().
This is a potential bugfix, but in reality I don't think there are any
systems affected, since compiler aligns struct uma_slab anyway.


# 0f9b7bf3 13-Nov-2018 Mark Johnston <markj@FreeBSD.org>

Add accounting to per-domain UMA full bucket caches.

In particular, track the current size of the cache and maintain an
estimate of its working set size. This will be used to decide how
much to shrink various caches when the kernel attempts to reclaim
pages. As a secondary effect, it makes statistics aggregation (done
by, e.g., vmstat -z) cheaper since sysctl_vm_zone_stats() no longer
needs to iterate over lists of cached buckets.

Discussed with: alc, glebius, jeff
Tested by: pho (previous version)
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D16666


# 9978bd99 30-Oct-2018 Mark Johnston <markj@FreeBSD.org>

Add malloc_domainset(9) and _domainset variants to other allocator KPIs.

Remove malloc_domain(9) and most other _domain KPIs added in r327900.
The new functions allow the caller to specify a general NUMA domain
selection policy, rather than specifically requesting an allocation from
a specific domain. The latter policy tends to interact poorly with
M_WAITOK, resulting in situations where a caller is blocked indefinitely
because the specified domain is depleted. Most existing consumers of
the _domain KPIs are converted to instead use a DOMAINSET_PREF() policy,
in which we fall back to other domains to satisfy the allocation
request.

This change also defines a set of DOMAINSET_FIXED() policies, which
only permit allocations from the specified domain.

Discussed with: gallatin, jeff
Reported and tested by: pho (previous version)
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17418


# 920239ef 30-Oct-2018 Mark Johnston <markj@FreeBSD.org>

Fix some problems that manifest when NUMA domain 0 is empty.

- In uma_prealloc(), we need to check for an empty domain before the
first allocation attempt, not after. Fix this by switching
uma_prealloc() to use a vm_domainset iterator, which addresses the
secondary issue of using a signed domain identifier in round-robin
iteration.
- Don't automatically create a page daemon for domain 0.
- In domainset_empty_vm(), recompute ds_cnt and ds_order after
excluding empty domains; otherwise we may frequently specify an empty
domain when calling in to the page allocator, wasting CPU time.
Convert DOMAINSET_PREF() policies for empty domains to round-robin.
- When freeing bootstrap pages, don't count them towards the per-domain
total page counts for now: some vm_phys segments are created before
the SRAT is parsed and are thus always identified as being in domain 0
even when they are not. Then, when bootstrap pages are freed, they
are added to a domain that we had previously thought was empty. Until
this is corrected, we simply exclude them from the per-domain page
count.

Reported and tested by: Rajesh Kumar <rajfbsd@gmail.com>
Reviewed by: gallatin
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17704


# 194a979e 24-Oct-2018 Mark Johnston <markj@FreeBSD.org>

Use a vm_domainset iterator in keg_fetch_slab().

Previously, it used a hand-rolled round-robin iterator. This meant that
the minskip logic in r338507 didn't apply to UMA allocations, and also
meant that we would call vm_wait() for individual domains rather than
permitting an allocation from any domain with sufficient free pages.

Discussed with: jeff
Tested by: pho
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17420


# 81c0d72c 22-Oct-2018 Gleb Smirnoff <glebius@FreeBSD.org>

If we lost race or were migrated during bucket allocation for the per-CPU
cache, then we put new bucket on generic bucket cache. However, code didn't
honor UMA_ZONE_NOBUCKETCACHE flag, so potentially we could start a cache
on a zone that clearly forbids that. Fix this.

Reviewed by: markj


# 30c5525b 01-Oct-2018 Andrew Gallatin <gallatin@FreeBSD.org>

Allow empty NUMA memory domains to support Threadripper2

The AMD Threadripper 2990WX is basically a slightly crippled Epyc.
Rather than having 4 memory controllers, one per NUMA domain, it has
only 2 memory controllers enabled. This means that only 2 of the
4 NUMA domains can be populated with physical memory, and the
others are empty.

Add support to FreeBSD for empty NUMA domains by:

- creating empty memory domains when parsing the SRAT table,
rather than failing to parse the table
- not running the pageout deamon threads in empty domains
- adding defensive code to UMA to avoid allocating from empty domains
- adding defensive code to cpuset to avoid binding to an empty domain
Thanks to Jeff for suggesting this strategy.

Reviewed by: alc, markj
Approved by: re (gjb@)
Differential Revision: https://reviews.freebsd.org/D1683


# 26fe2217 18-Sep-2018 Mark Johnston <markj@FreeBSD.org>

Only update the domain cursor once in keg_fetch_slab().

We drop the keg lock when we go to actually allocate the slab, allowing
other threads to advance the cursor. This can cause us to exit the
round-robin loop before having attempted allocations from all domains,
resulting in a hang during a subsequent blocking allocation attempt from
a depleted domain.

Reported and tested by: Jan Bramkamp <crest@bultmann.eu>
Reviewed by: alc, cem
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17209


# 19fa89e9 25-Aug-2018 Mark Murray <markm@FreeBSD.org>

Remove the Yarrow PRNG algorithm option in accordance with due notice
given in random(4).

This includes updating of the relevant man pages, and no-longer-used
harvesting parameters.

Ensure that the pseudo-unit-test still does something useful, now also
with the "other" algorithm instead of Yarrow.

PR: 230870
Reviewed by: cem
Approved by: so(delphij,gtetlow)
Approved by: re(marius)
Differential Revision: https://reviews.freebsd.org/D16898


# 49bfa624 25-Aug-2018 Alan Cox <alc@FreeBSD.org>

Eliminate the arena parameter to kmem_free(). Implicitly this corrects an
error in the function hypercall_memfree(), where the wrong arena was being
passed to kmem_free().

Introduce a per-page flag, VPO_KMEM_EXEC, to mark physical pages that are
mapped in kmem with execute permissions. Use this flag to determine which
arena the kmem virtual addresses are returned to.

Eliminate UMA_SLAB_KRWX. The introduction of VPO_KMEM_EXEC makes it
redundant.

Update the nearby comment for UMA_SLAB_KERNEL.

Reviewed by: kib, markj
Discussed with: jeff
Approved by: re (marius)
Differential Revision: https://reviews.freebsd.org/D16845


# 83a90bff 21-Aug-2018 Alan Cox <alc@FreeBSD.org>

Eliminate kmem_malloc()'s unused arena parameter. (The arena parameter
became unused in FreeBSD 12.x as a side-effect of the NUMA-related
changes.)

Reviewed by: kib, markj
Discussed with: jeff, re@
Differential Revision: https://reviews.freebsd.org/D16825


# 067fd858 18-Aug-2018 Alan Cox <alc@FreeBSD.org>

Eliminate the arena parameter to kmem_malloc_domain(). It is redundant.
The domain and flags parameters suffice. In fact, the related functions
kmem_alloc_{attr,contig}_domain() don't have an arena parameter.

Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D16713


# efb6d4a4 12-Jul-2018 Mateusz Guzik <mjg@FreeBSD.org>

uma: whack main zone counter update in the slow path, freeing side

See r333052.


# 013072f0 09-Jul-2018 Mark Johnston <markj@FreeBSD.org>

Fix pre-SI_SUB_CPU initialization of per-CPU counters.

r336020 introduced pcpu_page_alloc(), replacing page_alloc() as the
backend allocator for PCPU UMA zones. Unlike page_alloc(), it does
not honour malloc(9) flags such as M_ZERO or M_NODUMP, so fix that.

r336020 also changed counter(9) to initialize each counter using a
CPU_FOREACH() loop instead of an SMP rendezvous. Before SI_SUB_CPU,
smp_rendezvous() will only execute the callback on the current CPU
(i.e., CPU 0), so only one counter gets zeroed. The rest are zeroed
by virtue of the fact that UMA gratuitously zeroes slabs when importing
them into a zone.

Prior to SI_SUB_CPU, all_cpus is clear, so with r336020 we weren't
zeroing vm_cnt counters during boot: the CPU_FOREACH() loop had no
effect, and pcpu_page_alloc() didn't honour M_ZERO. Fix this by
iterating over the full range of CPU IDs when zeroing counters,
ignoring whether the corresponding bits in all_cpus are set.

Reported and tested by: pho (previous version)
Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D16190


# a03af342 07-Jul-2018 Sean Bruno <sbruno@FreeBSD.org>

Wrap the declaration and assignment of "stripe" with #ifdef NUMA declarations
as not all targets are NUMA aware.

Found with gcc.

Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D16113


# ab3059a8 05-Jul-2018 Matt Macy <mmacy@FreeBSD.org>

Back pcpu zone with domain correct pages

- Change pcpu zone consumers to use a stride size of PAGE_SIZE.
(defined as UMA_PCPU_ALLOC_SIZE to make future identification easier)

- Allocate page from the correct domain for a given cpu.

- Don't initialize pc_domain to non-zero value if NUMA is not defined
There are some misconceptions surrounding this field. It is the
_VM_ NUMA domain and should only ever correspond to valid domain
values as understood by the VM.

The former slab size of sizeof(struct pcpu) was somewhat arbitrary.
The new value is PAGE_SIZE because that's the smallest granularity
which the VM can allocate a slab for a given domain. If you have
fewer than PAGE_SIZE/8 counters on your system there will be some
memory wasted, but this is obviously something where you want the
cache line to be coming from the correct domain.

Reviewed by: jeff
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D15933


# c5b7751f 22-Jun-2018 Ian Lepore <ian@FreeBSD.org>

Eliminate a spurious panic on non-SMP systems (occurred on shutdown/reboot).


# b4799947 21-Jun-2018 Ruslan Bukin <br@FreeBSD.org>

Fix uma_zalloc_pcpu_arg() operation in case of !SMP build.

Reviewed by: mjg
Sponsored by: DARPA, AFRL


# 0766f278 13-Jun-2018 Jonathan T. Looney <jtl@FreeBSD.org>

Make UMA and malloc(9) return non-executable memory in most cases.

Most kernel memory that is allocated after boot does not need to be
executable. There are a few exceptions. For example, kernel modules
do need executable memory, but they don't use UMA or malloc(9). The
BPF JIT compiler also needs executable memory and did use malloc(9)
until r317072.

(Note that a side effect of r316767 was that the "small allocation"
path in UMA on amd64 already returned non-executable memory. This
meant that some calls to malloc(9) or the UMA zone(9) allocator could
return executable memory, while others could return non-executable
memory. This change makes the behavior consistent.)

This change makes malloc(9) return non-executable memory unless the new
M_EXEC flag is specified. After this change, the UMA zone(9) allocator
will always return non-executable memory, and a KASSERT will catch
attempts to use the M_EXEC flag to allocate executable memory using
uma_zalloc() or its variants.

Allocations that do need executable memory have various choices. They
may use the M_EXEC flag to malloc(9), or they may use a different VM
interfact to obtain executable pages.

Now that malloc(9) again allows executable allocations, this change also
reverts most of r317072.

PR: 228927
Reviewed by: alc, kib, markj, jhb (previous version)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D15691


# 4e180881 08-Jun-2018 Mateusz Guzik <mjg@FreeBSD.org>

uma: implement provisional api for per-cpu zones

Per-cpu zone allocations are very rarely done compared to regular zones.
The intent is to avoid pessimizing the latter case with per-cpu specific
code.

In particular contrary to the claim in r334824, M_ZERO is sometimes being
used for such zones. But the zeroing method is completely different and
braching on it in the fast path for regular zones is a waste of time.


# b8af2820 07-Jun-2018 Mateusz Guzik <mjg@FreeBSD.org>

uma: fix up r334824

Turns out there is code which ends up passing M_ZERO to counters.
Since counters zero unconditionally on their own, just ignore drop the
flag in that place.


# ea99223e 07-Jun-2018 Mateusz Guzik <mjg@FreeBSD.org>

uma: remove M_ZERO support for pcpu zones

Nothing in the tree uses it and pcpu zones have a fundamentally different use
case than the regular zones - they are not supposed to be allocated and freed
all the time.

This reduces pollution in the allocation fast path.


# c5deaf04 07-Jun-2018 Gleb Smirnoff <glebius@FreeBSD.org>

UMA memory debugging enabled with INVARIANTS consists of two things:
trashing freed memory and checking that allocated memory is properly
trashed, and also of keeping a bitset of freed items. Trashing/checking
creates a lot of CPU cache poisoning, while keeping debugging bitsets
consistent creates a lot of contention on UMA zone lock(s). The performance
difference between INVARIANTS kernel and normal one is mostly attributed
to UMA debugging, rather than to all KASSERT checks in the kernel.

Add loader tunable vm.debug.divisor that allows either to turn off UMA
debugging completely, or turn it on only for a fraction of allocations,
while still running all KASSERTs in kernel. That allows to run INVARIANTS
kernels in production environments without reducing load by orders of
magnitude, but still doing useful extra checks.

Default value is 1, meaning debug every allocation. Value of 0 would
disable UMA debugging completely. Values above 1 enable debugging only
for every N-th item. It isn't possible to strictly follow the number,
but still amount of debugging is reduced roughly by (N-1)/N percent.

Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D15199


# e825ab8d 26-Apr-2018 Mateusz Guzik <mjg@FreeBSD.org>

uma: whack main zone counter update in the slow path

Cached counters are typically zero at this point so it performs
avoidable atomics. Everything reading them also reads the cached
ones, thus there is really no point.

Reviewed by: jeff


# 7e28037a 24-Apr-2018 Mark Johnston <markj@FreeBSD.org>

Add a UMA zone flag to disable the use of buckets.

This allows the creation of zones which don't do any caching in front of
the keg. If the zone is a cache zone, this means that UMA will not
attempt any memory allocations when allocating an item from the backend.
This is intended for use after a panic by netdump, but likely has other
applications.

Reviewed by: kib
MFC after: 2 weeks
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D15184


# b92b26ad 01-Apr-2018 Gleb Smirnoff <glebius@FreeBSD.org>

Use UMA_SLAB_SPACE macro. No functional change here.


# 96a10340 01-Apr-2018 Gleb Smirnoff <glebius@FreeBSD.org>

In uma_startup_count() handle special case when zone will fit into
single slab, but with alignment adjustment it won't. Again, when
there is only one item in a slab alignment can be ignored. See
previous revision of this file for more info.

PR: 227116


# 1ca6ed45 01-Apr-2018 Gleb Smirnoff <glebius@FreeBSD.org>

Handle a special case when a slab can fit only one allocation,
and zone has a large alignment. With alignment taken into
account uk_rsize will be greater than space in a slab. However,
since we have only one item per slab, it is always naturally
aligned.

Code that will panic before this change with 4k page:

z = uma_zcreate("test", 3984, NULL, NULL, NULL, NULL, 31, 0);
uma_zalloc(z, M_WAITOK);

A practical scenario to hit the panic is a machine with 56 CPUs
and 2 NUMA domains, which yields in zone size of 3984.

PR: 227116
MFC after: 2 weeks


# e8bb2dc7 31-Mar-2018 Jeff Roberson <jeff@FreeBSD.org>

Add the flag ZONE_NOBUCKETCACHE. This flag instructions UMA not to keep
a cache of fully populated buckets. This will be used in a follow-on
commit.

The flag idea was originally from markj.

Reviewed by: markj, kib
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon


# 63b5d112 24-Mar-2018 Konstantin Belousov <kib@FreeBSD.org>

For vm_zone_stats() sysctl handler, do not drain sbuf calling
copyout(9) while owning zone lock.

Despite old value sysctl buffer is wired, spurious faults might still
occur.

Note that we still own the uma_rwlock there, but this lock does not
participate in sensitive lock orders.

Reported and tested by: pho (as part of the larger patch)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# f7d35785 08-Feb-2018 Gleb Smirnoff <glebius@FreeBSD.org>

Fix boot_pages exhaustion on machines with many domains and cores, where
size of UMA zone allocation is greater than page size. In this case zone
of zones can not use UMA_MD_SMALL_ALLOC, and we need to postpone switch
off of this zone from startup_alloc() until full launch of VM.

o Always supply number of VM zones to uma_startup_count(). On machines
with UMA_MD_SMALL_ALLOC ignore it completely, unless zsize goes over
a page. In the latter case account VM zones for number of allocations
from the zone of zones.
o Rewrite startup_alloc() so that it will immediately switch off from
itself any zone that is already capable of running real alloc.
In worst case scenario we may leak a single page here. See comment
in uma_startup_count().
o Hardcode call to uma_startup2() into vm_mem_init(). Otherwise some
extra SYSINITs, e.g. vm_page_init() may sneak in before.
o While here, remove uma_boot_pages_mtx. With recent changes to boot
pages calculation, we are guaranteed to use all of the boot_pages
in the early single threaded stage.

Reported & tested by: mav


# 5073a083 07-Feb-2018 Gleb Smirnoff <glebius@FreeBSD.org>

Fix three miscalculations in amount of boot pages:

o Most of startup zones have struct uma_slab embedded into the slab,
so provide macro UMA_SLAB_SPACE and use it instead of UMA_SLAB_SIZE,
when calculating how many pages would certain kind of allocations
require. Some zones are offpage, so we might have a positive inaccuracy.
o The keg for the zone of zones is allocated "dynamically", so we
need +1 when calculating amount of pages for kegs. [1]
o The zones of zones and zones of kegs have arbitrary alignment of 32,
and this also needs to be accounted for. [2]

While here, spread more comments and improve diagnostic messages.

Reported by: pho [1], jtl [2]


# d2be4a1e 06-Feb-2018 Gleb Smirnoff <glebius@FreeBSD.org>

Use correct arithmetic to calculate how many pages we need for kegs
and hashes. There is no functional change with current sizes.


# e2068d0b 06-Feb-2018 Jeff Roberson <jeff@FreeBSD.org>

Use per-domain locks for vm page queue free. Move paging control from
global to per-domain state. Protect reservations with the free lock
from the domain that they belong to. Refactor to make vm domains more
of a first class object.

Reviewed by: markj, kib, gallatin
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14000


# 1616767d 06-Feb-2018 Gleb Smirnoff <glebius@FreeBSD.org>

Improve DIAGNOSTIC printf. Report using a boot page every time regardless
of booted status.


# f4bef67c 05-Feb-2018 Gleb Smirnoff <glebius@FreeBSD.org>

Followup on r302393 by cperciva, improving calculation of boot pages required
for UMA startup.

o Introduce another stage of UMA startup, which is entered after
vm_page_startup() finishes. After this stage we don't yet enable buckets,
but we can ask VM for pages. Rename stages to meaningful names while here.
New list of stages: BOOT_COLD, BOOT_STRAPPED, BOOT_PAGEALLOC, BOOT_BUCKETS,
BOOT_RUNNING.
Enabling page alloc earlier allows us to dramatically reduce number of
boot pages required. What is more important number of zones becomes
consistent across different machines, as no MD allocations are done before
the BOOT_PAGEALLOC stage. Now only UMA internal zones actually need to use
startup_alloc(), however that may change, so vm_page_startup() provides
its need for early zones as argument.
o Introduce uma_startup_count() function, to avoid code duplication. The
functions calculates sizes of zones zone and kegs zone, and calculates how
many pages UMA will need to bootstrap.
It counts not only of zone structures, but also of kegs, slabs and hashes.
o Hide uma_startup_foo() declarations from public file.
o Provide several DIAGNOSTIC printfs on boot_pages usage.
o Bugfix: when calculating zone of zones size use (mp_maxid + 1) instead of
mp_ncpus. Use resulting number not only in the size argument to zone_ctor()
but also as args.size.

Reviewed by: imp, gallatin (earlier version)
Differential Revision: https://reviews.freebsd.org/D14054


# b6715dab 13-Jan-2018 Jeff Roberson <jeff@FreeBSD.org>

Move VM_NUMA_ALLOC and DEVICE_NUMA under the single global config option NUMA.

Sponsored by: Netflix, Dell/EMC Isilon
Discussed with: jhb


# ab3185d1 12-Jan-2018 Jeff Roberson <jeff@FreeBSD.org>

Implement NUMA support in uma(9) and malloc(9). Allocations from specific
domains can be done by the _domain() API variants. UMA also supports a
first-touch policy via the NUMA zone flag.

The slab layer is now segregated by VM domains and is precise. It handles
iteration for round-robin directly. The per-cpu cache layer remains
a mix of domains according to where memory is allocated and freed. Well
behaved clients can achieve perfect locality with no performance penalty.

The direct domain allocation functions have to visit the slab layer and
so require per-zone locks which come at some expense.

Reviewed by: Attilio (a slightly older version)
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon


# ad5b0f5b 01-Jan-2018 Jeff Roberson <jeff@FreeBSD.org>

Fix arc after r326347 broke various memory limit queries. Use UMA features
rather than kmem arena size to determine available memory.

Initialize the UMA limit to LONG_MAX to avoid spurious wakeups on boot before
the real limit is set.

PR: 224330 (partial), 224080
Reviewed by: markj, avg
Sponsored by: Netflix / Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D13494


# 200f8117 19-Dec-2017 Konstantin Belousov <kib@FreeBSD.org>

Perform all accesses to uma_reclaim_needed using atomic(9) KPI.

Reviewed by: alc, jhb
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D13534


# 952a29c0 07-Dec-2017 Mark Johnston <markj@FreeBSD.org>

Fix the UMA reclaim worker after r326347.

atomic_set_*() sets a bit in the target memory location, so
atomic_set_int(&uma_reclaim_needed, 0) does not do what it looks like
it does.

PR: 224080
Reviewed by: jeff, kib
Differential Revision: https://reviews.freebsd.org/D13412


# 2e47807c 28-Nov-2017 Jeff Roberson <jeff@FreeBSD.org>

Eliminate kmem_arena and kmem_object in preparation for further NUMA commits.

The arena argument to kmem_*() is now only used in an assert. A follow-up
commit will remove the argument altogether before we freeze the API for the
next release.

This replaces the hard limit on kmem size with a soft limit imposed by UMA. When
the soft limit is exceeded we periodically wakeup the UMA reclaim thread to
attempt to shrink KVA. On 32bit architectures this should behave much more
gracefully as we exhaust KVA. On 64bit the limits are likely never hit.

Reviewed by: markj, kib (some objections)
Discussed with: alc
Tested by: pho
Sponsored by: Netflix / Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D13187


# fe267a55 27-Nov-2017 Pedro F. Giffuni <pfg@FreeBSD.org>

sys: general adoption of SPDX licensing ID tags.

Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

No functional change intended.


# 772c8b674 08-Nov-2017 Konstantin Belousov <kib@FreeBSD.org>

Fix operator priority.

Sponsored by: The FreeBSD Foundation


# 8d6fbbb8 07-Nov-2017 Jeff Roberson <jeff@FreeBSD.org>

Replace manyinstances of VM_WAIT with blocking page allocation flags
similar to the kernel memory allocator.

This simplifies NUMA allocation because the domain will be known at wait
time and races between failure and sleeping are eliminated. This also
reduces boilerplate code and simplifies callers.

A wait primitive is supplied for uma zones for similar reasons. This
eliminates some non-specific VM_WAIT calls in favor of more explicit
sleeps that may be satisfied without new pages.

Reviewed by: alc, kib, markj
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon


# 2934eb8a 13-Sep-2017 Mark Johnston <markj@FreeBSD.org>

Fix a logic error in the item size calculation for internal UMA zones.

Kegs for internal zones always keep the slab header in the slab itself.
Therefore, when determining the allocation size, we need to take the
slab header size into account.

Reported and tested by: ae, rakuco
Reviewed by: avg
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D12342


# fe933c1d 06-Sep-2017 Mateusz Guzik <mjg@FreeBSD.org>

Start annotating global _padalign locks with __exclusive_cache_line

While these locks are guarnteed to not share their respective cache lines,
their current placement leaves unnecessary holes in lines which preceeded them.

For instance the annotation of vm_page_queue_free_mtx allows 2 neighbour
cachelines (previously separate by the lock) to be collapsed into 1.

The annotation is only effective on architectures which have it implemented in
their linker script (currently only amd64). Thus locks are not converted to
their not-padaligned variants as to not affect the rest.

MFC after: 1 week


# 77e19437 08-Jun-2017 Gleb Smirnoff <glebius@FreeBSD.org>

When we are in UMA_STARTUP use startup_alloc() for any zone, not for
internal zones only. This allows to create new zones at early stages
of boot, without need to mark them as internal to UMA, which isn't
always true.

Reviewed by: alc


# 1431a748 01-Jun-2017 Gleb Smirnoff <glebius@FreeBSD.org>

As old prophecy says, some day UMA_DEBUG printfs shall be made CTRs.


# ac0a6fd0 01-Jun-2017 Gleb Smirnoff <glebius@FreeBSD.org>

Simplify boot pages management in UMA.

It is simply a contigous virtual memory pointer and number of pages.
There is no need to build a linked list here. Just increment pointer
and decrement counter. The only functional difference to old allocator
is that before we gave pages from topmost and down to lowest, and now
we give them in normal ascending order.

While here remove padalign from a mutex that is unused at runtime.

Reviewed by: alc


# a5a35578 04-Apr-2017 John Baldwin <jhb@FreeBSD.org>

Assert that the align parameter to uma_zcreate() is valid.

Reviewed by: kib
MFC after: 1 week
Sponsored by: DARPA / AFRL
Differential Revision: https://reviews.freebsd.org/D10100


# 57223e99 11-Mar-2017 Andriy Gapon <avg@FreeBSD.org>

uma: fix pages <-> items conversions at several places

Those places were not taking into account uk_ppera.
At present one allocation is always used by one slab, so uk_ppera must
be used to convert between pages and slabs.
uk_ipers is used to convert between slabs and items.

MFC after: 1 month (if ever)


# a55ebb7c 11-Mar-2017 Andriy Gapon <avg@FreeBSD.org>

uma: eliminate uk_slabsize field

The field was not used beyond the initial keg setup stage anyway.

MFC after: 1 month (if ever)


# 9b43bc27 25-Feb-2017 Andriy Gapon <avg@FreeBSD.org>

call vm_lowmem hook in uma_reclaim_worker

A comment near kmem_reclaim() implies that we already did that.
Calling the hook is useful, because some handlers, e.g. ARC,
might be able to release significant amounts of KVA.

Now that we have more than one place where vm_lowmem hook is called,
use this change as an opportunity to introduce flags that describe
a reason for calling the hook. No handler makes use of the flags yet.

Reviewed by: markj, kib
MFC after: 1 week
Sponsored by: Panzura
Differential Revision: https://reviews.freebsd.org/D9764


# b5345ef1 02-Jan-2017 Justin Hibbits <jhibbits@FreeBSD.org>

Print flags in hex instead of decimal.

Hex is easier to grok for flags, and consistent with other prints.


# 829be516 20-Oct-2016 Mark Johnston <markj@FreeBSD.org>

Simplify keg_drain() a bit by using LIST_FOREACH_SAFE.

MFC after: 1 week


# afa5d703 19-Jul-2016 Mark Johnston <markj@FreeBSD.org>

Release the second critical section in uma_zfree_arg() slightly earlier.

It is only needed when removing a full bucket from the per-CPU cache. The
bucket cache (uz_buckets) is protected by the zone mutex and thus the
critical section can be released before inserting into that list.

MFC after: 1 week


# 96c85efb 06-Jul-2016 Nathan Whitehorn <nwhitehorn@FreeBSD.org>

Replace a number of conflations of mp_ncpus and mp_maxid with either
mp_maxid or CPU_FOREACH() as appropriate. This fixes a number of places in
the kernel that assumed CPU IDs are dense in [0, mp_ncpus) and would try,
for example, to run tasks on CPUs that did not exist or to allocate too
few buffers on systems with sparse CPU IDs in which there are holes in the
range and mp_maxid > mp_ncpus. Such circumstances generally occur on
systems with SMT, but on which SMT is disabled. This patch restores system
operation at least on POWER8 systems configured in this way.

There are a number of other places in the kernel with potential problems
in these situations, but where sparse CPU IDs are not currently known
to occur, mostly in the ARM machine-dependent code. These will be fixed
in a follow-up commit after the stable/11 branch.

PR: kern/210106
Reviewed by: jhb
Approved by: re (glebius)


# bc9d08e1 01-Jun-2016 Mark Johnston <markj@FreeBSD.org>

Fix memguard(9) in kernels with INVARIANTS enabled.

With r284861, UMA zones use the trash ctor and dtor by default. This is
incompatible with memguard, which frees the backing page when the item
is freed. Modify the UMA debug functions to be no-ops if the item was
allocated from memguard. This also fixes constructors such as
mb_ctor_pack(), which invokes the trash ctor in addition to performing
some initialization.

Reviewed by: glebius
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D6562


# 763df3ec 02-May-2016 Pedro F. Giffuni <pfg@FreeBSD.org>

sys/vm: minor spelling fixes in comments.

No functional change.


# cfcae3f8 29-Feb-2016 Gleb Smirnoff <glebius@FreeBSD.org>

Remove UMA_ZONE_REFCNT feature, now unused.

Blessed by: jeff


# e60b2fcb 03-Feb-2016 Gleb Smirnoff <glebius@FreeBSD.org>

Redo r292484. Embed task(9) into zone, so that uz_maxaction is called
in a context that can sleep, allowing consumers of the KPI to run their
drain routines without any extra measures.

Discussed with: jtl


# 9542ea7b 03-Feb-2016 Gleb Smirnoff <glebius@FreeBSD.org>

Move uma_dbg_alloc() and uma_dbg_free() into uma_core.c, which allows
to make uma_dbg.h not depend on uma_int.h, which allows to uninclude
uma_int.h from the mbuf(9) allocator.


# 54503a13 19-Dec-2015 Jonathan T. Looney <jtl@FreeBSD.org>

Add a safety net to reclaim mbufs when one of the mbuf zones become
exhausted.

It is possible for a bug in the code (or, theoretically, even unusual
network conditions) to exhaust all possible mbufs or mbuf clusters.
When this occurs, things can grind to a halt fairly quickly. However,
we currently do not call mb_reclaim() unless the entire system is
experiencing a low-memory condition.

While it is best to try to prevent exhaustion of one of the mbuf zones,
it would also be useful to have a mechanism to attempt to recover from
these situations by freeing "expendable" mbufs.

This patch makes two changes:

a) The patch adds a generic API to the UMA zone allocator to set a
function that should be called when an allocation fails because the
zone limit has been reached. Because of the way this function can be
called, it really should do minimal work.

b) The patch uses this API to try to free mbufs when an allocation
fails from one of the mbuf zones because the zone limit has been
reached. The function schedules a callout to run mb_reclaim().

Differential Revision: https://reviews.freebsd.org/D3864
Reviewed by: gnn
Comments by: rrs, glebius
MFC after: 2 weeks
Sponsored by: Juniper Networks


# d9e2e68d 11-Dec-2015 Mark Johnston <markj@FreeBSD.org>

Don't make assertions about td_critnest when the scheduler is stopped.

A panicking thread always executes with a critical section held, so any
attempt to allocate or free memory while dumping will otherwise cause a
second panic. This can occur, for example, if xpt_polled_action() completes
non-dump I/O that was pending at the time of the panic. The fact that this
can occur is itself a bug, but asserting in this case does little but
reduce the reliability of kernel dumps.

Suggested by: kib
Reported by: pho


# 1067a2ba 19-Nov-2015 Jonathan T. Looney <jtl@FreeBSD.org>

Consistently enforce the restriction against calling malloc/free when in a
critical section.

uma_zalloc_arg()/uma_zalloc_free() may acquire a sleepable lock on the
zone. The malloc() family of functions may call uma_zalloc_arg() or
uma_zalloc_free().

The malloc(9) man page currently claims that free() will never sleep.
It also implies that the malloc() family of functions will not sleep
when called with M_NOWAIT. However, it is more correct to say that
these functions will not sleep indefinitely. Indeed, they may acquire
a sleepable lock. However, a developer may overlook this restriction
because the WITNESS check that catches attempts to call the malloc()
family of functions within a critical section is inconsistenly
applied.

This change clarifies the language of the malloc(9) man page to clarify
the restriction against calling the malloc() family of functions
while in a critical section or holding a spin lock. It also adds
KASSERTs at appropriate points to make the enforcement of this
restriction more consistent.

PR: 204633
Differential Revision: https://reviews.freebsd.org/D4197
Reviewed by: markj
Approved by: gnn (mentor)
Sponsored by: Juniper Networks


# 087a6132 26-Sep-2015 Alan Cox <alc@FreeBSD.org>

Exploit r288122 to address a cosmetic issue. Since the pages allocated
by noobj_alloc() don't belong to a vm object, they can't be paged out.
Since they can't be paged out, they are never enqueued in a paging queue.
Nonetheless, passing PQ_INACTIVE to vm_page_unwire() creates the appearance
that these pages are being enqueued in the inactive queue. As of r288122,
we can avoid giving this false impression by passing PQ_NONE.

Submitted by: kmacy
Differential Revision: https://reviews.freebsd.org/D1674


# 19c591bf 02-Sep-2015 Mateusz Guzik <mjg@FreeBSD.org>

Don't trash memory from UMA_ZONE_NOFREE zones.

Objects obtained from such zones are supposed to retain type stability,
which was violated by aforementioned trashing.

This is a follow-up to r284861.

Discussed with: kib


# e866d8f0 21-Aug-2015 Mark Murray <markm@FreeBSD.org>

Make the UMA harvesting go away completely if not wanted. Default to "not wanted".
Provide and document the RANDOM_ENABLE_UMA option.

Change RANDOM_FAST to RANDOM_UMA to clarify the harvesting.

Remove RANDOM_DEBUG option, replace with SDT probes. These will be of
use to folks measuring the harvesting effect when deciding whether to
use RANDOM_ENABLE_UMA.

Requested by: scottl and others.
Approved by: so (/dev/random blanket)
Differential Revision: https://reviews.freebsd.org/D3197


# 9ba30bcb 10-Aug-2015 Zbigniew Bodek <zbb@FreeBSD.org>

Avoid sign extension of value passed to kva_alloc from uma_zone_reserve_kva

Fixes "panic: vm_radix_reserve_kva: unable to reserve KVA" caused by sign
extention of "pages * UMA_SLAB_SIZE" value passed to kva_alloc() which
takes unsigned long argument.

In the erroneus case that triggered this bug, the number of pages
to allocate in uma_zone_reserve_kva() was 0x8ebe6, that gave the
total number of bytes to allocate equal to 0x8ebe6000 (int).
This was then sign extended in kva_alloc() to 0xffffffff8ebe6000
(unsigned long).

Reviewed by: alc, kib
Submitted by: Zbigniew Bodek <zbb@semihalf.com>
Obtained from: Semihalf
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D3346


# d1b06863 30-Jun-2015 Mark Murray <markm@FreeBSD.org>

Huge cleanup of random(4) code.

* GENERAL
- Update copyright.
- Make kernel options for RANDOM_YARROW and RANDOM_DUMMY. Set
neither to ON, which means we want Fortuna
- If there is no 'device random' in the kernel, there will be NO
random(4) device in the kernel, and the KERN_ARND sysctl will
return nothing. With RANDOM_DUMMY there will be a random(4) that
always blocks.
- Repair kern.arandom (KERN_ARND sysctl). The old version went
through arc4random(9) and was a bit weird.
- Adjust arc4random stirring a bit - the existing code looks a little
suspect.
- Fix the nasty pre- and post-read overloading by providing explictit
functions to do these tasks.
- Redo read_random(9) so as to duplicate random(4)'s read internals.
This makes it a first-class citizen rather than a hack.
- Move stuff out of locked regions when it does not need to be
there.
- Trim RANDOM_DEBUG printfs. Some are excess to requirement, some
behind boot verbose.
- Use SYSINIT to sequence the startup.
- Fix init/deinit sysctl stuff.
- Make relevant sysctls also tunables.
- Add different harvesting "styles" to allow for different requirements
(direct, queue, fast).
- Add harvesting of FFS atime events. This needs to be checked for
weighing down the FS code.
- Add harvesting of slab allocator events. This needs to be checked for
weighing down the allocator code.
- Fix the random(9) manpage.
- Loadable modules are not present for now. These will be re-engineered
when the dust settles.
- Use macros for locks.
- Fix comments.

* src/share/man/...
- Update the man pages.

* src/etc/...
- The startup/shutdown work is done in D2924.

* src/UPDATING
- Add UPDATING announcement.

* src/sys/dev/random/build.sh
- Add copyright.
- Add libz for unit tests.

* src/sys/dev/random/dummy.c
- Remove; no longer needed. Functionality incorporated into randomdev.*.

* live_entropy_sources.c live_entropy_sources.h
- Remove; content moved.
- move content to randomdev.[ch] and optimise.

* src/sys/dev/random/random_adaptors.c src/sys/dev/random/random_adaptors.h
- Remove; plugability is no longer used. Compile-time algorithm
selection is the way to go.

* src/sys/dev/random/random_harvestq.c src/sys/dev/random/random_harvestq.h
- Add early (re)boot-time randomness caching.

* src/sys/dev/random/randomdev_soft.c src/sys/dev/random/randomdev_soft.h
- Remove; no longer needed.

* src/sys/dev/random/uint128.h
- Provide a fake uint128_t; if a real one ever arrived, we can use
that instead. All that is needed here is N=0, N++, N==0, and some
localised trickery is used to manufacture a 128-bit 0ULLL.

* src/sys/dev/random/unit_test.c src/sys/dev/random/unit_test.h
- Improve unit tests; previously the testing human needed clairvoyance;
now the test will do a basic check of compressibility. Clairvoyant
talent is still a good idea.
- This is still a long way off a proper unit test.

* src/sys/dev/random/fortuna.c src/sys/dev/random/fortuna.h
- Improve messy union to just uint128_t.
- Remove unneeded 'static struct fortuna_start_cache'.
- Tighten up up arithmetic.
- Provide a method to allow eternal junk to be introduced; harden
it against blatant by compress/hashing.
- Assert that locks are held correctly.
- Fix the nasty pre- and post-read overloading by providing explictit
functions to do these tasks.
- Turn into self-sufficient module (no longer requires randomdev_soft.[ch])

* src/sys/dev/random/yarrow.c src/sys/dev/random/yarrow.h
- Improve messy union to just uint128_t.
- Remove unneeded 'staic struct start_cache'.
- Tighten up up arithmetic.
- Provide a method to allow eternal junk to be introduced; harden
it against blatant by compress/hashing.
- Assert that locks are held correctly.
- Fix the nasty pre- and post-read overloading by providing explictit
functions to do these tasks.
- Turn into self-sufficient module (no longer requires randomdev_soft.[ch])
- Fix some magic numbers elsewhere used as FAST and SLOW.

Differential Revision: https://reviews.freebsd.org/D2025
Reviewed by: vsevolod,delphij,rwatson,trasz,jmg
Approved by: so (delphij)


# afc6dc36 25-Jun-2015 John-Mark Gurney <jmg@FreeBSD.org>

If INVARIANTS is specified, add ctor/dtor to junk memory if they are
unspecified...

Submitted by: Suresh Gumpula at Netapp
Differential Revision: https://reviews.freebsd.org/D2725


# fd90e2ed 22-May-2015 Jung-uk Kim <jkim@FreeBSD.org>

CALLOUT_MPSAFE has lost its meaning since r141428, i.e., for more than ten
years for head. However, it is continuously misused as the mpsafe argument
for callout_init(9). Deprecate the flag and clean up callout_init() calls
to make them more consistent.

Differential Revision: https://reviews.freebsd.org/D2613
Reviewed by: jhb
MFC after: 2 weeks


# 44ec2b63 09-May-2015 Konstantin Belousov <kib@FreeBSD.org>

The vmem callback to reclaim kmem arena address space on low or
fragmented conditions currently just wakes up the pagedaemon. The
kmem arena is significantly smaller then the total available physical
memory, which means that there are loads where kmem arena space could
be exhausted, while there is a lot of pages available still. The
woken up pagedaemon sees vm_pages_needed != 0, verifies the condition
vm_paging_needed() which is false, clears the pass and returns back to
sleep, not calling neither uma_reclaim() nor lowmem handler.

To handle low kmem arena conditions, create additional pagedaemon
thread which calls uma_reclaim() directly. The thread sleeps on the
dedicated channel and kmem_reclaim() wakes the thread in addition to
the pagedaemon.

Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# d74e6a1d 20-Apr-2015 Alan Cox <alc@FreeBSD.org>

Eliminate an unused variable.

MFC after: 1 week


# 51cfb0be 12-Apr-2015 Dmitry Chagin <dchagin@FreeBSD.org>

Rework r281162. Indeed, the flexible array member is preferable here.

Suggested by: Justin T. Gibbs

MFC after: 3 days


# 16be9f54 10-Apr-2015 Gleb Smirnoff <glebius@FreeBSD.org>

UMA zone limit can be lowered, so remove protection against from
the sysctl_handle_uma_zone_max().

Sponsored by: Nginx, Inc.


# 6723fdfe 06-Apr-2015 Dmitry Chagin <dchagin@FreeBSD.org>

Properly calculate "UMA Zones" per cpu cache size. Avoid allocating
an extra struct uma_cache since the struct uma_zone already has one.

PR: 199169
Submitted by: luke.tw gmail com
MFC after: 1 week


# 1d2c0c46 05-Apr-2015 Dmitry Chagin <dchagin@FreeBSD.org>

Fix wrong kassert msg in uma.

PR: 199172
Submitted by: luke.tw gmail com
MFC after: 1 week


# f2c2231e 31-Mar-2015 Ryan Stone <rstone@FreeBSD.org>

Fix integer truncation bug in malloc(9)

A couple of internal functions used by malloc(9) and uma truncated
a size_t down to an int. This could cause any number of issues
(e.g. indefinite sleeps, memory corruption) if any kernel
subsystem tried to allocate 2GB or more through malloc. zfs would
attempt such an allocation when run on a system with 2TB or more
of RAM.

Note to self: When this is MFCed, sparc64 needs the same fix.

Differential revision: https://reviews.freebsd.org/D2106
Reviewed by: kib
Reported by: Michael Fuckner <michael@fuckner.net>
Tested by: Michael Fuckner <michael@fuckner.net>
MFC after: 2 weeks


# 1eafc078 14-Mar-2015 Ian Lepore <ian@FreeBSD.org>

Set the SBUF_INCLUDENUL flag in sbuf_new_for_sysctl() so that sysctl
strings returned to userland include the nulterm byte.

Some uses of sbuf_new_for_sysctl() write binary data rather than strings;
clear the SBUF_INCLUDENUL flag after calling sbuf_new_for_sysctl() in
those cases. (Note that the sbuf code still automatically adds a nulterm
byte in sbuf_finish(), but since it's not included in the length it won't
get copied to userland along with the binary data.)

Remove explicit adding of a nulterm byte in a couple places now that it
gets done automatically by the sbuf drain code.

PR: 195668


# 67c44fa3 31-Dec-2014 Alan Cox <alc@FreeBSD.org>

Eliminate a stale debug message. The per-CPU cache locks were replaced
by critical sections in r145686.

PR: 193254
Submitted by: luke.tw@gmail.com
MFC after: 3 days


# 95c4bf75 30-Nov-2014 Konstantin Belousov <kib@FreeBSD.org>

Provide mutual exclusion between zone allocation/destruction and
uma_reclaim(). Reclamation code must not see half-constructed or
destructed zones. Do this by bracing uma_zcreate() and uma_zdestroy()
into a shared-locked sx, and take the sx exclusively in uma_reclaim().

Usually zones are not created/destroyed during the system operation,
but tmpfs mounts do cause zone operations and exposed the bug.

Another solution could be to only expose a new keg on uma_kegs list
after the corresponding zone is fully constructed, and similar
treatment for the destruction. But it probably requires more risky
code rearrangement as well.

Reported and tested by: pho
Discussed with: avg
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 10cb2424 30-Oct-2014 Mark Murray <markm@FreeBSD.org>

This is the much-discussed major upgrade to the random(4) device, known to you all as /dev/random.

This code has had an extensive rewrite and a good series of reviews, both by the author and other parties. This means a lot of code has been simplified. Pluggable structures for high-rate entropy generators are available, and it is most definitely not the case that /dev/random can be driven by only a hardware souce any more. This has been designed out of the device. Hardware sources are stirred into the CSPRNG (Yarrow, Fortuna) like any other entropy source. Pluggable modules may be written by third parties for additional sources.

The harvesting structures and consequently the locking have been simplified. Entropy harvesting is done in a more general way (the documentation for this will follow). There is some GREAT entropy to be had in the UMA allocator, but it is disabled for now as messing with that is likely to annoy many people.

The venerable (but effective) Yarrow algorithm, which is no longer supported by its authors now has an alternative, Fortuna. For now, Yarrow is retained as the default algorithm, but this may be changed using a kernel option. It is intended to make Fortuna the default algorithm for 11.0. Interested parties are encouraged to read ISBN 978-0-470-47424-2 "Cryptography Engineering" By Ferguson, Schneier and Kohno for Fortuna's gory details. Heck, read it anyway.

Many thanks to Arthur Mesh who did early grunt work, and who got caught in the crossfire rather more than he deserved to.

My thanks also to folks who helped me thresh this out on whiteboards and in the odd "Hallway track", or otherwise.

My Nomex pants are on. Let the feedback commence!

Reviewed by: trasz,des(partial),imp(partial?),rwatson(partial?)
Approved by: so(des)


# 111fbcd5 05-Oct-2014 Bryan Venteicher <bryanv@FreeBSD.org>

Change the UMA mutex into a rwlock

Acquire the lock in read mode when just needed to ensure the stability
of the keg list. The UMA lock may be held for a long time (relatively
speaking) in uma_reclaim() on machines with lots of zones/kegs. If the
uma_timeout() would fire during that period, subsequent callouts on that
CPU may be significantly delayed.

Reviewed by: jhb


# 6e5254e0 04-Oct-2014 Bryan Venteicher <bryanv@FreeBSD.org>

Remove stray uma_mtx lock/unlock in zone_drain_wait()

Callers of zone_drain_wait(M_WAITOK) do not need to hold (and were not)
the uma_mtx, but we would attempt to unlock and relock the mutex if we
had to sleep because the zone was already draining. The M_NOWAIT callers
may hold the uma_mtx, but we do not sleep in that case.

Reviewed by: jhb
MFC after: 3 days


# af3b2549 27-Jun-2014 Hans Petter Selasky <hselasky@FreeBSD.org>

Pull in r267961 and r267973 again. Fix for issues reported will follow.


# 37a107a4 27-Jun-2014 Glen Barber <gjb@FreeBSD.org>

Revert r267961, r267973:

These changes prevent sysctl(8) from returning proper output,
such as:

1) no output from sysctl(8)
2) erroneously returning ENOMEM with tools like truss(1)
or uname(1)
truss: can not get etype: Cannot allocate memory


# 3da1cf1e 27-Jun-2014 Hans Petter Selasky <hselasky@FreeBSD.org>

Extend the meaning of the CTLFLAG_TUN flag to automatically check if
there is an environment variable which shall initialize the SYSCTL
during early boot. This works for all SYSCTL types both statically and
dynamically created ones, except for the SYSCTL NODE type and SYSCTLs
which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to
be used in the case a tunable sysctl has a custom initialisation
function allowing the sysctl to still be marked as a tunable. The
kernel SYSCTL API is mostly the same, with a few exceptions for some
special operations like iterating childrens of a static/extern SYSCTL
node. This operation should probably be made into a factored out
common macro, hence some device drivers use this. The reason for
changing the SYSCTL API was the need for a SYSCTL parent OID pointer
and not only the SYSCTL parent OID list pointer in order to quickly
generate the sysctl path. The motivation behind this patch is to avoid
parameter loading cludges inside the OFED driver subsystem. Instead of
adding special code to the OFED driver subsystem to post-load tunables
into dynamically created sysctls, we generalize this in the kernel.

Other changes:
- Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask"
to "hw.pcic.intr_mask".
- Removed redundant TUNABLE statements throughout the kernel.
- Some minor code rewrites in connection to removing not needed
TUNABLE statements.
- Added a missing SYSCTL_DECL().
- Wrapped two very long lines.
- Avoid malloc()/free() inside sysctl string handling, in case it is
called to initialize a sysctl from a tunable, hence malloc()/free() is
not ready when sysctls from the sysctl dataset are registered.
- Bumped FreeBSD version to indicate SYSCTL API change.

MFC after: 2 weeks
Sponsored by: Mellanox Technologies


# 3ae10f74 16-Jun-2014 Attilio Rao <attilio@FreeBSD.org>

- Modify vm_page_unwire() and vm_page_enqueue() to directly accept
the queue where to enqueue pages that are going to be unwired.
- Add stronger checks to the enqueue/dequeue for the pagequeues when
adding and removing pages to them.

Of course, for unmanaged pages the queue parameter of vm_page_unwire() will
be ignored, just as the active parameter today.
This makes adding new pagequeues quicker.

This change effectively modifies the KPI. __FreeBSD_version will be,
however, bumped just when the full cache of free pages will be
evicted.

Sponsored by: EMC / Isilon storage division
Reviewed by: alc
Tested by: pho


# 1aa6c758 12-Jun-2014 Alexander Motin <mav@FreeBSD.org>

Introduce new "256 Bucket" zone to split requests and reduce congestion
on "128 Bucket" zone lock.

MFC after: 2 weeks
Sponsored by: iXsystems, Inc.


# 20d3ab87 12-Jun-2014 Alexander Motin <mav@FreeBSD.org>

Allocating new bucket for bucket zone, never take it from the zone itself,
since it will almost certanly fail. Take next bigger zone instead.

This situation should not happen with original bucket zones configuration:
"32 Bucket" zone uses "64 Bucket" and vice versa. But if "64 Bucket" zone
lock is congested, zone may grow its bucket size and start biting itself.

MFC after: 2 weeks
Sponsored by: iXsystems, Inc.


# 2367b4dd 14-Feb-2014 Dimitry Andric <dim@FreeBSD.org>

After r251709, avoid a clang 3.4 warning about an unused static const
variable (uma_max_ipers), when asserts are disabled.

Reviewed by: glebius
MFC after: 3 days


# 48343a2f 10-Feb-2014 Gleb Smirnoff <glebius@FreeBSD.org>

Make M_ZERO flag work correctly on UMA_ZONE_PCPU zones.

Sponsored by: Nginx, Inc.


# 0a5a3ccb 07-Feb-2014 Gleb Smirnoff <glebius@FreeBSD.org>

Provide macros that allow easily export uma(9) zone limits and
current usage via sysctl(9):

SYSCTL_UMA_MAX()
SYSCTL_ADD_UMA_MAX()
SYSCTL_UMA_CUR()
SYSCTL_ADD_UMA_CUR()

Sponsored by: Nginx, Inc.


# a3845534 29-Nov-2013 Craig Rodrigues <rodrigc@FreeBSD.org>

In keg_dtor(), print out the keg name in the "Freed UMA keg was not empty"
message printed to the console. This makes it easier to track down
the source of certain memory leaks.

Suggested by: adrian


# 03175483 28-Nov-2013 Alexander Motin <mav@FreeBSD.org>

- Add bucket size column to `show uma` DDB command.
- Add `show umacache` command to show alike stats for cache-only UMA zones.


# cec48e00 27-Nov-2013 Alexander Motin <mav@FreeBSD.org>

Make UMA to not blindly force offpage slab header allocation for large
(> PAGE_SIZE) zones. If zone is not multiple to PAGE_SIZE, there may
be enough space for the header at the last page, so we may avoid extra
header memory allocation and hash table update/lookup.

ZFS creates bunch of odd-sized UMA zones (5120, 6144, 7168, 10240, 14336).
This change gives good use to at least some of otherwise lost memory there.

Reviewed by: avg


# f7104ccd 27-Nov-2013 Alexander Motin <mav@FreeBSD.org>

Don't count bucket allocation failures for UMA zones as their own failures.
There are good reasons for this to happen, such as recursion prevention, etc.
and they are not fatal since buckets are just an optimization mechanism.
Real bucket allocation failures are any way counted by the bucket zones
themselves, and we don't need double accounting there.


# e8a720fe 27-Nov-2013 Alexander Motin <mav@FreeBSD.org>

Fix bug introduced at r252226, when udata argument passed to bucket_alloc()
was used without making sure first that it was really passed for us.

On some of my systems this bug made user argument passed by ZFS code to
uma_zalloc_arg() unexpectedly block UMA per-CPU caches for those zones.


# 8a8d9d14 23-Nov-2013 Alexander Motin <mav@FreeBSD.org>

When purging per-CPU UMA caches do not return empty buckets into the global
full bucket cache to not trigger assertion if allocation happen before that
global cache get purged.


# a2de44ab 19-Nov-2013 Alexander Motin <mav@FreeBSD.org>

Implement mechanism to safely but slowly purge UMA per-CPU caches.

This is a last resort for very low memory condition in case other measures
to free memory were ineffective. Sequentially cycle through all CPUs and
extract per-CPU cache buckets into zone cache from where they can be freed.


# 4d104ba0 19-Nov-2013 Alexander Motin <mav@FreeBSD.org>

Grow UMA zone bucket size also on lock congestion during item free.

Lock congestion is the same, whether it happens on alloc or free, so
handle it equally. Now that we have back pressure, there is no problem
to grow buckets a bit faster. Any way growth is much slower then in 9.x.


# f3932e90 19-Nov-2013 Alexander Motin <mav@FreeBSD.org>

Add two new UMA bucket zones to store 3 and 9 items per bucket.

These new buckets make bucket size self-tuning more soft and precise.
Without them there are buckets for 1, 5, 13, 29, ... items. While at
bigger sizes difference about 2x is fine, at smallest ones it is 5x and
2.6x respectively. New buckets make that line look like 1, 3, 5, 9, 13,
29, reducing jumps between steps, making algorithm work softer, allocating
and freeing memory in better fitting chunks. Otherwise there is quite a
big gap between allocating 128K and 5x128K of RAM at once.


# ace66b56 19-Nov-2013 Alexander Motin <mav@FreeBSD.org>

Implement soft pressure on UMA cache bucket sizes.

Every time system detects low memory condition decrease bucket sizes for
each zone by one item. As result, higher memory pressure will push to
smaller bucket sizes and so smaller per-CPU caches and so more efficient
memory use.

Before this change there was no force to oppose buckets growth as result
of practically inevitable zone lock conflicts, and after some run time
per-CPU caches could consume enough RAM to kill the system.


# 1645995b 31-Aug-2013 Kirk McKusick <mckusick@FreeBSD.org>

Fix bug introduced in rewrite of keg_free_slab in -r251894.
The consequence of the bug is that fini calls are not done
when a slab is freed by a call-back from the page daemon.
It went unnoticed for two months because fini is little used.

I spotted the bug while reading the code to learn how it works
so I could write it up for the next edition of the Design and
Implementation of FreeBSD book.

No MFC needed as this code exists only in HEAD.

Reviewed by: kib, jeff
Tested by: pho


# c325e866 10-Aug-2013 Konstantin Belousov <kib@FreeBSD.org>

Different consumers of the struct vm_page abuse pageq member to keep
additional information, when the page is guaranteed to not belong to a
paging queue. Usually, this results in a lot of type casts which make
reasoning about the code correctness harder.

Sometimes m->object is used instead of pageq, which could cause real
and confusing bugs if non-NULL m->object is leaked. See r141955 and
r253140 for examples.

Change the pageq member into a union containing explicitly-typed
members. Use them instead of type-punning or abusing m->object in x86
pmaps, uma and vm_page_alloc_contig().

Requested and reviewed by: alc
Sponsored by: The FreeBSD Foundation


# 5df87b21 07-Aug-2013 Jeff Roberson <jeff@FreeBSD.org>

Replace kernel virtual address space allocation with vmem. This provides
transparent layering and better fragmentation.

- Normalize functions that allocate memory to use kmem_*
- Those that allocate address space are named kva_*
- Those that operate on maps are named kmap_*
- Implement recursive allocation handling for kmem_arena in vmem.

Reviewed by: alc
Tested by: pho
Sponsored by: EMC / Isilon Storage Division


# e28a647d 23-Jul-2013 Gleb Smirnoff <glebius@FreeBSD.org>

Revert r249590 and in case if mp_ncpus isn't initialized use MAXCPU. This
allows us to init counter zone at early stage of boot.

Reviewed by: kib
Tested by: Lytochkin Boris <lytboris gmail.com>


# a1dff920 28-Jun-2013 Davide Italiano <davide@FreeBSD.org>

Remove a spurious keg lock acquisition.


# 6fd34d6f 25-Jun-2013 Jeff Roberson <jeff@FreeBSD.org>

- Resolve bucket recursion issues by passing a cookie with zone flags
through bucket_alloc() to uma_zalloc_arg() and uma_zfree_arg().
- Make some smaller buckets for large zones to further reduce memory
waste.
- Implement uma_zone_reserve(). This holds aside a number of items only
for callers who specify M_USE_RESERVE. buckets will never be filled
from reserve allocations.

Sponsored by: EMC / Isilon Storage Division


# af526374 20-Jun-2013 Jeff Roberson <jeff@FreeBSD.org>

- Add a per-zone lock for zones without kegs.
- Be more explicit about zone vs keg locking. This functionally changes
almost nothing.
- Add a size parameter to uma_zcache_create() so we can size the buckets.
- Pass the zone to bucket_alloc() so it can modify allocation flags
as appropriate.
- Fix a bug in zone_alloc_bucket() where I missed an address of operator
in a failure case. (Found by pho)

Sponsored by: EMC / Isilon Storage Division


# 8aaf680e 18-Jun-2013 Jeff Roberson <jeff@FreeBSD.org>

- Persist the caller's flags in the bucket allocation flags so we don't
lose a M_NOVM when we recurse into a bucket allocation.

Sponsored by: EMC / Isilon Storage Division


# fc03d22b 17-Jun-2013 Jeff Roberson <jeff@FreeBSD.org>

Refine UMA bucket allocation to reduce space consumption and improve
performance.

- Always free to the alloc bucket if there is space. This gives LIFO
allocation order to improve hot-cache performance. This also allows
for zones with a single bucket per-cpu rather than a pair if the entire
working set fits in one bucket.
- Enable per-cpu caches of buckets. To prevent recursive bucket
allocation one bucket zone still has per-cpu caches disabled.
- Pick the initial bucket size based on a table driven maximum size
per-bucket rather than the number of items per-page. This gives
more sane initial sizes.
- Only grow the bucket size when we face contention on the zone lock, this
causes bucket sizes to grow more slowly.
- Adjust the number of items per-bucket to account for the header space.
This packs the buckets more efficiently per-page while making them
not quite powers of two.
- Eliminate the per-zone free bucket list. Always return buckets back
to the bucket zone. This ensures that as zones grow into larger
bucket sizes they eventually discard the smaller sizes. It persists
fewer buckets in the system. The locking is slightly trickier.
- Only switch buckets in zalloc, not zfree, this eliminates pathological
cases where we ping-pong between two buckets.
- Ensure that the thread that fills a new bucket gets to allocate from
it to give a better upper bound on allocation time.

Sponsored by: EMC / Isilon Storage Division


# 0095a784 16-Jun-2013 Jeff Roberson <jeff@FreeBSD.org>

- Add a new UMA API: uma_zcache_create(). This makes a zone without any
backing memory that is only a container for per-cpu caches of arbitrary
pointer items. These zones have no kegs.
- Convert the regular keg based allocator to use the new import/release
functions.
- Move some stats to be atomics since they would require excessive zone
locking/unlocking with the new import/release paradigm. Make
zone_free_item simpler now that callers can manage more stats.
- Check for these cache-only zones in the public APIs and debugging
code by checking zone_first_keg() against NULL.

Sponsored by: EMC / Isilong Storage Division


# ef72505e 13-Jun-2013 Jeff Roberson <jeff@FreeBSD.org>

- Convert the slab free item list from a linked array of indices to a
bitmap using sys/bitset. This is much simpler, has lower space
overhead and is cheaper in most cases.
- Use a second bitmap for invariants asserts and improve the quality of
the asserts as well as the number of erroneous conditions that we will
catch.
- Drastically simplify sizing code. Special case refcnt zones since they
will be going away.
- Update stale comments.

Sponsored by: EMC / Isilon Storage Division


# 08a3102c 22-Apr-2013 Gleb Smirnoff <glebius@FreeBSD.org>

Panic if UMA_ZONE_PCPU is created at early stages of boot, when mp_ncpus
isn't yet initialized. Otherwise we will panic at first allocation later.

Sponsored by: Nginx, Inc.


# 85dcf349 09-Apr-2013 Gleb Smirnoff <glebius@FreeBSD.org>

Convert UMA code to C99 uintXX_t types.


# 025071f2 08-Apr-2013 Gleb Smirnoff <glebius@FreeBSD.org>

Fix KASSERTs: maximum number of items per slab is 256.


# ad97af7e 08-Apr-2013 Gleb Smirnoff <glebius@FreeBSD.org>

Merge from projects/counters: UMA_ZONE_PCPU zones.

These zones have slab size == sizeof(struct pcpu), but request from VM
enough pages to fit (uk_slabsize * mp_ncpus). An item allocated from such
zone would have a separate twin for each CPU in the system, and these twins
are at a distance of sizeof(struct pcpu) from each other. This magic value
of distance would allow us to make some optimizations later.

To address private item from a CPU simple arithmetics should be used:

item = (type *)((char *)base + sizeof(struct pcpu) * curcpu)

These arithmetics are available as zpcpu_get() macro in pcpu.h.

To introduce non-page size slabs a new field had been added to uma_keg
uk_slabsize. This shifted some frequently used fields of uma_keg to the
fourth cache line on amd64. To mitigate this pessimization, uma_keg fields
were a bit rearranged and least frequently used uk_name and uk_link moved
down to the fourth cache line. All other fields, that are dereferenced
frequently fit into first three cache lines.

Sponsored by: Nginx, Inc.


# 89f6b863 08-Mar-2013 Attilio Rao <attilio@FreeBSD.org>

Switch the vm_object mutex to be a rwlock. This will enable in the
future further optimizations where the vm_object lock will be held
in read mode most of the time the page cache resident pool of pages
are accessed for reading purposes.

The change is mostly mechanical but few notes are reported:
* The KPI changes as follow:
- VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK()
- VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK()
- VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK()
- VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED()
(in order to avoid visibility of implementation details)
- The read-mode operations are added:
VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(),
VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED()
* The vm/vm_pager.h namespace pollution avoidance (forcing requiring
sys/mutex.h in consumers directly to cater its inlining functions
using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h
consumers now must include also sys/rwlock.h.
* zfs requires a quite convoluted fix to include FreeBSD rwlocks into
the compat layer because the name clash between FreeBSD and solaris
versions must be avoided.
At this purpose zfs redefines the vm_object locking functions
directly, isolating the FreeBSD components in specific compat stubs.

The KPI results heavilly broken by this commit. Thirdy part ports must
be updated accordingly (I can think off-hand of VirtualBox, for example).

Sponsored by: EMC / Isilon storage division
Reviewed by: jeff
Reviewed by: pjd (ZFS specific review)
Discussed with: alc
Tested by: pho


# a4915c21 26-Feb-2013 Attilio Rao <attilio@FreeBSD.org>

Merge from vmc-playground branch:
Replace the sub-optimal uma_zone_set_obj() primitive with more modern
uma_zone_reserve_kva(). The new primitive reserves before hand
the necessary KVA space to cater the zone allocations and allocates pages
with ALLOC_NOOBJ. More specifically:
- uma_zone_reserve_kva() does not need an object to cater the backend
allocator.
- uma_zone_reserve_kva() can cater M_WAITOK requests, in order to
serve zones which need to do uma_prealloc() too.
- When possible, uma_zone_reserve_kva() uses directly the direct-mapping
by uma_small_alloc() rather than relying on the KVA / offset
combination.

The removal of the object attribute allows 2 further changes:
1) _vm_object_allocate() becomes static within vm_object.c
2) VM_OBJECT_LOCK_INIT() is removed. This function is replaced by
direct calls to mtx_init() as there is no need to export it anymore
and the calls aren't either homogeneous anymore: there are now small
differences between arguments passed to mtx_init().

Sponsored by: EMC / Isilon storage division
Reviewed by: alc (which also offered almost all the comments)
Tested by: pho, jhb, davide


# 3caae6ca 29-Jan-2013 Gleb Smirnoff <glebius@FreeBSD.org>

Fix typo in debug printf.


# 2f891cd5 07-Dec-2012 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Implemented uma_zone_set_warning(9) function that sets a warning, which
will be printed once the given zone becomes full and cannot allocate an
item. The warning will not be printed more often than every five minutes.

All UMA warnings can be globally turned off by setting sysctl/tunable
vm.zone_warnings to 0.

Discussed on: arch
Obtained from: WHEEL Systems
MFC after: 2 weeks


# bb196eb4 26-Oct-2012 Matthew D Fleming <mdf@FreeBSD.org>

Const-ify the zone name argument to uma_zcreate(9).

MFC after: 3 days


# 0b80c1e4 21-Oct-2012 Eitan Adler <eadler@FreeBSD.org>

Print flags as hex instead of an integer.

PR: kern/168210
Submitted by: linimon
Reviewed by: alc
Approved by: cperciva
MFC after: 3 days


# 2864dbbf 18-Sep-2012 Gleb Smirnoff <glebius@FreeBSD.org>

If caller specifies UMA_ZONE_OFFPAGE explicitly, then do not waste memory
in an allocation for a slab.

Reviewed by: jeff


# 42321809 26-Aug-2012 Gleb Smirnoff <glebius@FreeBSD.org>

Fix function name in keg_cachespread_init() assert.


# c288b548 07-Jul-2012 Eitan Adler <eadler@FreeBSD.org>

Add missing sleep stat increase

PR: kern/168211
Submitted by: linimon
Reviewed by: alc
Approved by: cperciva
MFC after: 3 days


# 687c94aa 02-Jul-2012 John Baldwin <jhb@FreeBSD.org>

Honor db_pager_quit in 'show uma' and 'show malloc'.

MFC after: 1 month


# 251386b4 23-May-2012 Maksim Yevmenkin <emax@FreeBSD.org>

Tweak condition for disabling allocation from per-CPU buckets in
low memory situation. I've observed a situation where per-CPU
allocations were disabled while there were enough free cached pages.
Basically, cnt.v_free_count was sitting stable at a value lower
than cnt.v_free_min and that caused massive performance drop.

Reviewed by: alc
MFC after: 1 week


# 263811f7 27-Jan-2012 Kip Macy <kmacy@FreeBSD.org>

exclude kmem_alloc'ed ARC data buffers from kernel minidumps on amd64
excluding other allocations including UMA now entails the addition of
a single flag to kmem_alloc or uma zone create

Reviewed by: alc, avg
MFC after: 2 weeks


# 8d689e04 12-Oct-2011 Gleb Smirnoff <glebius@FreeBSD.org>

Make memguard(9) capable to guard uma(9) allocations.


# 8cd02d00 22-May-2011 Alan Cox <alc@FreeBSD.org>

Correct an error in r222163. Unless UMA_MD_SMALL_ALLOC is defined,
startup_alloc() must be used until uma_startup2() is called.

Reported by: jh


# 342f1793 21-May-2011 Alan Cox <alc@FreeBSD.org>

1. Prior to r214782, UMA did not support multipage allocations before
uma_startup2() was called. Thus, setting the variable "booted" to true in
uma_startup() was ok on machines with UMA_MD_SMALL_ALLOC defined, because
any allocations made after uma_startup() but before uma_startup2() could be
satisfied by uma_small_alloc(). Now, however, some multipage allocations
are necessary before uma_startup2() just to allocate zone structures on
machines with a large number of processors. Thus, a Boolean can no longer
effectively describe the state of the UMA allocator. Instead, make "booted"
have three values to describe how far initialization has progressed. This
allows multipage allocations to continue using startup_alloc() until
uma_startup2(), but single-page allocations may begin using
uma_small_alloc() after uma_startup().

2. With the aforementioned change, only a modest increase in boot pages is
necessary to boot UMA on a large number of processors.

3. Retire UMA_MD_SMALL_ALLOC_NEEDS_VM. It has only been used between
r182028 and r204128.

Reviewed by: attilio [1], nwhitehorn [3]
Tested by: sbruno


# df1bc9de 20-May-2011 Alan Cox <alc@FreeBSD.org>

Eliminate a redundant #include. ("vm/vm_param.h" already includes
"machine/vmparam.h".)


# e4cd31dd 21-Mar-2011 Jeff Roberson <jeff@FreeBSD.org>

- Merge changes to the base system to support OFED. These include
a wider arg2 for sysctl, updates to vlan code, IFT_INFINIBAND,
and other miscellaneous small features.


# 00f0e671 26-Jan-2011 Matthew D Fleming <mdf@FreeBSD.org>

Explicitly wire the user buffer rather than doing it implicitly in
sbuf_new_for_sysctl(9). This allows using an sbuf with a SYSCTL_OUT
drain for extremely large amounts of data where the caller knows that
appropriate references are held, and sleeping is not an issue.

Inspired by: rwatson


# e9a069d8 04-Nov-2010 John Baldwin <jhb@FreeBSD.org>

Update startup_alloc() to support multi-page allocations and allow internal
zones whose objects are larger than a page to use startup_alloc(). This
allows allocation of zone objects during early boot on machines with a large
number of CPUs since the resulting zone objects are larger than a page.

Submitted by: trema
Reviewed by: attilio
MFC after: 1 week


# a7d5f7eb 19-Oct-2010 Jamie Gritton <jamie@FreeBSD.org>

A new jail(8) with a configuration file, to replace the work currently done
by /etc/rc.d/jail.


# 20ed0cb0 19-Oct-2010 Matthew D Fleming <mdf@FreeBSD.org>

uma_zfree(zone, NULL) should do nothing, to match free(9).

Noticed by: Ron Steinke <rsteinke at isilon dot com>
MFC after: 3 days


# 1c6cae97 15-Oct-2010 Lawrence Stewart <lstewart@FreeBSD.org>

Change uma_zone_set_max to return the effective value of "nitems" after
rounding. The same value can also be obtained with uma_zone_get_max, but this
change avoids a caller having to make two back-to-back calls.

Sponsored by: FreeBSD Foundation
Reviewed by: gnn, jhb


# c4ae7908 15-Oct-2010 Lawrence Stewart <lstewart@FreeBSD.org>

- Simplify implementation of uma_zone_get_max.
- Add uma_zone_get_cur which returns the current approximate occupancy of
a zone. This is useful for providing stats via sysctl amongst other things.

Sponsored by: FreeBSD Foundation
Reviewed by: gnn, jhb
MFC after: 2 weeks


# 4e657159 16-Sep-2010 Matthew D Fleming <mdf@FreeBSD.org>

Re-add r212370 now that the LOR in powerpc64 has been resolved:

Add a drain function for struct sysctl_req, and use it for a variety
of handlers, some of which had to do awkward things to get a large
enough SBUF_FIXEDLEN buffer.

Note that some sysctl handlers were explicitly outputting a trailing
NUL byte. This behaviour was preserved, though it should not be
necessary.

Reviewed by: phk (original patch)


# 404a593e 13-Sep-2010 Matthew D Fleming <mdf@FreeBSD.org>

Revert r212370, as it causes a LOR on powerpc. powerpc does a few
unexpected things in copyout(9) and so wiring the user buffer is not
sufficient to perform a copyout(9) while holding a random mutex.

Requested by: nwhitehorn


# dd67e210 09-Sep-2010 Matthew D Fleming <mdf@FreeBSD.org>

Add a drain function for struct sysctl_req, and use it for a variety of
handlers, some of which had to do awkward things to get a large enough
FIXEDLEN buffer.

Note that some sysctl handlers were explicitly outputting a trailing NUL
byte. This behaviour was preserved, though it should not be necessary.

Reviewed by: phk


# e49471b0 16-Aug-2010 Andre Oppermann <andre@FreeBSD.org>

Add uma_zone_get_max() to obtain the effective limit after a call
to uma_zone_set_max().

The UMA zone limit is not exactly set to the value supplied but
rounded up to completely fill the backing store increment (a page
normally). This can lead to surprising situations where the number
of elements allocated from UMA is higher than the supplied limit
value. The new get function reads back the effective value so that
the supplied limit value can be adjusted to the real limit.

Reviewed by: jeffr
MFC after: 1 week


# bf965959 15-Jun-2010 Sean Bruno <sbruno@FreeBSD.org>

Add a new column to the output of vmstat -z to indicate the number
of times the system was forced to sleep when requesting a new allocation.

Expand the debugger hook, db_show_uma, to display these results as well.

This has proven to be very useful in out of memory situations when
it is not known why systems have become sluggish or fail in odd ways.

Reviewed by: rwatson alc
Approved by: scottl (mentor) peter
Obtained from: Yahoo Inc.


# 3aa6d94e 11-Jun-2010 John Baldwin <jhb@FreeBSD.org>

Update several places that iterate over CPUs to use CPU_FOREACH().


# 451033a4 03-May-2010 Alan Cox <alc@FreeBSD.org>

It makes more sense for the object-based backend allocator to use OBJT_PHYS
objects instead of OBJT_DEFAULT objects because we never reclaim or pageout
the allocated pages. Moreover, they are mapped with pmap_qenter(), which
creates unmanaged mappings.

Reviewed by: kib


# 2965a453 29-Apr-2010 Kip Macy <kmacy@FreeBSD.org>

On Alan's advice, rather than do a wholesale conversion on a single
architecture from page queue lock to a hashed array of page locks
(based on a patch by Jeff Roberson), I've implemented page lock
support in the MI code and have only moved vm_page's hold_count
out from under page queue mutex to page lock. This changes
pmap_extract_and_hold on all pmaps.

Supported by: Bitgravity Inc.

Discussed with: alc, jeffr, and kib


# e2b36efd 29-Jan-2010 Antoine Brodin <antoine@FreeBSD.org>

MFC r201145 to stable/8:
(S)LIST_HEAD_INITIALIZER takes a (S)LIST_HEAD as an argument.
Fix some wrong usages.
Note: this does not affect generated binaries as this argument is not used.

PR: 137213
Submitted by: Eygene Ryabinkin (initial version)


# 13e403fd 28-Dec-2009 Antoine Brodin <antoine@FreeBSD.org>

(S)LIST_HEAD_INITIALIZER takes a (S)LIST_HEAD as an argument.
Fix some wrong usages.
Note: this does not affect generated binaries as this argument is not used.

PR: 137213
Submitted by: Eygene Ryabinkin (initial version)
MFC after: 1 month


# aea6e893 18-Jun-2009 Alan Cox <alc@FreeBSD.org>

Add support for UMA_SLAB_KERNEL to page_free(). (While I'm here remove an
unnecessary newline character from the end of two panic messages.)


# e20a199f 25-Jan-2009 Jeff Roberson <jeff@FreeBSD.org>

- Make the keg abstraction more complete. Permit a zone to have multiple
backend kegs so it may source compatible memory from multiple backends.
This is useful for cases such as NUMA or different layouts for the same
memory type.
- Provide a new api for adding new backend kegs to secondary zones.
- Provide a new flag for adjusting the layout of zones to stagger
allocations better across cache lines.

Sponsored by: Nokia


# d7f03759 19-Oct-2008 Ulf Lilleengen <lulf@FreeBSD.org>

- Import the HEAD csup code which is the basis for the cvsmode work.


# 2f2ea10a 22-Aug-2008 Antoine Brodin <antoine@FreeBSD.org>

Remove unused variable nosleepwithlocks.

PR: 126609
Submitted by: Mateusz Guzik
MFC after: 1 month
X-MFC: to stable/7 only, this variable is still used in stable/6


# f620b5bf 22-Aug-2008 Nathan Whitehorn <nwhitehorn@FreeBSD.org>

Allow the MD UMA allocator to use VM routines like kmem_*(). Existing code requires MD allocator to be available early in the boot process, before the VM is fully available. This defines a new VM define (UMA_MD_SMALL_ALLOC_NEEDS_VM) that allows an MD UMA small allocator to become available at the same time as the default UMA allocator.

Approved by: marcel (mentor)


# 7630c265 04-Apr-2008 Alan Cox <alc@FreeBSD.org>

Reintroduce UMA_SLAB_KMAP; however, change its spelling to
UMA_SLAB_KERNEL for consistency with its sibling UMA_SLAB_KMEM.
(UMA_SLAB_KMAP met its original demise in revision 1.30 of
vm/uma_core.c.) UMA_SLAB_KERNEL is now required by the jumbo frame
allocators. Without it, UMA cannot correctly return pages from the
jumbo frame zones to the VM system because it resets the pages' object
field to NULL instead of the kernel object. In more detail, the jumbo
frame zones are created with the option UMA_ZONE_REFCNT. This causes
UMA to overwrite the pages' object field with the address of the slab.
However, when UMA wants to release these pages, it doesn't know how to
restore the object field, so it sets it to NULL. This change teaches
UMA how to reset the object field to the kernel object.

Crashes reported by: kris
Fix tested by: kris
Fix discussed with: jeff
MFC after: 6 weeks


# 71eb44c7 11-Oct-2007 John Baldwin <jhb@FreeBSD.org>

Allow recursion on the 'zones' internal UMA zone.

Submitted by: thompsa
MFC after: 1 week
Approved by: re (kensmith)
Discussed with: jeff


# 2feb50bf 31-May-2007 Attilio Rao <attilio@FreeBSD.org>

Revert VMCNT_* operations introduction.
Probabilly, a general approach is not the better solution here, so we should
solve the sched_lock protection problems separately.

Requested by: alc
Approved by: jeff (mentor)


# 222d0195 18-May-2007 Jeff Roberson <jeff@FreeBSD.org>

- define and use VMCNT_{GET,SET,ADD,SUB,PTR} macros for manipulating
vmcnts. This can be used to abstract away pcpu details but also changes
to use atomics for all counters now. This means sched lock is no longer
responsible for protecting counts in the switch routines.

Contributed by: Attilio Rao <attilio@FreeBSD.org>


# 1e319f6d 11-Feb-2007 Robert Watson <rwatson@FreeBSD.org>

Add uma_set_align() interface, which will be called at most once during
boot by MD code to indicated detected alignment preference. Rather than
cache alignment being encoded in UMA consumers by defining a global
alignment value of (16 - 1) in UMA_ALIGN_CACHE, UMA_ALIGN_CACHE is now
a special value (-1) that causes UMA to look at registered alignment. If
no preferred alignment has been selected by MD code, a default alignment
of (16 - 1) will be used.

Currently, no hardware platforms specify alignment; architecture
maintainers will need to modify MD startup code to specify an alignment
if desired. This must occur before initialization of UMA so that all UMA
zones pick up the requested alignment.

Reviewed by: jeff, alc
Submitted by: attilio


# 6c125b8d 24-Jan-2007 Mohan Srinivasan <mohans@FreeBSD.org>

Fix for problems that occur when all mbuf clusters migrate to the mbuf packet
zone. Cluster allocations fail when this happens. Also processes that may have
blocked on cluster allocations will never be woken up. Thanks to rwatson for
an overview of the issue and pointers to the mbuma paper and his tool to dump
out UMA zones.

Reviewed by: andre@


# 77380291 24-Jan-2007 Mohan Srinivasan <mohans@FreeBSD.org>

Fix for a bug where only one process (of multiple) blocked on
maxpages on a zone is woken up, with the rest never being woken up as
a result of the ZFLAG_FULL flag being cleared. Wakeup all such blocked
procsses instead. This change introduces a thundering herd, but since
this should be relatively infrequent, optimizing this (by introducing
a count of blocked processes, for example) may be premature.

Reviewd by: ups@


# 635fd505 10-Jan-2007 Robert Watson <rwatson@FreeBSD.org>

Remove uma_zalloc_arg() hack, which coerced M_WAITOK to M_NOWAIT when
allocations were made using improper flags in interrupt context.
Replace with a simple WITNESS warning call. This restores the
invariant that M_WAITOK allocations will always succeed or die
horribly trying, which is relied on by many UMA consumers.

MFC after: 3 weeks
Discussed with: jhb


# 663b416f 05-Jan-2007 John Baldwin <jhb@FreeBSD.org>

- Add a new function uma_zone_exhausted() to see if a zone is full.
- Add a printf in swp_pager_meta_build() to warn if the swapzone becomes
exhausted so that there's at least a warning before a box that runs out
of swapzone space before running out of swap space deadlocks.

MFC after: 1 week
Reviwed by: alc


# ae4e9636 25-Oct-2006 Robert Watson <rwatson@FreeBSD.org>

Better align output of "show uma" by moving from displaying the basic
counters of allocs/frees/use for each zone to the same statistics
shown by userspace "vmstat -z".

MFC after: 3 days


# a0d4b0ae 17-Jul-2006 Robert Watson <rwatson@FreeBSD.org>

Fix build of uma_core.c when DDB is not compiled into the kernel by
making uma_zone_sumstat() ifdef DDB, as it's only used with DDB now.

Submitted by: Wolfram Fenske <Wolfram.Fenske at Student.Uni-Magdeburg.DE>


# eabadd9e 16-Jul-2006 Robert Watson <rwatson@FreeBSD.org>

Remove sysctl_vm_zone() and vm.zone sysctl from 7.x. As of 6.x,
libmemstat(3) is used by vmstat (and friends) to produce more accurate
and more detailed statistics information in a machine-readable way,
and vmstat continues to provide the same text-based front-end.

This change should not be MFC'd.


# 4f538c74 21-May-2006 Robert Watson <rwatson@FreeBSD.org>

When allocating a bucket to hold a free'd item in UMA fails, don't
report this as an allocation failure for the item type. The failure
will be separately recorded with the bucket type. This my eliminate
high mbuf allocation failure counts under some circumstances, which
can be alarming in appearance, but not actually a problem in
practice.

MFC after: 2 weeks
Reported by: ps, Peter J. Blok <pblok at bsd4all dot org>,
OxY <oxy at field dot hu>,
Gabor MICSKO <gmicskoa at szintezis dot hu>


# 082dc776 11-Feb-2006 Robert Watson <rwatson@FreeBSD.org>

Skip per-cpu caches associated with absent CPUs when generating a
memory statistics record stream via sysctl.

MFC after: 3 days


# ffaf2c55 27-Jan-2006 John Baldwin <jhb@FreeBSD.org>

Add a new macro wrapper WITNESS_CHECK() around the witness_warn() function.
The difference between WITNESS_CHECK() and WITNESS_WARN() is that
WITNESS_CHECK() should be used in the places that the return value of
witness_warn() is checked, whereas WITNESS_WARN() should be used in places
where the return value is ignored. Specifically, in a kernel without
WITNESS enabled, WITNESS_WARN() evaluates to an empty string where as
WITNESS_CHECK evaluates to 0. I also updated the one place that was
checking the return value of WITNESS_WARN() to use WITNESS_CHECK.


# ca49f12f 06-Jan-2006 John Baldwin <jhb@FreeBSD.org>

Reduce the scope of one #ifdef to avoid duplicating a SYSCTL_INT() macro
and trim another unneeded #ifdef (it was just around a macro that is
already conditionally defined).


# 64a266f9 20-Oct-2005 Robert Watson <rwatson@FreeBSD.org>

Change format string for u_int64_t to %ju from %llu, in order to use the
correct format string on 64-bit systems.

Pointed out by: pjd


# 48c5777e 20-Oct-2005 Robert Watson <rwatson@FreeBSD.org>

Add a "show uma" command to DDB, which prints out the current stats for
available UMA zones. Quite useful for post-mortem debugging of memory
leaks without a dump device configured on a panicked box.

MFC after: 2 weeks


# 3803b26b 08-Oct-2005 Dag-Erling Smørgrav <des@FreeBSD.org>

As alc pointed out to me, vm_page.c 1.305 was incomplete: uma_startup()
still uses the constant UMA_BOOT_PAGES. Change it to accept boot_pages
as an additional argument.

MFC after: 2 weeks


# f353d338 09-Sep-2005 Alan Cox <alc@FreeBSD.org>

Introduce a new lock for the purpose of synchronizing access to the
UMA boot pages.

Disable recursion on the general UMA lock now that startup_alloc() no
longer uses it.

Eliminate the variable uma_boot_free. It serves no purpose.

Note: This change eliminates a lock-order reversal between a system
map mutex and the UMA lock. See
http://sources.zabbadoz.net/freebsd/lor.html#109 for details.

MFC after: 3 days


# cbbb4a00 24-Jul-2005 Robert Watson <rwatson@FreeBSD.org>

Rename UMA_MAX_NAME to UTH_MAX_NAME, since it's a maximum in the
monitoring API, which might or might not be the same as the internal
maximum (currently none).

Export flag information on UMA zones -- in particular, whether or
not this is a secondary zone, and so the keg free count should be
considered in that light.

MFC after: 1 day


# f4ff923b 20-Jul-2005 Robert Watson <rwatson@FreeBSD.org>

Further UMA statistics related changes:

- Add a new uma_zfree_internal() flag, ZFREE_STATFREE, which causes it to
to update the zone's uz_frees statistic. Previously, the statistic was
updated unconditionally.

- Use the flag in situations where a "real" free occurs: i.e., one where
the caller is freeing an allocated item, to be differentiated from
situations where uma_zfree_internal() is used to tear down the item
during slab teardown in order to invoke its fini() method. Also use
the flag when UMA is freeing its internal objects.

- When exchanging a bucket with the zone from the per-CPU cache when
freeing an item, flush cache statistics back to the zone (since the
zone lock and critical section are both held) to match the allocation
case.

MFC after: 3 days


# ab3a57c0 16-Jul-2005 Robert Watson <rwatson@FreeBSD.org>

Use mp_maxid in preference to MAXCPU when creating exports of UMA
per-CPU cache statistics. UMA sizes the cache array based on the
number of CPUs at boot (mp_maxid + 1), and iterating based on MAXCPU
could read off the end of the array (into the next zone).

Reported by: yongari
MFC after: 1 week


# 08ecce74 16-Jul-2005 Robert Watson <rwatson@FreeBSD.org>

Improve canonicalization of copyrights. Order copyrights by order of
assertion (jeff, bmilekic, rwatson).

Suggested ages ago by: bde
MFC after: 1 week


# 2450bbb8 16-Jul-2005 Robert Watson <rwatson@FreeBSD.org>

Move the unlocking of the zone mutex in sysctl_vm_zone_stats() so that
it covers the following of the uc_alloc/freebucket cache pointers.
Originally, I felt that the race wasn't helped by holding the mutex,
hence a comment in the code and not holding it across the cache access.
However, it does improve consistency, as while it doesn't prevent
bucket exchange, it does prevent bucket pointer invalidation. So a
race in gathering cache free space statistics still can occur, but not
one that follows an invalid bucket pointer, if the mutex is held.

Submitted by: yongari
MFC after: 1 week


# 2018f30c 15-Jul-2005 Mike Silbersack <silby@FreeBSD.org>

Increase the flags field for kegs from a 16 to a 32 bit value;
we have exhausted all 16 flags.


# 2019094a 15-Jul-2005 Robert Watson <rwatson@FreeBSD.org>

Track UMA(9) allocation failures by zone, and export via sysctl.

Requested by: victor cruceru <victor dot cruceru at gmail dot com>
MFC after: 1 week


# 7a52a97e 14-Jul-2005 Robert Watson <rwatson@FreeBSD.org>

Introduce a new sysctl, vm.zone_stats, which exports UMA(9) allocator
statistics via a binary structure stream:

- Add structure 'uma_stream_header', which defines a stream version,
definition of MAXCPUs used in the stream, and the number of zone
records in the stream.

- Add structure 'uma_type_header', which defines the name, alignment,
size, resource allocation limits, current pages allocated, preferred
bucket size, and central zone + keg statistics.

- Add structure 'uma_percpu_stat', which, for each per-CPU cache,
includes the number of allocations and frees, as well as the number
of free items in the cache.

- When the sysctl is queried, return a stream header, followed by a
series of type descriptions, each consisting of a type header
followed by a series of MAXCPUs uma_percpu_stat structures holding
per-CPU allocation information. Typical values of MAXCPU will be
1 (UP compiled kernel) and 16 (SMP compiled kernel).

This query mechanism allows user space monitoring tools to extract
memory allocation statistics in a machine-readable form, and to do so
at a per-CPU granularity, allowing monitoring of allocation patterns
across CPUs in order to better understand the distribution of work and
memory flow over multiple CPUs.

While here, also export the number of UMA zones as a sysctl
vm.uma_count, in order to assist in sizing user swpace buffers to
receive the stream.

A follow-up commit of libmemstat(3), a library to monitor kernel memory
allocation, will occur in the next few days. This change directly
supports converting netstat(1)'s "-mb" mode to using UMA-sourced stats
rather than separately maintained mbuf allocator statistics.

MFC after: 1 week


# 773df9ab 14-Jul-2005 Robert Watson <rwatson@FreeBSD.org>

In addition to tracking allocs in the zone, also track frees. Add
a zone free counter, as well as a cache free counter.

MFC after: 1 week


# 2c743d36 14-Jul-2005 Robert Watson <rwatson@FreeBSD.org>

In an earlier world order, UMA would flush per-CPU statistics to the
zone whenever it was moving buckets between the zone and the cache,
or when coalescing statistics across the CPU. Remove flushing of
statistics to the zone when coalescing statistics as part of sysctl,
as we won't be running on the right CPU to write to the cache
statistics.

Add a missed gathering of statistics: when uma_zalloc_internal()
does a special case allocation of a single item, make sure to update
the zone statistics to represent this. Previously this case wasn't
accounted for in user-visible statistics.

MFC after: 1 week


# 5d1ae027 29-Apr-2005 Robert Watson <rwatson@FreeBSD.org>

Modify UMA to use critical sections to protect per-CPU caches, rather than
mutexes, which offers lower overhead on both UP and SMP. When allocating
from or freeing to the per-cpu cache, without INVARIANTS enabled, we now
no longer perform any mutex operations, which offers a 1%-3% performance
improvement in a variety of micro-benchmarks. We rely on critical
sections to prevent (a) preemption resulting in reentrant access to UMA on
a single CPU, and (b) migration of the thread during access. In the event
we need to go back to the zone for a new bucket, we release the critical
section to acquire the global zone mutex, and must re-acquire the critical
section and re-evaluate which cache we are accessing in case migration has
occured, or circumstances have changed in the current cache.

Per-CPU cache statistics are now gathered lock-free by the sysctl, which
can result in small races in statistics reporting for caches.

Reviewed by: bmilekic, jeff (somewhat)
Tested by: rwatson, kris, gnn, scottl, mike at sentex dot net, others


# b70458ae 23-Feb-2005 Alan Cox <alc@FreeBSD.org>

Revert the first part of revision 1.114 and modify the second part. On
architectures implementing uma_small_alloc() pages do not necessarily
belong to the kmem object.


# 8076cb52 16-Feb-2005 Bosko Milekic <bmilekic@FreeBSD.org>

Well, it seems that I pre-maturely removed the "All rights reserved"
statement from some files, so re-add it for the moment, until the
related legalese is sorted out. This change affects:

sys/kern/kern_mbuf.c
sys/vm/memguard.c
sys/vm/memguard.h
sys/vm/uma.h
sys/vm/uma_core.c
sys/vm/uma_dbg.c
sys/vm/uma_dbg.h
sys/vm/uma_int.h


# 500f29d0 16-Feb-2005 Bosko Milekic <bmilekic@FreeBSD.org>

Make UMA set the overloaded page->object back to kmem_object for
UMA_ZONE_REFCNT and UMA_ZONE_MALLOC zones, as the page(s) undoubtedly
came from kmem_map for those two. Previously it would set it back
to NULL for UMA_ZONE_REFCNT zones and although this was probably not
fatal, it added MORE code for no reason.


# c5c1b16e 10-Jan-2005 Bosko Milekic <bmilekic@FreeBSD.org>

While we want the recursion protection for the bucket zones so that
recursion from the VM is handled (and the calling code that allocates
buckets knows how to deal with it), we do not want to prevent allocation
from the slab header zones (slabzone and slabrefzone) if uk_recurse is
not zero for them. The reason is that it could lead to NULL being
returned for the slab header allocations even in the M_WAITOK
case, and the caller can't handle that (this is also explained in a
comment with this commit).

The problem analysis is documented in our mailing lists:
http://docs.freebsd.org/cgi/getmsg.cgi?fetch=153445+0+archive/2004/freebsd-current/20041231.freebsd-current

(see entire thread for proper context).

Crash dump data provided by: Peter Holm <peter@holm.cc>


# 1e183df2 10-Jan-2005 Stefan Farfeleder <stefanf@FreeBSD.org>

ISO C requires at least one element in an initialiser list.


# 60727d8b 06-Jan-2005 Warner Losh <imp@FreeBSD.org>

/* -> /*- for license, minor formatting changes


# 7b871205 25-Dec-2004 Bosko Milekic <bmilekic@FreeBSD.org>

Add my copyright and update Jeff's copyright on UMA source files,
as per his request.

Discussed with: Jeffrey Roberson


# dc2c7965 06-Nov-2004 Robert Watson <rwatson@FreeBSD.org>

Abstract the logic to look up the uma_bucket_zone given a desired
number of entries into bucket_zone_lookup(), which helps make more
clear the logic of consumers of bucket zones.

Annotate the behavior of bucket_init() with a comment indicating
how the various data structures, including the bucket lookup tables,
are initialized.


# f9d27e75 06-Nov-2004 Robert Watson <rwatson@FreeBSD.org>

Annotate what bucket_size[] array does; staticize since it's used only
in uma_core.c.


# a5a262c6 27-Oct-2004 Bosko Milekic <bmilekic@FreeBSD.org>

Fix a INVARIANTS-only bug introduced in Revision 1.104:

IF INVARIANTS is defined, and in the rare case that we have
allocated some objects from the slab and at least one initializer
on at least one of those objects failed, and we need to fail the
allocation and push the uninitialized items back into the slab
caches -- in that scenario, we would fail to [re]set the
bucket cache's ub_bucket item references to NULL, which would
eventually trigger a KASSERT.


# 55fc8c11 09-Oct-2004 Brian Feldman <green@FreeBSD.org>

In the previous revision, I did not intend to change the default value
of "nosleepwithlocks."

Submitted by: ru


# ab14a3f7 08-Oct-2004 Brian Feldman <green@FreeBSD.org>

Fix critical stability problems that can cause UMA mbuf cluster
state management corruption, mbuf leaks, general mbuf corruption,
and at least on i386 a first level splash damage radius that
encompasses up to about half a megabyte of the memory after
an mbuf cluster's allocation slab. In short, this has caused
instability nightmares anywhere the right kind of network traffic
is present.

When the polymorphic refcount slabs were added to UMA, the new types
were not used pervasively. In particular, the slab management
structure was turned into one for refcounts, and one for non-refcounts
(supposed to be mostly like the old slab management structure),
but the latter was almost always used through out. In general, every
access to zones with UMA_ZONE_REFCNT turned on corrupted the
"next free" slab offset offset and the refcount with each other and
with other allocations (on i386, 2 mbuf clusters per 4096 byte slab).

Fix things so that the right type is used to access refcounted zones
where it was not before. There are additional errors in gross
overestimation of padding, it seems, that would cause a large kegs
(nee zones) to be allocated when small ones would do. Unless I have
analyzed this incorrectly, it is not directly harmful.


# 3659f747 06-Aug-2004 Robert Watson <rwatson@FreeBSD.org>

Generate KTR trace records for uma_zalloc_arg() and uma_zfree_arg().
This doesn't trace every event of interest in UMA, but provides
enough basic information to explain lock traces and sleep patterns.


# b23f72e9 01-Aug-2004 Brian Feldman <green@FreeBSD.org>

* Add a "how" argument to uma_zone constructors and initialization functions
so that they know whether the allocation is supposed to be able to sleep
or not.
* Allow uma_zone constructors and initialation functions to return either
success or error. Almost all of the ones in the tree currently return
success unconditionally, but mbuf is a notable exception: the packet
zone constructor wants to be able to fail if it cannot suballocate an
mbuf cluster, and the mbuf allocators want to be able to fail in general
in a MAC kernel if the MAC mbuf initializer fails. This fixes the
panics people are seeing when they run out of memory for mbuf clusters.
* Allow debug.nosleepwithlocks on WITNESS to be disabled, without changing
the default.

Both bmilekic and jeff have reviewed the changes made to make failable
zone allocations work.


# 244f4554 29-Jul-2004 Bosko Milekic <bmilekic@FreeBSD.org>

Rework the way slab header storage space is calculated in UMA.

- zone_large_init() stays pretty much the same.
- zone_small_init() will try to stash the slab header in the slab page
being allocated if the amount of calculated wasted space is less
than UMA_MAX_WASTE (for both the UMA_ZONE_REFCNT case and regular
case). If the amount of wasted space is >= UMA_MAX_WASTE, then
UMA_ZONE_OFFPAGE will be set and the slab header will be allocated
separately for better use of space.
- uma_startup() calculates the maximum ipers required in offpage slabs
(so that the offpage slab header zone(s) can be sized accordingly).
The algorithm used to calculate this replaces the old calculation
(which only happened to work coincidentally). We now iterate over
possible object sizes, starting from the smallest one, until we
determine that wastedspace calculated in zone_small_init() might
end up being greater than UMA_MAX_WASTE, at which point we use the
found object size to compute the maximum possible ipers. The
reason this works is because:
- wastedspace versus objectsize is a see-saw function with
local minima all equal to zero and local maxima growing
directly proportioned to objectsize. This implies that
for objects up to or equal a certain objectsize, the see-saw
remains entirely below UMA_MAX_WASTE, so for those objectsizes
it is impossible to ever go OFFPAGE for slab headers.
- ipers (items-per-slab) versus objectsize is an inversely
proportional function which falls off very quickly (very large
for small objectsizes).
- To determine the maximum ipers we'll ever need from OFFPAGE
slab headers we first find the largest objectsize for which
we are guaranteed to not go offpage for and use it to compute
ipers (as though we were offpage). Since the only objectsizes
allowed to go offpage are bigger than the found objectsize,
and since ipers vs objectsize is inversely proportional (and
monotonically decreasing), then we are guaranteed that the
ipers computed is always >= what we will ever need in offpage
slab headers.
- Define UMA_FRITM_SZ and UMA_FRITMREF_SZ to be the actual (possibly
padded) size of each freelist index so that offset calculations are
fixed.

This might fix weird data corruption problems and certainly allows
ARM to now boot to at least single-user (via simulator).

Tested on i386 UP by me.
Tested on sparc64 SMP by fenner.
Tested on ARM simulator to single-user by cognet.


# 5285558a 22-Jul-2004 Alan Cox <alc@FreeBSD.org>

- Change uma_zone_set_obj() to call kmem_alloc_nofault() instead of
kmem_alloc_pageable(). The difference between these is that an errant
memory access to the zone will be detected sooner with
kmem_alloc_nofault().

The following changes serve to eliminate the following lock-order
reversal reported by witness:

1st 0xc1a3c084 vm object (vm object) @ vm/swap_pager.c:1311
2nd 0xc07acb00 swap_pager swhash (swap_pager swhash) @ vm/swap_pager.c:1797
3rd 0xc1804bdc vm object (vm object) @ vm/uma_core.c:931

There is no potential deadlock in this case. However, witness is unable
to recognize this because vm objects used by UMA have the same type as
ordinary vm objects. To remedy this, we make the following changes:

- Add a mutex type argument to VM_OBJECT_LOCK_INIT().
- Use the mutex type argument to assign distinct types to special
vm objects such as the kernel object, kmem object, and UMA objects.
- Define a static swap zone object for use by UMA. (Only static
objects are assigned a special mutex type.)


# 0c3c862e 19-Jul-2004 Brian Feldman <green@FreeBSD.org>

Since breakage of malloc(9)/uma_zalloc(9) is totally non-optional in
GENERIC/for WITNESS users, make sure the sysctl to disable the behavior
is read-only and always enabled.


# 0d0837ee 04-Jul-2004 Bosko Milekic <bmilekic@FreeBSD.org>

Introduce debug.nosleepwithlocks sysctl, 0 by default. If set to 1
and WITNESS is not built, then force all M_WAITOK allocations to
M_NOWAIT behavior (transparently). This is to be used temporarily
if wierd deadlocks are reported because we still have code paths
that perform M_WAITOK allocations with lock(s) held, which can
lead to deadlock. If WITNESS is compiled, then the sysctl is ignored
and we ask witness to tell us wether we have locks held, converting
to M_NOWAIT behavior only if it tells us that we do.

Note this removes the previous mbuf.h inclusion as well (only needed
by last revision), and cleans up unneeded [artificial] comparisons
to just the mbuf zones. The problem described above has nothing to
do with previous mbuf wait behavior; it is a general problem.


# 7a708c36 04-Jul-2004 Brian Feldman <green@FreeBSD.org>

Reextend the M_WAITOK-disabling-hack to all three of the mbuf-related
zones, and do it by direct comparison of uma_zone_t instead of strcmp.

The mbuf subsystem used to provide M_TRYWAIT/M_DONTWAIT semantics, but
this is mostly no longer the case. M_WAITOK has taken over the spot
M_TRYWAIT used to have, and for mbuf things, still may return NULL if
the code path is incorrectly holding a mutex going into mbuf allocation
functions.

The M_WAITOK/M_NOWAIT semantics are absolute; though it may deadlock
the system to try to malloc or uma_zalloc something with a mutex held
and M_WAITOK specified, it is absolutely required to not return NULL
and will result in instability and/or security breaches otherwise.
There is still room to add the WITNESS_WARN() to all cases so that
we are notified of the possibility of deadlocks, but it cannot change
the value of the "badness" variable and allow allocation to actually
fail except for the specialized cases which used to be M_TRYWAIT.


# cf107c1d 03-Jul-2004 Brian Feldman <green@FreeBSD.org>

Limit mbuma damage. Suddenly ALL allocations with M_WAITOK are subject
to failing -- that is, allocations via malloc(M_WAITOK) that are required
to never fail -- if WITNESS is not defined. While everyone should be
running WITNESS, in any case, zone "Mbuf" allocations are really the only
ones that should be screwed with by this hack.

This hack is crashing people, and would continue to do so with or without
WITNESS. Things shouldn't be allocating with M_WAITOK with locks held,
but it's not okay just to always remove M_WAITOK when !WITNESS.

Reported by: Bernd Walter <ticso@cicely5.cicely.de>


# cc822cb5 23-Jun-2004 Bosko Milekic <bmilekic@FreeBSD.org>

Make uma_mtx MTX_RECURSE. Here's why:

The general UMA lock is a recursion-allowed lock because
there is a code path where, while we're still configured
to use startup_alloc() for backend page allocations, we
may end up in uma_reclaim() which calls zone_foreach(zone_drain),
which grabs uma_mtx, only to later call into startup_alloc()
because while freeing we needed to allocate a bucket. Since
startup_alloc() also takes uma_mtx, we need to be able to
recurse on it.

This exact explanation also added as comment above mtx_init().

Trace showing recursion reported by: Peter Holm <peter-at-holm.cc>


# 7fd87882 09-Jun-2004 Bosko Milekic <bmilekic@FreeBSD.org>

Backout previous change, I think Julian has a better solution which
does not require type-stable refcnts here.


# e66468ea 09-Jun-2004 Bosko Milekic <bmilekic@FreeBSD.org>

Make the slabrefzone, the zone from which we allocated slabs with
internal reference counters, UMA_ZONE_NOFREE. This way, those slabs
(with their ref counts) will be effectively type-stable, then using
a trick like this on the refcount is no longer dangerous:

MEXT_REM_REF(m);
if (atomic_cmpset_int(m->m_ext.ref_cnt, 0, 1)) {
if (m->m_ext.ext_type == EXT_PACKET) {
uma_zfree(zone_pack, m);
return;
} else if (m->m_ext.ext_type == EXT_CLUSTER) {
uma_zfree(zone_clust, m->m_ext.ext_buf);
m->m_ext.ext_buf = NULL;
} else {
(*(m->m_ext.ext_free))(m->m_ext.ext_buf,
m->m_ext.ext_args);
if (m->m_ext.ext_type != EXT_EXTREF)
free(m->m_ext.ref_cnt, M_MBUF);
}
}
uma_zfree(zone_mbuf, m);

Previously, a second thread hitting the above cmpset might
actually read the refcnt AFTER it has already been freed. A very
rare occurance. Now we'll know that it won't be freed, though.

Spotted by: julian, pjd


# 099a0e58 31-May-2004 Bosko Milekic <bmilekic@FreeBSD.org>

Bring in mbuma to replace mballoc.

mbuma is an Mbuf & Cluster allocator built on top of a number of
extensions to the UMA framework, all included herein.

Extensions to UMA worth noting:
- Better layering between slab <-> zone caches; introduce
Keg structure which splits off slab cache away from the
zone structure and allows multiple zones to be stacked
on top of a single Keg (single type of slab cache);
perhaps we should look into defining a subset API on
top of the Keg for special use by malloc(9),
for example.
- UMA_ZONE_REFCNT zones can now be added, and reference
counters automagically allocated for them within the end
of the associated slab structures. uma_find_refcnt()
does a kextract to fetch the slab struct reference from
the underlying page, and lookup the corresponding refcnt.

mbuma things worth noting:
- integrates mbuf & cluster allocations with extended UMA
and provides caches for commonly-allocated items; defines
several zones (two primary, one secondary) and two kegs.
- change up certain code paths that always used to do:
m_get() + m_clget() to instead just use m_getcl() and
try to take advantage of the newly defined secondary
Packet zone.
- netstat(1) and systat(1) quickly hacked up to do basic
stat reporting but additional stats work needs to be
done once some other details within UMA have been taken
care of and it becomes clearer to how stats will work
within the modified framework.

From the user perspective, one implication is that the
NMBCLUSTERS compile-time option is no longer used. The
maximum number of clusters is still capped off according
to maxusers, but it can be made unlimited by setting
the kern.ipc.nmbclusters boot-time tunable to zero.
Work should be done to write an appropriate sysctl
handler allowing dynamic tuning of kern.ipc.nmbclusters
at runtime.

Additional things worth noting/known issues (READ):
- One report of 'ips' (ServeRAID) driver acting really
slow in conjunction with mbuma. Need more data.
Latest report is that ips is equally sucking with
and without mbuma.
- Giant leak in NFS code sometimes occurs, can't
reproduce but currently analyzing; brueffer is
able to reproduce but THIS IS NOT an mbuma-specific
problem and currently occurs even WITHOUT mbuma.
- Issues in network locking: there is at least one
code path in the rip code where one or more locks
are acquired and we end up in m_prepend() with
M_WAITOK, which causes WITNESS to whine from within
UMA. Current temporary solution: force all UMA
allocations to be M_NOWAIT from within UMA for now
to avoid deadlocks unless WITNESS is defined and we
can determine with certainty that we're not holding
any locks when we're M_WAITOK.
- I've seen at least one weird socketbuffer empty-but-
mbuf-still-attached panic. I don't believe this
to be related to mbuma but please keep your eyes
open, turn on debugging, and capture crash dumps.

This change removes more code than it adds.

A paper is available detailing the change and considering
various performance issues, it was presented at BSDCan2004:
http://www.unixdaemons.com/~bmilekic/netbuf_bmilekic.pdf
Please read the paper for Future Work and implementation
details, as well as credits.

Testing and Debugging:
rwatson,
brueffer,
Ketrien I. Saihr-Kesenchedra,
...
Reviewed by: Lots of people (for different parts)


# 5d328ed4 09-Mar-2004 Alan Cox <alc@FreeBSD.org>

- Make the acquisition of Giant in vm_fault_unwire() conditional on the
pmap. For the kernel pmap, Giant is not required. In general, for
other pmaps, Giant is required by i386's pmap_pte() implementation.
Specifically, the use of PMAP2/PADDR2 is synchronized by Giant.
Note: In principle, updates to the kernel pmap's wired count could be
lost without Giant. However, in practice, we never use the kernel
pmap's wired count. This will be resolved when pmap locking appears.
- With the above change, cpu_thread_clean() and uma_large_free() need
not acquire Giant. (The first case is simply the revival of
i386/i386/vm_machdep.c's revision 1.226 by peter.)


# a3c07611 07-Mar-2004 Robert Watson <rwatson@FreeBSD.org>

Mark uma_callout as CALLOUT_MPSAFE, as uma_timeout can run MPSAFE.

Reviewed by: jeff


# aaa8bb16 31-Jan-2004 Jeff Roberson <jeff@FreeBSD.org>

- Fix a problem where we did not drain the cache of buckets in the zone
when uma_reclaim() was called. This was introduced when the zone
working-set algorithm was removed in favor of using the per cpu caches
as the working set.


# e726bc0e 30-Jan-2004 Dag-Erling Smørgrav <des@FreeBSD.org>

Mechanical whitespace cleanup.


# b6c71225 03-Dec-2003 John Baldwin <jhb@FreeBSD.org>

Fix all users of mp_maxid to use the same semantics, namely:

1) mp_maxid is a valid FreeBSD CPU ID in the range 0 .. MAXCPU - 1.
2) For all active CPUs in the system, PCPU_GET(cpuid) <= mp_maxid.

Approved by: re (scottl)
Tested on: i386, amd64, alpha


# e30b97c5 30-Nov-2003 Jeff Roberson <jeff@FreeBSD.org>

- Unbreak UP. mp_maxid is not defined on uni-processor machines, although
I believe it and the other MP variables should be. For now, just define it
here and wait for jhb to clean it up later.

Approved by: re (rwatson)


# 504d5de3 30-Nov-2003 Jeff Roberson <jeff@FreeBSD.org>

- Replace the local maxcpu with mp_maxid. Previously, if mp_maxid
was equal to MAXCPU, we would overrun the pcpu_mtx array because maxcpu
was calculated incorrectly.
- Add some more debugging code so that memory leaks at the time of
uma_zdestroy() are more easily diagnosed.

Approved by: re (rwatson)


# d1f42ac2 14-Nov-2003 Alan Cox <alc@FreeBSD.org>

- Remove use of Giant from uma_zone_set_obj().


# 009b6fcb 21-Sep-2003 Jeff Roberson <jeff@FreeBSD.org>

- Fix MD_SMALL_ALLOC on architectures that support it. Define a new alloc
function, startup_alloc(), that is used for single page allocations prior
to the VM starting up. If it is used after the VM startups up, it
replaces the zone's allocf pointer with either page_alloc() or
uma_small_alloc() where appropriate.

Pointy hat to: me
Tested by: phk/amd64, me/x86


# c43ab0b5 20-Sep-2003 Peter Wemm <peter@FreeBSD.org>

Bad Jeffr! No cookie!

Temporarily disable the UMA_MD_SMALL_ALLOC stuff since recent commits
break sparc64, amd64, ia64 and alpha. It appears only i386 and maybe
powerpc were not broken.


# 9643769a 19-Sep-2003 Jeff Roberson <jeff@FreeBSD.org>

- Remove the working-set algorithm. Instead, use the per cpu buckets as the
working set cache. This has several advantages. Firstly, we never touch
the per cpu queues now in the timeout handler. This removes one more
reason for having per cpu locks. Secondly, it reduces the size of the zone
by 8 bytes, bringing it under 200 bytes for a single proc x86 box. This
tidies up other logic as well.
- The 'destroy' flag no longer needs to be passed to zone_drain() since it
always frees everything in the zone's slabs.
- cache_drain() is now only called from zone_dtor() and so it destroys by
default. It also does not need the destroy parameter now.


# 3e0cab95 19-Sep-2003 Jeff Roberson <jeff@FreeBSD.org>

- Remove the cache colorization code. We can't use it due to all of the
broken consumers of the malloc interface who assume that the allocated
address will be an even multiple of the size.
- Remove disabled time delay code on uma_reclaim(). The comment there said
it all. It was not an effective strategy and it should not be left in
#if 0'd for all eternity.


# 64f051e9 19-Sep-2003 Jeff Roberson <jeff@FreeBSD.org>

- There are an endless stream of style(9) errors in this file. Fix a few.
Also catch some spelling errors.


# 44eca34a 19-Sep-2003 Jeff Roberson <jeff@FreeBSD.org>

- Don't inspect the zone in page_alloc(). It may be NULL.
- Don't cache more items than the zone would like in uma_zalloc_bucket().


# 45bf76f0 19-Sep-2003 Jeff Roberson <jeff@FreeBSD.org>

- Move the logic for dealing with the uma_boot_pages cache into the
page_alloc() function from the slab_zalloc() function. This allows us
to unconditionally call uz_allocf().
- In page_alloc() cleanup the boot_pages logic some. Previously memory from
this cache that was not used by the time the system started was left in
the cache and never used. Typically this wasn't more than a few pages,
but now we will use this cache so long as memory is available.


# b60f5b79 19-Sep-2003 Jeff Roberson <jeff@FreeBSD.org>

- Fix the silly flag situation in UMA. Remove redundant ZFLAG/ZONE flags
by accepting the user supplied flags directly. Previously this was not
done so that flags for the same field would not be defined in two
different files. Add comments in each header instructing future
developers on how now to shoot their feet.
- Fix a test for !OFFPAGE which should have been a test for HASH. This would
have caused a panic if we had ever destructed a malloc zone. This also
opens up the possibility that other zones could use the vsetobj() method
rather than a hash.


# 961647df 19-Sep-2003 Jeff Roberson <jeff@FreeBSD.org>

- Don't abuse M_DEVBUF, define a tag for UMA hashes.


# b983089a 19-Sep-2003 Jeff Roberson <jeff@FreeBSD.org>

- Eliminate a pair of unnecessary variables.


# cae33c14 19-Sep-2003 Jeff Roberson <jeff@FreeBSD.org>

- Initialize a pool of bucket zones so that we waste less space on zones that
don't cache as many items.
- Introduce the bucket_alloc(), bucket_free() functions to wrap bucket
allocation. These functions select the appropriate bucket zone to
allocate from or free to.
- Rename ub_ptr to ub_cnt to reflect a change in its use. ub_cnt now reflects
the count of free items in the bucket. This gets rid of many unnatural
subtractions by 1 throughout the code.
- Add ub_entries which reflects the number of entries possibly held in a
bucket.


# 1c35e213 20-Aug-2003 Bosko Milekic <bmilekic@FreeBSD.org>

In sysctl_vm_zone, do not calculate per-cpu cache stats on
UMA_ZFLAG_INTERNAL zones at all. Apparently, Wilko's alpha
was crashing while entering multi-user because, I think, we
were calculating the garbage cachefree for pcpu caches that
essentially don't exist for at least the 'zones' zone and it so
happened that we were reading from an unmapped location.

Confirmed to fix crash: wilko
Helped debug: wilko, gallatin


# 20e8e865 11-Aug-2003 Bosko Milekic <bmilekic@FreeBSD.org>

- When deciding whether to init the zone with small_init or large_init,
compare the zone element size (+1 for the byte of linkage) against
UMA_SLAB_SIZE - sizeof(struct uma_slab), and not just UMA_SLAB_SIZE.
Add a KASSERT in zone_small_init to make sure that the computed
ipers (items per slab) for the zone is not zero, despite the addition
of the check, just to be sure (this part submitted by: silby)

- UMA_ZONE_VM used to imply BUCKETCACHE. Now it implies
CACHEONLY instead. CACHEONLY is like BUCKETCACHE in the
case of bucket allocations, but in addition to that also ensures that
we don't setup the zone with OFFPAGE slab headers allocated from the
slabzone. This means that we're not allowed to have a UMA_ZONE_VM
zone initialized for large items (zone_large_init) because it would
require the slab headers to be allocated from slabzone, and hence
kmem_map. Some of the zones init'd with UMA_ZONE_VM are so init'd
before kmem_map is suballoc'd from kernel_map, which is why this
change is necessary.


# b245ac95 03-Aug-2003 Alan Cox <alc@FreeBSD.org>

Revise obj_alloc(). Most notably, use the object's lock to prevent two
concurrent invocations from acquiring the same address(es). Also, in case
of an incomplete allocation, free any allocated pages.

In collaboration with: tegge


# 48bf8725 02-Aug-2003 Bosko Milekic <bmilekic@FreeBSD.org>

When INVARIANTS is on and we're in uma_zalloc_free(), we need to make
sure that uma_dbg_free() is called if we're about to call
uma_zfree_internal() but we're asking it to skip the dtor and
uma_dbg_free() call itself. So, if we're about to call
uma_zfree_internal() from uma_zfree_arg() and skip == 1, call
uma_dbg_free() ourselves.


# 174ab450 01-Aug-2003 Bosko Milekic <bmilekic@FreeBSD.org>

Only free the pcpu cache buckets if they are non-NULL.

Crashed this person's machine: harti
Pointy-hat to: me


# d56368d7 30-Jul-2003 Bosko Milekic <bmilekic@FreeBSD.org>

Plug a race and a leak in UMA.

1) The race has to do with zone destruction. From the zone destructor we
would lock the zone, set the working set size to 0, then unlock the zone,
drain it, and then free the structure. Within the window following the
working-set-size set to 0 and unlocking of the zone and the point where
in zone_drain we re-acquire the zone lock, the uma timer routine could
have fired off and changed the working set size to something non-zero,
thereby potentially preventing us from completely freeing slabs before
destroying the zone (and thus leaking them).

2) The leak has to do with zone destruction as well. When destroying a
zone we would take care to free all the buckets cached in the zone, but
although we would drain the pcpu cache buckets, we would not free them.
This resulted in leaking a couple of bucket structures (512 bytes each)
per cpu on SMP during zone destruction.

While I'm here, also silence GCC warnings by turning uma_slab_alloc()
from inline to real function. It's too big to be an inline.

Reviewed by: JeffR


# a40fdcb4 30-Jul-2003 Bosko Milekic <bmilekic@FreeBSD.org>

When generating the zone stats make sure to handle the master zone
("UMA Zone") carefully, because it does not have pcpu caches allocated
at all. In the UP case, we did not catch this because one pcpu cache
is always allocated with the zone, but for the MP case, we were getting
bogus stats for this zone.

Tested by: Lukas Ertl <le@univie.ac.at>


# 7b4bd98a 30-Jul-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Remove the disabling of buckets workaround.

Thanks to: jeffr


# f828e5be 29-Jul-2003 Jeff Roberson <jeff@FreeBSD.org>

- Get rid of the ill-conceived uz_cachefree member of uma_zone.
- In sysctl_vm_zone use the per cpu locks to read the current cache
statistics this makes them more accurate while under heavy load.

Submitted by: tegge


# d11e0ba5 29-Jul-2003 Jeff Roberson <jeff@FreeBSD.org>

- Check to see if we need a slab prior to allocating one. Failure to do
so not only wastes memory but it can also cause a leak in zones that
will be destroyed later. The problem is that the slab allocation code
places newly created slabs on the partially allocated list because it
assumes that the caller will actually allocate some memory from it.
Failure to do so places an otherwise free slab on the partial slab list
where we wont find it later in zone_drain().

Continuously prodded to fix by: phk (Thanks)


# 0c32d97a 29-Jul-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Temporary workaround: Always disable buckets, there is a bug there
somewhere.

JeffR will look at this as soon as he has time.

OK'ed by: jeffr


# 234c7726 27-Jul-2003 Alan Cox <alc@FreeBSD.org>

None of the "alloc" functions used by UMA assume that Giant is held any
longer. (If they still need it, e.g., contigmalloc(), they acquire it
themselves.) Therefore, we need not acquire Giant in slab_zalloc().


# 0c1a133f 25-Jul-2003 Alan Cox <alc@FreeBSD.org>

Gulp ... call kmem_malloc() without Giant.


# 8522511b 18-Jul-2003 Hartmut Brandt <harti@FreeBSD.org>

When INVARIANTS is defined make sure that uma_zalloc_arg (and hence
uma_zalloc) is called with exactly one of either M_WAITOK or M_NOWAIT and
that it is called with neither M_TRYWAIT or M_DONTWAIT. Print a warning
if anything is wrong. Default to M_WAITOK of no flag is given. This is the
same test as in malloc(9).


# d88797c2 25-Jun-2003 Bosko Milekic <bmilekic@FreeBSD.org>

Move the pcpu lock out of the uma_cache and instead have a single set
of pcpu locks. This makes uma_zone somewhat smaller (by (LOCKNAME_LEN *
sizeof(char) + sizeof(struct mtx) * maxcpu) bytes, to be exact).

No Objections from jeff.


# 5c133dfa 25-Jun-2003 Bosko Milekic <bmilekic@FreeBSD.org>

Make sure that the zone destructor doesn't get called twice in
certain free paths.


# 874651b1 11-Jun-2003 David E. O'Brien <obrien@FreeBSD.org>

Use __FBSDID().


# c1f5a182 09-Jun-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Revert last commit, I have no idea what happened.


# 47f94c12 09-Jun-2003 Poul-Henning Kamp <phk@FreeBSD.org>

A white-space nit I noticed.


# 82774d80 28-Apr-2003 Alan Cox <alc@FreeBSD.org>

uma_zone_set_obj() must perform VM_OBJECT_LOCK_INIT() if the caller
provides storage for the vm_object.


# 5103186c 25-Apr-2003 Alan Cox <alc@FreeBSD.org>

Remove an XXX comment. It is no longer a problem.


# 410cfc45 18-Apr-2003 Alan Cox <alc@FreeBSD.org>

Lock the vm_object in obj_alloc().


# b37d8ead 18-Apr-2003 Andrew Gallatin <gallatin@FreeBSD.org>

Don't grab Giant in slab_zalloc() if M_NOWAIT is specified. This
should allow the use of INTR_MPSAFE network drivers.

Tested by: njl
Glanced at by: jeff


# 125ee0d1 26-Mar-2003 Tor Egge <tegge@FreeBSD.org>

Obtain Giant before calling kmem_alloc without M_NOWAIT and before calling
kmem_free if Giant isn't already held.


# 26306795 04-Mar-2003 John Baldwin <jhb@FreeBSD.org>

Replace calls to WITNESS_SLEEP() and witness_list() with equivalent calls
to WITNESS_WARN().


# a163d034 18-Feb-2003 Warner Losh <imp@FreeBSD.org>

Back out M_* changes, per decision of the TRB.

Approved by: trb


# 886eaaac 04-Feb-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Change a printf to also tell how many items were left in the zone.


# 44956c98 21-Jan-2003 Alfred Perlstein <alfred@FreeBSD.org>

Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0.
Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.


# ebc85edf 19-Jan-2003 Jeff Roberson <jeff@FreeBSD.org>

- M_WAITOK is 0 and not a real flag. Test for this properly.

Submitted by: tmm
Pointy hat to: jeff


# 9d5abbdd 01-Jan-2003 Jens Schweikhardt <schweikh@FreeBSD.org>

Correct typos, mostly s/ a / an / where appropriate. Some whitespace cleanup,
especially in troff files.


# 74c924b5 18-Nov-2002 Jeff Roberson <jeff@FreeBSD.org>

- Wakeup the correct address when a zone is no longer full.

Spotted by: jake


# f3da1873 16-Nov-2002 Jeff Roberson <jeff@FreeBSD.org>

- Don't forget the flags value when using boot pages.

Reported by: grehan


# 81f71eda 11-Nov-2002 Matt Jacob <mjacob@FreeBSD.org>

atomic_set_8 isn't MI. Instead, follow Jake's suggestions about
ZONE_LOCK.


# 48eea375 31-Oct-2002 Jeff Roberson <jeff@FreeBSD.org>

- Add support for machine dependant page allocation routines. MD code
may define UMA_MD_SMALL_ALLOC to make use of this feature.

Reviewed by: peter, jake


# bbee39c6 24-Oct-2002 Jeff Roberson <jeff@FreeBSD.org>

- Now that uma_zalloc_internal is not the fast path don't be so fussy about
extra function calls. Refactor uma_zalloc_internal into seperate functions
for finding the most appropriate slab, filling buckets, allocating single
items, and pulling items off of slabs. This makes the code significantly
cleaner.
- This also fixes the "Returning an empty bucket." panic that a few people
have seen.

Tested On: alpha, x86


# bba739ab 24-Oct-2002 Jeff Roberson <jeff@FreeBSD.org>

- Move the destructor calls so that they are not called with the zone lock
held. This avoids a lock order reversal when destroying zones.
Unfortunately, this also means that the free checks are not done before
the destructor is called.

Reported by: phk


# 37c84183 28-Sep-2002 Poul-Henning Kamp <phk@FreeBSD.org>

Be consistent about "static" functions: if the function is marked
static in its prototype, mark it static at the definition too.

Inspired by: FlexeLint warning #512


# f461cf22 19-Sep-2002 Jeff Roberson <jeff@FreeBSD.org>

- Use my freebsd email alias in the copyright.
- Remove redundant instances of my email alias in the file summary.


# 99571dc3 18-Sep-2002 Jeff Roberson <jeff@FreeBSD.org>

- Split UMA_ZFLAG_OFFPAGE into UMA_ZFLAG_OFFPAGE and UMA_ZFLAG_HASH.
- Remove all instances of the mallochash.
- Stash the slab pointer in the vm page's object pointer when allocating from
the kmem_obj.
- Use the overloaded object pointer to find slabs for malloced memory.


# 55f7c614 21-Aug-2002 Archie Cobbs <archie@FreeBSD.org>

Don't use "NULL" when "0" is really meant.


# 17b9cc49 05-Jul-2002 Jeff Roberson <jeff@FreeBSD.org>

Fix a lock order reversal in uma_zdestroy. The uma_mtx needs to be held across
calls to zone_drain().

Noticed by: scottl


# f5118d6a 04-Jul-2002 Jeff Roberson <jeff@FreeBSD.org>

Remove unnecessary includes.


# e221e841 02-Jul-2002 Jeff Roberson <jeff@FreeBSD.org>

Actually use the fini callback.

Pointy hat to: me :-(
Noticed By: Julian


# 5c0e403b 25-Jun-2002 Jeff Roberson <jeff@FreeBSD.org>

Reduce the amount of code that runs with the zone lock held in slab_zalloc().
This allows us to run the zone initialization functions without any locks held.


# 3370c5bf 19-Jun-2002 Jeff Roberson <jeff@FreeBSD.org>

- Remove bogus use of kmem_alloc that was inherited from the old zone
allocator.
- Properly set M_ZERO when talking to the back end page allocators for
non malloc zones. This forces us to zero fill pages when they are first
brought into a cache.
- Properly handle M_ZERO in uma_zalloc_internal. This fixes a problem where
per cpu buckets weren't always getting zeroed.


# 4741dcbf 17-Jun-2002 Jeff Roberson <jeff@FreeBSD.org>

Honor the BUCKETCACHE flag on free as well.


# 18aa2de5 17-Jun-2002 Jeff Roberson <jeff@FreeBSD.org>

- Introduce the new M_NOVM option which tells uma to only check the currently
allocated slabs and bucket caches for free items. It will not go ask the vm
for pages. This differs from M_NOWAIT in that it not only doesn't block, it
doesn't even ask.

- Add a new zcreate option ZONE_VM, that sets the BUCKETCACHE zflag. This
tells uma that it should only allocate buckets out of the bucket cache, and
not from the VM. It does this by using the M_NOVM option to zalloc when
getting a new bucket. This is so that the VM doesn't recursively enter
itself while trying to allocate buckets for vm_map_entry zones. If there
are already allocated buckets when we get here we'll still use them but
otherwise we'll skip it.

- Use the ZONE_VM flag on vm map entries and pv entries on x86.


# f97d6ce3 09-Jun-2002 Ian Dowse <iedowse@FreeBSD.org>

Correct the logic for determining whether the per-CPU locks need
to be destroyed. This fixes a problem where destroying a UMA zone
would fail to destroy all zone mutexes.

Reviewed by: jeff


# 494273be 03-Jun-2002 Jeff Roberson <jeff@FreeBSD.org>

Add a comment describing a resource leak that occurs during a failure case
in obj_alloc.


# 4c1cc01c 20-May-2002 John Baldwin <jhb@FreeBSD.org>

In uma_zalloc_arg(), if we are performing a M_WAITOK allocation, ensure
that td_intr_nesting_level is 0 (like malloc() does). Since malloc() calls
uma we can probably remove the check in malloc() for this now. Also,
perform an extra witness check in that case to make sure we don't hold
any locks when performing a M_WAITOK allocation.


# 713deb36 12-May-2002 Jeff Roberson <jeff@FreeBSD.org>

Don't call the uz free function while the zone lock is held. This can lead
to lock order reversals. uma_reclaim now builds a list of freeable slabs and
then unlocks the zones to do all of the frees.


# 0aef6126 12-May-2002 Jeff Roberson <jeff@FreeBSD.org>

Remove the hash_free() lock order reversal. This could have happened for
several reasons before. Fixing it involved restructuring the generic hash
code to require calling code to handle locking, unlocking, and freeing hashes
on error conditions.


# c7173f58 04-May-2002 Jeff Roberson <jeff@FreeBSD.org>

Use pages instead of uz_maxpages, which has not been initialized yet, when
creating the vm_object. This was broken after the code was rearranged to
grab giant itself.

Spotted by: alc


# b9ba8931 02-May-2002 Jeff Roberson <jeff@FreeBSD.org>

Move around the dbg code a bit so it's always under a lock. This stops a
weird potential race if we were preempted right as we were doing the dbg
checks.


# c3bdc05f 02-May-2002 Andrew R. Reiter <arr@FreeBSD.org>

- Changed the size element of uma_zctor_args to be size_t instead of int.
- Changed uma_zcreate to accept the size argument as a size_t intead of
int.

Approved by: jeff


# 5a34a9f0 02-May-2002 Jeff Roberson <jeff@FreeBSD.org>

malloc/free(9) no longer require Giant. Use the malloc_mtx to protect the
mallochash. Mallochash is going to go away as soon as I introduce the
kfree/kmalloc api and partially overhaul the malloc wrapper. This can't happen
until all users of the malloc api that expect memory to be aligned on the size
of the allocation are fixed.


# 639c9550 01-May-2002 Jeff Roberson <jeff@FreeBSD.org>

Remove the temporary alignment check in free().

Implement the following checks on freed memory in the bucket path:
- Slab membership
- Alignment
- Duplicate free

This previously was only done if we skipped the buckets. This code will slow
down INVARIANTS a bit, but it is smp safe. The checks were moved out of the
normal path and into hooks supplied in uma_dbg.


# 2cc35ff9 29-Apr-2002 Jeff Roberson <jeff@FreeBSD.org>

Move the implementation of M_ZERO into UMA so that it can be passed to
uma_zalloc and friends. Remove this functionality from the malloc wrapper.

Document this change in uma.h and adjust variable names in uma_core.


# 28bc4419 29-Apr-2002 Jeff Roberson <jeff@FreeBSD.org>

Add a new zone flag UMA_ZONE_MTXCLASS. This puts the zone in it's own
mutex class. Currently this is only used for kmapentzone because kmapents
are are potentially allocated when freeing memory. This is not dangerous
though because no other allocations will be done while holding the
kmapentzone lock.


# d4d6aee5 25-Apr-2002 Andrew R. Reiter <arr@FreeBSD.org>

- Fix a round down bogon in uma_zone_set_max().

Submitted by: jeff@


# 5300d9dd 14-Apr-2002 Jeff Roberson <jeff@FreeBSD.org>

Fix a witness warning when expanding a hash table. We were allocating the new
hash while holding the lock on a zone. Fix this by doing the allocation
seperately from the actual hash expansion.

The lock is dropped before the allocation and reacquired before the expansion.
The expansion code checks to see if we lost the race and frees the new hash
if we do. We really never will lose this race because the hash expansion is
single threaded via the timeout mechanism.


# 0da47b2f 13-Apr-2002 Jeff Roberson <jeff@FreeBSD.org>

Protect the initial list traversal in sysctl_vm_zone() with the uma_mtx.


# af7f9b97 13-Apr-2002 Jeff Roberson <jeff@FreeBSD.org>

Fix the calculation that determines uz_maxpages. It was off for large zones.
Fortunately we have no large zones with maximums specified yet, so it wasn't
breaking anything.

Implement blocking when a zone exceeds the maximum and M_WAITOK is specified.
Previously this just failed like the old zone allocator did. The old zone
allocator didn't support WAITOK/NOWAIT though so we should do what we
advertise.

While I was in there I cleaned up some more zalloc logic to further simplify
that code path and reduce redundant code. This was needed to make the blocking
work properly anyway.


# bce97791 09-Apr-2002 Jeff Roberson <jeff@FreeBSD.org>

Remember to unlock the zone if the fill count is too high.

Pointed out by: pete, jake, jhb


# 86bbae32 08-Apr-2002 Jeff Roberson <jeff@FreeBSD.org>

Add a mechanism to disable buckets when the v_free_count drops below
v_free_min. This should help performance in memory starved situations.


# 605cbd6a 07-Apr-2002 Jeff Roberson <jeff@FreeBSD.org>

Don't release the zone lock until after the dtor has been called. As far as I
can tell this could not have caused any problems yet because UMA is still
called with giant.

Pointy hat to: jeff
Noticed by: jake


# 9c2cd7e5 07-Apr-2002 Jeff Roberson <jeff@FreeBSD.org>

Implement uma_zdestroy(). It's prototype changed slightly. I decided that I
didn't like the wait argument and that if you were removing a zone it had
better be empty.

Also, I broke out part of hash_expand and made a seperate hash_free() for use
in uma_zdestroy.


# a553d4b8 07-Apr-2002 Jeff Roberson <jeff@FreeBSD.org>

Rework most of the bucket allocation and free code so that per cpu locks are
never held across blocking operations. Also, fix two other lock order
reversals that were exposed by jhb's witness change.

The free path previously had a bug that would cause it to skip the free bucket
list in some cases and go straight to allocating a new bucket. This has been
fixed as well.

These changes made the bucket handling code much cleaner and removed quite a
few lock operations. This should be marginally faster now.

It is now possible to call malloc w/o Giant and avoid any witness warnings.
This still isn't entirely safe though because malloc_type statistics are not
protected by any lock.


# d0b06acb 07-Apr-2002 Jeff Roberson <jeff@FreeBSD.org>

This fixes a bug where isitem never got set to 1 if a certain chain of events
relating to extreme low memory situations occured. This was only ever seen on
the port build cluster, so many thanks to kris for helping me debug this.

Tested by: kris


# 6008862b 04-Apr-2002 John Baldwin <jhb@FreeBSD.org>

Change callers of mtx_init() to pass in an appropriate lock type name. In
most cases NULL is passed, but in some cases such as network driver locks
(which use the MTX_NETWORK_LOCK macro) and UMA zone locks, a name is used.

Tested on: i386, alpha, sparc64


# 157d7b35 02-Apr-2002 Alfred Perlstein <alfred@FreeBSD.org>

fix comment typo, s/neccisary/necessary/g


# f4af24d5 24-Mar-2002 Jeff Roberson <jeff@FreeBSD.org>

Reset the cachefree statistics after draining the cache. This fixes a bug
where a sysctl within 20 seconds of a cache_drain could yield negative "USED"
counts.

Also, grab the uma_mtx while in the sysctl handler. This hadn't caused
problems yet because Giant is held all the time.

Reported by: kkenn


# 736ee590 19-Mar-2002 Jeff Roberson <jeff@FreeBSD.org>

Add uma_zone_set_max() to add enforced limits to non vm obj backed zones.


# 8355f576 19-Mar-2002 Jeff Roberson <jeff@FreeBSD.org>

This is the first part of the new kernel memory allocator. This replaces
malloc(9) and vm_zone with a slab like allocator.

Reviewed by: arch@