History log of /freebsd-current/sys/vm/vm_page.c
Revision Date Author Comments
# 10f2e94a 02-Jan-2024 Jason A. Harmening <jah@FreeBSD.org>

vm_page_reclaim_contig(): update comment to chase recent changes

Commit 2619c5ccfe ("Avoid waiting on physical allocations that can't
possibly be satisfied") changed the return value from bool to errno.
Adjust the function description to match reality.


# 2619c5cc 20-Nov-2023 Jason A. Harmening <jah@FreeBSD.org>

Avoid waiting on physical allocations that can't possibly be satisfied

- Change vm_page_reclaim_contig[_domain] to return an errno instead
of a boolean. 0 indicates a successful reclaim, ENOMEM indicates
lack of available memory to reclaim, with any other error (currently
only ERANGE) indicating that reclamation is impossible for the
specified address range. Change all callers to only follow
up with vm_page_wait* in the ENOMEM case.

- Introduce vm_domainset_iter_ignore(), which marks the specified
domain as unavailable for further use by the iterator. Use this
function to ignore domains that can't possibly satisfy a physical
allocation request. Since WAITOK allocations run the iterators
repeatedly, this avoids the possibility of infinitely spinning
in domain iteration if no available domain can satisfy the
allocation request.

PR: 274252
Reported by: kevans
Tested by: kevans
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D42706


# 29363fb4 23-Nov-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove ancient SCCS tags.

Remove ancient SCCS tags from the tree, automated scripting, with two
minor fixup to keep things compiling. All the common forms in the tree
were removed with a perl script.

Sponsored by: Netflix


# a55fbda8 12-Oct-2023 Zhenlei Huang <zlei@FreeBSD.org>

vm_page: Add corresponding sysctl knob for loader tunable

The loader tunable 'vm.pgcache_zone_max_pcpu' does not have corresponding
sysctl MIB entry. Add it so that it can be retrieved, and `sysctl -T`
will also report it correctly.

Reviewed by: markj
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D42138


# 685dc743 16-Aug-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove $FreeBSD$: one-line .c pattern

Remove /^[\s*]*__FBSDID\("\$FreeBSD\$"\);?\s*\n/


# 9e817428 16-Jun-2023 Doug Moore <dougm@FreeBSD.org>

vm_phys: add binary segment search

Replace several sequential searches for a segment that contains a
phyiscal address with a call to a function that does it by binary
search. In vm_page_reclaim_contig_domain_ext, find the first segment
to reclaim from, and reclaim from each subsequent appropriate segment.
Eliminate vm_phys_scan_contig.

Reviewed by: alc, markj
Differential Revision: https://reviews.freebsd.org/D40058


# 6062d9fa 05-Jun-2023 Mark Johnston <markj@FreeBSD.org>

vm_phys: Change the return type of vm_phys_unfree_page() to bool

This is in keeping with the trend of removing uses of boolean_t, and the
sole caller was implicitly converting it to a "bool".

No functional change intended.

Reviewed by: dougm, alc, imp, kib
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D40401


# 8b0dafdb 08-May-2023 Andrew Gallatin <gallatin@FreeBSD.org>

vm: implement vm_page_reclaim_contig_domain_ext()

Implement vm_page_reclaim_contig_domain_ext() to reclaim multiple
contiguous regions at once. This makes it more efficient for users
that need multiple contiguous regions to reclaim those regions
efficiently.

This is needed because callers like ktls may need to reclaim many
contiguous regions, and each scan of physical memory can take
multiple seconds on a large memory machine (order of 100GB of
RMA). Rather than modifying the core algorithm, I extended
vm_page_reclaim_contig_domain() to take a "desired_runs" argument to
allow the caller to request that it reclaim more than just a single
run. There is no functional change intended for all existing
callers.

The first user for this interface is the ktls code
(https://reviews.freebsd.org/D39421). By reclaiming multiple runs,
ktls goes from consuming hours of CPU to refill its buffer zone to
just seconds or minutes.

Differential Revision: https://reviews.freebsd.org/D39739
Sponsored by: Netflix
Reviewed by: alc, jhb, markj


# 32494491 16-Dec-2022 Konstantin Belousov <kib@FreeBSD.org>

vm_page_grab_valid(): clear *mp in case of pager denying page allocation

Same as it is done in other error return cases. Callers depend on error
case returning NULL, e.g. vm_imgact_hold_page().

Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D37719


# 1cac76c9 14-Dec-2022 Andrew Gallatin <gallatin@FreeBSD.org>

vm: reduce lock contention when processing vm batchqueues

Rather than waiting until the batchqueue is full to acquire the lock &
process the queue, we now start trying to acquire the lock using trylocks
when the batchqueue is 1/2 full. This removes almost all contention on the
vm pagequeue mutex for for our busy sendfile() based web workload.
It also greadly reduces the amount of time a network driver ithread
remains blocked on a mutex, and eliminates some packet drops under
heavy load.

So that the system does not loose the benefit of processing large
batchqueues, I've doubled the size of the batchqueues. This way, when
there is no contention, we process the same batch size as before.

This has been run for several months on a busy Netflix server, as well
as on my personal desktop.

Reviewed by: markj
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D37305


# ec201ddd 20-Oct-2022 Konstantin Belousov <kib@FreeBSD.org>

vm_pager: add method to veto page allocation

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D37097


# d537d1f1 20-Oct-2022 Konstantin Belousov <kib@FreeBSD.org>

vm_pager: add methods for page insertion and removal notifications

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D37097


# cfbf1da0 09-Nov-2022 Anton Rang <rang@acm.org>

vm_page_unswappable: remove wrong assertion

markj says:

...the assertion is incorrect and should simply be removed.
It has been racy since we removed the use of the page hash
lock to synchronize wiring of pages.

PR: 267621
Reviewed by: markj, Anton Rang <rang@acm.org>
MFC after: 1 week
Sponsored by: Dell Inc.
Differential Revision: https://reviews.freebsd.org/D37320


# 934bfc12 17-Oct-2022 Konstantin Belousov <kib@FreeBSD.org>

Add vm_page_any_valid()

Use it and several other vm_page_*_valid() functions in more places.

Suggested and reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D37024


# 2c9dc238 05-Oct-2022 Mark Johnston <markj@FreeBSD.org>

vm_page: Fix a logic error in the handling of PQ_ACTIVE operations

As an optimization, vm_page_activate() avoids requeuing a page that's
already in the active queue. A page's location in the active queue is
mostly unimportant.

When a page is unwired and placed back in the page queues,
vm_page_unwire() avoids moving pages out of PQ_ACTIVE to honour the
request, the idea being that they're likely mapped and so will simply
get bounced back in to PQ_ACTIVE during a queue scan.

In both cases, if the page was logically in PQ_ACTIVE but had not yet
been physically enqueued (i.e., the page is in a per-CPU batch), we
would end up clearing PGA_REQUEUE from the page. Then, batch processing
would ignore the page, so it would end up unwired and not in any queues.
This can arise, for example, when a page is allocated and then
vm_page_activate() is called multiple times in quick succession. The
result is that the page is hidden from the page daemon, so while it will
be freed when its VM object is destroyed, it cannot be reclaimed under
memory pressure.

Fix the bug: when checking if a page is in PQ_ACTIVE, only perform the
optimization if the page is physically enqueued.

PR: 256507
Fixes: f3f38e2580f1 ("Start implementing queue state updates using fcmpset loops.")
Reviewed by: alc, kib
MFC after: 1 week
Sponsored by: E-CARD Ltd.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D36839


# 1b89d40f 17-Aug-2022 Mateusz Guzik <mjg@FreeBSD.org>

Revert "vm: use atomic fetchadd in vm_page_sunbusy"

This reverts commit f6ffed44a8eb5d1ab89a18e60fb056aab2105be7.

fetchadd will fail the waiters flag, which can cause other
code to wait when it should not with nothing clear it

Revert until I sort this out.

Reported by: markj


# f6ffed44 05-Aug-2022 Mateusz Guzik <mjg@FreeBSD.org>

vm: use atomic fetchadd in vm_page_sunbusy

This also fixes a bug where not-last unbusy failed to post a release
fence.

Reviewed by: markj (previous version), kib (previous version)
Differential Revision: https://reviews.freebsd.org/D36084


# c84c5e00 18-Jul-2022 Mitchell Horne <mhorne@FreeBSD.org>

ddb: annotate some commands with DB_CMD_MEMSAFE

This is not completely exhaustive, but covers a large majority of
commands in the tree.

Reviewed by: markj
Sponsored by: Juniper Networks, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D35583


# 0cb2610e 16-Jul-2022 Mark Johnston <markj@FreeBSD.org>

vm: Remove handling for OBJT_DEFAULT objects

Now that OBJT_DEFAULT objects can't be instantiated, we can simplify
checks of the form object->type == OBJT_DEFAULT || (object->flags &
OBJ_SWAP) != 0. No functional change intended.

Reviewed by: alc, kib
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D35788


# f77a88c8 03-Jun-2022 Gordon Bergling <gbe@FreeBSD.org>

vm_page: Fix a typo in a source code comment

- s/consistancy/consistency/

MFC after: 3 days


# 6fb7c42d 14-Apr-2022 Mark Johnston <markj@FreeBSD.org>

vm: Move the "vm_wait in early boot" assertion to the proper place

The assertion was added in commit 1771e987ca6a. After that, vm_wait()
and friends were refactored such that the actual sleep happens
elsewhere. Now the assertion condition is not checked when
vm_wait_doms() is called directly, and it is checked even if we are not
going to sleep (because vm_page_count_min_set(wdoms) is false).

Reviewed by: alc, kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D34909


# b8ebd99a 13-Apr-2022 John Baldwin <jhb@FreeBSD.org>

vm: Use __diagused for variables only used in KASSERT().


# c25a30e2 05-Jan-2022 Konstantin Belousov <kib@FreeBSD.org>

Dump page tracking no longer needed on mips

Reviewed by: imp
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D33763


# c606ab59 30-Dec-2021 Doug Moore <dougm@FreeBSD.org>

vm_extern: use standard address checkers everywhere

Define simple functions for alignment and boundary checks and use them
everywhere instead of having slightly different implementations
scattered about. Define them in vm_extern.h and use them where
possible where vm_extern.h is included.

Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D33685


# 4bae154f 24-Dec-2021 Doug Moore <dougm@FreeBSD.org>

vm_page: Move a comment

fb38b29b5609 (page_alloc_br) vm_page: Remove extra test, dup code from page alloc
should have moved a comment block when it moved the function call that followed it.

Move the comment block now.


# 0d5fac28 23-Dec-2021 Doug Moore <dougm@FreeBSD.org>

vm: alloc pages from reserv before breaking it

Function vm_reserv_reclaim_contig breaks a reservation with enough
free space to satisfy an allocation request and returns the free space
to the buddy allocator. Change the function to allocate the request
memory from the reservation before breaking it, and return that memory
to the caller. That avoids a second call to the buddy allocator and
guarantees successful allocation after breaking the reservation, where
that success is not currently guaranteed.

Reviewed by: alc, kib (previous version)
Differential Revision: https://reviews.freebsd.org/D33644


# 184c63db 24-Dec-2021 Doug Moore <dougm@FreeBSD.org>

Fix clerical error in page alloc

Fix a very recent change that introduced a page accounting error in
case of a reserveration being broken.
Reviewed by: alc
Fixes: fb38b29b5609 (page_alloc_br) vm_page: Remove extra test, dup code from page alloc
Differential Revision: https://reviews.freebsd.org/D33645


# fb38b29b 23-Dec-2021 Doug Moore <dougm@FreeBSD.org>

vm_page: Remove extra test, dup code from page alloc

Extract code common to functions vm_page_alloc_contig_domain and
vm_page_alloc_noobj_contig_domain into a new function. Do so in a way
that eliminates a bound-to-fail reservation test after a reservation
is broken by a call from vm_page_alloc_contig_domain.

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D33551


# 39a7396f 05-Dec-2021 Mark Johnston <markj@FreeBSD.org>

vm_page: Tighten the object lock assertion in vm_page_invalid()

A page must not become invalid while vm_fault_soft_fast() is attempting
to map unbusied pages for reading.

Note that all callers hold the object write lock already, and
vm_page_set_invalid() asserts the object write lock.

Reviewed by: kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D33250


# 87b64663 15-Nov-2021 Mark Johnston <markj@FreeBSD.org>

vm_page: Consolidate page busy sleep mechanisms

- Modify vm_page_busy_sleep() and vm_page_busy_sleep_unlocked() to take
a VM_ALLOC_* flag indicating whether to sleep on shared-busy, and fix
up callers.
- Modify vm_page_busy_sleep() to return a status indicating whether the
object lock was dropped, and fix up callers.
- Convert callers of vm_page_sleep_if_busy() to use vm_page_busy_sleep()
instead.
- Remove vm_page_sleep_if_(x)busy().

No functional change intended.

Obtained from: jeff (object_concurrency patches)
Reviewed by: kib
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D32947


# e4bdb685 11-Nov-2021 Mark Johnston <markj@FreeBSD.org>

vm_page: Handle VM_ALLOC_NORECLAIM in the contiguous page allocator

We added _NORECLAIM to request that kmem_alloc_contig_pages() not spend
time scanning physical memory for candidates to reclaim. In some
situations the scanning can induce large amounts of undesirable latency,
and it's less important that the request be satisfied than it is that we
not spend many milliseconds scanning.

The problem extends to vm_reserv_reclaim_contig(), which unlike
vm_reserv_reclaim() may have to scan the entire list of partially
populated reservations. Use VM_ALLOC_NORECLAIM to request that this
scan not be executed.[1]

As a side effect, this fixes a regression in 02fb0585e7b3 ("vm_page:
Drop handling of VM_ALLOC_NOOBJ in vm_page_alloc_contig_domain()")
where VM_ALLOC_CONTIG was not included in VPAC_FLAGS or VPANC_FLAGS even
though it is not masked by kmem_alloc_contig_pages().[2]

Reported by: gallatin [1], glebius [2]
Reviewed by: alc, glebius, kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32899


# d7acbe48 21-Oct-2021 Mark Johnston <markj@FreeBSD.org>

vm_page: Break reservations to handle noobj allocations

vm_reserv_reclaim_*() will release pages to the default freepool, not
the direct freepool from which noobj allocations are drawn. But if both
pools are empty, the noobj allocator variants must break reservations to
make progress.

Reported by: cy
Reviewed by: kib (previous version)
Fixes: b498f71bc56a ("vm_page: Add a new page allocator interface for unnamed pages")
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32592


# 02fb0585 19-Oct-2021 Mark Johnston <markj@FreeBSD.org>

vm_page: Drop handling of VM_ALLOC_NOOBJ in vm_page_alloc_contig_domain()

As in vm_page_alloc_domain_after(), unconditionally preserve PG_ZERO.

Implement vm_page_alloc_noobj_contig_domain().

Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32034


# c40cf9bc 19-Oct-2021 Mark Johnston <markj@FreeBSD.org>

vm_page: Stop handling VM_ALLOC_NOOBJ in vm_page_alloc_domain_after()

This makes the allocator simpler since it can assume object != NULL.
Also modify the function to unconditionally preserve PG_ZERO, so
VM_ALLOC_ZERO is effectively ignored (and still must be implemented by
the caller for now).

Reviewed by: alc, kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32033


# 84c39222 19-Oct-2021 Mark Johnston <markj@FreeBSD.org>

Convert consumers to vm_page_alloc_noobj_contig()

Remove now-unneeded page zeroing. No functional change intended.

Reviewed by: alc, hselasky, kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32006


# 92db9f3b 19-Oct-2021 Mark Johnston <markj@FreeBSD.org>

Introduce vm_page_alloc_noobj_contig()

This is the same as vm_page_alloc_noobj(), but allocates physically
contiguous runs of memory. For now it is implemented in terms of
vm_page_alloc_contig(), with the difference that
vm_page_alloc_noobj_contig() implements VM_ALLOC_ZERO by zeroing the
page.

Reviewed by: alc, kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32005


# a4667e09 19-Oct-2021 Mark Johnston <markj@FreeBSD.org>

Convert vm_page_alloc() callers to use vm_page_alloc_noobj().

Remove page zeroing code from consumers and stop specifying
VM_ALLOC_NOOBJ. In a few places, also convert an allocation loop to
simply use VM_ALLOC_WAITOK.

Similarly, convert vm_page_alloc_domain() callers.

Note that callers are now responsible for assigning the pindex.

Reviewed by: alc, hselasky, kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31986


# b498f71b 19-Oct-2021 Mark Johnston <markj@FreeBSD.org>

vm_page: Add a new page allocator interface for unnamed pages

The diff adds vm_page_alloc_noobj() and vm_page_alloc_noobj_domain().
These mostly correspond to vm_page_alloc() and vm_page_alloc_domain()
when no VM object is specified, with the exception that they handle
VM_ALLOC_ZERO by zeroing the page, rather than by preserving PG_ZERO.

This simplifies callers and will permit simplification of the
vm_page_alloc_domain() definition.

Since the new allocator variant is similar to vm_page_alloc_freelist(),
implement both of them using a common backend allocator function. No
functional change intended.

Reviewed by: alc, kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31985


# a23e6a10 19-Oct-2021 Mark Johnston <markj@FreeBSD.org>

vm_page: Move vm_page_alloc_check() to after page allocator definitions

This way all of the vm_page_alloc_*() allocator functions are grouped
together.

MFC after: 1 week
Sponsored by: The FreeBSD Foundation


# bd3a6680 17-Sep-2021 Konstantin Belousov <kib@FreeBSD.org>

vm_page_startup: correct calculation of the starting page

Also avoid unneded calculations when phys segment end is the phys_avail[]
start.

Submitted by: alc
Reviewed by: markj
MFC after: 1 week
Fixes: 181bfb42fd01bfa9f46
Differential revision: https://reviews.freebsd.org/D32009


# 181bfb42 14-Sep-2021 Konstantin Belousov <kib@FreeBSD.org>

vm_phys: do not ignore phys_avail[] segments that do not fit completely into vm_phys segments

If phys_avail[] segment only intersect with some vm_phys segment, add
pages from it to the free list that belong to the given vm_phys_seg,
instead of dropping them.

The vm_phys segments are generally result of subdivision of phys_avail
segments, for instance DMA32 or LOWMEM boundaries split them. On
amd64, after UEFI in-place kernel activation (copy_staging disable)
was enabled, we typically have a large phys_avail[] segment below 4G
which crosses LOWMEM (1M) boundary. With the current way of requiring
phys_avail[] fully fit into vm_phys_seg, this memory was ignored.

Reported by: madpilot
Reviewed by: markj
Discussed with: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D31958


# eccb516d 22-Aug-2021 Bjoern A. Zeeb <bz@FreeBSD.org>

vm: use __func__ for the correct function name

In fee2a2fa39834d8d5eaa981298fce9d2ed31546d the KASSERTs in
vm_page_unwire_noq() changed from "vm_page_unwire" to "vm_page_unref".
While the former no longer was part of that function the latter does
not exist as a function and is highly confusing when hit when using
tools to lookup the functions and not doing a full-text search.
Use %s __func__ for printing the function name, as that will do the
right thing as code moves around and functions get renamed.

Hit: while debugging a wired page leak with linuxkpi/iwlwifi
Sponsored by: The FreeBSD Foundation
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D31635


# 041b7317 10-Jul-2021 Konstantin Belousov <kib@FreeBSD.org>

Add pmap_vm_page_alloc_check()

which is the place to put MD asserts about allocated pages.

On amd64, verify that allocated page does not belong to the kernel
(text, data) or early allocated pages.

Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D31121


# 5b10e79e 17-Jun-2021 Konstantin Belousov <kib@FreeBSD.org>

Un-staticise vm_page_init_page()

Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D30785


# 4b8365d7 30-Apr-2021 Konstantin Belousov <kib@FreeBSD.org>

Add OBJT_SWAP_TMPFS pager

This is OBJT_SWAP pager, specialized for tmpfs. Right now, both swap pager
and generic vm code have to explicitly handle swap objects which are tmpfs
vnode v_object, in the special ways. Replace (almost) all such places with
proper methods.

Since VM still needs a notion of the 'swap object', regardless of its
use, add yet another type-classification flag OBJ_SWAP. Set it in
vm_object_allocate() where other type-class flags are set.

This change almost completely eliminates the knowledge of tmpfs from VM,
and opens a way to make OBJT_SWAP_TMPFS loadable from tmpfs.ko.

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D30070


# 04019892 02-Mar-2021 Mark Johnston <markj@FreeBSD.org>

vm: Round up npages and alignment for contig reclamation

When searching for runs to reclaim, we need to ensure that the entire
run will be added to the buddy allocator as a single unit. Otherwise,
it will not be visible to vm_phys_alloc_contig() as it is currently
implemented. This is a problem for allocation requests that are not a
power of 2 in size, as with 9KB jumbo mbuf clusters.

Reported by: alc
Reviewed by: alc
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D28924


# 14b5a3c7 24-Feb-2021 Max Laier <mlaier@FreeBSD.org>

vm pqbatch: move unmanaged page assert under pagequeue lock

This KASSERT is overzealous because of the following race condition:
1) A managed page which is currently in PQ_LAUNDRY is freed.
vm_page_free_prep calls vm_page_dequeue_deferred()

The page state is:
PQ_LAUNDRY, PGA_DEQUEUE|PGA_ENQUEUED

2) The laundry worker comes around and pick up the page and calls
vm_pageout_defer(m, PQ_LAUNDRY, true) to check if page is still in the
queue. We do a vm_page_astate_load and get
PQ_LAUNDRY, PGA_DEQUEUE|PGA_ENQUEUED
as per above.

3) The laundry worker is pre-empted and another thread allocates our page
from the free pool. For example vm_page_alloc_domain_after calls
vm_page_dequeue() and sets VPO_UNMANAGED because we are allocating for
an OBJT_UNMANAGED object.

The page state is:
PQ_NONE, 0 - VPO_UNMANAGED

4) The laundry worker resumes, and processes vm_pageout_defer based on the
stale astate which leads to a call to vm_page_pqbatch_submit, which will
trip on the KASSERT.

Submitted by: mlaier
Reviewed by: markj, rlibby
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D28563


# 5c18744e 10-Feb-2021 Mark Johnston <markj@FreeBSD.org>

vm: Honour the "noreuse" flag to vm_page_unwire_managed()

This flag indicates that the page should be enqueued near the head of
the inactive queue, skipping the LRU queue. It is used when unwiring
pages from the buffer cache following direct I/O or after I/O when
POSIX_FADV_NOREUSE or _DONTNEED advice was specified, or when
sendfile(SF_NOCACHE) completes. For the direct I/O and sendfile cases
we only enqueue the page if we decide not to free it, typically because
it's mapped.

Pass "noreuse" through to vm_page_release_toq() so that we actually
honour the desired LRU policy for these scenarios.

Reported by: bdrewery
Reviewed by: alc, kib
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D28555


# 81846def 27-Dec-2020 Mark Johnston <markj@FreeBSD.org>

vm: Fix some bugs in the page busying code

In vm_page_busy_acquire(), load the object pointer using
atomic_load_ptr() as we do elsewhere. Per the comment, the object
identity must be consistent across sleeps.

In vm_page_grab_sleep(), pass the correct pindex to
_vm_page_busy_sleep(). The pindex is used to re-check the page's
identity before going to sleep. In particular, vm_page_grab_sleep() is
used in unlocked grab, so the object lock is not necessarily held when
verifying the page's identity, and the pindex may change if the page is
moved, or freed and re-allocated. I believe this can result in spurious
VM_PAGER_FAILs from vm_page_grab_valid_unlocked() or early termination
of vm_page_grab_pages_unlocked().

In vm_page_grab_pages(), pass the correct pindex to
vm_page_grab_sleep(). Otherwise I believe vm_page_grab_pages() will
effectively spin when attempting to busy a busy page after the first
index in the range.

Reviewed by: alc, kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27607


# 1fea4b25 19-Nov-2020 Mark Johnston <markj@FreeBSD.org>

Wrap a long line in vm_pqbatch_process_page()


# 9e3e7376 19-Nov-2020 Mark Johnston <markj@FreeBSD.org>

Micro-optimize vm_page_pqbatch_submit()

Avoid calling vm_page_domain() twice.

Discussed with: alc (in D27207)


# 431fb8ab 18-Nov-2020 Mark Johnston <markj@FreeBSD.org>

vm_phys: Try to clean up NUMA KPIs

It can useful for code outside the VM system to look up the NUMA domain
of a page backing a virtual or physical address, specifically when
creating NUMA-aware data structures. We have _vm_phys_domain() for
this, but the leading underscore implies that it's an internal function,
and vm_phys.h has dependencies on a number of other headers.

Rename vm_phys_domain() to vm_page_domain(), and _vm_phys_domain() to
vm_phys_domain(). Make the latter an inline function.

Add _vm_phys.h and define struct vm_phys_seg there so that it's easier
to use in other headers. Include it from vm_page.h so that
vm_page_domain() can be defined there.

Include machine/vmparam.h from _vm_phys.h since it depends directly on
some constants defined there.

Reviewed by: alc
Reviewed by: dougm, kib (earlier versions)
Differential Revision: https://reviews.freebsd.org/D27207


# 6f3b523c 14-Oct-2020 Konstantin Belousov <kib@FreeBSD.org>

Avoid dump_avail[] redefinition.

Move dump_avail[] extern declaration and inlines into a new header
vm/vm_dumpset.h. This fixes default gcc build for mips.

Reviewed by: alc, scottph
Tested by: kevans (previous version)
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D26741


# c2c6fb90 09-Oct-2020 Bryan Drewery <bdrewery@FreeBSD.org>

Use unlocked page lookup for inmem() to avoid object lock contention

Reviewed By: kib, markj
Submitted by: mlaier
Sponsored by: Dell EMC
Differential Revision: https://reviews.freebsd.org/D26653


# 00e66147 21-Sep-2020 D Scott Phillips <scottph@FreeBSD.org>

Sparsify the vm_page_dump bitmap

On Ampere Altra systems, the sparse population of RAM within the
physical address space causes the vm_page_dump bitmap to be much
larger than necessary, increasing the size from ~8 Mib to > 2 Gib
(and overflowing `int` for the size).

Changing the page dump bitmap also changes the minidump file
format, so changes are also necessary in libkvm.

Reviewed by: jhb
Approved by: scottl (implicit)
MFC after: 1 week
Sponsored by: Ampere Computing, Inc.
Differential Revision: https://reviews.freebsd.org/D26131


# ab041f71 21-Sep-2020 D Scott Phillips <scottph@FreeBSD.org>

Move vm_page_dump bitset array definition to MI code

These definitions were repeated by all architectures, with small
variations. Consolidate the common definitons in machine
independent code and use bitset(9) macros for manipulation. Many
opportunities for deduplication remain in the machine dependent
minidump logic. The only intended functional change is increasing
the bit index type to vm_pindex_t, allowing the indexing of pages
with address of 8 TiB and greater.

Reviewed by: kib, markj
Approved by: scottl (implicit)
MFC after: 1 week
Sponsored by: Ampere Computing, Inc.
Differential Revision: https://reviews.freebsd.org/D26129


# 89d2fb14 08-Sep-2020 Konstantin Belousov <kib@FreeBSD.org>

Add interruptible variant of vm_wait(9), vm_wait_intr(9).

Also add msleep flags argument to vm_wait_doms(9).

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D24652


# a2d704d1 02-Sep-2020 Mark Johnston <markj@FreeBSD.org>

Avoid unnecessary object locking in vm_page_grab_pages_unlocked().

We were needlessly acquiring the object lock to call
vm_page_grab_pages() even when all of the requested pages were looked up
locklessly. Fix that, stop testing for count == 0 in
vm_page_grab_pages(), and add assertions to help catch this kind of
mistake.

Reported by: cem
Reviewed by: alc, cem, dougm, jeff
Differential Revision: https://reviews.freebsd.org/D26304


# 411096d0 25-Aug-2020 Mark Johnston <markj@FreeBSD.org>

Permit vm_page_wire() to be called on pages not belonging to an object.

For such pages ref_count is effectively a consumer-managed field, but
there is no harm in calling vm_page_wire() on them.
vm_page_unwire_noq() handles them as well. Relax the vm_page_wire()
assertions to permit this case which is triggered by some out-of-tree
code. [1]

Also guard a conditional assertion with INVARIANTS. Otherwise the
conditions are evaluated even though the result is unused. [2]

Reported by: bz, cem [1], kib [2]
Reviewed by: dougm, kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D26173


# b7883452 11-Aug-2020 Conrad Meyer <cem@FreeBSD.org>

Back out unrelated change

Reported by: kib, markj
X-MFC-With: r364129


# 0292c54b 11-Aug-2020 Conrad Meyer <cem@FreeBSD.org>

Add support for multithreading the inactive queue pageout within a domain.

In very high throughput workloads, the inactive scan can become overwhelmed
as you have many cores producing pages and a single core freeing. Since
Mark's introduction of batched pagequeue operations, we can now run multiple
inactive threads working on independent batches.

To avoid confusing the pid and other control algorithms, I (Jeff) do this in
a mpi-like fan out and collect model that is driven from the primary page
daemon. It decides whether the shortfall can be overcome with a single
thread and if not dispatches multiple threads and waits for their results.

The heuristic is based on timing the pageout activity and averaging a
pages-per-second variable which is exponentially decayed. This is visible in
sysctl and may be interesting for other purposes.

I (Jeff) have verified that this does indeed double our paging throughput
when used with two threads. With four we tend to run into other contention
problems. For now I would like to commit this infrastructure with only a
single thread enabled.

The number of worker threads per domain can be controlled with the
'vm.pageout_threads_per_domain' tunable.

Submitted by: jeff (earlier version)
Discussed with: markj
Tested by: pho
Sponsored by: probably Netflix (based on contemporary commits)
Differential Revision: https://reviews.freebsd.org/D21629


# efec381d 04-Aug-2020 Mark Johnston <markj@FreeBSD.org>

Remove most lingering references to the page lock in comments.

Finish updating comments to reflect new locking protocols introduced
over the past year. In particular, vm_page_lock is now effectively
unused.

Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D25868


# 958d8f52 29-Jul-2020 Mark Johnston <markj@FreeBSD.org>

Remove the volatile qualifier from busy_lock.

Use atomic(9) to load the lock state. Some places were doing this
already, so it was inconsistent. In initialization code, the lock state
is still initialized with plain stores.

Reviewed by: alc, kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D25861


# 782ebde5 27-Jul-2020 Mark Johnston <markj@FreeBSD.org>

vm_page_free_invalid(): Relax the xbusy assertion.

vm_page_assert_xbusied() asserts that the busying thread is the current
thread. For some uses of vm_page_free_invalid() (e.g., error handling
in vnode_pager_generic_getpages_done()), this condition might not hold.

Reported by: Jenkins via trasz
Reviewed by: chs, kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D25828


# 4dfa06e1 17-Jul-2020 Chuck Silvers <chs@FreeBSD.org>

Add a new function vm_page_free_invalid() for freeing invalid pages
that might be wired. If the page is wired then it cannot be freed now,
but the thread that eventually unwires it will free it at that point.

Reviewed by: markj, kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D25430


# ffc568ba 09-Jul-2020 Scott Long <scottl@FreeBSD.org>

Revert r362998, r326999 while a better compatibility strategy is devised.


# b302c2e5 07-Jul-2020 Scott Long <scottl@FreeBSD.org>

Migrate the feature of excluding RAM pages to use "excludelist"
as its nomenclature.

MFC after: 1 week


# ee06cffc 26-Jun-2020 Konstantin Belousov <kib@FreeBSD.org>

vm_page_free_prep(): correct description of the required page and object state.

Reviewed by: markj
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D25482


# a9ea09e5 28-Apr-2020 Mark Johnston <markj@FreeBSD.org>

Re-check for wirings after busying the page in vm_page_release_locked().

A concurrent unlocked lookup can wire the page after
vm_page_release_locked() releases the last wiring, in which case
vm_page_release_locked() must not free the page. Once the xbusy lock is
acquired, that, the object lock and the fact that the page is unmapped
ensure that the wire count cannot increase, so re-check for new wirings
after the page is xbusied.

Update the comment above vm_page_wired() to reflect the new
synchronization rules.

Reported by: glebius
Reviewed by: alc, jeff, kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D24592


# 70e68b19 20-Apr-2020 Mark Johnston <markj@FreeBSD.org>

Handle trashed queue pointers in vm_page_acquire_unlocked().

vm_page_acquire_unlocked() relies on type-stability of vm_page
structures and assumes that the listq linkage pointers always point to a
vm_page or are NULL. QUEUE_MACRO_DEBUG_TRASH breaks that assumption, so
add an explicit check for a trashed queue pointer before dereferencing.

Reported and tested by: pho
Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D24472


# adc03881 30-Mar-2020 Bryan Drewery <bdrewery@FreeBSD.org>

Remove dead code leftover from r331018.

Sponsored by: Dell EMC


# a7c55b3e 27-Mar-2020 Konstantin Belousov <kib@FreeBSD.org>

ddb show pginfo: print pages reference value in hex.

It is more useful this way after the VPRC_ flags were introduced.

Sponsored by: The FreeBSD Foundation


# d1105e94 11-Mar-2020 Jeff Roberson <jeff@FreeBSD.org>

Check for busy or wired in vm_page_relookup(). Some callers will only keep
a page wired and expect it to still be present.

Reported by: delphij@FreeBSD.org
Reviewed by: kib


# d869a17e 06-Mar-2020 Mark Johnston <markj@FreeBSD.org>

Use COUNTER_U64_DEFINE_EARLY() in places where it simplifies things.

Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D23978


# 1ed42f6f 01-Mar-2020 Mark Johnston <markj@FreeBSD.org>

Avoid doubly wiring a newly allocated page in vm_page_grab_valid().

This fixes a regression from r358363.

Reported by: manu, jbeich
Tested by: jbeich


# 6be21eb7 28-Feb-2020 Jeff Roberson <jeff@FreeBSD.org>

Provide a lock free alternative to resolve bogus pages. This is not likely
to be much of a perf win, just a nice code simplification.

Reviewed by: markj, kib
Differential Revision: https://reviews.freebsd.org/D23866


# 3f39f80a 28-Feb-2020 Jeff Roberson <jeff@FreeBSD.org>

Support the NOCREAT flag for grab_valid_unlocked.

Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D23865


# c49be4f1 26-Feb-2020 Jeff Roberson <jeff@FreeBSD.org>

Add unlocked grab* function variants that use lockless radix code to
lookup pages. These variants will fall back to their locked counterparts
if the page is not present.

Discussed with: kib, markj
Differential Revision: https://reviews.freebsd.org/D23449


# 7029da5c 26-Feb-2020 Pawel Biernacki <kaktus@FreeBSD.org>

Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many)

r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.

This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.

Mark all obvious cases as MPSAFE. All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT

Approved by: kib (mentor, blanket)
Commented by: kib, gallatin, melifaro
Differential Revision: https://reviews.freebsd.org/D23718


# eaa17d42 22-Feb-2020 Ryan Libby <rlibby@FreeBSD.org>

sys/vm: quiet -Wwrite-strings

Discussed with: kib
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D23796


# 6c5f36ff 19-Feb-2020 Jeff Roberson <jeff@FreeBSD.org>

Eliminate some unnecessary uses of UMA_ZONE_VM. Only zones involved in
virtual address or physical page allocation need to be marked with this
flag.

Reviewed by: markj
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D23712


# f212367b 16-Feb-2020 Jeff Roberson <jeff@FreeBSD.org>

Refactor _vm_page_busy_sleep to reduce the delta between the various
sleep routines and introduce a variant that supports lockless sleep.

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D23612


# 23ed568c 14-Feb-2020 Mateusz Guzik <mjg@FreeBSD.org>

vm: remove no longer needed atomic_load_ptr casts


# 06ef6052 13-Feb-2020 Mark Johnston <markj@FreeBSD.org>

Fix handling of WAITFAIL in vm_page_grab() and vm_page_grab_pages().

After sleeping through a memory shortage, we must return NULL rather
than retry.

Discussed with: jeff
Reported by: pho
Sponsored by: The FreeBSD Foundation


# ee9e43f8 04-Feb-2020 Jeff Roberson <jeff@FreeBSD.org>

Add an explicit busy state for free pages. This improves behavior with
potential bugs that access freed pages as well as providing a path
towards lockless page lookup.

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D23444


# f0a273c0 01-Feb-2020 Mark Johnston <markj@FreeBSD.org>

Remove a couple of lingering usages of the page lock.

Update vm_page_scan_contig() and vm_page_reclaim_run() to stop using
vm_page_change_lock(). It has no use after r356157. Remove
vm_page_change_lock() now that it has no users.

Remove an unncessary check for wirings in vm_page_scan_contig(), which
was previously checking twice. The check is racy until
vm_page_reclaim_run() ensures that the page is unmapped, so one check is
sufficient.

Reviewed by: jeff, kib (previous versions)
Tested by: pho (previous version)
Differential Revision: https://reviews.freebsd.org/D23279


# d6e13f3b 19-Jan-2020 Jeff Roberson <jeff@FreeBSD.org>

Don't hold the object lock while calling getpages.

The vnode pager does not want the object lock held. Moving this out allows
further object lock scope reduction in callers. While here add some missing
paging in progress calls and an assert. The object handle is now protected
explicitly with pip.

Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D23033


# a81c400e 15-Jan-2020 Jeff Roberson <jeff@FreeBSD.org>

Simplify VM and UMA startup by eliminating boot pages. Instead use careful
ordering to allocate early pages in the same way boot pages were but only
as needed. After the KVA allocator has started up we allocate the KVA that
we consumed during boot. This also makes the boot pages freeable since they
have vm_page structures allocated with the rest of memory.

Parts of this patch were written and tested by markj.

Reviewed by: glebius, markj
Differential Revision: https://reviews.freebsd.org/D23102


# 9328cbc0 10-Jan-2020 Gleb Smirnoff <glebius@FreeBSD.org>

Always multiple vm.pgcache_zone_max to number of CPUs, and rename it
respectively. The tunable controls how big is the size of per-cpu
vm page cache. Previously the value was split for all CPUs in system,
so configuring same value on machines with different count of CPUs
yielded in different cache size available to a particular CPU.

Reviewed by: markj
Obtained from: Netflix


# 79c9f942 05-Jan-2020 Jeff Roberson <jeff@FreeBSD.org>

Fix uma boot pages calculations on NUMA machines that also don't have
MD_UMA_SMALL_ALLOC. This is unusual but not impossible. Fix the alignemnt
of zones while here. This was already correct because uz_cpu strongly
aligned the zone structure but the specified alignment did not match
reality and involved redundant defines.

Reviewed by: markj, rlibby
Differential Revision: https://reviews.freebsd.org/D23046


# 758b2c02 29-Dec-2019 Mark Johnston <markj@FreeBSD.org>

Restore a vm_page_wired() check in vm_page_mvqueue() after r356156.

We now set PGA_DEQUEUE on a managed page when it is wired after
allocation, and vm_page_mvqueue() ignores pages with this flag set,
ensuring that they do not end up in the page queues. However, this is
not sufficient for managed fictitious pages or pages managed by the
TTM. In particular, the TTM makes use of the plinks.q queue linkage
fields for its own purposes.

PR: 242961
Reported and tested by: Greg V <greg@unrelenting.technology>


# 9b888dd9 29-Dec-2019 Mark Johnston <markj@FreeBSD.org>

Clear queue op flags in vm_page_mvqueue().

This fixes a regression in r356155, introduced at the last minute. In
particular, we must clear PGA_REQUEUE_HEAD before inserting into any
queue besides PQ_INACTIVE since that operation is implemented only for
PQ_INACTIVE.

Reported by: pho, Jenkins via lwhsu


# 727150ff 28-Dec-2019 Mark Johnston <markj@FreeBSD.org>

Remove some unused functions.

The previous series of patches orphaned some vm_page functions, so
remove them.

Reviewed by: dougm, kib
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D22886


# 9f5632e6 28-Dec-2019 Mark Johnston <markj@FreeBSD.org>

Remove page locking for queue operations.

With the previous reviews, the page lock is no longer required in order
to perform queue operations on a page. It is also no longer needed in
the page queue scans. This change effectively eliminates remaining uses
of the page lock and also the false sharing caused by multiple pages
sharing a page lock.

Reviewed by: jeff
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D22885


# b7f30bff 28-Dec-2019 Mark Johnston <markj@FreeBSD.org>

Generalize lazy dequeue logic for wired pages.

Some recent work aims to remove the use of the page lock for
synchronizing updates to page queue state. This change adds a mechanism
to preserve the existing behaviour of lazily dequeuing wired pages,
which was previously synchronized using the page lock.

Handle this by setting PGA_DEQUEUE when a managed page's wire count
transitions from 0 to 1. When the page daemon encounters a page with a
flag in PGA_QUEUE_OP_MASK set, it creates a batch queue entry for that
page, but in so doing it does not modify the page itself and thus racing
with a concurrent free of the page is harmless. The flag is advisory;
the page daemon still checks for wirings after acquiring the object and
page xbusy locks.

vm_page_unwire_managed() now clears PGA_DEQUEUE on a 1->0 transition.
It must do this before dropping the reference to avoid a use-after-free
but also handles races with concurrent wirings to ensure that
PGA_DEQUEUE is not left unset on a wired page.

Reviewed by: jeff
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D22882


# f3f38e25 28-Dec-2019 Mark Johnston <markj@FreeBSD.org>

Start implementing queue state updates using fcmpset loops.

This is in preparation for eliminating the use of the vm_page lock for
protecting queue state operations.

Introduce the vm_page_pqstate_commit_*() functions. These functions act
as helpers around vm_page_astate_fcmpset() and are specialized for
specific types of operations. vm_page_pqstate_commit() wraps these
functions.

Convert a number of routines to use these new helpers. Use
vm_page_release_toq() in vm_page_unwire() and vm_page_release() to
atomically release a wiring reference and release the page into a queue.
This has the side effect that vm_page_unwire() will leave the page in
the active queue if it is already present there.

Convert the page queue scans to use the new helpers. Simplify
vm_pageout_reinsert_inactive(), which requeues pages that were found to
be busy during an inactive queue scan, to avoid duplicating the work of
vm_pqbatch_process_page(). In particular, if PGA_REQUEUE or
PGA_REQUEUE_HEAD is set, let that be handled during batch processing.

Reviewed by: jeff
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D22770
Differential Revision: https://reviews.freebsd.org/D22771
Differential Revision: https://reviews.freebsd.org/D22772
Differential Revision: https://reviews.freebsd.org/D22773
Differential Revision: https://reviews.freebsd.org/D22776


# 5541eb27 27-Dec-2019 Mark Johnston <markj@FreeBSD.org>

Remove some stale comments from the page allocator.

Since r352110 the page lock is not required to wire pages in any
context.


# ff5ce8a7 26-Dec-2019 Jeff Roberson <jeff@FreeBSD.org>

Fix a pair of bugs introduced in r356002. When we reclaim physical pages we
allocate them with VM_ALLOC_NOOBJ which means they are not busy. For now
move the busy assert for the new page in vm_page_replace into the public
api and out of the private api used by contig reclaim. Fix another issue
where we would leak busy if the page could not be removed from pmap.

Reported by: pho
Discussed with: markj


# 7e1b379e 24-Dec-2019 Jeff Roberson <jeff@FreeBSD.org>

Don't unnecessarily relock the vm object after sleeps. This results in a
surprising amount of object contention on loop restarts in fault.

Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D22821


# 3cf3b4e6 21-Dec-2019 Jeff Roberson <jeff@FreeBSD.org>

Make page busy state deterministic on free. Pages must be xbusy when
removed from objects including calls to free. Pages must not be xbusy
when freed and not on an object. Strengthen assertions to match these
expectations. In practice very little code had to change busy handling
to meet these rules but we can now make stronger guarantees to busy
holders and avoid conditionally dropping busy in free.

Refine vm_page_remove() and vm_page_replace() semantics now that we have
stronger guarantees about busy state. This removes redundant and
potentially problematic code that has proliferated.

Discussed with: markj
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D22822


# d07c5718 21-Dec-2019 Mark Johnston <markj@FreeBSD.org>

Fix VPO_UNMANAGED handling in vm_page_reclaim_run() after r353540.

When allocating a replacement page we must clear VPO_UNMANAGED since we
only ever reclaim pages from managed objects. vm_page_replace() does
not handle this for us.

Sprinkle some assertions to help catch this sort of issue.

Reported by: pho
Reviewed by: alc, kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D22868


# a8081778 14-Dec-2019 Jeff Roberson <jeff@FreeBSD.org>

Add a deferred free mechanism for freeing swap space that does not require
an exclusive object lock.

Previously swap space was freed on a best effort basis when a page that
had valid swap was dirtied, thus invalidating the swap copy. This may be
done inconsistently and requires the object lock which is not always
convenient.

Instead, track when swap space is present. The first dirty is responsible
for deleting space or setting PGA_SWAP_FREE which will trigger background
scans to free the swap space.

Simplify the locking in vm_fault_dirty() now that we can reliably identify
the first dirty.

Discussed with: alc, kib, markj
Differential Revision: https://reviews.freebsd.org/D22654


# af009714 14-Dec-2019 Jeff Roberson <jeff@FreeBSD.org>

Handle pagein clustering in vm_page_grab_valid() so that it can be used by
exec_map_first_page(). This will also enable pagein clustering for other
interested consumers (tmpfs, md, etc).

Discussed with: alc
Approved by: kib
Differential Revision: https://reviews.freebsd.org/D22731


# 5cff1f4d 10-Dec-2019 Mark Johnston <markj@FreeBSD.org>

Introduce vm_page_astate.

This is a 32-bit structure embedded in each vm_page, consisting mostly
of page queue state. The use of a structure makes it easy to store a
snapshot of a page's queue state in a stack variable and use cmpset
loops to update that state without requiring the page lock.

This change merely adds the structure and updates references to atomic
state fields. No functional change intended.

Reviewed by: alc, jeff, kib
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D22650


# cff8481d 07-Dec-2019 Jeff Roberson <jeff@FreeBSD.org>

It is safe to wire a page while the object is busy.

Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D22636


# 2306558c 07-Dec-2019 Jeff Roberson <jeff@FreeBSD.org>

It is now safe to rename a page that is still on a queue. Allowing this
is necessary for a forthcoming patch.

Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D22636


# fb1d575c 07-Dec-2019 Jeff Roberson <jeff@FreeBSD.org>

Reduce duplication in grab functions by providing allocflags based inlines.

Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D22635


# caef3e12 06-Dec-2019 Justin Hibbits <jhibbits@FreeBSD.org>

powerpc/pmap: NUMA-ize vm_page_array on powerpc

Summary:
This matches r351198 from amd64. This only applies to AIM64 and Book-E.
On AIM64 it short-circuits with one domain, to behave similar to
existing. Otherwise it will allocate 16MB huge pages to hold the page
array, across all NUMA domains. On the first domain it will shift the
page array base up, to "upper-align" the page array in that domain, so
as to reduce the number of pages from the next domain appearing in this
domain. After the first domain, subsequent domains will be allocated in
full 16MB pages, until the final domain, which can be short. This means
some inner domains may have pages accounted in earlier domains.

On Book-E the page array is setup at MMU bootstrap time so that it's
always mapped in TLB1, on both 32-bit and 64-bit. This reduces the TLB0
overhead for touching the vm_page_array, which reduces up to one TLB
miss per array access.

Since page_range (vm_page_startup()) is no longer used on Book-E but is on
32-bit AIM, mark the variable as potentially unused, rather than using a
nasty #if defined() list.

Reviewed by: luporl
Differential Revision: https://reviews.freebsd.org/D21449


# 9b78b1f4 02-Dec-2019 Jeff Roberson <jeff@FreeBSD.org>

Use a precise bit count for the slab free items in UMA. This significantly
shrinks embedded slab structures.

Reviewed by: markj, rlibby (prior version)
Differential Revision: https://reviews.freebsd.org/D22584


# b631c36f 24-Nov-2019 Konstantin Belousov <kib@FreeBSD.org>

Record part of the owner struct thread pointer into busy_lock.

Record as much bits from curthread into busy_lock as fits. Low bits
for struct thread * representation are zero due to struct and zone
alignment, and they leave space for busy flags (perhaps except
statically allocated thread0). Upper bits are not very interesting
for assert, and in most practical situations recorded value should
allow to manually identify the owner with certainity.

Assert that unbusy is performed by the owner, except few places where
unbusy is done in io completion handler. For this case, add
_unchecked variants of asserts and unbusy primitives.

Reviewed by: markj (previous version)
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D22298


# bf0d60af 22-Nov-2019 Mark Johnston <markj@FreeBSD.org>

Update the checks in vm_page_zone_import().

- Remove the cnt == 1 check. UMA passes cnt == 1 when it has disabled
per-CPU caching. In this case we might as well just allocate a single
page and return it to the caller, since the caller is going to do
exactly that anyway if the UMA cache allocation attempt fails.
- Don't replenish caches if the domain is severely short on free pages.
With large buckets we may otherwise quickly exacerbate a situation
where the page daemon is failing to keep up.
- Don't replenish caches if the calling thread belongs to the page
daemon, which should avoid creating extra memory pressure when it is
trying to free memory. Virtually all such allocations while occur in
the context of laundering, where the laundry thread must allocate
slabs for various swap and I/O-related UMA zones.

Reviewed by: kib
Discussed with: alc, jeff
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D22394


# 003cf08b 22-Nov-2019 Mark Johnston <markj@FreeBSD.org>

Revise the page cache size policy.

In r353734 the use of the page caches was limited to systems with a
relatively large amount of RAM per CPU. This was to mitigate some
issues reported with the system not able to keep up with memory pressure
in cases where it had been able to do so prior to the addition of the
direct free pool cache. This change re-enables those caches.

The change modifies uma_zone_set_maxcache(), which was introduced
specifically for the page cache zones. Rather than using it to limit
only the full bucket cache, have it also set uz_count_max to provide an
upper bound on the per-CPU cache size that is consistent with the number
of items requested. Remove its return value since it has no use.

Enable the page cache zones unconditionally, and limit them to 0.1% of
the domain's pages. The limit can be overridden by the
vm.pgcache_zone_max tunable as before.

Change the item size parameter passed to uma_zcache_create() to the
correct size, and stop setting UMA_ZONE_MAXBUCKET. This allows the page
cache buckets to be adaptively sized, like the rest of UMA's caches.
This also causes the initial bucket size to be small, so only systems
which benefit from large caches will get them.

Reviewed by: gallatin, jeff
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D22393


# 09a65f9f 20-Nov-2019 Andrew Turner <andrew@FreeBSD.org>

As with r354905 use uint16_t to store aflags on the stack and as function
arguments as the aflags size in vm_page_t has increased.

Sponsored by: DARPA, AFRL


# ad216bc1 20-Nov-2019 Andrew Turner <andrew@FreeBSD.org>

Use atomic_load_16 to load aflags as it's a uint16_t after r354820.

Sponsored by: DARPA, AFRL


# 7f935055 19-Nov-2019 Jeff Roberson <jeff@FreeBSD.org>

Remove unnecessary object locking from the vnode pager. Recent changes to
busy/valid/dirty locking make these acquires redundant.

Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D22186


# 3a2ba997 18-Nov-2019 Mark Johnston <markj@FreeBSD.org>

Widen the vm_page aflags field to 16 bits.

We are now out of aflags bits, whereas the "flags" field only makes use
of five of its sixteen bits, so narrow "flags" to eight bits. I have no
intention of adding a new aflag in the near future, but would like to
combine the aflags, queue and act_count fields into a single atomically
updated word. This will allow vm_page_pqstate_cmpset() to become much
simpler and is a step towards eliminating the use of the page lock array
in updating per-page queue state.

The change modifies the layout of struct vm_page, so bump
__FreeBSD_version.

Reviewed by: alc, dougm, jeff, kib
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D22397


# c95f8ed8 07-Nov-2019 Mark Johnston <markj@FreeBSD.org>

Drop Giant before sleeping on a busy page.

Before the page busy code was converted to make direct use of
sleepqueues, this was handled by _sleep().

Reported by: glebius
Reviewed by: kib
Sponsored by: The FreeBSD Foundation


# be801aaa 06-Nov-2019 Mark Johnston <markj@FreeBSD.org>

Fix a race in release_page().

Since r354156 we may call release_page() without the page's object lock
held, specifically following the page copy during a CoW fault.
release_page() must therefore unbusy the page only after scheduling the
requeue, to avoid racing with a free of the page. Previously, the
object lock prevented this race from occurring.

Add some assertions that were helpful in tracking this down.

Reported by: pho, syzkaller
Tested by: pho
Reviewed by: alc, jeff, kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D22234


# afa7e88a 30-Oct-2019 Konstantin Belousov <kib@FreeBSD.org>

vm_page_wire_mapped: explain why failure does not affect correctness.

Reviewed by: markj (previous version)
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D22196


# 67d0e293 29-Oct-2019 Jeff Roberson <jeff@FreeBSD.org>

Replace OBJ_MIGHTBEDIRTY with a system using atomics. Remove the TMPFS_DIRTY
flag and use the same system.

This enables further fault locking improvements by allowing more faults to
proceed with a shared lock.

Reviewed by: kib
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D22116


# 0dc59d76 24-Oct-2019 Andrew Gallatin <gallatin@FreeBSD.org>

Add a tunable to set the pgcache zone's maxcache

When it is set to 0 (the default), a heavy Netflix-style web workload
suffers from heavy lock contention on the vm page free queue called from
vm_page_zone_{import,release}() as the buckets are frequently drained.
When setting the maxcache, this contention goes away.

We should eventually try to autotune this, as well as make this
zone eligable for uma_reclaim().

Reviewed by: alc, markj
Not Objected to by: jeff
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D22112


# e6f1a580 23-Oct-2019 Mark Johnston <markj@FreeBSD.org>

Verify identity after checking for WAITFAIL in vm_page_busy_acquire().

A caller that does not guarantee that a page's identity won't change
while sleeping for a busy lock must specify either NOWAIT or WAITFAIL.

Reported by: syzkaller
Reviewed by: alc, kib
Discussed with: jeff
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D22124


# d307bdcc 18-Oct-2019 Mark Johnston <markj@FreeBSD.org>

Further constrain the use of per-CPU caches for free pages.

In low memory conditions a significant number of pages may end up stuck
in the caches, and currently these caches cannot be reaped, leading to
spurious memory allocation failures and OOM kills. So:

- Take into account the fact that we may cache up to two full buckets
of pages per CPU, not just one.
- Increase the amount of RAM required per CPU to enable the caches.

This is a temporary measure until the page cache management policy is
improved.

PR: 241048
Reported and tested by: Kevin Oberman <rkoberman@gmail.com>
Reviewed by: alc, kib
Discussed with: jeff
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D22040


# 01cef4ca 16-Oct-2019 Mark Johnston <markj@FreeBSD.org>

Remove page locking from pmap_mincore().

After r352110 the page lock no longer protects a page's identity, so
there is no purpose in locking the page in pmap_mincore(). Instead,
if vm.mincore_mapped is set to the non-default value of 0, re-lookup
the page after acquiring its object lock, which holds the page's
identity stable.

The change removes the last callers of vm_page_pa_tryrelock(), so
remove it.

Reviewed by: kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21823


# fff5403f 14-Oct-2019 Jeff Roberson <jeff@FreeBSD.org>

(5/6) Move the VPO_NOSYNC to PGA_NOSYNC to eliminate the dependency on the
object lock in vm_page_set_validclean().

Reviewed by: kib, markj
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D21595


# 0012f373 14-Oct-2019 Jeff Roberson <jeff@FreeBSD.org>

(4/6) Protect page valid with the busy lock.

Atomics are used for page busy and valid state when the shared busy is
held. The details of the locking protocol and valid and dirty
synchronization are in the updated vm_page.h comments.

Reviewed by: kib, markj
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D21594


# 205be21d 14-Oct-2019 Jeff Roberson <jeff@FreeBSD.org>

(3/6) Add a shared object busy synchronization mechanism that blocks new page
busy acquires while held.

This allows code that would need to acquire and release a very large number
of page busy locks to use the old mechanism where busy is only checked and
not held. This comes at the cost of false positives but never false
negatives which the single consumer, vm_fault_soft_fast(), handles.

Reviewed by: kib
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D21592


# 8da1c098 14-Oct-2019 Jeff Roberson <jeff@FreeBSD.org>

(2/6) Don't release xbusy in vm_page_remove(), defer to vm_page_free_prep().

This persists busy state across operations like rename and replace.

Reviewed by: kib, markj
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D21549


# 63e97555 14-Oct-2019 Jeff Roberson <jeff@FreeBSD.org>

(1/6) Replace busy checks with acquires where it is trival to do so.

This is the first in a series of patches that promotes the page busy field
to a first class lock that no longer requires the object lock for
consistency.

Reviewed by: kib, markj
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D21548


# 0ecc478b 14-Oct-2019 Leandro Lupori <luporl@FreeBSD.org>

[PPC64] Initial kernel minidump implementation

Based on POWER9BSD implementation, with all POWER9 specific code removed and
addition of new methods in PPC64 MMU interface, to isolate platform specific
code. Currently, the new methods are implemented on pseries and PowerNV
(D21643).

Reviewed by: jhibbits
Differential Revision: https://reviews.freebsd.org/D21551


# 4090e217 07-Oct-2019 Mark Johnston <markj@FreeBSD.org>

Assert that the PGA_{WRITEABLE,EXECUTABLE} flags do not leak.

Reviewed by: alc, kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D21783


# 7b1fbc42 07-Oct-2019 Mateusz Guzik <mjg@FreeBSD.org>

vm: stop trylocking page queues in vm_page_pqbatch_submit

About 11 minutes of poudriere -s -j 104 and probing on return value of
trylocks reveals that over 10% of attempts fail, which in turn means
there are more atomics performed than necessary.

Trylocking was there to try preventing migration, but it's not very likely
to happen if the lock is uncontested.

Reviewed by: markj
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21925


# 7cc833c5 27-Sep-2019 Mark Johnston <markj@FreeBSD.org>

Fix a race in vm_page_swapqueue().

vm_page_swapqueue() atomically transitions a page between queues. To do
so, it must hold the page queue lock for the old queue. However, once
the queue index has been updated, the queue lock no longer protects the
page's queue state. Thus, we must speculatively remove the page from
the old queue before committing the queue state update, and roll back if
the update fails.

Reported and tested by: pho
Reviewed by: kib
Sponsored by: Intel, Netflix
Differential Revision: https://reviews.freebsd.org/D21791


# 2b93f779 25-Sep-2019 Mark Johnston <markj@FreeBSD.org>

Add some counters for per-VM page events.

For now, just count batched page queue state operations.
vm.stats.page.queue_ops counts the number of batch entries that
successfully completed, while queue_nops counts entries that had no
effect, which occurs when the queue operation had been completed before
the batch entry was processed.

Reviewed by: alc, kib
MFC after: 1 week
Sponsored by: Intel, Netflix
Differential Revision: https://reviews.freebsd.org/D21782


# 923da43e7 16-Sep-2019 Mark Johnston <markj@FreeBSD.org>

Fix a race in vm_page_dequeue_deferred_free() after r352110.

This function loaded the page's queue index before setting PGA_DEQUEUE.
In this window the page daemon may have deactivated the page, updating
its queue index. Make the operation atomic using vm_page_pqstate_cmpset();
the page daemon will not modify the page once it observes that PGA_DEQUEUE
is set.

Reported and tested by: pho
Reviewed by: alc, kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21639


# 47aef898 16-Sep-2019 Mark Johnston <markj@FreeBSD.org>

Fix a page leak in vm_page_reclaim_run().

After r352110 the attempt to remove mappings of the page being replaced
may fail if the page is wired. In this case we must free the replacement
page.

Reviewed by: alc, kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21639


# e8bcf696 16-Sep-2019 Mark Johnston <markj@FreeBSD.org>

Revert r352406, which contained changes I didn't intend to commit.


# 41fd4b94 16-Sep-2019 Mark Johnston <markj@FreeBSD.org>

Fix a couple of nits in r352110.

- Remove a dead variable from the amd64 pmap_extract_and_hold().
- Fix grammar in the vm_page_wire man page.

Reported by: alc
Reviewed by: alc, kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21639


# c7575748 10-Sep-2019 Jeff Roberson <jeff@FreeBSD.org>

Replace redundant code with a few new vm_page_grab facilities:
- VM_ALLOC_NOCREAT will grab without creating a page.
- vm_page_grab_valid() will grab and page in if necessary.
- vm_page_busy_acquire() automates some busy acquire loops.

Discussed with: alc, kib, markj
Tested by: pho (part of larger branch)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21546


# 4cdea4a8 10-Sep-2019 Jeff Roberson <jeff@FreeBSD.org>

Use the sleepq lock rather than the page lock to protect against wakeup
races with page busy state. The object lock is still used as an interlock
to ensure that the identity stays valid. Most callers should use
vm_page_sleep_if_busy() to handle the locking particulars.

Reviewed by: alc, kib, markj
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21255


# fee2a2fa 09-Sep-2019 Mark Johnston <markj@FreeBSD.org>

Change synchonization rules for vm_page reference counting.

There are several mechanisms by which a vm_page reference is held,
preventing the page from being freed back to the page allocator. In
particular, holding the page's object lock is sufficient to prevent the
page from being freed; holding the busy lock or a wiring is sufficent as
well. These references are protected by the page lock, which must
therefore be acquired for many per-page operations. This results in
false sharing since the page locks are external to the vm_page
structures themselves and each lock protects multiple structures.

Transition to using an atomically updated per-page reference counter.
The object's reference is counted using a flag bit in the counter. A
second flag bit is used to atomically block new references via
pmap_extract_and_hold() while removing managed mappings of a page.
Thus, the reference count of a page is guaranteed not to increase if the
page is unbusied, unmapped, and the object's write lock is held. As
a consequence of this, the page lock no longer protects a page's
identity; operations which move pages between objects are now
synchronized solely by the objects' locks.

The vm_page_wire() and vm_page_unwire() KPIs are changed. The former
requires that either the object lock or the busy lock is held. The
latter no longer has a return value and may free the page if it releases
the last reference to that page. vm_page_unwire_noq() behaves the same
as before; the caller is responsible for checking its return value and
freeing or enqueuing the page as appropriate. vm_page_wire_mapped() is
introduced for use in pmap_extract_and_hold(). It fails if the page is
concurrently being unmapped, typically triggering a fallback to the
fault handler. vm_page_wire() no longer requires the page lock and
vm_page_unwire() now internally acquires the page lock when releasing
the last wiring of a page (since the page lock still protects a page's
queue state). In particular, synchronization details are no longer
leaked into the caller.

The change excises the page lock from several frequently executed code
paths. In particular, vm_object_terminate() no longer bounces between
page locks as it releases an object's pages, and direct I/O and
sendfile(SF_NOCACHE) completions no longer require the page lock. In
these latter cases we now get linear scalability in the common scenario
where different threads are operating on different files.

__FreeBSD_version is bumped. The DRM ports have been updated to
accomodate the KPI changes.

Reviewed by: jeff (earlier version)
Tested by: gallatin (earlier version), pho
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20486


# 7cdeaf33 03-Sep-2019 Mark Johnston <markj@FreeBSD.org>

Add preliminary support for atomic updates of per-page queue state.

Queue operations on a page use the page lock when updating the page to
reflect the desired queue state, and the page queue lock when physically
enqueuing or dequeuing a page. Multiple pages share a given page lock,
but queue state is per-page; this false sharing results in heavy lock
contention.

Take a small step towards the use of atomic_cmpset to synchronize
updates to per-page queue state by introducing vm_page_pqstate_cmpset()
and using it in the page daemon. In the longer term the plan is to stop
using the page lock to protect page identity and rely only on the object
and page busy locks. However, since the page daemon avoids acquiring
the object lock except when necessary, some synchronization with a
concurrent free of the page is required. vm_page_pqstate_cmpset() can
be used to ensure that queue state updates are successful only if the
page is not scheduled for a dequeue, which is sufficient for the page
daemon.

Add vm_page_swapqueue(), which moves a page from one queue to another
using vm_page_pqstate_cmpset(). Use it in the active queue scan, which
does not use the object lock. Modify vm_page_dequeue_deferred() to
use vm_page_pqstate_cmpset() as well.

Reviewed by: kib
Discussed with: jeff
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21257


# 9d75f0dc 03-Sep-2019 Mark Johnston <markj@FreeBSD.org>

Map the vm_page array into KVA on amd64.

r351198 allows the kernel to use domain-local memory to back the vm_page
array (up to 2MB boundaries) and reserves a separate PML4 entry for that
purpose. One consequence of that change is that the vm_page array is no
longer present in minidumps, which only adds pages mapped above
VM_MIN_KERNEL_ADDRESS.

To avoid the friction caused by having kernel data structures mapped
below VM_MIN_KERNEL_ADDRESS, map the vm_page array starting at
VM_MIN_KERNEL_ADDRESS instead of using a dedicated PML4 entry.

Reviewed by: kib
Discussed with: jeff
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21491


# a70e17ee 26-Aug-2019 Mark Johnston <markj@FreeBSD.org>

Fix a few nits in vm_pqbatch_process_page().

- Don't bother masking off non-queue state flags when loading the
page's atomic state, since it is only required for one of the
function's assertions. Update the assertion instead.
- Remove an incorrect comment regarding synchronization with the
page daemon. The page daemon only ever checks for PGA_ENQUEUED
with the page queue lock held.
- When clearing requeue flags, only clear the flags that have been
acted upon.

Reviewed by: kib (previous version)
Discussed with: alc
Tested by: pho (part of a larger patch)
MFC after: 1 week
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21368


# f93670b7 23-Aug-2019 Mark Johnston <markj@FreeBSD.org>

Stop clearing page flags in vm_page_pqbatch_submit().

All existing callers guarantee that the page does not have a
pre-existing dequeue pending. Thus, if the page is dequeued before
pqbatch_submit() acquires the page queue lock, we do not need to do
anything since vm_page_dequeue_complete() takes care of clearing all
page queue state flags for us.

With this change, vm_page_pqbatch_submit() has the nice property that it
does not directly modify any fields in the page structure.

Reviewed by: alc, kib
Tested by: pho (part of a larger change)
MFC after: 1 week
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21372


# 386eba08 23-Aug-2019 Mark Johnston <markj@FreeBSD.org>

Make vm_pqbatch_submit_page() externally visible.

It will become useful for the page daemon to be able to directly create
a batch queue entry for a page, and without modifying the page
structure. Rename vm_pqbatch_submit_page() to vm_page_pqbatch_submit()
to keep the namespace consistent. No functional change intended.

Reviewed by: alc, kib
MFC after: 1 week
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21369


# 8b90607f 21-Aug-2019 Mark Johnston <markj@FreeBSD.org>

Simplify vm_page_dequeue() and fix an assertion.

- Add a vm_pagequeue_remove() function to physically remove a page
from its queue and update the queue length.
- Remove vm_page_pagequeue_lockptr() and let vm_page_pagequeue()
return NULL for dequeued pages.
- Avoid unnecessarily reloading the queue index if vm_page_dequeue()
loses a race with a concurrent queue operation.
- Correct an always-true assertion: vm_page_dequeue() may be called
from the page allocator with the page unlocked. The assertion
m->order == VM_NFREEORDER simply tests whether the page has been
removed from the vm_phys free lists; instead, check whether the
page belongs to an object.

Reviewed by: kib
MFC after: 1 week
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21341


# 3e5e1b51 18-Aug-2019 Jeff Roberson <jeff@FreeBSD.org>

Allocate amd64's page array using pages and page directory pages from the
NUMA domain that the pages describe. Patch original from gallatin.

Reviewed by: kib
Tested by: pho
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21252


# b7565d44 18-Aug-2019 Jeff Roberson <jeff@FreeBSD.org>

Encapsulate phys_avail manipulation in a set of simple routines. Add a
NUMA aware boot time memory allocator that will be used to allocate early
domain correct structures. Code partially submitted by gallatin.

Reviewed by: gallatin, kib
Tested by: pho
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21251


# 245139c6 16-Aug-2019 Konstantin Belousov <kib@FreeBSD.org>

Fix OOM handling of some corner cases.

In addition to pagedaemon initiating OOM, also do it from the
vm_fault() internals. Namely, if the thread waits for a free page to
satisfy page fault some preconfigured amount of time, trigger OOM.
These triggers are rate-limited, due to a usual case of several
threads of the same multi-threaded process to enter fault handler
simultaneously. The faults from pagedaemon threads participate in the
calculation of OOM rate, but are not under the limit.

Reviewed by: markj (previous version)
Tested by: pho
Discussed with: alc
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D13671


# 98549e2d 29-Jul-2019 Mark Johnston <markj@FreeBSD.org>

Centralize the logic in vfs_vmio_unwire() and sendfile_free_page().

Both of these functions atomically unwire a page, optionally attempt
to free the page, and enqueue or requeue the page. Add functions
vm_page_release() and vm_page_release_locked() to perform the same task.
The latter must be called with the page's object lock held.

As a side effect of this refactoring, the buffer cache will no longer
attempt to free mapped pages when completing direct I/O. This is
consistent with the handling of pages by sendfile(SF_NOCACHE).

Reviewed by: alc, kib
MFC after: 2 weeks
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20986


# b16e57a6 20-Jul-2019 Mark Johnston <markj@FreeBSD.org>

Rename vm_page_{import,release}() to vm_page_zone_{import,release}().

I would like to use the name vm_page_release() for a different purpose,
and vm_page_{import,release}() are local to vm_page.c.

Reviewed by: kib
MFC after: 1 week


# eeacb3b0 08-Jul-2019 Mark Johnston <markj@FreeBSD.org>

Merge the vm_page hold and wire mechanisms.

The hold_count and wire_count fields of struct vm_page are separate
reference counters with similar semantics. The remaining essential
differences are that holds are not counted as a reference with respect
to LRU, and holds have an implicit free-on-last unhold semantic whereas
vm_page_unwire() callers must explicitly determine whether to free the
page once the last reference to the page is released.

This change removes the KPIs which directly manipulate hold_count.
Functions such as vm_fault_quick_hold_pages() now return wired pages
instead. Since r328977 the overhead of maintaining LRU for wired pages
is lower, and in many cases vm_fault_quick_hold_pages() callers would
swap holds for wirings on the returned pages anyway, so with this change
we remove a number of page lock acquisitions.

No functional change is intended. __FreeBSD_version is bumped.

Reviewed by: alc, kib
Discussed with: jeff
Discussed with: jhb, np (cxgbe)
Tested by: pho (previous version)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D19247


# 46736e30 08-Jul-2019 Mark Johnston <markj@FreeBSD.org>

Elide the vm_reserv_free_page() call when PG_PCPU_CACHE is set.

Pages with PG_PCPU_CACHE set cannot have been allocated from a
reservation, so as an optimization, skip the call to
vm_reserv_free_page() in this case. Otherwise, the access of
the corresponding reservation structure often results in a cache
miss.

Reviewed by: alc, kib
Discussed with: jeff
MFC after: 2 weeks
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20859


# d9a73522 08-Jul-2019 Mark Johnston <markj@FreeBSD.org>

Add a per-CPU page cache per VM free pool.

Some workloads benefit from having a per-CPU cache for
VM_FREEPOOL_DIRECT pages.

Reviewed by: dougm, kib
Discussed with: alc, jeff
MFC after: 2 weeks
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20858


# 9f74cdbf 02-Jul-2019 Mark Johnston <markj@FreeBSD.org>

Mark pages allocated from the per-CPU cache.

Only free pages to the cache when they were allocated from that cache.
This mitigates rapid fragmentation of physical memory seen during
poudriere's dependency calculation phase. In particular, pages
belonging to broken reservations are no longer freed to the per-CPU
cache, so they get a chance to coalesce with freed pages during the
break. Otherwise, the optimized CoW handler may create object
chains in which multiple objects contain pages from the same
reservation, and the order in which we do object termination means
that the reservation is broken before all of those pages are freed,
so some of them end up in the per-CPU cache and thus permanently
fragment physical memory.

The flag may also be useful for eliding calls to vm_reserv_free_page(),
thus avoiding memory accesses for data that is likely not present
in the CPU caches.

Reviewed by: alc
Discussed with: jeff
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20763


# 0fd977b3 26-Jun-2019 Mark Johnston <markj@FreeBSD.org>

Add a return value to vm_page_remove().

Use it to indicate whether the page may be safely freed following
its removal from the object. Also change vm_page_remove() to assume
that the page's object pointer is non-NULL, and have callers perform
this check instead.

This is a step towards an implementation of an atomic reference counter
for each physical page structure.

Reviewed by: alc, dougm, kib
MFC after: 1 week
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20758


# ee1f1685 19-Jun-2019 Mark Johnston <markj@FreeBSD.org>

Group vm_page_activate()'s definition with other related functions.

No functional change intended.

MFC after: 3 days


# 2d274871 04-Jun-2019 Mark Johnston <markj@FreeBSD.org>

Remove an outdated header comment for vm_page.c.

The listed rules were incomplete and outdated. There is a much more
comprehensive comment in vm_page.h.

Reviewed by: alc, kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D20503


# 2d5039db 02-Jun-2019 Alan Cox <alc@FreeBSD.org>

Retire vm_reserv_extend_{contig,page}(). These functions were introduced
as part of a false start toward fine-grained reservation locking. In the
end, they were not needed, so eliminate them.

Order the parameters to vm_reserv_alloc_{contig,page}() consistently with
the vm_page functions that call them.

Update the comments about the locking requirements for
vm_reserv_alloc_{contig,page}(). They no longer require a free page
queues lock.

Wrap several lines that became too long after the "req" and "domain"
parameters were added to vm_reserv_alloc_{contig,page}().

Reviewed by: kib, markj
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D20492


# d842aa51 01-Jun-2019 Mark Johnston <markj@FreeBSD.org>

Add a vm_page_wired() predicate.

Use it instead of accessing the wire_count field directly. No
functional change intended.

Reviewed by: alc, kib
MFC after: 1 week
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20485


# b8590dae 31-May-2019 Doug Moore <dougm@FreeBSD.org>

The function vm_phys_free_contig invokes vm_phys_free_pages for every
power-of-two page block it frees, launching an unsuccessful search for
a buddy to pair up with each time. The only possible buddy-up mergers
are across the boundaries of the freed region, so change
vm_phys_free_contig simply to enqueue the freed interior blocks, via a
new function vm_phys_enqueue_contig, and then call vm_phys_free_pages
on the bounding blocks to create as big a cross-boundary block as
possible after buddy-merging.

The only callers of vm_phys_free_contig at the moment call it in
situations where merging blocks across the boundary is clearly
impossible, so just call vm_phys_enqueue_contig in those places and
avoid trying to buddy-up at all.

One beneficiary of this change is in breaking reservations. For the
case where memory is freed in breaking a reservation with only the
first and last pages allocated, the number of cycles consumed by the
operation drops about 11% with this change.

Suggested by: alc
Reviewed by: alc
Approved by: kib, markj (mentors)
Differential Revision: https://reviews.freebsd.org/D16901


# 42447bb5 31-May-2019 Mark Johnston <markj@FreeBSD.org>

Remove a redundant vm_page_remove() call.

vm_page_free_prep() removes the page from its object. No functional
change intended.

Reviewed by: alc, kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D20469


# 3b5b2029 05-Mar-2019 Mark Johnston <markj@FreeBSD.org>

Implement minidump support for RISC-V.

Submitted by: Mitchell Horne <mhorne063@gmail.com>
Differential Revision: https://reviews.freebsd.org/D18320


# 1e2b3e6f 03-Feb-2019 Mark Johnston <markj@FreeBSD.org>

Allow vm_page_free_prep() to dequeue pages without the page lock.

This is a step towards being able to free pages without the page
lock held. The approach is simply to add an implementation of
vm_page_dequeue_deferred() which does not assert that the page
lock is held. Formally, the page lock is required to set
PGA_DEQUEUE, but in the case of vm_page_free_prep() we get the
same mutual exclusion for free by virtue of the fact that no
other references to the page may exist.

No functional change intended.

Reviewed by: kib (previous version)
MFC after: 2 weeks
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D19065


# d0488e69 03-Feb-2019 Mark Johnston <markj@FreeBSD.org>

Fix a race in vm_page_dequeue_deferred().

To detect the case where the page is already marked for a deferred
dequeue, we must read the "queue" and "aflags" fields in a
precise order. Otherwise, a race with a concurrent
vm_page_dequeue_complete() could leave the page with PGA_DEQUEUE
set despite it already having been dequeued. Fix the problem by
using vm_page_queue() to check the queue state, which correctly
handles the race.

Reviewed by: kib
Tested by: pho
MFC after: 3 days
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D19039


# bb15d1c7 14-Jan-2019 Gleb Smirnoff <glebius@FreeBSD.org>

o Move zone limit from keg level up to zone level. This means that now
two zones sharing a keg may have different limits. Now this is going
to work:

zone = uma_zcreate();
uma_zone_set_max(zone, limit);
zone2 = uma_zsecond_create(zone);
uma_zone_set_max(zone2, limit2);

Kegs no longer have uk_maxpages field, but zones have uz_items. When
set, it may be rounded up to minimum possible CPU bucket cache size.
For small limits bucket cache can also be reconfigured to be smaller.
Counter uz_items is updated whenever items transition from keg to a
bucket cache or directly to a consumer. If zone has uz_maxitems set and
it is reached, then we are going to sleep.

o Since new limits don't play well with multi-keg zones, remove them. The
idea of multi-keg zones was introduced exactly 10 years ago, and never
have had a practical usage. In discussion with Jeff we came to a wild
agreement that if we ever want to reintroduce the idea of a smart allocator
that would be able to choose between two (or more) totally different
backing stores, that choice should be made one level higher than UMA,
e.g. in malloc(9) or in mget(), or whatever and choice should be controlled
by the caller.

o Sleeping code is improved to account number of sleepers and wake them one
by one, to avoid thundering herd problem.

o Flag UMA_ZONE_NOBUCKETCACHE removed, instead uma_zone_set_maxcache()
KPI added. Having no bucket cache basically means setting maxcache to 0.

o Now with many fields added and many removed (no multi-keg zones!) make
sure that struct uma_zone is perfectly aligned.

Reviewed by: markj, jeff
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D17773


# 9cc36b3d 07-Jan-2019 Gleb Smirnoff <glebius@FreeBSD.org>

Fix regression in r331368, that broke dumping of UMA startup pages
when WITNESS is present.

Discussed with: markj


# 7af49852 30-Dec-2018 Konstantin Belousov <kib@FreeBSD.org>

Add 'v' modifier to the ddb 'show pginfo' command to display vm_page
backing the provided kernel virtual address.

Tested by: pho
MFC after: 1 week
Sponsored by: The FreeBSD Foundation


# e31fc3ab 29-Nov-2018 Mark Johnston <markj@FreeBSD.org>

Update the free page count when blacklisting pages.

Otherwise the free page count will not accurately reflect the physical
page allocator's state. On 11 this can trigger panics in
vm_page_alloc() since the allocator state and free page count are
updated atomically and we expect them to stay in sync. On 12 the
bug would manifest as threads looping in vm_page_alloc().

PR: 231296
Reported by: mav, wollman, Rainer Duffner, Josh Gitlin
Reviewed by: alc, kib, mav
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D18374


# 920239ef 30-Oct-2018 Mark Johnston <markj@FreeBSD.org>

Fix some problems that manifest when NUMA domain 0 is empty.

- In uma_prealloc(), we need to check for an empty domain before the
first allocation attempt, not after. Fix this by switching
uma_prealloc() to use a vm_domainset iterator, which addresses the
secondary issue of using a signed domain identifier in round-robin
iteration.
- Don't automatically create a page daemon for domain 0.
- In domainset_empty_vm(), recompute ds_cnt and ds_order after
excluding empty domains; otherwise we may frequently specify an empty
domain when calling in to the page allocator, wasting CPU time.
Convert DOMAINSET_PREF() policies for empty domains to round-robin.
- When freeing bootstrap pages, don't count them towards the per-domain
total page counts for now: some vm_phys segments are created before
the SRAT is parsed and are thus always identified as being in domain 0
even when they are not. Then, when bootstrap pages are freed, they
are added to a domain that we had previously thought was empty. Until
this is corrected, we simply exclude them from the per-domain page
count.

Reported and tested by: Rajesh Kumar <rajfbsd@gmail.com>
Reviewed by: gallatin
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17704


# 4c29d2de 23-Oct-2018 Mark Johnston <markj@FreeBSD.org>

Refactor domainset iterators for use by malloc(9) and UMA.

Before this change we had two flavours of vm_domainset iterators: "page"
and "malloc". The latter was only used for kmem_*() and hard-coded its
behaviour based on kernel_object's policy. Moreover, its use contained
a race similar to that fixed by r338755 since the kernel_object's
iterator was being run without the object lock.

In some cases it is useful to be able to explicitly specify a policy
(domainset) or policy+iterator (domainset_ref) when performing memory
allocations. To that end, refactor the vm_dominset_* KPI to permit
this, and get rid of the "malloc" domainset_iter KPI in the process.

Reviewed by: jeff (previous version)
Tested by: pho (part of a larger patch)
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17417


# 463406ac 24-Sep-2018 Mark Johnston <markj@FreeBSD.org>

Add more NUMA-specific low memory predicates.

Use these predicates instead of inline references to vm_min_domains.
Also add a global all_domains set, akin to all_cpus.

Reviewed by: alc, jeff, kib
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17278


# 7a364d45 10-Sep-2018 Mark Johnston <markj@FreeBSD.org>

Split some checks in vm_page_activate() to make it easier to read.

No functional change intended.

Reviewed by: alc, kib
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17028


# 5a7f9937 08-Sep-2018 Mark Johnston <markj@FreeBSD.org>

Relax an assertion in vm_pqbatch_process_page().

While executing vm_pqbatch_process_page(m), m->queue may change to
PQ_NONE if the page daemon is concurrently freeing the page. In this
case m's queue state flags must be clear, so vm_pqbatch_process_page()
will be a no-op, but the race could cause spurious assertion failures.
Correct the assertion which assumed that m->queue's value does not
change while the page queue lock is held.

Reviewed by: alc, kib
Reported and tested by: pho
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17027


# 23984ce5 06-Sep-2018 Mark Johnston <markj@FreeBSD.org>

Avoid resource deadlocks when one domain has exhausted its memory. Attempt
other allowed domains if the requested domain is below the minimum paging
threshold. Block in fork only if all domains available to the forking
thread are below the severe threshold rather than any.

Submitted by: jeff
Reported by: mjg
Reviewed by: alc, kib, markj
Approved by: re (rgrimes)
Differential Revision: https://reviews.freebsd.org/D16191


# 21f01f45 06-Sep-2018 Mark Johnston <markj@FreeBSD.org>

Remove vm_page_remque().

Testing m->queue != PQ_NONE is not sufficient; see the commit log
message for r338276. As of r332974 vm_page_dequeue() handles
already-dequeued pages, so just replace vm_page_remque() calls with
vm_page_dequeue() calls.

Reviewed by: kib
Tested by: pho
Approved by: re (marius)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17025


# 899fe184 23-Aug-2018 Mark Johnston <markj@FreeBSD.org>

Add a per-pagequeue pdpages counter.

Expose these counters under the vm.domain sysctl node. The existing
vm.stats.vm.v_pdpages sysctl is preserved.

Reviewed by: alc (previous version)
Differential Revision: https://reviews.freebsd.org/D14666


# 99d92d73 23-Aug-2018 Mark Johnston <markj@FreeBSD.org>

Ensure that queue state is cleared when vm_page_dequeue() returns.

Per-page queue state is updated non-atomically, with either the page
lock or the page queue lock held. When vm_page_dequeue() is called
without the page lock, in rare cases a different thread may be
concurrently dequeuing the page with the pagequeue lock held. Because
of the non-atomic update, vm_page_dequeue() might return before queue
state is completely updated, which can lead to race conditions.

Restrict the vm_page_dequeue() interface so that it must be called
either with the page lock held or on a free page, and busy wait when
a different thread is concurrently updating queue state, which must
happen in a critical section.

While here, do some related cleanup: inline vm_page_dequeue_locked()
into its only caller and delete a prototype for the unimplemented
vm_page_requeue_locked(). Replace the volatile qualifier for "queue"
added in r333703 with explicit uses of atomic_load_8() where required.

Reported and tested by: pho
Reviewed by: alc
Differential Revision: https://reviews.freebsd.org/D15980


# 2bf95012 05-Jul-2018 Andrew Turner <andrew@FreeBSD.org>

Create a new macro for static DPCPU data.

On arm64 (and possible other architectures) we are unable to use static
DPCPU data in kernel modules. This is because the compiler will generate
PC-relative accesses, however the runtime-linker expects to be able to
relocate these.

In preparation to fix this create two macros depending on if the data is
global or static.

Reviewed by: bz, emaste, markj
Sponsored by: ABT Systems Ltd
Differential Revision: https://reviews.freebsd.org/D16140


# a66d7a8d 05-Jul-2018 Konstantin Belousov <kib@FreeBSD.org>

Copyout(9) on 4/4 i386 needs correct vm_page_array[].

On the 4/4 i386, copyout(9) may need to call pmap_extract_and_hold()
on arbitrary userspace mapping. If the mapping is backed by the
non-managed cdev pager or by the sg pager, on dense configs we might
access arbitrary element of vm_page_array[], in particular, not
corresponding to a page from the memory segment. Initialize such pages
as fictitious with the corresponding physical address.

Reported by: bde
Reviewed by: alc, markj (previous version)
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D16085


# 89ea39a7 26-Jun-2018 Alan Cox <alc@FreeBSD.org>

Update the physical page selection strategy used by vm_page_import() so
that it does not cause rapid fragmentation of the free physical memory.

Reviewed by: jeff, markj (an earlier version)
Differential Revision: https://reviews.freebsd.org/D15976


# 16e05b32 07-Jun-2018 Jonathan T. Looney <jtl@FreeBSD.org>

Fix a typo in vm_domain_set(). When a domain crosses into the severe range,
we need to set the domain bit from the vm_severe_domains bitset (instead
of clearing it).

Reviewed by: jeff, markj
Sponsored by: Netflix, Inc.


# a99ee60b 22-May-2018 Mark Johnston <markj@FreeBSD.org>

Ensure that "m" is initialized in vm_page_alloc_freelist_domain().

While here, remove a superfluous comment.

Coverity CID: 1383559
MFC after: 3 days


# ba2b3349 16-May-2018 Mark Johnston <markj@FreeBSD.org>

Fix a race in vm_page_pagequeue_lockptr().

The value of m->queue must be cached after comparing it with PQ_NONE,
since it may be concurrently changing.

Reported by: glebius
Reviewed by: jeff
Differential Revision: https://reviews.freebsd.org/D15462


# 1b5c869d 04-May-2018 Mark Johnston <markj@FreeBSD.org>

Fix some races introduced in r332974.

With r332974, when performing a synchronized access of a page's "queue"
field, one must first check whether the page is logically dequeued. If
so, then the page lock does not prevent the page from being removed
from its page queue. Intoduce vm_page_queue(), which returns the page's
logical queue index. In some cases, direct access to the "queue" field
is still required, but such accesses should be confined to sys/vm.

Reported and tested by: pho
Reviewed by: kib
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D15280


# 5cd29d0f 24-Apr-2018 Mark Johnston <markj@FreeBSD.org>

Improve VM page queue scalability.

Currently both the page lock and a page queue lock must be held in
order to enqueue, dequeue or requeue a page in a given page queue.
The queue locks are a scalability bottleneck in many workloads. This
change reduces page queue lock contention by batching queue operations.
To detangle the page and page queue locks, per-CPU batch queues are
used to reference pages with pending queue operations. The requested
operation is encoded in the page's aflags field with the page lock
held, after which the page is enqueued for a deferred batch operation.
Page queue scans are similarly optimized to minimize the amount of
work performed with a page queue lock held.

Reviewed by: kib, jeff (previous versions)
Tested by: pho
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14893


# 64b38930 19-Apr-2018 Mark Johnston <markj@FreeBSD.org>

Initialize marker pages in vm_page_domain_init().

They were previously initialized by the corresponding page daemon
threads, but for vmd_inacthead this may be too late if
vm_page_deactivate_noreuse() is called during boot.

Reported and tested by: cperciva
Reviewed by: alc, kib
MFC after: 1 week


# 9de8fcfd 17-Apr-2018 Mark Johnston <markj@FreeBSD.org>

Ensure that m and skip_m belong to the same object.

Pages allocated from a given reservation may belong to different
objects. It is therefore possible for vm_page_ps_test() to be called
with the base page's object unlocked. Check for this case before
asserting that the object lock is held.

Reported by: jhb
Reviewed by: kib
MFC after: 1 week


# e55d32b7 07-Apr-2018 Konstantin Belousov <kib@FreeBSD.org>

Handle Skylake-X errata SKZ63.

SKZ63 Processor May Hang When Executing Code In an HLE Transaction
Region

Problem: Under certain conditions, if the processor acquires an HLE
(Hardware Lock Elision) lock via the XACQUIRE instruction in the Host
Physical Address range between 40000000H and 403FFFFFH, it may hang
with an internal timeout error (MCACOD 0400H) logged into
IA32_MCi_STATUS.

Move the pages from the range into the blacklist. Add a tunable to
not waste 4M if local DoS is not the issue.

Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D15001


# c33e3a64 31-Mar-2018 Jeff Roberson <jeff@FreeBSD.org>

Add a uma cache of free pages in the DEFAULT freepool. This gives us
per-cpu alloc and free of pages. The cache is filled with as few trips
to the phys allocator as possible by the use of a new
vm_phys_alloc_npages() function which allocates as many as N pages.

This code was originally by markj with the import function rewritten by
me.

Reviewed by: markj, kib
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14905


# e5818a53 28-Mar-2018 Jeff Roberson <jeff@FreeBSD.org>

Implement several enhancements to NUMA policies.

Add a new "interleave" allocation policy which stripes pages across
domains with a stride or width keeping contiguity within a multi-page
region.

Move the kernel to the dedicated numbered cpuset #2 making it possible
to assign kernel threads and memory policy separately from user. This
also eliminates the need for the complicated interrupt binding code.

Add a sysctl API for viewing and manipulating domainsets. Refactor some
of the cpuset_t manipulation code using the generic bitset type so that
it can be used for both. This probably belongs in a dedicated subr file.

Attempt to improve the include situation.

Reviewed by: kib
Discussed with: jhb (cpuset parts)
Tested by: pho (before review feedback)
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14839


# 2d3f4181 23-Mar-2018 Jeff Roberson <jeff@FreeBSD.org>

Fix two compliation problems on non-amd64 architectures.


# 40468513 23-Mar-2018 Mark Johnston <markj@FreeBSD.org>

Correct a couple of assertion messages in vm_page_reclaim_run().

MFC after: 3 days


# 5c930c89 22-Mar-2018 Jeff Roberson <jeff@FreeBSD.org>

Lock reservations with a dedicated lock in each reservation. Protect the
vmd_free_count with atomics.

This allows us to allocate and free from reservations without the free lock
except where a superpage is allocated from the physical layer, which is
roughly 1/512 of the operations on amd64.

Use the counter api to eliminate cache conention on counters.

Reviewed by: markj
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14707


# 9a4b4cd3 22-Mar-2018 Jeff Roberson <jeff@FreeBSD.org>

Start witness much earlier in boot so that we can shrink the pend list and
make it more immune to further change.

Reviewed by: markj, imp (Part of D14707)
Sponsored by: Netflix, Dell/EMC Isilon


# 0eb50f9c 18-Mar-2018 Mark Johnston <markj@FreeBSD.org>

Have vm_page_{deactivate,launder}() requeue already-queued pages.

In many cases the page is not enqueued so the change will have no
effect. However, the change is needed to support an optimization in
the fault handler and in some cases (sendfile, the buffer cache) it
was being emulated by the caller anyway.

Reviewed by: alc
Tested by: pho
MFC after: 2 weeks
X-Differential Revision: https://reviews.freebsd.org/D14625


# 434862ac 18-Mar-2018 Mark Johnston <markj@FreeBSD.org>

Have vm_page_replace() assert that the new page is not enqueued.

The new page does not belong to a VM object, but the page daemon does
not expect to encounter such pages.

Reviewed by: alc, kib
Tested by: pho
MFC after: 1 week
X-Differential Revision: https://reviews.freebsd.org/D14625


# 30fbfdda 15-Mar-2018 Jeff Roberson <jeff@FreeBSD.org>

Eliminate pageout wakeup races. Take another step towards lockless
vmd_free_count manipulation. Reduce the scope of the free lock by
using a pageout lock to synchronize sleep and wakeup. Only trigger
the pageout daemon on transitions between states. Drive all wakeup
operations directly as side-effects from freeing memory rather than
requiring an additional function call.

Reviewed by: markj, kib
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14612


# 741e1c91 13-Mar-2018 Konstantin Belousov <kib@FreeBSD.org>

Revert the chunk from r330410 in vm_page_reclaim_run().

There, the pages freed might be managed but the page's lock is not
owned. For KPI correctness, the page lock is requried around the call
to vm_page_free_prep(), which is asserted. Reclaim loop already did
the work which could be done by vm_page_free_prep(), so the lock is
not needed and the only consequence of not owning it is the assert
trigger.

Instead of adding the locking to satisfy the assert, revert to the
code that calls vm_page_free_phys() directly.

Reported by: pho
Discussed with: jeff
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 2a8e8f78 04-Mar-2018 Konstantin Belousov <kib@FreeBSD.org>

Remove redundant test from r330410.

If the input slist is non-empty, counter cannot be zero after freeing.

Noted by: mjg
MFC after: 2 weeks


# 8c8ee2ee 04-Mar-2018 Konstantin Belousov <kib@FreeBSD.org>

Unify bulk free operations in several pmaps.

Submitted by: Yoshihiro Ota
Reviewed by: markj
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D13485


# 9140bff7 23-Feb-2018 Mark Johnston <markj@FreeBSD.org>

Remove a bogus assertion from vm_page_launder().

After r328977, a wired page m may have m->queue != PQ_NONE.

Reviewed by: kib
X-MFC with: r328977
Differential Revision: https://reviews.freebsd.org/D14485


# 5f8cd1c0 23-Feb-2018 Jeff Roberson <jeff@FreeBSD.org>

Add a generic Proportional Integral Derivative (PID) controller algorithm and
use it to regulate page daemon output.

This provides much smoother and more responsive page daemon output, anticipating
demand and avoiding pageout stalls by increasing the number of pages to match
the workload. This is a reimplementation of work done by myself and mlaier at
Isilon.

Reviewed by: bsdimp
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14402


# 2c0f13aa 20-Feb-2018 Konstantin Belousov <kib@FreeBSD.org>

vm_wait() rework.

Make vm_wait() take the vm_object argument which specifies the domain
set to wait for the min condition pass. If there is no object
associated with the wait, use curthread' policy domainset. The
mechanics of the wait in vm_wait() and vm_wait_domain() is supplied by
the new helper vm_wait_doms(), which directly takes the bitmask of the
domains to wait for passing min condition.

Eliminate pagedaemon_wait(). vm_domain_clear() handles the same
operations.

Eliminate VM_WAIT and VM_WAITPFAULT macros, the direct functions calls
are enough.

Eliminate several control state variables from vm_domain, unneeded
after the vm_wait() conversion.

Scetched and reviewed by: jeff
Tested by: pho
Sponsored by: The FreeBSD Foundation, Mellanox Technologies
Differential revision: https://reviews.freebsd.org/D14384


# e958ad4c 12-Feb-2018 Jeff Roberson <jeff@FreeBSD.org>

Make v_wire_count a per-cpu counter(9) counter. This eliminates a
significant source of cache line contention from vm_page_alloc(). Use
accessors and vm_page_unwire_noq() so that the mechanism can be easily
changed in the future.

Reviewed by: markj
Discussed with: kib, glebius
Tested by: pho (earlier version)
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14273


# f7d35785 08-Feb-2018 Gleb Smirnoff <glebius@FreeBSD.org>

Fix boot_pages exhaustion on machines with many domains and cores, where
size of UMA zone allocation is greater than page size. In this case zone
of zones can not use UMA_MD_SMALL_ALLOC, and we need to postpone switch
off of this zone from startup_alloc() until full launch of VM.

o Always supply number of VM zones to uma_startup_count(). On machines
with UMA_MD_SMALL_ALLOC ignore it completely, unless zsize goes over
a page. In the latter case account VM zones for number of allocations
from the zone of zones.
o Rewrite startup_alloc() so that it will immediately switch off from
itself any zone that is already capable of running real alloc.
In worst case scenario we may leak a single page here. See comment
in uma_startup_count().
o Hardcode call to uma_startup2() into vm_mem_init(). Otherwise some
extra SYSINITs, e.g. vm_page_init() may sneak in before.
o While here, remove uma_boot_pages_mtx. With recent changes to boot
pages calculation, we are guaranteed to use all of the boot_pages
in the early single threaded stage.

Reported & tested by: mav


# 5073a083 07-Feb-2018 Gleb Smirnoff <glebius@FreeBSD.org>

Fix three miscalculations in amount of boot pages:

o Most of startup zones have struct uma_slab embedded into the slab,
so provide macro UMA_SLAB_SPACE and use it instead of UMA_SLAB_SIZE,
when calculating how many pages would certain kind of allocations
require. Some zones are offpage, so we might have a positive inaccuracy.
o The keg for the zone of zones is allocated "dynamically", so we
need +1 when calculating amount of pages for kegs. [1]
o The zones of zones and zones of kegs have arbitrary alignment of 32,
and this also needs to be accounted for. [2]

While here, spread more comments and improve diagnostic messages.

Reported by: pho [1], jtl [2]


# 1d3a1bcf 07-Feb-2018 Mark Johnston <markj@FreeBSD.org>

Dequeue wired pages lazily.

Previously, wiring a page would cause it to be removed from its page
queue. In the common case, unwiring causes it to be enqueued at the tail
of that page queue. This change modifies vm_page_wire() to not dequeue
the page, thus avoiding the highly contended page queue locks. Instead,
vm_page_unwire() takes care of requeuing the page as a single operation,
and the page daemon dequeues wired pages as they are encountered during
a queue scan to avoid needlessly revisiting them later. For pages in
PQ_ACTIVE we do even better, since a requeue is unnecessary.

The change improves scalability for some common workloads. For instance,
threads wiring pages into the buffer cache no longer need to modify
global page queues, and unwiring is usually done by the bufspace thread,
so concurrency is not as much of an issue. As another example, many
sysctl handlers wire the output buffer to avoid faults on copyout, and
since the buffer is likely to be in PQ_ACTIVE, we now entirely avoid
modifying the page queue in this case.

The change also adds a block comment describing some properties of
struct vm_page's reference counters, and the busy lock.

Reviewed by: jeff
Discussed with: alc, kib
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D11943


# e2068d0b 06-Feb-2018 Jeff Roberson <jeff@FreeBSD.org>

Use per-domain locks for vm page queue free. Move paging control from
global to per-domain state. Protect reservations with the free lock
from the domain that they belong to. Refactor to make vm domains more
of a first class object.

Reviewed by: markj, kib, gallatin
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14000


# ae941b1b 06-Feb-2018 Gleb Smirnoff <glebius@FreeBSD.org>

Fix boot_pages calculation for machines that don't have UMA_MD_SMALL_ALLOC.

o Call uma_startup1() after initializing kmem, vmem and domains.
o Include 8 eight VM startup pages into uma_startup_count() calculation.
o Account for vmem_startup() and vm_map_startup() preallocating pages.
o Account for extra two allocations done by kmem_init() and vmem_create().
o Hardcode the place of execution of vm_radix_reserve_kva(). Using SYSINIT
allowed several other SYSINITs to sneak in before it, thus bumping
requirement for amount of boot pages.


# f4bef67c 05-Feb-2018 Gleb Smirnoff <glebius@FreeBSD.org>

Followup on r302393 by cperciva, improving calculation of boot pages required
for UMA startup.

o Introduce another stage of UMA startup, which is entered after
vm_page_startup() finishes. After this stage we don't yet enable buckets,
but we can ask VM for pages. Rename stages to meaningful names while here.
New list of stages: BOOT_COLD, BOOT_STRAPPED, BOOT_PAGEALLOC, BOOT_BUCKETS,
BOOT_RUNNING.
Enabling page alloc earlier allows us to dramatically reduce number of
boot pages required. What is more important number of zones becomes
consistent across different machines, as no MD allocations are done before
the BOOT_PAGEALLOC stage. Now only UMA internal zones actually need to use
startup_alloc(), however that may change, so vm_page_startup() provides
its need for early zones as argument.
o Introduce uma_startup_count() function, to avoid code duplication. The
functions calculates sizes of zones zone and kegs zone, and calculates how
many pages UMA will need to bootstrap.
It counts not only of zone structures, but also of kegs, slabs and hashes.
o Hide uma_startup_foo() declarations from public file.
o Provide several DIAGNOSTIC printfs on boot_pages usage.
o Bugfix: when calculating zone of zones size use (mp_maxid + 1) instead of
mp_ncpus. Use resulting number not only in the size argument to zone_ctor()
but also as args.size.

Reviewed by: imp, gallatin (earlier version)
Differential Revision: https://reviews.freebsd.org/D14054


# 9a8196ce 19-Jan-2018 Nathan Whitehorn <nwhitehorn@FreeBSD.org>

Remove SFBUF_OPTIONAL_DIRECT_MAP and such hacks, replacing them across the
kernel by PHYS_TO_DMAP() as previously present on amd64, arm64, riscv, and
powerpc64. This introduces a new MI macro (PMAP_HAS_DMAP) that can be
evaluated at runtime to determine if the architecture has a direct map;
if it does not (or does) unconditionally and PMAP_HAS_DMAP is either 0 or
1, the compiler can remove the conditional logic.

As part of this, implement PHYS_TO_DMAP() on sparc64 and mips64, which had
similar things but spelled differently. 32-bit MIPS has a partial direct-map
that maps poorly to this concept and is unchanged.

Reviewed by: kib
Suggestions from: marius, alc, kib
Runtime tested on: amd64, powerpc64, powerpc, mips64


# 3f289c3f 12-Jan-2018 Jeff Roberson <jeff@FreeBSD.org>

Implement 'domainset', a cpuset based NUMA policy mechanism. This allows
userspace to control NUMA policy administratively and programmatically.

Implement domainset based iterators in the page layer.

Remove the now legacy numa_* syscalls.

Cleanup some header polution created by having seq.h in proc.h.

Reviewed by: markj, kib
Discussed with: alc
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D13403


# 280d15cd 24-Dec-2017 Mark Johnston <markj@FreeBSD.org>

Fix two problems with the page daemon control loop.

Both issues caused the page daemon to erroneously go to sleep when
applications are consuming free pages at a high rate, leaving the
application threads blocked in VM_WAIT.

1) After completing an inactive queue scan, concurrent allocations may
have prevented the page daemon from meeting the v_free_min threshold.
In this case, the page daemon was going to sleep even when the
inactive queue contained plenty of clean pages.
2) pagedaemon_wakeup() may be called without the free queues lock held.
This can lead to a lost wakeup if a call occurs after the page daemon
clears vm_pageout_wanted but before going to sleep.

Fix 1) by ensuring that we start a new inactive queue scan immediately
if v_free_count < v_free_min after a prior scan.

Fix 2) by adding a new subroutine, pagedaemon_wait(), called from
vm_wait() and vm_waitpfault(). It wakes up the page daemon if either
vm_pages_needed or vm_pageout_wanted is false, and atomically sleeps
on v_free_count.

Reported by: jeff
Reviewed by: alc
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D13424


# 0db2102a 04-Dec-2017 Michael Zhilin <mizhka@FreeBSD.org>

[mips] [vm] restore translation of freelist to flind for page allocation

Commit r326346 moved domain iterators from physical layer to vm_page one,
but it also removed translation of freelist to flind for
vm_page_alloc_freelist() call. Before it expects VM_FREELIST_ parameter,
but after it expect freelist index.

On small WiFi boxes with few megabytes of RAM, there is only one freelist
VM_FREELIST_LOWMEM (1) and there is no VM_FREELIST_DEFAULT(0) (see file
sys/mips/include/vmparam.h). It results in freelist 1 with flind 0.

At first, this commit renames flind to freelist in vm_page_alloc_freelist
to avoid misunderstanding about input parameters. Then on physical layer it
restores translation for correct handling of freelist parameter.

Reported by: landonf
Reviewed by: jeff
Differential Revision: https://reviews.freebsd.org/D13351


# 796df753 30-Nov-2017 Pedro F. Giffuni <pfg@FreeBSD.org>

SPDX: Consider code from Carnegie-Mellon University.

Interesting cases, most likely from CMU Mach sources.


# 1084894f 29-Nov-2017 Mark Johnston <markj@FreeBSD.org>

Remove some comments that became incorrect with r325530.


# ef435ae7 28-Nov-2017 Jeff Roberson <jeff@FreeBSD.org>

Move domain iterators into the page layer where domain selection should take
place. This makes the majority of the phys layer explicitly domain specific.

Reviewed by: markj, kib (some objections)
Discussed with: alc
Tested by: pho
Sponsored by: Netflix & Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D13014


# d2b677ce 27-Nov-2017 Mark Johnston <markj@FreeBSD.org>

Avoid unnecessary lookups when initializing the vm_page array.

This gives a marginal improvement in the vm_page_array initialization
time. Also garbage-collect the now-unused vm_phys_paddr_to_segind().

Reviewed by: alc, kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D13270


# b20bf182 26-Nov-2017 Mark Johnston <markj@FreeBSD.org>

Move vm_phys_init_page() to vm_page.c.

Suggested by: kib
Reviewed by: alc, kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D13250


# 5070d56d 21-Nov-2017 Mark Johnston <markj@FreeBSD.org>

Allow for fictitious physical pages in vm_page_scan_contig().

Some drm2 drivers will set PG_FICTITIOUS in physical pages in order to
satisfy the OBJT_MGTDEVICE object interface, so a scan may encounter
fictitous pages. For now, allow for this possibility; such pages will be
skipped later in the scan since they are wired.

Reported by: avg
Reviewed by: kib
MFC after: 1 week


# 51369649 20-Nov-2017 Pedro F. Giffuni <pfg@FreeBSD.org>

sys: further adoption of SPDX licensing ID tags.

Mainly focus on files that use BSD 3-Clause license.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.


# 8d6fbbb8 07-Nov-2017 Jeff Roberson <jeff@FreeBSD.org>

Replace manyinstances of VM_WAIT with blocking page allocation flags
similar to the kernel memory allocator.

This simplifies NUMA allocation because the domain will be known at wait
time and races between failure and sleeping are eliminated. This also
reduces boilerplate code and simplifies callers.

A wait primitive is supplied for uma zones for similar reasons. This
eliminates some non-specific VM_WAIT calls in favor of more explicit
sleeps that may be satisfied without new pages.

Reviewed by: alc, kib, markj
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon


# 3a757e54 24-Oct-2017 Alan Cox <alc@FreeBSD.org>

Micro-optimize the handling of fictitious pages in vm_page_free_prep().
A fictitious page is always wired, so there is no point in trying to
remove one from the page queues.

Completely remove one inaccurate comment from vm_page_free_prep() and
correct another.

Reviewed by: kib, markj
MFC after: 1 week


# 422fe502 21-Oct-2017 Konstantin Belousov <kib@FreeBSD.org>

Check that the page which is freed as zeroed, indeed has all-zero content.

This catches some rare mysterious failures at the source. The check
is only performed on architectures which implement direct map, and
only enabled with option DIAGNOSTIC, similar to other costly
consistency checks.

Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 43139893 20-Oct-2017 Konstantin Belousov <kib@FreeBSD.org>

In vm_page_free_phys_pglist(), do not take vm_page_queue_free_mtx if
there is nothing to do.

Suggested by: mjg
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 1dbf52e7 13-Oct-2017 Mateusz Guzik <mjg@FreeBSD.org>

Reduce traffic on vm_cnt.v_free_count

The variable is modified with the highly contended page free queue lock.
It unnecessarily shares a cacheline with purely read-only fields and is
re-read after the lock is dropped in the page allocation code making the
hold time longer.

Pad the variable just like the others and store the value as found with
the lock held instead of re-reading.

Provides a modest 1%-ish speed up in concurrent page faults.

Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D12665


# 43cc906f 24-Sep-2017 Alan Cox <alc@FreeBSD.org>

Change vm_page_try_to_free() to require a managed page. Essentially,
vm_page_try_to_free() is testing conditions, like clean versus dirty,
that only vary in managed pages.

Suggested by: kib
Reviewed by: markj
X-MFC after: never


# 494c6e43 24-Sep-2017 Alan Cox <alc@FreeBSD.org>

Optimize vm_page_try_to_free(). Specifically, the call to pmap_remove_all()
can be avoided when the page's containing object has a reference count of
zero. (If the object has a reference count of zero, then none of its pages
can possibly be mapped.)

Address nearby style issues in vm_page_try_to_free(), and change its
return type to "bool".

Reviewed by: kib, markj
MFC after: 1 week


# e82e50e6 13-Sep-2017 Konstantin Belousov <kib@FreeBSD.org>

Remove inline specifier from vm_page_free_wakeup(), do not
micro-manage compiler.

Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 540ac3b3 13-Sep-2017 Konstantin Belousov <kib@FreeBSD.org>

Split vm_page_free_toq() into two parts, preparation vm_page_free_prep()
and insertion into the phys allocator free queues vm_page_free_phys().
Also provide a wrapper vm_page_free_phys_pglist() for batched free.

Reviewed by: alc, markj
Tested by: mjg (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 2934eb8a 13-Sep-2017 Mark Johnston <markj@FreeBSD.org>

Fix a logic error in the item size calculation for internal UMA zones.

Kegs for internal zones always keep the slab header in the slab itself.
Therefore, when determining the allocation size, we need to take the
slab header size into account.

Reported and tested by: ae, rakuco
Reviewed by: avg
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D12342


# 93c5d3a4 09-Sep-2017 Konstantin Belousov <kib@FreeBSD.org>

Add a vm_page_change_lock() helper, the common code to not relock page
lock if both old and new pages use the same underlying lock. Convert
existing places to use the helper instead of inlining it. Use the
optimization in vm_object_page_remove().

Suggested and reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# f93f7cf1 07-Sep-2017 Mark Johnston <markj@FreeBSD.org>

Speed up vm_page_array initialization.

We currently initialize the vm_page array in three passes: one to zero
the array, one to initialize the "order" field of each page (necessary
when inserting them into the vm_phys buddy allocator one-by-one), and
one to initialize the remaining non-zero fields and individually insert
each page into the allocator.

Merge the three passes into one following a suggestion from alc:
initialize vm_page fields in a single pass, and use vm_phys_free_contig()
to efficiently insert physical memory segments into the buddy allocator.
This reduces the initialization time to a third or a quarter of what it
was before on most systems that I tested.

Reviewed by: alc, kib
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D12248


# fe933c1d 06-Sep-2017 Mateusz Guzik <mjg@FreeBSD.org>

Start annotating global _padalign locks with __exclusive_cache_line

While these locks are guarnteed to not share their respective cache lines,
their current placement leaves unnecessary holes in lines which preceeded them.

For instance the annotation of vm_page_queue_free_mtx allows 2 neighbour
cachelines (previously separate by the lock) to be collapsed into 1.

The annotation is only effective on architectures which have it implemented in
their linker script (currently only amd64). Thus locks are not converted to
their not-padaligned variants as to not affect the rest.

MFC after: 1 week


# 33fff5d5 15-Aug-2017 Mark Johnston <markj@FreeBSD.org>

Add vm_page_alloc_after().

This is a variant of vm_page_alloc() which accepts an additional parameter:
the page in the object with largest index that is smaller than the requested
index. vm_page_alloc() finds this page using a lookup in the object's radix
tree, but in some cases its identity is already known, allowing the lookup
to be elided.

Modify kmem_back() and vm_page_grab_pages() to use vm_page_alloc_after().
vm_page_alloc() is converted into a trivial wrapper of
vm_page_alloc_after().

Suggested by: alc
Reviewed by: alc, kib
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D11984


# 9df950b3 11-Aug-2017 Mark Johnston <markj@FreeBSD.org>

Modify vm_page_grab_pages() to handle VM_ALLOC_NOWAIT.

This will allow its use in sendfile_swapin().

Reviewed by: alc, kib
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D11942


# 2c642ec1 10-Aug-2017 Mark Johnston <markj@FreeBSD.org>

Make vm_page_sunbusy() assert that the page is unlocked.

Reviewed by: kib
MFC after: 1 week
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D11946


# 5471caf6 08-Aug-2017 Alan Cox <alc@FreeBSD.org>

Introduce vm_page_grab_pages(), which is intended to replace loops calling
vm_page_grab() on consecutive page indices. Besides simplifying the code
in the caller, vm_page_grab_pages() allows for batching optimizations.
For example, the current implementation replaces calls to vm_page_lookup()
on consecutive page indices by cheaper calls to vm_page_next().

Reviewed by: kib, markj
Tested by: pho (an earlier version)
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D11926


# 1d3b9818 22-Jul-2017 Alan Cox <alc@FreeBSD.org>

In vm_page_ps_test(), always check that the base pages within the specified
superpage all belong to the same object. To date, that check has not been
needed, but upcoming changes require it. (See the Differential Revision.)

Reviewed by: kib, markj
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D11556


# 88302601 13-Jul-2017 Alan Cox <alc@FreeBSD.org>

Generalize vm_page_ps_is_valid() to support testing other predicates on
the (super)page, renaming the function to vm_page_ps_test().

Reviewed by: kib, markj
MFC after: 1 week


# 4bd7e351 08-Jun-2017 John Baldwin <jhb@FreeBSD.org>

Fix an off-by-one error in the VM page array on some systems.

r31386 changed how the size of the VM page array was calculated to be
less wasteful. For most systems, the amount of memory is divided by
the overhead required by each page (a page of data plus a struct vm_page)
to determine the maximum number of available pages. However, if the
remainder for the first non-available page was at least a page of data
(so that the only memory missing was a struct vm_page), this last page
was left in phys_avail[] but was not allocated an entry in the VM page
array. Handle this case by explicitly excluding the page from
phys_avail[].

Reviewed by: alc
Sponsored by: DARPA / AFRL
Differential Revision: https://reviews.freebsd.org/D11000


# 83c9dea1 17-Apr-2017 Gleb Smirnoff <glebius@FreeBSD.org>

- Remove 'struct vmmeter' from 'struct pcpu', leaving only global vmmeter
in place. To do per-cpu stats, convert all fields that previously were
maintained in the vmmeters that sit in pcpus to counter(9).
- Since some vmmeter stats may be touched at very early stages of boot,
before we have set up UMA and we can do counter_u64_alloc(), provide an
early counter mechanism:
o Leave one spare uint64_t in struct pcpu, named pc_early_dummy_counter.
o Point counter(9) fields of vmmeter to pcpu[0].pc_early_dummy_counter,
so that at early stages of boot, before counters are allocated we already
point to a counter that can be safely written to.
o For sparc64 that required a whole dummy pcpu[MAXCPU] array.

Further related changes:
- Don't include vmmeter.h into pcpu.h.
- vm.stats.vm.v_swappgsout and vm.stats.vm.v_swappgsin changed to 64-bit,
to match kernel representation.
- struct vmmeter hidden under _KERNEL, and only vmstat(1) is an exclusion.

This is based on benno@'s 4-year old patch:
https://lists.freebsd.org/pipermail/freebsd-arch/2013-July/014471.html

Reviewed by: kib, gallatin, marius, lidl
Differential Revision: https://reviews.freebsd.org/D10156


# fbbd9655 28-Feb-2017 Warner Losh <imp@FreeBSD.org>

Renumber copyright clause 4

Renumber cluase 4 to 3, per what everybody else did when BSD granted
them permission to remove clause 3. My insistance on keeping the same
numbering for legal reasons is too pedantic, so give up on that point.

Submitted by: Jan Schaumann <jschauma@stevens.edu>
Pull Request: https://github.com/freebsd/freebsd/pull/96


# 8a99f1cc 03-Feb-2017 Alan Cox <alc@FreeBSD.org>

Over the years, the code and comments in vm_page_startup() have diverged in
one respect. When determining how many page structures to allocate,
contrary to what the comments say, the code does not account for the
overhead of a page structure per page of physical memory. This revision
changes the code to match the comments.

Reviewed by: kib, markj
MFC after: 6 weeks
Differential Revision: https://reviews.freebsd.org/D9081


# c2655a40 14-Jan-2017 Mark Johnston <markj@FreeBSD.org>

Avoid unnecessary page lookups in vm_object_madvise().

vm_object_madvise() is frequently used to apply advice to a contiguous
set of pages in an object with no backing object. Optimize this case by
skipping non-resident subranges in constant time, and by iterating over
resident pages using the object memq, thus avoiding radix tree lookups on
each page index in the specified range.

While here, move MADV_WILLNEED handling to vm_page_advise(), and rename the
"advise" parameter to vm_object_madvise() to "advice."

Reviewed by: alc, kib
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D9098


# bfc8c24c 04-Jan-2017 Gleb Smirnoff <glebius@FreeBSD.org>

Move bogus_page declaration to vm_page.h and initialization to vm_page.c.

Reviewed by: kib


# b1fd102e 02-Jan-2017 Mark Johnston <markj@FreeBSD.org>

Add a page queue for holding dirty anonymous unswappable pages.

On systems without a configured swap device, an attempt to launder pages
from a swap object will always fail and result in the page being
reactivated. This means that the page daemon will continuously scan pages
that can never be evicted. With this change, anonymous pages are instead
moved to PQ_UNSWAPPABLE after a failed laundering attempt when no swap
devices are configured. PQ_UNSWAPPABLE is not scanned unless a swap device
is configured, so unreferenced unswappable pages are excluded from the page
daemon's workload.

Reviewed by: alc


# 0c8bd6a7 30-Dec-2016 Konstantin Belousov <kib@FreeBSD.org>

Assert that the pages found on the object queue by vm_page_next() and
vm_page_prev() have correct ownership.

In collaboration with: alc
Sponsored by: The FreeBSD Foundation (kib)
MFC after: 1 week


# 920da7e4 28-Dec-2016 Alan Cox <alc@FreeBSD.org>

Relax the object type restrictions on vm_page_alloc_contig(). Specifically,
add support for object types that were previously prohibited because they
could contain PG_CACHED pages.

Roughly halve the number of radix trie operations performed by
vm_page_alloc_contig() using the same approach that is employed by
vm_page_alloc(). Also, eliminate the radix trie lookup performed with the
free page queues lock held.

Tidy up the handling of radix trie insert failures in vm_page_alloc() and
vm_page_alloc_contig().

Reviewed by: kib, markj
Tested by: pho
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D8878


# 3453bca8 12-Dec-2016 Alan Cox <alc@FreeBSD.org>

Eliminate every mention of PG_CACHED pages from the comments in the machine-
independent layer of the virtual memory system. Update some of the nearby
comments to eliminate redundancy and improve clarity.

In vm/vm_reserv.c, do not use hyphens after adverbs ending in -ly per
The Chicago Manual of Style.

Update the comment in vm/vm_page.h defining the four types of page queues to
reflect the elimination of PG_CACHED pages and the introduction of the
laundry queue.

Reviewed by: kib, markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D8752


# e94965d8 07-Dec-2016 Alan Cox <alc@FreeBSD.org>

Previously, vm_radix_remove() would panic if the radix trie didn't
contain a vm_page_t at the specified index. However, with this
change, vm_radix_remove() no longer panics. Instead, it returns NULL
if there is no vm_page_t at the specified index. Otherwise, it
returns the vm_page_t. The motivation for this change is that it
simplifies the use of radix tries in the amd64, arm64, and i386 pmap
implementations. Instead of performing a lookup before every remove,
the pmap can simply perform the remove.

Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D8708


# ba673696 26-Nov-2016 Alan Cox <alc@FreeBSD.org>

Recursion on the free page queue mutex occurred when UMA needed to allocate
a new page of radix trie nodes to complete a vm_radix_insert() operation
that was requested by vm_page_cache(). Specifically, vm_page_cache()
already held the free page queue lock when UMA tried to acquire it through
a call to vm_page_alloc(). This code path no longer exists, so there is no
longer any reason to allow recursion on the free page queue mutex.

Improve nearby comments.

Reviewed by: kib, markj
Tested by: pho
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D8628


# bba39b9a 22-Nov-2016 Alan Cox <alc@FreeBSD.org>

Remove PG_CACHED-related fields from struct vmmeter, because they are no
longer used. More precisely, they are always zero because the code that
decremented and incremented them no longer exists.

Bump __FreeBSD_version to mark this change.

Reviewed by: kib, markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D8583


# 7667839a 15-Nov-2016 Alan Cox <alc@FreeBSD.org>

Remove most of the code for implementing PG_CACHED pages. (This change does
not remove user-space visible fields from vm_cnt or all of the references to
cached pages from comments. Those changes will come later.)

Reviewed by: kib, markj
Tested by: pho
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D8497


# ebcddc72 09-Nov-2016 Alan Cox <alc@FreeBSD.org>

Introduce a new page queue, PQ_LAUNDRY, for storing unreferenced, dirty
pages, specificially, dirty pages that have passed once through the inactive
queue. A new, dedicated thread is responsible for both deciding when to
launder pages and actually laundering them. The new policy uses the
relative sizes of the inactive and laundry queues to determine whether to
launder pages at a given point in time. In general, this leads to more
intelligent swapping behavior, since the laundry thread will avoid pageouts
when the marginal benefit of doing so is low. Previously, without a
dedicated queue for dirty pages, the page daemon didn't have the information
to determine whether pageout provides any benefit to the system. Thus, the
previous policy often resulted in small but steadily increasing amounts of
swap usage when the system is under memory pressure, even when the inactive
queue consisted mostly of clean pages. This change addresses that issue,
and also paves the way for some future virtual memory system improvements by
removing the last source of object-cached clean pages, i.e., PG_CACHE pages.

The new laundry thread sleeps while waiting for a request from the page
daemon thread(s). A request is raised by setting the variable
vm_laundry_request and waking the laundry thread. We request launderings
for two reasons: to try and balance the inactive and laundry queue sizes
("background laundering"), and to quickly make up for a shortage of free
pages and clean inactive pages ("shortfall laundering"). When background
laundering is requested, the laundry thread computes the number of page
daemon wakeups that have taken place since the last laundering. If this
number is large enough relative to the ratio of the laundry and (global)
inactive queue sizes, we will launder vm_background_launder_target pages at
vm_background_launder_rate KB/s. Otherwise, the laundry thread goes back
to sleep without doing any work. When scanning the laundry queue during
background laundering, reactivated pages are counted towards the laundry
thread's target.

In contrast, shortfall laundering is requested when an inactive queue scan
fails to meet its target. In this case, the laundry thread attempts to
launder enough pages to meet v_free_target within 0.5s, which is the
inactive queue scan period.

A laundry request can be latched while another is currently being
serviced. In particular, a shortfall request will immediately preempt a
background laundering.

This change also redefines the meaning of vm_cnt.v_reactivated and removes
the functions vm_page_cache() and vm_page_try_to_cache(). The new meaning
of vm_cnt.v_reactivated now better reflects its name. It represents the
number of inactive or laundry pages that are returned to the active queue
on account of a reference.

In collaboration with: markj
Reviewed by: kib
Tested by: pho
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D8302


# 1771e987 03-Nov-2016 Konstantin Belousov <kib@FreeBSD.org>

Do not sleep in vm_wait() if pagedaemon did not yet started. Panic instead.

Requests which cannot be satisfied by allocators at boot time often
have unrealizable parameters. Waiting for the pagedaemon' start would
hang the boot if done in the thread0 context and just never succeed if
executed from another thread. In fact, for very early stages, sleep
attempt panics with obscure diagnostic about the scheduler state, and
explicit panic in vm_wait() makes the investigation much shorter by
cut off the examination of the thread and scheduler.

Theoretically, some subsystem might grab a resource to exhaustion, and
free it later in the boot process. If this unlikely scenario does
appear for real, the way to diagnose the trouble can be revisited.

Reported by: emaste
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D8421


# bd9546a2 17-Oct-2016 Konstantin Belousov <kib@FreeBSD.org>

Export vm_page_xunbusy_maybelocked().

Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
X-Differential revision: https://reviews.freebsd.org/D8197


# 5975e53d 13-Oct-2016 Konstantin Belousov <kib@FreeBSD.org>

Fix a race in vm_page_busy_sleep(9).

Suppose that we have an exclusively busy page, and a thread which can
accept shared-busy page. In this case, typical code waiting for the
page xbusy state to pass is
again:
VM_OBJECT_WLOCK(object);
...
if (vm_page_xbusied(m)) {
vm_page_lock(m);
VM_OBJECT_WUNLOCK(object); <---1
vm_page_busy_sleep(p, "vmopax");
goto again;
}

Suppose that the xbusy state owner locked the object, unbusied the
page and unlocked the object after we are at the line [1], but before we
executed the load of the busy_lock word in vm_page_busy_sleep(). If it
happens that there is still no waiters recorded for the busy state,
the xbusy owner did not acquired the page lock, so it proceeded.

More, suppose that some other thread happen to share-busy the page
after xbusy state was relinquished but before the m->busy_lock is read
in vm_page_busy_sleep(). Again, that thread only needs vm_object lock
to proceed. Then, vm_page_busy_sleep() reads busy_lock value equal to
the VPB_SHARERS_WORD(1).

In this case, all tests in vm_page_busy_sleep(9) pass and we are going
to sleep, despite the page being share-busied.

Update check for m->busy_lock == VPB_UNBUSIED in vm_page_busy_sleep(9)
to also accept shared-busy state if we only wait for the xbusy state to
pass.

Merge sequential if()s with the same 'then' clause in
vm_page_busy_sleep().

Note that the current code does not share-busy pages from parallel
threads, the only way to have more that one sbusy owner is right now
is to recurse.

Reported and tested by: pho (previous version)
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D8196


# 267ed8e2 11-Oct-2016 Konstantin Belousov <kib@FreeBSD.org>

When downgrading exclusively busied page to shared-busy state, wakeup
waiters. Otherwise, owners of the shared-busy state are left blocked
and might get into a deadlock.

Note that the vm_page_busy_downgrade() function is not used in the
tree right now.

Reported and tested by: pho (previous version)
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D8195


# 70cf3ced 05-Oct-2016 Alan Cox <alc@FreeBSD.org>

Make the page daemon's notion of what kind of pass is being performed
by vm_pageout_scan() local to vm_pageout_worker(). There is no reason
to store the pass in the NUMA domain structure.

Reviewed by: kib
MFC after: 3 weeks


# dbbaf04f 03-Sep-2016 Mark Johnston <markj@FreeBSD.org>

Remove support for idle page zeroing.

Idle page zeroing has been disabled by default on all architectures since
r170816 and has some bugs that make it seemingly unusable. Specifically,
the idle-priority pagezero thread exacerbates contention for the free page
lock, and yields the CPU without releasing it in non-preemptive kernels. The
pagezero thread also does not behave correctly when superpage reservations
are enabled: its target is a function of v_free_count, which includes
reserved-but-free pages, but it is only able to zero pages belonging to the
physical memory allocator.

Reviewed by: alc, imp, kib
Differential Revision: https://reviews.freebsd.org/D7714


# 915d1b71 29-Aug-2016 Mark Johnston <markj@FreeBSD.org>

Restore swap pager readahead after r292373.

The removal of vm_fault_additional_pages() meant that a hard fault on
a swap-backed page would result in only that page being read in. This
change implements readahead and readbehind for the swap pager in
swap_pager_getpages(). swap_pager_haspage() is modified to return the
largest contiguous non-resident range of pages containing the requested
range.

Reviewed by: alc, kib
Tested by: pho
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D7677


# 842ee21e 13-Aug-2016 Mark Johnston <markj@FreeBSD.org>

Strengthen assertions about the busy state of newly-allocated pages.

Reviewed by: alc
MFC after: 1 week


# 897d0c66 29-Jul-2016 Mark Johnston <markj@FreeBSD.org>

Use vm_page_undirty() instead of manually setting a page field.

Reviewed by: alc
MFC after: 3 days


# efe1ff4c 23-Jul-2016 Mark Johnston <markj@FreeBSD.org>

Update a comment in vm_page_advise() to match behaviour after r290529.

Reviewed by: alc
MFC after: 3 days


# 34caa842 07-Jul-2016 Colin Percival <cperciva@FreeBSD.org>

Autotune the number of pages set aside for UMA startup based on the number
of CPUs present. On amd64 this unbreaks the boot for systems with 92 or
more CPUs; the limit will vary on other systems depending on the size of
their uma_zone and uma_cache structures.

The major consumer of pages during UMA startup is the 19 zone structures
which are set up before UMA has bootstrapped itself sufficiently to use
the rest of the available memory: UMA Slabs, UMA Hash, 4 / 6 / 8 / 12 /
16 / 32 / 64 / 128 / 256 Bucket, vmem btag, VM OBJECT, RADIX NODE, MAP,
KMAP ENTRY, MAP ENTRY, VMSPACE, and fakepg. If the zone structures occupy
more than one page, they will not share pages and the number of pages
currently needed for startup is 19 * pages_per_zone + N, where N is the
number of pages used for allocating other structures; on amd64 N = 3 at
present (2 pages are allocated for UMA Kegs, and one page for UMA Hash).

This patch adds a new definition UMA_BOOT_PAGES_ZONES, currently set to 32,
and if a zone structure does not fit into a single page sets boot_pages to
UMA_BOOT_PAGES_ZONES * pages_per_zone instead of UMA_BOOT_PAGES (which
remains at 64). Consequently this patch has no effect on systems where the
zone structure fits into 2 or fewer pages (on amd64, 59 or fewer CPUs), but
increases boot_pages sufficiently on systems where the large number of CPUs
makes this structure larger. It seems safe to assume that systems with 60+
CPUs can afford to set aside an additional 128kB of memory per 32 CPUs.

The vm.boot_pages tunable continues to override this computation, but is
unlikely to be necessary in the future.

Tested on: EC2 x1.32xlarge
Relnotes: FreeBSD can now boot on 92+ CPU systems without requiring
vm.boot_pages to be manually adjusted.
Reviewed by: jeff, alc, adrian
Approved by: re (kib)


# 35e8002c 23-Jun-2016 Konstantin Belousov <kib@FreeBSD.org>

In vm_page_xunbusy_maybelocked(), add fast path for unbusy when no
waiters exist, same as for vm_page_xunbusy(). If previous value of
busy_lock was VPB_SINGLE_EXCLUSIVER, no waiters existed and wakeup is
not needed.

Move common code from vm_page_xunbusy_maybelocked() and
vm_page_xunbusy_hard() to vm_page_xunbusy_locked().

Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Approved by: re (gjb)


# 0a1dc6e2 02-Jun-2016 Mark Johnston <markj@FreeBSD.org>

Reset the page busy lock state after failing to insert into the object.

Freeing a shared-busy page is not permitted.

Reviewed by: kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D6670


# e7052969 02-Jun-2016 Mark Johnston <markj@FreeBSD.org>

Don't preserve the page's object linkage in vm_page_insert_after().

Per the KASSERT at the beginning of the function, we expect that the page
does not belong to any object, so its object and pindex fields are
meaningless. Reset them in the rare case that vm_radix_insert() fails.

Reviewed by: kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D6669


# e5f0191f 01-Jun-2016 Konstantin Belousov <kib@FreeBSD.org>

If the fast path unbusy in vm_page_replace() fails, slow path needs to
acquire the page lock, which recurses. Avoid the recursion by reusing
the code from vm_page_remove() in a new helper
vm_page_xunbusy_maybelocked().

Reviewed by: alc
Sponsored by: The FreeBSD Foundation


# 56ce0690 27-May-2016 Alan Cox <alc@FreeBSD.org>

The flag "vm_pages_needed" has long served two distinct purposes: (1) to
indicate that threads are waiting for free pages to become available and
(2) to indicate whether a wakeup call has been sent to the page daemon.
The trouble is that a single flag cannot really serve both purposes, because
we have two distinct targets for when to wakeup threads waiting for free
pages versus when the page daemon has completed its work. In particular,
the flag will be cleared by vm_page_free() before the page daemon has met
its target, and this can lead to the OOM killer being invoked prematurely.
To address this problem, a new flag "vm_pageout_wanted" is introduced.

Discussed with: jeff
Reviewed by: kib, markj
Tested by: markj
Sponsored by: EMC / Isilon Storage Division


# 0e384220 24-May-2016 Konstantin Belousov <kib@FreeBSD.org>

In vm_page_cache(), only drop the vnode after radix insert failure
for empty page cache when the object type if OBJT_VNODE.

Reported and tested by: pho
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 30a8a5f7 24-May-2016 Konstantin Belousov <kib@FreeBSD.org>

In vm_page_alloc_contig(), on vm_page_insert() failure, mark each
freed page as VPO_UNMANAGED. Otherwise vm_pge_free_toq() insists on
owning the page lock.

Previously, VPO_UNMANAGED was only set up to the last processed page.

Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 5a2e650a 19-May-2016 Conrad Meyer <cem@FreeBSD.org>

vm/vm_page.h: Fix trivial '-Wpointer-sign' warning

pq_vcnt, as a count of real things, has no business being negative. It is only
ever initialized by a u_int counter.

The warning came from the atomic_add_int() in vm_pagequeue_cnt_add().

Rectify the warning by changing the variable to u_int. No functional change.

Suggested by: Clang 3.3
Sponsored by: EMC / Isilon Storage Division


# 0ef14902 29-Apr-2016 John Baldwin <jhb@FreeBSD.org>

Don't require write locks on the VM object for vm_page_prev/next.

Reviewed by: kib
Sponsored by: Chelsio Communications


# d9c9c81c 21-Apr-2016 Pedro F. Giffuni <pfg@FreeBSD.org>

sys: use our roundup2/rounddown2() macros when param.h is available.

rounddown2 tends to produce longer lines than the original code
and when the code has a high indentation level it was not really
advantageous to do the replacement.

This tries to strike a balance between readability using the macros
and flexibility of having the expressions, so not everything is
converted.


# b28cc462 09-Feb-2016 Gleb Smirnoff <glebius@FreeBSD.org>

Include sys/_task.h into uma_int.h, so that taskqueue.h isn't a
requirement for uma_int.h.

Suggested by: jhb


# e60b2fcb 03-Feb-2016 Gleb Smirnoff <glebius@FreeBSD.org>

Redo r292484. Embed task(9) into zone, so that uz_maxaction is called
in a context that can sleep, allowing consumers of the KPI to run their
drain routines without any extra measures.

Discussed with: jtl


# c869e672 19-Dec-2015 Alan Cox <alc@FreeBSD.org>

Introduce a new mechanism for relocating virtual pages to a new physical
address and use this mechanism when:

1. kmem_alloc_{attr,contig}() can't find suitable free pages in the physical
memory allocator's free page lists. This replaces the long-standing
approach of scanning the inactive and inactive queues, converting clean
pages into PG_CACHED pages and laundering dirty pages. In contrast, the
new mechanism does not use PG_CACHED pages nor does it trigger a large
number of I/O operations.

2. on 32-bit MIPS processors, uma_small_alloc() and the pmap can't find
free pages in the physical memory allocator's free page lists that are
covered by the direct map. Tested by: adrian

3. ttm_bo_global_init() and ttm_vm_page_alloc_dma32() can't find suitable
free pages in the physical memory allocator's free page lists.

In the coming months, I expect that this new mechanism will be applied in
other places. For example, balloon drivers should use relocation to
minimize fragmentation of the guest physical address space.

Make vm_phys_alloc_contig() a little smarter (and more efficient in some
cases). Specifically, use vm_phys_segs[] earlier to avoid scanning free
page lists that can't possibly contain suitable pages.

Reviewed by: kib, markj
Glanced at: jhb
Discussed with: jeff
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D4444


# b0cd2017 16-Dec-2015 Gleb Smirnoff <glebius@FreeBSD.org>

A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().

o With new KPI consumers can request contiguous ranges of pages, and
unlike before, all pages will be kept busied on return, like it was
done before with the 'reqpage' only. Now the reqpage goes away. With
new interface it is easier to implement code protected from race
conditions.

Such arrayed requests for now should be preceeded by a call to
vm_pager_haspage() to make sure that request is possible. This
could be improved later, making vm_pager_haspage() obsolete.

Strenghtening the promises on the business of the array of pages
allows us to remove such hacks as swp_pager_free_nrpage() and
vm_pager_free_nonreq().

o New KPI accepts two integer pointers that may optionally point at
values for read ahead and read behind, that a pager may do, if it
can. These pages are completely owned by pager, and not controlled
by the caller.

This shifts the UFS-specific readahead logic from vm_fault.c, which
should be file system agnostic, into vnode_pager.c. It also removes
one VOP_BMAP() request per hard fault.

Discussed with: kib, alc, jeff, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix


# 5e09bdc8 10-Dec-2015 Conrad Meyer <cem@FreeBSD.org>

vm_page_replace: remove redundant radix lookup

Remove redundant lookup of the old page from vm_page_replace. Verification
that the old page exists is already done by vm_radix_replace.

Submitted by: Ryan Libby <rlibby@gmail.com>
Reviewed by: alc, kib
Sponsored by: EMC / Isilon Storage Division
Follow-up to: https://reviews.freebsd.org/D4326
Differential Revision: https://reviews.freebsd.org/D4471


# 7e78597f 07-Nov-2015 Mark Johnston <markj@FreeBSD.org>

Ensure that deactivated pages that are not expected to be reused are
reclaimed in FIFO order by the pagedaemon. Previously we would enqueue
such pages at the head of the inactive queue, yielding a LIFO reclaim order.

Reviewed by: alc
MFC after: 2 weeks
Sponsored by: EMC / Isilon Storage Division


# bc727596 03-Oct-2015 Alan Cox <alc@FreeBSD.org>

Reduce the scope of a variable to the only file where it is used.


# 3138cd36 30-Sep-2015 Mark Johnston <markj@FreeBSD.org>

As a step towards the elimination of PG_CACHED pages, rework the handling
of POSIX_FADV_DONTNEED so that it causes the backing pages to be moved to
the head of the inactive queue instead of being cached.

This affects the implementation of POSIX_FADV_NOREUSE as well, since it
works by applying POSIX_FADV_DONTNEED to file ranges after they have been
read or written. At that point the corresponding buffers may still be
dirty, so the previous implementation would coalesce successive ranges and
apply POSIX_FADV_DONTNEED to the result, ensuring that pages backing the
dirty buffers would eventually be cached. To preserve this behaviour in an
efficient manner, this change adds a new buf flag, B_NOREUSE, which causes
the pages backing a VMIO buf to be placed at the head of the inactive queue
when the buf is released. POSIX_FADV_NOREUSE then works by setting this
flag in bufs that underlie the specified range.

Reviewed by: alc, kib
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D3726


# 15aaea78 22-Sep-2015 Alan Cox <alc@FreeBSD.org>

Change vm_page_unwire() such that it (1) accepts PQ_NONE as the specified
queue and (2) returns a Boolean indicating whether the page's wire count
transitioned to zero.

Exploit this change in vfs_vmio_release() to avoid pointlessly enqueueing
a page that is about to be freed.

(An earlier version of this change was developed by attilio@ and kmacy@.
Any errors in this version are my own.)

Reviewed by: kib
Sponsored by: EMC / Isilon Storage Division


# c9af644e 17-Sep-2015 Alan Cox <alc@FreeBSD.org>

Eliminate (many) unnecessary calls to pmap_remove_all(). Pages from objects
with a reference count of zero can't possibly be mapped, so there is never a
need for vm_page_set_invalid() to call pmap_remove_all() on them.

Reviewed by: kib
MFC after: 1 week
Sponsored by: EMC / Isilon Storage Division


# d73ce4c6 10-Sep-2015 Mark Johnston <markj@FreeBSD.org>

Remove the v_cache_min and v_cache_max sysctls. They are unused and have
no effect.

Reviewed by: alc
Sponsored by: EMC / Isilon Storage Division


# c25fabea 27-Aug-2015 Mark Johnston <markj@FreeBSD.org>

Remove weighted page handling from vm_page_advise().

This was added in r51337 as part of the implementation of
madvise(MADV_DONTNEED). Its objective was to ensure that the page daemon
would eventually reclaim other unreferenced pages (i.e., unreferenced pages
not touched by madvise()) from the active queue.

Now that the pagedaemon performs steady scanning of the active page queue,
this weighted handling is unnecessary. Instead, always "cache" clean pages
by moving them to the head of the inactive page queue. This simplifies the
implementation of vm_page_advise() and eliminates the fragmentation that
resulted from the distribution of pages among multiple queues.

Suggested by: alc
Reviewed by: alc
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D3401


# 52afd687 19-Aug-2015 Andrew Turner <andrew@FreeBSD.org>

Add the kernel support for minidumps on arm64.

Obtained from: ABT Systems Ltd
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D3318


# 966272ca3 07-Jun-2015 Alan Cox <alc@FreeBSD.org>

Retire VM_FREEPOOL_CACHE as the next step in eliminating PG_CACHE pages.

Differential Revision: https://reviews.freebsd.org/D2712
Reviewed by: kib
Sponsored by: EMC / Isilon Storage Division


# c4e49ba4 30-May-2015 Alan Cox <alc@FreeBSD.org>

Document vm_page_alloc_contig()'s support for the VM_ALLOC_NODUMP option.

MFC after: 3 days


# 2c20bd8b 20-May-2015 Konstantin Belousov <kib@FreeBSD.org>

Do grammar fix in the comment to record the right commit message for
r283162.

Fix a cosmetic issue with vm_page_alloc() calling vm_page_free_toq()
with the page not completely satisfying vm_page_free() assertions.
The page is not owned by the object, since insertion failed. But
besides m->object reset to NULL, we should also set VPO_UNMANAGED flag
for consistency.

Reported by: pho
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# da474990 20-May-2015 Konstantin Belousov <kib@FreeBSD.org>

Remove the write-only variable phent. We currently do not check the
size of the program header's entries.

Reported by: adrian (by using gcc 4.9)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# ed95805e 30-Apr-2015 John Baldwin <jhb@FreeBSD.org>

Remove support for Xen PV domU kernels. Support for HVM domU kernels
remains. Xen is planning to phase out support for PV upstream since it
is harder to maintain and has more overhead. Modern x86 CPUs include
virtualization extensions that support HVM guests instead of PV guests.
In addition, the PV code was i386 only and not as well maintained recently
as the HVM code.
- Remove the i386-only NATIVE option that was used to disable certain
components for PV kernels. These components are now standard as they
are on amd64.
- Remove !XENHVM bits from PV drivers.
- Remove various shims required for XEN (e.g. PT_UPDATES_FLUSH, LOAD_CR3,
etc.)
- Remove duplicate copy of <xen/features.h>.
- Remove unused, i386-only xenstored.h.

Differential Revision: https://reviews.freebsd.org/D2362
Reviewed by: royger
Tested by: royger (i386/amd64 HVM domU and amd64 PVH dom0)
Relnotes: yes


# affc4a4b 29-Apr-2015 Scott Long <scottl@FreeBSD.org>

Improve support for blacklisting bad memory locations. The user can supply
a text file with a list of physical memory addresses to exclude, and have it
loaded at boot time via the provided example in loader.conf. The tunable
'vm.blacklist' remains, but using an external file means that there's no
practical limit to the size of the list. This change also improves the
scanning algorithm for processing the list, scanning the list only once
instead of scanning it for every page in the system. Both the sysctl and
the file can be unsorted and contain duplicates so long as each entry is
numeric (decimal or hex) and is separated by a space, comma, or newline
character. The sysctl 'vm.page_blacklist' is now provided to report what
memory locations were successfully excluded.

Reviewed by: imp, emax
Obtained from: Netflix, Inc.
MFC after: 3 days


# b575067a 25-Mar-2015 Rui Paulo <rpaulo@FreeBSD.org>

Add comments about CTLFLAG_RDTUN vs. TUNABLE_INT_FETCH.

Requested by: julian


# 57e5a8b1 24-Mar-2015 Rui Paulo <rpaulo@FreeBSD.org>

Use TUNABLE_INT_FETCH for boot_pages.

vm.boot_pages is marked as a CTLFLAG_RDTUN, but it's used by the VM
before the sysctl subsystem is initialsed. We manually fetch the
variable from the environment to work around this problem.

Tested by: Keith White kwhite at uottawa.ca
MFC after: 1 week


# b0bce0ae 24-Mar-2015 Rui Paulo <rpaulo@FreeBSD.org>

Remove whitespace.


# e3ed82bc 22-Dec-2014 Gleb Smirnoff <glebius@FreeBSD.org>

Add flag VM_ALLOC_NOWAIT for vm_page_grab() that prevents sleeping and
allows the function to fail.

Reviewed by: kib, alc
Sponsored by: Nginx, Inc.


# 6ee80f25 22-Dec-2014 Gleb Smirnoff <glebius@FreeBSD.org>

Do not clear flag that vm_page_alloc() doesn't support.

Submitted by: kib


# 271f0f12 15-Nov-2014 Alan Cox <alc@FreeBSD.org>

Enable the use of VM_PHYSSEG_SPARSE on amd64 and i386, making it the default
on i386 PAE. Previously, VM_PHYSSEG_SPARSE could not be used on amd64 and
i386 because vm_page_startup() would not create vm_page structures for the
kernel page table pages allocated during pmap_bootstrap() but those vm_page
structures are needed when the kernel attempts to promote the corresponding
kernel virtual addresses to superpage mappings. To address this problem, a
new public function, vm_phys_add_seg(), is introduced and vm_phys_init() is
updated to reflect the creation of vm_phys_seg structures by calls to
vm_phys_add_seg().

Discussed with: Svatopluk Kraus
MFC after: 3 weeks
Sponsored by: EMC / Isilon Storage Division


# 5e929009 04-Nov-2014 Alan Cox <alc@FreeBSD.org>

Eliminate a stale, i386-specific comment.


# 2be111bf 16-Oct-2014 Davide Italiano <davide@FreeBSD.org>

Follow up to r225617. In order to maximize the re-usability of kernel code
in userland rename in-kernel getenv()/setenv() to kern_setenv()/kern_getenv().
This fixes a namespace collision with libc symbols.

Submitted by: kmacy
Tested by: make universe


# 1a83a822 29-Aug-2014 John Baldwin <jhb@FreeBSD.org>

Fix a typo.


# afb69e6b 08-Aug-2014 Konstantin Belousov <kib@FreeBSD.org>

Adapt vm_page_aflag_set(PGA_WRITEABLE) to the locking of
pmap_enter(PMAP_ENTER_NOSLEEP). The PGA_WRITEABLE flag can be set
when either the page is busied, or the owner object is locked.

Update comments, move all assertions about page state when
PGA_WRITEABLE flag is set, into new helper
vm_page_assert_pga_writeable().

Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# af3b2549 27-Jun-2014 Hans Petter Selasky <hselasky@FreeBSD.org>

Pull in r267961 and r267973 again. Fix for issues reported will follow.


# 37a107a4 27-Jun-2014 Glen Barber <gjb@FreeBSD.org>

Revert r267961, r267973:

These changes prevent sysctl(8) from returning proper output,
such as:

1) no output from sysctl(8)
2) erroneously returning ENOMEM with tools like truss(1)
or uname(1)
truss: can not get etype: Cannot allocate memory


# 3da1cf1e 27-Jun-2014 Hans Petter Selasky <hselasky@FreeBSD.org>

Extend the meaning of the CTLFLAG_TUN flag to automatically check if
there is an environment variable which shall initialize the SYSCTL
during early boot. This works for all SYSCTL types both statically and
dynamically created ones, except for the SYSCTL NODE type and SYSCTLs
which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to
be used in the case a tunable sysctl has a custom initialisation
function allowing the sysctl to still be marked as a tunable. The
kernel SYSCTL API is mostly the same, with a few exceptions for some
special operations like iterating childrens of a static/extern SYSCTL
node. This operation should probably be made into a factored out
common macro, hence some device drivers use this. The reason for
changing the SYSCTL API was the need for a SYSCTL parent OID pointer
and not only the SYSCTL parent OID list pointer in order to quickly
generate the sysctl path. The motivation behind this patch is to avoid
parameter loading cludges inside the OFED driver subsystem. Instead of
adding special code to the OFED driver subsystem to post-load tunables
into dynamically created sysctls, we generalize this in the kernel.

Other changes:
- Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask"
to "hw.pcic.intr_mask".
- Removed redundant TUNABLE statements throughout the kernel.
- Some minor code rewrites in connection to removing not needed
TUNABLE statements.
- Added a missing SYSCTL_DECL().
- Wrapped two very long lines.
- Avoid malloc()/free() inside sysctl string handling, in case it is
called to initialize a sysctl from a tunable, hence malloc()/free() is
not ready when sysctls from the sysctl dataset are registered.
- Bumped FreeBSD version to indicate SYSCTL API change.

MFC after: 2 weeks
Sponsored by: Mellanox Technologies


# 3ae10f74 16-Jun-2014 Attilio Rao <attilio@FreeBSD.org>

- Modify vm_page_unwire() and vm_page_enqueue() to directly accept
the queue where to enqueue pages that are going to be unwired.
- Add stronger checks to the enqueue/dequeue for the pagequeues when
adding and removing pages to them.

Of course, for unmanaged pages the queue parameter of vm_page_unwire() will
be ignored, just as the active parameter today.
This makes adding new pagequeues quicker.

This change effectively modifies the KPI. __FreeBSD_version will be,
however, bumped just when the full cache of free pages will be
evicted.

Sponsored by: EMC / Isilon storage division
Reviewed by: alc
Tested by: pho


# dd05fa19 07-Jun-2014 Alan Cox <alc@FreeBSD.org>

Add a page size field to struct vm_page. Increase the page size field when
a partially populated reservation becomes fully populated, and decrease this
field when a fully populated reservation becomes partially populated.

Use this field to simplify the implementation of pmap_enter_object() on
amd64, arm, and i386.

On all architectures where we support superpages, the cost of creating a
superpage mapping is roughly the same as creating a base page mapping. For
example, both kinds of mappings entail the creation of a single PTE and PV
entry. With this in mind, use the page size field to make the
implementation of vm_map_pmap_enter(..., MAP_PREFAULT_PARTIAL) a little
smarter. Previously, if MAP_PREFAULT_PARTIAL was specified to
vm_map_pmap_enter(), that function would only map base pages. Now, it will
create up to 96 base page or superpage mappings.

Reviewed by: kib
Sponsored by: EMC / Isilon Storage Division


# 44f1c916 22-Mar-2014 Bryan Drewery <bdrewery@FreeBSD.org>

Rename global cnt to vm_cnt to avoid shadowing.

To reduce the diff struct pcu.cnt field was not renamed, so
PCPU_OP(cnt.field) is still used. pc_cnt and pcpu are also used in
kvm(3) and vmstat(8). The goal was to not affect externally used KPI.

Bump __FreeBSD_version_ in case some out-of-tree module/code relies on the
the global cnt variable.

Exp-run revealed no ports using it directly.

No objection from: arch@
Sponsored by: EMC / Isilon Storage Division


# 793d1407 24-Jan-2014 Alan Cox <alc@FreeBSD.org>

In an effort to diagnose possible corruption of struct vm_page on some
sparc64 machines make the page queue assert in vm_page_dequeue() more
precise. While I'm here switch the page lock assert to the newer style.


# 000fb817 31-Dec-2013 Alan Cox <alc@FreeBSD.org>

Since the introduction of the popmap to reservations in r259999, there is
no longer any need for the page's PG_CACHED and PG_FREE flags to be set and
cleared while the free page queues lock is held. Thus, vm_page_alloc(),
vm_page_alloc_contig(), and vm_page_alloc_freelist() can wait until after
the free page queues lock is released to clear the page's flags. Moreover,
the PG_FREE flag can be retired. Now that the reservation system no longer
uses it, its only uses are in a few assertions. Eliminating these
assertions is no real loss. Other assertions catch the same types of
misbehavior, like doubly freeing a page (see r260032) or dirtying a free
page (free pages are invalid and only valid pages can be dirtied).

Eliminate an unneeded variable from vm_page_alloc_contig().

Sponsored by: EMC / Isilon Storage Division


# 703b304f 08-Dec-2013 Alan Cox <alc@FreeBSD.org>

Eliminate a redundant parameter to vm_radix_replace().

Improve the wording of the comment describing vm_radix_replace().

Reviewed by: attilio
MFC after: 6 weeks
Sponsored by: EMC / Isilon Storage Division


# 9eab5484 17-Sep-2013 Konstantin Belousov <kib@FreeBSD.org>

PG_SLAB no longer serves a useful purpose, since m->object is no
longer abused to store pointer to slab. Remove it.

Reviewed by: alc
Sponsored by: The FreeBSD Foundation
Approved by: re (hrs)


# 3846a822 16-Sep-2013 Konstantin Belousov <kib@FreeBSD.org>

Remove zero-copy sockets code. It only worked for anonymous memory,
and the equivalent functionality is now provided by sendfile(2) over
posix shared memory filedescriptor.

Remove the cow member of struct vm_page, and rearrange the remaining
members. While there, make hold_count unsigned.

Requested and reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
Approved by: re (delphij)


# 196beb53 14-Sep-2013 Konstantin Belousov <kib@FreeBSD.org>

If the last page of the file is partially full and whole valid
portion is invalidated, invalidate the whole page. Otherwise,
partially valid page appears on a page queue, which is wrong. This
could only happen for the last page, because only then buffer which
triggered invalidation could not cover the whole page.

Reported and tested by: pho (previous version)
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
Approved by: re (delphij)
MFC after: 2 weeks


# 7a4b2bc5 04-Sep-2013 Konstantin Belousov <kib@FreeBSD.org>

The vm_page_trysbusy() should not fail when shared busy counter or
VPB_BIT_WAITERS flag were changed between reading of busy_lock and the
cas. The vm_page_sbusy(), which is the only user of
vm_page_trysbusy() in the tree, panics on the failure, which in these
cases is transient and do not mean that the current page state
prevents sbusying.

Retry the operation inside vm_page_trysbusy() if cas failed, only
return a failure when VPB_BIT_SHARED is cleared.

Reported and tested by: pho
Reviewed by: attilio
Sponsored by: The FreeBSD Foundation


# 51321f7c 29-Aug-2013 Alan Cox <alc@FreeBSD.org>

Significantly reduce the cost, i.e., run time, of calls to madvise(...,
MADV_DONTNEED) and madvise(..., MADV_FREE). Specifically, introduce a new
pmap function, pmap_advise(), that operates on a range of virtual addresses
within the specified pmap, allowing for a more efficient implementation of
MADV_DONTNEED and MADV_FREE. Previously, the implementation of
MADV_DONTNEED and MADV_FREE relied on per-page pmap operations, such as
pmap_clear_reference(). Intuitively, the problem with this implementation
is that the pmap-level locks are acquired and released and the page table
traversed repeatedly, once for each resident page in the range
that was specified to madvise(2). A more subtle flaw with the previous
implementation is that pmap_clear_reference() would clear the reference bit
on all mappings to the specified page, not just the mapping in the range
specified to madvise(2).

Since our malloc(3) makes heavy use of madvise(2), this change can have a
measureable impact. For example, the system time for completing a parallel
"buildworld" on a 6-core amd64 machine was reduced by about 1.5% to 2.0%.

Note: This change only contains pmap_advise() implementations for a subset
of our supported architectures. I will commit implementations for the
remaining architectures after further testing. For now, a stub function is
sufficient because of the advisory nature of pmap_advise().

Discussed with: jeff, jhb, kib
Tested by: pho (i386), marcel (ia64)
Sponsored by: EMC / Isilon Storage Division


# 133dae88 26-Aug-2013 Gleb Smirnoff <glebius@FreeBSD.org>

Remove comment that is no longer relevant since r254182.


# 776cad90 23-Aug-2013 Alan Cox <alc@FreeBSD.org>

Addendum to r254141: The call to vm_radix_insert() in vm_page_cache() can
reclaim the last preexisting cached page in the object, resulting in a call
to vdrop(). Detect this scenario so that the vnode's hold count is
correctly maintained. Otherwise, we panic.

Reported by: scottl
Tested by: pho
Discussed with: attilio, jeff, kib


# 5944de8e 22-Aug-2013 Konstantin Belousov <kib@FreeBSD.org>

Remove the deprecated VM_ALLOC_RETRY flag for the vm_page_grab(9).
The flag was mandatory since r209792, where vm_page_grab(9) was
changed to only support the alloc retry semantic.

Suggested and reviewed by: alc
Sponsored by: The FreeBSD Foundation


# 28a288cb 21-Aug-2013 Alan Cox <alc@FreeBSD.org>

Addendum to r254141: Allow recursion on the free pages queues lock in
vm_page_alloc_freelist().

Reported and tested by: sbruno
Sponsored by: EMC / Isilon Storage Division


# a834cbae 15-Aug-2013 Attilio Rao <attilio@FreeBSD.org>

On the recovery path for vm_page_alloc(), if a page had been requested
wired, unwind back the wiring bits otherwise we can end up freeing a
page that is considered wired.

Sponsored by: EMC / Isilon storage division
Reported by: alc


# d9e23210 13-Aug-2013 Jeff Roberson <jeff@FreeBSD.org>

Improve pageout flow control to wakeup more frequently and do less work while
maintaining better LRU of active pages.

- Change v_free_target to include the quantity previously represented by
v_cache_min so we don't need to add them together everywhere we use them.
- Add a pageout_wakeup_thresh that sets the free page count trigger for
waking the page daemon. Set this 10% above v_free_min so we wakeup before
any phase transitions in vm users.
- Adjust down v_free_target now that we're willing to accept more pagedaemon
wakeups. This means we process fewer pages in one iteration as well,
leading to shorter lock hold times and less overall disruption.
- Eliminate vm_pageout_page_stats(). This was a minor variation on the
PQ_ACTIVE segment of the normal pageout daemon. Instead we now process
1 / vm_pageout_update_period pages every second. This causes us to visit
the whole active list every 60 seconds. Previously we would only maintain
the active LRU when we were short on pages which would mean it could be
woefully out of date.

Reviewed by: alc (slight variant of this)
Discussed with: alc, kib, jhb
Sponsored by: EMC / Isilon Storage Division


# 60068841 11-Aug-2013 Attilio Rao <attilio@FreeBSD.org>

Correct the recovery logic in vm_page_alloc_contig:
what is really needed on this code snipped is that all the pages that
are already fully inserted gets fully freed, while for the others the
object removal itself might be skipped, hence the object might be set to
NULL.

Sponsored by: EMC / Isilon storage division
Reported by: alc, kib
Reviewed by: alc


# c325e866 10-Aug-2013 Konstantin Belousov <kib@FreeBSD.org>

Different consumers of the struct vm_page abuse pageq member to keep
additional information, when the page is guaranteed to not belong to a
paging queue. Usually, this results in a lot of type casts which make
reasoning about the code correctness harder.

Sometimes m->object is used instead of pageq, which could cause real
and confusing bugs if non-NULL m->object is leaked. See r141955 and
r253140 for examples.

Change the pageq member into a union containing explicitly-typed
members. Use them instead of type-punning or abusing m->object in x86
pmaps, uma and vm_page_alloc_contig().

Requested and reviewed by: alc
Sponsored by: The FreeBSD Foundation


# cdc00bf7 09-Aug-2013 John Baldwin <jhb@FreeBSD.org>

Revert the addition of VPO_BUSY and instead update vm_page_replace() to
properly unbusy the page.

Submitted by: alc


# e946b949 09-Aug-2013 Attilio Rao <attilio@FreeBSD.org>

On all the architectures, avoid to preallocate the physical memory
for nodes used in vm_radix.
On architectures supporting direct mapping, also avoid to pre-allocate
the KVA for such nodes.

In order to do so make the operations derived from vm_radix_insert()
to fail and handle all the deriving failure of those.

vm_radix-wise introduce a new function called vm_radix_replace(),
which can replace a leaf node, already present, with a new one,
and take into account the possibility, during vm_radix_insert()
allocation, that the operations on the radix trie can recurse.
This means that if operations in vm_radix_insert() recursed
vm_radix_insert() will start from scratch again.

Sponsored by: EMC / Isilon storage division
Reviewed by: alc (older version)
Reviewed by: jeff
Tested by: pho, scottl


# c7aebda8 09-Aug-2013 Attilio Rao <attilio@FreeBSD.org>

The soft and hard busy mechanism rely on the vm object lock to work.
Unify the 2 concept into a real, minimal, sxlock where the shared
acquisition represent the soft busy and the exclusive acquisition
represent the hard busy.
The old VPO_WANTED mechanism becames the hard-path for this new lock
and it becomes per-page rather than per-object.
The vm_object lock becames an interlock for this functionality:
it can be held in both read or write mode.
However, if the vm_object lock is held in read mode while acquiring
or releasing the busy state, the thread owner cannot make any
assumption on the busy state unless it is also busying it.

Also:
- Add a new flag to directly shared busy pages while vm_page_alloc
and vm_page_grab are being executed. This will be very helpful
once these functions happen under a read object lock.
- Move the swapping sleep into its own per-object flag

The KPI is heavilly changed this is why the version is bumped.
It is very likely that some VM ports users will need to change
their own code.

Sponsored by: EMC / Isilon storage division
Discussed with: alc
Reviewed by: jeff, kib
Tested by: gavin, bapt (older version)
Tested by: pho, scottl


# 449c2e92 07-Aug-2013 Konstantin Belousov <kib@FreeBSD.org>

Split the pagequeues per NUMA domains, and split pageademon process
into threads each processing queue in a single domain. The structure
of the pagedaemons and queues is kept intact, most of the changes come
from the need for code to find an owning page queue for given page,
calculated from the segment containing the page.

The tie between NUMA domain and pagedaemon thread/pagequeue split is
rather arbitrary, the multithreaded daemon could be allowed for the
single-domain machines, or one domain might be split into several page
domains, to further increase concurrency.

Right now, each pagedaemon thread tries to reach the global target,
precalculated at the start of the pass. This is not optimal, since it
could cause excessive page deactivation and freeing. The code should
be changed to re-check the global page deficit state in the loop after
some number of iterations.

The pagedaemons reach the quorum before starting the OOM, since one
thread inability to meet the target is normal for split queues. Only
when all pagedaemons fail to produce enough reusable pages, OOM is
started by single selected thread.

Launder is modified to take into account the segments layout with
regard to the region for which cleaning is performed.

Based on the preliminary patch by jeff, sponsored by EMC / Isilon
Storage Division.

Reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation


# 3abeb811 10-Jul-2013 Konstantin Belousov <kib@FreeBSD.org>

In the vm_page_set_invalid() function, do not assert that the page is
not busy, since its only caller brelse() can legitimately call it on
busy page. This happens for VOP_PUTPAGES() on filesystems that use
buffers and which VOP_WRITE() method marked the buffer containing page
as non-cacheable.

Reported and tested by: pho
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 6b5fbc12 03-Jul-2013 Neel Natu <neel@FreeBSD.org>

vm_phys_fictitious_reg_range() was losing the 'memattr' because it would be
reset by pmap_page_init() right after being initialized in vm_page_initfake().

The statement above is with reference to the amd64 implementation of
pmap_page_init().

Fix this by calling 'pmap_page_init()' in 'vm_page_initfake()' before changing
the 'memattr'.

Reviewed by: kib
MFC after: 2 weeks


# 4aa4cd8e 24-Jun-2013 Gleb Smirnoff <glebius@FreeBSD.org>

Typo in comment.


# 2051980f 09-Jun-2013 Alan Cox <alc@FreeBSD.org>

Revise the interface between vm_object_madvise() and vm_page_dontneed() so
that pointless calls to pmap_is_modified() can be easily avoided when
performing madvise(..., MADV_FREE).

Sponsored by: EMC / Isilon Storage Division


# be6ec553 03-Jun-2013 Konstantin Belousov <kib@FreeBSD.org>

Remove irrelevant comments.

Discussed with: alc
MFC after: 3 days


# b4171812 02-Jun-2013 Alan Cox <alc@FreeBSD.org>

Require that the page lock is held, instead of the object lock, when
clearing the page's PGA_REFERENCED flag. Since we are typically
manipulating the page's act_count field when we are clearing its
PGA_REFERENCED flag, the page lock is already held everywhere that we clear
the PGA_REFERENCED flag. So, in fact, this revision only changes some
comments and an assertion. Nonetheless, it will enable later changes to
object locking in the pageout code.

Introduce vm_page_assert_locked(), which completely hides the implementation
details of the page lock from the caller, and use it in
vm_page_aflag_clear(). (The existing vm_page_lock_assert() could not be
used in vm_page_aflag_clear().) Over the coming weeks, I expect that we'll
either eliminate or replace the various uses of vm_page_lock_assert() with
vm_page_assert_locked().

Reviewed by: attilio
Sponsored by: EMC / Isilon Storage Division


# b4e49807 01-Jun-2013 Alan Cox <alc@FreeBSD.org>

Now that access to the page's "act_count" field is synchronized by the page
lock instead of the object lock, there is no reason for vm_page_activate()
to assert that the object is locked for either read or write access.
(The "VPO_UNMANAGED" flag never changes after page allocation.)

Sponsored by: EMC / Isilon Storage Division


# 9af6d512 21-May-2013 Attilio Rao <attilio@FreeBSD.org>

o Relax locking assertions for vm_page_find_least()
o Relax locking assertions for pmap_enter_object() and add them also
to architectures that currently don't have any
o Introduce VM_OBJECT_LOCK_DOWNGRADE() which is basically a downgrade
operation on the per-object rwlock
o Use all the mechanisms above to make vm_map_pmap_enter() to work
mostl of the times only with readlocks.

Sponsored by: EMC / Isilon storage division
Reviewed by: alc


# 4fab678b 21-May-2013 Konstantin Belousov <kib@FreeBSD.org>

Add ddb command 'show pginfo' which provides useful information about
a vm page, denoted either by an address of the struct vm_page, or, if
the '/p' modifier is specified, by a physical address of the
corresponding frame.

Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 767a6420 17-May-2013 Alan Cox <alc@FreeBSD.org>

Relax the object locking assertion in vm_page_lookup(). Now that a radix
tree is used to maintain the object's collection of resident pages,
vm_page_lookup() no longer needs an exclusive lock.

Reviewed by: attilio
Sponsored by: EMC / Isilon Storage Division


# df839389 13-May-2013 Peter Wemm <peter@FreeBSD.org>

Bandaid for compiling with gcc, which happens to be the default compiler
for a number of platforms still.


# 404eb1b3 12-May-2013 Alan Cox <alc@FreeBSD.org>

Refactor vm_page_alloc()'s interactions with vm_reserv_alloc_page() and
vm_page_insert() so that (1) vm_radix_lookup_le() is never called while the
free page queues lock is held and (2) vm_radix_lookup_le() is called at most
once. This change reduces the average time that the free page queues lock
is held by vm_page_alloc() as well as vm_page_alloc()'s average overall
running time.

Sponsored by: EMC / Isilon Storage Division


# 774d251d 17-Mar-2013 Attilio Rao <attilio@FreeBSD.org>

Sync back vmcontention branch into HEAD:
Replace the per-object resident and cached pages splay tree with a
path-compressed multi-digit radix trie.
Along with this, switch also the x86-specific handling of idle page
tables to using the radix trie.

This change is supposed to do the following:
- Allowing the acquisition of read locking for lookup operations of the
resident/cached pages collections as the per-vm_page_t splay iterators
are now removed.
- Increase the scalability of the operations on the page collections.

The radix trie does rely on the consumers locking to ensure atomicity of
its operations. In order to avoid deadlocks the bisection nodes are
pre-allocated in the UMA zone. This can be done safely because the
algorithm needs at maximum one new node per insert which means the
maximum number of the desired nodes is the number of available physical
frames themselves. However, not all the times a new bisection node is
really needed.

The radix trie implements path-compression because UFS indirect blocks
can lead to several objects with a very sparse trie, increasing the number
of levels to usually scan. It also helps in the nodes pre-fetching by
introducing the single node per-insert property.

This code is not generalized (yet) because of the possible loss of
performance by having much of the sizes in play configurable.
However, efforts to make this code more general and then reusable in
further different consumers might be really done.

The only KPI change is the removal of the function vm_page_splay() which
is now reaped.
The only KBI change, instead, is the removal of the left/right iterators
from struct vm_page, which are now reaped.

Further technical notes broken into mealpieces can be retrieved from the
svn branch:
http://svn.freebsd.org/base/user/attilio/vmcontention/

Sponsored by: EMC / Isilon storage division
In collaboration with: alc, jeff
Tested by: flo, pho, jhb, davide
Tested by: ian (arm)
Tested by: andreast (powerpc)


# 4bc80a34 11-Mar-2013 Attilio Rao <attilio@FreeBSD.org>

Simplify vm_page_is_valid().

Sponsored by: EMC / Isilon storage division
Reviewed by: alc


# 34496b53 09-Mar-2013 Alan Cox <alc@FreeBSD.org>

Update a comment: The object lock is no longer a mutex.


# 89f6b863 08-Mar-2013 Attilio Rao <attilio@FreeBSD.org>

Switch the vm_object mutex to be a rwlock. This will enable in the
future further optimizations where the vm_object lock will be held
in read mode most of the time the page cache resident pool of pages
are accessed for reading purposes.

The change is mostly mechanical but few notes are reported:
* The KPI changes as follow:
- VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK()
- VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK()
- VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK()
- VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED()
(in order to avoid visibility of implementation details)
- The read-mode operations are added:
VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(),
VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED()
* The vm/vm_pager.h namespace pollution avoidance (forcing requiring
sys/mutex.h in consumers directly to cater its inlining functions
using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h
consumers now must include also sys/rwlock.h.
* zfs requires a quite convoluted fix to include FreeBSD rwlocks into
the compat layer because the name clash between FreeBSD and solaris
versions must be avoided.
At this purpose zfs redefines the vm_object locking functions
directly, isolating the FreeBSD components in specific compat stubs.

The KPI results heavilly broken by this commit. Thirdy part ports must
be updated accordingly (I can think off-hand of VirtualBox, for example).

Sponsored by: EMC / Isilon storage division
Reviewed by: jeff
Reviewed by: pjd (ZFS specific review)
Discussed with: alc
Tested by: pho


# c9341161 08-Mar-2013 Attilio Rao <attilio@FreeBSD.org>

Merge from vmc-playground:
Introduce a new KPI that verifies if the page cache is empty for a
specified vm_object. This KPI does not make assumptions about the
locking in order to be used also for building assertions at init and
destroy time.
It is mostly used to hide implementation details of the page cache.

Sponsored by: EMC / Isilon storage division
Reviewed by: jeff
Reviewed by: alc (vm_radix based version)
Tested by: flo, pho, jhb, davide


# 0dde287b 26-Feb-2013 Attilio Rao <attilio@FreeBSD.org>

Wrap the sleeps synchronized by the vm_object lock into the specific
macro VM_OBJECT_SLEEP().
This hides some implementation details like the usage of the msleep()
primitive and the necessity to access to the lock address directly.
For this reason VM_OBJECT_MTX() macro is now retired.

Sponsored by: EMC / Isilon storage division
Reviewed by: alc
Tested by: pho


# 28634820 08-Dec-2012 Alan Cox <alc@FreeBSD.org>

In the past four years, we've added two new vm object types. Each time,
similar changes had to be made in various places throughout the machine-
independent virtual memory layer to support the new vm object type.
However, in most of these places, it's actually not the type of the vm
object that matters to us but instead certain attributes of its pages.
For example, OBJT_DEVICE, OBJT_MGTDEVICE, and OBJT_SG objects contain
fictitious pages. In other words, in most of these places, we were
testing the vm object's type to determine if it contained fictitious (or
unmanaged) pages.

To both simplify the code in these places and make the addition of future
vm object types easier, this change introduces two new vm object flags
that describe attributes of the vm object's pages, specifically, whether
they are fictitious or unmanaged.

Reviewed and tested by: kib


# 0d69690e 20-Nov-2012 Alan Cox <alc@FreeBSD.org>

Correct an error in r230623. When both VM_ALLOC_NODUMP and VM_ALLOC_ZERO
were specified to vm_page_alloc(), PG_NODUMP wasn't being set on the
allocated page when it happened to be pre-zeroed.

MFC after: 5 days


# 8d220203 12-Nov-2012 Alan Cox <alc@FreeBSD.org>

Replace the single, global page queues lock with per-queue locks on the
active and inactive paging queues.

Reviewed by: kib


# 9fc4739d 01-Nov-2012 Alan Cox <alc@FreeBSD.org>

In general, we call pmap_remove_all() before calling vm_page_cache(). So,
the call to pmap_remove_all() within vm_page_cache() is usually redundant.
This change eliminates that call to pmap_remove_all() and introduces a
call to pmap_remove_all() before vm_page_cache() in the one place where
it didn't already exist.

When iterating over a paging queue, if the object containing the current
page has a zero reference count, then the page can't have any managed
mappings. So, a call to pmap_remove_all() is pointless.

Change a panic() call in vm_page_cache() to a KASSERT().

MFC after: 6 weeks


# 4ceaf45d 31-Oct-2012 Attilio Rao <attilio@FreeBSD.org>

Rework the known mutexes to benefit about staying on their own
cache line in order to avoid manual frobbing but using
struct mtx_padalign.

The sole exception being nvme and sxfge drivers, where the author
redefined CACHE_LINE_SIZE manually, so they need to be analyzed and
dealt with separately.

Reviwed by: jimharris, alc


# 081a4881 29-Oct-2012 Alan Cox <alc@FreeBSD.org>

Replace the page hold queue, PQ_HOLD, by a new page flag, PG_UNHOLDFREE,
because the queue itself serves no purpose. When a held page is freed,
inserting the page into the hold queue has the side effect of setting the
page's "queue" field to PQ_HOLD. Later, when the page is unheld, it will
be freed because the "queue" field is PQ_HOLD. In other words, PQ_HOLD is
used as a flag, not a queue. So, this change replaces it with a flag.

To accomodate the new page flag, make the page's "flags" field wider and
"oflags" field narrower.

Reviewed by: kib


# 7ecfabc7 13-Oct-2012 Alan Cox <alc@FreeBSD.org>

Move vm_page_requeue() to the only file that uses it.

MFC after: 3 weeks


# 9af47af6 13-Oct-2012 Alan Cox <alc@FreeBSD.org>

Eliminate the conditional for releasing the page queues lock in
vm_page_sleep(). vm_page_sleep() is no longer called with this lock
held.

Eliminate assertions that the page queues lock is NOT held. These
assertions won't translate well to having distinct locks on the active
and inactive page queues, and they really aren't that useful.

MFC after: 3 weeks


# 4db2c4b8 02-Oct-2012 Alan Cox <alc@FreeBSD.org>

Tidy up a bit:

Update some of the comments. In particular, use "sleep" in preference to
"block" where appropriate.

Eliminate some unnecessary casts.

Make a few whitespace changes for consistency.

Reviewed by: kib
MFC after: 3 days


# b6c00483 14-Aug-2012 Konstantin Belousov <kib@FreeBSD.org>

Do not leave invalid pages in the object after the short read for a
network file systems (not only NFS proper). Short reads cause pages
other then the requested one, which were not filled by read response,
to stay invalid.

Change the vm_page_readahead_finish() interface to not take the error
code, but instead to make a decision to free or to (de)activate the
page only by its validity. As result, not requested invalid pages are
freed even if the read RPC indicated success.

Noted and reviewed by: alc
MFC after: 1 week


# 0055cbd3 04-Aug-2012 Konstantin Belousov <kib@FreeBSD.org>

Reduce code duplication and exposure of direct access to struct
vm_page oflags by providing helper function
vm_page_readahead_finish(), which handles completed reads for pages
with indexes other then the requested one, for VOP_GETPAGES().

Reviewed by: alc
MFC after: 1 week


# 369763e3 02-Aug-2012 Alan Cox <alc@FreeBSD.org>

Inline vm_page_aflags_clear() and vm_page_aflags_set().

Add comments stating that neither these functions nor the flags that they
are used to manipulate are part of the KBI.


# da1ab8a4 16-Jul-2012 Alan Cox <alc@FreeBSD.org>

Correct vm_page_alloc_contig()'s implementation of VM_ALLOC_NODUMP.


# e30df26e 26-Jun-2012 Alan Cox <alc@FreeBSD.org>

Add new pmap layer locks to the predefined lock order. Change the names
of a few existing VM locks to follow a consistent naming scheme.


# eddc9291 20-Jun-2012 Alan Cox <alc@FreeBSD.org>

Selectively inline vm_page_dirty().


# 6031c68d 16-Jun-2012 Alan Cox <alc@FreeBSD.org>

The page flag PGA_WRITEABLE is set and cleared exclusively by the pmap
layer, but it is read directly by the MI VM layer. This change introduces
pmap_page_is_write_mapped() in order to completely encapsulate all direct
access to PGA_WRITEABLE in the pmap layer.

Aesthetics aside, I am making this change because amd64 will likely begin
using an alternative method to track write mappings, and having
pmap_page_is_write_mapped() in place allows me to make such a change
without further modification to the MI VM layer.

As an added bonus, tidy up some nearby comments concerning page flags.

Reviewed by: kib
MFC after: 6 weeks


# c415e172 22-May-2012 Andrew Turner <andrew@FreeBSD.org>

Fix booting on ARM.

In PHYS_TO_VM_PAGE() when VM_PHYSSEG_DENSE is set the check if we are past
the end of vm_page_array was incorrect causing it to return NULL. This
value is then used in vm_phys_add_page causing a data abort.

Reviewed by: alc, kib, imp
Tested by: stas


# ccc4a5c7 20-May-2012 Nathan Whitehorn <nwhitehorn@FreeBSD.org>

Replace the list of PVOs owned by each PMAP with an RB tree. This simplifies
range operations like pmap_remove() and pmap_protect() as well as allowing
simple operations like pmap_extract() not to involve any global state.
This substantially reduces lock coverages for the global table lock and
improves concurrency.


# b6de32bd 12-May-2012 Konstantin Belousov <kib@FreeBSD.org>

Add a facility to register a range of physical addresses to be used
for allocation of fictitious pages, for which PHYS_TO_VM_PAGE()
returns proper fictitious vm_page_t. The range should be de-registered
after consumer stopped using it.

De-inline the PHYS_TO_VM_PAGE() since it now carries code to iterate
over registered ranges.

A hash container might be developed instead of range registration
interface, and fake pages could be put automatically into the hash,
were PHYS_TO_VM_PAGE() could look them up later. This should be
considered before the MFC of the commit is done.

Sponsored by: The FreeBSD Foundation
Reviewed by: alc
MFC after: 1 month


# e461aae7 12-May-2012 Konstantin Belousov <kib@FreeBSD.org>

Split the code from vm_page_getfake() to initialize the fake page struct
vm_page into new interface vm_page_initfake(). Handle the case of fake
page re-initialization with changed memattr.

Sponsored by: The FreeBSD Foundation
Reviewed by: alc
MFC after: 1 month


# 116c2135 12-May-2012 Konstantin Belousov <kib@FreeBSD.org>

Assert that the page passed to vm_page_putfake() is unmanaged.

Sponsored by: The FreeBSD Foundation
Reviewed by: alc
MFC after: 1 month


# 0c26bb71 12-May-2012 Konstantin Belousov <kib@FreeBSD.org>

Make the vm_page_array_size long. Remove redundand zero initialization
for vm_page_array_size and nearby variablees.

Sponsored by: The FreeBSD Foundation
Reviewed by: alc
MFC after: 1 month


# 0b852c03 22-Apr-2012 Nathan Whitehorn <nwhitehorn@FreeBSD.org>

Avoid a lock order reversal in pmap_extract_and_hold() from relocking
the page. This PMAP requires an additional lock besides the PMAP lock
in pmap_extract_and_hold(), which vm_page_pa_tryrelock() did not release.

Suggested by: kib
MFC after: 4 days


# 2aa163dc 21-Apr-2012 Alan Cox <alc@FreeBSD.org>

As documented in vm_page.h, updates to the vm_page's flags no longer
require the page queues lock.

MFC after: 1 week


# a0f2c37b 09-Apr-2012 Attilio Rao <attilio@FreeBSD.org>

- Introduce a cache-miss optimization for consistency with other
accesses of the cache member of vm_object objects.
- Use novel vm_page_is_cached() for checks outside of the vm subsystem.

Reviewed by: alc
MFC after: 2 weeks
X-MFC: r234039


# 1c8279e4 08-Apr-2012 Alan Cox <alc@FreeBSD.org>

Fix mincore(2) so that it reports PG_CACHED pages as resident.

MFC after: 2 weeks


# d1aa86e1 06-Apr-2012 Attilio Rao <attilio@FreeBSD.org>

Staticize vm_page_cache_remove().

Reviewed by: alc


# 263811f7 27-Jan-2012 Kip Macy <kmacy@FreeBSD.org>

exclude kmem_alloc'ed ARC data buffers from kernel minidumps on amd64
excluding other allocations including UMA now entails the addition of
a single flag to kmem_alloc or uma zone create

Reviewed by: alc, avg
MFC after: 2 weeks


# c68c3537 05-Dec-2011 Alan Cox <alc@FreeBSD.org>

Introduce vm_reserv_alloc_contig() and teach vm_page_alloc_contig() how to
use superpage reservations. So, for the first time, kernel virtual memory
that is allocated by contigmalloc(), kmem_alloc_attr(), and
kmem_alloc_contig() can be promoted to superpages. In fact, even a series
of small contigmalloc() allocations may collectively result in a promoted
superpage.

Eliminate some duplication of code in vm_reserv_alloc_page().

Change the type of vm_reserv_reclaim_contig()'s first parameter in order
that it be consistent with other vm_*_contig() functions.

Tested by: marius (sparc64)


# dc874f98 30-Nov-2011 Konstantin Belousov <kib@FreeBSD.org>

Rename vm_page_set_valid() to vm_page_set_valid_range().
The vm_page_set_valid() is the most reasonable name for the m->valid
accessor.

Reviewed by: attilio, alc


# cf1911a9 29-Nov-2011 Konstantin Belousov <kib@FreeBSD.org>

Hide the internals of vm_page_lock(9) from the loadable modules.
Since the address of vm_page lock mutex depends on the kernel options,
it is easy for module to get out of sync with the kernel.

No vm_page_lockptr() accessor is provided for modules. It can be added
later if needed, unless proper KPI is developed to serve the needs.

Reviewed by: attilio, alc
MFC after: 3 weeks


# 5ff276b7 16-Nov-2011 Alan Cox <alc@FreeBSD.org>

Eliminate end-of-line white space.


# fbd80bd0 16-Nov-2011 Alan Cox <alc@FreeBSD.org>

Refactor the code that performs physically contiguous memory allocation,
yielding a new public interface, vm_page_alloc_contig(). This new function
addresses some of the limitations of the current interfaces, contigmalloc()
and kmem_alloc_contig(). For example, the physically contiguous memory that
is allocated with those interfaces can only be allocated to the kernel vm
object and must be mapped into the kernel virtual address space. It also
provides functionality that vm_phys_alloc_contig() doesn't, such as wiring
the returned pages. Moreover, unlike that function, it respects the low
water marks on the paging queues and wakes up the page daemon when
necessary. That said, at present, this new function can't be applied to all
types of vm objects. However, that restriction will be eliminated in the
coming weeks.

From a design standpoint, this change also addresses an inconsistency
between vm_phys_alloc_contig() and the other vm_phys_alloc*() functions.
Specifically, vm_phys_alloc_contig() manipulated vm_page fields that other
functions in vm/vm_phys.c didn't. Moreover, vm_phys_alloc_contig() knew
about vnodes and reservations. Now, vm_page_alloc_contig() is responsible
for these things.

Reviewed by: kib
Discussed with: jhb


# c835bd16 05-Nov-2011 Alan Cox <alc@FreeBSD.org>

Wake up the page daemon in vm_page_alloc_freelist() if it couldn't
allocate the requested page because too few pages are cached or free.

Document the VM_ALLOC_COUNT() option to vm_page_alloc() and
vm_page_alloc_freelist().

Make style changes to vm_page_alloc() and vm_page_alloc_freelist(),
such as using a variable name that more closely corresponds to the
comments.


# 561cc9fc 05-Nov-2011 Konstantin Belousov <kib@FreeBSD.org>

Provide typedefs for the type of bit mask for the page bits.
Use the defined types instead of int when manipulating masks.
Supposedly, it could fix support for 32KB page size in the
machine-independend VM layer.

Reviewed by: alc
MFC after: 2 weeks


# 83937680 01-Nov-2011 Alan Cox <alc@FreeBSD.org>

Add support for VM_ALLOC_WIRED and VM_ALLOC_ZERO to vm_page_alloc_freelist()
and use these new options in the mips pmap.

Wake up the page daemon in vm_page_alloc_freelist() if the number of free
and cached pages becomes too low.

Tidy up vm_page_alloc_init(). In particular, add a comment about an
important restriction on its use.

Tested by: jchandra@


# 125b695b 27-Oct-2011 Alan Cox <alc@FreeBSD.org>

Tidy up the comment at the head of vm_page_alloc, and mention that the
returned page has the flag VPO_BUSY set.


# 9c60ca32 25-Oct-2011 Alan Cox <alc@FreeBSD.org>

Speed up vm_page_cache() and vm_page_remove() by checking for a few
common cases that can be handled in constant time. The insight being
that a page's parent in the vm object's tree is very often its
predecessor or successor in the vm object's ordered memq.

Tested by: jhb
MFC after: 10 days


# 17514c1b 28-Sep-2011 Konstantin Belousov <kib@FreeBSD.org>

Style nit.

Submitted by: jhb
MFC after: 2 weeks


# 2042bb37 28-Sep-2011 Konstantin Belousov <kib@FreeBSD.org>

Fix grammar.

Submitted by: bf
MFC after: 2 weeks


# abb9b935 28-Sep-2011 Konstantin Belousov <kib@FreeBSD.org>

Use the trick of performing the atomic operation on the contained aligned
word to handle the dirty mask updates in vm_page_clear_dirty_mask().
Remove the vm page queue lock around vm_page_dirty() call in vm_fault_hold()
the sole purpose of which was to protect dirty on architectures which
does not provide short or byte-wide atomics.

Reviewed by: alc, attilio
Tested by: flo (sparc64)
MFC after: 2 weeks


# 3407fefe 06-Sep-2011 Konstantin Belousov <kib@FreeBSD.org>

Split the vm_page flags PG_WRITEABLE and PG_REFERENCED into atomic
flags field. Updates to the atomic flags are performed using the atomic
ops on the containing word, do not require any vm lock to be held, and
are non-blocking. The vm_page_aflag_set(9) and vm_page_aflag_clear(9)
functions are provided to modify afalgs.

Document the changes to flags field to only require the page lock.

Introduce vm_page_reference(9) function to provide a stable KPI and
KBI for filesystems like tmpfs and zfs which need to mark a page as
referenced.

Reviewed by: alc, attilio
Tested by: marius, flo (sparc64); andreast (powerpc, powerpc64)
Approved by: re (bz)


# d98d0ce2 09-Aug-2011 Konstantin Belousov <kib@FreeBSD.org>

- Move the PG_UNMANAGED flag from m->flags to m->oflags, renaming the flag
to VPO_UNMANAGED (and also making the flag protected by the vm object
lock, instead of vm page queue lock).
- Mark the fake pages with both PG_FICTITIOUS (as it is now) and
VPO_UNMANAGED. As a consequence, pmap code now can use use just
VPO_UNMANAGED to decide whether the page is unmanaged.

Reviewed by: alc
Tested by: pho (x86, previous version), marius (sparc64),
marcel (arm, ia64, powerpc), ray (mips)
Sponsored by: The FreeBSD Foundation
Approved by: re (bz)


# 1bfec3df 22-Jun-2011 Alan Cox <alc@FreeBSD.org>

Revert to using the page queues lock in vm_page_clear_dirty_mask() on
MIPS. (At present, although atomic_clear_char() is defined by atomic.h
on MIPS, it is not actually implemented by support.S.)


# 3c76db4c 19-Jun-2011 Alan Cox <alc@FreeBSD.org>

Precisely document the synchronization rules for the page's dirty field.
(Saying that the lock on the object that the page belongs to must be held
only represents one aspect of the rules.)

Eliminate the use of the page queues lock for atomically performing read-
modify-write operations on the dirty field when the underlying architecture
supports atomic operations on char and short types.

Document the fact that 32KB pages aren't really supported.

Reviewed by: attilio, kib


# 3b1025d2 11-Jun-2011 Konstantin Belousov <kib@FreeBSD.org>

Assert that page is VPO_BUSY or page owner object is locked in
vm_page_undirty(). The assert is not precise due to VPO_BUSY owner
to tracked, so assertion does not catch the case when VPO_BUSY is
owned by other thread.

Reviewed by: alc


# 10cf2560 11-Mar-2011 Alan Cox <alc@FreeBSD.org>

Eliminate duplication of the fake page code and zone by the device and sg
pagers.

Reviewed by: jhb


# e6ffa214 17-Feb-2011 Alan Cox <alc@FreeBSD.org>

Remove pmap fields that are either unused or not fully implemented.

Discussed with: kib


# d7b20e4b 11-Feb-2011 Alan Cox <alc@FreeBSD.org>

Retire VFS_BIO_DEBUG. Convert those checks that were still valid into
KASSERT()s and eliminate the rest.

Replace excessive printf()s and a panic() in bufdone_finish() with a
KASSERT() in vm_page_io_finish().

Reviewed by: kib


# 3d05198e 30-Jan-2011 Alan Cox <alc@FreeBSD.org>

Release the free page queues lock earlier in vm_page_alloc().

Discussed with: kib@


# 4053b05b 21-Jan-2011 Sergey Kandaurov <pluknet@FreeBSD.org>

Make MSGBUF_SIZE kernel option a loader tunable kern.msgbufsize.

Submitted by: perryh pluto.rain.com (previous version)
Reviewed by: jhb
Approved by: kib (mentor)
Tested by: universe


# 4c6a2e7a 16-Jan-2011 Alan Cox <alc@FreeBSD.org>

Shift responsibility for synchronizing access to the page's act_count
field to the object's lock.

Reviewed by: kib@


# 9648f344 16-Jan-2011 Alan Cox <alc@FreeBSD.org>

Clean up the start of vm_page_alloc(). In particular, eliminate an
assertion that is no longer required. Long ago, calls to vm_page_alloc()
from an interrupt handler had to specify VM_ALLOC_INTERRUPT so that
vm_page_alloc() would not attempt to reclaim a PQ_CACHE page from another vm
object. Today, with the synchronization on a vm object's collection of
PQ_CACHE pages, this is no longer an issue. In fact, VM_ALLOC_INTERRUPT now
reclaims PQ_CACHE pages just like VM_ALLOC_{NORMAL,SYSTEM}.

MFC after: 3 weeks


# 27772ddf 08-Jan-2011 Alan Cox <alc@FreeBSD.org>

Eliminate a redundant alignment directive on the page locks array.


# ce8a13bd 08-Jan-2011 Alan Cox <alc@FreeBSD.org>

Eliminate the counting of vm_page_pa_tryrelock calls. We really don't
need it anymore. Moreover, its implementation had a type mismatch, a
long is not necessarily an uint64_t. (This mismatch was hidden by
casting.) Move the remaining two counters up a level in the sysctl
hierarchy. There is no reason for them to be under the vm.pmap node.

Reviewed by: kib


# 17f6a17b 02-Jan-2011 Alan Cox <alc@FreeBSD.org>

Release the page lock early in vm_pageout_clean(). There is no reason to
hold this lock until the end of the function.

With the aforementioned change to vm_pageout_clean(), page locks don't need
to support recursive (MTX_RECURSE) or duplicate (MTX_DUPOK) acquisitions.

Reviewed by: kib


# 3280870d 28-Dec-2010 Konstantin Belousov <kib@FreeBSD.org>

Move the increment of vm object generation count into
vm_object_set_writeable_dirty().

Fix an issue where restart of the scan in vm_object_page_clean() did
not removed write permissions for newly added pages or, if the mapping
for some already scanned page changed to writeable due to fault.
Merge the two loops in vm_object_page_clean(), doing the remove of
write permission and cleaning in the same loop. The restart of the
loop then correctly downgrade writeable mappings.

Fix an issue where a second caller to msync() might actually return
before the first caller had actually completed flushing the
pages. Clear the OBJ_MIGHTBEDIRTY flag after the cleaning loop, not
before.

Calls to pmap_is_modified() are not needed after pmap_remove_write()
there.

Proposed, reviewed and tested by: alc
MFC after: 1 week


# 8c22654d 17-Dec-2010 Alan Cox <alc@FreeBSD.org>

Implement and use a single optimized function for unholding a set of pages.

Reviewed by: kib@


# 48772ca4 09-Dec-2010 Jayachandran C. <jchandra@FreeBSD.org>

Revert the vm/vm_page.c change in r216317.

This adds back changes in r216141, which was reverted by the above
check in.


# aa93efed 08-Dec-2010 Jayachandran C. <jchandra@FreeBSD.org>

swi_vm() for mips.


# 6f1a8765 02-Dec-2010 Warner Losh <imp@FreeBSD.org>

To make minidumps work properly on mips for memory that's direct
mapped and entered via vm_page_setup, keep track of it like we do
for amd64.

# A separate commit will be made to move this to a capability-based ifdef
# rather than arch-based ifdef.

Submitted by: alc@
MFC after: 1 week


# 05cb58f6 30-Nov-2010 Alan Cox <alc@FreeBSD.org>

Correct an error in the allocation of the vm_page_dump array in
vm_page_startup(). Specifically, the dump_avail array should be used
instead of the phys_avail array to calculate the size of vm_page_dump. For
example, the pages for the message buffer are allocated prior to
vm_page_startup() by subtracting them from the last entry in the phys_avail
array, but the first thing that vm_page_startup() does after creating the
vm_page_dump array is to set the bits corresponding to the message buffer
pages in that array. However, these bits might not actually exist in the
array, because the size of the array is determined by the current value in
the last entry of the phys_avail array. In general, the only reason why
this doesn't always result in an out-of-bounds array access is that the size
of the vm_page_dump array is rounded up to the next page boundary. This
change eliminates that dependence on rounding (and luck).

MFC after: 6 weeks


# aa546366 27-Nov-2010 Jayachandran C. <jchandra@FreeBSD.org>

Fix issue noted by alc while reviewing r215938:
The current implementation of vm_page_alloc_freelist() does not handle
order > 0 correctly. Remove order parameter to the function and use it
only for order 0 pages.

Submitted by: alc


# 00f8bffc 19-Nov-2010 Alan Cox <alc@FreeBSD.org>

Reduce the amount of detail printed by vm_page_free_toq() when it panics.

Reviewed by: kib


# 4166faae 18-Nov-2010 Konstantin Belousov <kib@FreeBSD.org>

Only increment object generation count when inserting the page into
object page list. The only use of object generation count now is a
restart of the scan in vm_object_page_clean(), which makes sense to do
on the page addition. Page removals do not affect the dirtiness of the
object, as well as manipulations with the shadow chain.

Suggested and reviewed by: alc
MFC after: 1 week


# 903ba3da 06-Nov-2010 Oleksandr Tymoshenko <gonzo@FreeBSD.org>

- Add minidump support for FreeBSD/mips


# a7d5f7eb 19-Oct-2010 Jamie Gritton <jamie@FreeBSD.org>

A new jail(8) with a configuration file, to replace the work currently done
by /etc/rc.d/jail.


# a9b89cf1 03-Sep-2010 Andriy Gapon <avg@FreeBSD.org>

vm_page.c: include opt_msgbuf.h for MSGBUF_SIZE use in vm_page_startup

vm_page_startup uses MSGBUF_SIZE value for adding msgbuf pages to minidump.
If opt_msgbuf.h is not included and MSGBUF_SIZE is overriden in kernel
config, then not all msgbuf pages will be dumped. And most importantly,
struct msgbuf itself will not be included. Thus the dump would look
corrupted/incomplete to tools like kgdb, dmesg, etc that try to access
struct msgbuf as one of the first things they do when working on a crash
dump.

MFC after: 5 days


# 49ca10d4 21-Jul-2010 Jayachandran C. <jchandra@FreeBSD.org>

Redo the page table page allocation on MIPS, as suggested by
alc@.

The UMA zone based allocation is replaced by a scheme that creates
a new free page list for the KSEG0 region, and a new function
in sys/vm that allocates pages from a specific free page list.

This also fixes a race condition introduced by the UMA based page table
page allocation code. Dropping the page queue and pmap locks before
the call to uma_zfree, and re-acquiring them afterwards will introduce
a race condtion(noted by alc@).

The changes are :
- Revert the earlier changes in MIPS pmap.c that added UMA zone for
page table pages.
- Add a new freelist VM_FREELIST_HIGHMEM to MIPS vmparam.h for memory that
is not directly mapped (in 32bit kernel). Normal page allocations will first
try the HIGHMEM freelist and then the default(direct mapped) freelist.
- Add a new function 'vm_page_t vm_page_alloc_freelist(int flind, int
order, int req)' to vm/vm_page.c to allocate a page from a specified
freelist. The MIPS page table pages will be allocated using this function
from the freelist containing direct mapped pages.
- Move the page initialization code from vm_phys_alloc_contig() to a
new function vm_page_alloc_init(), and use this function to initialize
pages in vm_page_alloc_freelist() too.
- Split the function vm_phys_alloc_pages(int pool, int order) to create
vm_phys_alloc_freelist_pages(int flind, int pool, int order), and use
this function from both vm_page_alloc_freelist() and vm_phys_alloc_pages().

Reviewed by: alc


# b99348e5 09-Jul-2010 Alan Cox <alc@FreeBSD.org>

Add support for the VM_ALLOC_COUNT() hint to vm_page_alloc(). Consequently,
the maintenance of vm_pageout_deficit can be localized to just two places:
vm_page_alloc() and vm_pageout_scan().

This change also corrects an off-by-one error in the maintenance of
vm_pageout_deficit. Historically, the buffer cache functions, allocbuf()
and vm_hold_load_pages(), have not taken into account that vm_page_alloc()
already increments vm_pageout_deficit by one.

Reviewed by: kib


# 1d9e77f6 08-Jul-2010 Konstantin Belousov <kib@FreeBSD.org>

Make VM_ALLOC_RETRY flag mandatory for vm_page_grab(). Assert that the
flag is always provided, and unconditionally retry after sleep for the
busy page or failed allocation.

The intent is to remove VM_ALLOC_RETRY eventually.

Proposed and reviewed by: alc


# 5f195aa3 05-Jul-2010 Konstantin Belousov <kib@FreeBSD.org>

Add the ability for the allocflag argument of the vm_page_grab() to
specify the increment of vm_pageout_deficit when sleeping due to page
shortage. Then, in allocbuf(), the code to allocate pages when extending
vmio buffer can be replaced by a call to vm_page_grab().

Suggested and reviewed by: alc
MFC after: 2 weeks


# b382c10a 04-Jul-2010 Konstantin Belousov <kib@FreeBSD.org>

Introduce a helper function vm_page_find_least(). Use it in several places,
which inline the function.

Reviewed by: alc
Tested by: pho
MFC after: 1 week


# b64400a0 03-Jul-2010 Alan Cox <alc@FreeBSD.org>

Improve the comment and man page for vm_page_alloc(). Specifically,
document one of the optional flags; clarify which of the flags are
optional (and which are not), and remove mention of a restriction on
the reclamation of cached pages that no longer holds since version 7.

MFC after: 1 week


# 9cf51988 02-Jul-2010 Alan Cox <alc@FreeBSD.org>

With the demise of page coloring, the page queue macros no longer serve any
useful purpose. Eliminate them.

Reviewed by: kib


# 91b4f427 21-Jun-2010 Alan Cox <alc@FreeBSD.org>

Introduce vm_page_next() and vm_page_prev(), and use them in
vm_pageout_clean(). When iterating over a range of pages, these functions
can be cheaper than vm_page_lookup() because their implementation takes
advantage of the vm_object's memq being ordered.

Reviewed by: kib@
MFC after: 3 weeks


# 9ee2165f 14-Jun-2010 Alan Cox <alc@FreeBSD.org>

Eliminate checks for a page having a NULL object in vm_pageout_scan()
and vm_pageout_page_stats(). These checks were recently introduced by
the first page locking commit, r207410, but they are not needed. At
the same time, eliminate some redundant accesses to the page's object
field. (These accesses should have neen eliminated by r207410.)

Make the assertion in vm_page_flag_set() stricter. Specifically, only
managed pages should have PG_WRITEABLE set.

Add a comment documenting an assertion to vm_page_flag_clear().

It has long been the case that fictitious pages have their wire count
permanently set to one. Add comments to vm_page_wire() and
vm_page_unwire() documenting this. Add assertions to these functions
as well.

Update the comment describing vm_page_unwire(). Much of the old
comment had little to do with vm_page_unwire(), but a lot to do with
_vm_page_deactivate(). Move relevant parts of the old comment to
_vm_page_deactivate().

Only pages that belong to an object can be paged out. Therefore, it
is pointless for vm_page_unwire() to acquire the page queues lock and
enqueue such pages in one of the paging queues. Generally speaking,
such pages are immediately freed after the call to vm_page_unwire().
Previously, it was the call to vm_page_free() that reacquired the page
queues lock and removed these pages from the paging queues. Now, we
will never acquire the page queues lock for this case. (It is also
worth noting that since both vm_page_unwire() and vm_page_free()
occurred with the page locked, the page daemon never saw the page with
its object field set to NULL.)

Change the panic with vm_page_unwire() to provide a more precise message.

Reviewed by: kib@


# ce186587 10-Jun-2010 Alan Cox <alc@FreeBSD.org>

Reduce the scope of the page queues lock and the number of
PG_REFERENCED changes in vm_pageout_object_deactivate_pages().
Simplify this function's inner loop using TAILQ_FOREACH(), and shorten
some of its overly long lines. Update a stale comment.

Assert that PG_REFERENCED may be cleared only if the object containing
the page is locked. Add a comment documenting this.

Assert that a caller to vm_page_requeue() holds the page queues lock,
and assert that the page is on a page queue.

Push down the page queues lock into pmap_ts_referenced() and
pmap_page_exists_quick(). (As of now, there are no longer any pmap
functions that expect to be called with the page queues lock held.)

Neither pmap_ts_referenced() nor pmap_page_exists_quick() should ever
be passed an unmanaged page. Assert this rather than returning "0"
and "FALSE" respectively.

ARM:

Simplify pmap_page_exists_quick() by switching to TAILQ_FOREACH().

Push down the page queues lock inside of pmap_clearbit(), simplifying
pmap_clear_modify(), pmap_clear_reference(), and pmap_remove_write().
Additionally, this allows for avoiding the acquisition of the page
queues lock in some cases.

PowerPC/AIM:

moea*_page_exits_quick() and moea*_page_wired_mappings() will never be
called before pmap initialization is complete. Therefore, the check
for moea_initialized can be eliminated.

Push down the page queues lock inside of moea*_clear_bit(),
simplifying moea*_clear_modify() and moea*_clear_reference().

The last parameter to moea*_clear_bit() is never used. Eliminate it.

PowerPC/BookE:

Simplify mmu_booke_page_exists_quick()'s control flow.

Reviewed by: kib@


# 2bbfbc3f 03-Jun-2010 Konstantin Belousov <kib@FreeBSD.org>

Add assertion and comment in vm_page_flag_set() describing the expectations
when the PG_WRITEABLE flag is set.

Reviewed by: alc


# f4e10cda 02-Jun-2010 Alan Cox <alc@FreeBSD.org>

Maintain the pretense that we support 32KB pages for the sake of the ia64
LINT build.


# c8fa8709 02-Jun-2010 Alan Cox <alc@FreeBSD.org>

Minimize the use of the page queues lock for synchronizing access to the
page's dirty field. With the exception of one case, access to this field
is now synchronized by the object lock.


# c46b90e9 26-May-2010 Alan Cox <alc@FreeBSD.org>

Push down page queues lock acquisition in pmap_enter_object() and
pmap_is_referenced(). Eliminate the corresponding page queues lock
acquisitions from vm_map_pmap_enter() and mincore(), respectively. In
mincore(), this allows some additional cases to complete without ever
acquiring the page queues lock.

Assert that the page is managed in pmap_is_referenced().

On powerpc/aim, push down the page queues lock acquisition from
moea*_is_modified() and moea*_is_referenced() into moea*_query_bit().
Again, this will allow some additional cases to complete without ever
acquiring the page queues lock.

Reorder a few statements in vm_page_dontneed() so that a race can't lead
to an old reference persisting. This scenario is described in detail by a
comment.

Correct a spelling error in vm_page_dontneed().

Assert that the object is locked in vm_page_clear_dirty(), and restrict the
page queues lock assertion to just those cases in which the page is
currently writeable.

Add object locking to vnode_pager_generic_putpages(). This was the one
and only place where vm_page_clear_dirty() was being called without the
object being locked.

Eliminate an unnecessary vm_page_lock() around vnode_pager_setsize()'s call
to vm_page_clear_dirty().

Change vnode_pager_generic_putpages() to the modern-style of function
definition. Also, change the name of one of the parameters to follow
virtual memory system naming conventions.

Reviewed by: kib


# e98d019d 24-May-2010 Alan Cox <alc@FreeBSD.org>

Eliminate the acquisition and release of the page queues lock from
vfs_busy_pages(). It is no longer needed.

Submitted by: kib


# 567e51e1 24-May-2010 Alan Cox <alc@FreeBSD.org>

Roughly half of a typical pmap_mincore() implementation is machine-
independent code. Move this code into mincore(), and eliminate the
page queues lock from pmap_mincore().

Push down the page queues lock into pmap_clear_modify(),
pmap_clear_reference(), and pmap_is_modified(). Assert that these
functions are never passed an unmanaged page.

Eliminate an inaccurate comment from powerpc/powerpc/mmu_if.m:
Contrary to what the comment says, pmap_mincore() is not simply an
optimization. Without a complete pmap_mincore() implementation,
mincore() cannot return either MINCORE_MODIFIED or MINCORE_REFERENCED
because only the pmap can provide this information.

Eliminate the page queues lock from vfs_setdirty_locked_object(),
vm_pageout_clean(), vm_object_page_collect_flush(), and
vm_object_page_clean(). Generally speaking, these are all accesses
to the page's dirty field, which are synchronized by the containing
vm object's lock.

Reduce the scope of the page queues lock in vm_object_madvise() and
vm_page_dontneed().

Reviewed by: kib (an earlier version)


# aa12e8b7 18-May-2010 Alan Cox <alc@FreeBSD.org>

The page queues lock is no longer required by vm_page_set_invalid(), so
eliminate it.

Assert that the object containing the page is locked in
vm_page_test_dirty(). Perform some style clean up while I'm here.

Reviewed by: kib


# 9ab6032f 16-May-2010 Alan Cox <alc@FreeBSD.org>

On entry to pmap_enter(), assert that the page is busy. While I'm
here, make the style of assertion used by pmap_enter() consistent
across all architectures.

On entry to pmap_remove_write(), assert that the page is neither
unmanaged nor fictitious, since we cannot remove write access to
either kind of page.

With the push down of the page queues lock, pmap_remove_write() cannot
condition its behavior on the state of the PG_WRITEABLE flag if the
page is busy. Assert that the object containing the page is locked.
This allows us to know that the page will neither become busy nor will
PG_WRITEABLE be set on it while pmap_remove_write() is running.

Correct a long-standing bug in vm_page_cowsetup(). We cannot possibly
do copy-on-write-based zero-copy transmit on unmanaged or fictitious
pages, so don't even try. Previously, the call to pmap_remove_write()
would have failed silently.


# a4bc2c89 16-May-2010 Alan Cox <alc@FreeBSD.org>

Correct an error of omission in r202897: Now that amd64 uses the direct map
to access the message buffer, we must explicitly request that the underlying
physical pages are included in a crash dump.

Reported by: Benjamin Kaduk


# eee9d992 09-May-2010 Alan Cox <alc@FreeBSD.org>

Push down the acquisition of the page queues lock into vm_pageq_remove().
(This eliminates a surprising number of page queues lock acquisitions by
vm_fault() because the page's queue is PQ_NONE and thus the page queues
lock is not needed to remove the page from a queue.)


# 34e7251f 08-May-2010 Alan Cox <alc@FreeBSD.org>

Minimize the scope of the page queues lock in vm_fault().


# 3c4a2440 08-May-2010 Alan Cox <alc@FreeBSD.org>

Push down the page queues into vm_page_cache(), vm_page_try_to_cache(), and
vm_page_try_to_free(). Consequently, push down the page queues lock into
pmap_enter_quick(), pmap_page_wired_mapped(), pmap_remove_all(), and
pmap_remove_write().

Push down the page queues lock into Xen's pmap_page_is_mapped(). (I
overlooked the Xen pmap in r207702.)

Switch to a per-processor counter for the total number of pages cached.


# 03679e23 07-May-2010 Alan Cox <alc@FreeBSD.org>

Push down the page queues lock into vm_page_activate().


# 9402dff3 06-May-2010 Alan Cox <alc@FreeBSD.org>

Push down the page queues lock into vm_page_deactivate(). Eliminate an
incorrect comment.


# 7024db1d 06-May-2010 Alan Cox <alc@FreeBSD.org>

Push down the page queues lock inside of vm_page_free_toq() and
pmap_page_is_mapped() in preparation for removing page queues locking
around calls to vm_page_free(). Setting aside the assertion that calls
pmap_page_is_mapped(), vm_page_free_toq() now acquires and holds the page
queues lock just long enough to actually add or remove the page from the
paging queues.

Update vm_page_unhold() to reflect the above change.


# 5ac59343 05-May-2010 Alan Cox <alc@FreeBSD.org>

Acquire the page lock around all remaining calls to vm_page_free() on
managed pages that didn't already have that lock held. (Freeing an
unmanaged page, such as the various pmaps use, doesn't require the page
lock.)

This allows a change in vm_page_remove()'s locking requirements. It now
expects the page lock to be held instead of the page queues lock.
Consequently, the page queues lock is no longer required at all by callers
to vm_page_rename().

Discussed with: kib


# e3ef0d2f 04-May-2010 Alan Cox <alc@FreeBSD.org>

Push down the acquisition of the page queues lock into vm_page_unwire().

Update the comment describing which lock should be held on entry to
vm_page_wire().

Reviewed by: kib


# a7283d32 04-May-2010 Alan Cox <alc@FreeBSD.org>

Add page locking to the vm_page_cow* functions.

Push down the acquisition and release of the page queues lock into
vm_page_wire().

Reviewed by: kib


# 0c41a69e 03-May-2010 Alan Cox <alc@FreeBSD.org>

Add lock assertions.


# 2d5d7f7f 03-May-2010 Alan Cox <alc@FreeBSD.org>

Acquire the page lock around vm_page_wire() in vm_page_grab().

Assert that the page lock is held in vm_page_wire().


# 9f2512ba 03-May-2010 Alan Cox <alc@FreeBSD.org>

Assert that the page queues lock is held in vm_page_remove() and
vm_page_unwire() only if the page is managed, i.e., pageable.


# b8d36afc 02-May-2010 Alan Cox <alc@FreeBSD.org>

Add page lock assertions where we access the page's hold_count.


# b88b6c9d 02-May-2010 Alan Cox <alc@FreeBSD.org>

It makes no sense for vm_page_sleep_if_busy()'s helper, vm_page_sleep(),
to unconditionally set PG_REFERENCED on a page before sleeping. In many
cases, it's perfectly ok for the page to disappear, i.e., be reclaimed by
the page daemon, before the caller to vm_page_sleep() is reawakened.
Instead, we now explicitly set PG_REFERENCED in those cases where having
the page persist until the caller is awakened is clearly desirable. Note,
however, that setting PG_REFERENCED on the page is still only a hint,
and not a guarantee that the page should persist.


# 6d74d042 29-Apr-2010 Kip Macy <kmacy@FreeBSD.org>

don't allow unsynchronized free in vm_page_unhold


# 2965a453 29-Apr-2010 Kip Macy <kmacy@FreeBSD.org>

On Alan's advice, rather than do a wholesale conversion on a single
architecture from page queue lock to a hashed array of page locks
(based on a patch by Jeff Roberson), I've implemented page lock
support in the MI code and have only moved vm_page's hold_count
out from under page queue mutex to page lock. This changes
pmap_extract_and_hold on all pmaps.

Supported by: Bitgravity Inc.

Discussed with: alc, jeffr, and kib


# 55146b83 09-Apr-2010 Alan Cox <alc@FreeBSD.org>

MFC r206174
vm_reserv_alloc_page() should never be called on an OBJT_SG object, just
as it is never called on an OBJT_DEVICE object. (This change should have
been included in r195840.)


# f6d00b38 05-Apr-2010 Alan Cox <alc@FreeBSD.org>

vm_reserv_alloc_page() should never be called on an OBJT_SG object, just as
it is never called on an OBJT_DEVICE object. (This change should have been
included in r195840.)

Reported by: dougb@, avg@
MFC after: 3 days


# 95e5a090 02-Mar-2010 Konstantin Belousov <kib@FreeBSD.org>

MFC r204415:
Update comment for vm_page_alloc(9), listing all acceptable flags [1].
Note that the function does not sleep, it can block.


# ddb16cfc 27-Feb-2010 Konstantin Belousov <kib@FreeBSD.org>

Update comment for vm_page_alloc(9), listing all acceptable flags [1].
Note that the function does not sleep, it can block.

Submitted by: Giovanni Trematerra <giovanni.trematerra gmail com> [1]
MFC after: 3 days


# e67e0775 04-Oct-2009 Alan Cox <alc@FreeBSD.org>

Align and pad the page queue and free page queue locks so that the linker
can't possibly place them together within the same cache line.

MFC after: 3 weeks


# 01381811 24-Jul-2009 John Baldwin <jhb@FreeBSD.org>

Add a new type of VM object: OBJT_SG. An OBJT_SG object is very similar to
a device pager (OBJT_DEVICE) object in that it uses fictitious pages to
provide aliases to other memory addresses. The primary difference is that
it uses an sglist(9) to determine the physical addresses for a given offset
into the object instead of invoking the d_mmap() method in a device driver.

Reviewed by: alc
Approved by: re (kensmith)
MFC after: 2 weeks


# 13de7221 17-Jul-2009 Alan Cox <alc@FreeBSD.org>

An addendum to r195649, "Add support to the virtual memory system for
configuring machine-dependent memory attributes...":

Don't set the memory attribute for a "real" page that is allocated to
a device object in vm_page_alloc(). It is a pointless act, because
the device pager replaces this "real" page with a "fake" page and sets
the memory attribute on that "fake" page.

Eliminate pointless code from pmap_cache_bits() on amd64.

Employ the "Self Snoop" feature supported by some x86 processors to
avoid cache flushes in the pmap.

Approved by: re (kib)


# 3153e878 12-Jul-2009 Alan Cox <alc@FreeBSD.org>

Add support to the virtual memory system for configuring machine-
dependent memory attributes:

Rename vm_cache_mode_t to vm_memattr_t. The new name reflects the
fact that there are machine-dependent memory attributes that have
nothing to do with controlling the cache's behavior.

Introduce vm_object_set_memattr() for setting the default memory
attributes that will be given to an object's pages.

Introduce and use pmap_page_{get,set}_memattr() for getting and
setting a page's machine-dependent memory attributes. Add full
support for these functions on amd64 and i386 and stubs for them on
the other architectures. The function pmap_page_set_memattr() is also
responsible for any other machine-dependent aspects of changing a
page's memory attributes, such as flushing the cache or updating the
direct map. The uses include kmem_alloc_contig(), vm_page_alloc(),
and the device pager:

kmem_alloc_contig() can now be used to allocate kernel memory with
non-default memory attributes on amd64 and i386.

vm_page_alloc() and the device pager will set the memory attributes
for the real or fictitious page according to the object's default
memory attributes.

Update the various pmap functions on amd64 and i386 that map pages to
incorporate each page's memory attributes in the mapping.

Notes: (1) Inherent to this design are safety features that prevent
the specification of inconsistent memory attributes by different
mappings on amd64 and i386. In addition, the device pager provides a
warning when a device driver creates a fictitious page with memory
attributes that are inconsistent with the real page that the
fictitious page is an alias for. (2) Storing the machine-dependent
memory attributes for amd64 and i386 as a dedicated "int" in "struct
md_page" represents a compromise between space efficiency and the ease
of MFCing these changes to RELENG_7.

In collaboration with: jhb

Approved by: re (kib)


# 6f0489c6 20-Jun-2009 Alan Cox <alc@FreeBSD.org>

Strive for greater consistency among the places that implement real,
fictious, and contiguous page allocation. Eliminate unnecessary
reinitialization of a page's fields.


# edd16ab1 30-May-2009 Alan Cox <alc@FreeBSD.org>

Add assertions in two places where a page's valid or dirty bits are changed.


# 1c1b26f2 12-May-2009 Alan Cox <alc@FreeBSD.org>

Eliminate page queues locking from bufdone_finish() through the
following changes:

Rename vfs_page_set_valid() to vfs_page_set_validclean() to reflect
what this function actually does. Suggested by: tegge

Introduce a new version of vfs_page_set_valid() that does no more than
what the function's name implies. Specifically, it does not update
the page's dirty mask, and thus it does not require the page queues
lock to be held.

Update two of the three callers to the old vfs_page_set_valid() to
call vfs_page_set_validclean() instead because they actually require
the page's dirty mask to be cleared.

Introduce vm_page_set_valid().

Reviewed by: tegge


# 641e2829 03-Jan-2009 Konstantin Belousov <kib@FreeBSD.org>

Extend the struct vm_page wire_count to u_int to avoid the overflow
of the counter, that may happen when too many sendfile(2) calls are
being executed with this vnode [1].

To keep the size of the struct vm_page and offsets of the fields
accessed by out-of-tree modules, swap the types and locations
of the wire_count and cow fields. Add safety checks to detect cow
overflow and force fallback to the normal copy code for zero-copy
sockets. [2]

Reported by: Anton Yuzhaninov <citrin citrin ru> [1]
Suggested by: alc [2]
Reviewed by: alc
MFC after: 2 weeks


# 8e321b79 06-Nov-2008 Rafal Jaworowski <raj@FreeBSD.org>

Support kernel crash mini dumps on ARM architecture.

Obtained from: Juniper Networks, Semihalf


# d7f03759 19-Oct-2008 Ulf Lilleengen <lulf@FreeBSD.org>

- Import the HEAD csup code which is the basis for the cvsmode work.


# a8a478fc 26-Sep-2008 Ed Maste <emaste@FreeBSD.org>

Move CTASSERT from header file to source file, per implementation note now
in the CTASSERT man page.


# 4b34502e 17-Aug-2008 Kip Macy <kmacy@FreeBSD.org>

Work around differences in page allocation for initial page tables on xen

MFC after: 1 month


# 8bcd3b19 06-Jun-2008 Alan Cox <alc@FreeBSD.org>

Essentially, neither madvise(..., MADV_DONTNEED) nor madvise(..., MADV_FREE)
work. (Moreover, I don't believe that they have ever worked as intended.)
The explanation is fairly simple. Both MADV_DONTNEED and MADV_FREE perform
vm_page_dontneed() on each page within the range given to madvise(). This
function moves the page to the inactive queue. Specifically, if the page is
clean, it is moved to the head of the inactive queue where it is first in
line for processing by the page daemon. On the other hand, if it is dirty,
it is placed at the tail. Let's further examine the case in which the page
is clean. Recall that the page is at the head of the line for processing by
the page daemon. The expectation of vm_page_dontneed()'s author was that
the page would be transferred from the inactive queue to the cache queue by
the page daemon. (Once the page is in the cache queue, it is, in effect,
free, that is, it can be reallocated to a new vm object by vm_page_alloc()
if it isn't reactivated quickly enough by a user of the old vm object.) The
trouble is that nowhere in the execution of either MADV_DONTNEED or
MADV_FREE is either the machine-independent reference flag (PG_REFERENCED)
or the reference bit in any page table entry (PTE) mapping the page cleared.
Consequently, the immediate reaction of the page daemon is to reactivate the
page because it is referenced. In effect, the madvise() was for naught.
The case in which the page was dirty is not too different. Instead of being
laundered, the page is reactivated.

Note: The essential difference between MADV_DONTNEED and MADV_FREE is
that MADV_FREE clears a page's dirty field. So, MADV_FREE is always
executing the clean case above.

This revision changes vm_page_dontneed() to clear both the machine-
independent reference flag (PG_REFERENCED) and the reference bit in all PTEs
mapping the page.

MFC after: 6 weeks


# f5788387 15-May-2008 Alan Cox <alc@FreeBSD.org>

Don't call vm_reserv_alloc_page() on device-backed objects. Otherwise, the
system may panic because there is no reservation structure corresponding to
the physical address of the device memory.

Reported by: Giorgos Keramidas


# 44aab2c3 06-Apr-2008 Alan Cox <alc@FreeBSD.org>

Introduce vm_reserv_reclaim_contig(). This function is used by
contigmalloc(9) as a last resort to steal pages from an inactive,
partially-used superpage reservation.

Rename vm_reserv_reclaim() to vm_reserv_reclaim_inactive() and
refactor it so that a separate subroutine is responsible for breaking
the selected reservation. This subroutine is also used by
vm_reserv_reclaim_contig().


# e5b006ff 19-Mar-2008 Alan Cox <alc@FreeBSD.org>

Rename vm_pageq_requeue() to vm_page_requeue() on account of its recent
migration to vm/vm_page.c.


# 1fa94a36 18-Mar-2008 Alan Cox <alc@FreeBSD.org>

Almost seven years ago, vm/vm_page.c was split into three parts:
vm/vm_contig.c, vm/vm_page.c, and vm/vm_pageq.c. Today, vm/vm_pageq.c
has withered to the point that it contains only four short functions,
two of which are only used by vm/vm_page.c. Since I can't foresee any
reason for vm/vm_pageq.c to grow, it is time to fold the remaining
contents of vm/vm_pageq.c back into vm/vm_page.c.

Add some comments. Rename one of the functions, vm_pageq_enqueue(),
that is now static within vm/vm_page.c to vm_page_enqueue().
Eliminate PQ_MAXCOUNT as it no longer serves any purpose.


# 273bf93c 01-Jan-2008 Alan Cox <alc@FreeBSD.org>

Defer setting either PG_CACHED or PG_FREE until after the free page
queues lock is acquired. Otherwise, the state of a reservation's
pages' flags and its population count can be inconsistent. That could
result in a page being freed twice.

Reported by: kris


# f8a47341 29-Dec-2007 Alan Cox <alc@FreeBSD.org>

Add the superpage reservation system. This is "part 2 of 2" of the
machine-independent support for superpages. (The earlier part was
the rewrite of the physical memory allocator.) The remainder of the
code required for superpages support is machine-dependent and will
be added to the various pmap implementations at a later date.

Initially, I am only supporting one large page size per architecture.
Moreover, I am only enabling the reservation system on amd64. (In
an emergency, it can be disabled by setting VM_NRESERVLEVELS to 0
in amd64/include/vmparam.h or your kernel configuration file.)


# e35395ce 20-Dec-2007 Alan Cox <alc@FreeBSD.org>

Modify vm_phys_unfree_page() so that it no longer requires the given
page to be in the free lists. Instead, it now returns TRUE if it
removed the page from the free lists and FALSE if the page was not
in the free lists.

This change is required to support superpage reservations. Specifically,
once reservations are introduced, a cached page can either be in the
free lists or a reservation.


# 03497757 18-Dec-2007 Alan Cox <alc@FreeBSD.org>

Eliminate redundant code from vm_page_startup().


# 21e10ad4 11-Dec-2007 Alan Cox <alc@FreeBSD.org>

Simplify vm_page_free_toq().


# b6408256 02-Dec-2007 Alan Cox <alc@FreeBSD.org>

Correct a comment.


# ddd6e7d2 21-Nov-2007 Alan Cox <alc@FreeBSD.org>

When reactivating a cached page, reset the page's pool to the default
pool. (Not doing this before was a performance pessimization but not
a cause for panic.)


# aefac177 05-Nov-2007 Konstantin Belousov <kib@FreeBSD.org>

The intent of the freeing the (zeroed) page in vm_page_cache() for
default object rather than cache it was to have
vm_pager_has_page(object, pindex, ...) == FALSE to imply that there is
no cached page in object at pindex. This allows to avoid explicit
checks for cached pages in vm_object_backing_scan().

For now, we need the same bandaid for the swap object, otherwise both
the vm_page_lookup() and the pager can report that there is no page at
offset, while page is stored in the cache. Also, this fixes another
instance of the KASSERT("object type is incompatible") failure in the
vm_page_cache_transfer().

Reported and tested by: Peter Holm
Reviewed by: alc
MFC after: 3 days


# 21f79586 26-Oct-2007 Alan Cox <alc@FreeBSD.org>

Change vm_page_cache_transfer() such that it does not transfer pages
that would have an offset beyond the end of the target object. Such
pages should remain in the source object.

MFC after: 3 days
Diagnosed and reviewed by: Kostik Belousov
Reported and tested by: Peter Holm


# b8c50480 08-Oct-2007 Alan Cox <alc@FreeBSD.org>

In the rare case that vm_page_cache() actually frees the given page,
it must first ensure that the page is no longer mapped. This is
trivially accomplished by calling pmap_remove_all() a little earlier
in vm_page_cache(). While I'm in the neighborbood, make a related
panic message a little more useful.

Approved by: re (kensmith)
Reported by: Peter Holm and Konstantin Belousov
Reviewed by: Konstantin Belousov


# dc9250f5 07-Oct-2007 Alan Cox <alc@FreeBSD.org>

Correct a lock assertion failure in sparc64's pmap_page_is_mapped() that is
a consequence of sparc64/sparc64/vm_machdep.c revision 1.76. It occurs
when uma_small_free() frees a page. The solution has two parts: (1) Mark
pages allocated with VM_ALLOC_NOOBJ as PG_UNMANAGED. (2) Defer the lock
assertion in pmap_page_is_mapped() until after PG_UNMANAGED is tested.
This is safe because both PG_UNMANAGED and PG_FICTITIOUS are immutable
flags, i.e., they do not change state between the time that a page is
allocated and freed.

Approved by: re (kensmith)
PR: 116794


# c9444914 26-Sep-2007 Alan Cox <alc@FreeBSD.org>

Correct an error of omission in the reimplementation of the page
cache: vm_object_page_remove() should convert any cached pages that
fall with the specified range to free pages. Otherwise, there could
be a problem if a file is first truncated and then regrown.
Specifically, some old data from prior to the truncation might reappear.

Generalize vm_page_cache_free() to support the conversion of either a
subset or the entirety of an object's cached pages.

Reported by: tegge
Reviewed by: tegge
Approved by: re (kensmith)


# 7bfda801 25-Sep-2007 Alan Cox <alc@FreeBSD.org>

Change the management of cached pages (PQ_CACHE) in two fundamental
ways:

(1) Cached pages are no longer kept in the object's resident page
splay tree and memq. Instead, they are kept in a separate per-object
splay tree of cached pages. However, access to this new per-object
splay tree is synchronized by the _free_ page queues lock, not to be
confused with the heavily contended page queues lock. Consequently, a
cached page can be reclaimed by vm_page_alloc(9) without acquiring the
object's lock or the page queues lock.

This solves a problem independently reported by tegge@ and Isilon.
Specifically, they observed the page daemon consuming a great deal of
CPU time because of pages bouncing back and forth between the cache
queue (PQ_CACHE) and the inactive queue (PQ_INACTIVE). The source of
this problem turned out to be a deadlock avoidance strategy employed
when selecting a cached page to reclaim in vm_page_select_cache().
However, the root cause was really that reclaiming a cached page
required the acquisition of an object lock while the page queues lock
was already held. Thus, this change addresses the problem at its
root, by eliminating the need to acquire the object's lock.

Moreover, keeping cached pages in the object's primary splay tree and
memq was, in effect, optimizing for the uncommon case. Cached pages
are reclaimed far, far more often than they are reactivated. Instead,
this change makes reclamation cheaper, especially in terms of
synchronization overhead, and reactivation more expensive, because
reactivated pages will have to be reentered into the object's primary
splay tree and memq.

(2) Cached pages are now stored alongside free pages in the physical
memory allocator's buddy queues, increasing the likelihood that large
allocations of contiguous physical memory (i.e., superpages) will
succeed.

Finally, as a result of this change long-standing restrictions on when
and where a cached page can be reclaimed and returned by
vm_page_alloc(9) are eliminated. Specifically, calls to
vm_page_alloc(9) specifying VM_ALLOC_INTERRUPT can now reclaim and
return a formerly cached page. Consequently, a call to malloc(9)
specifying M_NOWAIT is less likely to fail.

Discussed with: many over the course of the summer, including jeff@,
Justin Husted @ Isilon, peter@, tegge@
Tested by: an earlier version by kris@
Approved by: re (kensmith)


# eaa29f1c 27-Jul-2007 Alan Cox <alc@FreeBSD.org>

Add a counter for the total number of pages cached and support for
reporting the value of this counter in the program "vmstat".

Approved by: re (rwatson)


# 8941dc44 14-Jul-2007 Alan Cox <alc@FreeBSD.org>

Eliminate two unused functions: vm_phys_alloc_pages() and
vm_phys_free_pages(). Rename vm_phys_alloc_pages_locked() to
vm_phys_alloc_pages() and vm_phys_free_pages_locked() to
vm_phys_free_pages(). Add comments regarding the need for the free page
queues lock to be held by callers to these functions. No functional
changes.

Approved by: re (hrs)


# 20dd22a2 10-Jul-2007 Alan Cox <alc@FreeBSD.org>

Correct a problem in the ZERO_COPY_SOCKETS option, specifically, in
vm_page_cowfault(). Initially, if vm_page_cowfault() sleeps, the given
page is wired, preventing it from being recycled. However, when
transmission of the page completes, the page is unwired and returned to
the page queues. At that point, the page is not in any special state
that prevents it from being recycled. Consequently, vm_page_cowfault()
should verify that the page is still held by the same vm object before
retrying the replacement of the page. Note: The containing object is,
however, safe from being recycled by virtue of having a non-zero
paging-in-progress count.

While I'm here, add some assertions and comments.

Approved by: re (rwatson)
MFC After: 3 weeks


# 0a49733c 16-Jun-2007 Matt Jacob <mjacob@FreeBSD.org>

Don't declare inline a function which isn't.


# bcc231ec 16-Jun-2007 Alan Cox <alc@FreeBSD.org>

If attempting to cache a "busy", panic instead of printing a diagnostic
message and returning.


# 2446e4f0 15-Jun-2007 Alan Cox <alc@FreeBSD.org>

Enable the new physical memory allocator.

This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).

The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.

This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.

Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.

Approved by: re


# 393a081d 10-Jun-2007 Attilio Rao <attilio@FreeBSD.org>

Optimize vmmeter locking.
In particular:
- Add an explicative table for locking of struct vmmeter members
- Apply new rules for some of those members
- Remove some unuseful comments

Heavily reviewed by: alc, bde, jeff
Approved by: jeff (mentor)


# b4b70819 04-Jun-2007 Attilio Rao <attilio@FreeBSD.org>

Do proper "locking" for missing vmmeters part.
Now, we assume no more sched_lock protection for some of them and use the
distribuited loads method for vmmeter (distribuited through CPUs).

Reviewed by: alc, bde
Approved by: jeff (mentor)


# 2feb50bf 31-May-2007 Attilio Rao <attilio@FreeBSD.org>

Revert VMCNT_* operations introduction.
Probabilly, a general approach is not the better solution here, so we should
solve the sched_lock protection problems separately.

Requested by: alc
Approved by: jeff (mentor)


# 80b200da 20-May-2007 Jeff Roberson <jeff@FreeBSD.org>

- rename VMCNT_DEC to VMCNT_SUB to reflect the count argument.

Suggested by: julian@
Contributed by: attilio@


# 222d0195 18-May-2007 Jeff Roberson <jeff@FreeBSD.org>

- define and use VMCNT_{GET,SET,ADD,SUB,PTR} macros for manipulating
vmcnts. This can be used to abstract away pcpu details but also changes
to use atomics for all counters now. This means sched lock is no longer
responsible for protecting counts in the switch routines.

Contributed by: Attilio Rao <attilio@FreeBSD.org>


# 04a18977 05-May-2007 Alan Cox <alc@FreeBSD.org>

Define every architecture as either VM_PHYSSEG_DENSE or
VM_PHYSSEG_SPARSE depending on whether the physical address space is
densely or sparsely populated with memory. The effect of this
definition is to determine which of two implementations of
vm_page_array and PHYS_TO_VM_PAGE() is used. The legacy
implementation is obtained by defining VM_PHYSSEG_DENSE, and a new
implementation that trades off time for space is obtained by defining
VM_PHYSSEG_SPARSE. For now, all architectures except for ia64 and
sparc64 define VM_PHYSSEG_DENSE. Defining VM_PHYSSEG_SPARSE on ia64
allows the entirety of my Itanium 2's memory to be used. Previously,
only the first 1 GB could be used. Defining VM_PHYSSEG_SPARSE on
sparc64 allows USIIIi-based systems to boot without crashing.

This change is a combination of Nathan Whitehorn's patch and my own
work in perforce.

Discussed with: kmacy, marius, Nathan Whitehorn
PR: 112194


# 9f5c801b 24-Feb-2007 Alan Cox <alc@FreeBSD.org>

Change the way that unmanaged pages are created. Specifically,
immediately flag any page that is allocated to a OBJT_PHYS object as
unmanaged in vm_page_alloc() rather than waiting for a later call to
vm_page_unmanage(). This allows for the elimination of some uses of
the page queues lock.

Change the type of the kernel and kmem objects from OBJT_DEFAULT to
OBJT_PHYS. This allows us to take advantage of the above change to
simplify the allocation of unmanaged pages in kmem_alloc() and
kmem_malloc().

Remove vm_page_unmanage(). It is no longer used.


# 711585d0 17-Feb-2007 Alan Cox <alc@FreeBSD.org>

Enable vm_page_free() and vm_page_free_zero() to be called on some pages
without the page queues lock being held, specifically, pages that are not
contained in a vm object and not a member of a page queue.


# ba000fb2 17-Feb-2007 Alan Cox <alc@FreeBSD.org>

Remove a stale comment. Add punctuation to a nearby comment.


# d3d029bd 14-Feb-2007 Alan Cox <alc@FreeBSD.org>

Relax the page queue lock assertions in vm_page_remove() and
vm_page_free_toq() to account for recent changes that allow
vm_page_free_toq() to be called on some pages without the page queues lock
being held, specifically, pages that are not contained in a vm object and
not a member of a page queue. (Examples of such pages include page table
pages, pv entry pages, and uma small alloc pages.)


# 7d60988b 14-Feb-2007 Alan Cox <alc@FreeBSD.org>

Avoid the unnecessary acquisition of the free page queues lock when a page
is actually being added to the hold queue, not the free queue. At the same
time, avoid unnecessary tests to wake up threads waiting for free memory
and the idle thread that zeroes free pages. (These tests will be performed
later when the page finally moves from the hold queue to the free queue.)


# 5351a248 10-Feb-2007 Alan Cox <alc@FreeBSD.org>

Use the free page queue mutex instead of the page queue mutex to
synchronize sleeping and waking of the zero idle thread.


# e9f995d8 06-Feb-2007 Alan Cox <alc@FreeBSD.org>

Change the pagedaemon, vm_wait(), and vm_waitpfault() to sleep on the
vm page queue free mutex instead of the vm page queue mutex.


# 3ae3919d 04-Feb-2007 Alan Cox <alc@FreeBSD.org>

Change the free page queue lock from a spin mutex to a default (blocking)
mutex. With the demise of Alpha support, there is no longer a reason for
it to be a spin mutex.


# 35d10226 08-Dec-2006 Kip Macy <kmacy@FreeBSD.org>

Remove the requirement that phys_avail be sorted in ascending order
by explicitly finding the lowest and highest addresses when calculating
the size of the vm_pages array

Reviewed by :alc


# 49c3b925 08-Nov-2006 Alan Cox <alc@FreeBSD.org>

I misplaced the assertion that was added to vm_page_startup() in the
previous change. Correct its placement.


# 9ad3296a 08-Nov-2006 Alan Cox <alc@FreeBSD.org>

Simplify the construction of the free queues in vm_page_startup(). Add
an assertion to test a hypothesis concerning other redundant computation
in vm_page_startup().


# 2a53696f 22-Oct-2006 Alan Cox <alc@FreeBSD.org>

The page queues lock is no longer required by vm_page_busy() or
vm_page_wakeup(). Reduce or eliminate its use accordingly.


# 9af80719 21-Oct-2006 Alan Cox <alc@FreeBSD.org>

Replace PG_BUSY with VPO_BUSY. In other words, changes to the page's
busy flag, i.e., VPO_BUSY, are now synchronized by the per-vm object
lock instead of the global page queues lock.


# a9a5d47c 28-Sep-2006 Ken Smith <kensmith@FreeBSD.org>

Fix two minor style(9) nits in v1.313 which were noticed during an
MFC review. alc@ will be MFCing V1.313 plus style fix to RELENG_6.


# eb4bbba8 27-Aug-2006 Alan Cox <alc@FreeBSD.org>

Refactor vm_page_sleep_if_busy() so that the test for a busy page is
inlined and a procedure call is made in the rare case, i.e., when it is
necessary to sleep. In this case, inlining the test actually makes the
kernel smaller.


# 4f9d17d8 20-Aug-2006 Alan Cox <alc@FreeBSD.org>

Page flags are reset on (re)allocation. There is no need to clear any
flags except for PG_ZERO in vm_page_free_toq().


# b146f9e5 12-Aug-2006 Alan Cox <alc@FreeBSD.org>

Reimplement the page's NOSYNC flag as an object-synchronized instead of a
page queues-synchronized flag. Reduce the scope of the page queues lock in
vm_fault() accordingly.

Move vm_fault()'s call to vm_object_set_writeable_dirty() outside of the
scope of the page queues lock. Reviewed by: tegge
Additionally, eliminate an unnecessary dereference in computing the
argument that is passed to vm_object_set_writeable_dirty().


# 25017df4 11-Aug-2006 Alan Cox <alc@FreeBSD.org>

Ensure that the page's new field for object-synchronized flags is always
initialized to zero.

Call vm_page_sleep_if_busy() instead of duplicating its implementation in
vm_page_grab().


# 75db2abb 09-Aug-2006 Alan Cox <alc@FreeBSD.org>

Change vm_page_cowfault() so that it doesn't allocate a pre-busied page.


# 5786be7c 09-Aug-2006 Alan Cox <alc@FreeBSD.org>

Introduce a field to struct vm_page for storing flags that are
synchronized by the lock on the object containing the page.

Transition PG_WANTED and PG_SWAPINPROG to use the new field,
eliminating the need for holding the page queues lock when setting
or clearing these flags. Rename PG_WANTED and PG_SWAPINPROG to
VPO_WANTED and VPO_SWAPINPROG, respectively.

Eliminate the assertion that the page queues lock is held in
vm_page_io_finish().

Eliminate the acquisition and release of the page queues lock
around calls to vm_page_io_finish() in kern_sendfile() and
vfs_unbusy_pages().


# e74814b6 05-Aug-2006 Alan Cox <alc@FreeBSD.org>

Change vm_page_sleep_if_busy() so that it no longer requires the caller to
hold the page queues lock.


# 91449ce9 03-Aug-2006 Alan Cox <alc@FreeBSD.org>

When sleeping on a busy page, use the lock from the containing object
rather than the global page queues lock.


# 78985e42 01-Aug-2006 Alan Cox <alc@FreeBSD.org>

Complete the transition from pmap_page_protect() to pmap_remove_write().
Originally, I had adopted sparc64's name, pmap_clear_write(), for the
function that is now pmap_remove_write(). However, this function is more
like pmap_remove_all() than like pmap_clear_modify() or
pmap_clear_reference(), hence, the name change.

The higher-level rationale behind this change is described in
src/sys/amd64/amd64/pmap.c revision 1.567. The short version is that I'm
trying to clean up and fix our support for execute access.

Reviewed by: marcel@ (ia64)


# af51d7bf 21-Jul-2006 Alan Cox <alc@FreeBSD.org>

Eliminate OBJ_WRITEABLE. It hasn't been used in a long time.


# 9bdaa433 23-Jun-2006 John Baldwin <jhb@FreeBSD.org>

Move the code to handle the vm.blacklist tunable up a layer into
vm_page_startup(). As a result, we now only lookup the tunable once
instead of looking it up once for every physical page of memory in the
system. This cuts out about a 1 second or so delay in boot on x86
systems. The delay is much larger and more noticable on sun4v apparently.

Reported by: kmacy
MFC after: 1 week


# 4cbb1c1a 31-May-2006 Paul Saab <ps@FreeBSD.org>

Fix minidumps to include pages allocated via pmap_map on amd64.
These pages are allocated from the direct map, and were not previous
tracked. This included the vm_page_array and the early UMA bootstrap
pages.

Reviewed by: peter


# c0345a84 20-Apr-2006 Peter Wemm <peter@FreeBSD.org>

Introduce minidumps. Full physical memory crash dumps are still available
via the debug.minidump sysctl and tunable.

Traditional dumps store all physical memory. This was once a good thing
when machines had a maximum of 64M of ram and 1GB of kvm. These days,
machines often have many gigabytes of ram and a smaller amount of kvm.
libkvm+kgdb don't have a way to access physical ram that is not mapped
into kvm at the time of the crash dump, so the extra ram being dumped
is mostly wasted.

Minidumps invert the process. Instead of dumping physical memory in
in order to guarantee that all of kvm's backing is dumped, minidumps
instead dump only memory that is actively mapped into kvm.

amd64 has a direct map region that things like UMA use. Obviously we
cannot dump all of the direct map region because that is effectively
an old style all-physical-memory dump. Instead, introduce a bitmap
and two helper routines (dump_add_page(pa) and dump_drop_page(pa)) that
allow certain critical direct map pages to be included in the dump.
uma_machdep.c's allocator is the intended consumer.

Dumps are a custom format. At the very beginning of the file is a header,
then a copy of the message buffer, then the bitmap of pages present in
the dump, then the final level of the kvm page table trees (2MB mappings
are expanded into a 4K page mappings), then the sparse physical pages
according to the bitmap. libkvm can now conveniently access the kvm
page table entries.

Booting my test 8GB machine, forcing it into ddb and forcing a dump
leads to a 48MB minidump. While this is a best case, I expect minidumps
to be in the 100MB-500MB range. Obviously, never larger than physical
memory of course.

minidumps are on by default. It would want be necessary to turn them off
if it was necessary to debug corrupt kernel page table management as that
would mess up minidumps as well.

Both minidumps and regular dumps are supported on the same machine.


# 62a59e8f 07-Mar-2006 Warner Losh <imp@FreeBSD.org>

Remove leading __ from __(inline|const|signed|volatile). They are
obsolete. This should reduce diffs to NetBSD as well.


# 22440959 15-Feb-2006 Stephan Uphoff <ups@FreeBSD.org>

When the VM needs to allocated physical memory pages (for non interrupt use)
and it has not plenty of free pages it tries to free pages in the cache queue.
Unfortunately freeing a cached page requires the locking of the object that
owns the page. However in the context of allocating pages we may not be able
to lock the object and thus can only TRY to lock the object. If the locking try
fails the cache page can not be freed and is activated to move it out of the way
so that we may try to free other cache pages.

If all pages in the cache belong to objects that are currently locked the
cache queue can be emptied without freeing a single page. This scenario caused
two problems:

1) vm_page_alloc always failed allocation when it tried freeing pages from
the cache queue and failed to do so. However if there are more than
cnt.v_interrupt_free_min pages on the free list it should return pages
when requested with priority VM_ALLOC_SYSTEM. Failure to do so can cause
resource exhaustion deadlocks.

2) Threads than need to allocate pages spend a lot of time cleaning up the
page queue without really getting anything done while the pagedaemon
needs to work overtime to refill the cache.

This change fixes the first problem. (1)

Reviewed by: tegge@


# 6c237adc 31-Jan-2006 Alan Cox <alc@FreeBSD.org>

Change #if defined(DIAGNOSTIC) to KASSERT.


# fc3c1bc4 24-Jan-2006 Alan Cox <alc@FreeBSD.org>

In vm_page_set_invalid() invalidate all of the page's mappings as soon as
any part of the page's contents is invalidated.

Submitted by: tegge


# ef39c05b 31-Dec-2005 Alexander Leidinger <netchild@FreeBSD.org>

MI changes:
- provide an interface (macros) to the page coloring part of the VM system,
this allows to try different coloring algorithms without the need to
touch every file [1]
- make the page queue tuning values readable: sysctl vm.stats.pagequeue
- autotuning of the page coloring values based upon the cache size instead
of options in the kernel config (disabling of the page coloring as a
kernel option is still possible)

MD changes:
- detection of the cache size: only IA32 and AMD64 (untested) contains
cache size detection code, every other arch just comes with a dummy
function (this results in the use of default values like it was the
case without the autotuning of the page coloring)
- print some more info on Intel CPU's (like we do on AMD and Transmeta
CPU's)

Note to AMD owners (IA32 and AMD64): please run "sysctl vm.stats.pagequeue"
and report if the cache* values are zero (= bug in the cache detection code)
or not.

Based upon work by: Chad David <davidc@acns.ab.ca> [1]
Reviewed by: alc, arch (in 2004)
Discussed with: alc, Chad David, arch (in 2004)


# 984922d7 13-Dec-2005 Alan Cox <alc@FreeBSD.org>

Assert that the page that is given to vm_page_free_toq() does not have any
managed mappings.


# 7e9d9442 07-Nov-2005 Alan Cox <alc@FreeBSD.org>

If a physical page is mapped by two or more virtual addresses, transmitted
by the zero-copy sockets method, and written to before the transmission
completes, we need to destroy all of the existing mappings to the page,
not just the one that we fault on. Otherwise, the mappings will no longer
be to the same page and changes made through one of the mappings will not
be visible through the others.

Observed by: tegge


# 674b706e 31-Oct-2005 Alan Cox <alc@FreeBSD.org>

Consider the zero-copy transmission of a page that was wired by mlock(2).
If a copy-on-write fault occurs on the page, the new copy should inherit
a part of the original page's wire count.

Submitted by: tegge
MFC after: 1 week


# 3803b26b 08-Oct-2005 Dag-Erling Smørgrav <des@FreeBSD.org>

As alc pointed out to me, vm_page.c 1.305 was incomplete: uma_startup()
still uses the constant UMA_BOOT_PAGES. Change it to accept boot_pages
as an additional argument.

MFC after: 2 weeks


# cfa22bcc 11-Aug-2005 Dag-Erling Smørgrav <des@FreeBSD.org>

Introduce the vm.boot_pages tunable and sysctl, which controls the number
of pages reserved to bootstrap the kernel memory allocator.

MFC after: 2 weeks


# 761dbeb6 15-Mar-2005 Jeff Roberson <jeff@FreeBSD.org>

- In vm_page_insert() hold the backing vnode when the first page
is inserted.
- In vm_page_remove() drop the backing vnode when the last page
is removed.
- Don't check the vnode to see if it must be reclaimed on every
call to vm_page_free_toq() as we only check it now when it is
actually required. This saves us two lock operations per call.

Sponsored by: Isilon Systems, Inc.


# 46fbc582 06-Jan-2005 Alan Cox <alc@FreeBSD.org>

Transfer responsibility for freeing the page taken from the cache
queue and (possibly) unlocking the containing object from
vm_page_alloc() to vm_page_select_cache(). Recent optimizations to
vm_map_pmap_enter() (see vm_map.c revisions 1.362 and 1.363) and
pmap_enter_quick() have resulted in panic()s because vm_page_alloc()
mistakenly unlocked objects that had not been locked by
vm_page_select_cache().

Reported by: Peter Holm and Kris Kennaway


# 60727d8b 06-Jan-2005 Warner Losh <imp@FreeBSD.org>

/* -> /*- for license, minor formatting changes


# 0869d38b 31-Dec-2004 Alan Cox <alc@FreeBSD.org>

Assert that page allocations during an interrupt specify
VM_ALLOC_INTERRUPT.

Assert that pages removed from the cache queue are not busy.


# 7aa2190c 28-Dec-2004 Alan Cox <alc@FreeBSD.org>

Access to the page's busy field is (now) synchronized by the containing
object's lock. Therefore, the assertion that the page queues lock is held
can be removed from vm_page_io_start().


# 40198b3c 26-Dec-2004 Alan Cox <alc@FreeBSD.org>

Assert that the vm object is locked on entry to vm_page_sleep_if_busy();
remove some unneeded code.


# d19ef814 03-Nov-2004 Alan Cox <alc@FreeBSD.org>

The synchronization provided by vm object locking has eliminated the
need for most calls to vm_page_busy(). Specifically, most calls to
vm_page_busy() occur immediately prior to a call to vm_page_remove().
In such cases, the containing vm object is locked across both calls.
Consequently, the setting of the vm page's PG_BUSY flag is not even
visible to other threads that are following the synchronization
protocol.

This change (1) eliminates the calls to vm_page_busy() that
immediately precede a call to vm_page_remove() or functions, such as
vm_page_free() and vm_page_rename(), that call it and (2) relaxes the
requirement in vm_page_remove() that the vm page's PG_BUSY flag is
set. Now, the vm page's PG_BUSY flag is set only when the vm object
lock is released while the vm page is still in transition. Typically,
this is when it is undergoing I/O.


# f4d49654 27-Oct-2004 Alan Cox <alc@FreeBSD.org>

Assert that the containing vm object is locked in vm_page_cache() and
vm_page_try_to_cache().


# 63bb7041 25-Oct-2004 Alan Cox <alc@FreeBSD.org>

Assert that the containing vm object is locked in vm_page_flash().


# 75d05338 24-Oct-2004 Alan Cox <alc@FreeBSD.org>

Assert that the containing vm object is locked in vm_page_busy() and
vm_page_wakeup().


# 0f9f9bcb 24-Oct-2004 Alan Cox <alc@FreeBSD.org>

Introduce VM_ALLOC_NOBUSY, an option to vm_page_alloc() and vm_page_grab()
that indicates that the caller does not want a page with its busy flag set.
In many places, the global page queues lock is acquired and released just
to clear the busy flag on a just allocated page. Both the allocation of
the page and the clearing of the busy flag occur while the containing vm
object is locked. So, the busy flag might as well never be set.


# 1e96d2a2 18-Oct-2004 Alan Cox <alc@FreeBSD.org>

Correct two errors in PG_BUSY management by vm_page_cowfault(). Both
errors are in rarely executed paths.
1. Each time the retry_alloc path is taken, the PG_BUSY must be set again.
Otherwise vm_page_remove() panics.
2. There is no need to set PG_BUSY on the newly allocated page before
freeing it. The page already has PG_BUSY set by vm_page_alloc().
Setting it again could cause an assertion failure.

MFC after: 2 weeks


# 36aeb90e 17-Oct-2004 Alan Cox <alc@FreeBSD.org>

Assert that the containing object is locked in vm_page_io_start() and
vm_page_io_finish(). The motivation being to transition synchronization of
the vm_page's busy field from the global page queues lock to the per-object
lock.


# 7ce1979b 14-Sep-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Add new a function isa_dma_init() which returns an errno when it fails
and which takes a M_WAITOK/M_NOWAIT flag argument.

Add compatibility isa_dmainit() macro which whines loudly if
isa_dma_init() fails.

Problem uncovered by: tegge


# a0879143 29-Jul-2004 Alan Cox <alc@FreeBSD.org>

Advance the state of pmap locking on alpha, amd64, and i386.

- Enable recursion on the page queues lock. This allows calls to
vm_page_alloc(VM_ALLOC_NORMAL) and UMA's obj_alloc() with the page
queues lock held. Such calls are made to allocate page table pages
and pv entries.
- The previous change enables a partial reversion of vm/vm_page.c
revision 1.216, i.e., the call to vm_page_alloc() by vm_page_cowfault()
now specifies VM_ALLOC_NORMAL rather than VM_ALLOC_INTERRUPT.
- Add partial locking to pmap_copy(). (As a side-effect, pmap_copy()
should now be faster on i386 SMP because it no longer generates IPIs
for TLB shootdown on the other processors.)
- Complete the locking of pmap_enter() and pmap_enter_quick(). (As of now,
all changes to a user-level pmap on alpha, amd64, and i386 are performed
with appropriate locking.)


# d951b752 21-Jul-2004 Brian Feldman <green@FreeBSD.org>

Fix a race in vm_page_sleep_if_busy(). Due to vm_object locking
being incomplete, it currently has to know how to drop and pick back
up the vm_object's mutex if it has to sleep and drop the page queue
mutex. The problem with this is that if the page is busy, while we
are sleeping, the page can be freed and object disappear. When trying
to lock m->object, we'd get a stale or NULL pointer and crash.

The object is now cached, but this makes the assumption that
the object is referenced in some manner and will not itself
disappear while it is unlocked. Since this only happens if
the object is locked, I had to remove an assumption earlier in
contigmalloc() that reversed the order of locking the object and
doing vm_page_sleep_if_busy(), not the normal order.


# e832aafc 19-Jul-2004 Alan Cox <alc@FreeBSD.org>

- Eliminate the pte object from the pmap. Instead, page table pages are
allocated as "no object" pages. Similar changes were made to the amd64
and i386 pmap last year. The primary reason being that maintaining
a pte object leads to lock order violations. A secondary reason being
that the pte object is redundant, i.e., the page table itself can be
used to lookup page table pages. (Historical note: The pte object
predates our ability to allocate "no object" pages. Thus, the pte
object was a necessary evil.)
- Unconditionally check the vm object lock's status in vm_page_remove().
Previously, this assertion could not be made on Alpha due to its use
of a pte object.


# 790bdd0f 10-Jul-2004 Alan Cox <alc@FreeBSD.org>

Increase the scope of the page queues lock in vm_page_alloc() to cover
a diagnostic check that accesses the cache queue count.


# 0a2df477 18-Jun-2004 Alan Cox <alc@FreeBSD.org>

Remove spl() calls. Update comments to reflect the removal of spl() calls.
Remove '\n' from panic() format strings. Remove some blank lines.


# d45f21f3 17-Jun-2004 Alan Cox <alc@FreeBSD.org>

Do not preset PG_BUSY on VM_ALLOC_NOOBJ pages. Such pages are not
accessible through an object. Thus, PG_BUSY serves no purpose.


# 4be14af9 21-May-2004 Alan Cox <alc@FreeBSD.org>

To date, unwiring a fictitious page has produced a panic. The reason
being that PHYS_TO_VM_PAGE() returns the wrong vm_page for fictitious
pages but unwiring uses PHYS_TO_VM_PAGE(). The resulting panic
reported an unexpected wired count. Rather than attempting to fix
PHYS_TO_VM_PAGE(), this fix takes advantage of the properties of
fictitious pages. Specifically, fictitious pages will never be
completely unwired. Therefore, we can keep a fictitious page's wired
count forever set to one and thereby avoid the use of
PHYS_TO_VM_PAGE() when we know that we're working with a fictitious
page, just not which one.

In collaboration with: green@, tegge@
PR: kern/29915


# 1bb816d3 11-May-2004 Alan Cox <alc@FreeBSD.org>

Restructure vm_page_select_cache() so that adding assertions is easy.

Some of the conditions that caused vm_page_select_cache() to deactivate a
page were wrong. For example, deactivating an unmanaged or wired page is a
nop. Thus, if vm_page_select_cache() had ever encountered an unmanaged or
wired page, it would have looped forever. Now, we assert that the page is
neither unmanaged nor wired.


# 3f39cca9 08-May-2004 Alan Cox <alc@FreeBSD.org>

Cache queue pages are not mapped. Thus, the pmap_remove_all() by
vm_page_alloc() is unnecessary.


# 2ec91846 24-Apr-2004 Alan Cox <alc@FreeBSD.org>

Update the comment describing vm_page_grab() to reflect the previous
revision and correct some of its style errors.


# 7ef6ba5d 24-Apr-2004 Alan Cox <alc@FreeBSD.org>

Push down the responsibility for zeroing a physical page from the
caller to vm_page_grab(). Although this gives VM_ALLOC_ZERO a
different meaning for vm_page_grab() than for vm_page_alloc(), I feel
such change is necessary to accomplish other goals. Specifically, I
want to make the PG_ZERO flag immutable between the time it is
allocated by vm_page_alloc() and freed by vm_page_free() or
vm_page_free_zero() to avoid locking overheads. Once we gave up on
the ability to automatically recognize a zeroed page upon entry to
vm_page_free(), the ability to mutate the PG_ZERO flag became useless.
Instead, I would like to say that "Once a page becomes valid, its
PG_ZERO flag must be ignored."


# 05eb3785 06-Apr-2004 Warner Losh <imp@FreeBSD.org>

Remove advertising clause from University of California Regent's license,
per letter dated July 22, 1999.

Approved by: core


# 889eb0fc 04-Apr-2004 Alan Cox <alc@FreeBSD.org>

Eliminate unused arguments from vm_page_startup().


# ca3b4477 02-Mar-2004 Alan Cox <alc@FreeBSD.org>

Modify contigmalloc1() so that the free page queues lock is not held when
vm_page_free() is called. The problem with holding this lock is that it is
a spin lock and vm_page_free() may attempt the acquisition of a different
default-type lock.


# 0f75a977 19-Feb-2004 Alan Cox <alc@FreeBSD.org>

- Correct a long-standing race condition in vm_page_try_to_free() that
could result in a dirty page being unintentionally freed.
- Simplify the dirty page check in vm_page_dontneed().

Reviewed by: tegge
MFC after: 7 days


# 84d98bf6 14-Feb-2004 Alan Cox <alc@FreeBSD.org>

- Correct a long-standing race condition in vm_page_try_to_cache() that
could result in a panic "vm_page_cache: caching a dirty page, ...":
Access to the page must be restricted or removed before calling
vm_page_cache(). This race condition is identical in nature to that
which was addressed by vm_pageout.c's revision 1.251.
- Simplify the code surrounding the fix to this same race condition
in vm_pageout.c's revision 1.251. There should be no behavioral
change. Reviewed by: tegge

MFC after: 7 days


# 65bae14d 08-Jan-2004 Alan Cox <alc@FreeBSD.org>

- Enable recursive acquisition of the mutex synchronizing access to the
free pages queue. This is presently needed by contigmalloc1().
- Move a sanity check against attempted double allocation of two pages
to the same vm object offset from vm_page_alloc() to vm_page_insert().
This provides better protection because double allocation could occur
through a direct call to vm_page_insert(), such as that by
vm_page_rename().
- Modify contigmalloc1() to hold the mutex synchronizing access to the
free pages queue while it scans vm_page_array in search of free pages.
- Correct a potential leak of pages by contigmalloc1() that I introduced
in revision 1.20: We must convert all cache queue pages to free pages
before we begin removing free pages from the free queue. Otherwise,
if we have to restart the scan because we are unable to acquire the
vm object lock that is necessary to convert a cache queue page to a
free page, we leak those free pages already removed from the free queue.


# 4804edb4 31-Dec-2003 Alan Cox <alc@FreeBSD.org>

In vm_page_lookup() check the root of the vm object's splay tree for the
desired page before calling vm_page_splay().


# bcdaad7f 30-Dec-2003 Alan Cox <alc@FreeBSD.org>

Simplify vm_page_grab(): Don't bother with the generation check. If the
vm object hasn't changed, the desired page will be at or near the root
of the vm object's splay tree, making vm_page_lookup() cheap. (The only
lock required for vm_page_lookup() is already held.) If, however, the
vm object has changed and retry was requested, eliminating the generation
check also eliminates a pointless acquisition and release of the page
queues lock.


# 9582cd94 21-Dec-2003 Alan Cox <alc@FreeBSD.org>

- Create an unmapped guard page to trap access to vm_page_array[-1].
This guard page would have trapped the problems with the MFC of the PAE
support to RELENG_4 at an earlier point in the sequence of events.

Submitted by: tegge


# de33bedd 31-Oct-2003 Alan Cox <alc@FreeBSD.org>

- Additional vm object locking in vm_object_split()
- New vm object locking assertions in vm_page_insert() and
vm_object_set_writeable_dirty()


# ab42316c 22-Oct-2003 Alan Cox <alc@FreeBSD.org>

- Retire vm_pageout_page_free(). Instead, use vm_page_select_cache() from
vm_pageout_scan(). Rationale: I don't like leaving a busy page in the
cache queue with neither the vm object nor the vm page queues lock held.
- Assert that the page is active in vm_pageout_page_stats().


# 0d42c05f 21-Oct-2003 Alan Cox <alc@FreeBSD.org>

- Assert that the containing vm object is locked in
vm_page_set_validclean(). (This function reads and modifies the
vm page's valid field, which is synchronized by the lock on the
containing vm object.)


# fee181a6 20-Oct-2003 Alan Cox <alc@FreeBSD.org>

- Remove some long unused code.


# 669890ea 07-Oct-2003 Alan Cox <alc@FreeBSD.org>

Retire vm_page_copy(). Its reason for being ended when peter@ modified
pmap_copy_page() et al. to accept a vm_page_t rather than a physical
address. Also, this change will facilitate locking access to the vm page's
valid field.


# 5a3970fe 05-Oct-2003 Alan Cox <alc@FreeBSD.org>

Assert that the containing vm object's lock is held in
vm_page_set_invalid().


# 874f526d 04-Oct-2003 Alan Cox <alc@FreeBSD.org>

Assert that the containing vm object's lock is held in
vm_page_zero_invalid().


# bf0da100 04-Oct-2003 Alan Cox <alc@FreeBSD.org>

- Extend the scope the vm object lock to cover calls to
vm_page_is_valid().
- Assert that the lock on the containing vm object is held in
vm_page_is_valid().


# 50028aa7 27-Sep-2003 Alan Cox <alc@FreeBSD.org>

In vm_page_remove(), assert that the vm object is locked, unless an Alpha.
(The Alpha still requires updates to its pmap.)


# 95aad59a 21-Sep-2003 Alan Cox <alc@FreeBSD.org>

Initialize the page's pindex field even for VM_ALLOC_NOOBJ allocations.
(This field is useful for implementing sanity checks even if the page does
not belong to an object.)


# 2370c6d4 28-Aug-2003 Alan Cox <alc@FreeBSD.org>

Recent pmap changes permit the use of a more precise locking assertion
in vm_page_lookup().


# 529e15ed 23-Aug-2003 Alan Cox <alc@FreeBSD.org>

Held pages, just like wired pages, should not be added to the cache queues.

Submitted by: tegge


# b7ad744d 23-Aug-2003 Alan Cox <alc@FreeBSD.org>

Hold the page queues lock when performing vm_page_clear_dirty() and
vm_page_set_invalid().


# 0f132ba6 21-Aug-2003 Alan Cox <alc@FreeBSD.org>

Assert that the vm object's lock is held on entry to vm_page_grab(); remove
code from this function that was needed when vm object locking was
incomplete.


# 891c1d4b 20-Aug-2003 Alan Cox <alc@FreeBSD.org>

Assert that the vm object lock is held in vm_page_alloc().


# c53e8c56 01-Jul-2003 Alan Cox <alc@FreeBSD.org>

Modify vm_page_alloc() and vm_page_select_cache() to allow the page that
is returned by vm_page_select_cache() to belong to the object that is
already locked by the caller to vm_page_alloc().


# baaaadf1 28-Jun-2003 Alan Cox <alc@FreeBSD.org>

- Use an int rather than a vm_pindex_t to represent the desired page
color in vm_page_alloc(). (This also has small performance benefits.)
- Eliminate vm_page_select_free(); vm_page_alloc() might as well
call vm_pageq_find() directly.


# 9f2b1758 26-Jun-2003 Alan Cox <alc@FreeBSD.org>

vm_page_select_cache() enforces a number of conditions on the returned
page. Add the ability to lock the containing object to those conditions.


# f29ba63e 22-Jun-2003 Alan Cox <alc@FreeBSD.org>

Maintain a lock on the vm object of interest throughout vm_fault(),
releasing the lock only if we are about to sleep (e.g., vm_pager_get_pages()
or vm_pager_has_pages()). If we sleep, we have marked the vm object with
the paging-in-progress flag.


# 37681d86 18-Jun-2003 Alan Cox <alc@FreeBSD.org>

Assert that the vm object is locked in vm_page_try_to_free().


# 874651b1 11-Jun-2003 David E. O'Brien <obrien@FreeBSD.org>

Use __FBSDID().


# 36d1fdf5 07-Jun-2003 Alan Cox <alc@FreeBSD.org>

Teach vm_page_grab() how to handle the vm object's lock.


# 5299887d 25-Apr-2003 Alan Cox <alc@FreeBSD.org>

- Relax the Giant required in vm_page_remove().
- Remove the Giant required from vm_page_free_toq(). (Any locking
errors will be caught by vm_page_remove().)

This remedies a panic that occurred when kmem_malloc(NOWAIT) performed
without Giant failed to allocate the necessary pages.

Reported by: phk


# 2e9d00a1 22-Apr-2003 Alan Cox <alc@FreeBSD.org>

Revision 1.246 should have also included

- Weaken the assertion in vm_page_insert() to require Giant only if the
vm_object isn't locked.

Reported by: "Ilmar S. Habibulin" <ilmar@watson.org>


# 03d4c1e6 21-Apr-2003 Alan Cox <alc@FreeBSD.org>

Revision 1.52 of vm/uma_core.c has led to UMA's obj_alloc() being
called without Giant; and obj_alloc() in turn calls vm_page_alloc()
without Giant. This causes an assertion failure in vm_page_alloc().
Fortunately, obj_alloc() is now MPSAFE. So, we need only clean up
some assertions.

- Weaken the assertion in vm_page_lookup() to require Giant only
if the vm_object isn't locked.
- Remove an assertion from vm_page_alloc() that duplicates a check
performed in vm_page_lookup().

In collaboration with: gallatin, jake, jeff


# d8fed0f0 10-Apr-2003 John Baldwin <jhb@FreeBSD.org>

- Kill the pv_flags member of the alpha mdpage since it stop being used
in rev 1.61 of pmap.c.
- Now that pmap_page_is_free() is empty and since it is just a hack for
the Alpha pmap, remove it.


# 227f9a1c 24-Mar-2003 Jake Burkholder <jake@FreeBSD.org>

- Add vm_paddr_t, a physical address type. This is required for systems
where physical addresses larger than virtual addresses, such as i386s
with PAE.
- Use this to represent physical addresses in the MI vm system and in the
i386 pmap code. This also changes the paddr parameter to d_mmap_t.
- Fix printf formats to handle physical addresses >4G in the i386 memory
detection code, and due to kvtop returning vm_paddr_t instead of u_long.

Note that this is a name change only; vm_paddr_t is still the same as
vm_offset_t on all currently supported platforms.

Sponsored by: DARPA, Network Associates Laboratories
Discussed with: re, phk (cdevsw change)


# dab392a4 18-Mar-2003 Maxime Henrion <mux@FreeBSD.org>

Remove an empty comment.


# 9f77ba59 16-Mar-2003 Jake Burkholder <jake@FreeBSD.org>

Subtract the memory that backs the vm_page structures from phys_avail
after mapping it. This makes it possible to determine if a physical
page has a backing vm_page or not.


# 1a1e9f41 01-Mar-2003 Alan Cox <alc@FreeBSD.org>

Teach vm_page_sleep_if_busy() to release the vm_object lock before sleeping.


# 3fa24ec9 24-Feb-2003 Alan Cox <alc@FreeBSD.org>

In vm_page_dirty(), assert that the page is not in the free queue(s).


# e6f2748c 01-Feb-2003 Alan Cox <alc@FreeBSD.org>

- Convert the tsleep()s in vm_wait() and vm_waitpfault() to msleep()s
with the page queue lock.
- Assert that the page queue lock is held in vm_page_free_wakeup().


# 28ec30cd 20-Jan-2003 Alan Cox <alc@FreeBSD.org>

- Hold the page queues lock around vm_page_hold().
- Assert that the page queues lock rather than Giant is held in
vm_page_hold().


# b0ef8c5f 13-Jan-2003 Alan Cox <alc@FreeBSD.org>

- Update vm_pageout_deficit using atomic operations. It's a simple
counter outside the scope of existing locks.
- Eliminate a redundant clearing of vm_pageout_deficit.


# a15700fe 12-Jan-2003 Alan Cox <alc@FreeBSD.org>

Make vm_page_alloc() return PG_ZERO only if VM_ALLOC_ZERO is specified.
The objective being to eliminate some cases of page queues locking.
(See, for example, vm/vm_fault.c revision 1.160.)

Reviewed by: tegge

(Also, pointed out by tegge that I changed vm_fault.c before changing
vm_page.c. Oops.)


# b5dc8305 11-Jan-2003 Alan Cox <alc@FreeBSD.org>

In vm_page_alloc(), fuse two if statements that are conditioned on the same
expression.


# 9a032278 08-Jan-2003 Alan Cox <alc@FreeBSD.org>

In vm_page_alloc(), honor VM_ALLOC_ZERO for system and interrupt class
requests when the number of free pages is below the reserved threshold.
Previously, VM_ALLOC_ZERO was only honored when the number of free pages
was above the reserved threshold. Honoring it in all cases generally
makes sense, does no harm, and simplifies the code.


# 6c4952c7 04-Jan-2003 Alan Cox <alc@FreeBSD.org>

Use atomic add and subtract to update the global wired page count,
cnt.v_wire_count.


# 009f3e7a 04-Jan-2003 Alan Cox <alc@FreeBSD.org>

Refine the assertions in vm_page_alloc().


# d61e1287 01-Jan-2003 Alan Cox <alc@FreeBSD.org>

Update the assertions in vm_page_insert() and vm_page_lookup() to reflect
locking of the kmem_object.


# a28cc55e 29-Dec-2002 Alan Cox <alc@FreeBSD.org>

Reduce the number of times that we acquire and release the page queues
lock by making vm_page_rename()'s caller, rather than vm_page_rename(),
responsible for acquiring it.


# 2ee5fea7 28-Dec-2002 Alan Cox <alc@FreeBSD.org>

Assert that the page queues lock rather than Giant is held in
vm_page_flag_clear().


# 24c9ad6b 19-Dec-2002 Alan Cox <alc@FreeBSD.org>

- Remove vm_page_sleep_busy(). The transition to vm_page_sleep_if_busy(),
which incorporates page queue and field locking, is complete.
- Assert that the page queue lock rather than Giant is held in
vm_page_flag_set().


# 495bedfb 14-Dec-2002 Alan Cox <alc@FreeBSD.org>

Assert that the page queues lock is held in vm_page_unhold(),
vm_page_remove(), and vm_page_free_toq().


# 178949e0 23-Nov-2002 Alan Cox <alc@FreeBSD.org>

Hold the page queues/flags lock when calling vm_page_set_validclean().

Approved by: re


# a12cc0e4 17-Nov-2002 Alan Cox <alc@FreeBSD.org>

Remove vm_page_protect(). Instead, use pmap_page_protect() directly.


# 4fec79be 16-Nov-2002 Alan Cox <alc@FreeBSD.org>

Now that pmap_remove_all() is exported by our pmap implementations
use it directly.


# d154fb4f 10-Nov-2002 Alan Cox <alc@FreeBSD.org>

When prot is VM_PROT_NONE, call pmap_page_protect() directly rather than
indirectly through vm_page_protect(). The one remaining page flag that
is updated by vm_page_protect() is already being updated by our various
pmap implementations.

Note: A later commit will similarly change the VM_PROT_READ case and
eliminate vm_page_protect().


# 1f7c5f98 09-Nov-2002 Alan Cox <alc@FreeBSD.org>

In vm_page_remove(), avoid calling vm_page_splay() if the object's memq
is empty.


# ada2a050 04-Nov-2002 Alan Cox <alc@FreeBSD.org>

Export the function vm_page_splay().


# c71f01af 03-Nov-2002 Alan Cox <alc@FreeBSD.org>

- Remove the memory allocation for the object/offset hash table
because it's no longer used. (See revision 1.215.)
- Fix a harmless bug: the number of vm_page structures allocated wasn't
properly adjusted when uma_bootstrap() was introduced. Consequently,
we were allocating 30 unused vm_page structures.
- Wrap a long line.


# 02af9de6 02-Nov-2002 Alan Cox <alc@FreeBSD.org>

Remove the vm page buckets mutex. As of revision 1.215 of vm/vm_page.c,
it is unused.


# 026aa839 31-Oct-2002 Jeff Roberson <jeff@FreeBSD.org>

- Add a new flag to vm_page_alloc, VM_ALLOC_NOOBJ. This tells
vm_page_alloc not to insert this page into an object. The pindex is
still used for colorization.
- Rework vm_page_select_* to accept a color instead of an object and
pindex to work with VM_PAGE_NOOBJ.
- Document other VM_ALLOC_ flags.

Reviewed by: peter, jake


# f3b676f0 20-Oct-2002 Alan Cox <alc@FreeBSD.org>

o Reinline vm_page_undirty(), reducing the kernel size. (This reverts
a part of vm_page.h revision 1.87 and vm_page.c revision 1.167.)


# f4ecdf05 19-Oct-2002 Alan Cox <alc@FreeBSD.org>

Complete the page queues locking needed for the page-based copy-
on-write (COW) mechanism. (This mechanism is used by the zero-copy
TCP/IP implementation.)
- Extend the scope of the page queues lock in vm_fault()
to cover vm_page_cowfault().
- Modify vm_page_cowfault() to release the page queues lock
if it sleeps.


# b86ec922 18-Oct-2002 Matthew Dillon <dillon@FreeBSD.org>

Replace the vm_page hash table with a per-vmobject splay tree. There should
be no major change in performance from this change at this time but this
will allow other work to progress: Giant lock removal around VM system
in favor of per-object mutexes, ranged fsyncs, more optimal COMMIT rpc's for
NFS, partial filesystem syncs by the syncer, more optimal object flushing,
etc. Note that the buffer cache is already using a similar splay tree
mechanism.

Note that a good chunk of the old hash table code is still in the tree.
Alan or I will remove it prior to the release if the new code does not
introduce unsolvable bugs, else we can revert more easily.

Submitted by: alc (this is Alan's code)
Approved by: re


# 8a59b15c 01-Sep-2002 Alan Cox <alc@FreeBSD.org>

o Synchronize updates to struct vm_page::cow with the page queues lock.


# fff6062a 24-Aug-2002 Alan Cox <alc@FreeBSD.org>

o Retire vm_page_zero_fill() and vm_page_zero_fill_area(). Ever since
pmap_zero_page() and pmap_zero_page_area() were modified to accept
a struct vm_page * instead of a physical address, vm_page_zero_fill()
and vm_page_zero_fill_area() have served no purpose.


# 60582cbe 10-Aug-2002 Alan Cox <alc@FreeBSD.org>

o Assert that the page queues lock is held in vm_page_activate().


# db44450b 10-Aug-2002 Alan Cox <alc@FreeBSD.org>

o Remove the setting and clearing of the PG_MAPPED flag. (This flag is
obsolete.)


# 06ec58b7 08-Aug-2002 Alan Cox <alc@FreeBSD.org>

o Use pmap_page_is_mapped() in vm_page_protect() rather than the PG_MAPPED
flag. (This is the only place in the entire kernel where the PG_MAPPED
flag is tested. It will be removed soon.)


# 24c28f1a 04-Aug-2002 Alan Cox <alc@FreeBSD.org>

o Acquire the page queues lock before checking the page's busy status
in vm_page_grab(). Also, replace the nearby tsleep() with an msleep()
on the page queues lock.


# e6e370a7 04-Aug-2002 Jeff Roberson <jeff@FreeBSD.org>

- Replace v_flag with v_iflag and v_vflag
- v_vflag is protected by the vnode lock and is used when synchronization
with VOP calls is needed.
- v_iflag is protected by interlock and is used for dealing with vnode
management issues. These flags include X/O LOCK, FREE, DOOMED, etc.
- All accesses to v_iflag and v_vflag have either been locked or marked with
mp_fixme's.
- Many ASSERT_VOP_LOCKED calls have been added where the locking was not
clear.
- Many functions in vfs_subr.c were restructured to provide for stronger
locking.

Idea stolen from: BSD/OS


# aa9b1d94 02-Aug-2002 Alan Cox <alc@FreeBSD.org>

o Remove the setting of PG_MAPPED from vm_page_wire() and
vm_page_alloc(VM_ALLOC_WIRED).


# 1e7ce68f 01-Aug-2002 Alan Cox <alc@FreeBSD.org>

o Lock page queue accesses in nwfs and smbfs.
o Assert that the page queues lock is held in vm_page_deactivate().


# 46086ddf 01-Aug-2002 Alan Cox <alc@FreeBSD.org>

o Acquire the page queues lock before calling vm_page_io_finish().
o Assert that the page queues lock is held in vm_page_io_finish().


# 67c1fae9 31-Jul-2002 Alan Cox <alc@FreeBSD.org>

o Lock page accesses by vm_page_io_start() with the page queues lock.
o Assert that the page queues lock is held in vm_page_io_start().


# e5f8bd94 29-Jul-2002 Alan Cox <alc@FreeBSD.org>

o Introduce vm_page_sleep_if_busy() as an eventual replacement for
vm_page_sleep_busy(). vm_page_sleep_if_busy() uses the page
queues lock.


# 2c071f61 28-Jul-2002 Alan Cox <alc@FreeBSD.org>

o Modify vm_page_grab() to accept VM_ALLOC_WIRED.


# 2999e9fa 22-Jul-2002 Alan Cox <alc@FreeBSD.org>

o Lock page queue accesses by vm_page_dontneed().
o Assert that the page queue lock is held in vm_page_dontneed().


# 40eab1e9 20-Jul-2002 Alan Cox <alc@FreeBSD.org>

o Lock page queue accesses by vm_page_try_to_cache(). (The accesses
in kern/vfs_bio.c are already locked.)
o Assert that the page queues lock is held in vm_page_try_to_cache().


# d82efd29 20-Jul-2002 Alan Cox <alc@FreeBSD.org>

o Assert that the page queues lock is held in vm_page_try_to_free().


# 15a5d210 20-Jul-2002 Alan Cox <alc@FreeBSD.org>

o Lock page queue accesses by vm_page_cache() in vm_fault() and
vm_pageout_scan(). (The others are already locked.)
o Assert that the page queues lock is held in vm_page_cache().


# eeeaf0fd 18-Jul-2002 Alan Cox <alc@FreeBSD.org>

o Duplicate an odd side-effect of vm_page_wire() in vm_page_allocate()
when VM_ALLOC_WIRED is specified: set the PG_MAPPED bit in flags.
o In both vm_page_wire() and vm_page_allocate() add a comment saying
that setting PG_MAPPED does not belong there.


# 827b2fa0 17-Jul-2002 Alan Cox <alc@FreeBSD.org>

o Introduce an argument, VM_ALLOC_WIRED, that requests vm_page_alloc()
to return a wired page.
o Use VM_ALLOC_WIRED within Alpha's pmap_growkernel(). Also, because
Alpha's pmap_growkernel() calls vm_page_alloc() from within a critical
section, specify VM_ALLOC_INTERRUPT instead of VM_ALLOC_SYSTEM. (Only
VM_ALLOC_INTERRUPT is implemented entirely with a spin mutex.)
o Assert that the page queues mutex is held in vm_page_wire()
on Alpha, just like the other platforms.


# 8b8b8202 14-Jul-2002 Alan Cox <alc@FreeBSD.org>

o Lock page queue accesses by vm_page_wire() that aren't
within a critical section.
o Assert that the page queues lock is held in vm_page_wire()
unless an Alpha.


# eed6f3fd 13-Jul-2002 Alan Cox <alc@FreeBSD.org>

o Lock page queue accesses by vm_page_unmanage().
o Assert that the page queues lock is held in vm_page_unmanage().


# 1f545269 13-Jul-2002 Alan Cox <alc@FreeBSD.org>

o Complete the locking of page queue accesses by vm_page_unwire().
o Assert that the page queues lock is held in vm_page_unwire().
o Make vm_page_lock_queues() and vm_page_unlock_queues() visible
to kernel loadable modules.


# f784043a 05-Jul-2002 Andrew Gallatin <gallatin@FreeBSD.org>

Remove bogus vm_page_wakeup() in vm_page_cowfault() that will cause panics
in the zero-copy send path if a process attempts to write to a page
which is still in flight.

reviewed by: ken


# 70c17636 04-Jul-2002 Alan Cox <alc@FreeBSD.org>

o Resurrect vm_page_lock_queues(), vm_page_unlock_queues(), and the free
queue lock (revision 1.33 of vm/vm_page.c removed them).
o Make the free queue lock a spin lock because it's sometimes acquired
inside of a critical section.


# 98cb733c 25-Jun-2002 Kenneth D. Merry <ken@FreeBSD.org>

At long last, commit the zero copy sockets code.

MAKEDEV: Add MAKEDEV glue for the ti(4) device nodes.

ti.4: Update the ti(4) man page to include information on the
TI_JUMBO_HDRSPLIT and TI_PRIVATE_JUMBOS kernel options,
and also include information about the new character
device interface and the associated ioctls.

man9/Makefile: Add jumbo.9 and zero_copy.9 man pages and associated
links.

jumbo.9: New man page describing the jumbo buffer allocator
interface and operation.

zero_copy.9: New man page describing the general characteristics of
the zero copy send and receive code, and what an
application author should do to take advantage of the
zero copy functionality.

NOTES: Add entries for ZERO_COPY_SOCKETS, TI_PRIVATE_JUMBOS,
TI_JUMBO_HDRSPLIT, MSIZE, and MCLSHIFT.

conf/files: Add uipc_jumbo.c and uipc_cow.c.

conf/options: Add the 5 options mentioned above.

kern_subr.c: Receive side zero copy implementation. This takes
"disposable" pages attached to an mbuf, gives them to
a user process, and then recycles the user's page.
This is only active when ZERO_COPY_SOCKETS is turned on
and the kern.ipc.zero_copy.receive sysctl variable is
set to 1.

uipc_cow.c: Send side zero copy functions. Takes a page written
by the user and maps it copy on write and assigns it
kernel virtual address space. Removes copy on write
mapping once the buffer has been freed by the network
stack.

uipc_jumbo.c: Jumbo disposable page allocator code. This allocates
(optionally) disposable pages for network drivers that
want to give the user the option of doing zero copy
receive.

uipc_socket.c: Add kern.ipc.zero_copy.{send,receive} sysctls that are
enabled if ZERO_COPY_SOCKETS is turned on.

Add zero copy send support to sosend() -- pages get
mapped into the kernel instead of getting copied if
they meet size and alignment restrictions.

uipc_syscalls.c:Un-staticize some of the sf* functions so that they
can be used elsewhere. (uipc_cow.c)

if_media.c: In the SIOCGIFMEDIA ioctl in ifmedia_ioctl(), avoid
calling malloc() with M_WAITOK. Return an error if
the M_NOWAIT malloc fails.

The ti(4) driver and the wi(4) driver, at least, call
this with a mutex held. This causes witness warnings
for 'ifconfig -a' with a wi(4) or ti(4) board in the
system. (I've only verified for ti(4)).

ip_output.c: Fragment large datagrams so that each segment contains
a multiple of PAGE_SIZE amount of data plus headers.
This allows the receiver to potentially do page
flipping on receives.

if_ti.c: Add zero copy receive support to the ti(4) driver. If
TI_PRIVATE_JUMBOS is not defined, it now uses the
jumbo(9) buffer allocator for jumbo receive buffers.

Add a new character device interface for the ti(4)
driver for the new debugging interface. This allows
(a patched version of) gdb to talk to the Tigon board
and debug the firmware. There are also a few additional
debugging ioctls available through this interface.

Add header splitting support to the ti(4) driver.

Tweak some of the default interrupt coalescing
parameters to more useful defaults.

Add hooks for supporting transmit flow control, but
leave it turned off with a comment describing why it
is turned off.

if_tireg.h: Change the firmware rev to 12.4.11, since we're really
at 12.4.11 plus fixes from 12.4.13.

Add defines needed for debugging.

Remove the ti_stats structure, it is now defined in
sys/tiio.h.

ti_fw.h: 12.4.11 firmware.

ti_fw2.h: 12.4.11 firmware, plus selected fixes from 12.4.13,
and my header splitting patches. Revision 12.4.13
doesn't handle 10/100 negotiation properly. (This
firmware is the same as what was in the tree previously,
with the addition of header splitting support.)

sys/jumbo.h: Jumbo buffer allocator interface.

sys/mbuf.h: Add a new external mbuf type, EXT_DISPOSABLE, to
indicate that the payload buffer can be thrown away /
flipped to a userland process.

socketvar.h: Add prototype for socow_setup.

tiio.h: ioctl interface to the character portion of the ti(4)
driver, plus associated structure/type definitions.

uio.h: Change prototype for uiomoveco() so that we'll know
whether the source page is disposable.

ufs_readwrite.c:Update for new prototype of uiomoveco().

vm_fault.c: In vm_fault(), check to see whether we need to do a page
based copy on write fault.

vm_object.c: Add a new function, vm_object_allocate_wait(). This
does the same thing that vm_object allocate does, except
that it gives the caller the opportunity to specify whether
it should wait on the uma_zalloc() of the object structre.

This allows vm objects to be allocated while holding a
mutex. (Without generating WITNESS warnings.)

vm_object_allocate() is implemented as a call to
vm_object_allocate_wait() with the malloc flag set to
M_WAITOK.

vm_object.h: Add prototype for vm_object_allocate_wait().

vm_page.c: Add page-based copy on write setup, clear and fault
routines.

vm_page.h: Add page based COW function prototypes and variable in
the vm_page structure.

Many thanks to Drew Gallatin, who wrote the zero copy send and receive
code, and to all the other folks who have tested and reviewed this code
over the years.


# e78f35b3 25-Jun-2002 Jeff Roberson <jeff@FreeBSD.org>

Turn VM_ALLOC_ZERO into a flag.

Submitted by: tegge
Reviewed by: dillon


# ea0f50bc 30-Apr-2002 Alan Cox <alc@FreeBSD.org>

o Convert the vm_page buckets mutex to a spin lock. (This resolves
an issue on the Alpha platform found by jeff@.)
o Simplify vm_page_lookup().

Reviewed by: jhb


# 44e74ba6 27-Apr-2002 Peter Wemm <peter@FreeBSD.org>

We do not necessarily need to map/unmap pages to zero parts of them.
On systems where physical memory is also direct mapped (alpha, sparc,
ia64 etc) this is slightly harmful.


# cbd53e95 26-Apr-2002 Alan Cox <alc@FreeBSD.org>

o Control access to the vm_page_buckets with a mutex.
o Fix some style(9) bugs.


# 1a87a0da 15-Apr-2002 Peter Wemm <peter@FreeBSD.org>

Pass vm_page_t instead of physical addresses to pmap_zero_page[_area]()
and pmap_copy_page(). This gets rid of a couple more physical addresses
in upper layers, with the eventual aim of supporting PAE and dealing with
the physical addressing mostly within pmap. (We will need either 64 bit
physical addresses or page indexes, possibly both depending on the
circumstances. Leaving this to pmap itself gives more flexibilitly.)

Reviewed by: jake
Tested on: i386, ia64 and (I believe) sparc64. (my alpha was hosed)


# 6008862b 04-Apr-2002 John Baldwin <jhb@FreeBSD.org>

Change callers of mtx_init() to pass in an appropriate lock type name. In
most cases NULL is passed, but in some cases such as network driver locks
(which use the MTX_NETWORK_LOCK macro) and UMA zone locks, a name is used.

Tested on: i386, alpha, sparc64


# 48f9a594 02-Apr-2002 Jake Burkholder <jake@FreeBSD.org>

Fix a long standing 32bit-ism. Don't assume that the size of a chunk of
memory in phys_avail will fit in 'int', use vm_size_t. This fixes booting
on sparc64 machines with more than 2 gigs of ram.

Thanks to Jan Chrillesen for providing me with access to a 4 gig machine.


# 8355f576 19-Mar-2002 Jeff Roberson <jeff@FreeBSD.org>

This is the first part of the new kernel memory allocator. This replaces
malloc(9) and vm_zone with a slab like allocator.

Reviewed by: arch@


# a1287949 10-Mar-2002 Eivind Eklund <eivind@FreeBSD.org>

- Remove a number of extra newlines that do not belong here according to
style(9)
- Minor space adjustment in cases where we have "( ", " )", if(), return(),
while(), for(), etc.
- Add /* SYMBOL */ after a few #endifs.

Reviewed by: alc


# 2be21c5e 04-Mar-2002 Alan Cox <alc@FreeBSD.org>

o Create vm_pageq_enqueue() to encapsulate code that is duplicated time
and again in vm_page.c and vm_pageq.c.
o Delete unusused prototypes. (Mainly a result of the earlier renaming
of various functions from vm_page_*() to vm_pageq_*().)


# d2760948 19-Feb-2002 Tor Egge <tegge@FreeBSD.org>

Add a page queue, PQ_HOLD, that temporarily owns pages with nonzero hold
count that would otherwise be on one of the free queues. This eliminates a
panic when broken programs unmap memory that still has pending IO from raw
devices.

Reviewed by: dillon, alc


# 0c9e4723 19-Feb-2002 Mike Silbersack <silby@FreeBSD.org>

Add one more comment to the OOM changes so that future readers of
the code may better understand the code.

Suggested by: dillon
MFC after: 1 week


# ef6020d1 19-Feb-2002 Mike Silbersack <silby@FreeBSD.org>

Changes to make the OOM killer much more effective:

- Allow the OOM killer to target processes currently locked in
memory. These very often are the ones doing the memory hogging.
- Drop the wakeup priority of processes currently sleeping while
waiting for their page fault to complete. In order for the OOM
killer to work well, the killed process and other system processes
waiting on memory must be allowed to wakeup first.

Reviewed by: dillon
MFC after: 1 week


# 3ebeaf59 13-Dec-2001 Matthew Dillon <dillon@FreeBSD.org>

This fixes a large number of bugs in our NFS client side code. A recent
commit by Kirk also fixed a softupdates bug that could easily be triggered
by server side NFS.

* An edge case with shared R+W mmap()'s and truncate whereby
the system would inappropriately clear the dirty bits on
still-dirty data. (applicable to all filesystems)

THIS FIX TEMPORARILY DISABLED PENDING FURTHER TESTING.
see vm/vm_page.c line 1641

* The straddle case for VM pages and buffer cache buffers when
truncating. (applicable to NFS client side)

* Possible SMP database corruption due to vm_pager_unmap_page()
not clearing the TLB for the other cpu's. (applicable to NFS
client side but could effect all filesystems). Note: not
considered serious since the corruption occurs beyond the file
EOF.

* When flusing a dirty buffer due to B_CACHE getting cleared,
we were accidently setting B_CACHE again (that is, bwrite() sets
B_CACHE), when we really want it to stay clear after the write
is complete. This resulted in a corrupt buffer. (applicable
to all filesystems but probably only triggered by NFS)

* We have to call vtruncbuf() when ftruncate()ing to remove
any buffer cache buffers. This is still tentitive, I may
be able to remove it due to the second bug fix. (applicable
to NFS client side)

* vnode_pager_setsize() race against nfs_vinvalbuf()... we have
to set n_size before calling nfs_vinvalbuf or the NFS code
may recursively vnode_pager_setsize() to the original value
before the truncate. This is what was causing the user mmap
bus faults in the nfs tester program. (applicable to NFS
client side)

* Fix to softupdates (see ufs/ffs/ffs_inode.c 1.73, commit made
by Kirk).

Testing program written by: Avadis Tevanian, Jr.
Testing program supplied by: jkh / Apple (see Dec2001 posting to freebsd-hackers with Subject 'NFS: How to make FreeBS fall on its face in one easy step')
MFC after: 1 week


# 245df27c 25-Oct-2001 Matthew Dillon <dillon@FreeBSD.org>

Implement kern.maxvnodes. adjusting kern.maxvnodes now actually has a
real effect.

Optimize vfs_msync(). Avoid having to continually drop and re-obtain
mutexes when scanning the vnode list. Improves looping case by 500%.

Optimize ffs_sync(). Avoid having to continually drop and re-obtain
mutexes when scanning the vnode list. This makes a couple of assumptions,
which I believe are ok, in regards to vnode stability when the mount list
mutex is held. Improves looping case by 500%.

(more optimization work is needed on top of these fixes)

MFC after: 1 week


# 3516c025 24-Aug-2001 Peter Wemm <peter@FreeBSD.org>

Implement idle zeroing of pages. I've been tinkering with this
on and off since John Dyson left his work-in-progress.

It is off by default for now. sysctl vm.zeroidle_enable=1 to turn it on.

There are some hacks here to deal with the present lack of preemption - we
yield after doing a small number of pages since we wont preempt otherwise.

This is basically Matt's algorithm [with hysteresis] with an idle process
to call it in a similar way it used to be called from the idle loop.

I cleaned up the includes a fair bit here too.


# 0b76df71 21-Aug-2001 Matthew Dillon <dillon@FreeBSD.org>

KASSERT if vm_page_t->wire_count overflows.


# 8ec48c6d 10-Aug-2001 John Baldwin <jhb@FreeBSD.org>

- Remove asleep(), await(), and M_ASLEEP.
- Callers of asleep() and await() have been converted to calling tsleep().
The only caller outside of M_ASLEEP was the ata driver, which called both
asleep() and await() with spl-raised, so there was no need for the
asleep() and await() pair. M_ASLEEP was unused.

Reviewed by: jasone, peter


# 3a9b5daf 30-Jul-2001 Jake Burkholder <jake@FreeBSD.org>

Oops. Last commit to vm_object.c should have got these files too.

Remove the use of atomic ops to manipulate vm_object and vm_page flags.
Giant is required here, so they are superfluous.

Discussed with: dillon


# d3e5863f 22-Jul-2001 Assar Westerlund <assar@FreeBSD.org>

make vm_page_select_cache static

Requested by: bde


# 6d03d577 04-Jul-2001 Matthew Dillon <dillon@FreeBSD.org>

Reorg vm_page.c into vm_page.c, vm_pageq.c, and vm_contig.c (for contigmalloc).
Also removed some spl's and added some VM mutexes, but they are not actually
used yet, so this commit does not really make any operational changes
to the system.

vm_page.c relates to vm_page_t manipulation, including high level deactivation,
activation, etc... vm_pageq.c relates to finding free pages and aquiring
exclusive access to a page queue (exclusivity part not yet implemented).
And the world still builds... :-)


# 1b40f8c0 04-Jul-2001 Matthew Dillon <dillon@FreeBSD.org>

Change inlines back into mainline code in preparation for mutexing. Also,
most of these inlines had been bloated in -current far beyond their
original intent. Normalize prototypes and function declarations to be ANSI
only (half already were). And do some general cleanup.

(kernel size also reduced by 50-100K, but that isn't the prime intent)


# 54d92145 04-Jul-2001 Matthew Dillon <dillon@FreeBSD.org>

whitespace / register cleanup


# 0cddd8f0 04-Jul-2001 Matthew Dillon <dillon@FreeBSD.org>

With Alfred's permission, remove vm_mtx in favor of a fine-grained approach
(this commit is just the first stage). Also add various GIANT_ macros to
formalize the removal of Giant, making it easy to test in a more piecemeal
fashion. These macros will allow us to test fine-grained locks to a degree
before removing Giant, and also after, and to remove Giant in a piecemeal
fashion via sysctl's on those subsystems which the authors believe can
operate without Giant.


# ac8f990b 24-May-2001 Matthew Dillon <dillon@FreeBSD.org>

This patch implements O_DIRECT about 80% of the way. It takes a patchset
Tor created a while ago, removes the raw I/O piece (that has cache coherency
problems), and adds a buffer cache / VM freeing piece.

Essentially this patch causes O_DIRECT I/O to not be left in the cache, but
does not prevent it from going through the cache, hence the 80%. For
the last 20% we need a method by which the I/O can be issued directly to
buffer supplied by the user process and bypass the buffer cache entirely,
but still maintain cache coherency.

I also have the code working under -stable but the changes made to sys/file.h
may not be MFCable, so an MFC is not on the table yet.

Submitted by: tegge, dillon


# 7d4ad42d 22-May-2001 John Baldwin <jhb@FreeBSD.org>

Sort includes.


# 23955314 18-May-2001 Alfred Perlstein <alfred@FreeBSD.org>

Introduce a global lock for the vm subsystem (vm_mtx).

vm_mtx does not recurse and is required for most low level
vm operations.

faults can not be taken without holding Giant.

Memory subsystems can now call the base page allocators safely.

Almost all atomic ops were removed as they are covered under the
vm mutex.

Alpha and ia64 now need to catch up to i386's trap handlers.

FFS and NFS have been tested, other filesystems will need minor
changes (grabbing the vm lock when twiddling page properties).

Reviewed (partially) by: jake, jhb


# fb919e4d 01-May-2001 Mark Murray <markm@FreeBSD.org>

Undo part of the tangle of having sys/lock.h and sys/mutex.h included in
other "system" header files.

Also help the deprecation of lockmgr.h by making it a sub-include of
sys/lock.h and removing sys/lockmgr.h form kernel .c files.

Sort sys/*.h includes where possible in affected files.

OK'ed by: bde (with reservations)


# 136d8f42 06-Mar-2001 John Baldwin <jhb@FreeBSD.org>

Unrevert the pmap_map() changes. They weren't broken on x86.

Sense beaten into me by: peter


# 4a01ebd4 06-Mar-2001 John Baldwin <jhb@FreeBSD.org>

Back out the pmap_map() change for now, it isn't completely stable on the
i386.


# 968950e5 05-Mar-2001 John Baldwin <jhb@FreeBSD.org>

- Rework pmap_map() to take advantage of direct-mapped segments on
supported architectures such as the alpha. This allows us to save
on kernel virtual address space, TLB entries, and (on the ia64) VHPT
entries. pmap_map() now modifies the passed in virtual address on
architectures that do not support direct-mapped segments to point to
the next available virtual address. It also returns the actual
address that the request was mapped to.
- On the IA64 don't use a special zone of PV entries needed for early
calls to pmap_kenter() during pmap_init(). This gets us in trouble
because we end up trying to use the zone allocator before it is
initialized. Instead, with the pmap_map() change, the number of needed
PV entries is small enough that we can get by with a static pool that is
used until pmap_init() is complete.

Submitted by: dfr
Debugging help: peter
Tested by: me


# c909b971 01-Mar-2001 Andrew Gallatin <gallatin@FreeBSD.org>

Allocate vm_page_array and vm_page_buckets from the end of the biggest chunk
of memory, rather than from the start.

This fixes problems allocating bouncebuffers on alphas where there is only
1 chunk of memory (unlike PCs where there is generally at least one small
chunk and a large chunk). Having 1 chunk had been fatal, because these
structures take over 13MB on a machine with 1GB of ram. This doesn't leave
much room for other structures and bounce buffers if they're at the front.

Reviewed by: dfr, anderson@cs.duke.edu, silence on -arch
Tested by: Yoriaki FUJIMORI <fujimori@grafin.fujimori.cache.waseda.ac.jp>


# 2b6b0df7 26-Dec-2000 Matthew Dillon <dillon@FreeBSD.org>

This implements a better launder limiting solution. There was a solution
in 4.2-REL which I ripped out in -stable and -current when implementing the
low-memory handling solution. However, maxlaunder turns out to be the saving
grace in certain very heavily loaded systems (e.g. newsreader box). The new
algorithm limits the number of pages laundered in the first pageout daemon
pass. If that is not sufficient then suceessive will be run without any
limit.

Write I/O is now pipelined using two sysctls, vfs.lorunningspace and
vfs.hirunningspace. This prevents excessive buffered writes in the
disk queues which cause long (multi-second) delays for reads. It leads
to more stable (less jerky) and generally faster I/O streaming to disk
by allowing required read ops (e.g. for indirect blocks and such) to occur
without interrupting the write stream, amoung other things.

NOTE: eventually, filesystem write I/O pipelining needs to be done on a
per-device basis. At the moment it is globalized.


# 065b2580 18-Dec-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Fix floppy drives on machines with lots of RAM.

The fix works by reverting the ordering of free memory so that the
chances of contig_malloc() succeeding increases.

PR: 23291
Submitted by: Andrew Atrens <atrens@nortel.ca>


# 936524aa 18-Nov-2000 Matthew Dillon <dillon@FreeBSD.org>

Implement a low-memory deadlock solution.

Removed most of the hacks that were trying to deal with low-memory
situations prior to now.

The new code is based on the concept that I/O must be able to function in
a low memory situation. All major modules related to I/O (except
networking) have been adjusted to allow allocation out of the system
reserve memory pool. These modules now detect a low memory situation but
rather then block they instead continue to operate, then return resources
to the memory pool instead of cache them or leave them wired.

Code has been added to stall in a low-memory situation prior to a vnode
being locked.

Thus situations where a process blocks in a low-memory condition while
holding a locked vnode have been reduced to near nothing. Not only will
I/O continue to operate, but many prior deadlock conditions simply no
longer exist.

Implement a number of VFS/BIO fixes

(found by Ian): in biodone(), bogus-page replacement code, the loop
was not properly incrementing loop variables prior to a continue
statement. We do not believe this code can be hit anyway but we
aren't taking any chances. We'll turn the whole section into a
panic (as it already is in brelse()) after the release is rolled.

In biodone(), the foff calculation was incorrectly
clamped to the iosize, causing the wrong foff to be calculated
for pages in the case of an I/O error or biodone() called without
initiating I/O. The problem always caused a panic before. Now it
doesn't. The problem is mainly an issue with NFS.

Fixed casts for ~PAGE_MASK. This code worked properly before only
because the calculations use signed arithmatic. Better to properly
extend PAGE_MASK first before inverting it for the 64 bit masking
op.

In brelse(), the bogus_page fixup code was improperly throwing
away the original contents of 'm' when it did the j-loop to
fix the bogus pages. The result was that it would potentially
invalidate parts of the *WRONG* page(!), leading to corruption.

There may still be cases where a background bitmap write is
being duplicated, causing potential corruption. We have identified
a potentially serious bug related to this but the fix is still TBD.
So instead this patch contains a KASSERT to detect the problem
and panic the machine rather then continue to corrupt the filesystem.
The problem does not occur very often.. it is very hard to
reproduce, and it may or may not be the cause of the corruption
people have reported.

Review by: (VFS/BIO: mckusick, Ian Dowse <iedowse@maths.tcd.ie>)
Testing by: (VM/Deadlock) Paul Saab <ps@yahoo-inc.com>


# c904bbbd 03-Jul-2000 Kirk McKusick <mckusick@FreeBSD.org>

Simplify and rationalise the management of the vnode free list
(preparing the code to add snapshots).


# 8b03c8ed 29-May-2000 Matthew Dillon <dillon@FreeBSD.org>

This is a cleanup patch to Peter's new OBJT_PHYS VM object type
and sysv shared memory support for it. It implements a new
PG_UNMANAGED flag that has slightly different characteristics
from PG_FICTICIOUS.

A new sysctl, kern.ipc.shm_use_phys has been added to enable the
use of physically-backed sysv shared memory rather then swap-backed.
Physically backed shm segments are not tracked with PV entries,
allowing programs which use a large shm segment as a rendezvous
point to operate without eating an insane amount of KVM in the
PV entry management. Read: Oracle.

Peter's OBJT_PHYS object will also allow us to eventually implement
page-table sharing and/or 4MB physical page support for such segments.
We're half way there.


# 0385347c 20-May-2000 Peter Wemm <peter@FreeBSD.org>

Implement an optimization of the VM<->pmap API. Pass vm_page_t's directly
to various pmap_*() functions instead of looking up the physical address
and passing that. In many cases, the first thing the pmap code was doing
was going to a lot of trouble to get back the original vm_page_t, or
it's shadow pv_table entry.

Inspired by: John Dyson's 1998 patches.

Also:
Eliminate pv_table as a seperate thing and build it into a machine
dependent part of vm_page_t. This eliminates having a seperate set of
structions that shadow each other in a 1:1 fashion that we often went to
a lot of trouble to translate from one to the other. (see above)
This happens to save 4 bytes of physical memory for each page in the
system. (8 bytes on the Alpha).

Eliminate the use of the phys_avail[] array to determine if a page is
managed (ie: it has pv_entries etc). Store this information in a flag.
Things like device_pager set it because they create vm_page_t's on the
fly that do not have pv_entries. This makes it easier to "unmanage" a
page of physical memory (this will be taken advantage of in subsequent
commits).

Add a function to add a new page to the freelist. This could be used
for reclaiming the previously wasted pages left over from preloaded
loader(8) files.

Reviewed by: dillon


# 5929bcfa 27-Mar-2000 Philippe Charnier <charnier@FreeBSD.org>

Revert spelling mistake I made in the previous commit
Requested by: Alan and Bruce


# 956f3135 26-Mar-2000 Philippe Charnier <charnier@FreeBSD.org>

Spelling


# db5f635a 16-Mar-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Eliminate the undocumented, experimental, non-delivering and highly
dangerous MAX_PERF option.


# 4f79d873 11-Dec-1999 Matthew Dillon <dillon@FreeBSD.org>

Add MAP_NOSYNC feature to mmap(), and MADV_NOSYNC and MADV_AUTOSYNC to
madvise().

This feature prevents the update daemon from gratuitously flushing
dirty pages associated with a mapped file-backed region of memory. The
system pager will still page the memory as necessary and the VM system
will still be fully coherent with the filesystem. Modifications made
by other means to the same area of memory, for example by write(), are
unaffected. The feature works on a page-granularity basis.

MAP_NOSYNC allows one to use mmap() to share memory between processes
without incuring any significant filesystem overhead, putting it in
the same performance category as SysV Shared memory and anonymous memory.

Reviewed by: julian, alc, dg


# e6ce5295 09-Nov-1999 Alan Cox <alc@FreeBSD.org>

Two changes: (1) Use vm_page_unqueue_nowakeup in vm_page_alloc
instead of duplicating the code. (2) If a wired page is passed
to vm_page_free_toq, panic instead of printing a friendly warning.
(If we don't panic here, we'll just panic later in vm_page_unwire
obscuring the problem.)


# be72f788 30-Oct-1999 Alan Cox <alc@FreeBSD.org>

The core of this patch is to vm/vm_page.h. The effects are two-fold: (1) to
eliminate an extra (useless) level of indirection in half of the page
queue accesses and (2) to use a single name for each queue throughout,
instead of, e.g., "vm_page_queue_active" in some places and
"vm_page_queues[PQ_ACTIVE]" in others.

Reviewed by: dillon


# 923502ff 29-Oct-1999 Poul-Henning Kamp <phk@FreeBSD.org>

useracc() the prequel:

Merge the contents (less some trivial bordering the silly comments)
of <vm/vm_prot.h> and <vm/vm_inherit.h> into <vm/vm.h>. This puts
the #defines for the vm_inherit_t and vm_prot_t types next to their
typedefs.

This paves the road for the commit to follow shortly: change
useracc() to use VM_PROT_{READ|WRITE} rather than B_{READ|WRITE}
as argument.


# 90ecac61 16-Sep-1999 Matthew Dillon <dillon@FreeBSD.org>

Reviewed by: Alan Cox <alc@cs.rice.edu>, David Greenman <dg@root.com>

Replace various VM related page count calculations strewn over the
VM code with inlines to aid in readability and to reduce fragility
in the code where modules depend on the same test being performed
to properly sleep and wakeup.

Split out a portion of the page deactivation code into an inline
in vm_page.c to support vm_page_dontneed().

add vm_page_dontneed(), which handles the madvise MADV_DONTNEED
feature in a related commit coming up for vm_map.c/vm_object.c. This
code prevents degenerate cases where an essentially active page may
be rotated through a subset of the paging lists, resulting in premature
disposal.


# c3aac50f 27-Aug-1999 Peter Wemm <peter@FreeBSD.org>

$Id$ -> $FreeBSD$


# 14068cfe 20-Aug-1999 Alan Cox <alc@FreeBSD.org>

vm_page_alloc and contigmalloc1:
Verify that free pages are not dirty.

Submitted by: dillon


# 0e709935 17-Aug-1999 Alan Cox <alc@FreeBSD.org>

vm_page_free_toq:
Update the comment to reflect the demise of PQ_ZERO and
remove a (now) useless test.


# 3fc3fec6 16-Aug-1999 Alan Cox <alc@FreeBSD.org>

vm_page_free_toq:
Clear the dirty bit mask (vm_page_undirty) before adding the page
to the free page queue.

Submitted by: dillon


# 6c91c1dc 10-Aug-1999 Alan Cox <alc@FreeBSD.org>

contigmalloc1:
If a page is found in the wrong queue, panic instead
of silently ignoring the problem.


# ed6d0b65 10-Aug-1999 Peter Wemm <peter@FreeBSD.org>

Add a contigfree() as a corollary to contigmalloc() as it's not clear
which free routine to use and people are tempted to use free() (which
doesn't work)


# 5d2aec89 31-Jul-1999 Alan Cox <alc@FreeBSD.org>

Change the type of vpgqueues::lcnt from "int *" to "int". The indirection
served no purpose.


# 755292ac 30-Jul-1999 Alan Cox <alc@FreeBSD.org>

vm_page_queue_init:
Remove the initialization of PQ_NONE's cnt and lcnt. They aren't
used.

vm_page_insert:
Remove an unnecessary dereference.

vm_page_wire:
Remove the one and only (and thus pointless) reference
to PQ_NONE's lcnt.


# 3efc015b 01-Jul-1999 Peter Wemm <peter@FreeBSD.org>

Fix some int/long printf problems for the Alpha


# 9c89c228 22-Jun-1999 Alan Cox <alc@FreeBSD.org>

Remove (1) "extern" declarations for variables that were previously
made "static" and (2) initialized but unused variables.


# 60ff97b0 20-Jun-1999 Alan Cox <alc@FreeBSD.org>

Remove vm_object::cache_count and vm_object::wired_count. They are
not used. (Nor is there any planned use by John who introduced them.)

Reviewed by: "John S. Dyson" <toor@dyson.iquest.net>


# c2077034 19-Jun-1999 Alan Cox <alc@FreeBSD.org>

Set cnt.v_page_size to PAGE_SIZE rather than DEFAULT_PAGE_SIZE so that
"vmstat -s" reports the correct value on the Alpha.

Submitted by: Hidetoshi Shimokawa <simokawa@sat.t.u-tokyo.ac.jp>


# 4221e284 02-May-1999 Alan Cox <alc@FreeBSD.org>

The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.

The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.

getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.

There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.

Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.

Submitted by: Matthew Dillon <dillon@apollo.backplane.com>


# 8d17e694 05-Apr-1999 Julian Elischer <julian@FreeBSD.org>

Catch a case spotted by Tor where files mmapped could leave garbage in the
unallocated parts of the last page when the file ended on a frag
but not a page boundary.
Delimitted by tags PRE_MATT_MMAP_EOF and POST_MATT_MMAP_EOF,
in files alpha/alpha/pmap.c i386/i386/pmap.c nfs/nfs_bio.c vm/pmap.h
vm/vm_page.c vm/vm_page.h vm/vnode_pager.c miscfs/specfs/spec_vnops.c
ufs/ufs/ufs_readwrite.c kern/vfs_bio.c

Submitted by: Matt Dillon <dillon@freebsd.org>
Reviewed by: Alan Cox <alc@freebsd.org>


# 61fc5ee6 18-Mar-1999 Alan Cox <alc@FreeBSD.org>

Construct the free queue(s) in descending order (by physical
address) so that the first 16MB of physical memory is allocated
last rather than first. On large-memory machines, this avoids
the exhaustion of low physical memory before isa_dmainit has run.


# d1bf5d56 24-Feb-1999 Matthew Dillon <dillon@FreeBSD.org>

Remove unnecessary page protects on map_split and collapse operations.
Fix bug where an object's OBJ_WRITEABLE/OBJ_MIGHTBEDIRTY flags do
not get set under certain circumstances ( page rename case ).

Reviewed by: Alan Cox <alc@cs.rice.edu>, John Dyson


# efcae3d3 14-Feb-1999 Matthew Dillon <dillon@FreeBSD.org>

Minor reorganization of vm_page_alloc(). No functional changes have
been made but the code has been reorganized and documented to make
it more readable, reduce the size of the code, and optimize the branch
path caching capabilities that most modern processors have.


# faa273d5 07-Feb-1999 Matthew Dillon <dillon@FreeBSD.org>

Rip out PQ_ZERO queue. PQ_ZERO functionality is now combined in with
PQ_FREE. There is little operational difference other then the kernel
being a few kilobytes smaller and the code being more readable.

* vm_page_select_free() has been *greatly* simplified.
* The PQ_ZERO page queue and supporting structures have been removed
* vm_page_zero_idle() revamped (see below)

PG_ZERO setting and clearing has been migrated from vm_page_alloc()
to vm_page_free[_zero]() and will eventually be guarenteed to remain
tracked throughout a page's life ( if it isn't already ).

When a page is freed, PG_ZERO pages are appended to the appropriate
tailq in the PQ_FREE queue while non-PG_ZERO pages are prepended.
When locating a new free page, PG_ZERO selection operates from within
vm_page_list_find() ( get page from end of queue instead of beginning
of queue ) and then only occurs in the nominal critical path case. If
the nominal case misses, both normal and zero-page allocation devolves
into the same _vm_page_list_find() select code without any specific
zero-page optimizations.

Additionally, vm_page_zero_idle() has been revamped. Hysteresis has been
added and zero-page tracking adjusted to conform with the other changes.
Currently hysteresis is set at 1/3 (lo) and 1/2 (hi) the number of free
pages. We may wish to increase both parameters as time permits. The
hysteresis is designed to avoid silly zeroing in borderline allocation/free
situations.


# a0e7b3e5 07-Feb-1999 Matthew Dillon <dillon@FreeBSD.org>

Remove L1 cache coloring optimization ( leave L2 cache coloring opt ).

Rewrite vm_page_list_find() and vm_page_select_free() - make inline out
of nominal case.


# 8aef1712 27-Jan-1999 Matthew Dillon <dillon@FreeBSD.org>

Fix warnings in preparation for adding -Wall -Wcast-qual to the
kernel compile


# 2f586e1b 24-Jan-1999 Matthew Dillon <dillon@FreeBSD.org>

Undo last commit - not a bug, just duplicate code. PG_MAPPED and
PG_WRITEABLE are already cleared by vm_page_protect().


# 68af6d16 23-Jan-1999 Matthew Dillon <dillon@FreeBSD.org>

vm_map_split() used to dirty the page manually after calling
vm_page_rename(), but never pulled the page off PQ_CACHE if it was on
PQ_CACHE. Dirty pages in PQ_CACHE are not allowed and a KASSERT was
added in -4.x to test for this... and got hit.

In -4.x, vm_page_rename() automatically dirties the page. This commit
also has it deal with the PQ_CACHE case, deactivating the page in that
case.


# e1a4feaf 23-Jan-1999 Matthew Dillon <dillon@FreeBSD.org>

Clear PG_MAPPED as well as PG_WRITEABLE when a page is moved to the
cache.


# c9fa34cf 23-Jan-1999 Matthew Dillon <dillon@FreeBSD.org>

Clear PG_WRITEABLE in vm_page_cache(). This may or may not be a bug,
but the bit should definitely be cleared.


# 060282de 21-Jan-1999 Matthew Dillon <dillon@FreeBSD.org>

The hash table used to be a table of doubly-link list headers ( two
pointers per entry ). The table has been changed to a singly linked
list of vm_page_t pointers. The table has been doubled in size, but
the entries only take half the space so a net-zero change in memory use.

The hash function has been changed, hopefully for the better. The
combination of the larger hash table size of changed function should
keep the chain length down to a reasonable number (0-3, average 1).

vm_object->page_hint has been removed. This 'optimization' was not
only never needed, but costs as much as a hash chain link to implement.
While having page_hint in vm_object might result in better locality
of reference, the cost is not worth the space in vm_object or the
extra instructions in my view.

vm_page_alloc*() functions have been inlined and call a generalized
non-inlined vm_page_alloc_toq() which combines the standard alloc
and zero-page alloc functions together, reducing code size and the L1
cache footprint. Some reordering has been done... not much. The
delinking code should be faster ( because unlinking a doubly-linked list
requires four memory ops and unlinking a singly linked list only requires
two ), and we get a hash consistancy check for free.

vm_page_rename() now automatically sets the page's dirty bits.

vm_page_alloc() does not try to manually inline freeing a cache page.
Instead, it now properly calls vm_page_free(m) ... vm_page_free() is
really too complex to manually inline.

vm_await(), supporting asleep(), has been added.


# 1c7c3c6a 21-Jan-1999 Matthew Dillon <dillon@FreeBSD.org>

This is a rather large commit that encompasses the new swapper,
changes to the VM system to support the new swapper, VM bug
fixes, several VM optimizations, and some additional revamping of the
VM code. The specific bug fixes will be documented with additional
forced commits. This commit is somewhat rough in regards to code
cleanup issues.

Reviewed by: "John S. Dyson" <root@dyson.iquest.net>, "David Greenman" <dg@root.com>


# 219cbf59 09-Jan-1999 Eivind Eklund <eivind@FreeBSD.org>

KNFize, by bde.


# 5526d2d9 08-Jan-1999 Eivind Eklund <eivind@FreeBSD.org>

Split DIAGNOSTIC -> DIAGNOSTIC, INVARIANTS, and INVARIANT_SUPPORT as
discussed on -hackers.

Introduce 'KASSERT(assertion, ("panic message", args))' for simple
check + panic.

Reviewed by: msmith


# 9858fcda 22-Dec-1998 Matthew Dillon <dillon@FreeBSD.org>

Update comments to routines in vm_page.c, most especially whether a
routine can block or not as part of a general effort to carefully
document blocking/non-blocking calls in the kernel.


# 4f6e1f8b 11-Nov-1998 David Greenman <dg@FreeBSD.org>

Closed a small race condition between wiring/unwiring pages that involved
the page's wire_count.


# c8d14c76 28-Oct-1998 David Greenman <dg@FreeBSD.org>

Fixed wrong comments in and about vm_page_deactivate().


# 73007561 28-Oct-1998 David Greenman <dg@FreeBSD.org>

Added a second argument, "activate" to the vm_page_unwire() call so that
the caller can select either inactive or active queue to put the page on.


# f5ef029e 25-Oct-1998 Poul-Henning Kamp <phk@FreeBSD.org>

Nitpicking and dusting performed on a train. Removes trivial warnings
about unused variables, labels and other lint.


# 300ee824 21-Oct-1998 David Greenman <dg@FreeBSD.org>

Nuked PG_TABLED flag. Replaced with m->object != NULL.


# 12d534d2 21-Oct-1998 David Greenman <dg@FreeBSD.org>

Add a diagnostic printf for freeing a wired page. This will eventually
be turned into a panic, but I want to make sure that all cases of freeing
pages with wire_count==1 (which is/was allowed) have first been fixed.


# e69763a3 04-Sep-1998 Doug Rabson <dfr@FreeBSD.org>

Cosmetic changes to the PAGE_XXX macros to make them consistent with
the other objects in vm.


# 069e9bc1 24-Aug-1998 Doug Rabson <dfr@FreeBSD.org>

Change various syscalls to use size_t arguments instead of u_int.

Add some overflow checks to read/write (from bde).

Change all modifications to vm_page::flags, vm_page::busy, vm_object::flags
and vm_object::paging_in_progress to use operations which are not
interruptable.

Reviewed by: Bruce Evans <bde@zeta.org.au>


# 56e7ede1 26-Jul-1998 Doug Rabson <dfr@FreeBSD.org>

Notify pmap when a page is freed on the alpha to allow it to clean up
its emulated modified/referenced bits.


# 15c73825 14-Jul-1998 Bruce Evans <bde@FreeBSD.org>

Cast pointers to [u]intptr_t instead of to [unsigned] long.


# ac1e407b 11-Jul-1998 Bruce Evans <bde@FreeBSD.org>

Fixed printf format errors.


# be160d60 21-Jun-1998 Bruce Evans <bde@FreeBSD.org>

Removed unused includes.


# ecbb00a2 07-Jun-1998 Doug Rabson <dfr@FreeBSD.org>

This commit fixes various 64bit portability problems required for
FreeBSD/alpha. The most significant item is to change the command
argument to ioctl functions from int to u_long. This change brings us
inline with various other BSD versions. Driver writers may like to
use (__FreeBSD_version == 300003) to detect this change.

The prototype FreeBSD/alpha machdep will follow in a couple of days
time.


# 976f208b 01-Jun-1998 John Dyson <dyson@FreeBSD.org>

Cleanup and remove some dead code from the initialization.


# 3c336467 01-May-1998 Peter Wemm <peter@FreeBSD.org>

Seatbelts for vm_page_bits() in case a file offset is passed in rather than
the page offset. If a large file offset was passed in, a large negative
array index could be generated which could cause page faults etc at worst
and file corruption at the least. (Pages are allocated within file
space on page alignment boundaries, so a file offset being passed in here
is harmless to DTRT. The case where this was happening has already been
fixed though, this is in case it happens again).

Reviewed by: dyson


# c1087c13 15-Apr-1998 Bruce Evans <bde@FreeBSD.org>

Support compiling with `gcc -ansi'.


# bef608bd 15-Mar-1998 John Dyson <dyson@FreeBSD.org>

Some VM improvements, including elimination of alot of Sig-11
problems. Tor Egge and others have helped with various VM bugs
lately, but don't blame him -- blame me!!!

pmap.c:
1) Create an object for kernel page table allocations. This
fixes a bogus allocation method previously used for such, by
grabbing pages from the kernel object, using bogus pindexes.
(This was a code cleanup, and perhaps a minor system stability
issue.)

pmap.c:
2) Pre-set the modify and accessed bits when prudent. This will
decrease bus traffic under certain circumstances.

vfs_bio.c, vfs_cluster.c:
3) Rather than calculating the beginning virtual byte offset
multiple times, stick the offset into the buffer header, so
that the calculated offset can be reused. (Long long multiplies
are often expensive, and this is a probably unmeasurable performance
improvement, and code cleanup.)

vfs_bio.c:
4) Handle write recursion more intelligently (but not perfectly) so
that it is less likely to cause a system panic, and is also
much more robust.

vfs_bio.c:
5) getblk incorrectly wrote out blocks that are incorrectly sized.
The problem is fixed, and writes blocks out ONLY when B_DELWRI
is true.

vfs_bio.c:
6) Check that already constituted buffers have fully valid pages. If
not, then make sure that the B_CACHE bit is not set. (This was
a major source of Sig-11 type problems.)

vfs_bio.c:
7) Fix a potential system deadlock due to an incorrectly specified
sleep priority while waiting for a buffer write operation. The
change that I made opens the system up to serious problems, and
we need to examine the issue of process sleep priorities.

vfs_cluster.c, vfs_bio.c:
8) Make clustered reads work more correctly (and more completely)
when buffers are already constituted, but not fully valid.
(This was another system reliability issue.)

vfs_subr.c, ffs_inode.c:
9) Create a vtruncbuf function, which is used by filesystems that
can truncate files. The vinvalbuf forced a file sync type operation,
while vtruncbuf only invalidates the buffers past the new end of file,
and also invalidates the appropriate pages. (This was a system reliabiliy
and performance issue.)

10) Modify FFS to use vtruncbuf.

vm_object.c:
11) Make the object rundown mechanism for OBJT_VNODE type objects work
more correctly. Included in that fix, create pager entries for
the OBJT_DEAD pager type, so that paging requests that might slip
in during race conditions are properly handled. (This was a system
reliability issue.)

vm_page.c:
12) Make some of the page validation routines be a little less picky
about arguments passed to them. Also, support page invalidation
change the object generation count so that we handle generation
counts a little more robustly.

vm_pageout.c:
13) Further reduce pageout daemon activity when the system doesn't
need help from it. There should be no additional performance
decrease even when the pageout daemon is running. (This was
a significant performance issue.)

vnode_pager.c:
14) Teach the vnode pager to handle race conditions during vnode
deallocations.


# e163e201 07-Mar-1998 John Dyson <dyson@FreeBSD.org>

Some cruft left over from my megacommit. A page rotation optimization
was a good idea, but can cause instability. That optimization is
now removed.


# 8f9110f6 07-Mar-1998 John Dyson <dyson@FreeBSD.org>

This mega-commit is meant to fix numerous interrelated problems. There
has been some bitrot and incorrect assumptions in the vfs_bio code. These
problems have manifest themselves worse on NFS type filesystems, but can
still affect local filesystems under certain circumstances. Most of
the problems have involved mmap consistancy, and as a side-effect broke
the vfs.ioopt code. This code might have been committed seperately, but
almost everything is interrelated.

1) Allow (pmap_object_init_pt) prefaulting of buffer-busy pages that
are fully valid.
2) Rather than deactivating erroneously read initial (header) pages in
kern_exec, we now free them.
3) Fix the rundown of non-VMIO buffers that are in an inconsistent
(missing vp) state.
4) Fix the disassociation of pages from buffers in brelse. The previous
code had rotted and was faulty in a couple of important circumstances.
5) Remove a gratuitious buffer wakeup in vfs_vmio_release.
6) Remove a crufty and currently unused cluster mechanism for VBLK
files in vfs_bio_awrite. When the code is functional, I'll add back
a cleaner version.
7) The page busy count wakeups assocated with the buffer cache usage were
incorrectly cleaned up in a previous commit by me. Revert to the
original, correct version, but with a cleaner implementation.
8) The cluster read code now tries to keep data associated with buffers
more aggressively (without breaking the heuristics) when it is presumed
that the read data (buffers) will be soon needed.
9) Change to filesystem lockmgr locks so that they use LK_NOPAUSE. The
delay loop waiting is not useful for filesystem locks, due to the
length of the time intervals.
10) Correct and clean-up spec_getpages.
11) Implement a fully functional nfs_getpages, nfs_putpages.
12) Fix nfs_write so that modifications are coherent with the NFS data on
the server disk (at least as well as NFS seems to allow.)
13) Properly support MS_INVALIDATE on NFS.
14) Properly pass down MS_INVALIDATE to lower levels of the VM code from
vm_map_clean.
15) Better support the notion of pages being busy but valid, so that
fewer in-transit waits occur. (use p->busy more for pageouts instead
of PG_BUSY.) Since the page is fully valid, it is still usable for
reads.
16) It is possible (in error) for cached pages to be busy. Make the
page allocation code handle that case correctly. (It should probably
be a printf or panic, but I want the system to handle coding errors
robustly. I'll probably add a printf.)
17) Correct the design and usage of vm_page_sleep. It didn't handle
consistancy problems very well, so make the design a little less
lofty. After vm_page_sleep, if it ever blocked, it is still important
to relookup the page (if the object generation count changed), and
verify it's status (always.)
18) In vm_pageout.c, vm_pageout_clean had rotted, so clean that up.
19) Push the page busy for writes and VM_PROT_READ into vm_pageout_flush.
20) Fix vm_pager_put_pages and it's descendents to support an int flag
instead of a boolean, so that we can pass down the invalidate bit.


# ffc82b0a 28-Feb-1998 John Dyson <dyson@FreeBSD.org>

1) Use a more consistent page wait methodology.
2) Do not unnecessarily force page blocking when paging
pages out.
3) Further improve swap pager performance and correctness,
including fixing the paging in progress deadlock (except
in severe I/O error conditions.)
4) Enable vfs_ioopt=1 as a default.
5) Fix and enable the page prezeroing in SMP mode.

All in all, SMP systems especially should show a significant
improvement in "snappyness."


# 303b270b 08-Feb-1998 Eivind Eklund <eivind@FreeBSD.org>

Staticize.


# 0b08f5f7 05-Feb-1998 Eivind Eklund <eivind@FreeBSD.org>

Back out DIAGNOSTIC changes.


# 95461b45 04-Feb-1998 John Dyson <dyson@FreeBSD.org>

1) Start using a cleaner and more consistant page allocator instead
of the various ad-hoc schemes.
2) When bringing in UPAGES, the pmap code needs to do another vm_page_lookup.
3) When appropriate, set the PG_A or PG_M bits a-priori to both avoid some
processor errata, and to minimize redundant processor updating of page
tables.
4) Modify pmap_protect so that it can only remove permissions (as it
originally supported.) The additional capability is not needed.
5) Streamline read-only to read-write page mappings.
6) For pmap_copy_page, don't enable write mapping for source page.
7) Correct and clean-up pmap_incore.
8) Cluster initial kern_exec pagin.
9) Removal of some minor lint from kern_malloc.
10) Correct some ioopt code.
11) Remove some dead code from the MI swapout routine.
12) Correct vm_object_deallocate (to remove backing_object ref.)
13) Fix dead object handling, that had problems under heavy memory load.
14) Add minor vm_page_lookup improvements.
15) Some pages are not in objects, and make sure that the vm_page.c can
properly support such pages.
16) Add some more page deficit handling.
17) Some minor code readability improvements.


# 47cfdb16 04-Feb-1998 Eivind Eklund <eivind@FreeBSD.org>

Turn DIAGNOSTIC into a new-style option.


# c15541e7 31-Jan-1998 John Dyson <dyson@FreeBSD.org>

contigalloc doesn't place the allocated page(s) into an object, and
now this breaks vm_page_wire (due to wired page accounting per object.)

This should fix a problem as described by Donald Maddox.


# eaf13dd7 31-Jan-1998 John Dyson <dyson@FreeBSD.org>

Change the busy page mgmt, so that when pages are freed, they
MUST be PG_BUSY. It is bogus to free a page that isn't busy,
because it is in a state of being "unavailable" when being
freed. The additional advantage is that the page_remove code
has a better cross-check that the page should be busy and
unavailable for other use. There were some minor problems
with the collapse code, and this plugs those subtile "holes."

Also, the vfs_bio code wasn't checking correctly for PG_BUSY
pages. I am going to develop a more consistant scheme for
grabbing pages, busy or otherwise. For now, we are stuck
with the current morass.


# 2d8acc0f 22-Jan-1998 John Dyson <dyson@FreeBSD.org>

VM level code cleanups.

1) Start using TSM.
Struct procs continue to point to upages structure, after being freed.
Struct vmspace continues to point to pte object and kva space for kstack.
u_map is now superfluous.
2) vm_map's don't need to be reference counted. They always exist either
in the kernel or in a vmspace. The vmspaces are managed by reference
counts.
3) Remove the "wired" vm_map nonsense.
4) No need to keep a cache of kernel stack kva's.
5) Get rid of strange looking ++var, and change to var++.
6) Change more data structures to use our "zone" allocator. Added
struct proc, struct vmspace and struct vnode. This saves a significant
amount of kva space and physical memory. Additionally, this enables
TSM for the zone managed memory.
7) Keep ioopt disabled for now.
8) Remove the now bogus "single use" map concept.
9) Use generation counts or id's for data structures residing in TSM, where
it allows us to avoid unneeded restart overhead during traversals, where
blocking might occur.
10) Account better for memory deficits, so the pageout daemon will be able
to make enough memory available (experimental.)
11) Fix some vnode locking problems. (From Tor, I think.)
12) Add a check in ufs_lookup, to avoid lots of unneeded calls to bcmp.
(experimental.)
13) Significantly shrink, cleanup, and make slightly faster the vm_fault.c
code. Use generation counts, get rid of unneded collpase operations,
and clean up the cluster code.
14) Make vm_zone more suitable for TSM.

This commit is partially as a result of discussions and contributions from
other people, including DG, Tor Egge, PHK, and probably others that I
have forgotten to attribute (so let me know, if I forgot.)

This is not the infamous, final cleanup of the vnode stuff, but a necessary
step. Vnode mgmt should be correct, but things might still change, and
there is still some missing stuff (like ioopt, and physical backing of
non-merged cache files, debugging of layering concepts.)


# 47221757 17-Jan-1998 John Dyson <dyson@FreeBSD.org>

Tie up some loose ends in vnode/object management. Remove an unneeded
config option in pmap. Fix a problem with faulting in pages. Clean-up
some loose ends in swap pager memory management.

The system should be much more stable, but all subtile bugs aren't fixed yet.


# 925a3a41 11-Jan-1998 John Dyson <dyson@FreeBSD.org>

Fix some vnode management problems, and better mgmt of vnode free list.
Fix the UIO optimization code.
Fix an assumption in vm_map_insert regarding allocation of swap pagers.
Fix an spl problem in the collapse handling in vm_object_deallocate.
When pages are freed from vnode objects, and the criteria for putting
the associated vnode onto the free list is reached, either put the
vnode onto the list, or put it onto an interrupt safe version of the
list, for further transfer onto the actual free list.
Some minor syntax changes changing pre-decs, pre-incs to post versions.
Remove a bogus timeout (that I added for debugging) from vn_lock.

PHK will likely still have problems with the vnode list management, and
so do I, but it is better than it was.


# 2be70f79 28-Dec-1997 John Dyson <dyson@FreeBSD.org>

Lots of improvements, including restructring the caching and management
of vnodes and objects. There are some metadata performance improvements
that come along with this. There are also a few prototypes added when
the need is noticed. Changes include:

1) Cleaning up vref, vget.
2) Removal of the object cache.
3) Nuke vnode_pager_uncache and friends, because they aren't needed anymore.
4) Correct some missing LK_RETRY's in vn_lock.
5) Correct the page range in the code for msync.

Be gentle, and please give me feedback asap.


# 0aa89185 06-Nov-1997 John Dyson <dyson@FreeBSD.org>

Fix the "missing page" problem. Also, improve the performance of page
allocation in common cases.


# f0d45e6a 10-Oct-1997 Poul-Henning Kamp <phk@FreeBSD.org>

Fix contigmalloc() and contigmalloc1() arguments.


# f8ddc1e2 13-Sep-1997 Peter Wemm <peter@FreeBSD.org>

Print correct function name in panics


# 79624e21 31-Aug-1997 Bruce Evans <bde@FreeBSD.org>

Removed unused #includes.


# 3075778b 04-Aug-1997 John Dyson <dyson@FreeBSD.org>

Get rid of the ad-hoc memory allocator for vm_map_entries, in lieu of
a simple, clean zone type allocator. This new allocator will also be
used for machine dependent pmap PV entries.


# 61600997 01-May-1997 John Dyson <dyson@FreeBSD.org>

Check the correct queue for waking up the pageout daemon. Specifically,
the pageout daemon wasn't always being waken up appropriately when the
(cache + free) queues were depleted.
Submitted by: David S. Miller <davem@jenolan.rutgers.edu>


# c5d593ae 22-Mar-1997 John Dyson <dyson@FreeBSD.org>

Fix a significant error in the accounting for pre-zeroed pages. This
is a candidate for RELENG_2_2...


# 6875d254 22-Feb-1997 Peter Wemm <peter@FreeBSD.org>

Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not
ready for it yet.


# 10825343 13-Feb-1997 Garrett Wollman <wollman@FreeBSD.org>

Provide an alternative interface to contigmalloc() which allows a specific
map to be used when allocating the kernel va (e.g., mb_map). The VM
gurus may want to look this over.


# 996c772f 09-Feb-1997 John Dyson <dyson@FreeBSD.org>

This is the kernel Lite/2 commit. There are some requisite userland
changes, so don't expect to be able to run the kernel as-is (very well)
without the appropriate Lite/2 userland changes.

The system boots and can mount UFS filesystems.

Untested: ext2fs, msdosfs, NFS
Known problems: Incorrect Berkeley ID strings in some files.
Mount_std mounts will not work until the getfsent
library routine is changed.

Reviewed by: various people
Submitted by: Jeffery Hsu <hsu@freebsd.org>


# 1130b656 14-Jan-1997 Jordan K. Hubbard <jkh@FreeBSD.org>

Make the long-awaited change from $Id$ to $FreeBSD$

This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.

Boy, I'm glad we're not using sup anymore. This update would have been
insane otherwise.


# e0c5a895 28-Nov-1996 John Dyson <dyson@FreeBSD.org>

Make the kernel smaller with at worst a neutral effect on perf by
de-inlining some VM calls. (Actually, I measured a small improvement.)


# 5b0a7408 16-Nov-1996 John Dyson <dyson@FreeBSD.org>

Improve the locality of reference for variables in vm_page and
vm_kern by moving them from .bss to .data. With this change,
there is a measurable perf improvement in fork/exec.


# db2c0faa 04-Nov-1996 John Dyson <dyson@FreeBSD.org>

Vastly improved contigmalloc routine. It does not solve the
problem of allocating contiguous buffer memory in general, but
make it much more likely to work at boot-up time. The best
chance for an LKM-type load of a sound driver is immediately
after the mount of the root filesystem.

This appears to work for a 64K allocation on an 8MB system.


# 675878e7 14-Oct-1996 John Dyson <dyson@FreeBSD.org>

Move much of the machine dependent code from vm_glue.c into
pmap.c. Along with the improved organization, small proc fork
performance is now about 5%-10% faster.


# 8ba0c490 12-Oct-1996 Bruce Evans <bde@FreeBSD.org>

Removed __pure's and __pure2's. __pure is a no-op for recent versions
of gcc by definition, and __pure2 is a no-op in effect (presumably the
compiler can see when an inline function has no side effects).


# f7d6dab2 06-Oct-1996 John Dyson <dyson@FreeBSD.org>

Fix a problem with the page coloring code that the system will not always
be able to use all of the free pages. This can manifest as a panic
using DIAGNOSTIC, or as a panic on an indirect memory reference.


# 322dfc2b 28-Sep-1996 Bruce Evans <bde@FreeBSD.org>

Fixed undeclared variables for the !(PQ_L2_SIZE > 1) case.

Removed redundant #include.


# a2f4a846 27-Sep-1996 John Dyson <dyson@FreeBSD.org>

Reviewed by:
Submitted by:
Obtained from:


# c7c34a24 14-Sep-1996 Bruce Evans <bde@FreeBSD.org>

Attached vm ddb commands `show map', `show vmochk', `show object',
`show vmopag', `show page' and `show pageq'. Moved all vm ddb stuff
to the ends of the vm source files.

Changed printf() to db_printf(), `indent' to db_indent, and iprintf()
to db_iprintf() in ddb commands. Moved db_indent and db_iprintf()
from vm to ddb.

vm_page.c:
Don't use __pure. Staticized.

db_output.c:
Reduced page width from 80 to 79 to inhibit double spacing for long
lines (there are still some problems if words are printed across
column 79).


# 5070c7f8 08-Sep-1996 John Dyson <dyson@FreeBSD.org>

Addition of page coloring support. Various levels of coloring are afforded.
The default level works with minimal overhead, but one can also enable
full, efficient use of a 512K cache. (Parameters can be generated
to support arbitrary cache sizes also.)


# 67bf6868 29-Jul-1996 John Dyson <dyson@FreeBSD.org>

Backed out the recent changes/enhancements to the VM code. The
problem with the 'shell scripts' was found, but there was a 'strange'
problem found with a 486 laptop that we could not find. This commit
backs the code back to 25-jul, and will be re-entered after the snapshot
in smaller (more easily tested) chunks.


# 4f4d35ed 26-Jul-1996 John Dyson <dyson@FreeBSD.org>

This commit is meant to solve a couple of VM system problems or
performance issues.

1) The pmap module has had too many inlines, and so the
object file is simply bigger than it needs to be.
Some common code is also merged into subroutines.
2) Removal of some *evil* PHYS_TO_VM_PAGE macro calls.
Unfortunately, a few have needed to be added also.
The removal caused the need for more vm_page_lookups.
I added lookup hints to minimize the need for the
page table lookup operations.
3) Removal of some bogus performance improvements, that
mostly made the code more complex (tracking individual
page table page updates unnecessarily). Those improvements
actually hurt 386 processors perf (not that people who
worry about perf use 386 processors anymore :-)).
4) Changed pv queue manipulations/structures to be TAILQ's.
5) The pv queue code has had some performance problems since
day one. Some significant scalability issues are resolved
by threading the pv entries from the pmap AND the physical
address instead of just the physical address. This makes
certain pmap operations run much faster. This does
not affect most micro-benchmarks, but should help loaded system
performance *significantly*. DG helped and came up with most
of the solution for this one.
6) Most if not all pmap bit operations follow the pattern:
pmap_test_bit();
pmap_clear_bit();
That made for twice the necessary pv list traversal. The
pmap interface now supports only pmap_tc_bit type operations:
pmap_[test/clear]_modified, pmap_[test/clear]_referenced.
Additionally, the modified routine now takes a vm_page_t arg
instead of a phys address. This eliminates a PHYS_TO_VM_PAGE
operation.
7) Several rewrites of routines that contain redundant code to
use common routines, so that there is a greater likelihood of
keeping the cache footprint smaller.


# 38efa82b 25-Jun-1996 John Dyson <dyson@FreeBSD.org>

This commit does a couple of things:
Re-enables the RSS limiting, and the routine is now tail-recursive,
making it much more safe (eliminates the possiblity of kernel stack
overflow.) Also, the RSS limiting is a little more intelligent about
finding the likely objects that are pushing the process over the limit.

Added some sysctls that help with VM system tuning.

New sysctl features:
1) Enable/disable lru pageout algorithm.
vm.pageout_algorithm = 0, default algorithm that works
well, especially using X windows and heavy
memory loading. Can have adverse effects,
sometimes slowing down program loading.

vm.pageout_algorithm = 1, close to true LRU. Works much
better than clock, etc. Does not work as well as
the default algorithm in general. Certain memory
"malloc" type benchmarks work a little better with
this setting.

Please give me feedback on the performance results
associated with these.

2) Enable/disable swapping.
vm.swapping_enabled = 1, default.

vm.swapping_enabled = 0, useful for cases where swapping
degrades performance.

The config option "NO_SWAPPING" is still operative, and
takes precedence over the sysctl. If "NO_SWAPPING" is
specified, the sysctl still exists, but "vm.swapping_enabled"
is hard-wired to "0".

Each of these can be changed "on the fly."


# 2a4eb04b 20-Jun-1996 John Dyson <dyson@FreeBSD.org>

Improve algorithm for page hash queue. It was previously about
as bad as it could be. This algorithm appears to improve fork
performance (barely) measurably.


# ef743ce6 16-Jun-1996 John Dyson <dyson@FreeBSD.org>

Several bugfixes/improvements:
1) Make it much less likely to miss a wakeup in vm_page_free_wakeup
2) Create a new entry point into pmap: pmap_ts_referenced, eliminates
the need to scan the pv lists twice in many cases. Perhaps there
is alot more to do here to work on minimizing pv list manipulation
3) Minor improvements to vm_pageout including the use of pmap_ts_ref.
4) Major changes and code improvement to pmap. This code has had
several serious bugs in page table page manipulation. In order
to simplify the problem, and hopefully solve it for once and all,
page table pages are no longer "managed" with the pv list stuff.
Page table pages are only (mapped and held/wired) or
(free and unused) now. Page table pages are never inactive,
active or cached. These changes have probably fixed the
hold count problems, but if they haven't, then the code is
simpler anyway for future bugfixing.
5) The pmap code has been sorely in need of re-organization, and I
have taken a first (of probably many) steps. Please tell me
if you have any ideas.


# b5b40fa6 16-Jun-1996 John Dyson <dyson@FreeBSD.org>

Various bugfixes/cleanups from me and others:
1) Remove potential race conditions on waking up in vm_page_free_wakeup
by making sure that it is at splvm().
2) Fix another bug in vm_map_simplify_entry.
3) Be more complete about converting from default to swap pager
when an object grows to be large enough that there can be
a problem with data structure allocation under low memory
conditions.
4) Make some madvise code more efficient.
5) Added some comments.


# 419702a4 12-Jun-1996 John Dyson <dyson@FreeBSD.org>

Fix a very significant cnt.v_wire_count leak in vm_page.c, and some
minor leaks in pmap.c. Bruce Evans made me aware of this problem.


# 886d3e11 08-Jun-1996 John Dyson <dyson@FreeBSD.org>

Adjust the threshold for blocking on movement of pages from the cache
queue in vm_fault.

Move the PG_BUSY in vm_fault to the correct place.

Remove redundant/unnecessary code in pmap.c.

Properly block on rundown of page table pages, if they are busy.

I think that the VM system is in pretty good shape now, and the following
individuals (among others, in no particular order) have helped with this
recent bunch of bugs, thanks! If I left anyone out, I apologize!

Stephen McKay, Stephen Hocking, Eric J. Chet, Dan O'Brien, James Raynard,
Marc Fournier.


# 6b6f0008 04-Jun-1996 John Dyson <dyson@FreeBSD.org>

Keep page-table pages from ever being sensed as dirty. This should fix
some problems with the page-table page management code, since it can't
deal with the notion of page-table pages being paged out or in transit.
Also, clean up some stylistic issues per some suggestions from
Stephen McKay.


# f35329ac 30-May-1996 John Dyson <dyson@FreeBSD.org>

This commit is dual-purpose, to fix more of the pageout daemon
queue corruption problems, and to apply Gary Palmer's code cleanups.
David Greenman helped with these problems also. There is still
a hang problem using X in small memory machines.


# f777ab7b 23-May-1996 John Dyson <dyson@FreeBSD.org>

Add an assert to vm_page_cache. We should never cache a dirty page.


# b18bfc3d 17-May-1996 John Dyson <dyson@FreeBSD.org>

This set of commits to the VM system does the following, and contain
contributions or ideas from Stephen McKay <syssgm@devetir.qld.gov.au>,
Alan Cox <alc@cs.rice.edu>, David Greenman <davidg@freebsd.org> and me:

More usage of the TAILQ macros. Additional minor fix to queue.h.
Performance enhancements to the pageout daemon.
Addition of a wait in the case that the pageout daemon
has to run immediately.
Slightly modify the pageout algorithm.
Significant revamp of the pmap/fork code:
1) PTE's and UPAGES's are NO LONGER in the process's map.
2) PTE's and UPAGES's reside in their own objects.
3) TOTAL elimination of recursive page table pagefaults.
4) The page directory now resides in the PTE object.
5) Implemented pmap_copy, thereby speeding up fork time.
6) Changed the pv entries so that the head is a pointer
and not an entire entry.
7) Significant cleanup of pmap_protect, and pmap_remove.
8) Removed significant amounts of machine dependent
fork code from vm_glue. Pushed much of that code into
the machine dependent pmap module.
9) Support more completely the reuse of already zeroed
pages (Page table pages and page directories) as being
already zeroed.
Performance and code cleanups in vm_map:
1) Improved and simplified allocation of map entries.
2) Improved vm_map_copy code.
3) Corrected some minor problems in the simplify code.
Implemented splvm (combo of splbio and splimp.) The VM code now
seldom uses splhigh.
Improved the speed of and simplified kmem_malloc.
Minor mod to vm_fault to avoid using pre-zeroed pages in the case
of objects with backing objects along with the already
existant condition of having a vnode. (If there is a backing
object, there will likely be a COW... With a COW, it isn't
necessary to start with a pre-zeroed page.)
Minor reorg of source to perhaps improve locality of ref.


# 30dcfc09 27-Mar-1996 John Dyson <dyson@FreeBSD.org>

VM performance improvements, and reorder some operations in VM fault
in anticipation of a fix in pmap that will allow the mlock system call to work
without panicing the system.


# 8169788f 11-Mar-1996 Peter Wemm <peter@FreeBSD.org>

Import 4.4BSD-Lite2 onto the vendor branch, note that in the kernel, all
files are off the vendor branch, so this should not change anything.

A "U" marker generally means that the file was not changed in between
the 4.4Lite and Lite-2 releases, and does not need a merge. "C" generally
means that there was a change.


# 9ee58740 08-Mar-1996 John Dyson <dyson@FreeBSD.org>

Modify a threshold for waking up the pageout daemon. Also, add a consistancy
check for making sure that held pages aren't freed (DG).


# de5f6a77 01-Mar-1996 John Dyson <dyson@FreeBSD.org>

1) Eliminate unnecessary bzero of UPAGES.
2) Eliminate unnecessary copying of pages during/after forks.
3) Add user map simplification.


# 324e9ed2 26-Jan-1996 Bruce Evans <bde@FreeBSD.org>

Added a `boundary' arg to vm_alloc_page_contig(). Previously the only
way to avoid crossing a 64K DMA boundary was to specify an alignment
greater than the size even when the alignment didn't matter, and for
sizes larger than a page, this reduced the chance of finding enough
contiguous pages. E.g., allocations of 8K not crossing a 64K boundary
previously had to be allocated on 8K boundaries; now they can be
allocated on any 4K boundary except (64 * n + 60)K.

Fixed bugs in vm_alloc_page_contig():
- the last page wasn't allocated for sizes smaller than a page.
- failures of kmem_alloc_pageable() weren't handled.

Mutated vm_page_alloc_contig() to create a more convenient interface
named contigmalloc(). This is the same as the one in 1.1.5 except
it has `low' and `high' args, and the `alignment' and `boundary'
args are multipliers instead of masks.


# bd7e5f99 18-Jan-1996 John Dyson <dyson@FreeBSD.org>

Eliminated many redundant vm_map_lookup operations for vm_mmap.
Speed up for vfs_bio -- addition of a routine bqrelse to greatly diminish
overhead for merged cache.
Efficiency improvement for vfs_cluster. It used to do alot of redundant
calls to cluster_rbuild.
Correct the ordering for vrele of .text and release of credentials.
Use the selective tlb update for 486/586/P6.
Numerous fixes to the size of objects allocated for files. Additionally,
fixes in the various pagers.
Fixes for proper positioning of vnode_pager_setsize in msdosfs and ext2fs.
Fixes in the swap pager for exhausted resources. The pageout code
will not as readily thrash.
Change the page queue flags (PG_ACTIVE, PG_INACTIVE, PG_FREE, PG_CACHE) into
page queue indices (PQ_ACTIVE, PQ_INACTIVE, PQ_FREE, PQ_CACHE),
thereby improving efficiency of several routines.
Eliminate even more unnecessary vm_page_protect operations.
Significantly speed up process forks.
Make vm_object_page_clean more efficient, thereby eliminating the pause
that happens every 30seconds.
Make sequential clustered writes B_ASYNC instead of B_DELWRI even in the
case of filesystems mounted async.
Fix a panic with busy pages when write clustering is done for non-VMIO
buffers.


# 0e41ee30 04-Jan-1996 Garrett Wollman <wollman@FreeBSD.org>

Convert DDB to new-style option.


# f2c6b65b 17-Dec-1995 Bruce Evans <bde@FreeBSD.org>

Fixed 1TB filesize changes. Some pindexes had bogus names and types
but worked because vm_pindex_t is indistinuishable from vm_offset_t.


# f708ef1b 14-Dec-1995 Poul-Henning Kamp <phk@FreeBSD.org>

Another mega commit to staticize things.


# ec07c60c 11-Dec-1995 John Dyson <dyson@FreeBSD.org>

Some DIAGNOSTIC code was enabled all of the time in error. The
diagnostic code is now conditional on #ifdef DIAGNOSTIC again.


# a316d390 10-Dec-1995 John Dyson <dyson@FreeBSD.org>

Changes to support 1Tb filesizes. Pages are now named by an
(object,index) pair instead of (object,offset) pair.


# efeaf95a 06-Dec-1995 David Greenman <dg@FreeBSD.org>

Untangled the vm.h include file spaghetti.


# cac597e4 02-Dec-1995 Bruce Evans <bde@FreeBSD.org>

Completed function declarations and/or added prototypes.

Staticized some functions.

__purified some functions. Some functions were bogusly declared as
returning `const'. This hasn't done anything since gcc-2.5. For
later versions of gcc, the equivalent is __attribute__((const)) at
the end of function declarations.


# 3af76890 19-Nov-1995 Poul-Henning Kamp <phk@FreeBSD.org>

Remove unused vars & funcs, make things static, protoize a little bit.


# a91c5a7e 22-Oct-1995 John Dyson <dyson@FreeBSD.org>

Get rid of machine-dependent NBPG and replace with PAGE_SIZE.


# f70f05f2 03-Sep-1995 John Dyson <dyson@FreeBSD.org>

Machine independent changes to support pre-zeroed free pages. This
significantly improves demand-zero performance.


# 4589a4b5 03-Sep-1995 John Dyson <dyson@FreeBSD.org>

New subroutine "vm_page_set_validclean" for a vfs_bio improvement.


# b367ddb1 19-Jul-1995 David Greenman <dg@FreeBSD.org>

#if 0'd one of the DIAGNOSTIC checks in vm_page_alloc(). It was too
expensive for "normal" use.


# 24a1cce3 13-Jul-1995 David Greenman <dg@FreeBSD.org>

NOTE: libkvm, w, ps, 'top', and any other utility which depends on struct
proc or any VM system structure will have to be rebuilt!!!

Much needed overhaul of the VM system. Included in this first round of
changes:

1) Improved pager interfaces: init, alloc, dealloc, getpages, putpages,
haspage, and sync operations are supported. The haspage interface now
provides information about clusterability. All pager routines now take
struct vm_object's instead of "pagers".

2) Improved data structures. In the previous paradigm, there is constant
confusion caused by pagers being both a data structure ("allocate a
pager") and a collection of routines. The idea of a pager structure has
escentially been eliminated. Objects now have types, and this type is
used to index the appropriate pager. In most cases, items in the pager
structure were duplicated in the object data structure and thus were
unnecessary. In the few cases that remained, a un_pager structure union
was created in the object to contain these items.

3) Because of the cleanup of #1 & #2, a lot of unnecessary layering can now
be removed. For instance, vm_object_enter(), vm_object_lookup(),
vm_object_remove(), and the associated object hash list were some of the
things that were removed.

4) simple_lock's removed. Discussion with several people reveals that the
SMP locking primitives used in the VM system aren't likely the mechanism
that we'll be adopting. Even if it were, the locking that was in the code
was very inadequate and would have to be mostly re-done anyway. The
locking in a uni-processor kernel was a no-op but went a long way toward
making the code difficult to read and debug.

5) Places that attempted to kludge-up the fact that we don't have kernel
thread support have been fixed to reflect the reality that we are really
dealing with processes, not threads. The VM system didn't have complete
thread support, so the comments and mis-named routines were just wrong.
We now use tsleep and wakeup directly in the lock routines, for instance.

6) Where appropriate, the pagers have been improved, especially in the
pager_alloc routines. Most of the pager_allocs have been rewritten and
are now faster and easier to maintain.

7) The pagedaemon pageout clustering algorithm has been rewritten and
now tries harder to output an even number of pages before and after
the requested page. This is sort of the reverse of the ideal pagein
algorithm and should provide better overall performance.

8) Unnecessary (incorrect) casts to caddr_t in calls to tsleep & wakeup
have been removed. Some other unnecessary casts have also been removed.

9) Some almost useless debugging code removed.

10) Terminology of shadow objects vs. backing objects straightened out.
The fact that the vm_object data structure escentially had this
backwards really confused things. The use of "shadow" and "backing
object" throughout the code is now internally consistent and correct
in the Mach terminology.

11) Several minor bug fixes, including one in the vm daemon that caused
0 RSS objects to not get purged as intended.

12) A "default pager" has now been created which cleans up the transition
of objects to the "swap" type. The previous checks throughout the code
for swp->pg_data != NULL were really ugly. This change also provides
the rudiments for future backing of "anonymous" memory by something
other than the swap pager (via the vnode pager, for example), and it
allows the decision about which of these pagers to use to be made
dynamically (although will need some additional decision code to do
this, of course).

13) (dyson) MAP_COPY has been deprecated and the corresponding "copy
object" code has been removed. MAP_COPY was undocumented and non-
standard. It was furthermore broken in several ways which caused its
behavior to degrade to MAP_PRIVATE. Binaries that use MAP_COPY will
continue to work correctly, but via the slightly different semantics
of MAP_PRIVATE.

14) (dyson) Sharing maps have been removed. It's marginal usefulness in a
threads design can be worked around in other ways. Both #12 and #13
were done to simplify the code and improve readability and maintain-
ability. (As were most all of these changes)

TODO:

1) Rewrite most of the vnode pager to use VOP_GETPAGES/PUTPAGES. Doing
this will reduce the vnode pager to a mere fraction of its current size.

2) Rewrite vm_fault and the swap/vnode pagers to use the clustering
information provided by the new haspage pager interface. This will
substantially reduce the overhead by eliminating a large number of
VOP_BMAP() calls. The VOP_BMAP() filesystem interface should be
improved to provide both a "behind" and "ahead" indication of
contiguousness.

3) Implement the extended features of pager_haspage in swap_pager_haspage().
It currently just says 0 pages ahead/behind.

4) Re-implement the swap device (swstrategy) in a more elegant way, perhaps
via a much more general mechanism that could also be used for disk
striping of regular filesystems.

5) Do something to improve the architecture of vm_object_collapse(). The
fact that it makes calls into the swap pager and knows too much about
how the swap pager operates really bothers me. It also doesn't allow
for collapsing of non-swap pager objects ("unnamed" objects backed by
other pagers).


# 9b2e5354 30-May-1995 Rodney W. Grimes <rgrimes@FreeBSD.org>

Remove trailing whitespace.


# c3cb3e12 15-Apr-1995 David Greenman <dg@FreeBSD.org>

Moved some zero-initialized variables into .bss. Made code intended to be
called only from DDB #ifdef DDB. Removed some completely unused globals.


# 8c3d9c40 16-Apr-1995 David Greenman <dg@FreeBSD.org>

Removed gratuitous m->blah=0 assignments when initializing the vm_page
structs in vm_page_startup(). The vm_page structs are already completely
zeroed.


# 2fdccd5e 16-Apr-1995 David Greenman <dg@FreeBSD.org>

Make "print_page_info" #ifdef DDB.


# f6b04d2b 09-Apr-1995 David Greenman <dg@FreeBSD.org>

Changes from John Dyson and myself:

Fixed remaining known bugs in the buffer IO and VM system.

vfs_bio.c:
Fixed some race conditions and locking bugs. Improved performance
by removing some (now) unnecessary code and fixing some broken
logic.
Fixed process accounting of # of FS outputs.
Properly handle NFS interrupts (B_EINTR).

(various)
Replaced calls to clrbuf() with calls to an optimized routine
called vfs_bio_clrbuf().

(various FS sync)
Sync out modified vnode_pager backed pages.

ffs_vnops.c:
Do two passes: Sync out file data first, then indirect blocks.

vm_fault.c:
Fixed deadly embrace caused by acquiring locks in the wrong order.

vnode_pager.c:
Changed to use buffer I/O system for writing out modified pages. This
should fix the problem with the modification date previous not getting
updated. Also dramatically simplifies the code. Note that this is
going to change in the future and be implemented via VOP_PUTPAGES().

vm_object.c:
Fixed a pile of bugs related to cleaning (vnode) objects. The performance
of vm_object_page_clean() is terrible when dealing with huge objects,
but this will change when we implement a binary tree to keep the object
pages sorted.

vm_pageout.c:
Fixed broken clustering of pageouts. Fixed race conditions and other
lockup style bugs in the scanning of pages. Improved performance.


# 7fd9a8b1 25-Mar-1995 David Greenman <dg@FreeBSD.org>

Implemented cnt.v_reactivated and moved vm_page_activate() routine to
before vm_page_deactivate().


# edf8a815 19-Mar-1995 David Greenman <dg@FreeBSD.org>

Removed redundant newlines that were in some panic strings.


# 806e3860 17-Mar-1995 David Greenman <dg@FreeBSD.org>

In vm_page_alloc_contig: Removed a redundant semicolon and used 'm' instead
of &pga[i] in one place.


# b5e8ce9f 16-Mar-1995 Bruce Evans <bde@FreeBSD.org>

Add and move declarations to fix all of the warnings from `gcc -Wimplicit'
(except in netccitt, netiso and netns) and most of the warnings from
`gcc -Wnested-externs'. Fix all the bugs found. There were no serious
ones.


# f919ebde 01-Mar-1995 David Greenman <dg@FreeBSD.org>

Various changes from John and myself that do the following:

New functions create - vm_object_pip_wakeup and pagedaemon_wakeup that
are used to reduce the actual number of wakeups.
New function vm_page_protect which is used in conjuction with some new
page flags to reduce the number of calls to pmap_page_protect.
Minor changes to reduce unnecessary spl nesting.
Rewrote vm_page_alloc() to improve readability.
Various other mostly cosmetic changes.


# 6f2b142e 22-Feb-1995 David Greenman <dg@FreeBSD.org>

vm_page.c:
Use request==VM_ALLOC_NORMAL rather than object!=kmem_object in deciding
if the caller is "important" in vm_page_alloc(). Also established a new
low threshold for non-interrupt allocations via cnt.v_interrupt_free_min.

vm_pageout.c:
Various algorithmic cleanup. Some calculations simplified. Initialize
cnt.v_interrupt_free_min to 2 pages.

Submitted by: John Dyson


# 5e716206 22-Feb-1995 David Greenman <dg@FreeBSD.org>

Just return in the case of a page not on any queue in vm_page_unqueue().
Return VM_PAGE_BITS_ALL even if size > PAGE_SIZE in vm_page_bits().

Submitted by: John Dyson


# d0686727 20-Feb-1995 David Greenman <dg@FreeBSD.org>

Don't allow act_count to exceed ACT_MAX when bumping it up.
Small optimization to vm_page_bits().

Submitted by: John Dyson


# d89ced81 20-Feb-1995 David Greenman <dg@FreeBSD.org>

Fully initialize pages returned via vm_page_alloc_contig() so that the
memory can be later freed.


# a1f6d91c 02-Feb-1995 David Greenman <dg@FreeBSD.org>

swap_pager.c:
Fixed long standing bug in freeing swap space during object collapses.
Fixed 'out of space' messages from printing out too often.
Modified to use new kmem_malloc() calling convention.
Implemented an additional stat in the swap pager struct to count the
amount of space allocated to that pager. This may be removed at some
point in the future.
Minimized unnecessary wakeups.

vm_fault.c:
Don't try to collect fault stats on 'swapped' processes - there aren't
any upages to store the stats in.
Changed read-ahead policy (again!).

vm_glue.c:
Be sure to gain a reference to the process's map before swapping.
Be sure to lose it when done.

kern_malloc.c:
Added the ability to specify if allocations are at interrupt time or
are 'safe'; this affects what types of pages can be allocated.

vm_map.c:
Fixed a variety of map lock problems; there's still a lurking bug that
will eventually bite.

vm_object.c:
Explicitly initialize the object fields rather than bzeroing the struct.
Eliminated the 'rcollapse' code and folded it's functionality into the
"real" collapse routine.
Moved an object_unlock() so that the backing_object is protected in
the qcollapse routine.
Make sure nobody fools with the backing_object when we're destroying it.
Added some diagnostic code which can be called from the debugger that
looks through all the internal objects and makes certain that they
all belong to someone.

vm_page.c:
Fixed a rather serious logic bug that would result in random system
crashes. Changed pagedaemon wakeup policy (again!).

vm_pageout.c:
Removed unnecessary page rotations on the inactive queue.
Changed the number of pages to explicitly free to just free_reserved
level.

Submitted by: John Dyson


# 6d40c3d3 24-Jan-1995 David Greenman <dg@FreeBSD.org>

Added ability to detect sequential faults and DTRT. (swap_pager.c)
Added hook for pmap_prefault() and use symbolic constant for new third
argument to vm_page_alloc() (vm_fault.c, various)
Changed the way that upages and page tables are held. (vm_glue.c)
Fixed architectural flaw in allocating pages at interrupt time that was
introduced with the merged cache changes. (vm_page.c, various)
Adjusted some algorithms to acheive better paging performance and to
accomodate the fix for the architectural flaw mentioned above. (vm_pageout.c)
Fixed pbuf handling problem, changed policy on handling read-behind page.
(vnode_pager.c)

Submitted by: John Dyson


# edfab85b 15-Jan-1995 David Greenman <dg@FreeBSD.org>

Moved some splx's down a few lines in vm_page_insert and vm_page_remove
to make the locking a bit more clear - this change is currently a NOP
as the calls to those routines are already at splhigh().


# a776a317 10-Jan-1995 David Greenman <dg@FreeBSD.org>

Kill VM_PAGE_INIT macro as it is only used once and makes the code more
difficult to understand. Got rid of unused vm_page flags.


# 480dff54 10-Jan-1995 David Greenman <dg@FreeBSD.org>

Fixed some formatting weirdness that I overlooked in the previous commit.


# 0d94caff 09-Jan-1995 David Greenman <dg@FreeBSD.org>

These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.

The majority of the merged VM/cache work is by John Dyson.

The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.

vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.

vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.

vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.

vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.

vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.

pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.

vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.

proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.

swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.

machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.

machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.

ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.

Submitted by: John Dyson and David Greenman


# 47c9acfd 23-Oct-1994 David Greenman <dg@FreeBSD.org>

Changed a thread_sleep into an spl protected tsleep. A deadlock can occur
otherwise. Minor efficiency improvement in vm_page_free().

Submitted by: John Dyson


# a58d1fa1 18-Oct-1994 David Greenman <dg@FreeBSD.org>

Fix the remaining vmmeter counters. They all now work correctly.


# 05f0fdd2 08-Oct-1994 Poul-Henning Kamp <phk@FreeBSD.org>

Cosmetics: unused vars, ()'s, #include's &c &c to silence gcc.
Reviewed by: davidg


# 1ffd2a2c 27-Sep-1994 David Greenman <dg@FreeBSD.org>

Previous commit should have read ...in vm_page_alloc_contig().
...(this commit): moved initialization of 'start' to make it more clear
that it is initialized properly (also in vm_page_alloc_contig).
Reviewed by:
Submitted by:
Obtained from:


# 5992708a 27-Sep-1994 David Greenman <dg@FreeBSD.org>

Fixed another bug, and cleaned up the code.


# 0d040c7e 27-Sep-1994 David Greenman <dg@FreeBSD.org>

Fixed multiple bugs in previous version of vm_page_alloc_contig.


# d3c2cf7a 27-Sep-1994 David Greenman <dg@FreeBSD.org>

1) New "vm_page_alloc_contig" routine by me.
2) Created a new vm_page flag "PG_FREE" to help track free pages.
3) Use PG_FREE flag to detect inconsistencies in a few places.


# 28b5c68f 09-Aug-1994 David Greenman <dg@FreeBSD.org>

Fixed vm_page_deactivate to deal with getting called with a page that's
not on any queue. This is an old patchkit days fix.

Reviewed by: John Dyson and David Greenman
Submitted by: originally by Paul Mackerras


# a481f200 07-Aug-1994 David Greenman <dg@FreeBSD.org>

Provide support for upcoming merged VM/buffer cache, and fixed a few bugs
that haven't appeared to manifest themselves (yet).

Submitted by: John Dyson


# 03e6c253 01-Aug-1994 David Greenman <dg@FreeBSD.org>

Removed all code related to the pagescan daemon, and changed 'act_count'
adjustments to compensate for a world without the pagescan daemon.


# 26f9a767 25-May-1994 Rodney W. Grimes <rgrimes@FreeBSD.org>

The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch.

Reviewed by: Rodney W. Grimes
Submitted by: John Dyson and David Greenman


# df8bae1d 24-May-1994 Rodney W. Grimes <rgrimes@FreeBSD.org>

BSD 4.4 Lite Kernel Sources