Cross Reference: /freebsd-current/sys/vm/vm

History log of /freebsd-current/sys/vm/vm_page.c
Revision	Date	Author	Comments
# 10f2e94a	02-Jan-2024	Jason A. Harmening <jah@FreeBSD.org>	vm_page_reclaim_contig(): update comment to chase recent changes Commit 2619c5ccfe ("Avoid waiting on physical allocations that can't possibly be satisfied") changed the return value from bool to errno. Adjust the function description to match reality.
# 2619c5cc	20-Nov-2023	Jason A. Harmening <jah@FreeBSD.org>	Avoid waiting on physical allocations that can't possibly be satisfied - Change vm_page_reclaim_contig[_domain] to return an errno instead of a boolean. 0 indicates a successful reclaim, ENOMEM indicates lack of available memory to reclaim, with any other error (currently only ERANGE) indicating that reclamation is impossible for the specified address range. Change all callers to only follow up with vm_page_wait* in the ENOMEM case. - Introduce vm_domainset_iter_ignore(), which marks the specified domain as unavailable for further use by the iterator. Use this function to ignore domains that can't possibly satisfy a physical allocation request. Since WAITOK allocations run the iterators repeatedly, this avoids the possibility of infinitely spinning in domain iteration if no available domain can satisfy the allocation request. PR: 274252 Reported by: kevans Tested by: kevans Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D42706
# 29363fb4	23-Nov-2023	Warner Losh <imp@FreeBSD.org>	sys: Remove ancient SCCS tags. Remove ancient SCCS tags from the tree, automated scripting, with two minor fixup to keep things compiling. All the common forms in the tree were removed with a perl script. Sponsored by: Netflix
# a55fbda8	12-Oct-2023	Zhenlei Huang <zlei@FreeBSD.org>	vm_page: Add corresponding sysctl knob for loader tunable The loader tunable 'vm.pgcache_zone_max_pcpu' does not have corresponding sysctl MIB entry. Add it so that it can be retrieved, and `sysctl -T` will also report it correctly. Reviewed by: markj MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D42138
# 685dc743	16-Aug-2023	Warner Losh <imp@FreeBSD.org>	sys: Remove $FreeBSD$: one-line .c pattern Remove /^[\s]__FBSDID$"\$FreeBSD\$"$;?\s*\n/
# 9e817428	16-Jun-2023	Doug Moore <dougm@FreeBSD.org>	vm_phys: add binary segment search Replace several sequential searches for a segment that contains a phyiscal address with a call to a function that does it by binary search. In vm_page_reclaim_contig_domain_ext, find the first segment to reclaim from, and reclaim from each subsequent appropriate segment. Eliminate vm_phys_scan_contig. Reviewed by: alc, markj Differential Revision: https://reviews.freebsd.org/D40058
# 6062d9fa	05-Jun-2023	Mark Johnston <markj@FreeBSD.org>	vm_phys: Change the return type of vm_phys_unfree_page() to bool This is in keeping with the trend of removing uses of boolean_t, and the sole caller was implicitly converting it to a "bool". No functional change intended. Reviewed by: dougm, alc, imp, kib MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D40401
# 8b0dafdb	08-May-2023	Andrew Gallatin <gallatin@FreeBSD.org>	vm: implement vm_page_reclaim_contig_domain_ext() Implement vm_page_reclaim_contig_domain_ext() to reclaim multiple contiguous regions at once. This makes it more efficient for users that need multiple contiguous regions to reclaim those regions efficiently. This is needed because callers like ktls may need to reclaim many contiguous regions, and each scan of physical memory can take multiple seconds on a large memory machine (order of 100GB of RMA). Rather than modifying the core algorithm, I extended vm_page_reclaim_contig_domain() to take a "desired_runs" argument to allow the caller to request that it reclaim more than just a single run. There is no functional change intended for all existing callers. The first user for this interface is the ktls code (https://reviews.freebsd.org/D39421). By reclaiming multiple runs, ktls goes from consuming hours of CPU to refill its buffer zone to just seconds or minutes. Differential Revision: https://reviews.freebsd.org/D39739 Sponsored by: Netflix Reviewed by: alc, jhb, markj
# 32494491	16-Dec-2022	Konstantin Belousov <kib@FreeBSD.org>	vm_page_grab_valid(): clear *mp in case of pager denying page allocation Same as it is done in other error return cases. Callers depend on error case returning NULL, e.g. vm_imgact_hold_page(). Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D37719
# 1cac76c9	14-Dec-2022	Andrew Gallatin <gallatin@FreeBSD.org>	vm: reduce lock contention when processing vm batchqueues Rather than waiting until the batchqueue is full to acquire the lock & process the queue, we now start trying to acquire the lock using trylocks when the batchqueue is 1/2 full. This removes almost all contention on the vm pagequeue mutex for for our busy sendfile() based web workload. It also greadly reduces the amount of time a network driver ithread remains blocked on a mutex, and eliminates some packet drops under heavy load. So that the system does not loose the benefit of processing large batchqueues, I've doubled the size of the batchqueues. This way, when there is no contention, we process the same batch size as before. This has been run for several months on a busy Netflix server, as well as on my personal desktop. Reviewed by: markj Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D37305
# ec201ddd	20-Oct-2022	Konstantin Belousov <kib@FreeBSD.org>	vm_pager: add method to veto page allocation Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D37097
# d537d1f1	20-Oct-2022	Konstantin Belousov <kib@FreeBSD.org>	vm_pager: add methods for page insertion and removal notifications Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D37097
# cfbf1da0	09-Nov-2022	Anton Rang <rang@acm.org>	vm_page_unswappable: remove wrong assertion markj says: ...the assertion is incorrect and should simply be removed. It has been racy since we removed the use of the page hash lock to synchronize wiring of pages. PR: 267621 Reviewed by: markj, Anton Rang <rang@acm.org> MFC after: 1 week Sponsored by: Dell Inc. Differential Revision: https://reviews.freebsd.org/D37320
# 934bfc12	17-Oct-2022	Konstantin Belousov <kib@FreeBSD.org>	Add vm_page_any_valid() Use it and several other vm_page_*_valid() functions in more places. Suggested and reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D37024
# 2c9dc238	05-Oct-2022	Mark Johnston <markj@FreeBSD.org>	vm_page: Fix a logic error in the handling of PQ_ACTIVE operations As an optimization, vm_page_activate() avoids requeuing a page that's already in the active queue. A page's location in the active queue is mostly unimportant. When a page is unwired and placed back in the page queues, vm_page_unwire() avoids moving pages out of PQ_ACTIVE to honour the request, the idea being that they're likely mapped and so will simply get bounced back in to PQ_ACTIVE during a queue scan. In both cases, if the page was logically in PQ_ACTIVE but had not yet been physically enqueued (i.e., the page is in a per-CPU batch), we would end up clearing PGA_REQUEUE from the page. Then, batch processing would ignore the page, so it would end up unwired and not in any queues. This can arise, for example, when a page is allocated and then vm_page_activate() is called multiple times in quick succession. The result is that the page is hidden from the page daemon, so while it will be freed when its VM object is destroyed, it cannot be reclaimed under memory pressure. Fix the bug: when checking if a page is in PQ_ACTIVE, only perform the optimization if the page is physically enqueued. PR: 256507 Fixes: f3f38e2580f1 ("Start implementing queue state updates using fcmpset loops.") Reviewed by: alc, kib MFC after: 1 week Sponsored by: E-CARD Ltd. Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D36839
# 1b89d40f	17-Aug-2022	Mateusz Guzik <mjg@FreeBSD.org>	Revert "vm: use atomic fetchadd in vm_page_sunbusy" This reverts commit f6ffed44a8eb5d1ab89a18e60fb056aab2105be7. fetchadd will fail the waiters flag, which can cause other code to wait when it should not with nothing clear it Revert until I sort this out. Reported by: markj
# f6ffed44	05-Aug-2022	Mateusz Guzik <mjg@FreeBSD.org>	vm: use atomic fetchadd in vm_page_sunbusy This also fixes a bug where not-last unbusy failed to post a release fence. Reviewed by: markj (previous version), kib (previous version) Differential Revision: https://reviews.freebsd.org/D36084
# c84c5e00	18-Jul-2022	Mitchell Horne <mhorne@FreeBSD.org>	ddb: annotate some commands with DB_CMD_MEMSAFE This is not completely exhaustive, but covers a large majority of commands in the tree. Reviewed by: markj Sponsored by: Juniper Networks, Inc. Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D35583
# 0cb2610e	16-Jul-2022	Mark Johnston <markj@FreeBSD.org>	vm: Remove handling for OBJT_DEFAULT objects Now that OBJT_DEFAULT objects can't be instantiated, we can simplify checks of the form object->type == OBJT_DEFAULT \|\| (object->flags & OBJ_SWAP) != 0. No functional change intended. Reviewed by: alc, kib Tested by: pho Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D35788
# f77a88c8	03-Jun-2022	Gordon Bergling <gbe@FreeBSD.org>	vm_page: Fix a typo in a source code comment - s/consistancy/consistency/ MFC after: 3 days
# 6fb7c42d	14-Apr-2022	Mark Johnston <markj@FreeBSD.org>	vm: Move the "vm_wait in early boot" assertion to the proper place The assertion was added in commit 1771e987ca6a. After that, vm_wait() and friends were refactored such that the actual sleep happens elsewhere. Now the assertion condition is not checked when vm_wait_doms() is called directly, and it is checked even if we are not going to sleep (because vm_page_count_min_set(wdoms) is false). Reviewed by: alc, kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D34909
# b8ebd99a	13-Apr-2022	John Baldwin <jhb@FreeBSD.org>	vm: Use __diagused for variables only used in KASSERT().
# c25a30e2	05-Jan-2022	Konstantin Belousov <kib@FreeBSD.org>	Dump page tracking no longer needed on mips Reviewed by: imp Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D33763
# c606ab59	30-Dec-2021	Doug Moore <dougm@FreeBSD.org>	vm_extern: use standard address checkers everywhere Define simple functions for alignment and boundary checks and use them everywhere instead of having slightly different implementations scattered about. Define them in vm_extern.h and use them where possible where vm_extern.h is included. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D33685
# 4bae154f	24-Dec-2021	Doug Moore <dougm@FreeBSD.org>	vm_page: Move a comment fb38b29b5609 (page_alloc_br) vm_page: Remove extra test, dup code from page alloc should have moved a comment block when it moved the function call that followed it. Move the comment block now.
# 0d5fac28	23-Dec-2021	Doug Moore <dougm@FreeBSD.org>	vm: alloc pages from reserv before breaking it Function vm_reserv_reclaim_contig breaks a reservation with enough free space to satisfy an allocation request and returns the free space to the buddy allocator. Change the function to allocate the request memory from the reservation before breaking it, and return that memory to the caller. That avoids a second call to the buddy allocator and guarantees successful allocation after breaking the reservation, where that success is not currently guaranteed. Reviewed by: alc, kib (previous version) Differential Revision: https://reviews.freebsd.org/D33644
# 184c63db	24-Dec-2021	Doug Moore <dougm@FreeBSD.org>	Fix clerical error in page alloc Fix a very recent change that introduced a page accounting error in case of a reserveration being broken. Reviewed by: alc Fixes: fb38b29b5609 (page_alloc_br) vm_page: Remove extra test, dup code from page alloc Differential Revision: https://reviews.freebsd.org/D33645
# fb38b29b	23-Dec-2021	Doug Moore <dougm@FreeBSD.org>	vm_page: Remove extra test, dup code from page alloc Extract code common to functions vm_page_alloc_contig_domain and vm_page_alloc_noobj_contig_domain into a new function. Do so in a way that eliminates a bound-to-fail reservation test after a reservation is broken by a call from vm_page_alloc_contig_domain. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D33551
# 39a7396f	05-Dec-2021	Mark Johnston <markj@FreeBSD.org>	vm_page: Tighten the object lock assertion in vm_page_invalid() A page must not become invalid while vm_fault_soft_fast() is attempting to map unbusied pages for reading. Note that all callers hold the object write lock already, and vm_page_set_invalid() asserts the object write lock. Reviewed by: kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D33250
# 87b64663	15-Nov-2021	Mark Johnston <markj@FreeBSD.org>	vm_page: Consolidate page busy sleep mechanisms - Modify vm_page_busy_sleep() and vm_page_busy_sleep_unlocked() to take a VM_ALLOC_* flag indicating whether to sleep on shared-busy, and fix up callers. - Modify vm_page_busy_sleep() to return a status indicating whether the object lock was dropped, and fix up callers. - Convert callers of vm_page_sleep_if_busy() to use vm_page_busy_sleep() instead. - Remove vm_page_sleep_if_(x)busy(). No functional change intended. Obtained from: jeff (object_concurrency patches) Reviewed by: kib MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D32947
# e4bdb685	11-Nov-2021	Mark Johnston <markj@FreeBSD.org>	vm_page: Handle VM_ALLOC_NORECLAIM in the contiguous page allocator We added _NORECLAIM to request that kmem_alloc_contig_pages() not spend time scanning physical memory for candidates to reclaim. In some situations the scanning can induce large amounts of undesirable latency, and it's less important that the request be satisfied than it is that we not spend many milliseconds scanning. The problem extends to vm_reserv_reclaim_contig(), which unlike vm_reserv_reclaim() may have to scan the entire list of partially populated reservations. Use VM_ALLOC_NORECLAIM to request that this scan not be executed.[1] As a side effect, this fixes a regression in 02fb0585e7b3 ("vm_page: Drop handling of VM_ALLOC_NOOBJ in vm_page_alloc_contig_domain()") where VM_ALLOC_CONTIG was not included in VPAC_FLAGS or VPANC_FLAGS even though it is not masked by kmem_alloc_contig_pages().[2] Reported by: gallatin [1], glebius [2] Reviewed by: alc, glebius, kib MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D32899
# d7acbe48	21-Oct-2021	Mark Johnston <markj@FreeBSD.org>	vm_page: Break reservations to handle noobj allocations vm_reserv_reclaim_*() will release pages to the default freepool, not the direct freepool from which noobj allocations are drawn. But if both pools are empty, the noobj allocator variants must break reservations to make progress. Reported by: cy Reviewed by: kib (previous version) Fixes: b498f71bc56a ("vm_page: Add a new page allocator interface for unnamed pages") Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D32592
# 02fb0585	19-Oct-2021	Mark Johnston <markj@FreeBSD.org>	vm_page: Drop handling of VM_ALLOC_NOOBJ in vm_page_alloc_contig_domain() As in vm_page_alloc_domain_after(), unconditionally preserve PG_ZERO. Implement vm_page_alloc_noobj_contig_domain(). Reviewed by: kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D32034
# c40cf9bc	19-Oct-2021	Mark Johnston <markj@FreeBSD.org>	vm_page: Stop handling VM_ALLOC_NOOBJ in vm_page_alloc_domain_after() This makes the allocator simpler since it can assume object != NULL. Also modify the function to unconditionally preserve PG_ZERO, so VM_ALLOC_ZERO is effectively ignored (and still must be implemented by the caller for now). Reviewed by: alc, kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D32033
# 84c39222	19-Oct-2021	Mark Johnston <markj@FreeBSD.org>	Convert consumers to vm_page_alloc_noobj_contig() Remove now-unneeded page zeroing. No functional change intended. Reviewed by: alc, hselasky, kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D32006
# 92db9f3b	19-Oct-2021	Mark Johnston <markj@FreeBSD.org>	Introduce vm_page_alloc_noobj_contig() This is the same as vm_page_alloc_noobj(), but allocates physically contiguous runs of memory. For now it is implemented in terms of vm_page_alloc_contig(), with the difference that vm_page_alloc_noobj_contig() implements VM_ALLOC_ZERO by zeroing the page. Reviewed by: alc, kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D32005
# a4667e09	19-Oct-2021	Mark Johnston <markj@FreeBSD.org>	Convert vm_page_alloc() callers to use vm_page_alloc_noobj(). Remove page zeroing code from consumers and stop specifying VM_ALLOC_NOOBJ. In a few places, also convert an allocation loop to simply use VM_ALLOC_WAITOK. Similarly, convert vm_page_alloc_domain() callers. Note that callers are now responsible for assigning the pindex. Reviewed by: alc, hselasky, kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31986
# b498f71b	19-Oct-2021	Mark Johnston <markj@FreeBSD.org>	vm_page: Add a new page allocator interface for unnamed pages The diff adds vm_page_alloc_noobj() and vm_page_alloc_noobj_domain(). These mostly correspond to vm_page_alloc() and vm_page_alloc_domain() when no VM object is specified, with the exception that they handle VM_ALLOC_ZERO by zeroing the page, rather than by preserving PG_ZERO. This simplifies callers and will permit simplification of the vm_page_alloc_domain() definition. Since the new allocator variant is similar to vm_page_alloc_freelist(), implement both of them using a common backend allocator function. No functional change intended. Reviewed by: alc, kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31985
# a23e6a10	19-Oct-2021	Mark Johnston <markj@FreeBSD.org>	vm_page: Move vm_page_alloc_check() to after page allocator definitions This way all of the vm_page_alloc_*() allocator functions are grouped together. MFC after: 1 week Sponsored by: The FreeBSD Foundation
# bd3a6680	17-Sep-2021	Konstantin Belousov <kib@FreeBSD.org>	vm_page_startup: correct calculation of the starting page Also avoid unneded calculations when phys segment end is the phys_avail[] start. Submitted by: alc Reviewed by: markj MFC after: 1 week Fixes: 181bfb42fd01bfa9f46 Differential revision: https://reviews.freebsd.org/D32009
# 181bfb42	14-Sep-2021	Konstantin Belousov <kib@FreeBSD.org>	vm_phys: do not ignore phys_avail[] segments that do not fit completely into vm_phys segments If phys_avail[] segment only intersect with some vm_phys segment, add pages from it to the free list that belong to the given vm_phys_seg, instead of dropping them. The vm_phys segments are generally result of subdivision of phys_avail segments, for instance DMA32 or LOWMEM boundaries split them. On amd64, after UEFI in-place kernel activation (copy_staging disable) was enabled, we typically have a large phys_avail[] segment below 4G which crosses LOWMEM (1M) boundary. With the current way of requiring phys_avail[] fully fit into vm_phys_seg, this memory was ignored. Reported by: madpilot Reviewed by: markj Discussed with: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D31958
# eccb516d	22-Aug-2021	Bjoern A. Zeeb <bz@FreeBSD.org>	vm: use __func__ for the correct function name In fee2a2fa39834d8d5eaa981298fce9d2ed31546d the KASSERTs in vm_page_unwire_noq() changed from "vm_page_unwire" to "vm_page_unref". While the former no longer was part of that function the latter does not exist as a function and is highly confusing when hit when using tools to lookup the functions and not doing a full-text search. Use %s __func__ for printing the function name, as that will do the right thing as code moves around and functions get renamed. Hit: while debugging a wired page leak with linuxkpi/iwlwifi Sponsored by: The FreeBSD Foundation Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D31635
# 041b7317	10-Jul-2021	Konstantin Belousov <kib@FreeBSD.org>	Add pmap_vm_page_alloc_check() which is the place to put MD asserts about allocated pages. On amd64, verify that allocated page does not belong to the kernel (text, data) or early allocated pages. Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D31121
# 5b10e79e	17-Jun-2021	Konstantin Belousov <kib@FreeBSD.org>	Un-staticise vm_page_init_page() Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D30785
# 4b8365d7	30-Apr-2021	Konstantin Belousov <kib@FreeBSD.org>	Add OBJT_SWAP_TMPFS pager This is OBJT_SWAP pager, specialized for tmpfs. Right now, both swap pager and generic vm code have to explicitly handle swap objects which are tmpfs vnode v_object, in the special ways. Replace (almost) all such places with proper methods. Since VM still needs a notion of the 'swap object', regardless of its use, add yet another type-classification flag OBJ_SWAP. Set it in vm_object_allocate() where other type-class flags are set. This change almost completely eliminates the knowledge of tmpfs from VM, and opens a way to make OBJT_SWAP_TMPFS loadable from tmpfs.ko. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D30070
# 04019892	02-Mar-2021	Mark Johnston <markj@FreeBSD.org>	vm: Round up npages and alignment for contig reclamation When searching for runs to reclaim, we need to ensure that the entire run will be added to the buddy allocator as a single unit. Otherwise, it will not be visible to vm_phys_alloc_contig() as it is currently implemented. This is a problem for allocation requests that are not a power of 2 in size, as with 9KB jumbo mbuf clusters. Reported by: alc Reviewed by: alc MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D28924
# 14b5a3c7	24-Feb-2021	Max Laier <mlaier@FreeBSD.org>	vm pqbatch: move unmanaged page assert under pagequeue lock This KASSERT is overzealous because of the following race condition: 1) A managed page which is currently in PQ_LAUNDRY is freed. vm_page_free_prep calls vm_page_dequeue_deferred() The page state is: PQ_LAUNDRY, PGA_DEQUEUE\|PGA_ENQUEUED 2) The laundry worker comes around and pick up the page and calls vm_pageout_defer(m, PQ_LAUNDRY, true) to check if page is still in the queue. We do a vm_page_astate_load and get PQ_LAUNDRY, PGA_DEQUEUE\|PGA_ENQUEUED as per above. 3) The laundry worker is pre-empted and another thread allocates our page from the free pool. For example vm_page_alloc_domain_after calls vm_page_dequeue() and sets VPO_UNMANAGED because we are allocating for an OBJT_UNMANAGED object. The page state is: PQ_NONE, 0 - VPO_UNMANAGED 4) The laundry worker resumes, and processes vm_pageout_defer based on the stale astate which leads to a call to vm_page_pqbatch_submit, which will trip on the KASSERT. Submitted by: mlaier Reviewed by: markj, rlibby Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D28563
# 5c18744e	10-Feb-2021	Mark Johnston <markj@FreeBSD.org>	vm: Honour the "noreuse" flag to vm_page_unwire_managed() This flag indicates that the page should be enqueued near the head of the inactive queue, skipping the LRU queue. It is used when unwiring pages from the buffer cache following direct I/O or after I/O when POSIX_FADV_NOREUSE or _DONTNEED advice was specified, or when sendfile(SF_NOCACHE) completes. For the direct I/O and sendfile cases we only enqueue the page if we decide not to free it, typically because it's mapped. Pass "noreuse" through to vm_page_release_toq() so that we actually honour the desired LRU policy for these scenarios. Reported by: bdrewery Reviewed by: alc, kib MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D28555
# 81846def	27-Dec-2020	Mark Johnston <markj@FreeBSD.org>	vm: Fix some bugs in the page busying code In vm_page_busy_acquire(), load the object pointer using atomic_load_ptr() as we do elsewhere. Per the comment, the object identity must be consistent across sleeps. In vm_page_grab_sleep(), pass the correct pindex to _vm_page_busy_sleep(). The pindex is used to re-check the page's identity before going to sleep. In particular, vm_page_grab_sleep() is used in unlocked grab, so the object lock is not necessarily held when verifying the page's identity, and the pindex may change if the page is moved, or freed and re-allocated. I believe this can result in spurious VM_PAGER_FAILs from vm_page_grab_valid_unlocked() or early termination of vm_page_grab_pages_unlocked(). In vm_page_grab_pages(), pass the correct pindex to vm_page_grab_sleep(). Otherwise I believe vm_page_grab_pages() will effectively spin when attempting to busy a busy page after the first index in the range. Reviewed by: alc, kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D27607
# 1fea4b25	19-Nov-2020	Mark Johnston <markj@FreeBSD.org>	Wrap a long line in vm_pqbatch_process_page()
# 9e3e7376	19-Nov-2020	Mark Johnston <markj@FreeBSD.org>	Micro-optimize vm_page_pqbatch_submit() Avoid calling vm_page_domain() twice. Discussed with: alc (in D27207)
# 431fb8ab	18-Nov-2020	Mark Johnston <markj@FreeBSD.org>	vm_phys: Try to clean up NUMA KPIs It can useful for code outside the VM system to look up the NUMA domain of a page backing a virtual or physical address, specifically when creating NUMA-aware data structures. We have _vm_phys_domain() for this, but the leading underscore implies that it's an internal function, and vm_phys.h has dependencies on a number of other headers. Rename vm_phys_domain() to vm_page_domain(), and _vm_phys_domain() to vm_phys_domain(). Make the latter an inline function. Add _vm_phys.h and define struct vm_phys_seg there so that it's easier to use in other headers. Include it from vm_page.h so that vm_page_domain() can be defined there. Include machine/vmparam.h from _vm_phys.h since it depends directly on some constants defined there. Reviewed by: alc Reviewed by: dougm, kib (earlier versions) Differential Revision: https://reviews.freebsd.org/D27207
# 6f3b523c	14-Oct-2020	Konstantin Belousov <kib@FreeBSD.org>	Avoid dump_avail[] redefinition. Move dump_avail[] extern declaration and inlines into a new header vm/vm_dumpset.h. This fixes default gcc build for mips. Reviewed by: alc, scottph Tested by: kevans (previous version) Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D26741
# c2c6fb90	09-Oct-2020	Bryan Drewery <bdrewery@FreeBSD.org>	Use unlocked page lookup for inmem() to avoid object lock contention Reviewed By: kib, markj Submitted by: mlaier Sponsored by: Dell EMC Differential Revision: https://reviews.freebsd.org/D26653
# 00e66147	21-Sep-2020	D Scott Phillips <scottph@FreeBSD.org>	Sparsify the vm_page_dump bitmap On Ampere Altra systems, the sparse population of RAM within the physical address space causes the vm_page_dump bitmap to be much larger than necessary, increasing the size from ~8 Mib to > 2 Gib (and overflowing `int` for the size). Changing the page dump bitmap also changes the minidump file format, so changes are also necessary in libkvm. Reviewed by: jhb Approved by: scottl (implicit) MFC after: 1 week Sponsored by: Ampere Computing, Inc. Differential Revision: https://reviews.freebsd.org/D26131
# ab041f71	21-Sep-2020	D Scott Phillips <scottph@FreeBSD.org>	Move vm_page_dump bitset array definition to MI code These definitions were repeated by all architectures, with small variations. Consolidate the common definitons in machine independent code and use bitset(9) macros for manipulation. Many opportunities for deduplication remain in the machine dependent minidump logic. The only intended functional change is increasing the bit index type to vm_pindex_t, allowing the indexing of pages with address of 8 TiB and greater. Reviewed by: kib, markj Approved by: scottl (implicit) MFC after: 1 week Sponsored by: Ampere Computing, Inc. Differential Revision: https://reviews.freebsd.org/D26129
# 89d2fb14	08-Sep-2020	Konstantin Belousov <kib@FreeBSD.org>	Add interruptible variant of vm_wait(9), vm_wait_intr(9). Also add msleep flags argument to vm_wait_doms(9). Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D24652
# a2d704d1	02-Sep-2020	Mark Johnston <markj@FreeBSD.org>	Avoid unnecessary object locking in vm_page_grab_pages_unlocked(). We were needlessly acquiring the object lock to call vm_page_grab_pages() even when all of the requested pages were looked up locklessly. Fix that, stop testing for count == 0 in vm_page_grab_pages(), and add assertions to help catch this kind of mistake. Reported by: cem Reviewed by: alc, cem, dougm, jeff Differential Revision: https://reviews.freebsd.org/D26304
# 411096d0	25-Aug-2020	Mark Johnston <markj@FreeBSD.org>	Permit vm_page_wire() to be called on pages not belonging to an object. For such pages ref_count is effectively a consumer-managed field, but there is no harm in calling vm_page_wire() on them. vm_page_unwire_noq() handles them as well. Relax the vm_page_wire() assertions to permit this case which is triggered by some out-of-tree code. [1] Also guard a conditional assertion with INVARIANTS. Otherwise the conditions are evaluated even though the result is unused. [2] Reported by: bz, cem [1], kib [2] Reviewed by: dougm, kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D26173
# b7883452	11-Aug-2020	Conrad Meyer <cem@FreeBSD.org>	Back out unrelated change Reported by: kib, markj X-MFC-With: r364129
# 0292c54b	11-Aug-2020	Conrad Meyer <cem@FreeBSD.org>	Add support for multithreading the inactive queue pageout within a domain. In very high throughput workloads, the inactive scan can become overwhelmed as you have many cores producing pages and a single core freeing. Since Mark's introduction of batched pagequeue operations, we can now run multiple inactive threads working on independent batches. To avoid confusing the pid and other control algorithms, I (Jeff) do this in a mpi-like fan out and collect model that is driven from the primary page daemon. It decides whether the shortfall can be overcome with a single thread and if not dispatches multiple threads and waits for their results. The heuristic is based on timing the pageout activity and averaging a pages-per-second variable which is exponentially decayed. This is visible in sysctl and may be interesting for other purposes. I (Jeff) have verified that this does indeed double our paging throughput when used with two threads. With four we tend to run into other contention problems. For now I would like to commit this infrastructure with only a single thread enabled. The number of worker threads per domain can be controlled with the 'vm.pageout_threads_per_domain' tunable. Submitted by: jeff (earlier version) Discussed with: markj Tested by: pho Sponsored by: probably Netflix (based on contemporary commits) Differential Revision: https://reviews.freebsd.org/D21629
# efec381d	04-Aug-2020	Mark Johnston <markj@FreeBSD.org>	Remove most lingering references to the page lock in comments. Finish updating comments to reflect new locking protocols introduced over the past year. In particular, vm_page_lock is now effectively unused. Reviewed by: kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D25868
# 958d8f52	29-Jul-2020	Mark Johnston <markj@FreeBSD.org>	Remove the volatile qualifier from busy_lock. Use atomic(9) to load the lock state. Some places were doing this already, so it was inconsistent. In initialization code, the lock state is still initialized with plain stores. Reviewed by: alc, kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D25861
# 782ebde5	27-Jul-2020	Mark Johnston <markj@FreeBSD.org>	vm_page_free_invalid(): Relax the xbusy assertion. vm_page_assert_xbusied() asserts that the busying thread is the current thread. For some uses of vm_page_free_invalid() (e.g., error handling in vnode_pager_generic_getpages_done()), this condition might not hold. Reported by: Jenkins via trasz Reviewed by: chs, kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D25828
# 4dfa06e1	17-Jul-2020	Chuck Silvers <chs@FreeBSD.org>	Add a new function vm_page_free_invalid() for freeing invalid pages that might be wired. If the page is wired then it cannot be freed now, but the thread that eventually unwires it will free it at that point. Reviewed by: markj, kib Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D25430
# ffc568ba	09-Jul-2020	Scott Long <scottl@FreeBSD.org>	Revert r362998, r326999 while a better compatibility strategy is devised.
# b302c2e5	07-Jul-2020	Scott Long <scottl@FreeBSD.org>	Migrate the feature of excluding RAM pages to use "excludelist" as its nomenclature. MFC after: 1 week
# ee06cffc	26-Jun-2020	Konstantin Belousov <kib@FreeBSD.org>	vm_page_free_prep(): correct description of the required page and object state. Reviewed by: markj Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D25482
# a9ea09e5	28-Apr-2020	Mark Johnston <markj@FreeBSD.org>	Re-check for wirings after busying the page in vm_page_release_locked(). A concurrent unlocked lookup can wire the page after vm_page_release_locked() releases the last wiring, in which case vm_page_release_locked() must not free the page. Once the xbusy lock is acquired, that, the object lock and the fact that the page is unmapped ensure that the wire count cannot increase, so re-check for new wirings after the page is xbusied. Update the comment above vm_page_wired() to reflect the new synchronization rules. Reported by: glebius Reviewed by: alc, jeff, kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D24592
# 70e68b19	20-Apr-2020	Mark Johnston <markj@FreeBSD.org>	Handle trashed queue pointers in vm_page_acquire_unlocked(). vm_page_acquire_unlocked() relies on type-stability of vm_page structures and assumes that the listq linkage pointers always point to a vm_page or are NULL. QUEUE_MACRO_DEBUG_TRASH breaks that assumption, so add an explicit check for a trashed queue pointer before dereferencing. Reported and tested by: pho Reviewed by: kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D24472
# adc03881	30-Mar-2020	Bryan Drewery <bdrewery@FreeBSD.org>	Remove dead code leftover from r331018. Sponsored by: Dell EMC
# a7c55b3e	27-Mar-2020	Konstantin Belousov <kib@FreeBSD.org>	ddb show pginfo: print pages reference value in hex. It is more useful this way after the VPRC_ flags were introduced. Sponsored by: The FreeBSD Foundation
# d1105e94	11-Mar-2020	Jeff Roberson <jeff@FreeBSD.org>	Check for busy or wired in vm_page_relookup(). Some callers will only keep a page wired and expect it to still be present. Reported by: delphij@FreeBSD.org Reviewed by: kib
# d869a17e	06-Mar-2020	Mark Johnston <markj@FreeBSD.org>	Use COUNTER_U64_DEFINE_EARLY() in places where it simplifies things. Reviewed by: kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D23978
# 1ed42f6f	01-Mar-2020	Mark Johnston <markj@FreeBSD.org>	Avoid doubly wiring a newly allocated page in vm_page_grab_valid(). This fixes a regression from r358363. Reported by: manu, jbeich Tested by: jbeich
# 6be21eb7	28-Feb-2020	Jeff Roberson <jeff@FreeBSD.org>	Provide a lock free alternative to resolve bogus pages. This is not likely to be much of a perf win, just a nice code simplification. Reviewed by: markj, kib Differential Revision: https://reviews.freebsd.org/D23866
# 3f39f80a	28-Feb-2020	Jeff Roberson <jeff@FreeBSD.org>	Support the NOCREAT flag for grab_valid_unlocked. Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D23865
# c49be4f1	26-Feb-2020	Jeff Roberson <jeff@FreeBSD.org>	Add unlocked grab* function variants that use lockless radix code to lookup pages. These variants will fall back to their locked counterparts if the page is not present. Discussed with: kib, markj Differential Revision: https://reviews.freebsd.org/D23449
# 7029da5c	26-Feb-2020	Pawel Biernacki <kaktus@FreeBSD.org>	Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many) r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are still not MPSAFE (or already are but aren’t properly marked). Use it in preparation for a general review of all nodes. This is non-functional change that adds annotations to SYSCTL_NODE and SYSCTL_PROC nodes using one of the soon-to-be-required flags. Mark all obvious cases as MPSAFE. All entries that haven't been marked as MPSAFE before are by default marked as NEEDGIANT Approved by: kib (mentor, blanket) Commented by: kib, gallatin, melifaro Differential Revision: https://reviews.freebsd.org/D23718
# eaa17d42	22-Feb-2020	Ryan Libby <rlibby@FreeBSD.org>	sys/vm: quiet -Wwrite-strings Discussed with: kib Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D23796
# 6c5f36ff	19-Feb-2020	Jeff Roberson <jeff@FreeBSD.org>	Eliminate some unnecessary uses of UMA_ZONE_VM. Only zones involved in virtual address or physical page allocation need to be marked with this flag. Reviewed by: markj Tested by: pho Differential Revision: https://reviews.freebsd.org/D23712
# f212367b	16-Feb-2020	Jeff Roberson <jeff@FreeBSD.org>	Refactor _vm_page_busy_sleep to reduce the delta between the various sleep routines and introduce a variant that supports lockless sleep. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D23612
# 23ed568c	14-Feb-2020	Mateusz Guzik <mjg@FreeBSD.org>	vm: remove no longer needed atomic_load_ptr casts
# 06ef6052	13-Feb-2020	Mark Johnston <markj@FreeBSD.org>	Fix handling of WAITFAIL in vm_page_grab() and vm_page_grab_pages(). After sleeping through a memory shortage, we must return NULL rather than retry. Discussed with: jeff Reported by: pho Sponsored by: The FreeBSD Foundation
# ee9e43f8	04-Feb-2020	Jeff Roberson <jeff@FreeBSD.org>	Add an explicit busy state for free pages. This improves behavior with potential bugs that access freed pages as well as providing a path towards lockless page lookup. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D23444
# f0a273c0	01-Feb-2020	Mark Johnston <markj@FreeBSD.org>	Remove a couple of lingering usages of the page lock. Update vm_page_scan_contig() and vm_page_reclaim_run() to stop using vm_page_change_lock(). It has no use after r356157. Remove vm_page_change_lock() now that it has no users. Remove an unncessary check for wirings in vm_page_scan_contig(), which was previously checking twice. The check is racy until vm_page_reclaim_run() ensures that the page is unmapped, so one check is sufficient. Reviewed by: jeff, kib (previous versions) Tested by: pho (previous version) Differential Revision: https://reviews.freebsd.org/D23279
# d6e13f3b	19-Jan-2020	Jeff Roberson <jeff@FreeBSD.org>	Don't hold the object lock while calling getpages. The vnode pager does not want the object lock held. Moving this out allows further object lock scope reduction in callers. While here add some missing paging in progress calls and an assert. The object handle is now protected explicitly with pip. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D23033
# a81c400e	15-Jan-2020	Jeff Roberson <jeff@FreeBSD.org>	Simplify VM and UMA startup by eliminating boot pages. Instead use careful ordering to allocate early pages in the same way boot pages were but only as needed. After the KVA allocator has started up we allocate the KVA that we consumed during boot. This also makes the boot pages freeable since they have vm_page structures allocated with the rest of memory. Parts of this patch were written and tested by markj. Reviewed by: glebius, markj Differential Revision: https://reviews.freebsd.org/D23102
# 9328cbc0	10-Jan-2020	Gleb Smirnoff <glebius@FreeBSD.org>	Always multiple vm.pgcache_zone_max to number of CPUs, and rename it respectively. The tunable controls how big is the size of per-cpu vm page cache. Previously the value was split for all CPUs in system, so configuring same value on machines with different count of CPUs yielded in different cache size available to a particular CPU. Reviewed by: markj Obtained from: Netflix
# 79c9f942	05-Jan-2020	Jeff Roberson <jeff@FreeBSD.org>	Fix uma boot pages calculations on NUMA machines that also don't have MD_UMA_SMALL_ALLOC. This is unusual but not impossible. Fix the alignemnt of zones while here. This was already correct because uz_cpu strongly aligned the zone structure but the specified alignment did not match reality and involved redundant defines. Reviewed by: markj, rlibby Differential Revision: https://reviews.freebsd.org/D23046
# 758b2c02	29-Dec-2019	Mark Johnston <markj@FreeBSD.org>	Restore a vm_page_wired() check in vm_page_mvqueue() after r356156. We now set PGA_DEQUEUE on a managed page when it is wired after allocation, and vm_page_mvqueue() ignores pages with this flag set, ensuring that they do not end up in the page queues. However, this is not sufficient for managed fictitious pages or pages managed by the TTM. In particular, the TTM makes use of the plinks.q queue linkage fields for its own purposes. PR: 242961 Reported and tested by: Greg V <greg@unrelenting.technology>
# 9b888dd9	29-Dec-2019	Mark Johnston <markj@FreeBSD.org>	Clear queue op flags in vm_page_mvqueue(). This fixes a regression in r356155, introduced at the last minute. In particular, we must clear PGA_REQUEUE_HEAD before inserting into any queue besides PQ_INACTIVE since that operation is implemented only for PQ_INACTIVE. Reported by: pho, Jenkins via lwhsu
# 727150ff	28-Dec-2019	Mark Johnston <markj@FreeBSD.org>	Remove some unused functions. The previous series of patches orphaned some vm_page functions, so remove them. Reviewed by: dougm, kib Sponsored by: Netflix, Intel Differential Revision: https://reviews.freebsd.org/D22886
# 9f5632e6	28-Dec-2019	Mark Johnston <markj@FreeBSD.org>	Remove page locking for queue operations. With the previous reviews, the page lock is no longer required in order to perform queue operations on a page. It is also no longer needed in the page queue scans. This change effectively eliminates remaining uses of the page lock and also the false sharing caused by multiple pages sharing a page lock. Reviewed by: jeff Tested by: pho Sponsored by: Netflix, Intel Differential Revision: https://reviews.freebsd.org/D22885
# b7f30bff	28-Dec-2019	Mark Johnston <markj@FreeBSD.org>	Generalize lazy dequeue logic for wired pages. Some recent work aims to remove the use of the page lock for synchronizing updates to page queue state. This change adds a mechanism to preserve the existing behaviour of lazily dequeuing wired pages, which was previously synchronized using the page lock. Handle this by setting PGA_DEQUEUE when a managed page's wire count transitions from 0 to 1. When the page daemon encounters a page with a flag in PGA_QUEUE_OP_MASK set, it creates a batch queue entry for that page, but in so doing it does not modify the page itself and thus racing with a concurrent free of the page is harmless. The flag is advisory; the page daemon still checks for wirings after acquiring the object and page xbusy locks. vm_page_unwire_managed() now clears PGA_DEQUEUE on a 1->0 transition. It must do this before dropping the reference to avoid a use-after-free but also handles races with concurrent wirings to ensure that PGA_DEQUEUE is not left unset on a wired page. Reviewed by: jeff Tested by: pho Sponsored by: Netflix, Intel Differential Revision: https://reviews.freebsd.org/D22882
# f3f38e25	28-Dec-2019	Mark Johnston <markj@FreeBSD.org>	Start implementing queue state updates using fcmpset loops. This is in preparation for eliminating the use of the vm_page lock for protecting queue state operations. Introduce the vm_page_pqstate_commit_*() functions. These functions act as helpers around vm_page_astate_fcmpset() and are specialized for specific types of operations. vm_page_pqstate_commit() wraps these functions. Convert a number of routines to use these new helpers. Use vm_page_release_toq() in vm_page_unwire() and vm_page_release() to atomically release a wiring reference and release the page into a queue. This has the side effect that vm_page_unwire() will leave the page in the active queue if it is already present there. Convert the page queue scans to use the new helpers. Simplify vm_pageout_reinsert_inactive(), which requeues pages that were found to be busy during an inactive queue scan, to avoid duplicating the work of vm_pqbatch_process_page(). In particular, if PGA_REQUEUE or PGA_REQUEUE_HEAD is set, let that be handled during batch processing. Reviewed by: jeff Tested by: pho Sponsored by: Netflix, Intel Differential Revision: https://reviews.freebsd.org/D22770 Differential Revision: https://reviews.freebsd.org/D22771 Differential Revision: https://reviews.freebsd.org/D22772 Differential Revision: https://reviews.freebsd.org/D22773 Differential Revision: https://reviews.freebsd.org/D22776
# 5541eb27	27-Dec-2019	Mark Johnston <markj@FreeBSD.org>	Remove some stale comments from the page allocator. Since r352110 the page lock is not required to wire pages in any context.
# ff5ce8a7	26-Dec-2019	Jeff Roberson <jeff@FreeBSD.org>	Fix a pair of bugs introduced in r356002. When we reclaim physical pages we allocate them with VM_ALLOC_NOOBJ which means they are not busy. For now move the busy assert for the new page in vm_page_replace into the public api and out of the private api used by contig reclaim. Fix another issue where we would leak busy if the page could not be removed from pmap. Reported by: pho Discussed with: markj
# 7e1b379e	24-Dec-2019	Jeff Roberson <jeff@FreeBSD.org>	Don't unnecessarily relock the vm object after sleeps. This results in a surprising amount of object contention on loop restarts in fault. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D22821
# 3cf3b4e6	21-Dec-2019	Jeff Roberson <jeff@FreeBSD.org>	Make page busy state deterministic on free. Pages must be xbusy when removed from objects including calls to free. Pages must not be xbusy when freed and not on an object. Strengthen assertions to match these expectations. In practice very little code had to change busy handling to meet these rules but we can now make stronger guarantees to busy holders and avoid conditionally dropping busy in free. Refine vm_page_remove() and vm_page_replace() semantics now that we have stronger guarantees about busy state. This removes redundant and potentially problematic code that has proliferated. Discussed with: markj Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D22822
# d07c5718	21-Dec-2019	Mark Johnston <markj@FreeBSD.org>	Fix VPO_UNMANAGED handling in vm_page_reclaim_run() after r353540. When allocating a replacement page we must clear VPO_UNMANAGED since we only ever reclaim pages from managed objects. vm_page_replace() does not handle this for us. Sprinkle some assertions to help catch this sort of issue. Reported by: pho Reviewed by: alc, kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D22868
# a8081778	14-Dec-2019	Jeff Roberson <jeff@FreeBSD.org>	Add a deferred free mechanism for freeing swap space that does not require an exclusive object lock. Previously swap space was freed on a best effort basis when a page that had valid swap was dirtied, thus invalidating the swap copy. This may be done inconsistently and requires the object lock which is not always convenient. Instead, track when swap space is present. The first dirty is responsible for deleting space or setting PGA_SWAP_FREE which will trigger background scans to free the swap space. Simplify the locking in vm_fault_dirty() now that we can reliably identify the first dirty. Discussed with: alc, kib, markj Differential Revision: https://reviews.freebsd.org/D22654
# af009714	14-Dec-2019	Jeff Roberson <jeff@FreeBSD.org>	Handle pagein clustering in vm_page_grab_valid() so that it can be used by exec_map_first_page(). This will also enable pagein clustering for other interested consumers (tmpfs, md, etc). Discussed with: alc Approved by: kib Differential Revision: https://reviews.freebsd.org/D22731
# 5cff1f4d	10-Dec-2019	Mark Johnston <markj@FreeBSD.org>	Introduce vm_page_astate. This is a 32-bit structure embedded in each vm_page, consisting mostly of page queue state. The use of a structure makes it easy to store a snapshot of a page's queue state in a stack variable and use cmpset loops to update that state without requiring the page lock. This change merely adds the structure and updates references to atomic state fields. No functional change intended. Reviewed by: alc, jeff, kib Sponsored by: Netflix, Intel Differential Revision: https://reviews.freebsd.org/D22650
# cff8481d	07-Dec-2019	Jeff Roberson <jeff@FreeBSD.org>	It is safe to wire a page while the object is busy. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D22636
# 2306558c	07-Dec-2019	Jeff Roberson <jeff@FreeBSD.org>	It is now safe to rename a page that is still on a queue. Allowing this is necessary for a forthcoming patch. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D22636
# fb1d575c	07-Dec-2019	Jeff Roberson <jeff@FreeBSD.org>	Reduce duplication in grab functions by providing allocflags based inlines. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D22635
# caef3e12	06-Dec-2019	Justin Hibbits <jhibbits@FreeBSD.org>	powerpc/pmap: NUMA-ize vm_page_array on powerpc Summary: This matches r351198 from amd64. This only applies to AIM64 and Book-E. On AIM64 it short-circuits with one domain, to behave similar to existing. Otherwise it will allocate 16MB huge pages to hold the page array, across all NUMA domains. On the first domain it will shift the page array base up, to "upper-align" the page array in that domain, so as to reduce the number of pages from the next domain appearing in this domain. After the first domain, subsequent domains will be allocated in full 16MB pages, until the final domain, which can be short. This means some inner domains may have pages accounted in earlier domains. On Book-E the page array is setup at MMU bootstrap time so that it's always mapped in TLB1, on both 32-bit and 64-bit. This reduces the TLB0 overhead for touching the vm_page_array, which reduces up to one TLB miss per array access. Since page_range (vm_page_startup()) is no longer used on Book-E but is on 32-bit AIM, mark the variable as potentially unused, rather than using a nasty #if defined() list. Reviewed by: luporl Differential Revision: https://reviews.freebsd.org/D21449
# 9b78b1f4	02-Dec-2019	Jeff Roberson <jeff@FreeBSD.org>	Use a precise bit count for the slab free items in UMA. This significantly shrinks embedded slab structures. Reviewed by: markj, rlibby (prior version) Differential Revision: https://reviews.freebsd.org/D22584
# b631c36f	24-Nov-2019	Konstantin Belousov <kib@FreeBSD.org>	Record part of the owner struct thread pointer into busy_lock. Record as much bits from curthread into busy_lock as fits. Low bits for struct thread * representation are zero due to struct and zone alignment, and they leave space for busy flags (perhaps except statically allocated thread0). Upper bits are not very interesting for assert, and in most practical situations recorded value should allow to manually identify the owner with certainity. Assert that unbusy is performed by the owner, except few places where unbusy is done in io completion handler. For this case, add _unchecked variants of asserts and unbusy primitives. Reviewed by: markj (previous version) Tested by: pho Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D22298
# bf0d60af	22-Nov-2019	Mark Johnston <markj@FreeBSD.org>	Update the checks in vm_page_zone_import(). - Remove the cnt == 1 check. UMA passes cnt == 1 when it has disabled per-CPU caching. In this case we might as well just allocate a single page and return it to the caller, since the caller is going to do exactly that anyway if the UMA cache allocation attempt fails. - Don't replenish caches if the domain is severely short on free pages. With large buckets we may otherwise quickly exacerbate a situation where the page daemon is failing to keep up. - Don't replenish caches if the calling thread belongs to the page daemon, which should avoid creating extra memory pressure when it is trying to free memory. Virtually all such allocations while occur in the context of laundering, where the laundry thread must allocate slabs for various swap and I/O-related UMA zones. Reviewed by: kib Discussed with: alc, jeff MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D22394
# 003cf08b	22-Nov-2019	Mark Johnston <markj@FreeBSD.org>	Revise the page cache size policy. In r353734 the use of the page caches was limited to systems with a relatively large amount of RAM per CPU. This was to mitigate some issues reported with the system not able to keep up with memory pressure in cases where it had been able to do so prior to the addition of the direct free pool cache. This change re-enables those caches. The change modifies uma_zone_set_maxcache(), which was introduced specifically for the page cache zones. Rather than using it to limit only the full bucket cache, have it also set uz_count_max to provide an upper bound on the per-CPU cache size that is consistent with the number of items requested. Remove its return value since it has no use. Enable the page cache zones unconditionally, and limit them to 0.1% of the domain's pages. The limit can be overridden by the vm.pgcache_zone_max tunable as before. Change the item size parameter passed to uma_zcache_create() to the correct size, and stop setting UMA_ZONE_MAXBUCKET. This allows the page cache buckets to be adaptively sized, like the rest of UMA's caches. This also causes the initial bucket size to be small, so only systems which benefit from large caches will get them. Reviewed by: gallatin, jeff MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D22393
# 09a65f9f	20-Nov-2019	Andrew Turner <andrew@FreeBSD.org>	As with r354905 use uint16_t to store aflags on the stack and as function arguments as the aflags size in vm_page_t has increased. Sponsored by: DARPA, AFRL
# ad216bc1	20-Nov-2019	Andrew Turner <andrew@FreeBSD.org>	Use atomic_load_16 to load aflags as it's a uint16_t after r354820. Sponsored by: DARPA, AFRL
# 7f935055	19-Nov-2019	Jeff Roberson <jeff@FreeBSD.org>	Remove unnecessary object locking from the vnode pager. Recent changes to busy/valid/dirty locking make these acquires redundant. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D22186
# 3a2ba997	18-Nov-2019	Mark Johnston <markj@FreeBSD.org>	Widen the vm_page aflags field to 16 bits. We are now out of aflags bits, whereas the "flags" field only makes use of five of its sixteen bits, so narrow "flags" to eight bits. I have no intention of adding a new aflag in the near future, but would like to combine the aflags, queue and act_count fields into a single atomically updated word. This will allow vm_page_pqstate_cmpset() to become much simpler and is a step towards eliminating the use of the page lock array in updating per-page queue state. The change modifies the layout of struct vm_page, so bump __FreeBSD_version. Reviewed by: alc, dougm, jeff, kib Sponsored by: Netflix, Intel Differential Revision: https://reviews.freebsd.org/D22397
# c95f8ed8	07-Nov-2019	Mark Johnston <markj@FreeBSD.org>	Drop Giant before sleeping on a busy page. Before the page busy code was converted to make direct use of sleepqueues, this was handled by _sleep(). Reported by: glebius Reviewed by: kib Sponsored by: The FreeBSD Foundation
# be801aaa	06-Nov-2019	Mark Johnston <markj@FreeBSD.org>	Fix a race in release_page(). Since r354156 we may call release_page() without the page's object lock held, specifically following the page copy during a CoW fault. release_page() must therefore unbusy the page only after scheduling the requeue, to avoid racing with a free of the page. Previously, the object lock prevented this race from occurring. Add some assertions that were helpful in tracking this down. Reported by: pho, syzkaller Tested by: pho Reviewed by: alc, jeff, kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D22234
# afa7e88a	30-Oct-2019	Konstantin Belousov <kib@FreeBSD.org>	vm_page_wire_mapped: explain why failure does not affect correctness. Reviewed by: markj (previous version) Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D22196
# 67d0e293	29-Oct-2019	Jeff Roberson <jeff@FreeBSD.org>	Replace OBJ_MIGHTBEDIRTY with a system using atomics. Remove the TMPFS_DIRTY flag and use the same system. This enables further fault locking improvements by allowing more faults to proceed with a shared lock. Reviewed by: kib Tested by: pho Differential Revision: https://reviews.freebsd.org/D22116
# 0dc59d76	24-Oct-2019	Andrew Gallatin <gallatin@FreeBSD.org>	Add a tunable to set the pgcache zone's maxcache When it is set to 0 (the default), a heavy Netflix-style web workload suffers from heavy lock contention on the vm page free queue called from vm_page_zone_{import,release}() as the buckets are frequently drained. When setting the maxcache, this contention goes away. We should eventually try to autotune this, as well as make this zone eligable for uma_reclaim(). Reviewed by: alc, markj Not Objected to by: jeff Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D22112
# e6f1a580	23-Oct-2019	Mark Johnston <markj@FreeBSD.org>	Verify identity after checking for WAITFAIL in vm_page_busy_acquire(). A caller that does not guarantee that a page's identity won't change while sleeping for a busy lock must specify either NOWAIT or WAITFAIL. Reported by: syzkaller Reviewed by: alc, kib Discussed with: jeff Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D22124
# d307bdcc	18-Oct-2019	Mark Johnston <markj@FreeBSD.org>	Further constrain the use of per-CPU caches for free pages. In low memory conditions a significant number of pages may end up stuck in the caches, and currently these caches cannot be reaped, leading to spurious memory allocation failures and OOM kills. So: - Take into account the fact that we may cache up to two full buckets of pages per CPU, not just one. - Increase the amount of RAM required per CPU to enable the caches. This is a temporary measure until the page cache management policy is improved. PR: 241048 Reported and tested by: Kevin Oberman <rkoberman@gmail.com> Reviewed by: alc, kib Discussed with: jeff MFC after: 3 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D22040
# 01cef4ca	16-Oct-2019	Mark Johnston <markj@FreeBSD.org>	Remove page locking from pmap_mincore(). After r352110 the page lock no longer protects a page's identity, so there is no purpose in locking the page in pmap_mincore(). Instead, if vm.mincore_mapped is set to the non-default value of 0, re-lookup the page after acquiring its object lock, which holds the page's identity stable. The change removes the last callers of vm_page_pa_tryrelock(), so remove it. Reviewed by: kib Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21823
# fff5403f	14-Oct-2019	Jeff Roberson <jeff@FreeBSD.org>	(5/6) Move the VPO_NOSYNC to PGA_NOSYNC to eliminate the dependency on the object lock in vm_page_set_validclean(). Reviewed by: kib, markj Tested by: pho Sponsored by: Netflix, Intel Differential Revision: https://reviews.freebsd.org/D21595
# 0012f373	14-Oct-2019	Jeff Roberson <jeff@FreeBSD.org>	(4/6) Protect page valid with the busy lock. Atomics are used for page busy and valid state when the shared busy is held. The details of the locking protocol and valid and dirty synchronization are in the updated vm_page.h comments. Reviewed by: kib, markj Tested by: pho Sponsored by: Netflix, Intel Differential Revision: https://reviews.freebsd.org/D21594
# 205be21d	14-Oct-2019	Jeff Roberson <jeff@FreeBSD.org>	(3/6) Add a shared object busy synchronization mechanism that blocks new page busy acquires while held. This allows code that would need to acquire and release a very large number of page busy locks to use the old mechanism where busy is only checked and not held. This comes at the cost of false positives but never false negatives which the single consumer, vm_fault_soft_fast(), handles. Reviewed by: kib Tested by: pho Sponsored by: Netflix, Intel Differential Revision: https://reviews.freebsd.org/D21592
# 8da1c098	14-Oct-2019	Jeff Roberson <jeff@FreeBSD.org>	(2/6) Don't release xbusy in vm_page_remove(), defer to vm_page_free_prep(). This persists busy state across operations like rename and replace. Reviewed by: kib, markj Tested by: pho Sponsored by: Netflix, Intel Differential Revision: https://reviews.freebsd.org/D21549
# 63e97555	14-Oct-2019	Jeff Roberson <jeff@FreeBSD.org>	(1/6) Replace busy checks with acquires where it is trival to do so. This is the first in a series of patches that promotes the page busy field to a first class lock that no longer requires the object lock for consistency. Reviewed by: kib, markj Tested by: pho Sponsored by: Netflix, Intel Differential Revision: https://reviews.freebsd.org/D21548
# 0ecc478b	14-Oct-2019	Leandro Lupori <luporl@FreeBSD.org>	[PPC64] Initial kernel minidump implementation Based on POWER9BSD implementation, with all POWER9 specific code removed and addition of new methods in PPC64 MMU interface, to isolate platform specific code. Currently, the new methods are implemented on pseries and PowerNV (D21643). Reviewed by: jhibbits Differential Revision: https://reviews.freebsd.org/D21551
# 4090e217	07-Oct-2019	Mark Johnston <markj@FreeBSD.org>	Assert that the PGA_{WRITEABLE,EXECUTABLE} flags do not leak. Reviewed by: alc, kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D21783
# 7b1fbc42	07-Oct-2019	Mateusz Guzik <mjg@FreeBSD.org>	vm: stop trylocking page queues in vm_page_pqbatch_submit About 11 minutes of poudriere -s -j 104 and probing on return value of trylocks reveals that over 10% of attempts fail, which in turn means there are more atomics performed than necessary. Trylocking was there to try preventing migration, but it's not very likely to happen if the lock is uncontested. Reviewed by: markj Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21925
# 7cc833c5	27-Sep-2019	Mark Johnston <markj@FreeBSD.org>	Fix a race in vm_page_swapqueue(). vm_page_swapqueue() atomically transitions a page between queues. To do so, it must hold the page queue lock for the old queue. However, once the queue index has been updated, the queue lock no longer protects the page's queue state. Thus, we must speculatively remove the page from the old queue before committing the queue state update, and roll back if the update fails. Reported and tested by: pho Reviewed by: kib Sponsored by: Intel, Netflix Differential Revision: https://reviews.freebsd.org/D21791
# 2b93f779	25-Sep-2019	Mark Johnston <markj@FreeBSD.org>	Add some counters for per-VM page events. For now, just count batched page queue state operations. vm.stats.page.queue_ops counts the number of batch entries that successfully completed, while queue_nops counts entries that had no effect, which occurs when the queue operation had been completed before the batch entry was processed. Reviewed by: alc, kib MFC after: 1 week Sponsored by: Intel, Netflix Differential Revision: https://reviews.freebsd.org/D21782
# 923da43e7	16-Sep-2019	Mark Johnston <markj@FreeBSD.org>	Fix a race in vm_page_dequeue_deferred_free() after r352110. This function loaded the page's queue index before setting PGA_DEQUEUE. In this window the page daemon may have deactivated the page, updating its queue index. Make the operation atomic using vm_page_pqstate_cmpset(); the page daemon will not modify the page once it observes that PGA_DEQUEUE is set. Reported and tested by: pho Reviewed by: alc, kib Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21639
# 47aef898	16-Sep-2019	Mark Johnston <markj@FreeBSD.org>	Fix a page leak in vm_page_reclaim_run(). After r352110 the attempt to remove mappings of the page being replaced may fail if the page is wired. In this case we must free the replacement page. Reviewed by: alc, kib Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21639
# e8bcf696	16-Sep-2019	Mark Johnston <markj@FreeBSD.org>	Revert r352406, which contained changes I didn't intend to commit.
# 41fd4b94	16-Sep-2019	Mark Johnston <markj@FreeBSD.org>	Fix a couple of nits in r352110. - Remove a dead variable from the amd64 pmap_extract_and_hold(). - Fix grammar in the vm_page_wire man page. Reported by: alc Reviewed by: alc, kib Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21639
# c7575748	10-Sep-2019	Jeff Roberson <jeff@FreeBSD.org>	Replace redundant code with a few new vm_page_grab facilities: - VM_ALLOC_NOCREAT will grab without creating a page. - vm_page_grab_valid() will grab and page in if necessary. - vm_page_busy_acquire() automates some busy acquire loops. Discussed with: alc, kib, markj Tested by: pho (part of larger branch) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21546
# 4cdea4a8	10-Sep-2019	Jeff Roberson <jeff@FreeBSD.org>	Use the sleepq lock rather than the page lock to protect against wakeup races with page busy state. The object lock is still used as an interlock to ensure that the identity stays valid. Most callers should use vm_page_sleep_if_busy() to handle the locking particulars. Reviewed by: alc, kib, markj Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21255
# fee2a2fa	09-Sep-2019	Mark Johnston <markj@FreeBSD.org>	Change synchonization rules for vm_page reference counting. There are several mechanisms by which a vm_page reference is held, preventing the page from being freed back to the page allocator. In particular, holding the page's object lock is sufficient to prevent the page from being freed; holding the busy lock or a wiring is sufficent as well. These references are protected by the page lock, which must therefore be acquired for many per-page operations. This results in false sharing since the page locks are external to the vm_page structures themselves and each lock protects multiple structures. Transition to using an atomically updated per-page reference counter. The object's reference is counted using a flag bit in the counter. A second flag bit is used to atomically block new references via pmap_extract_and_hold() while removing managed mappings of a page. Thus, the reference count of a page is guaranteed not to increase if the page is unbusied, unmapped, and the object's write lock is held. As a consequence of this, the page lock no longer protects a page's identity; operations which move pages between objects are now synchronized solely by the objects' locks. The vm_page_wire() and vm_page_unwire() KPIs are changed. The former requires that either the object lock or the busy lock is held. The latter no longer has a return value and may free the page if it releases the last reference to that page. vm_page_unwire_noq() behaves the same as before; the caller is responsible for checking its return value and freeing or enqueuing the page as appropriate. vm_page_wire_mapped() is introduced for use in pmap_extract_and_hold(). It fails if the page is concurrently being unmapped, typically triggering a fallback to the fault handler. vm_page_wire() no longer requires the page lock and vm_page_unwire() now internally acquires the page lock when releasing the last wiring of a page (since the page lock still protects a page's queue state). In particular, synchronization details are no longer leaked into the caller. The change excises the page lock from several frequently executed code paths. In particular, vm_object_terminate() no longer bounces between page locks as it releases an object's pages, and direct I/O and sendfile(SF_NOCACHE) completions no longer require the page lock. In these latter cases we now get linear scalability in the common scenario where different threads are operating on different files. __FreeBSD_version is bumped. The DRM ports have been updated to accomodate the KPI changes. Reviewed by: jeff (earlier version) Tested by: gallatin (earlier version), pho Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20486
# 7cdeaf33	03-Sep-2019	Mark Johnston <markj@FreeBSD.org>	Add preliminary support for atomic updates of per-page queue state. Queue operations on a page use the page lock when updating the page to reflect the desired queue state, and the page queue lock when physically enqueuing or dequeuing a page. Multiple pages share a given page lock, but queue state is per-page; this false sharing results in heavy lock contention. Take a small step towards the use of atomic_cmpset to synchronize updates to per-page queue state by introducing vm_page_pqstate_cmpset() and using it in the page daemon. In the longer term the plan is to stop using the page lock to protect page identity and rely only on the object and page busy locks. However, since the page daemon avoids acquiring the object lock except when necessary, some synchronization with a concurrent free of the page is required. vm_page_pqstate_cmpset() can be used to ensure that queue state updates are successful only if the page is not scheduled for a dequeue, which is sufficient for the page daemon. Add vm_page_swapqueue(), which moves a page from one queue to another using vm_page_pqstate_cmpset(). Use it in the active queue scan, which does not use the object lock. Modify vm_page_dequeue_deferred() to use vm_page_pqstate_cmpset() as well. Reviewed by: kib Discussed with: jeff Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21257
# 9d75f0dc	03-Sep-2019	Mark Johnston <markj@FreeBSD.org>	Map the vm_page array into KVA on amd64. r351198 allows the kernel to use domain-local memory to back the vm_page array (up to 2MB boundaries) and reserves a separate PML4 entry for that purpose. One consequence of that change is that the vm_page array is no longer present in minidumps, which only adds pages mapped above VM_MIN_KERNEL_ADDRESS. To avoid the friction caused by having kernel data structures mapped below VM_MIN_KERNEL_ADDRESS, map the vm_page array starting at VM_MIN_KERNEL_ADDRESS instead of using a dedicated PML4 entry. Reviewed by: kib Discussed with: jeff Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21491
# a70e17ee	26-Aug-2019	Mark Johnston <markj@FreeBSD.org>	Fix a few nits in vm_pqbatch_process_page(). - Don't bother masking off non-queue state flags when loading the page's atomic state, since it is only required for one of the function's assertions. Update the assertion instead. - Remove an incorrect comment regarding synchronization with the page daemon. The page daemon only ever checks for PGA_ENQUEUED with the page queue lock held. - When clearing requeue flags, only clear the flags that have been acted upon. Reviewed by: kib (previous version) Discussed with: alc Tested by: pho (part of a larger patch) MFC after: 1 week Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21368
# f93670b7	23-Aug-2019	Mark Johnston <markj@FreeBSD.org>	Stop clearing page flags in vm_page_pqbatch_submit(). All existing callers guarantee that the page does not have a pre-existing dequeue pending. Thus, if the page is dequeued before pqbatch_submit() acquires the page queue lock, we do not need to do anything since vm_page_dequeue_complete() takes care of clearing all page queue state flags for us. With this change, vm_page_pqbatch_submit() has the nice property that it does not directly modify any fields in the page structure. Reviewed by: alc, kib Tested by: pho (part of a larger change) MFC after: 1 week Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21372
# 386eba08	23-Aug-2019	Mark Johnston <markj@FreeBSD.org>	Make vm_pqbatch_submit_page() externally visible. It will become useful for the page daemon to be able to directly create a batch queue entry for a page, and without modifying the page structure. Rename vm_pqbatch_submit_page() to vm_page_pqbatch_submit() to keep the namespace consistent. No functional change intended. Reviewed by: alc, kib MFC after: 1 week Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21369
# 8b90607f	21-Aug-2019	Mark Johnston <markj@FreeBSD.org>	Simplify vm_page_dequeue() and fix an assertion. - Add a vm_pagequeue_remove() function to physically remove a page from its queue and update the queue length. - Remove vm_page_pagequeue_lockptr() and let vm_page_pagequeue() return NULL for dequeued pages. - Avoid unnecessarily reloading the queue index if vm_page_dequeue() loses a race with a concurrent queue operation. - Correct an always-true assertion: vm_page_dequeue() may be called from the page allocator with the page unlocked. The assertion m->order == VM_NFREEORDER simply tests whether the page has been removed from the vm_phys free lists; instead, check whether the page belongs to an object. Reviewed by: kib MFC after: 1 week Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21341
# 3e5e1b51	18-Aug-2019	Jeff Roberson <jeff@FreeBSD.org>	Allocate amd64's page array using pages and page directory pages from the NUMA domain that the pages describe. Patch original from gallatin. Reviewed by: kib Tested by: pho Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21252
# b7565d44	18-Aug-2019	Jeff Roberson <jeff@FreeBSD.org>	Encapsulate phys_avail manipulation in a set of simple routines. Add a NUMA aware boot time memory allocator that will be used to allocate early domain correct structures. Code partially submitted by gallatin. Reviewed by: gallatin, kib Tested by: pho Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21251
# 245139c6	16-Aug-2019	Konstantin Belousov <kib@FreeBSD.org>	Fix OOM handling of some corner cases. In addition to pagedaemon initiating OOM, also do it from the vm_fault() internals. Namely, if the thread waits for a free page to satisfy page fault some preconfigured amount of time, trigger OOM. These triggers are rate-limited, due to a usual case of several threads of the same multi-threaded process to enter fault handler simultaneously. The faults from pagedaemon threads participate in the calculation of OOM rate, but are not under the limit. Reviewed by: markj (previous version) Tested by: pho Discussed with: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D13671
# 98549e2d	29-Jul-2019	Mark Johnston <markj@FreeBSD.org>	Centralize the logic in vfs_vmio_unwire() and sendfile_free_page(). Both of these functions atomically unwire a page, optionally attempt to free the page, and enqueue or requeue the page. Add functions vm_page_release() and vm_page_release_locked() to perform the same task. The latter must be called with the page's object lock held. As a side effect of this refactoring, the buffer cache will no longer attempt to free mapped pages when completing direct I/O. This is consistent with the handling of pages by sendfile(SF_NOCACHE). Reviewed by: alc, kib MFC after: 2 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20986
# b16e57a6	20-Jul-2019	Mark Johnston <markj@FreeBSD.org>	Rename vm_page_{import,release}() to vm_page_zone_{import,release}(). I would like to use the name vm_page_release() for a different purpose, and vm_page_{import,release}() are local to vm_page.c. Reviewed by: kib MFC after: 1 week
# eeacb3b0	08-Jul-2019	Mark Johnston <markj@FreeBSD.org>	Merge the vm_page hold and wire mechanisms. The hold_count and wire_count fields of struct vm_page are separate reference counters with similar semantics. The remaining essential differences are that holds are not counted as a reference with respect to LRU, and holds have an implicit free-on-last unhold semantic whereas vm_page_unwire() callers must explicitly determine whether to free the page once the last reference to the page is released. This change removes the KPIs which directly manipulate hold_count. Functions such as vm_fault_quick_hold_pages() now return wired pages instead. Since r328977 the overhead of maintaining LRU for wired pages is lower, and in many cases vm_fault_quick_hold_pages() callers would swap holds for wirings on the returned pages anyway, so with this change we remove a number of page lock acquisitions. No functional change is intended. __FreeBSD_version is bumped. Reviewed by: alc, kib Discussed with: jeff Discussed with: jhb, np (cxgbe) Tested by: pho (previous version) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D19247
# 46736e30	08-Jul-2019	Mark Johnston <markj@FreeBSD.org>	Elide the vm_reserv_free_page() call when PG_PCPU_CACHE is set. Pages with PG_PCPU_CACHE set cannot have been allocated from a reservation, so as an optimization, skip the call to vm_reserv_free_page() in this case. Otherwise, the access of the corresponding reservation structure often results in a cache miss. Reviewed by: alc, kib Discussed with: jeff MFC after: 2 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20859
# d9a73522	08-Jul-2019	Mark Johnston <markj@FreeBSD.org>	Add a per-CPU page cache per VM free pool. Some workloads benefit from having a per-CPU cache for VM_FREEPOOL_DIRECT pages. Reviewed by: dougm, kib Discussed with: alc, jeff MFC after: 2 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20858
# 9f74cdbf	02-Jul-2019	Mark Johnston <markj@FreeBSD.org>	Mark pages allocated from the per-CPU cache. Only free pages to the cache when they were allocated from that cache. This mitigates rapid fragmentation of physical memory seen during poudriere's dependency calculation phase. In particular, pages belonging to broken reservations are no longer freed to the per-CPU cache, so they get a chance to coalesce with freed pages during the break. Otherwise, the optimized CoW handler may create object chains in which multiple objects contain pages from the same reservation, and the order in which we do object termination means that the reservation is broken before all of those pages are freed, so some of them end up in the per-CPU cache and thus permanently fragment physical memory. The flag may also be useful for eliding calls to vm_reserv_free_page(), thus avoiding memory accesses for data that is likely not present in the CPU caches. Reviewed by: alc Discussed with: jeff MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D20763
# 0fd977b3	26-Jun-2019	Mark Johnston <markj@FreeBSD.org>	Add a return value to vm_page_remove(). Use it to indicate whether the page may be safely freed following its removal from the object. Also change vm_page_remove() to assume that the page's object pointer is non-NULL, and have callers perform this check instead. This is a step towards an implementation of an atomic reference counter for each physical page structure. Reviewed by: alc, dougm, kib MFC after: 1 week Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20758
# ee1f1685	19-Jun-2019	Mark Johnston <markj@FreeBSD.org>	Group vm_page_activate()'s definition with other related functions. No functional change intended. MFC after: 3 days
# 2d274871	04-Jun-2019	Mark Johnston <markj@FreeBSD.org>	Remove an outdated header comment for vm_page.c. The listed rules were incomplete and outdated. There is a much more comprehensive comment in vm_page.h. Reviewed by: alc, kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D20503
# 2d5039db	02-Jun-2019	Alan Cox <alc@FreeBSD.org>	Retire vm_reserv_extend_{contig,page}(). These functions were introduced as part of a false start toward fine-grained reservation locking. In the end, they were not needed, so eliminate them. Order the parameters to vm_reserv_alloc_{contig,page}() consistently with the vm_page functions that call them. Update the comments about the locking requirements for vm_reserv_alloc_{contig,page}(). They no longer require a free page queues lock. Wrap several lines that became too long after the "req" and "domain" parameters were added to vm_reserv_alloc_{contig,page}(). Reviewed by: kib, markj MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D20492
# d842aa51	01-Jun-2019	Mark Johnston <markj@FreeBSD.org>	Add a vm_page_wired() predicate. Use it instead of accessing the wire_count field directly. No functional change intended. Reviewed by: alc, kib MFC after: 1 week Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20485
# b8590dae	31-May-2019	Doug Moore <dougm@FreeBSD.org>	The function vm_phys_free_contig invokes vm_phys_free_pages for every power-of-two page block it frees, launching an unsuccessful search for a buddy to pair up with each time. The only possible buddy-up mergers are across the boundaries of the freed region, so change vm_phys_free_contig simply to enqueue the freed interior blocks, via a new function vm_phys_enqueue_contig, and then call vm_phys_free_pages on the bounding blocks to create as big a cross-boundary block as possible after buddy-merging. The only callers of vm_phys_free_contig at the moment call it in situations where merging blocks across the boundary is clearly impossible, so just call vm_phys_enqueue_contig in those places and avoid trying to buddy-up at all. One beneficiary of this change is in breaking reservations. For the case where memory is freed in breaking a reservation with only the first and last pages allocated, the number of cycles consumed by the operation drops about 11% with this change. Suggested by: alc Reviewed by: alc Approved by: kib, markj (mentors) Differential Revision: https://reviews.freebsd.org/D16901
# 42447bb5	31-May-2019	Mark Johnston <markj@FreeBSD.org>	Remove a redundant vm_page_remove() call. vm_page_free_prep() removes the page from its object. No functional change intended. Reviewed by: alc, kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D20469
# 3b5b2029	05-Mar-2019	Mark Johnston <markj@FreeBSD.org>	Implement minidump support for RISC-V. Submitted by: Mitchell Horne <mhorne063@gmail.com> Differential Revision: https://reviews.freebsd.org/D18320
# 1e2b3e6f	03-Feb-2019	Mark Johnston <markj@FreeBSD.org>	Allow vm_page_free_prep() to dequeue pages without the page lock. This is a step towards being able to free pages without the page lock held. The approach is simply to add an implementation of vm_page_dequeue_deferred() which does not assert that the page lock is held. Formally, the page lock is required to set PGA_DEQUEUE, but in the case of vm_page_free_prep() we get the same mutual exclusion for free by virtue of the fact that no other references to the page may exist. No functional change intended. Reviewed by: kib (previous version) MFC after: 2 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D19065
# d0488e69	03-Feb-2019	Mark Johnston <markj@FreeBSD.org>	Fix a race in vm_page_dequeue_deferred(). To detect the case where the page is already marked for a deferred dequeue, we must read the "queue" and "aflags" fields in a precise order. Otherwise, a race with a concurrent vm_page_dequeue_complete() could leave the page with PGA_DEQUEUE set despite it already having been dequeued. Fix the problem by using vm_page_queue() to check the queue state, which correctly handles the race. Reviewed by: kib Tested by: pho MFC after: 3 days Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D19039
# bb15d1c7	14-Jan-2019	Gleb Smirnoff <glebius@FreeBSD.org>	o Move zone limit from keg level up to zone level. This means that now two zones sharing a keg may have different limits. Now this is going to work: zone = uma_zcreate(); uma_zone_set_max(zone, limit); zone2 = uma_zsecond_create(zone); uma_zone_set_max(zone2, limit2); Kegs no longer have uk_maxpages field, but zones have uz_items. When set, it may be rounded up to minimum possible CPU bucket cache size. For small limits bucket cache can also be reconfigured to be smaller. Counter uz_items is updated whenever items transition from keg to a bucket cache or directly to a consumer. If zone has uz_maxitems set and it is reached, then we are going to sleep. o Since new limits don't play well with multi-keg zones, remove them. The idea of multi-keg zones was introduced exactly 10 years ago, and never have had a practical usage. In discussion with Jeff we came to a wild agreement that if we ever want to reintroduce the idea of a smart allocator that would be able to choose between two (or more) totally different backing stores, that choice should be made one level higher than UMA, e.g. in malloc(9) or in mget(), or whatever and choice should be controlled by the caller. o Sleeping code is improved to account number of sleepers and wake them one by one, to avoid thundering herd problem. o Flag UMA_ZONE_NOBUCKETCACHE removed, instead uma_zone_set_maxcache() KPI added. Having no bucket cache basically means setting maxcache to 0. o Now with many fields added and many removed (no multi-keg zones!) make sure that struct uma_zone is perfectly aligned. Reviewed by: markj, jeff Tested by: pho Differential Revision: https://reviews.freebsd.org/D17773
# 9cc36b3d	07-Jan-2019	Gleb Smirnoff <glebius@FreeBSD.org>	Fix regression in r331368, that broke dumping of UMA startup pages when WITNESS is present. Discussed with: markj
# 7af49852	30-Dec-2018	Konstantin Belousov <kib@FreeBSD.org>	Add 'v' modifier to the ddb 'show pginfo' command to display vm_page backing the provided kernel virtual address. Tested by: pho MFC after: 1 week Sponsored by: The FreeBSD Foundation
# e31fc3ab	29-Nov-2018	Mark Johnston <markj@FreeBSD.org>	Update the free page count when blacklisting pages. Otherwise the free page count will not accurately reflect the physical page allocator's state. On 11 this can trigger panics in vm_page_alloc() since the allocator state and free page count are updated atomically and we expect them to stay in sync. On 12 the bug would manifest as threads looping in vm_page_alloc(). PR: 231296 Reported by: mav, wollman, Rainer Duffner, Josh Gitlin Reviewed by: alc, kib, mav MFC after: 3 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D18374
# 920239ef	30-Oct-2018	Mark Johnston <markj@FreeBSD.org>	Fix some problems that manifest when NUMA domain 0 is empty. - In uma_prealloc(), we need to check for an empty domain before the first allocation attempt, not after. Fix this by switching uma_prealloc() to use a vm_domainset iterator, which addresses the secondary issue of using a signed domain identifier in round-robin iteration. - Don't automatically create a page daemon for domain 0. - In domainset_empty_vm(), recompute ds_cnt and ds_order after excluding empty domains; otherwise we may frequently specify an empty domain when calling in to the page allocator, wasting CPU time. Convert DOMAINSET_PREF() policies for empty domains to round-robin. - When freeing bootstrap pages, don't count them towards the per-domain total page counts for now: some vm_phys segments are created before the SRAT is parsed and are thus always identified as being in domain 0 even when they are not. Then, when bootstrap pages are freed, they are added to a domain that we had previously thought was empty. Until this is corrected, we simply exclude them from the per-domain page count. Reported and tested by: Rajesh Kumar <rajfbsd@gmail.com> Reviewed by: gallatin MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17704
# 4c29d2de	23-Oct-2018	Mark Johnston <markj@FreeBSD.org>	Refactor domainset iterators for use by malloc(9) and UMA. Before this change we had two flavours of vm_domainset iterators: "page" and "malloc". The latter was only used for kmem_() and hard-coded its behaviour based on kernel_object's policy. Moreover, its use contained a race similar to that fixed by r338755 since the kernel_object's iterator was being run without the object lock. In some cases it is useful to be able to explicitly specify a policy (domainset) or policy+iterator (domainset_ref) when performing memory allocations. To that end, refactor the vm_dominset_ KPI to permit this, and get rid of the "malloc" domainset_iter KPI in the process. Reviewed by: jeff (previous version) Tested by: pho (part of a larger patch) MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17417
# 463406ac	24-Sep-2018	Mark Johnston <markj@FreeBSD.org>	Add more NUMA-specific low memory predicates. Use these predicates instead of inline references to vm_min_domains. Also add a global all_domains set, akin to all_cpus. Reviewed by: alc, jeff, kib Approved by: re (gjb) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17278
# 7a364d45	10-Sep-2018	Mark Johnston <markj@FreeBSD.org>	Split some checks in vm_page_activate() to make it easier to read. No functional change intended. Reviewed by: alc, kib Approved by: re (gjb) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17028
# 5a7f9937	08-Sep-2018	Mark Johnston <markj@FreeBSD.org>	Relax an assertion in vm_pqbatch_process_page(). While executing vm_pqbatch_process_page(m), m->queue may change to PQ_NONE if the page daemon is concurrently freeing the page. In this case m's queue state flags must be clear, so vm_pqbatch_process_page() will be a no-op, but the race could cause spurious assertion failures. Correct the assertion which assumed that m->queue's value does not change while the page queue lock is held. Reviewed by: alc, kib Reported and tested by: pho Approved by: re (gjb) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17027
# 23984ce5	06-Sep-2018	Mark Johnston <markj@FreeBSD.org>	Avoid resource deadlocks when one domain has exhausted its memory. Attempt other allowed domains if the requested domain is below the minimum paging threshold. Block in fork only if all domains available to the forking thread are below the severe threshold rather than any. Submitted by: jeff Reported by: mjg Reviewed by: alc, kib, markj Approved by: re (rgrimes) Differential Revision: https://reviews.freebsd.org/D16191
# 21f01f45	06-Sep-2018	Mark Johnston <markj@FreeBSD.org>	Remove vm_page_remque(). Testing m->queue != PQ_NONE is not sufficient; see the commit log message for r338276. As of r332974 vm_page_dequeue() handles already-dequeued pages, so just replace vm_page_remque() calls with vm_page_dequeue() calls. Reviewed by: kib Tested by: pho Approved by: re (marius) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17025
# 899fe184	23-Aug-2018	Mark Johnston <markj@FreeBSD.org>	Add a per-pagequeue pdpages counter. Expose these counters under the vm.domain sysctl node. The existing vm.stats.vm.v_pdpages sysctl is preserved. Reviewed by: alc (previous version) Differential Revision: https://reviews.freebsd.org/D14666
# 99d92d73	23-Aug-2018	Mark Johnston <markj@FreeBSD.org>	Ensure that queue state is cleared when vm_page_dequeue() returns. Per-page queue state is updated non-atomically, with either the page lock or the page queue lock held. When vm_page_dequeue() is called without the page lock, in rare cases a different thread may be concurrently dequeuing the page with the pagequeue lock held. Because of the non-atomic update, vm_page_dequeue() might return before queue state is completely updated, which can lead to race conditions. Restrict the vm_page_dequeue() interface so that it must be called either with the page lock held or on a free page, and busy wait when a different thread is concurrently updating queue state, which must happen in a critical section. While here, do some related cleanup: inline vm_page_dequeue_locked() into its only caller and delete a prototype for the unimplemented vm_page_requeue_locked(). Replace the volatile qualifier for "queue" added in r333703 with explicit uses of atomic_load_8() where required. Reported and tested by: pho Reviewed by: alc Differential Revision: https://reviews.freebsd.org/D15980
# 2bf95012	05-Jul-2018	Andrew Turner <andrew@FreeBSD.org>	Create a new macro for static DPCPU data. On arm64 (and possible other architectures) we are unable to use static DPCPU data in kernel modules. This is because the compiler will generate PC-relative accesses, however the runtime-linker expects to be able to relocate these. In preparation to fix this create two macros depending on if the data is global or static. Reviewed by: bz, emaste, markj Sponsored by: ABT Systems Ltd Differential Revision: https://reviews.freebsd.org/D16140
# a66d7a8d	05-Jul-2018	Konstantin Belousov <kib@FreeBSD.org>	Copyout(9) on 4/4 i386 needs correct vm_page_array[]. On the 4/4 i386, copyout(9) may need to call pmap_extract_and_hold() on arbitrary userspace mapping. If the mapping is backed by the non-managed cdev pager or by the sg pager, on dense configs we might access arbitrary element of vm_page_array[], in particular, not corresponding to a page from the memory segment. Initialize such pages as fictitious with the corresponding physical address. Reported by: bde Reviewed by: alc, markj (previous version) Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D16085
# 89ea39a7	26-Jun-2018	Alan Cox <alc@FreeBSD.org>	Update the physical page selection strategy used by vm_page_import() so that it does not cause rapid fragmentation of the free physical memory. Reviewed by: jeff, markj (an earlier version) Differential Revision: https://reviews.freebsd.org/D15976
# 16e05b32	07-Jun-2018	Jonathan T. Looney <jtl@FreeBSD.org>	Fix a typo in vm_domain_set(). When a domain crosses into the severe range, we need to set the domain bit from the vm_severe_domains bitset (instead of clearing it). Reviewed by: jeff, markj Sponsored by: Netflix, Inc.
# a99ee60b	22-May-2018	Mark Johnston <markj@FreeBSD.org>	Ensure that "m" is initialized in vm_page_alloc_freelist_domain(). While here, remove a superfluous comment. Coverity CID: 1383559 MFC after: 3 days
# ba2b3349	16-May-2018	Mark Johnston <markj@FreeBSD.org>	Fix a race in vm_page_pagequeue_lockptr(). The value of m->queue must be cached after comparing it with PQ_NONE, since it may be concurrently changing. Reported by: glebius Reviewed by: jeff Differential Revision: https://reviews.freebsd.org/D15462
# 1b5c869d	04-May-2018	Mark Johnston <markj@FreeBSD.org>	Fix some races introduced in r332974. With r332974, when performing a synchronized access of a page's "queue" field, one must first check whether the page is logically dequeued. If so, then the page lock does not prevent the page from being removed from its page queue. Intoduce vm_page_queue(), which returns the page's logical queue index. In some cases, direct access to the "queue" field is still required, but such accesses should be confined to sys/vm. Reported and tested by: pho Reviewed by: kib Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D15280
# 5cd29d0f	24-Apr-2018	Mark Johnston <markj@FreeBSD.org>	Improve VM page queue scalability. Currently both the page lock and a page queue lock must be held in order to enqueue, dequeue or requeue a page in a given page queue. The queue locks are a scalability bottleneck in many workloads. This change reduces page queue lock contention by batching queue operations. To detangle the page and page queue locks, per-CPU batch queues are used to reference pages with pending queue operations. The requested operation is encoded in the page's aflags field with the page lock held, after which the page is enqueued for a deferred batch operation. Page queue scans are similarly optimized to minimize the amount of work performed with a page queue lock held. Reviewed by: kib, jeff (previous versions) Tested by: pho Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D14893
# 64b38930	19-Apr-2018	Mark Johnston <markj@FreeBSD.org>	Initialize marker pages in vm_page_domain_init(). They were previously initialized by the corresponding page daemon threads, but for vmd_inacthead this may be too late if vm_page_deactivate_noreuse() is called during boot. Reported and tested by: cperciva Reviewed by: alc, kib MFC after: 1 week
# 9de8fcfd	17-Apr-2018	Mark Johnston <markj@FreeBSD.org>	Ensure that m and skip_m belong to the same object. Pages allocated from a given reservation may belong to different objects. It is therefore possible for vm_page_ps_test() to be called with the base page's object unlocked. Check for this case before asserting that the object lock is held. Reported by: jhb Reviewed by: kib MFC after: 1 week
# e55d32b7	07-Apr-2018	Konstantin Belousov <kib@FreeBSD.org>	Handle Skylake-X errata SKZ63. SKZ63 Processor May Hang When Executing Code In an HLE Transaction Region Problem: Under certain conditions, if the processor acquires an HLE (Hardware Lock Elision) lock via the XACQUIRE instruction in the Host Physical Address range between 40000000H and 403FFFFFH, it may hang with an internal timeout error (MCACOD 0400H) logged into IA32_MCi_STATUS. Move the pages from the range into the blacklist. Add a tunable to not waste 4M if local DoS is not the issue. Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D15001
# c33e3a64	31-Mar-2018	Jeff Roberson <jeff@FreeBSD.org>	Add a uma cache of free pages in the DEFAULT freepool. This gives us per-cpu alloc and free of pages. The cache is filled with as few trips to the phys allocator as possible by the use of a new vm_phys_alloc_npages() function which allocates as many as N pages. This code was originally by markj with the import function rewritten by me. Reviewed by: markj, kib Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14905
# e5818a53	28-Mar-2018	Jeff Roberson <jeff@FreeBSD.org>	Implement several enhancements to NUMA policies. Add a new "interleave" allocation policy which stripes pages across domains with a stride or width keeping contiguity within a multi-page region. Move the kernel to the dedicated numbered cpuset #2 making it possible to assign kernel threads and memory policy separately from user. This also eliminates the need for the complicated interrupt binding code. Add a sysctl API for viewing and manipulating domainsets. Refactor some of the cpuset_t manipulation code using the generic bitset type so that it can be used for both. This probably belongs in a dedicated subr file. Attempt to improve the include situation. Reviewed by: kib Discussed with: jhb (cpuset parts) Tested by: pho (before review feedback) Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14839
# 2d3f4181	23-Mar-2018	Jeff Roberson <jeff@FreeBSD.org>	Fix two compliation problems on non-amd64 architectures.
# 40468513	23-Mar-2018	Mark Johnston <markj@FreeBSD.org>	Correct a couple of assertion messages in vm_page_reclaim_run(). MFC after: 3 days
# 5c930c89	22-Mar-2018	Jeff Roberson <jeff@FreeBSD.org>	Lock reservations with a dedicated lock in each reservation. Protect the vmd_free_count with atomics. This allows us to allocate and free from reservations without the free lock except where a superpage is allocated from the physical layer, which is roughly 1/512 of the operations on amd64. Use the counter api to eliminate cache conention on counters. Reviewed by: markj Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14707
# 9a4b4cd3	22-Mar-2018	Jeff Roberson <jeff@FreeBSD.org>	Start witness much earlier in boot so that we can shrink the pend list and make it more immune to further change. Reviewed by: markj, imp (Part of D14707) Sponsored by: Netflix, Dell/EMC Isilon
# 0eb50f9c	18-Mar-2018	Mark Johnston <markj@FreeBSD.org>	Have vm_page_{deactivate,launder}() requeue already-queued pages. In many cases the page is not enqueued so the change will have no effect. However, the change is needed to support an optimization in the fault handler and in some cases (sendfile, the buffer cache) it was being emulated by the caller anyway. Reviewed by: alc Tested by: pho MFC after: 2 weeks X-Differential Revision: https://reviews.freebsd.org/D14625
# 434862ac	18-Mar-2018	Mark Johnston <markj@FreeBSD.org>	Have vm_page_replace() assert that the new page is not enqueued. The new page does not belong to a VM object, but the page daemon does not expect to encounter such pages. Reviewed by: alc, kib Tested by: pho MFC after: 1 week X-Differential Revision: https://reviews.freebsd.org/D14625
# 30fbfdda	15-Mar-2018	Jeff Roberson <jeff@FreeBSD.org>	Eliminate pageout wakeup races. Take another step towards lockless vmd_free_count manipulation. Reduce the scope of the free lock by using a pageout lock to synchronize sleep and wakeup. Only trigger the pageout daemon on transitions between states. Drive all wakeup operations directly as side-effects from freeing memory rather than requiring an additional function call. Reviewed by: markj, kib Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14612
# 741e1c91	13-Mar-2018	Konstantin Belousov <kib@FreeBSD.org>	Revert the chunk from r330410 in vm_page_reclaim_run(). There, the pages freed might be managed but the page's lock is not owned. For KPI correctness, the page lock is requried around the call to vm_page_free_prep(), which is asserted. Reclaim loop already did the work which could be done by vm_page_free_prep(), so the lock is not needed and the only consequence of not owning it is the assert trigger. Instead of adding the locking to satisfy the assert, revert to the code that calls vm_page_free_phys() directly. Reported by: pho Discussed with: jeff Sponsored by: The FreeBSD Foundation MFC after: 1 week
# 2a8e8f78	04-Mar-2018	Konstantin Belousov <kib@FreeBSD.org>	Remove redundant test from r330410. If the input slist is non-empty, counter cannot be zero after freeing. Noted by: mjg MFC after: 2 weeks
# 8c8ee2ee	04-Mar-2018	Konstantin Belousov <kib@FreeBSD.org>	Unify bulk free operations in several pmaps. Submitted by: Yoshihiro Ota Reviewed by: markj MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D13485
# 9140bff7	23-Feb-2018	Mark Johnston <markj@FreeBSD.org>	Remove a bogus assertion from vm_page_launder(). After r328977, a wired page m may have m->queue != PQ_NONE. Reviewed by: kib X-MFC with: r328977 Differential Revision: https://reviews.freebsd.org/D14485
# 5f8cd1c0	23-Feb-2018	Jeff Roberson <jeff@FreeBSD.org>	Add a generic Proportional Integral Derivative (PID) controller algorithm and use it to regulate page daemon output. This provides much smoother and more responsive page daemon output, anticipating demand and avoiding pageout stalls by increasing the number of pages to match the workload. This is a reimplementation of work done by myself and mlaier at Isilon. Reviewed by: bsdimp Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14402
# 2c0f13aa	20-Feb-2018	Konstantin Belousov <kib@FreeBSD.org>	vm_wait() rework. Make vm_wait() take the vm_object argument which specifies the domain set to wait for the min condition pass. If there is no object associated with the wait, use curthread' policy domainset. The mechanics of the wait in vm_wait() and vm_wait_domain() is supplied by the new helper vm_wait_doms(), which directly takes the bitmask of the domains to wait for passing min condition. Eliminate pagedaemon_wait(). vm_domain_clear() handles the same operations. Eliminate VM_WAIT and VM_WAITPFAULT macros, the direct functions calls are enough. Eliminate several control state variables from vm_domain, unneeded after the vm_wait() conversion. Scetched and reviewed by: jeff Tested by: pho Sponsored by: The FreeBSD Foundation, Mellanox Technologies Differential revision: https://reviews.freebsd.org/D14384
# e958ad4c	12-Feb-2018	Jeff Roberson <jeff@FreeBSD.org>	Make v_wire_count a per-cpu counter(9) counter. This eliminates a significant source of cache line contention from vm_page_alloc(). Use accessors and vm_page_unwire_noq() so that the mechanism can be easily changed in the future. Reviewed by: markj Discussed with: kib, glebius Tested by: pho (earlier version) Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14273
# f7d35785	08-Feb-2018	Gleb Smirnoff <glebius@FreeBSD.org>	Fix boot_pages exhaustion on machines with many domains and cores, where size of UMA zone allocation is greater than page size. In this case zone of zones can not use UMA_MD_SMALL_ALLOC, and we need to postpone switch off of this zone from startup_alloc() until full launch of VM. o Always supply number of VM zones to uma_startup_count(). On machines with UMA_MD_SMALL_ALLOC ignore it completely, unless zsize goes over a page. In the latter case account VM zones for number of allocations from the zone of zones. o Rewrite startup_alloc() so that it will immediately switch off from itself any zone that is already capable of running real alloc. In worst case scenario we may leak a single page here. See comment in uma_startup_count(). o Hardcode call to uma_startup2() into vm_mem_init(). Otherwise some extra SYSINITs, e.g. vm_page_init() may sneak in before. o While here, remove uma_boot_pages_mtx. With recent changes to boot pages calculation, we are guaranteed to use all of the boot_pages in the early single threaded stage. Reported & tested by: mav
# 5073a083	07-Feb-2018	Gleb Smirnoff <glebius@FreeBSD.org>	Fix three miscalculations in amount of boot pages: o Most of startup zones have struct uma_slab embedded into the slab, so provide macro UMA_SLAB_SPACE and use it instead of UMA_SLAB_SIZE, when calculating how many pages would certain kind of allocations require. Some zones are offpage, so we might have a positive inaccuracy. o The keg for the zone of zones is allocated "dynamically", so we need +1 when calculating amount of pages for kegs. [1] o The zones of zones and zones of kegs have arbitrary alignment of 32, and this also needs to be accounted for. [2] While here, spread more comments and improve diagnostic messages. Reported by: pho [1], jtl [2]
# 1d3a1bcf	07-Feb-2018	Mark Johnston <markj@FreeBSD.org>	Dequeue wired pages lazily. Previously, wiring a page would cause it to be removed from its page queue. In the common case, unwiring causes it to be enqueued at the tail of that page queue. This change modifies vm_page_wire() to not dequeue the page, thus avoiding the highly contended page queue locks. Instead, vm_page_unwire() takes care of requeuing the page as a single operation, and the page daemon dequeues wired pages as they are encountered during a queue scan to avoid needlessly revisiting them later. For pages in PQ_ACTIVE we do even better, since a requeue is unnecessary. The change improves scalability for some common workloads. For instance, threads wiring pages into the buffer cache no longer need to modify global page queues, and unwiring is usually done by the bufspace thread, so concurrency is not as much of an issue. As another example, many sysctl handlers wire the output buffer to avoid faults on copyout, and since the buffer is likely to be in PQ_ACTIVE, we now entirely avoid modifying the page queue in this case. The change also adds a block comment describing some properties of struct vm_page's reference counters, and the busy lock. Reviewed by: jeff Discussed with: alc, kib MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D11943
# e2068d0b	06-Feb-2018	Jeff Roberson <jeff@FreeBSD.org>	Use per-domain locks for vm page queue free. Move paging control from global to per-domain state. Protect reservations with the free lock from the domain that they belong to. Refactor to make vm domains more of a first class object. Reviewed by: markj, kib, gallatin Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14000
# ae941b1b	06-Feb-2018	Gleb Smirnoff <glebius@FreeBSD.org>	Fix boot_pages calculation for machines that don't have UMA_MD_SMALL_ALLOC. o Call uma_startup1() after initializing kmem, vmem and domains. o Include 8 eight VM startup pages into uma_startup_count() calculation. o Account for vmem_startup() and vm_map_startup() preallocating pages. o Account for extra two allocations done by kmem_init() and vmem_create(). o Hardcode the place of execution of vm_radix_reserve_kva(). Using SYSINIT allowed several other SYSINITs to sneak in before it, thus bumping requirement for amount of boot pages.
# f4bef67c	05-Feb-2018	Gleb Smirnoff <glebius@FreeBSD.org>	Followup on r302393 by cperciva, improving calculation of boot pages required for UMA startup. o Introduce another stage of UMA startup, which is entered after vm_page_startup() finishes. After this stage we don't yet enable buckets, but we can ask VM for pages. Rename stages to meaningful names while here. New list of stages: BOOT_COLD, BOOT_STRAPPED, BOOT_PAGEALLOC, BOOT_BUCKETS, BOOT_RUNNING. Enabling page alloc earlier allows us to dramatically reduce number of boot pages required. What is more important number of zones becomes consistent across different machines, as no MD allocations are done before the BOOT_PAGEALLOC stage. Now only UMA internal zones actually need to use startup_alloc(), however that may change, so vm_page_startup() provides its need for early zones as argument. o Introduce uma_startup_count() function, to avoid code duplication. The functions calculates sizes of zones zone and kegs zone, and calculates how many pages UMA will need to bootstrap. It counts not only of zone structures, but also of kegs, slabs and hashes. o Hide uma_startup_foo() declarations from public file. o Provide several DIAGNOSTIC printfs on boot_pages usage. o Bugfix: when calculating zone of zones size use (mp_maxid + 1) instead of mp_ncpus. Use resulting number not only in the size argument to zone_ctor() but also as args.size. Reviewed by: imp, gallatin (earlier version) Differential Revision: https://reviews.freebsd.org/D14054
# 9a8196ce	19-Jan-2018	Nathan Whitehorn <nwhitehorn@FreeBSD.org>	Remove SFBUF_OPTIONAL_DIRECT_MAP and such hacks, replacing them across the kernel by PHYS_TO_DMAP() as previously present on amd64, arm64, riscv, and powerpc64. This introduces a new MI macro (PMAP_HAS_DMAP) that can be evaluated at runtime to determine if the architecture has a direct map; if it does not (or does) unconditionally and PMAP_HAS_DMAP is either 0 or 1, the compiler can remove the conditional logic. As part of this, implement PHYS_TO_DMAP() on sparc64 and mips64, which had similar things but spelled differently. 32-bit MIPS has a partial direct-map that maps poorly to this concept and is unchanged. Reviewed by: kib Suggestions from: marius, alc, kib Runtime tested on: amd64, powerpc64, powerpc, mips64
# 3f289c3f	12-Jan-2018	Jeff Roberson <jeff@FreeBSD.org>	Implement 'domainset', a cpuset based NUMA policy mechanism. This allows userspace to control NUMA policy administratively and programmatically. Implement domainset based iterators in the page layer. Remove the now legacy numa_* syscalls. Cleanup some header polution created by having seq.h in proc.h. Reviewed by: markj, kib Discussed with: alc Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D13403
# 280d15cd	24-Dec-2017	Mark Johnston <markj@FreeBSD.org>	Fix two problems with the page daemon control loop. Both issues caused the page daemon to erroneously go to sleep when applications are consuming free pages at a high rate, leaving the application threads blocked in VM_WAIT. 1) After completing an inactive queue scan, concurrent allocations may have prevented the page daemon from meeting the v_free_min threshold. In this case, the page daemon was going to sleep even when the inactive queue contained plenty of clean pages. 2) pagedaemon_wakeup() may be called without the free queues lock held. This can lead to a lost wakeup if a call occurs after the page daemon clears vm_pageout_wanted but before going to sleep. Fix 1) by ensuring that we start a new inactive queue scan immediately if v_free_count < v_free_min after a prior scan. Fix 2) by adding a new subroutine, pagedaemon_wait(), called from vm_wait() and vm_waitpfault(). It wakes up the page daemon if either vm_pages_needed or vm_pageout_wanted is false, and atomically sleeps on v_free_count. Reported by: jeff Reviewed by: alc MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D13424
# 0db2102a	04-Dec-2017	Michael Zhilin <mizhka@FreeBSD.org>	[mips] [vm] restore translation of freelist to flind for page allocation Commit r326346 moved domain iterators from physical layer to vm_page one, but it also removed translation of freelist to flind for vm_page_alloc_freelist() call. Before it expects VM_FREELIST_ parameter, but after it expect freelist index. On small WiFi boxes with few megabytes of RAM, there is only one freelist VM_FREELIST_LOWMEM (1) and there is no VM_FREELIST_DEFAULT(0) (see file sys/mips/include/vmparam.h). It results in freelist 1 with flind 0. At first, this commit renames flind to freelist in vm_page_alloc_freelist to avoid misunderstanding about input parameters. Then on physical layer it restores translation for correct handling of freelist parameter. Reported by: landonf Reviewed by: jeff Differential Revision: https://reviews.freebsd.org/D13351
# 796df753	30-Nov-2017	Pedro F. Giffuni <pfg@FreeBSD.org>	SPDX: Consider code from Carnegie-Mellon University. Interesting cases, most likely from CMU Mach sources.
# 1084894f	29-Nov-2017	Mark Johnston <markj@FreeBSD.org>	Remove some comments that became incorrect with r325530.
# ef435ae7	28-Nov-2017	Jeff Roberson <jeff@FreeBSD.org>	Move domain iterators into the page layer where domain selection should take place. This makes the majority of the phys layer explicitly domain specific. Reviewed by: markj, kib (some objections) Discussed with: alc Tested by: pho Sponsored by: Netflix & Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D13014
# d2b677ce	27-Nov-2017	Mark Johnston <markj@FreeBSD.org>	Avoid unnecessary lookups when initializing the vm_page array. This gives a marginal improvement in the vm_page_array initialization time. Also garbage-collect the now-unused vm_phys_paddr_to_segind(). Reviewed by: alc, kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D13270
# b20bf182	26-Nov-2017	Mark Johnston <markj@FreeBSD.org>	Move vm_phys_init_page() to vm_page.c. Suggested by: kib Reviewed by: alc, kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D13250
# 5070d56d	21-Nov-2017	Mark Johnston <markj@FreeBSD.org>	Allow for fictitious physical pages in vm_page_scan_contig(). Some drm2 drivers will set PG_FICTITIOUS in physical pages in order to satisfy the OBJT_MGTDEVICE object interface, so a scan may encounter fictitous pages. For now, allow for this possibility; such pages will be skipped later in the scan since they are wired. Reported by: avg Reviewed by: kib MFC after: 1 week
# 51369649	20-Nov-2017	Pedro F. Giffuni <pfg@FreeBSD.org>	sys: further adoption of SPDX licensing ID tags. Mainly focus on files that use BSD 3-Clause license. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts. Special thanks to Wind River for providing access to "The Duke of Highlander" tool: an older (2014) run over FreeBSD tree was useful as a starting point.
# 8d6fbbb8	07-Nov-2017	Jeff Roberson <jeff@FreeBSD.org>	Replace manyinstances of VM_WAIT with blocking page allocation flags similar to the kernel memory allocator. This simplifies NUMA allocation because the domain will be known at wait time and races between failure and sleeping are eliminated. This also reduces boilerplate code and simplifies callers. A wait primitive is supplied for uma zones for similar reasons. This eliminates some non-specific VM_WAIT calls in favor of more explicit sleeps that may be satisfied without new pages. Reviewed by: alc, kib, markj Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon
# 3a757e54	24-Oct-2017	Alan Cox <alc@FreeBSD.org>	Micro-optimize the handling of fictitious pages in vm_page_free_prep(). A fictitious page is always wired, so there is no point in trying to remove one from the page queues. Completely remove one inaccurate comment from vm_page_free_prep() and correct another. Reviewed by: kib, markj MFC after: 1 week
# 422fe502	21-Oct-2017	Konstantin Belousov <kib@FreeBSD.org>	Check that the page which is freed as zeroed, indeed has all-zero content. This catches some rare mysterious failures at the source. The check is only performed on architectures which implement direct map, and only enabled with option DIAGNOSTIC, similar to other costly consistency checks. Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
# 43139893	20-Oct-2017	Konstantin Belousov <kib@FreeBSD.org>	In vm_page_free_phys_pglist(), do not take vm_page_queue_free_mtx if there is nothing to do. Suggested by: mjg Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week
# 1dbf52e7	13-Oct-2017	Mateusz Guzik <mjg@FreeBSD.org>	Reduce traffic on vm_cnt.v_free_count The variable is modified with the highly contended page free queue lock. It unnecessarily shares a cacheline with purely read-only fields and is re-read after the lock is dropped in the page allocation code making the hold time longer. Pad the variable just like the others and store the value as found with the lock held instead of re-reading. Provides a modest 1%-ish speed up in concurrent page faults. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D12665
# 43cc906f	24-Sep-2017	Alan Cox <alc@FreeBSD.org>	Change vm_page_try_to_free() to require a managed page. Essentially, vm_page_try_to_free() is testing conditions, like clean versus dirty, that only vary in managed pages. Suggested by: kib Reviewed by: markj X-MFC after: never
# 494c6e43	24-Sep-2017	Alan Cox <alc@FreeBSD.org>	Optimize vm_page_try_to_free(). Specifically, the call to pmap_remove_all() can be avoided when the page's containing object has a reference count of zero. (If the object has a reference count of zero, then none of its pages can possibly be mapped.) Address nearby style issues in vm_page_try_to_free(), and change its return type to "bool". Reviewed by: kib, markj MFC after: 1 week
# e82e50e6	13-Sep-2017	Konstantin Belousov <kib@FreeBSD.org>	Remove inline specifier from vm_page_free_wakeup(), do not micro-manage compiler. Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week
# 540ac3b3	13-Sep-2017	Konstantin Belousov <kib@FreeBSD.org>	Split vm_page_free_toq() into two parts, preparation vm_page_free_prep() and insertion into the phys allocator free queues vm_page_free_phys(). Also provide a wrapper vm_page_free_phys_pglist() for batched free. Reviewed by: alc, markj Tested by: mjg (previous version) Sponsored by: The FreeBSD Foundation MFC after: 1 week
# 2934eb8a	13-Sep-2017	Mark Johnston <markj@FreeBSD.org>	Fix a logic error in the item size calculation for internal UMA zones. Kegs for internal zones always keep the slab header in the slab itself. Therefore, when determining the allocation size, we need to take the slab header size into account. Reported and tested by: ae, rakuco Reviewed by: avg MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D12342
# 93c5d3a4	09-Sep-2017	Konstantin Belousov <kib@FreeBSD.org>	Add a vm_page_change_lock() helper, the common code to not relock page lock if both old and new pages use the same underlying lock. Convert existing places to use the helper instead of inlining it. Use the optimization in vm_object_page_remove(). Suggested and reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week
# f93f7cf1	07-Sep-2017	Mark Johnston <markj@FreeBSD.org>	Speed up vm_page_array initialization. We currently initialize the vm_page array in three passes: one to zero the array, one to initialize the "order" field of each page (necessary when inserting them into the vm_phys buddy allocator one-by-one), and one to initialize the remaining non-zero fields and individually insert each page into the allocator. Merge the three passes into one following a suggestion from alc: initialize vm_page fields in a single pass, and use vm_phys_free_contig() to efficiently insert physical memory segments into the buddy allocator. This reduces the initialization time to a third or a quarter of what it was before on most systems that I tested. Reviewed by: alc, kib MFC after: 3 weeks Differential Revision: https://reviews.freebsd.org/D12248
# fe933c1d	06-Sep-2017	Mateusz Guzik <mjg@FreeBSD.org>	Start annotating global _padalign locks with __exclusive_cache_line While these locks are guarnteed to not share their respective cache lines, their current placement leaves unnecessary holes in lines which preceeded them. For instance the annotation of vm_page_queue_free_mtx allows 2 neighbour cachelines (previously separate by the lock) to be collapsed into 1. The annotation is only effective on architectures which have it implemented in their linker script (currently only amd64). Thus locks are not converted to their not-padaligned variants as to not affect the rest. MFC after: 1 week
# 33fff5d5	15-Aug-2017	Mark Johnston <markj@FreeBSD.org>	Add vm_page_alloc_after(). This is a variant of vm_page_alloc() which accepts an additional parameter: the page in the object with largest index that is smaller than the requested index. vm_page_alloc() finds this page using a lookup in the object's radix tree, but in some cases its identity is already known, allowing the lookup to be elided. Modify kmem_back() and vm_page_grab_pages() to use vm_page_alloc_after(). vm_page_alloc() is converted into a trivial wrapper of vm_page_alloc_after(). Suggested by: alc Reviewed by: alc, kib MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D11984
# 9df950b3	11-Aug-2017	Mark Johnston <markj@FreeBSD.org>	Modify vm_page_grab_pages() to handle VM_ALLOC_NOWAIT. This will allow its use in sendfile_swapin(). Reviewed by: alc, kib MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D11942
# 2c642ec1	10-Aug-2017	Mark Johnston <markj@FreeBSD.org>	Make vm_page_sunbusy() assert that the page is unlocked. Reviewed by: kib MFC after: 1 week Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D11946
# 5471caf6	08-Aug-2017	Alan Cox <alc@FreeBSD.org>	Introduce vm_page_grab_pages(), which is intended to replace loops calling vm_page_grab() on consecutive page indices. Besides simplifying the code in the caller, vm_page_grab_pages() allows for batching optimizations. For example, the current implementation replaces calls to vm_page_lookup() on consecutive page indices by cheaper calls to vm_page_next(). Reviewed by: kib, markj Tested by: pho (an earlier version) MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D11926
# 1d3b9818	22-Jul-2017	Alan Cox <alc@FreeBSD.org>	In vm_page_ps_test(), always check that the base pages within the specified superpage all belong to the same object. To date, that check has not been needed, but upcoming changes require it. (See the Differential Revision.) Reviewed by: kib, markj MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D11556
# 88302601	13-Jul-2017	Alan Cox <alc@FreeBSD.org>	Generalize vm_page_ps_is_valid() to support testing other predicates on the (super)page, renaming the function to vm_page_ps_test(). Reviewed by: kib, markj MFC after: 1 week
# 4bd7e351	08-Jun-2017	John Baldwin <jhb@FreeBSD.org>	Fix an off-by-one error in the VM page array on some systems. r31386 changed how the size of the VM page array was calculated to be less wasteful. For most systems, the amount of memory is divided by the overhead required by each page (a page of data plus a struct vm_page) to determine the maximum number of available pages. However, if the remainder for the first non-available page was at least a page of data (so that the only memory missing was a struct vm_page), this last page was left in phys_avail[] but was not allocated an entry in the VM page array. Handle this case by explicitly excluding the page from phys_avail[]. Reviewed by: alc Sponsored by: DARPA / AFRL Differential Revision: https://reviews.freebsd.org/D11000
# 83c9dea1	17-Apr-2017	Gleb Smirnoff <glebius@FreeBSD.org>	- Remove 'struct vmmeter' from 'struct pcpu', leaving only global vmmeter in place. To do per-cpu stats, convert all fields that previously were maintained in the vmmeters that sit in pcpus to counter(9). - Since some vmmeter stats may be touched at very early stages of boot, before we have set up UMA and we can do counter_u64_alloc(), provide an early counter mechanism: o Leave one spare uint64_t in struct pcpu, named pc_early_dummy_counter. o Point counter(9) fields of vmmeter to pcpu[0].pc_early_dummy_counter, so that at early stages of boot, before counters are allocated we already point to a counter that can be safely written to. o For sparc64 that required a whole dummy pcpu[MAXCPU] array. Further related changes: - Don't include vmmeter.h into pcpu.h. - vm.stats.vm.v_swappgsout and vm.stats.vm.v_swappgsin changed to 64-bit, to match kernel representation. - struct vmmeter hidden under _KERNEL, and only vmstat(1) is an exclusion. This is based on benno@'s 4-year old patch: https://lists.freebsd.org/pipermail/freebsd-arch/2013-July/014471.html Reviewed by: kib, gallatin, marius, lidl Differential Revision: https://reviews.freebsd.org/D10156
# fbbd9655	28-Feb-2017	Warner Losh <imp@FreeBSD.org>	Renumber copyright clause 4 Renumber cluase 4 to 3, per what everybody else did when BSD granted them permission to remove clause 3. My insistance on keeping the same numbering for legal reasons is too pedantic, so give up on that point. Submitted by: Jan Schaumann <jschauma@stevens.edu> Pull Request: https://github.com/freebsd/freebsd/pull/96
# 8a99f1cc	03-Feb-2017	Alan Cox <alc@FreeBSD.org>	Over the years, the code and comments in vm_page_startup() have diverged in one respect. When determining how many page structures to allocate, contrary to what the comments say, the code does not account for the overhead of a page structure per page of physical memory. This revision changes the code to match the comments. Reviewed by: kib, markj MFC after: 6 weeks Differential Revision: https://reviews.freebsd.org/D9081
# c2655a40	14-Jan-2017	Mark Johnston <markj@FreeBSD.org>	Avoid unnecessary page lookups in vm_object_madvise(). vm_object_madvise() is frequently used to apply advice to a contiguous set of pages in an object with no backing object. Optimize this case by skipping non-resident subranges in constant time, and by iterating over resident pages using the object memq, thus avoiding radix tree lookups on each page index in the specified range. While here, move MADV_WILLNEED handling to vm_page_advise(), and rename the "advise" parameter to vm_object_madvise() to "advice." Reviewed by: alc, kib MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D9098
# bfc8c24c	04-Jan-2017	Gleb Smirnoff <glebius@FreeBSD.org>	Move bogus_page declaration to vm_page.h and initialization to vm_page.c. Reviewed by: kib
# b1fd102e	02-Jan-2017	Mark Johnston <markj@FreeBSD.org>	Add a page queue for holding dirty anonymous unswappable pages. On systems without a configured swap device, an attempt to launder pages from a swap object will always fail and result in the page being reactivated. This means that the page daemon will continuously scan pages that can never be evicted. With this change, anonymous pages are instead moved to PQ_UNSWAPPABLE after a failed laundering attempt when no swap devices are configured. PQ_UNSWAPPABLE is not scanned unless a swap device is configured, so unreferenced unswappable pages are excluded from the page daemon's workload. Reviewed by: alc
# 0c8bd6a7	30-Dec-2016	Konstantin Belousov <kib@FreeBSD.org>	Assert that the pages found on the object queue by vm_page_next() and vm_page_prev() have correct ownership. In collaboration with: alc Sponsored by: The FreeBSD Foundation (kib) MFC after: 1 week
# 920da7e4	28-Dec-2016	Alan Cox <alc@FreeBSD.org>	Relax the object type restrictions on vm_page_alloc_contig(). Specifically, add support for object types that were previously prohibited because they could contain PG_CACHED pages. Roughly halve the number of radix trie operations performed by vm_page_alloc_contig() using the same approach that is employed by vm_page_alloc(). Also, eliminate the radix trie lookup performed with the free page queues lock held. Tidy up the handling of radix trie insert failures in vm_page_alloc() and vm_page_alloc_contig(). Reviewed by: kib, markj Tested by: pho Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D8878
# 3453bca8	12-Dec-2016	Alan Cox <alc@FreeBSD.org>	Eliminate every mention of PG_CACHED pages from the comments in the machine- independent layer of the virtual memory system. Update some of the nearby comments to eliminate redundancy and improve clarity. In vm/vm_reserv.c, do not use hyphens after adverbs ending in -ly per The Chicago Manual of Style. Update the comment in vm/vm_page.h defining the four types of page queues to reflect the elimination of PG_CACHED pages and the introduction of the laundry queue. Reviewed by: kib, markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D8752
# e94965d8	07-Dec-2016	Alan Cox <alc@FreeBSD.org>	Previously, vm_radix_remove() would panic if the radix trie didn't contain a vm_page_t at the specified index. However, with this change, vm_radix_remove() no longer panics. Instead, it returns NULL if there is no vm_page_t at the specified index. Otherwise, it returns the vm_page_t. The motivation for this change is that it simplifies the use of radix tries in the amd64, arm64, and i386 pmap implementations. Instead of performing a lookup before every remove, the pmap can simply perform the remove. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D8708
# ba673696	26-Nov-2016	Alan Cox <alc@FreeBSD.org>	Recursion on the free page queue mutex occurred when UMA needed to allocate a new page of radix trie nodes to complete a vm_radix_insert() operation that was requested by vm_page_cache(). Specifically, vm_page_cache() already held the free page queue lock when UMA tried to acquire it through a call to vm_page_alloc(). This code path no longer exists, so there is no longer any reason to allow recursion on the free page queue mutex. Improve nearby comments. Reviewed by: kib, markj Tested by: pho Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D8628
# bba39b9a	22-Nov-2016	Alan Cox <alc@FreeBSD.org>	Remove PG_CACHED-related fields from struct vmmeter, because they are no longer used. More precisely, they are always zero because the code that decremented and incremented them no longer exists. Bump __FreeBSD_version to mark this change. Reviewed by: kib, markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D8583
# 7667839a	15-Nov-2016	Alan Cox <alc@FreeBSD.org>	Remove most of the code for implementing PG_CACHED pages. (This change does not remove user-space visible fields from vm_cnt or all of the references to cached pages from comments. Those changes will come later.) Reviewed by: kib, markj Tested by: pho Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D8497
# ebcddc72	09-Nov-2016	Alan Cox <alc@FreeBSD.org>	Introduce a new page queue, PQ_LAUNDRY, for storing unreferenced, dirty pages, specificially, dirty pages that have passed once through the inactive queue. A new, dedicated thread is responsible for both deciding when to launder pages and actually laundering them. The new policy uses the relative sizes of the inactive and laundry queues to determine whether to launder pages at a given point in time. In general, this leads to more intelligent swapping behavior, since the laundry thread will avoid pageouts when the marginal benefit of doing so is low. Previously, without a dedicated queue for dirty pages, the page daemon didn't have the information to determine whether pageout provides any benefit to the system. Thus, the previous policy often resulted in small but steadily increasing amounts of swap usage when the system is under memory pressure, even when the inactive queue consisted mostly of clean pages. This change addresses that issue, and also paves the way for some future virtual memory system improvements by removing the last source of object-cached clean pages, i.e., PG_CACHE pages. The new laundry thread sleeps while waiting for a request from the page daemon thread(s). A request is raised by setting the variable vm_laundry_request and waking the laundry thread. We request launderings for two reasons: to try and balance the inactive and laundry queue sizes ("background laundering"), and to quickly make up for a shortage of free pages and clean inactive pages ("shortfall laundering"). When background laundering is requested, the laundry thread computes the number of page daemon wakeups that have taken place since the last laundering. If this number is large enough relative to the ratio of the laundry and (global) inactive queue sizes, we will launder vm_background_launder_target pages at vm_background_launder_rate KB/s. Otherwise, the laundry thread goes back to sleep without doing any work. When scanning the laundry queue during background laundering, reactivated pages are counted towards the laundry thread's target. In contrast, shortfall laundering is requested when an inactive queue scan fails to meet its target. In this case, the laundry thread attempts to launder enough pages to meet v_free_target within 0.5s, which is the inactive queue scan period. A laundry request can be latched while another is currently being serviced. In particular, a shortfall request will immediately preempt a background laundering. This change also redefines the meaning of vm_cnt.v_reactivated and removes the functions vm_page_cache() and vm_page_try_to_cache(). The new meaning of vm_cnt.v_reactivated now better reflects its name. It represents the number of inactive or laundry pages that are returned to the active queue on account of a reference. In collaboration with: markj Reviewed by: kib Tested by: pho Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D8302
# 1771e987	03-Nov-2016	Konstantin Belousov <kib@FreeBSD.org>	Do not sleep in vm_wait() if pagedaemon did not yet started. Panic instead. Requests which cannot be satisfied by allocators at boot time often have unrealizable parameters. Waiting for the pagedaemon' start would hang the boot if done in the thread0 context and just never succeed if executed from another thread. In fact, for very early stages, sleep attempt panics with obscure diagnostic about the scheduler state, and explicit panic in vm_wait() makes the investigation much shorter by cut off the examination of the thread and scheduler. Theoretically, some subsystem might grab a resource to exhaustion, and free it later in the boot process. If this unlikely scenario does appear for real, the way to diagnose the trouble can be revisited. Reported by: emaste Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D8421
# bd9546a2	17-Oct-2016	Konstantin Belousov <kib@FreeBSD.org>	Export vm_page_xunbusy_maybelocked(). Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week X-Differential revision: https://reviews.freebsd.org/D8197
# 5975e53d	13-Oct-2016	Konstantin Belousov <kib@FreeBSD.org>	Fix a race in vm_page_busy_sleep(9). Suppose that we have an exclusively busy page, and a thread which can accept shared-busy page. In this case, typical code waiting for the page xbusy state to pass is again: VM_OBJECT_WLOCK(object); ... if (vm_page_xbusied(m)) { vm_page_lock(m); VM_OBJECT_WUNLOCK(object); <---1 vm_page_busy_sleep(p, "vmopax"); goto again; } Suppose that the xbusy state owner locked the object, unbusied the page and unlocked the object after we are at the line [1], but before we executed the load of the busy_lock word in vm_page_busy_sleep(). If it happens that there is still no waiters recorded for the busy state, the xbusy owner did not acquired the page lock, so it proceeded. More, suppose that some other thread happen to share-busy the page after xbusy state was relinquished but before the m->busy_lock is read in vm_page_busy_sleep(). Again, that thread only needs vm_object lock to proceed. Then, vm_page_busy_sleep() reads busy_lock value equal to the VPB_SHARERS_WORD(1). In this case, all tests in vm_page_busy_sleep(9) pass and we are going to sleep, despite the page being share-busied. Update check for m->busy_lock == VPB_UNBUSIED in vm_page_busy_sleep(9) to also accept shared-busy state if we only wait for the xbusy state to pass. Merge sequential if()s with the same 'then' clause in vm_page_busy_sleep(). Note that the current code does not share-busy pages from parallel threads, the only way to have more that one sbusy owner is right now is to recurse. Reported and tested by: pho (previous version) Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D8196
# 267ed8e2	11-Oct-2016	Konstantin Belousov <kib@FreeBSD.org>	When downgrading exclusively busied page to shared-busy state, wakeup waiters. Otherwise, owners of the shared-busy state are left blocked and might get into a deadlock. Note that the vm_page_busy_downgrade() function is not used in the tree right now. Reported and tested by: pho (previous version) Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D8195
# 70cf3ced	05-Oct-2016	Alan Cox <alc@FreeBSD.org>	Make the page daemon's notion of what kind of pass is being performed by vm_pageout_scan() local to vm_pageout_worker(). There is no reason to store the pass in the NUMA domain structure. Reviewed by: kib MFC after: 3 weeks
# dbbaf04f	03-Sep-2016	Mark Johnston <markj@FreeBSD.org>	Remove support for idle page zeroing. Idle page zeroing has been disabled by default on all architectures since r170816 and has some bugs that make it seemingly unusable. Specifically, the idle-priority pagezero thread exacerbates contention for the free page lock, and yields the CPU without releasing it in non-preemptive kernels. The pagezero thread also does not behave correctly when superpage reservations are enabled: its target is a function of v_free_count, which includes reserved-but-free pages, but it is only able to zero pages belonging to the physical memory allocator. Reviewed by: alc, imp, kib Differential Revision: https://reviews.freebsd.org/D7714
# 915d1b71	29-Aug-2016	Mark Johnston <markj@FreeBSD.org>	Restore swap pager readahead after r292373. The removal of vm_fault_additional_pages() meant that a hard fault on a swap-backed page would result in only that page being read in. This change implements readahead and readbehind for the swap pager in swap_pager_getpages(). swap_pager_haspage() is modified to return the largest contiguous non-resident range of pages containing the requested range. Reviewed by: alc, kib Tested by: pho MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D7677
# 842ee21e	13-Aug-2016	Mark Johnston <markj@FreeBSD.org>	Strengthen assertions about the busy state of newly-allocated pages. Reviewed by: alc MFC after: 1 week
# 897d0c66	29-Jul-2016	Mark Johnston <markj@FreeBSD.org>	Use vm_page_undirty() instead of manually setting a page field. Reviewed by: alc MFC after: 3 days
# efe1ff4c	23-Jul-2016	Mark Johnston <markj@FreeBSD.org>	Update a comment in vm_page_advise() to match behaviour after r290529. Reviewed by: alc MFC after: 3 days
# 34caa842	07-Jul-2016	Colin Percival <cperciva@FreeBSD.org>	Autotune the number of pages set aside for UMA startup based on the number of CPUs present. On amd64 this unbreaks the boot for systems with 92 or more CPUs; the limit will vary on other systems depending on the size of their uma_zone and uma_cache structures. The major consumer of pages during UMA startup is the 19 zone structures which are set up before UMA has bootstrapped itself sufficiently to use the rest of the available memory: UMA Slabs, UMA Hash, 4 / 6 / 8 / 12 / 16 / 32 / 64 / 128 / 256 Bucket, vmem btag, VM OBJECT, RADIX NODE, MAP, KMAP ENTRY, MAP ENTRY, VMSPACE, and fakepg. If the zone structures occupy more than one page, they will not share pages and the number of pages currently needed for startup is 19 * pages_per_zone + N, where N is the number of pages used for allocating other structures; on amd64 N = 3 at present (2 pages are allocated for UMA Kegs, and one page for UMA Hash). This patch adds a new definition UMA_BOOT_PAGES_ZONES, currently set to 32, and if a zone structure does not fit into a single page sets boot_pages to UMA_BOOT_PAGES_ZONES * pages_per_zone instead of UMA_BOOT_PAGES (which remains at 64). Consequently this patch has no effect on systems where the zone structure fits into 2 or fewer pages (on amd64, 59 or fewer CPUs), but increases boot_pages sufficiently on systems where the large number of CPUs makes this structure larger. It seems safe to assume that systems with 60+ CPUs can afford to set aside an additional 128kB of memory per 32 CPUs. The vm.boot_pages tunable continues to override this computation, but is unlikely to be necessary in the future. Tested on: EC2 x1.32xlarge Relnotes: FreeBSD can now boot on 92+ CPU systems without requiring vm.boot_pages to be manually adjusted. Reviewed by: jeff, alc, adrian Approved by: re (kib)
# 35e8002c	23-Jun-2016	Konstantin Belousov <kib@FreeBSD.org>	In vm_page_xunbusy_maybelocked(), add fast path for unbusy when no waiters exist, same as for vm_page_xunbusy(). If previous value of busy_lock was VPB_SINGLE_EXCLUSIVER, no waiters existed and wakeup is not needed. Move common code from vm_page_xunbusy_maybelocked() and vm_page_xunbusy_hard() to vm_page_xunbusy_locked(). Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Approved by: re (gjb)
# 0a1dc6e2	02-Jun-2016	Mark Johnston <markj@FreeBSD.org>	Reset the page busy lock state after failing to insert into the object. Freeing a shared-busy page is not permitted. Reviewed by: kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D6670
# e7052969	02-Jun-2016	Mark Johnston <markj@FreeBSD.org>	Don't preserve the page's object linkage in vm_page_insert_after(). Per the KASSERT at the beginning of the function, we expect that the page does not belong to any object, so its object and pindex fields are meaningless. Reset them in the rare case that vm_radix_insert() fails. Reviewed by: kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D6669
# e5f0191f	01-Jun-2016	Konstantin Belousov <kib@FreeBSD.org>	If the fast path unbusy in vm_page_replace() fails, slow path needs to acquire the page lock, which recurses. Avoid the recursion by reusing the code from vm_page_remove() in a new helper vm_page_xunbusy_maybelocked(). Reviewed by: alc Sponsored by: The FreeBSD Foundation
# 56ce0690	27-May-2016	Alan Cox <alc@FreeBSD.org>	The flag "vm_pages_needed" has long served two distinct purposes: (1) to indicate that threads are waiting for free pages to become available and (2) to indicate whether a wakeup call has been sent to the page daemon. The trouble is that a single flag cannot really serve both purposes, because we have two distinct targets for when to wakeup threads waiting for free pages versus when the page daemon has completed its work. In particular, the flag will be cleared by vm_page_free() before the page daemon has met its target, and this can lead to the OOM killer being invoked prematurely. To address this problem, a new flag "vm_pageout_wanted" is introduced. Discussed with: jeff Reviewed by: kib, markj Tested by: markj Sponsored by: EMC / Isilon Storage Division
# 0e384220	24-May-2016	Konstantin Belousov <kib@FreeBSD.org>	In vm_page_cache(), only drop the vnode after radix insert failure for empty page cache when the object type if OBJT_VNODE. Reported and tested by: pho Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week
# 30a8a5f7	24-May-2016	Konstantin Belousov <kib@FreeBSD.org>	In vm_page_alloc_contig(), on vm_page_insert() failure, mark each freed page as VPO_UNMANAGED. Otherwise vm_pge_free_toq() insists on owning the page lock. Previously, VPO_UNMANAGED was only set up to the last processed page. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week
# 5a2e650a	19-May-2016	Conrad Meyer <cem@FreeBSD.org>	vm/vm_page.h: Fix trivial '-Wpointer-sign' warning pq_vcnt, as a count of real things, has no business being negative. It is only ever initialized by a u_int counter. The warning came from the atomic_add_int() in vm_pagequeue_cnt_add(). Rectify the warning by changing the variable to u_int. No functional change. Suggested by: Clang 3.3 Sponsored by: EMC / Isilon Storage Division
# 0ef14902	29-Apr-2016	John Baldwin <jhb@FreeBSD.org>	Don't require write locks on the VM object for vm_page_prev/next. Reviewed by: kib Sponsored by: Chelsio Communications
# d9c9c81c	21-Apr-2016	Pedro F. Giffuni <pfg@FreeBSD.org>	sys: use our roundup2/rounddown2() macros when param.h is available. rounddown2 tends to produce longer lines than the original code and when the code has a high indentation level it was not really advantageous to do the replacement. This tries to strike a balance between readability using the macros and flexibility of having the expressions, so not everything is converted.
# b28cc462	09-Feb-2016	Gleb Smirnoff <glebius@FreeBSD.org>	Include sys/_task.h into uma_int.h, so that taskqueue.h isn't a requirement for uma_int.h. Suggested by: jhb
# e60b2fcb	03-Feb-2016	Gleb Smirnoff <glebius@FreeBSD.org>	Redo r292484. Embed task(9) into zone, so that uz_maxaction is called in a context that can sleep, allowing consumers of the KPI to run their drain routines without any extra measures. Discussed with: jtl
# c869e672	19-Dec-2015	Alan Cox <alc@FreeBSD.org>	Introduce a new mechanism for relocating virtual pages to a new physical address and use this mechanism when: 1. kmem_alloc_{attr,contig}() can't find suitable free pages in the physical memory allocator's free page lists. This replaces the long-standing approach of scanning the inactive and inactive queues, converting clean pages into PG_CACHED pages and laundering dirty pages. In contrast, the new mechanism does not use PG_CACHED pages nor does it trigger a large number of I/O operations. 2. on 32-bit MIPS processors, uma_small_alloc() and the pmap can't find free pages in the physical memory allocator's free page lists that are covered by the direct map. Tested by: adrian 3. ttm_bo_global_init() and ttm_vm_page_alloc_dma32() can't find suitable free pages in the physical memory allocator's free page lists. In the coming months, I expect that this new mechanism will be applied in other places. For example, balloon drivers should use relocation to minimize fragmentation of the guest physical address space. Make vm_phys_alloc_contig() a little smarter (and more efficient in some cases). Specifically, use vm_phys_segs[] earlier to avoid scanning free page lists that can't possibly contain suitable pages. Reviewed by: kib, markj Glanced at: jhb Discussed with: jeff Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D4444
# b0cd2017	16-Dec-2015	Gleb Smirnoff <glebius@FreeBSD.org>	A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES(). o With new KPI consumers can request contiguous ranges of pages, and unlike before, all pages will be kept busied on return, like it was done before with the 'reqpage' only. Now the reqpage goes away. With new interface it is easier to implement code protected from race conditions. Such arrayed requests for now should be preceeded by a call to vm_pager_haspage() to make sure that request is possible. This could be improved later, making vm_pager_haspage() obsolete. Strenghtening the promises on the business of the array of pages allows us to remove such hacks as swp_pager_free_nrpage() and vm_pager_free_nonreq(). o New KPI accepts two integer pointers that may optionally point at values for read ahead and read behind, that a pager may do, if it can. These pages are completely owned by pager, and not controlled by the caller. This shifts the UFS-specific readahead logic from vm_fault.c, which should be file system agnostic, into vnode_pager.c. It also removes one VOP_BMAP() request per hard fault. Discussed with: kib, alc, jeff, scottl Sponsored by: Nginx, Inc. Sponsored by: Netflix
# 5e09bdc8	10-Dec-2015	Conrad Meyer <cem@FreeBSD.org>	vm_page_replace: remove redundant radix lookup Remove redundant lookup of the old page from vm_page_replace. Verification that the old page exists is already done by vm_radix_replace. Submitted by: Ryan Libby <rlibby@gmail.com> Reviewed by: alc, kib Sponsored by: EMC / Isilon Storage Division Follow-up to: https://reviews.freebsd.org/D4326 Differential Revision: https://reviews.freebsd.org/D4471
# 7e78597f	07-Nov-2015	Mark Johnston <markj@FreeBSD.org>	Ensure that deactivated pages that are not expected to be reused are reclaimed in FIFO order by the pagedaemon. Previously we would enqueue such pages at the head of the inactive queue, yielding a LIFO reclaim order. Reviewed by: alc MFC after: 2 weeks Sponsored by: EMC / Isilon Storage Division
# bc727596	03-Oct-2015	Alan Cox <alc@FreeBSD.org>	Reduce the scope of a variable to the only file where it is used.
# 3138cd36	30-Sep-2015	Mark Johnston <markj@FreeBSD.org>	As a step towards the elimination of PG_CACHED pages, rework the handling of POSIX_FADV_DONTNEED so that it causes the backing pages to be moved to the head of the inactive queue instead of being cached. This affects the implementation of POSIX_FADV_NOREUSE as well, since it works by applying POSIX_FADV_DONTNEED to file ranges after they have been read or written. At that point the corresponding buffers may still be dirty, so the previous implementation would coalesce successive ranges and apply POSIX_FADV_DONTNEED to the result, ensuring that pages backing the dirty buffers would eventually be cached. To preserve this behaviour in an efficient manner, this change adds a new buf flag, B_NOREUSE, which causes the pages backing a VMIO buf to be placed at the head of the inactive queue when the buf is released. POSIX_FADV_NOREUSE then works by setting this flag in bufs that underlie the specified range. Reviewed by: alc, kib Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D3726
# 15aaea78	22-Sep-2015	Alan Cox <alc@FreeBSD.org>	Change vm_page_unwire() such that it (1) accepts PQ_NONE as the specified queue and (2) returns a Boolean indicating whether the page's wire count transitioned to zero. Exploit this change in vfs_vmio_release() to avoid pointlessly enqueueing a page that is about to be freed. (An earlier version of this change was developed by attilio@ and kmacy@. Any errors in this version are my own.) Reviewed by: kib Sponsored by: EMC / Isilon Storage Division
# c9af644e	17-Sep-2015	Alan Cox <alc@FreeBSD.org>	Eliminate (many) unnecessary calls to pmap_remove_all(). Pages from objects with a reference count of zero can't possibly be mapped, so there is never a need for vm_page_set_invalid() to call pmap_remove_all() on them. Reviewed by: kib MFC after: 1 week Sponsored by: EMC / Isilon Storage Division
# d73ce4c6	10-Sep-2015	Mark Johnston <markj@FreeBSD.org>	Remove the v_cache_min and v_cache_max sysctls. They are unused and have no effect. Reviewed by: alc Sponsored by: EMC / Isilon Storage Division
# c25fabea	27-Aug-2015	Mark Johnston <markj@FreeBSD.org>	Remove weighted page handling from vm_page_advise(). This was added in r51337 as part of the implementation of madvise(MADV_DONTNEED). Its objective was to ensure that the page daemon would eventually reclaim other unreferenced pages (i.e., unreferenced pages not touched by madvise()) from the active queue. Now that the pagedaemon performs steady scanning of the active page queue, this weighted handling is unnecessary. Instead, always "cache" clean pages by moving them to the head of the inactive page queue. This simplifies the implementation of vm_page_advise() and eliminates the fragmentation that resulted from the distribution of pages among multiple queues. Suggested by: alc Reviewed by: alc Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D3401
# 52afd687	19-Aug-2015	Andrew Turner <andrew@FreeBSD.org>	Add the kernel support for minidumps on arm64. Obtained from: ABT Systems Ltd Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D3318
# 966272ca3	07-Jun-2015	Alan Cox <alc@FreeBSD.org>	Retire VM_FREEPOOL_CACHE as the next step in eliminating PG_CACHE pages. Differential Revision: https://reviews.freebsd.org/D2712 Reviewed by: kib Sponsored by: EMC / Isilon Storage Division
# c4e49ba4	30-May-2015	Alan Cox <alc@FreeBSD.org>	Document vm_page_alloc_contig()'s support for the VM_ALLOC_NODUMP option. MFC after: 3 days
# 2c20bd8b	20-May-2015	Konstantin Belousov <kib@FreeBSD.org>	Do grammar fix in the comment to record the right commit message for r283162. Fix a cosmetic issue with vm_page_alloc() calling vm_page_free_toq() with the page not completely satisfying vm_page_free() assertions. The page is not owned by the object, since insertion failed. But besides m->object reset to NULL, we should also set VPO_UNMANAGED flag for consistency. Reported by: pho Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week
# da474990	20-May-2015	Konstantin Belousov <kib@FreeBSD.org>	Remove the write-only variable phent. We currently do not check the size of the program header's entries. Reported by: adrian (by using gcc 4.9) Sponsored by: The FreeBSD Foundation MFC after: 1 week
# ed95805e	30-Apr-2015	John Baldwin <jhb@FreeBSD.org>	Remove support for Xen PV domU kernels. Support for HVM domU kernels remains. Xen is planning to phase out support for PV upstream since it is harder to maintain and has more overhead. Modern x86 CPUs include virtualization extensions that support HVM guests instead of PV guests. In addition, the PV code was i386 only and not as well maintained recently as the HVM code. - Remove the i386-only NATIVE option that was used to disable certain components for PV kernels. These components are now standard as they are on amd64. - Remove !XENHVM bits from PV drivers. - Remove various shims required for XEN (e.g. PT_UPDATES_FLUSH, LOAD_CR3, etc.) - Remove duplicate copy of <xen/features.h>. - Remove unused, i386-only xenstored.h. Differential Revision: https://reviews.freebsd.org/D2362 Reviewed by: royger Tested by: royger (i386/amd64 HVM domU and amd64 PVH dom0) Relnotes: yes
# affc4a4b	29-Apr-2015	Scott Long <scottl@FreeBSD.org>	Improve support for blacklisting bad memory locations. The user can supply a text file with a list of physical memory addresses to exclude, and have it loaded at boot time via the provided example in loader.conf. The tunable 'vm.blacklist' remains, but using an external file means that there's no practical limit to the size of the list. This change also improves the scanning algorithm for processing the list, scanning the list only once instead of scanning it for every page in the system. Both the sysctl and the file can be unsorted and contain duplicates so long as each entry is numeric (decimal or hex) and is separated by a space, comma, or newline character. The sysctl 'vm.page_blacklist' is now provided to report what memory locations were successfully excluded. Reviewed by: imp, emax Obtained from: Netflix, Inc. MFC after: 3 days
# b575067a	25-Mar-2015	Rui Paulo <rpaulo@FreeBSD.org>	Add comments about CTLFLAG_RDTUN vs. TUNABLE_INT_FETCH. Requested by: julian
# 57e5a8b1	24-Mar-2015	Rui Paulo <rpaulo@FreeBSD.org>	Use TUNABLE_INT_FETCH for boot_pages. vm.boot_pages is marked as a CTLFLAG_RDTUN, but it's used by the VM before the sysctl subsystem is initialsed. We manually fetch the variable from the environment to work around this problem. Tested by: Keith White kwhite at uottawa.ca MFC after: 1 week
# b0bce0ae	24-Mar-2015	Rui Paulo <rpaulo@FreeBSD.org>	Remove whitespace.
# e3ed82bc	22-Dec-2014	Gleb Smirnoff <glebius@FreeBSD.org>	Add flag VM_ALLOC_NOWAIT for vm_page_grab() that prevents sleeping and allows the function to fail. Reviewed by: kib, alc Sponsored by: Nginx, Inc.
# 6ee80f25	22-Dec-2014	Gleb Smirnoff <glebius@FreeBSD.org>	Do not clear flag that vm_page_alloc() doesn't support. Submitted by: kib
# 271f0f12	15-Nov-2014	Alan Cox <alc@FreeBSD.org>	Enable the use of VM_PHYSSEG_SPARSE on amd64 and i386, making it the default on i386 PAE. Previously, VM_PHYSSEG_SPARSE could not be used on amd64 and i386 because vm_page_startup() would not create vm_page structures for the kernel page table pages allocated during pmap_bootstrap() but those vm_page structures are needed when the kernel attempts to promote the corresponding kernel virtual addresses to superpage mappings. To address this problem, a new public function, vm_phys_add_seg(), is introduced and vm_phys_init() is updated to reflect the creation of vm_phys_seg structures by calls to vm_phys_add_seg(). Discussed with: Svatopluk Kraus MFC after: 3 weeks Sponsored by: EMC / Isilon Storage Division
# 5e929009	04-Nov-2014	Alan Cox <alc@FreeBSD.org>	Eliminate a stale, i386-specific comment.
# 2be111bf	16-Oct-2014	Davide Italiano <davide@FreeBSD.org>	Follow up to r225617. In order to maximize the re-usability of kernel code in userland rename in-kernel getenv()/setenv() to kern_setenv()/kern_getenv(). This fixes a namespace collision with libc symbols. Submitted by: kmacy Tested by: make universe
# 1a83a822	29-Aug-2014	John Baldwin <jhb@FreeBSD.org>	Fix a typo.
# afb69e6b	08-Aug-2014	Konstantin Belousov <kib@FreeBSD.org>	Adapt vm_page_aflag_set(PGA_WRITEABLE) to the locking of pmap_enter(PMAP_ENTER_NOSLEEP). The PGA_WRITEABLE flag can be set when either the page is busied, or the owner object is locked. Update comments, move all assertions about page state when PGA_WRITEABLE flag is set, into new helper vm_page_assert_pga_writeable(). Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
# af3b2549	27-Jun-2014	Hans Petter Selasky <hselasky@FreeBSD.org>	Pull in r267961 and r267973 again. Fix for issues reported will follow.
# 37a107a4	27-Jun-2014	Glen Barber <gjb@FreeBSD.org>	Revert r267961, r267973: These changes prevent sysctl(8) from returning proper output, such as: 1) no output from sysctl(8) 2) erroneously returning ENOMEM with tools like truss(1) or uname(1) truss: can not get etype: Cannot allocate memory
# 3da1cf1e	27-Jun-2014	Hans Petter Selasky <hselasky@FreeBSD.org>	Extend the meaning of the CTLFLAG_TUN flag to automatically check if there is an environment variable which shall initialize the SYSCTL during early boot. This works for all SYSCTL types both statically and dynamically created ones, except for the SYSCTL NODE type and SYSCTLs which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to be used in the case a tunable sysctl has a custom initialisation function allowing the sysctl to still be marked as a tunable. The kernel SYSCTL API is mostly the same, with a few exceptions for some special operations like iterating childrens of a static/extern SYSCTL node. This operation should probably be made into a factored out common macro, hence some device drivers use this. The reason for changing the SYSCTL API was the need for a SYSCTL parent OID pointer and not only the SYSCTL parent OID list pointer in order to quickly generate the sysctl path. The motivation behind this patch is to avoid parameter loading cludges inside the OFED driver subsystem. Instead of adding special code to the OFED driver subsystem to post-load tunables into dynamically created sysctls, we generalize this in the kernel. Other changes: - Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask" to "hw.pcic.intr_mask". - Removed redundant TUNABLE statements throughout the kernel. - Some minor code rewrites in connection to removing not needed TUNABLE statements. - Added a missing SYSCTL_DECL(). - Wrapped two very long lines. - Avoid malloc()/free() inside sysctl string handling, in case it is called to initialize a sysctl from a tunable, hence malloc()/free() is not ready when sysctls from the sysctl dataset are registered. - Bumped FreeBSD version to indicate SYSCTL API change. MFC after: 2 weeks Sponsored by: Mellanox Technologies
# 3ae10f74	16-Jun-2014	Attilio Rao <attilio@FreeBSD.org>	- Modify vm_page_unwire() and vm_page_enqueue() to directly accept the queue where to enqueue pages that are going to be unwired. - Add stronger checks to the enqueue/dequeue for the pagequeues when adding and removing pages to them. Of course, for unmanaged pages the queue parameter of vm_page_unwire() will be ignored, just as the active parameter today. This makes adding new pagequeues quicker. This change effectively modifies the KPI. __FreeBSD_version will be, however, bumped just when the full cache of free pages will be evicted. Sponsored by: EMC / Isilon storage division Reviewed by: alc Tested by: pho
# dd05fa19	07-Jun-2014	Alan Cox <alc@FreeBSD.org>	Add a page size field to struct vm_page. Increase the page size field when a partially populated reservation becomes fully populated, and decrease this field when a fully populated reservation becomes partially populated. Use this field to simplify the implementation of pmap_enter_object() on amd64, arm, and i386. On all architectures where we support superpages, the cost of creating a superpage mapping is roughly the same as creating a base page mapping. For example, both kinds of mappings entail the creation of a single PTE and PV entry. With this in mind, use the page size field to make the implementation of vm_map_pmap_enter(..., MAP_PREFAULT_PARTIAL) a little smarter. Previously, if MAP_PREFAULT_PARTIAL was specified to vm_map_pmap_enter(), that function would only map base pages. Now, it will create up to 96 base page or superpage mappings. Reviewed by: kib Sponsored by: EMC / Isilon Storage Division
# 44f1c916	22-Mar-2014	Bryan Drewery <bdrewery@FreeBSD.org>	Rename global cnt to vm_cnt to avoid shadowing. To reduce the diff struct pcu.cnt field was not renamed, so PCPU_OP(cnt.field) is still used. pc_cnt and pcpu are also used in kvm(3) and vmstat(8). The goal was to not affect externally used KPI. Bump __FreeBSD_version_ in case some out-of-tree module/code relies on the the global cnt variable. Exp-run revealed no ports using it directly. No objection from: arch@ Sponsored by: EMC / Isilon Storage Division
# 793d1407	24-Jan-2014	Alan Cox <alc@FreeBSD.org>	In an effort to diagnose possible corruption of struct vm_page on some sparc64 machines make the page queue assert in vm_page_dequeue() more precise. While I'm here switch the page lock assert to the newer style.
# 000fb817	31-Dec-2013	Alan Cox <alc@FreeBSD.org>	Since the introduction of the popmap to reservations in r259999, there is no longer any need for the page's PG_CACHED and PG_FREE flags to be set and cleared while the free page queues lock is held. Thus, vm_page_alloc(), vm_page_alloc_contig(), and vm_page_alloc_freelist() can wait until after the free page queues lock is released to clear the page's flags. Moreover, the PG_FREE flag can be retired. Now that the reservation system no longer uses it, its only uses are in a few assertions. Eliminating these assertions is no real loss. Other assertions catch the same types of misbehavior, like doubly freeing a page (see r260032) or dirtying a free page (free pages are invalid and only valid pages can be dirtied). Eliminate an unneeded variable from vm_page_alloc_contig(). Sponsored by: EMC / Isilon Storage Division
# 703b304f	08-Dec-2013	Alan Cox <alc@FreeBSD.org>	Eliminate a redundant parameter to vm_radix_replace(). Improve the wording of the comment describing vm_radix_replace(). Reviewed by: attilio MFC after: 6 weeks Sponsored by: EMC / Isilon Storage Division
# 9eab5484	17-Sep-2013	Konstantin Belousov <kib@FreeBSD.org>	PG_SLAB no longer serves a useful purpose, since m->object is no longer abused to store pointer to slab. Remove it. Reviewed by: alc Sponsored by: The FreeBSD Foundation Approved by: re (hrs)
# 3846a822	16-Sep-2013	Konstantin Belousov <kib@FreeBSD.org>	Remove zero-copy sockets code. It only worked for anonymous memory, and the equivalent functionality is now provided by sendfile(2) over posix shared memory filedescriptor. Remove the cow member of struct vm_page, and rearrange the remaining members. While there, make hold_count unsigned. Requested and reviewed by: alc Tested by: pho Sponsored by: The FreeBSD Foundation Approved by: re (delphij)
# 196beb53	14-Sep-2013	Konstantin Belousov <kib@FreeBSD.org>	If the last page of the file is partially full and whole valid portion is invalidated, invalidate the whole page. Otherwise, partially valid page appears on a page queue, which is wrong. This could only happen for the last page, because only then buffer which triggered invalidation could not cover the whole page. Reported and tested by: pho (previous version) Reviewed by: alc Sponsored by: The FreeBSD Foundation Approved by: re (delphij) MFC after: 2 weeks
# 7a4b2bc5	04-Sep-2013	Konstantin Belousov <kib@FreeBSD.org>	The vm_page_trysbusy() should not fail when shared busy counter or VPB_BIT_WAITERS flag were changed between reading of busy_lock and the cas. The vm_page_sbusy(), which is the only user of vm_page_trysbusy() in the tree, panics on the failure, which in these cases is transient and do not mean that the current page state prevents sbusying. Retry the operation inside vm_page_trysbusy() if cas failed, only return a failure when VPB_BIT_SHARED is cleared. Reported and tested by: pho Reviewed by: attilio Sponsored by: The FreeBSD Foundation
# 51321f7c	29-Aug-2013	Alan Cox <alc@FreeBSD.org>	Significantly reduce the cost, i.e., run time, of calls to madvise(..., MADV_DONTNEED) and madvise(..., MADV_FREE). Specifically, introduce a new pmap function, pmap_advise(), that operates on a range of virtual addresses within the specified pmap, allowing for a more efficient implementation of MADV_DONTNEED and MADV_FREE. Previously, the implementation of MADV_DONTNEED and MADV_FREE relied on per-page pmap operations, such as pmap_clear_reference(). Intuitively, the problem with this implementation is that the pmap-level locks are acquired and released and the page table traversed repeatedly, once for each resident page in the range that was specified to madvise(2). A more subtle flaw with the previous implementation is that pmap_clear_reference() would clear the reference bit on all mappings to the specified page, not just the mapping in the range specified to madvise(2). Since our malloc(3) makes heavy use of madvise(2), this change can have a measureable impact. For example, the system time for completing a parallel "buildworld" on a 6-core amd64 machine was reduced by about 1.5% to 2.0%. Note: This change only contains pmap_advise() implementations for a subset of our supported architectures. I will commit implementations for the remaining architectures after further testing. For now, a stub function is sufficient because of the advisory nature of pmap_advise(). Discussed with: jeff, jhb, kib Tested by: pho (i386), marcel (ia64) Sponsored by: EMC / Isilon Storage Division
# 133dae88	26-Aug-2013	Gleb Smirnoff <glebius@FreeBSD.org>	Remove comment that is no longer relevant since r254182.
# 776cad90	23-Aug-2013	Alan Cox <alc@FreeBSD.org>	Addendum to r254141: The call to vm_radix_insert() in vm_page_cache() can reclaim the last preexisting cached page in the object, resulting in a call to vdrop(). Detect this scenario so that the vnode's hold count is correctly maintained. Otherwise, we panic. Reported by: scottl Tested by: pho Discussed with: attilio, jeff, kib
# 5944de8e	22-Aug-2013	Konstantin Belousov <kib@FreeBSD.org>	Remove the deprecated VM_ALLOC_RETRY flag for the vm_page_grab(9). The flag was mandatory since r209792, where vm_page_grab(9) was changed to only support the alloc retry semantic. Suggested and reviewed by: alc Sponsored by: The FreeBSD Foundation
# 28a288cb	21-Aug-2013	Alan Cox <alc@FreeBSD.org>	Addendum to r254141: Allow recursion on the free pages queues lock in vm_page_alloc_freelist(). Reported and tested by: sbruno Sponsored by: EMC / Isilon Storage Division
# a834cbae	15-Aug-2013	Attilio Rao <attilio@FreeBSD.org>	On the recovery path for vm_page_alloc(), if a page had been requested wired, unwind back the wiring bits otherwise we can end up freeing a page that is considered wired. Sponsored by: EMC / Isilon storage division Reported by: alc
# d9e23210	13-Aug-2013	Jeff Roberson <jeff@FreeBSD.org>	Improve pageout flow control to wakeup more frequently and do less work while maintaining better LRU of active pages. - Change v_free_target to include the quantity previously represented by v_cache_min so we don't need to add them together everywhere we use them. - Add a pageout_wakeup_thresh that sets the free page count trigger for waking the page daemon. Set this 10% above v_free_min so we wakeup before any phase transitions in vm users. - Adjust down v_free_target now that we're willing to accept more pagedaemon wakeups. This means we process fewer pages in one iteration as well, leading to shorter lock hold times and less overall disruption. - Eliminate vm_pageout_page_stats(). This was a minor variation on the PQ_ACTIVE segment of the normal pageout daemon. Instead we now process 1 / vm_pageout_update_period pages every second. This causes us to visit the whole active list every 60 seconds. Previously we would only maintain the active LRU when we were short on pages which would mean it could be woefully out of date. Reviewed by: alc (slight variant of this) Discussed with: alc, kib, jhb Sponsored by: EMC / Isilon Storage Division
# 60068841	11-Aug-2013	Attilio Rao <attilio@FreeBSD.org>	Correct the recovery logic in vm_page_alloc_contig: what is really needed on this code snipped is that all the pages that are already fully inserted gets fully freed, while for the others the object removal itself might be skipped, hence the object might be set to NULL. Sponsored by: EMC / Isilon storage division Reported by: alc, kib Reviewed by: alc
# c325e866	10-Aug-2013	Konstantin Belousov <kib@FreeBSD.org>	Different consumers of the struct vm_page abuse pageq member to keep additional information, when the page is guaranteed to not belong to a paging queue. Usually, this results in a lot of type casts which make reasoning about the code correctness harder. Sometimes m->object is used instead of pageq, which could cause real and confusing bugs if non-NULL m->object is leaked. See r141955 and r253140 for examples. Change the pageq member into a union containing explicitly-typed members. Use them instead of type-punning or abusing m->object in x86 pmaps, uma and vm_page_alloc_contig(). Requested and reviewed by: alc Sponsored by: The FreeBSD Foundation
# cdc00bf7	09-Aug-2013	John Baldwin <jhb@FreeBSD.org>	Revert the addition of VPO_BUSY and instead update vm_page_replace() to properly unbusy the page. Submitted by: alc
# e946b949	09-Aug-2013	Attilio Rao <attilio@FreeBSD.org>	On all the architectures, avoid to preallocate the physical memory for nodes used in vm_radix. On architectures supporting direct mapping, also avoid to pre-allocate the KVA for such nodes. In order to do so make the operations derived from vm_radix_insert() to fail and handle all the deriving failure of those. vm_radix-wise introduce a new function called vm_radix_replace(), which can replace a leaf node, already present, with a new one, and take into account the possibility, during vm_radix_insert() allocation, that the operations on the radix trie can recurse. This means that if operations in vm_radix_insert() recursed vm_radix_insert() will start from scratch again. Sponsored by: EMC / Isilon storage division Reviewed by: alc (older version) Reviewed by: jeff Tested by: pho, scottl
# c7aebda8	09-Aug-2013	Attilio Rao <attilio@FreeBSD.org>	The soft and hard busy mechanism rely on the vm object lock to work. Unify the 2 concept into a real, minimal, sxlock where the shared acquisition represent the soft busy and the exclusive acquisition represent the hard busy. The old VPO_WANTED mechanism becames the hard-path for this new lock and it becomes per-page rather than per-object. The vm_object lock becames an interlock for this functionality: it can be held in both read or write mode. However, if the vm_object lock is held in read mode while acquiring or releasing the busy state, the thread owner cannot make any assumption on the busy state unless it is also busying it. Also: - Add a new flag to directly shared busy pages while vm_page_alloc and vm_page_grab are being executed. This will be very helpful once these functions happen under a read object lock. - Move the swapping sleep into its own per-object flag The KPI is heavilly changed this is why the version is bumped. It is very likely that some VM ports users will need to change their own code. Sponsored by: EMC / Isilon storage division Discussed with: alc Reviewed by: jeff, kib Tested by: gavin, bapt (older version) Tested by: pho, scottl
# 449c2e92	07-Aug-2013	Konstantin Belousov <kib@FreeBSD.org>	Split the pagequeues per NUMA domains, and split pageademon process into threads each processing queue in a single domain. The structure of the pagedaemons and queues is kept intact, most of the changes come from the need for code to find an owning page queue for given page, calculated from the segment containing the page. The tie between NUMA domain and pagedaemon thread/pagequeue split is rather arbitrary, the multithreaded daemon could be allowed for the single-domain machines, or one domain might be split into several page domains, to further increase concurrency. Right now, each pagedaemon thread tries to reach the global target, precalculated at the start of the pass. This is not optimal, since it could cause excessive page deactivation and freeing. The code should be changed to re-check the global page deficit state in the loop after some number of iterations. The pagedaemons reach the quorum before starting the OOM, since one thread inability to meet the target is normal for split queues. Only when all pagedaemons fail to produce enough reusable pages, OOM is started by single selected thread. Launder is modified to take into account the segments layout with regard to the region for which cleaning is performed. Based on the preliminary patch by jeff, sponsored by EMC / Isilon Storage Division. Reviewed by: alc Tested by: pho Sponsored by: The FreeBSD Foundation
# 3abeb811	10-Jul-2013	Konstantin Belousov <kib@FreeBSD.org>	In the vm_page_set_invalid() function, do not assert that the page is not busy, since its only caller brelse() can legitimately call it on busy page. This happens for VOP_PUTPAGES() on filesystems that use buffers and which VOP_WRITE() method marked the buffer containing page as non-cacheable. Reported and tested by: pho Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
# 6b5fbc12	03-Jul-2013	Neel Natu <neel@FreeBSD.org>	vm_phys_fictitious_reg_range() was losing the 'memattr' because it would be reset by pmap_page_init() right after being initialized in vm_page_initfake(). The statement above is with reference to the amd64 implementation of pmap_page_init(). Fix this by calling 'pmap_page_init()' in 'vm_page_initfake()' before changing the 'memattr'. Reviewed by: kib MFC after: 2 weeks
# 4aa4cd8e	24-Jun-2013	Gleb Smirnoff <glebius@FreeBSD.org>	Typo in comment.
# 2051980f	09-Jun-2013	Alan Cox <alc@FreeBSD.org>	Revise the interface between vm_object_madvise() and vm_page_dontneed() so that pointless calls to pmap_is_modified() can be easily avoided when performing madvise(..., MADV_FREE). Sponsored by: EMC / Isilon Storage Division
# be6ec553	03-Jun-2013	Konstantin Belousov <kib@FreeBSD.org>	Remove irrelevant comments. Discussed with: alc MFC after: 3 days
# b4171812	02-Jun-2013	Alan Cox <alc@FreeBSD.org>	Require that the page lock is held, instead of the object lock, when clearing the page's PGA_REFERENCED flag. Since we are typically manipulating the page's act_count field when we are clearing its PGA_REFERENCED flag, the page lock is already held everywhere that we clear the PGA_REFERENCED flag. So, in fact, this revision only changes some comments and an assertion. Nonetheless, it will enable later changes to object locking in the pageout code. Introduce vm_page_assert_locked(), which completely hides the implementation details of the page lock from the caller, and use it in vm_page_aflag_clear(). (The existing vm_page_lock_assert() could not be used in vm_page_aflag_clear().) Over the coming weeks, I expect that we'll either eliminate or replace the various uses of vm_page_lock_assert() with vm_page_assert_locked(). Reviewed by: attilio Sponsored by: EMC / Isilon Storage Division
# b4e49807	01-Jun-2013	Alan Cox <alc@FreeBSD.org>	Now that access to the page's "act_count" field is synchronized by the page lock instead of the object lock, there is no reason for vm_page_activate() to assert that the object is locked for either read or write access. (The "VPO_UNMANAGED" flag never changes after page allocation.) Sponsored by: EMC / Isilon Storage Division
# 9af6d512	21-May-2013	Attilio Rao <attilio@FreeBSD.org>	o Relax locking assertions for vm_page_find_least() o Relax locking assertions for pmap_enter_object() and add them also to architectures that currently don't have any o Introduce VM_OBJECT_LOCK_DOWNGRADE() which is basically a downgrade operation on the per-object rwlock o Use all the mechanisms above to make vm_map_pmap_enter() to work mostl of the times only with readlocks. Sponsored by: EMC / Isilon storage division Reviewed by: alc
# 4fab678b	21-May-2013	Konstantin Belousov <kib@FreeBSD.org>	Add ddb command 'show pginfo' which provides useful information about a vm page, denoted either by an address of the struct vm_page, or, if the '/p' modifier is specified, by a physical address of the corresponding frame. Reviewed by: jhb Sponsored by: The FreeBSD Foundation MFC after: 1 week
# 767a6420	17-May-2013	Alan Cox <alc@FreeBSD.org>	Relax the object locking assertion in vm_page_lookup(). Now that a radix tree is used to maintain the object's collection of resident pages, vm_page_lookup() no longer needs an exclusive lock. Reviewed by: attilio Sponsored by: EMC / Isilon Storage Division
# df839389	13-May-2013	Peter Wemm <peter@FreeBSD.org>	Bandaid for compiling with gcc, which happens to be the default compiler for a number of platforms still.
# 404eb1b3	12-May-2013	Alan Cox <alc@FreeBSD.org>	Refactor vm_page_alloc()'s interactions with vm_reserv_alloc_page() and vm_page_insert() so that (1) vm_radix_lookup_le() is never called while the free page queues lock is held and (2) vm_radix_lookup_le() is called at most once. This change reduces the average time that the free page queues lock is held by vm_page_alloc() as well as vm_page_alloc()'s average overall running time. Sponsored by: EMC / Isilon Storage Division
# 774d251d	17-Mar-2013	Attilio Rao <attilio@FreeBSD.org>	Sync back vmcontention branch into HEAD: Replace the per-object resident and cached pages splay tree with a path-compressed multi-digit radix trie. Along with this, switch also the x86-specific handling of idle page tables to using the radix trie. This change is supposed to do the following: - Allowing the acquisition of read locking for lookup operations of the resident/cached pages collections as the per-vm_page_t splay iterators are now removed. - Increase the scalability of the operations on the page collections. The radix trie does rely on the consumers locking to ensure atomicity of its operations. In order to avoid deadlocks the bisection nodes are pre-allocated in the UMA zone. This can be done safely because the algorithm needs at maximum one new node per insert which means the maximum number of the desired nodes is the number of available physical frames themselves. However, not all the times a new bisection node is really needed. The radix trie implements path-compression because UFS indirect blocks can lead to several objects with a very sparse trie, increasing the number of levels to usually scan. It also helps in the nodes pre-fetching by introducing the single node per-insert property. This code is not generalized (yet) because of the possible loss of performance by having much of the sizes in play configurable. However, efforts to make this code more general and then reusable in further different consumers might be really done. The only KPI change is the removal of the function vm_page_splay() which is now reaped. The only KBI change, instead, is the removal of the left/right iterators from struct vm_page, which are now reaped. Further technical notes broken into mealpieces can be retrieved from the svn branch: http://svn.freebsd.org/base/user/attilio/vmcontention/ Sponsored by: EMC / Isilon storage division In collaboration with: alc, jeff Tested by: flo, pho, jhb, davide Tested by: ian (arm) Tested by: andreast (powerpc)
# 4bc80a34	11-Mar-2013	Attilio Rao <attilio@FreeBSD.org>	Simplify vm_page_is_valid(). Sponsored by: EMC / Isilon storage division Reviewed by: alc
# 34496b53	09-Mar-2013	Alan Cox <alc@FreeBSD.org>	Update a comment: The object lock is no longer a mutex.
# 89f6b863	08-Mar-2013	Attilio Rao <attilio@FreeBSD.org>	Switch the vm_object mutex to be a rwlock. This will enable in the future further optimizations where the vm_object lock will be held in read mode most of the time the page cache resident pool of pages are accessed for reading purposes. The change is mostly mechanical but few notes are reported: * The KPI changes as follow: - VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK() - VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK() - VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK() - VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED() (in order to avoid visibility of implementation details) - The read-mode operations are added: VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(), VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED() * The vm/vm_pager.h namespace pollution avoidance (forcing requiring sys/mutex.h in consumers directly to cater its inlining functions using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h consumers now must include also sys/rwlock.h. * zfs requires a quite convoluted fix to include FreeBSD rwlocks into the compat layer because the name clash between FreeBSD and solaris versions must be avoided. At this purpose zfs redefines the vm_object locking functions directly, isolating the FreeBSD components in specific compat stubs. The KPI results heavilly broken by this commit. Thirdy part ports must be updated accordingly (I can think off-hand of VirtualBox, for example). Sponsored by: EMC / Isilon storage division Reviewed by: jeff Reviewed by: pjd (ZFS specific review) Discussed with: alc Tested by: pho
# c9341161	08-Mar-2013	Attilio Rao <attilio@FreeBSD.org>	Merge from vmc-playground: Introduce a new KPI that verifies if the page cache is empty for a specified vm_object. This KPI does not make assumptions about the locking in order to be used also for building assertions at init and destroy time. It is mostly used to hide implementation details of the page cache. Sponsored by: EMC / Isilon storage division Reviewed by: jeff Reviewed by: alc (vm_radix based version) Tested by: flo, pho, jhb, davide
# 0dde287b	26-Feb-2013	Attilio Rao <attilio@FreeBSD.org>	Wrap the sleeps synchronized by the vm_object lock into the specific macro VM_OBJECT_SLEEP(). This hides some implementation details like the usage of the msleep() primitive and the necessity to access to the lock address directly. For this reason VM_OBJECT_MTX() macro is now retired. Sponsored by: EMC / Isilon storage division Reviewed by: alc Tested by: pho
# 28634820	08-Dec-2012	Alan Cox <alc@FreeBSD.org>	In the past four years, we've added two new vm object types. Each time, similar changes had to be made in various places throughout the machine- independent virtual memory layer to support the new vm object type. However, in most of these places, it's actually not the type of the vm object that matters to us but instead certain attributes of its pages. For example, OBJT_DEVICE, OBJT_MGTDEVICE, and OBJT_SG objects contain fictitious pages. In other words, in most of these places, we were testing the vm object's type to determine if it contained fictitious (or unmanaged) pages. To both simplify the code in these places and make the addition of future vm object types easier, this change introduces two new vm object flags that describe attributes of the vm object's pages, specifically, whether they are fictitious or unmanaged. Reviewed and tested by: kib
# 0d69690e	20-Nov-2012	Alan Cox <alc@FreeBSD.org>	Correct an error in r230623. When both VM_ALLOC_NODUMP and VM_ALLOC_ZERO were specified to vm_page_alloc(), PG_NODUMP wasn't being set on the allocated page when it happened to be pre-zeroed. MFC after: 5 days
# 8d220203	12-Nov-2012	Alan Cox <alc@FreeBSD.org>	Replace the single, global page queues lock with per-queue locks on the active and inactive paging queues. Reviewed by: kib
# 9fc4739d	01-Nov-2012	Alan Cox <alc@FreeBSD.org>	In general, we call pmap_remove_all() before calling vm_page_cache(). So, the call to pmap_remove_all() within vm_page_cache() is usually redundant. This change eliminates that call to pmap_remove_all() and introduces a call to pmap_remove_all() before vm_page_cache() in the one place where it didn't already exist. When iterating over a paging queue, if the object containing the current page has a zero reference count, then the page can't have any managed mappings. So, a call to pmap_remove_all() is pointless. Change a panic() call in vm_page_cache() to a KASSERT(). MFC after: 6 weeks
# 4ceaf45d	31-Oct-2012	Attilio Rao <attilio@FreeBSD.org>	Rework the known mutexes to benefit about staying on their own cache line in order to avoid manual frobbing but using struct mtx_padalign. The sole exception being nvme and sxfge drivers, where the author redefined CACHE_LINE_SIZE manually, so they need to be analyzed and dealt with separately. Reviwed by: jimharris, alc
# 081a4881	29-Oct-2012	Alan Cox <alc@FreeBSD.org>	Replace the page hold queue, PQ_HOLD, by a new page flag, PG_UNHOLDFREE, because the queue itself serves no purpose. When a held page is freed, inserting the page into the hold queue has the side effect of setting the page's "queue" field to PQ_HOLD. Later, when the page is unheld, it will be freed because the "queue" field is PQ_HOLD. In other words, PQ_HOLD is used as a flag, not a queue. So, this change replaces it with a flag. To accomodate the new page flag, make the page's "flags" field wider and "oflags" field narrower. Reviewed by: kib
# 7ecfabc7	13-Oct-2012	Alan Cox <alc@FreeBSD.org>	Move vm_page_requeue() to the only file that uses it. MFC after: 3 weeks
# 9af47af6	13-Oct-2012	Alan Cox <alc@FreeBSD.org>	Eliminate the conditional for releasing the page queues lock in vm_page_sleep(). vm_page_sleep() is no longer called with this lock held. Eliminate assertions that the page queues lock is NOT held. These assertions won't translate well to having distinct locks on the active and inactive page queues, and they really aren't that useful. MFC after: 3 weeks
# 4db2c4b8	02-Oct-2012	Alan Cox <alc@FreeBSD.org>	Tidy up a bit: Update some of the comments. In particular, use "sleep" in preference to "block" where appropriate. Eliminate some unnecessary casts. Make a few whitespace changes for consistency. Reviewed by: kib MFC after: 3 days
# b6c00483	14-Aug-2012	Konstantin Belousov <kib@FreeBSD.org>	Do not leave invalid pages in the object after the short read for a network file systems (not only NFS proper). Short reads cause pages other then the requested one, which were not filled by read response, to stay invalid. Change the vm_page_readahead_finish() interface to not take the error code, but instead to make a decision to free or to (de)activate the page only by its validity. As result, not requested invalid pages are freed even if the read RPC indicated success. Noted and reviewed by: alc MFC after: 1 week
# 0055cbd3	04-Aug-2012	Konstantin Belousov <kib@FreeBSD.org>	Reduce code duplication and exposure of direct access to struct vm_page oflags by providing helper function vm_page_readahead_finish(), which handles completed reads for pages with indexes other then the requested one, for VOP_GETPAGES(). Reviewed by: alc MFC after: 1 week
# 369763e3	02-Aug-2012	Alan Cox <alc@FreeBSD.org>	Inline vm_page_aflags_clear() and vm_page_aflags_set(). Add comments stating that neither these functions nor the flags that they are used to manipulate are part of the KBI.
# da1ab8a4	16-Jul-2012	Alan Cox <alc@FreeBSD.org>	Correct vm_page_alloc_contig()'s implementation of VM_ALLOC_NODUMP.
# e30df26e	26-Jun-2012	Alan Cox <alc@FreeBSD.org>	Add new pmap layer locks to the predefined lock order. Change the names of a few existing VM locks to follow a consistent naming scheme.
# eddc9291	20-Jun-2012	Alan Cox <alc@FreeBSD.org>	Selectively inline vm_page_dirty().
# 6031c68d	16-Jun-2012	Alan Cox <alc@FreeBSD.org>	The page flag PGA_WRITEABLE is set and cleared exclusively by the pmap layer, but it is read directly by the MI VM layer. This change introduces pmap_page_is_write_mapped() in order to completely encapsulate all direct access to PGA_WRITEABLE in the pmap layer. Aesthetics aside, I am making this change because amd64 will likely begin using an alternative method to track write mappings, and having pmap_page_is_write_mapped() in place allows me to make such a change without further modification to the MI VM layer. As an added bonus, tidy up some nearby comments concerning page flags. Reviewed by: kib MFC after: 6 weeks
# c415e172	22-May-2012	Andrew Turner <andrew@FreeBSD.org>	Fix booting on ARM. In PHYS_TO_VM_PAGE() when VM_PHYSSEG_DENSE is set the check if we are past the end of vm_page_array was incorrect causing it to return NULL. This value is then used in vm_phys_add_page causing a data abort. Reviewed by: alc, kib, imp Tested by: stas
# ccc4a5c7	20-May-2012	Nathan Whitehorn <nwhitehorn@FreeBSD.org>	Replace the list of PVOs owned by each PMAP with an RB tree. This simplifies range operations like pmap_remove() and pmap_protect() as well as allowing simple operations like pmap_extract() not to involve any global state. This substantially reduces lock coverages for the global table lock and improves concurrency.
# b6de32bd	12-May-2012	Konstantin Belousov <kib@FreeBSD.org>	Add a facility to register a range of physical addresses to be used for allocation of fictitious pages, for which PHYS_TO_VM_PAGE() returns proper fictitious vm_page_t. The range should be de-registered after consumer stopped using it. De-inline the PHYS_TO_VM_PAGE() since it now carries code to iterate over registered ranges. A hash container might be developed instead of range registration interface, and fake pages could be put automatically into the hash, were PHYS_TO_VM_PAGE() could look them up later. This should be considered before the MFC of the commit is done. Sponsored by: The FreeBSD Foundation Reviewed by: alc MFC after: 1 month
# e461aae7	12-May-2012	Konstantin Belousov <kib@FreeBSD.org>	Split the code from vm_page_getfake() to initialize the fake page struct vm_page into new interface vm_page_initfake(). Handle the case of fake page re-initialization with changed memattr. Sponsored by: The FreeBSD Foundation Reviewed by: alc MFC after: 1 month
# 116c2135	12-May-2012	Konstantin Belousov <kib@FreeBSD.org>	Assert that the page passed to vm_page_putfake() is unmanaged. Sponsored by: The FreeBSD Foundation Reviewed by: alc MFC after: 1 month
# 0c26bb71	12-May-2012	Konstantin Belousov <kib@FreeBSD.org>	Make the vm_page_array_size long. Remove redundand zero initialization for vm_page_array_size and nearby variablees. Sponsored by: The FreeBSD Foundation Reviewed by: alc MFC after: 1 month
# 0b852c03	22-Apr-2012	Nathan Whitehorn <nwhitehorn@FreeBSD.org>	Avoid a lock order reversal in pmap_extract_and_hold() from relocking the page. This PMAP requires an additional lock besides the PMAP lock in pmap_extract_and_hold(), which vm_page_pa_tryrelock() did not release. Suggested by: kib MFC after: 4 days
# 2aa163dc	21-Apr-2012	Alan Cox <alc@FreeBSD.org>	As documented in vm_page.h, updates to the vm_page's flags no longer require the page queues lock. MFC after: 1 week
# a0f2c37b	09-Apr-2012	Attilio Rao <attilio@FreeBSD.org>	- Introduce a cache-miss optimization for consistency with other accesses of the cache member of vm_object objects. - Use novel vm_page_is_cached() for checks outside of the vm subsystem. Reviewed by: alc MFC after: 2 weeks X-MFC: r234039
# 1c8279e4	08-Apr-2012	Alan Cox <alc@FreeBSD.org>	Fix mincore(2) so that it reports PG_CACHED pages as resident. MFC after: 2 weeks
# d1aa86e1	06-Apr-2012	Attilio Rao <attilio@FreeBSD.org>	Staticize vm_page_cache_remove(). Reviewed by: alc
# 263811f7	27-Jan-2012	Kip Macy <kmacy@FreeBSD.org>	exclude kmem_alloc'ed ARC data buffers from kernel minidumps on amd64 excluding other allocations including UMA now entails the addition of a single flag to kmem_alloc or uma zone create Reviewed by: alc, avg MFC after: 2 weeks
# c68c3537	05-Dec-2011	Alan Cox <alc@FreeBSD.org>	Introduce vm_reserv_alloc_contig() and teach vm_page_alloc_contig() how to use superpage reservations. So, for the first time, kernel virtual memory that is allocated by contigmalloc(), kmem_alloc_attr(), and kmem_alloc_contig() can be promoted to superpages. In fact, even a series of small contigmalloc() allocations may collectively result in a promoted superpage. Eliminate some duplication of code in vm_reserv_alloc_page(). Change the type of vm_reserv_reclaim_contig()'s first parameter in order that it be consistent with other vm_*_contig() functions. Tested by: marius (sparc64)
# dc874f98	30-Nov-2011	Konstantin Belousov <kib@FreeBSD.org>	Rename vm_page_set_valid() to vm_page_set_valid_range(). The vm_page_set_valid() is the most reasonable name for the m->valid accessor. Reviewed by: attilio, alc
# cf1911a9	29-Nov-2011	Konstantin Belousov <kib@FreeBSD.org>	Hide the internals of vm_page_lock(9) from the loadable modules. Since the address of vm_page lock mutex depends on the kernel options, it is easy for module to get out of sync with the kernel. No vm_page_lockptr() accessor is provided for modules. It can be added later if needed, unless proper KPI is developed to serve the needs. Reviewed by: attilio, alc MFC after: 3 weeks
# 5ff276b7	16-Nov-2011	Alan Cox <alc@FreeBSD.org>	Eliminate end-of-line white space.
# fbd80bd0	16-Nov-2011	Alan Cox <alc@FreeBSD.org>	Refactor the code that performs physically contiguous memory allocation, yielding a new public interface, vm_page_alloc_contig(). This new function addresses some of the limitations of the current interfaces, contigmalloc() and kmem_alloc_contig(). For example, the physically contiguous memory that is allocated with those interfaces can only be allocated to the kernel vm object and must be mapped into the kernel virtual address space. It also provides functionality that vm_phys_alloc_contig() doesn't, such as wiring the returned pages. Moreover, unlike that function, it respects the low water marks on the paging queues and wakes up the page daemon when necessary. That said, at present, this new function can't be applied to all types of vm objects. However, that restriction will be eliminated in the coming weeks. From a design standpoint, this change also addresses an inconsistency between vm_phys_alloc_contig() and the other vm_phys_alloc*() functions. Specifically, vm_phys_alloc_contig() manipulated vm_page fields that other functions in vm/vm_phys.c didn't. Moreover, vm_phys_alloc_contig() knew about vnodes and reservations. Now, vm_page_alloc_contig() is responsible for these things. Reviewed by: kib Discussed with: jhb
# c835bd16	05-Nov-2011	Alan Cox <alc@FreeBSD.org>	Wake up the page daemon in vm_page_alloc_freelist() if it couldn't allocate the requested page because too few pages are cached or free. Document the VM_ALLOC_COUNT() option to vm_page_alloc() and vm_page_alloc_freelist(). Make style changes to vm_page_alloc() and vm_page_alloc_freelist(), such as using a variable name that more closely corresponds to the comments.
# 561cc9fc	05-Nov-2011	Konstantin Belousov <kib@FreeBSD.org>	Provide typedefs for the type of bit mask for the page bits. Use the defined types instead of int when manipulating masks. Supposedly, it could fix support for 32KB page size in the machine-independend VM layer. Reviewed by: alc MFC after: 2 weeks
# 83937680	01-Nov-2011	Alan Cox <alc@FreeBSD.org>	Add support for VM_ALLOC_WIRED and VM_ALLOC_ZERO to vm_page_alloc_freelist() and use these new options in the mips pmap. Wake up the page daemon in vm_page_alloc_freelist() if the number of free and cached pages becomes too low. Tidy up vm_page_alloc_init(). In particular, add a comment about an important restriction on its use. Tested by: jchandra@
# 125b695b	27-Oct-2011	Alan Cox <alc@FreeBSD.org>	Tidy up the comment at the head of vm_page_alloc, and mention that the returned page has the flag VPO_BUSY set.
# 9c60ca32	25-Oct-2011	Alan Cox <alc@FreeBSD.org>	Speed up vm_page_cache() and vm_page_remove() by checking for a few common cases that can be handled in constant time. The insight being that a page's parent in the vm object's tree is very often its predecessor or successor in the vm object's ordered memq. Tested by: jhb MFC after: 10 days
# 17514c1b	28-Sep-2011	Konstantin Belousov <kib@FreeBSD.org>	Style nit. Submitted by: jhb MFC after: 2 weeks
# 2042bb37	28-Sep-2011	Konstantin Belousov <kib@FreeBSD.org>	Fix grammar. Submitted by: bf MFC after: 2 weeks
# abb9b935	28-Sep-2011	Konstantin Belousov <kib@FreeBSD.org>	Use the trick of performing the atomic operation on the contained aligned word to handle the dirty mask updates in vm_page_clear_dirty_mask(). Remove the vm page queue lock around vm_page_dirty() call in vm_fault_hold() the sole purpose of which was to protect dirty on architectures which does not provide short or byte-wide atomics. Reviewed by: alc, attilio Tested by: flo (sparc64) MFC after: 2 weeks
# 3407fefe	06-Sep-2011	Konstantin Belousov <kib@FreeBSD.org>	Split the vm_page flags PG_WRITEABLE and PG_REFERENCED into atomic flags field. Updates to the atomic flags are performed using the atomic ops on the containing word, do not require any vm lock to be held, and are non-blocking. The vm_page_aflag_set(9) and vm_page_aflag_clear(9) functions are provided to modify afalgs. Document the changes to flags field to only require the page lock. Introduce vm_page_reference(9) function to provide a stable KPI and KBI for filesystems like tmpfs and zfs which need to mark a page as referenced. Reviewed by: alc, attilio Tested by: marius, flo (sparc64); andreast (powerpc, powerpc64) Approved by: re (bz)
# d98d0ce2	09-Aug-2011	Konstantin Belousov <kib@FreeBSD.org>	- Move the PG_UNMANAGED flag from m->flags to m->oflags, renaming the flag to VPO_UNMANAGED (and also making the flag protected by the vm object lock, instead of vm page queue lock). - Mark the fake pages with both PG_FICTITIOUS (as it is now) and VPO_UNMANAGED. As a consequence, pmap code now can use use just VPO_UNMANAGED to decide whether the page is unmanaged. Reviewed by: alc Tested by: pho (x86, previous version), marius (sparc64), marcel (arm, ia64, powerpc), ray (mips) Sponsored by: The FreeBSD Foundation Approved by: re (bz)
# 1bfec3df	22-Jun-2011	Alan Cox <alc@FreeBSD.org>	Revert to using the page queues lock in vm_page_clear_dirty_mask() on MIPS. (At present, although atomic_clear_char() is defined by atomic.h on MIPS, it is not actually implemented by support.S.)
# 3c76db4c	19-Jun-2011	Alan Cox <alc@FreeBSD.org>	Precisely document the synchronization rules for the page's dirty field. (Saying that the lock on the object that the page belongs to must be held only represents one aspect of the rules.) Eliminate the use of the page queues lock for atomically performing read- modify-write operations on the dirty field when the underlying architecture supports atomic operations on char and short types. Document the fact that 32KB pages aren't really supported. Reviewed by: attilio, kib
# 3b1025d2	11-Jun-2011	Konstantin Belousov <kib@FreeBSD.org>	Assert that page is VPO_BUSY or page owner object is locked in vm_page_undirty(). The assert is not precise due to VPO_BUSY owner to tracked, so assertion does not catch the case when VPO_BUSY is owned by other thread. Reviewed by: alc
# 10cf2560	11-Mar-2011	Alan Cox <alc@FreeBSD.org>	Eliminate duplication of the fake page code and zone by the device and sg pagers. Reviewed by: jhb
# e6ffa214	17-Feb-2011	Alan Cox <alc@FreeBSD.org>	Remove pmap fields that are either unused or not fully implemented. Discussed with: kib
# d7b20e4b	11-Feb-2011	Alan Cox <alc@FreeBSD.org>	Retire VFS_BIO_DEBUG. Convert those checks that were still valid into KASSERT()s and eliminate the rest. Replace excessive printf()s and a panic() in bufdone_finish() with a KASSERT() in vm_page_io_finish(). Reviewed by: kib
# 3d05198e	30-Jan-2011	Alan Cox <alc@FreeBSD.org>	Release the free page queues lock earlier in vm_page_alloc(). Discussed with: kib@
# 4053b05b	21-Jan-2011	Sergey Kandaurov <pluknet@FreeBSD.org>	Make MSGBUF_SIZE kernel option a loader tunable kern.msgbufsize. Submitted by: perryh pluto.rain.com (previous version) Reviewed by: jhb Approved by: kib (mentor) Tested by: universe
# 4c6a2e7a	16-Jan-2011	Alan Cox <alc@FreeBSD.org>	Shift responsibility for synchronizing access to the page's act_count field to the object's lock. Reviewed by: kib@
# 9648f344	16-Jan-2011	Alan Cox <alc@FreeBSD.org>	Clean up the start of vm_page_alloc(). In particular, eliminate an assertion that is no longer required. Long ago, calls to vm_page_alloc() from an interrupt handler had to specify VM_ALLOC_INTERRUPT so that vm_page_alloc() would not attempt to reclaim a PQ_CACHE page from another vm object. Today, with the synchronization on a vm object's collection of PQ_CACHE pages, this is no longer an issue. In fact, VM_ALLOC_INTERRUPT now reclaims PQ_CACHE pages just like VM_ALLOC_{NORMAL,SYSTEM}. MFC after: 3 weeks
# 27772ddf	08-Jan-2011	Alan Cox <alc@FreeBSD.org>	Eliminate a redundant alignment directive on the page locks array.
# ce8a13bd	08-Jan-2011	Alan Cox <alc@FreeBSD.org>	Eliminate the counting of vm_page_pa_tryrelock calls. We really don't need it anymore. Moreover, its implementation had a type mismatch, a long is not necessarily an uint64_t. (This mismatch was hidden by casting.) Move the remaining two counters up a level in the sysctl hierarchy. There is no reason for them to be under the vm.pmap node. Reviewed by: kib
# 17f6a17b	02-Jan-2011	Alan Cox <alc@FreeBSD.org>	Release the page lock early in vm_pageout_clean(). There is no reason to hold this lock until the end of the function. With the aforementioned change to vm_pageout_clean(), page locks don't need to support recursive (MTX_RECURSE) or duplicate (MTX_DUPOK) acquisitions. Reviewed by: kib
# 3280870d	28-Dec-2010	Konstantin Belousov <kib@FreeBSD.org>	Move the increment of vm object generation count into vm_object_set_writeable_dirty(). Fix an issue where restart of the scan in vm_object_page_clean() did not removed write permissions for newly added pages or, if the mapping for some already scanned page changed to writeable due to fault. Merge the two loops in vm_object_page_clean(), doing the remove of write permission and cleaning in the same loop. The restart of the loop then correctly downgrade writeable mappings. Fix an issue where a second caller to msync() might actually return before the first caller had actually completed flushing the pages. Clear the OBJ_MIGHTBEDIRTY flag after the cleaning loop, not before. Calls to pmap_is_modified() are not needed after pmap_remove_write() there. Proposed, reviewed and tested by: alc MFC after: 1 week
# 8c22654d	17-Dec-2010	Alan Cox <alc@FreeBSD.org>	Implement and use a single optimized function for unholding a set of pages. Reviewed by: kib@
# 48772ca4	09-Dec-2010	Jayachandran C. <jchandra@FreeBSD.org>	Revert the vm/vm_page.c change in r216317. This adds back changes in r216141, which was reverted by the above check in.
# aa93efed	08-Dec-2010	Jayachandran C. <jchandra@FreeBSD.org>	swi_vm() for mips.
# 6f1a8765	02-Dec-2010	Warner Losh <imp@FreeBSD.org>	To make minidumps work properly on mips for memory that's direct mapped and entered via vm_page_setup, keep track of it like we do for amd64. # A separate commit will be made to move this to a capability-based ifdef # rather than arch-based ifdef. Submitted by: alc@ MFC after: 1 week
# 05cb58f6	30-Nov-2010	Alan Cox <alc@FreeBSD.org>	Correct an error in the allocation of the vm_page_dump array in vm_page_startup(). Specifically, the dump_avail array should be used instead of the phys_avail array to calculate the size of vm_page_dump. For example, the pages for the message buffer are allocated prior to vm_page_startup() by subtracting them from the last entry in the phys_avail array, but the first thing that vm_page_startup() does after creating the vm_page_dump array is to set the bits corresponding to the message buffer pages in that array. However, these bits might not actually exist in the array, because the size of the array is determined by the current value in the last entry of the phys_avail array. In general, the only reason why this doesn't always result in an out-of-bounds array access is that the size of the vm_page_dump array is rounded up to the next page boundary. This change eliminates that dependence on rounding (and luck). MFC after: 6 weeks
# aa546366	27-Nov-2010	Jayachandran C. <jchandra@FreeBSD.org>	Fix issue noted by alc while reviewing r215938: The current implementation of vm_page_alloc_freelist() does not handle order > 0 correctly. Remove order parameter to the function and use it only for order 0 pages. Submitted by: alc
# 00f8bffc	19-Nov-2010	Alan Cox <alc@FreeBSD.org>	Reduce the amount of detail printed by vm_page_free_toq() when it panics. Reviewed by: kib
# 4166faae	18-Nov-2010	Konstantin Belousov <kib@FreeBSD.org>	Only increment object generation count when inserting the page into object page list. The only use of object generation count now is a restart of the scan in vm_object_page_clean(), which makes sense to do on the page addition. Page removals do not affect the dirtiness of the object, as well as manipulations with the shadow chain. Suggested and reviewed by: alc MFC after: 1 week
# 903ba3da	06-Nov-2010	Oleksandr Tymoshenko <gonzo@FreeBSD.org>	- Add minidump support for FreeBSD/mips
# a7d5f7eb	19-Oct-2010	Jamie Gritton <jamie@FreeBSD.org>	A new jail(8) with a configuration file, to replace the work currently done by /etc/rc.d/jail.
# a9b89cf1	03-Sep-2010	Andriy Gapon <avg@FreeBSD.org>	vm_page.c: include opt_msgbuf.h for MSGBUF_SIZE use in vm_page_startup vm_page_startup uses MSGBUF_SIZE value for adding msgbuf pages to minidump. If opt_msgbuf.h is not included and MSGBUF_SIZE is overriden in kernel config, then not all msgbuf pages will be dumped. And most importantly, struct msgbuf itself will not be included. Thus the dump would look corrupted/incomplete to tools like kgdb, dmesg, etc that try to access struct msgbuf as one of the first things they do when working on a crash dump. MFC after: 5 days
# 49ca10d4	21-Jul-2010	Jayachandran C. <jchandra@FreeBSD.org>	Redo the page table page allocation on MIPS, as suggested by alc@. The UMA zone based allocation is replaced by a scheme that creates a new free page list for the KSEG0 region, and a new function in sys/vm that allocates pages from a specific free page list. This also fixes a race condition introduced by the UMA based page table page allocation code. Dropping the page queue and pmap locks before the call to uma_zfree, and re-acquiring them afterwards will introduce a race condtion(noted by alc@). The changes are : - Revert the earlier changes in MIPS pmap.c that added UMA zone for page table pages. - Add a new freelist VM_FREELIST_HIGHMEM to MIPS vmparam.h for memory that is not directly mapped (in 32bit kernel). Normal page allocations will first try the HIGHMEM freelist and then the default(direct mapped) freelist. - Add a new function 'vm_page_t vm_page_alloc_freelist(int flind, int order, int req)' to vm/vm_page.c to allocate a page from a specified freelist. The MIPS page table pages will be allocated using this function from the freelist containing direct mapped pages. - Move the page initialization code from vm_phys_alloc_contig() to a new function vm_page_alloc_init(), and use this function to initialize pages in vm_page_alloc_freelist() too. - Split the function vm_phys_alloc_pages(int pool, int order) to create vm_phys_alloc_freelist_pages(int flind, int pool, int order), and use this function from both vm_page_alloc_freelist() and vm_phys_alloc_pages(). Reviewed by: alc
# b99348e5	09-Jul-2010	Alan Cox <alc@FreeBSD.org>	Add support for the VM_ALLOC_COUNT() hint to vm_page_alloc(). Consequently, the maintenance of vm_pageout_deficit can be localized to just two places: vm_page_alloc() and vm_pageout_scan(). This change also corrects an off-by-one error in the maintenance of vm_pageout_deficit. Historically, the buffer cache functions, allocbuf() and vm_hold_load_pages(), have not taken into account that vm_page_alloc() already increments vm_pageout_deficit by one. Reviewed by: kib
# 1d9e77f6	08-Jul-2010	Konstantin Belousov <kib@FreeBSD.org>	Make VM_ALLOC_RETRY flag mandatory for vm_page_grab(). Assert that the flag is always provided, and unconditionally retry after sleep for the busy page or failed allocation. The intent is to remove VM_ALLOC_RETRY eventually. Proposed and reviewed by: alc
# 5f195aa3	05-Jul-2010	Konstantin Belousov <kib@FreeBSD.org>	Add the ability for the allocflag argument of the vm_page_grab() to specify the increment of vm_pageout_deficit when sleeping due to page shortage. Then, in allocbuf(), the code to allocate pages when extending vmio buffer can be replaced by a call to vm_page_grab(). Suggested and reviewed by: alc MFC after: 2 weeks
# b382c10a	04-Jul-2010	Konstantin Belousov <kib@FreeBSD.org>	Introduce a helper function vm_page_find_least(). Use it in several places, which inline the function. Reviewed by: alc Tested by: pho MFC after: 1 week
# b64400a0	03-Jul-2010	Alan Cox <alc@FreeBSD.org>	Improve the comment and man page for vm_page_alloc(). Specifically, document one of the optional flags; clarify which of the flags are optional (and which are not), and remove mention of a restriction on the reclamation of cached pages that no longer holds since version 7. MFC after: 1 week
# 9cf51988	02-Jul-2010	Alan Cox <alc@FreeBSD.org>	With the demise of page coloring, the page queue macros no longer serve any useful purpose. Eliminate them. Reviewed by: kib
# 91b4f427	21-Jun-2010	Alan Cox <alc@FreeBSD.org>	Introduce vm_page_next() and vm_page_prev(), and use them in vm_pageout_clean(). When iterating over a range of pages, these functions can be cheaper than vm_page_lookup() because their implementation takes advantage of the vm_object's memq being ordered. Reviewed by: kib@ MFC after: 3 weeks
# 9ee2165f	14-Jun-2010	Alan Cox <alc@FreeBSD.org>	Eliminate checks for a page having a NULL object in vm_pageout_scan() and vm_pageout_page_stats(). These checks were recently introduced by the first page locking commit, r207410, but they are not needed. At the same time, eliminate some redundant accesses to the page's object field. (These accesses should have neen eliminated by r207410.) Make the assertion in vm_page_flag_set() stricter. Specifically, only managed pages should have PG_WRITEABLE set. Add a comment documenting an assertion to vm_page_flag_clear(). It has long been the case that fictitious pages have their wire count permanently set to one. Add comments to vm_page_wire() and vm_page_unwire() documenting this. Add assertions to these functions as well. Update the comment describing vm_page_unwire(). Much of the old comment had little to do with vm_page_unwire(), but a lot to do with _vm_page_deactivate(). Move relevant parts of the old comment to _vm_page_deactivate(). Only pages that belong to an object can be paged out. Therefore, it is pointless for vm_page_unwire() to acquire the page queues lock and enqueue such pages in one of the paging queues. Generally speaking, such pages are immediately freed after the call to vm_page_unwire(). Previously, it was the call to vm_page_free() that reacquired the page queues lock and removed these pages from the paging queues. Now, we will never acquire the page queues lock for this case. (It is also worth noting that since both vm_page_unwire() and vm_page_free() occurred with the page locked, the page daemon never saw the page with its object field set to NULL.) Change the panic with vm_page_unwire() to provide a more precise message. Reviewed by: kib@
# ce186587	10-Jun-2010	Alan Cox <alc@FreeBSD.org>	Reduce the scope of the page queues lock and the number of PG_REFERENCED changes in vm_pageout_object_deactivate_pages(). Simplify this function's inner loop using TAILQ_FOREACH(), and shorten some of its overly long lines. Update a stale comment. Assert that PG_REFERENCED may be cleared only if the object containing the page is locked. Add a comment documenting this. Assert that a caller to vm_page_requeue() holds the page queues lock, and assert that the page is on a page queue. Push down the page queues lock into pmap_ts_referenced() and pmap_page_exists_quick(). (As of now, there are no longer any pmap functions that expect to be called with the page queues lock held.) Neither pmap_ts_referenced() nor pmap_page_exists_quick() should ever be passed an unmanaged page. Assert this rather than returning "0" and "FALSE" respectively. ARM: Simplify pmap_page_exists_quick() by switching to TAILQ_FOREACH(). Push down the page queues lock inside of pmap_clearbit(), simplifying pmap_clear_modify(), pmap_clear_reference(), and pmap_remove_write(). Additionally, this allows for avoiding the acquisition of the page queues lock in some cases. PowerPC/AIM: moea_page_exits_quick() and moea_page_wired_mappings() will never be called before pmap initialization is complete. Therefore, the check for moea_initialized can be eliminated. Push down the page queues lock inside of moea_clear_bit(), simplifying moea_clear_modify() and moea_clear_reference(). The last parameter to moea_clear_bit() is never used. Eliminate it. PowerPC/BookE: Simplify mmu_booke_page_exists_quick()'s control flow. Reviewed by: kib@
# 2bbfbc3f	03-Jun-2010	Konstantin Belousov <kib@FreeBSD.org>	Add assertion and comment in vm_page_flag_set() describing the expectations when the PG_WRITEABLE flag is set. Reviewed by: alc
# f4e10cda	02-Jun-2010	Alan Cox <alc@FreeBSD.org>	Maintain the pretense that we support 32KB pages for the sake of the ia64 LINT build.
# c8fa8709	02-Jun-2010	Alan Cox <alc@FreeBSD.org>	Minimize the use of the page queues lock for synchronizing access to the page's dirty field. With the exception of one case, access to this field is now synchronized by the object lock.
# c46b90e9	26-May-2010	Alan Cox <alc@FreeBSD.org>	Push down page queues lock acquisition in pmap_enter_object() and pmap_is_referenced(). Eliminate the corresponding page queues lock acquisitions from vm_map_pmap_enter() and mincore(), respectively. In mincore(), this allows some additional cases to complete without ever acquiring the page queues lock. Assert that the page is managed in pmap_is_referenced(). On powerpc/aim, push down the page queues lock acquisition from moea_is_modified() and moea_is_referenced() into moea*_query_bit(). Again, this will allow some additional cases to complete without ever acquiring the page queues lock. Reorder a few statements in vm_page_dontneed() so that a race can't lead to an old reference persisting. This scenario is described in detail by a comment. Correct a spelling error in vm_page_dontneed(). Assert that the object is locked in vm_page_clear_dirty(), and restrict the page queues lock assertion to just those cases in which the page is currently writeable. Add object locking to vnode_pager_generic_putpages(). This was the one and only place where vm_page_clear_dirty() was being called without the object being locked. Eliminate an unnecessary vm_page_lock() around vnode_pager_setsize()'s call to vm_page_clear_dirty(). Change vnode_pager_generic_putpages() to the modern-style of function definition. Also, change the name of one of the parameters to follow virtual memory system naming conventions. Reviewed by: kib
# e98d019d	24-May-2010	Alan Cox <alc@FreeBSD.org>	Eliminate the acquisition and release of the page queues lock from vfs_busy_pages(). It is no longer needed. Submitted by: kib
# 567e51e1	24-May-2010	Alan Cox <alc@FreeBSD.org>	Roughly half of a typical pmap_mincore() implementation is machine- independent code. Move this code into mincore(), and eliminate the page queues lock from pmap_mincore(). Push down the page queues lock into pmap_clear_modify(), pmap_clear_reference(), and pmap_is_modified(). Assert that these functions are never passed an unmanaged page. Eliminate an inaccurate comment from powerpc/powerpc/mmu_if.m: Contrary to what the comment says, pmap_mincore() is not simply an optimization. Without a complete pmap_mincore() implementation, mincore() cannot return either MINCORE_MODIFIED or MINCORE_REFERENCED because only the pmap can provide this information. Eliminate the page queues lock from vfs_setdirty_locked_object(), vm_pageout_clean(), vm_object_page_collect_flush(), and vm_object_page_clean(). Generally speaking, these are all accesses to the page's dirty field, which are synchronized by the containing vm object's lock. Reduce the scope of the page queues lock in vm_object_madvise() and vm_page_dontneed(). Reviewed by: kib (an earlier version)
# aa12e8b7	18-May-2010	Alan Cox <alc@FreeBSD.org>	The page queues lock is no longer required by vm_page_set_invalid(), so eliminate it. Assert that the object containing the page is locked in vm_page_test_dirty(). Perform some style clean up while I'm here. Reviewed by: kib
# 9ab6032f	16-May-2010	Alan Cox <alc@FreeBSD.org>	On entry to pmap_enter(), assert that the page is busy. While I'm here, make the style of assertion used by pmap_enter() consistent across all architectures. On entry to pmap_remove_write(), assert that the page is neither unmanaged nor fictitious, since we cannot remove write access to either kind of page. With the push down of the page queues lock, pmap_remove_write() cannot condition its behavior on the state of the PG_WRITEABLE flag if the page is busy. Assert that the object containing the page is locked. This allows us to know that the page will neither become busy nor will PG_WRITEABLE be set on it while pmap_remove_write() is running. Correct a long-standing bug in vm_page_cowsetup(). We cannot possibly do copy-on-write-based zero-copy transmit on unmanaged or fictitious pages, so don't even try. Previously, the call to pmap_remove_write() would have failed silently.
# a4bc2c89	16-May-2010	Alan Cox <alc@FreeBSD.org>	Correct an error of omission in r202897: Now that amd64 uses the direct map to access the message buffer, we must explicitly request that the underlying physical pages are included in a crash dump. Reported by: Benjamin Kaduk
# eee9d992	09-May-2010	Alan Cox <alc@FreeBSD.org>	Push down the acquisition of the page queues lock into vm_pageq_remove(). (This eliminates a surprising number of page queues lock acquisitions by vm_fault() because the page's queue is PQ_NONE and thus the page queues lock is not needed to remove the page from a queue.)
# 34e7251f	08-May-2010	Alan Cox <alc@FreeBSD.org>	Minimize the scope of the page queues lock in vm_fault().
# 3c4a2440	08-May-2010	Alan Cox <alc@FreeBSD.org>	Push down the page queues into vm_page_cache(), vm_page_try_to_cache(), and vm_page_try_to_free(). Consequently, push down the page queues lock into pmap_enter_quick(), pmap_page_wired_mapped(), pmap_remove_all(), and pmap_remove_write(). Push down the page queues lock into Xen's pmap_page_is_mapped(). (I overlooked the Xen pmap in r207702.) Switch to a per-processor counter for the total number of pages cached.
# 03679e23	07-May-2010	Alan Cox <alc@FreeBSD.org>	Push down the page queues lock into vm_page_activate().
# 9402dff3	06-May-2010	Alan Cox <alc@FreeBSD.org>	Push down the page queues lock into vm_page_deactivate(). Eliminate an incorrect comment.
# 7024db1d	06-May-2010	Alan Cox <alc@FreeBSD.org>	Push down the page queues lock inside of vm_page_free_toq() and pmap_page_is_mapped() in preparation for removing page queues locking around calls to vm_page_free(). Setting aside the assertion that calls pmap_page_is_mapped(), vm_page_free_toq() now acquires and holds the page queues lock just long enough to actually add or remove the page from the paging queues. Update vm_page_unhold() to reflect the above change.
# 5ac59343	05-May-2010	Alan Cox <alc@FreeBSD.org>	Acquire the page lock around all remaining calls to vm_page_free() on managed pages that didn't already have that lock held. (Freeing an unmanaged page, such as the various pmaps use, doesn't require the page lock.) This allows a change in vm_page_remove()'s locking requirements. It now expects the page lock to be held instead of the page queues lock. Consequently, the page queues lock is no longer required at all by callers to vm_page_rename(). Discussed with: kib
# e3ef0d2f	04-May-2010	Alan Cox <alc@FreeBSD.org>	Push down the acquisition of the page queues lock into vm_page_unwire(). Update the comment describing which lock should be held on entry to vm_page_wire(). Reviewed by: kib
# a7283d32	04-May-2010	Alan Cox <alc@FreeBSD.org>	Add page locking to the vm_page_cow* functions. Push down the acquisition and release of the page queues lock into vm_page_wire(). Reviewed by: kib
# 0c41a69e	03-May-2010	Alan Cox <alc@FreeBSD.org>	Add lock assertions.
# 2d5d7f7f	03-May-2010	Alan Cox <alc@FreeBSD.org>	Acquire the page lock around vm_page_wire() in vm_page_grab(). Assert that the page lock is held in vm_page_wire().
# 9f2512ba	03-May-2010	Alan Cox <alc@FreeBSD.org>	Assert that the page queues lock is held in vm_page_remove() and vm_page_unwire() only if the page is managed, i.e., pageable.
# b8d36afc	02-May-2010	Alan Cox <alc@FreeBSD.org>	Add page lock assertions where we access the page's hold_count.
# b88b6c9d	02-May-2010	Alan Cox <alc@FreeBSD.org>	It makes no sense for vm_page_sleep_if_busy()'s helper, vm_page_sleep(), to unconditionally set PG_REFERENCED on a page before sleeping. In many cases, it's perfectly ok for the page to disappear, i.e., be reclaimed by the page daemon, before the caller to vm_page_sleep() is reawakened. Instead, we now explicitly set PG_REFERENCED in those cases where having the page persist until the caller is awakened is clearly desirable. Note, however, that setting PG_REFERENCED on the page is still only a hint, and not a guarantee that the page should persist.
# 6d74d042	29-Apr-2010	Kip Macy <kmacy@FreeBSD.org>	don't allow unsynchronized free in vm_page_unhold
# 2965a453	29-Apr-2010	Kip Macy <kmacy@FreeBSD.org>	On Alan's advice, rather than do a wholesale conversion on a single architecture from page queue lock to a hashed array of page locks (based on a patch by Jeff Roberson), I've implemented page lock support in the MI code and have only moved vm_page's hold_count out from under page queue mutex to page lock. This changes pmap_extract_and_hold on all pmaps. Supported by: Bitgravity Inc. Discussed with: alc, jeffr, and kib
# 55146b83	09-Apr-2010	Alan Cox <alc@FreeBSD.org>	MFC r206174 vm_reserv_alloc_page() should never be called on an OBJT_SG object, just as it is never called on an OBJT_DEVICE object. (This change should have been included in r195840.)
# f6d00b38	05-Apr-2010	Alan Cox <alc@FreeBSD.org>	vm_reserv_alloc_page() should never be called on an OBJT_SG object, just as it is never called on an OBJT_DEVICE object. (This change should have been included in r195840.) Reported by: dougb@, avg@ MFC after: 3 days
# 95e5a090	02-Mar-2010	Konstantin Belousov <kib@FreeBSD.org>	MFC r204415: Update comment for vm_page_alloc(9), listing all acceptable flags [1]. Note that the function does not sleep, it can block.
# ddb16cfc	27-Feb-2010	Konstantin Belousov <kib@FreeBSD.org>	Update comment for vm_page_alloc(9), listing all acceptable flags [1]. Note that the function does not sleep, it can block. Submitted by: Giovanni Trematerra <giovanni.trematerra gmail com> [1] MFC after: 3 days
# e67e0775	04-Oct-2009	Alan Cox <alc@FreeBSD.org>	Align and pad the page queue and free page queue locks so that the linker can't possibly place them together within the same cache line. MFC after: 3 weeks
# 01381811	24-Jul-2009	John Baldwin <jhb@FreeBSD.org>	Add a new type of VM object: OBJT_SG. An OBJT_SG object is very similar to a device pager (OBJT_DEVICE) object in that it uses fictitious pages to provide aliases to other memory addresses. The primary difference is that it uses an sglist(9) to determine the physical addresses for a given offset into the object instead of invoking the d_mmap() method in a device driver. Reviewed by: alc Approved by: re (kensmith) MFC after: 2 weeks
# 13de7221	17-Jul-2009	Alan Cox <alc@FreeBSD.org>	An addendum to r195649, "Add support to the virtual memory system for configuring machine-dependent memory attributes...": Don't set the memory attribute for a "real" page that is allocated to a device object in vm_page_alloc(). It is a pointless act, because the device pager replaces this "real" page with a "fake" page and sets the memory attribute on that "fake" page. Eliminate pointless code from pmap_cache_bits() on amd64. Employ the "Self Snoop" feature supported by some x86 processors to avoid cache flushes in the pmap. Approved by: re (kib)
# 3153e878	12-Jul-2009	Alan Cox <alc@FreeBSD.org>	Add support to the virtual memory system for configuring machine- dependent memory attributes: Rename vm_cache_mode_t to vm_memattr_t. The new name reflects the fact that there are machine-dependent memory attributes that have nothing to do with controlling the cache's behavior. Introduce vm_object_set_memattr() for setting the default memory attributes that will be given to an object's pages. Introduce and use pmap_page_{get,set}_memattr() for getting and setting a page's machine-dependent memory attributes. Add full support for these functions on amd64 and i386 and stubs for them on the other architectures. The function pmap_page_set_memattr() is also responsible for any other machine-dependent aspects of changing a page's memory attributes, such as flushing the cache or updating the direct map. The uses include kmem_alloc_contig(), vm_page_alloc(), and the device pager: kmem_alloc_contig() can now be used to allocate kernel memory with non-default memory attributes on amd64 and i386. vm_page_alloc() and the device pager will set the memory attributes for the real or fictitious page according to the object's default memory attributes. Update the various pmap functions on amd64 and i386 that map pages to incorporate each page's memory attributes in the mapping. Notes: (1) Inherent to this design are safety features that prevent the specification of inconsistent memory attributes by different mappings on amd64 and i386. In addition, the device pager provides a warning when a device driver creates a fictitious page with memory attributes that are inconsistent with the real page that the fictitious page is an alias for. (2) Storing the machine-dependent memory attributes for amd64 and i386 as a dedicated "int" in "struct md_page" represents a compromise between space efficiency and the ease of MFCing these changes to RELENG_7. In collaboration with: jhb Approved by: re (kib)
# 6f0489c6	20-Jun-2009	Alan Cox <alc@FreeBSD.org>	Strive for greater consistency among the places that implement real, fictious, and contiguous page allocation. Eliminate unnecessary reinitialization of a page's fields.
# edd16ab1	30-May-2009	Alan Cox <alc@FreeBSD.org>	Add assertions in two places where a page's valid or dirty bits are changed.
# 1c1b26f2	12-May-2009	Alan Cox <alc@FreeBSD.org>	Eliminate page queues locking from bufdone_finish() through the following changes: Rename vfs_page_set_valid() to vfs_page_set_validclean() to reflect what this function actually does. Suggested by: tegge Introduce a new version of vfs_page_set_valid() that does no more than what the function's name implies. Specifically, it does not update the page's dirty mask, and thus it does not require the page queues lock to be held. Update two of the three callers to the old vfs_page_set_valid() to call vfs_page_set_validclean() instead because they actually require the page's dirty mask to be cleared. Introduce vm_page_set_valid(). Reviewed by: tegge
# 641e2829	03-Jan-2009	Konstantin Belousov <kib@FreeBSD.org>	Extend the struct vm_page wire_count to u_int to avoid the overflow of the counter, that may happen when too many sendfile(2) calls are being executed with this vnode [1]. To keep the size of the struct vm_page and offsets of the fields accessed by out-of-tree modules, swap the types and locations of the wire_count and cow fields. Add safety checks to detect cow overflow and force fallback to the normal copy code for zero-copy sockets. [2] Reported by: Anton Yuzhaninov <citrin citrin ru> [1] Suggested by: alc [2] Reviewed by: alc MFC after: 2 weeks
# 8e321b79	06-Nov-2008	Rafal Jaworowski <raj@FreeBSD.org>	Support kernel crash mini dumps on ARM architecture. Obtained from: Juniper Networks, Semihalf
# d7f03759	19-Oct-2008	Ulf Lilleengen <lulf@FreeBSD.org>	- Import the HEAD csup code which is the basis for the cvsmode work.
# a8a478fc	26-Sep-2008	Ed Maste <emaste@FreeBSD.org>	Move CTASSERT from header file to source file, per implementation note now in the CTASSERT man page.
# 4b34502e	17-Aug-2008	Kip Macy <kmacy@FreeBSD.org>	Work around differences in page allocation for initial page tables on xen MFC after: 1 month
# 8bcd3b19	06-Jun-2008	Alan Cox <alc@FreeBSD.org>	Essentially, neither madvise(..., MADV_DONTNEED) nor madvise(..., MADV_FREE) work. (Moreover, I don't believe that they have ever worked as intended.) The explanation is fairly simple. Both MADV_DONTNEED and MADV_FREE perform vm_page_dontneed() on each page within the range given to madvise(). This function moves the page to the inactive queue. Specifically, if the page is clean, it is moved to the head of the inactive queue where it is first in line for processing by the page daemon. On the other hand, if it is dirty, it is placed at the tail. Let's further examine the case in which the page is clean. Recall that the page is at the head of the line for processing by the page daemon. The expectation of vm_page_dontneed()'s author was that the page would be transferred from the inactive queue to the cache queue by the page daemon. (Once the page is in the cache queue, it is, in effect, free, that is, it can be reallocated to a new vm object by vm_page_alloc() if it isn't reactivated quickly enough by a user of the old vm object.) The trouble is that nowhere in the execution of either MADV_DONTNEED or MADV_FREE is either the machine-independent reference flag (PG_REFERENCED) or the reference bit in any page table entry (PTE) mapping the page cleared. Consequently, the immediate reaction of the page daemon is to reactivate the page because it is referenced. In effect, the madvise() was for naught. The case in which the page was dirty is not too different. Instead of being laundered, the page is reactivated. Note: The essential difference between MADV_DONTNEED and MADV_FREE is that MADV_FREE clears a page's dirty field. So, MADV_FREE is always executing the clean case above. This revision changes vm_page_dontneed() to clear both the machine- independent reference flag (PG_REFERENCED) and the reference bit in all PTEs mapping the page. MFC after: 6 weeks
# f5788387	15-May-2008	Alan Cox <alc@FreeBSD.org>	Don't call vm_reserv_alloc_page() on device-backed objects. Otherwise, the system may panic because there is no reservation structure corresponding to the physical address of the device memory. Reported by: Giorgos Keramidas
# 44aab2c3	06-Apr-2008	Alan Cox <alc@FreeBSD.org>	Introduce vm_reserv_reclaim_contig(). This function is used by contigmalloc(9) as a last resort to steal pages from an inactive, partially-used superpage reservation. Rename vm_reserv_reclaim() to vm_reserv_reclaim_inactive() and refactor it so that a separate subroutine is responsible for breaking the selected reservation. This subroutine is also used by vm_reserv_reclaim_contig().
# e5b006ff	19-Mar-2008	Alan Cox <alc@FreeBSD.org>	Rename vm_pageq_requeue() to vm_page_requeue() on account of its recent migration to vm/vm_page.c.
# 1fa94a36	18-Mar-2008	Alan Cox <alc@FreeBSD.org>	Almost seven years ago, vm/vm_page.c was split into three parts: vm/vm_contig.c, vm/vm_page.c, and vm/vm_pageq.c. Today, vm/vm_pageq.c has withered to the point that it contains only four short functions, two of which are only used by vm/vm_page.c. Since I can't foresee any reason for vm/vm_pageq.c to grow, it is time to fold the remaining contents of vm/vm_pageq.c back into vm/vm_page.c. Add some comments. Rename one of the functions, vm_pageq_enqueue(), that is now static within vm/vm_page.c to vm_page_enqueue(). Eliminate PQ_MAXCOUNT as it no longer serves any purpose.
# 273bf93c	01-Jan-2008	Alan Cox <alc@FreeBSD.org>	Defer setting either PG_CACHED or PG_FREE until after the free page queues lock is acquired. Otherwise, the state of a reservation's pages' flags and its population count can be inconsistent. That could result in a page being freed twice. Reported by: kris
# f8a47341	29-Dec-2007	Alan Cox <alc@FreeBSD.org>	Add the superpage reservation system. This is "part 2 of 2" of the machine-independent support for superpages. (The earlier part was the rewrite of the physical memory allocator.) The remainder of the code required for superpages support is machine-dependent and will be added to the various pmap implementations at a later date. Initially, I am only supporting one large page size per architecture. Moreover, I am only enabling the reservation system on amd64. (In an emergency, it can be disabled by setting VM_NRESERVLEVELS to 0 in amd64/include/vmparam.h or your kernel configuration file.)
# e35395ce	20-Dec-2007	Alan Cox <alc@FreeBSD.org>	Modify vm_phys_unfree_page() so that it no longer requires the given page to be in the free lists. Instead, it now returns TRUE if it removed the page from the free lists and FALSE if the page was not in the free lists. This change is required to support superpage reservations. Specifically, once reservations are introduced, a cached page can either be in the free lists or a reservation.
# 03497757	18-Dec-2007	Alan Cox <alc@FreeBSD.org>	Eliminate redundant code from vm_page_startup().
# 21e10ad4	11-Dec-2007	Alan Cox <alc@FreeBSD.org>	Simplify vm_page_free_toq().
# b6408256	02-Dec-2007	Alan Cox <alc@FreeBSD.org>	Correct a comment.
# ddd6e7d2	21-Nov-2007	Alan Cox <alc@FreeBSD.org>	When reactivating a cached page, reset the page's pool to the default pool. (Not doing this before was a performance pessimization but not a cause for panic.)
# aefac177	05-Nov-2007	Konstantin Belousov <kib@FreeBSD.org>	The intent of the freeing the (zeroed) page in vm_page_cache() for default object rather than cache it was to have vm_pager_has_page(object, pindex, ...) == FALSE to imply that there is no cached page in object at pindex. This allows to avoid explicit checks for cached pages in vm_object_backing_scan(). For now, we need the same bandaid for the swap object, otherwise both the vm_page_lookup() and the pager can report that there is no page at offset, while page is stored in the cache. Also, this fixes another instance of the KASSERT("object type is incompatible") failure in the vm_page_cache_transfer(). Reported and tested by: Peter Holm Reviewed by: alc MFC after: 3 days
# 21f79586	26-Oct-2007	Alan Cox <alc@FreeBSD.org>	Change vm_page_cache_transfer() such that it does not transfer pages that would have an offset beyond the end of the target object. Such pages should remain in the source object. MFC after: 3 days Diagnosed and reviewed by: Kostik Belousov Reported and tested by: Peter Holm
# b8c50480	08-Oct-2007	Alan Cox <alc@FreeBSD.org>	In the rare case that vm_page_cache() actually frees the given page, it must first ensure that the page is no longer mapped. This is trivially accomplished by calling pmap_remove_all() a little earlier in vm_page_cache(). While I'm in the neighborbood, make a related panic message a little more useful. Approved by: re (kensmith) Reported by: Peter Holm and Konstantin Belousov Reviewed by: Konstantin Belousov
# dc9250f5	07-Oct-2007	Alan Cox <alc@FreeBSD.org>	Correct a lock assertion failure in sparc64's pmap_page_is_mapped() that is a consequence of sparc64/sparc64/vm_machdep.c revision 1.76. It occurs when uma_small_free() frees a page. The solution has two parts: (1) Mark pages allocated with VM_ALLOC_NOOBJ as PG_UNMANAGED. (2) Defer the lock assertion in pmap_page_is_mapped() until after PG_UNMANAGED is tested. This is safe because both PG_UNMANAGED and PG_FICTITIOUS are immutable flags, i.e., they do not change state between the time that a page is allocated and freed. Approved by: re (kensmith) PR: 116794
# c9444914	26-Sep-2007	Alan Cox <alc@FreeBSD.org>	Correct an error of omission in the reimplementation of the page cache: vm_object_page_remove() should convert any cached pages that fall with the specified range to free pages. Otherwise, there could be a problem if a file is first truncated and then regrown. Specifically, some old data from prior to the truncation might reappear. Generalize vm_page_cache_free() to support the conversion of either a subset or the entirety of an object's cached pages. Reported by: tegge Reviewed by: tegge Approved by: re (kensmith)
# 7bfda801	25-Sep-2007	Alan Cox <alc@FreeBSD.org>	Change the management of cached pages (PQ_CACHE) in two fundamental ways: (1) Cached pages are no longer kept in the object's resident page splay tree and memq. Instead, they are kept in a separate per-object splay tree of cached pages. However, access to this new per-object splay tree is synchronized by the _free_ page queues lock, not to be confused with the heavily contended page queues lock. Consequently, a cached page can be reclaimed by vm_page_alloc(9) without acquiring the object's lock or the page queues lock. This solves a problem independently reported by tegge@ and Isilon. Specifically, they observed the page daemon consuming a great deal of CPU time because of pages bouncing back and forth between the cache queue (PQ_CACHE) and the inactive queue (PQ_INACTIVE). The source of this problem turned out to be a deadlock avoidance strategy employed when selecting a cached page to reclaim in vm_page_select_cache(). However, the root cause was really that reclaiming a cached page required the acquisition of an object lock while the page queues lock was already held. Thus, this change addresses the problem at its root, by eliminating the need to acquire the object's lock. Moreover, keeping cached pages in the object's primary splay tree and memq was, in effect, optimizing for the uncommon case. Cached pages are reclaimed far, far more often than they are reactivated. Instead, this change makes reclamation cheaper, especially in terms of synchronization overhead, and reactivation more expensive, because reactivated pages will have to be reentered into the object's primary splay tree and memq. (2) Cached pages are now stored alongside free pages in the physical memory allocator's buddy queues, increasing the likelihood that large allocations of contiguous physical memory (i.e., superpages) will succeed. Finally, as a result of this change long-standing restrictions on when and where a cached page can be reclaimed and returned by vm_page_alloc(9) are eliminated. Specifically, calls to vm_page_alloc(9) specifying VM_ALLOC_INTERRUPT can now reclaim and return a formerly cached page. Consequently, a call to malloc(9) specifying M_NOWAIT is less likely to fail. Discussed with: many over the course of the summer, including jeff@, Justin Husted @ Isilon, peter@, tegge@ Tested by: an earlier version by kris@ Approved by: re (kensmith)
# eaa29f1c	27-Jul-2007	Alan Cox <alc@FreeBSD.org>	Add a counter for the total number of pages cached and support for reporting the value of this counter in the program "vmstat". Approved by: re (rwatson)
# 8941dc44	14-Jul-2007	Alan Cox <alc@FreeBSD.org>	Eliminate two unused functions: vm_phys_alloc_pages() and vm_phys_free_pages(). Rename vm_phys_alloc_pages_locked() to vm_phys_alloc_pages() and vm_phys_free_pages_locked() to vm_phys_free_pages(). Add comments regarding the need for the free page queues lock to be held by callers to these functions. No functional changes. Approved by: re (hrs)
# 20dd22a2	10-Jul-2007	Alan Cox <alc@FreeBSD.org>	Correct a problem in the ZERO_COPY_SOCKETS option, specifically, in vm_page_cowfault(). Initially, if vm_page_cowfault() sleeps, the given page is wired, preventing it from being recycled. However, when transmission of the page completes, the page is unwired and returned to the page queues. At that point, the page is not in any special state that prevents it from being recycled. Consequently, vm_page_cowfault() should verify that the page is still held by the same vm object before retrying the replacement of the page. Note: The containing object is, however, safe from being recycled by virtue of having a non-zero paging-in-progress count. While I'm here, add some assertions and comments. Approved by: re (rwatson) MFC After: 3 weeks
# 0a49733c	16-Jun-2007	Matt Jacob <mjacob@FreeBSD.org>	Don't declare inline a function which isn't.
# bcc231ec	16-Jun-2007	Alan Cox <alc@FreeBSD.org>	If attempting to cache a "busy", panic instead of printing a diagnostic message and returning.
# 2446e4f0	15-Jun-2007	Alan Cox <alc@FreeBSD.org>	Enable the new physical memory allocator. This allocator uses a binary buddy system with a twist. First and foremost, this allocator is required to support the implementation of superpages. As a side effect, it enables a more robust implementation of contigmalloc(9). Moreover, this reimplementation of contigmalloc(9) eliminates the acquisition of Giant by contigmalloc(..., M_NOWAIT, ...). The twist is that this allocator tries to reduce the number of TLB misses incurred by accesses through a direct map to small, UMA-managed objects and page table pages. Roughly speaking, the physical pages that are allocated for such purposes are clustered together in the physical address space. The performance benefits vary. In the most extreme case, a uniprocessor kernel running on an Opteron, I measured an 18% reduction in system time during a buildworld. This allocator does not implement page coloring. The reason is that superpages have much the same effect. The contiguous physical memory allocation necessary for a superpage is inherently colored. Finally, the one caveat is that this allocator does not effectively support prezeroed pages. I hope this is temporary. On i386, this is a slight pessimization. However, on amd64, the beneficial effects of the direct-map optimization outweigh the ill effects. I speculate that this is true in general of machines with a direct map. Approved by: re
# 393a081d	10-Jun-2007	Attilio Rao <attilio@FreeBSD.org>	Optimize vmmeter locking. In particular: - Add an explicative table for locking of struct vmmeter members - Apply new rules for some of those members - Remove some unuseful comments Heavily reviewed by: alc, bde, jeff Approved by: jeff (mentor)
# b4b70819	04-Jun-2007	Attilio Rao <attilio@FreeBSD.org>	Do proper "locking" for missing vmmeters part. Now, we assume no more sched_lock protection for some of them and use the distribuited loads method for vmmeter (distribuited through CPUs). Reviewed by: alc, bde Approved by: jeff (mentor)
# 2feb50bf	31-May-2007	Attilio Rao <attilio@FreeBSD.org>	Revert VMCNT_* operations introduction. Probabilly, a general approach is not the better solution here, so we should solve the sched_lock protection problems separately. Requested by: alc Approved by: jeff (mentor)
# 80b200da	20-May-2007	Jeff Roberson <jeff@FreeBSD.org>	- rename VMCNT_DEC to VMCNT_SUB to reflect the count argument. Suggested by: julian@ Contributed by: attilio@
# 222d0195	18-May-2007	Jeff Roberson <jeff@FreeBSD.org>	- define and use VMCNT_{GET,SET,ADD,SUB,PTR} macros for manipulating vmcnts. This can be used to abstract away pcpu details but also changes to use atomics for all counters now. This means sched lock is no longer responsible for protecting counts in the switch routines. Contributed by: Attilio Rao <attilio@FreeBSD.org>
# 04a18977	05-May-2007	Alan Cox <alc@FreeBSD.org>	Define every architecture as either VM_PHYSSEG_DENSE or VM_PHYSSEG_SPARSE depending on whether the physical address space is densely or sparsely populated with memory. The effect of this definition is to determine which of two implementations of vm_page_array and PHYS_TO_VM_PAGE() is used. The legacy implementation is obtained by defining VM_PHYSSEG_DENSE, and a new implementation that trades off time for space is obtained by defining VM_PHYSSEG_SPARSE. For now, all architectures except for ia64 and sparc64 define VM_PHYSSEG_DENSE. Defining VM_PHYSSEG_SPARSE on ia64 allows the entirety of my Itanium 2's memory to be used. Previously, only the first 1 GB could be used. Defining VM_PHYSSEG_SPARSE on sparc64 allows USIIIi-based systems to boot without crashing. This change is a combination of Nathan Whitehorn's patch and my own work in perforce. Discussed with: kmacy, marius, Nathan Whitehorn PR: 112194
# 9f5c801b	24-Feb-2007	Alan Cox <alc@FreeBSD.org>	Change the way that unmanaged pages are created. Specifically, immediately flag any page that is allocated to a OBJT_PHYS object as unmanaged in vm_page_alloc() rather than waiting for a later call to vm_page_unmanage(). This allows for the elimination of some uses of the page queues lock. Change the type of the kernel and kmem objects from OBJT_DEFAULT to OBJT_PHYS. This allows us to take advantage of the above change to simplify the allocation of unmanaged pages in kmem_alloc() and kmem_malloc(). Remove vm_page_unmanage(). It is no longer used.
# 711585d0	17-Feb-2007	Alan Cox <alc@FreeBSD.org>	Enable vm_page_free() and vm_page_free_zero() to be called on some pages without the page queues lock being held, specifically, pages that are not contained in a vm object and not a member of a page queue.
# ba000fb2	17-Feb-2007	Alan Cox <alc@FreeBSD.org>	Remove a stale comment. Add punctuation to a nearby comment.
# d3d029bd	14-Feb-2007	Alan Cox <alc@FreeBSD.org>	Relax the page queue lock assertions in vm_page_remove() and vm_page_free_toq() to account for recent changes that allow vm_page_free_toq() to be called on some pages without the page queues lock being held, specifically, pages that are not contained in a vm object and not a member of a page queue. (Examples of such pages include page table pages, pv entry pages, and uma small alloc pages.)
# 7d60988b	14-Feb-2007	Alan Cox <alc@FreeBSD.org>	Avoid the unnecessary acquisition of the free page queues lock when a page is actually being added to the hold queue, not the free queue. At the same time, avoid unnecessary tests to wake up threads waiting for free memory and the idle thread that zeroes free pages. (These tests will be performed later when the page finally moves from the hold queue to the free queue.)
# 5351a248	10-Feb-2007	Alan Cox <alc@FreeBSD.org>	Use the free page queue mutex instead of the page queue mutex to synchronize sleeping and waking of the zero idle thread.
# e9f995d8	06-Feb-2007	Alan Cox <alc@FreeBSD.org>	Change the pagedaemon, vm_wait(), and vm_waitpfault() to sleep on the vm page queue free mutex instead of the vm page queue mutex.
# 3ae3919d	04-Feb-2007	Alan Cox <alc@FreeBSD.org>	Change the free page queue lock from a spin mutex to a default (blocking) mutex. With the demise of Alpha support, there is no longer a reason for it to be a spin mutex.
# 35d10226	08-Dec-2006	Kip Macy <kmacy@FreeBSD.org>	Remove the requirement that phys_avail be sorted in ascending order by explicitly finding the lowest and highest addresses when calculating the size of the vm_pages array Reviewed by :alc
# 49c3b925	08-Nov-2006	Alan Cox <alc@FreeBSD.org>	I misplaced the assertion that was added to vm_page_startup() in the previous change. Correct its placement.
# 9ad3296a	08-Nov-2006	Alan Cox <alc@FreeBSD.org>	Simplify the construction of the free queues in vm_page_startup(). Add an assertion to test a hypothesis concerning other redundant computation in vm_page_startup().
# 2a53696f	22-Oct-2006	Alan Cox <alc@FreeBSD.org>	The page queues lock is no longer required by vm_page_busy() or vm_page_wakeup(). Reduce or eliminate its use accordingly.
# 9af80719	21-Oct-2006	Alan Cox <alc@FreeBSD.org>	Replace PG_BUSY with VPO_BUSY. In other words, changes to the page's busy flag, i.e., VPO_BUSY, are now synchronized by the per-vm object lock instead of the global page queues lock.
# a9a5d47c	28-Sep-2006	Ken Smith <kensmith@FreeBSD.org>	Fix two minor style(9) nits in v1.313 which were noticed during an MFC review. alc@ will be MFCing V1.313 plus style fix to RELENG_6.
# eb4bbba8	27-Aug-2006	Alan Cox <alc@FreeBSD.org>	Refactor vm_page_sleep_if_busy() so that the test for a busy page is inlined and a procedure call is made in the rare case, i.e., when it is necessary to sleep. In this case, inlining the test actually makes the kernel smaller.
# 4f9d17d8	20-Aug-2006	Alan Cox <alc@FreeBSD.org>	Page flags are reset on (re)allocation. There is no need to clear any flags except for PG_ZERO in vm_page_free_toq().
# b146f9e5	12-Aug-2006	Alan Cox <alc@FreeBSD.org>	Reimplement the page's NOSYNC flag as an object-synchronized instead of a page queues-synchronized flag. Reduce the scope of the page queues lock in vm_fault() accordingly. Move vm_fault()'s call to vm_object_set_writeable_dirty() outside of the scope of the page queues lock. Reviewed by: tegge Additionally, eliminate an unnecessary dereference in computing the argument that is passed to vm_object_set_writeable_dirty().
# 25017df4	11-Aug-2006	Alan Cox <alc@FreeBSD.org>	Ensure that the page's new field for object-synchronized flags is always initialized to zero. Call vm_page_sleep_if_busy() instead of duplicating its implementation in vm_page_grab().
# 75db2abb	09-Aug-2006	Alan Cox <alc@FreeBSD.org>	Change vm_page_cowfault() so that it doesn't allocate a pre-busied page.
# 5786be7c	09-Aug-2006	Alan Cox <alc@FreeBSD.org>	Introduce a field to struct vm_page for storing flags that are synchronized by the lock on the object containing the page. Transition PG_WANTED and PG_SWAPINPROG to use the new field, eliminating the need for holding the page queues lock when setting or clearing these flags. Rename PG_WANTED and PG_SWAPINPROG to VPO_WANTED and VPO_SWAPINPROG, respectively. Eliminate the assertion that the page queues lock is held in vm_page_io_finish(). Eliminate the acquisition and release of the page queues lock around calls to vm_page_io_finish() in kern_sendfile() and vfs_unbusy_pages().
# e74814b6	05-Aug-2006	Alan Cox <alc@FreeBSD.org>	Change vm_page_sleep_if_busy() so that it no longer requires the caller to hold the page queues lock.
# 91449ce9	03-Aug-2006	Alan Cox <alc@FreeBSD.org>	When sleeping on a busy page, use the lock from the containing object rather than the global page queues lock.
# 78985e42	01-Aug-2006	Alan Cox <alc@FreeBSD.org>	Complete the transition from pmap_page_protect() to pmap_remove_write(). Originally, I had adopted sparc64's name, pmap_clear_write(), for the function that is now pmap_remove_write(). However, this function is more like pmap_remove_all() than like pmap_clear_modify() or pmap_clear_reference(), hence, the name change. The higher-level rationale behind this change is described in src/sys/amd64/amd64/pmap.c revision 1.567. The short version is that I'm trying to clean up and fix our support for execute access. Reviewed by: marcel@ (ia64)
# af51d7bf	21-Jul-2006	Alan Cox <alc@FreeBSD.org>	Eliminate OBJ_WRITEABLE. It hasn't been used in a long time.
# 9bdaa433	23-Jun-2006	John Baldwin <jhb@FreeBSD.org>	Move the code to handle the vm.blacklist tunable up a layer into vm_page_startup(). As a result, we now only lookup the tunable once instead of looking it up once for every physical page of memory in the system. This cuts out about a 1 second or so delay in boot on x86 systems. The delay is much larger and more noticable on sun4v apparently. Reported by: kmacy MFC after: 1 week
# 4cbb1c1a	31-May-2006	Paul Saab <ps@FreeBSD.org>	Fix minidumps to include pages allocated via pmap_map on amd64. These pages are allocated from the direct map, and were not previous tracked. This included the vm_page_array and the early UMA bootstrap pages. Reviewed by: peter
# c0345a84	20-Apr-2006	Peter Wemm <peter@FreeBSD.org>	Introduce minidumps. Full physical memory crash dumps are still available via the debug.minidump sysctl and tunable. Traditional dumps store all physical memory. This was once a good thing when machines had a maximum of 64M of ram and 1GB of kvm. These days, machines often have many gigabytes of ram and a smaller amount of kvm. libkvm+kgdb don't have a way to access physical ram that is not mapped into kvm at the time of the crash dump, so the extra ram being dumped is mostly wasted. Minidumps invert the process. Instead of dumping physical memory in in order to guarantee that all of kvm's backing is dumped, minidumps instead dump only memory that is actively mapped into kvm. amd64 has a direct map region that things like UMA use. Obviously we cannot dump all of the direct map region because that is effectively an old style all-physical-memory dump. Instead, introduce a bitmap and two helper routines (dump_add_page(pa) and dump_drop_page(pa)) that allow certain critical direct map pages to be included in the dump. uma_machdep.c's allocator is the intended consumer. Dumps are a custom format. At the very beginning of the file is a header, then a copy of the message buffer, then the bitmap of pages present in the dump, then the final level of the kvm page table trees (2MB mappings are expanded into a 4K page mappings), then the sparse physical pages according to the bitmap. libkvm can now conveniently access the kvm page table entries. Booting my test 8GB machine, forcing it into ddb and forcing a dump leads to a 48MB minidump. While this is a best case, I expect minidumps to be in the 100MB-500MB range. Obviously, never larger than physical memory of course. minidumps are on by default. It would want be necessary to turn them off if it was necessary to debug corrupt kernel page table management as that would mess up minidumps as well. Both minidumps and regular dumps are supported on the same machine.
# 62a59e8f	07-Mar-2006	Warner Losh <imp@FreeBSD.org>	Remove leading __ from __(inline\|const\|signed\|volatile). They are obsolete. This should reduce diffs to NetBSD as well.
# 22440959	15-Feb-2006	Stephan Uphoff <ups@FreeBSD.org>	When the VM needs to allocated physical memory pages (for non interrupt use) and it has not plenty of free pages it tries to free pages in the cache queue. Unfortunately freeing a cached page requires the locking of the object that owns the page. However in the context of allocating pages we may not be able to lock the object and thus can only TRY to lock the object. If the locking try fails the cache page can not be freed and is activated to move it out of the way so that we may try to free other cache pages. If all pages in the cache belong to objects that are currently locked the cache queue can be emptied without freeing a single page. This scenario caused two problems: 1) vm_page_alloc always failed allocation when it tried freeing pages from the cache queue and failed to do so. However if there are more than cnt.v_interrupt_free_min pages on the free list it should return pages when requested with priority VM_ALLOC_SYSTEM. Failure to do so can cause resource exhaustion deadlocks. 2) Threads than need to allocate pages spend a lot of time cleaning up the page queue without really getting anything done while the pagedaemon needs to work overtime to refill the cache. This change fixes the first problem. (1) Reviewed by: tegge@
# 6c237adc	31-Jan-2006	Alan Cox <alc@FreeBSD.org>	Change #if defined(DIAGNOSTIC) to KASSERT.
# fc3c1bc4	24-Jan-2006	Alan Cox <alc@FreeBSD.org>	In vm_page_set_invalid() invalidate all of the page's mappings as soon as any part of the page's contents is invalidated. Submitted by: tegge
# ef39c05b	31-Dec-2005	Alexander Leidinger <netchild@FreeBSD.org>	MI changes: - provide an interface (macros) to the page coloring part of the VM system, this allows to try different coloring algorithms without the need to touch every file [1] - make the page queue tuning values readable: sysctl vm.stats.pagequeue - autotuning of the page coloring values based upon the cache size instead of options in the kernel config (disabling of the page coloring as a kernel option is still possible) MD changes: - detection of the cache size: only IA32 and AMD64 (untested) contains cache size detection code, every other arch just comes with a dummy function (this results in the use of default values like it was the case without the autotuning of the page coloring) - print some more info on Intel CPU's (like we do on AMD and Transmeta CPU's) Note to AMD owners (IA32 and AMD64): please run "sysctl vm.stats.pagequeue" and report if the cache* values are zero (= bug in the cache detection code) or not. Based upon work by: Chad David <davidc@acns.ab.ca> [1] Reviewed by: alc, arch (in 2004) Discussed with: alc, Chad David, arch (in 2004)
# 984922d7	13-Dec-2005	Alan Cox <alc@FreeBSD.org>	Assert that the page that is given to vm_page_free_toq() does not have any managed mappings.
# 7e9d9442	07-Nov-2005	Alan Cox <alc@FreeBSD.org>	If a physical page is mapped by two or more virtual addresses, transmitted by the zero-copy sockets method, and written to before the transmission completes, we need to destroy all of the existing mappings to the page, not just the one that we fault on. Otherwise, the mappings will no longer be to the same page and changes made through one of the mappings will not be visible through the others. Observed by: tegge
# 674b706e	31-Oct-2005	Alan Cox <alc@FreeBSD.org>	Consider the zero-copy transmission of a page that was wired by mlock(2). If a copy-on-write fault occurs on the page, the new copy should inherit a part of the original page's wire count. Submitted by: tegge MFC after: 1 week
# 3803b26b	08-Oct-2005	Dag-Erling Smørgrav <des@FreeBSD.org>	As alc pointed out to me, vm_page.c 1.305 was incomplete: uma_startup() still uses the constant UMA_BOOT_PAGES. Change it to accept boot_pages as an additional argument. MFC after: 2 weeks
# cfa22bcc	11-Aug-2005	Dag-Erling Smørgrav <des@FreeBSD.org>	Introduce the vm.boot_pages tunable and sysctl, which controls the number of pages reserved to bootstrap the kernel memory allocator. MFC after: 2 weeks
# 761dbeb6	15-Mar-2005	Jeff Roberson <jeff@FreeBSD.org>	- In vm_page_insert() hold the backing vnode when the first page is inserted. - In vm_page_remove() drop the backing vnode when the last page is removed. - Don't check the vnode to see if it must be reclaimed on every call to vm_page_free_toq() as we only check it now when it is actually required. This saves us two lock operations per call. Sponsored by: Isilon Systems, Inc.
# 46fbc582	06-Jan-2005	Alan Cox <alc@FreeBSD.org>	Transfer responsibility for freeing the page taken from the cache queue and (possibly) unlocking the containing object from vm_page_alloc() to vm_page_select_cache(). Recent optimizations to vm_map_pmap_enter() (see vm_map.c revisions 1.362 and 1.363) and pmap_enter_quick() have resulted in panic()s because vm_page_alloc() mistakenly unlocked objects that had not been locked by vm_page_select_cache(). Reported by: Peter Holm and Kris Kennaway
# 60727d8b	06-Jan-2005	Warner Losh <imp@FreeBSD.org>	/* -> /*- for license, minor formatting changes
# 0869d38b	31-Dec-2004	Alan Cox <alc@FreeBSD.org>	Assert that page allocations during an interrupt specify VM_ALLOC_INTERRUPT. Assert that pages removed from the cache queue are not busy.
# 7aa2190c	28-Dec-2004	Alan Cox <alc@FreeBSD.org>	Access to the page's busy field is (now) synchronized by the containing object's lock. Therefore, the assertion that the page queues lock is held can be removed from vm_page_io_start().
# 40198b3c	26-Dec-2004	Alan Cox <alc@FreeBSD.org>	Assert that the vm object is locked on entry to vm_page_sleep_if_busy(); remove some unneeded code.
# d19ef814	03-Nov-2004	Alan Cox <alc@FreeBSD.org>	The synchronization provided by vm object locking has eliminated the need for most calls to vm_page_busy(). Specifically, most calls to vm_page_busy() occur immediately prior to a call to vm_page_remove(). In such cases, the containing vm object is locked across both calls. Consequently, the setting of the vm page's PG_BUSY flag is not even visible to other threads that are following the synchronization protocol. This change (1) eliminates the calls to vm_page_busy() that immediately precede a call to vm_page_remove() or functions, such as vm_page_free() and vm_page_rename(), that call it and (2) relaxes the requirement in vm_page_remove() that the vm page's PG_BUSY flag is set. Now, the vm page's PG_BUSY flag is set only when the vm object lock is released while the vm page is still in transition. Typically, this is when it is undergoing I/O.
# f4d49654	27-Oct-2004	Alan Cox <alc@FreeBSD.org>	Assert that the containing vm object is locked in vm_page_cache() and vm_page_try_to_cache().
# 63bb7041	25-Oct-2004	Alan Cox <alc@FreeBSD.org>	Assert that the containing vm object is locked in vm_page_flash().
# 75d05338	24-Oct-2004	Alan Cox <alc@FreeBSD.org>	Assert that the containing vm object is locked in vm_page_busy() and vm_page_wakeup().
# 0f9f9bcb	24-Oct-2004	Alan Cox <alc@FreeBSD.org>	Introduce VM_ALLOC_NOBUSY, an option to vm_page_alloc() and vm_page_grab() that indicates that the caller does not want a page with its busy flag set. In many places, the global page queues lock is acquired and released just to clear the busy flag on a just allocated page. Both the allocation of the page and the clearing of the busy flag occur while the containing vm object is locked. So, the busy flag might as well never be set.
# 1e96d2a2	18-Oct-2004	Alan Cox <alc@FreeBSD.org>	Correct two errors in PG_BUSY management by vm_page_cowfault(). Both errors are in rarely executed paths. 1. Each time the retry_alloc path is taken, the PG_BUSY must be set again. Otherwise vm_page_remove() panics. 2. There is no need to set PG_BUSY on the newly allocated page before freeing it. The page already has PG_BUSY set by vm_page_alloc(). Setting it again could cause an assertion failure. MFC after: 2 weeks
# 36aeb90e	17-Oct-2004	Alan Cox <alc@FreeBSD.org>	Assert that the containing object is locked in vm_page_io_start() and vm_page_io_finish(). The motivation being to transition synchronization of the vm_page's busy field from the global page queues lock to the per-object lock.
# 7ce1979b	14-Sep-2004	Poul-Henning Kamp <phk@FreeBSD.org>	Add new a function isa_dma_init() which returns an errno when it fails and which takes a M_WAITOK/M_NOWAIT flag argument. Add compatibility isa_dmainit() macro which whines loudly if isa_dma_init() fails. Problem uncovered by: tegge
# a0879143	29-Jul-2004	Alan Cox <alc@FreeBSD.org>	Advance the state of pmap locking on alpha, amd64, and i386. - Enable recursion on the page queues lock. This allows calls to vm_page_alloc(VM_ALLOC_NORMAL) and UMA's obj_alloc() with the page queues lock held. Such calls are made to allocate page table pages and pv entries. - The previous change enables a partial reversion of vm/vm_page.c revision 1.216, i.e., the call to vm_page_alloc() by vm_page_cowfault() now specifies VM_ALLOC_NORMAL rather than VM_ALLOC_INTERRUPT. - Add partial locking to pmap_copy(). (As a side-effect, pmap_copy() should now be faster on i386 SMP because it no longer generates IPIs for TLB shootdown on the other processors.) - Complete the locking of pmap_enter() and pmap_enter_quick(). (As of now, all changes to a user-level pmap on alpha, amd64, and i386 are performed with appropriate locking.)
# d951b752	21-Jul-2004	Brian Feldman <green@FreeBSD.org>	Fix a race in vm_page_sleep_if_busy(). Due to vm_object locking being incomplete, it currently has to know how to drop and pick back up the vm_object's mutex if it has to sleep and drop the page queue mutex. The problem with this is that if the page is busy, while we are sleeping, the page can be freed and object disappear. When trying to lock m->object, we'd get a stale or NULL pointer and crash. The object is now cached, but this makes the assumption that the object is referenced in some manner and will not itself disappear while it is unlocked. Since this only happens if the object is locked, I had to remove an assumption earlier in contigmalloc() that reversed the order of locking the object and doing vm_page_sleep_if_busy(), not the normal order.
# e832aafc	19-Jul-2004	Alan Cox <alc@FreeBSD.org>	- Eliminate the pte object from the pmap. Instead, page table pages are allocated as "no object" pages. Similar changes were made to the amd64 and i386 pmap last year. The primary reason being that maintaining a pte object leads to lock order violations. A secondary reason being that the pte object is redundant, i.e., the page table itself can be used to lookup page table pages. (Historical note: The pte object predates our ability to allocate "no object" pages. Thus, the pte object was a necessary evil.) - Unconditionally check the vm object lock's status in vm_page_remove(). Previously, this assertion could not be made on Alpha due to its use of a pte object.
# 790bdd0f	10-Jul-2004	Alan Cox <alc@FreeBSD.org>	Increase the scope of the page queues lock in vm_page_alloc() to cover a diagnostic check that accesses the cache queue count.
# 0a2df477	18-Jun-2004	Alan Cox <alc@FreeBSD.org>	Remove spl() calls. Update comments to reflect the removal of spl() calls. Remove '\n' from panic() format strings. Remove some blank lines.
# d45f21f3	17-Jun-2004	Alan Cox <alc@FreeBSD.org>	Do not preset PG_BUSY on VM_ALLOC_NOOBJ pages. Such pages are not accessible through an object. Thus, PG_BUSY serves no purpose.
# 4be14af9	21-May-2004	Alan Cox <alc@FreeBSD.org>	To date, unwiring a fictitious page has produced a panic. The reason being that PHYS_TO_VM_PAGE() returns the wrong vm_page for fictitious pages but unwiring uses PHYS_TO_VM_PAGE(). The resulting panic reported an unexpected wired count. Rather than attempting to fix PHYS_TO_VM_PAGE(), this fix takes advantage of the properties of fictitious pages. Specifically, fictitious pages will never be completely unwired. Therefore, we can keep a fictitious page's wired count forever set to one and thereby avoid the use of PHYS_TO_VM_PAGE() when we know that we're working with a fictitious page, just not which one. In collaboration with: green@, tegge@ PR: kern/29915
# 1bb816d3	11-May-2004	Alan Cox <alc@FreeBSD.org>	Restructure vm_page_select_cache() so that adding assertions is easy. Some of the conditions that caused vm_page_select_cache() to deactivate a page were wrong. For example, deactivating an unmanaged or wired page is a nop. Thus, if vm_page_select_cache() had ever encountered an unmanaged or wired page, it would have looped forever. Now, we assert that the page is neither unmanaged nor wired.
# 3f39cca9	08-May-2004	Alan Cox <alc@FreeBSD.org>	Cache queue pages are not mapped. Thus, the pmap_remove_all() by vm_page_alloc() is unnecessary.
# 2ec91846	24-Apr-2004	Alan Cox <alc@FreeBSD.org>	Update the comment describing vm_page_grab() to reflect the previous revision and correct some of its style errors.
# 7ef6ba5d	24-Apr-2004	Alan Cox <alc@FreeBSD.org>	Push down the responsibility for zeroing a physical page from the caller to vm_page_grab(). Although this gives VM_ALLOC_ZERO a different meaning for vm_page_grab() than for vm_page_alloc(), I feel such change is necessary to accomplish other goals. Specifically, I want to make the PG_ZERO flag immutable between the time it is allocated by vm_page_alloc() and freed by vm_page_free() or vm_page_free_zero() to avoid locking overheads. Once we gave up on the ability to automatically recognize a zeroed page upon entry to vm_page_free(), the ability to mutate the PG_ZERO flag became useless. Instead, I would like to say that "Once a page becomes valid, its PG_ZERO flag must be ignored."
# 05eb3785	06-Apr-2004	Warner Losh <imp@FreeBSD.org>	Remove advertising clause from University of California Regent's license, per letter dated July 22, 1999. Approved by: core
# 889eb0fc	04-Apr-2004	Alan Cox <alc@FreeBSD.org>	Eliminate unused arguments from vm_page_startup().
# ca3b4477	02-Mar-2004	Alan Cox <alc@FreeBSD.org>	Modify contigmalloc1() so that the free page queues lock is not held when vm_page_free() is called. The problem with holding this lock is that it is a spin lock and vm_page_free() may attempt the acquisition of a different default-type lock.
# 0f75a977	19-Feb-2004	Alan Cox <alc@FreeBSD.org>	- Correct a long-standing race condition in vm_page_try_to_free() that could result in a dirty page being unintentionally freed. - Simplify the dirty page check in vm_page_dontneed(). Reviewed by: tegge MFC after: 7 days
# 84d98bf6	14-Feb-2004	Alan Cox <alc@FreeBSD.org>	- Correct a long-standing race condition in vm_page_try_to_cache() that could result in a panic "vm_page_cache: caching a dirty page, ...": Access to the page must be restricted or removed before calling vm_page_cache(). This race condition is identical in nature to that which was addressed by vm_pageout.c's revision 1.251. - Simplify the code surrounding the fix to this same race condition in vm_pageout.c's revision 1.251. There should be no behavioral change. Reviewed by: tegge MFC after: 7 days
# 65bae14d	08-Jan-2004	Alan Cox <alc@FreeBSD.org>	- Enable recursive acquisition of the mutex synchronizing access to the free pages queue. This is presently needed by contigmalloc1(). - Move a sanity check against attempted double allocation of two pages to the same vm object offset from vm_page_alloc() to vm_page_insert(). This provides better protection because double allocation could occur through a direct call to vm_page_insert(), such as that by vm_page_rename(). - Modify contigmalloc1() to hold the mutex synchronizing access to the free pages queue while it scans vm_page_array in search of free pages. - Correct a potential leak of pages by contigmalloc1() that I introduced in revision 1.20: We must convert all cache queue pages to free pages before we begin removing free pages from the free queue. Otherwise, if we have to restart the scan because we are unable to acquire the vm object lock that is necessary to convert a cache queue page to a free page, we leak those free pages already removed from the free queue.
# 4804edb4	31-Dec-2003	Alan Cox <alc@FreeBSD.org>	In vm_page_lookup() check the root of the vm object's splay tree for the desired page before calling vm_page_splay().
# bcdaad7f	30-Dec-2003	Alan Cox <alc@FreeBSD.org>	Simplify vm_page_grab(): Don't bother with the generation check. If the vm object hasn't changed, the desired page will be at or near the root of the vm object's splay tree, making vm_page_lookup() cheap. (The only lock required for vm_page_lookup() is already held.) If, however, the vm object has changed and retry was requested, eliminating the generation check also eliminates a pointless acquisition and release of the page queues lock.
# 9582cd94	21-Dec-2003	Alan Cox <alc@FreeBSD.org>	- Create an unmapped guard page to trap access to vm_page_array[-1]. This guard page would have trapped the problems with the MFC of the PAE support to RELENG_4 at an earlier point in the sequence of events. Submitted by: tegge
# de33bedd	31-Oct-2003	Alan Cox <alc@FreeBSD.org>	- Additional vm object locking in vm_object_split() - New vm object locking assertions in vm_page_insert() and vm_object_set_writeable_dirty()
# ab42316c	22-Oct-2003	Alan Cox <alc@FreeBSD.org>	- Retire vm_pageout_page_free(). Instead, use vm_page_select_cache() from vm_pageout_scan(). Rationale: I don't like leaving a busy page in the cache queue with neither the vm object nor the vm page queues lock held. - Assert that the page is active in vm_pageout_page_stats().
# 0d42c05f	21-Oct-2003	Alan Cox <alc@FreeBSD.org>	- Assert that the containing vm object is locked in vm_page_set_validclean(). (This function reads and modifies the vm page's valid field, which is synchronized by the lock on the containing vm object.)
# fee181a6	20-Oct-2003	Alan Cox <alc@FreeBSD.org>	- Remove some long unused code.
# 669890ea	07-Oct-2003	Alan Cox <alc@FreeBSD.org>	Retire vm_page_copy(). Its reason for being ended when peter@ modified pmap_copy_page() et al. to accept a vm_page_t rather than a physical address. Also, this change will facilitate locking access to the vm page's valid field.
# 5a3970fe	05-Oct-2003	Alan Cox <alc@FreeBSD.org>	Assert that the containing vm object's lock is held in vm_page_set_invalid().
# 874f526d	04-Oct-2003	Alan Cox <alc@FreeBSD.org>	Assert that the containing vm object's lock is held in vm_page_zero_invalid().
# bf0da100	04-Oct-2003	Alan Cox <alc@FreeBSD.org>	- Extend the scope the vm object lock to cover calls to vm_page_is_valid(). - Assert that the lock on the containing vm object is held in vm_page_is_valid().
# 50028aa7	27-Sep-2003	Alan Cox <alc@FreeBSD.org>	In vm_page_remove(), assert that the vm object is locked, unless an Alpha. (The Alpha still requires updates to its pmap.)
# 95aad59a	21-Sep-2003	Alan Cox <alc@FreeBSD.org>	Initialize the page's pindex field even for VM_ALLOC_NOOBJ allocations. (This field is useful for implementing sanity checks even if the page does not belong to an object.)
# 2370c6d4	28-Aug-2003	Alan Cox <alc@FreeBSD.org>	Recent pmap changes permit the use of a more precise locking assertion in vm_page_lookup().
# 529e15ed	23-Aug-2003	Alan Cox <alc@FreeBSD.org>	Held pages, just like wired pages, should not be added to the cache queues. Submitted by: tegge
# b7ad744d	23-Aug-2003	Alan Cox <alc@FreeBSD.org>	Hold the page queues lock when performing vm_page_clear_dirty() and vm_page_set_invalid().
# 0f132ba6	21-Aug-2003	Alan Cox <alc@FreeBSD.org>	Assert that the vm object's lock is held on entry to vm_page_grab(); remove code from this function that was needed when vm object locking was incomplete.
# 891c1d4b	20-Aug-2003	Alan Cox <alc@FreeBSD.org>	Assert that the vm object lock is held in vm_page_alloc().
# c53e8c56	01-Jul-2003	Alan Cox <alc@FreeBSD.org>	Modify vm_page_alloc() and vm_page_select_cache() to allow the page that is returned by vm_page_select_cache() to belong to the object that is already locked by the caller to vm_page_alloc().
# baaaadf1	28-Jun-2003	Alan Cox <alc@FreeBSD.org>	- Use an int rather than a vm_pindex_t to represent the desired page color in vm_page_alloc(). (This also has small performance benefits.) - Eliminate vm_page_select_free(); vm_page_alloc() might as well call vm_pageq_find() directly.
# 9f2b1758	26-Jun-2003	Alan Cox <alc@FreeBSD.org>	vm_page_select_cache() enforces a number of conditions on the returned page. Add the ability to lock the containing object to those conditions.
# f29ba63e	22-Jun-2003	Alan Cox <alc@FreeBSD.org>	Maintain a lock on the vm object of interest throughout vm_fault(), releasing the lock only if we are about to sleep (e.g., vm_pager_get_pages() or vm_pager_has_pages()). If we sleep, we have marked the vm object with the paging-in-progress flag.
# 37681d86	18-Jun-2003	Alan Cox <alc@FreeBSD.org>	Assert that the vm object is locked in vm_page_try_to_free().
# 874651b1	11-Jun-2003	David E. O'Brien <obrien@FreeBSD.org>	Use __FBSDID().
# 36d1fdf5	07-Jun-2003	Alan Cox <alc@FreeBSD.org>	Teach vm_page_grab() how to handle the vm object's lock.
# 5299887d	25-Apr-2003	Alan Cox <alc@FreeBSD.org>	- Relax the Giant required in vm_page_remove(). - Remove the Giant required from vm_page_free_toq(). (Any locking errors will be caught by vm_page_remove().) This remedies a panic that occurred when kmem_malloc(NOWAIT) performed without Giant failed to allocate the necessary pages. Reported by: phk
# 2e9d00a1	22-Apr-2003	Alan Cox <alc@FreeBSD.org>	Revision 1.246 should have also included - Weaken the assertion in vm_page_insert() to require Giant only if the vm_object isn't locked. Reported by: "Ilmar S. Habibulin" <ilmar@watson.org>
# 03d4c1e6	21-Apr-2003	Alan Cox <alc@FreeBSD.org>	Revision 1.52 of vm/uma_core.c has led to UMA's obj_alloc() being called without Giant; and obj_alloc() in turn calls vm_page_alloc() without Giant. This causes an assertion failure in vm_page_alloc(). Fortunately, obj_alloc() is now MPSAFE. So, we need only clean up some assertions. - Weaken the assertion in vm_page_lookup() to require Giant only if the vm_object isn't locked. - Remove an assertion from vm_page_alloc() that duplicates a check performed in vm_page_lookup(). In collaboration with: gallatin, jake, jeff
# d8fed0f0	10-Apr-2003	John Baldwin <jhb@FreeBSD.org>	- Kill the pv_flags member of the alpha mdpage since it stop being used in rev 1.61 of pmap.c. - Now that pmap_page_is_free() is empty and since it is just a hack for the Alpha pmap, remove it.
# 227f9a1c	24-Mar-2003	Jake Burkholder <jake@FreeBSD.org>	- Add vm_paddr_t, a physical address type. This is required for systems where physical addresses larger than virtual addresses, such as i386s with PAE. - Use this to represent physical addresses in the MI vm system and in the i386 pmap code. This also changes the paddr parameter to d_mmap_t. - Fix printf formats to handle physical addresses >4G in the i386 memory detection code, and due to kvtop returning vm_paddr_t instead of u_long. Note that this is a name change only; vm_paddr_t is still the same as vm_offset_t on all currently supported platforms. Sponsored by: DARPA, Network Associates Laboratories Discussed with: re, phk (cdevsw change)
# dab392a4	18-Mar-2003	Maxime Henrion <mux@FreeBSD.org>	Remove an empty comment.
# 9f77ba59	16-Mar-2003	Jake Burkholder <jake@FreeBSD.org>	Subtract the memory that backs the vm_page structures from phys_avail after mapping it. This makes it possible to determine if a physical page has a backing vm_page or not.
# 1a1e9f41	01-Mar-2003	Alan Cox <alc@FreeBSD.org>	Teach vm_page_sleep_if_busy() to release the vm_object lock before sleeping.
# 3fa24ec9	24-Feb-2003	Alan Cox <alc@FreeBSD.org>	In vm_page_dirty(), assert that the page is not in the free queue(s).
# e6f2748c	01-Feb-2003	Alan Cox <alc@FreeBSD.org>	- Convert the tsleep()s in vm_wait() and vm_waitpfault() to msleep()s with the page queue lock. - Assert that the page queue lock is held in vm_page_free_wakeup().
# 28ec30cd	20-Jan-2003	Alan Cox <alc@FreeBSD.org>	- Hold the page queues lock around vm_page_hold(). - Assert that the page queues lock rather than Giant is held in vm_page_hold().
# b0ef8c5f	13-Jan-2003	Alan Cox <alc@FreeBSD.org>	- Update vm_pageout_deficit using atomic operations. It's a simple counter outside the scope of existing locks. - Eliminate a redundant clearing of vm_pageout_deficit.
# a15700fe	12-Jan-2003	Alan Cox <alc@FreeBSD.org>	Make vm_page_alloc() return PG_ZERO only if VM_ALLOC_ZERO is specified. The objective being to eliminate some cases of page queues locking. (See, for example, vm/vm_fault.c revision 1.160.) Reviewed by: tegge (Also, pointed out by tegge that I changed vm_fault.c before changing vm_page.c. Oops.)
# b5dc8305	11-Jan-2003	Alan Cox <alc@FreeBSD.org>	In vm_page_alloc(), fuse two if statements that are conditioned on the same expression.
# 9a032278	08-Jan-2003	Alan Cox <alc@FreeBSD.org>	In vm_page_alloc(), honor VM_ALLOC_ZERO for system and interrupt class requests when the number of free pages is below the reserved threshold. Previously, VM_ALLOC_ZERO was only honored when the number of free pages was above the reserved threshold. Honoring it in all cases generally makes sense, does no harm, and simplifies the code.
# 6c4952c7	04-Jan-2003	Alan Cox <alc@FreeBSD.org>	Use atomic add and subtract to update the global wired page count, cnt.v_wire_count.
# 009f3e7a	04-Jan-2003	Alan Cox <alc@FreeBSD.org>	Refine the assertions in vm_page_alloc().
# d61e1287	01-Jan-2003	Alan Cox <alc@FreeBSD.org>	Update the assertions in vm_page_insert() and vm_page_lookup() to reflect locking of the kmem_object.
# a28cc55e	29-Dec-2002	Alan Cox <alc@FreeBSD.org>	Reduce the number of times that we acquire and release the page queues lock by making vm_page_rename()'s caller, rather than vm_page_rename(), responsible for acquiring it.
# 2ee5fea7	28-Dec-2002	Alan Cox <alc@FreeBSD.org>	Assert that the page queues lock rather than Giant is held in vm_page_flag_clear().
# 24c9ad6b	19-Dec-2002	Alan Cox <alc@FreeBSD.org>	- Remove vm_page_sleep_busy(). The transition to vm_page_sleep_if_busy(), which incorporates page queue and field locking, is complete. - Assert that the page queue lock rather than Giant is held in vm_page_flag_set().
# 495bedfb	14-Dec-2002	Alan Cox <alc@FreeBSD.org>	Assert that the page queues lock is held in vm_page_unhold(), vm_page_remove(), and vm_page_free_toq().
# 178949e0	23-Nov-2002	Alan Cox <alc@FreeBSD.org>	Hold the page queues/flags lock when calling vm_page_set_validclean(). Approved by: re
# a12cc0e4	17-Nov-2002	Alan Cox <alc@FreeBSD.org>	Remove vm_page_protect(). Instead, use pmap_page_protect() directly.
# 4fec79be	16-Nov-2002	Alan Cox <alc@FreeBSD.org>	Now that pmap_remove_all() is exported by our pmap implementations use it directly.
# d154fb4f	10-Nov-2002	Alan Cox <alc@FreeBSD.org>	When prot is VM_PROT_NONE, call pmap_page_protect() directly rather than indirectly through vm_page_protect(). The one remaining page flag that is updated by vm_page_protect() is already being updated by our various pmap implementations. Note: A later commit will similarly change the VM_PROT_READ case and eliminate vm_page_protect().
# 1f7c5f98	09-Nov-2002	Alan Cox <alc@FreeBSD.org>	In vm_page_remove(), avoid calling vm_page_splay() if the object's memq is empty.
# ada2a050	04-Nov-2002	Alan Cox <alc@FreeBSD.org>	Export the function vm_page_splay().
# c71f01af	03-Nov-2002	Alan Cox <alc@FreeBSD.org>	- Remove the memory allocation for the object/offset hash table because it's no longer used. (See revision 1.215.) - Fix a harmless bug: the number of vm_page structures allocated wasn't properly adjusted when uma_bootstrap() was introduced. Consequently, we were allocating 30 unused vm_page structures. - Wrap a long line.
# 02af9de6	02-Nov-2002	Alan Cox <alc@FreeBSD.org>	Remove the vm page buckets mutex. As of revision 1.215 of vm/vm_page.c, it is unused.
# 026aa839	31-Oct-2002	Jeff Roberson <jeff@FreeBSD.org>	- Add a new flag to vm_page_alloc, VM_ALLOC_NOOBJ. This tells vm_page_alloc not to insert this page into an object. The pindex is still used for colorization. - Rework vm_page_select_* to accept a color instead of an object and pindex to work with VM_PAGE_NOOBJ. - Document other VM_ALLOC_ flags. Reviewed by: peter, jake
# f3b676f0	20-Oct-2002	Alan Cox <alc@FreeBSD.org>	o Reinline vm_page_undirty(), reducing the kernel size. (This reverts a part of vm_page.h revision 1.87 and vm_page.c revision 1.167.)
# f4ecdf05	19-Oct-2002	Alan Cox <alc@FreeBSD.org>	Complete the page queues locking needed for the page-based copy- on-write (COW) mechanism. (This mechanism is used by the zero-copy TCP/IP implementation.) - Extend the scope of the page queues lock in vm_fault() to cover vm_page_cowfault(). - Modify vm_page_cowfault() to release the page queues lock if it sleeps.
# b86ec922	18-Oct-2002	Matthew Dillon <dillon@FreeBSD.org>	Replace the vm_page hash table with a per-vmobject splay tree. There should be no major change in performance from this change at this time but this will allow other work to progress: Giant lock removal around VM system in favor of per-object mutexes, ranged fsyncs, more optimal COMMIT rpc's for NFS, partial filesystem syncs by the syncer, more optimal object flushing, etc. Note that the buffer cache is already using a similar splay tree mechanism. Note that a good chunk of the old hash table code is still in the tree. Alan or I will remove it prior to the release if the new code does not introduce unsolvable bugs, else we can revert more easily. Submitted by: alc (this is Alan's code) Approved by: re
# 8a59b15c	01-Sep-2002	Alan Cox <alc@FreeBSD.org>	o Synchronize updates to struct vm_page::cow with the page queues lock.
# fff6062a	24-Aug-2002	Alan Cox <alc@FreeBSD.org>	o Retire vm_page_zero_fill() and vm_page_zero_fill_area(). Ever since pmap_zero_page() and pmap_zero_page_area() were modified to accept a struct vm_page * instead of a physical address, vm_page_zero_fill() and vm_page_zero_fill_area() have served no purpose.
# 60582cbe	10-Aug-2002	Alan Cox <alc@FreeBSD.org>	o Assert that the page queues lock is held in vm_page_activate().
# db44450b	10-Aug-2002	Alan Cox <alc@FreeBSD.org>	o Remove the setting and clearing of the PG_MAPPED flag. (This flag is obsolete.)
# 06ec58b7	08-Aug-2002	Alan Cox <alc@FreeBSD.org>	o Use pmap_page_is_mapped() in vm_page_protect() rather than the PG_MAPPED flag. (This is the only place in the entire kernel where the PG_MAPPED flag is tested. It will be removed soon.)
# 24c28f1a	04-Aug-2002	Alan Cox <alc@FreeBSD.org>	o Acquire the page queues lock before checking the page's busy status in vm_page_grab(). Also, replace the nearby tsleep() with an msleep() on the page queues lock.
# e6e370a7	04-Aug-2002	Jeff Roberson <jeff@FreeBSD.org>	- Replace v_flag with v_iflag and v_vflag - v_vflag is protected by the vnode lock and is used when synchronization with VOP calls is needed. - v_iflag is protected by interlock and is used for dealing with vnode management issues. These flags include X/O LOCK, FREE, DOOMED, etc. - All accesses to v_iflag and v_vflag have either been locked or marked with mp_fixme's. - Many ASSERT_VOP_LOCKED calls have been added where the locking was not clear. - Many functions in vfs_subr.c were restructured to provide for stronger locking. Idea stolen from: BSD/OS
# aa9b1d94	02-Aug-2002	Alan Cox <alc@FreeBSD.org>	o Remove the setting of PG_MAPPED from vm_page_wire() and vm_page_alloc(VM_ALLOC_WIRED).
# 1e7ce68f	01-Aug-2002	Alan Cox <alc@FreeBSD.org>	o Lock page queue accesses in nwfs and smbfs. o Assert that the page queues lock is held in vm_page_deactivate().
# 46086ddf	01-Aug-2002	Alan Cox <alc@FreeBSD.org>	o Acquire the page queues lock before calling vm_page_io_finish(). o Assert that the page queues lock is held in vm_page_io_finish().
# 67c1fae9	31-Jul-2002	Alan Cox <alc@FreeBSD.org>	o Lock page accesses by vm_page_io_start() with the page queues lock. o Assert that the page queues lock is held in vm_page_io_start().
# e5f8bd94	29-Jul-2002	Alan Cox <alc@FreeBSD.org>	o Introduce vm_page_sleep_if_busy() as an eventual replacement for vm_page_sleep_busy(). vm_page_sleep_if_busy() uses the page queues lock.
# 2c071f61	28-Jul-2002	Alan Cox <alc@FreeBSD.org>	o Modify vm_page_grab() to accept VM_ALLOC_WIRED.
# 2999e9fa	22-Jul-2002	Alan Cox <alc@FreeBSD.org>	o Lock page queue accesses by vm_page_dontneed(). o Assert that the page queue lock is held in vm_page_dontneed().
# 40eab1e9	20-Jul-2002	Alan Cox <alc@FreeBSD.org>	o Lock page queue accesses by vm_page_try_to_cache(). (The accesses in kern/vfs_bio.c are already locked.) o Assert that the page queues lock is held in vm_page_try_to_cache().
# d82efd29	20-Jul-2002	Alan Cox <alc@FreeBSD.org>	o Assert that the page queues lock is held in vm_page_try_to_free().
# 15a5d210	20-Jul-2002	Alan Cox <alc@FreeBSD.org>	o Lock page queue accesses by vm_page_cache() in vm_fault() and vm_pageout_scan(). (The others are already locked.) o Assert that the page queues lock is held in vm_page_cache().
# eeeaf0fd	18-Jul-2002	Alan Cox <alc@FreeBSD.org>	o Duplicate an odd side-effect of vm_page_wire() in vm_page_allocate() when VM_ALLOC_WIRED is specified: set the PG_MAPPED bit in flags. o In both vm_page_wire() and vm_page_allocate() add a comment saying that setting PG_MAPPED does not belong there.
# 827b2fa0	17-Jul-2002	Alan Cox <alc@FreeBSD.org>	o Introduce an argument, VM_ALLOC_WIRED, that requests vm_page_alloc() to return a wired page. o Use VM_ALLOC_WIRED within Alpha's pmap_growkernel(). Also, because Alpha's pmap_growkernel() calls vm_page_alloc() from within a critical section, specify VM_ALLOC_INTERRUPT instead of VM_ALLOC_SYSTEM. (Only VM_ALLOC_INTERRUPT is implemented entirely with a spin mutex.) o Assert that the page queues mutex is held in vm_page_wire() on Alpha, just like the other platforms.
# 8b8b8202	14-Jul-2002	Alan Cox <alc@FreeBSD.org>	o Lock page queue accesses by vm_page_wire() that aren't within a critical section. o Assert that the page queues lock is held in vm_page_wire() unless an Alpha.
# eed6f3fd	13-Jul-2002	Alan Cox <alc@FreeBSD.org>	o Lock page queue accesses by vm_page_unmanage(). o Assert that the page queues lock is held in vm_page_unmanage().
# 1f545269	13-Jul-2002	Alan Cox <alc@FreeBSD.org>	o Complete the locking of page queue accesses by vm_page_unwire(). o Assert that the page queues lock is held in vm_page_unwire(). o Make vm_page_lock_queues() and vm_page_unlock_queues() visible to kernel loadable modules.
# f784043a	05-Jul-2002	Andrew Gallatin <gallatin@FreeBSD.org>	Remove bogus vm_page_wakeup() in vm_page_cowfault() that will cause panics in the zero-copy send path if a process attempts to write to a page which is still in flight. reviewed by: ken
# 70c17636	04-Jul-2002	Alan Cox <alc@FreeBSD.org>	o Resurrect vm_page_lock_queues(), vm_page_unlock_queues(), and the free queue lock (revision 1.33 of vm/vm_page.c removed them). o Make the free queue lock a spin lock because it's sometimes acquired inside of a critical section.
# 98cb733c	25-Jun-2002	Kenneth D. Merry <ken@FreeBSD.org>	At long last, commit the zero copy sockets code. MAKEDEV: Add MAKEDEV glue for the ti(4) device nodes. ti.4: Update the ti(4) man page to include information on the TI_JUMBO_HDRSPLIT and TI_PRIVATE_JUMBOS kernel options, and also include information about the new character device interface and the associated ioctls. man9/Makefile: Add jumbo.9 and zero_copy.9 man pages and associated links. jumbo.9: New man page describing the jumbo buffer allocator interface and operation. zero_copy.9: New man page describing the general characteristics of the zero copy send and receive code, and what an application author should do to take advantage of the zero copy functionality. NOTES: Add entries for ZERO_COPY_SOCKETS, TI_PRIVATE_JUMBOS, TI_JUMBO_HDRSPLIT, MSIZE, and MCLSHIFT. conf/files: Add uipc_jumbo.c and uipc_cow.c. conf/options: Add the 5 options mentioned above. kern_subr.c: Receive side zero copy implementation. This takes "disposable" pages attached to an mbuf, gives them to a user process, and then recycles the user's page. This is only active when ZERO_COPY_SOCKETS is turned on and the kern.ipc.zero_copy.receive sysctl variable is set to 1. uipc_cow.c: Send side zero copy functions. Takes a page written by the user and maps it copy on write and assigns it kernel virtual address space. Removes copy on write mapping once the buffer has been freed by the network stack. uipc_jumbo.c: Jumbo disposable page allocator code. This allocates (optionally) disposable pages for network drivers that want to give the user the option of doing zero copy receive. uipc_socket.c: Add kern.ipc.zero_copy.{send,receive} sysctls that are enabled if ZERO_COPY_SOCKETS is turned on. Add zero copy send support to sosend() -- pages get mapped into the kernel instead of getting copied if they meet size and alignment restrictions. uipc_syscalls.c:Un-staticize some of the sf* functions so that they can be used elsewhere. (uipc_cow.c) if_media.c: In the SIOCGIFMEDIA ioctl in ifmedia_ioctl(), avoid calling malloc() with M_WAITOK. Return an error if the M_NOWAIT malloc fails. The ti(4) driver and the wi(4) driver, at least, call this with a mutex held. This causes witness warnings for 'ifconfig -a' with a wi(4) or ti(4) board in the system. (I've only verified for ti(4)). ip_output.c: Fragment large datagrams so that each segment contains a multiple of PAGE_SIZE amount of data plus headers. This allows the receiver to potentially do page flipping on receives. if_ti.c: Add zero copy receive support to the ti(4) driver. If TI_PRIVATE_JUMBOS is not defined, it now uses the jumbo(9) buffer allocator for jumbo receive buffers. Add a new character device interface for the ti(4) driver for the new debugging interface. This allows (a patched version of) gdb to talk to the Tigon board and debug the firmware. There are also a few additional debugging ioctls available through this interface. Add header splitting support to the ti(4) driver. Tweak some of the default interrupt coalescing parameters to more useful defaults. Add hooks for supporting transmit flow control, but leave it turned off with a comment describing why it is turned off. if_tireg.h: Change the firmware rev to 12.4.11, since we're really at 12.4.11 plus fixes from 12.4.13. Add defines needed for debugging. Remove the ti_stats structure, it is now defined in sys/tiio.h. ti_fw.h: 12.4.11 firmware. ti_fw2.h: 12.4.11 firmware, plus selected fixes from 12.4.13, and my header splitting patches. Revision 12.4.13 doesn't handle 10/100 negotiation properly. (This firmware is the same as what was in the tree previously, with the addition of header splitting support.) sys/jumbo.h: Jumbo buffer allocator interface. sys/mbuf.h: Add a new external mbuf type, EXT_DISPOSABLE, to indicate that the payload buffer can be thrown away / flipped to a userland process. socketvar.h: Add prototype for socow_setup. tiio.h: ioctl interface to the character portion of the ti(4) driver, plus associated structure/type definitions. uio.h: Change prototype for uiomoveco() so that we'll know whether the source page is disposable. ufs_readwrite.c:Update for new prototype of uiomoveco(). vm_fault.c: In vm_fault(), check to see whether we need to do a page based copy on write fault. vm_object.c: Add a new function, vm_object_allocate_wait(). This does the same thing that vm_object allocate does, except that it gives the caller the opportunity to specify whether it should wait on the uma_zalloc() of the object structre. This allows vm objects to be allocated while holding a mutex. (Without generating WITNESS warnings.) vm_object_allocate() is implemented as a call to vm_object_allocate_wait() with the malloc flag set to M_WAITOK. vm_object.h: Add prototype for vm_object_allocate_wait(). vm_page.c: Add page-based copy on write setup, clear and fault routines. vm_page.h: Add page based COW function prototypes and variable in the vm_page structure. Many thanks to Drew Gallatin, who wrote the zero copy send and receive code, and to all the other folks who have tested and reviewed this code over the years.
# e78f35b3	25-Jun-2002	Jeff Roberson <jeff@FreeBSD.org>	Turn VM_ALLOC_ZERO into a flag. Submitted by: tegge Reviewed by: dillon
# ea0f50bc	30-Apr-2002	Alan Cox <alc@FreeBSD.org>	o Convert the vm_page buckets mutex to a spin lock. (This resolves an issue on the Alpha platform found by jeff@.) o Simplify vm_page_lookup(). Reviewed by: jhb
# 44e74ba6	27-Apr-2002	Peter Wemm <peter@FreeBSD.org>	We do not necessarily need to map/unmap pages to zero parts of them. On systems where physical memory is also direct mapped (alpha, sparc, ia64 etc) this is slightly harmful.
# cbd53e95	26-Apr-2002	Alan Cox <alc@FreeBSD.org>	o Control access to the vm_page_buckets with a mutex. o Fix some style(9) bugs.
# 1a87a0da	15-Apr-2002	Peter Wemm <peter@FreeBSD.org>	Pass vm_page_t instead of physical addresses to pmap_zero_page[_area]() and pmap_copy_page(). This gets rid of a couple more physical addresses in upper layers, with the eventual aim of supporting PAE and dealing with the physical addressing mostly within pmap. (We will need either 64 bit physical addresses or page indexes, possibly both depending on the circumstances. Leaving this to pmap itself gives more flexibilitly.) Reviewed by: jake Tested on: i386, ia64 and (I believe) sparc64. (my alpha was hosed)
# 6008862b	04-Apr-2002	John Baldwin <jhb@FreeBSD.org>	Change callers of mtx_init() to pass in an appropriate lock type name. In most cases NULL is passed, but in some cases such as network driver locks (which use the MTX_NETWORK_LOCK macro) and UMA zone locks, a name is used. Tested on: i386, alpha, sparc64
# 48f9a594	02-Apr-2002	Jake Burkholder <jake@FreeBSD.org>	Fix a long standing 32bit-ism. Don't assume that the size of a chunk of memory in phys_avail will fit in 'int', use vm_size_t. This fixes booting on sparc64 machines with more than 2 gigs of ram. Thanks to Jan Chrillesen for providing me with access to a 4 gig machine.
# 8355f576	19-Mar-2002	Jeff Roberson <jeff@FreeBSD.org>	This is the first part of the new kernel memory allocator. This replaces malloc(9) and vm_zone with a slab like allocator. Reviewed by: arch@
# a1287949	10-Mar-2002	Eivind Eklund <eivind@FreeBSD.org>	- Remove a number of extra newlines that do not belong here according to style(9) - Minor space adjustment in cases where we have "( ", " )", if(), return(), while(), for(), etc. - Add /* SYMBOL */ after a few #endifs. Reviewed by: alc
# 2be21c5e	04-Mar-2002	Alan Cox <alc@FreeBSD.org>	o Create vm_pageq_enqueue() to encapsulate code that is duplicated time and again in vm_page.c and vm_pageq.c. o Delete unusused prototypes. (Mainly a result of the earlier renaming of various functions from vm_page_() to vm_pageq_().)
# d2760948	19-Feb-2002	Tor Egge <tegge@FreeBSD.org>	Add a page queue, PQ_HOLD, that temporarily owns pages with nonzero hold count that would otherwise be on one of the free queues. This eliminates a panic when broken programs unmap memory that still has pending IO from raw devices. Reviewed by: dillon, alc
# 0c9e4723	19-Feb-2002	Mike Silbersack <silby@FreeBSD.org>	Add one more comment to the OOM changes so that future readers of the code may better understand the code. Suggested by: dillon MFC after: 1 week
# ef6020d1	19-Feb-2002	Mike Silbersack <silby@FreeBSD.org>	Changes to make the OOM killer much more effective: - Allow the OOM killer to target processes currently locked in memory. These very often are the ones doing the memory hogging. - Drop the wakeup priority of processes currently sleeping while waiting for their page fault to complete. In order for the OOM killer to work well, the killed process and other system processes waiting on memory must be allowed to wakeup first. Reviewed by: dillon MFC after: 1 week
# 3ebeaf59	13-Dec-2001	Matthew Dillon <dillon@FreeBSD.org>	This fixes a large number of bugs in our NFS client side code. A recent commit by Kirk also fixed a softupdates bug that could easily be triggered by server side NFS. * An edge case with shared R+W mmap()'s and truncate whereby the system would inappropriately clear the dirty bits on still-dirty data. (applicable to all filesystems) THIS FIX TEMPORARILY DISABLED PENDING FURTHER TESTING. see vm/vm_page.c line 1641 * The straddle case for VM pages and buffer cache buffers when truncating. (applicable to NFS client side) * Possible SMP database corruption due to vm_pager_unmap_page() not clearing the TLB for the other cpu's. (applicable to NFS client side but could effect all filesystems). Note: not considered serious since the corruption occurs beyond the file EOF. * When flusing a dirty buffer due to B_CACHE getting cleared, we were accidently setting B_CACHE again (that is, bwrite() sets B_CACHE), when we really want it to stay clear after the write is complete. This resulted in a corrupt buffer. (applicable to all filesystems but probably only triggered by NFS) * We have to call vtruncbuf() when ftruncate()ing to remove any buffer cache buffers. This is still tentitive, I may be able to remove it due to the second bug fix. (applicable to NFS client side) * vnode_pager_setsize() race against nfs_vinvalbuf()... we have to set n_size before calling nfs_vinvalbuf or the NFS code may recursively vnode_pager_setsize() to the original value before the truncate. This is what was causing the user mmap bus faults in the nfs tester program. (applicable to NFS client side) * Fix to softupdates (see ufs/ffs/ffs_inode.c 1.73, commit made by Kirk). Testing program written by: Avadis Tevanian, Jr. Testing program supplied by: jkh / Apple (see Dec2001 posting to freebsd-hackers with Subject 'NFS: How to make FreeBS fall on its face in one easy step') MFC after: 1 week
# 245df27c	25-Oct-2001	Matthew Dillon <dillon@FreeBSD.org>	Implement kern.maxvnodes. adjusting kern.maxvnodes now actually has a real effect. Optimize vfs_msync(). Avoid having to continually drop and re-obtain mutexes when scanning the vnode list. Improves looping case by 500%. Optimize ffs_sync(). Avoid having to continually drop and re-obtain mutexes when scanning the vnode list. This makes a couple of assumptions, which I believe are ok, in regards to vnode stability when the mount list mutex is held. Improves looping case by 500%. (more optimization work is needed on top of these fixes) MFC after: 1 week
# 3516c025	24-Aug-2001	Peter Wemm <peter@FreeBSD.org>	Implement idle zeroing of pages. I've been tinkering with this on and off since John Dyson left his work-in-progress. It is off by default for now. sysctl vm.zeroidle_enable=1 to turn it on. There are some hacks here to deal with the present lack of preemption - we yield after doing a small number of pages since we wont preempt otherwise. This is basically Matt's algorithm [with hysteresis] with an idle process to call it in a similar way it used to be called from the idle loop. I cleaned up the includes a fair bit here too.
# 0b76df71	21-Aug-2001	Matthew Dillon <dillon@FreeBSD.org>	KASSERT if vm_page_t->wire_count overflows.
# 8ec48c6d	10-Aug-2001	John Baldwin <jhb@FreeBSD.org>	- Remove asleep(), await(), and M_ASLEEP. - Callers of asleep() and await() have been converted to calling tsleep(). The only caller outside of M_ASLEEP was the ata driver, which called both asleep() and await() with spl-raised, so there was no need for the asleep() and await() pair. M_ASLEEP was unused. Reviewed by: jasone, peter
# 3a9b5daf	30-Jul-2001	Jake Burkholder <jake@FreeBSD.org>	Oops. Last commit to vm_object.c should have got these files too. Remove the use of atomic ops to manipulate vm_object and vm_page flags. Giant is required here, so they are superfluous. Discussed with: dillon
# d3e5863f	22-Jul-2001	Assar Westerlund <assar@FreeBSD.org>	make vm_page_select_cache static Requested by: bde
# 6d03d577	04-Jul-2001	Matthew Dillon <dillon@FreeBSD.org>	Reorg vm_page.c into vm_page.c, vm_pageq.c, and vm_contig.c (for contigmalloc). Also removed some spl's and added some VM mutexes, but they are not actually used yet, so this commit does not really make any operational changes to the system. vm_page.c relates to vm_page_t manipulation, including high level deactivation, activation, etc... vm_pageq.c relates to finding free pages and aquiring exclusive access to a page queue (exclusivity part not yet implemented). And the world still builds... :-)
# 1b40f8c0	04-Jul-2001	Matthew Dillon <dillon@FreeBSD.org>	Change inlines back into mainline code in preparation for mutexing. Also, most of these inlines had been bloated in -current far beyond their original intent. Normalize prototypes and function declarations to be ANSI only (half already were). And do some general cleanup. (kernel size also reduced by 50-100K, but that isn't the prime intent)
# 54d92145	04-Jul-2001	Matthew Dillon <dillon@FreeBSD.org>	whitespace / register cleanup
# 0cddd8f0	04-Jul-2001	Matthew Dillon <dillon@FreeBSD.org>	With Alfred's permission, remove vm_mtx in favor of a fine-grained approach (this commit is just the first stage). Also add various GIANT_ macros to formalize the removal of Giant, making it easy to test in a more piecemeal fashion. These macros will allow us to test fine-grained locks to a degree before removing Giant, and also after, and to remove Giant in a piecemeal fashion via sysctl's on those subsystems which the authors believe can operate without Giant.
# ac8f990b	24-May-2001	Matthew Dillon <dillon@FreeBSD.org>	This patch implements O_DIRECT about 80% of the way. It takes a patchset Tor created a while ago, removes the raw I/O piece (that has cache coherency problems), and adds a buffer cache / VM freeing piece. Essentially this patch causes O_DIRECT I/O to not be left in the cache, but does not prevent it from going through the cache, hence the 80%. For the last 20% we need a method by which the I/O can be issued directly to buffer supplied by the user process and bypass the buffer cache entirely, but still maintain cache coherency. I also have the code working under -stable but the changes made to sys/file.h may not be MFCable, so an MFC is not on the table yet. Submitted by: tegge, dillon
# 7d4ad42d	22-May-2001	John Baldwin <jhb@FreeBSD.org>	Sort includes.
# 23955314	18-May-2001	Alfred Perlstein <alfred@FreeBSD.org>	Introduce a global lock for the vm subsystem (vm_mtx). vm_mtx does not recurse and is required for most low level vm operations. faults can not be taken without holding Giant. Memory subsystems can now call the base page allocators safely. Almost all atomic ops were removed as they are covered under the vm mutex. Alpha and ia64 now need to catch up to i386's trap handlers. FFS and NFS have been tested, other filesystems will need minor changes (grabbing the vm lock when twiddling page properties). Reviewed (partially) by: jake, jhb
# fb919e4d	01-May-2001	Mark Murray <markm@FreeBSD.org>	Undo part of the tangle of having sys/lock.h and sys/mutex.h included in other "system" header files. Also help the deprecation of lockmgr.h by making it a sub-include of sys/lock.h and removing sys/lockmgr.h form kernel .c files. Sort sys/*.h includes where possible in affected files. OK'ed by: bde (with reservations)
# 136d8f42	06-Mar-2001	John Baldwin <jhb@FreeBSD.org>	Unrevert the pmap_map() changes. They weren't broken on x86. Sense beaten into me by: peter
# 4a01ebd4	06-Mar-2001	John Baldwin <jhb@FreeBSD.org>	Back out the pmap_map() change for now, it isn't completely stable on the i386.
# 968950e5	05-Mar-2001	John Baldwin <jhb@FreeBSD.org>	- Rework pmap_map() to take advantage of direct-mapped segments on supported architectures such as the alpha. This allows us to save on kernel virtual address space, TLB entries, and (on the ia64) VHPT entries. pmap_map() now modifies the passed in virtual address on architectures that do not support direct-mapped segments to point to the next available virtual address. It also returns the actual address that the request was mapped to. - On the IA64 don't use a special zone of PV entries needed for early calls to pmap_kenter() during pmap_init(). This gets us in trouble because we end up trying to use the zone allocator before it is initialized. Instead, with the pmap_map() change, the number of needed PV entries is small enough that we can get by with a static pool that is used until pmap_init() is complete. Submitted by: dfr Debugging help: peter Tested by: me
# c909b971	01-Mar-2001	Andrew Gallatin <gallatin@FreeBSD.org>	Allocate vm_page_array and vm_page_buckets from the end of the biggest chunk of memory, rather than from the start. This fixes problems allocating bouncebuffers on alphas where there is only 1 chunk of memory (unlike PCs where there is generally at least one small chunk and a large chunk). Having 1 chunk had been fatal, because these structures take over 13MB on a machine with 1GB of ram. This doesn't leave much room for other structures and bounce buffers if they're at the front. Reviewed by: dfr, anderson@cs.duke.edu, silence on -arch Tested by: Yoriaki FUJIMORI <fujimori@grafin.fujimori.cache.waseda.ac.jp>
# 2b6b0df7	26-Dec-2000	Matthew Dillon <dillon@FreeBSD.org>	This implements a better launder limiting solution. There was a solution in 4.2-REL which I ripped out in -stable and -current when implementing the low-memory handling solution. However, maxlaunder turns out to be the saving grace in certain very heavily loaded systems (e.g. newsreader box). The new algorithm limits the number of pages laundered in the first pageout daemon pass. If that is not sufficient then suceessive will be run without any limit. Write I/O is now pipelined using two sysctls, vfs.lorunningspace and vfs.hirunningspace. This prevents excessive buffered writes in the disk queues which cause long (multi-second) delays for reads. It leads to more stable (less jerky) and generally faster I/O streaming to disk by allowing required read ops (e.g. for indirect blocks and such) to occur without interrupting the write stream, amoung other things. NOTE: eventually, filesystem write I/O pipelining needs to be done on a per-device basis. At the moment it is globalized.
# 065b2580	18-Dec-2000	Poul-Henning Kamp <phk@FreeBSD.org>	Fix floppy drives on machines with lots of RAM. The fix works by reverting the ordering of free memory so that the chances of contig_malloc() succeeding increases. PR: 23291 Submitted by: Andrew Atrens <atrens@nortel.ca>
# 936524aa	18-Nov-2000	Matthew Dillon <dillon@FreeBSD.org>	Implement a low-memory deadlock solution. Removed most of the hacks that were trying to deal with low-memory situations prior to now. The new code is based on the concept that I/O must be able to function in a low memory situation. All major modules related to I/O (except networking) have been adjusted to allow allocation out of the system reserve memory pool. These modules now detect a low memory situation but rather then block they instead continue to operate, then return resources to the memory pool instead of cache them or leave them wired. Code has been added to stall in a low-memory situation prior to a vnode being locked. Thus situations where a process blocks in a low-memory condition while holding a locked vnode have been reduced to near nothing. Not only will I/O continue to operate, but many prior deadlock conditions simply no longer exist. Implement a number of VFS/BIO fixes (found by Ian): in biodone(), bogus-page replacement code, the loop was not properly incrementing loop variables prior to a continue statement. We do not believe this code can be hit anyway but we aren't taking any chances. We'll turn the whole section into a panic (as it already is in brelse()) after the release is rolled. In biodone(), the foff calculation was incorrectly clamped to the iosize, causing the wrong foff to be calculated for pages in the case of an I/O error or biodone() called without initiating I/O. The problem always caused a panic before. Now it doesn't. The problem is mainly an issue with NFS. Fixed casts for ~PAGE_MASK. This code worked properly before only because the calculations use signed arithmatic. Better to properly extend PAGE_MASK first before inverting it for the 64 bit masking op. In brelse(), the bogus_page fixup code was improperly throwing away the original contents of 'm' when it did the j-loop to fix the bogus pages. The result was that it would potentially invalidate parts of the WRONG page(!), leading to corruption. There may still be cases where a background bitmap write is being duplicated, causing potential corruption. We have identified a potentially serious bug related to this but the fix is still TBD. So instead this patch contains a KASSERT to detect the problem and panic the machine rather then continue to corrupt the filesystem. The problem does not occur very often.. it is very hard to reproduce, and it may or may not be the cause of the corruption people have reported. Review by: (VFS/BIO: mckusick, Ian Dowse <iedowse@maths.tcd.ie>) Testing by: (VM/Deadlock) Paul Saab <ps@yahoo-inc.com>
# c904bbbd	03-Jul-2000	Kirk McKusick <mckusick@FreeBSD.org>	Simplify and rationalise the management of the vnode free list (preparing the code to add snapshots).
# 8b03c8ed	29-May-2000	Matthew Dillon <dillon@FreeBSD.org>	This is a cleanup patch to Peter's new OBJT_PHYS VM object type and sysv shared memory support for it. It implements a new PG_UNMANAGED flag that has slightly different characteristics from PG_FICTICIOUS. A new sysctl, kern.ipc.shm_use_phys has been added to enable the use of physically-backed sysv shared memory rather then swap-backed. Physically backed shm segments are not tracked with PV entries, allowing programs which use a large shm segment as a rendezvous point to operate without eating an insane amount of KVM in the PV entry management. Read: Oracle. Peter's OBJT_PHYS object will also allow us to eventually implement page-table sharing and/or 4MB physical page support for such segments. We're half way there.
# 0385347c	20-May-2000	Peter Wemm <peter@FreeBSD.org>	Implement an optimization of the VM<->pmap API. Pass vm_page_t's directly to various pmap_*() functions instead of looking up the physical address and passing that. In many cases, the first thing the pmap code was doing was going to a lot of trouble to get back the original vm_page_t, or it's shadow pv_table entry. Inspired by: John Dyson's 1998 patches. Also: Eliminate pv_table as a seperate thing and build it into a machine dependent part of vm_page_t. This eliminates having a seperate set of structions that shadow each other in a 1:1 fashion that we often went to a lot of trouble to translate from one to the other. (see above) This happens to save 4 bytes of physical memory for each page in the system. (8 bytes on the Alpha). Eliminate the use of the phys_avail[] array to determine if a page is managed (ie: it has pv_entries etc). Store this information in a flag. Things like device_pager set it because they create vm_page_t's on the fly that do not have pv_entries. This makes it easier to "unmanage" a page of physical memory (this will be taken advantage of in subsequent commits). Add a function to add a new page to the freelist. This could be used for reclaiming the previously wasted pages left over from preloaded loader(8) files. Reviewed by: dillon
# 5929bcfa	27-Mar-2000	Philippe Charnier <charnier@FreeBSD.org>	Revert spelling mistake I made in the previous commit Requested by: Alan and Bruce
# 956f3135	26-Mar-2000	Philippe Charnier <charnier@FreeBSD.org>	Spelling
# db5f635a	16-Mar-2000	Poul-Henning Kamp <phk@FreeBSD.org>	Eliminate the undocumented, experimental, non-delivering and highly dangerous MAX_PERF option.
# 4f79d873	11-Dec-1999	Matthew Dillon <dillon@FreeBSD.org>	Add MAP_NOSYNC feature to mmap(), and MADV_NOSYNC and MADV_AUTOSYNC to madvise(). This feature prevents the update daemon from gratuitously flushing dirty pages associated with a mapped file-backed region of memory. The system pager will still page the memory as necessary and the VM system will still be fully coherent with the filesystem. Modifications made by other means to the same area of memory, for example by write(), are unaffected. The feature works on a page-granularity basis. MAP_NOSYNC allows one to use mmap() to share memory between processes without incuring any significant filesystem overhead, putting it in the same performance category as SysV Shared memory and anonymous memory. Reviewed by: julian, alc, dg
# e6ce5295	09-Nov-1999	Alan Cox <alc@FreeBSD.org>	Two changes: (1) Use vm_page_unqueue_nowakeup in vm_page_alloc instead of duplicating the code. (2) If a wired page is passed to vm_page_free_toq, panic instead of printing a friendly warning. (If we don't panic here, we'll just panic later in vm_page_unwire obscuring the problem.)
# be72f788	30-Oct-1999	Alan Cox <alc@FreeBSD.org>	The core of this patch is to vm/vm_page.h. The effects are two-fold: (1) to eliminate an extra (useless) level of indirection in half of the page queue accesses and (2) to use a single name for each queue throughout, instead of, e.g., "vm_page_queue_active" in some places and "vm_page_queues[PQ_ACTIVE]" in others. Reviewed by: dillon
# 923502ff	29-Oct-1999	Poul-Henning Kamp <phk@FreeBSD.org>	useracc() the prequel: Merge the contents (less some trivial bordering the silly comments) of <vm/vm_prot.h> and <vm/vm_inherit.h> into <vm/vm.h>. This puts the #defines for the vm_inherit_t and vm_prot_t types next to their typedefs. This paves the road for the commit to follow shortly: change useracc() to use VM_PROT_{READ\|WRITE} rather than B_{READ\|WRITE} as argument.
# 90ecac61	16-Sep-1999	Matthew Dillon <dillon@FreeBSD.org>	Reviewed by: Alan Cox <alc@cs.rice.edu>, David Greenman <dg@root.com> Replace various VM related page count calculations strewn over the VM code with inlines to aid in readability and to reduce fragility in the code where modules depend on the same test being performed to properly sleep and wakeup. Split out a portion of the page deactivation code into an inline in vm_page.c to support vm_page_dontneed(). add vm_page_dontneed(), which handles the madvise MADV_DONTNEED feature in a related commit coming up for vm_map.c/vm_object.c. This code prevents degenerate cases where an essentially active page may be rotated through a subset of the paging lists, resulting in premature disposal.
# c3aac50f	27-Aug-1999	Peter Wemm <peter@FreeBSD.org>	$Id$ -> $FreeBSD$
# 14068cfe	20-Aug-1999	Alan Cox <alc@FreeBSD.org>	vm_page_alloc and contigmalloc1: Verify that free pages are not dirty. Submitted by: dillon
# 0e709935	17-Aug-1999	Alan Cox <alc@FreeBSD.org>	vm_page_free_toq: Update the comment to reflect the demise of PQ_ZERO and remove a (now) useless test.
# 3fc3fec6	16-Aug-1999	Alan Cox <alc@FreeBSD.org>	vm_page_free_toq: Clear the dirty bit mask (vm_page_undirty) before adding the page to the free page queue. Submitted by: dillon
# 6c91c1dc	10-Aug-1999	Alan Cox <alc@FreeBSD.org>	contigmalloc1: If a page is found in the wrong queue, panic instead of silently ignoring the problem.
# ed6d0b65	10-Aug-1999	Peter Wemm <peter@FreeBSD.org>	Add a contigfree() as a corollary to contigmalloc() as it's not clear which free routine to use and people are tempted to use free() (which doesn't work)
# 5d2aec89	31-Jul-1999	Alan Cox <alc@FreeBSD.org>	Change the type of vpgqueues::lcnt from "int *" to "int". The indirection served no purpose.
# 755292ac	30-Jul-1999	Alan Cox <alc@FreeBSD.org>	vm_page_queue_init: Remove the initialization of PQ_NONE's cnt and lcnt. They aren't used. vm_page_insert: Remove an unnecessary dereference. vm_page_wire: Remove the one and only (and thus pointless) reference to PQ_NONE's lcnt.
# 3efc015b	01-Jul-1999	Peter Wemm <peter@FreeBSD.org>	Fix some int/long printf problems for the Alpha
# 9c89c228	22-Jun-1999	Alan Cox <alc@FreeBSD.org>	Remove (1) "extern" declarations for variables that were previously made "static" and (2) initialized but unused variables.
# 60ff97b0	20-Jun-1999	Alan Cox <alc@FreeBSD.org>	Remove vm_object::cache_count and vm_object::wired_count. They are not used. (Nor is there any planned use by John who introduced them.) Reviewed by: "John S. Dyson" <toor@dyson.iquest.net>
# c2077034	19-Jun-1999	Alan Cox <alc@FreeBSD.org>	Set cnt.v_page_size to PAGE_SIZE rather than DEFAULT_PAGE_SIZE so that "vmstat -s" reports the correct value on the Alpha. Submitted by: Hidetoshi Shimokawa <simokawa@sat.t.u-tokyo.ac.jp>
# 4221e284	02-May-1999	Alan Cox <alc@FreeBSD.org>	The VFS/BIO subsystem contained a number of hacks in order to optimize piecemeal, middle-of-file writes for NFS. These hacks have caused no end of trouble, especially when combined with mmap(). I've removed them. Instead, NFS will issue a read-before-write to fully instantiate the struct buf containing the write. NFS does, however, optimize piecemeal appends to files. For most common file operations, you will not notice the difference. The sole remaining fragment in the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache coherency issues with read-merge-write style operations. NFS also optimizes the write-covers-entire-buffer case by avoiding the read-before-write. There is quite a bit of room for further optimization in these areas. The VM system marks pages fully-valid (AKA vm_page_t->valid = VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This is not correct operation. The vm_pager_get_pages() code is now responsible for marking VM pages all-valid. A number of VM helper routines have been added to aid in zeroing-out the invalid portions of a VM page prior to the page being marked all-valid. This operation is necessary to properly support mmap(). The zeroing occurs most often when dealing with file-EOF situations. Several bugs have been fixed in the NFS subsystem, including bits handling file and directory EOF situations and buf->b_flags consistancy issues relating to clearing B_ERROR & B_INVAL, and handling B_DONE. getblk() and allocbuf() have been rewritten. B_CACHE operation is now formally defined in comments and more straightforward in implementation. B_CACHE for VMIO buffers is based on the validity of the backing store. B_CACHE for non-VMIO buffers is based simply on whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear, and vise-versa). biodone() is now responsible for setting B_CACHE when a successful read completes. B_CACHE is also set when a bdwrite() is initiated and when a bwrite() is initiated. VFS VOP_BWRITE routines (there are only two - nfs_bwrite() and bwrite()) are now expected to set B_CACHE. This means that bowrite() and bawrite() also set B_CACHE indirectly. There are a number of places in the code which were previously using buf->b_bufsize (which is DEV_BSIZE aligned) when they should have been using buf->b_bcount. These have been fixed. getblk() now clears B_DONE on return because the rest of the system is so bad about dealing with B_DONE. Major fixes to NFS/TCP have been made. A server-side bug could cause requests to be lost by the server due to nfs_realign() overwriting other rpc's in the same TCP mbuf chain. The server's kernel must be recompiled to get the benefit of the fixes. Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
# 8d17e694	05-Apr-1999	Julian Elischer <julian@FreeBSD.org>	Catch a case spotted by Tor where files mmapped could leave garbage in the unallocated parts of the last page when the file ended on a frag but not a page boundary. Delimitted by tags PRE_MATT_MMAP_EOF and POST_MATT_MMAP_EOF, in files alpha/alpha/pmap.c i386/i386/pmap.c nfs/nfs_bio.c vm/pmap.h vm/vm_page.c vm/vm_page.h vm/vnode_pager.c miscfs/specfs/spec_vnops.c ufs/ufs/ufs_readwrite.c kern/vfs_bio.c Submitted by: Matt Dillon <dillon@freebsd.org> Reviewed by: Alan Cox <alc@freebsd.org>
# 61fc5ee6	18-Mar-1999	Alan Cox <alc@FreeBSD.org>	Construct the free queue(s) in descending order (by physical address) so that the first 16MB of physical memory is allocated last rather than first. On large-memory machines, this avoids the exhaustion of low physical memory before isa_dmainit has run.
# d1bf5d56	24-Feb-1999	Matthew Dillon <dillon@FreeBSD.org>	Remove unnecessary page protects on map_split and collapse operations. Fix bug where an object's OBJ_WRITEABLE/OBJ_MIGHTBEDIRTY flags do not get set under certain circumstances ( page rename case ). Reviewed by: Alan Cox <alc@cs.rice.edu>, John Dyson
# efcae3d3	14-Feb-1999	Matthew Dillon <dillon@FreeBSD.org>	Minor reorganization of vm_page_alloc(). No functional changes have been made but the code has been reorganized and documented to make it more readable, reduce the size of the code, and optimize the branch path caching capabilities that most modern processors have.
# faa273d5	07-Feb-1999	Matthew Dillon <dillon@FreeBSD.org>	Rip out PQ_ZERO queue. PQ_ZERO functionality is now combined in with PQ_FREE. There is little operational difference other then the kernel being a few kilobytes smaller and the code being more readable. * vm_page_select_free() has been greatly simplified. * The PQ_ZERO page queue and supporting structures have been removed * vm_page_zero_idle() revamped (see below) PG_ZERO setting and clearing has been migrated from vm_page_alloc() to vm_page_free[_zero]() and will eventually be guarenteed to remain tracked throughout a page's life ( if it isn't already ). When a page is freed, PG_ZERO pages are appended to the appropriate tailq in the PQ_FREE queue while non-PG_ZERO pages are prepended. When locating a new free page, PG_ZERO selection operates from within vm_page_list_find() ( get page from end of queue instead of beginning of queue ) and then only occurs in the nominal critical path case. If the nominal case misses, both normal and zero-page allocation devolves into the same _vm_page_list_find() select code without any specific zero-page optimizations. Additionally, vm_page_zero_idle() has been revamped. Hysteresis has been added and zero-page tracking adjusted to conform with the other changes. Currently hysteresis is set at 1/3 (lo) and 1/2 (hi) the number of free pages. We may wish to increase both parameters as time permits. The hysteresis is designed to avoid silly zeroing in borderline allocation/free situations.
# a0e7b3e5	07-Feb-1999	Matthew Dillon <dillon@FreeBSD.org>	Remove L1 cache coloring optimization ( leave L2 cache coloring opt ). Rewrite vm_page_list_find() and vm_page_select_free() - make inline out of nominal case.
# 8aef1712	27-Jan-1999	Matthew Dillon <dillon@FreeBSD.org>	Fix warnings in preparation for adding -Wall -Wcast-qual to the kernel compile
# 2f586e1b	24-Jan-1999	Matthew Dillon <dillon@FreeBSD.org>	Undo last commit - not a bug, just duplicate code. PG_MAPPED and PG_WRITEABLE are already cleared by vm_page_protect().
# 68af6d16	23-Jan-1999	Matthew Dillon <dillon@FreeBSD.org>	vm_map_split() used to dirty the page manually after calling vm_page_rename(), but never pulled the page off PQ_CACHE if it was on PQ_CACHE. Dirty pages in PQ_CACHE are not allowed and a KASSERT was added in -4.x to test for this... and got hit. In -4.x, vm_page_rename() automatically dirties the page. This commit also has it deal with the PQ_CACHE case, deactivating the page in that case.
# e1a4feaf	23-Jan-1999	Matthew Dillon <dillon@FreeBSD.org>	Clear PG_MAPPED as well as PG_WRITEABLE when a page is moved to the cache.
# c9fa34cf	23-Jan-1999	Matthew Dillon <dillon@FreeBSD.org>	Clear PG_WRITEABLE in vm_page_cache(). This may or may not be a bug, but the bit should definitely be cleared.
# 060282de	21-Jan-1999	Matthew Dillon <dillon@FreeBSD.org>	The hash table used to be a table of doubly-link list headers ( two pointers per entry ). The table has been changed to a singly linked list of vm_page_t pointers. The table has been doubled in size, but the entries only take half the space so a net-zero change in memory use. The hash function has been changed, hopefully for the better. The combination of the larger hash table size of changed function should keep the chain length down to a reasonable number (0-3, average 1). vm_object->page_hint has been removed. This 'optimization' was not only never needed, but costs as much as a hash chain link to implement. While having page_hint in vm_object might result in better locality of reference, the cost is not worth the space in vm_object or the extra instructions in my view. vm_page_alloc*() functions have been inlined and call a generalized non-inlined vm_page_alloc_toq() which combines the standard alloc and zero-page alloc functions together, reducing code size and the L1 cache footprint. Some reordering has been done... not much. The delinking code should be faster ( because unlinking a doubly-linked list requires four memory ops and unlinking a singly linked list only requires two ), and we get a hash consistancy check for free. vm_page_rename() now automatically sets the page's dirty bits. vm_page_alloc() does not try to manually inline freeing a cache page. Instead, it now properly calls vm_page_free(m) ... vm_page_free() is really too complex to manually inline. vm_await(), supporting asleep(), has been added.
# 1c7c3c6a	21-Jan-1999	Matthew Dillon <dillon@FreeBSD.org>	This is a rather large commit that encompasses the new swapper, changes to the VM system to support the new swapper, VM bug fixes, several VM optimizations, and some additional revamping of the VM code. The specific bug fixes will be documented with additional forced commits. This commit is somewhat rough in regards to code cleanup issues. Reviewed by: "John S. Dyson" <root@dyson.iquest.net>, "David Greenman" <dg@root.com>
# 219cbf59	09-Jan-1999	Eivind Eklund <eivind@FreeBSD.org>	KNFize, by bde.
# 5526d2d9	08-Jan-1999	Eivind Eklund <eivind@FreeBSD.org>	Split DIAGNOSTIC -> DIAGNOSTIC, INVARIANTS, and INVARIANT_SUPPORT as discussed on -hackers. Introduce 'KASSERT(assertion, ("panic message", args))' for simple check + panic. Reviewed by: msmith
# 9858fcda	22-Dec-1998	Matthew Dillon <dillon@FreeBSD.org>	Update comments to routines in vm_page.c, most especially whether a routine can block or not as part of a general effort to carefully document blocking/non-blocking calls in the kernel.
# 4f6e1f8b	11-Nov-1998	David Greenman <dg@FreeBSD.org>	Closed a small race condition between wiring/unwiring pages that involved the page's wire_count.
# c8d14c76	28-Oct-1998	David Greenman <dg@FreeBSD.org>	Fixed wrong comments in and about vm_page_deactivate().
# 73007561	28-Oct-1998	David Greenman <dg@FreeBSD.org>	Added a second argument, "activate" to the vm_page_unwire() call so that the caller can select either inactive or active queue to put the page on.
# f5ef029e	25-Oct-1998	Poul-Henning Kamp <phk@FreeBSD.org>	Nitpicking and dusting performed on a train. Removes trivial warnings about unused variables, labels and other lint.
# 300ee824	21-Oct-1998	David Greenman <dg@FreeBSD.org>	Nuked PG_TABLED flag. Replaced with m->object != NULL.
# 12d534d2	21-Oct-1998	David Greenman <dg@FreeBSD.org>	Add a diagnostic printf for freeing a wired page. This will eventually be turned into a panic, but I want to make sure that all cases of freeing pages with wire_count==1 (which is/was allowed) have first been fixed.
# e69763a3	04-Sep-1998	Doug Rabson <dfr@FreeBSD.org>	Cosmetic changes to the PAGE_XXX macros to make them consistent with the other objects in vm.
# 069e9bc1	24-Aug-1998	Doug Rabson <dfr@FreeBSD.org>	Change various syscalls to use size_t arguments instead of u_int. Add some overflow checks to read/write (from bde). Change all modifications to vm_page::flags, vm_page::busy, vm_object::flags and vm_object::paging_in_progress to use operations which are not interruptable. Reviewed by: Bruce Evans <bde@zeta.org.au>
# 56e7ede1	26-Jul-1998	Doug Rabson <dfr@FreeBSD.org>	Notify pmap when a page is freed on the alpha to allow it to clean up its emulated modified/referenced bits.
# 15c73825	14-Jul-1998	Bruce Evans <bde@FreeBSD.org>	Cast pointers to [u]intptr_t instead of to [unsigned] long.
# ac1e407b	11-Jul-1998	Bruce Evans <bde@FreeBSD.org>	Fixed printf format errors.
# be160d60	21-Jun-1998	Bruce Evans <bde@FreeBSD.org>	Removed unused includes.
# ecbb00a2	07-Jun-1998	Doug Rabson <dfr@FreeBSD.org>	This commit fixes various 64bit portability problems required for FreeBSD/alpha. The most significant item is to change the command argument to ioctl functions from int to u_long. This change brings us inline with various other BSD versions. Driver writers may like to use (__FreeBSD_version == 300003) to detect this change. The prototype FreeBSD/alpha machdep will follow in a couple of days time.
# 976f208b	01-Jun-1998	John Dyson <dyson@FreeBSD.org>	Cleanup and remove some dead code from the initialization.
# 3c336467	01-May-1998	Peter Wemm <peter@FreeBSD.org>	Seatbelts for vm_page_bits() in case a file offset is passed in rather than the page offset. If a large file offset was passed in, a large negative array index could be generated which could cause page faults etc at worst and file corruption at the least. (Pages are allocated within file space on page alignment boundaries, so a file offset being passed in here is harmless to DTRT. The case where this was happening has already been fixed though, this is in case it happens again). Reviewed by: dyson
# c1087c13	15-Apr-1998	Bruce Evans <bde@FreeBSD.org>	Support compiling with `gcc -ansi'.
# bef608bd	15-Mar-1998	John Dyson <dyson@FreeBSD.org>	Some VM improvements, including elimination of alot of Sig-11 problems. Tor Egge and others have helped with various VM bugs lately, but don't blame him -- blame me!!! pmap.c: 1) Create an object for kernel page table allocations. This fixes a bogus allocation method previously used for such, by grabbing pages from the kernel object, using bogus pindexes. (This was a code cleanup, and perhaps a minor system stability issue.) pmap.c: 2) Pre-set the modify and accessed bits when prudent. This will decrease bus traffic under certain circumstances. vfs_bio.c, vfs_cluster.c: 3) Rather than calculating the beginning virtual byte offset multiple times, stick the offset into the buffer header, so that the calculated offset can be reused. (Long long multiplies are often expensive, and this is a probably unmeasurable performance improvement, and code cleanup.) vfs_bio.c: 4) Handle write recursion more intelligently (but not perfectly) so that it is less likely to cause a system panic, and is also much more robust. vfs_bio.c: 5) getblk incorrectly wrote out blocks that are incorrectly sized. The problem is fixed, and writes blocks out ONLY when B_DELWRI is true. vfs_bio.c: 6) Check that already constituted buffers have fully valid pages. If not, then make sure that the B_CACHE bit is not set. (This was a major source of Sig-11 type problems.) vfs_bio.c: 7) Fix a potential system deadlock due to an incorrectly specified sleep priority while waiting for a buffer write operation. The change that I made opens the system up to serious problems, and we need to examine the issue of process sleep priorities. vfs_cluster.c, vfs_bio.c: 8) Make clustered reads work more correctly (and more completely) when buffers are already constituted, but not fully valid. (This was another system reliability issue.) vfs_subr.c, ffs_inode.c: 9) Create a vtruncbuf function, which is used by filesystems that can truncate files. The vinvalbuf forced a file sync type operation, while vtruncbuf only invalidates the buffers past the new end of file, and also invalidates the appropriate pages. (This was a system reliabiliy and performance issue.) 10) Modify FFS to use vtruncbuf. vm_object.c: 11) Make the object rundown mechanism for OBJT_VNODE type objects work more correctly. Included in that fix, create pager entries for the OBJT_DEAD pager type, so that paging requests that might slip in during race conditions are properly handled. (This was a system reliability issue.) vm_page.c: 12) Make some of the page validation routines be a little less picky about arguments passed to them. Also, support page invalidation change the object generation count so that we handle generation counts a little more robustly. vm_pageout.c: 13) Further reduce pageout daemon activity when the system doesn't need help from it. There should be no additional performance decrease even when the pageout daemon is running. (This was a significant performance issue.) vnode_pager.c: 14) Teach the vnode pager to handle race conditions during vnode deallocations.
# e163e201	07-Mar-1998	John Dyson <dyson@FreeBSD.org>	Some cruft left over from my megacommit. A page rotation optimization was a good idea, but can cause instability. That optimization is now removed.
# 8f9110f6	07-Mar-1998	John Dyson <dyson@FreeBSD.org>	This mega-commit is meant to fix numerous interrelated problems. There has been some bitrot and incorrect assumptions in the vfs_bio code. These problems have manifest themselves worse on NFS type filesystems, but can still affect local filesystems under certain circumstances. Most of the problems have involved mmap consistancy, and as a side-effect broke the vfs.ioopt code. This code might have been committed seperately, but almost everything is interrelated. 1) Allow (pmap_object_init_pt) prefaulting of buffer-busy pages that are fully valid. 2) Rather than deactivating erroneously read initial (header) pages in kern_exec, we now free them. 3) Fix the rundown of non-VMIO buffers that are in an inconsistent (missing vp) state. 4) Fix the disassociation of pages from buffers in brelse. The previous code had rotted and was faulty in a couple of important circumstances. 5) Remove a gratuitious buffer wakeup in vfs_vmio_release. 6) Remove a crufty and currently unused cluster mechanism for VBLK files in vfs_bio_awrite. When the code is functional, I'll add back a cleaner version. 7) The page busy count wakeups assocated with the buffer cache usage were incorrectly cleaned up in a previous commit by me. Revert to the original, correct version, but with a cleaner implementation. 8) The cluster read code now tries to keep data associated with buffers more aggressively (without breaking the heuristics) when it is presumed that the read data (buffers) will be soon needed. 9) Change to filesystem lockmgr locks so that they use LK_NOPAUSE. The delay loop waiting is not useful for filesystem locks, due to the length of the time intervals. 10) Correct and clean-up spec_getpages. 11) Implement a fully functional nfs_getpages, nfs_putpages. 12) Fix nfs_write so that modifications are coherent with the NFS data on the server disk (at least as well as NFS seems to allow.) 13) Properly support MS_INVALIDATE on NFS. 14) Properly pass down MS_INVALIDATE to lower levels of the VM code from vm_map_clean. 15) Better support the notion of pages being busy but valid, so that fewer in-transit waits occur. (use p->busy more for pageouts instead of PG_BUSY.) Since the page is fully valid, it is still usable for reads. 16) It is possible (in error) for cached pages to be busy. Make the page allocation code handle that case correctly. (It should probably be a printf or panic, but I want the system to handle coding errors robustly. I'll probably add a printf.) 17) Correct the design and usage of vm_page_sleep. It didn't handle consistancy problems very well, so make the design a little less lofty. After vm_page_sleep, if it ever blocked, it is still important to relookup the page (if the object generation count changed), and verify it's status (always.) 18) In vm_pageout.c, vm_pageout_clean had rotted, so clean that up. 19) Push the page busy for writes and VM_PROT_READ into vm_pageout_flush. 20) Fix vm_pager_put_pages and it's descendents to support an int flag instead of a boolean, so that we can pass down the invalidate bit.
# ffc82b0a	28-Feb-1998	John Dyson <dyson@FreeBSD.org>	1) Use a more consistent page wait methodology. 2) Do not unnecessarily force page blocking when paging pages out. 3) Further improve swap pager performance and correctness, including fixing the paging in progress deadlock (except in severe I/O error conditions.) 4) Enable vfs_ioopt=1 as a default. 5) Fix and enable the page prezeroing in SMP mode. All in all, SMP systems especially should show a significant improvement in "snappyness."
# 303b270b	08-Feb-1998	Eivind Eklund <eivind@FreeBSD.org>	Staticize.
# 0b08f5f7	05-Feb-1998	Eivind Eklund <eivind@FreeBSD.org>	Back out DIAGNOSTIC changes.
# 95461b45	04-Feb-1998	John Dyson <dyson@FreeBSD.org>	1) Start using a cleaner and more consistant page allocator instead of the various ad-hoc schemes. 2) When bringing in UPAGES, the pmap code needs to do another vm_page_lookup. 3) When appropriate, set the PG_A or PG_M bits a-priori to both avoid some processor errata, and to minimize redundant processor updating of page tables. 4) Modify pmap_protect so that it can only remove permissions (as it originally supported.) The additional capability is not needed. 5) Streamline read-only to read-write page mappings. 6) For pmap_copy_page, don't enable write mapping for source page. 7) Correct and clean-up pmap_incore. 8) Cluster initial kern_exec pagin. 9) Removal of some minor lint from kern_malloc. 10) Correct some ioopt code. 11) Remove some dead code from the MI swapout routine. 12) Correct vm_object_deallocate (to remove backing_object ref.) 13) Fix dead object handling, that had problems under heavy memory load. 14) Add minor vm_page_lookup improvements. 15) Some pages are not in objects, and make sure that the vm_page.c can properly support such pages. 16) Add some more page deficit handling. 17) Some minor code readability improvements.
# 47cfdb16	04-Feb-1998	Eivind Eklund <eivind@FreeBSD.org>	Turn DIAGNOSTIC into a new-style option.
# c15541e7	31-Jan-1998	John Dyson <dyson@FreeBSD.org>	contigalloc doesn't place the allocated page(s) into an object, and now this breaks vm_page_wire (due to wired page accounting per object.) This should fix a problem as described by Donald Maddox.
# eaf13dd7	31-Jan-1998	John Dyson <dyson@FreeBSD.org>	Change the busy page mgmt, so that when pages are freed, they MUST be PG_BUSY. It is bogus to free a page that isn't busy, because it is in a state of being "unavailable" when being freed. The additional advantage is that the page_remove code has a better cross-check that the page should be busy and unavailable for other use. There were some minor problems with the collapse code, and this plugs those subtile "holes." Also, the vfs_bio code wasn't checking correctly for PG_BUSY pages. I am going to develop a more consistant scheme for grabbing pages, busy or otherwise. For now, we are stuck with the current morass.
# 2d8acc0f	22-Jan-1998	John Dyson <dyson@FreeBSD.org>	VM level code cleanups. 1) Start using TSM. Struct procs continue to point to upages structure, after being freed. Struct vmspace continues to point to pte object and kva space for kstack. u_map is now superfluous. 2) vm_map's don't need to be reference counted. They always exist either in the kernel or in a vmspace. The vmspaces are managed by reference counts. 3) Remove the "wired" vm_map nonsense. 4) No need to keep a cache of kernel stack kva's. 5) Get rid of strange looking ++var, and change to var++. 6) Change more data structures to use our "zone" allocator. Added struct proc, struct vmspace and struct vnode. This saves a significant amount of kva space and physical memory. Additionally, this enables TSM for the zone managed memory. 7) Keep ioopt disabled for now. 8) Remove the now bogus "single use" map concept. 9) Use generation counts or id's for data structures residing in TSM, where it allows us to avoid unneeded restart overhead during traversals, where blocking might occur. 10) Account better for memory deficits, so the pageout daemon will be able to make enough memory available (experimental.) 11) Fix some vnode locking problems. (From Tor, I think.) 12) Add a check in ufs_lookup, to avoid lots of unneeded calls to bcmp. (experimental.) 13) Significantly shrink, cleanup, and make slightly faster the vm_fault.c code. Use generation counts, get rid of unneded collpase operations, and clean up the cluster code. 14) Make vm_zone more suitable for TSM. This commit is partially as a result of discussions and contributions from other people, including DG, Tor Egge, PHK, and probably others that I have forgotten to attribute (so let me know, if I forgot.) This is not the infamous, final cleanup of the vnode stuff, but a necessary step. Vnode mgmt should be correct, but things might still change, and there is still some missing stuff (like ioopt, and physical backing of non-merged cache files, debugging of layering concepts.)
# 47221757	17-Jan-1998	John Dyson <dyson@FreeBSD.org>	Tie up some loose ends in vnode/object management. Remove an unneeded config option in pmap. Fix a problem with faulting in pages. Clean-up some loose ends in swap pager memory management. The system should be much more stable, but all subtile bugs aren't fixed yet.
# 925a3a41	11-Jan-1998	John Dyson <dyson@FreeBSD.org>	Fix some vnode management problems, and better mgmt of vnode free list. Fix the UIO optimization code. Fix an assumption in vm_map_insert regarding allocation of swap pagers. Fix an spl problem in the collapse handling in vm_object_deallocate. When pages are freed from vnode objects, and the criteria for putting the associated vnode onto the free list is reached, either put the vnode onto the list, or put it onto an interrupt safe version of the list, for further transfer onto the actual free list. Some minor syntax changes changing pre-decs, pre-incs to post versions. Remove a bogus timeout (that I added for debugging) from vn_lock. PHK will likely still have problems with the vnode list management, and so do I, but it is better than it was.
# 2be70f79	28-Dec-1997	John Dyson <dyson@FreeBSD.org>	Lots of improvements, including restructring the caching and management of vnodes and objects. There are some metadata performance improvements that come along with this. There are also a few prototypes added when the need is noticed. Changes include: 1) Cleaning up vref, vget. 2) Removal of the object cache. 3) Nuke vnode_pager_uncache and friends, because they aren't needed anymore. 4) Correct some missing LK_RETRY's in vn_lock. 5) Correct the page range in the code for msync. Be gentle, and please give me feedback asap.
# 0aa89185	06-Nov-1997	John Dyson <dyson@FreeBSD.org>	Fix the "missing page" problem. Also, improve the performance of page allocation in common cases.
# f0d45e6a	10-Oct-1997	Poul-Henning Kamp <phk@FreeBSD.org>	Fix contigmalloc() and contigmalloc1() arguments.
# f8ddc1e2	13-Sep-1997	Peter Wemm <peter@FreeBSD.org>	Print correct function name in panics
# 79624e21	31-Aug-1997	Bruce Evans <bde@FreeBSD.org>	Removed unused #includes.
# 3075778b	04-Aug-1997	John Dyson <dyson@FreeBSD.org>	Get rid of the ad-hoc memory allocator for vm_map_entries, in lieu of a simple, clean zone type allocator. This new allocator will also be used for machine dependent pmap PV entries.
# 61600997	01-May-1997	John Dyson <dyson@FreeBSD.org>	Check the correct queue for waking up the pageout daemon. Specifically, the pageout daemon wasn't always being waken up appropriately when the (cache + free) queues were depleted. Submitted by: David S. Miller <davem@jenolan.rutgers.edu>
# c5d593ae	22-Mar-1997	John Dyson <dyson@FreeBSD.org>	Fix a significant error in the accounting for pre-zeroed pages. This is a candidate for RELENG_2_2...
# 6875d254	22-Feb-1997	Peter Wemm <peter@FreeBSD.org>	Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not ready for it yet.
# 10825343	13-Feb-1997	Garrett Wollman <wollman@FreeBSD.org>	Provide an alternative interface to contigmalloc() which allows a specific map to be used when allocating the kernel va (e.g., mb_map). The VM gurus may want to look this over.
# 996c772f	09-Feb-1997	John Dyson <dyson@FreeBSD.org>	This is the kernel Lite/2 commit. There are some requisite userland changes, so don't expect to be able to run the kernel as-is (very well) without the appropriate Lite/2 userland changes. The system boots and can mount UFS filesystems. Untested: ext2fs, msdosfs, NFS Known problems: Incorrect Berkeley ID strings in some files. Mount_std mounts will not work until the getfsent library routine is changed. Reviewed by: various people Submitted by: Jeffery Hsu <hsu@freebsd.org>
# 1130b656	14-Jan-1997	Jordan K. Hubbard <jkh@FreeBSD.org>	Make the long-awaited change from $Id$ to $FreeBSD$ This will make a number of things easier in the future, as well as (finally!) avoiding the Id-smashing problem which has plagued developers for so long. Boy, I'm glad we're not using sup anymore. This update would have been insane otherwise.
# e0c5a895	28-Nov-1996	John Dyson <dyson@FreeBSD.org>	Make the kernel smaller with at worst a neutral effect on perf by de-inlining some VM calls. (Actually, I measured a small improvement.)
# 5b0a7408	16-Nov-1996	John Dyson <dyson@FreeBSD.org>	Improve the locality of reference for variables in vm_page and vm_kern by moving them from .bss to .data. With this change, there is a measurable perf improvement in fork/exec.
# db2c0faa	04-Nov-1996	John Dyson <dyson@FreeBSD.org>	Vastly improved contigmalloc routine. It does not solve the problem of allocating contiguous buffer memory in general, but make it much more likely to work at boot-up time. The best chance for an LKM-type load of a sound driver is immediately after the mount of the root filesystem. This appears to work for a 64K allocation on an 8MB system.
# 675878e7	14-Oct-1996	John Dyson <dyson@FreeBSD.org>	Move much of the machine dependent code from vm_glue.c into pmap.c. Along with the improved organization, small proc fork performance is now about 5%-10% faster.
# 8ba0c490	12-Oct-1996	Bruce Evans <bde@FreeBSD.org>	Removed __pure's and __pure2's. __pure is a no-op for recent versions of gcc by definition, and __pure2 is a no-op in effect (presumably the compiler can see when an inline function has no side effects).
# f7d6dab2	06-Oct-1996	John Dyson <dyson@FreeBSD.org>	Fix a problem with the page coloring code that the system will not always be able to use all of the free pages. This can manifest as a panic using DIAGNOSTIC, or as a panic on an indirect memory reference.
# 322dfc2b	28-Sep-1996	Bruce Evans <bde@FreeBSD.org>	Fixed undeclared variables for the !(PQ_L2_SIZE > 1) case. Removed redundant #include.
# a2f4a846	27-Sep-1996	John Dyson <dyson@FreeBSD.org>	Reviewed by: Submitted by: Obtained from:
# c7c34a24	14-Sep-1996	Bruce Evans <bde@FreeBSD.org>	Attached vm ddb commands `show map', `show vmochk', `show object', `show vmopag', `show page' and `show pageq'. Moved all vm ddb stuff to the ends of the vm source files. Changed printf() to db_printf(), `indent' to db_indent, and iprintf() to db_iprintf() in ddb commands. Moved db_indent and db_iprintf() from vm to ddb. vm_page.c: Don't use __pure. Staticized. db_output.c: Reduced page width from 80 to 79 to inhibit double spacing for long lines (there are still some problems if words are printed across column 79).
# 5070c7f8	08-Sep-1996	John Dyson <dyson@FreeBSD.org>	Addition of page coloring support. Various levels of coloring are afforded. The default level works with minimal overhead, but one can also enable full, efficient use of a 512K cache. (Parameters can be generated to support arbitrary cache sizes also.)
# 67bf6868	29-Jul-1996	John Dyson <dyson@FreeBSD.org>	Backed out the recent changes/enhancements to the VM code. The problem with the 'shell scripts' was found, but there was a 'strange' problem found with a 486 laptop that we could not find. This commit backs the code back to 25-jul, and will be re-entered after the snapshot in smaller (more easily tested) chunks.
# 4f4d35ed	26-Jul-1996	John Dyson <dyson@FreeBSD.org>	This commit is meant to solve a couple of VM system problems or performance issues. 1) The pmap module has had too many inlines, and so the object file is simply bigger than it needs to be. Some common code is also merged into subroutines. 2) Removal of some evil PHYS_TO_VM_PAGE macro calls. Unfortunately, a few have needed to be added also. The removal caused the need for more vm_page_lookups. I added lookup hints to minimize the need for the page table lookup operations. 3) Removal of some bogus performance improvements, that mostly made the code more complex (tracking individual page table page updates unnecessarily). Those improvements actually hurt 386 processors perf (not that people who worry about perf use 386 processors anymore :-)). 4) Changed pv queue manipulations/structures to be TAILQ's. 5) The pv queue code has had some performance problems since day one. Some significant scalability issues are resolved by threading the pv entries from the pmap AND the physical address instead of just the physical address. This makes certain pmap operations run much faster. This does not affect most micro-benchmarks, but should help loaded system performance significantly. DG helped and came up with most of the solution for this one. 6) Most if not all pmap bit operations follow the pattern: pmap_test_bit(); pmap_clear_bit(); That made for twice the necessary pv list traversal. The pmap interface now supports only pmap_tc_bit type operations: pmap_[test/clear]_modified, pmap_[test/clear]_referenced. Additionally, the modified routine now takes a vm_page_t arg instead of a phys address. This eliminates a PHYS_TO_VM_PAGE operation. 7) Several rewrites of routines that contain redundant code to use common routines, so that there is a greater likelihood of keeping the cache footprint smaller.
# 38efa82b	25-Jun-1996	John Dyson <dyson@FreeBSD.org>	This commit does a couple of things: Re-enables the RSS limiting, and the routine is now tail-recursive, making it much more safe (eliminates the possiblity of kernel stack overflow.) Also, the RSS limiting is a little more intelligent about finding the likely objects that are pushing the process over the limit. Added some sysctls that help with VM system tuning. New sysctl features: 1) Enable/disable lru pageout algorithm. vm.pageout_algorithm = 0, default algorithm that works well, especially using X windows and heavy memory loading. Can have adverse effects, sometimes slowing down program loading. vm.pageout_algorithm = 1, close to true LRU. Works much better than clock, etc. Does not work as well as the default algorithm in general. Certain memory "malloc" type benchmarks work a little better with this setting. Please give me feedback on the performance results associated with these. 2) Enable/disable swapping. vm.swapping_enabled = 1, default. vm.swapping_enabled = 0, useful for cases where swapping degrades performance. The config option "NO_SWAPPING" is still operative, and takes precedence over the sysctl. If "NO_SWAPPING" is specified, the sysctl still exists, but "vm.swapping_enabled" is hard-wired to "0". Each of these can be changed "on the fly."
# 2a4eb04b	20-Jun-1996	John Dyson <dyson@FreeBSD.org>	Improve algorithm for page hash queue. It was previously about as bad as it could be. This algorithm appears to improve fork performance (barely) measurably.
# ef743ce6	16-Jun-1996	John Dyson <dyson@FreeBSD.org>	Several bugfixes/improvements: 1) Make it much less likely to miss a wakeup in vm_page_free_wakeup 2) Create a new entry point into pmap: pmap_ts_referenced, eliminates the need to scan the pv lists twice in many cases. Perhaps there is alot more to do here to work on minimizing pv list manipulation 3) Minor improvements to vm_pageout including the use of pmap_ts_ref. 4) Major changes and code improvement to pmap. This code has had several serious bugs in page table page manipulation. In order to simplify the problem, and hopefully solve it for once and all, page table pages are no longer "managed" with the pv list stuff. Page table pages are only (mapped and held/wired) or (free and unused) now. Page table pages are never inactive, active or cached. These changes have probably fixed the hold count problems, but if they haven't, then the code is simpler anyway for future bugfixing. 5) The pmap code has been sorely in need of re-organization, and I have taken a first (of probably many) steps. Please tell me if you have any ideas.
# b5b40fa6	16-Jun-1996	John Dyson <dyson@FreeBSD.org>	Various bugfixes/cleanups from me and others: 1) Remove potential race conditions on waking up in vm_page_free_wakeup by making sure that it is at splvm(). 2) Fix another bug in vm_map_simplify_entry. 3) Be more complete about converting from default to swap pager when an object grows to be large enough that there can be a problem with data structure allocation under low memory conditions. 4) Make some madvise code more efficient. 5) Added some comments.
# 419702a4	12-Jun-1996	John Dyson <dyson@FreeBSD.org>	Fix a very significant cnt.v_wire_count leak in vm_page.c, and some minor leaks in pmap.c. Bruce Evans made me aware of this problem.
# 886d3e11	08-Jun-1996	John Dyson <dyson@FreeBSD.org>	Adjust the threshold for blocking on movement of pages from the cache queue in vm_fault. Move the PG_BUSY in vm_fault to the correct place. Remove redundant/unnecessary code in pmap.c. Properly block on rundown of page table pages, if they are busy. I think that the VM system is in pretty good shape now, and the following individuals (among others, in no particular order) have helped with this recent bunch of bugs, thanks! If I left anyone out, I apologize! Stephen McKay, Stephen Hocking, Eric J. Chet, Dan O'Brien, James Raynard, Marc Fournier.
# 6b6f0008	04-Jun-1996	John Dyson <dyson@FreeBSD.org>	Keep page-table pages from ever being sensed as dirty. This should fix some problems with the page-table page management code, since it can't deal with the notion of page-table pages being paged out or in transit. Also, clean up some stylistic issues per some suggestions from Stephen McKay.
# f35329ac	30-May-1996	John Dyson <dyson@FreeBSD.org>	This commit is dual-purpose, to fix more of the pageout daemon queue corruption problems, and to apply Gary Palmer's code cleanups. David Greenman helped with these problems also. There is still a hang problem using X in small memory machines.
# f777ab7b	23-May-1996	John Dyson <dyson@FreeBSD.org>	Add an assert to vm_page_cache. We should never cache a dirty page.
# b18bfc3d	17-May-1996	John Dyson <dyson@FreeBSD.org>	This set of commits to the VM system does the following, and contain contributions or ideas from Stephen McKay <syssgm@devetir.qld.gov.au>, Alan Cox <alc@cs.rice.edu>, David Greenman <davidg@freebsd.org> and me: More usage of the TAILQ macros. Additional minor fix to queue.h. Performance enhancements to the pageout daemon. Addition of a wait in the case that the pageout daemon has to run immediately. Slightly modify the pageout algorithm. Significant revamp of the pmap/fork code: 1) PTE's and UPAGES's are NO LONGER in the process's map. 2) PTE's and UPAGES's reside in their own objects. 3) TOTAL elimination of recursive page table pagefaults. 4) The page directory now resides in the PTE object. 5) Implemented pmap_copy, thereby speeding up fork time. 6) Changed the pv entries so that the head is a pointer and not an entire entry. 7) Significant cleanup of pmap_protect, and pmap_remove. 8) Removed significant amounts of machine dependent fork code from vm_glue. Pushed much of that code into the machine dependent pmap module. 9) Support more completely the reuse of already zeroed pages (Page table pages and page directories) as being already zeroed. Performance and code cleanups in vm_map: 1) Improved and simplified allocation of map entries. 2) Improved vm_map_copy code. 3) Corrected some minor problems in the simplify code. Implemented splvm (combo of splbio and splimp.) The VM code now seldom uses splhigh. Improved the speed of and simplified kmem_malloc. Minor mod to vm_fault to avoid using pre-zeroed pages in the case of objects with backing objects along with the already existant condition of having a vnode. (If there is a backing object, there will likely be a COW... With a COW, it isn't necessary to start with a pre-zeroed page.) Minor reorg of source to perhaps improve locality of ref.
# 30dcfc09	27-Mar-1996	John Dyson <dyson@FreeBSD.org>	VM performance improvements, and reorder some operations in VM fault in anticipation of a fix in pmap that will allow the mlock system call to work without panicing the system.
# 8169788f	11-Mar-1996	Peter Wemm <peter@FreeBSD.org>	Import 4.4BSD-Lite2 onto the vendor branch, note that in the kernel, all files are off the vendor branch, so this should not change anything. A "U" marker generally means that the file was not changed in between the 4.4Lite and Lite-2 releases, and does not need a merge. "C" generally means that there was a change.
# 9ee58740	08-Mar-1996	John Dyson <dyson@FreeBSD.org>	Modify a threshold for waking up the pageout daemon. Also, add a consistancy check for making sure that held pages aren't freed (DG).
# de5f6a77	01-Mar-1996	John Dyson <dyson@FreeBSD.org>	1) Eliminate unnecessary bzero of UPAGES. 2) Eliminate unnecessary copying of pages during/after forks. 3) Add user map simplification.
# 324e9ed2	26-Jan-1996	Bruce Evans <bde@FreeBSD.org>	Added a `boundary' arg to vm_alloc_page_contig(). Previously the only way to avoid crossing a 64K DMA boundary was to specify an alignment greater than the size even when the alignment didn't matter, and for sizes larger than a page, this reduced the chance of finding enough contiguous pages. E.g., allocations of 8K not crossing a 64K boundary previously had to be allocated on 8K boundaries; now they can be allocated on any 4K boundary except (64 * n + 60)K. Fixed bugs in vm_alloc_page_contig(): - the last page wasn't allocated for sizes smaller than a page. - failures of kmem_alloc_pageable() weren't handled. Mutated vm_page_alloc_contig() to create a more convenient interface named contigmalloc(). This is the same as the one in 1.1.5 except it has `low' and `high' args, and the `alignment' and `boundary' args are multipliers instead of masks.
# bd7e5f99	18-Jan-1996	John Dyson <dyson@FreeBSD.org>	Eliminated many redundant vm_map_lookup operations for vm_mmap. Speed up for vfs_bio -- addition of a routine bqrelse to greatly diminish overhead for merged cache. Efficiency improvement for vfs_cluster. It used to do alot of redundant calls to cluster_rbuild. Correct the ordering for vrele of .text and release of credentials. Use the selective tlb update for 486/586/P6. Numerous fixes to the size of objects allocated for files. Additionally, fixes in the various pagers. Fixes for proper positioning of vnode_pager_setsize in msdosfs and ext2fs. Fixes in the swap pager for exhausted resources. The pageout code will not as readily thrash. Change the page queue flags (PG_ACTIVE, PG_INACTIVE, PG_FREE, PG_CACHE) into page queue indices (PQ_ACTIVE, PQ_INACTIVE, PQ_FREE, PQ_CACHE), thereby improving efficiency of several routines. Eliminate even more unnecessary vm_page_protect operations. Significantly speed up process forks. Make vm_object_page_clean more efficient, thereby eliminating the pause that happens every 30seconds. Make sequential clustered writes B_ASYNC instead of B_DELWRI even in the case of filesystems mounted async. Fix a panic with busy pages when write clustering is done for non-VMIO buffers.
# 0e41ee30	04-Jan-1996	Garrett Wollman <wollman@FreeBSD.org>	Convert DDB to new-style option.
# f2c6b65b	17-Dec-1995	Bruce Evans <bde@FreeBSD.org>	Fixed 1TB filesize changes. Some pindexes had bogus names and types but worked because vm_pindex_t is indistinuishable from vm_offset_t.
# f708ef1b	14-Dec-1995	Poul-Henning Kamp <phk@FreeBSD.org>	Another mega commit to staticize things.
# ec07c60c	11-Dec-1995	John Dyson <dyson@FreeBSD.org>	Some DIAGNOSTIC code was enabled all of the time in error. The diagnostic code is now conditional on #ifdef DIAGNOSTIC again.
# a316d390	10-Dec-1995	John Dyson <dyson@FreeBSD.org>	Changes to support 1Tb filesizes. Pages are now named by an (object,index) pair instead of (object,offset) pair.
# efeaf95a	06-Dec-1995	David Greenman <dg@FreeBSD.org>	Untangled the vm.h include file spaghetti.
# cac597e4	02-Dec-1995	Bruce Evans <bde@FreeBSD.org>	Completed function declarations and/or added prototypes. Staticized some functions. __purified some functions. Some functions were bogusly declared as returning `const'. This hasn't done anything since gcc-2.5. For later versions of gcc, the equivalent is __attribute__((const)) at the end of function declarations.
# 3af76890	19-Nov-1995	Poul-Henning Kamp <phk@FreeBSD.org>	Remove unused vars & funcs, make things static, protoize a little bit.
# a91c5a7e	22-Oct-1995	John Dyson <dyson@FreeBSD.org>	Get rid of machine-dependent NBPG and replace with PAGE_SIZE.
# f70f05f2	03-Sep-1995	John Dyson <dyson@FreeBSD.org>	Machine independent changes to support pre-zeroed free pages. This significantly improves demand-zero performance.
# 4589a4b5	03-Sep-1995	John Dyson <dyson@FreeBSD.org>	New subroutine "vm_page_set_validclean" for a vfs_bio improvement.
# b367ddb1	19-Jul-1995	David Greenman <dg@FreeBSD.org>	#if 0'd one of the DIAGNOSTIC checks in vm_page_alloc(). It was too expensive for "normal" use.
# 24a1cce3	13-Jul-1995	David Greenman <dg@FreeBSD.org>	NOTE: libkvm, w, ps, 'top', and any other utility which depends on struct proc or any VM system structure will have to be rebuilt!!! Much needed overhaul of the VM system. Included in this first round of changes: 1) Improved pager interfaces: init, alloc, dealloc, getpages, putpages, haspage, and sync operations are supported. The haspage interface now provides information about clusterability. All pager routines now take struct vm_object's instead of "pagers". 2) Improved data structures. In the previous paradigm, there is constant confusion caused by pagers being both a data structure ("allocate a pager") and a collection of routines. The idea of a pager structure has escentially been eliminated. Objects now have types, and this type is used to index the appropriate pager. In most cases, items in the pager structure were duplicated in the object data structure and thus were unnecessary. In the few cases that remained, a un_pager structure union was created in the object to contain these items. 3) Because of the cleanup of #1 & #2, a lot of unnecessary layering can now be removed. For instance, vm_object_enter(), vm_object_lookup(), vm_object_remove(), and the associated object hash list were some of the things that were removed. 4) simple_lock's removed. Discussion with several people reveals that the SMP locking primitives used in the VM system aren't likely the mechanism that we'll be adopting. Even if it were, the locking that was in the code was very inadequate and would have to be mostly re-done anyway. The locking in a uni-processor kernel was a no-op but went a long way toward making the code difficult to read and debug. 5) Places that attempted to kludge-up the fact that we don't have kernel thread support have been fixed to reflect the reality that we are really dealing with processes, not threads. The VM system didn't have complete thread support, so the comments and mis-named routines were just wrong. We now use tsleep and wakeup directly in the lock routines, for instance. 6) Where appropriate, the pagers have been improved, especially in the pager_alloc routines. Most of the pager_allocs have been rewritten and are now faster and easier to maintain. 7) The pagedaemon pageout clustering algorithm has been rewritten and now tries harder to output an even number of pages before and after the requested page. This is sort of the reverse of the ideal pagein algorithm and should provide better overall performance. 8) Unnecessary (incorrect) casts to caddr_t in calls to tsleep & wakeup have been removed. Some other unnecessary casts have also been removed. 9) Some almost useless debugging code removed. 10) Terminology of shadow objects vs. backing objects straightened out. The fact that the vm_object data structure escentially had this backwards really confused things. The use of "shadow" and "backing object" throughout the code is now internally consistent and correct in the Mach terminology. 11) Several minor bug fixes, including one in the vm daemon that caused 0 RSS objects to not get purged as intended. 12) A "default pager" has now been created which cleans up the transition of objects to the "swap" type. The previous checks throughout the code for swp->pg_data != NULL were really ugly. This change also provides the rudiments for future backing of "anonymous" memory by something other than the swap pager (via the vnode pager, for example), and it allows the decision about which of these pagers to use to be made dynamically (although will need some additional decision code to do this, of course). 13) (dyson) MAP_COPY has been deprecated and the corresponding "copy object" code has been removed. MAP_COPY was undocumented and non- standard. It was furthermore broken in several ways which caused its behavior to degrade to MAP_PRIVATE. Binaries that use MAP_COPY will continue to work correctly, but via the slightly different semantics of MAP_PRIVATE. 14) (dyson) Sharing maps have been removed. It's marginal usefulness in a threads design can be worked around in other ways. Both #12 and #13 were done to simplify the code and improve readability and maintain- ability. (As were most all of these changes) TODO: 1) Rewrite most of the vnode pager to use VOP_GETPAGES/PUTPAGES. Doing this will reduce the vnode pager to a mere fraction of its current size. 2) Rewrite vm_fault and the swap/vnode pagers to use the clustering information provided by the new haspage pager interface. This will substantially reduce the overhead by eliminating a large number of VOP_BMAP() calls. The VOP_BMAP() filesystem interface should be improved to provide both a "behind" and "ahead" indication of contiguousness. 3) Implement the extended features of pager_haspage in swap_pager_haspage(). It currently just says 0 pages ahead/behind. 4) Re-implement the swap device (swstrategy) in a more elegant way, perhaps via a much more general mechanism that could also be used for disk striping of regular filesystems. 5) Do something to improve the architecture of vm_object_collapse(). The fact that it makes calls into the swap pager and knows too much about how the swap pager operates really bothers me. It also doesn't allow for collapsing of non-swap pager objects ("unnamed" objects backed by other pagers).
# 9b2e5354	30-May-1995	Rodney W. Grimes <rgrimes@FreeBSD.org>	Remove trailing whitespace.
# c3cb3e12	15-Apr-1995	David Greenman <dg@FreeBSD.org>	Moved some zero-initialized variables into .bss. Made code intended to be called only from DDB #ifdef DDB. Removed some completely unused globals.
# 8c3d9c40	16-Apr-1995	David Greenman <dg@FreeBSD.org>	Removed gratuitous m->blah=0 assignments when initializing the vm_page structs in vm_page_startup(). The vm_page structs are already completely zeroed.
# 2fdccd5e	16-Apr-1995	David Greenman <dg@FreeBSD.org>	Make "print_page_info" #ifdef DDB.
# f6b04d2b	09-Apr-1995	David Greenman <dg@FreeBSD.org>	Changes from John Dyson and myself: Fixed remaining known bugs in the buffer IO and VM system. vfs_bio.c: Fixed some race conditions and locking bugs. Improved performance by removing some (now) unnecessary code and fixing some broken logic. Fixed process accounting of # of FS outputs. Properly handle NFS interrupts (B_EINTR). (various) Replaced calls to clrbuf() with calls to an optimized routine called vfs_bio_clrbuf(). (various FS sync) Sync out modified vnode_pager backed pages. ffs_vnops.c: Do two passes: Sync out file data first, then indirect blocks. vm_fault.c: Fixed deadly embrace caused by acquiring locks in the wrong order. vnode_pager.c: Changed to use buffer I/O system for writing out modified pages. This should fix the problem with the modification date previous not getting updated. Also dramatically simplifies the code. Note that this is going to change in the future and be implemented via VOP_PUTPAGES(). vm_object.c: Fixed a pile of bugs related to cleaning (vnode) objects. The performance of vm_object_page_clean() is terrible when dealing with huge objects, but this will change when we implement a binary tree to keep the object pages sorted. vm_pageout.c: Fixed broken clustering of pageouts. Fixed race conditions and other lockup style bugs in the scanning of pages. Improved performance.
# 7fd9a8b1	25-Mar-1995	David Greenman <dg@FreeBSD.org>	Implemented cnt.v_reactivated and moved vm_page_activate() routine to before vm_page_deactivate().
# edf8a815	19-Mar-1995	David Greenman <dg@FreeBSD.org>	Removed redundant newlines that were in some panic strings.
# 806e3860	17-Mar-1995	David Greenman <dg@FreeBSD.org>	In vm_page_alloc_contig: Removed a redundant semicolon and used 'm' instead of &pga[i] in one place.
# b5e8ce9f	16-Mar-1995	Bruce Evans <bde@FreeBSD.org>	Add and move declarations to fix all of the warnings from `gcc -Wimplicit' (except in netccitt, netiso and netns) and most of the warnings from `gcc -Wnested-externs'. Fix all the bugs found. There were no serious ones.
# f919ebde	01-Mar-1995	David Greenman <dg@FreeBSD.org>	Various changes from John and myself that do the following: New functions create - vm_object_pip_wakeup and pagedaemon_wakeup that are used to reduce the actual number of wakeups. New function vm_page_protect which is used in conjuction with some new page flags to reduce the number of calls to pmap_page_protect. Minor changes to reduce unnecessary spl nesting. Rewrote vm_page_alloc() to improve readability. Various other mostly cosmetic changes.
# 6f2b142e	22-Feb-1995	David Greenman <dg@FreeBSD.org>	vm_page.c: Use request==VM_ALLOC_NORMAL rather than object!=kmem_object in deciding if the caller is "important" in vm_page_alloc(). Also established a new low threshold for non-interrupt allocations via cnt.v_interrupt_free_min. vm_pageout.c: Various algorithmic cleanup. Some calculations simplified. Initialize cnt.v_interrupt_free_min to 2 pages. Submitted by: John Dyson
# 5e716206	22-Feb-1995	David Greenman <dg@FreeBSD.org>	Just return in the case of a page not on any queue in vm_page_unqueue(). Return VM_PAGE_BITS_ALL even if size > PAGE_SIZE in vm_page_bits(). Submitted by: John Dyson
# d0686727	20-Feb-1995	David Greenman <dg@FreeBSD.org>	Don't allow act_count to exceed ACT_MAX when bumping it up. Small optimization to vm_page_bits(). Submitted by: John Dyson
# d89ced81	20-Feb-1995	David Greenman <dg@FreeBSD.org>	Fully initialize pages returned via vm_page_alloc_contig() so that the memory can be later freed.
# a1f6d91c	02-Feb-1995	David Greenman <dg@FreeBSD.org>	swap_pager.c: Fixed long standing bug in freeing swap space during object collapses. Fixed 'out of space' messages from printing out too often. Modified to use new kmem_malloc() calling convention. Implemented an additional stat in the swap pager struct to count the amount of space allocated to that pager. This may be removed at some point in the future. Minimized unnecessary wakeups. vm_fault.c: Don't try to collect fault stats on 'swapped' processes - there aren't any upages to store the stats in. Changed read-ahead policy (again!). vm_glue.c: Be sure to gain a reference to the process's map before swapping. Be sure to lose it when done. kern_malloc.c: Added the ability to specify if allocations are at interrupt time or are 'safe'; this affects what types of pages can be allocated. vm_map.c: Fixed a variety of map lock problems; there's still a lurking bug that will eventually bite. vm_object.c: Explicitly initialize the object fields rather than bzeroing the struct. Eliminated the 'rcollapse' code and folded it's functionality into the "real" collapse routine. Moved an object_unlock() so that the backing_object is protected in the qcollapse routine. Make sure nobody fools with the backing_object when we're destroying it. Added some diagnostic code which can be called from the debugger that looks through all the internal objects and makes certain that they all belong to someone. vm_page.c: Fixed a rather serious logic bug that would result in random system crashes. Changed pagedaemon wakeup policy (again!). vm_pageout.c: Removed unnecessary page rotations on the inactive queue. Changed the number of pages to explicitly free to just free_reserved level. Submitted by: John Dyson
# 6d40c3d3	24-Jan-1995	David Greenman <dg@FreeBSD.org>	Added ability to detect sequential faults and DTRT. (swap_pager.c) Added hook for pmap_prefault() and use symbolic constant for new third argument to vm_page_alloc() (vm_fault.c, various) Changed the way that upages and page tables are held. (vm_glue.c) Fixed architectural flaw in allocating pages at interrupt time that was introduced with the merged cache changes. (vm_page.c, various) Adjusted some algorithms to acheive better paging performance and to accomodate the fix for the architectural flaw mentioned above. (vm_pageout.c) Fixed pbuf handling problem, changed policy on handling read-behind page. (vnode_pager.c) Submitted by: John Dyson
# edfab85b	15-Jan-1995	David Greenman <dg@FreeBSD.org>	Moved some splx's down a few lines in vm_page_insert and vm_page_remove to make the locking a bit more clear - this change is currently a NOP as the calls to those routines are already at splhigh().
# a776a317	10-Jan-1995	David Greenman <dg@FreeBSD.org>	Kill VM_PAGE_INIT macro as it is only used once and makes the code more difficult to understand. Got rid of unused vm_page flags.
# 480dff54	10-Jan-1995	David Greenman <dg@FreeBSD.org>	Fixed some formatting weirdness that I overlooked in the previous commit.
# 0d94caff	09-Jan-1995	David Greenman <dg@FreeBSD.org>	These changes embody the support of the fully coherent merged VM buffer cache, much higher filesystem I/O performance, and much better paging performance. It represents the culmination of over 6 months of R&D. The majority of the merged VM/cache work is by John Dyson. The following highlights the most significant changes. Additionally, there are (mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to support the new VM/buffer scheme. vfs_bio.c: Significant rewrite of most of vfs_bio to support the merged VM buffer cache scheme. The scheme is almost fully compatible with the old filesystem interface. Significant improvement in the number of opportunities for write clustering. vfs_cluster.c, vfs_subr.c Upgrade and performance enhancements in vfs layer code to support merged VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff. vm_object.c: Yet more improvements in the collapse code. Elimination of some windows that can cause list corruption. vm_pageout.c: Fixed it, it really works better now. Somehow in 2.0, some "enhancements" broke the code. This code has been reworked from the ground-up. vm_fault.c, vm_page.c, pmap.c, vm_object.c Support for small-block filesystems with merged VM/buffer cache scheme. pmap.c vm_map.c Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of kernel PTs. vm_glue.c Much simpler and more effective swapping code. No more gratuitous swapping. proc.h Fixed the problem that the p_lock flag was not being cleared on a fork. swap_pager.c, vnode_pager.c Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the code doesn't need it anymore. machdep.c Changes to better support the parameter values for the merged VM/buffer cache scheme. machdep.c, kern_exec.c, vm_glue.c Implemented a seperate submap for temporary exec string space and another one to contain process upages. This eliminates all map fragmentation problems that previously existed. ffs_inode.c, ufs_inode.c, ufs_readwrite.c Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on busy buffers. Submitted by: John Dyson and David Greenman
# 47c9acfd	23-Oct-1994	David Greenman <dg@FreeBSD.org>	Changed a thread_sleep into an spl protected tsleep. A deadlock can occur otherwise. Minor efficiency improvement in vm_page_free(). Submitted by: John Dyson
# a58d1fa1	18-Oct-1994	David Greenman <dg@FreeBSD.org>	Fix the remaining vmmeter counters. They all now work correctly.
# 05f0fdd2	08-Oct-1994	Poul-Henning Kamp <phk@FreeBSD.org>	Cosmetics: unused vars, ()'s, #include's &c &c to silence gcc. Reviewed by: davidg
# 1ffd2a2c	27-Sep-1994	David Greenman <dg@FreeBSD.org>	Previous commit should have read ...in vm_page_alloc_contig(). ...(this commit): moved initialization of 'start' to make it more clear that it is initialized properly (also in vm_page_alloc_contig). Reviewed by: Submitted by: Obtained from:
# 5992708a	27-Sep-1994	David Greenman <dg@FreeBSD.org>	Fixed another bug, and cleaned up the code.
# 0d040c7e	27-Sep-1994	David Greenman <dg@FreeBSD.org>	Fixed multiple bugs in previous version of vm_page_alloc_contig.
# d3c2cf7a	27-Sep-1994	David Greenman <dg@FreeBSD.org>	1) New "vm_page_alloc_contig" routine by me. 2) Created a new vm_page flag "PG_FREE" to help track free pages. 3) Use PG_FREE flag to detect inconsistencies in a few places.
# 28b5c68f	09-Aug-1994	David Greenman <dg@FreeBSD.org>	Fixed vm_page_deactivate to deal with getting called with a page that's not on any queue. This is an old patchkit days fix. Reviewed by: John Dyson and David Greenman Submitted by: originally by Paul Mackerras
# a481f200	07-Aug-1994	David Greenman <dg@FreeBSD.org>	Provide support for upcoming merged VM/buffer cache, and fixed a few bugs that haven't appeared to manifest themselves (yet). Submitted by: John Dyson
# 03e6c253	01-Aug-1994	David Greenman <dg@FreeBSD.org>	Removed all code related to the pagescan daemon, and changed 'act_count' adjustments to compensate for a world without the pagescan daemon.
# 26f9a767	25-May-1994	Rodney W. Grimes <rgrimes@FreeBSD.org>	The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch. Reviewed by: Rodney W. Grimes Submitted by: John Dyson and David Greenman
# df8bae1d	24-May-1994	Rodney W. Grimes <rgrimes@FreeBSD.org>	BSD 4.4 Lite Kernel Sources