Cross Reference: /freebsd-10.1-release/sys/vm/

History log of /freebsd-10.1-release/sys/vm/
Revision	Date	Author	Comments (<<< Hide modified files) (Show modified files >>>)
290362	04-Nov-2015	glebius	o Fix regressions related to SA-15:25 upgrade of NTP. [1] o Fix kqueue write events never fired for files greater 2GB. [2] o Fix kpplications exiting due to segmentation violation on a correct memory address. [3] PR: 204046 [1] PR: 204203 [1] Errata Notice: FreeBSD-EN-15:19.kqueue [2] Errata Notice: FreeBSD-EN-15:20.vm [3] Approved by: so /freebsd-10.1-release/UPDATING /freebsd-10.1-release/sys/conf/newvers.sh /freebsd-10.1-release/sys/sys/vnode.h vm_map.c /freebsd-10.1-release/usr.sbin/ntp/config.h
273099	14-Oct-2014	kib	MFC r272907: Make MAP_NOSYNC handling in the vm_fault() read-locked object path compatible with write-locked path. Approved by: re (marius) vm_fault.c
273007	12-Oct-2014	alc	MFS: r272543 (r271351 on HEAD) Fix a boundary case error in vm_reserv_alloc_contig(). Approved by: re (kib) vm_reserv.c
272461	03-Oct-2014	gjb	Copy stable/10@r272459 to releng/10.1 as part of the 10.1-RELEASE process. Approved by: re (implicit) Sponsored by: The FreeBSD Foundation
272221	27-Sep-2014	smh	MFC r272071: Fix ticks wrap issue of lowmem test in vm_pageout_scan Approved by: re (kib) Sponsored by: Multiplay
272202	27-Sep-2014	kib	MFC r272036: Avoid calling vm_map_pmap_enter() for the MADV_WILLNEED on the wired entry, the pages must be already mapped. Approved by: re (gjb)
271925	21-Sep-2014	kib	MFC r271586: Fix mis-spelling of bits and types names in the vnode_pager_putpages(). Approved by: re (delphij)
270996	03-Sep-2014	alc	This is a direct commit to account for the renaming of 'cnt' to 'vm_cnt' in HEAD but not stable/10.
270995	03-Sep-2014	alc	MFC r270666 Back in the days when the kernel was single threaded, testing "vm_paging_target() > 0" was a reasonable way of determining if the inactive queue scan met its target. However, now that other threads can be allocating pages while the inactive queue scan is running, it's an unreliable method. The effect of it being unreliable is that we can start swapping out processes when we didn't intend to. This issue has existed since the kernel was multithreaded, but the changes to the inactive queue target in 10.0-RELEASE have made its effects visible. This change introduces a more direct method for determining if the inactive queue scan met its target that is not affected by the actions of other threads.
270920	01-Sep-2014	kib	Fix a leak of the wired pages when unwiring of the PROT_NONE-mapped wired region. Rework the handling of unwire to do the it in batch, both at pmap and object level. All commits below are by alc. MFC r268327: Introduce pmap_unwire(). MFC r268591: Implement pmap_unwire() for powerpc. MFC r268776: Implement pmap_unwire() for arm. MFC r268806: pmap_unwire(9) man page. MFC r269134: When unwiring a region of an address space, do not assume that the underlying physical pages are mapped by the pmap. This fixes a leak of the wired pages on the unwiring of the region mapped with no access allowed. MFC r269339: In the implementation of the new function pmap_unwire(), the call to MOEA64_PVO_TO_PTE() must be performed before any changes are made to the PVO. Otherwise, MOEA64_PVO_TO_PTE() will panic. MFC r269365: Correct a long-standing problem in moea{,64}_pvo_enter() that was revealed by the combination of r268591 and r269134: When we attempt to add the wired attribute to an existing mapping, moea{,64}_pvo_enter() do nothing. (They only set the wired attribute on newly created mappings.) MFC r269433: Handle wiring failures in vm_map_wire() with the new functions pmap_unwire() and vm_object_unwire(). Retire vm_fault_{un,}wire(), since they are no longer used. MFC r269438: Rewrite a loop in vm_map_wire() so that gcc doesn't think that the variable "rv" is uninitialized. MFC r269485: Retire pmap_change_wiring(). Reviewed by: alc
270630	25-Aug-2014	kib	MFC r270011: Implement 'fast path' for the vm page fault handler. MFC r270387 (by alc): Relax one of the conditions for mapping a page on the fast path. Approved by: re (gjb)
270629	25-Aug-2014	kib	MFC r261647 (by alc): Don't call vm_fault_prefault() on zero-fill faults.
270628	25-Aug-2014	kib	MFC r261412 (by alc): Make prefaulting more aggressive on hard faults.
270627	25-Aug-2014	kib	MFC r269978 (by alc): Avoid pointless (but harmless) actions on unmanaged pages.
270440	24-Aug-2014	kib	MFC r269746: Adapt vm_page_aflag_set(PGA_WRITEABLE) to the locking of pmap_enter(PMAP_ENTER_NOSLEEP).
270439	24-Aug-2014	kib	Merge the changes to pmap_enter(9) for sleep-less operation (requested by flag). The ia64 pmap.c changes are direct commit, since ia64 is removed on head. MFC r269368 (by alc): Retire PVO_EXECUTABLE. MFC r269728: Change pmap_enter(9) interface to take flags parameter and superpage mapping size (currently unused). MFC r269759 (by alc): Update the text of a KASSERT() to reflect the changes in r269728. MFC r269822 (by alc): Change {_,}pmap_allocpte() so that they look for the flag PMAP_ENTER_NOSLEEP instead of M_NOWAIT/M_WAITOK when deciding whether to sleep on page table page allocation. MFC r270151 (by alc): Replace KASSERT that no PV list locks are held with a conditional unlock. Reviewed by: alc Approved by: re (gjb) Sponsored by: The FreeBSD Foundation
270205	20-Aug-2014	kib	MFC r269907: Fix leaks of unqueued unwired pages.
269915	13-Aug-2014	kib	MFC r269643: Weaken the requirement for the vm object lock by only asserting locked object in vm_pager_page_unswapped(), instead of locked exclusively.
269914	13-Aug-2014	kib	MFC r269642: Add wrappers to assert that vm object is unlocked and for try upgrade.
269174	28-Jul-2014	kib	MFC r268615: Add OBJ_TMPFS_NODE flag. MFC r268616: Set the OBJ_TMPFS_NODE flag for vm_object of VREG tmpfs node. MFC r269053: Correct assertion. tmpfs vm object is always at the bottom of the shadow chain.
269072	24-Jul-2014	kib	MFC r267213 (by alc): Add a page size field to struct vm_page. Approved by: alc
267956	27-Jun-2014	kib	MFC r267664: Assert that the new entry is inserted into the right location in the map entries list, and that it does not overlap with the previous and next entries.
267901	26-Jun-2014	kib	MFC r267630: Add MAP_EXCL flag for mmap(2).
267899	26-Jun-2014	kib	MFC r267766: Use correct names for the flags.
267772	23-Jun-2014	kib	MFC r267254: Make mmap(MAP_STACK) search for the available address space. MFC r267497 (by alc): Use local variable instead of sgrowsiz.
267751	22-Jun-2014	mav	MFC r267391: Introduce new "256 Bucket" zone to split requests and reduce congestion on "128 Bucket" zone lock.
267750	22-Jun-2014	mav	MFC r267387: Allocating new bucket for bucket zone, never take it from the zone itself, since it will almost certanly fail. Take next bigger zone instead. This situation should not happen with original bucket zones configuration: "32 Bucket" zone uses "64 Bucket" and vice versa. But if "64 Bucket" zone lock is congested, zone may grow its bucket size and start biting itself.
267059	04-Jun-2014	kib	MFC r266780: Remove the assert which can be triggered by the userspace.
266607	24-May-2014	kib	MFC r266491: Remove redundand loop.
266591	23-May-2014	alc	MFC r259107 Eliminate a redundant parameter to vm_radix_replace(). Improve the wording of the comment describing vm_radix_replace().
266589	23-May-2014	alc	MFC r265886, r265948 With the new-and-improved vm_fault_copy_entry() (r265843), we can always avoid soft page faults when adding write access to user wired entries in vm_map_protect(). Previously, we only avoided the soft page fault when the underlying pages were copy-on-write. In other words, we avoided the pages faults that might sleep on page allocation, but not the trivial page faults to update the physical map. On a fork allow read-only wired pages to be copy-on-write shared between the parent and child processes. Previously, we copied these pages even though they are read only. However, the reason for copying them is historical and no longer exists. In recent times, vm_map_protect() has developed the ability to copy pages when write access is added to wired copy-on-write pages. So, in this case, copy-on-write sharing of wired pages is not to be feared. It is not going to lead to copy-on-write faults on wired memory.
266582	23-May-2014	kib	MFC r266464: In execve(2), postpone the free of old vmspace until the threads are resumed and exited.
266492	21-May-2014	pho	MFC r265534: msync(2) must return ENOMEM and not EINVAL when the address is outside the allowed range or when one or more pages are not mapped. This according to The Open Group Base Specifications Issue 7. Sponsored by: EMC / Isilon storage division
266315	17-May-2014	alc	MFC r265850 About 9% of the pmap_protect() calls being performed by vm_map_copy_entry() are unnecessary. Eliminate the unnecessary calls.
266304	17-May-2014	kib	MFC r265843: For the upgrade case in vm_fault_copy_entry(), when the entry does not need COW and is writeable, do not create a new backing object for the entry. MFC r265887: Fix locking.
266302	17-May-2014	kib	MFC r265825: When printing the map with the ddb 'show procvm' command, do not dump page queues for the backing objects.
266299	17-May-2014	kib	MFC r265824: Print the entry address in addition to the object.
265945	13-May-2014	alc	MFC r265418 Prior to r254304, a separate function, vm_pageout_page_stats(), was used to periodically update the reference status of the active pages. This function was called, instead of vm_pageout_scan(), when memory was not scarce. The objective was to provide up to date reference status for active pages in case memory did become scarce and active pages needed to be deactivated. The active page queue scan performed by vm_pageout_page_stats() was virtually identical to that performed by vm_pageout_scan(), and so r254304 eliminated vm_pageout_page_stats(). Instead, vm_pageout_scan() is called with the parameter "pass" set to zero. The intention was that when pass is zero, vm_pageout_scan() would only scan the active queue. However, the variable page_shortage can still be greater than zero when memory is not scarce and vm_pageout_scan() is called with pass equal to zero. Consequently, the inactive queue may be scanned and dirty pages laundered even though that was not intended by r254304. This revision fixes that.
265944	13-May-2014	alc	MFC r260567 Correctly update the count of stuck pages, "addl_page_shortage", in vm_pageout_scan(). There were missing increments in two less common cases. Don't conflate the count of stuck pages and the pageout deficit provided by vm_page_alloc{,_contig}(). Handle held pages consistently in the inactive queue scan. In the more common case, we did not move the page to the tail of the queue. Whereas, in the less common case, we did. There's no particular reason to move the page in the less common case, so remove it. Perform the calculation of the page shortage for the active queue scan a little earlier, before the active queue lock is acquired. The correctness of this calculation doesn't depend on the active queue lock being held. Eliminate a redundant variable, "pcount". Use the more descriptive variable, "maxscan", in its place. Apply a few nearby style fixes, e.g., eliminate stray whitespace and excess parentheses.
265932	12-May-2014	des	MFH (r264966): add sysctl OIDs for actual swap zone size and capacity
265435	06-May-2014	kib	MFC r265100: Fix the comparision for the end of range in vm_phys_fictitious_reg_range().
265311	04-May-2014	kib	MFC r265002: Fix vm_fault_copy_entry() operation on upgrade; allow it to find the pages in the shadowed objects.
263875	28-Mar-2014	kib	MFC r263475: Fix two issues with /dev/mem access on amd64, both causing kernel page faults. First, for accesses to direct map region should check for the limit by which direct map is instantiated. Second, for accesses to the kernel map, use a new thread private flag TDP_DEVMEMIO, which instructs vm_fault() to return error when fault happens on the MAP_ENTRY_NOFAULT entry, instead of panicing. MFC r263498: Add change forgotten in r263475. Make dmaplimit accessible outside amd64/pmap.c.
263684	24-Mar-2014	kib	MFC r263471: Initialize vm_map_entry member wiring_thread on the map entry creation.
263360	19-Mar-2014	kib	MFC r263095: Initialize paddr to handle the case of zero size.
263359	19-Mar-2014	kib	MFC r263092: Do not vdrop() the tmpfs vnode until it is unlocked. The hold reference might be the last, and then vdrop() would free the vnode.
262739	04-Mar-2014	glebius	Merge r261722, r261723, r261724, r261725 from head: several minor improvements for UMA_ZPCPU_ZONE zones.
262737	04-Mar-2014	glebius	Merge 261593 from head: Provide macros that allow easily export uma(9) zone limits and current usage via sysctl(9).
262291	21-Feb-2014	attilio	MFC r261867: Use the right index to free swapspace after vm_page_rename().
262127	17-Feb-2014	dim	MFC r261896: After r251709, avoid a clang 3.4 warning about an unused static const variable (uma_max_ipers), when asserts are disabled. Reviewed by: glebius
261999	16-Feb-2014	marcel	MFC r259908: For ia64, use pmap_remove_pages() and not pmap_remove().
260306	04-Jan-2014	mav	MFC r258716: - Add bucket size column to `show uma` DDB command. - Add `show umacache` command to show alike stats for cache-only UMA zones.
260305	04-Jan-2014	mav	MFC r258693: Make UMA to not blindly force offpage slab header allocation for large (> PAGE_SIZE) zones. If zone is not multiple to PAGE_SIZE, there may be enough space for the header at the last page, so we may avoid extra header memory allocation and hash table update/lookup. ZFS creates bunch of odd-sized UMA zones (5120, 6144, 7168, 10240, 14336). This change gives good use to at least some of otherwise lost memory there.
260304	04-Jan-2014	mav	MFC r258691: Don't count bucket allocation failures for UMA zones as their own failures. There are good reasons for this to happen, such as recursion prevention, etc. and they are not fatal since buckets are just an optimization mechanism. Real bucket allocation failures are any way counted by the bucket zones themselves, and we don't need double accounting there.
260303	04-Jan-2014	mav	MFC r258340, r258497: Implement mechanism to safely but slowly purge UMA per-CPU caches. This is a last resort for very low memory condition in case other measures to free memory were ineffective. Sequentially cycle through all CPUs and extract per-CPU cache buckets into zone cache from where they can be freed.
260302	04-Jan-2014	mav	MFC r258338: Grow UMA zone bucket size also on lock congestion during item free. Lock congestion is the same, whether it happens on alloc or free, so handle it equally. Now that we have back pressure, there is no problem to grow buckets a bit faster. Any way growth is much slower then in 9.x.
260301	04-Jan-2014	mav	MFC r258337: Add two new UMA bucket zones to store 3 and 9 items per bucket. These new buckets make bucket size self-tuning more soft and precise. Without them there are buckets for 1, 5, 13, 29, ... items. While at bigger sizes difference about 2x is fine, at smallest ones it is 5x and 2.6x respectively. New buckets make that line look like 1, 3, 5, 9, 13, 29, reducing jumps between steps, making algorithm work softer, allocating and freeing memory in better fitting chunks. Otherwise there is quite a big gap between allocating 128K and 5x128K of RAM at once.
260300	04-Jan-2014	mav	MFC r258336: Implement soft pressure on UMA cache bucket sizes. Every time system detects low memory condition decrease bucket sizes for each zone by one item. As result, higher memory pressure will push to smaller bucket sizes and so smaller per-CPU caches and so more efficient memory use. Before this change there was no force to oppose buckets growth as result of practically inevitable zone lock conflicts, and after some run time per-CPU caches could consume enough RAM to kill the system.
260280	04-Jan-2014	glebius	Merge r258690 by mav from head: Fix bug introduced at r252226, when udata argument passed to bucket_alloc() was used without making sure first that it was really passed for us. On some of my systems this bug made user argument passed by ZFS code to uma_zalloc_arg() unexpectedly block UMA per-CPU caches for those zones.
260081	30-Dec-2013	kib	MFC r259951: Do not coalesce stack entry. Pass MAP_STACK_GROWS_DOWN and MAP_STACK_GROWS_UP flags to vm_map_insert() from vm_map_stack()
259991	28-Dec-2013	dim	MFC r259893: In sys/vm/vm_pageout.c, since vm_pageout_worker() takes a void * as argument, cast the incoming 0 argument to void , to silence a warning from clang 3.4 ("expression which evaluates to zero treated as a null pointer constant of type 'void ' [-Wnon-literal-null-conversion]").
259499	17-Dec-2013	kib	MFC r258039: Avoid overflow for the page counts. MFC r258365: Revert back to use int for the page counts. Rearrange the checks to correctly handle overflowing address arithmetic.
259299	13-Dec-2013	kib	MFC r258367: Verify for zero-length requests and act as if it is always successfull without performing any action on the address space.
259297	13-Dec-2013	kib	MFC r258366: Add assertions to cover all places in the wiring and unwiring code where MAP_ENTRY_IN_TRANSITION is set or cleared.
259296	13-Dec-2013	kib	MFC r257899: If filesystem declares that it supports shared locking for writes, use shared vnode lock for VOP_PUTPAGES() as well.
258911	04-Dec-2013	rodrigc	MFC r258737 In keg_dtor(), print out the keg name in the "Freed UMA keg was not empty" message printed to the console. This makes it easier to track down the source of certain memory leaks. Suggested by: adrian Approved by: re (gjb)
258037	12-Nov-2013	kib	MFC r257680: Do not coalesce if the swap object belongs to tmpfs vnode. Approved by: re (glebius)
256281	10-Oct-2013	gjb	Copy head (r256279) to stable/10 as part of the 10.0-RELEASE cycle. Approved by: re (implicit) Sponsored by: The FreeBSD Foundation
256275	10-Oct-2013	alc	Tidy up the output of "sysctl vm.phys_free". Approved by: re (glebius) Sponsored by: EMC / Isilon Storage Division
255793	22-Sep-2013	alc	Both the vm_map and vmspace zones are defined as "no free". So, there is no point in defining a fini function for these zones. Reviewed by: kib Approved by: re (glebius) Sponsored by: EMC / Isilon Storage Division
255732	20-Sep-2013	neel	Merge the following changes from projects/bhyve_npt_pmap: - add fields to 'struct pmap' that are required to manage nested page tables. - add a parameter to 'vmspace_alloc()' that can be used to override the default pmap initialization routine 'pmap_pinit()'. These changes are pushed ahead of the remaining changes in 'bhyve_npt_pmap' in anticipation of the upcoming KBI freeze for 10.0. Reviewed by: kib@, alc@ Approved by: re (glebius)
255724	20-Sep-2013	alc	The pmap function pmap_clear_reference() is no longer used. Remove it. pmap_clear_reference() has had exactly one caller in the kernel for several years, more precisely, since FreeBSD 8. Now, that call no longer exists. Approved by: re (kib) Sponsored by: EMC / Isilon Storage Division
255708	19-Sep-2013	jhb	Extend the support for exempting processes from being killed when swap is exhausted. - Add a new protect(1) command that can be used to set or revoke protection from arbitrary processes. Similar to ktrace it can apply a change to all existing descendants of a process as well as future descendants. - Add a new procctl(2) system call that provides a generic interface for control operations on processes (as opposed to the debugger-specific operations provided by ptrace(2)). procctl(2) uses a combination of idtype_t and an id to identify the set of processes on which to operate similar to wait6(). - Add a PROC_SPROTECT control operation to manage the protection status of a set of processes. MADV_PROTECT still works for backwards compatability. - Add a p_flag2 to struct proc (and a corresponding ki_flag2 to kinfo_proc) the first bit of which is used to track if P_PROTECT should be inherited by new child processes. Reviewed by: kib, jilles (earlier version) Approved by: re (delphij) MFC after: 1 month
255626	17-Sep-2013	kib	PG_SLAB no longer serves a useful purpose, since m->object is no longer abused to store pointer to slab. Remove it. Reviewed by: alc Sponsored by: The FreeBSD Foundation Approved by: re (hrs)
255608	16-Sep-2013	kib	Remove zero-copy sockets code. It only worked for anonymous memory, and the equivalent functionality is now provided by sendfile(2) over posix shared memory filedescriptor. Remove the cow member of struct vm_page, and rearrange the remaining members. While there, make hold_count unsigned. Requested and reviewed by: alc Tested by: pho Sponsored by: The FreeBSD Foundation Approved by: re (delphij)
255566	14-Sep-2013	kib	If the last page of the file is partially full and whole valid portion is invalidated, invalidate the whole page. Otherwise, partially valid page appears on a page queue, which is wrong. This could only happen for the last page, because only then buffer which triggered invalidation could not cover the whole page. Reported and tested by: pho (previous version) Reviewed by: alc Sponsored by: The FreeBSD Foundation Approved by: re (delphij) MFC after: 2 weeks
255497	12-Sep-2013	jhb	Fix an off-by-one error when populating mincore(2) entries for skipped entries. lastvecindex references the last valid byte, so the new bytes should come after it. Approved by: re (kib) MFC after: 1 week
255426	09-Sep-2013	jhb	Add a mmap flag (MAP_32BIT) on 64-bit platforms to request that a mapping use an address in the first 2GB of the process's address space. This flag should have the same semantics as the same flag on Linux. To facilitate this, add a new parameter to vm_map_find() that specifies an optional maximum virtual address. While here, fix several callers of vm_map_find() to use a VMFS_* constant for the findspace argument instead of TRUE and FALSE. Reviewed by: alc Approved by: re (kib)
255396	08-Sep-2013	kib	Drain for the xbusy state for two places which potentially do pmap_remove_all(). Not doing the drain allows the pmap_enter() to proceed in parallel, making the pmap_remove_all() effects void. The race results in an invalidated page mapped wired by usermode. Reported and tested by: pho Reviewed by: alc Sponsored by: The FreeBSD Foundation Approved by: re (glebius)
255244	05-Sep-2013	kib	The vm_page_trysbusy() should not fail when shared busy counter or VPB_BIT_WAITERS flag were changed between reading of busy_lock and the cas. The vm_page_sbusy(), which is the only user of vm_page_trysbusy() in the tree, panics on the failure, which in these cases is transient and do not mean that the current page state prevents sbusying. Retry the operation inside vm_page_trysbusy() if cas failed, only return a failure when VPB_BIT_SHARED is cleared. Reported and tested by: pho Reviewed by: attilio Sponsored by: The FreeBSD Foundation
255219	05-Sep-2013	pjd	Change the cap_rights_t type from uint64_t to a structure that we can extend in the future in a backward compatible (API and ABI) way. The cap_rights_t represents capability rights. We used to use one bit to represent one right, but we are running out of spare bits. Currently the new structure provides place for 114 rights (so 50 more than the previous cap_rights_t), but it is possible to grow the structure to hold at least 285 rights, although we can make it even larger if 285 rights won't be enough. The structure definition looks like this: struct cap_rights { uint64_t cr_rights[CAP_RIGHTS_VERSION + 2]; }; The initial CAP_RIGHTS_VERSION is 0. The top two bits in the first element of the cr_rights[] array contain total number of elements in the array - 2. This means if those two bits are equal to 0, we have 2 array elements. The top two bits in all remaining array elements should be 0. The next five bits in all array elements contain array index. Only one bit is used and bit position in this five-bits range defines array index. This means there can be at most five array elements in the future. To define new right the CAPRIGHT() macro must be used. The macro takes two arguments - an array index and a bit to set, eg. #define CAP_PDKILL CAPRIGHT(1, 0x0000000000000800ULL) We still support aliases that combine few rights, but the rights have to belong to the same array element, eg: #define CAP_LOOKUP CAPRIGHT(0, 0x0000000000000400ULL) #define CAP_FCHMOD CAPRIGHT(0, 0x0000000000002000ULL) #define CAP_FCHMODAT (CAP_FCHMOD \| CAP_LOOKUP) There is new API to manage the new cap_rights_t structure: cap_rights_t cap_rights_init(cap_rights_t rights, ...); void cap_rights_set(cap_rights_t rights, ...); void cap_rights_clear(cap_rights_t rights, ...); bool cap_rights_is_set(const cap_rights_t rights, ...); bool cap_rights_is_valid(const cap_rights_t rights); void cap_rights_merge(cap_rights_t dst, const cap_rights_t src); void cap_rights_remove(cap_rights_t dst, const cap_rights_t src); bool cap_rights_contains(const cap_rights_t big, const cap_rights_t little); Capability rights to the cap_rights_init(), cap_rights_set(), cap_rights_clear() and cap_rights_is_set() functions are provided by separating them with commas, eg: cap_rights_t rights; cap_rights_init(&rights, CAP_READ, CAP_WRITE, CAP_FSTAT); There is no need to terminate the list of rights, as those functions are actually macros that take care of the termination, eg: #define cap_rights_set(rights, ...) \ __cap_rights_set((rights), __VA_ARGS__, 0ULL) void __cap_rights_set(cap_rights_t *rights, ...); Thanks to using one bit as an array index we can assert in those functions that there are no two rights belonging to different array elements provided together. For example this is illegal and will be detected, because CAP_LOOKUP belongs to element 0 and CAP_PDKILL to element 1: cap_rights_init(&rights, CAP_LOOKUP \| CAP_PDKILL); Providing several rights that belongs to the same array's element this way is correct, but is not advised. It should only be used for aliases definition. This commit also breaks compatibility with some existing Capsicum system calls, but I see no other way to do that. This should be fine as Capsicum is still experimental and this change is not going to 9.x. Sponsored by: The FreeBSD Foundation
255097	31-Aug-2013	mckusick	Fix bug introduced in rewrite of keg_free_slab in -r251894. The consequence of the bug is that fini calls are not done when a slab is freed by a call-back from the page daemon. It went unnoticed for two months because fini is little used. I spotted the bug while reading the code to learn how it works so I could write it up for the next edition of the Design and Implementation of FreeBSD book. No MFC needed as this code exists only in HEAD. Reviewed by: kib, jeff Tested by: pho
255028	29-Aug-2013	alc	Significantly reduce the cost, i.e., run time, of calls to madvise(..., MADV_DONTNEED) and madvise(..., MADV_FREE). Specifically, introduce a new pmap function, pmap_advise(), that operates on a range of virtual addresses within the specified pmap, allowing for a more efficient implementation of MADV_DONTNEED and MADV_FREE. Previously, the implementation of MADV_DONTNEED and MADV_FREE relied on per-page pmap operations, such as pmap_clear_reference(). Intuitively, the problem with this implementation is that the pmap-level locks are acquired and released and the page table traversed repeatedly, once for each resident page in the range that was specified to madvise(2). A more subtle flaw with the previous implementation is that pmap_clear_reference() would clear the reference bit on all mappings to the specified page, not just the mapping in the range specified to madvise(2). Since our malloc(3) makes heavy use of madvise(2), this change can have a measureable impact. For example, the system time for completing a parallel "buildworld" on a 6-core amd64 machine was reduced by about 1.5% to 2.0%. Note: This change only contains pmap_advise() implementations for a subset of our supported architectures. I will commit implementations for the remaining architectures after further testing. For now, a stub function is sufficient because of the advisory nature of pmap_advise(). Discussed with: jeff, jhb, kib Tested by: pho (i386), marcel (ia64) Sponsored by: EMC / Isilon Storage Division
254911	26-Aug-2013	glebius	Remove comment that is no longer relevant since r254182.
254719	23-Aug-2013	alc	Addendum to r254141: The call to vm_radix_insert() in vm_page_cache() can reclaim the last preexisting cached page in the object, resulting in a call to vdrop(). Detect this scenario so that the vnode's hold count is correctly maintained. Otherwise, we panic. Reported by: scottl Tested by: pho Discussed with: attilio, jeff, kib
254667	22-Aug-2013	kib	Revert r254501. Instead, reuse the type stability of the struct pmap which is the part of struct vmspace, allocated from UMA_ZONE_NOFREE zone. Initialize the pmap lock in the vmspace zone init function, and remove pmap lock initialization and destruction from pmap_pinit() and pmap_release(). Suggested and reviewed by: alc (previous version) Tested by: pho Sponsored by: The FreeBSD Foundation
254649	22-Aug-2013	kib	Remove the deprecated VM_ALLOC_RETRY flag for the vm_page_grab(9). The flag was mandatory since r209792, where vm_page_grab(9) was changed to only support the alloc retry semantic. Suggested and reviewed by: alc Sponsored by: The FreeBSD Foundation
254622	21-Aug-2013	jeff	- Eliminate the vm object lock from the active queue scan. It is not necessary since we do not free or cache the page from active anymore. Document the one possible race that is harmless. Sponsored by: EMC / Isilon Storage Division Discussed with: alc
254599	21-Aug-2013	alc	Addendum to r254141: Allow recursion on the free pages queues lock in vm_page_alloc_freelist(). Reported and tested by: sbruno Sponsored by: EMC / Isilon Storage Division
254544	20-Aug-2013	jeff	- Increase the active lru refresh interval to 10 minutes. This has been shown to negatively impact some workloads and the goal is only to eliminate worst case behaviors for very long periods of paging inactivity. Eventually we should determine a more complex scaling factor for this feature. - Rate limit low memory callback handlers to limit thrashing. Set the default to 10 seconds. Sponsored by: EMC / Isilon Storage Division
254543	19-Aug-2013	jeff	- Use an arbitrary but reasonably large import size for kva on architectures that don't support superpages. This keeps the number of spans and internal fragmentation lower. - When the user asks for alignment from vmem_xalloc adjust the imported size by 2*align to be certain we can satisfy the allocation. This comes at the expense of potential failures when the backend can't supply enough memory but could supply the requested size and alignment. Sponsored by: EMC / Isilon Storage Division
254439	17-Aug-2013	kib	Remove the arbitrary binding of the pagedaemon threads to the domains, update the comment accordingly and make it more precise. Requested and reviewed by: jeff (previous version)
254430	16-Aug-2013	jhb	Add new mmap(2) flags to permit applications to request specific virtual address alignment of mappings. - MAP_ALIGNED(n) requests a mapping aligned on a boundary of (1 << n). Requests for n >= number of bits in a pointer or less than the size of a page fail with EINVAL. This matches the API provided by NetBSD. - MAP_ALIGNED_SUPER is a special case of MAP_ALIGNED. It can be used to optimize the chances of using large pages. By default it will align the mapping on a large page boundary (the system is free to choose any large page size to align to that seems best for the mapping request). However, if the object being mapped is already using large pages, then it will align the virtual mapping to match the existing large pages in the object instead. - Internally, VMFS_ALIGNED_SPACE is now renamed to VMFS_SUPER_SPACE, and VMFS_ALIGNED_SPACE(n) is repurposed for specifying a specific alignment. MAP_ALIGNED(n) maps to using VMFS_ALIGNED_SPACE(n), while MAP_ALIGNED_SUPER maps to VMFS_SUPER_SPACE. - mmap() of a device object now uses VMFS_OPTIMAL_SPACE rather than explicitly using VMFS_SUPER_SPACE. All device objects are forced to use a specific color on creation, so VMFS_OPTIMAL_SPACE is effectively equivalent. Reviewed by: alc MFC after: 1 month
254387	15-Aug-2013	jeff	- Fix bug in r254304. Use the ACTIVE pq count for the active list processing, not inactive. This was the result of a bad merge. Reported by: pho Sponsored by: EMC / Isilon Storage Division
254362	15-Aug-2013	attilio	On the recovery path for vm_page_alloc(), if a page had been requested wired, unwind back the wiring bits otherwise we can end up freeing a page that is considered wired. Sponsored by: EMC / Isilon storage division Reported by: alc
254307	13-Aug-2013	jeff	- Add a statically allocated memguard arena since it is needed very early on. - Pass the appropriate flags to vmem_xalloc() when allocating space for the arena from kmem_arena. Sponsored by: EMC / Isilon Storage Division
254304	13-Aug-2013	jeff	Improve pageout flow control to wakeup more frequently and do less work while maintaining better LRU of active pages. - Change v_free_target to include the quantity previously represented by v_cache_min so we don't need to add them together everywhere we use them. - Add a pageout_wakeup_thresh that sets the free page count trigger for waking the page daemon. Set this 10% above v_free_min so we wakeup before any phase transitions in vm users. - Adjust down v_free_target now that we're willing to accept more pagedaemon wakeups. This means we process fewer pages in one iteration as well, leading to shorter lock hold times and less overall disruption. - Eliminate vm_pageout_page_stats(). This was a minor variation on the PQ_ACTIVE segment of the normal pageout daemon. Instead we now process 1 / vm_pageout_update_period pages every second. This causes us to visit the whole active list every 60 seconds. Previously we would only maintain the active LRU when we were short on pages which would mean it could be woefully out of date. Reviewed by: alc (slight variant of this) Discussed with: alc, kib, jhb Sponsored by: EMC / Isilon Storage Division
254228	11-Aug-2013	attilio	Correct the recovery logic in vm_page_alloc_contig: what is really needed on this code snipped is that all the pages that are already fully inserted gets fully freed, while for the others the object removal itself might be skipped, hence the object might be set to NULL. Sponsored by: EMC / Isilon storage division Reported by: alc, kib Reviewed by: alc
254182	10-Aug-2013	kib	Different consumers of the struct vm_page abuse pageq member to keep additional information, when the page is guaranteed to not belong to a paging queue. Usually, this results in a lot of type casts which make reasoning about the code correctness harder. Sometimes m->object is used instead of pageq, which could cause real and confusing bugs if non-NULL m->object is leaked. See r141955 and r253140 for examples. Change the pageq member into a union containing explicitly-typed members. Use them instead of type-punning or abusing m->object in x86 pmaps, uma and vm_page_alloc_contig(). Requested and reviewed by: alc Sponsored by: The FreeBSD Foundation
254168	09-Aug-2013	zont	Remove unused definition for CTL_VM_NAMES. Suggested by: bde
254163	09-Aug-2013	jhb	Revert the addition of VPO_BUSY and instead update vm_page_replace() to properly unbusy the page. Submitted by: alc
254150	09-Aug-2013	obrien	Add missing 'VPO_BUSY' from r254141 to fix kernel build break.
254141	09-Aug-2013	attilio	On all the architectures, avoid to preallocate the physical memory for nodes used in vm_radix. On architectures supporting direct mapping, also avoid to pre-allocate the KVA for such nodes. In order to do so make the operations derived from vm_radix_insert() to fail and handle all the deriving failure of those. vm_radix-wise introduce a new function called vm_radix_replace(), which can replace a leaf node, already present, with a new one, and take into account the possibility, during vm_radix_insert() allocation, that the operations on the radix trie can recurse. This means that if operations in vm_radix_insert() recursed vm_radix_insert() will start from scratch again. Sponsored by: EMC / Isilon storage division Reviewed by: alc (older version) Reviewed by: jeff Tested by: pho, scottl
254138	09-Aug-2013	attilio	The soft and hard busy mechanism rely on the vm object lock to work. Unify the 2 concept into a real, minimal, sxlock where the shared acquisition represent the soft busy and the exclusive acquisition represent the hard busy. The old VPO_WANTED mechanism becames the hard-path for this new lock and it becomes per-page rather than per-object. The vm_object lock becames an interlock for this functionality: it can be held in both read or write mode. However, if the vm_object lock is held in read mode while acquiring or releasing the busy state, the thread owner cannot make any assumption on the busy state unless it is also busying it. Also: - Add a new flag to directly shared busy pages while vm_page_alloc and vm_page_grab are being executed. This will be very helpful once these functions happen under a read object lock. - Move the swapping sleep into its own per-object flag The KPI is heavilly changed this is why the version is bumped. It is very likely that some VM ports users will need to change their own code. Sponsored by: EMC / Isilon storage division Discussed with: alc Reviewed by: jeff, kib Tested by: gavin, bapt (older version) Tested by: pho, scottl
254065	07-Aug-2013	kib	Split the pagequeues per NUMA domains, and split pageademon process into threads each processing queue in a single domain. The structure of the pagedaemons and queues is kept intact, most of the changes come from the need for code to find an owning page queue for given page, calculated from the segment containing the page. The tie between NUMA domain and pagedaemon thread/pagequeue split is rather arbitrary, the multithreaded daemon could be allowed for the single-domain machines, or one domain might be split into several page domains, to further increase concurrency. Right now, each pagedaemon thread tries to reach the global target, precalculated at the start of the pass. This is not optimal, since it could cause excessive page deactivation and freeing. The code should be changed to re-check the global page deficit state in the loop after some number of iterations. The pagedaemons reach the quorum before starting the OOM, since one thread inability to meet the target is normal for split queues. Only when all pagedaemons fail to produce enough reusable pages, OOM is started by single selected thread. Launder is modified to take into account the segments layout with regard to the region for which cleaning is performed. Based on the preliminary patch by jeff, sponsored by EMC / Isilon Storage Division. Reviewed by: alc Tested by: pho Sponsored by: The FreeBSD Foundation
254025	07-Aug-2013	jeff	Replace kernel virtual address space allocation with vmem. This provides transparent layering and better fragmentation. - Normalize functions that allocate memory to use kmem_* - Those that allocate address space are named kva_* - Those that operate on maps are named kmap_* - Implement recursive allocation handling for kmem_arena in vmem. Reviewed by: alc Tested by: pho Sponsored by: EMC / Isilon Storage Division
254017	07-Aug-2013	markj	Fill in the description fields for M_FICT_PAGES. Reviewed by: kib MFC after: 3 days
253953	05-Aug-2013	attilio	Revert r253939: We cannot busy a page before doing pagefaults. Infact, it can deadlock against vnode lock, as it tries to vget(). Other functions, right now, have an opposite lock ordering, like vm_object_sync(), which acquires the vnode lock first and then sleeps on the busy mechanism. Before this patch is reinserted we need to break this ordering. Sponsored by: EMC / Isilon storage division Reported by: kib
253939	04-Aug-2013	attilio	The page hold mechanism is fast but it has couple of fallouts: - It does not let pages respect the LRU policy - It bloats the active/inactive queues of few pages Try to avoid it as much as possible with the long-term target to completely remove it. Use the soft-busy mechanism to protect page content accesses during short-term operations (like uiomove_fromphys()). After this change only vm_fault_quick_hold_pages() is still using the hold mechanism for page content access. There is an additional complexity there as the quick path cannot immediately access the page object to busy the page and the slow path cannot however busy more than one page a time (to avoid deadlocks). Fixing such primitive can bring to complete removal of the page hold mechanism. Sponsored by: EMC / Isilon storage division Discussed with: alc Reviewed by: jeff Tested by: pho
253775	29-Jul-2013	zont	Unbreak sysctl ABI changes introduced in r253662 Requested by: bde
253697	26-Jul-2013	jeff	Improve page LRU quality and simplify the logic. - Don't short-circuit aging tests for unmapped objects. This biases against unmapped file pages and transient mappings. - Always honor PGA_REFERENCED. We can now use this after soft busying to lazily restart the LRU. - Don't transition directly from active to cached bypassing the inactive queue. This frees recently used data much too early. - Rename actcount to act_delta to be more consistent with use and meaning. Reviewed by: kib, alc Sponsored by: EMC / Isilon Storage Division
253662	26-Jul-2013	zont	Remove define and documentation for vm_pageout_algorithm missed in r253587
253636	25-Jul-2013	kientzle	Clear entire map structure including locks so that the locks don't accidentally appear to have been already initialized. In particular, this fixes a consistent kernel crash on armv6 with: panic: lock "vm map (user)" 0xc09cc050 already initialized that appeared with r251709. PR: arm/180820
253604	24-Jul-2013	avg	rename scheduler->swapper and SI_SUB_RUN_SCHEDULER->SI_SUB_LAST Also directly call swapper() at the end of mi_startup instead of relying on swapper being the last thing in sysinits order. Rationale: - "RUN_SCHEDULER" was misleading, scheduling already takes place at that stage - "scheduler" was misleading, the function swaps in the swapped out processes - another SYSINIT(SI_SUB_RUN_SCHEDULER, SI_ORDER_ANY) could never be invoked depending on its relative order with scheduler; this was not obvious and the bug actually used to exist Reviewed by: kib (ealier version) MFC after: 14 days
253591	24-Jul-2013	glebius	Since r251709 a slab no longer use 8-bit indicies to manage items, thus remove a stale comment. Reviewed by: jeff
253587	24-Jul-2013	jeff	- Remove the long obsolete 'vm_pageout_algorithm' experiment. Discussed with: alc Sponsored by: EMC / Isilon Storage Division
253583	23-Jul-2013	jeff	- Correct a stale comment. We don't have vclean() anymore. The work is done by vgonel() and destroy_vobject() should only be called once from VOP_INACTIVE(). Sponsored by: EMC / Isilon Storage Division
253565	23-Jul-2013	glebius	Revert r249590 and in case if mp_ncpus isn't initialized use MAXCPU. This allows us to init counter zone at early stage of boot. Reviewed by: kib Tested by: Lytochkin Boris <lytboris gmail.com>
253556	22-Jul-2013	jlh	Fix previous commit when option RACCT is not used. MFC after: 7 days
253554	22-Jul-2013	jlh	Fix a panic in the racct code when munlock(2) is called with incorrect values. The racct code in sys_munlock() assumed that the boundaries provided by the userland were correct as long as vm_map_unwire() returned successfully. However the latter contains its own logic and sometimes manages to do something out of those boundaries, even if they are buggy. This change makes the racct code to use the accounting done by the vm layer, as it is done in other places such as vm_mlock(). Despite fixing the panic, Alan Cox pointed that this code is still race-y though: two simultaneous callers will produce incorrect values. Reviewed by: alc MFC after: 7 days
253471	19-Jul-2013	jhb	Be more aggressive in using superpages in all mappings of objects: - Add a new address space allocation method (VMFS_OPTIMAL_SPACE) for vm_map_find() that will try to alter the alignment of a mapping to match any existing superpage mappings of the object being mapped. If no suitable address range is found with the necessary alignment, vm_map_find() will fall back to using the simple first-fit strategy (VMFS_ANY_SPACE). - Change mmap() without MAP_FIXED, shmat(), and the GEM mapping ioctl to use VMFS_OPTIMAL_SPACE instead of VMFS_ANY_SPACE. Reviewed by: alc (earlier version) MFC after: 2 weeks
253221	11-Jul-2013	kib	When swap pager allocates metadata in the pagedaemon context, allow it to drain the reserve. This was broken in r243040, causing deadlock. Note that VM_WAIT call in case of uma_zalloc() failure from pagedaemon would only wait for the v_pageout_free_min anyway. Reported and tested by: pho Reviewed by: alc Sponsored by: The FreeBSD Foundation
253191	11-Jul-2013	kib	The vm_fault() should not be allowed to proceed on the map entry which is being wired now. The entry wired count is changed to non-zero in advance, before the map lock is dropped. This makes the vm_fault() to perceive the entry as wired, and breaks the fragment which moves the wire count from the shadowed page, to the upper page, making the code unwiring non-wired page. On the other hand, the vm_fault() calls from vm_fault_wire() should be allowed to proceed, so only drain MAP_ENTRY_IN_TRANSITION from vm_fault() when wiring_thread is not current. Reported and tested by: pho Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
253190	11-Jul-2013	kib	The mlockall() or VM_MAP_WIRE_HOLESOK does not interact properly with parallel creation of the map entries, e.g. by mmap() or stack growing. It also breaks when other entry is wired in parallel. The vm_map_wire() iterates over the map entries in the region, and assumes that map entries it finds are marked as in transition before, also that any entry marked as in transition, are marked by the current invocation of vm_map_wire(). This is not true for new entries in the holes. Add the thread owner of the MAP_ENTRY_IN_TRANSITION flag to struct vm_map_entry. In vm_map_wire() and vm_map_unwire(), only process the entries which transition owner is the current thread. Reported and tested by: pho Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
253189	11-Jul-2013	kib	Never remove user-wired pages from an object when doing msync(MS_INVALIDATE). The vm_fault_copy_entry() requires that object range which corresponds to the user-wired vm_map_entry, is always fully populated. Add OBJPR_NOTWIRED flag for vm_object_page_remove() to request the preserving behaviour, use it when calling vm_object_page_remove() from vm_object_sync(). Reported and tested by: pho Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
253188	11-Jul-2013	kib	In the vm_page_set_invalid() function, do not assert that the page is not busy, since its only caller brelse() can legitimately call it on busy page. This happens for VOP_PUTPAGES() on filesystems that use buffers and which VOP_WRITE() method marked the buffer containing page as non-cacheable. Reported and tested by: pho Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
253095	09-Jul-2013	kib	Fix typo in comment. MFC after: 3 days
252653	03-Jul-2013	neel	vm_phys_fictitious_reg_range() was losing the 'memattr' because it would be reset by pmap_page_init() right after being initialized in vm_page_initfake(). The statement above is with reference to the amd64 implementation of pmap_page_init(). Fix this by calling 'pmap_page_init()' in 'vm_page_initfake()' before changing the 'memattr'. Reviewed by: kib MFC after: 2 weeks
252358	28-Jun-2013	davide	Remove a spurious keg lock acquisition.
252330	28-Jun-2013	jeff	- Add a general purpose resource allocator, vmem, from NetBSD. It was originally inspired by the Solaris vmem detailed in the proceedings of usenix 2001. The NetBSD version was heavily refactored for bugs and simplicity. - Use this resource allocator to allocate the buffer and transient maps. Buffer cache defrags are reduced by 25% when used by filesystems with mixed block sizes. Ultimately this may permit dynamic buffer cache sizing on low KVA machines. Discussed with: alc, kib, attilio Tested by: pho Sponsored by: EMC / Isilon Storage Division
252226	26-Jun-2013	jeff	- Resolve bucket recursion issues by passing a cookie with zone flags through bucket_alloc() to uma_zalloc_arg() and uma_zfree_arg(). - Make some smaller buckets for large zones to further reduce memory waste. - Implement uma_zone_reserve(). This holds aside a number of items only for callers who specify M_USE_RESERVE. buckets will never be filled from reserve allocations. Sponsored by: EMC / Isilon Storage Division
252161	24-Jun-2013	glebius	Typo in comment.
252040	20-Jun-2013	jeff	- Add a per-zone lock for zones without kegs. - Be more explicit about zone vs keg locking. This functionally changes almost nothing. - Add a size parameter to uma_zcache_create() so we can size the buckets. - Pass the zone to bucket_alloc() so it can modify allocation flags as appropriate. - Fix a bug in zone_alloc_bucket() where I missed an address of operator in a failure case. (Found by pho) Sponsored by: EMC / Isilon Storage Division
251983	19-Jun-2013	jeff	- Persist the caller's flags in the bucket allocation flags so we don't lose a M_NOVM when we recurse into a bucket allocation. Sponsored by: EMC / Isilon Storage Division
251901	18-Jun-2013	des	Fix a bug that allowed a tracing process (e.g. gdb) to write to a memory-mapped file in the traced process's address space even if neither the traced process nor the tracing process had write access to that file. Security: CVE-2013-2171 Security: FreeBSD-SA-13:06.mmap Approved by: so
251894	18-Jun-2013	jeff	Refine UMA bucket allocation to reduce space consumption and improve performance. - Always free to the alloc bucket if there is space. This gives LIFO allocation order to improve hot-cache performance. This also allows for zones with a single bucket per-cpu rather than a pair if the entire working set fits in one bucket. - Enable per-cpu caches of buckets. To prevent recursive bucket allocation one bucket zone still has per-cpu caches disabled. - Pick the initial bucket size based on a table driven maximum size per-bucket rather than the number of items per-page. This gives more sane initial sizes. - Only grow the bucket size when we face contention on the zone lock, this causes bucket sizes to grow more slowly. - Adjust the number of items per-bucket to account for the header space. This packs the buckets more efficiently per-page while making them not quite powers of two. - Eliminate the per-zone free bucket list. Always return buckets back to the bucket zone. This ensures that as zones grow into larger bucket sizes they eventually discard the smaller sizes. It persists fewer buckets in the system. The locking is slightly trickier. - Only switch buckets in zalloc, not zfree, this eliminates pathological cases where we ping-pong between two buckets. - Ensure that the thread that fills a new bucket gets to allocate from it to give a better upper bound on allocation time. Sponsored by: EMC / Isilon Storage Division
251826	17-Jun-2013	jeff	- Add a new UMA API: uma_zcache_create(). This makes a zone without any backing memory that is only a container for per-cpu caches of arbitrary pointer items. These zones have no kegs. - Convert the regular keg based allocator to use the new import/release functions. - Move some stats to be atomics since they would require excessive zone locking/unlocking with the new import/release paradigm. Make zone_free_item simpler now that callers can manage more stats. - Check for these cache-only zones in the public APIs and debugging code by checking zone_first_keg() against NULL. Sponsored by: EMC / Isilong Storage Division
251709	13-Jun-2013	jeff	- Convert the slab free item list from a linked array of indices to a bitmap using sys/bitset. This is much simpler, has lower space overhead and is cheaper in most cases. - Use a second bitmap for invariants asserts and improve the quality of the asserts as well as the number of erroneous conditions that we will catch. - Drastically simplify sizing code. Special case refcnt zones since they will be going away. - Update stale comments. Sponsored by: EMC / Isilon Storage Division
251591	10-Jun-2013	alc	Revise the interface between vm_object_madvise() and vm_page_dontneed() so that pointless calls to pmap_is_modified() can be easily avoided when performing madvise(..., MADV_FREE). Sponsored by: EMC / Isilon Storage Division
251523	08-Jun-2013	glebius	Make sys_mlock() function just a wrapper around vm_mlock() function that does all the job. Reviewed by: kib, jilles Sponsored by: Nginx, Inc.
251471	06-Jun-2013	attilio	Complete r251452: Avoid to busy/unbusy a page in cases where there is no need to drop the vm_obj lock, more nominally when the page is full valid after vm_page_grab(). Sponsored by: EMC / Isilon storage division Reviewed by: alc
251397	04-Jun-2013	attilio	In vm_object_split(), busy and consequently unbusy the pages only when swap_pager_copy() is invoked, otherwise there is no reason to do so. This will eliminate the necessity to busy pages most of the times. Sponsored by: EMC / Isilon storage division Reviewed by: alc
251367	04-Jun-2013	alc	Update a comment.
251359	04-Jun-2013	alc	Relax the object locking in vm_pageout_map_deactivate_pages() and vm_pageout_object_deactivate_pages(). A read lock suffices. Sponsored by: EMC / Isilon Storage Division
251318	03-Jun-2013	kib	Remove irrelevant comments. Discussed with: alc MFC after: 3 days
251280	03-Jun-2013	alc	Require that the page lock is held, instead of the object lock, when clearing the page's PGA_REFERENCED flag. Since we are typically manipulating the page's act_count field when we are clearing its PGA_REFERENCED flag, the page lock is already held everywhere that we clear the PGA_REFERENCED flag. So, in fact, this revision only changes some comments and an assertion. Nonetheless, it will enable later changes to object locking in the pageout code. Introduce vm_page_assert_locked(), which completely hides the implementation details of the page lock from the caller, and use it in vm_page_aflag_clear(). (The existing vm_page_lock_assert() could not be used in vm_page_aflag_clear().) Over the coming weeks, I expect that we'll either eliminate or replace the various uses of vm_page_lock_assert() with vm_page_assert_locked(). Reviewed by: attilio Sponsored by: EMC / Isilon Storage Division
251229	01-Jun-2013	alc	Now that access to the page's "act_count" field is synchronized by the page lock instead of the object lock, there is no reason for vm_page_activate() to assert that the object is locked for either read or write access. (The "VPO_UNMANAGED" flag never changes after page allocation.) Sponsored by: EMC / Isilon Storage Division
251183	31-May-2013	alc	Simplify the definition of vm_page_lock_assert(). There is no compelling reason to inline the implementation of vm_page_lock_assert() in the !KLD_MODULES case. Use the same implementation for both KLD_MODULES and !KLD_MODULES. Reviewed by: kib
251151	30-May-2013	kib	After the object lock was dropped, the object' reference count could change. Retest the ref_count and return from the function to not execute the further code which assumes that ref_count == 1 if it is not. Also, do not leak vnode lock if other thread cleared OBJ_TMPFS flag meantime. Reported by: bdrewery Tested by: bdrewery, pho Sponsored by: The FreeBSD Foundation
251150	30-May-2013	kib	Remove the capitalization in the assertion message. Print the address of the object to get useful information from optimizated kernels dump.
251077	28-May-2013	attilio	o Change the locking scheme for swp_bcount. It can now be accessed with a write lock on the object containing it OR with a read lock on the object containing it along with the swhash_mtx. o Remove some duplicate assertions for swap_pager_freespace() and swap_pager_unswapped() but keep the object locking references for documentation. Sponsored by: EMC / Isilon storage division Reviewed by: alc
250909	22-May-2013	attilio	Acquire read lock on the src object for vm_fault_copy_entry(). Sponsored by: EMC / Isilon storage division Reviewed by: alc
250884	21-May-2013	attilio	o Relax locking assertions for vm_page_find_least() o Relax locking assertions for pmap_enter_object() and add them also to architectures that currently don't have any o Introduce VM_OBJECT_LOCK_DOWNGRADE() which is basically a downgrade operation on the per-object rwlock o Use all the mechanisms above to make vm_map_pmap_enter() to work mostl of the times only with readlocks. Sponsored by: EMC / Isilon storage division Reviewed by: alc
250849	21-May-2013	kib	Add ddb command 'show pginfo' which provides useful information about a vm page, denoted either by an address of the struct vm_page, or, if the '/p' modifier is specified, by a physical address of the corresponding frame. Reviewed by: jhb Sponsored by: The FreeBSD Foundation MFC after: 1 week
250748	17-May-2013	alc	Relax the object locking in vm_fault_prefault(). A read lock suffices. Reviewed by: attilio Sponsored by: EMC / Isilon Storage Division
250745	17-May-2013	alc	Relax the object locking assertion in vm_page_lookup(). Now that a radix tree is used to maintain the object's collection of resident pages, vm_page_lookup() no longer needs an exclusive lock. Reviewed by: attilio Sponsored by: EMC / Isilon Storage Division
250601	13-May-2013	attilio	o Add accessor functions to add and remove pages from a specific freelist. o Split the pool of free pages queues really by domain and not rely on definition of VM_RAW_NFREELIST. o For MAXMEMDOM > 1, wrap the RR allocation logic into a specific function that is called when calculating the allocation domain. The RR counter is kept, currently, per-thread. In the future it is expected that such function evolves in a real policy decision referee, based on specific informations retrieved by per-thread and per-vm_object attributes. o Add the concept of "probed domains" under the form of vm_ndomains. It is responsibility for every architecture willing to support multiple memory domains to correctly probe vm_ndomains along with mem_affinity segments attributes. Those two values are supposed to remain always consistent. Please also note that vm_ndomains and td_dom_rr_idx are both int because segments already store domains as int. Ideally u_int would have much more sense. Probabilly this should be cleaned up in the future. o Apply RR domain selection also to vm_phys_zero_pages_idle(). Sponsored by: EMC / Isilon storage division Partly obtained from: jeff Reviewed by: alc Tested by: jeff
250594	13-May-2013	peter	Bandaid for compiling with gcc, which happens to be the default compiler for a number of platforms still.
250577	12-May-2013	alc	Refactor vm_page_alloc()'s interactions with vm_reserv_alloc_page() and vm_page_insert() so that (1) vm_radix_lookup_le() is never called while the free page queues lock is held and (2) vm_radix_lookup_le() is called at most once. This change reduces the average time that the free page queues lock is held by vm_page_alloc() as well as vm_page_alloc()'s average overall running time. Sponsored by: EMC / Isilon Storage Division
250520	11-May-2013	alc	To reduce the amount of arithmetic performed in the various radix tree functions, reverse the numbering scheme for the levels. The highest numbered level in the tree now appears near the root instead of the leaves. Sponsored by: EMC / Isilon Storage Division
250361	08-May-2013	attilio	Fix-up r250338 by completing the removal of VM_NDOMAIN in favor of MAXMEMDOM. This unbreak builds. Sponsored by: EMC / Isilon storage division Reported by: adrian, jeli
250338	07-May-2013	attilio	Rename VM_NDOMAIN into MAXMEMDOM and move it into machine/param.h in order to match the MAXCPU concept. The change should also be useful for consolidation and consistency. Sponsored by: EMC / Isilon storage division Obtained from: jeff Reviewed by: alc
250334	07-May-2013	alc	Remove a redundant call to panic() from vm_radix_keydiff(). The assertion before the loop accomplishes the same thing. Sponsored by: EMC / Isilon Storage Division
250259	04-May-2013	alc	Optimize vm_radix_lookup_ge() and vm_radix_lookup_le(). Specifically, change the way that these functions ascend the tree when the search for a matching leaf fails at an interior node. Rather than returning to the root of the tree and repeating the lookup with an updated key, maintain a stack of interior nodes that were visited during the descent and use that stack to resume the lookup at the closest ancestor that might have a matching descendant. Sponsored by: EMC / Isilon Storage Division Reviewed by: attilio Tested by: pho
250219	03-May-2013	jhb	Fix two bugs in the current NUMA-aware allocation code: - vm_phys_alloc_freelist_pages() can be called by vm_page_alloc_freelist() to allocate a page from a specific freelist. In the NUMA case it did not properly map the public VM_FREELIST_* constants to the correct backing freelists, nor did it try all NUMA domains for allocations from VM_FREELIST_DEFAULT. - vm_phys_alloc_pages() did not pin the thread and each call to vm_phys_alloc_freelist_pages() fetched the current domain to choose which freelist to use. If a thread migrated domains during the loop in vm_phys_alloc_pages() it could skip one of the freelists. If the other freelists were out of memory then it is possible that vm_phys_alloc_pages() would fail to allocate a page even though pages were available resulting in a panic in vm_page_alloc(). Reviewed by: alc MFC after: 1 week
250187	02-May-2013	kib	Add a hint suggesting why tmpfs does not need a special case there.
250030	28-Apr-2013	kib	Rework the handling of the tmpfs node backing swap object and tmpfs vnode v_object to avoid double-buffering. Use the same object both as the backing store for tmpfs node and as the v_object. Besides reducing memory use up to 2x times for situation of mapping files from tmpfs, it also makes tmpfs read and write operations copy twice bytes less. VM subsystem was already slightly adapted to tolerate OBJT_SWAP object as v_object. Now the vm_object_deallocate() is modified to not reinstantiate OBJ_ONEMAPPING flag and help the VFS to correctly handle VV_TEXT flag on the last dereference of the tmpfs backing object. Reviewed by: alc Tested by: pho, bf MFC after: 1 month
250029	28-Apr-2013	kib	Make vm_object_page_clean() and vm_mmap_vnode() tolerate the vnode' v_object of non OBJT_VNODE type. For vm_object_page_clean(), simply do not assert that object type must be OBJT_VNODE, and add a comment explaining how the check for OBJ_MIGHTBEDIRTY prevents the rest of function from operating on such objects. For vm_mmap_vnode(), if the object type is not OBJT_VNODE, require it to be for swap pager (or default), handle the bypass filesystems, and correctly acquire the object reference in this case. Reviewed by: alc Tested by: pho, bf MFC after: 1 week
250028	28-Apr-2013	kib	Assert that the object type for the vnode' non-NULL v_object, passed to vnode_pager_setsize(), is either OBJT_VNODE, or, if vnode was already reclaimed, OBJT_DEAD. Note that the later is only possible due to some filesystems, in particular, nfsiods from nfs clients, call vnode_pager_setsize() with unlocked vnode. More, if the object is terminated, do not perform the resizing operation. Reviewed by: alc Tested by: pho, bf MFC after: 1 week
250026	28-Apr-2013	kib	Convert panic() into KASSERT(). Reviewed by: alc MFC after: 1 week
250018	28-Apr-2013	alc	Eliminate an unneeded call to vm_radix_trimkey() from vm_radix_lookup_le(). This call is clearing bits from the key that will be set again by the next line. Sponsored by: EMC / Isilon Storage Division
249986	27-Apr-2013	alc	Avoid some lookup restarts in vm_radix_lookup_{ge,le}(). Sponsored by: EMC / Isilon Storage Division
249763	22-Apr-2013	glebius	Panic if UMA_ZONE_PCPU is created at early stages of boot, when mp_ncpus isn't yet initialized. Otherwise we will panic at first allocation later. Sponsored by: Nginx, Inc.
249745	22-Apr-2013	alc	Simplify vm_radix_{add,dec}lev(). Sponsored by: EMC / Isilon Storage Division
249605	18-Apr-2013	alc	When calculating the number of reserved nodes, discount the pages that will be used to store the nodes. Sponsored by: EMC / Isilon Storage Division
249502	15-Apr-2013	alc	Although we perform path compression to reduce the height of the trie and the number of interior nodes, we have previously created a level zero interior node at the root of every non-empty trie, even when that node is not strictly necessary, i.e., it has only one child. This change is the second (and final) step in eliminating those unnecessary level zero interior nodes. Specifically, it updates the deletion and insertion functions so that they do not require a level zero interior node at the root of the trie. For a "buildworld" workload, this change results in a 16.8% reduction in the number of interior nodes allocated and a similar reduction in the average execution time for lookup functions. For example, the average execution time for a call to vm_radix_lookup_ge() is reduced by 22.9%. Reviewed by: attilio, jeff (an earlier version) Sponsored by: EMC / Isilon Storage Division
249427	12-Apr-2013	alc	Although we perform path compression to reduce the height of the trie and the number of interior nodes, we always create a level zero interior node at the root of every non-empty trie, even when that node is not strictly necessary, i.e., it has only one child. This change is the first step in eliminating those unnecessary level zero interior nodes. Specifically, it updates all of the lookup functions so that they do not require a level zero interior node at the root. Reviewed by: attilio, jeff (an earlier version) Sponsored by: EMC / Isilon Storage Division
249313	09-Apr-2013	glebius	Convert UMA code to C99 uintXX_t types.
249312	09-Apr-2013	glebius	Swap us_freecount and us_flags, achieving same structure size as before previous commit. Submitted by: alc
249309	09-Apr-2013	glebius	Since now we support 256 items per slab, we need more bits for us_freecount. This grows uma_slab_head on 32-bit arches, but growth isn't significant. Taking kmem zones as example, only the 32 byte zone is affected, ipers is reduced from 113 to 112. In collaboration with: kib
249305	09-Apr-2013	glebius	Fix KASSERTs: maximum number of items per slab is 256.
249303	09-Apr-2013	kib	Fix the assertions for the state of the object under the map entry with the MAP_ENTRY_VN_WRITECNT flag: - Move the assertion that verifies the state of the v_writecount and vnp.writecount, under the block where the object is locked. - Check that the object type is OBJT_VNODE before asserting. Reported by: avg Reviewed by: alc MFC after: 1 week
249278	08-Apr-2013	attilio	The per-page act_count can be made very-easily protected by the per-page lock rather than vm_object lock, without any further overhead. Make the formal switch. Sponsored by: EMC / Isilon storage division Reviewed by: alc Tested by: pho
249264	08-Apr-2013	glebius	Merge from projects/counters: UMA_ZONE_PCPU zones. These zones have slab size == sizeof(struct pcpu), but request from VM enough pages to fit (uk_slabsize * mp_ncpus). An item allocated from such zone would have a separate twin for each CPU in the system, and these twins are at a distance of sizeof(struct pcpu) from each other. This magic value of distance would allow us to make some optimizations later. To address private item from a CPU simple arithmetics should be used: item = (type )((char )base + sizeof(struct pcpu) * curcpu) These arithmetics are available as zpcpu_get() macro in pcpu.h. To introduce non-page size slabs a new field had been added to uma_keg uk_slabsize. This shifted some frequently used fields of uma_keg to the fourth cache line on amd64. To mitigate this pessimization, uma_keg fields were a bit rearranged and least frequently used uk_name and uk_link moved down to the fourth cache line. All other fields, that are dereferenced frequently fit into first three cache lines. Sponsored by: Nginx, Inc.
249221	07-Apr-2013	alc	Micro-optimize the order of struct vm_radix_node's fields. Specifically, arrange for all of the fields to start at a short offset from the beginning of the structure. Eliminate unnecessary masking of VM_RADIX_FLAGS from the root pointer in vm_radix_getroot(). Sponsored by: EMC / Isilon Storage Division
249218	06-Apr-2013	jeff	Prepare to replace the buf splay with a trie: - Don't insert BKGRDMARKER bufs into the splay or dirty/clean buf lists. No consumers need to find them there and it complicates the tree. These flags are all FFS specific and could be moved out of the buf cache. - Use pbgetvp() and pbrelvp() to associate the background and journal bufs with the vp. Not only is this much cheaper it makes more sense for these transient bufs. - Fix the assertions in pbget* and pbrel*. It's not safe to check list pointers which were never initialized. Use the BX flags instead. We also check B_PAGING in reassignbuf() so this should cover all cases. Discussed with: kib, mckusick, attilio Sponsored by: EMC / Isilon Storage Division
249211	06-Apr-2013	alc	Simplify vm_radix_keybarr(). Sponsored by: EMC / Isilon Storage Division
249182	06-Apr-2013	alc	Simplify vm_radix_insert(). Reviewed by: attilio Tested by: pho Sponsored by: EMC / Isilon Storage Division
249038	03-Apr-2013	alc	Replace the remaining uses of vm_radix_node_page() by vm_radix_isleaf() and vm_radix_topage(). This transformation eliminates some unnecessary conditional branches from the inner loops of vm_radix_insert(), vm_radix_lookup{,_ge,_le}(), and vm_radix_remove(). Simplify the control flow of vm_radix_lookup_{ge,le}(). Reviewed by: attilio (an earlier version) Tested by: pho Sponsored by: EMC / Isilon Storage Division
248815	28-Mar-2013	kib	Release the v_writecount reference on the vnode in case of error, before the vnode is vput() in vm_mmap_vnode(). Error return means that there is no use reference on the vnode from the vm object reference, and failing to restore v_writecount breaks the invariant that v_writecount is less or equal to the usecount. The situation observed when nfs client returns ESTALE for VOP_GETATTR() after the open. In collaboration with: pho MFC after: 1 week
248728	26-Mar-2013	alc	Introduce vm_radix_isleaf() and use it in a couple places. As compared to using vm_radix_node_page() == NULL, the compiler is able to generate one less conditional branch when vm_radix_isleaf() is used. More use cases involving the inner loops of vm_radix_insert(), vm_radix_lookup{,_ge,_le}(), and vm_radix_remove() will follow. Reviewed by: attilio Sponsored by: EMC / Isilon Storage Division
248684	24-Mar-2013	alc	Micro-optimize the control flow in a few places. Eliminate a panic call that could never be reached in vm_radix_insert(). (If the pointer being checked by the panic call were ever NULL, the immmediately preceding loop would have already crashed on a NULL pointer dereference.) Reviewed by: attilio (an earlier version) Sponsored by: EMC / Isilon Storage Division
248569	21-Mar-2013	kib	Only size and create the bio_transient_map when unmapped buffers are enabled. Now, disabling the unmapped buffers should result in the kernel memory map identical to pre-r248550. Sponsored by: The FreeBSD Foundation
248550	20-Mar-2013	kib	Fix the logic inversion in the r248512. Noted by: mckay
248514	19-Mar-2013	kib	Do not map the swap i/o pbufs if the geom provider for the swap partition accepts unmapped requests. Sponsored by: The FreeBSD Foundation Tested by: pho
248512	19-Mar-2013	kib	Pass unmapped buffers for page in requests if the filesystem indicated support for the unmapped i/o. Sponsored by: The FreeBSD Foundation Tested by: pho
248508	19-Mar-2013	kib	Implement the concept of the unmapped VMIO buffers, i.e. buffers which do not map the b_pages pages into buffer_map KVA. The use of the unmapped buffers eliminate the need to perform TLB shootdown for mapping on the buffer creation and reuse, greatly reducing the amount of IPIs for shootdown on big-SMP machines and eliminating up to 25-30% of the system time on i/o intensive workloads. The unmapped buffer should be explicitely requested by the GB_UNMAPPED flag by the consumer. For unmapped buffer, no KVA reservation is performed at all. The consumer might request unmapped buffer which does have a KVA reserve, to manually map it without recursing into buffer cache and blocking, with the GB_KVAALLOC flag. When the mapped buffer is requested and unmapped buffer already exists, the cache performs an upgrade, possibly reusing the KVA reservation. Unmapped buffer is translated into unmapped bio in g_vfs_strategy(). Unmapped bio carry a pointer to the vm_page_t array, offset and length instead of the data pointer. The provider which processes the bio should explicitely specify a readiness to accept unmapped bio, otherwise g_down geom thread performs the transient upgrade of the bio request by mapping the pages into the new bio_transient_map KVA submap. The bio_transient_map submap claims up to 10% of the buffer map, and the total buffer_map + bio_transient_map KVA usage stays the same. Still, it could be manually tuned by kern.bio_transient_maxcnt tunable, in the units of the transient mappings. Eventually, the bio_transient_map could be removed after all geom classes and drivers can accept unmapped i/o requests. Unmapped support can be turned off by the vfs.unmapped_buf_allowed tunable, disabling which makes the buffer (or cluster) creation requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped buffers are only enabled by default on the architectures where pmap_copy_page() was implemented and tested. In the rework, filesystem metadata is not the subject to maxbufspace limit anymore. Since the metadata buffers are always mapped, the buffers still have to fit into the buffer map, which provides a reasonable (but practically unreachable) upper bound on it. The non-metadata buffer allocations, both mapped and unmapped, is accounted against maxbufspace, as before. Effectively, this means that the maxbufspace is forced on mapped and unmapped buffers separately. The pre-patch bufspace limiting code did not worked, because buffer_map fragmentation does not allow the limit to be reached. By Jeff Roberson request, the getnewbuf() function was split into smaller single-purpose functions. Sponsored by: The FreeBSD Foundation Discussed with: jeff (previous version) Tested by: pho, scottl (previous version), jhb, bf MFC after: 2 weeks
248449	18-Mar-2013	attilio	Sync back vmcontention branch into HEAD: Replace the per-object resident and cached pages splay tree with a path-compressed multi-digit radix trie. Along with this, switch also the x86-specific handling of idle page tables to using the radix trie. This change is supposed to do the following: - Allowing the acquisition of read locking for lookup operations of the resident/cached pages collections as the per-vm_page_t splay iterators are now removed. - Increase the scalability of the operations on the page collections. The radix trie does rely on the consumers locking to ensure atomicity of its operations. In order to avoid deadlocks the bisection nodes are pre-allocated in the UMA zone. This can be done safely because the algorithm needs at maximum one new node per insert which means the maximum number of the desired nodes is the number of available physical frames themselves. However, not all the times a new bisection node is really needed. The radix trie implements path-compression because UFS indirect blocks can lead to several objects with a very sparse trie, increasing the number of levels to usually scan. It also helps in the nodes pre-fetching by introducing the single node per-insert property. This code is not generalized (yet) because of the possible loss of performance by having much of the sizes in play configurable. However, efforts to make this code more general and then reusable in further different consumers might be really done. The only KPI change is the removal of the function vm_page_splay() which is now reaped. The only KBI change, instead, is the removal of the left/right iterators from struct vm_page, which are now reaped. Further technical notes broken into mealpieces can be retrieved from the svn branch: http://svn.freebsd.org/base/user/attilio/vmcontention/ Sponsored by: EMC / Isilon storage division In collaboration with: alc, jeff Tested by: flo, pho, jhb, davide Tested by: ian (arm) Tested by: andreast (powerpc)
248283	14-Mar-2013	kib	Some style fixes. Sponsored by: The FreeBSD Foundation
248280	14-Mar-2013	kib	Add pmap function pmap_copy_pages(), which copies the content of the pages around, taking array of vm_page_t both for source and destination. Starting offsets and total transfer size are specified. The function implements optimal algorithm for copying using the platform-specific optimizations. For instance, on the architectures were the direct map is available, no transient mappings are created, for i386 the per-cpu ephemeral page frame is used. The code was typically borrowed from the pmap_copy_page() for the same architecture. Only i386/amd64, powerpc aim and arm/arm-v6 implementations were tested at the time of commit. High-level code, not committed yet to the tree, ensures that the use of the function is only allowed after explicit enablement. For sparc64, the existing code has known issues and a stab is added instead, to allow the kernel linking. Sponsored by: The FreeBSD Foundation Tested by: pho (i386, amd64), scottl (amd64), ian (arm and arm-v6) MFC after: 2 weeks
248277	14-Mar-2013	kib	Remove excessive and inconsistent initializers for the various kernel maps and submaps. MFC after: 2 weeks
248197	12-Mar-2013	attilio	Simplify vm_page_is_valid(). Sponsored by: EMC / Isilon storage division Reviewed by: alc
248117	09-Mar-2013	alc	Update a comment: The object lock is no longer a mutex.
248084	09-Mar-2013	attilio	Switch the vm_object mutex to be a rwlock. This will enable in the future further optimizations where the vm_object lock will be held in read mode most of the time the page cache resident pool of pages are accessed for reading purposes. The change is mostly mechanical but few notes are reported: * The KPI changes as follow: - VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK() - VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK() - VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK() - VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED() (in order to avoid visibility of implementation details) - The read-mode operations are added: VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(), VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED() * The vm/vm_pager.h namespace pollution avoidance (forcing requiring sys/mutex.h in consumers directly to cater its inlining functions using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h consumers now must include also sys/rwlock.h. * zfs requires a quite convoluted fix to include FreeBSD rwlocks into the compat layer because the name clash between FreeBSD and solaris versions must be avoided. At this purpose zfs redefines the vm_object locking functions directly, isolating the FreeBSD components in specific compat stubs. The KPI results heavilly broken by this commit. Thirdy part ports must be updated accordingly (I can think off-hand of VirtualBox, for example). Sponsored by: EMC / Isilon storage division Reviewed by: jeff Reviewed by: pjd (ZFS specific review) Discussed with: alc Tested by: pho
248082	09-Mar-2013	attilio	Merge from vmc-playground: Introduce a new KPI that verifies if the page cache is empty for a specified vm_object. This KPI does not make assumptions about the locking in order to be used also for building assertions at init and destroy time. It is mostly used to hide implementation details of the page cache. Sponsored by: EMC / Isilon storage division Reviewed by: jeff Reviewed by: alc (vm_radix based version) Tested by: flo, pho, jhb, davide
248032	08-Mar-2013	andre	Move the callout subsystem initialization to its own SYSINIT() from being indirectly called via cpu_startup()+vm_ksubmap_init(). The boot order position remains the same at SI_SUB_CPU. Allocation of the callout array is changed to stardard kernel malloc from a slightly obscure direct kernel_map allocation. kern_timeout_callwheel_alloc() is renamed to callout_callwheel_init() to better describe its purpose. kern_timeout_callwheel_init() is removed simplifying the per-cpu initialization. Reviewed by: davide
247788	04-Mar-2013	attilio	Merge from vmcontention: As vm objects are type-stable there is no need to initialize the resident splay tree pointer and the cache splay tree pointer in _vm_object_allocate() but this could be done in the init UMA zone handler. The destructor UMA zone handler, will further check if the condition is retained at every destruction and catch for bugs. Sponsored by: EMC / Isilon storage division Submitted by: alc
247659	02-Mar-2013	alc	The value held by the vm object's field pg_color is only considered valid if the flag OBJ_COLORED is set. Since _vm_object_allocate() doesn't set this flag, it needn't initialize pg_color. Sponsored by: EMC / Isilon Storage Division
247602	02-Mar-2013	pjd	Merge Capsicum overhaul: - Capability is no longer separate descriptor type. Now every descriptor has set of its own capability rights. - The cap_new(2) system call is left, but it is no longer documented and should not be used in new code. - The new syscall cap_rights_limit(2) should be used instead of cap_new(2), which limits capability rights of the given descriptor without creating a new one. - The cap_getrights(2) syscall is renamed to cap_rights_get(2). - If CAP_IOCTL capability right is present we can further reduce allowed ioctls list with the new cap_ioctls_limit(2) syscall. List of allowed ioctls can be retrived with cap_ioctls_get(2) syscall. - If CAP_FCNTL capability right is present we can further reduce fcntls that can be used with the new cap_fcntls_limit(2) syscall and retrive them with cap_fcntls_get(2). - To support ioctl and fcntl white-listing the filedesc structure was heavly modified. - The audit subsystem, kdump and procstat tools were updated to recognize new syscalls. - Capability rights were revised and eventhough I tried hard to provide backward API and ABI compatibility there are some incompatible changes that are described in detail below: CAP_CREATE old behaviour: - Allow for openat(2)+O_CREAT. - Allow for linkat(2). - Allow for symlinkat(2). CAP_CREATE new behaviour: - Allow for openat(2)+O_CREAT. Added CAP_LINKAT: - Allow for linkat(2). ABI: Reuses CAP_RMDIR bit. - Allow to be target for renameat(2). Added CAP_SYMLINKAT: - Allow for symlinkat(2). Removed CAP_DELETE. Old behaviour: - Allow for unlinkat(2) when removing non-directory object. - Allow to be source for renameat(2). Removed CAP_RMDIR. Old behaviour: - Allow for unlinkat(2) when removing directory. Added CAP_RENAMEAT: - Required for source directory for the renameat(2) syscall. Added CAP_UNLINKAT (effectively it replaces CAP_DELETE and CAP_RMDIR): - Allow for unlinkat(2) on any object. - Required if target of renameat(2) exists and will be removed by this call. Removed CAP_MAPEXEC. CAP_MMAP old behaviour: - Allow for mmap(2) with any combination of PROT_NONE, PROT_READ and PROT_WRITE. CAP_MMAP new behaviour: - Allow for mmap(2)+PROT_NONE. Added CAP_MMAP_R: - Allow for mmap(PROT_READ). Added CAP_MMAP_W: - Allow for mmap(PROT_WRITE). Added CAP_MMAP_X: - Allow for mmap(PROT_EXEC). Added CAP_MMAP_RW: - Allow for mmap(PROT_READ \| PROT_WRITE). Added CAP_MMAP_RX: - Allow for mmap(PROT_READ \| PROT_EXEC). Added CAP_MMAP_WX: - Allow for mmap(PROT_WRITE \| PROT_EXEC). Added CAP_MMAP_RWX: - Allow for mmap(PROT_READ \| PROT_WRITE \| PROT_EXEC). Renamed CAP_MKDIR to CAP_MKDIRAT. Renamed CAP_MKFIFO to CAP_MKFIFOAT. Renamed CAP_MKNODE to CAP_MKNODEAT. CAP_READ old behaviour: - Allow pread(2). - Disallow read(2), readv(2) (if there is no CAP_SEEK). CAP_READ new behaviour: - Allow read(2), readv(2). - Disallow pread(2) (CAP_SEEK was also required). CAP_WRITE old behaviour: - Allow pwrite(2). - Disallow write(2), writev(2) (if there is no CAP_SEEK). CAP_WRITE new behaviour: - Allow write(2), writev(2). - Disallow pwrite(2) (CAP_SEEK was also required). Added convinient defines: #define CAP_PREAD (CAP_SEEK \| CAP_READ) #define CAP_PWRITE (CAP_SEEK \| CAP_WRITE) #define CAP_MMAP_R (CAP_MMAP \| CAP_SEEK \| CAP_READ) #define CAP_MMAP_W (CAP_MMAP \| CAP_SEEK \| CAP_WRITE) #define CAP_MMAP_X (CAP_MMAP \| CAP_SEEK \| 0x0000000000000008ULL) #define CAP_MMAP_RW (CAP_MMAP_R \| CAP_MMAP_W) #define CAP_MMAP_RX (CAP_MMAP_R \| CAP_MMAP_X) #define CAP_MMAP_WX (CAP_MMAP_W \| CAP_MMAP_X) #define CAP_MMAP_RWX (CAP_MMAP_R \| CAP_MMAP_W \| CAP_MMAP_X) #define CAP_RECV CAP_READ #define CAP_SEND CAP_WRITE #define CAP_SOCK_CLIENT \ (CAP_CONNECT \| CAP_GETPEERNAME \| CAP_GETSOCKNAME \| CAP_GETSOCKOPT \| \ CAP_PEELOFF \| CAP_RECV \| CAP_SEND \| CAP_SETSOCKOPT \| CAP_SHUTDOWN) #define CAP_SOCK_SERVER \ (CAP_ACCEPT \| CAP_BIND \| CAP_GETPEERNAME \| CAP_GETSOCKNAME \| \ CAP_GETSOCKOPT \| CAP_LISTEN \| CAP_PEELOFF \| CAP_RECV \| CAP_SEND \| \ CAP_SETSOCKOPT \| CAP_SHUTDOWN) Added defines for backward API compatibility: #define CAP_MAPEXEC CAP_MMAP_X #define CAP_DELETE CAP_UNLINKAT #define CAP_MKDIR CAP_MKDIRAT #define CAP_RMDIR CAP_UNLINKAT #define CAP_MKFIFO CAP_MKFIFOAT #define CAP_MKNOD CAP_MKNODAT #define CAP_SOCK_ALL (CAP_SOCK_CLIENT \| CAP_SOCK_SERVER) Sponsored by: The FreeBSD Foundation Reviewed by: Christoph Mallon <christoph.mallon@gmx.de> Many aspects discussed with: rwatson, benl, jonathan ABI compatibility discussed with: kib
247400	27-Feb-2013	attilio	Merge from vmobj-rwlock: VM_OBJECT_LOCKED() macro is only used to implement a custom version of lock assertions right now (which likely spread out thanks to copy and paste). Remove it and implement actual assertions. Sponsored by: EMC / Isilon storage division Reviewed by: alc Tested by: pho
247360	26-Feb-2013	attilio	Merge from vmc-playground branch: Replace the sub-optimal uma_zone_set_obj() primitive with more modern uma_zone_reserve_kva(). The new primitive reserves before hand the necessary KVA space to cater the zone allocations and allocates pages with ALLOC_NOOBJ. More specifically: - uma_zone_reserve_kva() does not need an object to cater the backend allocator. - uma_zone_reserve_kva() can cater M_WAITOK requests, in order to serve zones which need to do uma_prealloc() too. - When possible, uma_zone_reserve_kva() uses directly the direct-mapping by uma_small_alloc() rather than relying on the KVA / offset combination. The removal of the object attribute allows 2 further changes: 1) _vm_object_allocate() becomes static within vm_object.c 2) VM_OBJECT_LOCK_INIT() is removed. This function is replaced by direct calls to mtx_init() as there is no need to export it anymore and the calls aren't either homogeneous anymore: there are now small differences between arguments passed to mtx_init(). Sponsored by: EMC / Isilon storage division Reviewed by: alc (which also offered almost all the comments) Tested by: pho, jhb, davide
247346	26-Feb-2013	attilio	Remove white spaces. Sponsored by: EMC / Isilon storage division
247323	26-Feb-2013	attilio	Wrap the sleeps synchronized by the vm_object lock into the specific macro VM_OBJECT_SLEEP(). This hides some implementation details like the usage of the msleep() primitive and the necessity to access to the lock address directly. For this reason VM_OBJECT_MTX() macro is now retired. Sponsored by: EMC / Isilon storage division Reviewed by: alc Tested by: pho
246926	18-Feb-2013	alc	On arm, like sparc64, the end of the kernel map varies from one type of machine to another. Therefore, VM_MAX_KERNEL_ADDRESS can't be a constant. Instead, #define it to be a variable, vm_max_kernel_address, just like we do on sparc64. Reviewed by: kib Tested by: ian
246805	14-Feb-2013	jhb	Make VM_NDOMAIN a kernel option so that it can be enabled from a kernel config file. Requested by: phk (ages ago) MFC after: 1 month
246316	04-Feb-2013	marius	Try to improve r242655 take III: move these SYSCTLs describing the kernel map, which is defined and initialized in vm/vm_kern.c, to the latter. Submitted by: alc
246087	29-Jan-2013	glebius	Fix typo in debug printf.
246032	28-Jan-2013	zont	- Add system wide page faults requiring I/O counter. Reviewed by: alc MFC after: 2 weeks
246030	28-Jan-2013	zont	- Add sysctls to show number of stats scans. MFC after: 2 weeks
246029	28-Jan-2013	zont	- Style. MFC after: 2 weeks
245421	14-Jan-2013	zont	- Get rid of unused function vmspace_wired_count(). Reviewed by: alc Approved by: kib (mentor) MFC after: 1 week
245296	11-Jan-2013	zont	- Improve readability of sys_obreak(). Suggested by: alc Reviewed by: alc Approved by: kib (mentor) MFC after: 1 week
245255	10-Jan-2013	zont	- Reduce kernel size by removing unnecessary pointer indirections. GENERIC kernel size reduced in 16 bytes and RACCT kernel in 336 bytes. Suggested by: alc Reviewed by: alc Approved by: kib (mentor) MFC after: 1 week
245226	09-Jan-2013	ken	Fix a bug in the device pager code that can trigger an assertion in devfs if a particular race condition is hit in the device pager code. This was a side effect of change 227530 which changed the device pager interface to call a new destructor routine for the cdev. That destructor routine, old_dev_pager_dtor(), takes a VM object handle. The object handle is cast to a struct cdev *, and passed into dev_rel(). That works in most cases, except the case in cdev_pager_allocate() where there is a race condition between two threads allocating an object backed by the same device. The loser of the race deallocates its object at the end of the function. The problem is that before inserting the object into the dev_pager_object_list, the object's handle is changed from the struct cdev pointer to the object's own address. This is to avoid conflicts with the winner of the race, which already inserted an object in the list with a handle that is a pointer to the same cdev structure. The object is then passed to vm_object_deallocate(), and eventually makes its way down to old_dev_pager_dtor(). That function passes the handle pointer (which is actually a VM object, not a struct cdev as usual) into dev_rel(). dev_rel() decrements the reference count in the assumed struct cdev (which happens to be 0), and that triggers the assertion in dev_rel() that the reference count is greater than or equal to 0. The fix is to add a cdev pointer to the VM object, and use that pointer when calling the cdev_pg_dtor() routine. vm_object.h: Add a struct cdev pointer to the VM object structure. device_pager.c: In cdev_pager_allocate(), populate the new cdev pointer. In dev_pager_dealloc(), use the new cdev pointer when calling the object's cdev_pg_dtor() routine. Reviewed by: kib Sponsored by: Spectra Logic Corporation MFC after: 1 week
244532	21-Dec-2012	glebius	Comment fix: there is no ub_ptr, instead explain meaning of uz_count field verbally.
244384	18-Dec-2012	zont	- Fix locked memory accounting for maps with MAP_WIREFUTURE flag. - Add sysctl vm.old_mlock which may turn such accounting off. Reviewed by: avg, trasz Approved by: kib (mentor) MFC after: 1 week
244043	09-Dec-2012	alc	In the past four years, we've added two new vm object types. Each time, similar changes had to be made in various places throughout the machine- independent virtual memory layer to support the new vm object type. However, in most of these places, it's actually not the type of the vm object that matters to us but instead certain attributes of its pages. For example, OBJT_DEVICE, OBJT_MGTDEVICE, and OBJT_SG objects contain fictitious pages. In other words, in most of these places, we were testing the vm object's type to determine if it contained fictitious (or unmanaged) pages. To both simplify the code in these places and make the addition of future vm object types easier, this change introduces two new vm object flags that describe attributes of the vm object's pages, specifically, whether they are fictitious or unmanaged. Reviewed and tested by: kib
244024	08-Dec-2012	pjd	White-space cleanups.
243998	07-Dec-2012	pjd	Implemented uma_zone_set_warning(9) function that sets a warning, which will be printed once the given zone becomes full and cannot allocate an item. The warning will not be printed more often than every five minutes. All UMA warnings can be globally turned off by setting sysctl/tunable vm.zone_warnings to 0. Discussed on: arch Obtained from: WHEEL Systems MFC after: 2 weeks
243659	28-Nov-2012	alc	Add support for the (relatively) new object type OBJT_MGTDEVICE to vm_object_set_memattr(). Also, add a "safety belt" so that vm_object_set_memattr() doesn't silently modify undefined object types. Reviewed by: kib MFC after: 10 days
243529	25-Nov-2012	alc	Make a few small changes to vm_map_pmap_enter(): Add detail to the comment describing this function. In particular, describe what MAP_PREFAULT_PARTIAL does. Eliminate the abrupt change in behavior when the specified address range grows from MAX_INIT_PT pages to MAX_INIT_PT plus one pages. Instead of doing nothing, i.e., preloading no mappings whatsoever, map any resident pages that fall within the start of the specified address range, i.e., [addr, addr + ulmin(size, ptoa(MAX_INIT_PT))). Long ago, the vm object's list of resident pages was not ordered, so this function had to choose between probing the global hash table of all resident pages and iterating over the vm object's unordered list of resident pages. Now, the list is ordered, so there is no reason for MAP_PREFAULT_PARTIAL to be concerned with the vm object's count of resident changes. MFC after: 14 days
243366	21-Nov-2012	alc	Correct an error in r230623. When both VM_ALLOC_NODUMP and VM_ALLOC_ZERO were specified to vm_page_alloc(), PG_NODUMP wasn't being set on the allocated page when it happened to be pre-zeroed. MFC after: 5 days
243333	20-Nov-2012	jh	- Don't pass geom and provider names as format strings. - Add __printflike() attributes. - Remove an extra argument for the g_new_geomf() call in swapongeom_ev(). Reviewed by: pjd
243176	17-Nov-2012	alc	Update a comment to reflect the elimination of the hold queue in r242300.
243132	16-Nov-2012	kib	Move the declaration of vm_phys_paddr_to_vm_page() from vm/vm_page.h to vm/vm_phys.h, where it belongs. Requested and reviewed by: alc MFC after: 2 weeks
243131	16-Nov-2012	kib	Explicitely state that M_USE_RESERVE requires M_NOWAIT, using assertion. Reviewed by: alc MFC after: 2 weeks
243040	14-Nov-2012	kib	Flip the semantic of M_NOWAIT to only require the allocation to not sleep, and perform the page allocations with VM_ALLOC_SYSTEM class. Previously, the allocation was also allowed to completely drain the reserve of the free pages, being translated to VM_ALLOC_INTERRUPT request class for vm_page_alloc() and similar functions. Allow the caller of malloc* to request the 'deep drain' semantic by providing M_USE_RESERVE flag, now translated to VM_ALLOC_INTERRUPT class. Previously, it resulted in less aggressive VM_ALLOC_SYSTEM allocation class. Centralize the translation of the M_* malloc(9) flags in the single inline function malloc2vm_flags(). Discussion started by: "Sears, Steven" <Steven.Sears@netapp.com> Reviewed by: alc, mdf (previous version) Tested by: pho (previous version) MFC after: 2 weeks
242941	13-Nov-2012	alc	Replace the single, global page queues lock with per-queue locks on the active and inactive paging queues. Reviewed by: kib
242903	12-Nov-2012	attilio	Fix DDB command "show map XXX": - Check that an argument is always available, otherwise current map printing before to recurse is garbage. - Spit out a message if an argument is not provided. - Remove unread nlines variable. - Use an explicit recursive function, disassociated from the DB_SHOW_COMMAND() body, in order to make clear prototype and recursion of the above mentioned function. The code results now much less obscure. Submitted by: gianni
242476	02-Nov-2012	kib	The r241025 fixed the case when a binary, executed from nullfs mount, was still possible to open for write from the lower filesystem. There is a symmetric situation where the binary could already has file descriptors opened for write, but it can be executed from the nullfs overlay. Handle the issue by passing one v_writecount reference to the lower vnode if nullfs vnode has non-zero v_writecount. Note that only one write reference can be donated, since nullfs only keeps one use reference on the lower vnode. Always use the lower vnode v_writecount for the checks. Introduce the VOP_GET_WRITECOUNT to read v_writecount, which is currently always bypassed to the lower vnode, and VOP_ADD_WRITECOUNT to manipulate the v_writecount value, which manages a single bypass reference to the lower vnode. Caling the VOPs instead of directly accessing v_writecount provide the fix described in the previous paragraph. Tested by: pho MFC after: 3 weeks
242434	01-Nov-2012	alc	In general, we call pmap_remove_all() before calling vm_page_cache(). So, the call to pmap_remove_all() within vm_page_cache() is usually redundant. This change eliminates that call to pmap_remove_all() and introduces a call to pmap_remove_all() before vm_page_cache() in the one place where it didn't already exist. When iterating over a paging queue, if the object containing the current page has a zero reference count, then the page can't have any managed mappings. So, a call to pmap_remove_all() is pointless. Change a panic() call in vm_page_cache() to a KASSERT(). MFC after: 6 weeks
242402	31-Oct-2012	attilio	Rework the known mutexes to benefit about staying on their own cache line in order to avoid manual frobbing but using struct mtx_padalign. The sole exception being nvme and sxfge drivers, where the author redefined CACHE_LINE_SIZE manually, so they need to be analyzed and dealt with separately. Reviwed by: jimharris, alc
242300	29-Oct-2012	alc	Replace the page hold queue, PQ_HOLD, by a new page flag, PG_UNHOLDFREE, because the queue itself serves no purpose. When a held page is freed, inserting the page into the hold queue has the side effect of setting the page's "queue" field to PQ_HOLD. Later, when the page is unheld, it will be freed because the "queue" field is PQ_HOLD. In other words, PQ_HOLD is used as a flag, not a queue. So, this change replaces it with a flag. To accomodate the new page flag, make the page's "flags" field wider and "oflags" field narrower. Reviewed by: kib
242268	28-Oct-2012	trasz	Remove useless check; vm_pindex_t is unsigned on all architectures. CID: 3701 Found with: Coverity Prevent
242152	26-Oct-2012	mdf	Const-ify the zone name argument to uma_zcreate(9). MFC after: 3 days
242151	26-Oct-2012	andre	Move the corresponding MTX_SYSINIT() next to their struct mtx declaration to make their relationship more obvious as done with the other such mutexs.
242012	24-Oct-2012	kib	Commit the actual text provided by Alan, instead of the wrong update in r242011. MFC after: 1 week
242011	24-Oct-2012	kib	Dirty the newly copied anonymous pages after the wired region is forked. Otherwise, pagedaemon might reclaim the page without saving its content into the swap file, resulting in the valid content replaced by zeroes. Reported and tested by: pho Reviewed and comment update by: alc MFC after: 1 week
241896	22-Oct-2012	kib	Remove the support for using non-mpsafe filesystem modules. In particular, do not lock Giant conditionally when calling into the filesystem module, remove the VFS_LOCK_GIANT() and related macros. Stop handling buffers belonging to non-mpsafe filesystems. The VFS_VERSION is bumped to indicate the interface change which does not result in the interface signatures changes. Conducted and reviewed by: attilio Tested by: pho
241825	22-Oct-2012	eadler	Print flags as hex instead of an integer. PR: kern/168210 Submitted by: linimon Reviewed by: alc Approved by: cperciva MFC after: 3 days
241517	13-Oct-2012	alc	Move vm_page_requeue() to the only file that uses it. MFC after: 3 weeks
241512	13-Oct-2012	alc	Eliminate the conditional for releasing the page queues lock in vm_page_sleep(). vm_page_sleep() is no longer called with this lock held. Eliminate assertions that the page queues lock is NOT held. These assertions won't translate well to having distinct locks on the active and inactive page queues, and they really aren't that useful. MFC after: 3 weeks
241155	03-Oct-2012	alc	Tidy up a bit: Update some of the comments. In particular, use "sleep" in preference to "block" where appropriate. Eliminate some unnecessary casts. Make a few whitespace changes for consistency. Reviewed by: kib MFC after: 3 days
241025	28-Sep-2012	kib	Fix the mis-handling of the VV_TEXT on the nullfs vnodes. If you have a binary on a filesystem which is also mounted over by nullfs, you could execute the binary from the lower filesystem, or from the nullfs mount. When executed from lower filesystem, the lower vnode gets VV_TEXT flag set, and the file cannot be modified while the binary is active. But, if executed as the nullfs alias, only the nullfs vnode gets VV_TEXT set, and you still can open the lower vnode for write. Add a set of VOPs for the VV_TEXT query, set and clear operations, which are correctly bypassed to lower vnode. Tested by: pho (previous version) MFC after: 2 weeks
240862	23-Sep-2012	alc	Address a race condition that was introduced in r238212. Unless the page queues lock is acquired before the page lock is released, there is no guarantee that the page will still be in that same page queue when vm_page_requeue() is called. Reported by: pho In collaboration with: kib MFC after: 3 days
240741	20-Sep-2012	kib	Plug the accounting leak for the wired pages when msync(MS_INVALIDATE) is performed on the vnode mapping which is wired in other address space. While there, explicitely assert that the page is unwired and zero the wire_count instead of substract. The condition is rechecked later in vm_page_free(_toq) already. Reported and tested by: zont Reviewed by: alc (previous version) MFC after: 1 week
240676	18-Sep-2012	glebius	If caller specifies UMA_ZONE_OFFPAGE explicitly, then do not waste memory in an allocation for a slab. Reviewed by: jeff
240518	14-Sep-2012	eadler	Correct double "the the" Approved by: cperciva MFC after: 3 days
240145	05-Sep-2012	zont	- Simplify VM code by using vmspace_wired_count() for counting wired memory of a process. Reviewed by: avg Approved by: kib (mentor) MFC after: 2 weeks
240134	05-Sep-2012	des	Whitespace cleanup.
240113	04-Sep-2012	des	No memory barrier is required. This was pointed out by kib@ a while ago, but I got distracted by other matters. (for real this time)
240105	04-Sep-2012	des	Revert previous commit, which was performed in the wrong tree.
240096	04-Sep-2012	des	No memory barrier is required. This was pointed out by kib@ a while ago, but I got distracted by other matters.
240069	03-Sep-2012	zont	- After r240026 sgrowsiz should be used in a safer maner. Approved by: kib (mentor) MCF after: 1 week
239895	30-Aug-2012	zont	- Remove accounting of locked memory from vsunlock(9) that I missed in r239818. Approved by: kib (mentor)
239818	29-Aug-2012	zont	- Don't take an account of locked memory for current process in vslock(9). There are two consumers of vslock(9): sysctl code and drm driver. These consumers are using locked memory as transient memory, it doesn't belong to a process's memory. Suggested by: avg Reviewed by: alc Approved by: kib (mentor) MFC after: 2 weeks
239723	27-Aug-2012	pluknet	Typo in previous change: print half the theoretical maximum as maximum recommended amount. Reported by: <site freebsd at orientalsensation com> Reviewed by: des
239710	26-Aug-2012	glebius	Fix function name in keg_cachespread_init() assert.
239327	16-Aug-2012	des	- When running out of swzone, instead of spewing an error message every tick until the situation is resolved (if ever), just print a single message when running out and another when space becomes available. - When adding more swap, warn if the total amount exceeds half the theoretical maximum we can handle.
239250	14-Aug-2012	kib	For old mmap syscall, when executing on amd64 or ia64, enforce the PROT_EXEC if prot is non-zero, process is 32bit and kern.elf32.i386_read_exec syscal is enabled. This workaround is needed for old i386 a.out binaries, where dynamic linker did not specified PROT_EXEC for mapping of the text. The kern.elf32.i386_read_exec MIB name looks weird for a.out binaries, but I reused the existing knob which already has the needed semantic. MFC after: 1 week
239247	14-Aug-2012	kib	Adjust the r205536, by allowing a non-zero offset for anonymous mappings for a.out binaries. Apparently, a.out ld.so from FreeBSD 1.1.5.1 can issue such requests. Reported and tested by: Dan Plassche <dplassche@gmail.com> MFC after: 1 week
239246	14-Aug-2012	kib	Do not leave invalid pages in the object after the short read for a network file systems (not only NFS proper). Short reads cause pages other then the requested one, which were not filled by read response, to stay invalid. Change the vm_page_readahead_finish() interface to not take the error code, but instead to make a decision to free or to (de)activate the page only by its validity. As result, not requested invalid pages are freed even if the read RPC indicated success. Noted and reviewed by: alc MFC after: 1 week
239121	07-Aug-2012	alc	Never sleep on busy pages in vm_pageout_launder(), always skip them. Long ago, sleeping on busy pages in vm_pageout_launder() made sense. The call to vm_pageout_flush() specified asynchronous I/O and sleeping on busy pages blocked vm_pageout_launder() until the flush had completed. However, in CVS revision 1.35 of vm/vm_contig.c, the call to vm_pageout_flush() was changed to request synchronous I/O, but the sleep on busy pages was not removed.
239065	05-Aug-2012	kib	After the PHYS_TO_VM_PAGE() function was de-inlined, the main reason to pull vm_param.h was removed. Other big dependency of vm_page.h on vm_param.h are PA_LOCK* definitions, which are only needed for in-kernel code, because modules use KBI-safe functions to lock the pages. Stop including vm_param.h into vm_page.h. Include vm_param.h explicitely for the kernel code which needs it. Suggested and reviewed by: alc MFC after: 2 weeks
239040	04-Aug-2012	kib	Reduce code duplication and exposure of direct access to struct vm_page oflags by providing helper function vm_page_readahead_finish(), which handles completed reads for pages with indexes other then the requested one, for VOP_GETPAGES(). Reviewed by: alc MFC after: 1 week
238998	03-Aug-2012	alc	Inline vm_page_aflags_clear() and vm_page_aflags_set(). Add comments stating that neither these functions nor the flags that they are used to manipulate are part of the KBI.
238915	30-Jul-2012	alc	Eliminate an unneeded declaration. (I should have removed this as part of r227568.)
238791	26-Jul-2012	kib	Do not requeue held page or page for which locking failed, just leave them alone. Process the act_count updates for the held pages in the vm_pageout loop over the inactive queue, instead of refusing to do anything with such page. Clarify the intent of the addl_page_shortage counter and change its use for pages which are not processed in the loop according to the description. Reviewed by: alc MFC after: 2 weeks
238732	24-Jul-2012	alc	Addendum to r238604. If the inactive queue scan isn't restarted, then the variable "addl_page_shortage_init" isn't needed. X-MFC after: r238604
238604	18-Jul-2012	kib	Do not restart scan of the inactive queue when non-inactive page is found. Rather, we shall not find such pages on inactive queue at all. Requested and reviewed by: alc MFC after: 2 weeks
238561	18-Jul-2012	alc	Move what remains of vm/vm_contig.c into vm/vm_pageout.c, where similar code resides. Rename vm_contig_grow_cache() to vm_pageout_grow_cache(). Reviewed by: kib
238543	17-Jul-2012	alc	Correct vm_page_alloc_contig()'s implementation of VM_ALLOC_NODUMP.
238536	16-Jul-2012	alc	Various improvements to vm_contig_grow_cache(). Most notably, even when it can't sleep, it can still move clean pages from the inactive queue to the cache. Also, when a page is cached, there is no need to restart the scan. The "next" page pointer held by vm_contig_launder() is still valid. Finally, add a comment summarizing what vm_contig_grow_cache() does based upon the value of "tries". MFC after: 3 weeks
238510	15-Jul-2012	alc	Correct an off-by-one error in vm_reserv_alloc_contig() that resulted in the last reservation of a multi-reservation allocation not being initialized.
238502	15-Jul-2012	mdf	Fix a bug with memguard(9) on 32-bit architectures without a VM_KMEM_MAX_SIZE. The code was not taking into account the size of the kernel_map, which the kmem_map is allocated from, so it could produce a sub-map size too large to fit. The simplest solution is to ignore VM_KMEM_MAX entirely and base the memguard map's size off the kernel_map's size, since this is always relevant and always smaller. Found by: Justin Hibbits
238456	14-Jul-2012	alc	If vm_contig_grow_cache() is allowed to sleep, then invoke the vm_lowmem handlers.
238452	14-Jul-2012	alc	Move kmem_alloc_{attr,contig}() to vm/vm_kern.c, where similarly named functions reside. Correct the comment describing kmem_alloc_contig().
238359	11-Jul-2012	attilio	Document the object type movements, related to swp_pager_copy(), in vm_object_collapse() and vm_object_split(). In collabouration with: alc MFC after: 3 days
238258	08-Jul-2012	kib	Avoid vm page queues lock leak after r238212. Reported and tested by: Michael Butler <imb protected-networks net> Reviewed by: alc Pointy hat to: kib MFC after: 20 days
238212	07-Jul-2012	kib	Drop page queues mutex on each iteration of vm_pageout_scan over the inactive queue, unless busy page is found. Dropping the mutex often should allow the other lock acquires to proceed without waiting for whole inactive scan to finish. On machines with lot of physical memory scan often need to iterate a lot before it finishes or finds a page which requires laundring, causing high latency for other lock waiters. Suggested and reviewed by: alc MFC after: 3 weeks
238206	07-Jul-2012	eadler	Add missing sleep stat increase PR: kern/168211 Submitted by: linimon Reviewed by: alc Approved by: cperciva MFC after: 3 days
238180	06-Jul-2012	kib	Style. Reviewed by: alc (previous version) MFC after: 1 week
238000	02-Jul-2012	jhb	Honor db_pager_quit in 'show uma' and 'show malloc'. MFC after: 1 month
237623	27-Jun-2012	alc	Add new pmap layer locks to the predefined lock order. Change the names of a few existing VM locks to follow a consistent naming scheme.
237451	22-Jun-2012	attilio	- Add a comment explaining the locking of the cached pages pool held by vm_objects. - Add flags for the per-object lock and free pages queue mutex lock. Use the newly added flags to mark the cache root within the vm_object structure. Please note that other vm_object members should be marked with correct locking but they are left for other commits. In collabouration with: alc MFC after: 3 days3 days3 days
237346	20-Jun-2012	alc	Selectively inline vm_page_dirty().
237334	20-Jun-2012	jhb	Move the per-thread deferred user map entries list into a private list in vm_map_process_deferred() which is then iterated to release map entries. This avoids having a nested vm map unlock operation called from the loop body attempt to recuse into vm_map_process_deferred(). This can happen if the vm_map_remove() triggers the OOM killer. Reviewed by: alc, kib MFC after: 1 week
237172	16-Jun-2012	attilio	Do a more targeted check on the page cache and avoid to check the cache pointer directly in vnode_pager_setsize() by using newly introduced vm_page_is_cached() function. Reviewed by: alc MFC after: 2 weeks X-MFC: r234039,234064
237168	16-Jun-2012	alc	The page flag PGA_WRITEABLE is set and cleared exclusively by the pmap layer, but it is read directly by the MI VM layer. This change introduces pmap_page_is_write_mapped() in order to completely encapsulate all direct access to PGA_WRITEABLE in the pmap layer. Aesthetics aside, I am making this change because amd64 will likely begin using an alternative method to track write mappings, and having pmap_page_is_write_mapped() in place allows me to make such a change without further modification to the MI VM layer. As an added bonus, tidy up some nearby comments concerning page flags. Reviewed by: kib MFC after: 6 weeks
236848	10-Jun-2012	kib	Use the previous stack entry protection and max protection to correctly propagate the stack execution permissions when stack is grown down. First, curproc->p_sysent->sv_stackprot specifies maximum allowed stack protection for current ABI, so the new stack entry was typically marked executable always. Second, for non-main stack MAP_STACK mapping, the PROT_ flags should be used which were specified at the mmap(2) call time, and not sv_stackprot. MFC after: 1 week
236417	01-Jun-2012	eadler	Revert r236380 PR: kern/166780 Requested by: many Approved by: cperciva (implicit)
236380	01-Jun-2012	eadler	Add sysctl to query amount of swap space free PR: kern/166780 Submitted by: Radim Kolar <hsn@sendmail.cz> Approved by: cperciva MFC after: 1 week
235854	23-May-2012	emax	Tweak condition for disabling allocation from per-CPU buckets in low memory situation. I've observed a situation where per-CPU allocations were disabled while there were enough free cached pages. Basically, cnt.v_free_count was sitting stable at a value lower than cnt.v_free_min and that caused massive performance drop. Reviewed by: alc MFC after: 1 week
235850	23-May-2012	kib	Calculate the count of per-process cow faults. Export the count to userspace using the obscure spare int field in struct kinfo_proc. Submitted by: Andrey Zonov <andrey zonov org> MFC after: 1 week
235829	23-May-2012	avg	vm_pager_object_lookup: small performance optimization do not needlessly lock an object if its handle doesn't match Reviewed by: kib, alc MFC after: 1 week
235776	22-May-2012	andrew	Fix booting on ARM. In PHYS_TO_VM_PAGE() when VM_PHYSSEG_DENSE is set the check if we are past the end of vm_page_array was incorrect causing it to return NULL. This value is then used in vm_phys_add_page causing a data abort. Reviewed by: alc, kib, imp Tested by: stas
235689	20-May-2012	nwhitehorn	Replace the list of PVOs owned by each PMAP with an RB tree. This simplifies range operations like pmap_remove() and pmap_protect() as well as allowing simple operations like pmap_extract() not to involve any global state. This substantially reduces lock coverages for the global table lock and improves concurrency.
235603	18-May-2012	kib	Do not double-reference the found vm object in cdev_pager_lookup(). vm_pager_object_lookup() already referenced the object. Note that there is no in-tree consumers of cdev_pager_lookup(). The only known user of the function is i915 gem driver, which is not yet imported. This should make the KPI change minor. Submitted by: avg MFC after: 1 week
235375	12-May-2012	kib	Add new pager type, OBJT_MGTDEVICE. It provides the device pager which carries fictitous managed pages. In particular, the consumers of the new object type can remove all mappings of the device page with pmap_remove_all(). The range of physical addresses used for fake page allocation shall be registered with vm_phys_fictitious_reg_range() interface to allow the PHYS_TO_VM_PAGE() to work in pmap. Most likely, only i386 and amd64 pmaps can handle fictitious managed pages right now. Sponsored by: The FreeBSD Foundation Reviewed by: alc MFC after: 1 month
235372	12-May-2012	kib	Add a facility to register a range of physical addresses to be used for allocation of fictitious pages, for which PHYS_TO_VM_PAGE() returns proper fictitious vm_page_t. The range should be de-registered after consumer stopped using it. De-inline the PHYS_TO_VM_PAGE() since it now carries code to iterate over registered ranges. A hash container might be developed instead of range registration interface, and fake pages could be put automatically into the hash, were PHYS_TO_VM_PAGE() could look them up later. This should be considered before the MFC of the commit is done. Sponsored by: The FreeBSD Foundation Reviewed by: alc MFC after: 1 month
235366	12-May-2012	kib	Split the code from vm_page_getfake() to initialize the fake page struct vm_page into new interface vm_page_initfake(). Handle the case of fake page re-initialization with changed memattr. Sponsored by: The FreeBSD Foundation Reviewed by: alc MFC after: 1 month
235365	12-May-2012	kib	Assert that the page passed to vm_page_putfake() is unmanaged. Sponsored by: The FreeBSD Foundation Reviewed by: alc MFC after: 1 month
235362	12-May-2012	kib	Assert that fictitious or unmanaged pages do not appear on active/inactive lists. Sponsored by: The FreeBSD Foundation Reviewed by: alc MFC after: 1 month
235359	12-May-2012	kib	Commit the change forgotten in r235356. Sponsored by: The FreeBSD Foundation Reviewed by: alc MFC after: 1 month
235356	12-May-2012	kib	Make the vm_page_array_size long. Remove redundand zero initialization for vm_page_array_size and nearby variablees. Sponsored by: The FreeBSD Foundation Reviewed by: alc MFC after: 1 month
235230	10-May-2012	alc	Give vm_fault()'s sequential access optimization a makeover. There are two aspects to the sequential access optimization: (1) read ahead of pages that are expected to be accessed in the near future and (2) unmap and cache behind of pages that are not expected to be accessed again. This revision changes both aspects. The read ahead optimization is now more effective. It starts with the same initial read window as before, but arithmetically grows the window on sequential page faults. This can yield increased read bandwidth. For example, on one of my machines, a program using mmap() to read a file that is several times larger than the machine's physical memory takes about 17% less time to complete. The unmap and cache behind optimization is now more selectively applied. The read ahead window must grow to its maximum size before unmap and cache behind is performed. This significantly reduces the number of times that pages are unmapped and cached only to be reactivated a short time later. The unmap and cache behind optimization now clears each page's referenced flag. Previously, in the case of dirty pages, if the containing file was still mapped at the time that the page daemon examined the dirty pages, they would be reactivated. From a stylistic standpoint, this revision also cleanly separates the implementation of the read ahead and unmap/cache behind optimizations. Glanced at: kib MFC after: 2 weeks
234576	22-Apr-2012	nwhitehorn	Avoid a lock order reversal in pmap_extract_and_hold() from relocking the page. This PMAP requires an additional lock besides the PMAP lock in pmap_extract_and_hold(), which vm_page_pa_tryrelock() did not release. Suggested by: kib MFC after: 4 days
234556	21-Apr-2012	kib	When MAP_STACK mapping is created, the map entry is created only to cover the initial stack size. For MCL_WIREFUTURE maps, the subsequent call to vm_map_wire() to wire the whole stack region fails due to VM_MAP_WIRE_NOHOLES flag. Use the VM_MAP_WIRE_HOLESOK to only wire mapped part of the stack. Reported and tested by: Sushanth Rai <sushanth_rai yahoo com> Reviewed by: alc MFC after: 1 week
234554	21-Apr-2012	alc	As documented in vm_page.h, updates to the vm_page's flags no longer require the page queues lock. MFC after: 1 week
234064	09-Apr-2012	attilio	- Introduce a cache-miss optimization for consistency with other accesses of the cache member of vm_object objects. - Use novel vm_page_is_cached() for checks outside of the vm subsystem. Reviewed by: alc MFC after: 2 weeks X-MFC: r234039
234039	08-Apr-2012	alc	Fix mincore(2) so that it reports PG_CACHED pages as resident. MFC after: 2 weeks
234038	08-Apr-2012	alc	If a page belonging a reservation is cached, then mark the reservation so that it will be freed to the cache pool rather than the default pool. Otherwise, the cached pages within the reservation may be recycled sooner than necessary. Reported by: Andrey Zonov
233960	06-Apr-2012	attilio	Staticize vm_page_cache_remove(). Reviewed by: alc
233949	06-Apr-2012	nwhitehorn	Reduce the frequency that the PowerPC/AIM pmaps invalidate instruction caches, by invalidating kernel icaches only when needed and not flushing user caches for shared pages. Suggested by: kib MFC after: 2 weeks
233925	05-Apr-2012	jhb	Add new ktrace records for the start and end of VM faults. This gives a pair of records similar to syscall entry and return that a user can use to determine how long page faults take. The new ktrace records are enabled via the 'p' trace type, and are enabled in the default set of trace points. Reviewed by: kib MFC after: 2 weeks
233627	28-Mar-2012	mckusick	Keep track of the mount point associated with a special device to enable the collection of counts of synchronous and asynchronous reads and writes for its associated filesystem. The counts are displayed using `mount -v'. Ensure that buffers used for paging indicate the vnode from which they are operating so that counts of paging I/O operations from the filesystem are collected. This checkin only adds the setting of the mount point for the UFS/FFS filesystem, but it would be trivial to add the setting and clearing of the mount point at filesystem mount/unmount time for other filesystems too. Reviewed by: kib
233291	22-Mar-2012	alc	Handle spurious page faults that may occur in no-fault sections of the kernel. When access restrictions are added to a page table entry, we flush the corresponding virtual address mapping from the TLB. In contrast, when access restrictions are removed from a page table entry, we do not flush the virtual address mapping from the TLB. This is exactly as recommended in AMD's documentation. In effect, when access restrictions are removed from a page table entry, AMD's MMUs will transparently refresh a stale TLB entry. In short, this saves us from having to perform potentially costly TLB flushes. In contrast, Intel's MMUs are allowed to generate a spurious page fault based upon the stale TLB entry. Usually, such spurious page faults are handled by vm_fault() without incident. However, when we are executing no-fault sections of the kernel, we are not allowed to execute vm_fault(). This change introduces special-case handling for spurious page faults that occur in no-fault sections of the kernel. In collaboration with: kib Tested by: gibbs (an earlier version) I would also like to acknowledge Hiroki Sato's assistance in diagnosing this problem. MFC after: 1 week
233194	19-Mar-2012	jhb	Bah, just revert my earlier change entirely. (Missed alc's request to do this earlier.) Requested by: alc
233191	19-Mar-2012	jhb	Fix madvise(MADV_WILLNEED) to properly handle individual mappings larger than 4GB. Specifically, the inlined version of 'ptoa' of the the 'int' count of pages overflowed on 64-bit platforms. While here, change vm_object_madvise() to accept two vm_pindex_t parameters (start and end) rather than a (start, count) tuple to match other VM APIs as suggested by alc@.
233190	19-Mar-2012	jhb	Alter the previous commit to use vm_size_t instead of vm_pindex_t. vm_pindex_t is not a count of pages per se, it is more like vm_ooffset_t, but a page index instead of a byte offset.
233100	17-Mar-2012	kib	In vm_object_page_clean(), do not clean OBJ_MIGHTBEDIRTY object flag if the filesystem performed short write and we are skipping the page due to this. Propogate write error from the pager back to the callers of vm_pageout_flush(). Report the failure to write a page from the requested range as the FALSE return value from vm_object_page_clean(), and propagate it back to msync(2) to return EIO to usermode. While there, convert the clearobjflags variable in the vm_object_page_clean() and arguments of the helper functions to boolean. PR: kern/165927 Reviewed by: alc MFC after: 2 weeks
232984	14-Mar-2012	jhb	Pedantic nit: use vm_pindex_t instead of long for a count of pages.
232701	08-Mar-2012	jhb	Add KTR_VFS traces to track modifications to a vnode's writecount.
232399	02-Mar-2012	alc	Eliminate stale incorrect ARGSUSED comments. Submitted by: bde
232288	29-Feb-2012	alc	Simplify kmem_alloc() by eliminating code that existed on account of external pagers in Mach. FreeBSD doesn't implement external pagers. Moreover, it don't pageout the kernel object. So, the reasons for having code don't hold. Reviewed by: kib MFC after: 6 weeks
232166	25-Feb-2012	alc	Simplify vm_mmap()'s control flow. Add a comment describing what vm_mmap_to_errno() does. Reviewed by: kib MFC after: 3 weeks X-MFC after: r232071
232160	25-Feb-2012	alc	Simplify vmspace_fork()'s control flow by copying immutable data before the vm map locks are acquired. Also, eliminate redundant initialization of the new vm map's timestamp. Reviewed by: kib MFC after: 3 weeks
232103	24-Feb-2012	kib	Place the if() at the right location, to activate the v_writecount accounting for shared writeable mappings for all filesystems, not only for the bypass layers. Submitted by: alc Pointy hat to: kib MFC after: 20 days
232071	23-Feb-2012	kib	Account the writeable shared mappings backed by file in the vnode v_writecount. Keep the amount of the virtual address space used by the mappings in the new vm_object un_pager.vnp.writemappings counter. The vnode v_writecount is incremented when writemappings gets non-zero value, and decremented when writemappings is returned to zero. Writeable shared vnode-backed mappings are accounted for in vm_mmap(), and vm_map_insert() is instructed to set MAP_ENTRY_VN_WRITECNT flag on the created map entry. During deferred map entry deallocation, vm_map_process_deferred() checks for MAP_ENTRY_VN_WRITECOUNT and decrements writemappings for the vm object. Now, the writeable mount cannot be demoted to read-only while writeable shared mappings of the vnodes from the mount point exist. Also, execve(2) fails for such files with ETXTBUSY, as it should be. Noted by: tegge Reviewed by: tegge (long time ago, early version), alc Tested by: pho MFC after: 3 weeks
232002	22-Feb-2012	kib	Remove wrong comment. Discussed with: alc MFC after: 3 days
231819	16-Feb-2012	alc	When vm_mmap() is used to map a vm object into a kernel vm_map, it makes no sense to check the size of the kernel vm_map against the user-level resource limits for the calling process. Reviewed by: kib
231526	11-Feb-2012	kib	Close a race due to dropping of the map lock between creating map entry for a shared mapping and marking the entry for inheritance. Other thread might execute vmspace_fork() in between (e.g. by fork(2)), resulting in the mapping becoming private. Noted and reviewed by: alc MFC after: 1 week
231378	10-Feb-2012	ed	Remove direct access to si_name. Code should just use the devtoname() function to obtain the name of a character device. Also add const keywords to pieces of code that need it to build properly. MFC after: 2 weeks
230877	01-Feb-2012	mav	Fix NULL dereference panic on attempt to turn off (on system shutdown) disconnected swap device. This is quick and imperfect solution, as swap device will still be opened and GEOM will not be able to destroy it. Proper solution would be to automatically turn off and close disconnected swap device, but with existing code it will cause panic if there is at least one page on device, even if it is unimportant page of the user-level process. It needs some work. Reviewed by: kib@ MFC after: 1 week
230623	27-Jan-2012	kmacy	exclude kmem_alloc'ed ARC data buffers from kernel minidumps on amd64 excluding other allocations including UMA now entails the addition of a single flag to kmem_alloc or uma zone create Reviewed by: alc, avg MFC after: 2 weeks
230247	17-Jan-2012	nwhitehorn	Revert r212360 now that PowerPC can handle large sparse arguments to pmap_remove() (changed in r228412). MFC after: 2 weeks
229934	10-Jan-2012	kib	Change the type of the paging_in_progress refcounter from u_short to u_int. With the auto-sized buffer cache on the modern machines, UFS metadata can generate more the 65535 pages belonging to the buffers undergoing i/o, overflowing the counter. Reported and tested by: jimharris Reviewed by: alc MFC after: 1 week
229495	04-Jan-2012	kib	Do not restart the scan in vm_object_page_clean() on the object generation change if requested mode is async. The object generation is only changed when the object is marked as OBJ_MIGHTBEDIRTY. For async mode it is enough to write each dirty page, not to make a guarantee that all pages are cleared after the vm_object_page_clean() returned. Diagnosed by: truckman Tested by: flo Reviewed by: alc, truckman MFC after: 2 weeks
228936	28-Dec-2011	alc	Optimize vm_object_split()'s handling of reservations.
228838	23-Dec-2011	kib	Optimize the common case of msyncing the whole file mapping with MS_SYNC flag. The system must guarantee that all writes are finished before syscalls returned. Schedule the writes in async mode, which is much faster and allows the clustering to occur. Wait for writes using VOP_FSYNC(), since we are syncing the whole file mapping. Potentially, the restriction to only apply the optimization can be relaxed by not requiring that the mapping cover whole file, as it is done by other OSes. Reported and tested by: az Reviewed by: alc MFC after: 2 weeks
228567	16-Dec-2011	kib	Move kstack_cache_entry into the private header, and make the stack cache list header accessible outside vm_glue.c. MFC after: 1 week
228498	14-Dec-2011	eadler	- The previous commit (r228449) accidentally moved the vm.stats.vm.* sysctls to vm.stats.sys. Move them back. Noticed by: pho Reviewed by: bde (earlier version) Approved by: bz MFC after: 1 week Pointy hat to: me
228449	13-Dec-2011	eadler	Document a large number of currently undocumented sysctls. While here fix some style(9) issues and reduce redundancy. PR: kern/155491 PR: kern/155490 PR: kern/155489 Submitted by: Galimov Albert <wtfcrap@mail.ru> Approved by: bde Reviewed by: jhb MFC after: 1 week
228432	12-Dec-2011	kib	Fix printf. Submitted by: az MFC after: 1 week
228287	05-Dec-2011	alc	Introduce vm_reserv_alloc_contig() and teach vm_page_alloc_contig() how to use superpage reservations. So, for the first time, kernel virtual memory that is allocated by contigmalloc(), kmem_alloc_attr(), and kmem_alloc_contig() can be promoted to superpages. In fact, even a series of small contigmalloc() allocations may collectively result in a promoted superpage. Eliminate some duplication of code in vm_reserv_alloc_page(). Change the type of vm_reserv_reclaim_contig()'s first parameter in order that it be consistent with other vm_*_contig() functions. Tested by: marius (sparc64)
228156	30-Nov-2011	kib	Rename vm_page_set_valid() to vm_page_set_valid_range(). The vm_page_set_valid() is the most reasonable name for the m->valid accessor. Reviewed by: attilio, alc
228133	29-Nov-2011	kib	Hide the internals of vm_page_lock(9) from the loadable modules. Since the address of vm_page lock mutex depends on the kernel options, it is easy for module to get out of sync with the kernel. No vm_page_lockptr() accessor is provided for modules. It can be added later if needed, unless proper KPI is developed to serve the needs. Reviewed by: attilio, alc MFC after: 3 weeks
227788	21-Nov-2011	attilio	Introduce the same mutex-wise fix in r227758 for sx locks. The functions that offer file and line specifications are: - sx_assert_ - sx_downgrade_ - sx_slock_ - sx_slock_sig_ - sx_sunlock_ - sx_try_slock_ - sx_try_xlock_ - sx_try_upgrade_ - sx_unlock_ - sx_xlock_ - sx_xlock_sig_ - sx_xunlock_ Now vm_map locking is fully converted and can avoid to know specifics about locking procedures. Reviewed by: kib MFC after: 1 month
227758	20-Nov-2011	attilio	Introduce macro stubs in the mutex implementation that will be always defined and will allow consumers, willing to provide options, file and line to locking requests, to not worry about options redefining the interfaces. This is typically useful when there is the need to build another locking interface on top of the mutex one. The introduced functions that consumers can use are: - mtx_lock_flags_ - mtx_unlock_flags_ - mtx_lock_spin_flags_ - mtx_unlock_spin_flags_ - mtx_assert_ - thread_lock_flags_ Spare notes: - Likely we can get rid of all the 'INVARIANTS' specification in the ppbus code by using the same macro as done in this patch (but this is left to the ppbus maintainer) - all the other locking interfaces may require a similar cleanup, where the most notable case is sx which will allow a further cleanup of vm_map locking facilities - The patch should be fully compatible with older branches, thus a MFC is previewed (infact it uses all the underlying mechanisms already present). Comments review by: eadler, Ben Kaduk Discussed with: kib, jhb MFC after: 1 month
227606	17-Nov-2011	alc	Eliminate end-of-line white space.
227568	16-Nov-2011	alc	Refactor the code that performs physically contiguous memory allocation, yielding a new public interface, vm_page_alloc_contig(). This new function addresses some of the limitations of the current interfaces, contigmalloc() and kmem_alloc_contig(). For example, the physically contiguous memory that is allocated with those interfaces can only be allocated to the kernel vm object and must be mapped into the kernel virtual address space. It also provides functionality that vm_phys_alloc_contig() doesn't, such as wiring the returned pages. Moreover, unlike that function, it respects the low water marks on the paging queues and wakes up the page daemon when necessary. That said, at present, this new function can't be applied to all types of vm objects. However, that restriction will be eliminated in the coming weeks. From a design standpoint, this change also addresses an inconsistency between vm_phys_alloc_contig() and the other vm_phys_alloc*() functions. Specifically, vm_phys_alloc_contig() manipulated vm_page fields that other functions in vm/vm_phys.c didn't. Moreover, vm_phys_alloc_contig() knew about vnodes and reservations. Now, vm_page_alloc_contig() is responsible for these things. Reviewed by: kib Discussed with: jhb
227530	15-Nov-2011	kib	Update the device pager interface, while keeping the compatibility layer for old KPI and KBI. New interface should be used together with d_mmap_single cdevsw method. Device pager can be allocated with the cdev_pager_allocate(9) function, which takes struct cdev_pager_ops, containing constructor/destructor and page fault handler methods supplied by driver. Constructor and destructor, called at the pager allocation and deallocation time, allow the driver to handle per-object private data. The pager handler is called to handle page fault on the vm map entry backed by the driver pager. Driver shall return either the vm_page_t which should be mapped, or error code (which does not cause kernel panic anymore). The page handler interface has a placeholder to specify the access mode causing the fault, but currently PROT_READ is always passed there. Sponsored by: The FreeBSD Foundation Reviewed by: alc MFC after: 1 month
227529	15-Nov-2011	kib	Remove the condition that is always true. Submitted by: alc MFC after: 1 week
227309	07-Nov-2011	ed	Mark all SYSCTL_NODEs static that have no corresponding SYSCTL_DECLs. The SYSCTL_NODE macro defines a list that stores all child-elements of that node. If there's no SYSCTL_DECL macro anywhere else, there's no reason why it shouldn't be static.
227127	06-Nov-2011	alc	Wake up the page daemon in vm_page_alloc_freelist() if it couldn't allocate the requested page because too few pages are cached or free. Document the VM_ALLOC_COUNT() option to vm_page_alloc() and vm_page_alloc_freelist(). Make style changes to vm_page_alloc() and vm_page_alloc_freelist(), such as using a variable name that more closely corresponds to the comments.
227103	05-Nov-2011	kib	Remove redundand definitions. The chunk was missed from r227102. MFC after: 2 weeks
227102	05-Nov-2011	kib	Provide typedefs for the type of bit mask for the page bits. Use the defined types instead of int when manipulating masks. Supposedly, it could fix support for 32KB page size in the machine-independend VM layer. Reviewed by: alc MFC after: 2 weeks
227072	04-Nov-2011	alc	Simplify the implementation of the failure case in kmem_alloc_attr().
227070	04-Nov-2011	jhb	Add the posix_fadvise(2) system call. It is somewhat similar to madvise(2) except that it operates on a file descriptor instead of a memory region. It is currently only supported on regular files. Just as with madvise(2), the advice given to posix_fadvise(2) can be divided into two types. The first type provide hints about data access patterns and are used in the file read and write routines to modify the I/O flags passed down to VOP_READ() and VOP_WRITE(). These modes are thus filesystem independent. Note that to ease implementation (and since this API is only advisory anyway), only a single non-normal range is allowed per file descriptor. The second type of hints are used to hint to the OS that data will or will not be used. These hints are implemented via a new VOP_ADVISE(). A default implementation is provided which does nothing for the WILLNEED request and attempts to move any clean pages to the cache page queue for the DONTNEED request. This latter case required two other changes. First, a new V_CLEANONLY flag was added to vinvalbuf(). This requests vinvalbuf() to only flush clean buffers for the vnode from the buffer cache and to not remove any backing pages from the vnode. This is used to ensure clean pages are not wired into the buffer cache before attempting to move them to the cache page queue. The second change adds a new vm_object_page_cache() method. This method is somewhat similar to vm_object_page_remove() except that instead of freeing each page in the specified range, it attempts to move clean pages to the cache queue if possible. To preserve the ABI of struct file, the f_cdevpriv pointer is now reused in a union to point to the currently active advice region if one is present for regular files. Reviewed by: jilles, kib, arch@ Approved by: re (kib) MFC after: 1 month
227012	02-Nov-2011	alc	Add support for VM_ALLOC_WIRED and VM_ALLOC_ZERO to vm_page_alloc_freelist() and use these new options in the mips pmap. Wake up the page daemon in vm_page_alloc_freelist() if the number of free and cached pages becomes too low. Tidy up vm_page_alloc_init(). In particular, add a comment about an important restriction on its use. Tested by: jchandra@
226928	30-Oct-2011	alc	Eliminate vm_phys_bootstrap_alloc(). It was a failed attempt at eliminating duplicated code in the various pmap implementations. Micro-optimize vm_phys_free_pages(). Introduce vm_phys_free_contig(). It is fast routine for freeing an arbitrary number of physically contiguous pages. In particular, it doesn't require the number of pages to be a power of two. Use "u_long" instead of "unsigned long". Bruce Evans (bde@) has convinced me that the "boundary" parameters to kmem_alloc_contig(), vm_phys_alloc_contig(), and vm_reserv_reclaim_contig() should be of type "vm_paddr_t" and not "u_long". Make this change.
226891	28-Oct-2011	alc	Use "u_long" instead of "unsigned long".
226848	27-Oct-2011	alc	Tidy up the comment at the head of vm_page_alloc, and mention that the returned page has the flag VPO_BUSY set.
226843	27-Oct-2011	alc	Eliminate vestiges of page coloring in VM_ALLOC_NOOBJ calls to vm_page_alloc(). While I'm here, for the sake of consistency, always specify the allocation class, such as VM_ALLOC_NORMAL, as the first of the flags.
226824	27-Oct-2011	alc	contigmalloc(9) and contigfree(9) are now implemented in terms of other more general VM system interfaces. So, their implementation can now reside in kern_malloc.c alongside the other functions that are declared in malloc.h.
226740	25-Oct-2011	alc	Speed up vm_page_cache() and vm_page_remove() by checking for a few common cases that can be handled in constant time. The insight being that a page's parent in the vm object's tree is very often its predecessor or successor in the vm object's ordered memq. Tested by: jhb MFC after: 10 days
226642	22-Oct-2011	attilio	VN_NRESERVLEVEL is used in this file but opt_vm is not included thus the stub switch won't be correctly handled. Include opt_vm.h. Submitted by: jeff MFC after: 3 days
226388	15-Oct-2011	kib	Control the execution permission of the readable segments for i386 binaries on the amd64 and ia64 with the sysctl, instead of unconditionally enabling it. Reviewed by: marcel
226366	14-Oct-2011	jhb	Fix a typo in a comment.
226343	13-Oct-2011	marcel	In sys_obreak() and when compiling for amd64 or ia64, when the process is ILP32 (i.e. i386) grant execute permissions by default. The JDK 1.4.x depends on being able to execute from the heap on i386.
226313	12-Oct-2011	glebius	Make memguard(9) capable to guard uma(9) allocations.
225856	29-Sep-2011	kib	Style nit. Submitted by: jhb MFC after: 2 weeks
225843	28-Sep-2011	kib	Fix grammar. Submitted by: bf MFC after: 2 weeks
225840	28-Sep-2011	kib	Use the trick of performing the atomic operation on the contained aligned word to handle the dirty mask updates in vm_page_clear_dirty_mask(). Remove the vm page queue lock around vm_page_dirty() call in vm_fault_hold() the sole purpose of which was to protect dirty on architectures which does not provide short or byte-wide atomics. Reviewed by: alc, attilio Tested by: flo (sparc64) MFC after: 2 weeks
225838	28-Sep-2011	kib	Use the explicitly-sized types for the dirty and valid masks. Requested by: attilio Reviewed by: alc MFC after: 2 weeks
225617	16-Sep-2011	kmacy	In order to maximize the re-usability of kernel code in user space this patch modifies makesyscalls.sh to prefix all of the non-compatibility calls (e.g. not linux_, freebsd32_) with sys_ and updates the kernel entry points and all places in the code that use them. It also fixes an additional name space collision between the kernel function psignal and the libc function of the same name by renaming the kernel psignal kern_psignal(). By introducing this change now we will ease future MFCs that change syscalls. Reviewed by: rwatson Approved by: re (bz)
225418	06-Sep-2011	kib	Split the vm_page flags PG_WRITEABLE and PG_REFERENCED into atomic flags field. Updates to the atomic flags are performed using the atomic ops on the containing word, do not require any vm lock to be held, and are non-blocking. The vm_page_aflag_set(9) and vm_page_aflag_clear(9) functions are provided to modify afalgs. Document the changes to flags field to only require the page lock. Introduce vm_page_reference(9) function to provide a stable KPI and KBI for filesystems like tmpfs and zfs which need to mark a page as referenced. Reviewed by: alc, attilio Tested by: marius, flo (sparc64); andreast (powerpc, powerpc64) Approved by: re (bz)
225089	22-Aug-2011	kib	Update some comments in swap_pager.c. Reviewed and most wording by: alc MFC after: 1 week Approved by: re (bz)
225076	22-Aug-2011	kib	Apply the limit to avoid the overflows in the radix tree subr_blist.c after the conversion of the swap device size to the page size units, not before. That lifts the limit on the usable swap partition size from 32GB to 256GB, that is less depressing for the modern systems. Submitted by: Alexander V. Chernikov <melifaro ipfw ru> Reviewed by: alc Approved by: re (bz) MFC after: 2 weeks
224778	11-Aug-2011	rwatson	Second-to-last commit implementing Capsicum capabilities in the FreeBSD kernel for FreeBSD 9.0: Add a new capability mask argument to fget(9) and friends, allowing system call code to declare what capabilities are required when an integer file descriptor is converted into an in-kernel struct file *. With options CAPABILITIES compiled into the kernel, this enforces capability protection; without, this change is effectively a no-op. Some cases require special handling, such as mmap(2), which must preserve information about the maximum rights at the time of mapping in the memory map so that they can later be enforced in mprotect(2) -- this is done by narrowing the rights in the existing max_protection field used for similar purposes with file permissions. In namei(9), we assert that the code is not reached from within capability mode, as we're not yet ready to enforce namespace capabilities there. This will follow in a later commit. Update two capability names: CAP_EVENT and CAP_KEVENT become CAP_POST_KEVENT and CAP_POLL_KEVENT to more accurately indicate what they represent. Approved by: re (bz) Submitted by: jonathan Sponsored by: Google Inc
224746	09-Aug-2011	kib	- Move the PG_UNMANAGED flag from m->flags to m->oflags, renaming the flag to VPO_UNMANAGED (and also making the flag protected by the vm object lock, instead of vm page queue lock). - Mark the fake pages with both PG_FICTITIOUS (as it is now) and VPO_UNMANAGED. As a consequence, pmap code now can use use just VPO_UNMANAGED to decide whether the page is unmanaged. Reviewed by: alc Tested by: pho (x86, previous version), marius (sparc64), marcel (arm, ia64, powerpc), ray (mips) Sponsored by: The FreeBSD Foundation Approved by: re (bz)
224689	07-Aug-2011	alc	Fix an error in kmem_alloc_attr(). Unless "tries" is updated, kmem_alloc_attr() could get stuck in a loop. Approved by: re (kib) MFC after: 3 days
224582	01-Aug-2011	kib	Implement the linprocfs swaps file, providing information about the configured swap devices in the Linux-compatible format. Based on the submission by: Robert Millan <rmh debian org> PR: kern/159281 Reviewed by: bde Approved by: re (kensmith) MFC after: 2 weeks
224522	30-Jul-2011	kib	Fix a race in the device pager allocation. If another thread won and allocated the device pager for the given handle, then the object fictitious pages list and the object membership in the global object list still need to be initialized. Otherwise, dev_pager_dealloc() will traverse uninitialized pointers. Reported and tested by: pho Reviewed by: jhb Approved by: re (kensmith) MFC after: 1 week
223914	10-Jul-2011	kib	Extract the code to translate VM error into errno, into an exported function vm_mmap_to_errno(). It is useful for the drivers that implement mmap(2)-like functionality, to be able to return error codes consistent with mmap(2). Sponsored by: The FreeBSD Foundation No objections from: alc MFC after: 1 week
223913	10-Jul-2011	kib	Style. MFC after: 3 days
223889	09-Jul-2011	kib	Add a facility to disable processing page faults. When activated, uiomove generates EFAULT if any accessed address is not mapped, as opposed to handling the fault. Sponsored by: The FreeBSD Foundation Reviewed by: alc (previous version)
223825	06-Jul-2011	trasz	All the racct_*() calls need to happen with the proc locked. Fixing this won't happen before 9.0. This commit adds "#ifdef RACCT" around all the "PROC_LOCK(p); racct_whatever(p, ...); PROC_UNLOCK(p)" instances, in order to avoid useless locking/unlocking in kernels built without "options RACCT".
223823	06-Jul-2011	attilio	Handle a race between device_pager and devsw in a more graceful manner: return an error code rather than panic the kernel. Sponsored by: Sandvine Incorporated Reviewed by: kib Tested by: pho MFC after: 2 weeks
223729	02-Jul-2011	alc	Initialize marker pages as held rather than fictitious/wired. Marking the page as held is more useful as a safety precaution in case someone forgets to check for PG_MARKER. Reviewed by: kib
223677	29-Jun-2011	alc	Add a new option, OBJPR_NOTMAPPED, to vm_object_page_remove(). Passing this option to vm_object_page_remove() asserts that the specified range of pages is not mapped, or more precisely that none of these pages have any managed mappings. Thus, vm_object_page_remove() need not call pmap_remove_all() on the pages. This change not only saves time by eliminating pointless calls to pmap_remove_all(), but it also eliminates an inconsistency in the use of pmap_remove_all() versus related functions, like pmap_remove_write(). It eliminates harmless but pointless calls to pmap_remove_all() that were being performed on PG_UNMANAGED pages. Update all of the existing assertions on pmap_remove_all() to reflect this change. Reviewed by: kib
223464	23-Jun-2011	alc	Revert to using the page queues lock in vm_page_clear_dirty_mask() on MIPS. (At present, although atomic_clear_char() is defined by atomic.h on MIPS, it is not actually implemented by support.S.)
223307	19-Jun-2011	alc	Precisely document the synchronization rules for the page's dirty field. (Saying that the lock on the object that the page belongs to must be held only represents one aspect of the rules.) Eliminate the use of the page queues lock for atomically performing read- modify-write operations on the dirty field when the underlying architecture supports atomic operations on char and short types. Document the fact that 32KB pages aren't really supported. Reviewed by: attilio, kib
222992	11-Jun-2011	kib	Assert that page is VPO_BUSY or page owner object is locked in vm_page_undirty(). The assert is not precise due to VPO_BUSY owner to tracked, so assertion does not catch the case when VPO_BUSY is owned by other thread. Reviewed by: alc
222991	11-Jun-2011	kib	Fix a bug in r222586. Lock the page owner object around the modification of the m->dirty. Reported and tested by: nwhitehorn Reviewed by: alc
222586	01-Jun-2011	kib	In the VOP_PUTPAGES() implementations, change the default error from VM_PAGER_AGAIN to VM_PAGER_ERROR for the uwritten pages. Return VM_PAGER_AGAIN for the partially written page. Always forward at least one page in the loop of vm_object_page_clean(). VM_PAGER_ERROR causes the page reactivation and does not clear the page dirty state, so the write is not lost. The change fixes an infinite loop in vm_object_page_clean() when the filesystem returns permanent errors for some page writes. Reported and tested by: gavin Reviewed by: alc, rmacklem MFC after: 1 week
222184	22-May-2011	alc	Correct an error in r222163. Unless UMA_MD_SMALL_ALLOC is defined, startup_alloc() must be used until uma_startup2() is called. Reported by: jh
222163	21-May-2011	alc	1. Prior to r214782, UMA did not support multipage allocations before uma_startup2() was called. Thus, setting the variable "booted" to true in uma_startup() was ok on machines with UMA_MD_SMALL_ALLOC defined, because any allocations made after uma_startup() but before uma_startup2() could be satisfied by uma_small_alloc(). Now, however, some multipage allocations are necessary before uma_startup2() just to allocate zone structures on machines with a large number of processors. Thus, a Boolean can no longer effectively describe the state of the UMA allocator. Instead, make "booted" have three values to describe how far initialization has progressed. This allows multipage allocations to continue using startup_alloc() until uma_startup2(), but single-page allocations may begin using uma_small_alloc() after uma_startup(). 2. With the aforementioned change, only a modest increase in boot pages is necessary to boot UMA on a large number of processors. 3. Retire UMA_MD_SMALL_ALLOC_NEEDS_VM. It has only been used between r182028 and r204128. Reviewed by: attilio [1], nwhitehorn [3] Tested by: sbruno
222137	20-May-2011	alc	Fix spelling errors.
222132	20-May-2011	alc	Eliminate a redundant #include. ("vm/vm_param.h" already includes "machine/vmparam.h".)
221855	13-May-2011	mdf	Move the ZERO_REGION_SIZE to a machine-dependent file, as on many architectures (i386, for example) the virtual memory space may be constrained enough that 2MB is a large chunk. Use 64K for arches other than amd64 and ia64, with special handling for sparc64 due to differing hardware. Also commit the comment changes to kmem_init_zero_region() that I missed due to not saving the file. (Darn the unfamiliar development environment). Arch maintainers, please feel free to adjust ZERO_REGION_SIZE as you see fit. Requested by: alc MFC after: 1 week MFC with: r221853
221853	13-May-2011	mdf	Usa a globally visible region of zeros for both /dev/zero and the md device. There are likely other kernel uses of "blob of zeros" than can be converted. Reviewed by: alc MFC after: 1 week
221714	09-May-2011	mlaier	Another long standing vm bug found at Isilon: Fix a race between vm_object_collapse and vm_fault. Reviewed by: alc@ MFC after: 3 days
221096	26-Apr-2011	obrien	Reap old SPL comments. Reviewed by: alc
220977	23-Apr-2011	kib	Fix two bugs in r218670. Hold the vnode around the region where object lock is dropped, until vnode lock is acquired. Do not drop the vnode reference for a case when the object was deallocated during unlock. Note that in this case, VV_TEXT is cleared by vnode_pager_dealloc(). Reported and tested by: pho Reviewed by: alc MFC after: 3 days
220390	06-Apr-2011	jhb	Fix several places to ignore processes that are not yet fully constructed. MFC after: 1 week
220387	06-Apr-2011	trasz	In vm_daemon(), do not skip processes stopped with SIGSTOP.
220386	06-Apr-2011	trasz	Add RACCT_RSS. Sponsored by: The FreeBSD Foundation Reviewed by: kib (earlier version)
220373	05-Apr-2011	trasz	Add accounting for most of the memory-related resources. Sponsored by: The FreeBSD Foundation Reviewed by: kib (earlier version)
220001	25-Mar-2011	kib	Handle the corner case in vm_fault_quick_hold_pages(). If supplied length is zero, and user address is invalid, function might return -1, due to the truncation and rounding of the address. The callers interpret the situation as EFAULT. Instead of handling the zero length in caller, filter it in vm_fault_quick_hold_pages(). Sponsored by: The FreeBSD Foundation Reviewed by: alc
219968	24-Mar-2011	jhb	Fix some locking nits with the p_state field of struct proc: - Hold the proc lock while changing the state from PRS_NEW to PRS_NORMAL in fork to honor the locking requirements. While here, expand the scope of the PROC_LOCK() on the new process (p2) to avoid some LORs. Previously the code was locking the new child process (p2) after it had locked the parent process (p1). However, when locking two processes, the safe order is to lock the child first, then the parent. - Fix various places that were checking p_state against PRS_NEW without having the process locked to use PROC_LOCK(). Every place was already locking the process, just after the PRS_NEW check. - Remove or reduce the use of PROC_SLOCK() for places that were checking p_state against PRS_NEW. The PROC_LOCK() alone is sufficient for reading the current state. - Reorder fill_kinfo_proc() slightly so it only acquires PROC_SLOCK() once. MFC after: 1 week
219819	21-Mar-2011	jeff	- Merge changes to the base system to support OFED. These include a wider arg2 for sysctl, updates to vlan code, IFT_INFINIBAND, and other miscellaneous small features.
219727	18-Mar-2011	trasz	In vm_daemon(), when iterating over all processes in the system, skip those which are not yet fully initialized (i.e. ones with p_state == PRS_NEW). Without it, we could panic in _thread_lock_flags(). Note that there may be other instances of FOREACH_PROC_IN_SYSTEM() that require similar fix. Reported by: pho, keramida Discussed with: kib
219476	11-Mar-2011	alc	Eliminate duplication of the fake page code and zone by the device and sg pagers. Reviewed by: jhb
219124	01-Mar-2011	brucec	Change the return type of vmspace_swap_count to a long to match the other vmspace_*_count functions. MFC after: 3 days
218989	24-Feb-2011	pluknet	Remove sysctl vm.max_proc_mmap used to protect from KVA space exhaustion. As it was pointed out by Alan Cox, that no longer serves its purpose with the modern UMA allocator compared to the old one used in 4.x days. The removal of sysctl eliminates max_proc_mmap type overflow leading to the broken mmap(2) seen with large amount of physical memory on arches with factually unbound KVA space (such as amd64). It was found that slightly less than 256GB of physmem was enough to trigger the overflow. Reviewed by: alc, kib Approved by: avg (mentor) MFC after: 2 months
218966	23-Feb-2011	brucec	Calculate and return the count in vmspace_swap_count as a vm_offset_t instead of an int to avoid overflow. While here, clean up some style(9) issues. PR: kern/152200 Reviewed by: kib MFC after: 2 weeks
218773	17-Feb-2011	alc	Remove pmap fields that are either unused or not fully implemented. Discussed with: kib
218701	15-Feb-2011	kib	Since r218070 reenabled the call to vm_map_simplify_entry() from vm_map_insert(), the kmem_back() assumption about newly inserted entry might be broken due to interference of two factors. In the low memory condition, when vm_page_alloc() returns NULL, supplied map is unlocked. If another thread performs kmem_malloc() meantime, and its map entry is placed right next to our thread map entry in the map, both entries wire count is still 0 and entries are coalesced due to vm_map_simplify_entry(). Mark new entry with MAP_ENTRY_IN_TRANSITION to prevent coalesce. Fix some style issues, tighten the assertions to account for MAP_ENTRY_IN_TRANSITION state. Reported and tested by: pho Reviewed by: alc
218670	13-Feb-2011	kib	Lock the vnode around clearing of VV_TEXT flag. Remove mp_fixme() note mentioning that vnode lock is needed. Reviewed by: alc Tested by: pho MFC after: 1 week
218592	12-Feb-2011	jmallett	Use CPU_FOREACH rather than expecting CPUs 0 through mp_ncpus-1 to be present. Don't micro-optimize the uniprocessor case; use the same loop there. Submitted by: Bhanu Prakash Reviewed by: kib, jhb
218589	12-Feb-2011	alc	Retire VFS_BIO_DEBUG. Convert those checks that were still valid into KASSERT()s and eliminate the rest. Replace excessive printf()s and a panic() in bufdone_finish() with a KASSERT() in vm_page_io_finish(). Reviewed by: kib
218345	05-Feb-2011	alc	Unless "cnt" exceeds MAX_COMMIT_COUNT, nfsrv_commit() and nfsvno_fsync() are incorrectly calling vm_object_page_clean(). They are passing the length of the range rather than the ending offset of the range. Perform the OFF_TO_IDX() conversion in vm_object_page_clean() rather than the callers. Reviewed by: kib MFC after: 3 weeks
218304	04-Feb-2011	alc	Since the last parameter to vm_object_shadow() is a vm_size_t and not a vm_pindex_t, it makes no sense for its callers to perform atop(). Let vm_object_shadow() do that instead.
218113	31-Jan-2011	alc	Release the free page queues lock earlier in vm_page_alloc(). Discussed with: kib@
218070	29-Jan-2011	alc	Reenable the call to vm_map_simplify_entry() from vm_map_insert() for non- MAP_STACK_* entries. (See r71983 and r74235.) In some cases, performing this call to vm_map_simplify_entry() halves the number of vm map entries used by the Sun JDK.
217916	27-Jan-2011	mdf	Explicitly wire the user buffer rather than doing it implicitly in sbuf_new_for_sysctl(9). This allows using an sbuf with a SYSCTL_OUT drain for extremely large amounts of data where the caller knows that appropriate references are held, and sleeping is not an issue. Inspired by: rwatson
217688	21-Jan-2011	pluknet	Make MSGBUF_SIZE kernel option a loader tunable kern.msgbufsize. Submitted by: perryh pluto.rain.com (previous version) Reviewed by: jhb Approved by: kib (mentor) Tested by: universe
217529	18-Jan-2011	alc	Move the definition of M_VMPGDATA to the swap pager, where the only remaining uses are.
217508	17-Jan-2011	alc	Explicitly initialize the page's queue field to PQ_NONE instead of relying on PQ_NONE being zero. Redefine PQ_NONE and PQ_COUNT so that a page queue isn't allocated for PQ_NONE. Reviewed by: kib@
217482	16-Jan-2011	alc	Sort function prototypes.
217479	16-Jan-2011	alc	Update a lock annotation on the page structure.
217478	16-Jan-2011	alc	Shift responsibility for synchronizing access to the page's act_count field to the object's lock. Reviewed by: kib@
217477	16-Jan-2011	alc	Clean up the start of vm_page_alloc(). In particular, eliminate an assertion that is no longer required. Long ago, calls to vm_page_alloc() from an interrupt handler had to specify VM_ALLOC_INTERRUPT so that vm_page_alloc() would not attempt to reclaim a PQ_CACHE page from another vm object. Today, with the synchronization on a vm object's collection of PQ_CACHE pages, this is no longer an issue. In fact, VM_ALLOC_INTERRUPT now reclaims PQ_CACHE pages just like VM_ALLOC_{NORMAL,SYSTEM}. MFC after: 3 weeks
217463	15-Jan-2011	kib	For consistency, use kernel_object instead of &kernel_object_store when initializing the object mutex. Do the same for kmem_object. Discussed with: alc MFC after: 1 week
217453	15-Jan-2011	alc	For some time now, the kernel and kmem objects have been ordinary OBJT_PHYS objects. Thus, there is no need for handling them specially in vm_fault(). In fact, this special case handling would have led to an assertion failure just before the call to pmap_enter(). Reviewed by: kib@ MFC after: 6 weeks
217265	11-Jan-2011	jhb	Remove unneeded includes of <sys/linker_set.h>. Other headers that use it internally contain nested includes. Reviewed by: bde
217192	09-Jan-2011	kib	Move repeated MAXSLP definition from machine/vmparam.h to sys/vmmeter.h. Update the outdated comments describing MAXSLP and the process selection algorithm for swap out. Comments wording and reviewed by: alc
217177	09-Jan-2011	alc	Eliminate a redundant alignment directive on the page locks array.
217171	08-Jan-2011	alc	Eliminate the counting of vm_page_pa_tryrelock calls. We really don't need it anymore. Moreover, its implementation had a type mismatch, a long is not necessarily an uint64_t. (This mismatch was hidden by casting.) Move the remaining two counters up a level in the sysctl hierarchy. There is no reason for them to be under the vm.pmap node. Reviewed by: kib
216899	03-Jan-2011	alc	Release the page lock early in vm_pageout_clean(). There is no reason to hold this lock until the end of the function. With the aforementioned change to vm_pageout_clean(), page locks don't need to support recursive (MTX_RECURSE) or duplicate (MTX_DUPOK) acquisitions. Reviewed by: kib
216874	01-Jan-2011	alc	Make a couple refinements to r216799 and r216810. In particular, revise a comment and move it to its proper place. Reviewed by: kib
216873	01-Jan-2011	brucec	There can be more than 0x20000000 swap meta blocks allocated if a swap-backed md(4) device is used. Don't panic when deallocating such a device if swap has been used. PR: kern/133170 Discussed with: kib MFC after: 3 days
216810	29-Dec-2010	kib	Remove OBJ_CLEANING flag. The vfs_setdirty_locked_object() is the only consumer of the flag, and it used the flag because OBJ_MIGHTBEDIRTY was cleared early in vm_object_page_clean, before the cleaning pass was done. This is no longer true after r216799. Moreover, since OBJ_CLEANING is a flag, and not the counter, it could be reset too prematurely when parallel vm_object_page_clean() are performed. Reviewed by: alc (as a part of the bigger patch) MFC after: 1 month (after r216799 is merged)
216807	29-Dec-2010	alc	There is no point in vm_contig_launder{,_page}() flushing held pages, instead skip over them. As long as a page is held, it can't be reclaimed by contigmalloc(M_WAITOK). Moreover, a held page may be undergoing modification, e.g., vmapbuf(), so even if the hold were released before the completion of contigmalloc(), the page might have to be flushed again. MFC after: 3 weeks
216799	29-Dec-2010	kib	Move the increment of vm object generation count into vm_object_set_writeable_dirty(). Fix an issue where restart of the scan in vm_object_page_clean() did not removed write permissions for newly added pages or, if the mapping for some already scanned page changed to writeable due to fault. Merge the two loops in vm_object_page_clean(), doing the remove of write permission and cleaning in the same loop. The restart of the loop then correctly downgrade writeable mappings. Fix an issue where a second caller to msync() might actually return before the first caller had actually completed flushing the pages. Clear the OBJ_MIGHTBEDIRTY flag after the cleaning loop, not before. Calls to pmap_is_modified() are not needed after pmap_remove_write() there. Proposed, reviewed and tested by: alc MFC after: 1 week
216772	28-Dec-2010	alc	Correct a typo in vm_fault_quick_hold_pages(). Reported by: Bartosz Stec
216731	27-Dec-2010	alc	Move vm_object_print()'s prototype to the expected place.
216701	26-Dec-2010	alc	Retire vm_fault_quick(). It's no longer used. Reviewed by: kib@
216699	25-Dec-2010	alc	Introduce and use a new VM interface for temporarily pinning pages. This new interface replaces the combined use of vm_fault_quick() and pmap_extract_and_hold() throughout the kernel. In collaboration with: kib@
216604	20-Dec-2010	alc	Introduce vm_fault_hold() and use it to (1) eliminate a long-standing race condition in proc_rwmem() and to (2) simplify the implementation of the cxgb driver's vm_fault_hold_user_pages(). Specifically, in proc_rwmem() the requested read or write could fail because the targeted page could be reclaimed between the calls to vm_fault() and vm_page_hold(). In collaboration with: kib@ MFC after: 6 weeks
216511	17-Dec-2010	alc	Implement and use a single optimized function for unholding a set of pages. Reviewed by: kib@
216425	14-Dec-2010	alc	Change memguard_fudge() so that it can handle km_max being zero. Not every platform defines VM_KMEM_SIZE_MAX, and on those platforms km_max will be zero. Reviewed by: mdf Tested by: marius
216335	09-Dec-2010	mlaier	Fix a long standing (from the original 4.4BSD lite sources) race between vmspace_fork and vm_map_wire that would lead to "vm_fault_copy_wired: page missing" panics. While faulting in pages for a map entry that is being wired down, mark the containing map as busy. In vmspace_fork wait until the map is unbusy, before we try to copy the entries. Reviewed by: kib MFC after: 5 days Sponsored by: Isilon Systems, Inc.
216319	09-Dec-2010	jchandra	Revert the vm/vm_page.c change in r216317. This adds back changes in r216141, which was reverted by the above check in.
216317	09-Dec-2010	jchandra	swi_vm() for mips.
216186	04-Dec-2010	trasz	Fix comment intentation.
216141	03-Dec-2010	imp	To make minidumps work properly on mips for memory that's direct mapped and entered via vm_page_setup, keep track of it like we do for amd64. # A separate commit will be made to move this to a capability-based ifdef # rather than arch-based ifdef. Submitted by: alc@ MFC after: 1 week
216128	02-Dec-2010	trasz	Replace pointer to "struct uidinfo" with pointer to "struct ucred" in "struct vm_object". This is required to make it possible to account for per-jail swap usage. Reviewed by: kib@ Tested by: pho@ Sponsored by: FreeBSD Foundation
216090	01-Dec-2010	alc	Correct an error in the allocation of the vm_page_dump array in vm_page_startup(). Specifically, the dump_avail array should be used instead of the phys_avail array to calculate the size of vm_page_dump. For example, the pages for the message buffer are allocated prior to vm_page_startup() by subtracting them from the last entry in the phys_avail array, but the first thing that vm_page_startup() does after creating the vm_page_dump array is to set the bits corresponding to the message buffer pages in that array. However, these bits might not actually exist in the array, because the size of the array is determined by the current value in the last entry of the phys_avail array. In general, the only reason why this doesn't always result in an out-of-bounds array access is that the size of the vm_page_dump array is rounded up to the next page boundary. This change eliminates that dependence on rounding (and luck). MFC after: 6 weeks
215973	28-Nov-2010	jchandra	Fix issue noted by alc while reviewing r215938: The current implementation of vm_page_alloc_freelist() does not handle order > 0 correctly. Remove order parameter to the function and use it only for order 0 pages. Submitted by: alc
215796	24-Nov-2010	kib	After the sleep caused by encountering a busy page, relookup the page. Submitted and reviewed by: alc Reprted and tested by: pho MFC after: 5 days
215610	21-Nov-2010	kib	Eliminate the mab, maf arrays and related variables. The change also fixes off-by-one error in the calculation of mreq. Suggested and reviewed by: alc Tested by: pho MFC after: 5 days
215597	20-Nov-2010	alc	Optimize vm_object_terminate(). Reviewed by: kib MFC after: 1 week
215574	20-Nov-2010	kib	The runlen returned from vm_pageout_flush() might be zero legitimately, when mreq page has status VM_PAGER_AGAIN. MFC after: 5 days
215538	19-Nov-2010	alc	Reduce the amount of detail printed by vm_page_free_toq() when it panics. Reviewed by: kib
215508	19-Nov-2010	mlaier	Off by one page in vm_reserv_reclaim_contig(): Also reclaim reservations with only a single free page if that satisfies the requested size. MFC after: 3 days Reviewed by: alc
215471	18-Nov-2010	kib	vm_pageout_flush() might cache the pages that finished write to the backing storage. Such pages might be then reused, racing with the assert in vm_object_page_collect_flush() that verified that dirty pages from the run (most likely, pages with VM_PAGER_AGAIN status) are write-protected still. In fact, the page indexes for the pages that were removed from the object page list should be ignored by vm_object_page_clean(). Return the length of successfully written run from vm_pageout_flush(), that is, the count of pages between requested page and first page after requested with status VM_PAGER_AGAIN. Supply the requested page index in the array to vm_pageout_flush(). Use the returned run length to forward the index of next page to clean in vm_object_page_clean(). Reported by: avg Reviewed by: alc MFC after: 1 week
215469	18-Nov-2010	kib	Only increment object generation count when inserting the page into object page list. The only use of object generation count now is a restart of the scan in vm_object_page_clean(), which makes sense to do on the page addition. Page removals do not affect the dirtiness of the object, as well as manipulations with the shadow chain. Suggested and reviewed by: alc MFC after: 1 week
215321	14-Nov-2010	kib	Do not use __FreeBSD_version prefix for the special osrel version. The ports/Mk/bsd.port.mk uses sys/param.h to fetch osrel, and cannot grok several constants with the prefix. Reported and tested by: swell.k gmail com MFC after: 1 week
215309	14-Nov-2010	kib	Use symbolic names instead of hardcoding values for magic p_osrel constants. MFC after: 1 week
215307	14-Nov-2010	kib	Implement a (soft) stack guard page for auto-growing stack mappings. The unmapped page separates the tip of the stack and possible adjanced segment, making some uses of stack overflow harder. The stack growing code refuses to expand the segment to the last page of the reseved region when sysctl security.bsd.stack_guard_page is set to 1. The default value for sysctl and accompanying tunable is 0. Please note that mmap(MAP_FIXED) still can place a mapping right up to the stack, making continuous region. Reviewed by: alc MFC after: 1 week
215093	10-Nov-2010	alc	Enable reservation-based physical memory allocation. Even without the creation of large page mappings in the pmap, it can provide modest performance benefits. In particular, for a "buildworld" on a 2x 1GHz Ultrasparc IIIi it reduced the wall clock time by 2.2% and the system time by 12.6%. Tested by: marius@
214953	07-Nov-2010	alc	In case the stack size reaches its limit and its growth must be restricted, ensure that grow_amount is a multiple of the page size. Otherwise, the kernel may crash in swap_reserve_by_uid() on HEAD and FreeBSD 8.x, and produce a core file with a missing stack on FreeBSD 7.x. Diagnosed and reported by: jilles Reviewed by: kib MFC after: 1 week
214903	07-Nov-2010	gonzo	- Add minidump support for FreeBSD/mips
214782	04-Nov-2010	jhb	Update startup_alloc() to support multi-page allocations and allow internal zones whose objects are larger than a page to use startup_alloc(). This allows allocation of zone objects during early boot on machines with a large number of CPUs since the resulting zone objects are larger than a page. Submitted by: trema Reviewed by: attilio MFC after: 1 week
214564	30-Oct-2010	alc	Correct some format strings used by sysctls. MFC after: 1 week
214144	21-Oct-2010	jhb	- Make 'vm_refcnt' volatile so that compilers won't be tempted to treat its value as a loop invariant. Currently this is a no-op because 'atomic_cmpset_int()' clobbers all memory on current architectures. - Use atomic_fetchadd_int() instead of an atomic_cmpset_int() loop to drop a reference in vmspace_free(). Reviewed by: alc MFC after: 1 month
214095	20-Oct-2010	avg	PG_BUSY -> VPO_BUSY, PG_WANTED -> VPO_WANTED in manual pages and comments Reviewed by: alc MFC after: 4 days
214062	19-Oct-2010	mdf	uma_zfree(zone, NULL) should do nothing, to match free(9). Noticed by: Ron Steinke <rsteinke at isilon dot com> MFC after: 3 days
213911	16-Oct-2010	lstewart	Change uma_zone_set_max to return the effective value of "nitems" after rounding. The same value can also be obtained with uma_zone_get_max, but this change avoids a caller having to make two back-to-back calls. Sponsored by: FreeBSD Foundation Reviewed by: gnn, jhb
213910	16-Oct-2010	lstewart	- Simplify implementation of uma_zone_get_max. - Add uma_zone_get_cur which returns the current approximate occupancy of a zone. This is useful for providing stats via sysctl amongst other things. Sponsored by: FreeBSD Foundation Reviewed by: gnn, jhb MFC after: 2 weeks
213408	04-Oct-2010	alc	If vm_map_find() is asked to allocate a superpage-aligned region of virtual addresses that is greater than a superpage in size but not a multiple of the superpage size, then vm_map_find() is not always expanding the kernel pmap to support the last few small pages being allocated. These failures are not commonplace, so this was first noticed by someone porting FreeBSD to a new architecture. Previously, we grew the kernel page table in vm_map_findspace() when we found the first available virtual address. This works most of the time because we always grow the kernel pmap or page table by an amount that is a multiple of the superpage size. Now, instead, we defer the call to pmap_growkernel() until we are committed to a range of virtual addresses in vm_map_insert(). In general, there is another reason to prefer calling pmap_growkernel() in vm_map_insert(). It makes it possible for someone to do the equivalent of an mmap(MAP_FIXED) on the kernel map. Reported by: Svatopluk Kraus Reviewed by: kib@ MFC after: 3 weeks
212931	20-Sep-2010	mdf	Replace an XXX comment with the appropriate code. Submitted by: alc
212873	19-Sep-2010	alc	Allow a POSIX shared memory object that is opened for read but not for write to nonetheless be mapped PROT_WRITE and MAP_PRIVATE, i.e., copy-on-write. (This is a regression in the new implementation of POSIX shared memory objects that is used by HEAD and RELENG_8. This bug does not exist in RELENG_7's user-level, file-based implementation.) PR: 150260 MFC after: 3 weeks
212868	19-Sep-2010	alc	Make refinements to r212824. In particular, don't make vm_map_unlock_nodefer() part of the synchronization interface for maps. Add comments to vm_map_unlock_and_wait() and vm_map_wakeup() describing how they should be used. In particular, describe the deferred deallocations issue with vm_map_unlock_and_wait(). Redo the implementation of vm_map_unlock_and_wait() so that it passes along the caller's file and line information, just like the other map locking primitives. Reviewed by: kib X-MFC after: r212824
212824	18-Sep-2010	kib	Adopt the deferring of object deallocation for the deleted map entries on map unlock to the lock downgrade and later read unlock operation. System map entries cannot be backed by OBJT_VNODE objects, no need to defer deallocation for them. Map entries from user maps do not require the owner map for deallocation, and can be accumulated in the thread-local list for freeing when a user map is unlocked. Move the collection of entries for deferred reclamation into vm_map_delete(). Create helper vm_map_process_deferred(), that is called from locations where processing is feasible. Do not process deferred entries in vm_map_unlock_and_wait() since map_sleep_mtx is held. Reviewed by: alc, rstone (previous versions) Tested by: pho MFC after: 2 weeks
212750	16-Sep-2010	mdf	Re-add r212370 now that the LOR in powerpc64 has been resolved: Add a drain function for struct sysctl_req, and use it for a variety of handlers, some of which had to do awkward things to get a large enough SBUF_FIXEDLEN buffer. Note that some sysctl handlers were explicitly outputting a trailing NUL byte. This behaviour was preserved, though it should not be necessary. Reviewed by: phk (original patch)
212572	13-Sep-2010	mdf	Revert r212370, as it causes a LOR on powerpc. powerpc does a few unexpected things in copyout(9) and so wiring the user buffer is not sufficient to perform a copyout(9) while holding a random mutex. Requested by: nwhitehorn
212370	09-Sep-2010	mdf	Add a drain function for struct sysctl_req, and use it for a variety of handlers, some of which had to do awkward things to get a large enough FIXEDLEN buffer. Note that some sysctl handlers were explicitly outputting a trailing NUL byte. This behaviour was preserved, though it should not be necessary. Reviewed by: phk
212360	09-Sep-2010	nwhitehorn	On architectures with non-tree-based page tables like PowerPC, every page in a range must be checked when calling pmap_remove(). Calling pmap_remove() from vm_pageout_map_deactivate_pages() with the entire range of the map could result in attempting to demap an extraordinary number of pages (> 10^15), so iterate through each map entry and unmap each of them individually. MFC after: 6 weeks
212282	07-Sep-2010	rstone	Fix a typo in r212281. uintptr -> uintptr_t Pointy hat to: rstone Approved by: emaste (mentor) MFC after: 2 weeks
212281	07-Sep-2010	rstone	In munmap() downgrade the vm_map_lock to a read lock before taking a read lock on the pmc-sx lock. This prevents a deadlock with pmc_log_process_mappings, which has an exclusive lock on pmc-sx and tries to get a read lock on a vm_map. Downgrading the vm_map_lock in munmap allows pmc_log_process_mappings to continue, preventing the deadlock. Without this change I could cause a deadlock on a multicore 8.1-RELEASE system by having one thread constantly mmap'ing and then munmap'ing a PROT_EXEC mapping in a loop while I repeatedly invoked and stopped pmcstat in system-wide sampling mode. Reviewed by: fabient Approved by: emaste (mentor) MFC after: 2 weeks
212174	03-Sep-2010	avg	vm_page.c: include opt_msgbuf.h for MSGBUF_SIZE use in vm_page_startup vm_page_startup uses MSGBUF_SIZE value for adding msgbuf pages to minidump. If opt_msgbuf.h is not included and MSGBUF_SIZE is overriden in kernel config, then not all msgbuf pages will be dumped. And most importantly, struct msgbuf itself will not be included. Thus the dump would look corrupted/incomplete to tools like kgdb, dmesg, etc that try to access struct msgbuf as one of the first things they do when working on a crash dump. MFC after: 5 days
212063	31-Aug-2010	mdf	Have memguard(9) crash with an easier-to-debug message on double-free. Reviewed by: zml MFC after: 3 weeks
212058	31-Aug-2010	mdf	The realloc case for memguard(9) will copy too many bytes when reallocating to a smaller-sized allocation. Fix this issue. Noticed by: alc Reviewed by: alc Approved by: zml (mentor) MFC after: 3 weeks
211937	28-Aug-2010	alc	Add the MAP_PREFAULT_READ option to mmap(2). Reviewed by: jhb, kib
211396	16-Aug-2010	andre	Add uma_zone_get_max() to obtain the effective limit after a call to uma_zone_set_max(). The UMA zone limit is not exactly set to the value supplied but rounded up to completely fill the backing store increment (a page normally). This can lead to surprising situations where the number of elements allocated from UMA is higher than the supplied limit value. The new get function reads back the effective value so that the supplied limit value can be adjusted to the real limit. Reviewed by: jeffr MFC after: 1 week
211229	12-Aug-2010	mdf	Fix compile. It seemed better to have memguard.c include opt_vm.h in case future compile-time knobs were added that it wants to use. Also add include guards and forward declarations to vm/memguard.h. Approved by: zml (mentor) MFC after: 1 month
211194	11-Aug-2010	mdf	Rework memguard(9) to reserve significantly more KVA to detect use-after-free over a longer time. Also release the backing pages of a guarded allocation at free(9) time to reduce the overhead of using memguard(9). Allow setting and varying the malloc type at run-time. Add knobs to allow: - randomly guarding memory - adding un-backed KVA guard pages to detect underflow and overflow - a lower limit on the size of allocations that are guarded Reviewed by: alc Reviewed by: brueffer, Ulrich Spörlein <uqs spoerlein net> (man page) Silence from: -arch Approved by: zml (mentor) MFC after: 1 month
210923	06-Aug-2010	kib	Add new make_dev_p(9) flag MAKEDEV_ETERNAL to inform devfs that created cdev will never be destroyed. Propagate the flag to devfs vnodes as VV_ETERNVALDEV. Use the flags to avoid acquiring devmtx and taking a thread reference on such nodes. In collaboration with: pho MFC after: 1 month
210550	27-Jul-2010	jhb	Very rough first cut at NUMA support for the physical page allocator. For now it uses a very dumb first-touch allocation policy. This will change in the future. - Each architecture indicates the maximum number of supported memory domains via a new VM_NDOMAIN parameter in <machine/vmparam.h>. - Each cpu now has a PCPU_GET(domain) member to indicate the memory domain a CPU belongs to. Domain values are dense and numbered from 0. - When a platform supports multiple domains, the default freelist (VM_FREELIST_DEFAULT) is split up into N freelists, one for each domain. The MD code is required to populate an array of mem_affinity structures. Each entry in the array defines a range of memory (start and end) and a domain for the range. Multiple entries may be present for a single domain. The list is terminated by an entry where all fields are zero. This array of structures is used to split up phys_avail[] regions that fall in VM_FREELIST_DEFAULT into per-domain freelists. - Each memory domain has a separate lookup-array of freelists that is used when fulfulling a physical memory allocation. Right now the per-domain freelists are listed in a round-robin order for each domain. In the future a table such as the ACPI SLIT table may be used to order the per-domain lookup lists based on the penalty for each memory domain relative to a specific domain. The lookup lists may be examined via a new vm.phys.lookup_lists sysctl. - The first-touch policy is implemented by using PCPU_GET(domain) to pick a lookup list when allocating memory. Reviewed by: alc
210548	27-Jul-2010	trasz	Fix commented out resource limit check in mlockall(2). It's still racy, but at least less misleading.
210545	27-Jul-2010	alc	Introduce exec_alloc_args(). The objective being to encapsulate the details of the string buffer allocation in one place. Eliminate the portion of the string buffer that was dedicated to storing the interpreter name. The pointer to the interpreter name can simply be made to point to the appropriate argument string. Reviewed by: kib
210475	25-Jul-2010	alc	Change the order in which the file name, arguments, environment, and shell command are stored in exec*()'s demand-paged string buffer. For a "buildworld" on an 8GB amd64 multiprocessor, the new order reduces the number of global TLB shootdowns by 31%. It also eliminates about 330k page faults on the kernel address space. Change exec_shell_imgact() to use "args->begin_argv" consistently as the start of the argument and environment strings. Previously, it would sometimes use "args->buf", which is the start of the overall buffer, but no longer the start of the argument and environment strings. While I'm here, eliminate unnecessary passing of "&length" to copystr(), where we don't actually care about the length of the copied string. Clean up the initialization of the exec map. In particular, use the correct size for an entry, and express that size in the same way that is used when an entry is allocated. The old size was one page too large. (This discrepancy originated in 2004 when I rewrote exec_map_first_page() to use sf_buf_alloc() instead of the exec map for mapping the first page of the executable.) Reviewed by: kib
210327	21-Jul-2010	jchandra	Redo the page table page allocation on MIPS, as suggested by alc@. The UMA zone based allocation is replaced by a scheme that creates a new free page list for the KSEG0 region, and a new function in sys/vm that allocates pages from a specific free page list. This also fixes a race condition introduced by the UMA based page table page allocation code. Dropping the page queue and pmap locks before the call to uma_zfree, and re-acquiring them afterwards will introduce a race condtion(noted by alc@). The changes are : - Revert the earlier changes in MIPS pmap.c that added UMA zone for page table pages. - Add a new freelist VM_FREELIST_HIGHMEM to MIPS vmparam.h for memory that is not directly mapped (in 32bit kernel). Normal page allocations will first try the HIGHMEM freelist and then the default(direct mapped) freelist. - Add a new function 'vm_page_t vm_page_alloc_freelist(int flind, int order, int req)' to vm/vm_page.c to allocate a page from a specified freelist. The MIPS page table pages will be allocated using this function from the freelist containing direct mapped pages. - Move the page initialization code from vm_phys_alloc_contig() to a new function vm_page_alloc_init(), and use this function to initialize pages in vm_page_alloc_freelist() too. - Split the function vm_phys_alloc_pages(int pool, int order) to create vm_phys_alloc_freelist_pages(int flind, int pool, int order), and use this function from both vm_page_alloc_freelist() and vm_phys_alloc_pages(). Reviewed by: alc
209861	09-Jul-2010	alc	Add support for the VM_ALLOC_COUNT() hint to vm_page_alloc(). Consequently, the maintenance of vm_pageout_deficit can be localized to just two places: vm_page_alloc() and vm_pageout_scan(). This change also corrects an off-by-one error in the maintenance of vm_pageout_deficit. Historically, the buffer cache functions, allocbuf() and vm_hold_load_pages(), have not taken into account that vm_page_alloc() already increments vm_pageout_deficit by one. Reviewed by: kib
209792	08-Jul-2010	kib	Make VM_ALLOC_RETRY flag mandatory for vm_page_grab(). Assert that the flag is always provided, and unconditionally retry after sleep for the busy page or failed allocation. The intent is to remove VM_ALLOC_RETRY eventually. Proposed and reviewed by: alc
209713	05-Jul-2010	kib	Add the ability for the allocflag argument of the vm_page_grab() to specify the increment of vm_pageout_deficit when sleeping due to page shortage. Then, in allocbuf(), the code to allocate pages when extending vmio buffer can be replaced by a call to vm_page_grab(). Suggested and reviewed by: alc MFC after: 2 weeks
209702	04-Jul-2010	kib	Several cleanups for the r209686: - remove unused defines; - remove unused curgeneration argument for vm_object_page_collect_flush(); - always assert that vm_object_page_clean() is called for OBJT_VNODE; - move vm_page_find_least() into for() statement initial clause. Submitted by: alc
209686	04-Jul-2010	kib	Reimplement vm_object_page_clean(), using the fact that vm object memq is ordered by page index. This greatly simplifies the implementation, since we no longer need to mark the pages with VPO_CLEANCHK to denote the progress. It is enough to remember the current position by index before dropping the object lock. Remove VPO_CLEANCHK and VM_PAGER_IGNORE_CLEANCHK as unused. Garbage-collect vm.msync_flush_flags sysctl. Suggested and reviewed by: alc Tested by: pho
209685	04-Jul-2010	kib	Introduce a helper function vm_page_find_least(). Use it in several places, which inline the function. Reviewed by: alc Tested by: pho MFC after: 1 week
209669	03-Jul-2010	alc	Improve the comment and man page for vm_page_alloc(). Specifically, document one of the optional flags; clarify which of the flags are optional (and which are not), and remove mention of a restriction on the reclamation of cached pages that no longer holds since version 7. MFC after: 1 week
209651	02-Jul-2010	alc	Push down the acquisition of the page queues lock into vm_pageout_page_stats(). In particular, avoid acquiring the page queues lock unless iterating over the active queue.
209650	02-Jul-2010	alc	Use vm_page_prev() instead of vm_page_lookup() in the implementation of vm_fault()'s automatic delete-behind heuristic. vm_page_prev() is typically faster.
209647	02-Jul-2010	alc	With the demise of page coloring, the page queue macros no longer serve any useful purpose. Eliminate them. Reviewed by: kib
209610	30-Jun-2010	alc	Simplify entry to vm_pageout_clean(). Expect the page to be locked. Previously, the caller unlocked the page, and vm_pageout_clean() immediately reacquired the page lock. Also, assert rather than test that the page is neither busy nor held. Since vm_pageout_clean() is called with the object and page locked, the page can't have changed state since the caller verified that the page is neither busy nor held.
209407	21-Jun-2010	alc	Introduce vm_page_next() and vm_page_prev(), and use them in vm_pageout_clean(). When iterating over a range of pages, these functions can be cheaper than vm_page_lookup() because their implementation takes advantage of the vm_object's memq being ordered. Reviewed by: kib@ MFC after: 3 weeks
209215	15-Jun-2010	sbruno	Add a new column to the output of vmstat -z to indicate the number of times the system was forced to sleep when requesting a new allocation. Expand the debugger hook, db_show_uma, to display these results as well. This has proven to be very useful in out of memory situations when it is not known why systems have become sluggish or fail in odd ways. Reviewed by: rwatson alc Approved by: scottl (mentor) peter Obtained from: Yahoo Inc.
209173	14-Jun-2010	alc	Eliminate checks for a page having a NULL object in vm_pageout_scan() and vm_pageout_page_stats(). These checks were recently introduced by the first page locking commit, r207410, but they are not needed. At the same time, eliminate some redundant accesses to the page's object field. (These accesses should have neen eliminated by r207410.) Make the assertion in vm_page_flag_set() stricter. Specifically, only managed pages should have PG_WRITEABLE set. Add a comment documenting an assertion to vm_page_flag_clear(). It has long been the case that fictitious pages have their wire count permanently set to one. Add comments to vm_page_wire() and vm_page_unwire() documenting this. Add assertions to these functions as well. Update the comment describing vm_page_unwire(). Much of the old comment had little to do with vm_page_unwire(), but a lot to do with _vm_page_deactivate(). Move relevant parts of the old comment to _vm_page_deactivate(). Only pages that belong to an object can be paged out. Therefore, it is pointless for vm_page_unwire() to acquire the page queues lock and enqueue such pages in one of the paging queues. Generally speaking, such pages are immediately freed after the call to vm_page_unwire(). Previously, it was the call to vm_page_free() that reacquired the page queues lock and removed these pages from the paging queues. Now, we will never acquire the page queues lock for this case. (It is also worth noting that since both vm_page_unwire() and vm_page_free() occurred with the page locked, the page daemon never saw the page with its object field set to NULL.) Change the panic with vm_page_unwire() to provide a more precise message. Reviewed by: kib@
209059	11-Jun-2010	jhb	Update several places that iterate over CPUs to use CPU_FOREACH().
208990	10-Jun-2010	alc	Reduce the scope of the page queues lock and the number of PG_REFERENCED changes in vm_pageout_object_deactivate_pages(). Simplify this function's inner loop using TAILQ_FOREACH(), and shorten some of its overly long lines. Update a stale comment. Assert that PG_REFERENCED may be cleared only if the object containing the page is locked. Add a comment documenting this. Assert that a caller to vm_page_requeue() holds the page queues lock, and assert that the page is on a page queue. Push down the page queues lock into pmap_ts_referenced() and pmap_page_exists_quick(). (As of now, there are no longer any pmap functions that expect to be called with the page queues lock held.) Neither pmap_ts_referenced() nor pmap_page_exists_quick() should ever be passed an unmanaged page. Assert this rather than returning "0" and "FALSE" respectively. ARM: Simplify pmap_page_exists_quick() by switching to TAILQ_FOREACH(). Push down the page queues lock inside of pmap_clearbit(), simplifying pmap_clear_modify(), pmap_clear_reference(), and pmap_remove_write(). Additionally, this allows for avoiding the acquisition of the page queues lock in some cases. PowerPC/AIM: moea_page_exits_quick() and moea_page_wired_mappings() will never be called before pmap initialization is complete. Therefore, the check for moea_initialized can be eliminated. Push down the page queues lock inside of moea_clear_bit(), simplifying moea_clear_modify() and moea_clear_reference(). The last parameter to moea_clear_bit() is never used. Eliminate it. PowerPC/BookE: Simplify mmu_booke_page_exists_quick()'s control flow. Reviewed by: kib@
208794	04-Jun-2010	jchandra	Make vm_contig_grow_cache() extern, and use it when vm_phys_alloc_contig() fails to allocate MIPS page table pages. The current usage of VM_WAIT in case of vm_phys_alloc_contig() failure is not correct, because: "There is no guarantee that any of the available free (or cached) pages after the VM_WAIT will fall within the range of suitable physical addresses. Every time this function sleeps and a single page is freed (or cached) by someone else, this function will be reawakened. With a little bad luck, you could spin indefinitely." We also add low and high parameters to vm_contig_grow_cache() and vm_contig_launder() so that we restrict vm_contig_launder() to the range of pages we are interested in. Reported by: alc Reviewed by: alc Approved by: rrs (mentor)
208791	03-Jun-2010	kib	Do not leak vm page lock in vm_contig_launder(), vm_pageout_page_lock() always returns with the page locked. Submitted by: alc Pointy hat to: kib
208772	03-Jun-2010	kib	Add assertion and comment in vm_page_flag_set() describing the expectations when the PG_WRITEABLE flag is set. Reviewed by: alc
208764	03-Jun-2010	alc	Maintain the pretense that we support 32KB pages for the sake of the ia64 LINT build.
208745	02-Jun-2010	alc	Minimize the use of the page queues lock for synchronizing access to the page's dirty field. With the exception of one case, access to this field is now synchronized by the object lock.
208645	29-May-2010	alc	When I pushed down the page queues lock into pmap_is_modified(), I created an ordering dependence: A pmap operation that clears PG_WRITEABLE and calls vm_page_dirty() must perform the call first. Otherwise, pmap_is_modified() could return FALSE without acquiring the page queues lock because the page is not (currently) writeable, and the caller to pmap_is_modified() might believe that the page's dirty field is clear because it has not seen the effect of the vm_page_dirty() call. When I pushed down the page queues lock into pmap_is_modified(), I overlooked one place where this ordering dependence is violated: pmap_enter(). In a rare situation pmap_enter() can be called to replace a dirty mapping to one page with a mapping to another page. (I say rare because replacements generally occur as a result of a copy-on-write fault, and so the old page is not dirty.) This change delays clearing PG_WRITEABLE until after vm_page_dirty() has been called. Fixing the ordering dependency also makes it easy to introduce a small optimization: When pmap_enter() used to replace a mapping to one page with a mapping to another page, it freed the pv entry for the first mapping and later called the pv entry allocator for the new mapping. Now, pmap_enter() attempts to recycle the old pv entry, saving two calls to the pv entry allocator. There is no point in setting PG_WRITEABLE on unmanaged pages, so don't. Update a comment to reflect this. Tidy up the variable declarations at the start of pmap_enter().
208574	26-May-2010	alc	Push down page queues lock acquisition in pmap_enter_object() and pmap_is_referenced(). Eliminate the corresponding page queues lock acquisitions from vm_map_pmap_enter() and mincore(), respectively. In mincore(), this allows some additional cases to complete without ever acquiring the page queues lock. Assert that the page is managed in pmap_is_referenced(). On powerpc/aim, push down the page queues lock acquisition from moea_is_modified() and moea_is_referenced() into moea*_query_bit(). Again, this will allow some additional cases to complete without ever acquiring the page queues lock. Reorder a few statements in vm_page_dontneed() so that a race can't lead to an old reference persisting. This scenario is described in detail by a comment. Correct a spelling error in vm_page_dontneed(). Assert that the object is locked in vm_page_clear_dirty(), and restrict the page queues lock assertion to just those cases in which the page is currently writeable. Add object locking to vnode_pager_generic_putpages(). This was the one and only place where vm_page_clear_dirty() was being called without the object being locked. Eliminate an unnecessary vm_page_lock() around vnode_pager_setsize()'s call to vm_page_clear_dirty(). Change vnode_pager_generic_putpages() to the modern-style of function definition. Also, change the name of one of the parameters to follow virtual memory system naming conventions. Reviewed by: kib
208524	25-May-2010	alc	Eliminate the acquisition and release of the page queues lock from vfs_busy_pages(). It is no longer needed. Submitted by: kib
208504	24-May-2010	alc	Roughly half of a typical pmap_mincore() implementation is machine- independent code. Move this code into mincore(), and eliminate the page queues lock from pmap_mincore(). Push down the page queues lock into pmap_clear_modify(), pmap_clear_reference(), and pmap_is_modified(). Assert that these functions are never passed an unmanaged page. Eliminate an inaccurate comment from powerpc/powerpc/mmu_if.m: Contrary to what the comment says, pmap_mincore() is not simply an optimization. Without a complete pmap_mincore() implementation, mincore() cannot return either MINCORE_MODIFIED or MINCORE_REFERENCED because only the pmap can provide this information. Eliminate the page queues lock from vfs_setdirty_locked_object(), vm_pageout_clean(), vm_object_page_collect_flush(), and vm_object_page_clean(). Generally speaking, these are all accesses to the page's dirty field, which are synchronized by the containing vm object's lock. Reduce the scope of the page queues lock in vm_object_madvise() and vm_page_dontneed(). Reviewed by: kib (an earlier version)
208340	20-May-2010	kib	When waiting for the busy page, do not unlock the object unless unlock cannot be avoided. Reviewed by: alc MFC after: 1 week
208264	18-May-2010	alc	The page queues lock is no longer required by vm_page_set_invalid(), so eliminate it. Assert that the object containing the page is locked in vm_page_test_dirty(). Perform some style clean up while I'm here. Reviewed by: kib
208175	16-May-2010	alc	On entry to pmap_enter(), assert that the page is busy. While I'm here, make the style of assertion used by pmap_enter() consistent across all architectures. On entry to pmap_remove_write(), assert that the page is neither unmanaged nor fictitious, since we cannot remove write access to either kind of page. With the push down of the page queues lock, pmap_remove_write() cannot condition its behavior on the state of the PG_WRITEABLE flag if the page is busy. Assert that the object containing the page is locked. This allows us to know that the page will neither become busy nor will PG_WRITEABLE be set on it while pmap_remove_write() is running. Correct a long-standing bug in vm_page_cowsetup(). We cannot possibly do copy-on-write-based zero-copy transmit on unmanaged or fictitious pages, so don't even try. Previously, the call to pmap_remove_write() would have failed silently.
208164	16-May-2010	alc	Correct an error of omission in r202897: Now that amd64 uses the direct map to access the message buffer, we must explicitly request that the underlying physical pages are included in a crash dump. Reported by: Benjamin Kaduk
208159	16-May-2010	alc	Add a comment about the proper use of vm_object_page_remove(). MFC after: 1 week
207905	11-May-2010	alc	Update synchronization annotations for struct vm_page. Add a comment explaining how the setting of PG_WRITEABLE is synchronized.
207846	10-May-2010	kib	Continue cleaning the queue instead of moving to the next queue or bailing out if acquisition of page lock caused page position in the queue to change. Pointed out by: alc
207823	09-May-2010	alc	Push down the acquisition of the page queues lock into vm_pageq_remove(). (This eliminates a surprising number of page queues lock acquisitions by vm_fault() because the page's queue is PQ_NONE and thus the page queues lock is not needed to remove the page from a queue.)
207822	09-May-2010	alc	Call vm_page_deactivate() rather than vm_page_dontneed() in swp_pager_force_pagein(). By dirtying the page, swp_pager_force_pagein() forces vm_page_dontneed() to insert the page at the head of the inactive queue, just like vm_page_deactivate() does. Moreover, because the page was invalid, it can't have been mapped, and thus the other effect of vm_page_dontneed(), clearing the page's reference bits has no effect. In summary, there is no reason to call vm_page_dontneed() since its effect will be identical to calling the simpler vm_page_deactivate().
207806	09-May-2010	alc	Remove the page queues lock around a call to vm_page_activate(). Make the page dirty before adding it to the active queue.
207798	08-May-2010	alc	Minimize the scope of the page queues lock in vm_fault().
207796	08-May-2010	alc	Push down the page queues into vm_page_cache(), vm_page_try_to_cache(), and vm_page_try_to_free(). Consequently, push down the page queues lock into pmap_enter_quick(), pmap_page_wired_mapped(), pmap_remove_all(), and pmap_remove_write(). Push down the page queues lock into Xen's pmap_page_is_mapped(). (I overlooked the Xen pmap in r207702.) Switch to a per-processor counter for the total number of pages cached.
207759	07-May-2010	jkim	Fix a typo in the previous commit.
207752	07-May-2010	kib	One more use for vm_pageout_init_marker(). Reviewed by: alc
207747	07-May-2010	alc	Eliminate unnecessary page queues locking.
207746	07-May-2010	alc	Push down the page queues lock into vm_page_activate().
207740	07-May-2010	alc	Update the synchronization requirements for the page usage count.
207739	07-May-2010	alc	Eliminate acquisitions of the page queues lock that are no longer needed. Switch to a per-processor counter for the number of pages freed during process termination.
207738	07-May-2010	alc	Push down the page queues lock into vm_page_deactivate(). Eliminate an incorrect comment.
207728	06-May-2010	alc	Eliminate page queues locking around most calls to vm_page_free().
207706	06-May-2010	alc	Update a comment to say that access to a page's wire count is now synchronized by the page lock.
207702	06-May-2010	alc	Push down the page queues lock inside of vm_page_free_toq() and pmap_page_is_mapped() in preparation for removing page queues locking around calls to vm_page_free(). Setting aside the assertion that calls pmap_page_is_mapped(), vm_page_free_toq() now acquires and holds the page queues lock just long enough to actually add or remove the page from the paging queues. Update vm_page_unhold() to reflect the above change.
207694	06-May-2010	kib	Add a helper function vm_pageout_page_lock(), similar to tegge' vm_pageout_fallback_object_lock(), to obtain the page lock while having page queue lock locked, and still maintain the page position in a queue. Use the helper to lock the page in the pageout daemon and contig launder iterators instead of skipping the page if its lock is contested. Skipping locked pages easily causes pagedaemon or launder to not make a progress with page cleaning. Proposed and reviewed by: alc
207669	05-May-2010	alc	Acquire the page lock around all remaining calls to vm_page_free() on managed pages that didn't already have that lock held. (Freeing an unmanaged page, such as the various pmaps use, doesn't require the page lock.) This allows a change in vm_page_remove()'s locking requirements. It now expects the page lock to be held instead of the page queues lock. Consequently, the page queues lock is no longer required at all by callers to vm_page_rename(). Discussed with: kib
207644	05-May-2010	alc	Push down the acquisition of the page queues lock into vm_page_unwire(). Update the comment describing which lock should be held on entry to vm_page_wire(). Reviewed by: kib
207617	04-May-2010	alc	Add page locking to the vm_page_cow* functions. Push down the acquisition and release of the page queues lock into vm_page_wire(). Reviewed by: kib
207601	04-May-2010	alc	Add lock assertions.
207580	03-May-2010	kib	Handle busy status of the page in a way expected for pager_getpage(). Flush requested page, unbusy other pages, do not clear m->busy. Reviewed by: alc MFC after: 1 week
207577	03-May-2010	alc	Acquire the page lock around vm_page_wire() in vm_page_grab(). Assert that the page lock is held in vm_page_wire().
207576	03-May-2010	alc	It makes more sense for the object-based backend allocator to use OBJT_PHYS objects instead of OBJT_DEFAULT objects because we never reclaim or pageout the allocated pages. Moreover, they are mapped with pmap_qenter(), which creates unmanaged mappings. Reviewed by: kib
207552	03-May-2010	alc	The pages allocated by kmem_alloc_attr() and kmem_malloc() are unmanaged. Consequently, neither the page lock nor the page queues lock is needed to unwire and free them.
207551	03-May-2010	alc	Assert that the page queues lock is held in vm_page_remove() and vm_page_unwire() only if the page is managed, i.e., pageable.
207544	02-May-2010	alc	Add page lock assertions where we access the page's hold_count.
207541	02-May-2010	alc	Eliminate an assignment that was made redundant by r207410.
207540	02-May-2010	alc	Defer the acquisition of the page and page queues locks in vm_pageout_object_deactivate_pages().
207539	02-May-2010	alc	Simplify vm_fault(). The introduction of the new page lock renders a bit of cleverness by vm_fault() to avoid repeatedly releasing and reacquiring the page queues lock pointless. Reviewed by: kib, kmacy
207531	02-May-2010	alc	Correct an error in r207410: Remove an unlock of a lock that is no longer held.
207530	02-May-2010	alc	It makes no sense for vm_page_sleep_if_busy()'s helper, vm_page_sleep(), to unconditionally set PG_REFERENCED on a page before sleeping. In many cases, it's perfectly ok for the page to disappear, i.e., be reclaimed by the page daemon, before the caller to vm_page_sleep() is reawakened. Instead, we now explicitly set PG_REFERENCED in those cases where having the page persist until the caller is awakened is clearly desirable. Note, however, that setting PG_REFERENCED on the page is still only a hint, and not a guarantee that the page should persist.
207519	02-May-2010	alc	This change addresses the race condition that was introduced by the previous revision, r207450, to this file. Specifically, between dropping the page queues lock in vm_contig_launder() and reacquiring it in vm_contig_launder_page(), the page may be removed from the active or inactive queue. It could be wired, freed, cached, etc. None of which vm_contig_launder_page() is prepared for. Reviewed by: kib, kmacy
207487	02-May-2010	alc	Correct an error of omission in r206819. If VMFS_TLB_ALIGNED_SPACE is specified to vm_map_find(), then retry the vm_map_findspace() if vm_map_insert() fails because the aligned space is already partly used. Reported by: Neel Natu
207460	01-May-2010	kmacy	Update locking comment above vm_page: - re-assign page queue lock "Q" - assign page lock "P" - update several uncommented fields - observe that hold_count is now protected by the page lock "P"
207452	30-Apr-2010	kmacy	push up dropping of the page queue lock to avoid holding it in vm_pageout_flush
207451	30-Apr-2010	kmacy	don't call vm_pageout_flush with the page queue mutex held Reported by: Michael Butler
207450	30-Apr-2010	kmacy	- acquire the page lock in vm_contig_launder_page before checking page fields - release page queue lock before calling vm_pageout_flush
207448	30-Apr-2010	kmacy	- don't check hold_count without the page lock held - don't leak the page lock if m->object is NULL (assuming that that check will in fact even be valid when m->object is protected by the page lock)
207438	30-Apr-2010	kib	Unlock page lock instead of recursively locking it.
207412	30-Apr-2010	kmacy	don't allow unsynchronized free in vm_page_unhold
207410	30-Apr-2010	kmacy	On Alan's advice, rather than do a wholesale conversion on a single architecture from page queue lock to a hashed array of page locks (based on a patch by Jeff Roberson), I've implemented page lock support in the MI code and have only moved vm_page's hold_count out from under page queue mutex to page lock. This changes pmap_extract_and_hold on all pmaps. Supported by: Bitgravity Inc. Discussed with: alc, jeffr, and kib
207374	29-Apr-2010	alc	Simplify the inner loop of vm_pageout_object_deactivate_pages(). Rather than checking each page for PG_UNMANAGED, check the vm object's type. Only OBJT_PHYS can have unmanaged pages. Eliminate a pointless counter. The vm object is locked, that lock is never released by the inner loop, and the set of pages contained by the vm object is not changed by the inner loop. Therefore, the counter serves no purpose.
207365	29-Apr-2010	kib	When doing kstack swapin, read as much pages in one run as possible. Suggested and reviewed by: alc (previous version) Tested by: pho MFC after: 2 weeks
207364	29-Apr-2010	kib	In swap pager, do not free the non-requested pages from the run if they are wired. Kstack pages are wired, this change prepares swap pager for handling of long runs of kstack pages. Noted and reviewed by: alc Tested by: pho MFC after: 2 weeks
207308	28-Apr-2010	alc	Setting PG_REFERENCED on a page at the end of vm_fault() is redundant since the page table entry's accessed bit is either preset by the immediately preceding call to pmap_enter() or by hardware (or software) upon return from vm_fault() when the faulting access is restarted.
207306	28-Apr-2010	alc	Change vm_object_madvise() so that it checks whether the page is invalid or unmanaged before acquiring the page queues lock. Neither of these tests require that lock. Moreover, a better way of testing if the page is unmanaged is to test the type of vm object. This avoids a pointless vm_page_lookup(). MFC after: 3 weeks
207155	24-Apr-2010	alc	Resurrect pmap_is_referenced() and use it in mincore(). Essentially, pmap_ts_referenced() is not always appropriate for checking whether or not pages have been referenced because it clears any reference bits that it encounters. For example, in mincore(), clearing the reference bits has two negative consequences. First, it throws off the activity count calculations performed by the page daemon. Specifically, a page on which mincore() has called pmap_ts_referenced() looks less active to the page daemon than it should. Consequently, the page could be deactivated prematurely by the page daemon. Arguably, this problem could be fixed by having mincore() duplicate the activity count calculation on the page. However, there is a second problem for which that is not a solution. In order to clear a reference on a 4KB page, it may be necessary to demote a 2/4MB page mapping. Thus, a mincore() by one process can have the side effect of demoting a superpage mapping within another process!
206885	20-Apr-2010	alc	Eliminate an unnecessary call to pmap_remove_all(). If a page belongs to an object whose reference count is zero, then that page cannot possibly be mapped.
206823	19-Apr-2010	alc	vm_thread_swapout() can safely dirty the page before rather than after acquiring the page queues lock.
206819	18-Apr-2010	jmallett	o) Add a VM find-space option, VMFS_TLB_ALIGNED_SPACE, which searches the address space for an address as aligned by the new pmap_align_tlb() function, which is for constraints imposed by the TLB. [1] o) Add a kmem_alloc_nofault_space() function, which acts like kmem_alloc_nofault() but allows the caller to specify which find-space option to use. [1] o) Use kmem_alloc_nofault_space() with VMFS_TLB_ALIGNED_SPACE to allocate the kernel stack address on MIPS. [1] o) Make pmap_align_tlb() on MIPS align addresses so that they do not start on an odd boundary within the TLB, so that they are suitable for insertion as wired entries and do not have to share a TLB entry with another mapping, assuming they are appropriately-sized. o) Eliminate md_realstack now that the kstack will be appropriately-aligned on MIPS. o) Increase the number of guard pages to 2 so that we retain the proper alignment of the kstack address. Reviewed by: [1] alc X-MFC-after: Making sure alc has not come up with a better interface.
206814	18-Apr-2010	alc	Remove a nonsensical test from vm_pageout_clean(). A page can't be in the inactive queue and have a non-zero wire count. Reviewed by: kib MFC after: 3 weeks
206801	18-Apr-2010	alc	There is no justification for vm_object_split() setting PG_REFERENCED on a page that it is going to sleep on. Eliminate it. MFC after: 3 weeks
206770	17-Apr-2010	alc	In vm_object_madvise() setting PG_REFERENCED on a page before sleeping on that page only makes sense if the advice is MADV_WILLNEED. In that case, the intention is to activate the page, so discouraging the page daemon from reclaiming the page makes sense. In contrast, in the other cases, MADV_DONTNEED and MADV_FREE, it makes no sense whatsoever to discourage the page daemon from reclaiming the page by setting PG_REFERENCED. Wrap a nearby line. Discussed with: kib MFC after: 3 weeks
206768	17-Apr-2010	alc	In vm_object_backing_scan(), setting PG_REFERENCED on a page before sleeping on that page is nonsensical. Doing so reduces the likelihood that the page daemon will reclaim the page before the thread waiting in vm_object_backing_scan() is reawakened. However, it does not guarantee that the page is not reclaimed, so vm_object_backing_scan() restarts after reawakening. More importantly, this muddles the meaning of PG_REFERENCED. There is no reason to believe that the caller of vm_object_backing_scan() is going to use (i.e., access) the contents of the page. There is especially no reason to believe that an access is more likely because vm_object_backing_scan() had to sleep on the page. Discussed with: kib MFC after: 3 weeks
206761	17-Apr-2010	alc	Setting PG_REFERENCED on the requested page in swap_pager_getpages() is either redundant or harmful, depending on the caller. For example, when called by vm_fault(), it is redundant. However, when called by vm_thread_swapin(), it is harmful. Specifically, if the thread is later swapped out, having PG_REFERENCED set on its stack pages leads the page daemon to reactivate these stack pages and delay their reclamation. Reviewed by: kib MFC after: 3 weeks
206545	13-Apr-2010	alc	Simplify vm_thread_swapin().
206483	11-Apr-2010	alc	Initialize the virtual memory-related resource limits in a single place. Previously, one of these limits was initialized in two places to a different value in each place. Moreover, because an unsigned int was used to represent the amount of pageable physical memory, some of these limits were incorrectly initialized on 64-bit architectures. (Currently, this error is masked by login.conf's default settings.) Make vm_thread_swapin() and vm_thread_swapout() static. Submitted by: bde (an earlier version) Reviewed by: kib
206409	09-Apr-2010	alc	Introduce the function kmem_alloc_attr(), which allocates kernel virtual memory with the specified physical attributes. In particular, like kmem_alloc_contig(), the caller can specify the physical address range from which the physical pages are allocated and the memory attributes (i.e., cache behavior) for these physical pages. However, in contrast to kmem_alloc_contig() or contigmalloc(), the physical pages that are allocated by kmem_alloc_attr() are not necessarily physically contiguous. This function is needed by DRM and VirtualBox. Correct an error in the prototype for kmem_malloc(). The third argument had the wrong type. Tested by: rnoland MFC after: 3 days
206360	07-Apr-2010	joel	Start copyright notice with /*-
206264	06-Apr-2010	kib	When OOM searches for a process to kill, ignore the processes already killed by OOM. When killed process waits for a page allocation, try to satisfy the request as fast as possible. This removes the often encountered deadlock, where OOM continously selects the same victim process, that sleeps uninterruptibly waiting for a page. The killed process may still sleep if page cannot be obtained immediately, but testing has shown that system has much higher chance to survive in OOM situation with the patch. In collaboration with: pho Reviewed by: alc MFC after: 4 weeks
206174	05-Apr-2010	alc	vm_reserv_alloc_page() should never be called on an OBJT_SG object, just as it is never called on an OBJT_DEVICE object. (This change should have been included in r195840.) Reported by: dougb@, avg@ MFC after: 3 days
206142	03-Apr-2010	alc	Make _vm_map_init() the one place where the vm map's pmap field is initialized. Reviewed by: kib
206140	03-Apr-2010	alc	Re-enable the call to pmap_release() by vmspace_dofree(). The accounting problem that is described in the comment has been addressed. Submitted by: kib Tested by: pho (a few months ago) MFC after: 6 weeks
205536	23-Mar-2010	jhb	Reject attempts to create a MAP_ANON mapping with a non-zero offset. PR: kern/71258 Submitted by: Alexander Best MFC after: 2 weeks
205487	22-Mar-2010	kmacy	- enable alignment on amd64 only - only align pcpu caches and the volatile portion of uma_zone
205298	18-Mar-2010	kmacy	turn 205266 in to a no-op until the problem can be properly diagnosed
205266	17-Mar-2010	kmacy	Cache line align various structures and move volatile counters to not share a cache line with (mostly) immutable state Reviewed by: jeff@ MFC after: 7 days
204415	27-Feb-2010	kib	Update comment for vm_page_alloc(9), listing all acceptable flags [1]. Note that the function does not sleep, it can block. Submitted by: Giovanni Trematerra <giovanni.trematerra gmail com> [1] MFC after: 3 days
204205	22-Feb-2010	kib	Remove write-only variable. MFC after: 3 days
204181	21-Feb-2010	alc	Align the start of the clean submap to a superpage boundary. Although no superpage mappings are created within the clean submap, aligning the start of the clean submap helps to prevent interference with kmem_alloc()'s use of superpages.
203175	29-Jan-2010	kib	The MAP_ENTRY_NEEDS_COPY flag belongs to protoeflags, cow variable uses different namespace. Reported by: Jonathan Anderson <jonathan.anderson cl cam ac uk> MFC after: 3 days
202529	17-Jan-2010	kib	When a vnode-backed vm object is referenced, it increments the vnode reference count, and decrements it on dereference. If referenced object is deallocated, object type is reset to OBJT_DEAD. Consequently, all vnode references that are owned by object references are never released. vunref() the vnode in vm object deallocation code for OBJT_VNODE appropriate number of times to prevent leak. Add an assertion to the vm_pageout() to make sure that we never get reference on the vnode but then do not execute code to release it. In collaboration with: pho Reviewed by: alc MFC after: 3 weeks
201223	29-Dec-2009	rnoland	Update d_mmap() to accept vm_ooffset_t and vm_memattr_t. This replaces d_mmap() with the d_mmap2() implementation and also changes the type of offset to vm_ooffset_t. Purge d_mmap2(). All driver modules will need to be rebuilt since D_VERSION is also bumped. Reviewed by: jhb@ MFC after: Not in this lifetime...
201145	28-Dec-2009	antoine	(S)LIST_HEAD_INITIALIZER takes a (S)LIST_HEAD as an argument. Fix some wrong usages. Note: this does not affect generated binaries as this argument is not used. PR: 137213 Submitted by: Eygene Ryabinkin (initial version) MFC after: 1 month
200770	21-Dec-2009	kib	VI_OBJDIRTY vnode flag mirrors the state of OBJ_MIGHTBEDIRTY vm object flag. Besides providing the redundand information, need to update both vnode and object flags causes more acquisition of vnode interlock. OBJ_MIGHTBEDIRTY is only checked for vnode-backed vm objects. Remove VI_OBJDIRTY and make sure that OBJ_MIGHTBEDIRTY is set only for vnode-backed vm objects. Suggested and reviewed by: alc Tested by: pho MFC after: 3 weeks
200129	05-Dec-2009	antoine	Remove trailing ";" in UMA_HASH_INSERT and UMA_HASH_REMOVE macros. MFC after: 1 month
199870	28-Nov-2009	alc	Properly synchronize the previous change.
199869	27-Nov-2009	alc	Support the new VM_PROT_COPY option on wired pages. The effect of which is that a debugger can now set a breakpoint in a program that uses mlock(2) on its text segment or mlockall(2) on its entire address space.
199868	27-Nov-2009	alc	Simplify the invocation of vm_fault(). Specifically, eliminate the flag VM_FAULT_DIRTY. The information provided by this flag can be trivially inferred by vm_fault(). Discussed with: kib
199819	26-Nov-2009	alc	Replace VM_PROT_OVERRIDE_WRITE by VM_PROT_COPY. VM_PROT_OVERRIDE_WRITE has represented a write access that is allowed to override write protection. Until now, VM_PROT_OVERRIDE_WRITE has been used to write breakpoints into text pages. Text pages are not just write protected but they are also copy-on-write. VM_PROT_OVERRIDE_WRITE overrides the write protection on the text page and triggers the replication of the page so that the breakpoint will be written to a private copy. However, here is where things become confused. It is the debugger, not the process being debugged that requires write access to the copied page. Nonetheless, the copied page is being mapped into the process with write access enabled. In other words, once the debugger sets a breakpoint within a text page, the program can write to its private copy of that text page. Whereas prior to setting the breakpoint, a SIGSEGV would have occurred upon a write access. VM_PROT_COPY addresses this problem. The combination of VM_PROT_READ and VM_PROT_COPY forces the replication of a copy-on-write page even though the access is only for read. Moreover, the replicated page is only mapped into the process with read access, and not write access. Reviewed by: kib MFC after: 4 weeks
199490	18-Nov-2009	alc	Simplify both the invocation and the implementation of vm_fault() for wiring pages. (Note: Claims made in the comments about the handling of breakpoints in wired pages have been false for roughly a decade. This and another bug involving breakpoints will be fixed in coming changes.) Reviewed by: kib
198870	04-Nov-2009	alc	Eliminate an unnecessary #include. (This #include should have been removed in r188331 when vnode_pager_lock() was eliminated.)
198855	03-Nov-2009	alc	Eliminate a bit of hackery from vm_fault(). The operations that this hackery sought to prevent are now properly supported by vm_map_protect(). (See r198505.) Reviewed by: kib
198854	03-Nov-2009	attilio	Split P_NOLOAD into a per-thread flag (TDF_NOLOAD). This improvements aims for avoiding further cache-misses in scheduler specific functions which need to keep track of average thread running time and further locking in places setting for this flag. Reported by: jeff (originally), kris (currently) Reviewed by: jhb Tested by: Giuseppe Cocomazzi <sbudella at email dot it>
198812	02-Nov-2009	alc	Avoid pointless calls to pmap_protect(). Reviewed by: kib
198811	02-Nov-2009	ivoras	Add sysctl documentation strings. The descriptions are derived from tuning(7). One of the descriptions references tuning(7) because it is too complex to adequatly describe here (it is not a simple boolean sysctl) and users should be warned to that. Reviewed by: alc, kib Approved by: gnn (mentor)
198721	31-Oct-2009	alc	Correct an error in vm_fault_copy_entry() that has existed since the first version of this file. When a process forks, any wired pages are immediately copied because copy-on-write is not supported for wired pages. In other words, the child process is given its own private copy of each wired page from its parent's address space. Unfortunately, to date, these copied pages have been mapped into the child's address space with the wrong permissions, typically VM_PROT_ALL. This change corrects the permissions. Reviewed by: kib
198505	27-Oct-2009	kib	When protection of wired read-only mapping is changed to read-write, install new shadow object behind the map entry and copy the pages from the underlying objects to it. This makes the mprotect(2) call to actually perform the requested operation instead of silently do nothing and return success, that causes SIGSEGV on later write access to the mapping. Reuse vm_fault_copy_entry() to do the copying, modifying it to behave correctly when src_entry == dst_entry. Reviewed by: alc MFC after: 3 weeks
198476	26-Oct-2009	alc	Simplify the inner loop of vm_fault_copy_entry(). Reviewed by: kib
198472	25-Oct-2009	alc	Eliminate an unnecessary check from vm_fault_prefault().
198341	21-Oct-2009	marcel	o Introduce vm_sync_icache() for making the I-cache coherent with the memory or D-cache, depending on the semantics of the platform. vm_sync_icache() is basically a wrapper around pmap_sync_icache(), that translates the vm_map_t argumument to pmap_t. o Introduce pmap_sync_icache() to all PMAP implementation. For powerpc it replaces the pmap_page_executable() function, added to solve the I-cache problem in uiomove_fromphys(). o In proc_rwmem() call vm_sync_icache() when writing to a page that has execute permissions. This assures that when breakpoints are written, the I-cache will be coherent and the process will actually hit the breakpoint. o This also fixes the Book-E PMAP implementation that was missing necessary locking while trying to deal with the I-cache coherency in pmap_enter() (read: mmu_booke_enter_locked). The key property of this change is that the I-cache is made coherent after writes have been done. Doing it in the PMAP layer when adding or changing a mapping means that the I-cache is made coherent before any writes happen. The difference is key when the I-cache prefetches.
198201	18-Oct-2009	kib	Remove spurious call to priv_check(PRIV_VM_SWAP_NOQUOTA). Call priv_check(PRIV_VM_SWAP_NORLIMIT) only when per-uid limit is actually exceed. Both changes aim at calling priv_check(9) only for the cases when privilege is actually exercised by the process. Reported and tested by: rwatson Reviewed by: alc MFC after: 3 days
197750	04-Oct-2009	alc	Align and pad the page queue and free page queue locks so that the linker can't possibly place them together within the same cache line. MFC after: 3 weeks
197712	02-Oct-2009	bz	Back out the functional parts from r197537. After r197711, affecting all user mappings, mmap no longer needs special treatment.
197661	01-Oct-2009	kib	Move the annotation for vm_map_startup() immediately before the function. MFC after: 3 days
197537	27-Sep-2009	simon	Do not allow mmap with the MAP_FIXED argument to map at address zero. This is done to make it harder to exploit kernel NULL pointer security vulnerabilities. While this of course does not fix vulnerabilities, it does mitigate their impact. Note that this may break some applications, most likely emulators or similar, which for one reason or another require mapping memory at zero. This restriction can be disabled with the security.bsd.mmap_zero sysctl variable. Discussed with: rwatson, bz Tested by: bz (Wine), simon (VirtualBox) Submitted by: jhb
197348	20-Sep-2009	kib	Old (a.out) rtld attempts to mmap zero-length region, e.g. when bss of the linked object is zero-length. More old code assumes that mmap of zero length returns success. For a.out and pre-8 ELF binaries, allow the mmap of zero length. Reported by: tegge Reviewed by: tegge, alc, jhb MFC after: 3 days
196730	01-Sep-2009	kib	Reintroduce the r196640, after fixing the problem with my testing. Remove the altkstacks, instead instantiate threads with kernel stack allocated with the right size from the start. For the thread that has kernel stack cached, verify that requested stack size is equial to the actual, and reallocate the stack if sizes differ [1]. This fixes the bug introduced by r173361 that was committed several days after r173004 and consisted of kthread_add(9) ignoring the non-default kernel stack size. Also, r173361 removed the caching of the kernel stacks for a non-first thread in the process. Introduce separate kernel stack cache that keeps some limited amount of preallocated kernel stacks to lower the latency of thread allocation. Add vm_lowmem handler to prune the cache on low memory condition. This way, system with reasonable amount of the threads get lower latency of thread creation, while still not exhausting significant portion of KVA for unused kstacks. Submitted by: peter [1] Discussed with: jhb, julian, peter Reviewed by: jhb Tested by: pho (and retested according to new test scenarious) MFC after: 1 week
196648	29-Aug-2009	kib	Reverse r196640 and r196644 for now.
196640	29-Aug-2009	kib	Remove the altkstacks, instead instantiate threads with kernel stack allocated with the right size from the start. For the thread that has kernel stack cached, verify that requested stack size is equial to the actual, and reallocate the stack if sizes differ [1]. This fixes the bug introduced by r173361 that was committed several days after r173004 and consisted of kthread_add(9) ignoring the non-default kernel stack size. Also, r173361 removed the caching of the kernel stacks for a non-first thread in the process. Introduce separate kernel stack cache that keeps some limited amount of preallocated kernel stacks to lower the latency of thread allocation. Add vm_lowmem handler to prune the cache on low memory condition. This way, system with reasonable amount of the threads get lower latency of thread creation, while still not exhausting significant portion of KVA for unused kstacks. Submitted by: peter [1] Discussed with: jhb, julian, peter Reviewed by: jhb Tested by: pho MFC after: 1 week
196637	29-Aug-2009	jhb	Mark the fake pages constructed by the OBJT_SG pager valid. This was accidentally lost at one point during the PAT development. Without this fix vm_pager_get_pages() was zeroing each of the pages. Submitted by: czander @ NVidia MFC after: 3 days
196615	28-Aug-2009	jhb	Extend the device pager to support different memory attributes on different pages in an object. - Add a new variant of d_mmap() currently called d_mmap2() which accepts an additional in/out parameter that is the memory attribute to use for the requested page. - A driver either uses d_mmap() or d_mmap2() for all requests but not both. The current implementation uses a flag in the cdevsw (D_MMAP2) to indicate that the driver provides a d_mmap2() handler instead of d_mmap(). This is done to make the change ABI compatible with existing drivers and MFC'able to 7 and 8. Submitted by: alc MFC after: 1 month
195844	24-Jul-2009	jhb	Remove debugging that crept in with previous commit. Reported by: nwhitehorn Approved by: re (kib)
195840	24-Jul-2009	jhb	Add a new type of VM object: OBJT_SG. An OBJT_SG object is very similar to a device pager (OBJT_DEVICE) object in that it uses fictitious pages to provide aliases to other memory addresses. The primary difference is that it uses an sglist(9) to determine the physical addresses for a given offset into the object instead of invoking the d_mmap() method in a device driver. Reviewed by: alc Approved by: re (kensmith) MFC after: 2 weeks
195774	19-Jul-2009	alc	Change the handling of fictitious pages by pmap_page_set_memattr() on amd64 and i386. Essentially, fictitious pages provide a mechanism for creating aliases for either normal or device-backed pages. Therefore, pmap_page_set_memattr() on a fictitious page needn't update the direct map or flush the cache. Such actions are the responsibility of the "primary" instance of the page or the device driver that "owns" the physical address. For example, these actions are already performed by pmap_mapdev(). The device pager needn't restore the memory attributes on a fictitious page before releasing it. It's now pointless. Add pmap_page_set_memattr() to the Xen pmap. Approved by: re (kib)
195749	18-Jul-2009	alc	An addendum to r195649, "Add support to the virtual memory system for configuring machine-dependent memory attributes...": Don't set the memory attribute for a "real" page that is allocated to a device object in vm_page_alloc(). It is a pointless act, because the device pager replaces this "real" page with a "fake" page and sets the memory attribute on that "fake" page. Eliminate pointless code from pmap_cache_bits() on amd64. Employ the "Self Snoop" feature supported by some x86 processors to avoid cache flushes in the pmap. Approved by: re (kib)
195693	14-Jul-2009	jhb	- Change mmap() to fail requests with EINVAL that pass a length of 0. This behavior is mandated by POSIX. - Do not fail requests that pass a length greater than SSIZE_MAX (such as > 2GB on 32-bit platforms). The 'len' parameter is actually an unsigned 'size_t' so negative values don't really make sense. Submitted by: Alexander Best alexbestms at math.uni-muenster.de Reviewed by: alc Approved by: re (kib) MFC after: 1 week
195649	12-Jul-2009	alc	Add support to the virtual memory system for configuring machine- dependent memory attributes: Rename vm_cache_mode_t to vm_memattr_t. The new name reflects the fact that there are machine-dependent memory attributes that have nothing to do with controlling the cache's behavior. Introduce vm_object_set_memattr() for setting the default memory attributes that will be given to an object's pages. Introduce and use pmap_page_{get,set}_memattr() for getting and setting a page's machine-dependent memory attributes. Add full support for these functions on amd64 and i386 and stubs for them on the other architectures. The function pmap_page_set_memattr() is also responsible for any other machine-dependent aspects of changing a page's memory attributes, such as flushing the cache or updating the direct map. The uses include kmem_alloc_contig(), vm_page_alloc(), and the device pager: kmem_alloc_contig() can now be used to allocate kernel memory with non-default memory attributes on amd64 and i386. vm_page_alloc() and the device pager will set the memory attributes for the real or fictitious page according to the object's default memory attributes. Update the various pmap functions on amd64 and i386 that map pages to incorporate each page's memory attributes in the mapping. Notes: (1) Inherent to this design are safety features that prevent the specification of inconsistent memory attributes by different mappings on amd64 and i386. In addition, the device pager provides a warning when a device driver creates a fictitious page with memory attributes that are inconsistent with the real page that the fictitious page is an alias for. (2) Storing the machine-dependent memory attributes for amd64 and i386 as a dedicated "int" in "struct md_page" represents a compromise between space efficiency and the ease of MFCing these changes to RELENG_7. In collaboration with: jhb Approved by: re (kib)
195635	12-Jul-2009	kib	When VM_MAP_WIRE_HOLESOK is not specified and vm_map_wire(9) encounters non-readable and non-executable map entry, the entry is skipped from wiring and loop is aborted. But, since MAP_ENTRY_WIRE_SKIPPED was not set for the map entry, its wired_count is later erronously decremented. vm_map_delete(9) for such map entry stuck in "vmmaps". Properly set MAP_ENTRY_WIRE_SKIPPED when aborting the loop. Reported by: John Marshall <john.marshall riverwillow com au> Approved by: re (kensmith)
195329	03-Jul-2009	kib	When forking a vm space that has wired map entries, do not forget to charge the objects created by vm_fault_copy_entry. The object charge was set, but reserve not incremented. Reported by: Greg Rivers <gcr+freebsd-current tharned org> Reviewed by: alc (previous version) Approved by: re (kensmith)
195131	28-Jun-2009	kib	Eliminiate code duplication by calling vm_object_destroy() from vm_object_collapse(). Requested and reviewed by: alc Approved by: re (kensmith)
195033	26-Jun-2009	alc	This change is the next step in implementing the cache control functionality required by video card drivers. Specifically, this change introduces vm_cache_mode_t with an appropriate VM_CACHE_DEFAULT definition on all architectures. In addition, this changes adds a vm_cache_mode_t parameter to kmem_alloc_contig() and vm_phys_alloc_contig(). These will be the interfaces for allocating mapped kernel memory and physical memory, respectively, with non-default cache modes. In collaboration with: jhb
194990	25-Jun-2009	kib	Change the type of uio_resid member of struct uio from int to ssize_t. Note that this does not actually enable full-range i/o requests for 64 architectures, and is done now to update KBI only. Tested by: pho Reviewed by: jhb, bde (as part of the review of the bigger patch)
194814	24-Jun-2009	kib	Initialize the uip to silence gcc warning that seems to sneak in in some build environments. Reported by: alc, bf1783 at googlemail com
194806	24-Jun-2009	alc	The bits set in a page's dirty mask are a subset of the bits set in its valid mask. Consequently, there is no need to perform a bit-wise and of the page's dirty and valid masks in order to determine which parts of a page are dirty and valid. Eliminate an unnecessary #include.
194766	23-Jun-2009	kib	Implement global and per-uid accounting of the anonymous memory. Add rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved for the uid. The accounting information (charge) is associated with either map entry, or vm object backing the entry, assuming the object is the first one in the shadow chain and entry does not require COW. Charge is moved from entry to object on allocation of the object, e.g. during the mmap, assuming the object is allocated, or on the first page fault on the entry. It moves back to the entry on forks due to COW setup. The per-entry granularity of accounting makes the charge process fair for processes that change uid during lifetime, and decrements charge for proper uid when region is unmapped. The interface of vm_pager_allocate(9) is extended by adding struct ucred *, that is used to charge appropriate uid when allocation if performed by kernel, e.g. md(4). Several syscalls, among them is fork(2), may now return ENOMEM when global or per-uid limits are enforced. In collaboration with: pho Reviewed by: alc Approved by: re (kensmith)
194642	22-Jun-2009	alc	Validate the page in one place, dev_pager_getpages(), rather than doing it in two places, dev_pager_getfake() and dev_pager_updatefake(). Compare a pointer to "NULL" rather than "0".
194607	21-Jun-2009	alc	Implement a mechanism within vm_phys_alloc_contig() to defer all necessary calls to vdrop() until after the free page queues lock is released. This eliminates repeatedly releasing and reacquiring the free page queues lock each time the last cached page is reclaimed from a vnode-backed object.
194562	21-Jun-2009	alc	Strive for greater consistency among the places that implement real, fictious, and contiguous page allocation. Eliminate unnecessary reinitialization of a page's fields.
194459	18-Jun-2009	thompsa	Track the kernel mapping of a physical page by a new entry in vm_page structure. When the page is shared, the kernel mapping becomes a special type of managed page to force the cache off the page mappings. This is needed to avoid stale entries on all ARM VIVT caches, and VIPT caches with cache color issue. Submitted by: Mark Tinguely Reviewed by: alc Tested by: Grzegorz Bernacki, thompsa
194429	18-Jun-2009	alc	Add support for UMA_SLAB_KERNEL to page_free(). (While I'm here remove an unnecessary newline character from the end of two panic messages.)
194393	17-Jun-2009	alc	Eliminate unnecessary forward declarations.
194376	17-Jun-2009	alc	Refactor contigmalloc() into two functions: a simple front-end that deals with the malloc tag and calls a new back-end, kmem_alloc_contig(), that allocates the pages and maps them. The motivations for this change are two-fold: (1) A cache mode parameter will be added to kmem_alloc_contig(). In other words, kmem_alloc_contig() will be extended to support the allocation of memory with caller-specified caching. (2) The UMA allocation function that is used by the two jumbo frames zones can use kmem_alloc_contig() in place of contigmalloc() and thereby avoid having free jumbo frames held by the zone counted as live malloc()ed memory.
194337	17-Jun-2009	alc	Pass the size of the mapping to contigmapping() as a "vm_size_t" rather than a "vm_pindex_t". A "vm_size_t" is more convenient for it to use.
194331	17-Jun-2009	alc	Make the maintenance of a page's valid bits by contigmalloc() more like kmem_alloc() and kmem_malloc(). Specifically, defer the setting of the page's valid bits until contigmapping() when the mapping is known to be successful.
194209	14-Jun-2009	alc	Long, long ago in r27464 special case code for mapping device-backed memory with 4MB pages was added to pmap_object_init_pt(). This code assumes that the pages of a OBJT_DEVICE object are always physically contiguous. Unfortunately, this is not always the case. For example, jhb@ informs me that the recently introduced /dev/ksyms driver creates a OBJT_DEVICE object that violates this assumption. Thus, this revision modifies pmap_object_init_pt() to abort the mapping if the OBJT_DEVICE object's pages are not physically contiguous. This revision also changes some inconsistent if not buggy behavior. For example, the i386 version aborts if the first 4MB virtual page that would be mapped is already valid. However, it incorrectly replaces any subsequent 4MB virtual page mappings that it encounters, potentially leaking a page table page. The amd64 version has a bug of my own creation. It potentially busies the wrong page and always an insufficent number of pages if it blocks allocating a page table page. To my knowledge, there have been no reports of these bugs, hence, their persistance. I suspect that the existing restrictions that pmap_object_init_pt() placed on the OBJT_DEVICE objects that it would choose to map, for example, that the first page must be aligned on a 2 or 4MB physical boundary and that the size of the mapping must be a multiple of the large page size, were enough to avoid triggering the bug for drivers like ksyms. However, one side effect of testing the OBJT_DEVICE object's pages for physical contiguity is that a dubious difference between pmap_object_init_pt() and the standard path for mapping devices pages, i.e., vm_fault(), has been eliminated. Previously, pmap_object_init_pt() would only instantiate the first PG_FICTITOUS page being mapped because it never examined the rest. Now, however, pmap_object_init_pt() uses the new function vm_object_populate() to instantiate them all (in order to support testing their physical contiguity). These pages need to be instantiated for the mechanism that I have prototyped for automatically maintaining the consistency of the PAT settings across multiple mappings, particularly, amd64's direct mapping, to work. (Translation: This change is also being made to support jhb@'s work on the Nvidia feature requests.) Discussed with: jhb@
194126	13-Jun-2009	alc	Eliminate an unnecessary clearing of a page's dirty bits in phys_pager_getpages().
193842	09-Jun-2009	alc	Eliminate an unnecessary restriction on the vm object type from vm_map_pmap_enter(). The immediate effect of this change is that automatic prefaulting by mmap() for small mappings is performed on POSIX shared memory objects just the same as it is on ordinary files.
193643	07-Jun-2009	alc	Eliminate unnecessary obfuscation when testing a page's valid bits.
193594	06-Jun-2009	alc	Eliminate an unneeded forward declaration. (This should have been removed in revision 1.42.)
193593	06-Jun-2009	alc	If vm_pager_get_pages() returns VM_PAGER_OK, then there is no need to check the page's valid bits. The page is guaranteed to be fully valid. (For the record, this is documented in vm/vm_pager.h's comments.)
193522	05-Jun-2009	alc	vm_thread_swapin() needn't validate any pages. The pages are already validated by vm_pager_get_pages().
193521	05-Jun-2009	alc	Simplify contigfree().
193511	05-Jun-2009	rwatson	Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC and used in a large number of files, but also because an increasing number of incorrect uses of MAC calls were sneaking in due to copy-and-paste of MAC-aware code without the associated opt_mac.h include. Discussed with: pjd
193303	02-Jun-2009	alc	Correct a boundary case error in the management of a page's dirty bits by shm_dotruncate() and vnode_pager_setsize(). Specifically, if the length of a shared memory object or a file is truncated such that the length modulo the page size is between 1 and 511, then all of the page's dirty bits were cleared. Now, a dirty bit is cleared only if the corresponding block is truncated in its entirety.
193275	01-Jun-2009	jhb	Add an extension to the character device interface that allows character device drivers to use arbitrary VM objects to satisfy individual mmap() requests. - A new d_mmap_single(cdev, &foff, objsize, &object, prot) callback is added to cdevsw. This function is called for each mmap() request. If it returns ENODEV, then the mmap() request will fall back to using the device's device pager object and d_mmap(). Otherwise, the method can return a VM object to satisfy this entire mmap() request via object. It can also modify the starting offset into this object via foff. This allows device drivers to use the file offset as a cookie to identify specific VM objects. - vm_mmap_vnode() has been changed to call vm_mmap_cdev() directly when mapping V_CHR vnodes. This avoids duplicating all the cdev mmap handling code and simplifies some of vm_mmap_vnode(). - D_VERSION has been bumped to D_VERSION_02. Older device drivers using D_VERSION_01 are still supported. MFC after: 1 month
193126	30-May-2009	alc	Eliminate a stale comment and the two remaining uses of the "register" keyword in this file.
193124	30-May-2009	alc	Add assertions in two places where a page's valid or dirty bits are changed.
192968	28-May-2009	alc	Change vm_object_page_remove() such that it clears the page's dirty bits when it invalidates the page. Suggested by: tegge
192962	28-May-2009	alc	Revise vm_pageout_scan()'s handling of partially dirty pages. Specifically, rather than unconditionally making partially dirty pages fully dirty, only make partially dirty pages fully dirty if the pmap says that the page has been modified. (This change is also a small optimization. It eliminate an unnecessary call to pmap_is_modified() on pages that are mapped read only.) Suggested by: tegge
192360	19-May-2009	kmacy	- back out direct map hack - it is no longer needed
192261	17-May-2009	alc	Eliminate a pointless call to pmap_clear_reference() from vm_pageout_scan(). If the page belongs to an object with a reference count of zero, then it can't have any managed mappings on which to clear a reference bit.
192207	16-May-2009	kmacy	apply band-aid to x86_64 systems with more physical memory than kmem by allocating from the direct map
192134	15-May-2009	alc	Eliminate unnecessary clearing of the page's dirty mask from various getpages functions. Eliminate a stale comment.
192034	13-May-2009	alc	Eliminate page queues locking from bufdone_finish() through the following changes: Rename vfs_page_set_valid() to vfs_page_set_validclean() to reflect what this function actually does. Suggested by: tegge Introduce a new version of vfs_page_set_valid() that does no more than what the function's name implies. Specifically, it does not update the page's dirty mask, and thus it does not require the page queues lock to be held. Update two of the three callers to the old vfs_page_set_valid() to call vfs_page_set_validclean() instead because they actually require the page's dirty mask to be cleared. Introduce vm_page_set_valid(). Reviewed by: tegge
192010	12-May-2009	alc	Eliminate gratuitous clearing of the page's dirty mask.
191935	09-May-2009	alc	Fix a race involving vnode_pager_input_smlfs(). Specifically, in the case that vnode_pager_input_smlfs() zeroes the page, it should not mark the page as valid until after the page is zeroed. Otherwise, the page could be mapped for read access (e.g., by vm_map_pmap_enter()) before the page is zeroed. Reviewed by: tegge Eliminate gratuitous clearing of the page's dirty mask by vnode_pager_input_smlfs(). Instead, assert that the page is clean. Reviewed by: tegge Eliminate some blank lines. Eliminate pointless calls to pmap_clear_modify() and vm_page_undirty() from vnode_pager_input_old(). The page is not mapped. Therefore, it cannot have any page table entries that are modified. Eliminate an incorrect comment from vnode_pager_generic_getpages().
191874	07-May-2009	alc	Eliminate an incorrect comment.
191778	04-May-2009	alc	Eliminate vnode_pager_input_smlfs()'s pointless call to pmap_clear_modify(). The page can't possibly have any modified page table entries because it isn't even mapped.
191626	28-Apr-2009	kib	Use the acquired reference to the vmspace instead of direct dereferencing of p->p_vmspace in a place where it was missed in r191277. Noted by: pluknet gmail com
191625	28-Apr-2009	kib	Fix typo.
191543	26-Apr-2009	alc	Eliminate an errant comment. Discussed with: tegge
191531	26-Apr-2009	alc	Eliminate an archaic band-aid. The immediately preceding comment already explains why the band-aid is unnecessary. Suggested by: tegge
191478	25-Apr-2009	alc	Eliminate unnecessary calls to pmap_clear_modify(). Specifically, calling pmap_clear_modify() on a page is pointless if that page is not mapped or it is only mapped for read access. Instead, assert that the page is not mapped or not mapped for write access as appropriate. Eliminate unnecessary clearing of a page's dirty mask. Instead, assert that the page's dirty mask is clear.
191439	23-Apr-2009	kib	Do not call vm_page_lookup() from the ddb routine, namely from "show vmopag" implementation. The vm_page_lookup() code modifies splay tree of the object pages, and asserts that object lock is taken. First issue could cause kernel data corruption, and second one instantly panics the INVARIANTS-enabled kernel. Take the advantage of the fact that object->memq is ordered by page index, and iterate over memq to calculate the runs. While there, make the code slightly more style-compliant by moving variables declarations to the right place. Discussed with: jhb, alc Reviewed by: alc MFC after: 2 weeks
191277	19-Apr-2009	kib	In both pageout oom handler and vm_daemon, acquire the reference to the vmspace of the examined process instead of directly accessing its vmspace, that may change. Also, as an optimization, check for P_INEXEC flag before examining the process. Reported and tested by: pho (previous version) Reviewed by: alc MFC after: 3 week
191263	19-Apr-2009	alc	Calling pmap_clear_modify() after calling pmap_remove_write() is pointless. The latter function already clears the modified status from each of the page's mappings.
191256	19-Apr-2009	alc	Allow valid pages to be mapped for read access when they have a non-zero busy count. Only mappings that allow write access should be prevented by a non-zero busy count. (The prohibition on mapping pages for read access when they have a non- zero busy count originated in revision 1.202 of i386/i386/pmap.c when this code was a part of the pmap.) Reviewed by: tegge
190949	11-Apr-2009	alc	Remove execute permission from the memory allocated by sbrk(). Pre-announced on: -arch (3/31/09) Discussed with: rwatson Tested by: marius (sparc64)
190912	11-Apr-2009	alc	Previously, when vm_page_free_toq() was performed on a page belonging to a reservation, unless all of the reservation's pages were free, the reservation was moved to the head of the partially-populated reservations queue, where it would be the next reservation to be broken in case the free page queues were emptied. Now, instead, I am moving it to the tail. Very likely this reservation is in the process of being freed in its entirety, so placing it at the tail of the queue makes it more likely that the underlying physical memory will be returned to the free page queues as one contiguous chunk. If a reservation must be broken, it will, instead, be the longest unchanged reservation, which is arguably the reservation that is least likely to ever achieve promotion or be freed in its entirety. MFC after: 6 weeks
190886	10-Apr-2009	kib	When vm_map_wire(9) is allowed to skip holes in the wired region, skip the mappings without any of read and execution rights, in particular, the PROT_NONE entries. This makes mlockall(2) work for the process address space that has such mappings. Since protection mode of the entry may change between setting MAP_ENTRY_IN_TRANSITION and final pass over the region that records the wire status of the entries, allocate new map entry flag MAP_ENTRY_WIRE_SKIPPED to mark the skipped PROT_NONE entries. Reported and tested by: Hans Ottevanger <fbsdhackers beasties demon nl> Reviewed by: alc MFC after: 3 weeks
190705	04-Apr-2009	alc	Retire VM_PROT_READ_IS_EXEC. It was intended to be a micro-optimization, but I see no benefit from it today. VM_PROT_READ_IS_EXEC was only intended for use on processors that do not distinguish between read and execute permission. On an mmap(2) or mprotect(2), it automatically added execute permission if the caller specified permissions included read permission. The hope was that this would reduce the number of vm map entries needed to implement an address space because there would be fewer neighboring vm map entries that differed only in the presence or absence of VM_PROT_EXECUTE. (See vm/vm_mmap.c revision 1.56.) Today, I don't see any real applications that benefit from VM_PROT_READ_IS_EXEC. In any case, vm map entries are now organized as a self-adjusting binary search tree instead of an ordered list. So, the need for coalescing vm map entries is not as great as it once was.
190604	01-Apr-2009	alc	Eliminate dead code. Reviewed by: jhb
189595	09-Mar-2009	jhb	Adjust some variables (mostly related to the buffer cache) that hold address space sizes to be longs instead of ints. Specifically, the follow values are now longs: runningbufspace, bufspace, maxbufspace, bufmallocspace, maxbufmallocspace, lobufspace, hibufspace, lorunningspace, hirunningspace, maxswzone, maxbcache, and maxpipekva. Previously, a relatively small number (~ 44000) of buffers set in kern.nbuf would result in integer overflows resulting either in hangs or bogus values of hidirtybuffers and lodirtybuffers. Now one has to overflow a long to see such problems. There was a check for a nbuf setting that would cause overflows in the auto-tuning of nbuf. I've changed it to always check and cap nbuf but warn if a user-supplied tunable would cause overflow. Note that this changes the ABI of several sysctls that are used by things like top(1), etc., so any MFC would probably require a some gross shims to allow for that. MFC after: 1 month
189024	25-Feb-2009	alc	Prior to r188331 a map entry's last read offset was only updated by a hard fault. In r188331 this update was relocated because of synchronization changes to a place where it would occur on both hard and soft faults. This change again restricts the update to hard faults.
189015	24-Feb-2009	kib	Revert the addition of the freelist argument for the vm_map_delete() function, done in r188334. Instead, collect the entries that shall be freed, in the deferred_freelist member of the map. Automatically purge the deferred freelist when map is unlocked. Tested by: pho Reviewed by: alc
189014	24-Feb-2009	kib	Add the assertion macros for the map locks. Use them in several map manipulation functions. Tested by: pho Reviewed by: alc
189012	24-Feb-2009	kib	Update the comment after the r188334. Reviewed by: alc
189004	24-Feb-2009	rdivacky	Change the functions to ANSI in those cases where it breaks promotion to int rule. See ISO C Standard: SS6.7.5.3:15. Approved by: kib (mentor) Reviewed by: warner Tested by: silence on -current
188967	23-Feb-2009	rwatson	Put debug.vm_lowmem sysctl under DIAGNOSTIC. Submitted by: sam MFC after: 3 days
188964	23-Feb-2009	rwatson	Add a debugging sysctl, debug.vm_lowmem, that when assigned a value of 1 will trigger a pass through the VM's low-memory handlers, such as protocol and UMA drain routines. This makes it easier to exercise these otherwise rarely-invoked code paths. MFC after: 3 days
188900	21-Feb-2009	alc	Reduce the scope of the page queues lock in vm_object_page_remove(). MFC after: 1 week
188859	20-Feb-2009	alc	Eliminate stale comments.
188386	09-Feb-2009	kib	Comment out the assertion from r188321. It is not valid for nfs. Reported by: alc
188383	09-Feb-2009	alc	Avoid some cases of unnecessary page queues locking by vm_fault's delete- behind heuristic.
188348	08-Feb-2009	alc	Eliminate OBJ_NEEDGIANT. After r188331, OBJ_NEEDGIANT's only use is by a redundant assertion in vm_fault(). Reviewed by: kib
188337	08-Feb-2009	kib	Remove no longer valid comment. Submitted by: alc
188335	08-Feb-2009	kib	Improve comments, correct English. Submitted by: alc
188334	08-Feb-2009	kib	Do not call vm_object_deallocate() from vm_map_delete(), because we hold the map lock there, and might need the vnode lock for OBJT_VNODE objects. Postpone object deallocation until caller of vm_map_delete() drops the map lock. Link the map entries to be freed into the freelist, that is released by the new helper function vm_map_entry_free_freelist(). Reviewed by: tegge, alc Tested by: pho
188333	08-Feb-2009	kib	In vm_map_sync(), do not call vm_object_sync() while holding map lock. Reference object, drop the map lock, and then call vm_object_sync(). The object sync might require vnode lock for OBJT_VNODE type objects. Reviewed by: tegge Tested by: pho
188331	08-Feb-2009	kib	Do not sleep for vnode lock while holding map lock in vm_fault. Try to acquire vnode lock for OBJT_VNODE object after map lock is dropped. Because we have the busy page(s) in the object, sleeping there would result in deadlock with vnode resize. Try to get lock without sleeping, and, if the attempt failed, drop the state, lock the vnode, and restart the fault handler from the start with already locked vnode. Because the vnode_pager_lock() function is inlined in vm_fault(), axe it. Based on suggestion by: alc Reviewed by: tegge, alc Tested by: pho
188325	08-Feb-2009	kib	Add the comments to vm_map_simplify_entry() and vmspace_fork(), describing why several calls to vm_deallocate_object() with locked map do not result in the acquisition of the vnode lock after map lock. Suggested and reviewed by: tegge
188323	08-Feb-2009	kib	Lock the new map in vmspace_fork(). The newly allocated map should not be accessible outside vmspace_fork() yet, but locking it would satisfy the protocol of the vm_map_entry_link() and other functions called from vmspace_fork(). Use trylock that is supposedly cannot fail, to silence WITNESS warning of the nested acquisition of the sx lock with the same name. Suggested and reviewed by: tegge
188321	08-Feb-2009	kib	Assert that vnode is exclusively locked when its vm object is resized. Reviewed by: tegge
188320	08-Feb-2009	kib	Do not leak the MAP_ENTRY_IN_TRANSITION flag when copying map entry on fork. Otherwise, copied entry cannot be removed in the child map. Reviewed by: tegge MFC after: 2 weeks
188319	08-Feb-2009	kib	Style.
187681	25-Jan-2009	jeff	- Make the keg abstraction more complete. Permit a zone to have multiple backend kegs so it may source compatible memory from multiple backends. This is useful for cases such as NUMA or different layouts for the same memory type. - Provide a new api for adding new backend kegs to secondary zones. - Provide a new flag for adjusting the layout of zones to stagger allocations better across cache lines. Sponsored by: Nokia
187658	23-Jan-2009	jhb	- Mark all standalone INT/LONG/QUAD sysctl's MPSAFE. This is done inside the SYSCTL() macros and thus does not need to be done for all of the nodes scattered across the source tree. - Mark the name-cache related sysctl's (including debug.hashstat.) MPSAFE. - Mark vm.loadavg MPSAFE. - Remove GIANT_REQUIRED from vmtotal() (everything in this routine already has sufficient locking) and mark vm.vmtotal MPSAFE. - Mark the vm.stats.(sys\|vm). sysctls MPSAFE.
187527	21-Jan-2009	jhb	Now that vfs_markatime() no longer requires an exclusive lock due to the VOP_MARKATIME() changes, use a shared vnode lock for mmap(). Submitted by: ups
186719	03-Jan-2009	kib	Extend the struct vm_page wire_count to u_int to avoid the overflow of the counter, that may happen when too many sendfile(2) calls are being executed with this vnode [1]. To keep the size of the struct vm_page and offsets of the fields accessed by out-of-tree modules, swap the types and locations of the wire_count and cow fields. Add safety checks to detect cow overflow and force fallback to the normal copy code for zero-copy sockets. [2] Reported by: Anton Yuzhaninov <citrin citrin ru> [1] Suggested by: alc [2] Reviewed by: alc MFC after: 2 weeks
186665	01-Jan-2009	alc	Resurrect shared map locks allowing greater concurrency during some map operations, such as page faults. An earlier version of this change was ... Reviewed by: kib Tested by: pho MFC after: 6 weeks
186633	31-Dec-2008	alc	Update or eliminate some stale comments.
186618	30-Dec-2008	alc	Avoid an unnecessary memory dereference in vm_map_entry_splay().
186616	30-Dec-2008	alc	Style change to vm_map_lookup(): Eliminate a macro of dubious value.
186609	30-Dec-2008	alc	Move the implementation of the vm map's fast path on address lookup from vm_map_lookup{,_locked}() to vm_map_lookup_entry(). Having the fast path in vm_map_lookup{,_locked}() limits its benefits to page faults. Moving it to vm_map_lookup_entry() extends its benefits to other operations on the vm map.
186374	21-Dec-2008	rnoland	Fix printing of KASSERT message missed in r163604. Approved by: kib
185012	16-Nov-2008	kib	Instead of forcing vn_start_write() to reset mp back to NULL for the failed calls with non-NULL vp, explicitely clear mp after failure. Tested by: stass Reviewed by: tegge PR: 123768 MFC after: 1 week
184728	06-Nov-2008	raj	Support kernel crash mini dumps on ARM architecture. Obtained from: Juniper Networks, Semihalf
184546	02-Nov-2008	keramida	Various comment nits, and typos.
184168	22-Oct-2008	rwatson	Update mmap() comment: no more block devices, so no more block device cache coherency questions. MFC after: 3 days
183754	10-Oct-2008	attilio	Remove the struct thread unuseful argument from bufobj interface. In particular following functions KPI results modified: - bufobj_invalbuf() - bufsync() and BO_SYNC() "virtual method" of the buffer objects set. Main consumers of bufobj functions are affected by this change too and, in particular, functions which changed their KPI are: - vinvalbuf() - g_vfs_close() Due to the KPI breakage, __FreeBSD_version will be bumped in a later commit. As a side note, please consider just temporary the 'curthread' argument passing to VOP_SYNC() (in bufsync()) as it will be axed out ASAP Reviewed by: kib Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>
183474	29-Sep-2008	kib	Move the code for doing out-of-memory grass from vm_pageout_scan() into the separate function vm_pageout_oom(). Supply a parameter for vm_pageout_oom() describing a reason for the call. Call vm_pageout_oom() from the swp_pager_meta_build() when swap zone is exhausted. Reviewed by: alc Tested by: pho, jhb MFC after: 2 weeks
183389	26-Sep-2008	emaste	Move CTASSERT from header file to source file, per implementation note now in the CTASSERT man page.
183383	26-Sep-2008	kib	Save previous content of the td_fpop before storing the current filedescriptor into it. Make sure that td_fpop is NULL when calling d_mmap from dev_pager_getpages(). Change guards against td_fpop field being non-NULL with private state for another device, and against sudden clearing the td_fpop. This could occur when either a driver method calls another driver through the filedescriptor operation, or a page fault happen while driver is writing to a memory backed by another driver. Noted by: rwatson Tested by: rnoland MFC after: 3 days
183236	21-Sep-2008	alc	Prevent an integer overflow in vm_pageout_page_stats() on machines with a large number of physical pages. PR: 126158 Submitted by: Dmitry Tejblum MFC after: 3 days
183216	20-Sep-2008	kib	Allow the d_mmap driver methods to use cdevpriv KPI during verification phase of establishing mapping. Discussed with: rwatson, jhb, rnoland Tested by: rnoland MFC after: 3 days
182371	28-Aug-2008	attilio	Decontextualize the couplet VOP_GETATTR / VOP_SETATTR as the passed thread was always curthread and totally unuseful. Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>
182047	23-Aug-2008	antoine	Remove unused variable nosleepwithlocks. PR: 126609 Submitted by: Mateusz Guzik MFC after: 1 month X-MFC: to stable/7 only, this variable is still used in stable/6
182028	23-Aug-2008	nwhitehorn	Allow the MD UMA allocator to use VM routines like kmem_*(). Existing code requires MD allocator to be available early in the boot process, before the VM is fully available. This defines a new VM define (UMA_MD_SMALL_ALLOC_NEEDS_VM) that allows an MD UMA small allocator to become available at the same time as the default UMA allocator. Approved by: marcel (mentor)
181887	20-Aug-2008	julian	A bunch of formatting fixes brough to light by, or created by the Vimage commit a few days ago.
181811	17-Aug-2008	kmacy	Work around differences in page allocation for initial page tables on xen MFC after: 1 month
181693	13-Aug-2008	emaste	Fix REDZONE(9) on amd64 and perhaps other 64 bit targets -- ensure the space that redzone adds to the allocation for storing its metadata is at least as large as the metadata that it will store there. Submitted by: Nima Misaghian
181334	05-Aug-2008	jhb	If a thread that is swapped out is made runnable, then the setrunnable() routine wakes up proc0 so that proc0 can swap the thread back in. Historically, this has been done by waking up proc0 directly from setrunnable() itself via a wakeup(). When waking up a sleeping thread that was swapped out (the usual case when waking proc0 since only sleeping threads are eligible to be swapped out), this resulted in a bit of recursion (e.g. wakeup() -> setrunnable() -> wakeup()). With sleep queues having separate locks in 6.x and later, this caused a spin lock LOR (sleepq lock -> sched_lock/thread lock -> sleepq lock). An attempt was made to fix this in 7.0 by making the proc0 wakeup use the ithread mechanism for doing the wakeup. However, this required grabbing proc0's thread lock to perform the wakeup. If proc0 was asleep elsewhere in the kernel (e.g. waiting for disk I/O), then this degenerated into the same LOR since the thread lock would be some other sleepq lock. Fix this by deferring the wakeup of the swapper until after the sleepq lock held by the upper layer has been locked. The setrunnable() routine now returns a boolean value to indicate whether or not proc0 needs to be woken up. The end result is that consumers of the sleepq API such as *sleep/wakeup, condition variables, sx locks, and lockmgr, have to wakeup proc0 if they get a non-zero return value from sleepq_abort(), sleepq_broadcast(), or sleepq_signal(). Discussed with: jeff Glanced at by: sam Tested by: Jurgen Weber jurgen - ish com au MFC after: 2 weeks
181239	03-Aug-2008	trhodes	Fill in a few sysctl descriptions. Reviewed by: alc, Matt Dillon <dillon@apollo.backplane.com> Approved by: alc
181024	30-Jul-2008	jhb	One more whitespace nit.
181020	30-Jul-2008	jhb	A few more whitespace fixes.
181019	30-Jul-2008	jhb	If the kernel has run out of metadata for swap, then explicitly panic() instead of emitting a warning before deadlocking. MFC after: 1 month
181004	30-Jul-2008	kib	The behaviour of the lockmgr going back at least to the 4.4BSD-Lite2 was to downgrade the exclusive lock to shared one when exclusive lock owner requested shared lock. New lockmgr panics instead. The vnode_pager_lock function requests shared lock on the vnode backing the OBJT_VNODE, and can be called when the current thread already holds an exlcusive lock on the vnode. For instance, it happens when handling page fault from the VOP_WRITE() uiomove that writes to the file, with the faulted in page fetched from the vm object backed by the same file. We then get the situation described above. Verify whether the vnode is already exclusively locked by the curthread and request recursed exclusive vnode lock instead of shared, if true. Reported by: gallatin Discussed with: attilio
180598	18-Jul-2008	alc	Eliminate stale comments from kmem_malloc().
180446	11-Jul-2008	kib	Use the VM_ALLOC_INTERRUPT for the page requests when allocating memory for the bio for swapout write. It allows the page allocator to drain free page list deeper. As result, a deadlock where pageout deamon sleeps waiting for bio to be allocated for swapout is no more reproducable in practice. Alan said that M_USE_RESERVE shall be ressurrected and used there, but until this is implemented, M_NOWAIT does exactly what is needed. Tested by: pho, kris Reviewed by: alc No objections from: phk MFC after: 2 weeks (RELENG_7 only)
180308	05-Jul-2008	alc	Enable the creation of a kmem map larger than 4GB. Submitted by: Tz-Huan Huang Make several variables related to kmem map auto-sizing static. Found by: CScout
179923	22-Jun-2008	alc	Make preparations for increasing the size of the kernel virtual address space on the amd64 architecture. The amd64 architecture requires kernel code and global variables to reside in the highest 2GB of the 64-bit virtual address space. Thus, the memory allocated during bootstrap, before the call to kmem_init(), starts at KERNBASE, which is not necessarily the same as VM_MIN_KERNEL_ADDRESS on amd64.
179921	21-Jun-2008	alc	KERNBASE is not necessarily an address within the kernel map, e.g., PowerPC/AIM. Consequently, it should not be used to determine the maximum number of kernel map entries. Intead, use VM_MIN_KERNEL_ADDRESS, which marks the start of the kernel map on all architectures. Tested by: marcel@ (PowerPC/AIM)
179765	12-Jun-2008	ups	Fix vm object creation locking to allow SHARED vnode locking for vnode_create_vobject. (Not currently used) Noticed by: kib@
179623	06-Jun-2008	alc	Essentially, neither madvise(..., MADV_DONTNEED) nor madvise(..., MADV_FREE) work. (Moreover, I don't believe that they have ever worked as intended.) The explanation is fairly simple. Both MADV_DONTNEED and MADV_FREE perform vm_page_dontneed() on each page within the range given to madvise(). This function moves the page to the inactive queue. Specifically, if the page is clean, it is moved to the head of the inactive queue where it is first in line for processing by the page daemon. On the other hand, if it is dirty, it is placed at the tail. Let's further examine the case in which the page is clean. Recall that the page is at the head of the line for processing by the page daemon. The expectation of vm_page_dontneed()'s author was that the page would be transferred from the inactive queue to the cache queue by the page daemon. (Once the page is in the cache queue, it is, in effect, free, that is, it can be reallocated to a new vm object by vm_page_alloc() if it isn't reactivated quickly enough by a user of the old vm object.) The trouble is that nowhere in the execution of either MADV_DONTNEED or MADV_FREE is either the machine-independent reference flag (PG_REFERENCED) or the reference bit in any page table entry (PTE) mapping the page cleared. Consequently, the immediate reaction of the page daemon is to reactivate the page because it is referenced. In effect, the madvise() was for naught. The case in which the page was dirty is not too different. Instead of being laundered, the page is reactivated. Note: The essential difference between MADV_DONTNEED and MADV_FREE is that MADV_FREE clears a page's dirty field. So, MADV_FREE is always executing the clean case above. This revision changes vm_page_dontneed() to clear both the machine- independent reference flag (PG_REFERENCED) and the reference bit in all PTEs mapping the page. MFC after: 6 weeks
179296	24-May-2008	alc	To date, our implementation of munmap(2) has required that the entirety of the specified range be mapped. Specifically, it has returned EINVAL if the entire range is not mapped. There is not, however, any basis for this in either SuSv2 or our own man page. Moreover, neither Linux nor Solaris impose this requirement. This revision removes this requirement. Submitted by: Tijl Coosemans PR: 118510 MFC after: 6 weeks
179159	20-May-2008	ups	Allow VM object creation in ufs_lookup. (If vfs.vmiodirenable is set) Directory IO without a VM object will store data in 'malloced' buffers severely limiting caching of the data. Without this change VM objects for directories are only created on an open() of the directory. TODO: Inline test if VM object already exists to avoid locking/function call overhead. Tested by: kris@ Reviewed by: jeff@ Reported by: David Filo
179081	18-May-2008	alc	Retire pmap_addr_hint(). It is no longer used.
179076	17-May-2008	alc	In order to map device memory using superpages, mmap(2) must find a superpage-aligned virtual address for the mapping. Revision 1.65 implemented an overly simplistic and generally ineffectual method for finding a superpage-aligned virtual address. Specifically, it rounds the virtual address corresponding to the end of the data segment up to the next superpage-aligned virtual address. If this virtual address is unallocated, then the device will be mapped using superpages. Unfortunately, in modern times, where applications like the X server dynamically load much of their code, this virtual address is already allocated. In such cases, mmap(2) simply uses the first available virtual address, which is not necessarily superpage aligned. This revision changes mmap(2) to use a more robust method, specifically, the VMFS_ALIGNED_SPACE option that is now implemented by vm_map_find().
179074	17-May-2008	alc	Preset a device object's alignment ("pg_color") based upon the physical address of the device's memory. This enables pmap_align_superpage() to propose a virtual address for mapping the device memory that permits the use of superpage mappings.
179019	15-May-2008	alc	Don't call vm_reserv_alloc_page() on device-backed objects. Otherwise, the system may panic because there is no reservation structure corresponding to the physical address of the device memory. Reported by: Giorgos Keramidas
178935	10-May-2008	alc	Provide the new argument to kmem_suballoc().
178933	10-May-2008	alc	Introduce a new parameter "superpage_align" to kmem_suballoc() that is used to request superpage alignment for the submap. Request superpage alignment for the kmem_map. Pass VMFS_ANY_SPACE instead of TRUE to vm_map_find(). (They are currently equivalent but VMFS_ANY_SPACE is the new preferred spelling.) Remove a stale comment from kmem_malloc().
178928	10-May-2008	alc	Generalize vm_map_find(9)'s parameter "find_space". Specifically, add support for VMFS_ALIGNED_SPACE, which requests the allocation of an address range best suited to superpages. The old options TRUE and FALSE are mapped to VMFS_ANY_SPACE and VMFS_NO_SPACE, so that there is no immediate need to update all of vm_map_find(9)'s callers. While I'm here, correct a misstatement about vm_map_find(9)'s return values in the man page.
178875	09-May-2008	alc	Introduce pmap_align_superpage(). It increases the starting virtual address of the given mapping if a different alignment might result in more superpage mappings.
178792	05-May-2008	kmacy	add malloc flag to blist so that it can be used in ithread context Reviewed by: alc, bsdimp
178637	28-Apr-2008	alc	Eliminate pointless casts from kmem_suballoc().
178630	28-Apr-2008	alc	vm_map_fixed(), unlike vm_map_find(), does not update "addr", so it can be passed by value.
178272	17-Apr-2008	jeff	- Make SCHED_STATS more generic by adding a wrapper to create the variables and sysctl nodes. - In reset walk the children of kern_sched_stats and reset the counters via the oid_arg1 pointer. This allows us to add arbitrary counters to the tree and still reset them properly. - Define a set of switch types to be passed with flags to mi_switch(). These types are named SWT_*. These types correspond to SCHED_STATS counters and are automatically handled in this way. - Make the new SWT_ types more specific than the older switch stats. There are now stats for idle switches, remote idle wakeups, remote preemption ithreads idling, etc. - Add switch statistics for ULE's pickcpu algorithm. These stats include how much migration there is, how often affinity was successful, how often threads were migrated to the local cpu on wakeup, etc. Sponsored by: Nokia
177956	06-Apr-2008	alc	Introduce vm_reserv_reclaim_contig(). This function is used by contigmalloc(9) as a last resort to steal pages from an inactive, partially-used superpage reservation. Rename vm_reserv_reclaim() to vm_reserv_reclaim_inactive() and refactor it so that a separate subroutine is responsible for breaking the selected reservation. This subroutine is also used by vm_reserv_reclaim_contig().
177932	05-Apr-2008	alc	Eliminate an unnecessary test from vm_phys_unfree_page().
177922	04-Apr-2008	alc	Update a comment to vm_map_pmap_enter().
177921	04-Apr-2008	alc	Reintroduce UMA_SLAB_KMAP; however, change its spelling to UMA_SLAB_KERNEL for consistency with its sibling UMA_SLAB_KMEM. (UMA_SLAB_KMAP met its original demise in revision 1.30 of vm/uma_core.c.) UMA_SLAB_KERNEL is now required by the jumbo frame allocators. Without it, UMA cannot correctly return pages from the jumbo frame zones to the VM system because it resets the pages' object field to NULL instead of the kernel object. In more detail, the jumbo frame zones are created with the option UMA_ZONE_REFCNT. This causes UMA to overwrite the pages' object field with the address of the slab. However, when UMA wants to release these pages, it doesn't know how to restore the object field, so it sets it to NULL. This change teaches UMA how to reset the object field to the kernel object. Crashes reported by: kris Fix tested by: kris Fix discussed with: jeff MFC after: 6 weeks
177762	30-Mar-2008	alc	Eliminate an unnecessary printf() from kmem_suballoc(). The subsequent panic() can be extended to convey the same information.
177704	29-Mar-2008	jeff	- Use vm_object_reference_locked() directly from vm_object_reference(). This is intended to get rid of vget() consumers who don't wish to acquire a lock. This is functionally the same as calling vref(). vm_object_reference_locked() already uses vref. Discussed with: alc
177458	20-Mar-2008	kib	Do not dereference cdev->si_cdevsw, use the dev_refthread() to properly obtain the reference. In particular, this fixes the panic reported in the PR. Remove the comments stating that this needs to be done. PR: kern/119422 MFC after: 1 week
177414	19-Mar-2008	alc	Rename vm_pageq_requeue() to vm_page_requeue() on account of its recent migration to vm/vm_page.c.
177368	19-Mar-2008	jeff	- Relax requirements for p_numthreads, p_threads, p_swtick, and p_nice from requiring the per-process spinlock to only requiring the process lock. - Reflect these changes in the proc.h documentation and consumers throughout the kernel. This is a substantial reduction in locking cost for these fields and was made possible by recent changes to threading support.
177342	18-Mar-2008	alc	Almost seven years ago, vm/vm_page.c was split into three parts: vm/vm_contig.c, vm/vm_page.c, and vm/vm_pageq.c. Today, vm/vm_pageq.c has withered to the point that it contains only four short functions, two of which are only used by vm/vm_page.c. Since I can't foresee any reason for vm/vm_pageq.c to grow, it is time to fold the remaining contents of vm/vm_pageq.c back into vm/vm_page.c. Add some comments. Rename one of the functions, vm_pageq_enqueue(), that is now static within vm/vm_page.c to vm_page_enqueue(). Eliminate PQ_MAXCOUNT as it no longer serves any purpose.
177261	16-Mar-2008	alc	Simplify the inner loop of vm_fault()'s delete-behind heuristic. Instead of checking each page for PG_UNMANAGED, perform a one-time check whether the object is OBJT_PHYS. (PG_UNMANAGED pages only belong to OBJT_PHYS objects.)
177253	16-Mar-2008	rwatson	In keeping with style(9)'s recommendations on macros, use a ';' after each SYSINIT() macro invocation. This makes a number of lightweight C parsers much happier with the FreeBSD kernel source, including cflow's prcc and lxr. MFC after: 1 month Discussed with: imp, rink
177091	12-Mar-2008	jeff	Remove kernel support for M:N threading. While the KSE project was quite successful in bringing threading to FreeBSD, the M:N approach taken by the kse library was never developed to its full potential. Backwards compatibility will be provided via libmap.conf for dynamically linked binaries and static binaries will be broken.
177085	12-Mar-2008	jeff	- Pass the priority argument from sleep() into sleepq and down into sched_sleep(). This removes extra thread_lock() acquisition and allows the scheduler to decide what to do with the static boost. - Change the priority arguments to cv_ to match sleepq/msleep/etc. where 0 means no priority change. Catch -1 in cv_broadcastpri() and convert it to 0 for now. - Set a flag when sleeping in a way that is compatible with swapping since direct priority comparisons are meaningless now. - Add a sysctl to ule, kern.sched.static_boost, that defaults to on which controls the boost behavior. Turning it off gives better performance in some workloads but needs more investigation. - While we're modifying sleepq, change signal and broadcast to both return with the lock held as the lock was held on enter. Reviewed by: jhb, peter
176967	09-Mar-2008	alc	Eliminate an unnecessary test from vm_fault's delete-behind heuristic. Specifically, since the delete-behind heuristic is never applied to a device-backed object, there is no point in checking whether each of the object's pages is fictitious. (Only device-backed objects have fictitious pages.)
176717	01-Mar-2008	marcel	Make the vm_pmap field of struct vmspace the last field in the structure. This allows per-CPU variations of struct pmap on a single architecture without affecting the machine-independent fields. As such, the PMAP variations don't affect the ABI. They become part of it.
176596	26-Feb-2008	alc	Correct a long-standing error in vm_object_page_remove(). Specifically, pmap_remove_all() must not be called on fictitious pages. To date, fictitious pages have been allocated from zeroed memory, effectively hiding this problem because the fictitious pages appear to have an empty pv list. Submitted by: Kostik Belousov Rewrite the comments describing vm_object_page_remove() to better describe what it does. Add an assertion. Reviewed by: Kostik Belousov MFC after: 1 week
176526	24-Feb-2008	alc	Correct a long-standing error in vm_object_deallocate(). Specifically, only anonymous default (OBJT_DEFAULT) and swap (OBJT_SWAP) objects should ever have OBJ_ONEMAPPING set. However, vm_object_deallocate() was setting it on device (OBJT_DEVICE) objects. As a result, vm_object_page_remove() could be called on a device object and if that occurred pmap_remove_all() would be called on the device object's pages. However, a device object's pages are fictitious, and fictitious pages do not have an initialized pv list (struct md_page). To date, fictitious pages have been allocated from zeroed memory, effectively hiding this problem. Now, however, the conversion of rotting diagnostics to invariants in the amd64 and i386 pmaps has revealed the problem. Specifically, assertion failures have occurred during the initialization phase of the X server on some hardware. MFC after: 1 week Discussed with: Kostik Belousov Reported by: Michiel Boland
175294	13-Jan-2008	attilio	VOP_LOCK1() (and so VOP_LOCK()) and VOP_UNLOCK() are only used in conjuction with 'thread' argument passing which is always curthread. Remove the unuseful extra-argument and pass explicitly curthread to lower layer functions, when necessary. KPI results broken by this change, which should affect several ports, so version bumping and manpage update will be further committed. Tested by: kris, pho, Diego Sardina <siarodx at gmail dot com>
175210	10-Jan-2008	pjd	When one tries to allocate memory with the M_WAITOK flag and we are short in address space in kmem map call vm_lowmem event in a loop and wait a bit for subsystems to reclaim some memory which in turn will reclaim address space as well. Note, this is a work-around. Reviewed by: alc Approved by: alc MFC after: 3 days
175202	10-Jan-2008	attilio	vn_lock() is currently only used with the 'curthread' passed as argument. Remove this argument and pass curthread directly to underlying VOP_LOCK1() VFS method. This modify makes the code cleaner and in particular remove an annoying dependence helping next lockmgr() cleanup. KPI results, obviously, changed. Manpage and FreeBSD_version will be updated through further commits. As a side note, would be valuable to say that next commits will address a similar cleanup about VFS methods, in particular vop_lock1 and vop_unlock. Tested by: Diego Sardina <siarodx at gmail dot com>, Andrea Di Pasquale <whyx dot it at gmail dot com>
175164	08-Jan-2008	jhb	Add a new file descriptor type for IPC shared memory objects and use it to implement shm_open(2) and shm_unlink(2) in the kernel: - Each shared memory file descriptor is associated with a swap-backed vm object which provides the backing store. Each descriptor starts off with a size of zero, but the size can be altered via ftruncate(2). The shared memory file descriptors also support fstat(2). read(2), write(2), ioctl(2), select(2), poll(2), and kevent(2) are not supported on shared memory file descriptors. - shm_open(2) and shm_unlink(2) are now implemented as system calls that manage shared memory file descriptors. The virtual namespace that maps pathnames to shared memory file descriptors is implemented as a hash table where the hash key is generated via the 32-bit Fowler/Noll/Vo hash of the pathname. - As an extension, the constant 'SHM_ANON' may be specified in place of the path argument to shm_open(2). In this case, an unnamed shared memory file descriptor will be created similar to the IPC_PRIVATE key for shmget(2). Note that the shared memory object can still be shared among processes by sharing the file descriptor via fork(2) or sendmsg(2), but it is unnamed. This effectively serves to implement the getmemfd() idea bandied about the lists several times over the years. - The backing store for shared memory file descriptors are garbage collected when they are not referenced by any open file descriptors or the shm_open(2) virtual namespace. Submitted by: dillon, peter (previous versions) Submitted by: rwatson (I based this on his version) Reviewed by: alc (suggested converting getmemfd() to shm_open())
175157	08-Jan-2008	csjp	When MAC is enabled in the kernel, fix a panic triggered by a locking assertion hit in swapoff_one() when we un-mount a swap partition. We should be using curthread where we used thread0 before. This change also replaces the thread argument with a credential argument, as the MAC framework only requires the cred. It should be noted that this allows the machine to be rebooted without panicing with "cannot differ from curthread or NULL" when MAC is enabled. Submitted by: rwatson Reviewed by: attilio MFC after: 2 weeks
175079	04-Jan-2008	kib	In the vm_map_stack(), check for the specified stack region wraparound. Reported and tested by: Peter Holm Reviewed by: alc MFC after: 3 days
175067	03-Jan-2008	alc	Add an access type parameter to pmap_enter(). It will be used to implement superpage promotion. Correct a style error in kmem_malloc(): pmap_enter()'s last parameter is a Boolean.
175055	02-Jan-2008	alc	Defer setting either PG_CACHED or PG_FREE until after the free page queues lock is acquired. Otherwise, the state of a reservation's pages' flags and its population count can be inconsistent. That could result in a page being freed twice. Reported by: kris
175041	01-Jan-2008	alc	Correct a style error that was introduced in revision 1.77.
174982	29-Dec-2007	alc	Add the superpage reservation system. This is "part 2 of 2" of the machine-independent support for superpages. (The earlier part was the rewrite of the physical memory allocator.) The remainder of the code required for superpages support is machine-dependent and will be added to the various pmap implementations at a later date. Initially, I am only supporting one large page size per architecture. Moreover, I am only enabling the reservation system on amd64. (In an emergency, it can be disabled by setting VM_NRESERVLEVELS to 0 in amd64/include/vmparam.h or your kernel configuration file.)
174940	27-Dec-2007	alc	Add a list of reservations to the vm object structure. Recycle the vm object's "pg_color" field to represent the color of the first virtual page address at which the object is mapped instead of the color of the object's first physical page. Since an object may not be mapped, introduce a flag "OBJ_COLORED" that indicates whether "pg_color" is valid.
174939	27-Dec-2007	alc	Add the superpage reservation type.
174825	21-Dec-2007	alc	Update the comment describing vm_phys_unfree_page().
174821	20-Dec-2007	alc	Modify vm_phys_unfree_page() so that it no longer requires the given page to be in the free lists. Instead, it now returns TRUE if it removed the page from the free lists and FALSE if the page was not in the free lists. This change is required to support superpage reservations. Specifically, once reservations are introduced, a cached page can either be in the free lists or a reservation.
174799	19-Dec-2007	alc	Correct one half of a loop continuation condition in vm_phys_unfree_page(). At present, this error is inconsequential; the other half of the loop continuation condition is sufficient to achieve correct execution.
174769	19-Dec-2007	alc	Eliminate redundant code from vm_page_startup().
174543	11-Dec-2007	alc	Simplify vm_page_free_toq().
174142	02-Dec-2007	alc	Correct a comment.
174137	01-Dec-2007	rwatson	Modify stack(9) stack_print() and stack_sbuf_print() routines to use new linker interfaces for looking up function names and offsets from instruction pointers. Create two variants of each call: one that is "DDB-safe" and avoids locking in the linker, and one that is safe for use in live kernels, by virtue of observing locking, and in particular safe when kernel modules are being loaded and unloaded simultaneous to their use. This will allow them to be used outside of debugging contexts. Modify two of three current stack(9) consumers to use the DDB-safe interfaces, as they run in low-level debugging contexts, such as inside lockmgr(9) and the kernel memory allocator. Update man page.
173918	25-Nov-2007	alc	Make contigmalloc(9)'s page laundering more robust. Specifically, use vm_pageout_fallback_object_lock() in vm_contig_launder_page() to better handle a lock-ordering problem. Consequently, trylock's failure on the page's containing object no longer implies that the page cannot be laundered. MFC after: 6 weeks
173901	25-Nov-2007	alc	Tidy up: Add comments. Eliminate the pointless malloc_type_allocated(..., 0) calls that occur when contigmalloc() has failed. Eliminate the acquisition and release of the page queues lock from vm_page_release_contig(). Rename contigmalloc2() to contigmapping(), reflecting what it does.
173853	23-Nov-2007	alc	Add a read/write sysctl for reconfiguring the maximum number of physical pages that can be wired. Submitted by: Eugene Grosbein PR: 114654 MFC after: 6 weeks
173846	22-Nov-2007	alc	Remove an unnecessary call to pmap_remove_all() and the associated "XXX" comments from vnode_pager_setsize(). This call was introduced in revision 1.140 to address a problem that no longer exists. Specifically, pmap_zero_page_area() has replaced a (possibly) problematic implementation of page zeroing that was based on vm_pager_map(), bzero(), and vm_pager_unmap().
173836	21-Nov-2007	alc	When reactivating a cached page, reset the page's pool to the default pool. (Not doing this before was a performance pessimization but not a cause for panic.)
173708	17-Nov-2007	alc	Prevent the leakage of wired pages in the following circumstances: First, a file is mmap(2)ed and then mlock(2)ed. Later, it is truncated. Under "normal" circumstances, i.e., when the file is not mlock(2)ed, the pages beyond the EOF are unmapped and freed. However, when the file is mlock(2)ed, the pages beyond the EOF are unmapped but not freed because they have a non-zero wire count. This can be a mistake. Specifically, it is a mistake if the sole reason why the pages are wired is because of wired, managed mappings. Previously, unmapping the pages destroys these wired, managed mappings, but does not reduce the pages' wire count. Consequently, when the file is unmapped, the pages are not unwired because the wired mapping has been destroyed. Moreover, when the vm object is finally destroyed, the pages are leaked because they are still wired. The fix is to reduce the pages' wired count by the number of wired, managed mappings destroyed. To do this, I introduce a new pmap function pmap_page_wired_mappings() that returns the number of managed mappings to the given physical page that are wired, and I use this function in vm_object_page_remove(). Reviewed by: tegge MFC after: 6 weeks
173429	07-Nov-2007	pjd	Change unused 'user_wait' argument to 'timo' argument, which will be used to specify timeout for msleep(9). Discussed with: alc Reviewed by: alc
173361	05-Nov-2007	kib	Fix for the panic("vm_thread_new: kstack allocation failed") and silent NULL pointer dereference in the i386 and sparc64 pmap_pinit() when the kmem_alloc_nofault() failed to allocate address space. Both functions now return error instead of panicing or dereferencing NULL. As consequence, vmspace_exec() and vmspace_unshare() returns the errno int. struct vmspace arg was added to vm_forkproc() to avoid dealing with failed allocation when most of the fork1() job is already done. The kernel stack for the thread is now set up in the thread_alloc(), that itself may return NULL. Also, allocation of the first process thread is performed in the fork1() to properly deal with stack allocation failure. proc_linkup() is separated into proc_linkup() called from fork1(), and proc_linkup0(), that is used to set up the kernel process (was known as swapper). In collaboration with: Peter Holm Reviewed by: jhb
173357	05-Nov-2007	kib	The intent of the freeing the (zeroed) page in vm_page_cache() for default object rather than cache it was to have vm_pager_has_page(object, pindex, ...) == FALSE to imply that there is no cached page in object at pindex. This allows to avoid explicit checks for cached pages in vm_object_backing_scan(). For now, we need the same bandaid for the swap object, otherwise both the vm_page_lookup() and the pager can report that there is no page at offset, while page is stored in the cache. Also, this fixes another instance of the KASSERT("object type is incompatible") failure in the vm_page_cache_transfer(). Reported and tested by: Peter Holm Reviewed by: alc MFC after: 3 days
173292	02-Nov-2007	maxim	o Fix panic message: it's swap_pager_putpages() not swap_pager_getpages(). Submitted by: Mark Tinguely
173180	30-Oct-2007	remko	Correct a copy and paste'o in phys_pager.c, we are talking about phys here and not about devices. PR: 93755 Approved by: imp (mentor, implicit when re-assigning the ticket to me).
173049	27-Oct-2007	alc	Change vm_page_cache_transfer() such that it does not transfer pages that would have an offset beyond the end of the target object. Such pages should remain in the source object. MFC after: 3 days Diagnosed and reviewed by: Kostik Belousov Reported and tested by: Peter Holm
172930	24-Oct-2007	rwatson	Merge first in a series of TrustedBSD MAC Framework KPI changes from Mac OS X Leopard--rationalize naming for entry points to the following general forms: mac_<object>_<method/action> mac_<object>_check_<method/action> The previous naming scheme was inconsistent and mostly reversed from the new scheme. Also, make object types more consistent and remove spaces from object types that contain multiple parts ("posix_sem" -> "posixsem") to make mechanical parsing easier. Introduce a new "netinet" object type for certain IPv4/IPv6-related methods. Also simplify, slightly, some entry point names. All MAC policy modules will need to be recompiled, and modules not updates as part of this commit will need to be modified to conform to the new KPI. Sponsored by: SPARTA (original patches against Mac OS X) Obtained from: TrustedBSD Project, Apple Computer
172875	22-Oct-2007	alc	Correct an error of omission in the reimplementation of the page cache: vnode_pager_setsize() must handle the case where a file is truncated to a non-page-size-aligned boundary and there is a cached page underlying the new end of file. Reported by: kris, tegge Tested by: kris MFC after: 3 days
172863	22-Oct-2007	alc	Correct an error in vm_map_sync(), nee vm_map_clean(), that has existed since revision 1.1. Specifically, neither traversal of the vm map checks whether the end of the vm map has been reached. Consequently, the first traversal can wrap around and bogusly return an error. This error has gone unnoticed for so long because no one had ever before tried msync(2)ing a region above the stack. Reported by: peter MFC after: 1 week
172836	20-Oct-2007	julian	Rename the kthread_xxx (e.g. kthread_create()) calls to kproc_xxx as they actually make whole processes. Thos makes way for us to add REAL kthread_create() and friends that actually make theads. it turns out that most of these calls actually end up being moved back to the thread version when it's added. but we need to make this cosmetic change first. I'd LOVE to do this rename in 7.0 so that we can eventually MFC the new kthread_xxx() calls.
172780	18-Oct-2007	alc	The previous revision, updating vm_object_page_remove() for the new page cache, did not account for the case where the vm object has nothing but cached pages. Reported by: kris, tegge Reviewed by: tegge MFC after: 3 days
172779	18-Oct-2007	peter	Fix cosmetic bug in stale copy of msync_args. 'len' is size_t, not int.
172700	16-Oct-2007	ru	Fix CTL_VM_NAMES.
172545	11-Oct-2007	jhb	Allow recursion on the 'zones' internal UMA zone. Submitted by: thompsa MFC after: 1 week Approved by: re (kensmith) Discussed with: jeff
172475	08-Oct-2007	kib	Do not dereference NULL pointer. Reported by: Peter Holm Reviewed by: alc Approved by: re (kensmith)
172472	08-Oct-2007	alc	In the rare case that vm_page_cache() actually frees the given page, it must first ensure that the page is no longer mapped. This is trivially accomplished by calling pmap_remove_all() a little earlier in vm_page_cache(). While I'm in the neighborbood, make a related panic message a little more useful. Approved by: re (kensmith) Reported by: Peter Holm and Konstantin Belousov Reviewed by: Konstantin Belousov
172466	07-Oct-2007	alc	Correct a lock assertion failure in sparc64's pmap_page_is_mapped() that is a consequence of sparc64/sparc64/vm_machdep.c revision 1.76. It occurs when uma_small_free() frees a page. The solution has two parts: (1) Mark pages allocated with VM_ALLOC_NOOBJ as PG_UNMANAGED. (2) Defer the lock assertion in pmap_page_is_mapped() until after PG_UNMANAGED is tested. This is safe because both PG_UNMANAGED and PG_FICTITIOUS are immutable flags, i.e., they do not change state between the time that a page is allocated and freed. Approved by: re (kensmith) PR: 116794
172341	27-Sep-2007	alc	Correct an error of omission in the reimplementation of the page cache: vm_object_page_remove() should convert any cached pages that fall with the specified range to free pages. Otherwise, there could be a problem if a file is first truncated and then regrown. Specifically, some old data from prior to the truncation might reappear. Generalize vm_page_cache_free() to support the conversion of either a subset or the entirety of an object's cached pages. Reported by: tegge Reviewed by: tegge Approved by: re (kensmith)
172322	25-Sep-2007	alc	Correct an error in the previous revision, specifically, vm_object_madvise() should request that the reactivated, cached page not be busied. Reported by: Rink Springer Approved by: re (kensmith)
172317	25-Sep-2007	alc	Change the management of cached pages (PQ_CACHE) in two fundamental ways: (1) Cached pages are no longer kept in the object's resident page splay tree and memq. Instead, they are kept in a separate per-object splay tree of cached pages. However, access to this new per-object splay tree is synchronized by the _free_ page queues lock, not to be confused with the heavily contended page queues lock. Consequently, a cached page can be reclaimed by vm_page_alloc(9) without acquiring the object's lock or the page queues lock. This solves a problem independently reported by tegge@ and Isilon. Specifically, they observed the page daemon consuming a great deal of CPU time because of pages bouncing back and forth between the cache queue (PQ_CACHE) and the inactive queue (PQ_INACTIVE). The source of this problem turned out to be a deadlock avoidance strategy employed when selecting a cached page to reclaim in vm_page_select_cache(). However, the root cause was really that reclaiming a cached page required the acquisition of an object lock while the page queues lock was already held. Thus, this change addresses the problem at its root, by eliminating the need to acquire the object's lock. Moreover, keeping cached pages in the object's primary splay tree and memq was, in effect, optimizing for the uncommon case. Cached pages are reclaimed far, far more often than they are reactivated. Instead, this change makes reclamation cheaper, especially in terms of synchronization overhead, and reactivation more expensive, because reactivated pages will have to be reentered into the object's primary splay tree and memq. (2) Cached pages are now stored alongside free pages in the physical memory allocator's buddy queues, increasing the likelihood that large allocations of contiguous physical memory (i.e., superpages) will succeed. Finally, as a result of this change long-standing restrictions on when and where a cached page can be reclaimed and returned by vm_page_alloc(9) are eliminated. Specifically, calls to vm_page_alloc(9) specifying VM_ALLOC_INTERRUPT can now reclaim and return a formerly cached page. Consequently, a call to malloc(9) specifying M_NOWAIT is less likely to fail. Discussed with: many over the course of the summer, including jeff@, Justin Husted @ Isilon, peter@, tegge@ Tested by: an earlier version by kris@ Approved by: re (kensmith)
172268	21-Sep-2007	jeff	- Redefine p_swtime and td_slptime as p_swtick and td_slptick. This changes the units from seconds to the value of 'ticks' when swapped in/out. ULE does not have a periodic timer that scans all threads in the system and as such maintaining a per-second counter is difficult. - Change computations requiring the unit in seconds to subtract ticks and divide by hz. This does make the wraparound condition hz times more frequent but this is still in the range of several months to years and the adverse effects are minimal. Approved by: re
172207	17-Sep-2007	jeff	- Move all of the PS_ flags into either p_flag or td_flags. - p_sflag was mostly protected by PROC_LOCK rather than the PROC_SLOCK or previously the sched_lock. These bugs have existed for some time. - Allow swapout to try each thread in a process individually and then swapin the whole process if any of these fail. This allows us to move most scheduler related swap flags into td_flags. - Keep ki_sflag for backwards compat but change all in source tools to use the new and more correct location of P_INMEM. Reported by: pho Reviewed by: attilio, kib Approved by: re (kensmith)
172188	15-Sep-2007	alc	Correct an assertion in vm_pageout_flush(). Specifically, if a page's status after vm_pager_put_pages() is VM_PAGER_PEND, then it could have already been recycled, i.e., freed and reallocated to a new purpose; thus, asserting that such pages cannot be written is inappropriate. Reported by: kris Submitted by: tegge Approved by: re (kensmith) MFC after: 1 week
171902	20-Aug-2007	kib	Do not drop vm_map lock between doing vm_map_remove() and vm_map_insert(). For this, introduce vm_map_fixed() that does that for MAP_FIXED case. Dropping the lock allowed for parallel thread to occupy the freed space. Reported by: Tijl Coosemans <tijl ulyssis org> Reviewed by: alc Approved by: re (kensmith) MFC after: 2 weeks
171889	18-Aug-2007	kib	Remove comment that is no longer quite true. Noted by: alc Approved by: re (kensmith)
171887	18-Aug-2007	kib	Fix the phys_pager in the way similar to the rev. 1.83 of the sys/vm/device_pager.c: Protect the creation of the phys pager with non-NULL handle with the phys_pager_mtx. Lookup of phys pager in the pagers list by handle is now synchronized with its removal from the list, and phys_pager_mtx is put before vm object lock in lock order. Dispose the phys_pager_alloc_lock and tsleep calls, together with acquiring Giant, since phys_pager_mtx now covers the same block. Reviewed by: alc Approved by: re (kensmith)
171779	07-Aug-2007	kib	Protect the creation of the device pager with the dev_pager_mtx. Lookup of device pager in the pagers list by handle is now synchronized with its removal from the list, and dev_pager_mtx is put before vm object lock in lock order. Dispose the dev_pager_sx lock, since dev_pager_mtx now covers the same block. Noted by: kensmith Reviewed by: alc Approved by: re (kensmith)
171737	05-Aug-2007	alc	Consider a scenario in which one processor, call it Pt, is performing vm_object_terminate() on a device-backed object at the same time that another processor, call it Pa, is performing dev_pager_alloc() on the same device. The problem is that vm_pager_object_lookup() should not be allowed to return a doomed object, i.e., an object with OBJ_DEAD set, but it does. In detail, the unfortunate sequence of events is: Pt in vm_object_terminate() holds the doomed object's lock and sets OBJ_DEAD on the object. Pa in dev_pager_alloc() holds dev_pager_sx and calls vm_pager_object_lookup(), which returns the doomed object. Next, Pa calls vm_object_reference(), which requires the doomed object's lock, so Pa waits for Pt to release the doomed object's lock. Pt proceeds to the point in vm_object_terminate() where it releases the doomed object's lock. Pa is now able to complete vm_object_reference() because it can now complete the acquisition of the doomed object's lock. So, now the doomed object has a reference count of one! Pa releases dev_pager_sx and returns the doomed object from dev_pager_alloc(). Pt now acquires dev_pager_mtx, removes the doomed object from dev_pager_object_list, releases dev_pager_mtx, and finally calls uma_zfree with the doomed object. However, the doomed object is still in use by Pa. Repeating my key point, vm_pager_object_lookup() must not return a doomed object. Moreover, the test for the object's state, i.e., doomed or not, and the increment of the object's reference count should be carried out atomically. Reviewed by: kib Approved by: re (kensmith) MFC after: 3 weeks
171725	05-Aug-2007	kib	Do not acquire Giant unconditionally around the calls to the cdevsw d_mmap methods. prep_cdevsw() already installs the shims that acquire/drop Giant for the methods of a driver that specified the D_NEEDGIANT flag. Reviewed by: alc Approved by: re (kensmith)
171633	27-Jul-2007	alc	Add a counter for the total number of pages cached and support for reporting the value of this counter in the program "vmstat". Approved by: re (rwatson)
171599	26-Jul-2007	pjd	When we do open, we should lock the vnode exclusively. This fixes few races: - fifo race, where two threads assign v_fifoinfo, - v_writecount modifications, - v_object modifications, - and probably more... Discussed with: kib, ups Approved by: re (rwatson)
171514	20-Jul-2007	alc	Two changes to vm_fault_additional_pages(): 1. Rewrite the backward scan. Specifically, reverse the order in which pages are allocated so that upon failure it is never necessary to free pages that were just allocated. Moreover, any allocated pages can be put to use. This makes the backward scan behave just like the forward scan. 2. Eliminate an explicit, unsynchronized check for low memory before calling vm_page_alloc(). It serves no useful purpose. It is, in effect, optimizing the uncommon case at the expense of the common case. Approved by: re (hrs) MFC after: 3 weeks
171451	14-Jul-2007	alc	Eliminate two unused functions: vm_phys_alloc_pages() and vm_phys_free_pages(). Rename vm_phys_alloc_pages_locked() to vm_phys_alloc_pages() and vm_phys_free_pages_locked() to vm_phys_free_pages(). Add comments regarding the need for the free page queues lock to be held by callers to these functions. No functional changes. Approved by: re (hrs)
171445	14-Jul-2007	alc	Eliminate dead code, specifically, an unused sysctl: "vm.idlezero_maxrun". Approved by: re (hrs)
171420	13-Jul-2007	alc	Update a comment describing the page queues. Approved by: re (hrs)
171417	12-Jul-2007	alc	Eliminate dead code. Approved by: re (hrs)
171347	10-Jul-2007	alc	Correct a problem in the ZERO_COPY_SOCKETS option, specifically, in vm_page_cowfault(). Initially, if vm_page_cowfault() sleeps, the given page is wired, preventing it from being recycled. However, when transmission of the page completes, the page is unwired and returned to the page queues. At that point, the page is not in any special state that prevents it from being recycled. Consequently, vm_page_cowfault() should verify that the page is still held by the same vm object before retrying the replacement of the page. Note: The containing object is, however, safe from being recycled by virtue of having a non-zero paging-in-progress count. While I'm here, add some assertions and comments. Approved by: re (rwatson) MFC After: 3 weeks
171310	08-Jul-2007	alc	Eliminate the special case handling of OBJT_DEVICE objects in vm_fault_additional_pages() that was introduced in revision 1.47. Then as now, it is unnecessary because dev_pager_haspage() returns zero for both the number of pages to read ahead and read behind, producing the same exact behavior by vm_fault_additional_pages() as the special case handling. Approved by: re (rwatson)
171288	06-Jul-2007	alc	When a cached page is reactivated in vm_fault(), update the counter that tracks the total number of reactivated pages. (We have not been counting reactivations by vm_fault() since revision 1.46.) Correct a comment in vm_fault_additional_pages(). Approved by: re (kensmith) MFC after: 1 week
171212	04-Jul-2007	peter	Add freebsd6_ wrappers for mmap/lseek/pread/pwrite/truncate/ftruncate Approved by: re (kensmith)
171150	02-Jul-2007	alc	In the previous revision, when I replaced the unconditional acquisition of Giant in vm_pageout_scan() with VFS_LOCK_GIANT(), I had to eliminate the acquisition of the vnode interlock before releasing the vm object's lock because the vnode interlock cannot be held when VFS_LOCK_GIANT() is performed. Unfortunately, this allows the vnode to be recycled between the release of the vm object's lock and the vget() on the vnode. In this revision, I prevent the vnode from being recycled by acquiring another reference to the vm object and underlying vnode before releasing the vm object's lock. This change also addresses another preexisting but trivial problem. By acquiring another reference to the vm object, I also prevent the vm object from being recycled. Previously, the "vnodes skipped" counter could be wrong because if it examined a recycled vm object. Reported by: kib Reviewed by: kib Approved by: re (kensmith) MFC after: 3 weeks
171048	26-Jun-2007	alc	Eliminate the use of Giant from vm_daemon(). Replace the unconditional use of Giant in vm_pageout_scan() with VFS_LOCK_GIANT(). Approved by: re (kensmith) MFC after: 3 weeks
171019	24-Jun-2007	alc	Eliminate GIANT_REQUIRED from swap_pager_putpages(). Approved by: re (mux) MFC after: 1 week
170905	18-Jun-2007	alc	Eliminate unnecessary checks from vm_pageout_clean(): The page that is passed to vm_pageout_clean() cannot possibly be PG_UNMANAGED because it came from the inactive queue and PG_UNMANAGED pages are not in any page queue. Moreover, PG_UNMANAGED pages only exist in OBJT_PHYS objects, and all pages within a OBJT_PHYS object are PG_UNMANAGED. So, if the page that is passed to vm_pageout_clean() is not PG_UNMANAGED, then it cannot be from an OBJT_PHYS object and its neighbors from the same object cannot themselves be PG_UNMANAGED. Reviewed by: tegge
170865	17-Jun-2007	mjacob	Don't declare inline a function which isn't.
170864	17-Jun-2007	mjacob	Make sure object is NULL- there is a possible case where you could fall through to it being used w/o being set. Put a break in the default case.
170863	17-Jun-2007	mjacob	Initialize reqpage to zero.
170836	16-Jun-2007	alc	If attempting to cache a "busy", panic instead of printing a diagnostic message and returning.
170818	16-Jun-2007	alc	Update a comment.
170816	16-Jun-2007	alc	Enable the new physical memory allocator. This allocator uses a binary buddy system with a twist. First and foremost, this allocator is required to support the implementation of superpages. As a side effect, it enables a more robust implementation of contigmalloc(9). Moreover, this reimplementation of contigmalloc(9) eliminates the acquisition of Giant by contigmalloc(..., M_NOWAIT, ...). The twist is that this allocator tries to reduce the number of TLB misses incurred by accesses through a direct map to small, UMA-managed objects and page table pages. Roughly speaking, the physical pages that are allocated for such purposes are clustered together in the physical address space. The performance benefits vary. In the most extreme case, a uniprocessor kernel running on an Opteron, I measured an 18% reduction in system time during a buildworld. This allocator does not implement page coloring. The reason is that superpages have much the same effect. The contiguous physical memory allocation necessary for a superpage is inherently colored. Finally, the one caveat is that this allocator does not effectively support prezeroed pages. I hope this is temporary. On i386, this is a slight pessimization. However, on amd64, the beneficial effects of the direct-map optimization outweigh the ill effects. I speculate that this is true in general of machines with a direct map. Approved by: re
170658	13-Jun-2007	alc	Eliminate dead code: We have not performed pageouts on the kernel object in this millenium.
170529	11-Jun-2007	alc	Conditionally acquire Giant in vm_contig_launder_page().
170517	10-Jun-2007	attilio	Optimize vmmeter locking. In particular: - Add an explicative table for locking of struct vmmeter members - Apply new rules for some of those members - Remove some unuseful comments Heavily reviewed by: alc, bde, jeff Approved by: jeff (mentor)
170477	10-Jun-2007	alc	Add a new physical memory allocator. However, do not yet connect it to the build. This allocator uses a binary buddy system with a twist. First and foremost, this allocator is required to support the implementation of superpages. As a side effect, it enables a more robust implementation of contigmalloc(9). Moreover, this reimplementation of contigmalloc(9) eliminates the acquisition of Giant by contigmalloc(..., M_NOWAIT, ...). The twist is that this allocator tries to reduce the number of TLB misses incurred by accesses through a direct map to small, UMA-managed objects and page table pages. Roughly speaking, the physical pages that are allocated for such purposes are clustered together in the physical address space. The performance benefits vary. In the most extreme case, a uniprocessor kernel running on an Opteron, I measured an 18% reduction in system time during a buildworld. This allocator does not implement page coloring. The reason is that superpages have much the same effect. The contiguous physical memory allocation necessary for a superpage is inherently colored. Finally, the one caveat is that this allocator does not effectively support prezeroed pages. I hope this is temporary. On i386, this is a slight pessimization. However, on amd64, the beneficial effects of the direct-map optimization outweigh the ill effects. I speculate that this is true in general of machines with a direct map. Approved by: re
170307	05-Jun-2007	jeff	Commit 14/14 of sched_lock decomposition. - Use thread_lock() rather than sched_lock for per-thread scheduling sychronization. - Use the per-process spinlock rather than the sched_lock for per-process scheduling synchronization. Tested by: kris, current@ Tested on: i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc. Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)
170292	04-Jun-2007	attilio	Do proper "locking" for missing vmmeters part. Now, we assume no more sched_lock protection for some of them and use the distribuited loads method for vmmeter (distribuited through CPUs). Reviewed by: alc, bde Approved by: jeff (mentor)
170291	04-Jun-2007	attilio	Rework the PCPU_* (MD) interface: - Rename PCPU_LAZY_INC into PCPU_INC - Add the PCPU_ADD interface which just does an add on the pcpu member given a specific value. Note that for most architectures PCPU_INC and PCPU_ADD are not safe. This is a point that needs some discussions/work in the next days. Reviewed by: alc, bde Approved by: jeff (mentor)
170174	01-Jun-2007	jeff	- Move rusage from being per-process in struct pstats to per-thread in td_ru. This removes the requirement for per-process synchronization in statclock() and mi_switch(). This was previously supported by sched_lock which is going away. All modifications to rusage are now done in the context of the owning thread. reads proceed without locks. - Aggregate exiting threads rusage in thread_exit() such that the exiting thread's rusage is not lost. - Provide a new routine, rufetch() to fetch an aggregate of all rusage structures from all threads in a process. This routine must be used in any place requiring a rusage from a process prior to it's exit. The exited process's rusage is still available via p_ru. - Aggregate tick statistics only on demand via rufetch() or when a thread exits. Tick statistics are kept in the thread and protected by sched_lock until it exits. Initial patch by: attilio Reviewed by: attilio, bde (some objections), arch (mostly silent)
170170	31-May-2007	attilio	Revert VMCNT_* operations introduction. Probabilly, a general approach is not the better solution here, so we should solve the sched_lock protection problems separately. Requested by: alc Approved by: jeff (mentor)
170152	31-May-2007	kib	Revert UF_OPENING workaround for CURRENT. Change the VOP_OPEN(), vn_open() vnode operation and d_fdopen() cdev operation argument from being file descriptor index into the pointer to struct file. Proposed and reviewed by: jhb Reviewed by: daichi (unionfs) Approved by: re (kensmith)
170149	31-May-2007	attilio	Add functions sx_xlock_sig() and sx_slock_sig(). These functions are intended to do the same actions of sx_xlock() and sx_slock() but with the difference to perform an interruptible sleep, so that sleep can be interrupted by external events. In order to support these new featueres, some code renstruction is needed, but external API won't be affected at all. Note: use "void" cast for "int" returning functions in order to avoid tools like Coverity prevents to whine. Requested by: rwatson Tested by: rwatson Reviewed by: jhb Approved by: jeff (mentor)
169849	22-May-2007	alc	Eliminate the reactivation of cached pages in vm_fault_prefault() and vm_map_pmap_enter() unless the caller is madvise(MADV_WILLNEED). With the exception of calls to vm_map_pmap_enter() from madvise(MADV_WILLNEED), vm_fault_prefault() and vm_map_pmap_enter() are both used to create speculative mappings. Thus, always reactivating cached pages is a mistake. In principle, cached pages should only be reactivated by an actual access. Otherwise, the following misbehavior can occur. On a hard fault for a text page the clustering algorithm fetches not only the required page but also several of the adjacent pages. Now, suppose that one or more of the adjacent pages are never accessed. Ultimately, these unused pages become cached pages through the efforts of the page daemon. However, the next activation of the executable reactivates and maps these unused pages. Consequently, they are never replaced. In effect, they become pinned in memory.
169805	20-May-2007	jeff	- rename VMCNT_DEC to VMCNT_SUB to reflect the count argument. Suggested by: julian@ Contributed by: attilio@
169667	18-May-2007	jeff	- define and use VMCNT_{GET,SET,ADD,SUB,PTR} macros for manipulating vmcnts. This can be used to abstract away pcpu details but also changes to use atomics for all counters now. This means sched lock is no longer responsible for protecting counts in the switch routines. Contributed by: Attilio Rao <attilio@FreeBSD.org>
169431	09-May-2007	rwatson	Update stale comment on protecting UMA per-CPU caches: we now use critical sections rather than mutexes.
169291	05-May-2007	alc	Define every architecture as either VM_PHYSSEG_DENSE or VM_PHYSSEG_SPARSE depending on whether the physical address space is densely or sparsely populated with memory. The effect of this definition is to determine which of two implementations of vm_page_array and PHYS_TO_VM_PAGE() is used. The legacy implementation is obtained by defining VM_PHYSSEG_DENSE, and a new implementation that trades off time for space is obtained by defining VM_PHYSSEG_SPARSE. For now, all architectures except for ia64 and sparc64 define VM_PHYSSEG_DENSE. Defining VM_PHYSSEG_SPARSE on ia64 allows the entirety of my Itanium 2's memory to be used. Previously, only the first 1 GB could be used. Defining VM_PHYSSEG_SPARSE on sparc64 allows USIIIi-based systems to boot without crashing. This change is a combination of Nathan Whitehorn's patch and my own work in perforce. Discussed with: kmacy, marius, Nathan Whitehorn PR: 112194
169048	26-Apr-2007	alc	Remove some code from vmspace_fork() that became redundant after revision 1.334 modified _vm_map_init() to initialize the new vm map's flags to zero.
168979	23-Apr-2007	rwatson	Audit pathnames looked up in swapon(2) and swapoff(2). MFC after: 2 weeks Obtained from: TrustedBSD Project
168852	19-Apr-2007	alc	Correct contigmalloc2()'s implementation of M_ZERO. Specifically, contigmalloc2() was always testing the first physical page for PG_ZERO, not the current page of interest. Submitted by: Michael Plass PR: 81301 MFC after: 1 week
168851	19-Apr-2007	alc	Correct two comments. Submitted by: Michael Plass
168581	10-Apr-2007	keramida	Minor typo fix, noticed while I was going through *_pager.c files.
168395	05-Apr-2007	pjd	When KVA is exhausted, try the vm_lowmem event for the last time before panicing. This helps a lot in ZFS stability.
168394	05-Apr-2007	pjd	Fix a problem for file systems that don't implement VOP_BMAP() operation. The problem is this: vm_fault_additional_pages() calls vm_pager_has_page(), which calls vnode_pager_haspage(). Now when VOP_BMAP() returns an error (eg. EOPNOTSUPP), vnode_pager_haspage() returns TRUE without initializing 'before' and 'after' arguments, so we have some accidental values there. This bascially was causing this condition to be meet: if ((rahead + rbehind) > ((cnt.v_free_count + cnt.v_cache_count) - cnt.v_free_reserved)) { pagedaemon_wakeup(); [...] } (we have some random values in rahead and rbehind variables) I'm not entirely sure this is the right fix, maybe we should just return FALSE in vnode_pager_haspage() when VOP_BMAP() fails? alc@ knows about this problem, maybe he will be able to come up with a better fix if this is not the right one.
167939	27-Mar-2007	alc	Prevent a race between vm_object_collapse() and vm_object_split() from causing a crash. Suppose that we have two objects, obj and backing_obj, where backing_obj is obj's backing object. Further, suppose that backing_obj has a reference count of two. One being the reference held by obj and the other by a map entry. Now, suppose that the map entry is deallocated and its reference removed by vm_object_deallocate(). vm_object_deallocate() recognizes that the only remaining reference is from a shadow object, obj, and calls vm_object_collapse() on obj. vm_object_collapse() executes if (backing_object->ref_count == 1) { /* * If there is exactly one reference to the backing * object, we can collapse it into the parent. */ vm_object_backing_scan(object, OBSC_COLLAPSE_WAIT); vm_object_backing_scan(OBSC_COLLAPSE_WAIT) executes if (op & OBSC_COLLAPSE_WAIT) { vm_object_set_flag(backing_object, OBJ_DEAD); } Finally, suppose that either vm_object_backing_scan() or vm_object_collapse() sleeps releasing its locks. At this instant, another thread executes vm_object_split(). It crashes in vm_object_reference_locked() on the assertion that the object is not dead. If, however, assertions are not enabled, it crashes much later, after the object has been recycled, in vm_object_deallocate() because the shadow count and shadow list are inconsistent. Reviewed by: tegge Reported by: jhb MFC after: 1 week
167880	25-Mar-2007	alc	Two small changes to vm_map_pmap_enter(): 1) Eliminate an unnecessary check for fictitious pages. Specifically, only device-backed objects contain fictitious pages and the object is not device-backed. 2) Change the types of "psize" and "tmpidx" to vm_pindex_t in order to prevent possible wrap around with extremely large maps and objects, respectively. Observed by: tegge (last summer)
167829	23-Mar-2007	alc	vm_page_busy() no longer requires the page queues lock to be held. Reduce the scope of the page queues lock in vm_fault() accordingly.
167795	22-Mar-2007	alc	Change the order of lock reacquisition in vm_object_split() in order to simplify the code slightly. Add a comment concerning lock ordering.
167243	05-Mar-2007	alc	Use PCPU_LAZY_INC() to update page fault statistics.
167091	27-Feb-2007	jhb	Use pause() in vm_object_deallocate() to yield the CPU to the lock holder rather than a tsleep() on &proc0. The only wakeup on &proc0 is intended to awaken the swapper, not random threads blocked in vm_object_deallocate().
167086	27-Feb-2007	jhb	Use pause() rather than tsleep() on stack variables and function pointers.
166964	25-Feb-2007	alc	Change the way that unmanaged pages are created. Specifically, immediately flag any page that is allocated to a OBJT_PHYS object as unmanaged in vm_page_alloc() rather than waiting for a later call to vm_page_unmanage(). This allows for the elimination of some uses of the page queues lock. Change the type of the kernel and kmem objects from OBJT_DEFAULT to OBJT_PHYS. This allows us to take advantage of the above change to simplify the allocation of unmanaged pages in kmem_alloc() and kmem_malloc(). Remove vm_page_unmanage(). It is no longer used.
166882	22-Feb-2007	alc	Change the page's CLEANCHK flag from being a page queue mutex synchronized flag to a vm object mutex synchronized flag.
166808	18-Feb-2007	alc	Enable vm_page_free() and vm_page_free_zero() to be called on some pages without the page queues lock being held, specifically, pages that are not contained in a vm object and not a member of a page queue.
166805	17-Feb-2007	alc	Remove a stale comment. Add punctuation to a nearby comment.
166736	15-Feb-2007	alc	Relax the page queue lock assertions in vm_page_remove() and vm_page_free_toq() to account for recent changes that allow vm_page_free_toq() to be called on some pages without the page queues lock being held, specifically, pages that are not contained in a vm object and not a member of a page queue. (Examples of such pages include page table pages, pv entry pages, and uma small alloc pages.)
166699	14-Feb-2007	alc	Avoid the unnecessary acquisition of the free page queues lock when a page is actually being added to the hold queue, not the free queue. At the same time, avoid unnecessary tests to wake up threads waiting for free memory and the idle thread that zeroes free pages. (These tests will be performed later when the page finally moves from the hold queue to the free queue.)
166654	11-Feb-2007	rwatson	Add uma_set_align() interface, which will be called at most once during boot by MD code to indicated detected alignment preference. Rather than cache alignment being encoded in UMA consumers by defining a global alignment value of (16 - 1) in UMA_ALIGN_CACHE, UMA_ALIGN_CACHE is now a special value (-1) that causes UMA to look at registered alignment. If no preferred alignment has been selected by MD code, a default alignment of (16 - 1) will be used. Currently, no hardware platforms specify alignment; architecture maintainers will need to modify MD startup code to specify an alignment if desired. This must occur before initialization of UMA so that all UMA zones pick up the requested alignment. Reviewed by: jeff, alc Submitted by: attilio
166637	11-Feb-2007	alc	Use the free page queue mutex instead of the page queue mutex to synchronize sleeping and waking of the zero idle thread.
166550	07-Feb-2007	jhb	- Move 'struct swdevt' back into swap_pager.h and expose it to userland. - Restore support for fetching swap information from crash dumps via kvm_get_swapinfo(3) to fix pstat -T/-s on crash dumps. Reviewed by: arch@, phk MFC after: 1 week
166544	07-Feb-2007	alc	Change the pagedaemon, vm_wait(), and vm_waitpfault() to sleep on the vm page queue free mutex instead of the vm page queue mutex.
166508	05-Feb-2007	alc	Change the free page queue lock from a spin mutex to a default (blocking) mutex. With the demise of Alpha support, there is no longer a reason for it to be a spin mutex.
166213	25-Jan-2007	mohans	Fix for problems that occur when all mbuf clusters migrate to the mbuf packet zone. Cluster allocations fail when this happens. Also processes that may have blocked on cluster allocations will never be woken up. Thanks to rwatson for an overview of the issue and pointers to the mbuma paper and his tool to dump out UMA zones. Reviewed by: andre@
166211	24-Jan-2007	mohans	Fix for a bug where only one process (of multiple) blocked on maxpages on a zone is woken up, with the rest never being woken up as a result of the ZFLAG_FULL flag being cleared. Wakeup all such blocked procsses instead. This change introduces a thundering herd, but since this should be relatively infrequent, optimizing this (by introducing a count of blocked processes, for example) may be premature. Reviewd by: ups@
166188	23-Jan-2007	jeff	- Remove setrunqueue and replace it with direct calls to sched_add(). setrunqueue() was mostly empty. The few asserts and thread state setting were moved to the individual schedulers. sched_add() was chosen to displace it for naming consistency reasons. - Remove adjustrunqueue, it was 4 lines of code that was ifdef'd to be different on all three schedulers where it was only called in one place each. - Remove the long ifdef'd out remrunqueue code. - Remove the now redundant ts_state. Inspect the thread state directly. - Don't set TSF_* flags from kern_switch.c, we were only doing this to support a feature in one scheduler. - Change sched_choose() to return a thread rather than a td_sched. Also, rely on the schedulers to return the idlethread. This simplifies the logic in choosethread(). Aside from the run queue links kern_switch.c mostly does not care about the contents of td_sched. Discussed with: julian - Move the idle thread loop into the per scheduler area. ULE wants to do something different from the other schedulers. Suggested by: jhb Tested on: x86/amd64 sched_{4BSD, ULE, CORE}.
166074	17-Jan-2007	delphij	Use FOREACH_PROC_IN_SYSTEM instead of using its unrolled form.
165928	10-Jan-2007	rwatson	Remove uma_zalloc_arg() hack, which coerced M_WAITOK to M_NOWAIT when allocations were made using improper flags in interrupt context. Replace with a simple WITNESS warning call. This restores the invariant that M_WAITOK allocations will always succeed or die horribly trying, which is relied on by many UMA consumers. MFC after: 3 weeks Discussed with: jhb
165854	07-Jan-2007	alc	Declare the map entry created by kmem_init() for the range from VM_MIN_KERNEL_ADDRESS to the end of the kernel's bootstrap data as MAP_NOFAULT.
165809	05-Jan-2007	jhb	- Add a new function uma_zone_exhausted() to see if a zone is full. - Add a printf in swp_pager_meta_build() to warn if the swapzone becomes exhausted so that there's at least a warning before a box that runs out of swapzone space before running out of swap space deadlocks. MFC after: 1 week Reviwed by: alc
165309	17-Dec-2006	alc	Optimize vm_object_split(). Specifically, make the number of iterations equal to the number of physical pages that are renamed to the new object rather than the new object's virtual size.
165278	16-Dec-2006	alc	Simplify the computation of the new object's size in vm_object_split().
165007	08-Dec-2006	kmacy	Remove the requirement that phys_avail be sorted in ascending order by explicitly finding the lowest and highest addresses when calculating the size of the vm_pages array Reviewed by :alc
164936	06-Dec-2006	julian	Threading cleanup.. part 2 of several. Make part of John Birrell's KSE patch permanent.. Specifically, remove: Any reference of the ksegrp structure. This feature was never fully utilised and made things overly complicated. All code in the scheduler that tried to make threaded programs fair to unthreaded programs. Libpthread processes will already do this to some extent and libthr processes already disable it. Also: Since this makes such a big change to the scheduler(s), take the opportunity to rename some structures and elements that had to be moved anyhow. This makes the code a lot more readable. The ULE scheduler compiles again but I have no idea if it works. The 4bsd scheduler still reqires a little cleaning and some functions that now do ALMOST nothing will go away, but I thought I'd do that as a separate commit. Tested by David Xu, and Dan Eischen using libthr and libpthread.
164446	20-Nov-2006	ru	The clean_map has been made local to vm_init.c long ago.
164437	20-Nov-2006	ru	Remove a redundant pointer-type variable.
164429	20-Nov-2006	ru	When counting vm totals, skip unreferenced objects, including vnodes representing mounted file systems. Reviewed by: alc MFC after: 3 days
164234	13-Nov-2006	alc	There is no point in setting PG_REFERENCED on kmem_object pages because they are "unmanaged", i.e., non-pageable, pages. Remove a stale comment.
164229	12-Nov-2006	alc	Make pmap_enter() responsible for setting PG_WRITEABLE instead of its caller. (As a beneficial side-effect, a high-contention acquisition of the page queues lock in vm_fault() is eliminated.)
164101	08-Nov-2006	alc	I misplaced the assertion that was added to vm_page_startup() in the previous change. Correct its placement.
164100	08-Nov-2006	alc	Simplify the construction of the free queues in vm_page_startup(). Add an assertion to test a hypothesis concerning other redundant computation in vm_page_startup().
164089	08-Nov-2006	alc	Ensure that the page's oflags field is initialized by contigmalloc().
164033	06-Nov-2006	rwatson	Sweep kernel replacing suser(9) calls with priv(9) calls, assigning specific privilege names to a broad range of privileges. These may require some future tweaking. Sponsored by: nCircle Network Security, Inc. Obtained from: TrustedBSD Project Discussed on: arch@ Reviewed (at least in part) by: mlaier, jmg, pjd, bde, ceri, Alex Lyashkov <umka at sevcity dot net>, Skip Ford <skip dot ford at verizon dot net>, Antoine Brodin <antoine dot brodin at laposte dot net>
163709	26-Oct-2006	jb	Make KSE a kernel option, turned on by default in all GENERIC kernel configs except sun4v (which doesn't process signals properly with KSE). Reviewed by: davidxu@
163702	26-Oct-2006	rwatson	Better align output of "show uma" by moving from displaying the basic counters of allocs/frees/use for each zone to the same statistics shown by userspace "vmstat -z". MFC after: 3 days
163622	23-Oct-2006	alc	The page queues lock is no longer required by vm_page_wakeup().
163614	22-Oct-2006	alc	The page queues lock is no longer required by vm_page_busy() or vm_page_wakeup(). Reduce or eliminate its use accordingly.
163606	22-Oct-2006	rwatson	Complete break-out of sys/sys/mac.h into sys/security/mac/mac_framework.h begun with a repo-copy of mac.h to mac_framework.h. sys/mac.h now contains the userspace and user<->kernel API and definitions, with all in-kernel interfaces moved to mac_framework.h, which is now included across most of the kernel instead. This change is the first step in a larger cleanup and sweep of MAC Framework interfaces in the kernel, and will not be MFC'd. Obtained from: TrustedBSD Project Sponsored by: SPARTA
163604	22-Oct-2006	alc	Replace PG_BUSY with VPO_BUSY. In other words, changes to the page's busy flag, i.e., VPO_BUSY, are now synchronized by the per-vm object lock instead of the global page queues lock.
163594	21-Oct-2006	alc	Eliminate unnecessary PG_BUSY tests. They originally served a purpose that is now handled by vm object locking.
163361	14-Oct-2006	alc	Long ago, revision 1.22 of vm/vm_pager.h introduced a bug. Specifically, it introduced a check after the call to file system's get pages method that assumes that the get pages method does not change the array of pages that is passed to it. In the case of vnode_pager_generic_getpages(), this assumption has been incorrect. The contents of the array of pages may be shifted by vnode_pager_generic_getpages(). Likely, the problem has been hidden by vnode_pager_haspage() limiting the set of pages that are passed to vnode_pager_generic_getpages() such that a shift never occurs. The fix implemented herein is to adjust the pointer to the array of pages rather than shifting the pages within the array. MFC after: 3 weeks Fix suggested by: tegge
163359	14-Oct-2006	alc	Change vnode_pager_addr() such that on returning it distinguishes between an error returned by VOP_BMAP() and a hole in the file. Change the callers to vnode_pager_addr() such that they return VM_PAGER_ERROR when VOP_BMAP fails instead of a zero-filled page. Reviewed by: tegge MFC after: 3 weeks
163259	12-Oct-2006	kmacy	sun4v requires TSBs (translation storage buffers) to be contiguous and be size aligned requiring heavy usage of vm_page_alloc_contig This change makes vm_page_alloc_contig SMP safe Approved by: scottl (acting as backup for mentor rwatson)
163210	10-Oct-2006	alc	Distinguish between two distinct kinds of errors from VOP_BMAP() in vnode_pager_generic_getpages(): (1) that VOP_BMAP() is unsupported by the underlying file system and (2) an error in performing the VOP_BMAP(). Previously, vnode_pager_generic_getpages() assumed that all errors were of the first type. If, in fact, the error was of the second type, the likely outcome was for the process to become permanently blocked on a busy page. MFC after: 3 weeks Reviewed by: tegge
163140	08-Oct-2006	alc	Change vnode_pager_generic_getpages() so that it does not panic if the given file is sparse. Instead, it zeroes the requested page. Reviewed by: tegge PR: kern/98116 MFC after: 3 days
162750	29-Sep-2006	kensmith	Fix two minor style(9) nits in v1.313 which were noticed during an MFC review. alc@ will be MFCing V1.313 plus style fix to RELENG_6.
161968	03-Sep-2006	alc	Make vm_page_release_contig() static.
161674	27-Aug-2006	alc	Refactor vm_page_sleep_if_busy() so that the test for a busy page is inlined and a procedure call is made in the rare case, i.e., when it is necessary to sleep. In this case, inlining the test actually makes the kernel smaller.
161629	26-Aug-2006	alc	Prevent a call to contigmalloc() that asks for more physical memory than the machine has from causing a panic. Submitted by: Michael Plass PR: 101668 MFC after: 3 days
161597	25-Aug-2006	alc	The return value from vm_pageq_add_new_page() is not used. Eliminate it.
161492	21-Aug-2006	alc	Add _vm_stats and _vm_stats_misc to the sysctl declarations in sysctl.h and eliminate their declarations from various source files.
161489	21-Aug-2006	alc	vm_page_zero_idle()'s return value serves no purpose. Eliminate it.
161486	21-Aug-2006	alc	Page flags are reset on (re)allocation. There is no need to clear any flags except for PG_ZERO in vm_page_free_toq().
161257	13-Aug-2006	alc	Reimplement the page's NOSYNC flag as an object-synchronized instead of a page queues-synchronized flag. Reduce the scope of the page queues lock in vm_fault() accordingly. Move vm_fault()'s call to vm_object_set_writeable_dirty() outside of the scope of the page queues lock. Reviewed by: tegge Additionally, eliminate an unnecessary dereference in computing the argument that is passed to vm_object_set_writeable_dirty().
161213	11-Aug-2006	alc	Ensure that the page's new field for object-synchronized flags is always initialized to zero. Call vm_page_sleep_if_busy() instead of duplicating its implementation in vm_page_grab().
161143	10-Aug-2006	alc	Change vm_page_cowfault() so that it doesn't allocate a pre-busied page.
161125	09-Aug-2006	alc	Introduce a field to struct vm_page for storing flags that are synchronized by the lock on the object containing the page. Transition PG_WANTED and PG_SWAPINPROG to use the new field, eliminating the need for holding the page queues lock when setting or clearing these flags. Rename PG_WANTED and PG_SWAPINPROG to VPO_WANTED and VPO_SWAPINPROG, respectively. Eliminate the assertion that the page queues lock is held in vm_page_io_finish(). Eliminate the acquisition and release of the page queues lock around calls to vm_page_io_finish() in kern_sendfile() and vfs_unbusy_pages().
161014	06-Aug-2006	alc	Eliminate the acquisition and release of the page queues lock around a call to vm_page_sleep_if_busy().
161013	06-Aug-2006	alc	Change vm_page_sleep_if_busy() so that it no longer requires the caller to hold the page queues lock.
161005	05-Aug-2006	alc	Remove a stale comment.
160960	03-Aug-2006	alc	When sleeping on a busy page, use the lock from the containing object rather than the global page queues lock.
160889	01-Aug-2006	alc	Complete the transition from pmap_page_protect() to pmap_remove_write(). Originally, I had adopted sparc64's name, pmap_clear_write(), for the function that is now pmap_remove_write(). However, this function is more like pmap_remove_all() than like pmap_clear_modify() or pmap_clear_reference(), hence, the name change. The higher-level rationale behind this change is described in src/sys/amd64/amd64/pmap.c revision 1.567. The short version is that I'm trying to clean up and fix our support for execute access. Reviewed by: marcel@ (ia64)
160585	22-Jul-2006	alc	Export the number of object bypasses and collapses through sysctl.
160561	21-Jul-2006	alc	Retire debug.mpsafevm. None of the architectures supported in CVS require it any longer.
160540	21-Jul-2006	alc	Eliminate OBJ_WRITEABLE. It hasn't been used in a long time.
160525	20-Jul-2006	alc	Add pmap_clear_write() to the interface between the virtual memory system's machine-dependent and machine-independent layers. Once pmap_clear_write() is implemented on all of our supported architectures, I intend to replace all calls to pmap_page_protect() by calls to pmap_clear_write(). Why? Both the use and implementation of pmap_page_protect() in our virtual memory system has subtle errors, specifically, the management of execute permission is broken on some architectures. The "prot" argument to pmap_page_protect() should behave differently from the "prot" argument to other pmap functions. Instead of meaning, "give the specified access rights to all of the physical page's mappings," it means "don't take away the specified access rights from all of the physical page's mappings, but do take away the ones that aren't specified." However, owing to our i386 legacy, i.e., no support for no-execute rights, all but one invocation of pmap_page_protect() specifies VM_PROT_READ only, when the intent is, in fact, to remove only write permission. Consequently, a faithful implementation of pmap_page_protect(), e.g., ia64, would remove execute permission as well as write permission. On the other hand, some architectures that support execute permission have basically ignored whether or not VM_PROT_EXECUTE is passed to pmap_page_protect(), e.g., amd64 and sparc64. This change represents the first step in replacing pmap_page_protect() by the less subtle pmap_clear_write() that is already implemented on amd64, i386, and sparc64. Discussed with: grehan@ and marcel@
160460	18-Jul-2006	rwatson	Fix build of uma_core.c when DDB is not compiled into the kernel by making uma_zone_sumstat() ifdef DDB, as it's only used with DDB now. Submitted by: Wolfram Fenske <Wolfram.Fenske at Student.Uni-Magdeburg.DE>
160421	17-Jul-2006	alc	Ensure that vm_object_deallocate() doesn't dereference a stale object pointer: When vm_object_deallocate() sleeps because of a non-zero paging in progress count on either object or object's shadow, vm_object_deallocate() must ensure that object is still the shadow's backing object when it reawakens. In fact, object may have been deallocated while vm_object_deallocate() slept. If so, reacquiring the lock on object can lead to a deadlock. Submitted by: ups@ MFC after: 3 weeks
160414	16-Jul-2006	rwatson	Remove sysctl_vm_zone() and vm.zone sysctl from 7.x. As of 6.x, libmemstat(3) is used by vmstat (and friends) to produce more accurate and more detailed statistics information in a machine-readable way, and vmstat continues to provide the same text-based front-end. This change should not be MFC'd.
160236	10-Jul-2006	alc	Set debug.mpsafevm to true on PowerPC. (Now, by default, all architectures in CVS have debug.mpsafevm set to true.) Tested by: grehan@
159880	23-Jun-2006	jhb	Move the code to handle the vm.blacklist tunable up a layer into vm_page_startup(). As a result, we now only lookup the tunable once instead of looking it up once for every physical page of memory in the system. This cuts out about a 1 second or so delay in boot on x86 systems. The delay is much larger and more noticable on sun4v apparently. Reported by: kmacy MFC after: 1 week
159837	21-Jun-2006	kib	Make the mincore(2) return ENOMEM when requested range is not fully mapped. Requested by: Bruno Haible <bruno at clisp org> Reviewed by: alc Approved by: pjd (mentor) MFC after: 1 month
159681	17-Jun-2006	alc	Use ptoa(psize) instead of size to compute the end of the mapping in vm_map_pmap_enter().
159627	15-Jun-2006	ups	Remove mpte optimization from pmap_enter_quick(). There is a race with the current locking scheme and removing it should have no measurable performance impact. This fixes page faults leading to panics in pmap_enter_quick_locked() on amd64/i386. Reviewed by: alc,jhb,peter,ps
159620	14-Jun-2006	alc	Correct an error in the previous revision that could lead to a panic: Found mapped cache page. Specifically, if cnt.v_free_count dips below cnt.v_free_reserved after p_start has been set to a non-NULL value, then vm_map_pmap_enter() would break out of the loop and incorrectly call pmap_enter_object() for the remaining address range. To correct this error, this revision truncates the address range so that pmap_enter_object() will not map any cache pages. In collaboration with: tegge@ Reported by: kris@
159475	10-Jun-2006	alc	Enable debug.mpsafevm on arm by default. Tested by: cognet@
159303	05-Jun-2006	alc	Introduce the function pmap_enter_object(). It maps a sequence of resident pages from the same object. Use it in vm_map_pmap_enter() to reduce the locking overhead of premapping objects. Reviewed by: tegge@
159121	31-May-2006	ps	Fix minidumps to include pages allocated via pmap_map on amd64. These pages are allocated from the direct map, and were not previous tracked. This included the vm_page_array and the early UMA bootstrap pages. Reviewed by: peter
159054	29-May-2006	tegge	Close race between vmspace_exitfree() and exit1() and races between vmspace_exitfree() and vmspace_free() which could result in the same vmspace being freed twice. Factor out part of exit1() into new function vmspace_exit(). Attach to vmspace0 to allow old vmspace to be freed earlier. Add new function, vmspace_acquire_ref(), for obtaining a vmspace reference for a vmspace belonging to another process. Avoid changing vmspace refcount from 0 to 1 since that could also lead to the same vmspace being freed twice. Change vmtotal() and swapout_procs() to use vmspace_acquire_ref(). Reviewed by: alc
158803	21-May-2006	rwatson	When allocating a bucket to hold a free'd item in UMA fails, don't report this as an allocation failure for the item type. The failure will be separately recorded with the bucket type. This my eliminate high mbuf allocation failure counts under some circumstances, which can be alarming in appearance, but not actually a problem in practice. MFC after: 2 weeks Reported by: ps, Peter J. Blok <pblok at bsd4all dot org>, OxY <oxy at field dot hu>, Gabor MICSKO <gmicskoa at szintezis dot hu>
158525	13-May-2006	alc	Simplify the implementation of vm_fault_additional_pages() based upon the object's memq being ordered. Specifically, replace repeated calls to vm_page_lookup() by two simple constant-time operations. Reviewed by: tegge
158387	10-May-2006	pjd	Use better order here.
158020	25-Apr-2006	alc	Add synchronization to vm_pageq_add_new_page() so that it can be called safely after kernel initialization. Remove GIANT_REQUIRED. MFC after: 6 weeks
157920	21-Apr-2006	trhodes	It seems that POSIX would rather ENODEV returned in place of EINVAL when trying to mmap() an fd that isn't a normal file. Reference: http://www.opengroup.org/onlinepubs/009695399/functions/mmap.html Submitted by: fanf
157908	21-Apr-2006	peter	Introduce minidumps. Full physical memory crash dumps are still available via the debug.minidump sysctl and tunable. Traditional dumps store all physical memory. This was once a good thing when machines had a maximum of 64M of ram and 1GB of kvm. These days, machines often have many gigabytes of ram and a smaller amount of kvm. libkvm+kgdb don't have a way to access physical ram that is not mapped into kvm at the time of the crash dump, so the extra ram being dumped is mostly wasted. Minidumps invert the process. Instead of dumping physical memory in in order to guarantee that all of kvm's backing is dumped, minidumps instead dump only memory that is actively mapped into kvm. amd64 has a direct map region that things like UMA use. Obviously we cannot dump all of the direct map region because that is effectively an old style all-physical-memory dump. Instead, introduce a bitmap and two helper routines (dump_add_page(pa) and dump_drop_page(pa)) that allow certain critical direct map pages to be included in the dump. uma_machdep.c's allocator is the intended consumer. Dumps are a custom format. At the very beginning of the file is a header, then a copy of the message buffer, then the bitmap of pages present in the dump, then the final level of the kvm page table trees (2MB mappings are expanded into a 4K page mappings), then the sparse physical pages according to the bitmap. libkvm can now conveniently access the kvm page table entries. Booting my test 8GB machine, forcing it into ddb and forcing a dump leads to a 48MB minidump. While this is a best case, I expect minidumps to be in the 100MB-500MB range. Obviously, never larger than physical memory of course. minidumps are on by default. It would want be necessary to turn them off if it was necessary to debug corrupt kernel page table management as that would mess up minidumps as well. Both minidumps and regular dumps are supported on the same machine.
157815	17-Apr-2006	jhb	Change msleep() and tsleep() to not alter the calling thread's priority if the specified priority is zero. This avoids a race where the calling thread could read a snapshot of it's current priority, then a different thread could change the first thread's priority, then the original thread would call sched_prio() inside msleep() undoing the change made by the second thread. I used a priority of zero as no thread that calls msleep() or tsleep() should be specifying a priority of zero anyway. The various places that passed 'curthread->td_priority' or some variant as the priority now pass 0.
157628	10-Apr-2006	pjd	On shutdown try to turn off all swap devices. This way GEOM providers are properly closed on shutdown. Requested by: ru Reviewed by: alc MFC after: 2 weeks
157443	03-Apr-2006	peter	Remove the unused sva and eva arguments from pmap_remove_pages().
157144	26-Mar-2006	jkoshy	MFP4: Support for profiling dynamically loaded objects. Kernel changes: Inform hwpmc of executable objects brought into the system by kldload() and mmap(), and of their removal by kldunload() and munmap(). A helper function linker_hwpmc_list_objects() has been added to "sys/kern/kern_linker.c" and is used by hwpmc to retrieve the list of currently loaded kernel modules. The unused `MAPPINGCHANGE' event has been deprecated in favour of separate `MAP_IN' and `MAP_OUT' events; this change reduces space wastage in the log. Bump the hwpmc's ABI version to "2.0.00". Teach hwpmc(4) to handle the map change callbacks. Change the default per-cpu sample buffer size to hold 32 samples (up from 16). Increment __FreeBSD_version. libpmc(3) changes: Update libpmc(3) to deal with the new events in the log file; bring the pmclog(3) manual page in sync with the code. pmcstat(8) changes: Introduce new options to pmcstat(8): "-r" (root fs path), "-M" (mapfile name), "-q"/"-v" (verbosity control). Option "-k" now takes a kernel directory as its argument but will also work with the older invocation syntax. Rework string handling in pmcstat(8) to use an opaque type for interned strings. Clean up ELF parsing code and add support for tracking dynamic object mappings reported by a v2.0.00 hwpmc(4). Report statistics at the end of a log conversion run depending on the requested verbosity level. Reviewed by: jhb, dds (kernel parts of an earlier patch) Tested by: gallatin (earlier patch)
156420	08-Mar-2006	imp	Remove leading __ from __(inline\|const\|signed\|volatile). They are obsolete. This should reduce diffs to NetBSD as well.
156415	08-Mar-2006	tegge	Ignore dirty pages owned by "dead" objects.
156225	02-Mar-2006	tegge	Eliminate a deadlock when creating snapshots. Blocking vn_start_write() must be called without any vnode locks held. Remove calls to vn_start_write() and vn_finished_write() in vnode_pager_putpages() and add these calls before the vnode lock is obtained to most of the callers that don't already have them.
156224	02-Mar-2006	tegge	Hold extra reference to vm object while cleaning pages.
155884	21-Feb-2006	jhb	Lock the vm_object while checking its type to see if it is a vnode-backed object that requires Giant in vm_object_deallocate(). This is somewhat hairy in that if we can't obtain Giant directly, we have to drop the object lock, then lock Giant, then relock the object lock and verify that we still need Giant. If we don't (because the object changed to OBJT_DEAD for example), then we drop Giant before continuing. Reviewed by: alc Tested by: kris
155790	17-Feb-2006	tegge	Expand scope of marker to reduce the number of page queue scan restarts.
155784	17-Feb-2006	tegge	Check return value from nonblocking call to vn_start_write().
155737	15-Feb-2006	ups	When the VM needs to allocated physical memory pages (for non interrupt use) and it has not plenty of free pages it tries to free pages in the cache queue. Unfortunately freeing a cached page requires the locking of the object that owns the page. However in the context of allocating pages we may not be able to lock the object and thus can only TRY to lock the object. If the locking try fails the cache page can not be freed and is activated to move it out of the way so that we may try to free other cache pages. If all pages in the cache belong to objects that are currently locked the cache queue can be emptied without freeing a single page. This scenario caused two problems: 1) vm_page_alloc always failed allocation when it tried freeing pages from the cache queue and failed to do so. However if there are more than cnt.v_interrupt_free_min pages on the free list it should return pages when requested with priority VM_ALLOC_SYSTEM. Failure to do so can cause resource exhaustion deadlocks. 2) Threads than need to allocate pages spend a lot of time cleaning up the page queue without really getting anything done while the pagedaemon needs to work overtime to refill the cache. This change fixes the first problem. (1) Reviewed by: tegge@
155551	11-Feb-2006	rwatson	Skip per-cpu caches associated with absent CPUs when generating a memory statistics record stream via sysctl. MFC after: 3 days
155384	06-Feb-2006	jeff	- Fix silly VI locking that is used to check a single flag. The vnode lock also protects this flag so it is not necessary. - Don't rely on v_mount to detect whether or not we've been recycled, use the more appropriate VI_DOOMED instead. Sponsored by: Isilon Systems, Inc. MFC After: 1 week
155320	04-Feb-2006	alc	Remove an unnecessary call to pmap_remove_all(). The given page is not mapped because its contents are invalid.
155230	02-Feb-2006	tegge	Adjust old comment (present in rev 1.1) to match changes in rev 1.82. PR: kern/92509 Submitted by: "Bryan Venteicher" <bryanv@daemoninthecloset.org>
155177	01-Feb-2006	yar	Use off_t for file size passed to vnode_create_vobject(). The former type, size_t, was causing truncation to 32 bits on i386, which immediately led to undersizing of VM objects backed by files >4GB. In particular, sendfile(2) was broken for such files. PR: kern/92243 MFC after: 5 days
155169	01-Feb-2006	jeff	- Install a temporary bandaid in vm_object_reference() that will stop mtx_assert()s from triggering until I find a real long-term solution.
155128	31-Jan-2006	alc	Change #if defined(DIAGNOSTIC) to KASSERT.
155086	31-Jan-2006	pjd	Add buffer corruption protection (RedZone) for kernel's malloc(9). It detects both: buffer underflows and buffer overflows bugs at runtime (on free(9) and realloc(9)) and prints backtraces from where memory was allocated and from where it was freed. Tested by: kris
154989	29-Jan-2006	scottl	The change a few years ago of having contigmalloc start its scan at the top of physical RAM instead of the bottom was a sound idea, but the implementation left a lot to be desired. Scans would spend considerable time looking at pages that are above of the address range given by the caller, and multiple calls (like what happens in busdma) would spend more time on top of that rescanning the same pages over and over. Solve this, at least for now, with two simple optimizations. The first is to not bother scanning high ordered pages that are outside of the provided address range. Second is to cache the page index from the last successful operation so that subsequent scans don't have to restart from the top. This is conditional on the numpages argument being the same or greater between calls. MFC After: 2 weeks
154934	27-Jan-2006	jhb	Add a new macro wrapper WITNESS_CHECK() around the witness_warn() function. The difference between WITNESS_CHECK() and WITNESS_WARN() is that WITNESS_CHECK() should be used in the places that the return value of witness_warn() is checked, whereas WITNESS_WARN() should be used in places where the return value is ignored. Specifically, in a kernel without WITNESS enabled, WITNESS_WARN() evaluates to an empty string where as WITNESS_CHECK evaluates to 0. I also updated the one place that was checking the return value of WITNESS_WARN() to use WITNESS_CHECK.
154929	27-Jan-2006	cognet	Make sure b_vp and b_bufobj are NULL before calling relpbuf(), as it asserts they are. They should be NULL at this point, except if we're coming from swapdev_strategy(). It should only affect the case where we're swapping directly on a file over NFS.
154927	27-Jan-2006	alc	Style: Add blank line after local variable declarations.
154896	27-Jan-2006	alc	Use the new macros abstracting the page coloring/queues implementation. (There are no functional changes.)
154889	27-Jan-2006	alc	Use the new macros abstracting the page coloring/queues implementation. (There are no functional changes.)
154849	26-Jan-2006	alc	Plug a leak in the newer contigmalloc() implementation. Specifically, if a multipage allocation was aborted midway, the pages that were already allocated were not always returned to the free list. Submitted by: tegge
154805	25-Jan-2006	jeff	- Avoid calling vm_object_backing_scan() when collapsing an object when the resident page count matches the object size. We know it fully backs its parent in this case. Reviewed by: acl, tegge Sponsored by: Isilon Systems, Inc.
154799	25-Jan-2006	alc	The previous revision incorrectly changed a switch statement into an if statement. Specifically, a break statement that previously broke out of the enclosing switch was not changed. Consequently, the enclosing loop terminated prematurely. This could result in "vm_page_insert: page already inserted" panics. Submitted by: tegge
154788	24-Jan-2006	alc	With the recent changes to the implementation of page coloring, the the option PQ_NOOPT is used exclusively by vm_pageq.c. Thus, the include of opt_vmpage.h can be removed from vm_page.h.
154764	24-Jan-2006	alc	In vm_page_set_invalid() invalidate all of the page's mappings as soon as any part of the page's contents is invalidated. Submitted by: tegge
154694	22-Jan-2006	alc	Make vm_object_vndeallocate() static. The external calls to it were eliminated in ufs/ffs/ffs_vnops.c's revision 1.125.
154076	06-Jan-2006	jhb	Reduce the scope of one #ifdef to avoid duplicating a SYSCTL_INT() macro and trim another unneeded #ifdef (it was just around a macro that is already conditionally defined).
154035	04-Jan-2006	netchild	Convert the PAGE_SIZE check into a CTASSERT. Suggested by: jhb
154031	04-Jan-2006	netchild	Prevent divide by zero, use default values in case one of the divisor's is zero. Tested by: Randy Bush <randy@psg.com>
153940	31-Dec-2005	netchild	MI changes: - provide an interface (macros) to the page coloring part of the VM system, this allows to try different coloring algorithms without the need to touch every file [1] - make the page queue tuning values readable: sysctl vm.stats.pagequeue - autotuning of the page coloring values based upon the cache size instead of options in the kernel config (disabling of the page coloring as a kernel option is still possible) MD changes: - detection of the cache size: only IA32 and AMD64 (untested) contains cache size detection code, every other arch just comes with a dummy function (this results in the use of default values like it was the case without the autotuning of the page coloring) - print some more info on Intel CPU's (like we do on AMD and Transmeta CPU's) Note to AMD owners (IA32 and AMD64): please run "sysctl vm.stats.pagequeue" and report if the cache* values are zero (= bug in the cache detection code) or not. Based upon work by: Chad David <davidc@acns.ab.ca> [1] Reviewed by: alc, arch (in 2004) Discussed with: alc, Chad David, arch (in 2004)
153880	30-Dec-2005	pjd	Improve memguard a bit: - Provide tunable vm.memguard.desc, so one can specify memory type without changing the code and recompiling the kernel. - Allow to use memguard for kernel modules by providing sysctl vm.memguard.desc, which can be changed to short description of memory type before module is loaded. - Move as much memguard code as possible to memguard.c. - Add sysctl node vm.memguard. and move memguard-specific sysctl there. - Add malloc_desc2type() function for finding memory type based on its short description (ks_shortdesc field). - Memory type can be changed (via vm.memguard.desc sysctl) only if it doesn't exist (will be loaded later) or when no memory is allocated yet. If there is allocated memory for the given memory type, return EBUSY. - Implement two ways of memory types comparsion and make safer/slower the default.
153555	20-Dec-2005	tegge	Don't access fs->first_object after dropping reference to it. The result could be a missed or extra giant unlock. Reviewed by: alc
153485	16-Dec-2005	alc	Use sf_buf_alloc() instead of vm_map_find() on exec_map to create the ephemeral mappings that are used as the source for three copy operations from kernel space to user space. There are two reasons for making this change: (1) Under heavy load exec_map can fill up causing vm_map_find() to fail. When it fails, the nascent process is aborted (SIGABRT). Whereas, this reimplementation using sf_buf_alloc() sleeps. (2) Although it is possible to sleep on vm_map_find()'s failure until address space becomes available (see kmem_alloc_wait()), using sf_buf_alloc() is faster. Furthermore, the reimplementation uses a CPU private mapping, avoiding a TLB shootdown on multiprocessors. Problem uncovered by: kris@ Reviewed by: tegge@ MFC after: 3 weeks
153385	13-Dec-2005	alc	Assert that the page that is given to vm_page_free_toq() does not have any managed mappings.
153311	11-Dec-2005	alc	Remove unneeded calls to pmap_remove_all(). The given page is not mapped. Reviewed by: tegge
153095	04-Dec-2005	alc	Simplify vmspace_dofree().
153068	03-Dec-2005	alc	Eliminate unneeded preallocation at initialization. Reviewed by: tegge
153060	03-Dec-2005	alc	Eliminate unneeded preallocation at initialization. Reviewed by: tegge
152630	20-Nov-2005	alc	Eliminate pmap_init2(). It's no longer used.
152224	09-Nov-2005	alc	Reimplement the reclamation of PV entries. Specifically, perform reclamation synchronously from get_pv_entry() instead of asynchronously as part of the page daemon. Additionally, limit the reclamation to inactive pages unless allocation from the PV entry zone or reclamation from the inactive queue fails. Previously, reclamation destroyed mappings to both inactive and active pages. get_pv_entry() still, however, wakes up the page daemon when reclamation occurs. The reason being that the page daemon may move some pages from the active queue to the inactive queue, making some new pages available to future reclamations. Print the "reclaiming PV entries" message at most once per minute, but don't stop printing it after the fifth time. This way, we do not give the impression that the problem has gone away. Reviewed by: tegge
152178	08-Nov-2005	alc	If a physical page is mapped by two or more virtual addresses, transmitted by the zero-copy sockets method, and written to before the transmission completes, we need to destroy all of the existing mappings to the page, not just the one that we fault on. Otherwise, the mappings will no longer be to the same page and changes made through one of the mappings will not be visible through the others. Observed by: tegge
151951	01-Nov-2005	ps	Rate limit vnode_pager_putpages printfs to once a second.
151918	01-Nov-2005	alc	Consider the zero-copy transmission of a page that was wired by mlock(2). If a copy-on-write fault occurs on the page, the new copy should inherit a part of the original page's wire count. Submitted by: tegge MFC after: 1 week
151897	31-Oct-2005	rwatson	Normalize a significant number of kernel malloc type names: - Prefer '_' to ' ', as it results in more easily parsed results in memory monitoring tools such as vmstat. - Remove punctuation that is incompatible with using memory type names as file names, such as '/' characters. - Disambiguate some collisions by adding subsystem prefixes to some memory types. - Generally prefer lower case to upper case. - If the same type is defined in multiple architecture directories, attempt to use the same name in additional cases. Not all instances were caught in this change, so more work is required to finish this conversion. Similar changes are required for UMA zone names.
151558	22-Oct-2005	alc	Use of the ZERO_COPY_SOCKETS options can result in an unusual state that vm_object_backing_scan() was not written to handle. Specifically, a wired page within a backing object that is shadowed by a page within the shadow object. Handle this state by removing the wired page from the backing object. The wired page will be freed by socow_iodone(). Stop masking errors: If a page is being freed by vm_object_backing_scan(), assert that it is no longer mapped rather than quietly destroying any mappings. Tested by: Harald Schmalzbauer
151526	20-Oct-2005	rwatson	Change format string for u_int64_t to %ju from %llu, in order to use the correct format string on 64-bit systems. Pointed out by: pjd
151516	20-Oct-2005	rwatson	Add a "show uma" command to DDB, which prints out the current stats for available UMA zones. Quite useful for post-mortem debugging of memory leaks without a dump device configured on a panicked box. MFC after: 2 weeks
151252	12-Oct-2005	dds	Move execve's access time update functionality into a new vfs_mark_atime() function, and use the new function for performing efficient atime updates in mmap(). Reviewed by: bde MFC after: 2 weeks
151104	08-Oct-2005	des	As alc pointed out to me, vm_page.c 1.305 was incomplete: uma_startup() still uses the constant UMA_BOOT_PAGES. Change it to accept boot_pages as an additional argument. MFC after: 2 weeks
150926	04-Oct-2005	dds	Update the vnode's access time after an mmap operation on it. Before this change a copy operation with cp(1) would not update the file access times. According to the POSIX mmap(2) documentation: the st_atime field of the mapped file may be marked for update at any time between the mmap() call and the corresponding munmap() call. The initial read or write reference to a mapped region shall cause the file's st_atime field to be marked for update if it has not already been marked for update.
150727	29-Sep-2005	jhb	Trim a couple of unneeded includes.
150418	21-Sep-2005	cognet	Make sure we have a bufobj before calling bstrategy(). I'm not sure this is the right thing to do, but at least I don't panic anymore when swapping on a NFS file without using md(4). X-MFC after: proper review
150397	20-Sep-2005	peter	Remove unused (but initialized) variable 'objsize' from vm_mmap()
149900	09-Sep-2005	alc	Introduce a new lock for the purpose of synchronizing access to the UMA boot pages. Disable recursion on the general UMA lock now that startup_alloc() no longer uses it. Eliminate the variable uma_boot_free. It serves no purpose. Note: This change eliminates a lock-order reversal between a system map mutex and the UMA lock. See http://sources.zabbadoz.net/freebsd/lor.html#109 for details. MFC after: 3 days
149839	07-Sep-2005	alc	Eliminate an incorrect cast.
149768	03-Sep-2005	alc	Pass a value of type vm_prot_t to pmap_enter_quick() so that it determine whether the mapping should permit execute access.
149035	13-Aug-2005	kan	Do not use vm_pager_init() to initialize vnode_pbuf_freecnt variable. vm_pager_init() is run before required nswbuf variable has been set to correct value. This caused system to run with single pbuf available for vnode_pager. Handle both cluster_pbuf_freecnt and vnode_pbuf_freecnt variable in the same way. Reported by: ade Obtained from: alc MFC after: 2 days
148997	12-Aug-2005	tegge	Check for marker pages when scanning active and inactive page queues. Reviewed by: alc
148985	12-Aug-2005	des	Introduce the vm.boot_pages tunable and sysctl, which controls the number of pages reserved to bootstrap the kernel memory allocator. MFC after: 2 weeks
148909	10-Aug-2005	tegge	Don't allow pagedaemon to skip pages while scanning PQ_ACTIVE or PQ_INACTIVE due to the vm object being locked. When a process writes large amounts of data to a file, the vm object associated with that file can contain most of the physical pages on the machine. If the process is preempted while holding the lock on the vm object, pagedaemon would be able to move very few pages from PQ_INACTIVE to PQ_CACHE or from PQ_ACTIVE to PQ_INACTIVE, resulting in unlimited cleaning of dirty pages belonging to other vm objects. Temporarily unlock the page queues lock while locking vm objects to avoid lock order violation. Detect and handle relevant page queue changes. This change depends on both the lock portion of struct vm_object and normal struct vm_page being type stable. Reviewed by: alc
148875	08-Aug-2005	ssouhlal	Use atomic operations on runningbufspace. PR: kern/84318 Submitted by: ade MFC after: 3 days
148691	04-Aug-2005	rwatson	Don't perform a nested include of opt_vmpage.h if LIBMEMSTAT is defined, as opt_vmpage.h will not be available to user space library builds. A similar existing check is present for KLD_MODULE for similar reasons. MFC after: 3 days
148690	04-Aug-2005	rwatson	Wrap inlines in uma_int.h in #ifdef _KERNEL so that uma_int.h can be used from memstat_uma.c for the purposes of kvm access without lots of additional unsafe includes. MFC after: 3 days
148371	25-Jul-2005	rwatson	Rename UMA_MAX_NAME to UTH_MAX_NAME, since it's a maximum in the monitoring API, which might or might not be the same as the internal maximum (currently none). Export flag information on UMA zones -- in particular, whether or not this is a secondary zone, and so the keg free count should be considered in that light. MFC after: 1 day
148200	20-Jul-2005	alc	Eliminate inconsistency in the setting of the B_DONE flag. Specifically, make the b_iodone callback responsible for setting it if it is needed. Previously, it was set unconditionally by bufdone() without holding whichever lock is shared by the b_iodone callback and the corresponding top-half function. Consequently, in a race, the top-half function could conclude that operation was done before the b_iodone callback finished. See, for example, aio_physwakeup() and aio_fphysio(). Note: I don't believe that the other, more widely-used b_iodone callbacks are affected. Discussed with: jeff Reviewed by: phk MFC after: 2 weeks
148194	20-Jul-2005	rwatson	Further UMA statistics related changes: - Add a new uma_zfree_internal() flag, ZFREE_STATFREE, which causes it to to update the zone's uz_frees statistic. Previously, the statistic was updated unconditionally. - Use the flag in situations where a "real" free occurs: i.e., one where the caller is freeing an allocated item, to be differentiated from situations where uma_zfree_internal() is used to tear down the item during slab teardown in order to invoke its fini() method. Also use the flag when UMA is freeing its internal objects. - When exchanging a bucket with the zone from the per-CPU cache when freeing an item, flush cache statistics back to the zone (since the zone lock and critical section are both held) to match the allocation case. MFC after: 3 days
148193	20-Jul-2005	alc	Eliminate an incorrect (and unnecessary) cast.
148079	16-Jul-2005	rwatson	Use mp_maxid in preference to MAXCPU when creating exports of UMA per-CPU cache statistics. UMA sizes the cache array based on the number of CPUs at boot (mp_maxid + 1), and iterating based on MAXCPU could read off the end of the array (into the next zone). Reported by: yongari MFC after: 1 week
148078	16-Jul-2005	rwatson	Improve canonicalization of copyrights. Order copyrights by order of assertion (jeff, bmilekic, rwatson). Suggested ages ago by: bde MFC after: 1 week
148077	16-Jul-2005	rwatson	Move the unlocking of the zone mutex in sysctl_vm_zone_stats() so that it covers the following of the uc_alloc/freebucket cache pointers. Originally, I felt that the race wasn't helped by holding the mutex, hence a comment in the code and not holding it across the cache access. However, it does improve consistency, as while it doesn't prevent bucket exchange, it does prevent bucket pointer invalidation. So a race in gathering cache free space statistics still can occur, but not one that follows an invalid bucket pointer, if the mutex is held. Submitted by: yongari MFC after: 1 week
148072	16-Jul-2005	silby	Increase the flags field for kegs from a 16 to a 32 bit value; we have exhausted all 16 flags.
148070	15-Jul-2005	rwatson	Track UMA(9) allocation failures by zone, and export via sysctl. Requested by: victor cruceru <victor dot cruceru at gmail dot com> MFC after: 1 week
148014	14-Jul-2005	jhb	Convert a remaining !fs.map->system_map to fs.first_object->flags & OBJ_NEEDGIANT test that was missed in an earlier revision. This fixes mutex assertion failures in the debug.mpsafevm=0 case. Reported by: ps MFC after: 3 days
147996	14-Jul-2005	rwatson	Introduce a new sysctl, vm.zone_stats, which exports UMA(9) allocator statistics via a binary structure stream: - Add structure 'uma_stream_header', which defines a stream version, definition of MAXCPUs used in the stream, and the number of zone records in the stream. - Add structure 'uma_type_header', which defines the name, alignment, size, resource allocation limits, current pages allocated, preferred bucket size, and central zone + keg statistics. - Add structure 'uma_percpu_stat', which, for each per-CPU cache, includes the number of allocations and frees, as well as the number of free items in the cache. - When the sysctl is queried, return a stream header, followed by a series of type descriptions, each consisting of a type header followed by a series of MAXCPUs uma_percpu_stat structures holding per-CPU allocation information. Typical values of MAXCPU will be 1 (UP compiled kernel) and 16 (SMP compiled kernel). This query mechanism allows user space monitoring tools to extract memory allocation statistics in a machine-readable form, and to do so at a per-CPU granularity, allowing monitoring of allocation patterns across CPUs in order to better understand the distribution of work and memory flow over multiple CPUs. While here, also export the number of UMA zones as a sysctl vm.uma_count, in order to assist in sizing user swpace buffers to receive the stream. A follow-up commit of libmemstat(3), a library to monitor kernel memory allocation, will occur in the next few days. This change directly supports converting netstat(1)'s "-mb" mode to using UMA-sourced stats rather than separately maintained mbuf allocator statistics. MFC after: 1 week
147995	14-Jul-2005	rwatson	In addition to tracking allocs in the zone, also track frees. Add a zone free counter, as well as a cache free counter. MFC after: 1 week
147994	14-Jul-2005	rwatson	In an earlier world order, UMA would flush per-CPU statistics to the zone whenever it was moving buckets between the zone and the cache, or when coalescing statistics across the CPU. Remove flushing of statistics to the zone when coalescing statistics as part of sysctl, as we won't be running on the right CPU to write to the cache statistics. Add a missed gathering of statistics: when uma_zalloc_internal() does a special case allocation of a single item, make sure to update the zone statistics to represent this. Previously this case wasn't accounted for in user-visible statistics. MFC after: 1 week
147615	26-Jun-2005	silby	Change the panic in trash_ctor into just a printf for now. Once the reports of panics in trash_ctor relating to mbufs have been examined and a fix found, this will be turned back into a panic. Approved by: re (rwatson)
147422	16-Jun-2005	alc	Increase UMA_BOOT_PAGES to prevent a crash during initialization. See http://docs.FreeBSD.org/cgi/mid.cgi?42AD8270.8060906 for a detailed description of the crash. Reported by: Eric Anderson Approved by: re (scottl) MFC after: 3 days
147283	11-Jun-2005	green	The new contigmalloc(9) has a bad degenerate case where there were many regions checked again and again despite knowing the pages contained were not usable and only satisfied the alignment constraints This case was compounded, especially for large allocations, by the practice of looping from the top of memory so as to keep out of the important low-memory regions. While the old contigmalloc(9) has the same problem, it is not as noticeable due to looping from the low memory to high. This degenerate case is fixed, as well as reversing the sense of the rest of the loops within it, to provide a tremendous speed increase. This makes the best case O(n * VM overhead) much more likely than the worst case O(4 * VM overhead). For comparison, the worst case for old contigmalloc would be O(5 * VM overhead) in addition to its strategy of turning used memory into free being highly pessimal. Also, fix a bug that in practice most likely couldn't have been triggered, int the new contigmalloc(9): it walked backwards from the end of memory without accounting for how many pages it needed. Potentially, nonexistant pages could have been mapped. This hasn't occurred because the kernel generally requests as its first contigmalloc(9) a single page. Reported by: Nicolas Dehaine <nicko@stbernard.com>, wes MFC After: 1 month More testing by: Nicolas Dehaine <nicko@stbernard.com>, wes
147262	10-Jun-2005	alc	Add a comment to the effect that fictitious pages do not require the initialization of their machine-dependent fields.
147217	10-Jun-2005	alc	Introduce a procedure, pmap_page_init(), that initializes the vm_page's machine-dependent fields. Use this function in vm_pageq_add_new_page() so that the vm_page's machine-dependent and machine-independent fields are initialized at the same time. Remove code from pmap_init() for initializing the vm_page's machine-dependent fields. Remove stale comments from pmap_init(). Eliminate the Boolean variable pmap_initialized from the alpha, amd64, i386, and ia64 pmap implementations. Its use is no longer required because of the above changes and earlier changes that result in physical memory that is being mapped at initialization time being mapped without pv entries. Tested by: cognet, kensmith, marcel
146727	28-May-2005	alc	Update some comments to reflect the change from spl-based to lock-based synchronization.
146554	23-May-2005	ups	Use low level constructs borrowed from interrupt threads to wait for work in proc0. Remove the TDP_WAKEPROC0 workaround.
146501	22-May-2005	alc	Swap in can occur safely without Giant. Release Giant on entry to scheduler().
146484	22-May-2005	alc	Remove GIANT_REQUIRED from swapout_procs().
146459	20-May-2005	alc	Reduce the number of times that we acquire and release locks in swap_pager_getpages(). MFC after: 1 week
146367	19-May-2005	alc	Remove calls to spl*().
146363	19-May-2005	alc	Remove a stale comment concerning spl* usage.
146355	18-May-2005	alc	Update some comments to reflect the change from spl-based to lock-based synchronization.
146351	18-May-2005	alc	Remove calls to spl*().
146350	18-May-2005	alc	Revert revision 1.270: swp_pager_async_iodone() need not perform VM_LOCK_GIANT(). Discussed with: jeff
146340	18-May-2005	bz	Correct 32 vs 64 bit signedness issues. Approved by: pjd (mentor) MFC after: 2 weeks
146126	12-May-2005	grehan	The final test in unlock_and_deallocate() to determine if GIANT needs to be unlocked wasn't updated to check for OBJ_NEEDGIANT. This caused a WITNESS panic when debug_mpsafevm was set to 0. Approved by: jeffr
146017	08-May-2005	marcel	Enable debug_mpsafevm on ia64 due to the severe functional regression caused by recent locking changes when it's off. Revert the logic to trim down the conditional. Clued-in by: alc@
145888	04-May-2005	jeff	- We need to inhert the OBJ_NEEDGIANT flag from the original object in vm_object_split(). Spotted by: alc
145826	03-May-2005	jeff	- Add a new object flag "OBJ_NEEDSGIANT". We set this flag if the underlying vnode requires Giant. - In vm_fault only acquire Giant if the underlying object has NEEDSGIANT set. - In vm_object_shadow inherit the NEEDSGIANT flag from the backing object.
145788	02-May-2005	alc	Remove GIANT_REQUIRED from vmspace_exec(). Prodded by: jeff
145699	30-Apr-2005	jeff	- VM_LOCK_GIANT in the swap pager's iodone routine as VFS will soon call it without Giant. Sponsored by: Isilon Systems, Inc.
145686	29-Apr-2005	rwatson	Modify UMA to use critical sections to protect per-CPU caches, rather than mutexes, which offers lower overhead on both UP and SMP. When allocating from or freeing to the per-cpu cache, without INVARIANTS enabled, we now no longer perform any mutex operations, which offers a 1%-3% performance improvement in a variety of micro-benchmarks. We rely on critical sections to prevent (a) preemption resulting in reentrant access to UMA on a single CPU, and (b) migration of the thread during access. In the event we need to go back to the zone for a new bucket, we release the critical section to acquire the global zone mutex, and must re-acquire the critical section and re-evaluate which cache we are accessing in case migration has occured, or circumstances have changed in the current cache. Per-CPU cache statistics are now gathered lock-free by the sysctl, which can result in small races in statistics reporting for caches. Reviewed by: bmilekic, jeff (somewhat) Tested by: rwatson, kris, gnn, scottl, mike at sentex dot net, others
145584	27-Apr-2005	jeff	- Pass the ISOPEN flag to namei so filesystems will know we're about to open them or otherwise access the data.
145530	25-Apr-2005	kris	Add the vm.exec_map_entries tunable and read-only sysctl, which controls the number of entries in exec_map (maximum number of simultaneous execs that can be handled by the kernel). The default value of 16 is insufficient on heavily loaded machines (particularly SMP machines), and if it is exceeded then executing further processes will generate a SIGABRT. This is a workaround until a better solution can be implemented. Reviewed by: alc MFC after: 3 days
145144	16-Apr-2005	des	Unbreak the build on 64-bit architectures.
145127	15-Apr-2005	jhb	Add a vm.blacklist tunable which can hold a space or comma seperated list of physical addresses. The pages containing these physical addresses will not be added to the free list and thus will effectively be ignored by the VM system. This is mostly useful for the case when one knows of specific physical addresses that have bit errors (such as from a memtest run) so that one can blacklist the bad pages while waiting for the new sticks of RAM to arrive. The physical addresses of any ignored pages are listed in the message buffer as well.
145076	14-Apr-2005	csjp	Move MAC check_vnode_mmap entry point out from being exclusive to MAP_SHARED so that the entry point gets executed un-conditionally. This may be useful for security policies which want to perform access control checks around run-time linking. -add the mmap(2) flags argument to the check_vnode_mmap entry point so that we can make access control decisions based on the type of mapped object. -update any dependent API around this parameter addition such as function prototype modifications, entry point parameter additions and the inclusion of sys/mman.h header file. -Change the MLS, BIBA and LOMAC security policies so that subject domination routines are not executed unless the type of mapping is shared. This is done to maintain compatibility between the old vm_mmap_vnode(9) and these policies. Reviewed by: rwatson MFC after: 1 month
144970	12-Apr-2005	jhb	Tidy vcnt() by moving a duplicated line above #ifdef and removing a useless variable.
144635	04-Apr-2005	jhb	Flip the switch and turn mpsafevm on by default for sparc64. Approved by: alc
144610	03-Apr-2005	jeff	- Don't NULL the vnode's v_object pointer until after the object is torn down. If we have dirty pages, the putpages routine will need to know what the vnode's object is so that it may write out dirty pages. Pointy hat: phk Found by: obrien
144501	01-Apr-2005	jhb	- Change the vm_mmap() function to accept an objtype_t parameter specifying the type of object represented by the handle argument. - Allow vm_mmap() to map device memory via cdev objects in addition to vnodes and anonymous memory. Note that mmaping a cdev directly does not currently perform any MAC checks like mapping a vnode does. - Unbreak the DRM getbufs ioctl by having it call vm_mmap() directly on the cdev the ioctl is acting on rather than trying to find a suitable vnode to map from. Reviewed by: alc, arch@
144367	31-Mar-2005	jeff	- LK_NOPAUSE is a nop now. Sponsored by: Isilon Systems, Inc.
144322	30-Mar-2005	alc	Eliminate (now) unnecessary acquisition and release of the global page queues lock in vm_object_backing_scan(). Updates to the page's PG_BUSY flag and busy field are synchronized by the containing object's lock. Testing the page's hold_count and wire_count in vm_object_backing_scan()'s OBSC_COLLAPSE_NOWAIT case is unnecessary. There is no reason why the held or wired pages cannot be migrated to the shadow object. Reviewed by: tegge
143821	18-Mar-2005	das	Move the swap_zone == NULL check earlier (i.e. before we dereference the pointer.) Found by: Coverity Prevent analysis tool
143745	17-Mar-2005	jeff	- Don't lock the vnode interlock in vm_object_set_writeable_dirty() if we've already set the object flags. Reviewed by: alc
143646	15-Mar-2005	jeff	- In vm_page_insert() hold the backing vnode when the first page is inserted. - In vm_page_remove() drop the backing vnode when the last page is removed. - Don't check the vnode to see if it must be reclaimed on every call to vm_page_free_toq() as we only check it now when it is actually required. This saves us two lock operations per call. Sponsored by: Isilon Systems, Inc.
143559	14-Mar-2005	jeff	- Don't directly adjust v_usecount, use vref() instead. Sponsored by: Isilon Systems, Inc.
143554	14-Mar-2005	jeff	- Retire OLOCK and OWANT. All callers hold the vnode lock when creating a vnode object. There has been an assert to prove this for some time. Sponsored by: Isilon Systems, Inc.
143505	13-Mar-2005	jeff	- Don't acquire the vnode lock in destroy_vobject, assert that it has already been acquired by the caller. Sponsored by: Isilon Systems, Inc.
142367	24-Feb-2005	alc	Revert the first part of revision 1.114 and modify the second part. On architectures implementing uma_small_alloc() pages do not necessarily belong to the kmem object.
142079	19-Feb-2005	phk	Try to unbreak the vnode locking around vop_reclaim() (based mostly on patch from kan@). Pull bufobj_invalbuf() out of vinvalbuf() and make g_vfs call it on close. This is not yet a generally safe function, but for this very specific use it is safe. This solves the problem with buffers not being flushed by unmount or after failed mount attempts.
141991	16-Feb-2005	bmilekic	Well, it seems that I pre-maturely removed the "All rights reserved" statement from some files, so re-add it for the moment, until the related legalese is sorted out. This change affects: sys/kern/kern_mbuf.c sys/vm/memguard.c sys/vm/memguard.h sys/vm/uma.h sys/vm/uma_core.c sys/vm/uma_dbg.c sys/vm/uma_dbg.h sys/vm/uma_int.h
141983	16-Feb-2005	bmilekic	Make UMA set the overloaded page->object back to kmem_object for UMA_ZONE_REFCNT and UMA_ZONE_MALLOC zones, as the page(s) undoubtedly came from kmem_map for those two. Previously it would set it back to NULL for UMA_ZONE_REFCNT zones and although this was probably not fatal, it added MORE code for no reason.
141955	15-Feb-2005	bmilekic	Rather than overloading the page->object field like UMA does, use instead an unused pageq queue reference in the page structure to stash a pointer to the MemGuard FIFO. Using the page->object field caused problems because when vm_map_protect() was called the second time to set VM_PROT_DEFAULT back onto a set of pages in memguard_map, the protection in the VM would be changed but the PMAP code would lazily not restore the PG_RW bit on the underlying pages right away (see pmap_protect()). So when a page fault finally occured and the VM noticed the faulting address corresponds to a page that _does_ have write access now, it would then call into PMAP to set back PG_RW (i386 case being discussed here). However, before it got to do that, an assertion on the object lock not being owned would get triggered, as the object of the faulting page would need to be locked but was overloaded by MemGuard. This is precisely why MemGuard cannot overload page->object. Submitted by: Alan Cox (alc@)
141696	11-Feb-2005	phk	sysctl node vm.stats can not be static (for ia64 reasons).
141670	10-Feb-2005	bmilekic	Implement support for buffers larger than PAGE_SIZE in MemGuard. Adds a little bit of complexity but performance requirements lacking (this is a debugging allocator after all), it's really not too bad (still only 317 lines). Also add an additional check to help catch really weird 3-threads-involved races: make memguard_free() write to the first page handed back, always, before it does anything else. Note that there is still a problem in VM+PMAP (specifically with vm_map_protect) w.r.t. MemGuard uses it, but this will be fixed shortly and this change stands on its own.
141630	10-Feb-2005	phk	Make three SYSCTL_NODEs static
141629	10-Feb-2005	phk	Make npages static and const.
141247	04-Feb-2005	ssouhlal	Set the scheduling class of the zeroidle thread to PRI_IDLE. Reviewed by: jhb Approved by: grehan (mentor) MFC after: 1 week
141068	30-Jan-2005	alc	Update the text of an assertion to reflect changes made in revision 1.148. Submitted by: tegge Eliminate an unnecessary, temporary increment of the backing object's reference count in vm_object_qcollapse(). Reviewed by: tegge
140929	28-Jan-2005	phk	Move the contents of vop_stddestroyvobject() to the new vnode_pager function vnode_destroy_vobject(). Make the new function zero the vp->v_object pointer so we can tell if a call is missing.
140782	25-Jan-2005	phk	Don't use VOP_GETVOBJECT, use vp->v_object directly.
140767	24-Jan-2005	phk	Move the body of vop_stdcreatevobject() over to the vnode_pager under the name Sande^H^H^H^H^Hvnode_create_vobject(). Make the new function take a size argument which removes the need for a VOP_STAT() or a very pessimistic guess for disks. Call that new function from vop_stdcreatevobject(). Make vnode_pager_alloc() private now that its only user came home.
140734	24-Jan-2005	phk	Kill the VV_OBJBUF and test the v_object for NULL instead.
140723	24-Jan-2005	jeff	- Remove GIANT_REQUIRED where giant is no longer required. - Use VFS_LOCK_GIANT() rather than directly acquiring giant in places where giant is only held because vfs requires it. Sponsored By: Isilon Systems, Inc.
140622	22-Jan-2005	alc	Guard against address wrap in kernacc(). Otherwise, a program accessing a bad address range through /dev/kmem can panic the machine. Submitted by: Mark W. Krentel Reported by: Kris Kennaway MFC after: 1 week
140605	22-Jan-2005	bmilekic	s/round_page/trunc_page/g I meant trunc_page. It's only a coincidence this hasn't caused problems yet. Pointed out by: Antoine Brodin <antoine.brodin@laposte.net>
140587	21-Jan-2005	bmilekic	Bring in MemGuard, a very simple and small replacement allocator designed to help detect tamper-after-free scenarios, a problem more and more common and likely with multithreaded kernels where race conditions are more prevalent. Currently MemGuard can only take over malloc()/realloc()/free() for particular (a) malloc type(s) and the code brought in with this change manually instruments it to take over M_SUBPROC allocations as an example. If you are planning to use it, for now you must: 1) Put "options DEBUG_MEMGUARD" in your kernel config. 2) Edit src/sys/kern/kern_malloc.c manually, look for "XXX CHANGEME" and replace the M_SUBPROC comparison with the appropriate malloc type (this might require additional but small/simple code modification if, say, the malloc type is declared out of scope). 3) Build and install your kernel. Tune vm.memguard_divisor boot-time tunable which is used to scale how much of kmem_map you want to allott for MemGuard's use. The default is 10, so kmem_size/10. ToDo: 1) Bring in a memguard(9) man page. 2) Better instrumentation (e.g., boot-time) of MemGuard taking over malloc types. 3) Teach UMA about MemGuard to allow MemGuard to override zone allocations too. 4) Improve MemGuard if necessary. This work is partly based on some old patches from Ian Dowse.
140439	18-Jan-2005	alc	Add checks to vm_map_findspace() to test for address wrap. The conditions where this could occur are very rare, but possible. Submitted by: Mark W. Krentel MFC after: 2 weeks
140319	15-Jan-2005	alc	Consider three objects, O, BO, and BBO, where BO is O's backing object and BBO is BO's backing object. Now, suppose that O and BO are being collapsed. Furthermore, suppose that BO has been marked dead (OBJ_DEAD) by vm_object_backing_scan() and that either vm_object_backing_scan() has been forced to sleep due to encountering a busy page or vm_object_collapse() has been forced to sleep due to memory allocation in the swap pager. If vm_object_deallocate() is then called on BBO and BO is BBO's only shadow object, vm_object_deallocate() will collapse BO and BBO. In doing so, it adds a necessary temporary reference to BO. If this collapse also sleeps and the prior collapse resumes first, the temporary reference will cause vm_object_collapse to panic with the message "backing_object %p was somehow re-referenced during collapse!" Resolve this race by changing vm_object_deallocate() such that it doesn't collapse BO and BBO if BO is marked dead. Once O and BO are collapsed, vm_object_collapse() will attempt to collapse O and BBO. So, vm_object_deallocate() on BBO need do nothing. Reported by: Peter Holm on 20050107 URL: http://www.holm.cc/stress/log/cons102.html In collaboration with: tegge@ Candidate for RELENG_4 and RELENG_5 MFC after: 2 weeks
140220	14-Jan-2005	phk	Eliminate unused and unnecessary "cred" argument from vinvalbuf()
140048	11-Jan-2005	phk	Remove the unused credential argument from VOP_FSYNC() and VFS_SYNC(). I'm not sure why a credential was added to these in the first place, it is not used anywhere and it doesn't make much sense: The credentials for syncing a file (ability to write to the file) should be checked at the system call level. Credentials for syncing one or more filesystems ("none") should be checked at the system call level as well. If the filesystem implementation needs a particular credential to carry out the syncing it would logically have to the cached mount credential, or a credential cached along with any delayed write data. Discussed with: rwatson
140031	11-Jan-2005	bmilekic	While we want the recursion protection for the bucket zones so that recursion from the VM is handled (and the calling code that allocates buckets knows how to deal with it), we do not want to prevent allocation from the slab header zones (slabzone and slabrefzone) if uk_recurse is not zero for them. The reason is that it could lead to NULL being returned for the slab header allocations even in the M_WAITOK case, and the caller can't handle that (this is also explained in a comment with this commit). The problem analysis is documented in our mailing lists: http://docs.freebsd.org/cgi/getmsg.cgi?fetch=153445+0+archive/2004/freebsd-current/20041231.freebsd-current (see entire thread for proper context). Crash dump data provided by: Peter Holm <peter@holm.cc>
139996	10-Jan-2005	stefanf	ISO C requires at least one element in an initialiser list.
139921	08-Jan-2005	alc	Move the acquisition and release of the page queues lock outside of a loop in vm_object_split() to avoid repeated acquisition and release.
139835	07-Jan-2005	alc	Transfer responsibility for freeing the page taken from the cache queue and (possibly) unlocking the containing object from vm_page_alloc() to vm_page_select_cache(). Recent optimizations to vm_map_pmap_enter() (see vm_map.c revisions 1.362 and 1.363) and pmap_enter_quick() have resulted in panic()s because vm_page_alloc() mistakenly unlocked objects that had not been locked by vm_page_select_cache(). Reported by: Peter Holm and Kris Kennaway
139825	07-Jan-2005	imp	/* -> /*- for license, minor formatting changes
139779	06-Jan-2005	alc	Revise the part of vm_pageout_scan() that moves pages from the cache queue to the free queue. With this change, if a page from the cache queue belongs to a locked object, it is simply skipped over rather than moved to the inactive queue.
139629	03-Jan-2005	phk	When allocating bio's in the swap_pager use M_WAITOK since the alternative is much worse.
139495	31-Dec-2004	alc	Assert that page allocations during an interrupt specify VM_ALLOC_INTERRUPT. Assert that pages removed from the cache queue are not busy.
139391	29-Dec-2004	alc	Access to the page's busy field is (now) synchronized by the containing object's lock. Therefore, the assertion that the page queues lock is held can be removed from vm_page_io_start().
139338	27-Dec-2004	alc	Note that access to the page's busy count is synchronized by the containing object's lock.
139332	26-Dec-2004	alc	Assert that the vm object is locked on entry to vm_page_sleep_if_busy(); remove some unneeded code.
139318	26-Dec-2004	bmilekic	Add my copyright and update Jeff's copyright on UMA source files, as per his request. Discussed with: Jeffrey Roberson
139296	25-Dec-2004	phk	fix comment
139265	24-Dec-2004	alc	Continue the transition from synchronizing access to the page's PG_BUSY flag and busy field with the global page queues lock to synchronizing their access with the containing object's lock. Specifically, acquire the containing object's lock before reading the page's PG_BUSY flag and busy field in vm_fault(). Reviewed by: tegge@
139241	23-Dec-2004	alc	Modify pmap_enter_quick() so that it expects the page queues to be locked on entry and it assumes the responsibility for releasing the page queues lock if it must sleep. Remove a bogus comment from pmap_enter_quick(). Using the first change, modify vm_map_pmap_enter() so that the page queues lock is acquired and released once, rather than each time that a page is mapped.
138986	17-Dec-2004	alc	Eliminate another unnecessary call to vm_page_busy(). (See revision 1.333 for a detailed explanation.)
138981	17-Dec-2004	alc	Enable debug.mpsafevm by default on alpha.
138897	15-Dec-2004	alc	In the common case, pmap_enter_quick() completes without sleeping. In such cases, the busying of the page and the unlocking of the containing object by vm_map_pmap_enter() and vm_fault_prefault() is unnecessary overhead. To eliminate this overhead, this change modifies pmap_enter_quick() so that it expects the object to be locked on entry and it assumes the responsibility for busying the page and unlocking the object if it must sleep. Note: alpha, amd64, i386 and ia64 are the only implementations optimized by this change; arm, powerpc, and sparc64 still conservatively busy the page and unlock the object within every pmap_enter_quick() call. Additionally, this change is the first case where we synchronize access to the page's PG_BUSY flag and busy field using the containing object's lock rather than the global page queues lock. (Modifications to the page's PG_BUSY flag and busy field have asserted both locks for several weeks, enabling an incremental transition.)
138538	08-Dec-2004	alc	With the removal of kern/uipc_jumbo.c and sys/jumbo.h, vm_object_allocate_wait() is not used. Remove it.
138531	07-Dec-2004	alc	Almost nine years ago, when support for 1TB files was introduced in revision 1.55, the address parameter to vnode_pager_addr() was changed from an unsigned 32-bit quantity to a signed 64-bit quantity. However, an out-of-range check on the address was not updated. Consequently, memory-mapped I/O on files greater than 2GB could cause a kernel panic. Since the address is now a signed 64-bit quantity, the problem resolution is simply to remove a cast. Reviewed by: bde@ and tegge@ PR: 73010 MFC after: 1 week
138406	05-Dec-2004	alc	Correct a sanity check in vnode_pager_generic_putpages(). The cast used to implement the sanity check should have been changed when we converted the implementation of vm_pindex_t from 32 to 64 bits. (Thus, RELENG_4 is not affected.) The consequence of this error would be a legimate write to an extremely large file being treated as an errant attempt to write meta- data. Discussed with: tegge@
138129	27-Nov-2004	das	Don't include sys/user.h merely for its side-effect of recursively including other headers.
138114	26-Nov-2004	cognet	Remove useless casts.
138066	24-Nov-2004	delphij	Try to close a potential, but serious race in our VM subsystem. Historically, our contigmalloc1() and contigmalloc2() assumes that a page in PQ_CACHE can be unconditionally reused by busying and freeing it. Unfortunatelly, when object happens to be not NULL, the code will set m->object to NULL and disregard the fact that the page is actually in the VM page bucket, resulting in page bucket hash table corruption and finally, a filesystem corruption, or a 'page not in hash' panic. This commit has borrowed the idea taken from DragonFlyBSD's fix to the VM fix by Matthew Dillon[1]. This version of patch will do the following checks: - When scanning pages in PQ_CACHE, check hold_count and skip over pages that are held temporarily. - For pages in PQ_CACHE and selected as candidate of being freed, check if it is busy at that time. Note: It seems that this is might be unrelated to kern/72539. Obtained from: DragonFlyBSD, sys/vm/vm_contig.c,v 1.11 and 1.12 [1] Reminded by: Matt Dillon Reworked by: alc MFC After: 1 week
137910	20-Nov-2004	das	Disable U area swapping and remove the routines that create, destroy, copy, and swap U areas. Reviewed by: arch@
137726	15-Nov-2004	phk	Make VOP_BMAP return a struct bufobj for the underlying storage device instead of a vnode for it. The vnode_pager does not and should not have any interest in what the filesystem uses for backend. (vfs_cluster doesn't use the backing store argument.)
137725	15-Nov-2004	phk	Add pbgetbo()/pbrelbo() lighter weight versions of pbgetvp()/pbrelvp().
137723	15-Nov-2004	phk	More kasserts.
137722	15-Nov-2004	phk	style polishing.
137721	15-Nov-2004	phk	Move pbgetvp() and pbrelvp() to vm_pager.c with the rest of the pbuf stuff.
137720	15-Nov-2004	phk	expect the caller to have called pbrelvp() if necessary.
137719	15-Nov-2004	phk	Explicitly call pbrelvp()
137457	09-Nov-2004	phk	Improve readability with a bunch of typedefs for the pager ops. These can also be used for prototypes in the pagers.
137393	08-Nov-2004	des	#include <vm/vm_param.h> instead of <machine/vmparam.h> (the former includes the latter, but also declares variables which are defined in kern/subr_param.c). Change som VM parameters from quad_t to unsigned long. They refer to quantities (size limits for text, heap and stack segments) which must necessarily be smaller than the size of the address space, so long is adequate on all platforms. MFC after: 1 week
137324	06-Nov-2004	alc	Eliminate an unnecessary atomic operation. Articulate the rationale in a comment.
137309	06-Nov-2004	rwatson	Abstract the logic to look up the uma_bucket_zone given a desired number of entries into bucket_zone_lookup(), which helps make more clear the logic of consumers of bucket zones. Annotate the behavior of bucket_init() with a comment indicating how the various data structures, including the bucket lookup tables, are initialized.
137306	06-Nov-2004	phk	Remove dangling variable
137305	06-Nov-2004	rwatson	Annotate what bucket_size[] array does; staticize since it's used only in uma_core.c.
137299	06-Nov-2004	das	Fix the last known race in swapoff(), which could lead to a spurious panic: swapoff: failed to locate %d swap blocks The race occurred because putpages() can block between the time it allocates swap space and the time it updates the swap metadata to associate that space with a vm_object, so swapoff() would complain about the temporary inconsistency. I hoped to fix this by making swp_pager_getswapspace() and swp_pager_meta_build() a single atomic operation, but that proved to be inconvenient. With this change, swapoff() simply doesn't attempt to be so clever about detecting when all the pageout activity to the target device should have drained.
137297	06-Nov-2004	alc	Move a call to wakeup() from vm_object_terminate() to vnode_pager_dealloc() because this call is only needed to wake threads that slept when they discovered a dead object connected to a vnode. To eliminate unnecessary calls to wakeup() by vnode_pager_dealloc(), introduce a new flag, OBJ_DISCONNECTWNT. Reviewed by: tegge@
137268	05-Nov-2004	jhb	- Set the priority of the page zeroing thread using sched_prio() when the thread is created rather than adjusting the priority in the main function. (kthread_create() should probably take the initial priority as an argument.) - Only yield the CPU in the !PREEMPTION case if there are any other runnable threads. Yielding when there isn't anything else better to do just wastes time in pointless context switches (albeit while the system is idle.)
137243	05-Nov-2004	alc	During traversal of the inactive queue, try locking the page's containing object before accessing the page's flags or the object's reference count.
137242	05-Nov-2004	alc	Eliminate another unnecessary call to vm_page_busy() that immediately precedes a call to vm_page_rename(). (See the previous revision for a detailed explanation.)
137239	05-Nov-2004	das	Close a race in swapoff(). Here are the gory details: In order to avoid livelock, swapoff() skips over objects with a nonzero pip count and makes another pass if necessary. Since it is impossible to know which objects we care about, it would choose an arbitrary object with a nonzero pip count and wait for it before making another pass, the theory being that this object would finish paging about as quickly as the ones we care about. Unfortunately, we may have slept since we acquired a reference to this object. Hack around this problem by tsleep()ing on the pointer anyway, but timeout after a fixed interval. More elegant solutions are possible, but the ones I considered unnecessarily complicate this rare case. Also, kill some nits that seem to have crept into the swapoff() code in the last 75 revisions or so: - Don't pass both sp and sp->sw_used to swap_pager_swapoff(), since the latter can be derived from the former. - Replace swp_pager_find_dev() with something simpler. There's no need to iterate over the entire list of swap devices just to determine if a given block is assigned to the one we're interested in. - Expand the scope of the swhash_mtx in a couple of places so that it isn't released and reacquired once for every hash bucket. - Don't drop the swhash_mtx while holding a reference to an object. We need to lock the object first. Unfortunately, doing so would violate the established lock order, so use VM_OBJECT_TRYLOCK() and try again on a subsequent pass if the object is already locked. - Refactor swp_pager_force_pagein() and swap_pager_swapoff() a bit.
137197	04-Nov-2004	phk	Retire b_magic now, we have the bufobj containing the same hint.
137191	04-Nov-2004	phk	De-couple our I/O bio request from the embedded bio in buf by explicitly copying the fields.
137186	04-Nov-2004	phk	Remove buf->b_dev field.
137168	03-Nov-2004	alc	The synchronization provided by vm object locking has eliminated the need for most calls to vm_page_busy(). Specifically, most calls to vm_page_busy() occur immediately prior to a call to vm_page_remove(). In such cases, the containing vm object is locked across both calls. Consequently, the setting of the vm page's PG_BUSY flag is not even visible to other threads that are following the synchronization protocol. This change (1) eliminates the calls to vm_page_busy() that immediately precede a call to vm_page_remove() or functions, such as vm_page_free() and vm_page_rename(), that call it and (2) relaxes the requirement in vm_page_remove() that the vm page's PG_BUSY flag is set. Now, the vm page's PG_BUSY flag is set only when the vm object lock is released while the vm page is still in transition. Typically, this is when it is undergoing I/O.
137104	31-Oct-2004	alc	Introduce a Boolean variable wakeup_needed to avoid repeated, unnecessary calls to wakeup() by vm_page_zero_idle_wakeup().
137091	30-Oct-2004	alc	During traversal of the active queue by vm_pageout_page_stats(), try locking the page's containing object before accessing the page's flags.
137079	30-Oct-2004	alc	Eliminate an unused but initialized variable.
137060	30-Oct-2004	alc	Add an assignment statement that I omitted from the previous revision.
137005	28-Oct-2004	alc	Assert that the containing vm object is locked in vm_page_cache() and vm_page_try_to_cache().
137001	27-Oct-2004	bmilekic	Fix a INVARIANTS-only bug introduced in Revision 1.104: IF INVARIANTS is defined, and in the rare case that we have allocated some objects from the slab and at least one initializer on at least one of those objects failed, and we need to fail the allocation and push the uninitialized items back into the slab caches -- in that scenario, we would fail to [re]set the bucket cache's ub_bucket item references to NULL, which would eventually trigger a KASSERT.
136996	27-Oct-2004	alc	During traversal of the active queue, try locking the page's containing object before accessing the page's flags or the object's reference count. If the trylock fails, handle the page as though it is busy.
136977	26-Oct-2004	phk	Also check that the sectormask is bigger than zero. Wrap this overly long KASSERT and remove newline.
136966	26-Oct-2004	phk	Put the I/O block size in bufobj->bo_bsize. We keep si_bsize_phys around for now as that is the simplest way to pull the number out of disk device drivers in devfs_open(). The correct solution would be to do an ioctl(DIOCGSECTORSIZE), but the point is probably mooth when filesystems sit on GEOM, so don't bother for now.
136961	26-Oct-2004	phk	Don't clear flags we just checked were not set.
136952	25-Oct-2004	alc	Assert that the containing vm object is locked in vm_page_flash().
136931	24-Oct-2004	alc	Assert that the containing vm object is locked in vm_page_busy() and vm_page_wakeup().
136927	24-Oct-2004	phk	Move the buffer method vector (buf->b_op) to the bufobj. Extend it with a strategy method. Add bufstrategy() which do the usual VOP_SPECSTRATEGY/VOP_STRATEGY song and dance. Rename ibwrite to bufwrite(). Move the two NFS buf_ops to more sensible places, add bufstrategy to them. Add inlines for bwrite() and bstrategy() which calls through buf->b_bufobj->b_ops->b_{write,strategy}(). Replace almost all VOP_STRATEGY()/VOP_SPECSTRATEGY() calls with bstrategy().
136924	24-Oct-2004	alc	Acquire the vm object lock before rather than after calling vm_page_sleep_if_busy(). (The motivation being to transition synchronization of the vm_page's PG_BUSY flag from the global page queues lock to the per-object lock.)
136923	24-Oct-2004	alc	Use VM_ALLOC_NOBUSY instead of calling vm_page_wakeup().
136850	24-Oct-2004	alc	Introduce VM_ALLOC_NOBUSY, an option to vm_page_alloc() and vm_page_grab() that indicates that the caller does not want a page with its busy flag set. In many places, the global page queues lock is acquired and released just to clear the busy flag on a just allocated page. Both the allocation of the page and the clearing of the busy flag occur while the containing vm object is locked. So, the busy flag might as well never be set.
136767	22-Oct-2004	phk	Add b_bufobj to struct buf which eventually will eliminate the need for b_vp. Initialize b_bufobj for all buffers. Make incore() and gbincore() take a bufobj instead of a vnode. Make inmem() local to vfs_bio.c Change a lot of VI_[UN]LOCK(bp->b_vp) to BO_[UN]LOCK(bp->b_bufobj) also VI_MTX() to BO_MTX(), Make buf_vlist_add() take a bufobj instead of a vnode. Eliminate other uses of bp->b_vp where bp->b_bufobj will do. Various minor polishing: remove "register", turn panic into KASSERT, use new function declarations, TAILQ_FOREACH_SAFE() etc.
136751	21-Oct-2004	phk	Move the VI_BWAIT flag into no bo_flag element of bufobj and call it BO_WWAIT Add bufobj_wref(), bufobj_wdrop() and bufobj_wwait() to handle the write count on a bufobj. Bufobj_wdrop() replaces vwakeup(). Use these functions all relevant places except in ffs_softdep.c where the use if interlocked_sleep() makes this impossible. Rename b_vnbufs to b_bobufs now that we touch all the relevant files anyway.
136655	18-Oct-2004	alc	Correct two errors in PG_BUSY management by vm_page_cowfault(). Both errors are in rarely executed paths. 1. Each time the retry_alloc path is taken, the PG_BUSY must be set again. Otherwise vm_page_remove() panics. 2. There is no need to set PG_BUSY on the newly allocated page before freeing it. The page already has PG_BUSY set by vm_page_alloc(). Setting it again could cause an assertion failure. MFC after: 2 weeks
136627	17-Oct-2004	alc	Assert that the containing object is locked in vm_page_io_start() and vm_page_io_finish(). The motivation being to transition synchronization of the vm_page's busy field from the global page queues lock to the per-object lock.
136621	17-Oct-2004	alc	Remove unnecessary check for curthread == NULL.
136404	11-Oct-2004	peter	Put on my peril sensitive sunglasses and add a flags field to the internal sysctl routines and state. Add some code to use it for signalling the need to downconvert a data structure to 32 bits on a 64 bit OS when requested by a 32 bit app. I tried to do this in a generic abi wrapper that intercepted the sysctl oid's, or looked up the format string etc, but it was a real can of worms that turned into a fragile mess before I even got it partially working. With this, we can now run 'sysctl -a' on a 32 bit sysctl binary and have it not abort. Things like netstat, ps, etc have a long way to go. This also fixes a bug in the kern.ps_strings and kern.usrstack hacks. These do matter very much because they are used by libc_r and other things.
136334	09-Oct-2004	green	In the previous revision, I did not intend to change the default value of "nosleepwithlocks." Submitted by: ru
136276	08-Oct-2004	green	Fix critical stability problems that can cause UMA mbuf cluster state management corruption, mbuf leaks, general mbuf corruption, and at least on i386 a first level splash damage radius that encompasses up to about half a megabyte of the memory after an mbuf cluster's allocation slab. In short, this has caused instability nightmares anywhere the right kind of network traffic is present. When the polymorphic refcount slabs were added to UMA, the new types were not used pervasively. In particular, the slab management structure was turned into one for refcounts, and one for non-refcounts (supposed to be mostly like the old slab management structure), but the latter was almost always used through out. In general, every access to zones with UMA_ZONE_REFCNT turned on corrupted the "next free" slab offset offset and the refcount with each other and with other allocations (on i386, 2 mbuf clusters per 4096 byte slab). Fix things so that the right type is used to access refcounted zones where it was not before. There are additional errors in gross overestimation of padding, it seems, that would cause a large kegs (nee zones) to be allocated when small ones would do. Unless I have analyzed this incorrectly, it is not directly harmful.
135746	24-Sep-2004	das	Don't look for swap blocks in objects that aren't swap-backed. I expect that this will fix the following panic, reported by Jun: swap_pager_isswapped: failed to locate all swap meta blocks MT5 candidate
135727	24-Sep-2004	phk	XXX mark two places where we do not hold a threadcount on the dev when frobbing the cdevsw. In both cases we examine only the cdevsw and it is a good question if we weren't better off copying those properties into the cdev in the first place. This question will be revisited.
135707	24-Sep-2004	phk	Use dev_re[fl]thread() to maintain a ref on the device driver while we call the ->d_mmap function.
135470	19-Sep-2004	das	The zone from which proc structures are allocated is marked UMA_ZONE_NOFREE to guarantee type stability, so proc_fini() should never be called. Move an assertion from proc_fini() to proc_dtor() and garbage-collect the rest of the unreachable code. I have retained vm_proc_dispose(), since I consider its disuse a bug.
135262	15-Sep-2004	phk	Add new a function isa_dma_init() which returns an errno when it fails and which takes a M_WAITOK/M_NOWAIT flag argument. Add compatibility isa_dmainit() macro which whines loudly if isa_dma_init() fails. Problem uncovered by: tegge
135088	11-Sep-2004	alc	System maps are prohibited from mapping vnode-backed objects. Take advantage of this restriction to avoid acquiring and releasing Giant when wiring pages within a system map. In collaboration with: tegge@
134892	07-Sep-2004	phk	add KASSERTS
134747	04-Sep-2004	alc	Enable debug.mpsafevm by default on amd64 and i386. This enables copy-on- write and zero-fill faults to run without holding Giant. It is still possible to disable Giant-free operation by setting debug.mpsafevm to 0 in loader.conf.
134675	03-Sep-2004	alc	Push Giant deep into vm_forkproc(), acquiring it only if the process has mapped System V shared memory segments (see shmfork_myhook()) or requires the allocation of an ldt (see vm_fault_wire()).
134649	02-Sep-2004	scottl	Turn PREEMPTION into a kernel option. Make sure that it's defined if FULL_PREEMPTION is defined. Add a runtime warning to ULE if PREEMPTION is enabled (code inspired by the PREEMPTION warning in kern_switch.c). This is a possible MT5 candidate.
134615	01-Sep-2004	alc	Remove dead code.
134612	01-Sep-2004	alc	In vm_fault_unwire() eliminate the acquisition and release of Giant in the case of non-kernel pmaps.
134586	01-Sep-2004	julian	Give setrunqueue() and sched_add() more of a clue as to where they are coming from and what is expected from them. MFC after: 2 days
134496	29-Aug-2004	alc	Move the acquisition and release of the lock on the object at the head of the shadow chain outside of the loop in vm_object_madvise(), reducing the number of times that this lock is acquired and released.
134461	29-Aug-2004	iedowse	Prevent vm_page_zero_idle_wakeup() from attempting to wake up the page zeroing thread before it has been created. It was possible for calls to free() very early in the boot process to panic here because the sleep queues were not yet initialised. Specifically, sysinit_add() running at SI_SUB_KLD would trigger this if the array of pointers became big enough to require uma_large_alloc() allocations. Submitted by: peter
134184	22-Aug-2004	marcel	Move the cow field between wire_count and hold_count. This is the position that is 64-bit aligned and makes sure that the valid and dirty fields are also 64-bit aligned. This means that if PAGE_SIZE is 32K, the size of the vm_page structure is only increased by 8 bytes instead of 16 bytes. More importantly, the vm_page structure is either 120 or 128 bytes on ia64. These are "interesting" sizes.
134139	22-Aug-2004	alc	In the previous revision, I failed to condition an early release of Giant in vm_fault() on debug_mpsafevm. If debug_mpsafevm was not set, the result was an assertion failure early in the boot process. Reported by: green@
134128	21-Aug-2004	alc	Further reduce the use of Giant by vm_fault(): Giant is held only when manipulating a vnode, e.g., calling vput(). This reduces contention for Giant during many copy-on-write faults, resulting in some additional speedup on SMPs. Note: debug_mpsafevm must be enabled for this optimization to take effect.
133996	19-Aug-2004	alc	Acquire and release Giant around a call to VOP_BMAP(). (This is a prerequisite to any further reduction in Giant's use by vm_fault().)
133807	16-Aug-2004	alc	- Introduce and use a new tunable "debug.mpsafevm". At present, setting "debug.mpsafevm" results in (almost) Giant-free execution of zero-fill page faults. (Giant is held only briefly, just long enough to determine if there is a vnode backing the faulting address.) Also, condition the acquisition and release of Giant around calls to pmap_remove() on "debug.mpsafevm". The effect on performance is significant. On my dual Opteron, I see a 3.6% reduction in "buildworld" time. - Use atomic operations to update several counters in vm_fault().
133796	16-Aug-2004	green	Rather than bringing back all of the changes to make VM map deletion wait for system wires to disappear, do so (much more trivially) by instead only checking for system wires of user maps and not kernel maps. Alternative by: tor Reviewed by: alc
133726	14-Aug-2004	alc	Remove spl calls.
133636	13-Aug-2004	alc	Replace the linear search in vm_map_findspace() with an O(log n) algorithm built into the map entry splay tree. This replaces the first_free hint in struct vm_map with two fields in vm_map_entry: adj_free, the amount of free space following a map entry, and max_free, the maximum amount of free space in the entry's subtree. These fields make it possible to find a first-fit free region of a given size in one pass down the tree, so O(log n) amortized using splay trees. This significantly reduces the overhead in vm_map_findspace() for applications that mmap() many hundreds or thousands of regions, and has a negligible slowdown (0.1%) on buildworld. See, for example, the discussion of a micro-benchmark titled "Some mmap observations compared to Linux 2.6/OpenBSD" on -hackers in late October 2003. OpenBSD adopted this approach in March 2002, and NetBSD added it in November 2003, both with Red-Black trees. Submitted by: Mark W. Krentel
133598	12-Aug-2004	tegge	The vm map lock is needed in vm_fault() after the page has been found, to avoid later changes before pmap_enter() and vm_fault_prefault() has completed. Simplify deadlock avoidance by not blocking on vm map relookup. In collaboration with: alc
133587	12-Aug-2004	green	Re-delete the comment from r1.352.
133435	10-Aug-2004	green	Back out all behavioral chnages.
133401	09-Aug-2004	green	Revamp VM map wiring. * Allow no-fault wiring/unwiring to succeed for consistency; however, the wired count remains at zero, so it's a special case. * Fix issues inside vm_map_wire() and vm_map_unwire() where the exact state of user wiring (one or zero) and system wiring (zero or more) could be confused; for example, system unwiring could succeed in removing a user wire, instead of being an error. * Require all mappings to be unwired before they are deleted. When VM space is still wired upon deletion, it will be waited upon for the following unwire. This makes vslock(9) work rather than allowing kernel-locked memory to be deleted out from underneath of its consumer as it would before.
133398	09-Aug-2004	alc	Make two changes to vm_fault(). 1. Move a comment to its proper place, updating it. (Except for white- space, this comment had been unchanged since revision 1.1!) 2. Remove spl calls.
133395	09-Aug-2004	alc	Remove a stale comment from vm_map_lookup() that pertains to share maps. (The last vestiges of the share map code were removed in revisions 1.153 and 1.159.)
133355	09-Aug-2004	alc	Make two changes to vm_fault(). 1. Retain the map lock until after the calls to pmap_enter() and vm_fault_prefault(). 2. Remove a stale comment. Submitted by: tegge@
133318	08-Aug-2004	phk	Tag all geom classes in the tree with a version number.
133253	07-Aug-2004	alc	Remove dead code. A vm_map's first_free is never NULL (even if the map is full). (This is preparation for an O(log n) implementation of vm_map_findspace().) Submitted by: Mark W. Krentel
133230	06-Aug-2004	rwatson	Generate KTR trace records for uma_zalloc_arg() and uma_zfree_arg(). This doesn't trace every event of interest in UMA, but provides enough basic information to explain lock traces and sleep patterns.
133185	05-Aug-2004	green	Turn on the new contigmalloc(9) by default. There should not actually be a reason to use the old contigmalloc(9), but if desired, it the vm.old_contigmalloc setting can be tuned/sysctld back to 0 for now.
133158	05-Aug-2004	phk	Remove a product specific workaround for wrong modes when mmap(2)'ing devices. They have had plenty of time to adjust now.
133143	04-Aug-2004	alc	- Push down the acquisition and release of Giant into pmap_enter_quick() on those architectures without pmap locking. - Eliminate the acquisition and release of Giant in vm_map_pmap_enter().
133113	04-Aug-2004	dfr	In dev_pager_updatefake, m->valid is typically 0 on entry. It should be set to VM_PAGE_BITS_ALL before returning, to ensure that neither vm_pager_get_pages nor vm_fault calls vm_page_zero_invalid after dev_pager_getpages has returned. Submitted by: tegge
132999	02-Aug-2004	alc	Eliminate the acquisition and release of Giant around the call to pmap_mincore() in mincore(2). Either pmap locking exists (alpha, amd64, i386, ia64) or pmap_mincore() is unimplemented (arm, powerpc, sparc64).
132987	02-Aug-2004	green	* Add a "how" argument to uma_zone constructors and initialization functions so that they know whether the allocation is supposed to be able to sleep or not. * Allow uma_zone constructors and initialation functions to return either success or error. Almost all of the ones in the tree currently return success unconditionally, but mbuf is a notable exception: the packet zone constructor wants to be able to fail if it cannot suballocate an mbuf cluster, and the mbuf allocators want to be able to fail in general in a MAC kernel if the MAC mbuf initializer fails. This fixes the panics people are seeing when they run out of memory for mbuf clusters. * Allow debug.nosleepwithlocks on WITNESS to be disabled, without changing the default. Both bmilekic and jeff have reviewed the changes made to make failable zone allocations work.
132899	30-Jul-2004	alc	- Push down the acquisition and release of Giant into pmap_protect() on those architectures without pmap locking. - Eliminate the acquisition and release of Giant from vm_map_protect(). (Translation: mprotect(2) runs to completion without touching Giant on alpha, amd64, i386 and ia64.)
132898	30-Jul-2004	alc	Giant is no longer required by vm_waitproc() and vmspace_exitfree(). Eliminate it acquisition and release around vm_waitproc() in kern_wait().
132884	30-Jul-2004	dfr	Fix a memory leak in the device pager which is exposed by the NVIDIA OpenGL driver. Submitted by: nvidia (possibly also tegge)
132883	30-Jul-2004	dfr	Fix handling of msync(2) for character special files. Submitted by: nvidia
132880	30-Jul-2004	mux	Get rid of another lockmgr(9) consumer by using sx locks for the user maps. We always acquire the sx lock exclusively here, but we can't use a mutex because we want to be able to sleep while holding the lock. This is completely equivalent to what we were doing with the lockmgr(9) locks before. Approved by: alc
132852	29-Jul-2004	alc	Advance the state of pmap locking on alpha, amd64, and i386. - Enable recursion on the page queues lock. This allows calls to vm_page_alloc(VM_ALLOC_NORMAL) and UMA's obj_alloc() with the page queues lock held. Such calls are made to allocate page table pages and pv entries. - The previous change enables a partial reversion of vm/vm_page.c revision 1.216, i.e., the call to vm_page_alloc() by vm_page_cowfault() now specifies VM_ALLOC_NORMAL rather than VM_ALLOC_INTERRUPT. - Add partial locking to pmap_copy(). (As a side-effect, pmap_copy() should now be faster on i386 SMP because it no longer generates IPIs for TLB shootdown on the other processors.) - Complete the locking of pmap_enter() and pmap_enter_quick(). (As of now, all changes to a user-level pmap on alpha, amd64, and i386 are performed with appropriate locking.)
132842	29-Jul-2004	bmilekic	Rework the way slab header storage space is calculated in UMA. - zone_large_init() stays pretty much the same. - zone_small_init() will try to stash the slab header in the slab page being allocated if the amount of calculated wasted space is less than UMA_MAX_WASTE (for both the UMA_ZONE_REFCNT case and regular case). If the amount of wasted space is >= UMA_MAX_WASTE, then UMA_ZONE_OFFPAGE will be set and the slab header will be allocated separately for better use of space. - uma_startup() calculates the maximum ipers required in offpage slabs (so that the offpage slab header zone(s) can be sized accordingly). The algorithm used to calculate this replaces the old calculation (which only happened to work coincidentally). We now iterate over possible object sizes, starting from the smallest one, until we determine that wastedspace calculated in zone_small_init() might end up being greater than UMA_MAX_WASTE, at which point we use the found object size to compute the maximum possible ipers. The reason this works is because: - wastedspace versus objectsize is a see-saw function with local minima all equal to zero and local maxima growing directly proportioned to objectsize. This implies that for objects up to or equal a certain objectsize, the see-saw remains entirely below UMA_MAX_WASTE, so for those objectsizes it is impossible to ever go OFFPAGE for slab headers. - ipers (items-per-slab) versus objectsize is an inversely proportional function which falls off very quickly (very large for small objectsizes). - To determine the maximum ipers we'll ever need from OFFPAGE slab headers we first find the largest objectsize for which we are guaranteed to not go offpage for and use it to compute ipers (as though we were offpage). Since the only objectsizes allowed to go offpage are bigger than the found objectsize, and since ipers vs objectsize is inversely proportional (and monotonically decreasing), then we are guaranteed that the ipers computed is always >= what we will ever need in offpage slab headers. - Define UMA_FRITM_SZ and UMA_FRITMREF_SZ to be the actual (possibly padded) size of each freelist index so that offset calculations are fixed. This might fix weird data corruption problems and certainly allows ARM to now boot to at least single-user (via simulator). Tested on i386 UP by me. Tested on sparc64 SMP by fenner. Tested on ARM simulator to single-user by cognet.
132804	28-Jul-2004	alc	Correct a very old error in both vm_object_madvise() (originating in vm/vm_object.c revision 1.88) and vm_object_sync() (originating in vm/vm_map.c revision 1.36): When descending a chain of backing objects, both use the wrong object's backing offset. Consequently, both may operate on the wrong pages. Quoting Matt, "This could be responsible for all of the sporatic madvise oddness that has been reported over the years." Reviewed by: Matt Dillon
132684	27-Jul-2004	alc	- Use atomic ops for updating the vmspace's refcnt and exitingcnt. - Push down Giant into shmexit(). (Giant is acquired only if the vmspace contains shm segments.) - Eliminate the acquisition of Giant from proc_rwmem(). - Reduce the scope of Giant in exit1(), uncovering the destruction of the address space.
132638	25-Jul-2004	alc	For years, kmem_alloc_pageable() has been misused. Now that the last of these misuses has been corrected, remove it before new ones appear, such as arm/arm/pmap.c revision 1.8.
132636	25-Jul-2004	alc	Remove spl calls.
132627	25-Jul-2004	alc	Make the code and comments for vm_object_coalesce() consistent.
132593	24-Jul-2004	alc	Simplify vmspace initialization. The bcopy() of fields from the old vmspace to the new vmspace in vmspace_exec() is mostly wasted effort. With one exception, vm_swrss, the copied fields are immediately overwritten. Instead, initialize these fields to zero in vmspace_alloc(), eliminating a bcopy() from vmspace_exec() and a bzero() from vmspace_fork().
132550	22-Jul-2004	alc	- Change uma_zone_set_obj() to call kmem_alloc_nofault() instead of kmem_alloc_pageable(). The difference between these is that an errant memory access to the zone will be detected sooner with kmem_alloc_nofault(). The following changes serve to eliminate the following lock-order reversal reported by witness: 1st 0xc1a3c084 vm object (vm object) @ vm/swap_pager.c:1311 2nd 0xc07acb00 swap_pager swhash (swap_pager swhash) @ vm/swap_pager.c:1797 3rd 0xc1804bdc vm object (vm object) @ vm/uma_core.c:931 There is no potential deadlock in this case. However, witness is unable to recognize this because vm objects used by UMA have the same type as ordinary vm objects. To remedy this, we make the following changes: - Add a mutex type argument to VM_OBJECT_LOCK_INIT(). - Use the mutex type argument to assign distinct types to special vm objects such as the kernel object, kmem object, and UMA objects. - Define a static swap zone object for use by UMA. (Only static objects are assigned a special mutex type.)
132517	21-Jul-2004	green	Fix a race in vm_page_sleep_if_busy(). Due to vm_object locking being incomplete, it currently has to know how to drop and pick back up the vm_object's mutex if it has to sleep and drop the page queue mutex. The problem with this is that if the page is busy, while we are sleeping, the page can be freed and object disappear. When trying to lock m->object, we'd get a stale or NULL pointer and crash. The object is now cached, but this makes the assumption that the object is referenced in some manner and will not itself disappear while it is unlocked. Since this only happens if the object is locked, I had to remove an assumption earlier in contigmalloc() that reversed the order of locking the object and doing vm_page_sleep_if_busy(), not the normal order.
132483	21-Jul-2004	peter	Semi-gratuitous change. Move two refcount operations to their own lines rather than be buried inside an if (expression). And now that the if expression is the same in both exit paths, use the same ordering.
132475	21-Jul-2004	peter	Move the initialization and teardown of pmaps to the vmspace zone's init and fini handlers. Our vm system removes all userland mappings at exit prior to calling pmap_release. It just so happens that we might as well reuse the pmap for the next process since the userland slate has already been wiped clean. However. There is a functional benefit to this as well. For platforms that share userland and kernel context in the same pmap, it means that the kernel portion of a pmap remains valid after the vmspace has been freed (process exit) and while it is in uma's cache. This is significant for i386 SMP systems with kernel context borrowing because it avoids a LOT of IPIs from the pmap_lazyfix() cleanup in the usual case. Tested on: amd64, i386, sparc64, alpha Glanced at by: alc
132420	19-Jul-2004	green	Remove extraneous locks on the VM free page queue mutex; it is not meant to be recursed upon, and could cauuse a deadlock inside the new contigmalloc (vm.old_contigmalloc=0) code. Submitted by: alc
132414	19-Jul-2004	alc	- Eliminate the pte object from the pmap. Instead, page table pages are allocated as "no object" pages. Similar changes were made to the amd64 and i386 pmap last year. The primary reason being that maintaining a pte object leads to lock order violations. A secondary reason being that the pte object is redundant, i.e., the page table itself can be used to lookup page table pages. (Historical note: The pte object predates our ability to allocate "no object" pages. Thus, the pte object was a necessary evil.) - Unconditionally check the vm object lock's status in vm_page_remove(). Previously, this assertion could not be made on Alpha due to its use of a pte object.
132407	19-Jul-2004	green	Since breakage of malloc(9)/uma_zalloc(9) is totally non-optional in GENERIC/for WITNESS users, make sure the sysctl to disable the behavior is read-only and always enabled.
132379	19-Jul-2004	green	Reimplement contigmalloc(9) with an algorithm which stands a greatly- improved chance of working despite pressure from running programs. Instead of trying to throw a bunch of pages out to swap and hope for the best, only a range that can potentially fulfill contigmalloc(9)'s request will have its contents paged out (potentially, not forcibly) at a time. The new contigmalloc operation still operates in three passes, but it could potentially be tuned to more or less. The first pass only looks at pages in the cache and free pages, so they would be thrown out without having to block. If this is not enough, the subsequent passes page out any unwired memory. To combat memory pressure refragmenting the section of memory being laundered, each page is removed from the systems' free memory queue once it has been freed so that blocking later doesn't cause the memory laundered so far to get reallocated. The page-out operations are now blocking, as it would make little sense to try to push out a page, then get its status immediately afterward to remove it from the available free pages queue, if it's unlikely to have been freed. Another change is that if KVA allocation fails, the allocated memory segment will be freed and not leaked. There is a sysctl/tunable, defaulting to on, which causes the old contigmalloc() algorithm to be used. Nonetheless, I have been using vm.old_contigmalloc=0 for over a month. It is safe to switch at run-time to see the difference it makes. A new interface has been used which does not require mapping the allocated pages into KVA: vm_page.h functions vm_page_alloc_contig() and vm_page_release_contig(). These are what vm.old_contigmalloc=0 uses internally, so the sysctl/tunable does not affect their operation. When using the contigmalloc(9) and contigfree(9) interfaces, memory is now tracked with malloc(9) stats. Several functions have been exported from kern_malloc.c to allow other subsystems to use these statistics, as well. This invalidates the BUGS section of the contigmalloc(9) manpage.
132336	18-Jul-2004	alc	Remove the GIANT_REQUIRED preceding pmap_remove() in vm_pageout_map_deactivate_pages().
132220	15-Jul-2004	alc	Push down the acquisition and release of the page queues lock into pmap_protect() and pmap_remove(). In general, they require the lock in order to modify a page's pv list or flags. In some cases, however, pmap_protect() can avoid acquiring the lock.
132040	12-Jul-2004	alc	Remove an unused and unimplemented sysctl. (For the record, it was marked as unimplemented in revision 1.129 nearly six years ago.)
131937	10-Jul-2004	alc	Increase the scope of the page queues lock in vm_page_alloc() to cover a diagnostic check that accesses the cache queue count.
131719	06-Jul-2004	alc	Micro-optimize vmspace for 64-bit architectures: Colocate vm_refcnt and vm_exitingcnt so that alignment does not result in wasted space.
131665	06-Jul-2004	bms	Properly brucify a string by outdenting it.
131573	04-Jul-2004	bmilekic	Introduce debug.nosleepwithlocks sysctl, 0 by default. If set to 1 and WITNESS is not built, then force all M_WAITOK allocations to M_NOWAIT behavior (transparently). This is to be used temporarily if wierd deadlocks are reported because we still have code paths that perform M_WAITOK allocations with lock(s) held, which can lead to deadlock. If WITNESS is compiled, then the sysctl is ignored and we ask witness to tell us wether we have locks held, converting to M_NOWAIT behavior only if it tells us that we do. Note this removes the previous mbuf.h inclusion as well (only needed by last revision), and cleans up unneeded [artificial] comparisons to just the mbuf zones. The problem described above has nothing to do with previous mbuf wait behavior; it is a general problem.
131572	04-Jul-2004	green	Reextend the M_WAITOK-disabling-hack to all three of the mbuf-related zones, and do it by direct comparison of uma_zone_t instead of strcmp. The mbuf subsystem used to provide M_TRYWAIT/M_DONTWAIT semantics, but this is mostly no longer the case. M_WAITOK has taken over the spot M_TRYWAIT used to have, and for mbuf things, still may return NULL if the code path is incorrectly holding a mutex going into mbuf allocation functions. The M_WAITOK/M_NOWAIT semantics are absolute; though it may deadlock the system to try to malloc or uma_zalloc something with a mutex held and M_WAITOK specified, it is absolutely required to not return NULL and will result in instability and/or security breaches otherwise. There is still room to add the WITNESS_WARN() to all cases so that we are notified of the possibility of deadlocks, but it cannot change the value of the "badness" variable and allow allocation to actually fail except for the specialized cases which used to be M_TRYWAIT.
131528	03-Jul-2004	green	Limit mbuma damage. Suddenly ALL allocations with M_WAITOK are subject to failing -- that is, allocations via malloc(M_WAITOK) that are required to never fail -- if WITNESS is not defined. While everyone should be running WITNESS, in any case, zone "Mbuf" allocations are really the only ones that should be screwed with by this hack. This hack is crashing people, and would continue to do so with or without WITNESS. Things shouldn't be allocating with M_WAITOK with locks held, but it's not okay just to always remove M_WAITOK when !WITNESS. Reported by: Bernd Walter <ticso@cicely5.cicely.de>
131481	02-Jul-2004	jhb	Implement preemption of kernel threads natively in the scheduler rather than as one-off hacks in various other parts of the kernel: - Add a function maybe_preempt() that is called from sched_add() to determine if a thread about to be added to a run queue should be preempted to directly. If it is not safe to preempt or if the new thread does not have a high enough priority, then the function returns false and sched_add() adds the thread to the run queue. If the thread should be preempted to but the current thread is in a nested critical section, then the flag TDF_OWEPREEMPT is set and the thread is added to the run queue. Otherwise, mi_switch() is called immediately and the thread is never added to the run queue since it is switch to directly. When exiting an outermost critical section, if TDF_OWEPREEMPT is set, then clear it and call mi_switch() to perform the deferred preemption. - Remove explicit preemption from ithread_schedule() as calling setrunqueue() now does all the correct work. This also removes the do_switch argument from ithread_schedule(). - Do not use the manual preemption code in mtx_unlock if the architecture supports native preemption. - Don't call mi_switch() in a loop during shutdown to give ithreads a chance to run if the architecture supports native preemption since the ithreads will just preempt DELAY(). - Don't call mi_switch() from the page zeroing idle thread for architectures that support native preemption as it is unnecessary. - Native preemption is enabled on the same archs that supported ithread preemption, namely alpha, i386, and amd64. This change should largely be a NOP for the default case as committed except that we will do fewer context switches in a few cases and will avoid the run queues completely when preempting. Approved by: scottl (with his re@ hat)
131473	02-Jul-2004	jhb	- Change mi_switch() and sched_switch() to accept an optional thread to switch to. If a non-NULL thread pointer is passed in, then the CPU will switch to that thread directly rather than calling choosethread() to pick a thread to choose to. - Make sched_switch() aware of idle threads and know to do TD_SET_CAN_RUN() instead of sticking them on the run queue rather than requiring all callers of mi_switch() to know to do this if they can be called from an idlethread. - Move constants for arguments to mi_switch() and thread_single() out of the middle of the function prototypes and up above into their own section.
131434	02-Jul-2004	jhb	- Don't use a variable to point to the user area that we only use once. Just use p2->p_uarea directly instead. - Remove an old and mostly bogus assertion regarding p2->p_sigacts. - Use RANGEOF macro ala fork1() to clean up bzero/bcopy of p_stats.
131256	28-Jun-2004	tegge	Initialize result->backing_object_offset before linking result onto the list of vm objects shadowing source in vm_object_shadow(). This closes a race where vm_object_collapse() could be called with a partially uninitialized object argument causing symptoms that looked like hardware problems, e.g. signal 6, 10, 11 or a /bin/sh busy-waiting for a nonexistant child process.
131252	28-Jun-2004	gallatin	Use MIN() macro rather than ulmin() inline, and fix stray tab that snuck in with my last commit. Submitted by: green
131251	28-Jun-2004	gallatin	Fix alpha - the use of min() on longs was loosing the high bits and returning wrong answers, leading to strange values vm2->vm_{s,t,d}size.
131163	27-Jun-2004	das	Update a stale comment. The heuristic to swap processes out based on the number of pages already paged out was broken in rev 1.10 and removed in rev 1.11.
131152	26-Jun-2004	alc	Remove an unused field from the vmspace structure.
131073	24-Jun-2004	green	Correct the tracking of various bits of the process's vmspace and vm_map when not propogated on fork (due to minherit(2)). Consistency checks otherwise fail when the vm_map is freed and it appears to have not been emptied completely, causing an INVARIANTS panic in vm_map_zdtor(). PR: kern/68017 Submitted by: Mark W. Krentel <krentel@dreamscape.com> Reviewed by: alc
131027	24-Jun-2004	alc	Call vm_pageout_page_stats() with the page queues lock held.
131023	24-Jun-2004	alc	Remove spl calls.
130995	23-Jun-2004	bmilekic	Make uma_mtx MTX_RECURSE. Here's why: The general UMA lock is a recursion-allowed lock because there is a code path where, while we're still configured to use startup_alloc() for backend page allocations, we may end up in uma_reclaim() which calls zone_foreach(zone_drain), which grabs uma_mtx, only to later call into startup_alloc() because while freeing we needed to allocate a bucket. Since startup_alloc() also takes uma_mtx, we need to be able to recurse on it. This exact explanation also added as comment above mtx_init(). Trace showing recursion reported by: Peter Holm <peter-at-holm.cc>
130979	23-Jun-2004	bms	In swap_pager_getpages(), bp->b_dev can be NULL, particularly for the case of NFS mounted swap, so do not try to dereference it. While we're here, brucify the printf() call which happens when we time out on acquisition of vm_page_queue_mtx. PR: kern/67898 Submitted by: bde (style)
130710	19-Jun-2004	alc	Remove spl() calls. Update comments to reflect the removal of spl() calls. Remove '\n' from panic() format strings. Remove some blank lines.
130640	17-Jun-2004	phk	Second half of the dev_t cleanup. The big lines are: NODEV -> NULL NOUDEV -> NODEV udev_t -> dev_t udev2dev() -> findcdev() Various minor adjustments including handling of userland access to kernel space struct cdev etc.
130626	17-Jun-2004	alc	Do not preset PG_BUSY on VM_ALLOC_NOOBJ pages. Such pages are not accessible through an object. Thus, PG_BUSY serves no purpose.
130585	16-Jun-2004	phk	Do the dreaded s/dev_t/struct cdev */ Bump __FreeBSD_version accordingly.
130551	16-Jun-2004	julian	Nice, is a property of a process as a whole.. I mistakenly moved it to the ksegroup when breaking up the process structure. Put it back in the proc structure.
130502	15-Jun-2004	green	Make contigmalloc() more reliable: 1. Remove a race whereby contigmalloc() would deadlock against the running processes in the system if they kept reinstantiating the memory on the active and inactive page queues that it was trying to flush out. The process doing the contigmalloc() would sit in "swwrt" forever and the swap pager would be going at full force, but never get anywhere. Instead of doing it until the queues are empty, launder for as many iterations as there are pages in the queue. 2. Do all laundering to swap synchronously; previously, the vnode laundering was synchronous and the swap laundering not. 3. Increase the number of launder-or-allocate passes to three, from two, while failing without bothering to do all the laundering on the third pass if allocation was not possible. This effectively gives exactly two chances to launder enough contiguous memory, helpful with high memory churn where a lot of memory from one pass to the next (and during a single laundering loop) becomes dirtied again. I can now reliably hot-plug hardware requiring a 256KB contigmalloc() without having the kldload/cbb ithread sit around failing to make progress, while running a busy X session. Previously, it took killing X to get contigmalloc() to get further (that is, quiescing the system), and even then contigmalloc() returned failure.
130344	11-Jun-2004	phk	Deorbit COMPAT_SUNOS. We inherited this from the sparc32 port of BSD4.4-Lite1. We have neither a sparc32 port nor a SunOS4.x compatibility desire these days.
130283	09-Jun-2004	bmilekic	Backout previous change, I think Julian has a better solution which does not require type-stable refcnts here.
130278	09-Jun-2004	bmilekic	Make the slabrefzone, the zone from which we allocated slabs with internal reference counters, UMA_ZONE_NOFREE. This way, those slabs (with their ref counts) will be effectively type-stable, then using a trick like this on the refcount is no longer dangerous: MEXT_REM_REF(m); if (atomic_cmpset_int(m->m_ext.ref_cnt, 0, 1)) { if (m->m_ext.ext_type == EXT_PACKET) { uma_zfree(zone_pack, m); return; } else if (m->m_ext.ext_type == EXT_CLUSTER) { uma_zfree(zone_clust, m->m_ext.ext_buf); m->m_ext.ext_buf = NULL; } else { (*(m->m_ext.ext_free))(m->m_ext.ext_buf, m->m_ext.ext_args); if (m->m_ext.ext_type != EXT_EXTREF) free(m->m_ext.ref_cnt, M_MBUF); } } uma_zfree(zone_mbuf, m); Previously, a second thread hitting the above cmpset might actually read the refcnt AFTER it has already been freed. A very rare occurance. Now we'll know that it won't be freed, though. Spotted by: julian, pjd
130201	07-Jun-2004	netchild	Remove references to L1 in the comments, according to Alan they are historical leftovers. Approved by: alc
130137	05-Jun-2004	alc	Update stale comments regarding page coloring.
130049	04-Jun-2004	alc	Move the definitions of SWAPBLK_NONE and SWAPBLK_MASK from vm_page.h to blist.h, enabling the removal of numerous #includes from subr_blist.c. (subr_blist.c and swap_pager.c are the only users of these definitions.)
129913	01-Jun-2004	bmilekic	Fix a comment above uma_zsecond_create(), describing its arguments. It doesn't take 'align' and 'flags' but 'master' instead, which is a reference to the Master Zone, containing the backing Keg. Pointed out by: Tim Robbins (tjr)
129906	31-May-2004	bmilekic	Bring in mbuma to replace mballoc. mbuma is an Mbuf & Cluster allocator built on top of a number of extensions to the UMA framework, all included herein. Extensions to UMA worth noting: - Better layering between slab <-> zone caches; introduce Keg structure which splits off slab cache away from the zone structure and allows multiple zones to be stacked on top of a single Keg (single type of slab cache); perhaps we should look into defining a subset API on top of the Keg for special use by malloc(9), for example. - UMA_ZONE_REFCNT zones can now be added, and reference counters automagically allocated for them within the end of the associated slab structures. uma_find_refcnt() does a kextract to fetch the slab struct reference from the underlying page, and lookup the corresponding refcnt. mbuma things worth noting: - integrates mbuf & cluster allocations with extended UMA and provides caches for commonly-allocated items; defines several zones (two primary, one secondary) and two kegs. - change up certain code paths that always used to do: m_get() + m_clget() to instead just use m_getcl() and try to take advantage of the newly defined secondary Packet zone. - netstat(1) and systat(1) quickly hacked up to do basic stat reporting but additional stats work needs to be done once some other details within UMA have been taken care of and it becomes clearer to how stats will work within the modified framework. From the user perspective, one implication is that the NMBCLUSTERS compile-time option is no longer used. The maximum number of clusters is still capped off according to maxusers, but it can be made unlimited by setting the kern.ipc.nmbclusters boot-time tunable to zero. Work should be done to write an appropriate sysctl handler allowing dynamic tuning of kern.ipc.nmbclusters at runtime. Additional things worth noting/known issues (READ): - One report of 'ips' (ServeRAID) driver acting really slow in conjunction with mbuma. Need more data. Latest report is that ips is equally sucking with and without mbuma. - Giant leak in NFS code sometimes occurs, can't reproduce but currently analyzing; brueffer is able to reproduce but THIS IS NOT an mbuma-specific problem and currently occurs even WITHOUT mbuma. - Issues in network locking: there is at least one code path in the rip code where one or more locks are acquired and we end up in m_prepend() with M_WAITOK, which causes WITNESS to whine from within UMA. Current temporary solution: force all UMA allocations to be M_NOWAIT from within UMA for now to avoid deadlocks unless WITNESS is defined and we can determine with certainty that we're not holding any locks when we're M_WAITOK. - I've seen at least one weird socketbuffer empty-but- mbuf-still-attached panic. I don't believe this to be related to mbuma but please keep your eyes open, turn on debugging, and capture crash dumps. This change removes more code than it adds. A paper is available detailing the change and considering various performance issues, it was presented at BSDCan2004: http://www.unixdaemons.com/~bmilekic/netbuf_bmilekic.pdf Please read the paper for Future Work and implementation details, as well as credits. Testing and Debugging: rwatson, brueffer, Ketrien I. Saihr-Kesenchedra, ... Reviewed by: Lots of people (for different parts)
129883	30-May-2004	alc	Remove a stale comment: PG_DIRTY and PG_FILLED were removed in revisions 1.17 and 1.12 respectively.
129857	30-May-2004	hmp	Correct typo, vm_page_list_find() is called vm_pageq_find() for quite a long time, i.e., since the cleanup of the VM Page-queues code done two years ago. Reviewed by: Alan Cox <alc at freebsd.org>, Matthew Dillon <dillon at backplane.com>
129729	25-May-2004	des	MFS: vm_map.c rev 1.187.2.27 through 1.187.2.29, fix MS_INVALIDATE semantics but provide a sysctl knob for reverting to old ones.
129728	25-May-2004	des	Back out previous commit; it went to the wrong file.
129725	25-May-2004	des	MFS: rev 1.187.2.27 through 1.187.2.29, fix MS_INVALIDATE semantics but provide a sysctl knob for reverting to old ones.
129701	25-May-2004	alc	Correct two error cases in vm_map_unwire(): 1. Contrary to the Single Unix Specification our implementation of munlock(2) when performed on an unwired virtual address range has returned an error. Correct this. Note, however, that the behavior of "system" unwiring is unchanged, only "user" unwiring is changed. If "system" unwiring is performed on an unwired virtual address range, an error is still returned. 2. Performing an errant "system" unwiring on a virtual address range that was "user" (i.e., mlock(2)) but not "system" wired would incorrectly undo the "user" wiring instead of returning an error. Correct this. Discussed with: green@ Reviewed by: tegge@
129571	22-May-2004	alc	To date, unwiring a fictitious page has produced a panic. The reason being that PHYS_TO_VM_PAGE() returns the wrong vm_page for fictitious pages but unwiring uses PHYS_TO_VM_PAGE(). The resulting panic reported an unexpected wired count. Rather than attempting to fix PHYS_TO_VM_PAGE(), this fix takes advantage of the properties of fictitious pages. Specifically, fictitious pages will never be completely unwired. Therefore, we can keep a fictitious page's wired count forever set to one and thereby avoid the use of PHYS_TO_VM_PAGE() when we know that we're working with a fictitious page, just not which one. In collaboration with: green@, tegge@ PR: kern/29915
129145	12-May-2004	alc	Restructure vm_page_select_cache() so that adding assertions is easy. Some of the conditions that caused vm_page_select_cache() to deactivate a page were wrong. For example, deactivating an unmanaged or wired page is a nop. Thus, if vm_page_select_cache() had ever encountered an unmanaged or wired page, it would have looped forever. Now, we assert that the page is neither unmanaged nor wired.
129143	12-May-2004	alc	Cache queue pages are not mapped. Thus, the pmap_remove_all() by vm_pageout_scan()'s loop for freeing cache queue pages is unnecessary.
129110	11-May-2004	tjr	To handle orphaned character device vnodes properly in mmap(), check that v_mount is non-null before dereferencing it. If it's null, behave as if MNT_NOEXEC was not set on the mount that originally containined it.
129057	09-May-2004	alc	Cache queue pages are not mapped. Thus, the pmap_remove_all() by vm_page_alloc() is unnecessary.
129028	07-May-2004	green	In r1.190, vslock() and vsunlock() were bogusly made to do a "user wire" and a "system unwire." Make this a "system wire" and "system unwire." Reviewed by: alc
129018	07-May-2004	green	Properly remove MAP_FUTUREWIRE when a vm_map_entry gets torn down. Previously, mlockall(2) usage would leak MAP_FUTUREWIRE of the process's vmspace::vm_map and subsequent processes would wire all of their memory. Coupled with a wired-page leak in vm_fault_unwire(), this would run the system out of free pages and cause programs to randomly SIGBUS when faulting in new pages. (Note that this is not the fix for the latter part; pages are still leaked when a wired area is unmapped in some cases.) Reviewed by: alc PR kern/62930
128992	06-May-2004	alc	Make vm_page's PG_ZERO flag immutable between the time of the page's allocation and deallocation. This flag's principal use is shortly after allocation. For such cases, clearing the flag is pointless. The only unusual use of PG_ZERO is in vfs_bio_clrbuf(). However, allocbuf() never requests a prezeroed page. So, vfs_bio_clrbuf() never sees a prezeroed page. Reviewed by: tegge@
128633	25-Apr-2004	alc	Zero the physical page only if it is invalid and not prezeroed.
128620	24-Apr-2004	alc	Add a VM_OBJECT_LOCK_ASSERT() call. Remove splvm() and splx() calls. Move a comment.
128614	24-Apr-2004	alc	Update the comment describing vm_page_grab() to reflect the previous revision and correct some of its style errors.
128613	24-Apr-2004	alc	Push down the responsibility for zeroing a physical page from the caller to vm_page_grab(). Although this gives VM_ALLOC_ZERO a different meaning for vm_page_grab() than for vm_page_alloc(), I feel such change is necessary to accomplish other goals. Specifically, I want to make the PG_ZERO flag immutable between the time it is allocated by vm_page_alloc() and freed by vm_page_free() or vm_page_free_zero() to avoid locking overheads. Once we gave up on the ability to automatically recognize a zeroed page upon entry to vm_page_free(), the ability to mutate the PG_ZERO flag became useless. Instead, I would like to say that "Once a page becomes valid, its PG_ZERO flag must be ignored."
128596	24-Apr-2004	alc	In cases where a file was resident in memory mmap(..., PROT_NONE, ...) would actually map the file with read access enabled. According to http://www.opengroup.org/onlinepubs/007904975/functions/mmap.html this is an error. Similarly, an madvise(..., MADV_WILLNEED) would enable read access on a virtual address range that was PROT_NONE. The solution implemented herein is (1) to pass a vm_prot_t to vm_map_pmap_enter() describing the allowed access and (2) to make vm_map_pmap_enter() responsible for understanding the limitations of pmap_enter_quick(). Submitted by: "Mark W. Krentel" <krentel@dreamscape.com> PR: kern/64573
128570	23-Apr-2004	alc	Push down Giant into vm_pager_get_pages(). The only get pages methods that require Giant are in the device and vnode pagers.
128097	10-Apr-2004	alc	- pmap_kenter_temporary() is unused by machine-independent code. Therefore, move its declaration to the machine-dependent header file on those machines that use it. In principle, only i386 should have it. Alpha and AMD64 should use their direct virtual-to-physical mapping. - Remove pmap_kenter_temporary() from ia64. It is unused. Approved by: marcel@
128038	08-Apr-2004	alc	The demise of vm_pager_map_page() in revision 1.93 of vm/vm_pager.c permits the reduction of the pager map's size by 8M bytes. In other words, eight megabytes of largely wasted KVA are returned to the kernel map for use elsewhere.
127961	06-Apr-2004	imp	Remove advertising clause from University of California Regent's license, per letter dated July 22, 1999. Approved by: core
127926	06-Apr-2004	alc	Eliminate vm_pager_map_page() and vm_pager_unmap_page() and their uses. Use sf_buf_alloc() and sf_buf_free() instead.
127879	05-Apr-2004	kan	Delay permission checks for VCHR vnodes until after vnode is locked in vm_mmap_vnode function, where we can safely check for a special /dev/zero case. Rev. 1.180 has reordered checks and introduced a regression. Submitted by: alc Was broken by: kan
127869	05-Apr-2004	alc	Remove unused arguments from pmap_init().
127868	04-Apr-2004	alc	Eliminate unused arguments from vm_page_startup().
127327	23-Mar-2004	tjr	Do not copy vm_exitingcnt to the new vmspace in vmspace_exec(). Copying it led to impossibly high values in the new vmspace, causing it to never drop to 0 and be freed.
127187	18-Mar-2004	guido	When mmap-ing a file from a noexec mount, be sure not to grant the right to mmap it PROT_EXEC. This also depends on the architecture, as some architextures (e.g. i386) do not distinguish between read and exec pages Inspired by: http://linux.bkbits.net:8080/linux-2.4/cset@1.1267.1.85 Reviewed by: alc
127013	15-Mar-2004	truckman	Make overflow/wraparound checking more robust and unbreak len=0 in vslock(), mlock(), and munlock(). Reviewed by: bde
127008	15-Mar-2004	truckman	Style(9) changes. Pointed out by: bde
127007	15-Mar-2004	truckman	Revert to the original vslock() and vsunlock() API with the following exceptions: Retain the recently added vslock() error return. The type of the len argument should be size_t, not u_int. Suggested by: bde
127006	15-Mar-2004	truckman	Remove redundant suser() check.
126911	13-Mar-2004	alc	Remove GIANT_REQUIRED from contigfree().
126865	12-Mar-2004	peter	Part 2 of rev 1.68. Update comment to match reality now that vm_endcopy exists and we no longer copy to the end of the struct. Forgotten by: alfred and green
126793	10-Mar-2004	alc	- Make the acquisition of Giant in vm_fault_unwire() conditional on the pmap. For the kernel pmap, Giant is not required. In general, for other pmaps, Giant is required by i386's pmap_pte() implementation. Specifically, the use of PMAP2/PADDR2 is synchronized by Giant. Note: In principle, updates to the kernel pmap's wired count could be lost without Giant. However, in practice, we never use the kernel pmap's wired count. This will be resolved when pmap locking appears. - With the above change, cpu_thread_clean() and uma_large_free() need not acquire Giant. (The first case is simply the revival of i386/i386/vm_machdep.c's revision 1.226 by peter.)
126739	08-Mar-2004	alc	Implement a work around for the deadlock avoidance case in vm_object_deallocate() so that it doesn't spin forever either. Submitted by: bde
126728	07-Mar-2004	alc	Retire pmap_pinit2(). Alpha was the last platform that used it. However, ever since alpha/alpha/pmap.c revision 1.81 introduced the list allpmaps, there has been no reason for having this function on Alpha. Briefly, when pmap_growkernel() relied upon the list of all processes to find and update the various pmaps to reflect a growth in the kernel's valid address space, pmap_init2() served to avoid a race between pmap initialization and pmap_growkernel(). Specifically, pmap_pinit2() was responsible for initializing the kernel portions of the pmap and pmap_pinit2() was called after the process structure contained a pointer to the new pmap for use by pmap_growkernel(). Thus, an update to the kernel's address space might be applied to the new pmap unnecessarily, but an update would never be lost.
126714	07-Mar-2004	rwatson	Mark uma_callout as CALLOUT_MPSAFE, as uma_timeout can run MPSAFE. Reviewed by: jeff
126668	05-Mar-2004	truckman	Undo the merger of mlock()/vslock and munlock()/vsunlock() and the introduction of kern_mlock() and kern_munlock() in src/sys/kern/kern_sysctl.c 1.150 src/sys/vm/vm_extern.h 1.69 src/sys/vm/vm_glue.c 1.190 src/sys/vm/vm_mmap.c 1.179 because different resource limits are appropriate for transient and "permanent" page wiring requests. Retain the kern_mlock() and kern_munlock() API in the revived vslock() and vsunlock() functions. Combine the best parts of each of the original sets of implementations with further code cleanup. Make the mclock() and vslock() implementations as similar as possible. Retain the RLIMIT_MEMLOCK check in mlock(). Move the most strigent test, which can return EAGAIN, last so that requests that have no hope of ever being satisfied will not be retried unnecessarily. Disable the test that can return EAGAIN in the vslock() implementation because it will cause the sysctl code to wedge. Tested by: Cy Schubert <Cy.Schubert AT komquats.com>
126632	05-Mar-2004	alc	In the last revision, I introduced a physical contiguity check that is both unnecessary and wrong. While it is necessary to verify that the page is still free after dropping and reacquiring the free page queue lock, the physical contiguity of the page can not change, making this check unnecessary. This check was wrong in that it could cause an out-of-bounds array access. Tested by: rwatson
126588	04-Mar-2004	bde	Record exactly where this file was copied from. It wasn't repo-copied so this is not very obvious. Fixed some style bugs (mainly missing parentheses around return values).
126585	04-Mar-2004	bde	Minor style fixes. In vm_daemon(), don't fetch the rss limit long before it is needed.
126571	04-Mar-2004	alc	Remove some long unused definitions.
126479	02-Mar-2004	alc	Modify contigmalloc1() so that the free page queues lock is not held when vm_page_free() is called. The problem with holding this lock is that it is a spin lock and vm_page_free() may attempt the acquisition of a different default-type lock.
126424	01-Mar-2004	kan	Pich up a do {} while(0) cleanup by phk that was discarded accidentally in previous revision. Submitted by: alc
126332	27-Feb-2004	kan	Move the code dealing with vnode out of several functions into a single helper function vm_mmap_vnode. Discussed with: jeffr,alc (a while ago)
126253	26-Feb-2004	truckman	Split the mlock() kernel code into two parts, mlock(), which unpacks the syscall arguments and does the suser() permission check, and kern_mlock(), which does the resource limit checking and calls vm_map_wire(). Split munlock() in a similar way. Enable the RLIMIT_MEMLOCK checking code in kern_mlock(). Replace calls to vslock() and vsunlock() in the sysctl code with calls to kern_mlock() and kern_munlock() so that the sysctl code will obey the wired memory limits. Nuke the vslock() and vsunlock() implementations, which are no longer used. Add a member to struct sysctl_req to track the amount of memory that is wired to handle the request. Modify sysctl_wire_old_buffer() to return an error if its call to kern_mlock() fails. Only wire the minimum of the length specified in the sysctl request and the length specified in its argument list. It is recommended that sysctl handlers that use sysctl_wire_old_buffer() should specify reasonable estimates for the amount of data they want to return so that only the minimum amount of memory is wired no matter what length has been specified by the request. Modify the callers of sysctl_wire_old_buffer() to look for the error return. Modify sysctl_old_user to obey the wired buffer length and clean up its implementation. Reviewed by: bms
126135	23-Feb-2004	alc	- Substitute bdone() and bwait() from vfs_bio.c for swap_pager_putpages()'s buffer completion code. Note: the only difference between swp_pager_sync_iodone() and bdone(), aside from the locking in the latter, was the unnecessary clearing of B_ASYNC. - Remove an unnecessary pmap_page_protect() from swp_pager_async_iodone(). Reviewed by: tegge
126108	22-Feb-2004	alc	Correct a long-standing race condition in vm_object_page_remove() that could result in a dirty page being unintentionally freed. Reviewed by: tegge MFC after: 7 days
126088	21-Feb-2004	alc	Eliminate the second, unnecessary call to pmap_page_protect() near the end of vm_pageout_flush(). Instead, assert that the page is still write protected. Discussed with: tegge
125990	19-Feb-2004	alc	- Correct a long-standing race condition in vm_page_try_to_free() that could result in a dirty page being unintentionally freed. - Simplify the dirty page check in vm_page_dontneed(). Reviewed by: tegge MFC after: 7 days
125889	16-Feb-2004	des	Back out previous commit due to objections.
125882	16-Feb-2004	des	Don't panic if we fail to satisfy an M_WAITOK request; return 0 instead. The calling code will either handle that gracefully or cause a page fault.
125861	16-Feb-2004	alc	Correct a long-standing race condition in vm_contig_launder() that could result in a panic "vm_page_cache: caching a dirty page, ...": Access to the page must be restricted or removed before calling vm_page_cache(). This race condition is identical in nature to that which was addressed by vm_pageout.c's revision 1.251 and vm_page.c's revision 1.275. MFC after: 7 days
125838	15-Feb-2004	alc	Correct a long-standing race condition in vm_fault() that could result in a panic "vm_page_cache: caching a dirty page, ...": Access to the page must be restricted or removed before calling vm_page_cache(). This race condition is identical in nature to that which was addressed by vm_pageout.c's revision 1.251 and vm_page.c's revision 1.275. Reviewed by: tegge MFC after: 7 days
125798	14-Feb-2004	alc	- Correct a long-standing race condition in vm_page_try_to_cache() that could result in a panic "vm_page_cache: caching a dirty page, ...": Access to the page must be restricted or removed before calling vm_page_cache(). This race condition is identical in nature to that which was addressed by vm_pageout.c's revision 1.251. - Simplify the code surrounding the fix to this same race condition in vm_pageout.c's revision 1.251. There should be no behavioral change. Reviewed by: tegge MFC after: 7 days
125755	12-Feb-2004	phk	Remove the absolute count g_access_abs() function since experience has shown that it is not useful. Rename the relative count g_access_rel() function to g_access(), only the name has changed. Change all g_access_rel() calls in our CVS tree to call g_access() instead. Add an #ifndef BURN_BRIDGES #define of g_access_rel() for source code compatibility.
125748	12-Feb-2004	alc	Further reduce the use of Giant in vm_map_delete(): Perform pmap_remove() on system maps, besides the kmem_map, without Giant. In collaboration with: tegge
125662	10-Feb-2004	alc	Correct a long-standing race condition in the inactive queue scan. (See the added comment for low-level details.) The effect of this race condition is a panic "vm_page_cache: caching a dirty page, ..." Reviewed by: tegge MFC after: 7 days
125558	07-Feb-2004	alc	swp_pager_async_iodone() no longer requires Giant. Modify bufdone() and swapgeom_done() to perform swp_pager_async_iodone() without Giant. Reviewed by: tegge
125470	05-Feb-2004	alc	- Locking for the per-process resource limits structure has eliminated the need for Giant in vm_map_growstack(). - Use the proc * that is passed to vm_map_growstack() rather than curthread->td_proc.
125454	04-Feb-2004	jhb	Locking for the per-process resource limits structure. - struct plimit includes a mutex to protect a reference count. The plimit structure is treated similarly to struct ucred in that is is always copy on write, so having a reference to a structure is sufficient to read from it without needing a further lock. - The proc lock protects the p_limit pointer and must be held while reading limits from a process to keep the limit structure from changing out from under you while reading from it. - Various global limits that are ints are not protected by a lock since int writes are atomic on all the archs we support and thus a lock wouldn't buy us anything. - All accesses to individual resource limits from a process are abstracted behind a simple lim_rlimit(), lim_max(), and lim_cur() API that return either an rlimit, or the current or max individual limit of the specified resource from a process. - dosetrlimit() was renamed to kern_setrlimit() to match existing style of other similar syscall helper functions. - The alpha OSF/1 compat layer no longer calls getrlimit() and setrlimit() (it didn't used the stackgap when it should have) but uses lim_rlimit() and kern_setrlimit() instead. - The svr4 compat no longer uses the stackgap for resource limits calls, but uses lim_rlimit() and kern_setrlimit() instead. - The ibcs2 compat no longer uses the stackgap for resource limits. It also no longer uses the stackgap for accessing sysctl's for the ibcs2_sysconf() syscall but uses kernel_sysctl() instead. As a result, ibcs2_sysconf() no longer needs Giant. - The p_rlimit macro no longer exists. Submitted by: mtm (mostly, I only did a few cleanups and catchups) Tested on: i386 Compiled on: alpha, amd64
125362	02-Feb-2004	jhb	Drop the reference count on the old vmspace after fully switching the current thread to the new vmspace. Suggested by: dillon
125322	02-Feb-2004	phk	Check error return from g_clone_bio(). (netchild@) Add XXX comment about why this is still not optimal. (phk@) Submitted by: netchild@
125314	02-Feb-2004	jeff	- Use a seperate startup function for the zeroidle kthread. Use this to set P_NOLOAD prior to running the thread.
125294	01-Feb-2004	jeff	- Fix a problem where we did not drain the cache of buckets in the zone when uma_reclaim() was called. This was introduced when the zone working-set algorithm was removed in favor of using the per cpu caches as the working set.
125246	30-Jan-2004	des	Mechanical whitespace cleanup.
125193	29-Jan-2004	bde	Fixed breakage of scheduling in rev.1.29 of subr_4bsd.c. The "scheduler" here has very little to do with scheduling. It is actually the swapper, and it really must be the last SYSINIT'ed item like its comment says, since proc0 metamorphoses into swapper by calling scheduler() last in mi_start(), and scheduler() never returns.. Rev.1.29 of subr_4bsd.c broke this by adding another SI_ORDER_FIRST item (kproc_start() for schedcpu_thread() onto the SI_SUB_RUN_SCHEDULER_LIST. The sorting of SYSINITs with identical orders (at all levels) is apparently nondeterministic, so this resulted in schedule() sometimes being called second last and schedcpu_thread() not being called at all. This quick fix just changes the code to almost match the comment (SI_ORDER_FIRST -> SI_ORDER_ANY). "LAST" is misspelled "ANY", and there is no way to ensure that there is only 1 very lst SYSINIT. A more complete fix would remove the SYSINIT obfuscation.
124944	25-Jan-2004	jeff	- Add a flags parameter to mi_switch. The value of flags may be SW_VOL or SW_INVOL. Assert that one of these is set in mi_switch() and propery adjust the rusage statistics. This is to simplify the large number of users of this interface which were previously all required to adjust the proper counter prior to calling mi_switch(). This also facilitates more switch and locking optimizations. - Change all callers of mi_switch() to pass the appropriate paramter and remove direct references to the process statistics.
124933	24-Jan-2004	alc	1. Statically initialize swap_pager_full and swap_pager_almost_full to the full state. (When swap is added their state will change appropriately.) 2. Set swap_pager_full and swap_pager_almost_full to the full state when the last swap device is removed. Combined these changes eliminate nonsense messages from the kernel on swap- less machines. Item 2 submitted by: Divacky Roman <xdivac02@stud.fit.vutbr.cz> Prodding by: phk
124649	18-Jan-2004	alc	Increase UMA_BOOT_PAGES because of changes to pv entry initialization in revision 1.457 of i386/i386/pmap.c.
124646	18-Jan-2004	alc	Don't acquire Giant in vm_object_deallocate() unless the object is vnode- backed.
124513	14-Jan-2004	alc	Remove vm_page_alloc_contig(). It's now unused.
124366	11-Jan-2004	alc	Remove long dead code, specifically, code related to munmapfd(). (See also vm/vm_mmap.c revision 1.173.)
124353	10-Jan-2004	alc	- Unmanage pages allocated by contigmalloc1(). (There is no point in having PV entries for these pages.) - Remove splvm() and splx() calls.
124321	10-Jan-2004	alc	Unmanage pages allocated by kmem_alloc(). (There is no point in having PV entries for these pages.)
124261	08-Jan-2004	alc	- Enable recursive acquisition of the mutex synchronizing access to the free pages queue. This is presently needed by contigmalloc1(). - Move a sanity check against attempted double allocation of two pages to the same vm object offset from vm_page_alloc() to vm_page_insert(). This provides better protection because double allocation could occur through a direct call to vm_page_insert(), such as that by vm_page_rename(). - Modify contigmalloc1() to hold the mutex synchronizing access to the free pages queue while it scans vm_page_array in search of free pages. - Correct a potential leak of pages by contigmalloc1() that I introduced in revision 1.20: We must convert all cache queue pages to free pages before we begin removing free pages from the free queue. Otherwise, if we have to restart the scan because we are unable to acquire the vm object lock that is necessary to convert a cache queue page to a free page, we leak those free pages already removed from the free queue.
124195	06-Jan-2004	alc	Don't bother clearing PG_ZERO in contigmalloc1(), kmem_alloc(), or kmem_malloc(). It serves no purpose.
124133	04-Jan-2004	alc	Simplify the various pager allocation routines by computing the desired object size once and assigning that value to a local variable.
124117	04-Jan-2004	alc	Eliminate the acquisition and release of Giant from vnode_pager_alloc(). The vm object and vnode locking should suffice. Discussed with: jeff
124110	03-Jan-2004	alc	Reduce the scope of Giant in swap_pager_alloc().
124084	02-Jan-2004	alc	Revision 1.74 of vm_meter.c ("Avoid lock-order reversal") makes the release and subsequent reacquisition of the same vm object lock in vm_object_collapse() unnecessary.
124083	02-Jan-2004	alc	Avoid lock-order reversal between the vm object list mutex and the vm object mutex.
124048	01-Jan-2004	alc	- Increase the scope of the kmem_object's lock in kmem_malloc(). Add a comment explaining why a further increase is not possible.
124028	31-Dec-2003	alc	In vm_page_lookup() check the root of the vm object's splay tree for the desired page before calling vm_page_splay().
124012	31-Dec-2003	alc	Simplify vm_page_grab(): Don't bother with the generation check. If the vm object hasn't changed, the desired page will be at or near the root of the vm object's splay tree, making vm_page_lookup() cheap. (The only lock required for vm_page_lookup() is already held.) If, however, the vm object has changed and retry was requested, eliminating the generation check also eliminates a pointless acquisition and release of the page queues lock.
124008	30-Dec-2003	alc	- Modify vm_object_split() to expect a locked vm object on entry and return on a locked vm object on exit. Remove GIANT_REQUIRED. - Eliminate some unnecessary local variables from vm_object_split().
123948	29-Dec-2003	alc	Remove swap_pager_un_object_list; it is unused.
123914	28-Dec-2003	alc	Remove GIANT_REQUIRED from kmem_suballoc().
123879	26-Dec-2003	alc	- Reduce Giant's scope in vm_fault(). - Use vm_object_reference_locked() instead of vm_object_reference() in vm_fault().
123878	26-Dec-2003	alc	Minor correction to revision 1.258: Use the proc pointer that is passed to vm_map_growstack() in the RLIMIT_VMEM check rather than curthread.
123711	22-Dec-2003	alc	- Create an unmapped guard page to trap access to vm_page_array[-1]. This guard page would have trapped the problems with the MFC of the PAE support to RELENG_4 at an earlier point in the sequence of events. Submitted by: tegge
123710	22-Dec-2003	alc	- Significantly reduce the number of preallocated pv entries in pmap_init(). Such a large preallocation is unnecessary and wastes nearly eight megabytes of kernel virtual address space per gigabyte of managed physical memory. - Increase UMA_BOOT_PAGES by two. This enables the removal of pmap_pv_allocf(). (Note: this function was only used during initialization, specifically, after pmap_init() but before pmap_init2(). During pmap_init2(), a new allocator is installed.)
123697	21-Dec-2003	alc	- Correct an error in mincore(2) that has existed since its introduction: mincore(2) should check that the page is valid, not just allocated. Otherwise, it can return a false positive for a page that is not yet resident because it is being read from disk.
123280	08-Dec-2003	kan	Remove trailing whitespace.
123276	08-Dec-2003	alc	Addendum to revision 1.174: In the case where vm_pager_allocate() is called to create a vnode-backed object, the vnode lock must be held by the caller. Reported by: truckman Discussed with: kan
123168	06-Dec-2003	alc	Fix a deadlock between vm_fault() and vm_mmap(): The expected lock ordering between vm_map and vnode locks is that vm_map locks are acquired first. In revision 1.150 mmap(2) was changed to pass a locked vnode into vm_mmap(). This creates a lock-order reversal when vm_mmap() calls one of the vm_map routines that acquires a vm_map lock. The solution implemented herein is to release the vnode lock in mmap() before calling vm_mmap() and reacquire this lock if necessary in vm_mmap(). Approved by: re (scottl) Reviewed by: jeff, kan, rwatson
123126	03-Dec-2003	jhb	Fix all users of mp_maxid to use the same semantics, namely: 1) mp_maxid is a valid FreeBSD CPU ID in the range 0 .. MAXCPU - 1. 2) For all active CPUs in the system, PCPU_GET(cpuid) <= mp_maxid. Approved by: re (scottl) Tested on: i386, amd64, alpha
123073	30-Nov-2003	jeff	- Unbreak UP. mp_maxid is not defined on uni-processor machines, although I believe it and the other MP variables should be. For now, just define it here and wait for jhb to clean it up later. Approved by: re (rwatson)
123057	30-Nov-2003	jeff	- Replace the local maxcpu with mp_maxid. Previously, if mp_maxid was equal to MAXCPU, we would overrun the pcpu_mtx array because maxcpu was calculated incorrectly. - Add some more debugging code so that memory leaks at the time of uma_zdestroy() are more easily diagnosed. Approved by: re (rwatson)
122902	19-Nov-2003	alc	- Avoid a lock-order reversal between Giant and a system map mutex that occurs when kmem_malloc() fails to allocate a sufficient number of vm pages. Specifically, we avoid the lock-order reversal by not grabbing Giant around pmap_remove() if the map is the kmem_map. Approved by: re (jhb) Reported by: Eugene <eugene3@web.de>
122748	15-Nov-2003	tjr	In vnode_pager_input_smlfs(), call VOP_STRATEGY instead of VOP_SPECSTRATEGY on non-VCHR vnodes. This fixes a panic when reading data from files on a filesystem with a small (less than a page) block size. PR: 59271 Reviewed by: alc
122680	14-Nov-2003	alc	- Remove use of Giant from uma_zone_set_obj().
122651	14-Nov-2003	alc	- Remove long dead code.
122646	14-Nov-2003	alc	Changes to msync(2) - Return EBUSY if the region was wired by mlock(2) and MS_INVALIDATE is specified to msync(2). This is required by the Open Group Base Specifications Issue 6. - vm_map_sync() doesn't return KERN_FAILURE. Thus, msync(2) can't possibly return EIO. - The second major loop in vm_map_sync() handles sub maps. Thus, failing on sub maps in the first major loop isn't necessary.
122384	10-Nov-2003	alc	- The Open Group Base Specifications Issue 6 specifies that an munmap(2) must return EINVAL if size is zero. Submitted by: tegge - In order to avoid a race condition in multithreaded applications, the check and removal operations by munmap(2) must be in the same critical section. To accomodate this, vm_map_check_protection() is modified to require its caller to obtain at least a read lock on the map.
122383	10-Nov-2003	mini	NFC: Update stale comments. Reviewed by: alc
122367	09-Nov-2003	alc	- Remove Giant from msync(2). Giant is still acquired by the lower layers if we drop into the pmap or vnode layers. - Migrate the handling of zero-length msync(2)s into vm_map_sync() so that multithread applications can't change the map between implementing the zero-length hack in msync(2) and reacquiring the map lock in vm_map_sync(). Reviewed by: tegge
122349	09-Nov-2003	alc	- Rename vm_map_clean() to vm_map_sync(). This better reflects the fact that msync(2) is its only caller. - Migrate the parts of the old vm_map_clean() that examined the internals of a vm object to a new function vm_object_sync() that is implemented in vm_object.c. At the same, introduce the necessary vm object locking so that vm_map_sync() and vm_object_sync() can be called without Giant. Reviewed by: tegge
122095	05-Nov-2003	alc	- Move the implementation of OBJ_ONEMAPPING from vm_map_delete() to vm_map_entry_delete() so that all of the vm object manipulation is performed in one place.
122034	04-Nov-2003	marcel	Update avail_ssize for rstacks after growing them.
121962	03-Nov-2003	des	Whitespace cleanup.
121919	03-Nov-2003	alc	- Increase the scope of the source object lock in vm_map_copy_entry().
121913	02-Nov-2003	alc	- Increase the scope of two vm object locks in vm_object_split().
121907	02-Nov-2003	alc	- Introduce and use vm_object_reference_locked(). Unlike vm_object_reference(), this function must not be used to reanimate dead vm objects. This restriction simplifies locking. Reviewed by: tegge
121866	01-Nov-2003	alc	- Increase the scope of two vm object locks in vm_object_collapse(). - Remove the acquisition and release of Giant from vm_object_coalesce().
121854	01-Nov-2003	alc	- Modify swap_pager_copy() and its callers such that the source and destination objects are locked on entry and exit. Add comments to the callers noting that the locks can be released by swap_pager_copy(). - Remove several instances of GIANT_REQUIRED.
121844	01-Nov-2003	alc	- Additional vm object locking in vm_object_split() - New vm object locking assertions in vm_page_insert() and vm_object_set_writeable_dirty()
121821	31-Oct-2003	alc	- Revert a part of revision 1.73: Make vm_object_set_flag() an inline function. This function is so trivial that inlining reduces the size of the kernel.
121815	31-Oct-2003	alc	- Take advantage of the swap pager locking: Eliminate the use of Giant from vm_object_madvise(). - Remove excessive blank lines from vm_object_madvise().
121786	31-Oct-2003	marcel	Fix two bugs introduced with the rstack functionality and specific to the rstack functionality: 1. Fix a KASSERT that tests for the address to be above the upward growable stack. Typically for rstack, the faulting address can be identical to the record end of the upward growable entry, and very likely is on ia64. The KASSERT tested for greater than, not greater equal, so whenever the register stack had to be grown the assertion fired. 2. When we grow the upward growable stack entry and adjust the unlying object, don't forget to adjust the size of the VM map. Not doing so would trigger an assert in vm_mapzdtor(). Pointy hat: marcel (for not testing with INVARIANTS).
121782	31-Oct-2003	alc	- Synchronize access to the swdevt's sw_flags with sw_dev_mtx. - Remove several instances of GIANT_REQUIRED.
121727	30-Oct-2003	alc	- Synchronize access to the swdevt's sw_blist with sw_dev_mtx. - Remove several instances of GIANT_REQUIRED.
121725	30-Oct-2003	alc	- Synchronize access to swdevhd using sw_dev_mtx. - Use swp_sizecheck() rather than assignment to swap_pager_full in swaponsomething().
121649	29-Oct-2003	alc	- Synchronize updates to nswapdev using sw_dev_mtx.
121646	29-Oct-2003	alc	- Avoid a race in swaponsomething(): Calculate the new swdevt's first and end swblk and insert this new swdevt into the list of swap devices in the same critical section.
121601	27-Oct-2003	alc	- Complete the synchronization of accesses to the swblock hash table.
121583	26-Oct-2003	alc	- Introduce and use a mutex synchronizing access to the swblock hash table.
121562	26-Oct-2003	alc	- Simplify vm_object_collapse()'s collapse case, reducing the number of lock acquires and releases performed. - Move an assertion from vm_object_collapse() to vm_object_zdtor() because it applies to all cases of object destruction.
121517	25-Oct-2003	alc	- Add some of the required vm object locking, including assertions where the vm object lock is required and already held.
121511	25-Oct-2003	alc	- Align a comment within struct vm_page. - Annotate the vm_page's valid field as synchronized by the containing vm object's lock.
121495	25-Oct-2003	alc	- Call vnode_pager_input_old() with the vm object locked.
121455	24-Oct-2003	alc	- Push down Giant from vm_pageout() to vm_pageout_scan(), freeing vm_pageout_page_stats() from Giant. - Modify vm_pager_put_pages() and vm_pager_page_unswapped() to expect the vm object to be locked on entry. (All of the pager routines now expect this.)
121351	22-Oct-2003	alc	- Retire vm_pageout_page_free(). Instead, use vm_page_select_cache() from vm_pageout_scan(). Rationale: I don't like leaving a busy page in the cache queue with neither the vm object nor the vm page queues lock held. - Assert that the page is active in vm_pageout_page_stats().
121321	22-Oct-2003	alc	- Assert that every page found in the active queue is an active page.
121313	21-Oct-2003	alc	- Assert that the containing vm object is locked in vm_page_set_validclean(). (This function reads and modifies the vm page's valid field, which is synchronized by the lock on the containing vm object.)
121288	20-Oct-2003	alc	- Remove some long unused code.
121267	20-Oct-2003	alc	- Remove comments referring to functions that no longer exist.
121264	20-Oct-2003	alc	- Hold the vm object's lock around calls to vm_page_set_validclean().
121230	19-Oct-2003	alc	- Synchronize access to a vm page's valid field using the containing vm object's lock. - Reduce the scope of the vm page queues lock in two places.
121227	18-Oct-2003	alc	- Synchronize access to the page's valid field in vnode_pager_generic_getpages() using the containing object's lock.
121226	18-Oct-2003	alc	- Increase the object lock's scope in vm_contig_launder() so that access to the object's type field and the call to vm_pageout_flush() are synchronized. - The above change allows for the eliminaton of the last parameter to vm_pageout_flush(). - Synchronize access to the page's valid field in vm_pageout_flush() using the containing object's lock.
121221	18-Oct-2003	alc	Corrections to revision 1.305 - Specifying VM_MAP_WIRE_HOLESOK should not assume that the start address is the beginning of the map. Instead, move to the first entry after the start address. - The implementation of VM_MAP_WIRE_HOLESOK was incomplete. This caused the failure of mlockall(2) in some circumstances.
121205	18-Oct-2003	phk	DuH! bp->b_iooffset (the spot on the disk), not bp->b_offset (the offset in the file)
121199	18-Oct-2003	phk	Initialize bp->b_offset before calling VOP_[SPEC]STRATEGY(). Remove stale comment about B_PHYS.
121150	17-Oct-2003	alc	- Synchronize access to a vm page's valid field using the containing vm object's lock. - Release the vm object and vm page queues locks around vput().
121108	15-Oct-2003	alc	- vm_fault_copy_entry() should not assume that the source object contains every page. If the source entry was read-only, one or more wired pages could be in backing objects. - vm_fault_copy_entry() should not set the PG_WRITEABLE flag on the page unless the destination entry is, in fact, writeable.
120905	08-Oct-2003	alc	Lock the destination object in vm_fault_copy_entry().
120903	08-Oct-2003	alc	Retire vm_page_copy(). Its reason for being ended when peter@ modified pmap_copy_page() et al. to accept a vm_page_t rather than a physical address. Also, this change will facilitate locking access to the vm page's valid field.
120837	06-Oct-2003	bms	Only the super-user should be able to wire pages via the mlock() family of system calls at this time. Remove various #ifdef's to enforce this.
120831	06-Oct-2003	bms	Move pmap_resident_count() from the MD pmap.h to the MI pmap.h. Add a definition of pmap_wired_count(). Add a definition of vmspace_wired_count(). Reviewed by: truckman Discussed with: peter
120824	05-Oct-2003	alc	The addition of a locking assertion to vm_page_zero_invalid() has revealed a long-time bug: vm_pager_get_pages() assumes that m[reqpage] contains a valid page upon return from pgo_getpages(). In the case of the device pager this page has been freed and replaced by a fake page. The fake page is properly inserted into the vm object but m[reqpage] is left pointing to a freed page. For now, update m[reqpage] to point to the fake page. Submitted by: tegge
120811	05-Oct-2003	bms	Revert previous commit. Come back vslock(), all is forgiven. Pointy hat to: bms
120806	05-Oct-2003	bms	Retire vslock() and vsunlock() with extreme prejudice. Discussed with: pete
120790	05-Oct-2003	alc	Assert that the containing vm object's lock is held in vm_page_set_invalid().
120766	04-Oct-2003	alc	Assert that the containing vm object's lock is held in vm_page_zero_invalid().
120764	04-Oct-2003	alc	Synchronize access to a vm page's valid field using the containing vm object's lock.
120762	04-Oct-2003	alc	- Extend the scope the vm object lock to cover calls to vm_page_is_valid(). - Assert that the lock on the containing vm object is held in vm_page_is_valid().
120761	04-Oct-2003	alc	Synchronize access to a vm page's valid field using the containing vm object's lock.
120739	04-Oct-2003	jeff	- Use the UMA_ZONE_VM flag on the fakepg and object zones to prevent vm recursion and LORs. This may be necessary for other zones created in the vm but this needs to be verified.
120722	03-Oct-2003	alc	Migrate pmap_prefault() into the machine-independent virtual memory layer. A small helper function pmap_is_prefaultable() is added. This function encapsulate the few lines of pmap_prefault() that actually vary from machine to machine. Note: pmap_is_prefaultable() and pmap_mincore() have much in common. Going forward, it's worth considering their merger.
120538	28-Sep-2003	alc	In vm_page_remove(), assert that the vm object is locked, unless an Alpha. (The Alpha still requires updates to its pmap.)
120531	27-Sep-2003	marcel	Part 2 of implementing rstacks: add the ability to create rstacks and use the ability on ia64 to map the register stack. The orientation of the stack (i.e. its grow direction) is passed to vm_map_stack() in the overloaded cow argument. Since the grow direction is represented by bits, it is possible and allowed to create bi-directional stacks. This is not an advertised feature, more of a side-effect. Fix a bug in vm_map_growstack() that's specific to rstacks and which we could only find by having the ability to create rstacks: when the mapped stack ends at the faulting address, we have not actually mapped the faulting address. we need to include or cover the faulting address. Note that at this time mmap(2) has not been extended to allow the creation of rstacks by processes. If such a need arises, this can be done. Tested on: alpha, i386, ia64, sparc64
120526	27-Sep-2003	phk	Provide a bit more help with "memory overwritten after free" style bugs.
120422	25-Sep-2003	peter	Add sysentvec->sv_fixlimits() hook so that we can catch cases on 64 bit systems where the data/stack/etc limits are too big for a 32 bit process. Move the 5 or so identical instances of ELF_RTLD_ADDR() into imgact_elf.c. Supply an ia32_fixlimits function. Export the clip/default values to sysctl under the compat.ia32 heirarchy. Have mmap(0, ...) respect the current p->p_limits[RLIMIT_DATA].rlim_max value rather than the sysctl tweakable variable. This allows mmap to place mappings at sensible locations when limits have been reduced. Have the imgact_elf.c ld-elf.so.1 placement algorithm use the same method as mmap(0, ...) now does. Note that we cannot remove all references to the sysctl tweakable maxdsiz etc variables because /etc/login.conf specifies a datasize of 'unlimited'. And that causes exec etc to fail since it can no longer find space to mmap things.
120389	23-Sep-2003	silby	Adjust the kmapentzone limit so that it takes into account the size of maxproc and maxfiles, as procs, pipes, and other structures cause allocations from kmapentzone. Submitted by: tegge
120371	23-Sep-2003	alc	Change the handling of the kernel and kmem objects in vm_map_delete(): In order to use "unmanaged" pages in the kmem object, vm_map_delete() must unconditionally perform pmap_remove(). Otherwise, sparc64 has problems. Tested by: jake
120326	22-Sep-2003	alc	Initialize the page's pindex field even for VM_ALLOC_NOOBJ allocations. (This field is useful for implementing sanity checks even if the page does not belong to an object.)
120311	21-Sep-2003	jeff	- Fix MD_SMALL_ALLOC on architectures that support it. Define a new alloc function, startup_alloc(), that is used for single page allocations prior to the VM starting up. If it is used after the VM startups up, it replaces the zone's allocf pointer with either page_alloc() or uma_small_alloc() where appropriate. Pointy hat to: me Tested by: phk/amd64, me/x86
120305	20-Sep-2003	peter	Bad Jeffr! No cookie! Temporarily disable the UMA_MD_SMALL_ALLOC stuff since recent commits break sparc64, amd64, ia64 and alpha. It appears only i386 and maybe powerpc were not broken.
120262	19-Sep-2003	jeff	- Remove the working-set algorithm. Instead, use the per cpu buckets as the working set cache. This has several advantages. Firstly, we never touch the per cpu queues now in the timeout handler. This removes one more reason for having per cpu locks. Secondly, it reduces the size of the zone by 8 bytes, bringing it under 200 bytes for a single proc x86 box. This tidies up other logic as well. - The 'destroy' flag no longer needs to be passed to zone_drain() since it always frees everything in the zone's slabs. - cache_drain() is now only called from zone_dtor() and so it destroys by default. It also does not need the destroy parameter now.
120255	19-Sep-2003	jeff	- Remove the cache colorization code. We can't use it due to all of the broken consumers of the malloc interface who assume that the allocated address will be an even multiple of the size. - Remove disabled time delay code on uma_reclaim(). The comment there said it all. It was not an effective strategy and it should not be left in #if 0'd for all eternity.
120249	19-Sep-2003	jeff	- There are an endless stream of style(9) errors in this file. Fix a few. Also catch some spelling errors.
120229	19-Sep-2003	jeff	- Don't inspect the zone in page_alloc(). It may be NULL. - Don't cache more items than the zone would like in uma_zalloc_bucket().
120224	19-Sep-2003	jeff	- Move the logic for dealing with the uma_boot_pages cache into the page_alloc() function from the slab_zalloc() function. This allows us to unconditionally call uz_allocf(). - In page_alloc() cleanup the boot_pages logic some. Previously memory from this cache that was not used by the time the system started was left in the cache and never used. Typically this wasn't more than a few pages, but now we will use this cache so long as memory is available.
120223	19-Sep-2003	jeff	- Fix the silly flag situation in UMA. Remove redundant ZFLAG/ZONE flags by accepting the user supplied flags directly. Previously this was not done so that flags for the same field would not be defined in two different files. Add comments in each header instructing future developers on how now to shoot their feet. - Fix a test for !OFFPAGE which should have been a test for HASH. This would have caused a panic if we had ever destructed a malloc zone. This also opens up the possibility that other zones could use the vsetobj() method rather than a hash.
120221	19-Sep-2003	jeff	- Don't abuse M_DEVBUF, define a tag for UMA hashes.
120219	19-Sep-2003	jeff	- Eliminate a pair of unnecessary variables.
120218	19-Sep-2003	jeff	- Initialize a pool of bucket zones so that we waste less space on zones that don't cache as many items. - Introduce the bucket_alloc(), bucket_free() functions to wrap bucket allocation. These functions select the appropriate bucket zone to allocate from or free to. - Rename ub_ptr to ub_cnt to reflect a change in its use. ub_cnt now reflects the count of free items in the bucket. This gets rid of many unnatural subtractions by 1 throughout the code. - Add ub_entries which reflects the number of entries possibly held in a bucket.
120217	19-Sep-2003	alc	Merge vm_pageout_free_page_calc() into vm_pageout(), eliminating some unneeded code.
120183	18-Sep-2003	alc	Add vm object locking to vnode_pager_lock(). (This triggers the movement of a VM_OBJECT_LOCK() in vm_fault().)
120152	17-Sep-2003	alc	Remove GIANT_REQUIRED from vm_object_shadow().
120150	17-Sep-2003	alc	When calling vget() on a vnode-backed vm object, acquire the vnode interlock before releasing the vm object's lock.
120086	15-Sep-2003	alc	Eliminate the use of Giant from vm_object_reference().
120050	14-Sep-2003	alc	Call vm_page_unmanage() on pages belonging to the kmem_object. This eliminates the unnecessary overhead of managing "PV" entries for these pages.
120035	13-Sep-2003	alc	There is no need for an atomic increment on the vm object's generation count in _vm_object_allocate(). (Access to the generation count is governed by the vm object's lock.) Note: the introduction of the atomic increment in revision 1.238 appears to be an accident. The purpose of that commit was to fix an Alpha-specific bug in UMA's debugging code.
119999	12-Sep-2003	alc	Add a new parameter to pmap_extract_and_hold() that is needed to eliminate Giant from vmapbuf(). Idea from: tegge
119869	08-Sep-2003	alc	Introduce a new pmap function, pmap_extract_and_hold(). This function atomically extracts and holds the physical page that is associated with the given pmap and virtual address. Such a function is needed to make the memory mapping optimizations used by, for example, pipes and raw disk I/O MP-safe. Reviewed by: tegge
119858	07-Sep-2003	alc	Revise the locking in mincore(2).
119663	02-Sep-2003	phk	Don't open with exclusive bit, swapon(8) wants to trash our swapdev. Add XXX comment with a rating of this concept.
119658	01-Sep-2003	eivind	Change clean_map from a global to an auto variable
119596	31-Aug-2003	alc	- Add vm object locking to the part of vm_pageout_scan() that launders dirty pages. - Remove some unused variables.
119595	30-Aug-2003	marcel	Introduce MAP_ENTRY_GROWS_DOWN and MAP_ENTRY_GROWS_UP to allow for growable (stack) entries that not only grow down, but also grow up. Have vm_map_growstack() take these flags into account when growing an entry. This is the first step in adding support for upward growable stacks. It is a required feature on ia64 to support the register stack (or rstack as I like to call it -- it also means reverse stack). We do not currently create rstacks, so the upward growing is not exercised and the change should be a functional no-op. Reviewed by: alc
119591	30-Aug-2003	phk	Add a close() method to a swapdev. Add a GEOM based backend. Remove the device/VOP_SPECSTRATEGY() based backend.
119590	30-Aug-2003	phk	Protect the swapdevice tailq with a mutex. Store the udev_t we will report to userland in the swdevt.
119575	30-Aug-2003	phk	Continue the objectification of the swapdev backends: Remove the vnode and dev_t fields and replace them with a void *. Introduce separate strategy functions for devices and regular (NFS) vnodes. For devices we don't need the vnode v_numoutput stuff. Add a generic swaponsomething() function to add a swapdevice and split the remainder of swaponvp() into swaponvp() and swapondev() which calls this backend.
119574	30-Aug-2003	phk	Make the strategy function a method of the individual swapdev.
119573	30-Aug-2003	phk	Consistent use modern function definitions
119544	29-Aug-2003	marcel	In vnode_pager_generic_putpages(), change the printf format specifier to long and explicitly cast field dirty of struct vm_page to unsigned long. When PAGE_SIZE is 32K, this field is actually unsigned long.
119543	28-Aug-2003	alc	Recent pmap changes permit the use of a more precise locking assertion in vm_page_lookup().
119468	25-Aug-2003	marcel	Assert that u_long is at least 64 bits if PAGE_SIZE is 32K. Suggested by: phk
119373	23-Aug-2003	alc	Held pages, just like wired pages, should not be added to the cache queues. Submitted by: tegge
119370	23-Aug-2003	alc	Hold the page queues lock when performing vm_page_clear_dirty() and vm_page_set_invalid().
119357	23-Aug-2003	alc	To implement the sequential access optimization, vm_fault() may need to reacquire the "first" object's lock while a backing object's lock is held. Since this is a lock-order reversal, vm_fault() uses trylock to acquire the first object's lock, skipping the sequential access optimization in the unlikely event that the trylock fails.
119356	23-Aug-2003	marcel	Also define VM_PAGE_BITS_ALL for 16K and 32K pages. Make the constant unsigned for all page sizes and unsigned long for 32K pages.
119354	23-Aug-2003	marcel	Add support for 16K and 32K page sizes. The valid and dirty maps in struct vm_page are defined as u_int for 16K pages and u_long for 32K pages, with the implied assumption that long will at least be 64 bits wide on platforms where we support 32K pages.
119247	21-Aug-2003	alc	Assert that the vm object's lock is held on entry to vm_page_grab(); remove code from this function that was needed when vm object locking was incomplete.
119186	20-Aug-2003	alc	Assert that the vm object lock is held in vm_page_alloc().
119182	20-Aug-2003	bmilekic	In sysctl_vm_zone, do not calculate per-cpu cache stats on UMA_ZFLAG_INTERNAL zones at all. Apparently, Wilko's alpha was crashing while entering multi-user because, I think, we were calculating the garbage cachefree for pcpu caches that essentially don't exist for at least the 'zones' zone and it so happened that we were reading from an unmapped location. Confirmed to fix crash: wilko Helped debug: wilko, gallatin
119092	18-Aug-2003	phk	Replace a homegrown bdone()/bwait() implementation by the real thing
119059	18-Aug-2003	alc	Three unrelated changes to vm_proc_new(): (1) add vm object locking on the U pages object; (2) reorganize such that the U pages object is created and filled in one block; and (3) remove an unnecessary clearing of PG_ZERO.
119045	17-Aug-2003	phk	Use NULL for 3rd argument of VOP_BMAP() rather than custom cast. Eliminate unused variable.
119004	16-Aug-2003	marcel	In vm_thread_swap{in\|out}(), remove the alpha specific conditional compilation and replace it with a call to cpu_thread_swap{in\|out}(). This allows us to add similar code on ia64 without cluttering the code even more.
118946	15-Aug-2003	phk	Eliminate unnecessary udev_t variable: we can derive it from the dev_t when we need it.
118945	15-Aug-2003	phk	Make swaponvp() static to the swap_pager.
118931	15-Aug-2003	alc	Extend the scope of the page queues lock in vm_pageout_scan() to cover the traversal of the PQ_INACTIVE queue.
118878	13-Aug-2003	alc	Remove GIANT_REQUIRED from vmspace_alloc().
118852	13-Aug-2003	alc	Reduce the size of the vm map (and by inclusion the vm space) on 64-bit architectures by moving a field within the structure.
118848	12-Aug-2003	imp	Expand inline the relevant parts of src/COPYRIGHT for Matt Dillon's copyrighted files. Approved by: Matt Dillon
118838	12-Aug-2003	alc	Reduce the size of the vm object on 64-bit architectures by moving a field within the structure.
118795	11-Aug-2003	bmilekic	- When deciding whether to init the zone with small_init or large_init, compare the zone element size (+1 for the byte of linkage) against UMA_SLAB_SIZE - sizeof(struct uma_slab), and not just UMA_SLAB_SIZE. Add a KASSERT in zone_small_init to make sure that the computed ipers (items per slab) for the zone is not zero, despite the addition of the check, just to be sure (this part submitted by: silby) - UMA_ZONE_VM used to imply BUCKETCACHE. Now it implies CACHEONLY instead. CACHEONLY is like BUCKETCACHE in the case of bucket allocations, but in addition to that also ensures that we don't setup the zone with OFFPAGE slab headers allocated from the slabzone. This means that we're not allowed to have a UMA_ZONE_VM zone initialized for large items (zone_large_init) because it would require the slab headers to be allocated from slabzone, and hence kmem_map. Some of the zones init'd with UMA_ZONE_VM are so init'd before kmem_map is suballoc'd from kernel_map, which is why this change is necessary.
118771	11-Aug-2003	bms	Add the mlockall() and munlockall() system calls. - All those diffs to syscalls.master for each architecture are necessary. This needed clarification; the stub code generation for mlockall() was disabled, which would prevent applications from linking to this API (suggested by mux) - Giant has been quoshed. It is no longer held by the code, as the required locking has been pushed down within vm_map.c. - Callers must specify VM_MAP_WIRE_HOLESOK or VM_MAP_WIRE_NOHOLES to express their intention explicitly. - Inspected at the vmstat, top and vm pager sysctl stats level. Paging-in activity is occurring correctly, using a test harness. - The RES size for a process may appear to be greater than its SIZE. This is believed to be due to mappings of the same shared library page being wired twice. Further exploration is needed. - Believed to back out of allocations and locks correctly (tested with WITNESS, MUTEX_PROFILING, INVARIANTS and DIAGNOSTIC). PR: kern/43426, standards/54223 Reviewed by: jake, alc Approved by: jake (mentor) MFC after: 2 weeks
118764	11-Aug-2003	silby	More pipe changes: From alc: Move pageable pipe memory to a seperate kernel submap to avoid awkward vm map interlocking issues. (Bad explanation provided by me.) From me: Rework pipespace accounting code to handle this new layout, and adjust our default values to account for the fact that we now have a solid limit on allocations. Also, remove the "maxpipes" limit, as it no longer has a purpose. (The limit on kva usage solves the problem of having two many pipes.)
118544	06-Aug-2003	phk	Make the first two pages magic to protect the BSD labels rather than only one.
118537	06-Aug-2003	phk	Remove an unused variable.
118536	06-Aug-2003	phk	Staticize swap_pager_putpages() Eliminate a lot of checkes to make sure requests are not cross-device which is unnecessary with the new layout. We know a sequential request cannot possibly be cross-device because there is a reserved page between the devices. Remove a couple of comments which no longer are relevant.
118535	06-Aug-2003	phk	Access the swap_pagers' ->putpages() through swappagerops instead of directly, this is a cleaner way to do it.
118528	06-Aug-2003	phk	Add XXX: comment to vm_pager_unswapped().
118527	06-Aug-2003	phk	Explicitly set B_PAGING
118521	06-Aug-2003	phk	Rip out the totally bogos vnode swapdev_vp with extreeme prejudice. Don't mark buffers with B_KEEPGIANT, we don't drop giant in strategy at this point in time.
118468	05-Aug-2003	phk	Use sparse struct initialization for struct pagerops. Mark our buffers B_KEEPGIANT before sending them downstream. Remove swap_pager_strategy implementation.
118466	05-Aug-2003	phk	Use sparse struct initializations for struct pagerops. This makes grepping for which pagers implement which methods easier.
118418	04-Aug-2003	phk	Put an uncovered page between the swap devices, that way we can be sure to not get any cross-device I/O requests. (The unallocated first page protecting BSD labels already gave us this, but that hack may go away at some point in time). Remove the check for cross-device I/O requests in swap_pager_strategy. Move the repeated statistics updating into flushchainbuf().
118413	04-Aug-2003	alc	Use kmem_alloc_nofault() instead of kmem_alloc_pageable() to allocate swapbkva. Swapbkva mappings are explicitly managed using pmap_qenter(), not on-demand by vm_fault(), making kmem_alloc_nofault() more appropriate. Submitted by: tegge
118398	03-Aug-2003	phk	Name swap_pager_find_dev() more correctly swp_pager_finde_dev(). Use ->bio_children to count child buffers, rather than abuse the bio_caller1 pointer. Expand the relevant bits of waitchainbuf() inline, this clarifies the code a little bit.
118392	03-Aug-2003	phk	I accidentally hit undo before committing, fix the resulting off-by-one.
118390	03-Aug-2003	phk	Change the layout policy of the swap_pager from a hardcoded width striping to a per device round-robin algorithm. Because of the policy of not attempting to retain previous swap allocation on page-out, this means that a newly added swap device almost instantly takes its 1/N share of the I/O load but it takes somewhat longer for it to assume it's 1/N share of the pages if there is plenty of space on the other devices. Change the 8G total swapspace limitation to 8G per device instead by using a per device blist rather than one global blist. This reduces the memory footprint by 75% (typically a couple hundred kilobytes) for the common case with one swapdevice but NSWAPDEV=4. Remove the compile time constant limit of number of swap devices, there is no limit now. Instead of a fixed size array, store the per swapdev structure in a TAILQ. Total swap space is still addressed by a 32 bit page number and therefore the upper limit is now 2^42 bytes = 16TB (for i386). We still do not allocate the first page of each device in order to give some amount of protection to any bsdlabel at the start of the device. A new device is appended after the existing devices in the swap space, no attempt is made to fill in holes left behind by swapoff (this can trivially be changed should it ever become a problem). The sysctl vm.nswapdev now reflects the number of currently configured swap devices. Rename vm_swap_size to swap_pager_avail for consistency with other exported names. Change argument type for vm_proc_swapin_all() and swap_pager_isswapped() to be a struct swdevt pointer rather than an index. Not changed: we are still using blists to manage the free space, but since the swapspace is no longer fragmented by the striping different resource managers might fare better.
118384	03-Aug-2003	phk	Move extern declaration of the various pagerops from vm_pager.c to vm_pager.h where the various pagers will also see them.
118380	03-Aug-2003	alc	Revise obj_alloc(). Most notably, use the object's lock to prevent two concurrent invocations from acquiring the same address(es). Also, in case of an incomplete allocation, free any allocated pages. In collaboration with: tegge
118369	02-Aug-2003	bmilekic	When INVARIANTS is on and we're in uma_zalloc_free(), we need to make sure that uma_dbg_free() is called if we're about to call uma_zfree_internal() but we're asking it to skip the dtor and uma_dbg_free() call itself. So, if we're about to call uma_zfree_internal() from uma_zfree_arg() and skip == 1, call uma_dbg_free() ourselves.
118317	01-Aug-2003	alc	Update the comment at the head of kmem_alloc_nofault() to describe its purpose and use.
118315	01-Aug-2003	bmilekic	Only free the pcpu cache buckets if they are non-NULL. Crashed this person's machine: harti Pointy-hat to: me
118286	31-Jul-2003	phk	Remove unused stuff. Move used stuff to swap_pager.c where it belongs. This file no longer exports anything to userland.
118234	31-Jul-2003	peter	Add #include "opt_kstack_pages.h" and "opt_kstack_max_pages.h" to remain in sync with the backend machdep code. When cpu_thread_init() does not have the same idea of KSTACK_PAGES as the thing that created the kstack, all hell breaks loose. Bad alc! no cookie! :-)
118221	30-Jul-2003	bmilekic	Plug a race and a leak in UMA. 1) The race has to do with zone destruction. From the zone destructor we would lock the zone, set the working set size to 0, then unlock the zone, drain it, and then free the structure. Within the window following the working-set-size set to 0 and unlocking of the zone and the point where in zone_drain we re-acquire the zone lock, the uma timer routine could have fired off and changed the working set size to something non-zero, thereby potentially preventing us from completely freeing slabs before destroying the zone (and thus leaking them). 2) The leak has to do with zone destruction as well. When destroying a zone we would take care to free all the buckets cached in the zone, but although we would drain the pcpu cache buckets, we would not free them. This resulted in leaking a couple of bucket structures (512 bytes each) per cpu on SMP during zone destruction. While I'm here, also silence GCC warnings by turning uma_slab_alloc() from inline to real function. It's too big to be an inline. Reviewed by: JeffR
118212	30-Jul-2003	bmilekic	When generating the zone stats make sure to handle the master zone ("UMA Zone") carefully, because it does not have pcpu caches allocated at all. In the UP case, we did not catch this because one pcpu cache is always allocated with the zone, but for the MP case, we were getting bogus stats for this zone. Tested by: Lukas Ertl <le@univie.ac.at>
118201	30-Jul-2003	phk	Remove the disabling of buckets workaround. Thanks to: jeffr
118190	30-Jul-2003	jeff	- Get rid of the ill-conceived uz_cachefree member of uma_zone. - In sysctl_vm_zone use the per cpu locks to read the current cache statistics this makes them more accurate while under heavy load. Submitted by: tegge
118189	30-Jul-2003	jeff	- Check to see if we need a slab prior to allocating one. Failure to do so not only wastes memory but it can also cause a leak in zones that will be destroyed later. The problem is that the slab allocation code places newly created slabs on the partially allocated list because it assumes that the caller will actually allocate some memory from it. Failure to do so places an otherwise free slab on the partial slab list where we wont find it later in zone_drain(). Continuously prodded to fix by: phk (Thanks)
118187	29-Jul-2003	phk	Temporary workaround: Always disable buckets, there is a bug there somewhere. JeffR will look at this as soon as he has time. OK'ed by: jeffr
118104	28-Jul-2003	alc	None of the "alloc" functions used by UMA assume that Giant is held any longer. (If they still need it, e.g., contigmalloc(), they acquire it themselves.) Therefore, we need not acquire Giant in slab_zalloc().
118096	27-Jul-2003	alc	Remove GIANT_REQUIRED from kmem_alloc().
118076	27-Jul-2003	mux	Use pmap_zero_page() to zero pages instead of bzero() because they haven't been vm_map_wire()'d yet.
118074	27-Jul-2003	alc	Allow vm_object_reference() on kernel_object without Giant.
118071	26-Jul-2003	alc	Acquire Giant rather than asserting it is held in contigmalloc(). This is a prerequisite to removing further uses of Giant from UMA.
118047	26-Jul-2003	phk	Add a "int fd" argument to VOP_OPEN() which in the future will contain the filedescriptor number on opens from userland. The index is used rather than a "struct file " since it conveys a bit more information, which may be useful to in particular fdescfs and /dev/fd/ For now pass -1 all over the place.
118040	26-Jul-2003	alc	Gulp ... call kmem_malloc() without Giant.
118029	25-Jul-2003	mux	Add support for the M_ZERO flag to contigmalloc(). Reviewed by: jeff
117903	22-Jul-2003	phk	Remove all but one of the inlines here, this reduces the code size by 2032 bytes and has no measurable impact on performance.
117876	22-Jul-2003	phk	Don't inline very large functions. Gcc has silently not been doing this for a long time.
117866	22-Jul-2003	peter	swp_pager_hash() was called before it was instantiated inline. This made gcc (quite rightly) unhappy. Move it earlier.
117747	18-Jul-2003	phk	Fix a printf format warning I introduced. Use the macro max number of swap devices rather than cache the constant in a variable. Avoid a (now) pointless variable.
117736	18-Jul-2003	harti	When INVARIANTS is defined make sure that uma_zalloc_arg (and hence uma_zalloc) is called with exactly one of either M_WAITOK or M_NOWAIT and that it is called with neither M_TRYWAIT or M_DONTWAIT. Print a warning if anything is wrong. Default to M_WAITOK of no flag is given. This is the same test as in malloc(9).
117725	18-Jul-2003	phk	If a proposed swap device exceeds the 8G artificial limit which out radix-tree code imposes, truncate the device instead of rejecting it.
117724	18-Jul-2003	phk	Move the implementation of the vmspace_swap_count() (used only in the "toss the largest process" emergency handling) from vm_map.c to swap_pager.c. The quantity calculated depends strongly on the internals of the swap_pager and by moving it, we no longer need to expose the internal metrics of the swap_pager to the world.
117723	18-Jul-2003	phk	Add a new function swap_pager_status() which reports the total size of the paging space and how much of it is in use (in pages). Use this interface from the Linuxolator instead of groping around in the internals of the swap_pager.
117722	18-Jul-2003	phk	Merge swap_pager.c and vm_swap.c into swap_pager.c, the separation is not natural and needlessly exposes a lot of dirty laundry. Move private interfaces between the two from swap_pager.h to swap_pager.c and staticize as much as possible. No functional change.
117702	17-Jul-2003	phk	Make sure that SWP_NPAGES always has the same value in all source files, so that SWAP_META_PAGES does not vary either. swap_pager.c ended up with a value of 16, everybody else 8. Go with the 16 for now. This should only have any effect in the "kill processes because we are out of swap" scenario, where it will make some sort of estimate of something more precise.
117519	13-Jul-2003	robert	Avoid an unnecessary calculation: there is no need to subtract `firstaddr' from `v' if we know that the former equals zero.
117303	07-Jul-2003	alc	- Complete the vm object locking in vm_pageout_object_deactivate_pages(). - Change vm_pageout_object_deactivate_pages()'s first parameter from a vm_map_t to a pmap_t. - Change vm_pageout_object_deactivate_pages()'s and vm_pageout_map_deactivate_pages()'s last parameter from a vm_pindex_t to a long. Since the number of pages in an address space doesn't require 64 bits on an i386, vm_pindex_t is overkill.
117262	05-Jul-2003	alc	Lock a vm object when freeing a page from it.
117224	04-Jul-2003	phk	Remove unnecessary cast.
117206	03-Jul-2003	alc	Background: pmap_object_init_pt() premaps the pages of a object in order to avoid the overhead of later page faults. In general, it implements two cases: one for vnode-backed objects and one for device-backed objects. Only the device-backed case is really machine-dependent, belonging in the pmap. This commit moves the vnode-backed case into the (relatively) new function vm_map_pmap_enter(). On amd64 and i386, this commit only amounts to code rearrangement. On alpha and ia64, the new machine independent (MI) implementation of the vnode case is smaller and more efficient than their pmap-based implementations. (The MI implementation takes advantage of the fact that objects in -CURRENT are ordered collections of pages.) On sparc64, pmap_object_init_pt() hadn't (yet) been implemented.
117143	02-Jul-2003	mux	Fix a few style(9) nits.
117094	01-Jul-2003	alc	Modify vm_page_alloc() and vm_page_select_cache() to allow the page that is returned by vm_page_select_cache() to belong to the object that is already locked by the caller to vm_page_alloc().
117093	01-Jul-2003	alc	Check the address provided to vm_map_stack() against the vm map's maximum, returning an error if the address is too high.
117047	29-Jun-2003	alc	Introduce vm_map_pmap_enter(). Presently, this is a stub calling the MD pmap_object_init_pt().
117045	29-Jun-2003	alc	- Export pmap_enter_quick() to the MI VM. This will permit the implementation of a largely MI pmap_object_init_pt() for vnode-backed objects. pmap_enter_quick() is implemented via pmap_enter() on sparc64 and powerpc. - Correct a mismatch between pmap_object_init_pt()'s prototype and its various implementations. (I plan to keep pmap_object_init_pt() as the MD hook for device-backed objects on i386 and amd64.) - Correct an error in ia64's pmap_enter_quick() and adjust its interface to match the other versions. Discussed with: marcel
117038	29-Jun-2003	alc	Add vm object locking to vm_pageout_map_deactivate_pages().
117004	28-Jun-2003	alc	Remove GIANT_REQUIRED from kmem_malloc().
117001	28-Jun-2003	alc	- Add vm object locking to vm_pageout_clean().
116959	28-Jun-2003	alc	- Use an int rather than a vm_pindex_t to represent the desired page color in vm_page_alloc(). (This also has small performance benefits.) - Eliminate vm_page_select_free(); vm_page_alloc() might as well call vm_pageq_find() directly.
116923	27-Jun-2003	alc	Simple read-modify-write operations on a vm object's flags, ref_count, and shadow_count can now rely on its mutex for synchronization. Remove one use of Giant from vm_map_insert().
116885	26-Jun-2003	alc	vm_page_select_cache() enforces a number of conditions on the returned page. Add the ability to lock the containing object to those conditions.
116860	26-Jun-2003	alc	Modify vm_pageq_requeue() to handle a PQ_NONE page without dereferencing a NULL pointer; remove some now unused code.
116837	25-Jun-2003	bmilekic	Move the pcpu lock out of the uma_cache and instead have a single set of pcpu locks. This makes uma_zone somewhat smaller (by (LOCKNAME_LEN * sizeof(char) + sizeof(struct mtx) * maxcpu) bytes, to be exact). No Objections from jeff.
116829	25-Jun-2003	bmilekic	Make sure that the zone destructor doesn't get called twice in certain free paths.
116799	25-Jun-2003	alc	Remove a GIANT_REQUIRED on the kernel object that we no longer need.
116798	25-Jun-2003	alc	Maintain the lock on a vm object when calling vm_page_grab().
116793	24-Jun-2003	alc	Assert that the vm object is locked on entry to dev_pager_getpages().
116710	23-Jun-2003	alc	Assert that the vm object is locked on entry to vm_pager_get_pages().
116695	22-Jun-2003	alc	Maintain a lock on the vm object of interest throughout vm_fault(), releasing the lock only if we are about to sleep (e.g., vm_pager_get_pages() or vm_pager_has_pages()). If we sleep, we have marked the vm object with the paging-in-progress flag.
116678	22-Jun-2003	phk	Add a f_vnode field to struct file. Several of the subtypes have an associated vnode which is used for stuff like the f*() functions. By giving the vnode a speparate field, a number of checks for the specific subtype can be replaced simply with a check for f_vnode != NULL, and we can later free f_data up to subtype specific use. At this point in time, f_data still points to the vnode, so any code I might have overlooked will still work.
116667	22-Jun-2003	alc	As vm_fault() descends the chain of backing objects, set paging-in- progress on the next object before clearing it on the current object.
116662	22-Jun-2003	alc	Complete the vm object locking in vm_object_backing_scan(); specifically, deal with the case where we need to sleep on a busy page with two vm object locks held.
116658	22-Jun-2003	alc	Make some style and white-space changes to the copy-on-write path through vm_fault(); remove a pointless assignment statement from that path.
116653	21-Jun-2003	phk	Use a do {...} while (0); and a couple of breaks to reduce the level of indentation a bit.
116650	21-Jun-2003	alc	Lock one of the vm objects involved in an optimized copy-on-write fault.
116645	21-Jun-2003	alc	- Increase the scope of the vm object lock in vm_object_collapse(). - Assert that the vm object and its backing vm object are both locked in vm_object_qcollapse().
116629	20-Jun-2003	alc	Make swap_pager_haspages() static; remove unused function prototypes.
116605	20-Jun-2003	phk	Initialize b_saveaddr when we hand out pbufs
116596	20-Jun-2003	alc	The so-called "optimized copy-on-write fault" case should not require the vm map lock. What's really needed is vm object locking, which is (for the moment) provided Giant. Reviewed by: tegge
116554	19-Jun-2003	alc	Assert that the vm object is locked in vm_page_try_to_free().
116552	19-Jun-2003	alc	Fix a vm object reference leak in the page-based copy-on-write mechanism used by the zero-copy sockets implementation. Reviewed by: gallatin
116512	18-Jun-2003	alc	Lock the vm object when freeing a vm page.
116437	16-Jun-2003	phk	This file was ignored by CVS in my last commit for some reason: Remove pointless initialization of b_spc field, which now no longer exists.
116412	15-Jun-2003	phk	Add the same KASSERT to all VOP_STRATEGY and VOP_SPECSTRATEGY implementations to check that the buffer points to the correct vnode.
116387	15-Jun-2003	alc	Remove an unnecessary forward declaration.
116359	15-Jun-2003	alc	Use #ifdef __alpha__, not __alpha.
116355	14-Jun-2003	alc	Migrate the thread stack management functions from the machine-dependent to the machine-independent parts of the VM. At the same time, this introduces vm object locking for the non-i386 platforms. Two details: 1. KSTACK_GUARD has been removed in favor of KSTACK_GUARD_PAGES. The different machine-dependent implementations used various combinations of KSTACK_GUARD and KSTACK_GUARD_PAGES. To disable guard page, set KSTACK_GUARD_PAGES to 0. 2. Remove the (unnecessary) clearing of PG_ZERO in vm_thread_new. In 5.x, (but not 4.x,) PG_ZERO can only be set if VM_ALLOC_ZERO is passed to vm_page_alloc() or vm_page_grab().
116328	14-Jun-2003	alc	Move the _new_altkstack() and _dispose_altkstack() functions out of the various pmap implementations into the machine-independent vm. They were all identical.
116280	13-Jun-2003	alc	Extend the scope of the vm object lock in swp_pager_async_iodone() to cover a vm_page_free().
116279	13-Jun-2003	alc	Add vm object locking to various pagers' "get pages" methods, i386 stack management functions, and a u area management function.
116226	11-Jun-2003	obrien	Use __FBSDID().
116188	11-Jun-2003	peter	GC unused cpu_wait() function
116167	10-Jun-2003	alc	- Finish vm object and page locking in vnode_pager_setsize(). - Make some small style changes to vnode_pager_setsize(); most notably, move two comments to a more logical place.
116131	09-Jun-2003	phk	Revert last commit, I have no idea what happened.
116117	09-Jun-2003	phk	A white-space nit I noticed.
116080	09-Jun-2003	alc	Hold the vm object's lock when performing vm_page_lookup().
116079	09-Jun-2003	alc	Don't use vm_object_set_flag() to initialize the vm object's flags.
116067	08-Jun-2003	alc	- Properly handle the paging_in_progress case on two vm objects in vm_object_deallocate(). - Remove vm_object_pip_sleep().
115997	07-Jun-2003	alc	Lock the kernel object in kmem_alloc().
115996	07-Jun-2003	alc	Teach vm_page_grab() how to handle the vm object's lock.
115987	07-Jun-2003	alc	Assert that the vm object is locked on entry to swap_pager_freespace().
115931	07-Jun-2003	alc	Pass the vm object to vm_object_collapse() with its lock held.
115883	05-Jun-2003	phk	Fix NFS file swapping, I broke it 3 months ago it seems.
115879	05-Jun-2003	alc	- Extend the scope of the backing object's lock in vm_object_collapse().
115856	04-Jun-2003	alc	- Add further vm object locking to vm_object_deallocate(), specifically, for accessing a vm object's shadows.
115853	04-Jun-2003	alc	- Add VM_OBJECT_TRYLOCK().
115818	04-Jun-2003	alc	- Add vm object locking to vm_object_deallocate(). (Still more changes are required.) - Remove special-case macros for kmem object locking. They are no longer used.
115782	03-Jun-2003	alc	Add vm object locking to vm_object_coalesce().
115655	01-Jun-2003	alc	Change kernel_object and kmem_object to (&kernel_object_store) and (&kmem_object_store), respectively. This allows the address of these objects to be resolved at link-time rather than run-time.
115523	31-May-2003	phk	Prepend _ to internal union members to avoid ambiguity. Found by: FlexeLint
115522	31-May-2003	phk	Remove unused variables Found by: FlexeLint
115516	31-May-2003	alc	Add vm object locking to vm_object_madvise().
115146	19-May-2003	das	If we seem to be out of VM, don't allow the pagedaemon to kill processes in the first pass. Among other things, this will give us a chance to launder vnode-backed pages before concluding that we need more swap. This is particularly useful for systems that have no swap. While here, update a comment and remove some long-unused code. Reported by: Lucky Green <shamrock@cypherpunks.to> Suggested by: dillon Approved by: re (rwatson)
115127	18-May-2003	alc	Reduce the size of a vm object by converting its shadow list from a TAILQ to a LIST. Approved by: re (rwatson)
114983	13-May-2003	jhb	- Merge struct procsig with struct sigacts. - Move struct sigacts out of the u-area and malloc() it using the M_SUBPROC malloc bucket. - Add a small sigacts_*() API for managing sigacts structures: sigacts_alloc(), sigacts_free(), sigacts_copy(), sigacts_share(), and sigacts_shared(). - Remove the p_sigignore, p_sigacts, and p_sigcatch macros. - Add a mutex to struct sigacts that protects all the members of the struct. - Add sigacts locking. - Remove Giant from nosys(), kill(), killpg(), and kern_sigaction() now that sigacts is locked. - Several in-kernel functions such as psignal(), tdsignal(), trapsignal(), and thread_stopped() are now MP safe. Reviewed by: arch@ Approved by: re (rwatson)
114850	09-May-2003	alc	Give the kmem object's mutex a unique name, instead of "vm object", to avoid false reports of lock-order reversal with a system map mutex. Approved by: re (jhb)
114774	06-May-2003	alc	Lock the vm_object when performing vm_pager_deallocate().
114669	04-May-2003	alc	Extend the scope of the vm_object lock in vm_object_terminate().
114649	04-May-2003	alc	Avoid a lock-order reversal and implement vm_object locking in vm_pageout_page_free().
114599	03-May-2003	alc	Lock the vm_object on entry to vm_object_vndeallocate().
114570	03-May-2003	alc	- Revert kern/vfs_subr.c revision 1.444. The vm_object's size isn't trustworthy for vnode-backed objects. - Restore the old behavior of vm_object_page_remove() when the end of the given range is zero. Add a comment to vm_object_page_remove() regarding this behavior. Reported by: iedowse
114564	03-May-2003	alc	Move a declaration to its proper place.
114489	02-May-2003	alc	Lock the vm_object when updating its shadow list.
114487	02-May-2003	alc	Simplify the removal of a shadow object in vm_object_collapse().
114387	01-May-2003	alc	Extend the scope of the vm_object locking in vm_object_split().
114372	01-May-2003	alc	- Update the vm_object locking in vm_object_reference(). - Convert some dead code in vm_object_reference() into a comment.
114317	30-Apr-2003	alc	Increase the scope of the vm_object lock in vm_map_delete().
114273	30-Apr-2003	alc	Eliminate an unused parameter from vm_pageout_object_deactivate_pages().
114263	30-Apr-2003	alc	Add vm_object locking to vmspace_swap_count().
114245	29-Apr-2003	alc	Remove unused declarations and definitions.
114216	29-Apr-2003	kan	Deprecate machine/limits.h in favor of new sys/limits.h. Change all in-tree consumers to include <sys/limits.h> Discussed on: standards@ Partially submitted by: Craig Rodrigues <rodrigc@attbi.com>
114166	28-Apr-2003	alc	- Lock the vm_object when performing swap_pager_isswapped(). - Assert that the vm_object is locked in swap_pager_isswapped().
114149	28-Apr-2003	alc	uma_zone_set_obj() must perform VM_OBJECT_LOCK_INIT() if the caller provides storage for the vm_object.
114145	28-Apr-2003	alc	- Define VM_OBJECT_LOCK_INIT(). - Avoid repeatedly mtx_init()ing and mtx_destroy()ing the vm_object's lock using UMA's uminit callback, in this case, vm_object_zinit().
114128	27-Apr-2003	alc	- Tell witness that holding two or more vm_object locks is okay. - In vm_object_deallocate(), lock the child when removing the parent from the child's shadow list.
114112	27-Apr-2003	alc	Various changes to vm_object_shadow(): (1) update the vm_object locking, (2) remove a pointless assertion, and (3) make a trivial change to a comment.
114091	26-Apr-2003	alc	Various changes to vm_object_page_remove(): - Eliminate an odd, special-case feature: if start == end == 0 then all pages are removed. Only one caller used this feature and that caller can trivially pass the object's size. - Assert that the vm_object is locked on entry; don't bother testing for a NULL vm_object. - Style: Fix lines that are longer than 80 characters.
114080	26-Apr-2003	alc	- Lock the vm_object on entry to vm_object_terminate().
114074	26-Apr-2003	alc	- Convert vm_object_pip_wait() from using tsleep() to msleep(). - Make vm_object_pip_sleep() static. - Lock the vm_object when performing vm_object_pip_wait().
114053	26-Apr-2003	alc	- Extend the scope of two existing vm_object locks to cover swap_pager_freespace().
114052	26-Apr-2003	alc	Remove an XXX comment. It is no longer a problem.
114030	25-Apr-2003	jhb	- Don't bother using the proc lock to test just P_SYSTEM as that is set in fork1() and never changes. - The proc lock is enough to cover reading p_state, so push down sched_lock into the PRS_NORMAL case of the switch on p_state.
114019	25-Apr-2003	alc	- Lock the vm_object when iterating over its list of resident pages.
114003	25-Apr-2003	alc	- Relax the Giant required in vm_page_remove(). - Remove the Giant required from vm_page_free_toq(). (Any locking errors will be caught by vm_page_remove().) This remedies a panic that occurred when kmem_malloc(NOWAIT) performed without Giant failed to allocate the necessary pages. Reported by: phk
113956	24-Apr-2003	alc	- Move swap_pager_isswapped()'s prototype to a more logical place.
113955	24-Apr-2003	alc	- Acquire the vm_object's lock when performing vm_object_page_clean(). - Add a parameter to vm_pageout_flush() that tells vm_pageout_flush() whether its caller has locked the vm_object. (This is a temporary measure to bootstrap vm_object locking.)
113918	23-Apr-2003	jhb	Fix compiling in the NO_SWAPPING case. Submitted by: bde (partially)
113869	22-Apr-2003	jhb	Lock the proc to check p_flag and several other related tests in vm_daemon(). We don't need to hold sched_lock as long now as a result.
113868	22-Apr-2003	jhb	Prefer the proc lock to sched_lock when testing PS_INMEM now that it is safe to do so.
113867	22-Apr-2003	jhb	- Always call faultin() in _PHOLD() if PS_INMEM is clear. This closes a race where a thread could assume that a process was swapped in by PHOLD() when it actually wasn't fully swapped in yet. - In faultin(), always msleep() if PS_SWAPPINGIN is set instead of doing this check after bumping p_lock in the PS_INMEM == 0 case. Also, sched_lock is only needed for setting and clearning swapping PS_* flags and the swap thread inhibitor. - Don't set and clear the thread swap inhibitor in the same loops as the pmap_swapin/out_thread() since we have to do it under sched_lock. Instead, mimic the treatment of the PS_INMEM flag and use separate loops to set the inhibitors when clearing PS_INMEM and clear the inhibitors when setting PS_INMEM. - swapout() now returns with the proc lock held as it holds the lock while adjusting the swapping-related PS_* flags so that the proc lock can be used to test those flags. - Only use the proc lock to check the swapping-related PS_* flags in several places. - faultin() no longer requires sched_lock to be held by callers. - Rename PS_SWAPPING to PS_SWAPPINGOUT to be less ambiguous now that we have PS_SWAPPINGIN.
113856	22-Apr-2003	alc	Revision 1.246 should have also included - Weaken the assertion in vm_page_insert() to require Giant only if the vm_object isn't locked. Reported by: "Ilmar S. Habibulin" <ilmar@watson.org>
113842	22-Apr-2003	alc	Remove unused declarations.
113841	22-Apr-2003	alc	Revision 1.52 of vm/uma_core.c has led to UMA's obj_alloc() being called without Giant; and obj_alloc() in turn calls vm_page_alloc() without Giant. This causes an assertion failure in vm_page_alloc(). Fortunately, obj_alloc() is now MPSAFE. So, we need only clean up some assertions. - Weaken the assertion in vm_page_lookup() to require Giant only if the vm_object isn't locked. - Remove an assertion from vm_page_alloc() that duplicates a check performed in vm_page_lookup(). In collaboration with: gallatin, jake, jeff
113838	22-Apr-2003	alc	Add VM_OBJECT_LOCKED().
113791	21-Apr-2003	alc	- Assert that the vm_object is locked in vm_object_clear_flag(), vm_object_pip_add() and vm_object_pip_wakeup(). - Remove GIANT_REQUIRED from vm_object_pip_subtract() and vm_object_pip_subtract(). - Lock the vm_object when performing vm_object_page_remove().
113775	20-Apr-2003	alc	- Lock the vm_object when performing either vm_object_clear_flag() or vm_object_pip_wakeup().
113768	20-Apr-2003	alc	- Update the vm_object locking in vm_map_insert().
113765	20-Apr-2003	alc	- Lock the vm_object when performing vm_object_pip_wakeup(). - Merge two identical cases in a switch statement.
113761	20-Apr-2003	alc	- Lock the vm_object when performing vm_object_pip_wakeup().
113744	20-Apr-2003	alc	- Lock the vm_object when performing vm_object_pip_add(). - Remove an unnecessary variable.
113740	20-Apr-2003	alc	Update vm_object locking in vm_map_delete().
113739	20-Apr-2003	alc	- Lock the vm_object when performing vm_object_pip_add().
113722	19-Apr-2003	alc	- Lock the vm_object when performing vm_object_pip_subtract(). - Assert that the vm_object lock is held in vm_object_pip_subtract().
113721	19-Apr-2003	alc	- Lock the vm_object when performing vm_object_pip_wakeupn(). - Assert that the vm_object lock is held in vm_object_pip_wakeupn(). - Add a new macro VM_OBJECT_LOCK_ASSERT().
113701	19-Apr-2003	alc	o Update locking around vm_object_page_remove() in vm_map_clean() to use the new macros. o Remove unnecessary increment and decrement of the vm_object's reference count in vm_map_clean().
113699	19-Apr-2003	alc	Lock the vm_object in obj_alloc().
113671	18-Apr-2003	alc	Update locking around vm_object_page_remove() to use the new macros.
113665	18-Apr-2003	gallatin	Don't grab Giant in slab_zalloc() if M_NOWAIT is specified. This should allow the use of INTR_MPSAFE network drivers. Tested by: njl Glanced at by: jeff
113639	17-Apr-2003	jhb	suser() does not need the proc lock, just the setting of P_PROTECTED in p_flag needs the lock.
113603	17-Apr-2003	trhodes	Add some tunable descriptions. Submitted by: hmp Discussed with: bde
113600	17-Apr-2003	trhodes	Pre-content whitespace commit. Discussed with: bde
113489	15-Apr-2003	alc	Update locking on the kmem_object to use the new macros.
113458	14-Apr-2003	alc	Update locking on the kernel_object to use the new macros.
113457	13-Apr-2003	alc	Lock some manipulations of the vm object's flags.
113449	13-Apr-2003	alc	Lock some manipulations of the vm object's flags.
113448	13-Apr-2003	alc	Lock some manipulations of the vm object's flags.
113445	13-Apr-2003	alc	Add new macros for locking and unlocking a vm object.
113419	13-Apr-2003	alc	Permit vm_object_pip_add() and vm_object_pip_wakeup() on the kmem_object without Giant held.
113418	13-Apr-2003	alc	Eliminate unnecessary gotos from kmem_malloc().
113343	10-Apr-2003	jhb	- Kill the pv_flags member of the alpha mdpage since it stop being used in rev 1.61 of pmap.c. - Now that pmap_page_is_free() is empty and since it is just a hack for the Alpha pmap, remove it.
113138	05-Apr-2003	alc	Remove GIANT_REQUIRED from getpbuf(). Reviewed by: tegge Reduce pbuf_mtx's scope in relpbuf(). Submitted by: tegge
113070	04-Apr-2003	des	Rename a static variable to avoid future conflicts.
112881	31-Mar-2003	wes	Add a facility allowing processes to inform the VM subsystem they are critical and should not be killed when pageout is looking for more memory pages in all the wrong places. Reviewed by: arch@ Sponsored by: St. Bernard Software
112835	30-Mar-2003	mux	The object type can't be OBJT_PHYS in vm_mmap(). Reviewed by: peter
112683	26-Mar-2003	tegge	Obtain Giant before calling kmem_alloc without M_NOWAIT and before calling kmem_free if Giant isn't already held.
112569	25-Mar-2003	jake	- Add vm_paddr_t, a physical address type. This is required for systems where physical addresses larger than virtual addresses, such as i386s with PAE. - Use this to represent physical addresses in the MI vm system and in the i386 pmap code. This also changes the paddr parameter to d_mmap_t. - Fix printf formats to handle physical addresses >4G in the i386 memory detection code, and due to kvtop returning vm_paddr_t instead of u_long. Note that this is a name change only; vm_paddr_t is still the same as vm_offset_t on all currently supported platforms. Sponsored by: DARPA, Network Associates Laboratories Discussed with: re, phk (cdevsw change)
112390	19-Mar-2003	mux	Remove an empty comment.
112367	18-Mar-2003	phk	Including <sys/stdint.h> is (almost?) universally only to be able to use %j in printfs, so put a newsted include in <sys/systm.h> where the printf prototype lives and save everybody else the trouble.
112329	17-Mar-2003	jake	Subtract the memory that backs the vm_page structures from phys_avail after mapping it. This makes it possible to determine if a physical page has a backing vm_page or not.
112312	16-Mar-2003	jake	Made the prototypes for pmap_kenter and pmap_kremove MD. These functions are machine dependent because they are not required to update the tlb when mappings are added or removed, and doing so is machine dependent. In addition, an implementation may require that pages mapped with pmap_kenter have a backing vm_page_t, which is not necessarily true of all physical pages, and so may choose to pass the vm_page_t to pmap_kenter instead of the physical address in order to make this requirement clear.
112167	12-Mar-2003	das	- When the VM daemon is out of swap space and looking for a process to kill, don't block on a map lock while holding the process lock. Instead, skip processes whose map locks are held and find something else to kill. - Add vm_map_trylock_read() to support the above. Reviewed by: alc, mike (mentor)
111977	08-Mar-2003	ken	Zero copy send and receive fixes: - On receive, vm_map_lookup() needs to trigger the creation of a shadow object. To make that happen, call vm_map_lookup() with PROT_WRITE instead of PROT_READ in vm_pgmoveco(). - On send, a shadow object will be created by the vm_map_lookup() in vm_fault(), but vm_page_cowfault() will delete the original page from the backing object rather than simply letting the legacy COW mechanism take over. In other words, the new page should be added to the shadow object rather than replacing the old page in the backing object. (i.e. vm_page_cowfault() should not be called in this case.) We accomplish this by making sure fs.object == fs.first_object before calling vm_page_cowfault() in vm_fault(). Submitted by: gallatin, alc Tested by: ken
111937	06-Mar-2003	alc	Remove ENABLE_VFS_IOOPT. It is a long unfinished work-in-progress. Discussed on: arch@
111936	05-Mar-2003	rwatson	Provide a mac_check_system_swapoff() entry point, which permits MAC modules to authorize disabling of swap against a particular vnode. Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories
111883	04-Mar-2003	jhb	Replace calls to WITNESS_SLEEP() and witness_list() with equivalent calls to WITNESS_WARN().
111732	02-Mar-2003	phk	NO_GEOM cleanup: Use VOP_IOCTL(DIOCGMEDIASIZE) to check the size of a potential swap device instead of the cdevsw->d_psize() method.
111712	01-Mar-2003	alc	Teach vm_page_sleep_if_busy() to release the vm_object lock before sleeping.
111467	25-Feb-2003	alc	Fuse two #ifdefs with identical conditions.
111463	25-Feb-2003	jeff	- Add an interlock argument to BUF_LOCK and BUF_TIMELOCK. - Remove the buftimelock mutex and acquire the buf's interlock to protect these fields instead. - Hold the vnode interlock while locking bufs on the clean/dirty queues. This reduces some cases from one BUF_LOCK with a LK_NOWAIT and another BUF_LOCK with a LK_TIMEFAIL to a single lock. Reviewed by: arch, mckusick
111462	25-Feb-2003	mux	Cleanup of the d_mmap_t interface. - Get rid of the useless atop() / pmap_phys_address() detour. The device mmap handlers must now give back the physical address without atop()'ing it. - Don't borrow the physical address of the mapping in the returned int. Now we properly pass a vm_offset_t * and expect it to be filled by the mmap handler when the mapping was successful. The mmap handler must now return 0 when successful, any other value is considered as an error. Previously, returning -1 was the only way to fail. This change thus accidentally fixes some devices which were bogusly returning errno constants which would have been considered as addresses by the device pager. - Garbage collect the poorly named pmap_phys_address() now that it's no longer used. - Convert all the d_mmap_t consumers to the new API. I'm still not sure wheter we need a __FreeBSD_version bump for this, since and we didn't guarantee API/ABI stability until 5.1-RELEASE. Discussed with: alc, phk, jake Reviewed by: peter Compile-tested on: LINT (i386), GENERIC (alpha and sparc64) Runtime-tested on: i386
111434	24-Feb-2003	alc	In vm_page_dirty(), assert that the page is not in the free queue(s).
111119	19-Feb-2003	imp	Back out M_* changes, per decision of the TRB. Approved by: trb
110983	16-Feb-2003	alc	Remove GIANT_REQUIRED from vm_pageq_remove().
110958	15-Feb-2003	alc	Remove the acquisition and release of Giant around pmap_growkernel(). It's unnecessary for two reasons: (1) Giant is at present already held in such cases and (2) our various implementations of pmap_growkernel() look to be MP safe. (For example, for sparc64 the proof of (2) is trivial.)
110957	15-Feb-2003	alc	Move kernel_vm_end's declaration to pmap.h; add a comment regarding the synchronization of access to kernel_vm_end.
110597	09-Feb-2003	alc	Add a comment describing how pagedaemon_wakeup() should be used and synchronized. Suggested by: tegge
110313	04-Feb-2003	phk	Change a printf to also tell how many items were left in the zone.
110225	02-Feb-2003	alc	- It's more accurate to say that vm_paging_needed() returns TRUE than a positive number. - In pagedaemon_wakeup(), set vm_pages_needed to 1 rather than incrementing it to accomplish the same.
110218	02-Feb-2003	alc	- Convert vm_pageout()'s tsleep()s to msleep()s with the page queue lock.
110207	01-Feb-2003	alc	- Remove (some) unnecessary explicit initializations to zero. - Style changes to vm_pageout(): declarations and white-space.
110204	01-Feb-2003	alc	- Convert the tsleep()s in vm_wait() and vm_waitpfault() to msleep()s with the page queue lock. - Assert that the page queue lock is held in vm_page_free_wakeup().
109912	27-Jan-2003	alc	Simplify vm_object_page_remove(): The object's memq is now ordered. The two cases that existed before for performance optimization purposes can be reduced to one.
109820	25-Jan-2003	alc	Add MTX_DUPOK to the initialization of system map locks.
109630	21-Jan-2003	alfred	use 'void *' instead of 'caddr_t' for useracc, kernacc, vslock and vsunlock.
109623	21-Jan-2003	alfred	Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0. Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.
109587	20-Jan-2003	dillon	Fix swapping to a file, it was broken when SPECSTRATEGY was introduced.
109572	20-Jan-2003	dillon	Close the remaining user address mapping races for physical I/O, CAM, and AIO. Still TODO: streamline useracc() checks. Reviewed by: alc, tegge MFC after: 7 days
109554	20-Jan-2003	alc	- Hold the page queues lock around vm_page_hold(). - Assert that the page queues lock rather than Giant is held in vm_page_hold().
109548	20-Jan-2003	jeff	- M_WAITOK is 0 and not a real flag. Test for this properly. Submitted by: tmm Pointy hat to: jeff
109496	18-Jan-2003	obrien	Rev 1.16 renamed VM_METER to VM_TOTAL. This is breaking 3rd-party apps. So add a VM_METER compat define. Submitted by: Andy Fawcett <andy@athame.co.uk>
109342	16-Jan-2003	dillon	Merge all the various copies of vm_fault_quick() into a single portable copy.
109223	14-Jan-2003	alc	- Update vm_pageout_deficit using atomic operations. It's a simple counter outside the scope of existing locks. - Eliminate a redundant clearing of vm_pageout_deficit.
109216	14-Jan-2003	alc	Make vm_pageout_page_free() static.
109205	13-Jan-2003	dillon	It is possible for an active aio to prevent shared memory from being dereferenced when a process exits due to the vmspace ref-count being bumped. Change shmexit() and shmexit_myhook() to take a vmspace instead of a process and call it in vmspace_dofree(). This way if it is missed in exit1()'s early-resource-free it will still be caught when the zombie is reaped. Also fix a potential race in shmexit_myhook() by NULLing out vmspace->vm_shm prior to calling shm_delete_mapping() and free(). MFC after: 7 days
109198	13-Jan-2003	phk	We can get past here on a normal vnode as well, so use VOP_STRATEGY if so.
109153	13-Jan-2003	dillon	Bow to the whining masses and change a union back into void *. Retain removal of unnecessary casts and throw in some minor cleanups to see if anyone complains, just for the hell of it.
109151	12-Jan-2003	alc	Make vm_page_alloc() return PG_ZERO only if VM_ALLOC_ZERO is specified. The objective being to eliminate some cases of page queues locking. (See, for example, vm/vm_fault.c revision 1.160.) Reviewed by: tegge (Also, pointed out by tegge that I changed vm_fault.c before changing vm_page.c. Oops.)
109131	12-Jan-2003	alc	vm_fault_copy_entry() needn't clear PG_ZERO because it didn't pass VM_ALLOC_ZERO to vm_page_alloc().
109123	12-Jan-2003	dillon	Change struct file f_data to un_data, a union of the correct struct pointer types, and remove a huge number of casts from code using it. Change struct xfile xf_data to xun_data (ABI is still compatible). If we need to add a #define for f_data and xf_data we can, but I don't think it will be necessary. There are no operational changes in this commit.
109114	11-Jan-2003	alc	In vm_page_alloc(), fuse two if statements that are conditioned on the same expression.
109097	11-Jan-2003	dillon	Make 'sysctl vm.vmtotal' work properly using updated patch from Hiten. (the patch in the PR was stale). PR: kern/5689 Submitted by: Hiten Pandya <hiten@unixdaemons.com>
108963	08-Jan-2003	alc	In vm_page_alloc(), honor VM_ALLOC_ZERO for system and interrupt class requests when the number of free pages is below the reserved threshold. Previously, VM_ALLOC_ZERO was only honored when the number of free pages was above the reserved threshold. Honoring it in all cases generally makes sense, does no harm, and simplifies the code.
108723	05-Jan-2003	phk	Convert VOP_STRATEGY to VOP_SPECSTRATEGY in the generic getpages and the pager input for small filesystems.
108693	05-Jan-2003	alc	Use atomic add and subtract to update the global wired page count, cnt.v_wire_count.
108686	04-Jan-2003	phk	Temporarily introduce a new VOP_SPECSTRATEGY operation while I try to sort out disk-io from file-io in the vm/buffer/filesystem space. The intent is to sort VOP_STRATEGY calls into those which operate on "real" vnodes and those which operate on VCHR vnodes. For the latter kind, the call will be changed to VOP_SPECSTRATEGY, possibly conditionally for those places where dual-use happens. Add a default VOP_SPECSTRATEGY method which will call the normal VOP_STRATEGY. First time it is called it will print debugging information. This will only happen if a normal vnode is passed to VOP_SPECSTRATEGY by mistake. Add a real VOP_SPECSTRATEGY in specfs, which does what VOP_STRATEGY does on a VCHR vnode today. Add a new VOP_STRATEGY method in specfs to catch instances where the conversion to VOP_SPECSTRATEGY has not yet happened. Handle the request just like we always did, but first time called print debugging information. Apart up to two instances of console messages per boot, this amounts to a glorified no-op commit. If you get any of the messages on your console I would very much like a copy of them mailed to phk@freebsd.org
108677	04-Jan-2003	alc	Allow kmem_malloc() without Giant if M_NOWAIT is specified.
108676	04-Jan-2003	alc	Use vm_object_lock() and vm_object_unlock() in vm_object_deallocate(). (This procedure needs further work, but this change is sufficient for locking the kmem_object.)
108675	04-Jan-2003	alc	Refine the assertions in vm_page_alloc().
108610	03-Jan-2003	alc	Refine the assertion in vm_object_clear_flag() to allow operation on the kmem_object without Giant. In that case, assert that the kmem_object's mutex is held.
108609	03-Jan-2003	phk	Revert use of dmmax_mask, I had overlooked a '~'. Spotted by: bde
108602	03-Jan-2003	phk	Make struct swblock kernel only, to make vm/swap_pager.h userland includable. Move struct swdevt from sys/conf.h to the more appropriate vm/swap_pager.h. Adjust #include use in libkvm and pstat(8) to match.
108600	03-Jan-2003	phk	Avoid extern decls in .c files by putting them in the vm/swap_pager.h include file where they belong. Share the dmmax_mask variable.
108599	03-Jan-2003	phk	Use correct _VM_SWAP_PAGER_H_ to check for multiple inclusion.
108595	03-Jan-2003	phk	Retire sys/dmap.h by including the two lines of it which matters directly in vm/vm_swap.c.
108594	03-Jan-2003	alc	Lock the vm object when performing vm_object_clear_flag().
108589	03-Jan-2003	phk	Convert calls to BUF_STRATEGY to VOP_STRATEGY calls. This is a no-op since all BUF_STRATEGY did in the first place was call VOP_STRATEGY.
108585	03-Jan-2003	alc	Add vm map and vm object locking to vmtotal().
108551	02-Jan-2003	alc	Lock the vm object when performing vm_object_clear_flag().
108534	01-Jan-2003	alc	Update the assertions in vm_page_insert() and vm_page_lookup() to reflect locking of the kmem_object.
108533	01-Jan-2003	schweikh	Correct typos, mostly s/ a / an / where appropriate. Some whitespace cleanup, especially in troff files.
108518	01-Jan-2003	alc	Add a needed #include. Reported by: ia64 tinderbox
108515	31-Dec-2002	alc	Implement a variant locking scheme for vm maps: Access to system maps is now synchronized by a mutex, whereas access to user maps is still synchronized by a lockmgr()-based lock. Why? No single type of lock, including sx locks, meets the requirements of both types of vm map. Sometimes we sleep while holding the lock on a user map. Thus, a a mutex isn't appropriate. On the other hand, both lockmgr()-based and sx locks release Giant when a thread/process blocks during contention for a lock. This could lead to a race condition in a legacy driver (that relies on Giant for synchronization) if it attempts to kmem_malloc() and fails to immediately obtain the lock. Fortunately, we never sleep while holding a system map lock.
108426	30-Dec-2002	alc	- Mark the kernel_map as a system map immediately after its creation. - Correct a cast.
108418	30-Dec-2002	alc	- Increment the vm_map's timestamp if _vm_map_trylock() succeeds. - Introduce map_sleep_mtx and use it to replace Giant in vm_map_unlock_and_wait() and vm_map_wakeup(). (Original version by: tegge.)
108413	29-Dec-2002	alc	- Remove vm_object_init2(). It is unused. - Add a mtx_destroy() to vm_object_collapse(). (This allows a bzero() to migrate from _vm_object_allocate() to vm_object_zinit(), where it will be performed less often.)
108384	29-Dec-2002	alc	Reduce the number of times that we acquire and release the page queues lock by making vm_page_rename()'s caller, rather than vm_page_rename(), responsible for acquiring it.
108370	28-Dec-2002	alc	Assert that the page queues lock rather than Giant is held in vm_page_flag_clear().
108361	28-Dec-2002	dillon	vm_pager_put_pages() takes VM_PAGER_* flags, not OBJPC_* flags. It just so happens that OBJPC_SYNC has the same value as VM_PAGER_PUT_SYNC so no harm done. But fix it :-) No operational changes. MFC after: 1 day
108358	28-Dec-2002	dillon	Allow the VM object flushing code to cluster. When the filesystem syncer comes along and flushes a file which has been mmap()'d SHARED/RW, with dirty pages, it was flushing the underlying VM object asynchronously, resulting in thousands of 8K writes. With this change the VM Object flushing code will cluster dirty pages in 64K blocks. Note that until the low memory deadlock issue is reviewed, it is not safe to allow the pageout daemon to use this feature. Forced pageouts still use fs block size'd ops for the moment. MFC after: 3 days
108351	28-Dec-2002	alc	Two changes to kmem_malloc(): - Use VM_ALLOC_WIRED. - Perform vm_page_wakeup() after pmap_enter(), like we do everywhere else.
108334	27-Dec-2002	alc	- Change vm_object_page_collect_flush() to assert rather than acquire the page queues lock. - Acquire the page queues lock in vm_object_page_clean().
108306	27-Dec-2002	alc	Increase the scope of the page queues lock in phys_pager_getpages().
108262	24-Dec-2002	alc	- Hold the page queues lock around calls to vm_page_flag_clear().
108251	24-Dec-2002	alc	- Hold the page queues lock around vm_page_wakeup().
108233	23-Dec-2002	alc	- Hold the kernel_object's lock around vm_page_insert(..., kernel_object, ...).
108197	23-Dec-2002	alc	Eliminate some dead code. (Any possible use for this code died with vm/vm_page.c revision 1.220.) Submitted by: bde
108171	22-Dec-2002	dillon	The UP -current was not properly counting the per-cpu VM stats in the sysctl code. This makes 'systat -vm 1's syscall count work again. Submitted by: Michal Mertl <mime@traveller.cz> Note: also slated for 5.0
108138	20-Dec-2002	alc	Increase the scope of the kmem_object locking in kmem_malloc().
108117	20-Dec-2002	alc	Add a mutex to struct vm_object. Initialize and destroy that mutex at appropriate times. For the moment, the mutex is only used on the kmem_object.
108101	19-Dec-2002	alc	Remove the hash_rand field from struct vm_object. As of revision 1.215 of vm/vm_page.c, it is unused.
108081	19-Dec-2002	alc	- Remove vm_page_sleep_busy(). The transition to vm_page_sleep_if_busy(), which incorporates page queue and field locking, is complete. - Assert that the page queue lock rather than Giant is held in vm_page_flag_set().
108068	19-Dec-2002	alc	- Hold the page queues lock when performing vm_page_busy() or vm_page_flag_set(). - Replace vm_page_sleep_busy() with proper page queues locking and vm_page_sleep_if_busy().
108012	18-Dec-2002	alc	- Hold the page queues lock when performing vm_page_busy(). - Replace vm_page_sleep_busy() with proper page queues locking and vm_page_sleep_if_busy().
108011	18-Dec-2002	alc	Hold the page queues lock when performing vm_page_flag_set().
107989	17-Dec-2002	alc	Hold the page queues lock when performing vm_page_flag_set().
107948	16-Dec-2002	dillon	Change the way ELF coredumps are handled. Instead of unconditionally skipping read-only pages, which can result in valuable non-text-related data not getting dumped, the ELF loader and the dynamic loader now mark read-only text pages NOCORE and the coredump code only checks (primarily) for complete inaccessibility of the page or NOCORE being set. Certain applications which map large amounts of read-only data will produce much larger cores. A new sysctl has been added, debug.elf_legacy_coredump, which will revert to the old behavior. This commit represents collaborative work by all parties involved. The PR contains a program demonstrating the problem. PR: kern/45994 Submitted by: "Peter Edwards" <pmedwards@eircom.net>, Archie Cobbs <archie@dellroad.org> Reviewed by: jdp, dillon MFC after: 7 days
107918	15-Dec-2002	alc	Perform vm_object_lock() and vm_object_unlock() on kmem_object around vm_page_lookup() and vm_page_free().
107913	15-Dec-2002	dillon	This is David Schultz's swapoff code which I am finally able to commit. This should be considered highly experimental for the moment. Submitted by: David Schultz <dschultz@uclink.Berkeley.EDU> MFC after: 3 weeks
107912	15-Dec-2002	dillon	Fix a refcount race with the vmspace structure. In order to prevent resource starvation we clean-up as much of the vmspace structure as we can when the last process using it exits. The rest of the structure is cleaned up when it is reaped. But since exit1() decrements the ref count it is possible for a double-free to occur if someone else, such as the process swapout code, references and then dereferences the structure. Additionally, the final cleanup of the structure should not occur until the last process referencing it is reaped. This commit solves the problem by introducing a secondary reference count, calling 'vm_exitingcnt'. The normal reference count is decremented on exit and vm_exitingcnt is incremented. vm_exitingcnt is decremented when the process is reaped. When both vm_exitingcnt and vm_refcnt are 0, the structure is freed for real. MFC after: 3 weeks
107893	15-Dec-2002	alc	As per the comments, vm_object_page_remove() now expects its caller to lock the object (i.e., acquire Giant).
107892	15-Dec-2002	alc	Perform vm_object_lock() and vm_object_unlock() around vm_object_page_remove().
107891	15-Dec-2002	alc	Perform vm_object_lock() and vm_object_unlock() around vm_object_page_remove().
107887	15-Dec-2002	alc	Assert that the page queues lock is held in vm_page_unhold(), vm_page_remove(), and vm_page_free_toq().
107464	01-Dec-2002	alc	Hold the page queues lock when calling pmap_protect(); it updates fields of the vm_page structure. Make the style of the pmap_protect() calls consistent. Approved by: re (blanket)
107436	01-Dec-2002	alc	Hold the page queues lock when calling pmap_protect(); it updates fields of the vm_page structure. Nearby, remove an unnecessary semicolon and return statement. Approved by: re (blanket)
107433	01-Dec-2002	alc	Increase the scope of the page queue lock in vm_pageout_scan(). Approved by: re (blanket)
107370	28-Nov-2002	alc	Lock page field accesses in mincore(). Approved by: re (blanket)
107347	27-Nov-2002	alc	Hold the page queues lock when performing pmap_clear_modify(). Approved by: re (blanket)
107304	27-Nov-2002	alc	Hold the page queues lock while performing pmap_page_protect(). Approved by: re (blanket)
107250	25-Nov-2002	alc	Acquire and release the page queues lock around calls to pmap_protect() because it updates flags within the vm page. Approved by: re (blanket)
107200	24-Nov-2002	alc	Extend the scope of the page queues/fields locking in vm_freeze_copyopts() to cover pmap_remove_all(). Approved by: re
107189	23-Nov-2002	alc	Hold the page queues/flags lock when calling vm_page_set_validclean(). Approved by: re
107185	23-Nov-2002	alc	Assert that the page queues lock rather than Giant is held in vm_pageout_page_free(). Approved by: re
107182	23-Nov-2002	alc	Add page queue and flag locking in vnode_pager_setsize(). Approved by: re
107136	21-Nov-2002	jeff	- Add an event that is triggered when the system is low on memory. This is intended to be used by significant memory consumers so that they may drain some of their caches. Inspired by: phk Approved by: re Tested on: x86, alpha
107048	18-Nov-2002	jeff	- Wakeup the correct address when a zone is no longer full. Spotted by: jake
107039	18-Nov-2002	alc	Remove vm_page_protect(). Instead, use pmap_page_protect() directly.
106992	16-Nov-2002	jeff	- Don't forget the flags value when using boot pages. Reported by: grehan
106981	16-Nov-2002	alc	Now that pmap_remove_all() is exported by our pmap implementations use it directly.
106871	13-Nov-2002	alc	Remove dead code that hasn't been needed since the demise of share maps in various revisions of vm/vm_map.c between 1.148 and 1.153.
106838	13-Nov-2002	alc	Move pmap_collect() out of the machine-dependent code, rename it to reflect its new location, and add page queue and flag locking. Notes: (1) alpha, i386, and ia64 had identical implementations of pmap_collect() in terms of machine-independent interfaces; (2) sparc64 doesn't require it; (3) powerpc had it as a TODO.
106778	11-Nov-2002	cognet	Remove extra #include<sys/vmmeter.h>.
106773	11-Nov-2002	mjacob	atomic_set_8 isn't MI. Instead, follow Jake's suggestions about ZONE_LOCK.
106753	11-Nov-2002	alc	- Clear the page's PG_WRITEABLE flag in the i386's pmap_changebit() if we're removing write access from the page's PTEs. - Export pmap_remove_all() on alpha, i386, and ia64. (It's already exported on sparc64.)
106733	10-Nov-2002	mjacob	Use atomic_set_8 on the us_freelist maps as they are not otherwise protected. Furthermore, in some RISC architectures with no normal byte operations, the surrounding 3 bytes are also affected by the read-modify-write that has to occur.
106720	10-Nov-2002	alc	When prot is VM_PROT_NONE, call pmap_page_protect() directly rather than indirectly through vm_page_protect(). The one remaining page flag that is updated by vm_page_protect() is already being updated by our various pmap implementations. Note: A later commit will similarly change the VM_PROT_READ case and eliminate vm_page_protect().
106708	09-Nov-2002	alc	Fix an error case in vm_map_wire(): unwiring of an entry during cleanup after a user wire error fails when the entry is already system wired. Reported by: tegge
106691	09-Nov-2002	alc	In vm_page_remove(), avoid calling vm_page_splay() if the object's memq is empty.
106605	07-Nov-2002	tmm	Move the definitions of the hw.physmem, hw.usermem and hw.availpages sysctls to MI code; this reduces code duplication and makes all of them available on sparc64, and the latter two on powerpc. The semantics by the i386 and pc98 hw.availpages is slightly changed: previously, holes between ranges of available pages would be included, while they are excluded now. The new behaviour should be more correct and brings i386 in line with the other architectures. Move physmem to vm/vm_init.c, where this variable is used in MI code.
106603	07-Nov-2002	mux	Better printf() formats.
106602	07-Nov-2002	mux	Some more printf() format fixes.
106600	07-Nov-2002	mux	Correctly print vm_offset_t types.
106422	04-Nov-2002	alc	Export the function vm_page_splay().
106387	03-Nov-2002	alc	- Remove the memory allocation for the object/offset hash table because it's no longer used. (See revision 1.215.) - Fix a harmless bug: the number of vm_page structures allocated wasn't properly adjusted when uma_bootstrap() was introduced. Consequently, we were allocating 30 unused vm_page structures. - Wrap a long line.
106359	02-Nov-2002	alc	Remove the vm page buckets mutex. As of revision 1.215 of vm/vm_page.c, it is unused.
106277	01-Nov-2002	jeff	- Add support for machine dependant page allocation routines. MD code may define UMA_MD_SMALL_ALLOC to make use of this feature. Reviewed by: peter, jake
106276	01-Nov-2002	jeff	- Add a new flag to vm_page_alloc, VM_ALLOC_NOOBJ. This tells vm_page_alloc not to insert this page into an object. The pindex is still used for colorization. - Rework vm_page_select_* to accept a color instead of an object and pindex to work with VM_PAGE_NOOBJ. - Document other VM_ALLOC_ flags. Reviewed by: peter, jake
106023	27-Oct-2002	rwatson	Merge from MAC tree: rename mac_check_vnode_swapon() to mac_check_system_swapon(), to reflect the fact that the primary object of this change is the running kernel as a whole, rather than just the vnode. We'll drop additional checks of this class into the same check namespace, including reboot(), sysctl(), et al. Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories
105853	24-Oct-2002	jeff	- Now that uma_zalloc_internal is not the fast path don't be so fussy about extra function calls. Refactor uma_zalloc_internal into seperate functions for finding the most appropriate slab, filling buckets, allocating single items, and pulling items off of slabs. This makes the code significantly cleaner. - This also fixes the "Returning an empty bucket." panic that a few people have seen. Tested On: alpha, x86
105848	24-Oct-2002	jeff	- Move the destructor calls so that they are not called with the zone lock held. This avoids a lock order reversal when destroying zones. Unfortunately, this also means that the free checks are not done before the destructor is called. Reported by: phk
105718	22-Oct-2002	rwatson	Invoke mac_check_vnode_mmap() during mmap operations on vnodes, permitting policies to restrict access to memory mapping based on the credential requesting the mapping, the target vnode, the requested rights, or other policy considerations. Approved by: re Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories
105717	22-Oct-2002	rwatson	Introduce MAC_CHECK_VNODE_SWAPON, which permits MAC policies to perform authorization checks during swapon() events; policies might choose to enforce protections based on the credential requesting the swap configuration, the target of the swap operation, or other factors such as internal policy state. Approved by: re Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories
105695	22-Oct-2002	jhb	- Check that a process isn't a new process (p_state == PRS_NEW) before trying to acquire it's proc lock since the proc lock may not have been constructed yet. - Split up the one big comment at the top of the loop and put the pieces in the right order above the various checks. Reported by: kris (1)
105689	22-Oct-2002	sheldonh	Fix typo in comments (misspelled "necessary").
105549	20-Oct-2002	alc	o Reinline vm_page_undirty(), reducing the kernel size. (This reverts a part of vm_page.h revision 1.87 and vm_page.c revision 1.167.)
105466	19-Oct-2002	alc	Complete the page queues locking needed for the page-based copy- on-write (COW) mechanism. (This mechanism is used by the zero-copy TCP/IP implementation.) - Extend the scope of the page queues lock in vm_fault() to cover vm_page_cowfault(). - Modify vm_page_cowfault() to release the page queues lock if it sleeps.
105407	18-Oct-2002	dillon	Replace the vm_page hash table with a per-vmobject splay tree. There should be no major change in performance from this change at this time but this will allow other work to progress: Giant lock removal around VM system in favor of per-object mutexes, ranged fsyncs, more optimal COMMIT rpc's for NFS, partial filesystem syncs by the syncer, more optimal object flushing, etc. Note that the buffer cache is already using a similar splay tree mechanism. Note that a good chunk of the old hash table code is still in the tree. Alan or I will remove it prior to the release if the new code does not introduce unsolvable bugs, else we can revert more easily. Submitted by: alc (this is Alan's code) Approved by: re
105229	16-Oct-2002	phk	Properly put macro args in (). Spotted by: FlexeLint.
105126	14-Oct-2002	julian	Remove old useless debugging code
104964	12-Oct-2002	jeff	- Create a new scheduler api that is defined in sys/sched.h - Begin moving scheduler specific functionality into sched_4bsd.c - Replace direct manipulation of scheduler data with hooks provided by the new api. - Remove KSE specific state modifications and single runq assumptions from kern_switch.c Reviewed by: -arch
104387	02-Oct-2002	jhb	Rename the mutex thread and process states to use a more generic 'LOCK' name instead. (e.g., SLOCK instead of SMTX, TD_ON_LOCK() instead of TD_ON_MUTEX()) Eventually a turnstile abstraction will be added that will be shared with mutexes and other types of locks. SLOCK/TDI_LOCK will be used internally by the turnstile code and will not be specific to mutexes. Making the change now ensures that turnstiles can be dropped in at a later date without affecting the ABI of userland applications.
104354	02-Oct-2002	scottl	Some kernel threads try to do significant work, and the default KSTACK_PAGES doesn't give them enough stack to do much before blowing away the pcb. This adds MI and MD code to allow the allocation of an alternate kstack who's size can be speficied when calling kthread_create. Passing the value 0 prevents the alternate kstack from being created. Note that the ia64 MD code is missing for now, and PowerPC was only partially written due to the pmap.c being incomplete there. Though this patch does not modify anything to make use of the alternate kstack, acpi and usb are good candidates. Reviewed by: jake, peter, jhb
104094	28-Sep-2002	phk	Be consistent about "static" functions: if the function is marked static in its prototype, mark it static at the definition too. Inspired by: FlexeLint warning #512
103925	25-Sep-2002	jeff	- Get rid of the unused LK_NOOBJ.
103924	25-Sep-2002	jeff	- Lock access to numoutput on the swap devices.
103923	25-Sep-2002	jeff	- Add a ASSERT_VOP_LOCKED in vnode_pager_alloc. - Lock access to v_iflags.
103794	22-Sep-2002	mdodd	Modify vm_map_clean() (and thus the msync(2) system call) to support invalidation of cached pages for objects of type OBJT_DEVICE. Submitted by: Christian Zander <zander@minion.de> Approved by: alc
103777	22-Sep-2002	alc	o Update some comments.
103767	21-Sep-2002	jake	Use the fields in the sysentvec and in the vm map header in place of the constants VM_MIN_ADDRESS, VM_MAXUSER_ADDRESS, USRSTACK and PS_STRINGS. This is mainly so that they can be variable even for the native abi, based on different machine types. Get stack protections from the sysentvec too. This makes it trivial to map the stack non-executable for certain abis, on machines that support it.
103732	21-Sep-2002	alc	Reduce namespace pollution. Submitted by: bde
103623	19-Sep-2002	jeff	- Use my freebsd email alias in the copyright. - Remove redundant instances of my email alias in the file summary.
103531	18-Sep-2002	jeff	- Split UMA_ZFLAG_OFFPAGE into UMA_ZFLAG_OFFPAGE and UMA_ZFLAG_HASH. - Remove all instances of the mallochash. - Stash the slab pointer in the vm page's object pointer when allocating from the kmem_obj. - Use the overloaded object pointer to find slabs for malloced memory.
103314	14-Sep-2002	njl	Remove all use of vnode->v_tag, replacing with appropriate substitutes. v_tag is now const char * and should only be used for debugging. Additionally: 1. All users of VT_NTS now check vfsconf->vf_type VFCF_NETWORK 2. The user of VT_PROCFS now checks for the new flag VV_PROCDEP, which is propagated by pseudofs to all child vnodes if the fs sets PFS_PROCDEP. Suggested by: phk Reviewed by: bde, rwatson (earlier version)
103216	11-Sep-2002	julian	Completely redo thread states. Reviewed by: davidxu@freebsd.org
103123	09-Sep-2002	tanimura	- Do not swap out a process if it is in creation. The process may have no address space yet. - Check whether a process is a system process prior to dereferencing its p_vmspace. Aio assumes that only the curthread switches the address space of a system process.
103002	06-Sep-2002	julian	Use UMA as a complex object allocator. The process allocator now caches and hands out complete process structures including substructures . i.e. it get's the process structure with the first thread (and soon KSE) already allocated and attached, all in one hit. For the average non threaded program (non KSE that is) the allocated thread and its stack remain attached to the process, even when the process is unused and in the process cache. This saves having to allocate and attach it later, effectively bringing us (hopefully) close to the efficiency of pre-KSE systems where these were a single structure. Reviewed by: davidxu@freebsd.org, peter@freebsd.org
102966	05-Sep-2002	bde	Use `struct uma_zone *' instead of uma_zone_t, so that <sys/uma.h> isn't a prerequisite.
102950	05-Sep-2002	davidxu	s/SGNL/SIG/ s/SNGL/SINGLE/ s/SNGLE/SINGLE/ Fix abbreviation for P_STOPPED_* etc flags, in original code they were inconsistent and difficult to distinguish between them. Approved by: julian (mentor)
102835	02-Sep-2002	alc	o Synchronize updates to struct vm_page::cow with the page queues lock.
102738	31-Aug-2002	dillon	Reduce the maximum KVA reserved for swap meta structures from 70 to 32 MB. Reduce the swap meta calculation by a factor of 2, it's still massive overkill. X-MFC after: immediately
102600	30-Aug-2002	peter	Change hw.physmem and hw.usermem to unsigned long like they used to be in the original hardwired sysctl implementation. The buf size calculator still overflows an integer on machines with large KVA (eg: ia64) where the number of pages does not fit into an int. Use 'long' there. Change Maxmem and physmem and related variables to 'long', mostly for completeness. Machines are not likely to overflow 'int' pages in the near term, but then again, 640K ought to be enough for anybody. This comes for free on 32 bit machines, so why not?
102399	25-Aug-2002	alc	o Retire pmap_pageable(). It's an advisory routine that none of our platforms implements.
102382	25-Aug-2002	alc	o Retire vm_page_zero_fill() and vm_page_zero_fill_area(). Ever since pmap_zero_page() and pmap_zero_page_area() were modified to accept a struct vm_page * instead of a physical address, vm_page_zero_fill() and vm_page_zero_fill_area() have served no purpose.
102372	24-Aug-2002	alc	o Use vm_object_lock() in place of directly locking Giant. Reviewed by: md5
102370	24-Aug-2002	alc	o Use vm_object_lock() in place of Giant when manipulating a vm object in vm_map_insert().
102349	24-Aug-2002	alc	o Resurrect vm_object_lock() and vm_object_unlock() from revision 1.19. (For now, they simply acquire and release Giant.)
102241	21-Aug-2002	archie	Don't use "NULL" when "0" is really meant.
101657	11-Aug-2002	alc	o Assert that the page queues lock is held in vm_page_activate().
101656	11-Aug-2002	alc	o Lock page queue accesses by vm_page_activate().
101655	10-Aug-2002	alc	o Lock page queue accesses by vm_page_activate().
101654	10-Aug-2002	alc	o Move a call to vm_page_wakeup() inside the scope of the page queues lock.
101645	10-Aug-2002	alc	o Remove the setting and clearing of the PG_MAPPED flag from the alpha and ia64 pmap. o Remove the PG_MAPPED flag's declaration.
101634	10-Aug-2002	alc	o Remove the setting and clearing of the PG_MAPPED flag. (This flag is obsolete.)
101543	08-Aug-2002	alc	o Use pmap_page_is_mapped() in vm_page_protect() rather than the PG_MAPPED flag. (This is the only place in the entire kernel where the PG_MAPPED flag is tested. It will be removed soon.)
101327	04-Aug-2002	alc	o Acquire the page queues lock before checking the page's busy status in vm_page_grab(). Also, replace the nearby tsleep() with an msleep() on the page queues lock.
101308	04-Aug-2002	jeff	- Replace v_flag with v_iflag and v_vflag - v_vflag is protected by the vnode lock and is used when synchronization with VOP calls is needed. - v_iflag is protected by interlock and is used for dealing with vnode management issues. These flags include X/O LOCK, FREE, DOOMED, etc. - All accesses to v_iflag and v_vflag have either been locked or marked with mp_fixme's. - Many ASSERT_VOP_LOCKED calls have been added where the locking was not clear. - Many functions in vfs_subr.c were restructured to provide for stronger locking. Idea stolen from: BSD/OS
101304	04-Aug-2002	alc	o Extend the scope of the page queues lock in contigmalloc1(). o Replace vm_page_sleep_busy() with vm_page_sleep_if_busy() in vm_contig_launder().
101250	03-Aug-2002	alc	o Remove the setting of PG_MAPPED from vm_page_wire() and vm_page_alloc(VM_ALLOC_WIRED).
101236	02-Aug-2002	alc	o Convert two instances of vm_page_sleep_busy() into vm_page_sleep_if_busy() with appropriate page queue locking.
101200	02-Aug-2002	alc	o Lock page queue accesses in nwfs and smbfs. o Assert that the page queues lock is held in vm_page_deactivate().
101196	02-Aug-2002	alc	o Lock page queue accesses by vm_page_deactivate().
101174	01-Aug-2002	alc	o Acquire the page queues lock before calling vm_page_io_finish(). o Assert that the page queues lock is held in vm_page_io_finish().
101105	31-Jul-2002	alc	o Setting PG_MAPPED and PG_WRITEABLE on pages that are mapped and unmapped by pmap_qenter() and pmap_qremove() is pointless. In fact, it probably leads to unnecessary pmap_page_protect() calls if one of these pages is paged out after unwiring. Note: setting PG_MAPPED asserts that the page's pv list may be non-empty. Since checking the status of the page's pv list isn't any harder than checking this flag, the flag should probably be eliminated. Alternatively, PG_MAPPED could be set by pmap_enter() exclusively rather than various places throughout the kernel.
101019	31-Jul-2002	alc	o Lock page accesses by vm_page_io_start() with the page queues lock. o Assert that the page queues lock is held in vm_page_io_start().
100915	30-Jul-2002	alc	o In vm_object_madvise() and vm_object_page_remove() replace vm_page_sleep_busy() with vm_page_sleep_if_busy(). At the same time, increase the scope of the page queues lock. (This should significantly reduce the locking overhead in vm_object_page_remove().) o Apply some style fixes.
100913	30-Jul-2002	tanimura	- Optimize wakeup() and its friends; if a thread waken up is being swapped in, we do not have to ask for the scheduler thread to do that. - Assert that a process is not swapped out in runq functions and swapout(). - Introduce thread_safetoswapout() for readability. - In swapout_procs(), perform a test that may block (check of a thread working on its vm map) first. This lets us call swapout() with the sched_lock held, providing a better atomicity.
100889	29-Jul-2002	alc	o Introduce vm_page_sleep_if_busy() as an eventual replacement for vm_page_sleep_busy(). vm_page_sleep_if_busy() uses the page queues lock.
100885	29-Jul-2002	julian	Remove a XXXKSE comment. the code is no longer a problem..
100884	29-Jul-2002	julian	Create a new thread state to describe threads that would be ready to run except for the fact tha they are presently swapped out. Also add a process flag to indicate that the process has started the struggle to swap back in. This will be needed for the case where multiple threads start the swapin action top a collision. Also add code to stop a process fropm being swapped out if one of the threads in this process is actually off running on another CPU.. that might hurt... Submitted by: Seigo Tanimura <tanimura@r.dl.itc.u-tokyo.ac.jp>
100862	29-Jul-2002	alc	o Pass VM_ALLOC_WIRED to vm_page_grab() rather than calling vm_page_wire() in pmap_new_thread(), pmap_pinit(), and vm_proc_new(). o Lock page queue accesses by vm_page_free() in pmap_object_init_pt().
100836	28-Jul-2002	alc	o Modify vm_page_grab() to accept VM_ALLOC_WIRED.
100832	28-Jul-2002	alc	o Lock page queue accesses by vm_page_free(). o Apply some style fixes.
100829	28-Jul-2002	alc	o Lock page queue accesses by vm_page_free().
100797	28-Jul-2002	alc	o Lock page queue accesses by vm_page_free(). o Increment cnt.v_dfree inside vm_pageout_page_free() rather than at each call.
100796	28-Jul-2002	alc	o Lock page queue accesses by vm_page_free().
100779	27-Jul-2002	alc	o Require that the page queues lock is held on entry to vm_pageout_clean() and vm_pageout_flush(). o Acquire the page queues lock before calling vm_pageout_clean() or vm_pageout_flush().
100742	27-Jul-2002	alc	o Lock page queue accesses by vm_page_activate().
100740	27-Jul-2002	alc	o Lock page queue accesses by vm_page_activate() and vm_page_deactivate() in vm_pageout_object_deactivate_pages(). o Apply some style fixes to vm_pageout_object_deactivate_pages().
100736	27-Jul-2002	alc	o Lock page queue accesses by vm_page_activate() and vm_page_deactivate().
100686	25-Jul-2002	alc	o Remove a vm_page_deactivate() that is immediately followed by a vm_page_rename() from vm_object_backing_scan(). vm_page_rename() also performs vm_page_deactivate() on pages in the cache queues, making the removed vm_page_deactivate() redundant.
100630	24-Jul-2002	alc	o Merge vm_fault_wire() and vm_fault_user_wire() by adding a new parameter, user_wire.
100545	23-Jul-2002	alc	o Lock page queue accesses by vm_page_dontneed(). o Assert that the page queue lock is held in vm_page_dontneed().
100542	23-Jul-2002	alc	o Extend the scope of the page queues lock in vm_pageout_scan() to cover the traversal of the cache queue.
100512	22-Jul-2002	alfred	Change struct vmspace->vm_shm from void * to struct shmmap_state *, this removes the need for casts in several cases.
100511	22-Jul-2002	alfred	Remove caddr_t.
100456	21-Jul-2002	alc	o Lock page queue accesses by vm_page_free() and vm_page_deactivate().
100452	21-Jul-2002	alc	o Lock page queue accesses by vm_page_free().
100438	21-Jul-2002	tanimura	Do not pass a thread with the state TDS_RUNQ to setrunqueue(), otherwise assertion in setrunqueue() fails.
100415	20-Jul-2002	alc	o Lock page queue accesses by vm_page_try_to_cache(). (The accesses in kern/vfs_bio.c are already locked.) o Assert that the page queues lock is held in vm_page_try_to_cache().
100414	20-Jul-2002	alc	o Assert that the page queues lock is held in vm_page_try_to_free().
100413	20-Jul-2002	alc	o Lock page queue accesses by vm_page_cache() in vm_fault() and vm_pageout_scan(). (The others are already locked.) o Assert that the page queues lock is held in vm_page_cache().
100407	20-Jul-2002	alc	o Lock accesses to the active page queue in vm_pageout_scan() and vm_pageout_page_stats().
100397	20-Jul-2002	alc	o Lock page queue accesses by vm_page_cache() in vm_contig_launder(). o Micro-optimize the control flow in vm_contig_launder().
100396	20-Jul-2002	alc	o Remove dead and/or unused code.
100384	20-Jul-2002	peter	Infrastructure tweaks to allow having both an Elf32 and an Elf64 executable handler in the kernel at the same time. Also, allow for the exec_new_vmspace() code to build a different sized vmspace depending on the executable environment. This is a big help for execing i386 binaries on ia64. The ELF exec code grows the ability to map partial pages when there is a page size difference, eg: emulating 4K pages on 8K or 16K hardware pages. Flesh out the i386 emulation support for ia64. At this point, the only binary that I know of that fails is cvsup, because the cvsup runtime tries to execute code in pages not marked executable. Obtained from: dfr (mostly, many tweaks from me).
100379	19-Jul-2002	peter	Set P_NOLOAD on the pagezero kthread so that it doesn't artificially skew the loadav. This is not real load. If you have a nice process running in the background, pagezero may sit in the run queue for ages and add one to the loadav, and thereby affecting other scheduling decisions.
100342	19-Jul-2002	alc	o Duplicate an odd side-effect of vm_page_wire() in vm_page_allocate() when VM_ALLOC_WIRED is specified: set the PG_MAPPED bit in flags. o In both vm_page_wire() and vm_page_allocate() add a comment saying that setting PG_MAPPED does not belong there.
100331	18-Jul-2002	alc	o Remove the acquisition and release of Giant from the idle priority thread that pre-zeroes free pages. o Remove GIANT_REQUIRED from some low-level page queue functions. (Instead assertions on the page queue lock are being added to the higher-level functions, like vm_page_wire(), etc.) In collaboration with: peter
100326	18-Jul-2002	markm	Void functions cannot return values.
100309	18-Jul-2002	peter	(VM_MAX_KERNEL_ADDRESS - KERNBASE) / PAGE_SIZE may not fit in an integer. Use lmin(long, long), not min(u_int, u_int). This is a problem here on ia64 which has way more than 2^32 pages of KVA. 281474976710655 pages to be precice.
100276	18-Jul-2002	alc	o Introduce an argument, VM_ALLOC_WIRED, that requests vm_page_alloc() to return a wired page. o Use VM_ALLOC_WIRED within Alpha's pmap_growkernel(). Also, because Alpha's pmap_growkernel() calls vm_page_alloc() from within a critical section, specify VM_ALLOC_INTERRUPT instead of VM_ALLOC_SYSTEM. (Only VM_ALLOC_INTERRUPT is implemented entirely with a spin mutex.) o Assert that the page queues mutex is held in vm_page_wire() on Alpha, just like the other platforms.
100193	16-Jul-2002	alc	o Use vm_pageq_remove_nowakeup() and vm_pageq_enqueue() in vm_page_zero_idle() instead of partially duplicated implementations. In particular, this change guarantees that the number of free pages in the free queue(s) matches the global free page count when Giant is released. Submitted by: peter (via his p4 "pmap" branch)
100031	15-Jul-2002	alc	o Create vm_contig_launder() to replace code that appears twice in contigmalloc1().
100005	14-Jul-2002	alc	o Lock page queue accesses by vm_page_wire() that aren't within a critical section. o Assert that the page queues lock is held in vm_page_wire() unless an Alpha.
99985	14-Jul-2002	alc	o Lock page queue accesses by vm_page_wire().
99934	13-Jul-2002	alc	o Lock page queue accesses by vm_page_unmanage(). o Assert that the page queues lock is held in vm_page_unmanage().
99927	13-Jul-2002	alc	o Complete the locking of page queue accesses by vm_page_unwire(). o Assert that the page queues lock is held in vm_page_unwire(). o Make vm_page_lock_queues() and vm_page_unlock_queues() visible to kernel loadable modules.
99920	13-Jul-2002	alc	o Lock some page queue accesses, in particular, those by vm_page_unwire().
99893	12-Jul-2002	alc	o Assert GIANT_REQUIRED on system maps in _vm_map_lock(), _vm_map_lock_read(), and _vm_map_trylock(). Submitted by: tegge o Remove GIANT_REQUIRED from kmem_alloc_wait() and kmem_free_wakeup(). (This clears the way for exec_map accesses to move outside of Giant. The exec_map is not a system map.) o Remove some premature MPSAFE comments. Reviewed by: tegge
99890	12-Jul-2002	dillon	Re-enable the idle page-zeroing code. Remove all IPIs from the idle page-zeroing code as well as from the general page-zeroing code and use a lazy tlb page invalidation scheme based on a callback made at the end of mi_switch. A number of people came up with this idea at the same time so credit belongs to Peter, John, and Jake as well. Two-way SMP buildworld -j 5 tests (second run, after stabilization) 2282.76 real 2515.17 user 704.22 sys before peter's IPI commit 2266.69 real 2467.50 user 633.77 sys after peter's commit 2232.80 real 2468.99 user 615.89 sys after this commit Reviewed by: peter, jhb Approved by: peter
99851	12-Jul-2002	peter	Avoid a vm_page_lookup() - that uses a spinlock protected hash. We can just use the object's memq for our nefarious purposes.
99850	12-Jul-2002	alc	o Lock some (unfortunately, not yet all) accesses to the page queues.
99849	12-Jul-2002	alc	o Lock accesses to the page queues.
99754	11-Jul-2002	alc	o Add a "needs wakeup" flag to the vm_map for use by kmem_alloc_wait() and kmem_free_wakeup(). Previously, kmem_free_wakeup() always called wakeup(). In general, no one was sleeping. o Export vm_map_unlock_and_wait() and vm_map_wakeup() from vm_map.c for use in vm_kern.c.
99683	09-Jul-2002	alc	o Lock accesses to the page queues in vm_object_terminate(). o Eliminate some unnecessary 64-bit arithmetic in vm_object_split().
99625	08-Jul-2002	peter	vm_page_queue_free_mtx is a spin mutex, not a normal sleep mutex. I do not know why this didn't panic my box, but I have most certainly been using it: peter@overcee[3:14pm]~src/sys/i386/i386-110> sysctl -a \| grep zero vm.stats.misc.zero_page_count: 2235 vm.stats.misc.cnt_prezero: 638951 vm.idlezero_enable: 1 vm.idlezero_maxrun: 16 Submitted by: Tor.Egge@cvsup.no.freebsd.org Approved by: Tor's patches are never wrong. :-)
99624	08-Jul-2002	peter	Turn the zeroidle process off for SMP systems, there is still a possible TLB problem when bouncing from one cpu to another (the original cpu will not have purged its TLB if the it simply went idle). Pointed out by: Tor.Egge@cvsup.no.freebsd.org Approved by: Tor is never wrong. :-)
99571	08-Jul-2002	peter	Add a special page zero entry point intended to be called via the single threaded VM pagezero kthread outside of Giant. For some platforms, this is really easy since it can just use the direct mapped region. For others, IPI sending is involved or there are other issues, so grab Giant when needed. We still have preemption issues to deal with, but Alan Cox has an interesting suggestion on how to minimize the problem on x86. Use Luigi's hack for preserving the (lack of) priority. Turn the idle zeroing back on since it can now actually do something useful outside of Giant in many cases.
99563	08-Jul-2002	peter	Avoid vm_page_lookup() [grabs a spinlock] and just process the upage object memq instead. Suggested by: alc
99559	07-Jul-2002	peter	Collect all the (now equivalent) pmap_new_proc/pmap_dispose_proc/ pmap_swapin_proc/pmap_swapout_proc functions from the MD pmap code and use a single equivalent MI version. There are other cleanups needed still. While here, use the UMA zone hooks to keep a cache of preinitialized proc structures handy, just like the thread system does. This eliminates one dependency on 'struct proc' being persistent even after being freed. There are some comments about things that can be factored out into ctor/dtor functions if it is worth it. For now they are mostly just doing statistics to get a feel of how it is working.
99545	07-Jul-2002	alc	o Lock accesses to the free queue(s) in vm_page_zero_idle().
99514	07-Jul-2002	alc	o Traverse the object's memq rather than repeatedly calling vm_page_lookup() in vm_object_split().
99509	06-Jul-2002	jeff	- Hold a lock on the vnode acquired from the file table across the call to vm_mmap() as well as the GETATTR etc. - If the handle is a vnode in vm_mmap() assert that it is locked. - Wiggle Giant around a little to account for the extra vnode operation.
99476	05-Jul-2002	gallatin	Remove bogus vm_page_wakeup() in vm_page_cowfault() that will cause panics in the zero-copy send path if a process attempts to write to a page which is still in flight. reviewed by: ken
99472	05-Jul-2002	jeff	Fix a lock order reversal in uma_zdestroy. The uma_mtx needs to be held across calls to zone_drain(). Noticed by: scottl
99427	05-Jul-2002	alc	o Lock accesses to the free page queues in contigmalloc1().
99424	05-Jul-2002	jeff	Remove unnecessary includes.
99416	04-Jul-2002	alc	o Resurrect vm_page_lock_queues(), vm_page_unlock_queues(), and the free queue lock (revision 1.33 of vm/vm_page.c removed them). o Make the free queue lock a spin lock because it's sometimes acquired inside of a critical section.
99408	04-Jul-2002	julian	A small cleanup.
99407	04-Jul-2002	julian	Don;t call teh thread setup routines from here.. they are already called when uma calls thread_init()
99374	03-Jul-2002	alc	o Make the reservation of KVA space for kernel map entries a function of the KVA space's size in addition to the amount of physical memory and reduce it by a factor of two. Under the old formula, our reservation amounted to one kernel map entry per virtual page in the KVA space on a 4GB i386.
99320	03-Jul-2002	jeff	Actually use the fini callback. Pointy hat to: me :-( Noticed By: Julian
99211	01-Jul-2002	robert	- Use (OFF_TO_IDX(off) - pi) instead of (OFF_TO_IDX(off - IDX_TO_OFF(pi))). - Reformat a comment.
99196	01-Jul-2002	alc	o Remove some long dead code: from revision 1.41 of vm/vm_pager.c 3+ years ago. o Remove some unused prototypes.
99093	29-Jun-2002	iedowse	Change the type of `tscan' in vm_object_page_clean() to vm_pindex_t, as it stores an absolute page index that may not fit in a vm_offset_t.
99072	29-Jun-2002	julian	Part 1 of KSE-III The ability to schedule multiple threads per process (one one cpu) by making ALL system calls optionally asynchronous. to come: ia64 and power-pc patches, patches for gdb, test program (in tools) Reviewed by: Almost everyone who counts (at various times, peter, jhb, matt, alfred, mini, bernd, and a cast of thousands) NOTE: this is still Beta code, and contains lots of debugging stuff. expect slight instability in signals..
98892	26-Jun-2002	iedowse	Avoid using the 64-bit vm_pindex_t in a few places where 64-bit types are not required, as the overhead is unnecessary: o In the i386 pmap_protect(), `sindex' and `eindex' represent page indices within the 32-bit virtual address space. o In swp_pager_meta_build() and swp_pager_meta_ctl(), use a temporary variable to store the low few bits of a vm_pindex_t that gets used as an array index. o vm_uiomove() uses `osize' and `idx' for page offsets within a map entry. o In vm_object_split(), `idx' is a page offset within a map entry.
98891	26-Jun-2002	iedowse	Use an explicit cast to avoid relying on sign extension to do the right thing in code such as `vm_pindex_t x = ~SWAP_META_MASK'. Reviewed by: dillon
98849	26-Jun-2002	ken	At long last, commit the zero copy sockets code. MAKEDEV: Add MAKEDEV glue for the ti(4) device nodes. ti.4: Update the ti(4) man page to include information on the TI_JUMBO_HDRSPLIT and TI_PRIVATE_JUMBOS kernel options, and also include information about the new character device interface and the associated ioctls. man9/Makefile: Add jumbo.9 and zero_copy.9 man pages and associated links. jumbo.9: New man page describing the jumbo buffer allocator interface and operation. zero_copy.9: New man page describing the general characteristics of the zero copy send and receive code, and what an application author should do to take advantage of the zero copy functionality. NOTES: Add entries for ZERO_COPY_SOCKETS, TI_PRIVATE_JUMBOS, TI_JUMBO_HDRSPLIT, MSIZE, and MCLSHIFT. conf/files: Add uipc_jumbo.c and uipc_cow.c. conf/options: Add the 5 options mentioned above. kern_subr.c: Receive side zero copy implementation. This takes "disposable" pages attached to an mbuf, gives them to a user process, and then recycles the user's page. This is only active when ZERO_COPY_SOCKETS is turned on and the kern.ipc.zero_copy.receive sysctl variable is set to 1. uipc_cow.c: Send side zero copy functions. Takes a page written by the user and maps it copy on write and assigns it kernel virtual address space. Removes copy on write mapping once the buffer has been freed by the network stack. uipc_jumbo.c: Jumbo disposable page allocator code. This allocates (optionally) disposable pages for network drivers that want to give the user the option of doing zero copy receive. uipc_socket.c: Add kern.ipc.zero_copy.{send,receive} sysctls that are enabled if ZERO_COPY_SOCKETS is turned on. Add zero copy send support to sosend() -- pages get mapped into the kernel instead of getting copied if they meet size and alignment restrictions. uipc_syscalls.c:Un-staticize some of the sf* functions so that they can be used elsewhere. (uipc_cow.c) if_media.c: In the SIOCGIFMEDIA ioctl in ifmedia_ioctl(), avoid calling malloc() with M_WAITOK. Return an error if the M_NOWAIT malloc fails. The ti(4) driver and the wi(4) driver, at least, call this with a mutex held. This causes witness warnings for 'ifconfig -a' with a wi(4) or ti(4) board in the system. (I've only verified for ti(4)). ip_output.c: Fragment large datagrams so that each segment contains a multiple of PAGE_SIZE amount of data plus headers. This allows the receiver to potentially do page flipping on receives. if_ti.c: Add zero copy receive support to the ti(4) driver. If TI_PRIVATE_JUMBOS is not defined, it now uses the jumbo(9) buffer allocator for jumbo receive buffers. Add a new character device interface for the ti(4) driver for the new debugging interface. This allows (a patched version of) gdb to talk to the Tigon board and debug the firmware. There are also a few additional debugging ioctls available through this interface. Add header splitting support to the ti(4) driver. Tweak some of the default interrupt coalescing parameters to more useful defaults. Add hooks for supporting transmit flow control, but leave it turned off with a comment describing why it is turned off. if_tireg.h: Change the firmware rev to 12.4.11, since we're really at 12.4.11 plus fixes from 12.4.13. Add defines needed for debugging. Remove the ti_stats structure, it is now defined in sys/tiio.h. ti_fw.h: 12.4.11 firmware. ti_fw2.h: 12.4.11 firmware, plus selected fixes from 12.4.13, and my header splitting patches. Revision 12.4.13 doesn't handle 10/100 negotiation properly. (This firmware is the same as what was in the tree previously, with the addition of header splitting support.) sys/jumbo.h: Jumbo buffer allocator interface. sys/mbuf.h: Add a new external mbuf type, EXT_DISPOSABLE, to indicate that the payload buffer can be thrown away / flipped to a userland process. socketvar.h: Add prototype for socow_setup. tiio.h: ioctl interface to the character portion of the ti(4) driver, plus associated structure/type definitions. uio.h: Change prototype for uiomoveco() so that we'll know whether the source page is disposable. ufs_readwrite.c:Update for new prototype of uiomoveco(). vm_fault.c: In vm_fault(), check to see whether we need to do a page based copy on write fault. vm_object.c: Add a new function, vm_object_allocate_wait(). This does the same thing that vm_object allocate does, except that it gives the caller the opportunity to specify whether it should wait on the uma_zalloc() of the object structre. This allows vm objects to be allocated while holding a mutex. (Without generating WITNESS warnings.) vm_object_allocate() is implemented as a call to vm_object_allocate_wait() with the malloc flag set to M_WAITOK. vm_object.h: Add prototype for vm_object_allocate_wait(). vm_page.c: Add page-based copy on write setup, clear and fault routines. vm_page.h: Add page based COW function prototypes and variable in the vm_page structure. Many thanks to Drew Gallatin, who wrote the zero copy send and receive code, and to all the other folks who have tested and reviewed this code over the years.
98848	26-Jun-2002	dillon	Enforce RLIMIT_VMEM on growable mappings (aka the primary stack or any MAP_STACK mapping). Suggested by: alc
98833	26-Jun-2002	dillon	Part I of RLIMIT_VMEM implementation. Implement core functionality for a new resource limit that covers a process's entire VM space, including mmap()'d space. (Part II will be additional code to check RLIMIT_VMEM during exec() but it needs more fleshing out). PR: kern/18209 Submitted by: Andrey Alekseyev <uitm@zenon.net>, Dmitry Kim <jason@nichego.net> MFC after: 7 days
98824	25-Jun-2002	iedowse	Complete the initial set of VM changes required to support full 64-bit file sizes. This step simply addresses the remaining overflows, and does attempt to optimise performance. The details are: o Use a 64-bit type for the vm_object `size' and the size argument to vm_object_allocate(). o Use the correct type for index variables in dev_pager_getpages(), vm_object_page_clean() and vm_object_page_remove(). o Avoid an overflow in the i386 pmap_object_init_pt().
98823	25-Jun-2002	jeff	Turn VM_ALLOC_ZERO into a flag. Submitted by: tegge Reviewed by: dillon
98822	25-Jun-2002	jeff	Reduce the amount of code that runs with the zone lock held in slab_zalloc(). This allows us to run the zone initialization functions without any locks held.
98818	25-Jun-2002	alc	o Eliminate vmspace::vm_minsaddr. It's initialized but never used. o Replace stale comments in vmspace by "const until freed" annotations on some fields.
98686	23-Jun-2002	alc	o Remove GIANT_REQUIRED from kmem_alloc_pageable(), kmem_alloc_nofault(), and kmem_free(). (Annotate as MPSAFE.) o Remove incorrect casts from kmem_alloc_pageable() and kmem_alloc_nofault().
98656	23-Jun-2002	alc	o Remove the unnecessary acquisition and release of Giant around fdrop() in mmap(2).
98632	22-Jun-2002	alc	o Reduce the scope of Giant in vm_mmap() to just the code that manipulates a vnode. (Thus, MAP_ANON and MAP_STACK never acquire Giant.)
98630	22-Jun-2002	alc	o Replace mtx_assert(&Giant, MA_OWNED) in dev_pager_alloc() with the acquisition and release of Giant. (Annotate as MPSAFE.) o Reorder the sanity checks in dev_pager_alloc() to reduce the time that Giant is held.
98624	22-Jun-2002	alc	o In vm_map_insert(), replace GIANT_REQUIRED by the acquisition and release of Giant around the direct manipulation of the vm_object and the optional call to pmap_object_init_pt(). o In vm_map_findspace(), remove GIANT_REQUIRED. Instead, acquire and release Giant around the occasional call to pmap_growkernel(). o In vm_map_find(), remove GIANT_REQUIRED.
98607	22-Jun-2002	alc	o Replace GIANT_REQUIRED in swap_pager_alloc() by the acquisition and release of Giant. (Annotate as MPSAFE.)
98605	22-Jun-2002	alc	o Remove GIANT_REQUIRED from phys_pager_alloc(). If handle isn't NULL, acquire and release Giant. If handle is NULL, Giant isn't needed. o Annotate phys_pager_alloc() and phys_pager_dealloc() as MPSAFE.
98604	22-Jun-2002	alc	o Replace GIANT_REQUIRED in vnode_pager_alloc() by the acquisition and release of Giant. (Annotate as MPSAFE.) o Also, in vnode_pager_alloc(), remove an unnecessary re-initialization of struct vm_object::flags and move a statement that is duplicated in both branches of an if-else.
98600	22-Jun-2002	alc	o Remove GIANT_REQUIRED from vslock(). o Annotate kernacc(), useracc(), and vslock() as MPSAFE. Motivated by: alfred
98541	21-Jun-2002	alc	o Remove GIANT_REQUIRED from vm_map_stack().
98538	21-Jun-2002	alc	o Remove GIANT_REQUIRED from vm_pager_allocate() and vm_pager_deallocate().
98498	20-Jun-2002	alc	o Remove an incorrect cast from obreak(). This cast would, for example, break an sbrk(>=4GB) on 64-bit architectures even if the resource limit allowed it. o Correct an off-by-one error. o Correct a spelling error in a comment. o Reorder an && expression so that the commonly FALSE expression comes first. Submitted by: bde (bullets 1 and 2)
98460	20-Jun-2002	alc	o Acquire and release the vm_map lock instead of Giant in obreak(). Consequently, use vm_map_insert() and vm_map_delete(), which expect the vm_map to be locked, instead of vm_map_find() and vm_map_remove(), which do not.
98455	19-Jun-2002	jeff	- Move the computation of pflags out of the page allocation loop in kmem_malloc() - zero fill pages if PG_ZERO bit is not set after allocation in kmem_malloc() Suggested by: alc, jake
98451	19-Jun-2002	jeff	- Remove bogus use of kmem_alloc that was inherited from the old zone allocator. - Properly set M_ZERO when talking to the back end page allocators for non malloc zones. This forces us to zero fill pages when they are first brought into a cache. - Properly handle M_ZERO in uma_zalloc_internal. This fixes a problem where per cpu buckets weren't always getting zeroed.
98450	19-Jun-2002	jeff	Teach kmem_malloc about M_ZERO.
98414	19-Jun-2002	alc	o Replace GIANT_REQUIRED in vm_object_coalesce() by the acquisition and release of Giant. o Reduce the scope of GIANT_REQUIRED in vm_map_insert(). These changes will enable us to remove the acquisition and release of Giant from obreak().
98397	18-Jun-2002	alc	o Remove LK_CANRECURSE from the vm_map lock.
98362	17-Jun-2002	jeff	Honor the BUCKETCACHE flag on free as well.
98361	17-Jun-2002	jeff	- Introduce the new M_NOVM option which tells uma to only check the currently allocated slabs and bucket caches for free items. It will not go ask the vm for pages. This differs from M_NOWAIT in that it not only doesn't block, it doesn't even ask. - Add a new zcreate option ZONE_VM, that sets the BUCKETCACHE zflag. This tells uma that it should only allocate buckets out of the bucket cache, and not from the VM. It does this by using the M_NOVM option to zalloc when getting a new bucket. This is so that the VM doesn't recursively enter itself while trying to allocate buckets for vm_map_entry zones. If there are already allocated buckets when we get here we'll still use them but otherwise we'll skip it. - Use the ZONE_VM flag on vm map entries and pv entries on x86.
98343	17-Jun-2002	alc	o Acquire and release Giant in vm_map_wakeup() to prevent a lost wakeup(). Reviewed by: tegge
98304	16-Jun-2002	alc	o Remove GIANT_REQUIRED from vm_fault_user_wire(). o Move pmap_pageable() outside of Giant in vm_fault_unwire(). (pmap_pageable() is a no-op on all supported architectures.) o Remove the acquisition and release of Giant from mlock().
98263	15-Jun-2002	alc	o Remove GIANT_REQUIRED from useracc() and vsunlock(). Neither vm_map_check_protection() nor vm_map_unwire() expect Giant to be held.
98240	15-Jun-2002	alc	o Remove the acquisition and release of Giant from munlock(). Reviewed by: tegge
98226	14-Jun-2002	alc	o Use vm_map_wire() and vm_map_unwire() in place of vm_map_pageable() and vm_map_user_pageable(). o Remove vm_map_pageable() and vm_map_user_pageable(). o Remove vm_map_clear_recursive() and vm_map_set_recursive(). (They were only used by vm_map_pageable() and vm_map_user_pageable().) Reviewed by: tegge
98142	12-Jun-2002	alc	o Acquire and release Giant in vm_map_unlock_and_wait(). Submitted by: tegge
98119	11-Jun-2002	alc	o Properly handle a failure by vm_fault_wire() or vm_fault_user_wire() in vm_map_wire(). o Make two white-space changes in vm_map_wire(). Reviewed by: tegge
98109	11-Jun-2002	alc	o Teach vm_map_delete() to respect the "in-transition" flag on a vm_map_entry by sleeping until the flag is cleared. Submitted by: tegge
98083	10-Jun-2002	alc	o In vm_map_entry_create(), call uma_zalloc() with M_NOWAIT on system maps. Submitted by: tegge o Eliminate the "!mapentzone" check from vm_map_entry_create() and vm_map_entry_dispose(). Reviewed by: tegge o Fix white-space usage in vm_map_entry_create().
98075	10-Jun-2002	iedowse	Correct the logic for determining whether the per-CPU locks need to be destroyed. This fixes a problem where destroying a UMA zone would fail to destroy all zone mutexes. Reviewed by: jeff
98071	09-Jun-2002	alc	o Add vm_map_wire() for wiring contiguous regions of either kernel or user vm_maps. This implementation has two key benefits when compared to vm_map_{user_,}pageable(): (1) it avoids a race condition through the use of "in-transition" vm_map entries and (2) it eliminates lock recursion on the vm_map. Note: there is still an error case that requires clean up. Reviewed by: tegge
98052	08-Jun-2002	alc	o Simplify vm_map_unwire() by merging the second and third passes over the caller-specified region.
98036	08-Jun-2002	alc	o Remove an unnecessary call to vm_map_wakeup() from vm_map_unwire(). o Add a stub for vm_map_wire(). Note: the description of the previous commit had an error. The in- transition flag actually blocks the deallocation of a vm_map_entry by vm_map_delete() and vm_map_simplify_entry().
98022	07-Jun-2002	alc	o Add vm_map_unwire() for unwiring contiguous regions of either kernel or user vm_maps. In accordance with the standards for munlock(2), and in contrast to vm_map_user_pageable(), this implementation does not allow holes in the specified region. This implementation uses the "in transition" flag described below. o Introduce a new flag, "in transition," to the vm_map_entry. Eventually, vm_map_delete() and vm_map_simplify_entry() will respect this flag by deallocating in-transition vm_map_entrys, allowing the vm_map lock to be safely released in vm_map_unwire() and (the forthcoming) vm_map_wire(). o Modify vm_map_simplify_entry() to respect the in-transition flag. In collaboration with: tegge
97947	06-Jun-2002	alfred	fix typo in _SYS_SYSPROTO_H_ case: s/mlockall_args/munlockall_args Submitted by: Mark Santcroos <marks@ripe.net>
97787	03-Jun-2002	jeff	Add a comment describing a resource leak that occurs during a failure case in obj_alloc.
97753	02-Jun-2002	alc	o Migrate vm_map_split() from vm_map.c to vm_object.c, renaming it to vm_object_split(). Its interface should still be changed to resemble vm_object_shadow().
97747	02-Jun-2002	alc	o Style fixes to vm_map_split(), including the elimination of one variable declaration that shadows another. Note: This function should really be vm_object_split(), not vm_map_split(). Reviewed by: md5
97729	02-Jun-2002	alc	o Condition vm_object_pmap_copy_1()'s compilation on the kernel option ENABLE_VFS_IOOPT. Unless this option is in effect, vm_object_pmap_copy_1() is not used.
97727	01-Jun-2002	alc	o Remove GIANT_REQUIRED from vm_map_zfini(), vm_map_zinit(), vm_map_create(), and vm_map_submap(). o Make further use of a local variable in vm_map_entry_splay() that caches a reference to one of a vm_map_entry's children. (This reduces code size somewhat.) o Revert a part of revision 1.66, deinlining vmspace_pmap(). (This function is MPSAFE.)
97710	01-Jun-2002	alc	o Revert a part of revision 1.66, contrary to what that commit message says, deinlining vm_map_entry_behavior() and vm_map_entry_set_behavior() actually increases the kernel's size. o Make vm_map_entry_set_behavior() static and add a comment describing its purpose. o Remove an unnecessary initialization statement from vm_map_entry_splay().
97654	31-May-2002	des	Export nswapdev through sysctl(8). Sponsored by: DARPA, NAI Labs
97648	31-May-2002	alc	Further work on pushing Giant out of the vm_map layer and down into the vm_object layer: o Acquire and release Giant in vm_object_shadow() and vm_object_page_remove(). o Remove the GIANT_REQUIRED assertion preceding vm_map_delete()'s call to vm_object_page_remove(). o Remove the acquisition and release of Giant around vm_map_lookup()'s call to vm_object_shadow().
97556	30-May-2002	alfred	Check for defined(__i386__) instead of just defined(i386) since the compiler will be updated to only define(__i386__) for ANSI cleanliness.
97453	29-May-2002	peter	The kernel printf does not have %i
97359	27-May-2002	alc	o Remove unused #defines.
97294	26-May-2002	alc	o Acquire and release Giant around pmap operations in vm_fault_unwire() and vm_map_delete(). Assert GIANT_REQUIRED in vm_map_delete() only if operating on the kernel_object or the kmem_object. o Remove GIANT_REQUIRED from vm_map_remove(). o Remove the acquisition and release of Giant from munmap().
97198	24-May-2002	alc	o Replace the vm_map's hint by the root of a splay tree. By design, the last accessed datum is moved to the root of the splay tree. Therefore, on lookups in which the hint resulted in O(1) access, the splay tree still achieves O(1) access. In contrast, on lookups in which the hint failed miserably, the splay tree achieves amortized logarithmic complexity, resulting in dramatic improvements on vm_maps with a large number of entries. For example, the execution time for replaying an access log from www.cs.rice.edu against the thttpd web server was reduced by 23.5% due to the large number of files simultaneously mmap()ed by this server. (The machine in question has enough memory to cache most of this workload.) Nothing comes for free: At present, I see a 0.2% slowdown on "buildworld" due to the overhead of maintaining the splay tree. I believe that some or all of this can be eliminated through optimizations to the code. Developed in collaboration with: Juan E Navarro <jnavarro@cs.rice.edu> Reviewed by: jeff
97088	22-May-2002	alc	o Make contigmalloc1() static.
97007	20-May-2002	jhb	In uma_zalloc_arg(), if we are performing a M_WAITOK allocation, ensure that td_intr_nesting_level is 0 (like malloc() does). Since malloc() calls uma we can probably remove the check in malloc() for this now. Also, perform an extra witness check in that case to make sure we don't hold any locks when performing a M_WAITOK allocation.
96875	18-May-2002	alc	o Eliminate the acquisition and release of Giant from minherit(2). (vm_map_inherit() no longer requires Giant to be held.)
96839	18-May-2002	alc	o Remove GIANT_REQUIRED from vm_map_madvise(). Instead, acquire and release Giant around vm_map_madvise()'s call to pmap_object_init_pt(). o Replace GIANT_REQUIRED in vm_object_madvise() with the acquisition and release of Giant. o Remove the acquisition and release of Giant from madvise().
96832	18-May-2002	alc	o Remove the acquisition and release of Giant from mprotect().
96755	16-May-2002	trhodes	More s/file system/filesystem/g
96572	14-May-2002	phk	Make daddr_t and u_daddr_t 64bits wide. Retire daddr64_t and use daddr_t instead. Sponsored by: DARPA & NAI Labs.
96496	13-May-2002	jeff	Don't call the uz free function while the zone lock is held. This can lead to lock order reversals. uma_reclaim now builds a list of freeable slabs and then unlocks the zones to do all of the frees.
96493	13-May-2002	jeff	Remove the hash_free() lock order reversal. This could have happened for several reasons before. Fixing it involved restructuring the generic hash code to require calling code to handle locking, unlocking, and freeing hashes on error conditions.
96469	12-May-2002	alc	o Remove GIANT_REQUIRED and an excessive number of blank lines from vm_map_inherit(). (minherit() need not acquire Giant anymore.)
96441	12-May-2002	alc	o Acquire and release Giant in vm_object_reference() and vm_object_deallocate(), replacing the assertion GIANT_REQUIRED. o Remove GIANT_REQUIRED from vm_map_protect() and vm_map_simplify_entry(). o Acquire and release Giant around vm_map_protect()'s call to pmap_protect(). Altogether, these changes eliminate the need for mprotect() to acquire and release Giant.
96096	06-May-2002	alc	o Header files shouldn't depend on options: Provide prototypes for uiomoveco(), uioread(), and vm_uiomove() regardless of whether ENABLE_VFS_IOOPT is defined or not. Submitted by: bde
96095	06-May-2002	alc	o Condition the compilation and use of vm_freeze_copyopts() on ENABLE_VFS_IOOPT.
96091	06-May-2002	alc	o Some improvements to the page coloring of vm objects, particularly, for shadow objects. Submitted by: bde
96087	06-May-2002	alc	o Move vm_freeze_copyopts() from vm_map.{c.h} to vm_object.{c,h}. It's plainly an operation on a vm_object and belongs in the latter place.
96080	05-May-2002	alc	o Condition the compilation of uiomoveco() and vm_uiomove() on ENABLE_VFS_IOOPT. o Add a comment to the effect that this code is experimental support for zero-copy I/O.
96073	05-May-2002	phk	Expand the one-line function pbreassignbuf() the only place it is or could be used.
96056	05-May-2002	alc	o Remove GIANT_REQUIRED from vm_map_lookup() and vm_map_lookup_done(). o Acquire and release Giant around vm_map_lookup()'s call to vm_object_shadow().
96044	04-May-2002	jeff	Use pages instead of uz_maxpages, which has not been initialized yet, when creating the vm_object. This was broken after the code was rearranged to grab giant itself. Spotted by: alc
96042	04-May-2002	alc	o Make _vm_object_allocate() and vm_object_allocate() callable without holding Giant. o Begin documenting the trivial cases of the locking protocol on vm_object.
96007	04-May-2002	alc	o Remove GIANT_REQUIRED from vm_map_lookup_entry() and vm_map_check_protection(). o Call vm_map_check_protection() without Giant held in munmap().
95942	02-May-2002	alc	o Change the implementation of vm_map locking to use exclusive locks exclusively. The interface still, however, distinguishes between a shared lock and an exclusive lock.
95931	02-May-2002	jeff	Hide a pointer to the malloc_type bucket at the end of the freed memory. If this memory is modified after it has been freed we can now report it's previous owner.
95930	02-May-2002	jeff	Move around the dbg code a bit so it's always under a lock. This stops a weird potential race if we were preempted right as we were doing the dbg checks.
95925	02-May-2002	arr	- Changed the size element of uma_zctor_args to be size_t instead of int. - Changed uma_zcreate to accept the size argument as a size_t intead of int. Approved by: jeff
95923	02-May-2002	jeff	malloc/free(9) no longer require Giant. Use the malloc_mtx to protect the mallochash. Mallochash is going to go away as soon as I introduce the kfree/kmalloc api and partially overhaul the malloc wrapper. This can't happen until all users of the malloc api that expect memory to be aligned on the size of the allocation are fixed.
95901	02-May-2002	alc	o Remove dead and lockmgr()-specific debugging code.
95899	02-May-2002	jeff	Remove the temporary alignment check in free(). Implement the following checks on freed memory in the bucket path: - Slab membership - Alignment - Duplicate free This previously was only done if we skipped the buckets. This code will slow down INVARIANTS a bit, but it is smp safe. The checks were moved out of the normal path and into hooks supplied in uma_dbg.
95823	30-Apr-2002	alc	o Convert the vm_page buckets mutex to a spin lock. (This resolves an issue on the Alpha platform found by jeff@.) o Simplify vm_page_lookup(). Reviewed by: jhb
95771	30-Apr-2002	jeff	Add a new UMA debugging facility. This will overwrite freed memory with 0xdeadc0de and then check for it just before memory is handed off as part of a new request. This will catch any post free/pre alloc modification of memory, as well as introduce errors for anything that tries to dereference it as a pointer. This code takes the form of special init, fini, ctor and dtor routines that are specificly used by malloc. It is in a seperate file because additional debugging aids will want to live here as well.
95766	30-Apr-2002	jeff	Move the implementation of M_ZERO into UMA so that it can be passed to uma_zalloc and friends. Remove this functionality from the malloc wrapper. Document this change in uma.h and adjust variable names in uma_core.
95764	30-Apr-2002	alc	o Revert vm_fault1() to its original name vm_fault(), eliminating the wrapper that took its place for the purposes of acquiring and releasing Giant.
95758	29-Apr-2002	jeff	Add a new zone flag UMA_ZONE_MTXCLASS. This puts the zone in it's own mutex class. Currently this is only used for kmapentzone because kmapents are are potentially allocated when freeing memory. This is not dangerous though because no other allocations will be done while holding the kmapentzone lock.
95710	29-Apr-2002	peter	Tidy up some loose ends. i386/ia64/alpha - catch up to sparc64/ppc: - replace pmap_kernel() with refs to kernel_pmap - change kernel_pmap pointer to (&kernel_pmap_store) (this is a speedup since ld can set these at compile/link time) all platforms (as suggested by jake): - gc unused pmap_reference - gc unused pmap_destroy - gc unused struct pmap.pm_count (we never used pm_count - we track address space sharing at the vmspace)
95701	29-Apr-2002	alc	Document three synchronization issues in vm_fault().
95686	28-Apr-2002	alc	Pass the caller's file name and line number to the vm_map locking functions.
95610	28-Apr-2002	alc	o Introduce and use vm_map_trylock() to replace several direct uses of lockmgr(). o Add missing synchronization to vmspace_swap_count(): Obtain a read lock on the vm_map before traversing it.
95598	28-Apr-2002	peter	We do not necessarily need to map/unmap pages to zero parts of them. On systems where physical memory is also direct mapped (alpha, sparc, ia64 etc) this is slightly harmful.
95589	27-Apr-2002	alc	o Begin documenting the (existing) locking protocol on the vm_map in the same style as sys/proc.h. o Undo the de-inlining of several trivial, MPSAFE methods on the vm_map. (Contrary to the commit message for vm_map.h revision 1.66 and vm_map.c revision 1.206, de-inlining these methods increased the kernel's size.)
95532	26-Apr-2002	alc	o Control access to the vm_page_buckets with a mutex. o Fix some style(9) bugs.
95432	25-Apr-2002	arr	- Fix a round down bogon in uma_zone_set_max(). Submitted by: jeff@
95112	20-Apr-2002	alc	Reintroduce locking on accesses to vm_object_list.
95021	19-Apr-2002	alc	o Move the acquisition of Giant from vm_fault() to the point after initialization in vm_fault1(). o Fix some style problems in vm_fault1().
94981	18-Apr-2002	alc	Add a comment documenting a race condition in vm_fault(): Specifically, a modification is made to the vm_map while only a read lock is held.
94977	18-Apr-2002	alc	o Call vm_map_growstack() from vm_fault() if vm_map_lookup() has failed due to conditions that suggest the possible need for stack growth. This has two beneficial effects: (1) we can now remove calls to vm_map_growstack() from the MD trap handlers and (2) simple page faults are faster because we no longer unnecessarily perform vm_map_growstack() on every page fault. o Remove vm_map_growstack() from the i386's trap_pfault(). o Remove the acquisition and release of Giant from i386's trap_pfault(). (vm_fault() still acquires it.)
94921	17-Apr-2002	peter	Do not free the vmspace until p->p_vmspace is set to null. Otherwise statclock can access it in the tail end of statclock_process() at an unfortunate time. This bit me several times on an SMP alpha (UP2000) and the problem went away with this change. I'm not sure why it doesn't break x86 as well. Maybe it's because the clocks are much faster on alpha (HZ=1024 by default).
94912	17-Apr-2002	alc	Remove an unused option, VM_FAULT_HOLD, to vm_fault().
94777	15-Apr-2002	peter	Pass vm_page_t instead of physical addresses to pmap_zero_page[_area]() and pmap_copy_page(). This gets rid of a couple more physical addresses in upper layers, with the eventual aim of supporting PAE and dealing with the physical addressing mostly within pmap. (We will need either 64 bit physical addresses or page indexes, possibly both depending on the circumstances. Leaving this to pmap itself gives more flexibilitly.) Reviewed by: jake Tested on: i386, ia64 and (I believe) sparc64. (my alpha was hosed)
94653	14-Apr-2002	jeff	Fix a witness warning when expanding a hash table. We were allocating the new hash while holding the lock on a zone. Fix this by doing the allocation seperately from the actual hash expansion. The lock is dropped before the allocation and reacquired before the expansion. The expansion code checks to see if we lost the race and frees the new hash if we do. We really never will lose this race because the hash expansion is single threaded via the timeout mechanism.
94651	14-Apr-2002	jeff	Protect the initial list traversal in sysctl_vm_zone() with the uma_mtx.
94631	14-Apr-2002	jeff	Fix the calculation that determines uz_maxpages. It was off for large zones. Fortunately we have no large zones with maximums specified yet, so it wasn't breaking anything. Implement blocking when a zone exceeds the maximum and M_WAITOK is specified. Previously this just failed like the old zone allocator did. The old zone allocator didn't support WAITOK/NOWAIT though so we should do what we advertise. While I was in there I cleaned up some more zalloc logic to further simplify that code path and reduce redundant code. This was needed to make the blocking work properly anyway.
94329	10-Apr-2002	jeff	Remember to unlock the zone if the fill count is too high. Pointed out by: pete, jake, jhb
94240	08-Apr-2002	jeff	Quiet witness warnings about acquiring several zone locks. In the case that this happens it is OK.
94165	08-Apr-2002	jeff	Add a mechanism to disable buckets when the v_free_count drops below v_free_min. This should help performance in memory starved situations.
94163	08-Apr-2002	jeff	Don't release the zone lock until after the dtor has been called. As far as I can tell this could not have caused any problems yet because UMA is still called with giant. Pointy hat to: jeff Noticed by: jake
94161	08-Apr-2002	jeff	Implement uma_zdestroy(). It's prototype changed slightly. I decided that I didn't like the wait argument and that if you were removing a zone it had better be empty. Also, I broke out part of hash_expand and made a seperate hash_free() for use in uma_zdestroy.
94159	08-Apr-2002	jeff	Rework most of the bucket allocation and free code so that per cpu locks are never held across blocking operations. Also, fix two other lock order reversals that were exposed by jhb's witness change. The free path previously had a bug that would cause it to skip the free bucket list in some cases and go straight to allocating a new bucket. This has been fixed as well. These changes made the bucket handling code much cleaner and removed quite a few lock operations. This should be marginally faster now. It is now possible to call malloc w/o Giant and avoid any witness warnings. This still isn't entirely safe though because malloc_type statistics are not protected by any lock.
94157	07-Apr-2002	jeff	Spelling correction; s/seperate/separate/g Submitted by: eric
94156	07-Apr-2002	jeff	There should be no remaining references to these two files in the tree. If there are, it is an error. vm_zone has been superseded by uma.
94155	07-Apr-2002	jeff	This fixes a bug where isitem never got set to 1 if a certain chain of events relating to extreme low memory situations occured. This was only ever seen on the port build cluster, so many thanks to kris for helping me debug this. Tested by: kris
93847	05-Apr-2002	alc	o Eliminate the use of grow_stack() and useracc() from sendsig(), osendsig(), and osf1_sendsig(). o Eliminate the prototype for the MD grow_stack() now that it has been removed from all platforms.
93823	04-Apr-2002	dillon	Embed a struct vmmeter in the per-cpu structure and add a macro, PCPU_LAZY_INC() which increments elements in it for cases where we can afford the occassional inaccuracy. Use of per-cpu stats counters avoids significant cache stalls in various critical paths that would otherwise severely limit our cpu scaleability. Adjust all sysctl's accessing cnt.* elements to now use a procedure which aggregates the requested field for all cpus and for the global vmmeter. The global vmmeter is retained, since some stats counters, like v_free_min, cannot be made per-cpu. Also, this allows us to convert counters from the global vmmeter to the per-cpu vmmeter in a piecemeal fashion, so have at it!
93818	04-Apr-2002	jhb	Change callers of mtx_init() to pass in an appropriate lock type name. In most cases NULL is passed, but in some cases such as network driver locks (which use the MTX_NETWORK_LOCK macro) and UMA zone locks, a name is used. Tested on: i386, alpha, sparc64
93716	03-Apr-2002	jake	Fix a long standing 32bit-ism. Don't assume that the size of a chunk of memory in phys_avail will fit in 'int', use vm_size_t. This fixes booting on sparc64 machines with more than 2 gigs of ram. Thanks to Jan Chrillesen for providing me with access to a 4 gig machine.
93697	02-Apr-2002	alfred	fix comment typo, s/neccisary/necessary/g
93593	01-Apr-2002	jhb	Change the suser() API to take advantage of td_ucred as well as do a general cleanup of the API. The entire API now consists of two functions similar to the pre-KSE API. The suser() function takes a thread pointer as its only argument. The td_ucred member of this thread must be valid so the only valid thread pointers are curthread and a few kernel threads such as thread0. The suser_cred() function takes a pointer to a struct ucred as its first argument and an integer flag as its second argument. The flag is currently only used for the PRISON_ROOT flag. Discussed on: smp@
93273	27-Mar-2002	jeff	Add a new mtx_init option "MTX_DUPOK" which allows duplicate acquires of locks with this flag. Remove the dup_list and dup_ok code from subr_witness. Now we just check for the flag instead of doing string compares. Also, switch the process lock, process group lock, and uma per cpu locks over to this interface. The original mechanism did not work well for uma because per cpu lock names are unique to each zone. Approved by: jhb
93194	26-Mar-2002	alc	Remove an unused prototype.
93089	24-Mar-2002	jeff	Reset the cachefree statistics after draining the cache. This fixes a bug where a sysctl within 20 seconds of a cache_drain could yield negative "USED" counts. Also, grab the uma_mtx while in the sysctl handler. This hadn't caused problems yet because Giant is held all the time. Reported by: kkenn
92758	20-Mar-2002	jeff	Add uma_zone_set_max() to add enforced limits to non vm obj backed zones.
92748	20-Mar-2002	jeff	Remove references to vm_zone.h and switch over to the new uma API.
92727	19-Mar-2002	alfred	Remove __P.
92692	19-Mar-2002	jeff	Quit a warning introduced by UMA. This only occurs on machines where vm_size_t != unsigned long. Reviewed by: phk
92666	19-Mar-2002	peter	Fix a gcc-3.1+ warning. warning: deprecated use of label at end of compound statement ie: you cannot do this anymore: switch(foo) { .... default: }
92654	19-Mar-2002	jeff	This is the first part of the new kernel memory allocator. This replaces malloc(9) and vm_zone with a slab like allocator. Reviewed by: arch@
92588	18-Mar-2002	green	Back out the modification of vm_map locks from lockmgr to sx locks. The best path forward now is likely to change the lockmgr locks to simple sleep mutexes, then see if any extra contention it generates is greater than removed overhead of managing local locking state information, cost of extra calls into lockmgr, etc. Additionally, making the vm_map lock a mutex and respecting it properly will put us much closer to not needing Giant magic in vm.
92511	17-Mar-2002	alc	Remove vm_object_count: It's unused, incorrectly maintained and duplicates information maintained by the zone allocator.
92475	17-Mar-2002	alc	Undo part of revision 1.57: Now that (o)sendsig() doesn't call useracc(), the motivation for saving and restoring the map->hint in useracc() is gone. (The same tests that motivated this change in revision 1.57 now show that there is no performance loss from removing it.) This was really a hack and some day we would have had to add new synchronization here on map->hint to maintain it.
92466	17-Mar-2002	alc	Acquire a read lock on the map inside of vm_map_check_protection() rather than expecting the caller to do so. This (1) eliminates duplicated code in kernacc() and useracc() and (2) fixes missing synchronization in munmap().
92461	17-Mar-2002	jake	Convert all pmap_kenter/pmap_kremove pairs in MI code to use pmap_qenter/ pmap_qremove. pmap_kenter is not safe to use in MI code because it is not guaranteed to flush the mapping from the tlb on all cpus. If the process in question is preempted and migrates cpus between the call to pmap_kenter and pmap_kremove, the original cpu will be left with stale mappings in its tlb. This is currently not a problem for i386 because we do not use PG_G on SMP, and thus all mappings are flushed from the tlb on context switches, not just user mappings. This is not the case on all architectures, and if PG_G is to be used with SMP on i386 it will be a problem. This was committed by peter earlier as part of his fine grained tlb shootdown work for i386, which was backed out for other reasons. Reviewed by: peter
92363	15-Mar-2002	mckusick	Introduce the new 64-bit size disk block, daddr64_t. Change the bio and buffer structures to have daddr64_t bio_pblkno, b_blkno, and b_lblkno fields which allows access to disks larger than a Terabyte in size. This change also requires that the VOP_BMAP vnode operation accept and return daddr64_t blocks. This delta should not affect system operation in any way. It merely sets up the necessary interfaces to allow the development of disk drivers that work with these larger disk block addresses. It also allows for the development of UFS2 which will use 64-bit block addresses.
92256	14-Mar-2002	green	Document faultstate.lookup_still_valid more than none. Requested by: alfred
92246	13-Mar-2002	green	Rename SI_SUB_MUTEX to SI_SUB_MTX_POOL to make the name at all accurate. While doing this, move it earlier in the sysinit boot process so that the VM system can use it. After that, the system is now able to use sx locks instead of lockmgr locks in the VM system. To accomplish this, some of the more questionable uses of the locks (such as testing whether they are owned or not, as well as allowing shared+exclusive recursion) are removed, and simpler logic throughout is used so locks should also be easier to understand. This has been tested on my laptop for months, and has not shown any problems on SMP systems, either, so appears quite safe. One more user of lockmgr down, many more to go :)
92029	10-Mar-2002	eivind	- Remove a number of extra newlines that do not belong here according to style(9) - Minor space adjustment in cases where we have "( ", " )", if(), return(), while(), for(), etc. - Add /* SYMBOL */ after a few #endifs. Reviewed by: alc
91946	09-Mar-2002	tegge	Revert change in revision 1.53 and add a small comment to protect the revived code. vm pages newly allocated are marked busy (PG_BUSY), thus calling vm_page_delete before the pages has been freed or unbusied will cause a deadlock since vm_page_object_page_remove will wait for the busy flag to be cleared. This can be triggered by calling malloc with size > PAGE_SIZE and the M_NOWAIT flag on systems low on physical free memory. A kernel module that reproduces the problem, written by Logan Gabriel <logan@mail.2cactus.com>, can be found in the freebsd-hackers mail archive (12 Apr 2001). The problem was recently noticed again by Archie Cobbs <archie@dellroad.org>. Reviewed by: dillon
91777	07-Mar-2002	dillon	Fix a bug in the vm_map_clean() procedure. msync()ing an area of memory that has just been mapped MAP_ANON\|MAP_NOSYNC and has not yet been accessed will panic the machine. MFC after: 1 day
91724	06-Mar-2002	dillon	Add a sequential iteration optimization to vm_object_page_clean(). This moderately improves msync's and VM object flushing for objects containing randomly dirtied pages (fsync(), msync(), filesystem update daemon), and improves cpu use for small-ranged sequential msync()s in the face of very large mmap()ings from O(N) to O(1) as might be performed by a database. A sysctl, vm.msync_flush_flag, has been added and defaults to 3 (the two committed optimizations are turned on by default). 0 will turn off both optimizations. This code has already been tested under stable and is one in a series of memq / vp->v_dirtyblkhd / fsync optimizations to remove O(N^2) restart conditions that will be coming down the pipe. MFC after: 3 days
91700	05-Mar-2002	eivind	* Move bswlist declaration and initialization from kern/vfs_bio.c to vm/vm_pager.c, which is the only place it is used. * Make the QUEUE_* definitions and bufqueues local to vfs_bio.c. * constify buf_wmesg.
91641	04-Mar-2002	alc	o Create vm_pageq_enqueue() to encapsulate code that is duplicated time and again in vm_page.c and vm_pageq.c. o Delete unusused prototypes. (Mainly a result of the earlier renaming of various functions from vm_page_() to vm_pageq_().)
91605	03-Mar-2002	alc	Call vm_pageq_remove_nowakeup() rather than duplicating it.
91569	02-Mar-2002	alc	Remove some long dead code.
91420	27-Feb-2002	jhb	Use thread0.td_ucred instead of proc0.p_ucred. This change is cosmetic and isn't strictly required. However, it lowers the number of false positives found when grep'ing the kernel sources for p_ucred to ensure proper locking.
91406	27-Feb-2002	jhb	Simple p_ucred -> td_ucred changes to start using the per-thread ucred reference.
91403	27-Feb-2002	silby	Fix a horribly suboptimal algorithm in the vm_daemon. In order to determine what to page out, the vm_daemon checks reference bits on all pages belonging to all processes. Unfortunately, the algorithm used reacted badly with shared pages; each shared page would be checked once per process sharing it; this caused an O(N^2) growth of tlb invalidations. The algorithm has been changed so that each page will be checked only 16 times. Prior to this change, a fork/sleepbomb of 1300 processes could cause the vm_daemon to take over 60 seconds to complete, effectively freezing the system for that time period. With this change in place, the vm_daemon completes in less than a second. Any system with hundreds of processes sharing pages should benefit from this change. Note that the vm_daemon is only run when the system is under extreme memory pressure. It is likely that many people with loaded systems saw no symptoms of this problem until they reached the point where swapping began. Special thanks go to dillon, peter, and Chuck Cranor, who helped me get up to speed with vm internals. PR: 33542, 20393 Reviewed by: dillon MFC after: 1 week
91367	27-Feb-2002	peter	Back out all the pmap related stuff I've touched over the last few days. There is some unresolved badness that has been eluding me, particularly affecting uniprocessor kernels. Turning off PG_G helped (which is a bad sign) but didn't solve it entirely. Userland programs still crashed.
91344	27-Feb-2002	peter	Jake further reduced IPI shootdowns on sparc64 in loops by using ranged shootdowns in a couple of key places. Do the same for i386. This also hides some physical addresses from higher levels and has it use the generic vm_page_t's instead. This will help for PAE down the road. Obtained from: jake (MI code, suggestions for MD part)
91263	26-Feb-2002	peter	Remove unused variable (td)
91063	22-Feb-2002	phk	GC: BIO_ORDERED, various infrastructure dealing with BIO_ORDERED.
90944	19-Feb-2002	tegge	Add a page queue, PQ_HOLD, that temporarily owns pages with nonzero hold count that would otherwise be on one of the free queues. This eliminates a panic when broken programs unmap memory that still has pending IO from raw devices. Reviewed by: dillon, alc
90937	19-Feb-2002	silby	Add one more comment to the OOM changes so that future readers of the code may better understand the code. Suggested by: dillon MFC after: 1 week
90935	19-Feb-2002	silby	Changes to make the OOM killer much more effective: - Allow the OOM killer to target processes currently locked in memory. These very often are the ones doing the memory hogging. - Drop the wakeup priority of processes currently sleeping while waiting for their page fault to complete. In order for the OOM killer to work well, the killed process and other system processes waiting on memory must be allowed to wakeup first. Reviewed by: dillon MFC after: 1 week
90702	15-Feb-2002	bde	Garbage-collect options ACPI_NO_ENABLE_ON_BOOT, AML_DEBUG, BLEED, DEVICE_SYSCTLS, KEY, LOUTB, NFS_MUIDHASHSIZ, NFS_UIDHASHSIZ, PCI_QUIET and SIMPLELOCK_DEBUG.
90538	11-Feb-2002	julian	In a threaded world, differnt priorirites become properties of different entities. Make it so. Reviewed by: jhb@freebsd.org (john baldwin)
90361	07-Feb-2002	julian	Pre-KSE/M3 commit. this is a low-functionality change that changes the kernel to access the main thread of a process via the linked list of threads rather than assuming that it is embedded in the process. It IS still embeded there but remove all teh code that assumes that in preparation for the next commit which will actually move it out. Reviewed by: peter@freebsd.org, gallatin@cs.duke.edu, benno rice,
90263	05-Feb-2002	alfred	Fix a race with free'ing vmspaces at process exit when vmspaces are shared. Also introduce vm_endcopy instead of using pointer tricks when initializing new vmspaces. The race occured because of how the reference was utilized: test vmspace reference, possibly block, decrement reference When sharing a vmspace between multiple processes it was possible for two processes exiting at the same time to test the reference count, possibly block and neither one free because they wouldn't see the other's update. Submitted by: green
90033	31-Jan-2002	dillon	GC P_BUFEXHAUST leftovers, we've had a new mechanism to avoid buffer cache lockups for over a year now. MFC after: 0 days
89802	25-Jan-2002	dwmalone	Remove a parameter name from a prototype.
89464	17-Jan-2002	bde	Don't declare vm_swapout() in the NO_SWAPPING case when it is not defined. Fixed some style bugs.
89319	14-Jan-2002	alfred	Replace ffind_* with fget calls. Make fget MPsafe. Make fgetvp and fgetsock use the fget subsystem to reduce code bloat. Push giant down in fpathconf().
89306	13-Jan-2002	alfred	SMP Lock struct file, filedesc and the global file list. Seigo Tanimura (tanimura) posted the initial delta. I've polished it quite a bit reducing the need for locking and adapting it for KSE. Locks: 1 mutex in each filedesc protects all the fields. protects "struct file" initialization, while a struct file is being changed from &badfileops -> &pipeops or something the filedesc should be locked. 1 mutex in each struct file protects the refcount fields. doesn't protect anything else. the flags used for garbage collection have been moved to f_gcflag which was the FILLER short, this doesn't need locking because the garbage collection is a single threaded container. could likely be made to use a pool mutex. 1 sx lock for the global filelist. struct file * fhold(struct file fp); / increments reference count on a file / struct file fhold_locked(struct file fp); / like fhold but expects file to locked / struct file ffind_hold(struct thread , int fd); / finds the struct file in thread, adds one reference and returns it unlocked / struct file ffind_lock(struct thread , int fd); / ffind_hold, but returns file locked */ I still have to smp-safe the fget cruft, I'll get to that asap.
88900	05-Jan-2002	jhb	Change the preemption code for software interrupt thread schedules and mutex releases to not require flags for the cases when preemption is not allowed: The purpose of the MTX_NOSWITCH and SWI_NOSWITCH flags is to prevent switching to a higher priority thread on mutex releease and swi schedule, respectively when that switch is not safe. Now that the critical section API maintains a per-thread nesting count, the kernel can easily check whether or not it should switch without relying on flags from the programmer. This fixes a few bugs in that all current callers of swi_sched() used SWI_NOSWITCH, when in fact, only the ones called from fast interrupt handlers and the swi_sched of softclock needed this flag. Note that to ensure that swi_sched()'s in clock and fast interrupt handlers do not switch, these handlers have to be explicitly wrapped in critical_enter/exit pairs. Presently, just wrapping the handlers is sufficient, but in the future with the fully preemptive kernel, the interrupt must be EOI'd before critical_exit() is called. (critical_exit() can switch due to a deferred preemption in a fully preemptive kernel.) I've tested the changes to the interrupt code on i386 and alpha. I have not tested ia64, but the interrupt code is almost identical to the alpha code, so I expect it will work fine. PowerPC and ARM do not yet have interrupt code in the tree so they shouldn't be broken. Sparc64 is broken, but that's been ok'd by jake and tmm who will be fixing the interrupt code for sparc64 shortly. Reviewed by: peter Tested on: i386, alpha
88318	20-Dec-2001	dillon	Fix a BUF_TIMELOCK race against BUF_LOCK and fix a deadlock in vget() against VM_WAIT in the pageout code. Both fixes involve adjusting the lockmgr's timeout capability so locks obtained with timeouts do not interfere with locks obtained without a timeout. Hopefully MFC: before the 4.5 release
87834	14-Dec-2001	dillon	This fixes a large number of bugs in our NFS client side code. A recent commit by Kirk also fixed a softupdates bug that could easily be triggered by server side NFS. * An edge case with shared R+W mmap()'s and truncate whereby the system would inappropriately clear the dirty bits on still-dirty data. (applicable to all filesystems) THIS FIX TEMPORARILY DISABLED PENDING FURTHER TESTING. see vm/vm_page.c line 1641 * The straddle case for VM pages and buffer cache buffers when truncating. (applicable to NFS client side) * Possible SMP database corruption due to vm_pager_unmap_page() not clearing the TLB for the other cpu's. (applicable to NFS client side but could effect all filesystems). Note: not considered serious since the corruption occurs beyond the file EOF. * When flusing a dirty buffer due to B_CACHE getting cleared, we were accidently setting B_CACHE again (that is, bwrite() sets B_CACHE), when we really want it to stay clear after the write is complete. This resulted in a corrupt buffer. (applicable to all filesystems but probably only triggered by NFS) * We have to call vtruncbuf() when ftruncate()ing to remove any buffer cache buffers. This is still tentitive, I may be able to remove it due to the second bug fix. (applicable to NFS client side) * vnode_pager_setsize() race against nfs_vinvalbuf()... we have to set n_size before calling nfs_vinvalbuf or the NFS code may recursively vnode_pager_setsize() to the original value before the truncate. This is what was causing the user mmap bus faults in the nfs tester program. (applicable to NFS client side) * Fix to softupdates (see ufs/ffs/ffs_inode.c 1.73, commit made by Kirk). Testing program written by: Avadis Tevanian, Jr. Testing program supplied by: jkh / Apple (see Dec2001 posting to freebsd-hackers with Subject 'NFS: How to make FreeBS fall on its face in one easy step') MFC after: 1 week
87157	01-Dec-2001	luigi	vm/vm_kern.c: rate limit (to once per second) diagnostic printf when you run out of mbuf address space. kern/subr_mbuf.c: print a warning message when mb_alloc fails, again rate-limited to at most once per second. This covers other cases of mbuf allocation failures. Probably it also overlaps the one handled in vm/vm_kern.c, so maybe the latter should go away. This warning will let us gradually remove the printf that are scattered across most network drivers to report mbuf allocation failures. Those are potentially dangerous, in that they are not rate-limited and can easily cause systems to panic. Unless there is disagreement (which does not seem to be the case judging from the discussion on -net so far), and because this is sort of a safety bugfix, I plan to commit a similar change to STABLE during the weekend (it affects kern/uipc_mbuf.c there). Discussed-with: jlemon, silby and -net
86475	17-Nov-2001	jlemon	When laying out objects in a ZONE_INTERRUPT zone, allow them to cross a page boundary, since we've already allocated all our contiguous kva space up front. This eliminates some memory wastage, and allows us to actually reach the # of objects were specified in the zinit() call. Reviewed by: peter, dillon
86236	09-Nov-2001	dillon	Fix deadlock introduced in 1.73 (Jan 1998). The paging-in-progress count on a vnode-backed object must be incremented after obtaining the vnode lock. If it is bumped before obtaining the vnode lock we can deadlock against vtruncbuf(). Submitted by: peter, ps MFC after: 3 days
86092	05-Nov-2001	dillon	Adjust vnode_pager_input_smlfs() to not attempt to BMAP blocks beyond the file EOF. This works around a bug in the ISOFS (CDRom) BMAP code which returns bogus values for requests beyond the file EOF rather then returning an error, resulting in either corrupt data being mmap()'d beyond the file EOF or resulting in a seg-fault on the last page of a mmap()'d file (mmap()s of CDRom files). Reported by: peter / Yahoo MFC after: 3 days
85762	31-Oct-2001	dillon	Don't let pmap_object_init_pt() exhaust all available free pages (allocating pv entries w/ zalloci) when called in a loop due to an madvise(). It is possible to completely exhaust the free page list and cause a system panic when an expected allocation fails.
85541	26-Oct-2001	dillon	Move recently added procedure which was incorrectly placed within an #ifdef DDB block.
85517	26-Oct-2001	dillon	Implement kern.maxvnodes. adjusting kern.maxvnodes now actually has a real effect. Optimize vfs_msync(). Avoid having to continually drop and re-obtain mutexes when scanning the vnode list. Improves looping case by 500%. Optimize ffs_sync(). Avoid having to continually drop and re-obtain mutexes when scanning the vnode list. This makes a couple of assumptions, which I believe are ok, in regards to vnode stability when the mount list mutex is held. Improves looping case by 500%. (more optimization work is needed on top of these fixes) MFC after: 1 week
85272	21-Oct-2001	dillon	Syntax cleanup and documentation, no operational changes. MFC after: 1 day
85227	20-Oct-2001	iedowse	Move the code that computes the system load average from vm_meter.c to kern_synch.c in preparation for adding some jitter to the inter-sample time. Note that the "vm.loadavg" sysctl still lives in vm_meter.c which isn't the right place, but it is appropriate for the current (bad) name of that sysctl. Suggested by: jhb (some time ago) Reviewed by: bde
85070	17-Oct-2001	dillon	contigmalloc1() could cause the vm_page_zero_count to become incorrect. Properly track the count. Submitted by: mark tinguely <tinguely@web.cs.ndsu.nodak.edu>
85016	15-Oct-2001	tegge	Don't use an uninitialized field reserved for callers in the bio structure passed to swap_pager_strategy(). Instead, use a field reserved for drivers and initialize it before usage. Reviewed by: dillon
84933	14-Oct-2001	tegge	Don't remove all mappings of a swapped out process if the vm map contained wired entries. vm_fault_unwire() depends on the mapping being intact. Reviewed by: dillon
84932	14-Oct-2001	tegge	Fix locking violations during page wiring: - vm map entries are not valid after the map has been unlocked. - An exclusive lock on the map is needed before calling vm_map_simplify_entry(). Fix cleanup after page wiring failure to unwire all pages that had been successfully wired before the failure was detected. Reviewed by: dillon
84869	13-Oct-2001	dillon	Makes contigalloc[1]() create the vm_map / underlying wired pages in the kernel map and object in a manner that contigfree() is actually able to free. Previously contigfree() freed up the KVA space but could not unwire & free the underlying VM pages due to mismatched pageability between the map entry and the VM pages. Submitted by: Thomas Moestl <tmoestl@gmx.net> Testing by: mark tinguely <tinguely@web.cs.ndsu.nodak.edu> MFC after: 3 days
84854	12-Oct-2001	dillon	Finally fix the VM bug where a file whos EOF occurs in the middle of a page would sometimes prevent a dirty page from being cleaned, even when synced, resulting in the dirty page being re-flushed to disk every 30-60 seconds or so, forever. The problem is that when the filesystem flushes a page to its backing file it typically does not clear dirty bits representing areas of the page that are beyond the file EOF. If the file is also mmap()'d and a fault is taken, vm_fault (properly, is required to) set the vm_page_t->dirty bits to VM_PAGE_BITS_ALL. This combination could leave us with an uncleanable, unfreeable page. The solution is to have the vnode_pager detect the edge case and manually clear the dirty bits representing areas beyond the file EOF. The filesystem does the rest and the page comes up clean after the write completes. MFC after: 3 days
84827	11-Oct-2001	jhb	Change the kernel's ucred API as follows: - crhold() returns a reference to the ucred whose refcount it bumps. - crcopy() now simply copies the credentials from one credential to another and has no return value. - a new crshared() primitive is added which returns true if a ucred's refcount is > 1 and false (0) otherwise.
84812	11-Oct-2001	jhb	Add missing includes of sys/ktr.h.
84783	10-Oct-2001	ps	Make MAXTSIZ, DFLDSIZ, MAXDSIZ, DFLSSIZ, MAXSSIZ, SGROWSIZ loader tunable. Reviewed by: peter MFC after: 2 weeks
84488	04-Oct-2001	iedowse	Remove the SSLEEP case from the load average computation. This has been a no-op for as long as our CVS history goes back. Processes in state SSLEEP could only be counted if p_slptime == 0, but immediately before loadav() is called, schedcpu() has just incremented p_slptime on all SSLEEP processes.
83986	26-Sep-2001	rwatson	o Modify access control checks in mmap() to use securelevel_gt() instead of direct variable access. Obtained from: TrustedBSD Project
83366	12-Sep-2001	julian	KSE Milestone 2 Note ALL MODULES MUST BE RECOMPILED make the kernel aware that there are smaller units of scheduling than the process. (but only allow one thread per process at this time). This is functionally equivalent to teh previousl -current except that there is a thread associated with each process. Sorry john! (your next MFC will be a doosie!) Reviewed by: peter@freebsd.org, dillon@freebsd.org X-MFC after: ha ha ha ha
83276	10-Sep-2001	peter	Rip some well duplicated code out of cpu_wait() and cpu_exit() and move it to the MI area. KSE touched cpu_wait() which had the same change replicated five ways for each platform. Now it can just do it once. The only MD parts seemed to be dealing with fpu state cleanup and things like vm86 cleanup on x86. The rest was identical. XXX: ia64 and powerpc did not have cpu_throw(), so I've put a functional stub in place. Reviewed by: jake, tmm, dillon
82756	01-Sep-2001	jhb	Process priority is locked by the sched_lock, not the proc lock.
82699	31-Aug-2001	dillon	make swapon() MPSAFE (will adjust syscalls.master later)
82697	31-Aug-2001	dillon	mark obreak() and ovadvise() as being MPSAFE
82612	31-Aug-2001	dillon	Cleanup
82314	25-Aug-2001	peter	Implement idle zeroing of pages. I've been tinkering with this on and off since John Dyson left his work-in-progress. It is off by default for now. sysctl vm.zeroidle_enable=1 to turn it on. There are some hacks here to deal with the present lack of preemption - we yield after doing a small number of pages since we wont preempt otherwise. This is basically Matt's algorithm [with hysteresis] with an idle process to call it in a similar way it used to be called from the idle loop. I cleaned up the includes a fair bit here too.
82290	24-Aug-2001	dillon	Remove support for the badly broken MAP_INHERIT (from -current only).
82127	22-Aug-2001	dillon	Move most of the kernel submap initialization code, including the timeout callwheel and buffer cache, out of the platform specific areas and into the machine independant area. i386 and alpha adjusted here. Other cpus can be fixed piecemeal. Reviewed by: freebsd-smp, jake
82126	22-Aug-2001	dillon	KASSERT if vm_page_t->wire_count overflows.
81933	20-Aug-2001	dillon	Limit the amount of KVM reserved for the buffer cache and for swap-meta information. The default limits only effect machines with > 1GB of ram and can be overriden with two new kernel conf variables VM_SWZONE_SIZE_MAX and VM_BCACHE_SIZE_MAX, or with loader variables kern.maxswzone and kern.maxbcache. This has the effect of leaving more KVM available for sizing NMBCLUSTERS and 'maxusers' and should avoid tripups where a sysad adds memory to a machine and then sees the kernel panic on boot due to running out of KVM. Also change the default swap-meta auto-sizing calculation to allocate half of what it was previously allocating. The prior defaults were way too high. Note that we cannot afford to run out of swap-meta structures so we still stay somewhat conservative here.
81399	10-Aug-2001	jhb	- Remove asleep(), await(), and M_ASLEEP. - Callers of asleep() and await() have been converted to calling tsleep(). The only caller outside of M_ASLEEP was the ata driver, which called both asleep() and await() with spl-raised, so there was no need for the asleep() and await() pair. M_ASLEEP was unused. Reviewed by: jasone, peter
81397	10-Aug-2001	jhb	- Remove asleep(), await(), and M_ASLEEP. - Callers of asleep() and await() have been converted to calling tsleep(). The only caller outside of M_ASLEEP was the ata driver, which called both asleep() and await() with spl-raised, so there was no need for the asleep() and await() pair. M_ASLEEP was unused. Reviewed by: jasone, peter
81148	05-Aug-2001	tmm	Add a missing semicolon to unbreak the kernel build with INVARIANTS (which was unfortunately turned off in the confguration I used for the last test build). Spotted by: jake Pointy hat to: tmm
81140	04-Aug-2001	jhb	Whitespace fixes.
81136	04-Aug-2001	tmm	Add a zdestroy() function to the zone allocator. This is needed for the unload case of modules that use their own zones. It has been tested with the nfs module.
81029	02-Aug-2001	alfred	Fixups for the initial allocation by dillon: 1) allocate fewer buckets 2) when failing to allocate swap zone, keep reducing the zone by a third rather than a half in order to reduce the chance of allocating way too little. I also moved around some code for readability. Suggested by: dillon Reviewed by: dillon
80705	31-Jul-2001	jake	Oops. Last commit to vm_object.c should have got these files too. Remove the use of atomic ops to manipulate vm_object and vm_page flags. Giant is required here, so they are superfluous. Discussed with: dillon
80704	31-Jul-2001	jake	Remove the use of atomic ops to manipulate vm_object and vm_page flags. Giant is required here, so they are superfluous. Discussed with: dillon
80517	28-Jul-2001	iedowse	Permit direct swapping to NFS regular files using swapon(2). We already allow this for NFS swap configured via BOOTP, so it is known to work fine. For many diskless configurations is is more flexible to have the client set up swapping itself; it can recreate a sparse swap file to save on server space for example, and it works with a non-NFS root filesystem such as an in-kernel filesystem image.
80204	23-Jul-2001	assar	make vm_page_select_cache static Requested by: bde
80089	21-Jul-2001	assar	(vm_page_select_cache): add prototype
79744	15-Jul-2001	benno	The i386-specific includes in this file were "fixed" by bracketing them with #ifndef __alpha__. Fix this for the rest of the world by turning it into #ifdef __i386__. Reviewed by: obrien
79443	09-Jul-2001	des	Fix missing newline and terminator at the end of the vm.zone sysctl.
79273	05-Jul-2001	mjacob	Apply field bandages to the includes so compiles happen on alpha.
79265	05-Jul-2001	dillon	Move vm_page_zero_idle() from machine-dependant sections to a machine-independant source file, vm/vm_zeroidle.c. It was exactly the same for all platforms and updating them all was getting annoying.
79263	04-Jul-2001	dillon	Reorg vm_page.c into vm_page.c, vm_pageq.c, and vm_contig.c (for contigmalloc). Also removed some spl's and added some VM mutexes, but they are not actually used yet, so this commit does not really make any operational changes to the system. vm_page.c relates to vm_page_t manipulation, including high level deactivation, activation, etc... vm_pageq.c relates to finding free pages and aquiring exclusive access to a page queue (exclusivity part not yet implemented). And the world still builds... :-)
79248	04-Jul-2001	dillon	Change inlines back into mainline code in preparation for mutexing. Also, most of these inlines had been bloated in -current far beyond their original intent. Normalize prototypes and function declarations to be ANSI only (half already were). And do some general cleanup. (kernel size also reduced by 50-100K, but that isn't the prime intent)
79242	04-Jul-2001	dillon	whitespace / register cleanup
79224	04-Jul-2001	dillon	With Alfred's permission, remove vm_mtx in favor of a fine-grained approach (this commit is just the first stage). Also add various GIANT_ macros to formalize the removal of Giant, making it easy to test in a more piecemeal fashion. These macros will allow us to test fine-grained locks to a degree before removing Giant, and also after, and to remove Giant in a piecemeal fashion via sysctl's on those subsystems which the authors believe can operate without Giant.
79127	03-Jul-2001	jhb	Fix a XXX comment by moving the initialization of the number of pbuf's for the vnode pager to a new vnode pager init method instead of making it a hack in getpages().
78622	22-Jun-2001	jhb	- Protect all accesses to nsw_[rw]count{,_{,a}sync} with the pbuf mutex. - Don't drop the vm mutex while grabbing the pbuf mutex to manipulate said variables.
78592	22-Jun-2001	bmilekic	Introduce numerous SMP friendly changes to the mbuf allocator. Namely, introduce a modified allocation mechanism for mbufs and mbuf clusters; one which can scale under SMP and which offers the possibility of resource reclamation to be implemented in the future. Notable advantages: o Reduce contention for SMP by offering per-CPU pools and locks. o Better use of data cache due to per-CPU pools. o Much less code cache pollution due to excessively large allocation macros. o Framework for `grouping' objects from same page together so as to be able to possibly free wired-down pages back to the system if they are no longer needed by the network stacks. Additional things changed with this addition: - Moved some mbuf specific declarations and initializations from sys/conf/param.c into mbuf-specific code where they belong. - m_getclr() has been renamed to m_get_clrd() because the old name is really confusing. m_getclr() HAS been preserved though and is defined to the new name. No tree sweep has been done "to change the interface," as the old name will continue to be supported and is not depracated. The change was merely done because m_getclr() sounds too much like "m_get a cluster." - TEMPORARILY disabled mbtypes statistics displaying in netstat(1) and systat(1) (see TODO below). - Fixed systat(1) to display number of "free mbufs" based on new per-CPU stat structures. - Fixed netstat(1) to display new per-CPU stats based on sysctl-exported per-CPU stat structures. All infos are fetched via sysctl. TODO (in order of priority): - Re-enable mbtypes statistics in both netstat(1) and systat(1) after introducing an SMP friendly way to collect the mbtypes stats under the already introduced per-CPU locks (i.e. hopefully don't use atomic() - it seems too costly for a mere stat update, especially when other locks are already present). - Optionally have systat(1) display not only "total free mbufs" but also "total free mbufs per CPU pool." - Fix minor length-fetching issues in netstat(1) related to recently re-enabled option to read mbuf stats from a core file. - Move reference counters at least for mbuf clusters into an unused portion of the cluster itself, to save space and need to allocate a counter. - Look into introducing resource freeing possibly from a kproc. Reviewed by (in parts): jlemon, jake, silby, terry Tested by: jlemon (Intel & Alpha), mjacob (Intel & Alpha) Preliminary performance measurements: jlemon (and me, obviously) URL: http://people.freebsd.org/~bmilekic/mb_alloc/
78521	20-Jun-2001	jhb	Don't lock around swap_pager_swap_init() that is only called once during the pagedaemon's startup code since it calls malloc which results in lock order reversals.
78481	20-Jun-2001	jhb	Put the scheduler, vmdaemon, and pagedaemon kthreads back under Giant for now. The proc locking isn't actually safe yet and won't be until the proc locking is finished.
78099	11-Jun-2001	dillon	Cleanup the tabbing
77948	09-Jun-2001	dillon	Two fixes to the out-of-swap process termination code. First, start killing processes a little earlier to avoid a deadlock. Second, when calculating the 'largest process' do not just count RSS. Instead count the RSS + SWAP used by the process. Without this the code tended to kill small inconsequential processes like, oh, sshd, rather then one of the many 'eatmem 200MB' I run on a whim :-). This fix has been extensively tested on -stable and somewhat tested on -current and will be MFCd in a few days. Shamed into fixing this by: ps
77604	01-Jun-2001	tmm	Change the way information about swap devices is exported to be more canonical: define a versioned struct xswdev, and add a sysctl node handler that allows the user to get this structure for a certain device index by specifying this index as last element of the MIB. This new node handler, vm.swap_info, replaces the old vm.nswapdev and vm.swapdevX.* (where X was the index) sysctls.
77582	01-Jun-2001	tmm	Clean up the code exporting interrupt statistics via sysctl a bit: - move the sysctl code to kern_intr.c - do not use INTRCNT_COUNT, but rather eintrcnt - intrcnt to determine the length of the intrcnt array - move the declarations of intrnames, eintrnames, intrcnt and eintrcnt from machine-dependent include files to sys/interrupt.h - remove the hw.nintr sysctl, it is not needed. - fix various style bugs Requested by: bde Reviewed by: bde (some time ago)
77398	29-May-2001	jhb	Don't hold the VM lock across VOP's and other things that can sleep.
77139	24-May-2001	jhb	Stick VM syscalls back under Giant if the BLEED option is not defined.
77115	24-May-2001	dillon	This patch implements O_DIRECT about 80% of the way. It takes a patchset Tor created a while ago, removes the raw I/O piece (that has cache coherency problems), and adds a buffer cache / VM freeing piece. Essentially this patch causes O_DIRECT I/O to not be left in the cache, but does not prevent it from going through the cache, hence the 80%. For the last 20% we need a method by which the I/O can be issued directly to buffer supplied by the user process and bypass the buffer cache entirely, but still maintain cache coherency. I also have the code working under -stable but the changes made to sys/file.h may not be MFCable, so an MFC is not on the table yet. Submitted by: tegge, dillon
77094	23-May-2001	jhb	- Assert Giant is held in the vnode pager methods. - Lock the VM while walking down a vm_object's backing_object list in vnode_pager_lock().
77093	23-May-2001	jhb	- Add in several asserts of vm_mtx. - Assert Giant in vm_pageout_scan() for the vnode hacking that it does. - Don't hold vm_mtx around vget() or vput(). - Lock Giant when calling vm_pageout_scan() from the pagedaemon. Also, lock curproc while setting the P_BUFEXHAUST flag. - For now we still hold Giant for all of the vm_daemon. When process limits are locked we will be only need Giant for swapout_procs().
77091	23-May-2001	jhb	- Assert that the vm lock is held for all of _vm_object_allocate(). - Restore the previous order of setting up a new vm_object. The previous had a small bug where we zero'd out the flags after we set the OBJ_ONEMAPPING flag. - Add several asserts of vm_mtx. - Assert Giant is held rather than locking and unlocking it in a few places. - Add in some #ifdef objlocks code to lock individual vm objects when vm objects each have their own lock someday. - Don't bother acquiring the allproc lock for a ddb command. If DDB blocked on the lock, that would be worse than having an inconsistent allproc list.
77090	23-May-2001	jhb	- Add lots of vm_mtx assertions. - Add a few KTR tracepoints to track the addition and removal of vm_map_entry's and the creation adn free'ing of vmspace's. - Adjust a few portions of code so that we update the process' vmspace pointer to its new vmspace before freeing the old vmspace.
77089	23-May-2001	jhb	- Lock the VM around the pmap_swapin_proc() call in faultin(). - Don't lock Giant in the scheduler() function except for when calling faultin(). - In swapout_procs(), lock the VM before the proccess to avoid a lock order violation. - In swapout_procs(), release the allproc lock before calling swapout(). We restart the process scan after swapping out a process. - In swapout_procs(), un #if 0 the code to bump the vmspace reference count and lock the process' vm structures. This bug was introduced by me and could result in the vmspace being free'd out from under a running process. - Fix an old bug where the vmspace reference count was not free'd if we failed the swap_idle_threshold2 test.
77088	23-May-2001	jhb	- Fix the sw_alloc_interlock to actually lock itself when the lock is acquired. - Assert Giant is held in the strategy, getpages, and putpages methods and the getchainbuf, flushchainbuf, and waitchainbuf functions. - Always call flushchainbuf() w/o the VM lock.
77087	23-May-2001	jhb	Assert Giant is held for the device pager alloc and getpages methods since we call the mmap method of the cdevsw of the device we are mmap'ing.
77083	23-May-2001	jhb	- Obtain Giant in mmap() syscall while messing with file descriptors and vnodes. - Fix an old bug that would leak a reference to a fd if the vnode being mmap'd wasn't of type VREG or VCHR. - Lock Giant in vm_mmap() around calls into the VM that can call into pager routines that need Giant or into other VM routines that need Giant. - Replace code that used a goto to jump around the else branch of a test to use an else branch instead.
77080	23-May-2001	jhb	Acquire Giant around vm_map_remove() inside of the obreak() syscall for vm_object_terminate().
77077	23-May-2001	jhb	Take a more conservative approach and still lock Giant around VM faults for now.
77062	23-May-2001	jhb	Set the phys_pager_alloc_lock to 1 when it is acquired so that it is actually locked.
77036	23-May-2001	alfred	aquire Giant when playing with the buffercache and doing IO. use msleep against the vm mutex while waiting for a page IO to complete.
77010	22-May-2001	alfred	aquire vm mutex in swp_pager_async_iodone. Don't call swp_pager_async_iodone with the mutex held.
76981	22-May-2001	jhb	Remove duplicate include and sort includes.
76978	22-May-2001	jhb	Sort includes.
76974	22-May-2001	jhb	Unlock the VM lock at the end of munlock() instead of locking it again.
76973	22-May-2001	jhb	Sort includes from previous commit.
76949	22-May-2001	jhb	Sort includes.
76827	19-May-2001	alfred	Introduce a global lock for the vm subsystem (vm_mtx). vm_mtx does not recurse and is required for most low level vm operations. faults can not be taken without holding Giant. Memory subsystems can now call the base page allocators safely. Almost all atomic ops were removed as they are covered under the vm mutex. Alpha and ia64 now need to catch up to i386's trap handlers. FFS and NFS have been tested, other filesystems will need minor changes (grabbing the vm lock when twiddling page properties). Reviewed (partially) by: jake, jhb
76778	18-May-2001	jhb	- Use a timeout for the tsleep in scheduler() instead of having vmmeter() wakeup proc0 by hand to enforce the timeout. - When swapping out a process, keep the process locked via the proc lock from the first checks up until we clear PS_INMEM and set PS_SWAPPING in swapout(). The swapout() function now must be called with the proc lock held and releases it before returning. - Comment out the code to attempt to lock a process' VM structures before swapping out. It is broken in that it releases the lock after obtaining it. If it does grab the lock, it needs to hand it off to swapout() instead of releasing it. This can be revisisted when the VM is locked as this is a valid test to perform. It also causes a lock order reversal for the time being, which is the immediate cause for temporarily disabling it.
76773	17-May-2001	jhb	During the code to pick a process to kill when memory is exhausted, keep the process in question locked as soon as we find it and determine it to be eligible until we actually kill it. To avoid deadlock, we don't block on the process lock but skip any process that is already locked during our search.
76641	15-May-2001	jhb	- Use PROC_LOCK_ASSERT instead of a direct mtx_assert. - Don't hold Giant in the swapper daemon while we walk the list of processes looking for a process to swap back in. - Don't bother grabbing the sched_lock while checking a process' sleep time in swapout_procs() to ensure that a process has been idle for at least swap_idle_threshold2 before swapping it out. If we lose the race we just let a process stay in memory until the next call of swapout_procs(). - Remove some unneeded spl's, sched_lock does all the locking needed in this case.
76322	06-May-2001	phk	Actually biofinish(struct bio , struct devstat , int error) is more general than the bioerror(). Most of this patch is generated by scripts.
76244	03-May-2001	markm	Putting sys/lockmgr.h in here allows us to depollute userland includes a bit. OK'ed by: bde
76166	01-May-2001	markm	Undo part of the tangle of having sys/lock.h and sys/mutex.h included in other "system" header files. Also help the deprecation of lockmgr.h by making it a sub-include of sys/lock.h and removing sys/lockmgr.h form kernel .c files. Sort sys/*.h includes where possible in affected files. OK'ed by: bde (with reservations)
76117	29-Apr-2001	grog	Revert consequences of changes to mount.h, part 2. Requested by: bde
76084	27-Apr-2001	alfred	Address a number of problems with sysctl_vm_zone(). The zone allocator's locks should be leaflocks, meaning that they should never be held when entering into another subsystem, however the sysctl grabs the zone global mutex and individual zone mutexes while holding the lock it calls SYSCTL_OUT which recurses into the VM subsystem in order to wire user memory to do a safe copy. This can block and cause lock order reversals. To fix this: lock zone global. get a count of the number of zones. unlock global. allocate temporary storage. format and SYSCTL_OUT the banner. lock global. traverse list. make sure we haven't looped more than the initial count taken to avoid overflowing the allocated buffer. lock each nodes. read values and format into buffer. unlock individual node. unlock global. format and SYSCTL_OUT the rest of the data. free storage. return. Other problems included not checking for errors when doing sysctl out of the column header. Fixed. Inconsistant termination of the copied string. Fixed. Objected to by: des (for not using sbuf) Since the output is not variable length and I'm actually over allocating signifigantly and I'd like to get this fixed now, I'll work on the sbuf convertion at a later date. I would not object to someone else taking it upon themselves to convert it to sbuf. I hold no MAINTIANER rights to this code (for now).
75858	23-Apr-2001	grog	Correct #includes to work with fixed sys/mount.h.
75692	19-Apr-2001	alfred	vnode_pager_freepage() is really vm_page_free() in disguise, nuke vnode_pager_freepage() and replace all calls to it with vm_page_free()
75675	18-Apr-2001	alfred	Protect pager object creation with sx locks. Protect pager object list manipulation with a mutex. It doesn't look possible to combine them under a single sx lock because creation may block and we can't have the object list manipulation block on anything other than a mutex because of interrupt requests.
75644	18-Apr-2001	alfred	Fix the botched rev 1.59 where I made it such that without INVARIANTS the map is never locked. Submitted by: tegge
75580	17-Apr-2001	phk	This patch removes the VOP_BWRITE() vector. VOP_BWRITE() was a hack which made it possible for NFS client side to use struct buf with non-bio backing. This patch takes a more general approach and adds a bp->b_op vector where more methods can be added. The success of this patch depends on bp->b_op being initialized all relevant places for some value of "relevant" which is not easy to determine. For now the buffers have grown a b_magic element which will make such issues a tiny bit easier to debug.
75523	15-Apr-2001	alfred	use TAILQ_FOREACH, fix a comment's location
75477	13-Apr-2001	alfred	if/panic -> KASSERT
75474	13-Apr-2001	alfred	protect pbufs and associated counts with a mutex
75473	13-Apr-2001	alfred	use %p for pointer printf, include sys/systm.h for printf proto
75462	13-Apr-2001	alfred	Use a macro wrapper over printf along with KASSERT to reduce the amount of code here.
75452	12-Apr-2001	alfred	remove truncated part from commment
74927	28-Mar-2001	jhb	Convert the allproc and proctree locks from lockmgr locks to sx locks.
74914	28-Mar-2001	jhb	Catch up to header include changes: - <sys/mutex.h> now requires <sys/systm.h> - <sys/mutex.h> and <sys/sx.h> now require <sys/lock.h>
74670	23-Mar-2001	tmm	Export intrnames and intrcnt as sysctls (hw.nintr, hw.intrnames and hw.intrcnt). Approved by: rwatson
74237	14-Mar-2001	dillon	Fix a lock reversal problem in the VM subsystem related to threaded programs. There is a case during a fork() which can cause a deadlock. From Tor - The workaround that consists of setting a flag in the vm map that indicates that a fork is in progress and using that mark in the page fault handling to force a revalidation failure. That change will only affect (pessimize) page fault handling during fork for threaded (linuxthreads style) applications and applications using aio_*(). Submited by: tegge
74235	14-Mar-2001	dillon	Temporarily remove the vm_map_simplify() call from vm_map_insert(). The call is correct, but it interferes with the massive hack called vm_map_growstack(). The call will be returned after our stack handling code is fixed. Reported by: tegge
74042	09-Mar-2001	iedowse	When creating a shadow vm_object in vmspace_fork(), only one reference count was transferred to the new object, but both the new and the old map entries had pointers to the new object. Correct this by transferring the second reference. This fixes a panic that can occur when mmap(2) is used with the MAP_INHERIT flag. PR: i386/25603 Reviewed by: dillon, alc
73936	07-Mar-2001	jhb	Unrevert the pmap_map() changes. They weren't broken on x86. Sense beaten into me by: peter
73903	07-Mar-2001	jhb	Back out the pmap_map() change for now, it isn't completely stable on the i386.
73862	06-Mar-2001	jhb	- Rework pmap_map() to take advantage of direct-mapped segments on supported architectures such as the alpha. This allows us to save on kernel virtual address space, TLB entries, and (on the ia64) VHPT entries. pmap_map() now modifies the passed in virtual address on architectures that do not support direct-mapped segments to point to the next available virtual address. It also returns the actual address that the request was mapped to. - On the IA64 don't use a special zone of PV entries needed for early calls to pmap_kenter() during pmap_init(). This gets us in trouble because we end up trying to use the zone allocator before it is initialized. Instead, with the pmap_map() change, the number of needed PV entries is small enough that we can get by with a static pool that is used until pmap_init() is complete. Submitted by: dfr Debugging help: peter Tested by: me
73534	04-Mar-2001	alfred	Simplify vm_object_deallocate(), by decrementing the refcount first. This allows some of the conditionals to be combined.
73282	01-Mar-2001	gallatin	Allocate vm_page_array and vm_page_buckets from the end of the biggest chunk of memory, rather than from the start. This fixes problems allocating bouncebuffers on alphas where there is only 1 chunk of memory (unlike PCs where there is generally at least one small chunk and a large chunk). Having 1 chunk had been fatal, because these structures take over 13MB on a machine with 1GB of ram. This doesn't leave much room for other structures and bounce buffers if they're at the front. Reviewed by: dfr, anderson@cs.duke.edu, silence on -arch Tested by: Yoriaki FUJIMORI <fujimori@grafin.fujimori.cache.waseda.ac.jp>
73212	28-Feb-2001	dillon	If we intend to make the page writable without requiring another fault, make sure that PG_NOSYNC is properly set. Previously we only set it for a write-fault, but this can occur on a read-fault too. (will be MFCd prior to 4.3 freeze)
72949	23-Feb-2001	rwatson	Introduce per-swap area accounting in the VM system, and export this information via the vm.nswapdev sysctl (number of swap areas) and vm.swapdevX nodes (where X is the device), which contain the MIBs dev, blocks, used, and flags. These changes are required to allow top and other userland swap-monitoring utilities to run without setgid kmem. Submitted by: Thomas Moestl <tmoestl@gmx.net> Reviewed by: freebsd-audit
72888	22-Feb-2001	des	Fix formatting bugs introduced in sysctl_vm_zone() by the previous commit. Also, if SYSCTL_OUT() returns a non-zero value, stop at once.
72376	12-Feb-2001	jake	Implement a unified run queue and adjust priority levels accordingly. - All processes go into the same array of queues, with different scheduling classes using different portions of the array. This allows user processes to have their priorities propogated up into interrupt thread range if need be. - I chose 64 run queues as an arbitrary number that is greater than 32. We used to have 4 separate arrays of 32 queues each, so this may not be optimal. The new run queue code was written with this in mind; changing the number of run queues only requires changing constants in runq.h and adjusting the priority levels. - The new run queue code takes the run queue as a parameter. This is intended to be used to create per-cpu run queues. Implement wrappers for compatibility with the old interface which pass in the global run queue structure. - Group the priority level, user priority, native priority (before propogation) and the scheduling class into a struct priority. - Change any hard coded priority levels that I found to use symbolic constants (TTIPRI and TTOPRI). - Remove the curpriority global variable and use that of curproc. This was used to detect when a process' priority had lowered and it should yield. We now effectively yield on every interrupt. - Activate propogate_priority(). It should now have the desired effect without needing to also propogate the scheduling class. - Temporarily comment out the call to vm_page_zero_idle() in the idle loop. It interfered with propogate_priority() because the idle process needed to do a non-blocking acquire of Giant and then other processes would try to propogate their priority onto it. The idle process should not do anything except idle. vm_page_zero_idle() will return in the form of an idle priority kernel thread which is woken up at apprioriate times by the vm system. - Update struct kinfo_proc to the new priority interface. Deliberately change its size by adjusting the spare fields. It remained the same size, but the layout has changed, so userland processes that use it would parse the data incorrectly. The size constraint should really be changed to an arbitrary version number. Also add a debug.sizeof sysctl node for struct kinfo_proc.
72200	09-Feb-2001	bmilekic	Change and clean the mutex lock interface. mtx_enter(lock, type) becomes: mtx_lock(lock) for sleep locks (MTX_DEF-initialized locks) mtx_lock_spin(lock) for spin locks (MTX_SPIN-initialized) similarily, for releasing a lock, we now have: mtx_unlock(lock) for MTX_DEF and mtx_unlock_spin(lock) for MTX_SPIN. We change the caller interface for the two different types of locks because the semantics are entirely different for each case, and this makes it explicitly clear and, at the same time, it rids us of the extra `type' argument. The enter->lock and exit->unlock change has been made with the idea that we're "locking data" and not "entering locked code" in mind. Further, remove all additional "flags" previously passed to the lock acquire/release routines with the exception of two: MTX_QUIET and MTX_NOSWITCH The functionality of these flags is preserved and they can be passed to the lock/unlock routines by calling the corresponding wrappers: mtx_{lock, unlock}_flags(lock, flag(s)) and mtx_{lock, unlock}_spin_flags(lock, flag(s)) for MTX_DEF and MTX_SPIN locks, respectively. Re-inline some lock acq/rel code; in the sleep lock case, we only inline the _obtain_lock()s in order to ensure that the inlined code fits into a cache line. In the spin lock case, we inline recursion and actually only perform a function call if we need to spin. This change has been made with the idea that we generally tend to avoid spin locks and that also the spin locks that we do have and are heavily used (i.e. sched_lock) do recurse, and therefore in an effort to reduce function call overhead for some architectures (such as alpha), we inline recursion for this case. Create a new malloc type for the witness code and retire from using the M_DEV type. The new type is called M_WITNESS and is only declared if WITNESS is enabled. Begin cleaning up some machdep/mutex.h code - specifically updated the "optimized" inlined code in alpha/mutex.h and wrote MTX_LOCK_SPIN and MTX_UNLOCK_SPIN asm macros for the i386/mutex.h as we presently need those. Finally, caught up to the interface changes in all sys code. Contributors: jake, jhb, jasone (in no particular order)
71999	04-Feb-2001	phk	Mechanical change to use <sys/queue.h> macro API instead of fondling implementation details. Created with: sed(1) Reviewed by: md5(1)
71983	04-Feb-2001	dillon	This commit represents work mainly submitted by Tor and slightly modified by myself. It solves a serious vm_map corruption problem that can occur with the buffer cache when block sizes > 64K are used. This code has been heavily tested in -stable but only tested somewhat on -current. An MFC will occur in a few days. My additions include the vm_map_simplify_entry() and minor buffer cache boundry case fix. Make the buffer cache use a system map for buffer cache KVM rather then a normal map. Ensure that VM objects are not allocated for system maps. There were cases where a buffer map could wind up with a backing VM object -- normally harmless, but this could also result in the buffer cache blocking in places where it assumes no blocking will occur, possibly resulting in corrupted maps. Fix a minor boundry case in the buffer cache size limit is reached that could result in non-optimal code. Add vm_map_simplify_entry() calls to prevent 'creeping proliferation' of vm_map_entry's in the buffer cache's vm_map. Previously only a simple linear optimization was made. (The buffer vm_map typically has only a handful of vm_map_entry's. This stabilizes it at that level permanently). PR: 20609 Submitted by: (Tor Egge) tegge
71610	25-Jan-2001	jhb	- Doh, lock faultin() with proc lock in scheduler(). - Lock p_swtime with sched_lock in scheduler() as well.
71576	24-Jan-2001	jasone	Convert all simplelocks to mutexes and remove the simplelock implementations.
71574	24-Jan-2001	jhb	Argh, I didn't get this test right when I converted it. Break this up into two separate if's instead of nested if's. Also, reorder things slightly to avoid unnecessary mutex operations.
71572	24-Jan-2001	jhb	- Catch up to proc flag changes. - Minimal proc locking. - Use queue macros.
71571	24-Jan-2001	jhb	Add mtx_assert()'s to verify that kmem_alloc() and kmem_free() are called with Giant held.
71570	24-Jan-2001	jhb	- Catch up to proc flag changes. - Proc locking in a few places. - faultin() now must be called with the proc lock held. - Split up swappable() into a couple of tests so that it can be locke in swapout_procs(). - Use queue macros.
71569	24-Jan-2001	jhb	- Catch up to proc flag changes.
71512	24-Jan-2001	jhb	Add missing include.
71429	23-Jan-2001	ume	Add mibs to hold the number of forks since boot. New mibs are: vm.stats.vm.v_forks vm.stats.vm.v_vforks vm.stats.vm.v_rforks vm.stats.vm.v_kthreads vm.stats.vm.v_forkpages vm.stats.vm.v_vforkpages vm.stats.vm.v_rforkpages vm.stats.vm.v_kthreadpages Submitted by: Paul Herman <pherman@frenchfries.net> Reviewed by: alfred
71408	23-Jan-2001	jake	Sigh. atomic_add_int takes a pointer, not an integer. Pointy-hat-to: des
71406	23-Jan-2001	des	Use atomic operations to update the stat counters.
71362	22-Jan-2001	des	Call vm_zone_init() at the appropriate time. Reviewed by: jasone, jhb
71361	22-Jan-2001	des	Give this code a major facelift: - replace the simplelock in struct vm_zone with a mutex. - use a proper SLIST rather than a hand-rolled job for the zone list. - add a subsystem lock that protects the zone list and the statistics counters. - merge _zalloc() into zalloc() and _zfree() into zfree(), and move them below _zget() so there's no need for a prototype. - add two initialization functions: one which initializes the subsystem mutex and the zone list, and one that currently doesn't do anything. - zap zerror(); use KASSERTs instead. - dike out half of sysctl_vm_zone(), which was mostly trying to do manually what the snprintf() call could do better. Reviewed by: jhb, jasone
71350	21-Jan-2001	des	First step towards an MP-safe zone allocator: - have zalloc() and zfree() always lock the vm_zone. - remove zalloci() and zfreei(), which are now redundant. Reviewed by: bmilekic, jasone
70480	29-Dec-2000	alfred	fix comment which was outdated 3 years ago remove useless assignment purge entire file of 'register' keyword
70478	29-Dec-2000	alfred	clean up kmem_suballoc(): remove useless assignment remove 'register' variables
70390	27-Dec-2000	assar	Make zalloc and zfree non-inline functions. This avoids having to have the code calling these be compiled with the same setting for INVARIANTS and SMP. Reviewed by: dillon
70374	26-Dec-2000	dillon	This implements a better launder limiting solution. There was a solution in 4.2-REL which I ripped out in -stable and -current when implementing the low-memory handling solution. However, maxlaunder turns out to be the saving grace in certain very heavily loaded systems (e.g. newsreader box). The new algorithm limits the number of pages laundered in the first pageout daemon pass. If that is not sufficient then suceessive will be run without any limit. Write I/O is now pipelined using two sysctls, vfs.lorunningspace and vfs.hirunningspace. This prevents excessive buffered writes in the disk queues which cause long (multi-second) delays for reads. It leads to more stable (less jerky) and generally faster I/O streaming to disk by allowing required read ops (e.g. for indirect blocks and such) to occur without interrupting the write stream, amoung other things. NOTE: eventually, filesystem write I/O pipelining needs to be done on a per-device basis. At the moment it is globalized.
70160	18-Dec-2000	phk	Fix floppy drives on machines with lots of RAM. The fix works by reverting the ordering of free memory so that the chances of contig_malloc() succeeding increases. PR: 23291 Submitted by: Andrew Atrens <atrens@nortel.ca>
69972	13-Dec-2000	tanimura	- If swap metadata does not fit into the KVM, reduce the number of struct swblock entries by dividing the number of the entries by 2 until the swap metadata fits. - Reject swapon(2) upon failure of swap_zone allocation. This is just a temporary fix. Better solutions include: (suggested by: dillon) o reserving swap in SWAP_META_PAGES chunks, and o swapping the swblock structures themselves. Reviewed by: alfred, dillon
69947	13-Dec-2000	jake	- Change the allproc_lock to use a macro, ALLPROC_LOCK(how), instead of explicit calls to lockmgr. Also provides macros for the flags pased to specify shared, exclusive or release which map to the lockmgr flags. This is so that the use of lockmgr can be easily replaced with optimized reader-writer locks. - Add some locking that I missed the first time.
69847	11-Dec-2000	dillon	Be less conservative with a recently added KASSERT. Certain edge cases with file fragments and read-write mmap's can lead to a situation where a VM page has odd dirty bits, e.g. 0xFC - due to being dirtied by an mmap and only the fragment (representing a non-page-aligned end of file) synced via a filesystem buffer. A correct solution that guarentees consistent m->dirty for the file EOF case is being worked on. In the mean time we can't be so conservative in the KASSERT.
69781	08-Dec-2000	dwmalone	Convert more malloc+bzero to malloc+M_ZERO. Submitted by: josh@zipperup.org Submitted by: Robert Drehmel <robd@gmx.net>
69687	06-Dec-2000	alfred	Really fix phys_pager: Backout the previous delta (rev 1.4), it didn't make any difference. If the requested handle is NULL then don't add it to the list of objects, to be found by handle. The problem is that when asking for a NULL handle you are implying you want a new object. Because objects with NULL handles were being added to the list, any further requests for phys backed objects with NULL handles would return a reference to the initial NULL handle object after finding it on the list. Basically one couldn't have more than one phys backed object without a handle in the entire system without this fix. If you did more than one shared memory allocation using the phys pager it would give you your initial allocation again.
69641	05-Dec-2000	alfred	need to adjust allocation size to properly deal with non PAGE_SIZE allocations, specifically with allocations < PAGE_SIZE when the code doesn't work properly
69517	02-Dec-2000	bde	Backed out previous commit. Don't depend on namespace pollution in <sys/buf.h>.
69516	02-Dec-2000	jhb	Protect p_stat with sched_lock.
69509	02-Dec-2000	jhb	Protect p_stat with sched_lock.
69399	30-Nov-2000	alfred	remove unneded sys/ucred.h includes
69022	22-Nov-2000	jake	Protect the following with a lockmgr lock: allproc zombproc pidhashtbl proc.p_list proc.p_hash nextpid Reviewed by: jhb Obtained from: BSD/OS and netbsd
68921	20-Nov-2000	rwatson	o Export dmmax ("Maximum size of a swap block") using SYSCTL_INT. This removes a reason that systat requires setgid kmem. More to come.
68885	18-Nov-2000	dillon	Implement a low-memory deadlock solution. Removed most of the hacks that were trying to deal with low-memory situations prior to now. The new code is based on the concept that I/O must be able to function in a low memory situation. All major modules related to I/O (except networking) have been adjusted to allow allocation out of the system reserve memory pool. These modules now detect a low memory situation but rather then block they instead continue to operate, then return resources to the memory pool instead of cache them or leave them wired. Code has been added to stall in a low-memory situation prior to a vnode being locked. Thus situations where a process blocks in a low-memory condition while holding a locked vnode have been reduced to near nothing. Not only will I/O continue to operate, but many prior deadlock conditions simply no longer exist. Implement a number of VFS/BIO fixes (found by Ian): in biodone(), bogus-page replacement code, the loop was not properly incrementing loop variables prior to a continue statement. We do not believe this code can be hit anyway but we aren't taking any chances. We'll turn the whole section into a panic (as it already is in brelse()) after the release is rolled. In biodone(), the foff calculation was incorrectly clamped to the iosize, causing the wrong foff to be calculated for pages in the case of an I/O error or biodone() called without initiating I/O. The problem always caused a panic before. Now it doesn't. The problem is mainly an issue with NFS. Fixed casts for ~PAGE_MASK. This code worked properly before only because the calculations use signed arithmatic. Better to properly extend PAGE_MASK first before inverting it for the 64 bit masking op. In brelse(), the bogus_page fixup code was improperly throwing away the original contents of 'm' when it did the j-loop to fix the bogus pages. The result was that it would potentially invalidate parts of the WRONG page(!), leading to corruption. There may still be cases where a background bitmap write is being duplicated, causing potential corruption. We have identified a potentially serious bug related to this but the fix is still TBD. So instead this patch contains a KASSERT to detect the problem and panic the machine rather then continue to corrupt the filesystem. The problem does not occur very often.. it is very hard to reproduce, and it may or may not be the cause of the corruption people have reported. Review by: (VFS/BIO: mckusick, Ian Dowse <iedowse@maths.tcd.ie>) Testing by: (VM/Deadlock) Paul Saab <ps@yahoo-inc.com>
68884	18-Nov-2000	dillon	Add the splvm()'s suggested in PR 20609 to protect vm_pager_page_unswapped(). The remainder of the PR is still open. PR: kern/20609 (partial fix)
68883	18-Nov-2000	dillon	This patchset fixes a large number of file descriptor race conditions. Pre-rfork code assumed inherent locking of a process's file descriptor array. However, with the advent of rfork() the file descriptor table could be shared between processes. This patch closes over a dozen serious race conditions related to one thread manipulating the table (e.g. closing or dup()ing a descriptor) while another is blocked in an open(), close(), fcntl(), read(), write(), etc... PR: kern/11629 Discussed with: Alexander Viro <viro@math.psu.edu>
68261	02-Nov-2000	tegge	Clear the MAP_ENTRY_USER_WIRED flag from cloned vm_map entries. PR: 2840
67885	29-Oct-2000	phk	Weaken a bogus dependency on <sys/proc.h> in <sys/buf.h> by #ifdef'ing the offending inline function (BUF_KERNPROC) on it being #included already. I'm not sure BUF_KERNPROC() is even the right thing to do or in the right place or implemented the right way (inline vs normal function). Remove consequently unneeded #includes of <sys/proc.h>
67536	25-Oct-2000	jhb	- Catch a machine/mutex.h -> sys/mutex.h I somehow missed. - Close a small race condition. The sched_lock mutex protects p->p_stat as well as the run queues. Another CPU could change p_stat of the process while we are waiting for the lock, and we would end up scheduling a process that isn't runnable.
67247	17-Oct-2000	ps	Implement write combining for crashdumps. This is useful when write caching is disabled on both SCSI and IDE disks where large memory dumps could take up to an hour to complete. Taking an i386 scsi based system with 512MB of ram and timing (in seconds) how long it took to complete a dump, the following results were obtained: Before: After: WCE TIME WCE TIME ------------------ ------------------ 1 141.820972 1 15.600111 0 797.265072 0 65.480465 Obtained from: Yahoo! Reviewed by: peter
67082	13-Oct-2000	dillon	The swap bitmap allocator was not calculating the bitmap size properly in the face of non-stripe-aligned swap areas. The bug could cause a panic during boot. Refuse to configure a swap area that is too large (67 GB or so) Properly document the power-of-2 requirement for SWB_NPAGES. The patch is slightly different then the one Tor enclosed in the P.R., but accomplishes the same thing. PR: kern/20273 Submitted by: Tor.Egge@fast.no
67046	12-Oct-2000	jasone	For lockmgr mutex protection, use an array of mutexes that are allocated and initialized during boot. This avoids bloating sizeof(struct lock). As a side effect, it is no longer necessary to enforce the assumtion that lockinit()/lockdestroy() calls are paired, so the LK_VALID flag has been removed. Idea taken from: BSD/OS.
66748	06-Oct-2000	dwmalone	If a process is over its resource limit for datasize, still allow it to lower its memory usage. This was mentioned on the mailing lists ages ago, and I've lost the name of the person who brought it up. Reviewed by: alc
66615	04-Oct-2000	jasone	Convert lockmgr locks from using simple locks to using mutexes. Add lockdestroy() and appropriate invocations, which corresponds to lockinit() and must be called to clean up after a lockmgr lock is no longer needed.
65904	15-Sep-2000	jhb	- Add a new process flag P_NOLOAD that marks a process that should be ignored during load average calcuations. - Set this flag for the idle processes and the softinterrupt process.
65770	12-Sep-2000	bp	Add three new VOPs: VOP_CREATEVOBJECT, VOP_DESTROYVOBJECT and VOP_GETVOBJECT. They will be used by nullfs and other stacked filesystems to support full cache coherency. Reviewed in general by: mckusick, dillon
65557	07-Sep-2000	jasone	Major update to the way synchronization is done in the kernel. Highlights include: * Mutual exclusion is used instead of spl(). See mutex(9). (Note: The alpha port is still in transition and currently uses both.) Per-CPU idle processes. * Interrupts are run in their own separate kernel threads and can be preempted (i386 only). Partially contributed by: BSDi (BSD/OS) Submissions by (at least): cp, dfr, dillon, grog, jake, jhb, sheldonh
65103	26-Aug-2000	obrien	Make the arguments match the functionality of the functions.
63973	28-Jul-2000	peter	Minor cleanups: - remove unused variables (fix warnings) - use a more consistant ansi style rather than a mixture - remove dead #if 0 code and declarations
63897	26-Jul-2000	mckusick	Clean up the snapshot code so that it no longer depends on the use of the SF_IMMUTABLE flag to prevent writing. Instead put in explicit checking for the SF_SNAPSHOT flag in the appropriate places. With this change, it is now possible to rename and link to snapshot files. It is also possible to set or clear any of the owner, group, or other read bits on the file, though none of the write or execute bits can be set. There is also an explicit test to prevent the setting or clearing of the SF_SNAPSHOT flag via chflags() or fchflags(). Note also that the modify time cannot be changed as it needs to accurately reflect the time that the snapshot was taken. Submitted by: Robert Watson <rwatson@FreeBSD.org>
62976	11-Jul-2000	mckusick	Add snapshots to the fast filesystem. Most of the changes support the gating of system calls that cause modifications to the underlying filesystem. The gating can be enabled by any filesystem that needs to consistently suspend operations by adding the vop_stdgetwritemount to their set of vnops. Once gating is enabled, the function vfs_write_suspend stops all new write operations to a filesystem, allows any filesystem modifying system calls already in progress to complete, then sync's the filesystem to disk and returns. The function vfs_write_resume allows the suspended write operations to begin again. Gating is not added by default for all filesystems as for SMP systems it adds two extra locks to such critical kernel paths as the write system call. Thus, gating should only be added as needed. Details on the use and current status of snapshots in FFS can be found in /sys/ufs/ffs/README.snapshot so for brevity and timelyness is not included here. Unless and until you create a snapshot file, these changes should have no effect on your system (famous last words).
62941	11-Jul-2000	alfred	#elsif -> #elif Noticed by: green
62622	05-Jul-2000	jhb	Support for unsigned integer and long sysctl variables. Update the SYSCTL_LONG macro to be consistent with other integer sysctl variables and require an initial value instead of assuming 0. Update several sysctl variables to use the unsigned types. PR: 15251 Submitted by: Kelly Yancey <kbyanc@posi.net>
62573	04-Jul-2000	phk	Previous commit changing SYSCTL_HANDLER_ARGS violated KNF. Pointed out by: bde
62568	04-Jul-2000	jhb	Replace the PQ_*CACHE options with a single PQ_CACHESIZE option that you set equal to the number of kilobytes in your cache. The old options are still supported for backwards compatibility. Submitted by: Kelly Yancey <kbyanc@posi.net>
62552	04-Jul-2000	mckusick	Simplify and rationalise the management of the vnode free list (preparing the code to add snapshots).
62454	03-Jul-2000	phk	Style police catches up with rev 1.26 of src/sys/sys/sysctl.h: Sanitize SYSCTL_HANDLER_ARGS so that simplistic tools can grog our sources: -sysctl_vm_zone SYSCTL_HANDLER_ARGS +sysctl_vm_zone (SYSCTL_HANDLER_ARGS)
62067	25-Jun-2000	markm	Nifty idea from Jeroen van Gelderen; don't call a routine to check if we are using the /dev/zero device, just check a flag (supplied by /dev/zero). Reviewed by: dfr
61272	05-Jun-2000	hsu	Add missing increment of allocation counter.
61081	29-May-2000	dillon	This is a cleanup patch to Peter's new OBJT_PHYS VM object type and sysv shared memory support for it. It implements a new PG_UNMANAGED flag that has slightly different characteristics from PG_FICTICIOUS. A new sysctl, kern.ipc.shm_use_phys has been added to enable the use of physically-backed sysv shared memory rather then swap-backed. Physically backed shm segments are not tracked with PV entries, allowing programs which use a large shm segment as a rendezvous point to operate without eating an insane amount of KVM in the PV entry management. Read: Oracle. Peter's OBJT_PHYS object will also allow us to eventually implement page-table sharing and/or 4MB physical page support for such segments. We're half way there.
61074	29-May-2000	dfr	Brucify the pmap_enter_temporary() changes.
61058	29-May-2000	dillon	Fix bug in vm_pageout_page_stats() that always resulted in a full scan of the active queue. This fix is not expected to have any noticeable impact on performance. Noticed by: Rik van Riel <riel@conectiva.com.br>
61036	28-May-2000	dfr	Add a new pmap entry point, pmap_enter_temporary() to be used during dumps to create temporary page mappings. This replaces the use of CADDR1 which is fairly x86 specific. Reviewed by: dillon
60938	26-May-2000	jake	Back out the previous change to the queue(3) interface. It was not discussed and should probably not happen. Requested by: msmith and others
60833	23-May-2000	jake	Change the way that the queue(3) structures are declared; don't assume that the type argument to _HEAD and _ENTRY is a struct. Suggested by: phk Reviewed by: phk Approved by: mdodd
60757	21-May-2000	peter	Checkpoint of a new physical memory backed object type, that does not have pv_entries. This is intended for very special circumstances, eg: a certain database that has a 1GB shm segment mapped into 300 processes. That would consume 2GB of kvm just to hold the pv_entries alone. This would not be used on systems unless the physical ram was available, as it's not pageable. This is a work-in-progress, but is a useful and functional checkpoint. Matt has got some more fixes for it that will be committed soon. Reviewed by: dillon
60755	21-May-2000	peter	Implement an optimization of the VM<->pmap API. Pass vm_page_t's directly to various pmap_*() functions instead of looking up the physical address and passing that. In many cases, the first thing the pmap code was doing was going to a lot of trouble to get back the original vm_page_t, or it's shadow pv_table entry. Inspired by: John Dyson's 1998 patches. Also: Eliminate pv_table as a seperate thing and build it into a machine dependent part of vm_page_t. This eliminates having a seperate set of structions that shadow each other in a 1:1 fashion that we often went to a lot of trouble to translate from one to the other. (see above) This happens to save 4 bytes of physical memory for each page in the system. (8 bytes on the Alpha). Eliminate the use of the phys_avail[] array to determine if a page is managed (ie: it has pv_entries etc). Store this information in a flag. Things like device_pager set it because they create vm_page_t's on the fly that do not have pv_entries. This makes it easier to "unmanage" a page of physical memory (this will be taken advantage of in subsequent commits). Add a function to add a new page to the freelist. This could be used for reclaiming the previously wasted pages left over from preloaded loader(8) files. Reviewed by: dillon
60557	14-May-2000	dillon	Fixed bug in madvise() / MADV_WILLNEED. When the request is offset from the base of the first map_entry the call to pmap_object_init_pt() uses the wrong start VA. MFC to follow. PR: i386/18095
60041	05-May-2000	phk	Separate the struct bio related stuff out of <sys/buf.h> into <sys/bio.h>. <sys/bio.h> is now a prerequisite for <sys/buf.h> but it shall not be made a nested include according to bdes teachings on the subject of nested includes. Diskdrivers and similar stuff below specfs::strategy() should no longer need to include <sys/buf.> unless they need caching of data. Still a few bogus uses of struct buf to track down. Repocopy by: peter
59915	03-May-2000	phk	Convert the vm_pager_strategy() interface to take a struct bio instead of a struct buf. Don't try to examine B_ASYNC, it is a layering violation to do so. The only current user of this interface is vn(4) which, since it emulates a disk interface, operates on struct bio already.
59866	01-May-2000	phk	Move and staticize the bufchain functions so they become local to the only piece of code using them. This will ease a rewrite of them.
59794	30-Apr-2000	phk	Remove unneeded #include <vm/vm_zone.h> Generated by: src/tools/tools/kerninclude
59496	22-Apr-2000	wollman	Implement POSIX.1b shared memory objects. In this implementation, shared memory objects are regular files; the shm_open(3) routine uses fcntl(2) to set a flag on the descriptor which tells mmap(2) to automatically apply MAP_NOSYNC. Not objected to by: bde, dillon, dufault, jasone
59395	19-Apr-2000	alc	vm_object_shadow: Remove an incorrect assertion. In obscure circumstances vm_object_shadow can be called on an object with ref_count > 1 and OBJ_ONEMAPPING set. This isn't really a problem for vm_object_shadow.
59368	18-Apr-2000	phk	Remove unneeded <sys/buf.h> includes. Due to some interesting cpp tricks in lockmgr, the LINT kernel shrinks by 924 bytes.
59249	15-Apr-2000	phk	Complete the bio/buf divorce for all code below devfs::strategy Exceptions: Vinum untouched. This means that it cannot be compiled. Greg Lehey is on the case. CCD not converted yet, casts to struct buf (still safe) atapi-cd casts to struct buf to examine B_PHYS
59017	04-Apr-2000	msmith	Fix _zget() so that it checks the return from kmem_alloc(), to avoid zttempting to bzero NULL when the kernel map fills up. _zget() will now return NULL as it seems it was originally intended to do.
58934	02-Apr-2000	phk	Move B_ERROR flag to b_ioflags and call it BIO_ERROR. (Much of this done by script) Move B_ORDERED flag to b_ioflags and call it BIO_ORDERED. Move b_pblkno and b_iodone_chain to struct bio while we transition, they will be obsoleted once bio structs chain/stack. Add bio_queue field for struct bio aware disksort. Address a lot of stylistic issues brought up by bde.
58708	27-Mar-2000	dillon	Add necessary spl protection for swapper. The problem was located by Alfred while testing his SPLASSERT stuff. This is not a complete fix, more protections are probably needed.
58705	27-Mar-2000	charnier	Revert spelling mistake I made in the previous commit Requested by: Alan and Bruce
58634	26-Mar-2000	charnier	Spelling
58462	22-Mar-2000	phk	Fix one place which knew that B_WRITE was zero. Fix a stylistic mistake of mine while here. Found by: Stephen Hocking <shocking@prth.pgs.com>
58349	20-Mar-2000	phk	Rename the existing BUF_STRATEGY() to DEV_STRATEGY() substitute BUF_WRITE(foo) for VOP_BWRITE(foo->b_vp, foo) substitute BUF_STRATEGY(foo) for VOP_STRATEGY(foo->b_vp, foo) This patch is machine generated except for the ccd.c and buf.h parts.
58345	20-Mar-2000	phk	Remove B_READ, B_WRITE and B_FREEBUF and replace them with a new field in struct buf: b_iocmd. The b_iocmd is enforced to have exactly one bit set. B_WRITE was bogusly defined as zero giving rise to obvious coding mistakes. Also eliminate the redundant struct buf flag B_CALL, it can just as efficiently be done by comparing b_iodone to NULL. Should you get a panic or drop into the debugger, complaining about "b_iocmd", don't continue. It is likely to write on your disk where it should have been reading. This change is a step in the direction towards a stackable BIO capability. A lot of this patch were machine generated (Thanks to style(9) compliance!) Vinum users: Greg has not had time to test this yet, be careful.
58132	16-Mar-2000	phk	Eliminate the undocumented, experimental, non-delivering and highly dangerous MAX_PERF option.
57975	13-Mar-2000	phk	Remove unused 3rd argument from vsunlock() which abused B_WRITE.
57550	28-Feb-2000	ps	Add MAP_NOCORE to mmap(2), and MADV_NOCORE and MADV_CORE to madvise(2). This This feature allows you to specify if mmap'd data is included in an application's corefile. Change the type of eflags in struct vm_map_entry from u_char to vm_eflags_t (an unsigned int). Reviewed by: dillon,jdp,alfred Approved by: jkh
57263	16-Feb-2000	dillon	Fix null-pointer dereference crash when the system is intentionally run out of KVM through a mmap()/fork() bomb that allocates hundreds of thousands of vm_map_entry structures. Add panic to make null-pointer dereference crash a little more verbose. Add a new sysctl, vm.max_proc_mmap, which specifies the maximum number of mmap()'d spaces (discrete vm_map_entry's in the process). The value defaults to around 9000 for a 128MB machine. The test is scaled for the number of processes sharing a vmspace (aka linux threads). Setting the value to 0 disables the feature. PR: kern/16573 Approved by: jkh
56599	25-Jan-2000	dillon	The swapdev_vp changes made to rip out the swap specfs interaction also broke diskless swapping. Moving the swapdev_vp initialization to more commonly run code solves the problem. PR: kern/16165 Additional testing by: David Gilbert <dgilbert@velocet.ca>
56378	21-Jan-2000	dillon	Fix a deadlock between msync(..., MS_INVALIDATE) and vm_fault. The invalidation code cannot wait for paging to complete while holding a vnode lock, so we don't wait. Instead we simply allow the lower level code to simply block on any busy pages it encounters. I think Yahoo may be the only entity in the entire world that actually uses this msync feature :-). Bug reported by: Paul Saab <paul@mu.org>
55756	10-Jan-2000	phk	Give vn_isdisk() a second argument where it can return a suitable errno. Suggested by: bde
55351	03-Jan-2000	guido	Use MAP_NOSYNC for vnodes without any links in their filesystem. This is necessary for vmware: it does not use an anonymous mmap for the memory of the virtual system. In stead it creates a temp file an unlinks it. For a 50 MB file, this results in a ot of syncing every 30 seconds. Reviewed by: Matthew Dillon <dillon@backplane.com>
55206	29-Dec-1999	peter	Change #ifdef KERNEL to #ifdef _KERNEL in the public headers. "KERNEL" is an application space macro and the applications are supposed to be free to use it as they please (but cannot). This is consistant with the other BSD's who made this change quite some time ago. More commits to come.
55175	28-Dec-1999	peter	Fix the swap backed vn case - this was broken by my rev 1.128 to swap_pager.c and related commits. Essentially swap_pager.c is backed out to before the changes, but swapdev_vp is converted into a real vnode with just VOP_STRATEGY(). It no longer abuses specfs vnops and no longer needs a dev_t and /dev/drum (or /dev/swapdev) for the intermediate layer. This essentially restores the vnode interface as the interface to the bottom of the swap pager, and vm_swap.c provides a clean vnode interface. This will need to be revisited when we swap to files (vnodes) - which is the other reason for keeping the vnode interface between the swap pager and the swap devices. OK'ed by: dillon
54655	15-Dec-1999	eivind	Introduce NDFREE (and remove VOP_ABORTOP)
54467	12-Dec-1999	dillon	Add MAP_NOSYNC feature to mmap(), and MADV_NOSYNC and MADV_AUTOSYNC to madvise(). This feature prevents the update daemon from gratuitously flushing dirty pages associated with a mapped file-backed region of memory. The system pager will still page the memory as necessary and the VM system will still be fully coherent with the filesystem. Modifications made by other means to the same area of memory, for example by write(), are unaffected. The feature works on a page-granularity basis. MAP_NOSYNC allows one to use mmap() to share memory between processes without incuring any significant filesystem overhead, putting it in the same performance category as SysV Shared memory and anonymous memory. Reviewed by: julian, alc, dg
54444	11-Dec-1999	eivind	Lock reporting and assertion changes. * lockstatus() and VOP_ISLOCKED() gets a new process argument and a new return value: LK_EXCLOTHER, when the lock is held exclusively by another process. * The ASSERT_VOP_(UN)LOCKED family is extended to use what this gives them * Extend the vnode_if.src format to allow more exact specification than locked/unlocked. This commit should not do any semantic changes unless you are using DEBUG_VFS_LOCKS. Discussed with: grog, mch, peter, phk Reviewed by: peter
54188	06-Dec-1999	luoqi	User ldt sharing.
53899	29-Nov-1999	phk	Report swapdevices as cdevs rather than bdevs. Remove unused dev2budev() function.
53701	25-Nov-1999	alc	Remove nonsensical vm_map_{clear,set}_recursive() calls from vm_map_pageable(). At the point they called, vm_map_pageable() holds a read (or shared) lock on the map. The purpose of vm_map_{clear,set}_recursive() is to disable/enable repeated write (or exclusive) lock requests by the same process.
53627	23-Nov-1999	alc	Correct the following error: vm_map_pageable() on a COW'ed (post-fork) vm_map always failed because vm_map_lookup() looked at "vm_map_entry->wired_count" instead of "(vm_map_entry->eflags & MAP_ENTRY_USER_WIRED)". The effect was that many page wiring operations by sysctl were (silently) failing.
53594	22-Nov-1999	phk	Isolate the swapdev_vp "not quite" vnode in the only source file which needs it now that /dev/drum is gone. Reviewed by: eivind, peter
53338	18-Nov-1999	peter	Remove the non-functional "swap device" userland front-end to the multiplexed underlying swap devices (/dev/drum). The only thing it did was to allow root to open /dev/drum, but not do anything with it. Various utilities used to grovel around in here, but Matt has written a much nicer (and clean) front-end to this for libkvm, and nothing uses the old system any more. The VM system was calling VOP_STRATEGY() on the vp of the first underlying swap device (not the /dev/drum one, the first real device), and using the VOP system to indirectly (and only) call swstrategy() to choose an underlying device and enqueue it on that device. I have changed it to avoid diverting through the VOP system and to call the only possible target directly, saving a little bit of time and some complexity. In all, nothing much changes, except some scaffolding to support the roundabout way of calling swstrategy() is gone. Matt gave me the ok to do this some time ago, and I apologize for taking so long to get around to it.
53074	10-Nov-1999	alc	Two changes: (1) Use vm_page_unqueue_nowakeup in vm_page_alloc instead of duplicating the code. (2) If a wired page is passed to vm_page_free_toq, panic instead of printing a friendly warning. (If we don't panic here, we'll just panic later in vm_page_unwire obscuring the problem.)
52974	08-Nov-1999	alc	Remove unused declarations.
52973	07-Nov-1999	alc	Remove unused #include's. Submitted by: phk
52960	07-Nov-1999	alc	The functions declared by this header file no longer exist. Submitted by: phk (in part)
52649	30-Oct-1999	alc	Reverse the sense of the test in the KASSERT's from the last commit.
52647	30-Oct-1999	alc	The core of this patch is to vm/vm_page.h. The effects are two-fold: (1) to eliminate an extra (useless) level of indirection in half of the page queue accesses and (2) to use a single name for each queue throughout, instead of, e.g., "vm_page_queue_active" in some places and "vm_page_queues[PQ_ACTIVE]" in others. Reviewed by: dillon
52644	30-Oct-1999	phk	Change useracc() and kernacc() to use VM_PROT_{READ\|WRITE\|EXECUTE} for the "rw" argument, rather than hijacking B_{READ\|WRITE}. Fix two bugs (physio & cam) resulting by the confusion caused by this. Submitted by: Tor.Egge@fast.no Reviewed by: alc, ken (partly)
52635	29-Oct-1999	phk	useracc() the prequel: Merge the contents (less some trivial bordering the silly comments) of <vm/vm_prot.h> and <vm/vm_inherit.h> into <vm/vm.h>. This puts the #defines for the vm_inherit_t and vm_prot_t types next to their typedefs. This paves the road for the commit to follow shortly: change useracc() to use VM_PROT_{READ\|WRITE} rather than B_{READ\|WRITE} as argument.
52617	29-Oct-1999	alc	Remove the last vestiges of "vm_map_t phys_map". It's been unused since i386/i386/machdep.c rev 1.45 (or 1994 :-) ).
52568	27-Oct-1999	alc	Shrink "struct vm_object" by not spending a full 32 bits on "objtype_t".
52035	08-Oct-1999	phk	Fix a panic(8) implementation: hexdump -C < /dev/drum by simply refusing to do I/O from userland. a panic. I'm not sure we even need /dev/drum anymore, it seems to have been broken for a long time thi
51930	04-Oct-1999	phk	Introduce swopen to prevent blockdevice opens and insist on minor==0.
51928	04-Oct-1999	phk	Give the swap device a D_DISK flag against my better judgement. TODO: add an open routing which fails for bdev opens.
51810	30-Sep-1999	dt	Plug an accounting leak: count pages in ZONE_INTERRUPT zones as wired.
51658	25-Sep-1999	phk	Remove five now unused fields from struct cdevsw. They should never have been there in the first place. A GENERIC kernel shrinks almost 1k. Add a slightly different safetybelt under nostop for tty drivers. Add some missing FreeBSD tags
51493	21-Sep-1999	dillon	cleanup madvise code, add a few more sanity checks. Reviewed by: Alan Cox <alc@cs.rice.edu>, dg@root.com
51488	21-Sep-1999	dillon	Final commit to remove vnode->v_lastr. vm_fault now handles read clustering issues (replacing code that used to be in ufs/ufs/ufs_readwrite.c). vm_fault also now uses the new VM page counter inlines. This completes the changeover from vnode->v_lastr to vm_entry_t->v_lastr for VM, and fp->f_nextread and fp->f_seqcount (which have been in the tree for a while). Determination of the I/O strategy (sequential, random, and so forth) is now handled on a descriptor-by-descriptor basis for base I/O calls, and on a memory-region-by-memory-region and process-by-process basis for VM faults. Reviewed by: David Greenman <dg@root.com>, Alan Cox <alc@cs.rice.edu>
51474	20-Sep-1999	dillon	Fix bug in pipe code relating to writes of mmap'd but illegal address spaces which cross a segment boundry in the page table. pmap_kextract() is not designed for access to the user space portion of the page table and cannot handle the null-page-directory-entry case. The fix is to have vm_fault_quick() return a success or failure which is then used to avoid calling pmap_kextract().
51343	17-Sep-1999	dillon	Remove inappropriate VOP_FSYNC from vm_object_page_clean(). The fsync syncs the entire underlying file rather then just the requested range, resulting in huge inefficiencies when the VM system is articulated in a certain way. The VOP_FSYNC was also found to massively reduce NFS performance in certain cases. Change MADV_DONTNEED and MADV_FREE to call vm_page_dontneed() instead of vm_page_deactivate(). Using vm_page_deactivate() causes all inactive and cache pages to be recycled before the dontneed/free page is recycled, effectively flushing our entire VM inactive & cache queues continuously even if only a few pages are being actively MADV free'd and reused (such as occurs with a sequential scan of a memory-mapped file). Reviewed by: Alan Cox <alc@cs.rice.edu>, David Greenman <dg@root.com>
51342	17-Sep-1999	dillon	Add 'lastr' field to vm_map_entry in preparation for its removal from the vnode. (The changeover is undergoing final testing and will be committed soon). Reviewed by: Alan Cox <alc@cs.rice.edu>, David Greenman <dg@root.com>
51340	17-Sep-1999	dillon	The vnode pager (used when you do file-backed mmaps) must use the underlying physical sector size when aligning I/O transfer sizes. It cannot assume 512 bytes. We assume the underlying sector size is a power of 2. If it isn't, mmap() will break badly anyway (in the same way mmap broke with NFS when NFS tried to cache piecemeal write ranges in buffers, before we enforced read-buffer-before-write-piecemeal for NFS). Reviewed by: Alan Cox <alc@cs.rice.edu>, David Greenman <dg@root.com>
51339	17-Sep-1999	dillon	Fix a number of spl bugs related to reserving and freeing swap space. Swap space can be freed from an interrupt and so swap reservation and freeing must occur at splvm. Add swap_pager_reserve() code to support a new swap pre-reservation capability for the VN device. Generally cleanup the swap code by simplifying the swp_pager_meta_build() static function and consolidating the SWAPBLK_NONE test from a bit test to an absolute compare. The bit test was left over from a rejected swap allocation scheme that was not ultimately committed. A few other minor cleanups were also made. Reorganize the swap strategy code, again for VN support, to not reallocate swap when writing as this messes up pre-reservation and can fragment I/O unnecessarily as VN-baesd disk is messed around with. Reviewed by: Alan Cox <alc@cs.rice.edu>, David Greenman <dg@root.com>
51338	17-Sep-1999	dillon	Add required BUF_KERNPROC to flushchainbuf() to disassociate the current process from the exclusive lock prior to initiating I/O. This fixes a panic related to swap-backed VN disks Reviewed by: Alan Cox <alc@cs.rice.edu>, David Greenman <dg@root.com>
51337	17-Sep-1999	dillon	Reviewed by: Alan Cox <alc@cs.rice.edu>, David Greenman <dg@root.com> Replace various VM related page count calculations strewn over the VM code with inlines to aid in readability and to reduce fragility in the code where modules depend on the same test being performed to properly sleep and wakeup. Split out a portion of the page deactivation code into an inline in vm_page.c to support vm_page_dontneed(). add vm_page_dontneed(), which handles the madvise MADV_DONTNEED feature in a related commit coming up for vm_map.c/vm_object.c. This code prevents degenerate cases where an essentially active page may be rotated through a subset of the paging lists, resulting in premature disposal.
50477	28-Aug-1999	peter	$Id$ -> $FreeBSD$
50405	26-Aug-1999	phk	Simplify the handling of VCHR and VBLK vnodes using the new dev_t: Make the alias list a SLIST. Drop the "fast recycling" optimization of vnodes (including the returning of a prexisting but stale vnode from checkalias). It doesn't buy us anything now that we don't hardlimit vnodes anymore. Rename checkalias2() and checkalias() to addalias() and addaliasu() - which takes dev_t and udev_t arg respectively. Make the revoke syscalls use vcount() instead of VALIASED. Remove VALIASED flag, we don't need it now and it is faster to traverse the much shorter lists than to maintain the flag. vfs_mountedon() can check the dev_t directly, all the vnodes point to the same one. Print the devicename in specfs/vprint(). Remove a couple of stale LFS vnode flags. Remove unimplemented/unused LK_DRAINED;
50301	24-Aug-1999	green	When the SYSINIT() was removed, it was replaced with a make_dev on-demand creation of /dev/drum via calling swapon. However, the make_dev has a bogus (insofar that it hasn't been added yet) cdevsw, so later we end up crashing with a null pointer dereference on the swap vp's specinfo. The specinfo points to a dev_t with a major of 254 (uninitialized), and we get a crash on its d_strategy being called. The simple solution to this is to call cdevsw_add before the make_dev is ever used. This fixes the panic which occurred upon swapping.
50269	23-Aug-1999	bde	Use devtoname to print dev_t's instead of casting them to u_long for misprinting with %lx. Cast pointers to intptr_t instead of casting them to long. Cosmetic.
50254	23-Aug-1999	phk	Convert DEVFS hooks in (most) drivers to make_dev(). Diskslice/label code not yet handled. Vinum, i4b, alpha, pc98 not dealt with (left to respective Maintainers) Add the correct hook for devfs to kern_conf.c The net result of this excercise is that a lot less files depends on DEVFS, and devtoname() gets more sensible output in many cases. A few drivers had minor additional cleanups performed relating to cdevsw registration. A few drivers don't register a cdevsw{} anymore, but only use make_dev().
50248	23-Aug-1999	alc	Correct the inconsistent formatting in struct vm_map. Addendum to rev 1.47: submitted by dillon.
50247	23-Aug-1999	alc	struct vm_map: The lock structure cannot be the first element of the vm_map because this can result in livelock between two or more system processes trying to kmem_alloc_wait.
50136	22-Aug-1999	alc	Remove two unused variable declarations.
50075	20-Aug-1999	alc	vm_page_alloc and contigmalloc1: Verify that free pages are not dirty. Submitted by: dillon
50034	19-Aug-1999	peter	Update for run queue code.
49998	18-Aug-1999	mjacob	Fix breakage - an extra brace got inserted where DIAGNOSTIC was defined but MAP_LOCK_DIAGNOSTIC wasn't.
49991	17-Aug-1999	green	Unbreak the nfs KLD_MODULE. It needs a bit more of vm_page.h than was exported (notably vm_page_undirty()). Also, let vm_page_dirty() work in a KLD.
49979	17-Aug-1999	alc	vm_page_free_toq: Update the comment to reflect the demise of PQ_ZERO and remove a (now) useless test.
49949	17-Aug-1999	alc	Correct an accidental omission of one "vm_page_undirty" replacement from the previous commit.
49948	17-Aug-1999	alc	vm_page_free_toq: Clear the dirty bit mask (vm_page_undirty) before adding the page to the free page queue. Submitted by: dillon
49945	17-Aug-1999	alc	Add the (inline) function vm_page_undirty for clearing the dirty bitmask of a vm_page. Use it. Submitted by: dillon
49937	17-Aug-1999	alc	vm_pageout_clean: Remove dead code. Submitted by: dillon
49900	16-Aug-1999	alc	vm_map_lock*: Remove semicolons or add "do { } while (0)" as necessary to enable the use of these macros in arbitrary statements. (There are no functional changes.) Submitted by: dillon
49858	15-Aug-1999	alc	Remove the declarations for "vm_map_t io_map". It's been unused since i386/i386/machdep rev 1.310, i.e., the demise of BOUNCE_BUFFERS.
49852	15-Aug-1999	alc	Remove the declarations for "vm_map_t u_map". It's been unused since i386/i386/pmap rev 1.190. (The alpha never used it.)
49819	15-Aug-1999	alc	contigmalloc1 (currently) depends on PQ_FREE and PQ_CACHE not being 0 to tell a valid "struct vm_page" from an invalid one in the vm_page_array. This isn't a very robust method.
49813	15-Aug-1999	mjacob	Add back in old definitions if we're compiling for alpha.
49720	14-Aug-1999	alc	Don't create a "struct vpgqueues" for PQ_NONE.
49697	13-Aug-1999	alc	vm_map_madvise: A complete rewrite by dillon and myself to separate the implementation of behaviors that effect the vm_map_entry from those that effect the vm_object. A result of this change is that madvise(..., MADV_FREE); is much cheaper.
49679	13-Aug-1999	phk	The bdevsw() and cdevsw() are now identical, so kill the former.
49666	12-Aug-1999	alc	Make the default page coloring parameters match a (non-Xeon) Pentium II/III. This setting is also acceptable for Celerons and Pentium Pros with less than 1MB L2 caches. Note: PQ_L2_SIZE is a misnomer. The correct number of colors is a function of the cache's degree of associativity as well as its size. Submitted by: bde and alc
49655	12-Aug-1999	alc	vm_object_madvise: Update the comments to match the implementation. Submitted by: dillon
49654	12-Aug-1999	alc	vm_object_madvise: Support MADV_DONTNEED and MADV_WILLNEED on object types besides OBJT_DEFAULT and OBJT_SWAP. Submitted by: dillon
49618	11-Aug-1999	alc	contigmalloc1: If a page is found in the wrong queue, panic instead of silently ignoring the problem.
49615	10-Aug-1999	peter	Add a contigfree() as a corollary to contigmalloc() as it's not clear which free routine to use and people are tempted to use free() (which doesn't work)
49592	10-Aug-1999	alc	vm_map_madvise: Now that behaviors are stored in the vm_map_entry rather than the vm_object, it's no longer necessary to instantiate a vm_object just to hold the behavior. Reviewed by: dillon
49558	09-Aug-1999	phk	Merge the cons.c and cons.h to the best of my ability. alpha may or may not compile, I can't test it.
49535	08-Aug-1999	phk	Decommision miscfs/specfs/specdev.h. Most of it goes into <sys/conf.h>, a few lines into <sys/vnode.h>. Add a few fields to struct specinfo, paving the way for the fun part.
49338	01-Aug-1999	alc	Move the memory access behavior information provided by madvise from the vm_object to the vm_map. Submitted by: dillon
49326	31-Jul-1999	alc	Change the type of vpgqueues::lcnt from "int *" to "int". The indirection served no purpose.
49305	31-Jul-1999	alc	vm_page_queue_init: Remove the initialization of PQ_NONE's cnt and lcnt. They aren't used. vm_page_insert: Remove an unnecessary dereference. vm_page_wire: Remove the one and only (and thus pointless) reference to PQ_NONE's lcnt.
48974	22-Jul-1999	alc	Reduce the number of "magic constants" used for page coloring by one: PQ_PRIME2 and PQ_PRIME3 are used to accomplish the same thing at different places in the kernel. Drop PQ_PRIME3.
48963	21-Jul-1999	alc	Fix the following problem: When creating new processes (or performing exec), the new page directory is initialized too early. The kernel might grow before p_vmspace is initialized for the new process. Since pmap_growkernel doesn't yet know about the new page directory, it isn't updated, and subsequent use causes a failure. The fix is (1) to clear p_vmspace early, to stop pmap_growkernel from stomping on memory, and (2) to defer part of the initialization of new page directories until p_vmspace is initialized. PR: kern/12378 Submitted by: tegge Reviewed by: dfr
48948	20-Jul-1999	green	Make a dev2budev() function, and use it. This refixes pstat (working, broken, working, broken, working) and savecore (working, working, broken, working, working). Sorta Reviewed by: phk
48922	20-Jul-1999	alc	Convert a "page not busy" warning to an assertion. Submitted by: dillon@backplane.com
48866	17-Jul-1999	phk	Add a field to struct swdevt to avoid a bogus udev2dev() call.
48859	17-Jul-1999	phk	I have not one single time remembered the name of this function correctly so obviously I gave it the wrong name. s/umakedev/makeudev/g
48833	16-Jul-1999	alc	Remove vm_object::last_read. It is used by the old swap pager, but not by the new one, i.e., vm/swap_pager.c rev 1.108. Reviewed by: dillon@backplane.com
48757	11-Jul-1999	alc	Cleanup OBJ_ONEMAPPING management. vm_map.c: Don't set OBJ_ONEMAPPING on arbitrary vm objects. Only default and swap type vm objects should have it set. vm_object_deallocate already handles these cases. vm_object.c: If OBJ_ONEMAPPING isn't already clear in vm_object_shadow, we are in trouble. Instead of clearing it, make it an assertion that it is already clear.
48738	10-Jul-1999	alc	Change the data type used to represent page color in the vm_object to be the same as that used in the vm_page. (This change also shrinks the vm_object.)
48736	10-Jul-1999	alc	Remove unused function prototypes.
48658	07-Jul-1999	ache	add unused argument to udev2dev() to make kernel compiled
48652	07-Jul-1999	msmith	Reinstate the previous fix for the broken export of a dev_t in sw_dev, convert back to a dev_t when the value is actually used.
48651	07-Jul-1999	green	Back out previous commit. It was wrong, and caused panics.
48647	06-Jul-1999	msmith	swdevt should contain a udev_t not a devt. This resulted in bogus swap device name reporting. Submitted by: Bill Swingle <unfurl@freebsd.org>
48590	05-Jul-1999	mckay	Reformat previous fix to remove an uglier than average goto. Looked OK to: dg
48544	04-Jul-1999	mckusick	The buffer queue mechanism has been reformulated. Instead of having QUEUE_AGE, QUEUE_LRU, and QUEUE_EMPTY we instead have QUEUE_CLEAN, QUEUE_DIRTY, QUEUE_EMPTY, and QUEUE_EMPTYKVA. With this patch clean and dirty buffers have been separated. Empty buffers with KVM assignments have been separated from truely empty buffers. getnewbuf() has been rewritten and now operates in a 100% optimal fashion. That is, it is able to find precisely the right kind of buffer it needs to allocate a new buffer, defragment KVM, or to free-up an existing buffer when the buffer cache is full (which is a steady-state situation for the buffer cache). Buffer flushing has been reorganized. Previously buffers were flushed in the context of whatever process hit the conditions forcing buffer flushing to occur. This resulted in processes blocking on conditions unrelated to what they were doing. This also resulted in inappropriate VFS stacking chains due to multiple processes getting stuck trying to flush dirty buffers or due to a single process getting into a situation where it might attempt to flush buffers recursively - a situation that was only partially fixed in prior commits. We have added a new daemon called the buf_daemon which is responsible for flushing dirty buffers when the number of dirty buffers exceeds the vfs.hidirtybuffers limit. This daemon attempts to dynamically adjust the rate at which dirty buffers are flushed such that getnewbuf() calls (almost) never block. The number of nbufs and amount of buffer space is now scaled past the 8MB limit that was previously imposed for systems with over 64MB of memory, and the vfs.{lo,hi}dirtybuffers limits have been relaxed somewhat. The number of physical buffers has been increased with the intention that we will manage physical I/O differently in the future. reassignbuf previously attempted to keep the dirtyblkhd list sorted which could result in non-deterministic operation under certain conditions, such as when a large number of dirty buffers are being managed. This algorithm has been changed. reassignbuf now keeps buffers locally sorted if it can do so cheaply, and otherwise gives up and adds buffers to the head of the dirtyblkhd list. The new algorithm is deterministic but not perfect. The new algorithm greatly reduces problems that previously occured when write_behind was turned off in the system. The P_FLSINPROG proc->p_flag bit has been replaced by the more descriptive P_BUFEXHAUST bit. This bit allows processes working with filesystem buffers to use available emergency reserves. Normal processes do not set this bit and are not allowed to dig into emergency reserves. The purpose of this bit is to avoid low-memory deadlocks. A small race condition was fixed in getpbuf() in vm/vm_pager.c. Submitted by: Matthew Dillon <dillon@apollo.backplane.com> Reviewed by: Kirk McKusick <mckusick@mckusick.com>
48409	01-Jul-1999	peter	Fix some int/long printf problems for the Alpha
48391	01-Jul-1999	peter	Slight reorganization of kernel thread/process creation. Instead of using SYSINIT_KT() etc (which is a static, compile-time procedure), use a NetBSD-style kthread_create() interface. kproc_start is still available as a SYSINIT() hook. This allowed simplification of chunks of the sysinit code in the process. This kthread_create() is our old kproc_start internals, with the SYSINIT_KT fork hooks grafted in and tweaked to work the same as the NetBSD one. One thing I'd like to do shortly is get rid of nfsiod as a user initiated process. It makes sense for the nfs client code to create them on the fly as needed up to a user settable limit. This means that nfsiod doesn't need to be in /sbin and is always "available". This is a fair bit easier to do outside of the SYSINIT_KT() framework.
48289	27-Jun-1999	peter	Kirk missed a required BUF_KERNPROC(). Even though this is a non-async transfer, the b_iodone hook causes biodone() to release it from interrupt context.
48274	27-Jun-1999	peter	Minor tweaks to make sure (new) prerequisites for <sys/buf.h> (mostly splbio()/splx()) are #included in time.
48252	26-Jun-1999	peter	There isn't much point waking up a daemon that hasn't existed since softupdates came in. Try calling speedup_syncer() instead..
48225	26-Jun-1999	mckusick	Convert buffer locking from using the B_BUSY and B_WANTED flags to using lockmgr locks. This commit should be functionally equivalent to the old semantics. That is, all buffer locking is done with LK_EXCLUSIVE requests. Changes to take advantage of LK_SHARED and LK_RECURSIVE will be done in future commits.
48099	22-Jun-1999	alc	Remove (1) "extern" declarations for variables that were previously made "static" and (2) initialized but unused variables.
48059	20-Jun-1999	alc	Remove vm_object::cache_count and vm_object::wired_count. They are not used. (Nor is there any planned use by John who introduced them.) Reviewed by: "John S. Dyson" <toor@dyson.iquest.net>
48045	20-Jun-1999	alc	Set cnt.v_page_size to PAGE_SIZE rather than DEFAULT_PAGE_SIZE so that "vmstat -s" reports the correct value on the Alpha. Submitted by: Hidetoshi Shimokawa <simokawa@sat.t.u-tokyo.ac.jp>
48022	19-Jun-1999	alc	Remove some unused function and variable declarations.
47986	17-Jun-1999	alc	vm_map_growstack uses vmspace::vm_ssize as though it contained the stack size in bytes when in fact it is the stack size in pages.
47968	17-Jun-1999	alc	vm_map_insert sometimes extends an existing vm_map entry, rather than creating a new entry. vm_map_stack and vm_map_growstack can panic when a new entry isn't created. Fixed vm_map_stack and vm_map_growstack. Also, when extending the stack, always set the protection to VM_PROT_ALL.
47966	17-Jun-1999	alc	Move vm_map_stack and vm_map_growstack after the definition of the vm_map_clip_end macro. (The next commit will modify vm_map_stack and vm_map_growstack to use vm_map_clip_end.)
47965	17-Jun-1999	alc	Remove some unused declarations and duplicate initialization.
47888	12-Jun-1999	alc	vm_map_protect: The wrong vm_map_entry is used to determine if writes must not be allowed due to COW.
47841	08-Jun-1999	dt	Add a function kmem_alloc_nofault() - same as kmem_alloc_pageable(), but create a nofault entry. It will be used to allocate kmem for upages. (I am not too happy with all this, but it's better than nothing).
47765	05-Jun-1999	alc	vm_mmap: Insure that device mappings get MAP_PREFAULT(_PARTIAL) set, so that 4M page mappings are used when possible. Reviewed by: Luoqi Chen <luoqi@watermarkgroup.com>
47673	01-Jun-1999	phk	Shorten a detour around dev_t to get a udev_t created.
47640	31-May-1999	phk	Simplify cdevsw registration. The cdevsw_add() function now finds the major number(s) in the struct cdevsw passed to it. cdevsw_add_generic() is no longer needed, cdevsw_add() does the same thing. cdevsw_add() will print an message if the d_maj field looks bogus. Remove nblkdev and nchrdev variables. Most places they were used bogusly. Instead check a dev_t for validity by seeing if devsw() or bdevsw() returns NULL. Move bdevsw() and devsw() functions to kern/kern_conf.c Bump __FreeBSD_version to 400006 This commit removes: 72 bogus makedev() calls 26 bogus SYSINIT functions if_xe.c bogusly accessed cdevsw[], author/maintainer please fix. I4b and vinum not changed. Patches emailed to authors. LINT probably broken until they catch up.
47625	30-May-1999	phk	This commit should be a extensive NO-OP: Reformat and initialize correctly all "struct cdevsw". Initialize the d_maj and d_bmaj fields. The d_reset field was not removed, although it is never used. I used a program to do most of this, so all the files now use the same consistent format. Please keep it that way. Vinum and i4b not modified, patches emailed to respective authors.
47607	30-May-1999	alc	Addendum to 1.155. Verify the existence of the object before checking its reference count.
47568	28-May-1999	alc	Avoid the creation of unnecessary shadow objects.
47290	18-May-1999	alc	vm_map_insert: General cleanup. Eliminate coalescing checks that are duplicated by vm_object_coalesce.
47258	17-May-1999	alc	Add the options MAP_PREFAULT and MAP_PREFAULT_PARTIAL to vm_map_find/insert, eliminating the need for the pmap_object_init_pt calls in imgact_* and mmap. Reviewed by: David Greenman <dg@root.com>
47243	16-May-1999	alc	Remove prototypes for functions that don't exist anymore (vm_map.h). Remove a useless argument from vm_map_madvise's interface (vm_map.c, vm_map.h, and vm_mmap.c). Remove a redundant test in vm_uiomove (vm_map.c). Make two changes to vm_object_coalesce: 1. Determine whether the new range of pages actually overlaps the existing object's range of pages before calling vm_object_page_remove. (Prior to this change almost 90% of the calls to vm_object_page_remove were to remove pages that were beyond the end of the object.) 2. Free any swap space allocated to removed pages.
47239	15-May-1999	dt	Fix confusion of size of transfer with size of the pager. PR: 11658 Broken in: 1.89 (1998/03/07)
47207	14-May-1999	alc	Simplify vm_map_find/insert's interface: remove the MAP_COPY_NEEDED option. It never makes sense to specify MAP_COPY_NEEDED without also specifying MAP_COPY_ON_WRITE, and vice versa. Thus, MAP_COPY_ON_WRITE suffices. Reviewed by: David Greenman <dg@root.com>
47111	13-May-1999	bde	Casting handles from void * to uintptr_t on the way to dev_t became especially bogus when dev_t became a pointer.
47094	13-May-1999	luoqi	Device pager's handle is dev_t not udev_t.
47064	12-May-1999	phk	Fix a udev_t/dev_t mismatch which prevent paging from working.
47028	11-May-1999	phk	Divorce "dev_t" from the "major\|minor" bitmap, which is now called udev_t in the kernel but still called dev_t in userland. Provide functions to manipulate both types: major() umajor() minor() uminor() makedev() umakedev() dev2udev() udev2dev() For now they're functions, they will become in-line functions after one of the next two steps in this process. Return major/minor/makedev to macro-hood for userland. Register a name in cdevsw[] for the "filedescriptor" driver. In the kernel the udev_t appears in places where we have the major/minor number combination, (ie: a potential device: we may not have the driver nor the device), like in inodes, vattr, cdevsw registration and so on, whereas the dev_t appears where we carry around a reference to a actual device. In the future the cdevsw and the aliased-from vnode will be hung directly from the dev_t, along with up to two softc pointers for the device driver and a few houskeeping bits. This will essentially replace the current "alias" check code (same buck, bigger bang). A little stunt has been provided to try to catch places where the wrong type is being used (dev_t vs udev_t), if you see something not working, #undef DEVT_FASCIST in kern/kern_conf.c and see if it makes a difference. If it does, please try to track it down (many hands make light work) or at least try to reproduce it as simply as possible, and describe how to do that. Without DEVT_FASCIST I belive this patch is a no-op. Stylistic/posixoid comments about the userland view of the <sys/*.h> files welcome now, from userland they now contain the end result. Next planned step: make all dev_t's refer to the same devsw[] which means convert BLK's to CHR's at the perimeter of the vnodes and other places where they enter the game (bootdev, mknod, sysctl).
46816	09-May-1999	phk	No point in swapdev being a static global when used only locally.
46676	08-May-1999	phk	I got tired of seeing all the cdevsw[major(foo)] all over the place. Made a new (inline) function devsw(dev_t dev) and substituted it. Changed to the BDEV variant to this format as well: bdevsw(dev_t dev) DEVFS will eventually benefit from this change too.
46635	07-May-1999	phk	Continue where Julian left off in July 1998: Virtualize bdevsw[] from cdevsw. bdevsw() is now an (inline) function. Join CDEV_MODULE and BDEV_MODULE to DEV_MODULE (please pay attention to the order of the cmaj/bmaj arguments!) Join CDEV_DRIVER_MODULE and BDEV_DRIVER_MODULE to DEV_DRIVER_MODULE (ditto!) (Next step will be to convert all bdev dev_t's to cdev dev_t's before they get to do any damage^H^H^H^H^H^Hwork in the kernel.)
46625	07-May-1999	phk	Introduce two functions: physread() and physwrite() and use these directly in *devsw[] rather than the 46 local copies of the same functions. (grog will do the same for vinum when he has time)
46592	06-May-1999	peter	Add brackets to silence egcs and help clarity.
46580	06-May-1999	phk	remove b_proc from struct buf, it's (now) unused. Reviewed by: dillon, bde
46538	06-May-1999	luoqi	Don't ignore mmap() address hint below the text section.
46381	03-May-1999	billf	Add sysctl descriptions to many SYSCTL_XXXs PR: kern/11197 Submitted by: Adrian Chadd <adrian@FreeBSD.org> Reviewed by: billf(spelling/style/minor nits) Looked at by: bde(style)
46349	02-May-1999	alc	The VFS/BIO subsystem contained a number of hacks in order to optimize piecemeal, middle-of-file writes for NFS. These hacks have caused no end of trouble, especially when combined with mmap(). I've removed them. Instead, NFS will issue a read-before-write to fully instantiate the struct buf containing the write. NFS does, however, optimize piecemeal appends to files. For most common file operations, you will not notice the difference. The sole remaining fragment in the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache coherency issues with read-merge-write style operations. NFS also optimizes the write-covers-entire-buffer case by avoiding the read-before-write. There is quite a bit of room for further optimization in these areas. The VM system marks pages fully-valid (AKA vm_page_t->valid = VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This is not correct operation. The vm_pager_get_pages() code is now responsible for marking VM pages all-valid. A number of VM helper routines have been added to aid in zeroing-out the invalid portions of a VM page prior to the page being marked all-valid. This operation is necessary to properly support mmap(). The zeroing occurs most often when dealing with file-EOF situations. Several bugs have been fixed in the NFS subsystem, including bits handling file and directory EOF situations and buf->b_flags consistancy issues relating to clearing B_ERROR & B_INVAL, and handling B_DONE. getblk() and allocbuf() have been rewritten. B_CACHE operation is now formally defined in comments and more straightforward in implementation. B_CACHE for VMIO buffers is based on the validity of the backing store. B_CACHE for non-VMIO buffers is based simply on whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear, and vise-versa). biodone() is now responsible for setting B_CACHE when a successful read completes. B_CACHE is also set when a bdwrite() is initiated and when a bwrite() is initiated. VFS VOP_BWRITE routines (there are only two - nfs_bwrite() and bwrite()) are now expected to set B_CACHE. This means that bowrite() and bawrite() also set B_CACHE indirectly. There are a number of places in the code which were previously using buf->b_bufsize (which is DEV_BSIZE aligned) when they should have been using buf->b_bcount. These have been fixed. getblk() now clears B_DONE on return because the rest of the system is so bad about dealing with B_DONE. Major fixes to NFS/TCP have been made. A server-side bug could cause requests to be lost by the server due to nfs_realign() overwriting other rpc's in the same TCP mbuf chain. The server's kernel must be recompiled to get the benefit of the fixes. Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
46153	28-Apr-1999	dt	s/static foo_devsw_installed = 0;/static int foo_devsw_installed;/. (Edited automatically)
46112	27-Apr-1999	phk	Suser() simplification: 1: s/suser/suser_xxx/ 2: Add new function: suser(struct proc ), prototyped in <sys/proc.h>. 3: s/suser_xxx($[a-zA-Z0-9_]$->p_ucred, \&\1->p_acflag)/suser(\1)/ The remaining suser_xxx() calls will be scrutinized and dealt with later. There may be some unneeded #include <sys/cred.h>, but they are left as an exercise for Bruce. More changes to the suser() API will come along with the "jail" code.
45960	23-Apr-1999	dt	Make pmap_collect() an official pmap interface.
45821	19-Apr-1999	peter	unifdef -DVM_STACK - it's been on for a while for x86 and was checked and appeared to be working for the Alpha some time ago.
45665	13-Apr-1999	peter	Move the declaration of faultin() from the vm headers to proc.h, since it is now referenced from a macro there (PHOLD()).
45567	11-Apr-1999	eivind	Staticize
45561	10-Apr-1999	dt	Convert usage of vm_page_bits() to the new convention ("Inputs are required to range within a page").
45550	10-Apr-1999	eivind	Lock vnode correctly for VOP_OPEN. Discussed with: alc, dillon
45365	06-Apr-1999	peter	Don't forcibly kill processes that are locked in-core via PHOLD - it was just checking P_NOSWAP before.
45363	06-Apr-1999	peter	Only use p->p_lock (manage by PHOLD()/PRELE()) - P_NOSWAP/P_PHYSIO is no longer set.
45347	05-Apr-1999	julian	Catch a case spotted by Tor where files mmapped could leave garbage in the unallocated parts of the last page when the file ended on a frag but not a page boundary. Delimitted by tags PRE_MATT_MMAP_EOF and POST_MATT_MMAP_EOF, in files alpha/alpha/pmap.c i386/i386/pmap.c nfs/nfs_bio.c vm/pmap.h vm/vm_page.c vm/vm_page.h vm/vnode_pager.c miscfs/specfs/spec_vnops.c ufs/ufs/ufs_readwrite.c kern/vfs_bio.c Submitted by: Matt Dillon <dillon@freebsd.org> Reviewed by: Alan Cox <alc@freebsd.org>
45293	04-Apr-1999	alc	Two changes to vm_map_delete: 1. Don't bother checking object->ref_count == 1 in order to set OBJ_ONEMAPPING. It's a waste of time. If object->ref_count == 1, vm_map_entry_delete will "run-down" the object and its pages. 2. If object->ref_count == 1, ignore OBJ_ONEMAPPING. Wait for vm_map_entry_delete to "run-down" the object and its pages. Otherwise, we're calling two different procedures to delete the object's pages. Note: "vmstat -s" will once again show a non-zero value for "pages freed by exiting processes".
45069	27-Mar-1999	alc	Mainly, eliminate the comments about share maps. (We don't have share maps any more.) Also, eliminate an incorrect comment that says that we don't coalesce vm_map_entry's. (We do.)
45057	27-Mar-1999	eivind	Correct a comment.
44928	21-Mar-1999	alc	Two changes: Remove more (redundant) map timestamp increments from properly synchronized routines. (Changed: vm_map_entry_link, vm_map_entry_unlink, and vm_map_pageable.) Micro-optimize vm_map_entry_link and vm_map_entry_unlink, eliminating unnecessary dereferences. At the same time, converted them from macros to inline functions.
44880	19-Mar-1999	alc	Construct the free queue(s) in descending order (by physical address) so that the first 16MB of physical memory is allocated last rather than first. On large-memory machines, this avoids the exhaustion of low physical memory before isa_dmainit has run.
44793	16-Mar-1999	alc	Correct a problem in kmem_malloc: A kmem_malloc allowing "wait" may block (VM_WAIT) holding the map lock. This is bad. For example, a subsequent kmem_malloc by an interrupt handler on the same map may find the lock held and panic in the lockmgr.
44773	15-Mar-1999	alc	Two changes: In general, vm_map_simplify_entry should be performed INSIDE the loop that traverses the map, not outside. (Changed: vm_map_inherit, vm_map_pageable.) vm_fault_unwire doesn't acquire the map lock (or block holding it). Thus, vm_map_set/clear_recursive shouldn't be called. (Changed: vm_map_user_pageable, vm_map_pageable.)
44771	15-Mar-1999	julian	Fix breakage in last commit Submitted by: Brian Feldman <green@unixhelp.org>
44754	14-Mar-1999	julian	A bit of a hack, but allows the vn device to be a module again. Submitted by: Matt Dillon <dillon@freebsd.org>
44739	14-Mar-1999	julian	Submitted by: Matt Dillon <dillon@freebsd.org> The old VN device broke in -4.x when the definition of B_PAGING changed. This patch fixes this plus implements additional capabilities. The new VN device can be backed by a file ( as per normal ), or it can be directly backed by swap. Due to dependencies in VM include files (on opt_xxx options) the new vn device cannot be a module yet. This will be fixed in a later commit. This commit delimitted by tags {PRE,POST}_MATT_VNDEV
44733	14-Mar-1999	alc	Correct two optimization errors in vm_object_page_remove: 1. The size of vm_object::memq is vm_object::resident_page_count, not vm_object::size. 2. The "size > 4" test sometimes results in the traversal of a ~1000 page memq in order to locate ~10 pages.
44682	12-Mar-1999	alc	Remove vm_page_frees from kmem_malloc that are performed by vm_map_delete/vm_object_page_remove anyway.
44675	12-Mar-1999	julian	Stop the mfs from trying to swap out crucial bits of the mfs as this can lead to deadlock. Submitted by: Mat dillon <dillon@freebsd.org>
44597	09-Mar-1999	alc	Remove (redundant) map timestamp increments from some properly synchronized routines.
44569	08-Mar-1999	alc	Remove an unused variable from vmspace_fork.
44565	07-Mar-1999	alc	Change vm_map_growstack to acquire and hold a read lock (instead of a write lock) until it actually needs to modify the vm_map. Note: it is legal to modify vm_map::hint without holding a write lock. Submitted by: "Richard Seaman, Jr." <dick@tar.com> with minor changes by myself.
44513	06-Mar-1999	alc	Upgrading a map's lock to exclusive status should increment the map's timestamp. In general, whenever an exclusive lock is acquired the timestamp should be incremented.
44438	02-Mar-1999	alc	To avoid a conflict for the vm_map's lock with vm_fault, release the read lock around the subyte operations in mincore. After the lock is reacquired, use the map's timestamp to determine if we need to restart the scan.
44396	02-Mar-1999	alc	Remove the last of the share map code: struct vm_map::is_main_map. Reviewed by: Matthew Dillon <dillon@apollo.backplane.com>
44379	01-Mar-1999	alc	mincore doesn't modify the vm_map. Therefore, it doesn't require an exclusive lock. A read lock will suffice.
44321	27-Feb-1999	alc	Reviewed by: "John S. Dyson" <dyson@iquest.net> Submitted by: Matthew Dillon <dillon@apollo.backplane.com> To prevent a deadlock, if we are extremely low on memory, force synchronous operation by the VOP_PUTPAGES in vnode_pager_putpages.
44250	25-Feb-1999	alc	Reviewed by: Matthew Dillon <dillon@apollo.backplane.com> Corrected the computation of cnt.v_ozfod in vm_fault: vm_fault was counting the number of unoptimized rather than optimized zero-fill faults.
44249	25-Feb-1999	dillon	Comment swstrategy() routine.
44245	24-Feb-1999	dillon	Remove unnecessary page protects on map_split and collapse operations. Fix bug where an object's OBJ_WRITEABLE/OBJ_MIGHTBEDIRTY flags do not get set under certain circumstances ( page rename case ). Reviewed by: Alan Cox <alc@cs.rice.edu>, John Dyson
44206	22-Feb-1999	dillon	Removed ENOMEM error on swap_pager_full condition which ignored the availability of physical memory. As per original bug report by Bruce. Reviewed by: Alan Cox <alc@cs.rice.edu>
44179	21-Feb-1999	dillon	Remove conditional sysctl's Leave swap_async_max sysctl intact, remove swap_cluster_max sysctl. Reviewed by: Alan Cox <alc@cs.rice.edu>
44178	21-Feb-1999	dillon	Reviewed by: Alan Cox <alc@cs.rice.edu> Fix problem w/ low-swap/low-memory handling as reported by Bruce Evans.
44156	19-Feb-1999	luoqi	Eliminate a possible numerical overflow.
44146	19-Feb-1999	luoqi	Hide access to vmspace:vm_pmap with inline function vmspace_pmap(). This is the preparation step for moving pmap storage out of vmspace proper. Reviewed by: Alan Cox <alc@cs.rice.edu> Matthew Dillion <dillon@apollo.backplane.com>
44135	19-Feb-1999	dillon	Submitted by: Alan Cox <alc@cs.rice.edu> Remove remaining share map garbage from vm_map_lookup() and clean out old #if 0 stuff.
44124	18-Feb-1999	dillon	Limit number of simultanious asynchronous swap pager I/Os that can be in progress at any given moment. Add two swap tuneables to sysctl: vm.swap_async_max: 4 vm.swap_cluster_max: 16 Recommended values are a cluster size of 8 or 16 pages. async_max is about right for 1-4 swap devices. Reduce to 2 if swap is eating too much bandwidth, or even 1 if swap is both eating too much bandwidth and sitting on a slow network (10BaseT). The defaults work well across a broad range of configurations and should normally be left alone.
44098	17-Feb-1999	dillon	Submitted by: Luoqi Chen <luoqi@watermarkgroup.com> Unlock vnode before messing with map to avoid deadlock between map and vnode ( e.g. with exec_map and underlying program binary vnode ). Solves a deadlock that most often occurs during a large -j# buildworld reported by three people.
44051	15-Feb-1999	dillon	Minor reorganization of vm_page_alloc(). No functional changes have been made but the code has been reorganized and documented to make it more readable, reduce the size of the code, and optimize the branch path caching capabilities that most modern processors have.
44034	15-Feb-1999	dillon	Fix a bug in the new madvise() code that would possibly (improperly) free swap space out from under a busy page. This is not legal because the swap may be reallocated and I/O issued while I/O is still in progress on the same swap page from the madvise()'d object. This bug could only occur under extreme paging conditions but might not cause an error until much later. As a side-benefit, madvise() is now even smaller.
43941	12-Feb-1999	dillon	Minor optimization to madvise() MADV_FREE to make page as freeable as possible without actually unmapping it from the process. As of now, I declare madvise() on OBJT_DEFAULT/OBJT_SWAP objects to be 'working and complete'.
43923	12-Feb-1999	dillon	Fix non-fatal bug in vm_map_insert() which improperly cleared OBJ_ONEMAPPING in the case where an object is extended by an additional vm_map_entry must be allocated. In vm_object_madvise(), remove calll to vm_page_cache() in MADV_FREE case in order to avoid a page fault on page reuse. However, we still mark the page as clean and destroy any swap backing store. Submitted by: Alan Cox <alc@cs.rice.edu>
43795	09-Feb-1999	dillon	Addendum to vm_map coalesce optimization. Also, this was backed-out because there was a concensus on current in regards to leaving bss r+w+x instead of r+w. This is in order to maintain reasonable compatibility with existing JIT compilers (e.g. kaffe) and possibly other programs.
43777	08-Feb-1999	dillon	Revamp vm_object_[q]collapse(). Despite the complexity of this patch, no major operational changes were made. The three core object->memq loops were moved into a single inline procedure and various operational characteristics of the collapse function were documented.
43761	08-Feb-1999	dillon	General cleanup. Remove #if 0's and remove useless register qualifiers.
43752	08-Feb-1999	dillon	Rip out PQ_ZERO queue. PQ_ZERO functionality is now combined in with PQ_FREE. There is little operational difference other then the kernel being a few kilobytes smaller and the code being more readable. * vm_page_select_free() has been greatly simplified. * The PQ_ZERO page queue and supporting structures have been removed * vm_page_zero_idle() revamped (see below) PG_ZERO setting and clearing has been migrated from vm_page_alloc() to vm_page_free[_zero]() and will eventually be guarenteed to remain tracked throughout a page's life ( if it isn't already ). When a page is freed, PG_ZERO pages are appended to the appropriate tailq in the PQ_FREE queue while non-PG_ZERO pages are prepended. When locating a new free page, PG_ZERO selection operates from within vm_page_list_find() ( get page from end of queue instead of beginning of queue ) and then only occurs in the nominal critical path case. If the nominal case misses, both normal and zero-page allocation devolves into the same _vm_page_list_find() select code without any specific zero-page optimizations. Additionally, vm_page_zero_idle() has been revamped. Hysteresis has been added and zero-page tracking adjusted to conform with the other changes. Currently hysteresis is set at 1/3 (lo) and 1/2 (hi) the number of free pages. We may wish to increase both parameters as time permits. The hysteresis is designed to avoid silly zeroing in borderline allocation/free situations.
43751	08-Feb-1999	dillon	Backed out vm_map coalesce optimization - it resulted in 22% more page faults for reasons unknown ( under investigation ). /usr/bin/time -l make in /usr/src/bin went from 67000 faults to 90000 faults.
43748	07-Feb-1999	dillon	Remove MAP_ENTRY_IS_A_MAP 'share' maps. These maps were once used to attempt to optimize forks but were essentially given-up on due to problems and replaced with an explicit dup of the vm_map_entry structure. Prior to the removal, they were entirely unused.
43747	07-Feb-1999	dillon	Remove L1 cache coloring optimization ( leave L2 cache coloring opt ). Rewrite vm_page_list_find() and vm_page_select_free() - make inline out of nominal case.
43729	07-Feb-1999	dillon	When shadowing objects, adjust the page coloring of the shadowing object such that pages in the combined/shadowed object are consistantly colored. Submitted by: "John S. Dyson" <dyson@iquest.net>
43700	06-Feb-1999	dillon	Add hysteresis to the 'swap_pager_getswapspace; failed' console message. Also widen the hysteresis levels a little ( these really should be dynamically configured ).
43638	05-Feb-1999	dillon	The elf loader sets the permissions on bss to VM_PROT_READ\|VM_PROT_WRITE rather then VM_PROT_ALL. obreak, on the otherhand, uses VM_PROT_ALL. This prevents vm_map_insert() from being able to coalesce the heap and creates an extra map entry. Since current architectures ignore VM_PROT_EXECUTE anyway, and since not having VM_PROT_EXECUTE on data/bss may provide protection in the future, obreak now uses read+write rather then all (r+w+x). This is an optimization, not a bug fix. Submitted by: Alan Cox <alc@cs.rice.edu>
43616	04-Feb-1999	dillon	Fix bug in a KASSERT I introduced in vm_page_qcollapse() rev 1.139. Since paging is in progress, page scan in vm_page_qcollapse() must be protected at atleast splbio() to prevent pages from being ripped out from under the scan.
43547	03-Feb-1999	dillon	Submitted by: Alan Cox The vm_map_insert()/vm_object_coalesce() optimization has been extended to include OBJT_SWAP objects as well as OBJT_DEFAULT objects. This is possible because it costs nothing to extend an OBJT_SWAP object with the new swapper. We can't do this with the old swapper. The old swapper used a linear array that would have had to have been reallocated, costing time as well as a potential low-memory deadlock.
43493	01-Feb-1999	dillon	This patch eliminates a pointless test from appearing twice in vm_map_simplify_entry. Basically, once you've verified that the objects in the adjacent vm_map_entry's are the same, either NULL or the same vm_object, there's no point in checking that the objects have the same behavior. Obtained from: Alan Cox <alc@cs.rice.edu>
43476	31-Jan-1999	julian	Submitted by: Alan Cox <alc@cs.rice.edu> Checked by: "Richard Seaman, Jr." <dick@tar.com> Fix the following problem: As the code stands now, growing any stack, and not just the process's main stack, modifies vm->vm_ssize. This is inconsistent with the code earlier in the same procedure.
43311	28-Jan-1999	dillon	Fix warnings in preparation for adding -Wall -Wcast-qual to the kernel compile
43287	27-Jan-1999	dillon	Remove unintended trigraph sequences in comments for -Wall
43209	26-Jan-1999	julian	Mostly remove the VM_STACK OPTION. This changes the definitions of a few items so that structures are the same whether or not the option itself is enabled. This allows people to enable and disable the option without recompilng the world. As the author says: \|I ran into a problem pulling out the VM_STACK option. I was aware of this \|when I first did the work, but then forgot about it. The VM_STACK stuff \|has some code changes in the i386 branch. There need to be corresponding \|changes in the alpha branch before it can come out completely. what is done: \| \|1) Pull the VM_STACK option out of the header files it appears in. This \|really shouldn't affect anything that executes with or without the rest \|of the VM_STACK patches. The vm_map_entry will then always have one \|extra element (avail_ssize). It just won't be used if the VM_STACK \|option is not turned on. \| \|I've also pulled the option out of vm_map.c. This shouldn't harm anything, \|since the routines that are enabled as a result are not called unless \|the VM_STACK option is enabled elsewhere. \| \|2) Add what appears to be appropriate code the the alpha branch, still \|protected behind the VM_STACK switch. I don't have an alpha machine, \|so we would need to get some testers with alpha machines to try it out. \| \|Once there is some testing, we can consider making the change permanent \|for both i386 and alpha. \| [..] \| \|Once the alpha code is adequately tested, we can pull VM_STACK out \|everywhere. \| Submitted by: "Richard Seaman, Jr." <dick@tar.com>
43208	26-Jan-1999	julian	Enable Linux threads support by default. This takes the conditionals out of the code that has been tested by various people for a while. ps and friends (libkvm) will need a recompile as some proc structure changes are made. Submitted by: "Richard Seaman, Jr." <dick@tar.com>
43145	24-Jan-1999	dillon	Undo last commit - not a bug, just duplicate code. PG_MAPPED and PG_WRITEABLE are already cleared by vm_page_protect().
43138	24-Jan-1999	dillon	Change all manual settings of vm_page_t->dirty = VM_PAGE_BITS_ALL to use the vm_page_dirty() inline. The inline can thus do sanity checks ( or not ) over all cases.
43136	24-Jan-1999	dillon	vm_map_split() used to dirty the page manually after calling vm_page_rename(), but never pulled the page off PQ_CACHE if it was on PQ_CACHE. Dirty pages in PQ_CACHE are not allowed and a KASSERT was added in -4.x to test for this... and got hit. In -4.x, vm_page_rename() automatically dirties the page. This commit also has it deal with the PQ_CACHE case, deactivating the page in that case.
43134	24-Jan-1999	dillon	Add vm_page_dirty() inline with PQ_CACHE sanity check
43129	24-Jan-1999	dillon	vm_pager_put_pages() is passed an rcval array to hold per-page return values. The 'int' return value for the procedure was never used and not well defined in any case when there are mixed errors on pages, so it has been removed. vm_pager_put_pages() and associated vm_pager functions now return void.
43128	24-Jan-1999	dillon	Clear PG_MAPPED as well as PG_WRITEABLE when a page is moved to the cache.
43127	24-Jan-1999	dillon	Added warning printf ( needs INVARIANTS ) when busy cache page is found while trying to free memory.
43123	24-Jan-1999	dillon	It is possible for a page in the cache to be busy. vm_pageout.c was not checking for this condition while it tried to free cache pages. Fixed.
43122	24-Jan-1999	dillon	Add invariants to vm_page_busy() and vm_page_wakeup() to check for PG_BUSY stupidity.
43121	24-Jan-1999	dillon	Clear PG_WRITEABLE in vm_page_cache(). This may or may not be a bug, but the bit should definitely be cleared.
43120	24-Jan-1999	dillon	Depreciate vm_object_pmap_copy() - nobody uses it. Everyone uses vm_object_pmap_copt_1() now, apparently.
43119	24-Jan-1999	dillon	Get rid of unused old_m in vm_fault. Add INVARIANTS to test whether page is still busy after all the hell vm_fault goes through.. it is supposed to be, and printf() if it isn't. don't panic, though.
43086	23-Jan-1999	dillon	Reenable John Dyson's low-memory VM_WAIT code for page reactivations out of PQ_CACHE. Add comments explaining what it accomplishes and its limitations.
42979	21-Jan-1999	dillon	Mainly changes to support the new swapper. The big adjustment is that swap blocks are now in PAGE_SIZE'd increments instead of DEV_BSIZE'd increments. We still convert to DEV_BSIZE'd increments for the backing store I/O, but everything else is in PAGE_SIZE increments.
42978	21-Jan-1999	dillon	Move many of the vm_pager_*() functions from vm_pager.c to inlines in vm_pager.h
42977	21-Jan-1999	dillon	Move many of the vm_pager_() functions from vm_pager.c to inlines in vm_pager.h Added argument to getpbuf() and relpbuf() to allow each subsystem to specify a different hard limit on the number of simultanious physical bufferes that said subsystem may allocate. Without this feature, one subsystem ( e.g. the vfs clustering code ) could hog ALL* the pbufs, causing a deadlock in the pager in a low memory situation. Same for trypbuf().
42976	21-Jan-1999	dillon	Reorganized some of the low memory testing code to make it more useful. Removed call to vm_object_collapse(), which can block. This was being called without the pageout code holding any sort of reference on the vm_object or vm_page_t structures being manipulated. Since this code can block, it was possible for other kernel code to shred the state the pageout code was assuming remained intact. Fixed potential blocking condition in vm_pageout_page_free() ( which could cause a deadlock in a low-memory situation ). Currently there is a hack in-place to deal with clean filesystem meta-data polluting the inactive page queue. John doesn't like the hack, and neither do I. Revamped and commented a portion of the pageout loop. Added protection against potential memory deadlocks with OBJT_VNODE when using VOP_ISLOCKED(). The problem is that vp->v_data can be NULL which causes VOP_ISLOCKED() to return a less informed answer. remove vm_pager_sync() -- none of the pagers use it any more ( the old swapper used to. The new one does not ).
42975	21-Jan-1999	dillon	The TAILQ hashq has been turned into a singly-linked=list link, reducing the size of vm_page_t. SWAPBLK_NONE and SWAPBLK_MASK are defined here. These actually are more generalized then their names imply, but their placement is somewhat of a legacy issue from a prior test version of this code that put the swapblk in the vm_page_t structure. That test code was eventually thrown away. The legacy remains. Added vm_page_flash() inline. Similar to vm_page_wakeup() except that it does not clear PG_BUSY ( one assumes that PG_BUSY is already clear ). Used by a number of routines to wakeup waiters. Collapsed some of the code in inline calls to make other inline calls. GCC will optimize this well and it reduces duplication. vm_page_free() and vm_page_free_zero() inlines added to convert to the proper vm_page_free_toq() call. vm_page_sleep_busy() inline added, replacing vm_page_sleep() ( which has been removed ). This implements a much more optimizable page-waiting function.
42974	21-Jan-1999	dillon	The hash table used to be a table of doubly-link list headers ( two pointers per entry ). The table has been changed to a singly linked list of vm_page_t pointers. The table has been doubled in size, but the entries only take half the space so a net-zero change in memory use. The hash function has been changed, hopefully for the better. The combination of the larger hash table size of changed function should keep the chain length down to a reasonable number (0-3, average 1). vm_object->page_hint has been removed. This 'optimization' was not only never needed, but costs as much as a hash chain link to implement. While having page_hint in vm_object might result in better locality of reference, the cost is not worth the space in vm_object or the extra instructions in my view. vm_page_alloc*() functions have been inlined and call a generalized non-inlined vm_page_alloc_toq() which combines the standard alloc and zero-page alloc functions together, reducing code size and the L1 cache footprint. Some reordering has been done... not much. The delinking code should be faster ( because unlinking a doubly-linked list requires four memory ops and unlinking a singly linked list only requires two ), and we get a hash consistancy check for free. vm_page_rename() now automatically sets the page's dirty bits. vm_page_alloc() does not try to manually inline freeing a cache page. Instead, it now properly calls vm_page_free(m) ... vm_page_free() is really too complex to manually inline. vm_await(), supporting asleep(), has been added.
42973	21-Jan-1999	dillon	The vm_object structure is now somewhat smaller due to the removal of most of the swap-pager-specific fields, the removal of the id, and the removal of paging_offset. A new inline, vm_object_pip_wakeupn() has been added to subtract an arbitrary number n from the paging_in_progress count and then wakeup waiters as necessary. n may be 0, resulting in a 'flash'.
42972	21-Jan-1999	dillon	object->id was badly implemented. It has simply been removed. object->paging_offset has been removed - it was used to optimize a single OBJT_SWAP collapse case yet introduced massive confusion throughout vm_object.c. The optimization was inconsequential except for the claim that it didn't have to allocate any memory. The optimization has been removed. madvise() has been fixed. The old madvise() could be made to operate on shared objects which is a big no-no. The new one is much more careful in what it modifies. MADV_FREE was totally broken and has now been fixed. vm_page_rename() now automatically dirties a page, so explicit dirtying of the page prior to calling vm_page_rename() has been removed.
42971	21-Jan-1999	dillon	Objects associated with raw devices are no longer counted in the VM stats total because they may contain absurd numbers ( like the size of all of physical memory if you mmap() /dev/mem ).
42970	21-Jan-1999	dillon	General cleanup related to the new pager. We no longer have to worry about conversions of objects to OBJT_SWAP, it is done automatically now. Replaced manually inserted code with inline calls for busy waiting on pages, which also incidently fixes a potential PG_BUSY race due to the code not running at splvm(). vm_objects no longer have a paging_offset field ( see vm/vm_object.c )
42969	21-Jan-1999	dillon	Potential bug fix, do not just clear PG_BUSY... call vm_page_wakeup() instead to properly handle any waiters. Added comments, added support for M_ASLEEP. Generally treat M_ flags as flags instead of constants to compare against.
42968	21-Jan-1999	dillon	Removed low-memory blockages at fork. This is the wrong place to put this sort of test. We need to fix the low-memory handling in general.
42967	21-Jan-1999	dillon	Mainly cleanup. Removed some inappropriate low-memory handling code and added lots of comments. Add tie-in to vm_pager ( and thus the new swapper ) to deallocate backing swap for dirtied pages on the fly.
42966	21-Jan-1999	dillon	The default_pager's interaction with the swap_pager has been reorganized, and the swap_pager has been completely replaced. The new swap pager uses the new blist radix-tree based bitmap allocator for low level swap allocation and deallocation. The new allocator is effectively O(5) while the old one was O(N), and the new allocator allocates all required memory at init time rather then at allocate memory on the fly at run time. Swap metadata is allocated in clusters and stored in a hash table, eliminating linearly allocated structures. Many, many features have been rewritten or added. Swap space is now reallocated on the fly providing a poor-mans auto defragmentation of swap space. Swap space that is no longer needed is freed on a timely basis so no garbage collection is necessary. Swap I/O is marked B_ASYNC and NFS has been fixed to do the right thing with it, so NFS-based paging now has around 10x the performance as it did before ( previously NFS enforced synchronous I/O for paging ).
42957	21-Jan-1999	dillon	This is a rather large commit that encompasses the new swapper, changes to the VM system to support the new swapper, VM bug fixes, several VM optimizations, and some additional revamping of the VM code. The specific bug fixes will be documented with additional forced commits. This commit is somewhat rough in regards to code cleanup issues. Reviewed by: "John S. Dyson" <root@dyson.iquest.net>, "David Greenman" <dg@root.com>
42453	10-Jan-1999	eivind	KNFize, by bde.
42408	08-Jan-1999	eivind	Split DIAGNOSTIC -> DIAGNOSTIC, INVARIANTS, and INVARIANT_SUPPORT as discussed on -hackers. Introduce 'KASSERT(assertion, ("panic message", args))' for simple check + panic. Reviewed by: msmith
42379	07-Jan-1999	julian	Changes to the LINUX_THREADS support to only allocate extra memory for shared signal handling when there is shared signal handling being used. This removes the main objection to making the shared signal handling a standard ability in rfork() and friends and 'unconditionalising' this code. (i.e. the allocation of an extra 328 bytes per process). Signal handling information remains in the U area until such a time as it's reference count would be incremented to > 1. At that point a new struct is malloc'd and maintained in KVM so that it can be shared between the processes (threads) using it. A function to check the reference count and move the struct back to the U area when it drops back to 1 is also supplied. Signal information is therefore now swapable for all processes that are not sharing that information with other processes. THis should addres the concerns raised by Garrett and others. Submitted by: "Richard Seaman, Jr." <dick@tar.com>
42360	06-Jan-1999	julian	Add (but don't activate) code for a special VM option to make downward growing stacks more general. Add (but don't activate) code to use the new stack facility when running threads, (specifically the linux threads support). This allows people to use both linux compiled linuxthreads, and also the native FreeBSD linux-threads port. The code is conditional on VM_STACK. Not using this will produce the old heavily tested system. Submitted by: Richard Seaman <dick@tar.com>
42248	02-Jan-1999	bde	Ifdefed conditionally used simplock variables.
42153	29-Dec-1998	dt	Don't free swap in swap_pager_getpages(): this code probably cause the "dying daemons" problem. (I thought this code was introduced in rev.1.80, but it just relaxed the condition.) Also, kill related "suggest more swap space" warning (also introduced in 1.80). It was confusing, to say the least... Requested by: msmith Not objected by: dg
42026	23-Dec-1998	dillon	Update comments to routines in vm_page.c, most especially whether a routine can block or not as part of a general effort to carefully document blocking/non-blocking calls in the kernel.
41936	19-Dec-1998	julian	Fix two bogons created by 'patch(1)' in my last commit.
41931	19-Dec-1998	julian	Reviewed by: Luoqi Chen, Jordan Hubbard Submitted by: "Richard Seaman, Jr." <lists@tar.com> Obtained from: linux :-) Code to allow Linux Threads to run under FreeBSD. By default not enabled This code is dependent on the conditional COMPAT_LINUX_THREADS (suggested by Garret) This is not yet a 'real' option but will be within some number of hours.
41620	09-Dec-1998	dt	Don't disable mmap with large file offset.
41591	07-Dec-1998	archie	The "easy" fixes for compiling the kernel -Wunused: remove unreferenced static and local variables, goto labels, and functions declared but not defined.
41514	04-Dec-1998	archie	Examine all occurrences of sprintf(), strcat(), and str[n]cpy() for possible buffer overflow problems. Replaced most sprintf()'s with snprintf(); for others cases, added terminating NUL bytes where appropriate, replaced constants like "16" with sizeof(), etc. These changes include several bug fixes, but most changes are for maintainability's sake. Any instance where it wasn't "immediately obvious" that a buffer overflow could not occur was made safer. Reviewed by: Bruce Evans <bde@zeta.org.au> Reviewed by: Matthew Dillon <dillon@apollo.backplane.com> Reviewed by: Mike Spengler <mks@networkcs.com>
41503	04-Dec-1998	rvb	In vnode_pager_input_old, set auio.uio_procp = curproc vs auio.uio_procp = (struct proc *) 0
41322	25-Nov-1998	dg	Add missing splvm protection around unqueue call. Without this, the page queues would eventually get corrupted.
41250	19-Nov-1998	bde	Fixed a null pointer panic in spc_free(). swap_pager_putpages() almost always causes this panic for the curproc != pageproc case. This case apparently doesn't happen in normal operation, but it happens when vm_page_alloc_contig() is called when there is a memory hogging application that hasn't already been paged out. PR: 8632 Reviewed by: info@opensound.com (Dev Mazumdar), dg Broken in: rev.1.89 (1998/02/23)
41093	11-Nov-1998	dg	Closed a small race condition between wiring/unwiring pages that involved the page's wire_count.
41059	10-Nov-1998	peter	add #include <sys/kernel.h> where it's needed by MALLOC_DEFINE()
41004	08-Nov-1998	dfr	* Fix a couple of places in the device pager where an address was truncated to 32 bits. * Change the calling convention of the device mmap entry point to pass a vm_offset_t instead of an int for the offset allowing devices with a larger memory map than (1<<32) to be supported on the alpha (/dev/mem is one such). These changes are required to allow the X server to mmap the various I/O regions used for device port and memory access on the alpha.
40931	05-Nov-1998	dg	Implemented zero-copy TCP/IP extensions via sendfile(2) - send a file to a stream socket. sendfile(2) is similar to implementations in HP-UX, Linux, and other systems, but the API is more extensive and addresses many of the complaints that the Apache Group and others have had with those other implementations. Thanks to Marc Slemko of the Apache Group for helping me work out the best API for this. Anyway, this has the "net" result of speeding up sends of files over TCP/IP sockets by about 10X (that is to say, uses 1/10th of the CPU cycles) when compared to a traditional read/write loop.
40794	31-Oct-1998	peter	Add John Dyson's SYSCTL descriptions, and an export of more stats to a sysctl hierarchy (vm.stats.*). SYSCTL descriptions are only present in source, they do not get compiled into the binaries taking up memory.
40790	31-Oct-1998	peter	Use TAILQ macros for clean/dirty block list processing. Set b_xflags rather than abusing the list next pointer with a magic number.
40701	28-Oct-1998	dg	Fixed wrong comments in and about vm_page_deactivate().
40700	28-Oct-1998	dg	Added a second argument, "activate" to the vm_page_unwire() call so that the caller can select either inactive or active queue to put the page on.
40673	27-Oct-1998	dg	Added needed splvm() protection around object page traversal in vm_object_terminate().
40650	25-Oct-1998	bde	Don't follow null bdevsw pointers. The `major(dev) < nblkdev' test rotted when bdevsw[] became sparse. We still depend on magic to avoid having to check that (v_rdev) device numbers in vnodes are not NODEV. Removed a redundant `major(dev) < nblkdev' test instead of updating it. Don't follow a garbage bdevsw pointer for attempts to swap on empty regular files. This case currently can't happen. Swapping on regular files is ifdefed out in swapon() and isn't attempted for empty files in nfs_mountroot().
40648	25-Oct-1998	phk	Nitpicking and dusting performed on a train. Removes trivial warnings about unused variables, labels and other lint.
40605	23-Oct-1998	dg	Oops, revert part of last fix. vm_pager_dealloc() can't be called until after the pages are removed from the object...so fix the problem by not printing the diagnostic for wired fictitious pages (which is normal).
40604	23-Oct-1998	dg	Fixed two bugs in recent commit: in vm_object_terminate, vm_pager_dealloc needs to be called prior to freeing remaining pages in the object so that the device pager has an opportunity to grab its "fake" pages. Also, in the case of wired pages, the page must be made busy prior to calling vm_page_remove. This is a difference from 2.2.x that I overlooked when I brought these changes forward.
40560	22-Oct-1998	dg	Make the VM system handle the case where a terminating object contains legitimately wired pages. Currently we print a diagnostic when this happens, but this will be removed soon when it will be common for this to occur with zero-copy TCP/IP buffers.
40558	22-Oct-1998	dg	Convert fake page allocs to use the zone allocator, thus eliminating the private pool management code in here.
40557	21-Oct-1998	dg	Set m->object to NULL in dev_pager_getfake().
40548	21-Oct-1998	dg	Nuked PG_TABLED flag. Replaced with m->object != NULL.
40546	21-Oct-1998	dg	Add a diagnostic printf for freeing a wired page. This will eventually be turned into a panic, but I want to make sure that all cases of freeing pages with wire_count==1 (which is/was allowed) have first been fixed.
40286	13-Oct-1998	dg	Fixed two potentially serious classes of bugs: 1) The vnode pager wasn't properly tracking the file size due to "size" being page rounded in some cases and not in others. This sometimes resulted in corrupted files. First noticed by Terry Lambert. Fixed by changing the "size" pager_alloc parameter to be a 64bit byte value (as opposed to a 32bit page index) and changing the pagers and their callers to deal with this properly. 2) Fixed a bogus type cast in round_page() and trunc_page() that caused some 64bit offsets and sizes to be scrambled. Removing the cast required adding casts at a few dozen callers. There may be problems with other bogus casts in close-by macros. A quick check seemed to indicate that those were okay, however.
40087	09-Oct-1998	jdp	Fix a panic on SMP systems, caused by sleeping while holding a simple-lock. The reviewer raises the following caveat: "I believe these changes open a non-critical race condition when adding memory to the pool for the zone. I think what will happen is that you could have two threads that are simultaneously adding additional memory when the pool runs out. This appears to not be a problem, however, since the re-aquisition of the lock will protect the list pointers." The submitter agrees that the race is non-critical, and points out that it already existed for the non-SMP case. He suggests that perhaps a sleep lock (using the lock manager) should be used to close that race. This might be worth revisiting after 3.0 is released. Reviewed by: dg (David Greenman) Submitted by: tegge (Tor Egge)
39873	01-Oct-1998	jdp	Fix a bug in which a page index was used where a byte offset was expected. This bug caused builds of Modula-3 to fail in mysterious ways on SMP kernels. More precisely, such builds failed on systems with kern.fast_vfork equal to 0, the default and only supported value for SMP kernels. PR: kern/7468 Submitted by: tegge (Tor Egge)
39770	29-Sep-1998	abial	Make #define NO_SWAPPING a normal kernel config option. Reviewed by: jkh
39739	28-Sep-1998	rvb	John Dyson approved of this solution; make vnode_pager_input_old set m->valid
39700	28-Sep-1998	dg	Be more selctive about when we clear p->valid. Submitted by: John Dyson <toor@dyson.iquest.net>
39512	20-Sep-1998	bde	Removed unused file.
38866	05-Sep-1998	bde	Instantiate `nfs_mount_type' in a standard file so that it is present when nfs is an LKM. Declare it in a header file. Don't forget to use it in non-Lite2 code. Initialize it to -1 instead of to 0, since 0 will soon be the mount type number for the first vfs loaded. NetBSD uses strcmp() to avoid this ugly global.
38799	04-Sep-1998	dfr	Cosmetic changes to the PAGE_XXX macros to make them consistent with the other objects in vm.
38729	01-Sep-1998	wollman	Separate wakeup conditions for page I/O count (pg_busy) and lock (PG_BUSY). This is not sa completely solution to the deadlock, but the additional wakeups have helped in my observation. Suggested by: John Dyson
38542	25-Aug-1998	luoqi	Fix a rounding problem that causes vnode pager to fail to remove the last partially filled page during a truncation. PR: kern/7422
38517	24-Aug-1998	dfr	Change various syscalls to use size_t arguments instead of u_int. Add some overflow checks to read/write (from bde). Change all modifications to vm_page::flags, vm_page::busy, vm_object::flags and vm_object::paging_in_progress to use operations which are not interruptable. Reviewed by: Bruce Evans <bde@zeta.org.au>
38479	22-Aug-1998	mckay	Correct/clarify some comments.
38298	13-Aug-1998	dfr	Protect all modifications to paging_in_progress with splvm().
38135	06-Aug-1998	dfr	Protect all modifications to paging_in_progress with splvm(). The i386 managed to avoid corruption of this variable by luck (the compiler used a memory read-modify-write instruction which wasn't interruptable) but other architectures cannot. With this change, I am now able to 'make buildworld' on the alpha (sfx: the crowd goes wild...)
37918	28-Jul-1998	bde	Fixed two spl nesting bugs. They caused (at least) the entire pageout daemon to run at splvm() forever after swap_pager_putpages() is called from vm_pageout_scan(). Broken in: rev.1.189 (1998/02/23)
37874	26-Jul-1998	dfr	Notify pmap when a page is freed on the alpha to allow it to clean up its emulated modified/referenced bits.
37843	22-Jul-1998	dg	Improved pager input failure message.
37821	22-Jul-1998	phk	There is a comment in vm_param.h which doesn't belong to the code still left in there. The macros it describes disapeared some- time since 4.4BSD lite. PR: 7246 Reviewed by: phk Submitted by: Stefan Eggers <seggers@semyam.dinoco.de>
37653	15-Jul-1998	bde	Cast pointers to [u]intptr_t instead of to [unsigned] long.
37649	15-Jul-1998	bde	Cast pointers to uintptr_t/intptr_t instead of to u_long/long, respectively. Most of the longs should probably have been u_longs, but this changes is just to prevent warnings about casts between pointers and integers of different sizes, not to fix poorly chosen types.
37641	14-Jul-1998	bde	Print pointers using %p instead of attempting to print them by casting them to long, etc. Fixed some nearby printf bogons (sign errors not warned about by gcc, and style bugs, but not truncation of vm_ooffset_t's).
37640	14-Jul-1998	bde	Print pointers using %p instead of attempting to print them by casting them to long, etc. Fixed some nearby printf bogons (sign errors not warned about by gcc, and style bugs, but not truncation of vm_ooffset_t's). Use slightly less bogus casts for passing pointers to ddb command functions.
37563	11-Jul-1998	bde	Fixed printf format errors.
37562	11-Jul-1998	bde	Fixed printf format errors.
37555	11-Jul-1998	bde	Fixed printf format errors.
37546	10-Jul-1998	alex	Removed no longer valid comment about swb_block being int instead of daddr_t. PR: 7238 Submitted by: Stefan Eggers <seggers@semyam.dinoco.de>
37545	10-Jul-1998	alex	Removed unnecessary test from if/else construct. PR: 7233 Submitted by: Stefan Eggers <seggers@semyam.dinoco.de>
37395	05-Jul-1998	dfr	Don't truncate the return value of mmap to sizeof(int).
37389	04-Jul-1998	julian	There is no such thing any more as "struct bdevsw". There is only cdevsw (which should be renamed in a later edit to deventry or something). cdevsw contains the union of what were in both bdevsw an cdevsw entries. The bdevsw[] table stiff exists and is a second pointer to the cdevsw entry of the device. it's major is in d_bmaj rather than d_maj. some cleanup still to happen (e.g. dsopen now gets two pointers to the same cdevsw struct instead of one to a bdevsw and one to a cdevsw). rawread()/rawwrite() went away as part of this though it's not strictly the same patch, just that it involves all the same lines in the drivers. cdroms no longer have write() entries (they did have rawwrite (?)). tapes no longer have support for bdev operations. Reviewed by: Eivind Eklund and Mike Smith Changes suggested by eivind.
37384	04-Jul-1998	julian	VOP_STRATEGY grows an (struct vnode *) argument as the value in b_vp is often not really what you want. (and needs to be frobbed). more cleanups will follow this. Reviewed by: Bruce Evans <bde@freebsd.org>
37282	30-Jun-1998	jmg	document some VM paging options for cache sizes: PQ_NOOPT no coloring PQ_LARGECACHE used for 512k/16k cache PQ_HUGECACHE used for 1024k/16k cache
37153	25-Jun-1998	phk	Remove bdevsw_add(), change the only two users to use bdevsw_add_generic(). Extend cdevsw to be superset of bdevsw. Remove non-functional bdev lkm support. Teach wcd what the open() args mean.
37101	21-Jun-1998	bde	Removed unused includes.
37094	21-Jun-1998	bde	Removed unused includes.
36735	07-Jun-1998	dfr	This commit fixes various 64bit portability problems required for FreeBSD/alpha. The most significant item is to change the command argument to ioctl functions from int to u_long. This change brings us inline with various other BSD versions. Driver writers may like to use (__FreeBSD_version == 300003) to detect this change. The prototype FreeBSD/alpha machdep will follow in a couple of days time.
36677	05-Jun-1998	dg	Changed the log() of "Out of mbuf clusters - increase maxusers" to a printf() of "Out of mbuf clusters - adjust NMBCLUSTERS or increase maxusers" so that the message is more informative and so that it will appear in the kernel message buffer.
36583	02-Jun-1998	dyson	Cleanup and remove some dead code from the initialization.
36582	02-Jun-1998	dyson	Correct sleep priority.
36326	24-May-1998	dyson	Support a 16K first level cache for 512K 2nd level. Also, add support for 1MB 2nd level cache.
36275	21-May-1998	dyson	Make flushing dirty pages work correctly on filesystems that unexpectedly do not complete writes even with sync I/O requests. This should help the behavior of mmaped files when using softupdates (and perhaps in other circumstances also.)
36177	19-May-1998	peter	Make the previous commit compile..
36164	18-May-1998	guido	Plug hole reported on Bugtraq: do not allow mmap with WRITE privs for append-only and immutable files. Obtained from: OpenBSD (partly)
36112	16-May-1998	dyson	An important fix for proper inheritance of backing objects for object splits. Another excellent detective job by Tor. Submitted by: Tor Egge <Tor.Egge@idi.ntnu.no>
35694	04-May-1998	dyson	Fix the shm panic. I mistakenly used the shadow_count to keep the object from being split, and instead added an OBJ_NOSPLIT.
35669	04-May-1998	dyson	Work around some VM bugs, the worst being an overly aggressive swap space free calculation. More complete fixes will be forthcoming, in a week.
35615	02-May-1998	dyson	Another minor cleanup of the split code. Make sure that pages are busied during the entire time, so that the waits for pages being unbusy don't make the objects inconsistant.
35612	02-May-1998	peter	Seatbelts for vm_page_bits() in case a file offset is passed in rather than the page offset. If a large file offset was passed in, a large negative array index could be generated which could cause page faults etc at worst and file corruption at the least. (Pages are allocated within file space on page alignment boundaries, so a file offset being passed in here is harmless to DTRT. The case where this was happening has already been fixed though, this is in case it happens again). Reviewed by: dyson
35571	01-May-1998	dyson	Fix minor bug with new over used swap fix.
35499	29-Apr-1998	dyson	Add a needed prototype, and fix a panic problem with the new memory code.
35497	29-Apr-1998	dyson	Tighten up management of memory and swap space during map allocation, deallocation cycles. This should provide a measurable improvement on swap and memory allocation on loaded systems. It is unlikely a complete solution. Also, provide more map info with procfs. Chuck Cranor spurred on this improvement.
35485	28-Apr-1998	dyson	Fix a pseudo-swap leak problem. This mitigates "leaks" due to freeing partial objects, not freeing entire objects didn't free any of it. Simple fix to the map code. Reviewed by: dg
35447	25-Apr-1998	dyson	Correct copyright.
35210	15-Apr-1998	bde	Support compiling with `gcc -ansi'.
34961	30-Mar-1998	phk	Eradicate the variable "time" from the kernel, using various measures. "time" wasn't a atomic variable, so splfoo() protection were needed around any access to it, unless you just wanted the seconds part. Most uses of time.tv_sec now uses the new variable time_second instead. gettime() changed to getmicrotime(0. Remove a couple of unneeded splfoo() protections, the new getmicrotime() is atomic, (until Bruce sets a breakpoint in it). A couple of places needed random data, so use read_random() instead of mucking about with time which isn't random. Add a new nfs_curusec() function. Mark a couple of bogosities involving the now disappeard time variable. Update ffs_update() to avoid the weird "== &time" checks, by fixing the one remaining call that passwd &time as args. Change profiling in ncr.c to use ticks instead of time. Resolution is the same. Add new function "tvtohz()" to avoid the bogus "splfoo(), add time, call hzto() which subtracts time" sequences. Reviewed by: bde
34924	28-Mar-1998	bde	Moved some #includes from <sys/param.h> nearer to where they are actually used.
34611	16-Mar-1998	dyson	Some VM improvements, including elimination of alot of Sig-11 problems. Tor Egge and others have helped with various VM bugs lately, but don't blame him -- blame me!!! pmap.c: 1) Create an object for kernel page table allocations. This fixes a bogus allocation method previously used for such, by grabbing pages from the kernel object, using bogus pindexes. (This was a code cleanup, and perhaps a minor system stability issue.) pmap.c: 2) Pre-set the modify and accessed bits when prudent. This will decrease bus traffic under certain circumstances. vfs_bio.c, vfs_cluster.c: 3) Rather than calculating the beginning virtual byte offset multiple times, stick the offset into the buffer header, so that the calculated offset can be reused. (Long long multiplies are often expensive, and this is a probably unmeasurable performance improvement, and code cleanup.) vfs_bio.c: 4) Handle write recursion more intelligently (but not perfectly) so that it is less likely to cause a system panic, and is also much more robust. vfs_bio.c: 5) getblk incorrectly wrote out blocks that are incorrectly sized. The problem is fixed, and writes blocks out ONLY when B_DELWRI is true. vfs_bio.c: 6) Check that already constituted buffers have fully valid pages. If not, then make sure that the B_CACHE bit is not set. (This was a major source of Sig-11 type problems.) vfs_bio.c: 7) Fix a potential system deadlock due to an incorrectly specified sleep priority while waiting for a buffer write operation. The change that I made opens the system up to serious problems, and we need to examine the issue of process sleep priorities. vfs_cluster.c, vfs_bio.c: 8) Make clustered reads work more correctly (and more completely) when buffers are already constituted, but not fully valid. (This was another system reliability issue.) vfs_subr.c, ffs_inode.c: 9) Create a vtruncbuf function, which is used by filesystems that can truncate files. The vinvalbuf forced a file sync type operation, while vtruncbuf only invalidates the buffers past the new end of file, and also invalidates the appropriate pages. (This was a system reliabiliy and performance issue.) 10) Modify FFS to use vtruncbuf. vm_object.c: 11) Make the object rundown mechanism for OBJT_VNODE type objects work more correctly. Included in that fix, create pager entries for the OBJT_DEAD pager type, so that paging requests that might slip in during race conditions are properly handled. (This was a system reliability issue.) vm_page.c: 12) Make some of the page validation routines be a little less picky about arguments passed to them. Also, support page invalidation change the object generation count so that we handle generation counts a little more robustly. vm_pageout.c: 13) Further reduce pageout daemon activity when the system doesn't need help from it. There should be no additional performance decrease even when the pageout daemon is running. (This was a significant performance issue.) vnode_pager.c: 14) Teach the vnode pager to handle race conditions during vnode deallocations.
34525	12-Mar-1998	guido	Fix for mmap of char devices bug as described in OpenBSD advisory of 1998/02/20 Reviewed by: John Dyson Submitted by: "Cy Schubert" <cschuber@uumail.gov.bc.ca>
34403	09-Mar-1998	msmith	Complement diagnostic messages about missing per-FS VOP page operations, but don't make their absence fatal. Submitted by: terry
34321	08-Mar-1998	dyson	Quell unneeded pageout daemon activity.
34320	08-Mar-1998	dyson	Remove a very ill advised vm_page_protect. This was being called for a non-managed page. That is a big no-no.
34236	08-Mar-1998	dyson	Some cruft left over from my megacommit. A page rotation optimization was a good idea, but can cause instability. That optimization is now removed.
34235	08-Mar-1998	dyson	Several minor fixes: 1) When freeing pages, it is a good idea to protect them off. (This is probably gratuitious, but good form.) 2) Allow collapsing pages in the backing object that are PQ_CACHE. This will improve memory utilization. 3) Correct the collapse code so that pages that were on the cache queue are moved to the inactive queue. This is done when pages are marked dirty (so that those pages will be properly paged out instead of freed), so that cached pages will not be paradoxically marked dirty.
34206	07-Mar-1998	dyson	This mega-commit is meant to fix numerous interrelated problems. There has been some bitrot and incorrect assumptions in the vfs_bio code. These problems have manifest themselves worse on NFS type filesystems, but can still affect local filesystems under certain circumstances. Most of the problems have involved mmap consistancy, and as a side-effect broke the vfs.ioopt code. This code might have been committed seperately, but almost everything is interrelated. 1) Allow (pmap_object_init_pt) prefaulting of buffer-busy pages that are fully valid. 2) Rather than deactivating erroneously read initial (header) pages in kern_exec, we now free them. 3) Fix the rundown of non-VMIO buffers that are in an inconsistent (missing vp) state. 4) Fix the disassociation of pages from buffers in brelse. The previous code had rotted and was faulty in a couple of important circumstances. 5) Remove a gratuitious buffer wakeup in vfs_vmio_release. 6) Remove a crufty and currently unused cluster mechanism for VBLK files in vfs_bio_awrite. When the code is functional, I'll add back a cleaner version. 7) The page busy count wakeups assocated with the buffer cache usage were incorrectly cleaned up in a previous commit by me. Revert to the original, correct version, but with a cleaner implementation. 8) The cluster read code now tries to keep data associated with buffers more aggressively (without breaking the heuristics) when it is presumed that the read data (buffers) will be soon needed. 9) Change to filesystem lockmgr locks so that they use LK_NOPAUSE. The delay loop waiting is not useful for filesystem locks, due to the length of the time intervals. 10) Correct and clean-up spec_getpages. 11) Implement a fully functional nfs_getpages, nfs_putpages. 12) Fix nfs_write so that modifications are coherent with the NFS data on the server disk (at least as well as NFS seems to allow.) 13) Properly support MS_INVALIDATE on NFS. 14) Properly pass down MS_INVALIDATE to lower levels of the VM code from vm_map_clean. 15) Better support the notion of pages being busy but valid, so that fewer in-transit waits occur. (use p->busy more for pageouts instead of PG_BUSY.) Since the page is fully valid, it is still usable for reads. 16) It is possible (in error) for cached pages to be busy. Make the page allocation code handle that case correctly. (It should probably be a printf or panic, but I want the system to handle coding errors robustly. I'll probably add a printf.) 17) Correct the design and usage of vm_page_sleep. It didn't handle consistancy problems very well, so make the design a little less lofty. After vm_page_sleep, if it ever blocked, it is still important to relookup the page (if the object generation count changed), and verify it's status (always.) 18) In vm_pageout.c, vm_pageout_clean had rotted, so clean that up. 19) Push the page busy for writes and VM_PROT_READ into vm_pageout_flush. 20) Fix vm_pager_put_pages and it's descendents to support an int flag instead of a boolean, so that we can pass down the invalidate bit.
34202	07-Mar-1998	dyson	Make vm_fault much cleaner by removing the evil macro inlines, and put alot of it's context into a data structure. This allows significant shortening of its codepath, and will significantly decrease it's cache footprint. Also, add some stats to vmmeter. Note that you'll have to rebuild/recompile vmstat, systat, etc... Otherwise, you'll get "very interesting" paging stats.
34030	04-Mar-1998	dufault	Reviewed by: msmith, bde long ago POSIX.4 headers and sysctl variables. Nothing should change unless POSIX4 is defined or _POSIX_VERSION is set to 199309.
33936	01-Mar-1998	dyson	1) Use a more consistent page wait methodology. 2) Do not unnecessarily force page blocking when paging pages out. 3) Further improve swap pager performance and correctness, including fixing the paging in progress deadlock (except in severe I/O error conditions.) 4) Enable vfs_ioopt=1 as a default. 5) Fix and enable the page prezeroing in SMP mode. All in all, SMP systems especially should show a significant improvement in "snappyness."
33847	26-Feb-1998	msmith	In the author's words: These diffs implement the first stage of a VOP_{GET\|PUT}PAGES pushdown for local media FS's. See ffs_putpages in /sys/ufs/ufs/ufs_readwrite.c for implementation details for generic _{get\|put}pages for local media FS's. Support is trivial to add for any FS that formerly relied on the default behaviour of the vnode_pager in in EOPNOTSUPP cases (just copy the ffs_getpages() code for the FS in question's _{get\|put}pages). Obviously, it would be better if each local media FS implemented a more optimal method, instead of calling an exported interface from the /sys/vm/vnode_pager.c, but this is a necessary first step in getting the FS's to a point where they can be supplied with better implementations on a case-by-case basis. Obviously, the cd9660_putpages() can be rather trivial (since it is a read-only FS type 8-)). A slight (temporary) modification is made to print a diagnostic message in the case where the underlying filesystem attempts to engage in the previous behaviour. Failure is likely to be ungraceful. Submitted by: terry@freebsd.org (Terry Lambert)
33817	25-Feb-1998	dyson	Fix page prezeroing for SMP, and fix some potential paging-in-progress hangs. The paging-in-progress diagnosis was a result of Tor Egge's excellent detective work. Submitted by: Partially from Tor Egge.
33784	24-Feb-1998	dyson	Correct some severe VM tuning problems for small systems (<=16MB), and improve tuning on larger systems. (A couple of the VM tuning params for small systems were so badly chosen that the system could hang under load.) The broken tuning was originaly my fault.
33758	23-Feb-1998	dyson	Significantly improve the efficiency of the swap pager, which appears to have declined due to code-rot over time. The swap pager rundown code has been clean-up, and unneeded wakeups removed. Lots of splbio's are changed to splvm's. Also, set the dynamic tunables for the pageout daemon to be more sane for larger systems (thereby decreasing the daemon overheadla.)
33757	23-Feb-1998	dyson	Try to dynamically size the VM_KMEM_SIZE (but is still able to be overridden in a way identically as before.) I had problems with the system properly handling the number of vnodes when there is alot of system memory, and the default VM_KMEM_SIZE. Two new options "VM_KMEM_SIZE_SCALE" and "VM_KMEM_SIZE_MAX" have been added to support better auto-sizing for systems with greater than 128MB. Add some accouting for vm_zone memory allocations, and provide properly for vm_zone allocations out of the kmem_map. Also move the vm_zone allocation stats to the VM OID tree from the KERN OID tree.
33676	20-Feb-1998	bde	Removed unused #includes.
33622	19-Feb-1998	msmith	Move the 'sw' device off block major #1, which is now occupied by 'wfd'.
33181	09-Feb-1998	eivind	Staticize.
33173	08-Feb-1998	dyson	Fix an argument to vn_lock. It appears that alot of the vn_lock usage is a bit undisciplined, and should be checked carefully.
33134	06-Feb-1998	eivind	Back out DIAGNOSTIC changes.
33109	05-Feb-1998	dyson	1) Start using a cleaner and more consistant page allocator instead of the various ad-hoc schemes. 2) When bringing in UPAGES, the pmap code needs to do another vm_page_lookup. 3) When appropriate, set the PG_A or PG_M bits a-priori to both avoid some processor errata, and to minimize redundant processor updating of page tables. 4) Modify pmap_protect so that it can only remove permissions (as it originally supported.) The additional capability is not needed. 5) Streamline read-only to read-write page mappings. 6) For pmap_copy_page, don't enable write mapping for source page. 7) Correct and clean-up pmap_incore. 8) Cluster initial kern_exec pagin. 9) Removal of some minor lint from kern_malloc. 10) Correct some ioopt code. 11) Remove some dead code from the MI swapout routine. 12) Correct vm_object_deallocate (to remove backing_object ref.) 13) Fix dead object handling, that had problems under heavy memory load. 14) Add minor vm_page_lookup improvements. 15) Some pages are not in objects, and make sure that the vm_page.c can properly support such pages. 16) Add some more page deficit handling. 17) Some minor code readability improvements.
33108	04-Feb-1998	eivind	Turn DIAGNOSTIC into a new-style option.
33058	03-Feb-1998	bde	Added #include of <sys/queue.h> so that this file is more "self"-sufficent.
33034	03-Feb-1998	dyson	This fix should help the panic problems in -current. There were some errors in "interval" management. Due to the clustering mechanism, the code is necessarily complex and error prone.
32995	01-Feb-1998	bde	Forward declare more structs that are used in prototypes here - don't depend on <sys/types.h> forward declaring common ones.
32952	01-Feb-1998	dyson	Fix a performance problem caused by an earlier commit.
32946	31-Jan-1998	dyson	contigalloc doesn't place the allocated page(s) into an object, and now this breaks vm_page_wire (due to wired page accounting per object.) This should fix a problem as described by Donald Maddox.
32937	31-Jan-1998	dyson	Change the busy page mgmt, so that when pages are freed, they MUST be PG_BUSY. It is bogus to free a page that isn't busy, because it is in a state of being "unavailable" when being freed. The additional advantage is that the page_remove code has a better cross-check that the page should be busy and unavailable for other use. There were some minor problems with the collapse code, and this plugs those subtile "holes." Also, the vfs_bio code wasn't checking correctly for PG_BUSY pages. I am going to develop a more consistant scheme for grabbing pages, busy or otherwise. For now, we are stuck with the current morass.
32751	25-Jan-1998	eivind	Turn NSWAPDEV into a new-style option.
32726	24-Jan-1998	eivind	Make all file-system (MFS, FFS, NFS, LFS, DEVFS) related option new-style. This introduce an xxxFS_BOOT for each of the rootable filesystems. (Presently not required, but encouraged to allow a smooth move of option *FS to opt_dontuse.h later.) LFS is temporarily disabled, and will be re-enabled tomorrow.
32724	24-Jan-1998	dyson	Add better support for larger I/O clusters, including larger physical I/O. The support is not mature yet, and some of the underlying implementation needs help. However, support does exist for IDE devices now.
32702	22-Jan-1998	dyson	VM level code cleanups. 1) Start using TSM. Struct procs continue to point to upages structure, after being freed. Struct vmspace continues to point to pte object and kva space for kstack. u_map is now superfluous. 2) vm_map's don't need to be reference counted. They always exist either in the kernel or in a vmspace. The vmspaces are managed by reference counts. 3) Remove the "wired" vm_map nonsense. 4) No need to keep a cache of kernel stack kva's. 5) Get rid of strange looking ++var, and change to var++. 6) Change more data structures to use our "zone" allocator. Added struct proc, struct vmspace and struct vnode. This saves a significant amount of kva space and physical memory. Additionally, this enables TSM for the zone managed memory. 7) Keep ioopt disabled for now. 8) Remove the now bogus "single use" map concept. 9) Use generation counts or id's for data structures residing in TSM, where it allows us to avoid unneeded restart overhead during traversals, where blocking might occur. 10) Account better for memory deficits, so the pageout daemon will be able to make enough memory available (experimental.) 11) Fix some vnode locking problems. (From Tor, I think.) 12) Add a check in ufs_lookup, to avoid lots of unneeded calls to bcmp. (experimental.) 13) Significantly shrink, cleanup, and make slightly faster the vm_fault.c code. Use generation counts, get rid of unneded collpase operations, and clean up the cluster code. 14) Make vm_zone more suitable for TSM. This commit is partially as a result of discussions and contributions from other people, including DG, Tor Egge, PHK, and probably others that I have forgotten to attribute (so let me know, if I forgot.) This is not the infamous, final cleanup of the vnode stuff, but a necessary step. Vnode mgmt should be correct, but things might still change, and there is still some missing stuff (like ioopt, and physical backing of non-merged cache files, debugging of layering concepts.)
32670	21-Jan-1998	dyson	Allow gdb to work again.
32585	17-Jan-1998	dyson	Tie up some loose ends in vnode/object management. Remove an unneeded config option in pmap. Fix a problem with faulting in pages. Clean-up some loose ends in swap pager memory management. The system should be much more stable, but all subtile bugs aren't fixed yet.
32454	12-Jan-1998	dyson	Fix some vnode management problems, and better mgmt of vnode free list. Fix the UIO optimization code. Fix an assumption in vm_map_insert regarding allocation of swap pagers. Fix an spl problem in the collapse handling in vm_object_deallocate. When pages are freed from vnode objects, and the criteria for putting the associated vnode onto the free list is reached, either put the vnode onto the list, or put it onto an interrupt safe version of the list, for further transfer onto the actual free list. Some minor syntax changes changing pre-decs, pre-incs to post versions. Remove a bogus timeout (that I added for debugging) from vn_lock. PHK will likely still have problems with the vnode list management, and so do I, but it is better than it was.
32305	07-Jan-1998	dyson	Turn off the VTEXT flag when an object is no longer referenced, so that an executable that is no longer running can be written to. Also, clear the OBJ_OPT flag more often, when appropriate.
32286	06-Jan-1998	dyson	Make our v_usecount vnode reference count work identically to the original BSD code. The association between the vnode and the vm_object no longer includes reference counts. The major difference is that vm_object's are no longer freed gratuitiously from the vnode, and so once an object is created for the vnode, it will last as long as the vnode does. When a vnode object reference count is incremented, then the underlying vnode reference count is incremented also. The two "objects" are now more intimately related, and so the interactions are now much less complex. When vnodes are now normally placed onto the free queue with an object still attached. The rundown of the object happens at vnode rundown time, and happens with exactly the same filesystem semantics of the original VFS code. There is absolutely no need for vnode_pager_uncache and other travesties like that anymore. A side-effect of these changes is that SMP locking should be much simpler, the I/O copyin/copyout optimizations work, NFS should be more ponderable, and further work on layered filesystems should be less frustrating, because of the totally coherent management of the vnode objects and vnodes. Please be careful with your system while running this code, but I would greatly appreciate feedback as soon a reasonably possible.
32132	31-Dec-1997	alex	caddr_t --> void *
32072	29-Dec-1997	dyson	Fix the decl of vfs_ioopt, allow LFS to compile again, fix a minor problem with the object cache removal.
32071	29-Dec-1997	dyson	Lots of improvements, including restructring the caching and management of vnodes and objects. There are some metadata performance improvements that come along with this. There are also a few prototypes added when the need is noticed. Changes include: 1) Cleaning up vref, vget. 2) Removal of the object cache. 3) Nuke vnode_pager_uncache and friends, because they aren't needed anymore. 4) Correct some missing LK_RETRY's in vn_lock. 5) Correct the page range in the code for msync. Be gentle, and please give me feedback asap.
31991	25-Dec-1997	dyson	The ioopt code is still buggy, but wasn't fully disabled.
31970	24-Dec-1997	dyson	Support running with inadequate swap space. Additionally, the code will complain with a suggestion of increasing it.
31935	22-Dec-1997	dyson	Improve my copyright.
31857	19-Dec-1997	dyson	Change bogus usage of btoc to atop. The incorrect usage of btoc was pointed out by bde.
31853	19-Dec-1997	dyson	Some performance improvements, and code cleanups (including changing our expensive OFF_TO_IDX to btoc whenever possible.)
31778	16-Dec-1997	eivind	Make COMPAT_43 and COMPAT_SUNOS new-style options.
31729	15-Dec-1997	dyson	Fix a recursive kernel_map lock problem in vm_zone allocator. PR: 5298
31712	14-Dec-1997	dyson	Slight improvement to the vm_zone stats output. Also, some other superficial cleanups.
31709	14-Dec-1997	dyson	After one of my analysis passes to evaluate methods for SMP TLB mgmt, I noticed some major enhancements available for UP situations. The number of UP TLB flushes is decreased much more than significantly with these changes. Since a TLB flush appears to cost minimally approx 80 cycles, this is a "nice" enhancement, equiv to eliminating between 40 and 160 instructions per TLB flush. Changes include making sure that kernel threads all use the same PTD, and eliminate unneeded PTD switches at context switch time.
31667	11-Dec-1997	dyson	Fix the prototype for swapout_procs(); Submitted by: dima@best.net
31563	06-Dec-1997	dyson	Support an optional, sysctl enabled feature of idle process swapout. This is apparently useful for large shell systems, or systems with long running idle processes. To enable the feature: sysctl -w vm.swap_idle_enabled=1 Please note that some of the other vm sysctl variables have been renamed to be more accurate. Submitted by: Much of it from Matt Dillon <dillon@best.net>
31561	05-Dec-1997	bde	Don't include <sys/lock.h> in headers when only `struct simplelock' is required. Fixed everything that depended on the pollution.
31550	05-Dec-1997	dyson	Add new (very useful) tunable for pageout daemon. The flag changes the maximum pageout rate: sysctl -w vm.vm_maxlaunder=n 1 < n < inf. If paging heavily on large systems, it is likely that a performance improvement can be achieved by increasing the parameter. On a large system, the parm is 32, but numbers as large as 128 can make a big difference. If paging is expensive, you might try decreasing the number to 1-8.
31542	04-Dec-1997	dyson	Support applications that need to resist or deny use of swap space. sysctl -w vm.defer_swap_pageouts=1 Causes the system to resist the use of swap space. In low memory conditions, performance will decrease. sysctl -w vm.disable_swap_pageouts=1 Causes the system to mostly disable the use of swap space. In low memory conditions, the system will likely start killing processes.
31493	02-Dec-1997	phk	In all such uses of struct buf: 's/b_un.b_addr/b_data/g'
31393	24-Nov-1997	bde	Removed all traces of P_IDLEPROC. It was tested but never set.
31392	24-Nov-1997	bde	Don't #define max() to get a version that works with vm_ooffset's. Just use qmax(). This should be fixed more generally using overloaded functions.
31252	18-Nov-1997	bde	Removed unused #include of <sys/malloc.h>. This file now uses only zalloc(). Many more cases like this are probably obscured by not including <vm/zone.h> explicitly (it is spammed into <sys/malloc.h>).
31175	14-Nov-1997	tegge	Simplify map entries during user page wire and user page unwire operations in vm_map_user_pageable(). Check return value of vm_map_lock_upgrade() during a user page wire operation.
31017	07-Nov-1997	phk	Rename some local variables to avoid shadowing other local variables. Found by: -Wshadow
31016	07-Nov-1997	phk	Remove a bunch of variables which were unused both in GENERIC and LINT. Found by: -Wunused
30994	06-Nov-1997	phk	Move the "retval" (3rd) parameter from all syscall functions and put it in struct proc instead. This fixes a boatload of compiler warning, and removes a lot of cruft from the sources. I have not removed the /ARGSUSED/, they will require some looking at. libkvm, ps and other userland struct proc frobbing programs will need recompiled.
30989	06-Nov-1997	dyson	Fix the "missing page" problem. Also, improve the performance of page allocation in common cases.
30813	28-Oct-1997	bde	Removed unused #includes.
30701	25-Oct-1997	dyson	Support garbage collecting the pmap pv entries. The management doesn't happen until the system would have nearly failed anyway, so no signficant overhead is added. This helps large systems with lots of processes.
30700	24-Oct-1997	dyson	Decrease the initial allocation for the zone allocations.
30354	12-Oct-1997	phk	Last major round (Unless Bruce thinks of somthing :-) of malloc changes. Distribute all but the most fundamental malloc types. This time I also remembered the trick to making things static: Put "static" in front of them. A couple of finer points by: bde
30309	11-Oct-1997	phk	Distribute and statizice a lot of the malloc M_* types. Substantial input from: bde
30297	11-Oct-1997	peter	Attempt to fix the previous fix to the contigmalloc1 prototype. struct malloc_type isn't defined in all cases (eg: from ddb), and the line wrapping was very badly mangled.
30286	10-Oct-1997	phk	Fix contigmalloc() and contigmalloc1() arguments.
30139	06-Oct-1997	dyson	Improve management of pages moving from the inactive to active queue. Additionally, add some much needed comments.
30137	06-Oct-1997	dyson	Relax the vnode locking for read only operations.
29657	21-Sep-1997	peter	Fix some style(9) and formatting problems. tabsize 4 formatting doesn't look too great with 'more' etc. Approved by: dyson (with a minor grumble :-)
29653	21-Sep-1997	dyson	Change the M_NAMEI allocations to use the zone allocator. This change plus the previous changes to use the zone allocator decrease the useage of malloc by half. The Zone allocator will be upgradeable to be able to use per CPU-pools, and has more intelligent usage of SPLs. Additionally, it has reasonable stats gathering capabilities, while making most calls inline.
29368	14-Sep-1997	peter	Update select -> poll in drivers.
29324	13-Sep-1997	peter	Print correct function name in panics
29316	12-Sep-1997	jlemon	Do not consider VM_PROT_OVERRIDE_WRITE to be part of the protection entry when handling a fault. This is set by procfs whenever it wants to write to a page, as a means of overriding `r-x COW' entries, but causes failures in the `rwx' case. Submitted by: bde
29208	07-Sep-1997	bde	Removed yet more vestiges of config-time swap configuration and/or cleaned up nearby cruft.
28992	01-Sep-1997	bde	Removed unused #includes.
28991	01-Sep-1997	bde	Some staticized variables were still declared to be extern.
28990	01-Sep-1997	bde	Print a device number in hex instead of decimal.
28954	31-Aug-1997	phk	Change the 0xdeadb hack to a flag called VDOOMED. Introduce VFREE which indicates that vnode is on freelist. Rename vholdrele() to vdrop(). Create vfree() and vbusy() to add/delete vnode from freelist. Add vfree()/vbusy() to keep (v_holdcnt != 0 \|\| v_usecount != 0) vnodes off the freelist. Generalize vhold()/v_holdcnt to mean "do not recycle". Fix reassignbuf()s lack of use of vhold(). Use vhold() instead of checking v_cache_src list. Remove vtouch(), the vnodes are always vget'ed soon enough after for it to have any measuable effect. Add sysctl debug.freevnodes to keep track of things. Move cache_purge() up in getnewvnodes to avoid race. Decrement v_usecount after VOP_INACTIVE(), put a vhold() on it during VOP_INACTIVE() Unmacroize vhold()/vdrop() Print out VDOOMED and VFREE flags (XXX: should use %b) Reviewed by: dyson
28940	30-Aug-1997	peter	Allow non-page aligned file offset mmap's, providing that the system is allowed to choose the address, or that the MAP_FIXED address has the same remainder when modulo PAGE_SIZE as the file offset. Apparently this is posix1003.1b specified behavior. SVR4 and the other *BSD's allow it too. It costs us nothing to support and means we don't get EINVAL on some mmap code that works perfectly elsewhere. Obtained from: NetBSD
28751	25-Aug-1997	bde	Fixed type mismatches for functions with args of type vm_prot_t and/or vm_inherit_t. These types are smaller than ints, so the prototypes should have used the promoted type (int) to match the old-style function definitions. They use just vm_prot_t and/or vm_inherit_t. This depends on gcc features to work. I fixed the definitions since this is easiest. The correct fix may be to change the small types to u_int, to optimize for time instead of space.
28558	22-Aug-1997	dyson	This is a trial improvement for the vnode reference count while on the vnode free list problem. Also, the vnode age flag is no longer used by the vnode pager. (It is actually incorrect to use then.) Constructive feedback welcome -- just be kind.
28551	21-Aug-1997	bde	#include <machine/limits.h> explicitly in the few places that it is required.
28349	18-Aug-1997	fsmp	Added includes of smp.h for SMP. This eliminates a bazillion warnings about implicit s_lock & friends.
28345	18-Aug-1997	dyson	Fix kern_lock so that it will work. Additionally, clean-up some of the VM systems usage of the kernel lock (lockmgr) code. This is a first pass implementation, and is expected to evolve as needed. The API for the lock manager code has not changed, but the underlying implementation has changed significantly. This change should not materially affect our current SMP or UP code without non-standard parameters being used.
28028	10-Aug-1997	dyson	The "cutsie" register parameter passing that I had mistakenly used breaks profiling. Since it doesn't really improve perf much, I have backed it out.
27947	07-Aug-1997	dyson	More vm_zone cleanup. The sysctl now accounts for items better, and counts the number of allocations.
27930	06-Aug-1997	dyson	Add exposure of some vm_zone allocation stats by sysctl. Also, change the initialization parameters of some zones in VM map. This contains only optimizations and not bugfixes.
27924	05-Aug-1997	dyson	Fixed the commit botch that was causing crashes soon after system startup. Due to the error, the initialization of the zone for pv_entries was missing. The system should be usable again.
27923	05-Aug-1997	dyson	Another attempt at cleaning up the new memory allocator.
27922	05-Aug-1997	dyson	Fix some bugs, document vm_zone better. Add copyright to vm_zone.h. Use the new zone code in pmap.c so that we can get rid of the ugly ad-hoc allocations in pmap.c.
27905	05-Aug-1997	dyson	Modify pmap to use our new memory allocator. Also, change the vm_map_entry allocations to be interrupt safe.
27901	05-Aug-1997	dyson	A very simple zone allocator.
27899	05-Aug-1997	dyson	Get rid of the ad-hoc memory allocator for vm_map_entries, in lieu of a simple, clean zone type allocator. This new allocator will also be used for machine dependent pmap PV entries.
27845	02-Aug-1997	bde	Removed unused #includes.
27716	27-Jul-1997	dyson	Add the ability for the pageout daemon to measure stats on memory usage before the system is out of memory. The daemon does a minimal amount of work that increases as the system becomes more likely to run out of memory and page in/out. The default tuning is fairly low in background CPU usage, and sysctl variables have been added to enable flexable operation. This is an experimental feature that will likely be changed and improved over time.
27715	27-Jul-1997	dyson	Fix a very subtile problem that causes unnessary numbers of objects backing a single logical object. Submitted by: Alan Cox <alc@cs.rice.edu>
27464	17-Jul-1997	dyson	Add support for 4MB pages. This includes the .text, .data, .data parts of the kernel, and also most of the dynamic parts of the kernel. Additionally, 4MB pages will be allocated for display buffers as appropriate (only.) The 4MB support for SMP isn't complete, but doesn't interfere with operation either.
26851	23-Jun-1997	tegge	Don't try upgrading an existing exclusive lock in vm_map_user_pageable. This should close PR kern/3180. Also remove a bogus unconditional call to vm_map_unlock_read in vm_map_lookup.
26811	22-Jun-1997	peter	Kill some stale leftovers from the earlier attempts at SMP per-cpu pages
26780	22-Jun-1997	dyson	Remove a window during running down a file vnode. Also, the OBJ_DEAD flag wasn't being respected during vref(), et. al. Note that this isn't the eventual fix for the locking problem. Fine grained SMP in the VM and VFS code will require (lots) more work.
26668	15-Jun-1997	dyson	Correct the return code for the mlock system call. Also add the stubs for mlockall and munlockall.
26667	15-Jun-1997	dyson	Fix a reference problem with maps. Only appears to manifest itself when sharing address spaces.
26258	29-May-1997	peter	Update the #include "opt_smpxxx.h" includes - opt_smp.h isn't needed very much in the generic parts of the kernel now.
25930	19-May-1997	dfr	Fix a few bugs with NFS and mmap caused by NFS' use of b_validoff and b_validend. The changes to vfs_bio.c are a bit ugly but hopefully can be tidied up later by a slight redesign. PR: kern/2573, kern/2754, kern/3046 (possibly) Reviewed by: dyson
25352	01-May-1997	dyson	Check the correct queue for waking up the pageout daemon. Specifically, the pageout daemon wasn't always being waken up appropriately when the (cache + free) queues were depleted. Submitted by: David S. Miller <davem@jenolan.rutgers.edu>
25164	26-Apr-1997	peter	Man the liferafts! Here comes the long awaited SMP -> -current merge! There are various options documented in i386/conf/LINT, there is more to come over the next few days. The kernel should run pretty much "as before" without the options to activate SMP mode. There are a handful of known "loose ends" that need to be fixed, but have been put off since the SMP kernel is in a moderately good condition at the moment. This commit is the result of the tinkering and testing over the last 14 months by many people. A special thanks to Steve Passe for implementing the APIC code!
25074	21-Apr-1997	peter	Send this to the Attic so there's no mixups over which kern_lock.c is in use in -current.
24917	14-Apr-1997	peter	Unused variable (upobj is now purely handled within pmap)
24848	13-Apr-1997	dyson	Fully implement vfork. Vfork is now much much faster than even our fork. (On my machine, fork is about 240usecs, vfork is 78usecs.) Implement rfork(!RFPROC !RFMEM), which allows a thread to divorce its memory from the other threads of a group. Implement rfork(!RFPROC RFCFDG), which closes all file descriptors, eliminating possible existing shares with other threads/processes. Implement rfork(!RFPROC RFFDG), which divorces the file descriptors for a thread from the rest of the group. Fix the case where a thread does an exec. It is almost nonsense for a thread to modify the other threads address space by an exec, so we now automatically divorce the address space before modifying it.
24691	07-Apr-1997	peter	The biggie: Get rid of the UPAGES from the top of the per-process address space. (!) Have each process use the kernel stack and pcb in the kvm space. Since the stacks are at a different address, we cannot copy the stack at fork() and allow the child to return up through the function call tree to return to user mode - create a new execution context and have the new process begin executing from cpu_switch() and go to user mode directly. In theory this should speed up fork a bit. Context switch the tss_esp0 pointer in the common tss. This is a lot simpler since than swithching the gdt[GPROC0_SEL].sd.sd_base pointer to each process's tss since the esp0 pointer is a 32 bit pointer, and the sd_base setting is split into three different bit sections at non-aligned boundaries and requires a lot of twiddling to reset. The 8K of memory at the top of the process space is now empty, and unmapped (and unmappable, it's higher than VM_MAXUSER_ADDRESS). Simplity the pmap code to manage process contexts, we no longer have to double map the UPAGES, this simplifies and should measuably speed up fork(). The following parts came from John Dyson: Set PG_G on the UPAGES that are now in kernel context, and invalidate them when swapping them out. Move the upages object (upobj) from the vmspace to the proc structure. Now that the UPAGES (pcb and kernel stack) are out of user space, make rfork(..RFMEM..) do what was intended by sharing the vmspace entirely via reference counting rather than simply inheriting the mappings.
24678	06-Apr-1997	peter	Commit a typo fix that's been sitting in my tree for ages, quite forgotten. The typo was detected once apon a time with the -Wunused compile option. The result was that a block of code for implementing madvise(.. MADV_SEQUENTIAL..) behavior was "dead" and unused, probably negating the effect of activating the option. Reviewed by: dyson
24668	06-Apr-1997	dyson	Make vm_map_protect be more complete about map simplification. This is useful when a process changes it's page range protections very much. Submitted by: Alan Cox <alc@cs.rice.edu>
24667	06-Apr-1997	dyson	Correction to the prototype for vm_fault.
24666	06-Apr-1997	dyson	Fix the gdb executable modify problem. Thanks to the detective work by Alan Cox <alc@cs.rice.edu>, and his description of the problem. The bug was primarily in procfs_mem, but the mistake likely happened due to the lack of vm system support for the operation. I added better support for selective marking of page dirty flags so that vm_map_pageable(wiring) will not cause this problem again. The code in procfs_mem is now less bogus (but maybe still a little so.)
24478	01-Apr-1997	bde	Removed potentially harmful garbage <vm/lock.h> and fixed bogus use of it. It was actually harmless because the use was null due to fortuitous include orders and identical (wrong) idempotency macros.
24437	31-Mar-1997	dg	Changed the way that the exec image header is read to be filesystem- centric rather than VM-centric to fix a problem with errors not being detectable when the header is read. Killed exech_map as a result of these changes. There appears to be no performance difference with this change.
24131	23-Mar-1997	bde	Don't #include <sys/fcntl.h> in <sys/file.h> if KERNEL is defined. Fixed everything that depended on getting fcntl.h stuff from the wrong place. Most things don't depend on file.h stuff at all.
24130	23-Mar-1997	dyson	Fix a significant error in the accounting for pre-zeroed pages. This is a candidate for RELENG_2_2...
23502	08-Mar-1997	dyson	When removing IN_RECURSE support during the Lite/2 merge, read/write to/from mmaped regions was broken. This commit fixes the breakage, and uses the new Lite/2 locking mechanisms.
23157	27-Feb-1997	bde	Removed a wrong LK_INTERLOCK flag.
22975	22-Feb-1997	peter	Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not ready for it yet.
22878	18-Feb-1997	bde	Removed vestiges of Mach lock types. vm_map.h: Removed #include of <sys/proc.h>. curproc is only used in some macros and users of the macros already include <sys/proc.h>.
22670	13-Feb-1997	wollman	Provide an alternative interface to contigmalloc() which allows a specific map to be used when allocating the kernel va (e.g., mb_map). The VM gurus may want to look this over.
22521	10-Feb-1997	dyson	This is the kernel Lite/2 commit. There are some requisite userland changes, so don't expect to be able to run the kernel as-is (very well) without the appropriate Lite/2 userland changes. The system boots and can mount UFS filesystems. Untested: ext2fs, msdosfs, NFS Known problems: Incorrect Berkeley ID strings in some files. Mount_std mounts will not work until the getfsent library routine is changed. Reviewed by: various people Submitted by: Jeffery Hsu <hsu@freebsd.org>
22156	31-Jan-1997	dyson	Another fix to inheriting shared segments. Do the copy on write thing if needed. Submitted by: Alan Cox <alc@cs.rice.edu>
21987	24-Jan-1997	dg	Added a check/panic for v_usecount being 0 (no vnode reference) in vnode_pager_alloc().
21940	22-Jan-1997	dyson	Fix two problems where a NULL object is dereferenced. One problem was in the VM_INHERIT_SHARE case of vmspace_fork, and also in vm_map_madvise. Submitted by: Alan Cox <alc@cs.rice.edu>
21881	20-Jan-1997	dyson	Make MADV_FREE work better. Specifically, it did not wait for the page to be unbusy, and it caused some algorithmic problems as a result. There were some other problems with it also, so this is a general cleanup of the code. Submitted by: Douglas Crosher <dtc@scrooge.ee.swin.oz.au> and myself.
21754	16-Jan-1997	dyson	Change the map entry flags from bitfields to bitmasks. Allows for some code simplification.
21737	15-Jan-1997	dg	Fix bug related to map entry allocations where a sleep might be attempted when allocating memory for network buffers at interrupt time. This is due to inadequate checking for the new mcl_map. Fixed by merging mb_map and mcl_map into a single mb_map. Reviewed by: wollman
21733	15-Jan-1997	bde	Removed redundant spl0()'s from kernel processes. They were work-arounds for a bug in fork().
21673	14-Jan-1997	jkh	Make the long-awaited change from $Id$ to $FreeBSD$ This will make a number of things easier in the future, as well as (finally!) avoiding the Id-smashing problem which has plagued developers for so long. Boy, I'm glad we're not using sup anymore. This update would have been insane otherwise.
21530	11-Jan-1997	dyson	Slightly correct the code that moves pages from the active to the inactive queue. This is only a minor performance improvement, but will not affect perf on machines that don't have ref bits.
21529	11-Jan-1997	dyson	Prepare better for multi-platform by eliminating another required pmap routine (pmap_is_referenced.) Upper level recoded to use pmap_ts_referenced.
21258	03-Jan-1997	dyson	Undo the collapse breakage (swap space usage problem.)
21157	01-Jan-1997	dyson	Guess what? We left alot of the old collapse code that is not needed anymore with the "full" collapse fix that we added about 1yr ago!!! The code has been removed by optioning it out for now, so we can put it back in ASAP if any problems are found.
21134	31-Dec-1996	dyson	A very significant improvement in the management of process maps and objects. Previously, "fancy" memory management techniques such as that used by the M3 RTS would have the tendancy of chopping up processes allocated memory into lots of little objects. Alan has come up with some improvements to migtigate the sitution to the point where even the M3 RTS only has one object for bss and it's managed memory (when running CVSUP.) (There are still cases where the situation isn't improved when the system pages -- but this is much much better for the vast majority of cases.) The system will now be able to much more effectively merge map entries. Submitted by: Alan Cox <alc@cs.rice.edu>
21039	30-Dec-1996	dyson	Let the VM system know that on certain arch's that VM_PROT_READ also implies VM_PROT_EXEC. We support it that way for now, since the break system call by default gives VM_PROT_ALL. Now we have a better chance of coalesing map entries when mixing mmap/break type operations. This was contributing to excessive numbers of map entries on the modula-3 runtime system. The problem is still not "solved", but the situation makes more sense. Eventually, when we work on architectures where VM_PROT_READ is orthogonal to VM_PROT_EXEC, we will have to visit this issue carefully (esp. regarding security issues.)
21037	30-Dec-1996	dyson	EEEK!!! useracc and kernacc didn't lock their respective maps. Additionally, eliminate the map->hint distortion associated with useracc. That may/may-not be the "right" thing to do -- but time will tell. Submitted by: Partially by Alan Cox <alc@cs.rice.edu>
20999	29-Dec-1996	dyson	Superficial cleanup of comment.
20993	28-Dec-1996	dyson	Eliminate the redundancy due to the similarity between the routines vm_map_simplify and vm_map_simplify_entry. Make vm_map_simplify_entry handle wired maps so that we can get rid of vm_map_simplify. Modify the callers of vm_map_simplify to properly use vm_map_simplify_entry. Submitted by: Alan Cox <alc@cs.rice.edu>
20991	28-Dec-1996	dyson	The code unnecessarily created an object with no handle up-front, which has the negative effect of disabling some map optimizations. This patch defers the creation of the object until it needs to be at fault time. Submitted by: Alan Cox <alc@cs.rice.edu>
20821	22-Dec-1996	joerg	Make DFLDSIZ and MAXDSIZ fully-supported options. "Don't forget to do a ``make depend''" :-)
20449	14-Dec-1996	dyson	Implement closer-to POSIX mlock semantics. The major difference is that we do allow mlock to span unallocated regions (of course, not mlocking them.) We also allow mlocking of RO regions (which the old code couldn't.) The restriction there is that once a RO region is wired (mlocked), it cannot be debugged (or EVER written to.) Under normal usage, the new mlock code will be a significant improvement over our old stuff.
20189	07-Dec-1996	dyson	Expunge inlines...
20187	07-Dec-1996	dyson	Fix a map entry leak problem found by DG. Also, de-inline a function vm_map_entry_dispose, because it won't help being inlined.
20182	07-Dec-1996	dyson	Make vm_map_insert much more intelligent in the MAP_NOFAULT case so that map entries are coalesced when appropriate. Also, conditionalize some code that is currently not used in vm_map_insert. This mod has been added to eliminate unnecessary map entries in buffer map. Additionally, there were some cases where map coalescing could be done when it shouldn't. That problem has been resolved.
20054	30-Nov-1996	dyson	Implement a new totally dynamic (up to MAXPHYS) buffer kva allocation scheme. Additionally, add the capability for checking for unexpected kernel page faults. The maximum amount of kva space for buffers hasn't been decreased from where it is, but it will now be possible to do so. This scheme manages the kva space similar to the buffers themselves. If there isn't enough kva space because of usage or fragementation, buffers will be reclaimed until a buffer allocation is successful. This scheme should be very resistant to fragmentation problems until/if the LFS code is fixed and uses the bogus buffer locking scheme -- but a 'fixed' LFS is not likely to use such a scheme. Now there should be NO problem allocating buffers up to MAXPHYS.
20007	28-Nov-1996	dyson	Make the kernel smaller with at worst a neutral effect on perf by de-inlining some VM calls. (Actually, I measured a small improvement.)
19830	17-Nov-1996	dyson	Improve the locality of reference for variables in vm_page and vm_kern by moving them from .bss to .data. With this change, there is a measurable perf improvement in fork/exec.
19415	05-Nov-1996	dyson	Vastly improved contigmalloc routine. It does not solve the problem of allocating contiguous buffer memory in general, but make it much more likely to work at boot-up time. The best chance for an LKM-type load of a sound driver is immediately after the mount of the root filesystem. This appears to work for a 64K allocation on an 8MB system.
19259	29-Oct-1996	dyson	Change mmap to use OBJT_DEFAULT instead of OBJT_SWAP by default for anonymous objects. The system will automatically change the type to SWAP if needed (for size or pageout reasons.)
19216	27-Oct-1996	phk	The way we get a vnode for swapdev is not quite kosher. In particular it breaks in the DEVFS_ROOT case. replicate a bit too much of bdevvp() in here to circumvent the problem. The real problem is the magic that lives in bdevsw[1].
19142	24-Oct-1996	dyson	Remove a bogus optimization in the mmap code. It is superfluous, and at best is the same speed as the unoptimized code. At worst, it slows down trivial programs.
18974	17-Oct-1996	dyson	Make processes waken up eligible for immediate swap-in.
18973	17-Oct-1996	dyson	Clean up the rundown of the object backing a vnode. This should fix NFS problems associated with forcible dismounts.
18942	15-Oct-1996	bde	Removed nested include of <sys/proc.h> from <vm/vm_object.h> and fixed the one place that depended on it. wakeup() is now prototyped in <sys/systm.h> so that it is normally visible. Added nested include of <sys/queue.h> in <vm/vm_object.h>. The queue macros are a more fundamental prerequisite for <vm/vm_object.h> than the wakeup prototype and previously happened to be included by namespace pollution from <sys/proc.h> or elsewhere.
18937	15-Oct-1996	dyson	Move much of the machine dependent code from vm_glue.c into pmap.c. Along with the improved organization, small proc fork performance is now about 5%-10% faster.
18908	13-Oct-1996	phk	Remove a stale comment.
18893	12-Oct-1996	bde	Removed __pure's and __pure2's. __pure is a no-op for recent versions of gcc by definition, and __pure2 is a no-op in effect (presumably the compiler can see when an inline function has no side effects).
18779	06-Oct-1996	dyson	Make the default cache size optim to be 256K, the old default was 64K. The change has essentially neutral effect on those machines with little or no cache, and has a positive effect on "normal" machines with 256K or more cache.
18768	06-Oct-1996	dyson	Fix a problem with the page coloring code that the system will not always be able to use all of the free pages. This can manifest as a panic using DIAGNOSTIC, or as a panic on an indirect memory reference.
18542	28-Sep-1996	bde	Fixed undeclared variables for the !(PQ_L2_SIZE > 1) case. Removed redundant #include.
18526	28-Sep-1996	dyson	Reviewed by: Submitted by: Obtained from:
18389	19-Sep-1996	dg	Fixed bug with reversed trunc/round_page() in madvise...start must be trunced, end must be rounded.
18307	15-Sep-1996	bde	Removed iprintf(). It was copied to db_iprintf() in ddb.
18298	14-Sep-1996	bde	Attached vm ddb commands `show map', `show vmochk', `show object', `show vmopag', `show page' and `show pageq'. Moved all vm ddb stuff to the ends of the vm source files. Changed printf() to db_printf(), `indent' to db_indent, and iprintf() to db_iprintf() in ddb commands. Moved db_indent and db_iprintf() from vm to ddb. vm_page.c: Don't use __pure. Staticized. db_output.c: Reduced page width from 80 to 79 to inhibit double spacing for long lines (there are still some problems if words are printed across column 79).
18205	10-Sep-1996	dyson	The whole issue of not support VOP_LOCK for VBLK devices should be rethought. This fixes YET another problem with unmounting filesystems. The root cause is not fixed here, but at least the problem has gone away.
18178	08-Sep-1996	dyson	Fixed the use of the wrong variable in vm_map_madvise.
18169	08-Sep-1996	dyson	Addition of page coloring support. Various levels of coloring are afforded. The default level works with minimal overhead, but one can also enable full, efficient use of a 512K cache. (Parameters can be generated to support arbitrary cache sizes also.)
18163	08-Sep-1996	dyson	Improve the scalability of certain pmap operations.
17761	21-Aug-1996	dyson	Even though this looks like it, this is not a complex code change. The interface into the "VMIO" system has changed to be more consistant and robust. Essentially, it is now no longer necessary to call vn_open to get merged VM/Buffer cache operation, and exceptional conditions such as merged operation of VBLK devices is simpler and more correct. This code corrects a potentially large set of problems including the problems with ktrace output and loaded systems, file create/deletes, etc. Most of the changes to NFS are cosmetic and name changes, eliminating a layer of subroutine calls. The direct calls to vput/vrele have been re-instituted for better cross platform compatibility. Reviewed by: davidg
17334	30-Jul-1996	dyson	Backed out the recent changes/enhancements to the VM code. The problem with the 'shell scripts' was found, but there was a 'strange' problem found with a 486 laptop that we could not find. This commit backs the code back to 25-jul, and will be re-entered after the snapshot in smaller (more easily tested) chunks.
17313	28-Jul-1996	dg	Slight performance tweak for previous commit.
17312	28-Jul-1996	dyson	Undo part of the scalability commit. Many of the changes in vm_fault had some performance enhancements not ready for prime time. This commit backs out some of the changes.
17301	27-Jul-1996	dyson	Allow sequentially created mmap'ed anonymous regions to coalesce. There is little or no reason to create a swap pager for small mmap's. The vm_map_insert code will automatically create a swap pager if the object becomes too large. This fix, per a request from phk.
17298	27-Jul-1996	dyson	Clean up some lint.
17297	27-Jul-1996	dyson	Remove experimental header file. My test-build must have picked it up in an unexpected place. Submitted by: jkh
17295	27-Jul-1996	dyson	Missing (prototype) change from the previous commit.
17294	27-Jul-1996	dyson	This commit is meant to solve a couple of VM system problems or performance issues. 1) The pmap module has had too many inlines, and so the object file is simply bigger than it needs to be. Some common code is also merged into subroutines. 2) Removal of some evil PHYS_TO_VM_PAGE macro calls. Unfortunately, a few have needed to be added also. The removal caused the need for more vm_page_lookups. I added lookup hints to minimize the need for the page table lookup operations. 3) Removal of some bogus performance improvements, that mostly made the code more complex (tracking individual page table page updates unnecessarily). Those improvements actually hurt 386 processors perf (not that people who worry about perf use 386 processors anymore :-)). 4) Changed pv queue manipulations/structures to be TAILQ's. 5) The pv queue code has had some performance problems since day one. Some significant scalability issues are resolved by threading the pv entries from the pmap AND the physical address instead of just the physical address. This makes certain pmap operations run much faster. This does not affect most micro-benchmarks, but should help loaded system performance significantly. DG helped and came up with most of the solution for this one. 6) Most if not all pmap bit operations follow the pattern: pmap_test_bit(); pmap_clear_bit(); That made for twice the necessary pv list traversal. The pmap interface now supports only pmap_tc_bit type operations: pmap_[test/clear]_modified, pmap_[test/clear]_referenced. Additionally, the modified routine now takes a vm_page_t arg instead of a phys address. This eliminates a PHYS_TO_VM_PAGE operation. 7) Several rewrites of routines that contain redundant code to use common routines, so that there is a greater likelihood of keeping the cache footprint smaller.
17108	12-Jul-1996	bde	Don't use NULL in non-pointer contexts.
17004	08-Jul-1996	dyson	Back-off on the previous commit, specifically remove the look-ahead optimization on the active queue scan. I will do this correctly later.
17003	08-Jul-1996	dyson	Fix a problem with the pageout daemon RSS limiting, where it degrades performance to LRU or worse when RSS limiting takes effect. Also, make an end condition in the active queue scan more efficient in the case where pages are removed from the active queue as a side effect of a pmap operation.
16993	07-Jul-1996	dg	In all special cases for spl or page_alloc where kmem_map is check for, mb_map (a submap of kmem_map) must also be checked. Thanks to wcarchive (err...sort of) for demonstrating this bug.
16892	02-Jul-1996	dyson	Properly set the PG_MAPPED and PG_WRITEABLE flags. This fixes some potential problems with vm_map_remove/vm_map_delete.
16858	30-Jun-1996	dyson	Make -current consistant with -stable regarding time that a process sleeps before being swapped out. The time is increased from 4 secs to 10 secs. Originally I had decreased it from 20 to 4, but that is a bit severe. 20 is too long though.
16834	29-Jun-1996	dg	Make sure we have an object in the map entry before trying to trim pages from it.
16750	26-Jun-1996	dyson	This commit does a couple of things: Re-enables the RSS limiting, and the routine is now tail-recursive, making it much more safe (eliminates the possiblity of kernel stack overflow.) Also, the RSS limiting is a little more intelligent about finding the likely objects that are pushing the process over the limit. Added some sysctls that help with VM system tuning. New sysctl features: 1) Enable/disable lru pageout algorithm. vm.pageout_algorithm = 0, default algorithm that works well, especially using X windows and heavy memory loading. Can have adverse effects, sometimes slowing down program loading. vm.pageout_algorithm = 1, close to true LRU. Works much better than clock, etc. Does not work as well as the default algorithm in general. Certain memory "malloc" type benchmarks work a little better with this setting. Please give me feedback on the performance results associated with these. 2) Enable/disable swapping. vm.swapping_enabled = 1, default. vm.swapping_enabled = 0, useful for cases where swapping degrades performance. The config option "NO_SWAPPING" is still operative, and takes precedence over the sysctl. If "NO_SWAPPING" is specified, the sysctl still exists, but "vm.swapping_enabled" is hard-wired to "0". Each of these can be changed "on the fly."
16679	25-Jun-1996	dyson	Fix some serious problems with limits checking in the sbrk(2)/brk(2) code. Reviewed by: bde
16664	24-Jun-1996	dyson	Remove RSS limiting until I rewrite the code to be non-recursive. The code can overrun the kernel stack under very stressful conditions.
16562	21-Jun-1996	dyson	Improve algorithm for page hash queue. It was previously about as bad as it could be. This algorithm appears to improve fork performance (barely) measurably.
16415	17-Jun-1996	dyson	Several bugfixes/improvements: 1) Make it much less likely to miss a wakeup in vm_page_free_wakeup 2) Create a new entry point into pmap: pmap_ts_referenced, eliminates the need to scan the pv lists twice in many cases. Perhaps there is alot more to do here to work on minimizing pv list manipulation 3) Minor improvements to vm_pageout including the use of pmap_ts_ref. 4) Major changes and code improvement to pmap. This code has had several serious bugs in page table page manipulation. In order to simplify the problem, and hopefully solve it for once and all, page table pages are no longer "managed" with the pv list stuff. Page table pages are only (mapped and held/wired) or (free and unused) now. Page table pages are never inactive, active or cached. These changes have probably fixed the hold count problems, but if they haven't, then the code is simpler anyway for future bugfixing. 5) The pmap code has been sorely in need of re-organization, and I have taken a first (of probably many) steps. Please tell me if you have any ideas.
16409	16-Jun-1996	dyson	Various bugfixes/cleanups from me and others: 1) Remove potential race conditions on waking up in vm_page_free_wakeup by making sure that it is at splvm(). 2) Fix another bug in vm_map_simplify_entry. 3) Be more complete about converting from default to swap pager when an object grows to be large enough that there can be a problem with data structure allocation under low memory conditions. 4) Make some madvise code more efficient. 5) Added some comments.
16377	14-Jun-1996	dg	Move a case of PG_MAPPED being set before a pmap_enter(). This will likely make no difference, but it will make it consistent with other uses of PG_MAPPED.
16324	12-Jun-1996	dyson	Fix a very significant cnt.v_wire_count leak in vm_page.c, and some minor leaks in pmap.c. Bruce Evans made me aware of this problem.
16318	12-Jun-1996	dyson	Fix some serious errors in vm_map_simplify_entries.
16274	10-Jun-1996	dyson	Mostly superficial code improvements, add a diagnostic. The code improvements include significant simplification of the reservation of the swap pager control blocks for reads. Add a panic for an inconsistent swap pager control block count.
16268	10-Jun-1996	dyson	Keep the vm_fault/vm_pageout from getting into an "infinite paging loop", by reserving "cached" pages before waking up the pageout daemon. This will reserve the faulted page, and keep the system from thrashing itself to death given this condition.
16197	08-Jun-1996	dyson	Adjust the threshold for blocking on movement of pages from the cache queue in vm_fault. Move the PG_BUSY in vm_fault to the correct place. Remove redundant/unnecessary code in pmap.c. Properly block on rundown of page table pages, if they are busy. I think that the VM system is in pretty good shape now, and the following individuals (among others, in no particular order) have helped with this recent bunch of bugs, thanks! If I left anyone out, I apologize! Stephen McKay, Stephen Hocking, Eric J. Chet, Dan O'Brien, James Raynard, Marc Fournier.
16122	05-Jun-1996	dyson	Keep page-table pages from ever being sensed as dirty. This should fix some problems with the page-table page management code, since it can't deal with the notion of page-table pages being paged out or in transit. Also, clean up some stylistic issues per some suggestions from Stephen McKay.
16058	01-Jun-1996	dyson	Disable madvise optimizations for device pager objects (some of the operations don't work with FICTITIOUS pages.) Also, close a window between PG_MANAGED and pmap_enter that can mess up the accounting of the managed flag. This problem could likely cause a hold_count error for page table pages.
16026	31-May-1996	dyson	This commit is dual-purpose, to fix more of the pageout daemon queue corruption problems, and to apply Gary Palmer's code cleanups. David Greenman helped with these problems also. There is still a hang problem using X in small memory machines.
15980	29-May-1996	dyson	Correct some unfortunately chosen constants, otherwise, not enough pages are calculated for deferred allocation of swap pager data structures. This is a follow-on to the previous commit to this file.
15979	29-May-1996	dyson	After careful review by David Greenman and myself, David had found a case where blocking can occur, thereby giving other process's a chance to modify the queue where a page resides. This could cause numerous process and system failures.
15978	29-May-1996	dyson	Make sure that pageout deadlocks cannot occur. There is a problem that the datastructures needed to support the swap pager can take enough space to fully deplete system memory, and cause a deadlock. This change keeps large objects from being filled with dirty pages without the appropriate swap pager datastructures. Right now, default objects greater than 1/4 the size of available system memory are converted to swap objects, thereby eliminating the risk of deadlock.
15905	26-May-1996	dyson	Fix a couple of problems in the pageout_scan routine. First, there is a condition when blocking can occur, and the daemon did not check properly for a page remaining on the expected queue. Additionally, the inactive target was being set much too large for small memory machines. It is now being calculated based upon the amount of user memory available on every pageout daemon run. Another problem was that if memory was very low, the pageout daemon could fail repeatedly to traverse the inactive queue.
15904	26-May-1996	dyson	I think this covers (fixes) the last batch of freeing active/held/busy page problem. BY MISTAKE, the vm_page_unqueue (or equiv) was removed from the vm_fault code. Really bad things appear to happen if a page is on a queue while it is being faulted.
15890	24-May-1996	dyson	Add an assert to vm_page_cache. We should never cache a dirty page.
15889	24-May-1996	dyson	Add apparently needed splvm protection to the active queue, and eliminate an unnecessary test for dirty pages if it is already known to be dirty.
15888	24-May-1996	dyson	Eliminate inefficient check for dirty pages for pages in the PQ_CACHE queue. Also, modify the MADV_FREE policy (it probably still isn't the final version.)
15887	24-May-1996	dyson	Make the conversion from the default pager to swap pager more robust in the face of low memory conditions.
15876	23-May-1996	dyson	Eliminate a vm_page_free, busy panic, in kern_malloc.
15873	23-May-1996	dyson	Initial support for MADV_FREE, support for pages that we don't care about the contents anymore. This gives us alot of the advantage of freeing individual pages through munmap, but with almost none of the overhead.
15841	21-May-1996	dyson	After reviewing the previous commit to vm_object, the page protection is never necessary, not just for PG_FICTICIOUS.
15836	21-May-1996	dyson	Don't protect non-managed pages off during object rundown. This fixes a hang that occurs under certain circumstances when exiting X.
15819	19-May-1996	dyson	Initial support for mincore and madvise. Both are almost fully supported, except madvise does not page in with MADV_WILLNEED, and MADV_DONTNEED doesn't force dirty pages out.
15811	18-May-1996	dyson	One more file missing from the mega-commit. This inlines some very simple routines in vm_page.c, so that an unnecessary subroutine call is removed.
15810	18-May-1996	dyson	File mistakenly left out of the previous mega-commit. This provides a global defn for 'exech_map.'
15809	18-May-1996	dyson	This set of commits to the VM system does the following, and contain contributions or ideas from Stephen McKay <syssgm@devetir.qld.gov.au>, Alan Cox <alc@cs.rice.edu>, David Greenman <davidg@freebsd.org> and me: More usage of the TAILQ macros. Additional minor fix to queue.h. Performance enhancements to the pageout daemon. Addition of a wait in the case that the pageout daemon has to run immediately. Slightly modify the pageout algorithm. Significant revamp of the pmap/fork code: 1) PTE's and UPAGES's are NO LONGER in the process's map. 2) PTE's and UPAGES's reside in their own objects. 3) TOTAL elimination of recursive page table pagefaults. 4) The page directory now resides in the PTE object. 5) Implemented pmap_copy, thereby speeding up fork time. 6) Changed the pv entries so that the head is a pointer and not an entire entry. 7) Significant cleanup of pmap_protect, and pmap_remove. 8) Removed significant amounts of machine dependent fork code from vm_glue. Pushed much of that code into the machine dependent pmap module. 9) Support more completely the reuse of already zeroed pages (Page table pages and page directories) as being already zeroed. Performance and code cleanups in vm_map: 1) Improved and simplified allocation of map entries. 2) Improved vm_map_copy code. 3) Corrected some minor problems in the simplify code. Implemented splvm (combo of splbio and splimp.) The VM code now seldom uses splhigh. Improved the speed of and simplified kmem_malloc. Minor mod to vm_fault to avoid using pre-zeroed pages in the case of objects with backing objects along with the already existant condition of having a vnode. (If there is a backing object, there will likely be a COW... With a COW, it isn't necessary to start with a pre-zeroed page.) Minor reorg of source to perhaps improve locality of ref.
15722	10-May-1996	wollman	Allocate mbufs from a separate submap so that NMBCLUSTERS works as expected.
15583	03-May-1996	phk	Another sweep over the pmap/vm macros, this time with more focus on the usage. I'm not satisfied with the naming, but now at least there is less bogus stuff around.
15543	02-May-1996	phk	removed: CLBYTES PD_SHIFT PGSHIFT NBPG PGOFSET CLSIZELOG2 CLSIZE pdei() ptei() kvtopte() ptetov() ispt() ptetoav() &c &c new: NPDEPG Major macro cleanup.
15534	02-May-1996	phk	KGDB is dead. It may come back one day if somebody does it.
15459	29-Apr-1996	dyson	Move the map entry allocations from the kmem_map to the kernel_map. As a side effect, correct the associated object offset.
15367	24-Apr-1996	dyson	This fixes kmem_malloc/kmem_free (and malloc/free of objects of > 8K). A page index was calculated incorrectly in vm_kern, and vm_object_page_remove removed pages that should not have been.
15203	11-Apr-1996	bde	Fixed a spl hog. The vmdaemon process ran entirely at splhigh. It sometimes disabled clock interrupts for 60 msec or more on a P133. Clock interrupts were lost ... Reviewed by: dyson
15153	09-Apr-1996	dyson	Reinstitute the map lock for processes being swapped out. This is needed because of the vm_fault used to bring the page table page for the kernel stack (UPAGES) back in. The consequence of the previous incorrect change was a system hang.
15134	08-Apr-1996	dyson	Map lock checks not needed anymore for swapping out. We don't use map operations for it anymore. Certain deadlocks should never happen anymore.
15117	07-Apr-1996	bde	Removed never-used #includes of <machine/cpu.h>. Many were apparently copied from bad examples.
15018	03-Apr-1996	dyson	Fixed a problem that the UPAGES of a process were being run down in a suboptimal manner. I had also noticed some panics that appeared to be at least superficially caused by this problem. Also, included are some minor mods to support more general handling of page table page faulting. More details in a future commit.
14900	29-Mar-1996	dg	Revert to previous calculation of vm_object_cache_max: it simply works better in most real-world cases.
14882	28-Mar-1996	bde	Undid last revision. It duplicated part of second last revision.
14879	28-Mar-1996	scrappy	devfs_add_devsw() -> devfs_add_devswf modifications Reviewed by: julian@freebsd.org
14866	28-Mar-1996	dyson	Add a function prototype for pmap_prefault.
14865	28-Mar-1996	dyson	VM performance improvements, and reorder some operations in VM fault in anticipation of a fix in pmap that will allow the mlock system call to work without panicing the system.
14864	28-Mar-1996	dyson	More map_simplify fixes from Alan Cox. This very significanly improves the performance when the map has been chopped up. The map simplify operations really work now. Reviewed by: dyson Submitted by: Alan Cox <alc@cs.rice.edu>
14854	27-Mar-1996	bde	Added drum device. Submitted by: partly by "Marc G. Fournier" <scrappy@ki.net>
14693	19-Mar-1996	dyson	Fix the problem that unmounting filesystems that are backed by a VMIO device have reference count problems. We mark the underlying object ono-persistent, and account for the reference count that the VM system maintainsfor the special device close. This should fix the removable device problem.
14638	16-Mar-1996	dg	Force device mappings to always be shared. It doesn't make sense for them to ever be COW and we need the mappings to be shared for backward compatibilty. Reviewed by: dyson
14610	13-Mar-1996	dyson	This commit is as a result of a comment by Alan Cox (alc@cs.rice.edu) regarding the "real" problem with maps that we have been having over the last few weeks. He noted that the first_free pointer was left dangling in certain circumstances -- and he was right!!! This should fix the map problems that we were having, and also give us the advantage of being able to simplify maps more aggressively.
14589	12-Mar-1996	dyson	Fix the map corruption problem that appears as a u_map allocation error.
14574	12-Mar-1996	dyson	Allow mmap'ed devices to work correctly across forks. The sanest solution appeared to be to allow the child to maintain the same mapping as the parent.
14531	11-Mar-1996	hsu	For Lite2: proc LIST changes. Reviewed by: davidg & bde
14432	09-Mar-1996	dyson	Delay forking a process until there are more pages available. It was possible to deadlock with the low threshold that we had used.
14431	09-Mar-1996	dyson	Modify a threshold for waking up the pageout daemon. Also, add a consistancy check for making sure that held pages aren't freed (DG).
14430	09-Mar-1996	dyson	Add a missing initialization of the hold_count for device pager ficticiouse pages.
14429	09-Mar-1996	dyson	Fix a calculation for a paging parameter.
14428	09-Mar-1996	dyson	Fix two problems: The pmap_remove in vm_map_clean incorrectly unmapped the entire map entry. The new vm_map_simplify_entry code had an error (the offset of the combined map entry was not set correctly.) Submitted by: Alan Cox <alc@cs.rice.edu>
14427	09-Mar-1996	dyson	Set the page valid bits in fewer places, as opposed to being scattered in various places.
14396	06-Mar-1996	dyson	Fix a problem in the swap pager that caused some of the pages that were paged in under low swap space conditions to both loose their backing store and their dirty bits. This would cause pages to be demand zeroed under certain conditions in low VM space conditions and consequential sig-11's or sig-10's. This situation was made worse lately when the level for swap space reclaim threshold was increased.
14366	04-Mar-1996	dyson	Fix a problem that pages in a mapped region were not always properly invalidated. Now we traverse the object shadow chain properly.
14364	03-Mar-1996	dyson	In order to fix some concurrency problems with the swap pager early on in the FreeBSD development, I had made a global lock around the rlist code. This was bogus, and now the lock is maintained on a per resource list basis. This now allows the rlist code to be used for almost any non-interrupt level application.
14360	03-Mar-1996	peter	Remove the #ifdef notyet from the prototype of vm_map_simplify. John re-enabled the function but missed the prototype, causing a warning.
14325	02-Mar-1996	peter	Oops.. I nearly forgot the actual core of the length/rounding/etc fixes that Bruce asked for. These still are not quite perfect, and in particular, it can get upset on extreme boundary cases (addr = 0xfff, len = 0xffffffff, which would end up mapping a single page rather than failing), but this is better code that I committed before. (note, the VM system does not (apparently) support single mmap segment sizes above 0x80000000 anyway)
14316	02-Mar-1996	dyson	1) Eliminate unnecessary bzero of UPAGES. 2) Eliminate unnecessary copying of pages during/after forks. 3) Add user map simplification.
14221	23-Feb-1996	peter	kern_descrip.c: add fdshare()/fdcopy() kern_fork.c: add the tiny bit of code for rfork operation. kern/sysv_: shmfork() takes one less arg, it was never used. sys/shm.h: drop "isvfork" arg from shmfork() prototype sys/param.h: declare rfork args.. (this is where OpenBSD put it..) sys/filedesc.h: protos for fdshare/fdcopy. vm/vm_mmap.c: add minherit code, add rounding to mmap() type args where it makes sense. vm/: drop unused isvfork arg. Note: this rfork() implementation copies the address space mappings, it does not connect the mappings together. ie: once the two processes have split, the pages may be shared, but the address space is not. If one does a mmap() etc, it does not appear in the other. This makes it not useful for pthreads, but it is useful in it's own right for having light-weight threads in a static shared address space. Obtained from: Original by Ron Minnich, extended by OpenBSD
14178	22-Feb-1996	dg	Add a "NO_SWAPPING" option to disable swapping. This was originally done to help diagnose a problem on wcarchive (where the kernel stack was sometimes not present), but is useful in its own right since swapping actually reduces performance on some systems (such as wcarchive). Note: swapping in this context means making the U pages pageable and has nothing to do with generic VM paging, which is unaffected by this option. Reviewed by: <dyson>
14036	11-Feb-1996	dyson	Fixed a really bogus problem with msync ripping pages away from objects before they were written. Also, don't allow processes without write access to remove pages from vm_objects.
13909	04-Feb-1996	dyson	Changed vm_fault_quick in vm_machdep.c to be global. Needed for new pipe code.
13790	31-Jan-1996	dg	"out of space" -> "out of swap space".
13788	31-Jan-1996	dg	Improved killproc() log message and made it and the other similar message tolerant of p_ucred being invalid. Starting using killproc() where appropriate.
13786	31-Jan-1996	dg	Print a more descriptive message when the mb_map is filled (out of mbuf clusters), and tell the operator what to do about it (increase maxusers).
13765	30-Jan-1996	mpp	Fix a bunch of spelling errors in the comment fields of a bunch of system include files.
13705	29-Jan-1996	dg	Added a check/panic for vm_map_find failing to find space for the page tables/u-pages when forking. This is a "can't happen" case. :-)
13642	27-Jan-1996	bde	Added a `boundary' arg to vm_alloc_page_contig(). Previously the only way to avoid crossing a 64K DMA boundary was to specify an alignment greater than the size even when the alignment didn't matter, and for sizes larger than a page, this reduced the chance of finding enough contiguous pages. E.g., allocations of 8K not crossing a 64K boundary previously had to be allocated on 8K boundaries; now they can be allocated on any 4K boundary except (64 * n + 60)K. Fixed bugs in vm_alloc_page_contig(): - the last page wasn't allocated for sizes smaller than a page. - failures of kmem_alloc_pageable() weren't handled. Mutated vm_page_alloc_contig() to create a more convenient interface named contigmalloc(). This is the same as the one in 1.1.5 except it has `low' and `high' args, and the `alignment' and `boundary' args are multipliers instead of masks.
13628	25-Jan-1996	phk	Don't use %r, we havn't got it anymore. Submitted by: bde
13490	19-Jan-1996	dyson	Eliminated many redundant vm_map_lookup operations for vm_mmap. Speed up for vfs_bio -- addition of a routine bqrelse to greatly diminish overhead for merged cache. Efficiency improvement for vfs_cluster. It used to do alot of redundant calls to cluster_rbuild. Correct the ordering for vrele of .text and release of credentials. Use the selective tlb update for 486/586/P6. Numerous fixes to the size of objects allocated for files. Additionally, fixes in the various pagers. Fixes for proper positioning of vnode_pager_setsize in msdosfs and ext2fs. Fixes in the swap pager for exhausted resources. The pageout code will not as readily thrash. Change the page queue flags (PG_ACTIVE, PG_INACTIVE, PG_FREE, PG_CACHE) into page queue indices (PQ_ACTIVE, PQ_INACTIVE, PQ_FREE, PQ_CACHE), thereby improving efficiency of several routines. Eliminate even more unnecessary vm_page_protect operations. Significantly speed up process forks. Make vm_object_page_clean more efficient, thereby eliminating the pause that happens every 30seconds. Make sequential clustered writes B_ASYNC instead of B_DELWRI even in the case of filesystems mounted async. Fix a panic with busy pages when write clustering is done for non-VMIO buffers.
13228	04-Jan-1996	wollman	Convert DDB to new-style option.
13226	04-Jan-1996	wollman	Convert SYSV IPC to new-style options. (I hope I got everything...) The LKMs will need an extra file, to come later.
13223	04-Jan-1996	dg	Increased vm_object_cache_max by about 50% to yield better utilization of memory when lots of small files are cached. Reviewed by: dyson
13122	30-Dec-1995	peter	recording cvs-1.6 file death
12954	21-Dec-1995	julian	i386/i386/conf.c is no longer needed.. remove it from files.i386 redistribute a few last routines to beter places and shoot the file I haven't act actually 'deleted' the file yet togive people time to have done a config.. I.e. they are likely to have done one in a week or so so I'll remove it then.. it's now empty. makes the question of a USL copyright rather moot.
12914	17-Dec-1995	dyson	Fix paging from ext2fs (and other fs w/block size < PAGE_SIZE). This should fix kern/900.
12905	17-Dec-1995	bde	Cleaned up prototypes in pmap headers: removed ones for nonexistent functions; moved misplaced ones; restored most of KNFish formatting from 4.4lite version; removed bogus __BEGIN/END_DECLS.
12904	17-Dec-1995	bde	Fixed 1TB filesize changes. Some pindexes had bogus names and types but worked because vm_pindex_t is indistinuishable from vm_offset_t.
12820	14-Dec-1995	phk	Another mega commit to staticize things.
12819	14-Dec-1995	phk	A Major staticize sweep. Generates a couple of warnings that I'll deal with later. A number of unused vars removed. A number of unused procs removed or #ifdefed.
12813	13-Dec-1995	julian	devsw tables are now arrays of POINTERS to struct [cb]devsw seems to work hre just fine though I can't check every file that changed due to limmited h/w, however I've checked enught to be petty happy withe hte code.. WARNING... struct lkm[mumble] has changed so it might be an idea to recompile any lkm related programs
12808	13-Dec-1995	dyson	There was a bug that the size for an msync'ed region was not rounded up. The effect of this was that msync with a size would generally sync 1 page less than it should. This problem was brought to my attention by Darrel Herbst <dherbst@gradin.cis.upenn.edu> and Ron Minnich <rminnich@sarnoff.com>.
12779	11-Dec-1995	dyson	Some new anti-deadlock code ended up messing up the paging stats. A modified version of the code is now in place, and gausspage performance is back up to where it should be.
12778	11-Dec-1995	dyson	Some DIAGNOSTIC code was enabled all of the time in error. The diagnostic code is now conditional on #ifdef DIAGNOSTIC again.
12767	11-Dec-1995	dyson	Changes to support 1Tb filesizes. Pages are now named by an (object,index) pair instead of (object,offset) pair.
12737	10-Dec-1995	bde	Replaced nxdump by nodump (if the dump function gets called, then the device must be configured, so ENXIO is a bogus errno). Replaced zerosize by nopsize. zerosize was a temporary alias.
12726	10-Dec-1995	bde	Restored used includes of <vm/vm_extern.h>.
12710	10-Dec-1995	bde	Moved the declaration of boolean_t from <vm/vm_param.h> to <sys/types.h> (if KERNEL is defined). This allows removing bogus dependencies on vm stuff in several places (e.g., ddb) and stops <vm_param.h> from depending on <vm_param.h> Added declaration of boolean_t to <vm/vm.h> (if KERNEL is not defined). It never belonged in <vm/vm_param.h>. Unfortunately, it is required for some vm headers that are included by applications. Deleted declarations of TRUE and FALSE from <vm/vm_param.h>. They are defined in <sys/param.h> if KERNEL is defined and we'll soon find out if any applications depend on them being defined in a vm header.
12678	08-Dec-1995	phk	Julian forgot to make the *devsw structures static.
12675	08-Dec-1995	julian	Pass 3 of the great devsw changes most devsw referenced functions are now static, as they are in the same file as their devsw structure. I've also added DEVFS support for nearly every device in the system, however many of the devices have 'incorrect' names under DEVFS because I couldn't quickly work out the correct naming conventions. (but devfs won't be coming on line for a month or so anyhow so that doesn't matter) If you "OWN" a device which would normally have an entry in /dev then search for the devfs_add_devsw() entries and munge to make them right.. check out similar devices to see what I might have done in them in you can't see what's going on.. for a laugh compare conf.c conf.h defore and after... :) I have not doen DEVFS entries for any DISKSLICE devices yet as that will be a much more complicated job.. (pass 5 :) pass 4 will be to make the devsw tables of type (cdevsw * ) rather than (cdevsw) seems to work here.. complaints to the usual places.. :)
12662	07-Dec-1995	dg	Untangled the vm.h include file spaghetti.
12642	05-Dec-1995	bde	Moved the declaration of vm_object_t from <vm/vm.h> to <sys/types.h> (if KERNEL is defined). This allows removing the #includes of vm stuff in vnode_if.h, which will speed up the compilation of LINT by about 5%.
12623	04-Dec-1995	phk	A major sweep over the sysctl stuff. Move a lot of variables home to their own code (In good time before xmas :-) Introduce the string descrition of format. Add a couple more functions to poke into these marvels, while I try to decide what the correct interface should look like. Next is adding vars on the fly, and sysctl looking at them too. Removed a tine bit of defunct and #ifdefed notused code in swapgeneric.
12610	03-Dec-1995	bde	Fixed the type mismatch in check for the bogus mmap function `nullop'. The test should never succeed and should go away. Temporarily print a warning if it does succeed.
12591	03-Dec-1995	bde	Completed function declarations and/or added prototypes. Staticized some functions. __purified some functions. Some functions were bogusly declared as returning `const'. This hasn't done anything since gcc-2.5. For later versions of gcc, the equivalent is __attribute__((const)) at the end of function declarations.
12569	02-Dec-1995	bde	Finished (?) cleaning up sysinit stuff.
12521	29-Nov-1995	julian	If you're going to mechanically replicate something in 50 files it's best to not have a (compiles cleanly) typo in it! (sigh)
12517	29-Nov-1995	julian	OK, that's it.. That's EVERY SINGLE driver that has an entry in conf.c.. my next trick will be to define cdevsw[] and bdevsw[] as empty arrays and remove all those DAMNED defines as well.. Each of these drivers has a SYSINIT linker set entry that comes in very early.. and asks teh driver to add it's own entry to the two devsw[] tables. some slight reworking of the commits from yesterday (added the SYSINIT stuff and some usually wrong but token DEVFS entries to all these devices. BTW does anyone know where the 'ata' entries in conf.c actually reside? seems we don't actually have a 'ataopen() etc... If you want to add a new device in conf.c please make sure I know so I can keep it up to date too.. as before, this is all dependent on #if defined(JREMOD) (and #ifdef DEVFS in parts)
12453	21-Nov-1995	bde	Completed function declarations and/or added prototypes.
12423	20-Nov-1995	phk	Remove unused vars & funcs, make things static, protoize a little bit.
12325	16-Nov-1995	bde	Fixed recent staticizations. Some protypes for static functions were left in headers and not staticized.
12300	14-Nov-1995	phk	staticize.
12286	14-Nov-1995	phk	Move all the VM sysctl stuff home where it belongs.
12259	13-Nov-1995	dg	Fixed up a comment and removed some #if 0'd code.
12226	12-Nov-1995	dg	Moved vm_map_lock call to inside the splhigh protection in vm_map_find(). This closes a probably rare but nonetheless real window that would result in a process hanging or the system panicing. Reviewed by: dyson, davidg Submitted by: kato@eclogite.eps.nagoya-u.ac.jp (KATO Takenori)
12221	12-Nov-1995	bde	Included <sys/sysproto.h> to get central declarations for syscall args structs and prototypes for syscalls. Ifdefed duplicated decentralized declarations of args structs. It's convenient to have this visible but they are hard to maintain. Some are already different from the central declarations. 4.4lite2 puts them in comments in the function headers but I wanted to avoid the large changes for that.
12206	11-Nov-1995	bde	Fixed type of obreak(). The args struct member name conflicted with the (better) machine generated one in <sys/sysproto.h>.
12128	06-Nov-1995	dg	Initialize lock struct entries explicitly rather than calling bzero().
12118	06-Nov-1995	bde	Replaced bogus macros for dummy devswitch entries by functions. These functions went away: enosys (hasn't been used for some time) enxio enodev enoioctl (was used only once, actually for a vop) if_tun.c: Continued cleaning up... conf.h: Probably fixed the type of d_reset_t. It is hard to tell the correct type because there are no non-dummy device reset functions. Removed last vestige of ambiguous sleep message strings.
12110	05-Nov-1995	dyson	Greatly simplify the msync code. Eliminate complications in vm_pageout for msyncing. Remove a bug that manifests itself primarily on NFS (the dirty range on the buffers is not set on msync.)
12006	02-Nov-1995	dg	Move page fixups (pmap_clear_modify, etc) that happen after paging input completes out of vm_fault and into the pagers. This get rid of some redundancy and improves the architecture. Reviewed by: John Dyson <dyson>
11943	30-Oct-1995	bde	Don't pass an extra trailing arg to some functions. Added the prototypes that found this bug.
11709	23-Oct-1995	dyson	Get rid of machine-dependent NBPG and replace with PAGE_SIZE.
11708	23-Oct-1995	dyson	Remove of now unused PG_COPYONWRITE.
11705	23-Oct-1995	dyson	First phase of removing the PG_COPYONWRITE flag, and an architectural cleanup of mapping files.
11701	23-Oct-1995	dyson	Finalize GETPAGES layering scheme. Move the device GETPAGES interface into specfs code. No need at this point to modify the PUTPAGES stuff except in the layered-type (NULL/UNION) filesystems.
11621	21-Oct-1995	dyson	Implement mincore system call.
11576	19-Oct-1995	dg	Fix initialization of "bsize" in vnode_pager_haspage(). It must happen after the check for the mount point still existing or else the system will panic if someone forcibly unmounted the filesystem.
11526	16-Oct-1995	dyson	Remove an unnecessary tsleep in the swapin code. This tsleep can defer swapping in processes and is just not the right thing to do.
11317	07-Oct-1995	dg	Fix argument passing to the "freeer" routine. Added some prototypes. (bde) Moved extern declaration of swap_pager_full into swap_pager.h and out of the various files that reference it. (davidg) Submitted by: bde & davidg
11260	06-Oct-1995	phk	Avoid a 64bit divide.
11194	05-Oct-1995	bde	Fix pollution of application namespace by declarations of kernel functions. The application header <sys/user.h> includes <vm/vm.h> which includes <vm/lock.h>... vm.h: Don't include <machine/cpufunc.h>. It is already included by <sys/systm.h> in the kernel and isn't designed to be included by applications (the 2.1 version causes a syntax error in C++ and the current version has initializers that are invalid in strict C++). lock.h: Only declare kernel functions if KERNEL is defined.
10989	24-Sep-1995	dyson	Perform more checking for proper loading of the UPAGES when a process is swapped in. Also, remove unnecessary map locking/unlocking during selection of processes to be swapped out. This code might afford proper panics as opposed to spontaneous reboots on certain systems. This should allow us to debug these problems better.
10988	24-Sep-1995	dyson	Significantly simplify the fault clustering code. After some analysis by David Greenman, it has been determined that the more sophisticated code only made a very minor difference in fault performance. Therefore, this code eliminates some of the complication of the fault code, decreasing the amount of CPU used to scan shadow chains.
10984	24-Sep-1995	dg	Check that the swap block is valid before including it in a cluster. Submitted by: John Dyson
10835	17-Sep-1995	dg	Check the return value from vm_map_pageable() when mapping the process's UPAGES and associated page table page. Panic on error. This is less than optimial and will be fixed in the future, but is better than the old behavior of panicing with a "kernel page directory invalid" in pmap_enter.
10728	14-Sep-1995	dyson	Fixed a typo in vm_fault_additional_pages.
10702	12-Sep-1995	dyson	Fix really bogus casting of a block number to a long. Also change the comparison from a "< 0" to "== -1" like it should be.
10670	11-Sep-1995	dyson	Make sure that the prezero flag is cleared when needed.
10669	11-Sep-1995	dyson	Fix an error that can cause attempted reading beyond the end of file.
10668	11-Sep-1995	dyson	Code cleanup and minor performance improvement in the faultin cluster code.
10653	09-Sep-1995	dg	Fixed init functions argument type - caddr_t -> void *. Fixed a couple of compiler warnings.
10579	06-Sep-1995	dyson	Fixed a sign reversal problem -- might have cause some Sig-11s that people have been seeing.
10576	06-Sep-1995	dyson	Minor performance improvements, additional prototype for additional exported symbol.
10556	04-Sep-1995	dyson	Allow the fault code to use additional clustering info from both bmap and the swap pager. Improved fault clustering performance.
10551	04-Sep-1995	dyson	Added VOP_GETPAGES/VOP_PUTPAGES and also the "backwards" block count for VOP_BMAP. Updated affected filesystems...
10548	03-Sep-1995	dyson	Machine independent changes to support pre-zeroed free pages. This significantly improves demand-zero performance.
10544	03-Sep-1995	dyson	Added prototype for new routine "vm_page_set_validclean" and initial declarations for the prezeroed pages mechanism.
10542	03-Sep-1995	dyson	New subroutine "vm_page_set_validclean" for a vfs_bio improvement.
10358	28-Aug-1995	julian	Reviewed by: julian with quick glances by bruce and others Submitted by: terry (terry lambert) This is a composite of 3 patch sets submitted by terry. they are: New low-level init code that supports loadbal modules better some cleanups in the namei code to help terry in 16-bit character support some changes to the mount-root code to make it a little more modular.. NOTE: mounting root off cdrom or NFS MIGHT be broken as I haven't been able to test those cases.. certainly mounting root of disk still works just fine.. mfs should work but is untested. (tomorrows task) The low level init stuff includes a total rewrite of init_main.c to make it possible for new modules to have an init phase by simply adding an entry to a TEXT_SET (or is it DATA_SET) list. thus a new module can be added to the kernel without editing any other files other than the 'files' file.
10345	26-Aug-1995	bde	Change vm_object_print() to have the correct number and type of args for a ddb command.
10344	26-Aug-1995	bde	Change vm_map_print() to have the correct number and type of args for a ddb command.
10080	16-Aug-1995	bde	Make everything except the unsupported network sources compile cleanly with -Wnested-externs.
9759	29-Jul-1995	bde	Eliminate sloppy common-style declarations. There should be none left for the LINT configuation.
9582	20-Jul-1995	dg	#if 0'd one of the DIAGNOSTIC checks in vm_page_alloc(). It was too expensive for "normal" use.
9548	16-Jul-1995	dg	1) Merged swpager structure into vm_object. 2) Changed swap_pager internal interfaces to cope w/#1. 3) Eliminated object->copy as we no longer have copy objects. 4) Minor stylistic changes.
9514	13-Jul-1995	dg	Added a copyright to this file.
9513	13-Jul-1995	dg	Oops, forgot to add the "default" pager files... NOTE: libkvm, w, ps, 'top', and any other utility which depends on struct proc or any VM system structure will have to be rebuilt!!! Much needed overhaul of the VM system. Included in this first round of changes: 1) Improved pager interfaces: init, alloc, dealloc, getpages, putpages, haspage, and sync operations are supported. The haspage interface now provides information about clusterability. All pager routines now take struct vm_object's instead of "pagers". 2) Improved data structures. In the previous paradigm, there is constant confusion caused by pagers being both a data structure ("allocate a pager") and a collection of routines. The idea of a pager structure has escentially been eliminated. Objects now have types, and this type is used to index the appropriate pager. In most cases, items in the pager structure were duplicated in the object data structure and thus were unnecessary. In the few cases that remained, a un_pager structure union was created in the object to contain these items. 3) Because of the cleanup of #1 & #2, a lot of unnecessary layering can now be removed. For instance, vm_object_enter(), vm_object_lookup(), vm_object_remove(), and the associated object hash list were some of the things that were removed. 4) simple_lock's removed. Discussion with several people reveals that the SMP locking primitives used in the VM system aren't likely the mechanism that we'll be adopting. Even if it were, the locking that was in the code was very inadequate and would have to be mostly re-done anyway. The locking in a uni-processor kernel was a no-op but went a long way toward making the code difficult to read and debug. 5) Places that attempted to kludge-up the fact that we don't have kernel thread support have been fixed to reflect the reality that we are really dealing with processes, not threads. The VM system didn't have complete thread support, so the comments and mis-named routines were just wrong. We now use tsleep and wakeup directly in the lock routines, for instance. 6) Where appropriate, the pagers have been improved, especially in the pager_alloc routines. Most of the pager_allocs have been rewritten and are now faster and easier to maintain. 7) The pagedaemon pageout clustering algorithm has been rewritten and now tries harder to output an even number of pages before and after the requested page. This is sort of the reverse of the ideal pagein algorithm and should provide better overall performance. 8) Unnecessary (incorrect) casts to caddr_t in calls to tsleep & wakeup have been removed. Some other unnecessary casts have also been removed. 9) Some almost useless debugging code removed. 10) Terminology of shadow objects vs. backing objects straightened out. The fact that the vm_object data structure escentially had this backwards really confused things. The use of "shadow" and "backing object" throughout the code is now internally consistent and correct in the Mach terminology. 11) Several minor bug fixes, including one in the vm daemon that caused 0 RSS objects to not get purged as intended. 12) A "default pager" has now been created which cleans up the transition of objects to the "swap" type. The previous checks throughout the code for swp->pg_data != NULL were really ugly. This change also provides the rudiments for future backing of "anonymous" memory by something other than the swap pager (via the vnode pager, for example), and it allows the decision about which of these pagers to use to be made dynamically (although will need some additional decision code to do this, of course). 13) (dyson) MAP_COPY has been deprecated and the corresponding "copy object" code has been removed. MAP_COPY was undocumented and non- standard. It was furthermore broken in several ways which caused its behavior to degrade to MAP_PRIVATE. Binaries that use MAP_COPY will continue to work correctly, but via the slightly different semantics of MAP_PRIVATE. 14) (dyson) Sharing maps have been removed. It's marginal usefulness in a threads design can be worked around in other ways. Both #12 and #13 were done to simplify the code and improve readability and maintain- ability. (As were most all of these changes) TODO: 1) Rewrite most of the vnode pager to use VOP_GETPAGES/PUTPAGES. Doing this will reduce the vnode pager to a mere fraction of its current size. 2) Rewrite vm_fault and the swap/vnode pagers to use the clustering information provided by the new haspage pager interface. This will substantially reduce the overhead by eliminating a large number of VOP_BMAP() calls. The VOP_BMAP() filesystem interface should be improved to provide both a "behind" and "ahead" indication of contiguousness. 3) Implement the extended features of pager_haspage in swap_pager_haspage(). It currently just says 0 pages ahead/behind. 4) Re-implement the swap device (swstrategy) in a more elegant way, perhaps via a much more general mechanism that could also be used for disk striping of regular filesystems. 5) Do something to improve the architecture of vm_object_collapse(). The fact that it makes calls into the swap pager and knows too much about how the swap pager operates really bothers me. It also doesn't allow for collapsing of non-swap pager objects ("unnamed" objects backed by other pagers).
9507	13-Jul-1995	dg	NOTE: libkvm, w, ps, 'top', and any other utility which depends on struct proc or any VM system structure will have to be rebuilt!!! Much needed overhaul of the VM system. Included in this first round of changes: 1) Improved pager interfaces: init, alloc, dealloc, getpages, putpages, haspage, and sync operations are supported. The haspage interface now provides information about clusterability. All pager routines now take struct vm_object's instead of "pagers". 2) Improved data structures. In the previous paradigm, there is constant confusion caused by pagers being both a data structure ("allocate a pager") and a collection of routines. The idea of a pager structure has escentially been eliminated. Objects now have types, and this type is used to index the appropriate pager. In most cases, items in the pager structure were duplicated in the object data structure and thus were unnecessary. In the few cases that remained, a un_pager structure union was created in the object to contain these items. 3) Because of the cleanup of #1 & #2, a lot of unnecessary layering can now be removed. For instance, vm_object_enter(), vm_object_lookup(), vm_object_remove(), and the associated object hash list were some of the things that were removed. 4) simple_lock's removed. Discussion with several people reveals that the SMP locking primitives used in the VM system aren't likely the mechanism that we'll be adopting. Even if it were, the locking that was in the code was very inadequate and would have to be mostly re-done anyway. The locking in a uni-processor kernel was a no-op but went a long way toward making the code difficult to read and debug. 5) Places that attempted to kludge-up the fact that we don't have kernel thread support have been fixed to reflect the reality that we are really dealing with processes, not threads. The VM system didn't have complete thread support, so the comments and mis-named routines were just wrong. We now use tsleep and wakeup directly in the lock routines, for instance. 6) Where appropriate, the pagers have been improved, especially in the pager_alloc routines. Most of the pager_allocs have been rewritten and are now faster and easier to maintain. 7) The pagedaemon pageout clustering algorithm has been rewritten and now tries harder to output an even number of pages before and after the requested page. This is sort of the reverse of the ideal pagein algorithm and should provide better overall performance. 8) Unnecessary (incorrect) casts to caddr_t in calls to tsleep & wakeup have been removed. Some other unnecessary casts have also been removed. 9) Some almost useless debugging code removed. 10) Terminology of shadow objects vs. backing objects straightened out. The fact that the vm_object data structure escentially had this backwards really confused things. The use of "shadow" and "backing object" throughout the code is now internally consistent and correct in the Mach terminology. 11) Several minor bug fixes, including one in the vm daemon that caused 0 RSS objects to not get purged as intended. 12) A "default pager" has now been created which cleans up the transition of objects to the "swap" type. The previous checks throughout the code for swp->pg_data != NULL were really ugly. This change also provides the rudiments for future backing of "anonymous" memory by something other than the swap pager (via the vnode pager, for example), and it allows the decision about which of these pagers to use to be made dynamically (although will need some additional decision code to do this, of course). 13) (dyson) MAP_COPY has been deprecated and the corresponding "copy object" code has been removed. MAP_COPY was undocumented and non- standard. It was furthermore broken in several ways which caused its behavior to degrade to MAP_PRIVATE. Binaries that use MAP_COPY will continue to work correctly, but via the slightly different semantics of MAP_PRIVATE. 14) (dyson) Sharing maps have been removed. It's marginal usefulness in a threads design can be worked around in other ways. Both #12 and #13 were done to simplify the code and improve readability and maintain- ability. (As were most all of these changes) TODO: 1) Rewrite most of the vnode pager to use VOP_GETPAGES/PUTPAGES. Doing this will reduce the vnode pager to a mere fraction of its current size. 2) Rewrite vm_fault and the swap/vnode pagers to use the clustering information provided by the new haspage pager interface. This will substantially reduce the overhead by eliminating a large number of VOP_BMAP() calls. The VOP_BMAP() filesystem interface should be improved to provide both a "behind" and "ahead" indication of contiguousness. 3) Implement the extended features of pager_haspage in swap_pager_haspage(). It currently just says 0 pages ahead/behind. 4) Re-implement the swap device (swstrategy) in a more elegant way, perhaps via a much more general mechanism that could also be used for disk striping of regular filesystems. 5) Do something to improve the architecture of vm_object_collapse(). The fact that it makes calls into the swap pager and knows too much about how the swap pager operates really bothers me. It also doesn't allow for collapsing of non-swap pager objects ("unnamed" objects backed by other pagers).
9468	10-Jul-1995	dg	swapout_threads() -> swapout_procs().
9467	10-Jul-1995	dg	Increased global RSS limit to total RAM.
9456	09-Jul-1995	dg	Moved call to VOP_GETATTR() out of vnode_pager_alloc() and into the places that call vnode_pager_alloc() so that a failure return can be dealt with. This fixes a panic seen on NFS clients when a file being opened is deleted on the server before the open completes.
9411	06-Jul-1995	dg	Fixed an object allocation race condition that was causing a "object deallocated too many times" panic when using NFS. Reviewed by: John Dyson
9356	28-Jun-1995	dg	1) Converted v_vmdata to v_object. 2) Removed unnecessary vm_object_lookup()/pager_cache(object, TRUE) pairs after vnode_pager_alloc() calls - the object is already guaranteed to be persistent. 3) Removed some gratuitous casts.
9202	11-Jun-1995	rgrimes	Merge RELENG_2_0_5 into HEAD
8876	30-May-1995	rgrimes	Remove trailing whitespace.
8743	25-May-1995	dg	Removed check for sw_dev == NODEV; this is a normal condition for swap over NFS and was gratuitously panicing when it happens. Reviewed by: John Dyson Submitted by: Pierre Beyssac via Poul-Henning Kamp
8692	21-May-1995	dg	Changes to fix the following bugs: 1) Files weren't properly synced on filesystems other than UFS. In some cases, this lead to lost data. Most likely would be noticed on NFS. The fix is to make the VM page sync/object_clean general rather than in each filesystem. 2) Mixing regular and mmaped file I/O on NFS was very broken. It caused chunks of files to end up as zeroes rather than the intended contents. The fix was to fix several race conditions and to kludge up the "b_dirtyoff" and "b_dirtyend" that NFS relies upon - paying attention to page modifications that occurred via the mmapping. Reviewed by: David Greenman Submitted by: John Dyson
8624	19-May-1995	dg	NFS diskless operation was broken because swapdev_vp wasn't initialized. These changes solve the problem in a general way by moving the initialization out of the individual fs_mountroot's and into swaponvp(). Submitted by: Poul-Henning Kamp
8588	18-May-1995	dg	Fixed a bug that managed to slip in during Poul's dynamic swap partition changes. The check for nswap was bogus, but the code was so convoluted that it was difficult to tell. It's better now. :-) Reviewed by: David Greenman (extensively), and John Dyson Submitted by: Poul-Henning Kamp, w/tweaks by me.
8585	18-May-1995	dg	Accessing pages beyond the end of a mapped file results in internal inconsistencies in the VM system that eventually lead to a panic. These changes fix the behavior to conform to the behavior in SunOS, which is to deny faults to pages beyond the EOF (returning SIGBUS). Internally, this is implemented by requiring faults to be within the object size boundaries. These changes exposed another bug, namely that passing in an offset to mmap when trying to map an unnamed anonymous region also results in internal inconsistencies. In this case, the offset is forced to zero. Reviewed by: John Dyson and others
8504	14-May-1995	dg	Changed swap partition handling/allocation so that it doesn't require specific partitions be mentioned in the kernel config file ("swap on foo" is now obsolete). From Poul-Henning: The visible effect is this: As default, unless options "NSWAPDEV=23" is in your config, you will have four swap-devices. You can swapon(2) any block device you feel like, it doesn't have to be in the kernel config. There is a performance/resource win available by getting the NSWAPDEV right (but only if you have just one swap-device ??), but using that as default would be too restrictive. The invisible effect is that: Swap-handling disappears from the $arch part of the kernel. It gets a lot simpler (-145 lines) and cleaner. Reviewed by: John Dyson, David Greenman Submitted by: Poul-Henning Kamp, with minor changes by me.
8464	12-May-1995	phk	I'm about to jump on the swap-initialization, and having talked with davidg about it, I hereby kill two undocumented misfeatures: The code to skip a miniroot in the swapdev is not particular useful, and if we need it we need it to be done properly, ie size the fs and skip all of it not some hardcoded size, and subtract what we skip from the length in the first place. The SEQSWAP dies too. It's not the way to do it, it doesn't work, and nobody have expressed any great desire for it to work. The way to implement it correctly would be a second argument to swapon(2) to give a priority/policy information. Low priority swapdevs can be made so by adding them at a far offset (0x80000000 kind of thing), with almost no modification to the strategy routine (in particular a offset per swapdev). But until the need is obvious, it will not be done.
8416	10-May-1995	dg	Changed "handle" from type caddr_t to void ; "handle" is several different types of pointers, and "char " is a bad choice for the type.
8319	07-May-1995	dyson	Another error in the correction for trimming swap allocation for small objects. (This code needs to be revisited.)
8315	07-May-1995	dyson	Fixed a calculation that would once-in-a-while cause the swap_pager to emit spurious page outside of object type messages. It is not a fatal condition anyway, so the message will be omitted for release. Also, the code that "clips" the allocation size, associated with the above problem, was fixed.
8216	02-May-1995	dg	Changed object hash list to be a list rather than a tailq. This saves space for the hash list buckets and is a little faster. The features of tailq aren't needed. Increased the size of the object hash table to improve performance. In the future, this will be changed so that the table is sized dynamically.
8059	25-Apr-1995	dg	Fixed a "bswbuf" hang caused by the wakeup in relpbuf() waking up the wrong thing.
8010	23-Apr-1995	bde	inline -> __inline. Headers should always use `__inline' for inline functions to avoid syntax errors when modules that don't even use the offending functions are compiled with `gcc -ansi'.
7968	21-Apr-1995	dyson	Fixed a problem in _vm_object_page_clean that could cause an infinite loop.
7935	19-Apr-1995	dg	New flag: B_PAGING. Added as part of the vn driver hack.
7904	17-Apr-1995	dg	Fixed a logic bug that caused the vmdaemon to not wake up when intended. Submitted by: John Dyson
7888	16-Apr-1995	dg	Removed obsolete/unused variable declarations. Killed externs and included appropriate include files.
7887	16-Apr-1995	dg	Removed obsolete/unused variable declarations. Removed some extern declarations and included the correct include files.
7883	16-Apr-1995	dg	Moved some zero-initialized variables into .bss. Made code intended to be called only from DDB #ifdef DDB. Removed some completely unused globals.
7879	16-Apr-1995	dg	Removed gratuitous m->blah=0 assignments when initializing the vm_page structs in vm_page_startup(). The vm_page structs are already completely zeroed.
7873	16-Apr-1995	dg	Make "print_page_info" #ifdef DDB.
7870	16-Apr-1995	dg	Fixed a few bugs in vm_object_page_clean, mostly related to not syncing pages that are in FS buffers. This fixes the (believed to already have been fixed) problem with msync() not doing it's job...in other words, the stuff that Andrew has continuously been complaining about. Submitted by: John Dyson, w/minor changes by me.
7695	09-Apr-1995	dg	Changes from John Dyson and myself: Fixed remaining known bugs in the buffer IO and VM system. vfs_bio.c: Fixed some race conditions and locking bugs. Improved performance by removing some (now) unnecessary code and fixing some broken logic. Fixed process accounting of # of FS outputs. Properly handle NFS interrupts (B_EINTR). (various) Replaced calls to clrbuf() with calls to an optimized routine called vfs_bio_clrbuf(). (various FS sync) Sync out modified vnode_pager backed pages. ffs_vnops.c: Do two passes: Sync out file data first, then indirect blocks. vm_fault.c: Fixed deadly embrace caused by acquiring locks in the wrong order. vnode_pager.c: Changed to use buffer I/O system for writing out modified pages. This should fix the problem with the modification date previous not getting updated. Also dramatically simplifies the code. Note that this is going to change in the future and be implemented via VOP_PUTPAGES(). vm_object.c: Fixed a pile of bugs related to cleaning (vnode) objects. The performance of vm_object_page_clean() is terrible when dealing with huge objects, but this will change when we implement a binary tree to keep the object pages sorted. vm_pageout.c: Fixed broken clustering of pageouts. Fixed race conditions and other lockup style bugs in the scanning of pages. Improved performance.
7430	28-Mar-1995	bde	Add and move declarations to fix all of the warnings from `gcc -Wimplicit' (except in netccitt, netiso and netns) that I didn't notice when I fixed "all" such warnings before.
7427	28-Mar-1995	dg	Fixed typo...using wrong variable in page_shortage calculation.
7424	28-Mar-1995	dg	Fixed "pages freed by daemon" statistic (again).
7411	27-Mar-1995	dg	Explicitly set page dirty if this is a write fault - reduces calls to pmap_is_modified() later.
7400	26-Mar-1995	dg	Removed some obsolete flags. Submitted by: John Dyson
7366	25-Mar-1995	dg	Fix logic bug I just introduced with the flags to msync().
7365	25-Mar-1995	dg	Pass syncio flag to vm_object_clean(). It remains unimplemented, however.
7364	25-Mar-1995	dg	Disallow both MS_ASYNC and MS_INVALIDATE flags being set at the same time in msync().
7360	25-Mar-1995	dg	Added "flags" argument to msync, and implemented MS_ASYNC and MS_INVALIDATE. The MS_ASYNC flag doesn't current work, and MS_INVALIDATE will only toss out the pages in the address space (not all pages in the shadow chain).
7352	25-Mar-1995	dg	Implemented cnt.v_reactivated and moved vm_page_activate() routine to before vm_page_deactivate().
7350	25-Mar-1995	dg	Removed (almost) meaningless "object cache lookups/hits" statistic. In our framework, these numbers will usually be nearly the same, and not because of any sort of high 'hit rate'.
7346	25-Mar-1995	dg	Removed cnt.v_nzfod: In our current scheme of things it is not possible to accurately track this. It isn't an indicator of resource consumption anyway. Removed cnt.v_kernel_pages: We don't implement this and doing so accurately would be very difficult (and ambiguous - since process pages are often double mapped in the kernel and the process address spaces).
7263	23-Mar-1995	dg	Fixed warning caused by returning a value in a void function (introduced in a recent commit by me). Relaxed checks before calling vm_object_remove; a non-internal object always has a pager.
7246	22-Mar-1995	dg	Removed unused fifth argument to vm_object_page_clean(). Fixed bug with VTEXT not always getting cleared when it is supposed to. Added check to make sure that vm_object_remove() isn't called with a NULL pager or for a pager for an OBJ_INTERNAL object (neither of which will be on the hash list). Clear OBJ_CANPERSIST if we decide to terminate it because of no resident pages.
7243	22-Mar-1995	dg	Fixed potential sleep/wakeup race conditional with splhigh(). Submitted by: John Dyson
7240	22-Mar-1995	dg	Added a check for wrong object size; print a warning, but deal with it correctly. The warning will tell us that there is a bug somewhere else in sizing the object correctly. Submitted by: John Dyson
7239	22-Mar-1995	dg	Fixed bug in vm_mmap() where the object that is created in some cases was the wrong size. This is the likely cause of panics reported by Lars Fredriksen and Paul Richards related to a -1 blkno when paging via the swap_pager. Submitted by: John Dyson
7236	21-Mar-1995	dg	Removed unused variable declaration missed in previous commit.
7235	21-Mar-1995	dg	Removed do-nothing VOP_UPDATE() call.
7215	21-Mar-1995	dg	Disallow non page-aligned file offsets in vm_mmap(). We don't support this in either the high or low level parts of the VM system. Just return EINVAL in this case, just like SunOS does.
7209	21-Mar-1995	dg	Fixed bug in the size == 0 case of msync() caused by a bogus return value check..
7204	21-Mar-1995	dg	Added a new boolean argument to vm_object_page_clean that causes it to only toss out clean pages if TRUE.
7187	20-Mar-1995	dg	Don't gain/lose an object reference in vnode_pager_setsize(). It will cause vnode locking problems in vm_object_terminate(). Implement proper vnode locking in vm_object_terminate().
7185	20-Mar-1995	dg	Fixed "objde1" hang. It was caused by a "&" where an "&&" belonged in the expression that decides if a wakeup should occur.
7180	20-Mar-1995	dg	Removed an unnecessary call to vinvalbuf after the page clean.
7178	19-Mar-1995	dg	Do proper vnode locking when doing paging I/O. Removed the asynchronous paging capability to facilitate this (we saw little or no measureable improvement with this anyway). Submitted by: John Dyson
7170	19-Mar-1995	dg	Removed redundant newlines that were in some panic strings.
7162	19-Mar-1995	dg	Incorporated 4.4-lite vnode_pager_uncache() and vnode_pager_umount() routines (and merged local changes). The changed vnode_pager_uncache gets rids of the bogosity that you can call the routine without having the vnode locked. The changed vnode_pager_umount properly locks the vnode before calling vnode_pager_uncache.
7120	18-Mar-1995	dg	In vm_page_alloc_contig: Removed a redundant semicolon and used 'm' instead of &pga[i] in one place.
7090	16-Mar-1995	bde	Add and move declarations to fix all of the warnings from `gcc -Wimplicit' (except in netccitt, netiso and netns) and most of the warnings from `gcc -Wnested-externs'. Fix all the bugs found. There were no serious ones.
7066	15-Mar-1995	dg	Special cased the handling of mb_map in the M_WAITOK case. kmem_malloc() now returns NULL and sets a global 'mb_map_full' when the map is full. m_clalloc() has further been taught to expect this and do the right thing. This should fix the "mb_map full" panics that several people have reported.
7029	12-Mar-1995	bde	Move a kernel inline function inside `#ifdef KERNEL' so that including <vm/vm.h> doesn't cause warnings about nonexistent functions called by the inline function. Clean up the formatting of the function.
7017	12-Mar-1995	dg	Fixed obsolete comment.
7016	12-Mar-1995	dg	Deleted vm_object_setpager().
7015	12-Mar-1995	dg	Deleted vm_object_setpager().
7014	12-Mar-1995	dg	Explicitly set object->flags = OBJ_CANPERSIST.
7008	11-Mar-1995	dg	Fix completely bogus comment.
7007	11-Mar-1995	dg	Clear OBJ_INTERNAL flag for device pager objects and named anonymous objects.
6947	07-Mar-1995	dg	Set VAGE flag when pager is destroyed. This usually happens when an object has fallen off the end of the cached list - this is likely the last reference to the vnode and it should be reused before non file vnodes that are already on the free list (VDIR mostly).
6944	07-Mar-1995	dg	Fixed object reference count problem that occurred in the MAP_PRIVATE case after we rewrote vm_mmap(). Added some comments to make it easier to follow the reference counts.
6943	07-Mar-1995	dg	Don't attempt to reverse collapse non OBJ_INTERNAL objects.
6897	04-Mar-1995	jkh	Remove a gratutious cast.
6816	01-Mar-1995	dg	Various changes from John and myself that do the following: New functions create - vm_object_pip_wakeup and pagedaemon_wakeup that are used to reduce the actual number of wakeups. New function vm_page_protect which is used in conjuction with some new page flags to reduce the number of calls to pmap_page_protect. Minor changes to reduce unnecessary spl nesting. Rewrote vm_page_alloc() to improve readability. Various other mostly cosmetic changes.
6806	01-Mar-1995	dg	Slight change to include file order to accommodate upcoming changes.
6709	25-Feb-1995	bde	Don't use __P(()) in a function definition.
6703	25-Feb-1995	dg	Fixed severely broken printf (arguments out of order, no newline).
6673	23-Feb-1995	dg	Removed redundant HOLDRELE()'s.
6626	22-Feb-1995	dg	Changed return value from vnode_pager_addr to be in DEV_BSIZE units so that 9 bits aren't lost in the conversion. Changed all callers to expect this. This allows paging on large (>2GB) filesystems. Submitted by: John Dyson
6625	22-Feb-1995	dg	vm_page.c: Use request==VM_ALLOC_NORMAL rather than object!=kmem_object in deciding if the caller is "important" in vm_page_alloc(). Also established a new low threshold for non-interrupt allocations via cnt.v_interrupt_free_min. vm_pageout.c: Various algorithmic cleanup. Some calculations simplified. Initialize cnt.v_interrupt_free_min to 2 pages. Submitted by: John Dyson
6624	22-Feb-1995	dg	Just return in the case of a page not on any queue in vm_page_unqueue(). Return VM_PAGE_BITS_ALL even if size > PAGE_SIZE in vm_page_bits(). Submitted by: John Dyson
6623	22-Feb-1995	dg	Removed object locking code (it was a left over from an abortion that was done a month or so ago). Submitted by: John Dyson
6622	22-Feb-1995	dg	Removed bogus copy object collapse check (the idea is right, but the spcific check was bogus). Removed old copy of vm_object_page_clean and took out the #if 1 around the remaining one. Submitted by: John Dyson
6618	22-Feb-1995	dg	Only do object paging_in_progress wakeups if someone is waiting on this condition. Submitted by: John Dyson
6617	22-Feb-1995	dg	Rewrote MAP_PRIVATE case of vm_mmap() - all of the COW portion of this routine was highly convoluted. Submitted by: John Dyson
6601	21-Feb-1995	dg	Panic if u_map allocation fails.
6587	21-Feb-1995	dg	vm_extern.h: removed vm_allocate_with_pager. Removed vm_user.c...it's now completely deprecated.
6585	21-Feb-1995	dg	Deprecated remaining use of vm_deallocate. Deprecated vm_allocate_with_ pager(). Almost completely rewrote vm_mmap(); when John gets done with the bottom half, it will be a complete rewrite. Deprecated most use of vm_object_setpager(). Removed side effect of setting object persist in vm_object_enter and moved this into the pager(s). A few other cosmetic changes.
6584	21-Feb-1995	dg	Set page alloced for map entries as valid.
6582	20-Feb-1995	dg	Removed vm_allocate(), vm_deallocate(), and vm_protect() functions. The only function remaining in this file is vm_allocate_with_pager(), and this will be going RSN. The file will be removed when this happens.
6580	20-Feb-1995	dg	Moved ACT_MAX, ACT_ADVANCE, and ACT_DECLINE to vm_page.h.
6573	20-Feb-1995	dg	vm_inherit function has been deprecated.
6572	20-Feb-1995	dg	Stop using vm_allocate and vm_deallocate.
6571	20-Feb-1995	dg	VM for the kernel stack and page tables doesn't need to be explicitly deallocated as it isn't inherited across the fork. Use vm_map_find not vm_allocate. Submitted by: John Dyson
6567	20-Feb-1995	dg	Panic if object is deallocated too many times. Slight change to reverse collapsing so that vm_object_deallocate doesn't have to be called recursively. Removed half of a previous fix - the renamed page during a collapse doesn't need to be marked dirty because the pager backing store pointers are copied - thus preserving the page's data. This assumes that pages without backing store are always dirty (except perhaps for when they are first zeroed, but this doesn't matter). Switch order of two lines of code so that the correct pager is removed from the hash list. The previous code bogusly passed a NULL pointer to vm_object_remove(). The call to vm_object_remove() should be unnecessary if named anonymous objects were being dealt with correctly. They are currently marked as OBJ_INTERNAL, which really screws up things (such as this).
6566	20-Feb-1995	dg	Don't allow act_count to exceed ACT_MAX when bumping it up. Small optimization to vm_page_bits(). Submitted by: John Dyson
6565	20-Feb-1995	dg	Fully initialize pages returned via vm_page_alloc_contig() so that the memory can be later freed.
6541	18-Feb-1995	dg	1) Added protection against collapsing OBJ_DEAD objects. 2) bump reference counts by 2 instead of 1 so that an object deallocate doesn't try to recursively collapse the object. 3) mark pages renamed during the collapse as dirty so that their contents are preserved. Submitted by: John and me.
6435	15-Feb-1995	dg	Don't bother calling pmap_create() when creating the temporary map. The whole COW section of vm_mmap() should be rewritten; the current implementation is highly convoluted.
6357	14-Feb-1995	phk	YF fix.
6356	14-Feb-1995	phk	YF Fix.
6351	14-Feb-1995	dg	Fixed problem with msync causing a panic. Submitted by: John Dyson
6326	12-Feb-1995	dg	Carefully choose the value for vm_object_cache_max. The previous calculation was rather bogus in most cases; the new value works very well for both large and small memory machines.
6278	09-Feb-1995	dg	Killed MACHVMCOMPAT function prototypes as the functions don't exist.
6277	09-Feb-1995	dg	Killed MACHVMCOMPAT code. It doesn't compile, and in its present state would require some work to make it not a serious security problem. It's non-standard and not very useful anyway.
6258	09-Feb-1995	dg	Minor algorithmic adjustments that reduce the CPU consumption of the pagedaemon in half while not reducing its effectiveness. Submitted by: me & John
6151	03-Feb-1995	dg	Fixed bmap run-length brokeness. Use bmap run-length extension when doing clustered paging. Submitted by: John Dyson
6129	02-Feb-1995	dg	swap_pager.c: Fixed long standing bug in freeing swap space during object collapses. Fixed 'out of space' messages from printing out too often. Modified to use new kmem_malloc() calling convention. Implemented an additional stat in the swap pager struct to count the amount of space allocated to that pager. This may be removed at some point in the future. Minimized unnecessary wakeups. vm_fault.c: Don't try to collect fault stats on 'swapped' processes - there aren't any upages to store the stats in. Changed read-ahead policy (again!). vm_glue.c: Be sure to gain a reference to the process's map before swapping. Be sure to lose it when done. kern_malloc.c: Added the ability to specify if allocations are at interrupt time or are 'safe'; this affects what types of pages can be allocated. vm_map.c: Fixed a variety of map lock problems; there's still a lurking bug that will eventually bite. vm_object.c: Explicitly initialize the object fields rather than bzeroing the struct. Eliminated the 'rcollapse' code and folded it's functionality into the "real" collapse routine. Moved an object_unlock() so that the backing_object is protected in the qcollapse routine. Make sure nobody fools with the backing_object when we're destroying it. Added some diagnostic code which can be called from the debugger that looks through all the internal objects and makes certain that they all belong to someone. vm_page.c: Fixed a rather serious logic bug that would result in random system crashes. Changed pagedaemon wakeup policy (again!). vm_pageout.c: Removed unnecessary page rotations on the inactive queue. Changed the number of pages to explicitly free to just free_reserved level. Submitted by: John Dyson
5973	28-Jan-1995	dg	Completed the fix for attempting to page out pages via the device_pager. Submitted by: John Dyson
5915	26-Jan-1995	dg	Use the VM_PAGE_BITS_ALL in a place it can be used. Comment out call to pmap_prefault() until stability problems can be thoroghly analyzed.
5903	25-Jan-1995	dg	Don't attempt to clean device_pager backed objects at terminate time. There is similar bogusness in the pageout daemon that will be fixed soon. This fixes a panic pointed out to me by Bruce Evans that occurs when /dev/mem is used to map managed memory.
5841	24-Jan-1995	dg	Added ability to detect sequential faults and DTRT. (swap_pager.c) Added hook for pmap_prefault() and use symbolic constant for new third argument to vm_page_alloc() (vm_fault.c, various) Changed the way that upages and page tables are held. (vm_glue.c) Fixed architectural flaw in allocating pages at interrupt time that was introduced with the merged cache changes. (vm_page.c, various) Adjusted some algorithms to acheive better paging performance and to accomodate the fix for the architectural flaw mentioned above. (vm_pageout.c) Fixed pbuf handling problem, changed policy on handling read-behind page. (vnode_pager.c) Submitted by: John Dyson
5636	15-Jan-1995	dg	Moved some splx's down a few lines in vm_page_insert and vm_page_remove to make the locking a bit more clear - this change is currently a NOP as the calls to those routines are already at splhigh().
5571	13-Jan-1995	dg	Protect a qcollapse call with an object lock before calling. The locks need to be moved into the qcollapse and rcollapse routines, but I don't have time at the moment to make all the required changes...this will do for now.
5520	11-Jan-1995	dg	Improve my previous change to use the same tests as are used in qcollapse.
5519	11-Jan-1995	dg	Fixed a panic that Garrett reported to me...the OBJ_INTERNAL flag wasn't being cleared in some cases for vnode backed objects; we now do this in vnode_pager_alloc proper to guarantee it. Also be more careful in the rcollapse code about messing with busy/bmapped pages.
5465	10-Jan-1995	dg	Kill VM_PAGE_INIT macro as it is only used once and makes the code more difficult to understand. Got rid of unused vm_page flags.
5464	10-Jan-1995	dg	Fixed some formatting weirdness that I overlooked in the previous commit.
5455	09-Jan-1995	dg	These changes embody the support of the fully coherent merged VM buffer cache, much higher filesystem I/O performance, and much better paging performance. It represents the culmination of over 6 months of R&D. The majority of the merged VM/cache work is by John Dyson. The following highlights the most significant changes. Additionally, there are (mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to support the new VM/buffer scheme. vfs_bio.c: Significant rewrite of most of vfs_bio to support the merged VM buffer cache scheme. The scheme is almost fully compatible with the old filesystem interface. Significant improvement in the number of opportunities for write clustering. vfs_cluster.c, vfs_subr.c Upgrade and performance enhancements in vfs layer code to support merged VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff. vm_object.c: Yet more improvements in the collapse code. Elimination of some windows that can cause list corruption. vm_pageout.c: Fixed it, it really works better now. Somehow in 2.0, some "enhancements" broke the code. This code has been reworked from the ground-up. vm_fault.c, vm_page.c, pmap.c, vm_object.c Support for small-block filesystems with merged VM/buffer cache scheme. pmap.c vm_map.c Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of kernel PTs. vm_glue.c Much simpler and more effective swapping code. No more gratuitous swapping. proc.h Fixed the problem that the p_lock flag was not being cleared on a fork. swap_pager.c, vnode_pager.c Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the code doesn't need it anymore. machdep.c Changes to better support the parameter values for the merged VM/buffer cache scheme. machdep.c, kern_exec.c, vm_glue.c Implemented a seperate submap for temporary exec string space and another one to contain process upages. This eliminates all map fragmentation problems that previously existed. ffs_inode.c, ufs_inode.c, ufs_readwrite.c Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on busy buffers. Submitted by: John Dyson and David Greenman
5404	05-Jan-1995	dg	Make sure that the object being collapsed doesn't go away on us...by gaining extra references to it. Submitted by: John Dyson Obtained from:
5348	02-Jan-1995	ats	Submitted by: Ben Jackson just a missing newline in a kernel printf added.
5283	30-Dec-1994	bde	Clean up previous commits (format for 80 columns...).
5203	23-Dec-1994	dg	Do vm_page_rename more conservatively in rcollapse and qcollapse, and change list walk so that it doesn't get stuck in an infinite loop. Submitted by: John Dyson
5202	23-Dec-1994	dg	Initialize b_vnbuf.le_next before returning a new buffer in getpbuf and trypbuf. Move a couple of splbio's to be slightly less conservative.
5186	22-Dec-1994	dg	Fixed a benign off by one error.
5166	19-Dec-1994	dg	Don't ever clear B_BUSY on a pbuf (or any other flag for that matter). This appears to be the cause of some buffer confusion that leads to a panic during heavy paging. Submitted by: John Dyson
5151	18-Dec-1994	dg	Fixed multiple bogons with the map entry handling.
5146	18-Dec-1994	dg	Fixed bug where statically allocated map entries might be freed to the malloc pool...causing a panic. Submitted by: John Dyson
5145	18-Dec-1994	dg	Change swapping policy to be a bit more aggressive about finding a candidate for swapout. Increased default RSS limit to a minimum of 2MB.
5114	15-Dec-1994	dg	Protect kmem_map modifications with splhigh() to work around a problem with the map being locked at interrupt time.
5033	11-Dec-1994	dg	Don't put objects that have no parent on the reverse_shadow_list. Problem identified and explained by Gene Stark (thanks Gene!). Submitted by: John Dyson
4810	25-Nov-1994	dg	These changes fix a couple of lingering VM problems: 1. The pageout daemon used to block under certain circumstances, and we needed to add new functionality that would cause the pageout daemon to block more often. Now, the pageout daemon mostly just gets rid of pages and kills processes when the system is out of swap. The swapping, rss limiting and object cache trimming have been folded into a new daemon called "vmdaemon". This new daemon does things that need to be done for the VM system, but can block. For example, if the vmdaemon blocks for memory, the pageout daemon can take care of it. If the pageout daemon had blocked for memory, it was difficult to handle the situation correctly (and in some cases, was impossible). 2. The collapse problem has now been entirely fixed. It now appears to be impossible to accumulate unnecessary vm objects. The object collapsing now occurs when ref counts drop to one (where it is more likely to be more simple anyway because less pages would be out on disk.) The original fixes were incomplete in that pathological circumstances could still be contrived to cause uncontrolled growth of swap. Also, the old code still, under steady state conditions, used more swap space than necessary. When using the new code, users will generally notice a significant decrease in swap space usage, and theoretically, the system should be leaving fewer unused pages around competing for memory. Submitted by: John Dyson
4797	24-Nov-1994	dg	Don't try to page to a vnode that had it's filesystem unmounted.
4768	22-Nov-1994	dg	Preallocate the first swap block to work around a failure with swap starting at physical block 0. Note that this will show up in pstat -s and swapinfo as space "in use". In reality, the space is simply never made available.
4537	17-Nov-1994	dg	Don't ever try to kill off process 1 - even if we are out of swap space and it's the candidate pig.
4534	17-Nov-1994	gibbs	Remove a peice of commented out code that was left over from the early stages of debugging LFS: * if we can't bmap, use old VOP code / ! if (/ (vp->v_mount && vp->v_mount->mnt_stat.f_type == MOUNT_LFS) \|\| / ! VOP_BMAP(vp, foff, &dp, 0, 0)) { for (i = 0; i < count; i++) { if (i != reqpage) { vnode_pager_freepage(m[i]); --- 804,810 ---- / * if we can't bmap, use old VOP code */ ! if (VOP_BMAP(vp, foff, &dp, 0, 0)) { Reviewed by: gibbs Submitted by: John Dyson
4461	14-Nov-1994	bde	pmap.h: Disable the bogus declaration of pmap_bootstrap(). Since its arg list is machine-dependent, it must be declared in a machine-dependent header. vm_page.h: Change `inline' to `__inline' and old-style function parameter lists for inlined functions to new-style. `inline' and old-style function parameter lists should never be used in system headers, even in very machine-dependent ones, because they cause warnings from gcc -Wreally-all.
4447	14-Nov-1994	dg	Set laundry flag when transitioning an inactive page from clean to dirty. This fixes a performance bug where pages would sometimes not be paged out when they could be. Submitted by: John Dyson
4446	13-Nov-1994	dg	Fixed bug where a read-behind to a negative offset would occur if the fault was at offset 0 in the object. This resulted in more overhead but was othewise benign. Added incore() check in vnode_pager_has_page() to work around a problem with LFS...other than slightly higher overhead, this change has no affect on UFS.
4440	13-Nov-1994	dg	Fixed bugs in accounting of swap space that resulted in the pager thinking it was out of space when it really wasn't. Submitted by: John Dyson
4439	13-Nov-1994	dg	Implemented swap locking via P_SWAPPING flag. It was possible for a process to be chosen for swap-in while it was being swapped-out. This was BAD. Submitted by: John Dyson
4207	06-Nov-1994	dg	Fixed return status from pagers. Ahem...the previous method would manufacture data when it couldn't get it legitimately. :-( Submitted by: John Dyson
4203	06-Nov-1994	dg	Added support for starting the experimental "vmdaemon" system process. Enabled via REL2_1. Added support for doing object collapses "on the fly". Enabled via REL2_1a. Improved object collapses so that they can happen in more cases. Improved sensing of modified pages to fix an apparant race condition and improve clustered pageout opportunities. Fixed an "oops" with not restarting page scan after a potential block in vm_pageout_clean() (not doing this can result in strange behavior in some cases). Submitted by: John Dyson & David Greenman
3841	25-Oct-1994	dg	Improved I/O error reporting.
3839	25-Oct-1994	dg	#if 0'd out the object cache trimming code - there are multiple ways that the pageout daemon can deadlock otherwise. Submitted by: John Dyson
3815	23-Oct-1994	dg	Fixed object cache trimming policy so it actually works. Submitted by: John Dyson
3814	23-Oct-1994	dg	Adjusted reserved levels to fix a deadlock condition. Submitted by: John Dyson
3807	23-Oct-1994	dg	Changed a thread_sleep into an spl protected tsleep. A deadlock can occur otherwise. Minor efficiency improvement in vm_page_free(). Submitted by: John Dyson
3798	22-Oct-1994	phk	Contrary to my last commit here: NFS-swap is enabled automatically.
3772	22-Oct-1994	dg	Fixed a comment from the previous commit.
3766	22-Oct-1994	dg	Various changes to allow operation without any swapspace configured. Note that this is intended for use only in floppy situations and is done at the sacrifice of performance in that case (in ther words, this is not the best solution, but works okay for this exceptional situation). Submitted by: John Dyson
3748	21-Oct-1994	phk	ATTENTION! From now on, >all< swapdevices must be activated with "swapon". If you havn't got it, add this line to /etc/fstab: /dev/wd0b none swap sw 0 0 ne sec Reason: We want our GENERIC* kernels to have a large selection of swap-devices, but on the other hand, we don't want to use a wd0b as swap when we boot of a floppy. This way, we will never use a unexpected swapdevice. Nothing else has changed.
3745	21-Oct-1994	wollman	Make my ALLDEVS kernel compile (basically, LINT minus a lot of options). This involves fixing a few things I broke last time.
3692	18-Oct-1994	dg	Fix the remaining vmmeter counters. They all now work correctly.
3660	17-Oct-1994	dg	Put sanity check for negative hold count into #ifdef DIAGNOSTIC so that it doesn't consume an extra 3k of kernel text because of gcc's bogus inlining code.
3612	15-Oct-1994	dg	1) Some of the counters in the vmmeter struct don't fit well into the Mach VM scheme of things, so I've changed them to be more appropriate. page in/ous are now associated with the pager that did them. Nuked v_fault as the only fault of interest that wouldn't be already counted in v_trap is a VM fault, and this is counted seperately. 2) Implemented most of the remaining counters and corrected the counting of some that were done wrong. They are all almost correct now...just a few minor ones left to fix.
3611	15-Oct-1994	dg	Count vm faults as v_vm_fault, not v_fault.
3610	15-Oct-1994	dg	Properly count object lookups and hits.
3591	14-Oct-1994	dg	Got rid of redundant declaration warnings.
3587	14-Oct-1994	jkh	Add missing )'s to previous midnight changes. :-)
3573	14-Oct-1994	dg	Fixed bug where page modifications would be lost when swap space was almost depleted. Reviewed by: John Dyson
3572	14-Oct-1994	dg	Changed I/O error messages to be somewhat less cryptic. Removed a piece of unused code.
3567	13-Oct-1994	dg	Fixed an object reference count problem that was caused by a call to vm_object_lookup() being outside of some parens. The bug was introduced via some recently added code. Reviewed by: John Dyson
3451	09-Oct-1994	dg	Got rid of map.h. It's a leftover from the rmap code, and we use rlists. Changed swapmap into swaplist.
3449	09-Oct-1994	phk	Cosmetics: unused vars, ()'s, #include's &c &c to silence gcc. Reviewed by: davidg
3446	09-Oct-1994	dg	Call resetpriority, not setpriority() ...oops. Submitted by: John Dyson
3407	07-Oct-1994	phk	Cosmetics. Unused vars and other warnings.
3374	05-Oct-1994	dg	Stuff object into v_vmdata rather than pager. Not important which at the moment, but will be in the future. Other changes mostly cosmetic, but are made for future VMIO considerations. Submitted by: John Dyson
3373	05-Oct-1994	dg	Fixed minor bug caused by some missing parens that can result in slightly reduced paging performance by missing a clustering opportunity. Found by Poul-Henning Kamp with gcc -Wall.
3354	04-Oct-1994	dg	John Dyson's work in progress. Not currently used.
3347	04-Oct-1994	dg	Fixed bug related to proper sensing of page modification that we inadvertantly introduced in pre-1.1.5. This could cause page modifications to go unnoticed during certain extreme low memory/high paging rate conditions. Submitted by: John Dyson and David Greenman
3311	02-Oct-1994	phk	GCC cleanup. Reviewed by: Submitted by: Obtained from:
3154	27-Sep-1994	dg	Previous commit should have read ...in vm_page_alloc_contig(). ...(this commit): moved initialization of 'start' to make it more clear that it is initialized properly (also in vm_page_alloc_contig). Reviewed by: Submitted by: Obtained from:
3153	27-Sep-1994	dg	Fixed another bug, and cleaned up the code.
3147	27-Sep-1994	dg	Fixed multiple bugs in previous version of vm_page_alloc_contig.
3145	27-Sep-1994	dg	1) New "vm_page_alloc_contig" routine by me. 2) Created a new vm_page flag "PG_FREE" to help track free pages. 3) Use PG_FREE flag to detect inconsistencies in a few places.
3103	25-Sep-1994	dg	Removed unimplemented subr_rmap.c and unused references to it.
3083	25-Sep-1994	dg	Disabled swap anti-fragmentation code. It reduces swap paging performance by 20% in my tests, and it appears to be the cause of a swap leak. Submitted by: John Dyson
2692	12-Sep-1994	dg	Fixed a bug I introduced when fixing the rss limit code. Changed swapout policy to be a bit more selective about what processes get swapped out. Reviewed by: John Dyson
2689	12-Sep-1994	dg	Eliminated a whole pile of ancient (we're taking 4.3BSD) VM system related #define constants. Corrected incorrect VM_MAX_KERNEL_ADDRESS. Reviewed by: John Dyson
2688	12-Sep-1994	dg	Don't deactivate pages in 0-refcount objects. Added a couple of missing paging stats. Fixed problem with free_reserved becoming depleted during certain swap_pager operations. Submitted by: John Dyson, with a little help from me
2654	11-Sep-1994	dg	Fixed problem with no swap on boot device, but there is some on an alternate device (as specified via kernel config file)...that casues the machine to panic.
2524	06-Sep-1994	dg	Disabled a debugging printf.
2521	06-Sep-1994	dg	Simple changes to paging algorithms...but boy do they make a difference. FreeBSD's paging performance has never been better. Wow. Submitted by: John Dyson
2462	02-Sep-1994	dg	Whoops, accidently left out some pieces of the munmapfd patch.
2455	02-Sep-1994	dg	Removed all vestiges of tlbflush(). Replaced them with calls to pmap_update(). Made pmap_update an inline assembly function.
2413	30-Aug-1994	dg	Fixed bug caused by change of rlimit variables to quad_t's. The bug was in using min() to calculate the minimum of rss_cur,rss_max - since these are now quad_t's and min() takes u_ints...the comparison later for exceeding the rss limit was always true - resulting in rather serious page thrashing. Now using new qmin() function for this purpose. Fixed another bug where PG_BUSY pages would sometimes be paged out (bad!). This was caused by the PG_BUSY flag not being included in a comparison.
2386	29-Aug-1994	dg	Patches from John Dyson to improve swap code efficiency. Religiously add back pmap_clear_modify() in vnode_pager_input until we figure out why system performance isn't what we expect. Submitted by: John Dyson (swap_pager) & David Greenman (vnode_pager)
2320	27-Aug-1994	dg	1) Changed ddb into a option rather than a pseudo-device (use options DDB in your kernel config now). 2) Added ps ddb function from 1.1.5. Cleaned it up a bit and moved into its own file. 3) Added \r handing in db_printf. 4) Added missing memory usage stats to statclock(). 5) Added dummy function to pseudo_set so it will be emitted if there are no other pseudo declarations.
2177	21-Aug-1994	paul	Made idempotent Reviewed by: Submitted by:
2112	18-Aug-1994	wollman	Fix up some sloppy coding practices: - Delete redundant declarations. - Add -Wredundant-declarations to Makefile.i386 so they don't come back. - Delete sloppy COMMON-style declarations of uninitialized data in header files. - Add a few prototypes. - Clean up warnings resulting from the above. NB: ioconf.c will still generate a redundant-declaration warning, which is unavoidable unless somebody volunteers to make `config' smarter.
1997	10-Aug-1994	dg	Fixed vm_page_deactivate to deal with getting called with a page that's not on any queue. This is an old patchkit days fix. Reviewed by: John Dyson and David Greenman Submitted by: originally by Paul Mackerras
1974	09-Aug-1994	dg	Removed an old, obsolete call to vmmeter(). This is called now in the schedcpu() routine in kern/kern_synch.c. This extra call to vmmeter() in vm_glue.c was what was totally messing up the load average calculations.
1896	07-Aug-1994	dg	Made pmap_kenter "TLB safe". ...and then removed all the pmap_updates that are no longer needed because of this.
1895	07-Aug-1994	dg	Provide support for upcoming merged VM/buffer cache, and fixed a few bugs that haven't appeared to manifest themselves (yet). Submitted by: John Dyson
1890	06-Aug-1994	dg	Fixed various prototype problems with the pmap functions and the subsequent problems that fixing them caused.
1887	06-Aug-1994	dg	Incorporated post 1.1.5 work from John Dyson. This includes performance improvements via the new routines pmap_qenter/pmap_qremove and pmap_kenter/ pmap_kremove. These routine allow fast mapping of pages for those architectures that have "normal" MMUs. Also included is a fix to the pageout daemon to properly check a queue end condition. Submitted by: John Dyson
1885	06-Aug-1994	dg	Enabled page table preloading of cached objects. Submitted by: John Dyson
1835	04-Aug-1994	dg	Added some code that was accidently left out early in the 1.x -> 2.0 VM system conversion. Submitted by: John Dyson
1827	04-Aug-1994	dg	Integrated VM system improvements/fixes from FreeBSD-1.1.5.
1817	02-Aug-1994	dg	Added $Id$
1810	01-Aug-1994	dg	Removed all code related to the pagescan daemon, and changed 'act_count' adjustments to compensate for a world without the pagescan daemon.
1687	06-Jun-1994	dg	Don't move the page's position in the active queue if it is busy or held. John has noticed some stability problems when doing this.
1549	25-May-1994	rgrimes	The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch. Reviewed by: Rodney W. Grimes Submitted by: John Dyson and David Greenman
1542	24-May-1994	rgrimes	This commit was generated by cvs2svn to compensate for changes in r1541, which included commits to RCS files with non-trunk default branches.