History log of /freebsd-11-stable/sys/vm/vm_object.h
Revision Date Author Comments
(<<< Hide modified files)
(Show modified files >>>)
# 331722 29-Mar-2018 eadler

Revert r330897:

This was intended to be a non-functional change. It wasn't. The commit
message was thus wrong. In addition it broke arm, and merged crypto
related code.

Revert with prejudice.

This revert skips files touched in r316370 since that commit was since
MFCed. This revert also skips files that require $FreeBSD$ property
changes.

Thank you to those who helped me get out of this mess including but not
limited to gonzo, kevans, rgrimes.

Requested by: gjb (re)


# 330897 14-Mar-2018 eadler

Partial merge of the SPDX changes

These changes are incomplete but are making it difficult
to determine what other changes can/should be merged.

No objections from: pfg


# 323803 20-Sep-2017 kib

MFC r322571 (by br):
Add OBJ_PG_DTOR flag to VM object.


# 323802 20-Sep-2017 kib

MFC r321330 (by br):
Style.


# 323800 20-Sep-2017 kib

MFC r323558:
Use existing tag name for the vm_object' memq.


# 323537 13-Sep-2017 kib

MFC r322913:
Replace global swhash in swap pager with per-object trie to track swap
blocks assigned to the object pages.

MFC r322970 (by alc):
Do not call vm_pager_page_unswapped() on the fast fault path.

MFC r322971 (by alc):
Update a couple vm_object lock assertions in the swap pager.

MFC r323224:
In swp_pager_meta_build(), handle a race with other thread allocating
swapblk for our index while we dropped the object lock.

MFC r323226:
Do not leak empty swblk.


# 318716 23-May-2017 markj

MFC r308474, r308691, r309203, r309365, r309703, r309898, r310720,
r308489, r308706:
Add PQ_LAUNDRY and remove PG_CACHED pages.


# 315563 19-Mar-2017 kib

MFC r313690:
Consistently handle negative or wrapping offsets in the mmap(2) syscalls.

MFC r315158:
Fix two missed places where vm_object offset to index calculation
should use unsigned shift.


# 313383 07-Feb-2017 kib

MFC r313249:
Style, use tab after #define.


# 312073 13-Jan-2017 kib

MFC r309710:
Add a new populate() pager method and extend device pager ops vector
with cdev_pg_populate() to provide device drivers access to it.

MFC r310849:
Fix two similar bugs in the populate vm_fault() code.


# 302408 07-Jul-2016 gjb

Copy head@r302406 to stable/11 as part of the 11.0-RELEASE cycle.
Prune svn:mergeinfo from the new branch, as nothing has been merged
here.

Additional commits post-branch will follow.

Approved by: re (implicit)
Sponsored by: The FreeBSD Foundation


/freebsd-11-stable/MAINTAINERS
/freebsd-11-stable/cddl
/freebsd-11-stable/cddl/contrib/opensolaris
/freebsd-11-stable/cddl/contrib/opensolaris/cmd/dtrace/test/tst/common/print
/freebsd-11-stable/cddl/contrib/opensolaris/cmd/zfs
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzfs
/freebsd-11-stable/contrib/amd
/freebsd-11-stable/contrib/apr
/freebsd-11-stable/contrib/apr-util
/freebsd-11-stable/contrib/atf
/freebsd-11-stable/contrib/binutils
/freebsd-11-stable/contrib/bmake
/freebsd-11-stable/contrib/byacc
/freebsd-11-stable/contrib/bzip2
/freebsd-11-stable/contrib/com_err
/freebsd-11-stable/contrib/compiler-rt
/freebsd-11-stable/contrib/dialog
/freebsd-11-stable/contrib/dma
/freebsd-11-stable/contrib/dtc
/freebsd-11-stable/contrib/ee
/freebsd-11-stable/contrib/elftoolchain
/freebsd-11-stable/contrib/elftoolchain/ar
/freebsd-11-stable/contrib/elftoolchain/brandelf
/freebsd-11-stable/contrib/elftoolchain/elfdump
/freebsd-11-stable/contrib/expat
/freebsd-11-stable/contrib/file
/freebsd-11-stable/contrib/gcc
/freebsd-11-stable/contrib/gcclibs/libgomp
/freebsd-11-stable/contrib/gdb
/freebsd-11-stable/contrib/gdtoa
/freebsd-11-stable/contrib/groff
/freebsd-11-stable/contrib/ipfilter
/freebsd-11-stable/contrib/ldns
/freebsd-11-stable/contrib/ldns-host
/freebsd-11-stable/contrib/less
/freebsd-11-stable/contrib/libarchive
/freebsd-11-stable/contrib/libarchive/cpio
/freebsd-11-stable/contrib/libarchive/libarchive
/freebsd-11-stable/contrib/libarchive/libarchive_fe
/freebsd-11-stable/contrib/libarchive/tar
/freebsd-11-stable/contrib/libc++
/freebsd-11-stable/contrib/libc-vis
/freebsd-11-stable/contrib/libcxxrt
/freebsd-11-stable/contrib/libexecinfo
/freebsd-11-stable/contrib/libpcap
/freebsd-11-stable/contrib/libstdc++
/freebsd-11-stable/contrib/libucl
/freebsd-11-stable/contrib/libxo
/freebsd-11-stable/contrib/llvm
/freebsd-11-stable/contrib/llvm/projects/libunwind
/freebsd-11-stable/contrib/llvm/tools/clang
/freebsd-11-stable/contrib/llvm/tools/lldb
/freebsd-11-stable/contrib/llvm/tools/llvm-dwarfdump
/freebsd-11-stable/contrib/llvm/tools/llvm-lto
/freebsd-11-stable/contrib/mdocml
/freebsd-11-stable/contrib/mtree
/freebsd-11-stable/contrib/ncurses
/freebsd-11-stable/contrib/netcat
/freebsd-11-stable/contrib/ntp
/freebsd-11-stable/contrib/nvi
/freebsd-11-stable/contrib/one-true-awk
/freebsd-11-stable/contrib/openbsm
/freebsd-11-stable/contrib/openpam
/freebsd-11-stable/contrib/openresolv
/freebsd-11-stable/contrib/pf
/freebsd-11-stable/contrib/sendmail
/freebsd-11-stable/contrib/serf
/freebsd-11-stable/contrib/sqlite3
/freebsd-11-stable/contrib/subversion
/freebsd-11-stable/contrib/tcpdump
/freebsd-11-stable/contrib/tcsh
/freebsd-11-stable/contrib/tnftp
/freebsd-11-stable/contrib/top
/freebsd-11-stable/contrib/top/install-sh
/freebsd-11-stable/contrib/tzcode/stdtime
/freebsd-11-stable/contrib/tzcode/zic
/freebsd-11-stable/contrib/tzdata
/freebsd-11-stable/contrib/unbound
/freebsd-11-stable/contrib/vis
/freebsd-11-stable/contrib/wpa
/freebsd-11-stable/contrib/xz
/freebsd-11-stable/crypto/heimdal
/freebsd-11-stable/crypto/openssh
/freebsd-11-stable/crypto/openssl
/freebsd-11-stable/gnu/lib
/freebsd-11-stable/gnu/usr.bin/binutils
/freebsd-11-stable/gnu/usr.bin/cc/cc_tools
/freebsd-11-stable/gnu/usr.bin/gdb
/freebsd-11-stable/lib/libc/locale/ascii.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris
/freebsd-11-stable/sys/contrib/dev/acpica
/freebsd-11-stable/sys/contrib/ipfilter
/freebsd-11-stable/sys/contrib/libfdt
/freebsd-11-stable/sys/contrib/octeon-sdk
/freebsd-11-stable/sys/contrib/x86emu
/freebsd-11-stable/sys/contrib/xz-embedded
/freebsd-11-stable/usr.sbin/bhyve/atkbdc.h
/freebsd-11-stable/usr.sbin/bhyve/bhyvegc.c
/freebsd-11-stable/usr.sbin/bhyve/bhyvegc.h
/freebsd-11-stable/usr.sbin/bhyve/console.c
/freebsd-11-stable/usr.sbin/bhyve/console.h
/freebsd-11-stable/usr.sbin/bhyve/pci_fbuf.c
/freebsd-11-stable/usr.sbin/bhyve/pci_xhci.c
/freebsd-11-stable/usr.sbin/bhyve/pci_xhci.h
/freebsd-11-stable/usr.sbin/bhyve/ps2kbd.c
/freebsd-11-stable/usr.sbin/bhyve/ps2kbd.h
/freebsd-11-stable/usr.sbin/bhyve/ps2mouse.c
/freebsd-11-stable/usr.sbin/bhyve/ps2mouse.h
/freebsd-11-stable/usr.sbin/bhyve/rfb.c
/freebsd-11-stable/usr.sbin/bhyve/rfb.h
/freebsd-11-stable/usr.sbin/bhyve/sockstream.c
/freebsd-11-stable/usr.sbin/bhyve/sockstream.h
/freebsd-11-stable/usr.sbin/bhyve/usb_emul.c
/freebsd-11-stable/usr.sbin/bhyve/usb_emul.h
/freebsd-11-stable/usr.sbin/bhyve/usb_mouse.c
/freebsd-11-stable/usr.sbin/bhyve/vga.c
/freebsd-11-stable/usr.sbin/bhyve/vga.h
# 302063 21-Jun-2016 kib

The vmtotal sysctl handler marks active vm objects to calculate
statistics. Marking is done by setting the OBJ_ACTIVE flag. The
flags change is locked, but the problem is that many parts of system
assume that vm object initialization ensures that no other code could
change the object, and thus performed lockless. The end result is
corrupted flags in vm objects, most visible is spurious OBJ_DEAD flag,
causing random hangs.

Avoid the active object marking, instead provide equally inexact but
immutable is_object_alive() definition for the object mapped state.

Avoid iterating over the processes mappings altogether by using
arguably improved definition of the paging thread as one which sleeps
on the v_free_count.

PR: 204764
Diagnosed by: pho
Tested by: pho (previous version)
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Approved by: re (gjb)


# 300043 17-May-2016 kib

Add implementation of robust mutexes, hopefully close enough to the
intention of the POSIX IEEE Std 1003.1TM-2008/Cor 1-2013.

A robust mutex is guaranteed to be cleared by the system upon either
thread or process owner termination while the mutex is held. The next
mutex locker is then notified about inconsistent mutex state and can
execute (or abandon) corrective actions.

The patch mostly consists of small changes here and there, adding
neccessary checks for the inconsistent and abandoned conditions into
existing paths. Additionally, the thread exit handler was extended to
iterate over the userspace-maintained list of owned robust mutexes,
unlocking and marking as terminated each of them.

The list of owned robust mutexes cannot be maintained atomically
synchronous with the mutex lock state (it is possible in kernel, but
is too expensive). Instead, for the duration of lock or unlock
operation, the current mutex is remembered in a special slot that is
also checked by the kernel at thread termination.

Kernel must be aware about the per-thread location of the heads of
robust mutex lists and the current active mutex slot. When a thread
touches a robust mutex for the first time, a new umtx op syscall is
issued which informs about location of lists heads.

The umtx sleep queues for PP and PI mutexes are split between
non-robust and robust.

Somewhat unrelated changes in the patch:
1. Style.
2. The fix for proper tdfind() call use in umtxq_sleep_pi() for shared
pi mutexes.
3. Removal of the userspace struct pthread_mutex m_owner field.
4. The sysctl kern.ipc.umtx_vnode_persistent is added, which controls
the lifetime of the shared mutex associated with a vnode' page.

Reviewed by: jilles (previous version, supposedly the objection was fixed)
Discussed with: brooks, Martin Simmons <martin@lispworks.com> (some aspects)
Tested by: pho
Sponsored by: The FreeBSD Foundation


# 296162 28-Feb-2016 kib

Implement process-shared locks support for libthr.so.3, without
breaking the ABI. Special value is stored in the lock pointer to
indicate shared lock, and offline page in the shared memory is
allocated to store the actual lock.

Reviewed by: vangyzen (previous version)
Discussed with: deischen, emaste, jhb, rwatson,
Martin Simmons <martin@lispworks.com>
Tested by: pho
Sponsored by: The FreeBSD Foundation


# 292373 16-Dec-2015 glebius

A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().

o With new KPI consumers can request contiguous ranges of pages, and
unlike before, all pages will be kept busied on return, like it was
done before with the 'reqpage' only. Now the reqpage goes away. With
new interface it is easier to implement code protected from race
conditions.

Such arrayed requests for now should be preceeded by a call to
vm_pager_haspage() to make sure that request is possible. This
could be improved later, making vm_pager_haspage() obsolete.

Strenghtening the promises on the business of the array of pages
allows us to remove such hacks as swp_pager_free_nrpage() and
vm_pager_free_nonreq().

o New KPI accepts two integer pointers that may optionally point at
values for read ahead and read behind, that a pager may do, if it
can. These pages are completely owned by pager, and not controlled
by the caller.

This shifts the UFS-specific readahead logic from vm_fault.c, which
should be file system agnostic, into vnode_pager.c. It also removes
one VOP_BMAP() request per hard fault.

Discussed with: kib, alc, jeff, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix


# 288431 30-Sep-2015 markj

As a step towards the elimination of PG_CACHED pages, rework the handling
of POSIX_FADV_DONTNEED so that it causes the backing pages to be moved to
the head of the inactive queue instead of being cached.

This affects the implementation of POSIX_FADV_NOREUSE as well, since it
works by applying POSIX_FADV_DONTNEED to file ranges after they have been
read or written. At that point the corresponding buffers may still be
dirty, so the previous implementation would coalesce successive ranges and
apply POSIX_FADV_DONTNEED to the result, ensuring that pages backing the
dirty buffers would eventually be cached. To preserve this behaviour in an
efficient manner, this change adds a new buf flag, B_NOREUSE, which causes
the pages backing a VMIO buf to be placed at the head of the inactive queue
when the buf is released. POSIX_FADV_NOREUSE then works by setting this
flag in bufs that underlie the specified range.

Reviewed by: alc, kib
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D3726


# 285878 25-Jul-2015 kib

Revert r173708's modifications to vm_object_page_remove().

Assume that a vnode is mapped shared and mlocked(), and then the vnode
is truncated, or truncated and then again extended past the mapping
point EOF. Truncation removes the pages past the truncation point,
and if pages are later created at this range, they are not properly
mapped into the mlocked region, and their wiring count is wrong.

The revert leaves the invalidated but wired pages on the object queue,
which means that the pages are found by vm_object_unwire() when the
mapped range is munlock()ed, and reused by the buffer cache when the
vnode is extended again.

The changes in r173708 were required since then vm_map_unwire() looked
at the page tables to find the page to unwire. This is no longer
needed with the vm_object_unwire() introduction, which follows the
objects shadow chain.

Also eliminate OBJPR_NOTWIRED flag for vm_object_page_remove(), which
is now redundand, we do not remove wired pages.

Reported by: trasz, Dmitry Sivachenko <trtrmitya@gmail.com>
Suggested and reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 283924 02-Jun-2015 vangyzen

Provide vnode in memory map info for files on tmpfs

When providing memory map information to userland, populate the vnode pointer
for tmpfs files. Set the memory mapping to appear as a vnode type, to match
FreeBSD 9 behavior.

This fixes the use of tmpfs files with the dtrace pid provider,
procstat -v, procfs, linprocfs, pmc (pmcstat), and ptrace (PT_VM_ENTRY).

Submitted by: Eric Badger <eric@badgerio.us> (initial revision)
Obtained from: Dell Inc.
PR: 198431
MFC after: 2 weeks
Reviewed by: jhb
Approved by: kib (mentor)


# 280327 21-Mar-2015 alc

Introduce vm_object_color() and use it in mmap(2) to set the color of
named objects to zero before the virtual address is selected. Previously,
the color setting was delayed until after the virtual address was
selected. In rtld, this delay effectively prevented the mapping of a
shared library's code section using superpages. Now, for example, we see
the first 1 MB of libc's code on armv6 mapped by a superpage after we've
gotten through the initial cold misses that bring the first 1 MB of code
into memory. (With the page clustering that we perform on read faults,
this happens quickly.)

Differential Revision: https://reviews.freebsd.org/D2013
Reviewed by: jhb, kib
Tested by: Svatopluk Kraus (armv6)
MFC after: 6 weeks


# 277828 28-Jan-2015 kib

Update mtime for tmpfs files modified through memory mapping. Similar
to UFS, perform updates during syncer scans, which in particular means
that tmpfs now performs scan on sync. Also, this means that a mtime
update may be delayed up to 30 seconds after the write.

The vm_object' OBJ_TMPFS_DIRTY flag for tmpfs swap object is similar
to the OBJ_MIGHTBEDIRTY flag for the vnode object, it indicates that
object could have been dirtied. Adapt fast page fault handler and
vm_object_set_writeable_dirty() to handle OBJ_TMPFS_NODE same as
OBJT_VNODE.

Reported by: Ronald Klop <ronald-lists@klop.ws>
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 271417 11-Sep-2014 alc

Update a stale comment.


# 269642 06-Aug-2014 kib

Add wrappers to assert that vm object is unlocked and for try upgrade.

Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 269134 26-Jul-2014 alc

When unwiring a region of an address space, do not assume that the
underlying physical pages are mapped by the pmap. If, for example, the
application has performed an mprotect(..., PROT_NONE) on any part of the
wired region, then those pages will no longer be mapped by the pmap.
So, using the pmap to lookup the wired pages in order to unwire them
doesn't always work, and when it doesn't work wired pages are leaked.

To avoid the leak, introduce and use a new function vm_object_unwire()
that locates the wired pages by traversing the object and its backing
objects.

At the same time, switch from using pmap_change_wiring() to the recently
introduced function pmap_unwire() for unwiring the region's mappings.
pmap_unwire() is faster, because it operates a range of virtual addresses
rather than a single virtual page at a time. Moreover, by operating on
a range, it is superpage friendly. It doesn't waste time performing
unnecessary demotions.

Reported by: markj
Reviewed by: kib
Tested by: pho, jmg (arm)
Sponsored by: EMC / Isilon Storage Division


# 268615 14-Jul-2014 kib

The OBJ_TMPFS flag of vm_object means that there is unreclaimed tmpfs
vnode for the tmpfs node owning this object. The flag is currently
used for two purposes. First, it allows to correctly handle VV_TEXT
for tmpfs vnode when the ref count on the object is decremented to 1,
similar to vnode_pager_dealloc() for regular filesystems. Second, it
prevents some operations, which are done on OBJT_SWAP vm objects
backing user anonymous memory, but are incorrect for the object owned
by tmpfs node.

The second kind of use of the OBJ_TMPFS flag is incorrect, since the
vnode might be reclaimed, which clears the flag, but vm object
operations must still be disallowed.

Introduce one more flag, OBJ_TMPFS_NODE, which is permanently set on
the object for VREG tmpfs node, and used instead of OBJ_TMPFS to test
whether vm object collapse and similar actions should be disabled.

Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 254141 09-Aug-2013 attilio

On all the architectures, avoid to preallocate the physical memory
for nodes used in vm_radix.
On architectures supporting direct mapping, also avoid to pre-allocate
the KVA for such nodes.

In order to do so make the operations derived from vm_radix_insert()
to fail and handle all the deriving failure of those.

vm_radix-wise introduce a new function called vm_radix_replace(),
which can replace a leaf node, already present, with a new one,
and take into account the possibility, during vm_radix_insert()
allocation, that the operations on the radix trie can recurse.
This means that if operations in vm_radix_insert() recursed
vm_radix_insert() will start from scratch again.

Sponsored by: EMC / Isilon storage division
Reviewed by: alc (older version)
Reviewed by: jeff
Tested by: pho, scottl


# 253189 11-Jul-2013 kib

Never remove user-wired pages from an object when doing
msync(MS_INVALIDATE). The vm_fault_copy_entry() requires that object
range which corresponds to the user-wired vm_map_entry, is always
fully populated.

Add OBJPR_NOTWIRED flag for vm_object_page_remove() to request the
preserving behaviour, use it when calling vm_object_page_remove() from
vm_object_sync().

Reported and tested by: pho
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 250884 21-May-2013 attilio

o Relax locking assertions for vm_page_find_least()
o Relax locking assertions for pmap_enter_object() and add them also
to architectures that currently don't have any
o Introduce VM_OBJECT_LOCK_DOWNGRADE() which is basically a downgrade
operation on the per-object rwlock
o Use all the mechanisms above to make vm_map_pmap_enter() to work
mostl of the times only with readlocks.

Sponsored by: EMC / Isilon storage division
Reviewed by: alc


# 250030 28-Apr-2013 kib

Rework the handling of the tmpfs node backing swap object and tmpfs
vnode v_object to avoid double-buffering. Use the same object both as
the backing store for tmpfs node and as the v_object.

Besides reducing memory use up to 2x times for situation of mapping
files from tmpfs, it also makes tmpfs read and write operations copy
twice bytes less.

VM subsystem was already slightly adapted to tolerate OBJT_SWAP object
as v_object. Now the vm_object_deallocate() is modified to not
reinstantiate OBJ_ONEMAPPING flag and help the VFS to correctly handle
VV_TEXT flag on the last dereference of the tmpfs backing object.

Reviewed by: alc
Tested by: pho, bf
MFC after: 1 month


# 248449 17-Mar-2013 attilio

Sync back vmcontention branch into HEAD:
Replace the per-object resident and cached pages splay tree with a
path-compressed multi-digit radix trie.
Along with this, switch also the x86-specific handling of idle page
tables to using the radix trie.

This change is supposed to do the following:
- Allowing the acquisition of read locking for lookup operations of the
resident/cached pages collections as the per-vm_page_t splay iterators
are now removed.
- Increase the scalability of the operations on the page collections.

The radix trie does rely on the consumers locking to ensure atomicity of
its operations. In order to avoid deadlocks the bisection nodes are
pre-allocated in the UMA zone. This can be done safely because the
algorithm needs at maximum one new node per insert which means the
maximum number of the desired nodes is the number of available physical
frames themselves. However, not all the times a new bisection node is
really needed.

The radix trie implements path-compression because UFS indirect blocks
can lead to several objects with a very sparse trie, increasing the number
of levels to usually scan. It also helps in the nodes pre-fetching by
introducing the single node per-insert property.

This code is not generalized (yet) because of the possible loss of
performance by having much of the sizes in play configurable.
However, efforts to make this code more general and then reusable in
further different consumers might be really done.

The only KPI change is the removal of the function vm_page_splay() which
is now reaped.
The only KBI change, instead, is the removal of the left/right iterators
from struct vm_page, which are now reaped.

Further technical notes broken into mealpieces can be retrieved from the
svn branch:
http://svn.freebsd.org/base/user/attilio/vmcontention/

Sponsored by: EMC / Isilon storage division
In collaboration with: alc, jeff
Tested by: flo, pho, jhb, davide
Tested by: ian (arm)
Tested by: andreast (powerpc)


# 248084 09-Mar-2013 attilio

Switch the vm_object mutex to be a rwlock. This will enable in the
future further optimizations where the vm_object lock will be held
in read mode most of the time the page cache resident pool of pages
are accessed for reading purposes.

The change is mostly mechanical but few notes are reported:
* The KPI changes as follow:
- VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK()
- VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK()
- VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK()
- VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED()
(in order to avoid visibility of implementation details)
- The read-mode operations are added:
VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(),
VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED()
* The vm/vm_pager.h namespace pollution avoidance (forcing requiring
sys/mutex.h in consumers directly to cater its inlining functions
using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h
consumers now must include also sys/rwlock.h.
* zfs requires a quite convoluted fix to include FreeBSD rwlocks into
the compat layer because the name clash between FreeBSD and solaris
versions must be avoided.
At this purpose zfs redefines the vm_object locking functions
directly, isolating the FreeBSD components in specific compat stubs.

The KPI results heavilly broken by this commit. Thirdy part ports must
be updated accordingly (I can think off-hand of VirtualBox, for example).

Sponsored by: EMC / Isilon storage division
Reviewed by: jeff
Reviewed by: pjd (ZFS specific review)
Discussed with: alc
Tested by: pho


# 248082 09-Mar-2013 attilio

Merge from vmc-playground:
Introduce a new KPI that verifies if the page cache is empty for a
specified vm_object. This KPI does not make assumptions about the
locking in order to be used also for building assertions at init and
destroy time.
It is mostly used to hide implementation details of the page cache.

Sponsored by: EMC / Isilon storage division
Reviewed by: jeff
Reviewed by: alc (vm_radix based version)
Tested by: flo, pho, jhb, davide


# 247400 27-Feb-2013 attilio

Merge from vmobj-rwlock:
VM_OBJECT_LOCKED() macro is only used to implement a custom version
of lock assertions right now (which likely spread out thanks to
copy and paste).
Remove it and implement actual assertions.

Sponsored by: EMC / Isilon storage division
Reviewed by: alc
Tested by: pho


# 247360 26-Feb-2013 attilio

Merge from vmc-playground branch:
Replace the sub-optimal uma_zone_set_obj() primitive with more modern
uma_zone_reserve_kva(). The new primitive reserves before hand
the necessary KVA space to cater the zone allocations and allocates pages
with ALLOC_NOOBJ. More specifically:
- uma_zone_reserve_kva() does not need an object to cater the backend
allocator.
- uma_zone_reserve_kva() can cater M_WAITOK requests, in order to
serve zones which need to do uma_prealloc() too.
- When possible, uma_zone_reserve_kva() uses directly the direct-mapping
by uma_small_alloc() rather than relying on the KVA / offset
combination.

The removal of the object attribute allows 2 further changes:
1) _vm_object_allocate() becomes static within vm_object.c
2) VM_OBJECT_LOCK_INIT() is removed. This function is replaced by
direct calls to mtx_init() as there is no need to export it anymore
and the calls aren't either homogeneous anymore: there are now small
differences between arguments passed to mtx_init().

Sponsored by: EMC / Isilon storage division
Reviewed by: alc (which also offered almost all the comments)
Tested by: pho, jhb, davide


# 247323 26-Feb-2013 attilio

Wrap the sleeps synchronized by the vm_object lock into the specific
macro VM_OBJECT_SLEEP().
This hides some implementation details like the usage of the msleep()
primitive and the necessity to access to the lock address directly.
For this reason VM_OBJECT_MTX() macro is now retired.

Sponsored by: EMC / Isilon storage division
Reviewed by: alc
Tested by: pho


# 245226 09-Jan-2013 ken

Fix a bug in the device pager code that can trigger an assertion
in devfs if a particular race condition is hit in the device pager
code.

This was a side effect of change 227530 which changed the device
pager interface to call a new destructor routine for the cdev.
That destructor routine, old_dev_pager_dtor(), takes a VM object
handle.

The object handle is cast to a struct cdev *, and passed into
dev_rel().

That works in most cases, except the case in cdev_pager_allocate()
where there is a race condition between two threads allocating an
object backed by the same device. The loser of the race
deallocates its object at the end of the function.

The problem is that before inserting the object into the
dev_pager_object_list, the object's handle is changed from the
struct cdev pointer to the object's own address. This is to avoid
conflicts with the winner of the race, which already inserted an
object in the list with a handle that is a pointer to the same cdev
structure.

The object is then passed to vm_object_deallocate(), and eventually
makes its way down to old_dev_pager_dtor(). That function passes
the handle pointer (which is actually a VM object, not a struct
cdev as usual) into dev_rel(). dev_rel() decrements the reference
count in the assumed struct cdev (which happens to be 0), and
that triggers the assertion in dev_rel() that the reference count
is greater than or equal to 0.

The fix is to add a cdev pointer to the VM object, and use that
pointer when calling the cdev_pg_dtor() routine.

vm_object.h: Add a struct cdev pointer to the VM object
structure.

device_pager.c: In cdev_pager_allocate(), populate the new cdev
pointer.

In dev_pager_dealloc(), use the new cdev pointer
when calling the object's cdev_pg_dtor() routine.

Reviewed by: kib
Sponsored by: Spectra Logic Corporation
MFC after: 1 week


# 244043 08-Dec-2012 alc

In the past four years, we've added two new vm object types. Each time,
similar changes had to be made in various places throughout the machine-
independent virtual memory layer to support the new vm object type.
However, in most of these places, it's actually not the type of the vm
object that matters to us but instead certain attributes of its pages.
For example, OBJT_DEVICE, OBJT_MGTDEVICE, and OBJT_SG objects contain
fictitious pages. In other words, in most of these places, we were
testing the vm object's type to determine if it contained fictitious (or
unmanaged) pages.

To both simplify the code in these places and make the addition of future
vm object types easier, this change introduces two new vm object flags
that describe attributes of the vm object's pages, specifically, whether
they are fictitious or unmanaged.

Reviewed and tested by: kib


# 237451 22-Jun-2012 attilio

- Add a comment explaining the locking of the cached pages pool held
by vm_objects.
- Add flags for the per-object lock and free pages queue mutex lock.
Use the newly added flags to mark the cache root within the vm_object
structure.

Please note that other vm_object members should be marked with correct
locking but they are left for other commits.

In collabouration with: alc

MFC after: 3 days3 days3 days


# 233191 19-Mar-2012 jhb

Fix madvise(MADV_WILLNEED) to properly handle individual mappings larger
than 4GB. Specifically, the inlined version of 'ptoa' of the the 'int'
count of pages overflowed on 64-bit platforms. While here, change
vm_object_madvise() to accept two vm_pindex_t parameters (start and end)
rather than a (start, count) tuple to match other VM APIs as suggested
by alc@.


# 233100 17-Mar-2012 kib

In vm_object_page_clean(), do not clean OBJ_MIGHTBEDIRTY object flag
if the filesystem performed short write and we are skipping the page
due to this.

Propogate write error from the pager back to the callers of
vm_pageout_flush(). Report the failure to write a page from the
requested range as the FALSE return value from vm_object_page_clean(),
and propagate it back to msync(2) to return EIO to usermode.

While there, convert the clearobjflags variable in the
vm_object_page_clean() and arguments of the helper functions to
boolean.

PR: kern/165927
Reviewed by: alc
MFC after: 2 weeks


# 232071 23-Feb-2012 kib

Account the writeable shared mappings backed by file in the vnode
v_writecount. Keep the amount of the virtual address space used by
the mappings in the new vm_object un_pager.vnp.writemappings
counter. The vnode v_writecount is incremented when writemappings gets
non-zero value, and decremented when writemappings is returned to
zero.

Writeable shared vnode-backed mappings are accounted for in vm_mmap(),
and vm_map_insert() is instructed to set MAP_ENTRY_VN_WRITECNT flag on
the created map entry. During deferred map entry deallocation,
vm_map_process_deferred() checks for MAP_ENTRY_VN_WRITECOUNT and
decrements writemappings for the vm object.

Now, the writeable mount cannot be demoted to read-only while
writeable shared mappings of the vnodes from the mount point
exist. Also, execve(2) fails for such files with ETXTBUSY, as it
should be.

Noted by: tegge
Reviewed by: tegge (long time ago, early version), alc
Tested by: pho
MFC after: 3 weeks


# 229934 10-Jan-2012 kib

Change the type of the paging_in_progress refcounter from u_short to
u_int. With the auto-sized buffer cache on the modern machines, UFS
metadata can generate more the 65535 pages belonging to the buffers
undergoing i/o, overflowing the counter.

Reported and tested by: jimharris
Reviewed by: alc
MFC after: 1 week


# 227530 15-Nov-2011 kib

Update the device pager interface, while keeping the compatibility
layer for old KPI and KBI. New interface should be used together with
d_mmap_single cdevsw method.

Device pager can be allocated with the cdev_pager_allocate(9)
function, which takes struct cdev_pager_ops, containing
constructor/destructor and page fault handler methods supplied by
driver.

Constructor and destructor, called at the pager allocation and
deallocation time, allow the driver to handle per-object private data.

The pager handler is called to handle page fault on the vm map entry
backed by the driver pager. Driver shall return either the vm_page_t
which should be mapped, or error code (which does not cause kernel
panic anymore). The page handler interface has a placeholder to
specify the access mode causing the fault, but currently PROT_READ is
always passed there.

Sponsored by: The FreeBSD Foundation
Reviewed by: alc
MFC after: 1 month


# 227070 04-Nov-2011 jhb

Add the posix_fadvise(2) system call. It is somewhat similar to
madvise(2) except that it operates on a file descriptor instead of a
memory region. It is currently only supported on regular files.

Just as with madvise(2), the advice given to posix_fadvise(2) can be
divided into two types. The first type provide hints about data access
patterns and are used in the file read and write routines to modify the
I/O flags passed down to VOP_READ() and VOP_WRITE(). These modes are
thus filesystem independent. Note that to ease implementation (and
since this API is only advisory anyway), only a single non-normal
range is allowed per file descriptor.

The second type of hints are used to hint to the OS that data will or
will not be used. These hints are implemented via a new VOP_ADVISE().
A default implementation is provided which does nothing for the WILLNEED
request and attempts to move any clean pages to the cache page queue for
the DONTNEED request. This latter case required two other changes.
First, a new V_CLEANONLY flag was added to vinvalbuf(). This requests
vinvalbuf() to only flush clean buffers for the vnode from the buffer
cache and to not remove any backing pages from the vnode. This is
used to ensure clean pages are not wired into the buffer cache before
attempting to move them to the cache page queue. The second change adds
a new vm_object_page_cache() method. This method is somewhat similar to
vm_object_page_remove() except that instead of freeing each page in the
specified range, it attempts to move clean pages to the cache queue if
possible.

To preserve the ABI of struct file, the f_cdevpriv pointer is now reused
in a union to point to the currently active advice region if one is
present for regular files.

Reviewed by: jilles, kib, arch@
Approved by: re (kib)
MFC after: 1 month


# 223677 29-Jun-2011 alc

Add a new option, OBJPR_NOTMAPPED, to vm_object_page_remove(). Passing this
option to vm_object_page_remove() asserts that the specified range of pages
is not mapped, or more precisely that none of these pages have any managed
mappings. Thus, vm_object_page_remove() need not call pmap_remove_all() on
the pages.

This change not only saves time by eliminating pointless calls to
pmap_remove_all(), but it also eliminates an inconsistency in the use of
pmap_remove_all() versus related functions, like pmap_remove_write(). It
eliminates harmless but pointless calls to pmap_remove_all() that were being
performed on PG_UNMANAGED pages.

Update all of the existing assertions on pmap_remove_all() to reflect this
change.

Reviewed by: kib


# 218345 05-Feb-2011 alc

Unless "cnt" exceeds MAX_COMMIT_COUNT, nfsrv_commit() and nfsvno_fsync() are
incorrectly calling vm_object_page_clean(). They are passing the length of
the range rather than the ending offset of the range.

Perform the OFF_TO_IDX() conversion in vm_object_page_clean() rather than the
callers.

Reviewed by: kib
MFC after: 3 weeks


# 216810 29-Dec-2010 kib

Remove OBJ_CLEANING flag. The vfs_setdirty_locked_object() is the only
consumer of the flag, and it used the flag because OBJ_MIGHTBEDIRTY
was cleared early in vm_object_page_clean, before the cleaning pass
was done. This is no longer true after r216799.

Moreover, since OBJ_CLEANING is a flag, and not the counter, it could
be reset too prematurely when parallel vm_object_page_clean() are
performed.

Reviewed by: alc (as a part of the bigger patch)
MFC after: 1 month (after r216799 is merged)


# 216731 27-Dec-2010 alc

Move vm_object_print()'s prototype to the expected place.


# 216128 02-Dec-2010 trasz

Replace pointer to "struct uidinfo" with pointer to "struct ucred"
in "struct vm_object". This is required to make it possible to account
for per-jail swap usage.

Reviewed by: kib@
Tested by: pho@
Sponsored by: FreeBSD Foundation


# 200770 21-Dec-2009 kib

VI_OBJDIRTY vnode flag mirrors the state of OBJ_MIGHTBEDIRTY vm object
flag. Besides providing the redundand information, need to update both
vnode and object flags causes more acquisition of vnode interlock.
OBJ_MIGHTBEDIRTY is only checked for vnode-backed vm objects.

Remove VI_OBJDIRTY and make sure that OBJ_MIGHTBEDIRTY is set only for
vnode-backed vm objects.

Suggested and reviewed by: alc
Tested by: pho
MFC after: 3 weeks


# 195840 24-Jul-2009 jhb

Add a new type of VM object: OBJT_SG. An OBJT_SG object is very similar to
a device pager (OBJT_DEVICE) object in that it uses fictitious pages to
provide aliases to other memory addresses. The primary difference is that
it uses an sglist(9) to determine the physical addresses for a given offset
into the object instead of invoking the d_mmap() method in a device driver.

Reviewed by: alc
Approved by: re (kensmith)
MFC after: 2 weeks


# 195649 12-Jul-2009 alc

Add support to the virtual memory system for configuring machine-
dependent memory attributes:

Rename vm_cache_mode_t to vm_memattr_t. The new name reflects the
fact that there are machine-dependent memory attributes that have
nothing to do with controlling the cache's behavior.

Introduce vm_object_set_memattr() for setting the default memory
attributes that will be given to an object's pages.

Introduce and use pmap_page_{get,set}_memattr() for getting and
setting a page's machine-dependent memory attributes. Add full
support for these functions on amd64 and i386 and stubs for them on
the other architectures. The function pmap_page_set_memattr() is also
responsible for any other machine-dependent aspects of changing a
page's memory attributes, such as flushing the cache or updating the
direct map. The uses include kmem_alloc_contig(), vm_page_alloc(),
and the device pager:

kmem_alloc_contig() can now be used to allocate kernel memory with
non-default memory attributes on amd64 and i386.

vm_page_alloc() and the device pager will set the memory attributes
for the real or fictitious page according to the object's default
memory attributes.

Update the various pmap functions on amd64 and i386 that map pages to
incorporate each page's memory attributes in the mapping.

Notes: (1) Inherent to this design are safety features that prevent
the specification of inconsistent memory attributes by different
mappings on amd64 and i386. In addition, the device pager provides a
warning when a device driver creates a fictitious page with memory
attributes that are inconsistent with the real page that the
fictitious page is an alias for. (2) Storing the machine-dependent
memory attributes for amd64 and i386 as a dedicated "int" in "struct
md_page" represents a compromise between space efficiency and the ease
of MFCing these changes to RELENG_7.

In collaboration with: jhb

Approved by: re (kib)


# 194766 23-Jun-2009 kib

Implement global and per-uid accounting of the anonymous memory. Add
rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved
for the uid.

The accounting information (charge) is associated with either map entry,
or vm object backing the entry, assuming the object is the first one
in the shadow chain and entry does not require COW. Charge is moved
from entry to object on allocation of the object, e.g. during the mmap,
assuming the object is allocated, or on the first page fault on the
entry. It moves back to the entry on forks due to COW setup.

The per-entry granularity of accounting makes the charge process fair
for processes that change uid during lifetime, and decrements charge
for proper uid when region is unmapped.

The interface of vm_pager_allocate(9) is extended by adding struct ucred *,
that is used to charge appropriate uid when allocation if performed by
kernel, e.g. md(4).

Several syscalls, among them is fork(2), may now return ENOMEM when
global or per-uid limits are enforced.

In collaboration with: pho
Reviewed by: alc
Approved by: re (kensmith)


# 194209 14-Jun-2009 alc

Long, long ago in r27464 special case code for mapping device-backed
memory with 4MB pages was added to pmap_object_init_pt(). This code
assumes that the pages of a OBJT_DEVICE object are always physically
contiguous. Unfortunately, this is not always the case. For example,
jhb@ informs me that the recently introduced /dev/ksyms driver creates
a OBJT_DEVICE object that violates this assumption. Thus, this
revision modifies pmap_object_init_pt() to abort the mapping if the
OBJT_DEVICE object's pages are not physically contiguous. This
revision also changes some inconsistent if not buggy behavior. For
example, the i386 version aborts if the first 4MB virtual page that
would be mapped is already valid. However, it incorrectly replaces
any subsequent 4MB virtual page mappings that it encounters,
potentially leaking a page table page. The amd64 version has a bug of
my own creation. It potentially busies the wrong page and always an
insufficent number of pages if it blocks allocating a page table page.

To my knowledge, there have been no reports of these bugs, hence,
their persistance. I suspect that the existing restrictions that
pmap_object_init_pt() placed on the OBJT_DEVICE objects that it would
choose to map, for example, that the first page must be aligned on a 2
or 4MB physical boundary and that the size of the mapping must be a
multiple of the large page size, were enough to avoid triggering the
bug for drivers like ksyms. However, one side effect of testing the
OBJT_DEVICE object's pages for physical contiguity is that a dubious
difference between pmap_object_init_pt() and the standard path for
mapping devices pages, i.e., vm_fault(), has been eliminated.
Previously, pmap_object_init_pt() would only instantiate the first
PG_FICTITOUS page being mapped because it never examined the rest.
Now, however, pmap_object_init_pt() uses the new function
vm_object_populate() to instantiate them all (in order to support
testing their physical contiguity). These pages need to be
instantiated for the mechanism that I have prototyped for
automatically maintaining the consistency of the PAT settings across
multiple mappings, particularly, amd64's direct mapping, to work.
(Translation: This change is also being made to support jhb@'s work on
the Nvidia feature requests.)

Discussed with: jhb@


# 188348 08-Feb-2009 alc

Eliminate OBJ_NEEDGIANT. After r188331, OBJ_NEEDGIANT's only use is by a
redundant assertion in vm_fault().

Reviewed by: kib


# 179159 20-May-2008 ups

Allow VM object creation in ufs_lookup. (If vfs.vmiodirenable is set)
Directory IO without a VM object will store data in 'malloced' buffers
severely limiting caching of the data. Without this change VM objects for
directories are only created on an open() of the directory.
TODO: Inline test if VM object already exists to avoid locking/function call
overhead.

Tested by: kris@
Reviewed by: jeff@
Reported by: David Filo


# 174940 27-Dec-2007 alc

Add a list of reservations to the vm object structure.

Recycle the vm object's "pg_color" field to represent the color of the
first virtual page address at which the object is mapped instead of the
color of the object's first physical page. Since an object may not be
mapped, introduce a flag "OBJ_COLORED" that indicates whether "pg_color"
is valid.


# 172317 25-Sep-2007 alc

Change the management of cached pages (PQ_CACHE) in two fundamental
ways:

(1) Cached pages are no longer kept in the object's resident page
splay tree and memq. Instead, they are kept in a separate per-object
splay tree of cached pages. However, access to this new per-object
splay tree is synchronized by the _free_ page queues lock, not to be
confused with the heavily contended page queues lock. Consequently, a
cached page can be reclaimed by vm_page_alloc(9) without acquiring the
object's lock or the page queues lock.

This solves a problem independently reported by tegge@ and Isilon.
Specifically, they observed the page daemon consuming a great deal of
CPU time because of pages bouncing back and forth between the cache
queue (PQ_CACHE) and the inactive queue (PQ_INACTIVE). The source of
this problem turned out to be a deadlock avoidance strategy employed
when selecting a cached page to reclaim in vm_page_select_cache().
However, the root cause was really that reclaiming a cached page
required the acquisition of an object lock while the page queues lock
was already held. Thus, this change addresses the problem at its
root, by eliminating the need to acquire the object's lock.

Moreover, keeping cached pages in the object's primary splay tree and
memq was, in effect, optimizing for the uncommon case. Cached pages
are reclaimed far, far more often than they are reactivated. Instead,
this change makes reclamation cheaper, especially in terms of
synchronization overhead, and reactivation more expensive, because
reactivated pages will have to be reentered into the object's primary
splay tree and memq.

(2) Cached pages are now stored alongside free pages in the physical
memory allocator's buddy queues, increasing the likelihood that large
allocations of contiguous physical memory (i.e., superpages) will
succeed.

Finally, as a result of this change long-standing restrictions on when
and where a cached page can be reclaimed and returned by
vm_page_alloc(9) are eliminated. Specifically, calls to
vm_page_alloc(9) specifying VM_ALLOC_INTERRUPT can now reclaim and
return a formerly cached page. Consequently, a call to malloc(9)
specifying M_NOWAIT is less likely to fail.

Discussed with: many over the course of the summer, including jeff@,
Justin Husted @ Isilon, peter@, tegge@
Tested by: an earlier version by kris@
Approved by: re (kensmith)


# 160540 21-Jul-2006 alc

Eliminate OBJ_WRITEABLE. It hasn't been used in a long time.


# 154694 22-Jan-2006 alc

Make vm_object_vndeallocate() static. The external calls to it were
eliminated in ufs/ffs/ffs_vnops.c's revision 1.125.


# 145826 03-May-2005 jeff

- Add a new object flag "OBJ_NEEDSGIANT". We set this flag if the
underlying vnode requires Giant.
- In vm_fault only acquire Giant if the underlying object has NEEDSGIANT
set.
- In vm_object_shadow inherit the NEEDSGIANT flag from the backing object.


# 144501 01-Apr-2005 jhb

- Change the vm_mmap() function to accept an objtype_t parameter specifying
the type of object represented by the handle argument.
- Allow vm_mmap() to map device memory via cdev objects in addition to
vnodes and anonymous memory. Note that mmaping a cdev directly does not
currently perform any MAC checks like mapping a vnode does.
- Unbreak the DRM getbufs ioctl by having it call vm_mmap() directly on the
cdev the ioctl is acting on rather than trying to find a suitable vnode
to map from.

Reviewed by: alc, arch@


# 139825 07-Jan-2005 imp

/* -> /*- for license, minor formatting changes


# 138538 08-Dec-2004 alc

With the removal of kern/uipc_jumbo.c and sys/jumbo.h,
vm_object_allocate_wait() is not used. Remove it.


# 137297 06-Nov-2004 alc

Move a call to wakeup() from vm_object_terminate() to vnode_pager_dealloc()
because this call is only needed to wake threads that slept when they
discovered a dead object connected to a vnode. To eliminate unnecessary
calls to wakeup() by vnode_pager_dealloc(), introduce a new flag,
OBJ_DISCONNECTWNT.

Reviewed by: tegge@


# 132627 25-Jul-2004 alc

Make the code and comments for vm_object_coalesce() consistent.


# 132550 22-Jul-2004 alc

- Change uma_zone_set_obj() to call kmem_alloc_nofault() instead of
kmem_alloc_pageable(). The difference between these is that an errant
memory access to the zone will be detected sooner with
kmem_alloc_nofault().

The following changes serve to eliminate the following lock-order
reversal reported by witness:

1st 0xc1a3c084 vm object (vm object) @ vm/swap_pager.c:1311
2nd 0xc07acb00 swap_pager swhash (swap_pager swhash) @ vm/swap_pager.c:1797
3rd 0xc1804bdc vm object (vm object) @ vm/uma_core.c:931

There is no potential deadlock in this case. However, witness is unable
to recognize this because vm objects used by UMA have the same type as
ordinary vm objects. To remedy this, we make the following changes:

- Add a mutex type argument to VM_OBJECT_LOCK_INIT().
- Use the mutex type argument to assign distinct types to special
vm objects such as the kernel object, kmem object, and UMA objects.
- Define a static swap zone object for use by UMA. (Only static
objects are assigned a special mutex type.)


# 127961 06-Apr-2004 imp

Remove advertising clause from University of California Regent's license,
per letter dated July 22, 1999.

Approved by: core


# 122349 09-Nov-2003 alc

- Rename vm_map_clean() to vm_map_sync(). This better reflects the fact
that msync(2) is its only caller.
- Migrate the parts of the old vm_map_clean() that examined the internals
of a vm object to a new function vm_object_sync() that is implemented in
vm_object.c. At the same, introduce the necessary vm object locking so
that vm_map_sync() and vm_object_sync() can be called without Giant.

Reviewed by: tegge


# 121907 02-Nov-2003 alc

- Introduce and use vm_object_reference_locked(). Unlike
vm_object_reference(), this function must not be used to reanimate dead
vm objects. This restriction simplifies locking.

Reviewed by: tegge


# 121821 31-Oct-2003 alc

- Revert a part of revision 1.73: Make vm_object_set_flag() an inline
function. This function is so trivial that inlining reduces the size
of the kernel.


# 118838 12-Aug-2003 alc

Reduce the size of the vm object on 64-bit architectures by moving
a field within the structure.


# 115853 04-Jun-2003 alc

- Add VM_OBJECT_TRYLOCK().


# 115818 04-Jun-2003 alc

- Add vm object locking to vm_object_deallocate(). (Still more
changes are required.)
- Remove special-case macros for kmem object locking. They are
no longer used.


# 115655 01-Jun-2003 alc

Change kernel_object and kmem_object to (&kernel_object_store) and
(&kmem_object_store), respectively. This allows the address of these
objects to be resolved at link-time rather than run-time.


# 115127 18-May-2003 alc

Reduce the size of a vm object by converting its shadow list from a TAILQ
to a LIST.

Approved by: re (rwatson)


# 114145 28-Apr-2003 alc

- Define VM_OBJECT_LOCK_INIT().
- Avoid repeatedly mtx_init()ing and mtx_destroy()ing the vm_object's lock
using UMA's uminit callback, in this case, vm_object_zinit().


# 114074 26-Apr-2003 alc

- Convert vm_object_pip_wait() from using tsleep() to msleep().
- Make vm_object_pip_sleep() static.
- Lock the vm_object when performing vm_object_pip_wait().


# 113838 22-Apr-2003 alc

Add VM_OBJECT_LOCKED().


# 113721 19-Apr-2003 alc

- Lock the vm_object when performing vm_object_pip_wakeupn().
- Assert that the vm_object lock is held in vm_object_pip_wakeupn().
- Add a new macro VM_OBJECT_LOCK_ASSERT().


# 113445 13-Apr-2003 alc

Add new macros for locking and unlocking a vm object.


# 111937 06-Mar-2003 alc

Remove ENABLE_VFS_IOOPT. It is a long unfinished work-in-progress.

Discussed on: arch@


# 111467 25-Feb-2003 alc

Fuse two #ifdefs with identical conditions.


# 108413 29-Dec-2002 alc

- Remove vm_object_init2(). It is unused.
- Add a mtx_destroy() to vm_object_collapse(). (This allows a bzero()
to migrate from _vm_object_allocate() to vm_object_zinit(), where it
will be performed less often.)


# 108117 20-Dec-2002 alc

Add a mutex to struct vm_object. Initialize and destroy that mutex
at appropriate times. For the moment, the mutex is only used on
the kmem_object.


# 108101 19-Dec-2002 alc

Remove the hash_rand field from struct vm_object. As of revision 1.215 of
vm/vm_page.c, it is unused.


# 106871 13-Nov-2002 alc

Remove dead code that hasn't been needed since the demise of share maps
in various revisions of vm/vm_map.c between 1.148 and 1.153.


# 105407 18-Oct-2002 dillon

Replace the vm_page hash table with a per-vmobject splay tree. There should
be no major change in performance from this change at this time but this
will allow other work to progress: Giant lock removal around VM system
in favor of per-object mutexes, ranged fsyncs, more optimal COMMIT rpc's for
NFS, partial filesystem syncs by the syncer, more optimal object flushing,
etc. Note that the buffer cache is already using a similar splay tree
mechanism.

Note that a good chunk of the old hash table code is still in the tree.
Alan or I will remove it prior to the release if the new code does not
introduce unsolvable bugs, else we can revert more easily.

Submitted by: alc (this is Alan's code)
Approved by: re


# 103732 21-Sep-2002 alc

Reduce namespace pollution.

Submitted by: bde


# 102349 24-Aug-2002 alc

o Resurrect vm_object_lock() and vm_object_unlock() from revision 1.19.
(For now, they simply acquire and release Giant.)


# 98849 26-Jun-2002 ken

At long last, commit the zero copy sockets code.

MAKEDEV: Add MAKEDEV glue for the ti(4) device nodes.

ti.4: Update the ti(4) man page to include information on the
TI_JUMBO_HDRSPLIT and TI_PRIVATE_JUMBOS kernel options,
and also include information about the new character
device interface and the associated ioctls.

man9/Makefile: Add jumbo.9 and zero_copy.9 man pages and associated
links.

jumbo.9: New man page describing the jumbo buffer allocator
interface and operation.

zero_copy.9: New man page describing the general characteristics of
the zero copy send and receive code, and what an
application author should do to take advantage of the
zero copy functionality.

NOTES: Add entries for ZERO_COPY_SOCKETS, TI_PRIVATE_JUMBOS,
TI_JUMBO_HDRSPLIT, MSIZE, and MCLSHIFT.

conf/files: Add uipc_jumbo.c and uipc_cow.c.

conf/options: Add the 5 options mentioned above.

kern_subr.c: Receive side zero copy implementation. This takes
"disposable" pages attached to an mbuf, gives them to
a user process, and then recycles the user's page.
This is only active when ZERO_COPY_SOCKETS is turned on
and the kern.ipc.zero_copy.receive sysctl variable is
set to 1.

uipc_cow.c: Send side zero copy functions. Takes a page written
by the user and maps it copy on write and assigns it
kernel virtual address space. Removes copy on write
mapping once the buffer has been freed by the network
stack.

uipc_jumbo.c: Jumbo disposable page allocator code. This allocates
(optionally) disposable pages for network drivers that
want to give the user the option of doing zero copy
receive.

uipc_socket.c: Add kern.ipc.zero_copy.{send,receive} sysctls that are
enabled if ZERO_COPY_SOCKETS is turned on.

Add zero copy send support to sosend() -- pages get
mapped into the kernel instead of getting copied if
they meet size and alignment restrictions.

uipc_syscalls.c:Un-staticize some of the sf* functions so that they
can be used elsewhere. (uipc_cow.c)

if_media.c: In the SIOCGIFMEDIA ioctl in ifmedia_ioctl(), avoid
calling malloc() with M_WAITOK. Return an error if
the M_NOWAIT malloc fails.

The ti(4) driver and the wi(4) driver, at least, call
this with a mutex held. This causes witness warnings
for 'ifconfig -a' with a wi(4) or ti(4) board in the
system. (I've only verified for ti(4)).

ip_output.c: Fragment large datagrams so that each segment contains
a multiple of PAGE_SIZE amount of data plus headers.
This allows the receiver to potentially do page
flipping on receives.

if_ti.c: Add zero copy receive support to the ti(4) driver. If
TI_PRIVATE_JUMBOS is not defined, it now uses the
jumbo(9) buffer allocator for jumbo receive buffers.

Add a new character device interface for the ti(4)
driver for the new debugging interface. This allows
(a patched version of) gdb to talk to the Tigon board
and debug the firmware. There are also a few additional
debugging ioctls available through this interface.

Add header splitting support to the ti(4) driver.

Tweak some of the default interrupt coalescing
parameters to more useful defaults.

Add hooks for supporting transmit flow control, but
leave it turned off with a comment describing why it
is turned off.

if_tireg.h: Change the firmware rev to 12.4.11, since we're really
at 12.4.11 plus fixes from 12.4.13.

Add defines needed for debugging.

Remove the ti_stats structure, it is now defined in
sys/tiio.h.

ti_fw.h: 12.4.11 firmware.

ti_fw2.h: 12.4.11 firmware, plus selected fixes from 12.4.13,
and my header splitting patches. Revision 12.4.13
doesn't handle 10/100 negotiation properly. (This
firmware is the same as what was in the tree previously,
with the addition of header splitting support.)

sys/jumbo.h: Jumbo buffer allocator interface.

sys/mbuf.h: Add a new external mbuf type, EXT_DISPOSABLE, to
indicate that the payload buffer can be thrown away /
flipped to a userland process.

socketvar.h: Add prototype for socow_setup.

tiio.h: ioctl interface to the character portion of the ti(4)
driver, plus associated structure/type definitions.

uio.h: Change prototype for uiomoveco() so that we'll know
whether the source page is disposable.

ufs_readwrite.c:Update for new prototype of uiomoveco().

vm_fault.c: In vm_fault(), check to see whether we need to do a page
based copy on write fault.

vm_object.c: Add a new function, vm_object_allocate_wait(). This
does the same thing that vm_object allocate does, except
that it gives the caller the opportunity to specify whether
it should wait on the uma_zalloc() of the object structre.

This allows vm objects to be allocated while holding a
mutex. (Without generating WITNESS warnings.)

vm_object_allocate() is implemented as a call to
vm_object_allocate_wait() with the malloc flag set to
M_WAITOK.

vm_object.h: Add prototype for vm_object_allocate_wait().

vm_page.c: Add page-based copy on write setup, clear and fault
routines.

vm_page.h: Add page based COW function prototypes and variable in
the vm_page structure.

Many thanks to Drew Gallatin, who wrote the zero copy send and receive
code, and to all the other folks who have tested and reviewed this code
over the years.


# 98824 25-Jun-2002 iedowse

Complete the initial set of VM changes required to support full
64-bit file sizes. This step simply addresses the remaining overflows,
and does attempt to optimise performance. The details are:

o Use a 64-bit type for the vm_object `size' and the size argument
to vm_object_allocate().
o Use the correct type for index variables in dev_pager_getpages(),
vm_object_page_clean() and vm_object_page_remove().
o Avoid an overflow in the i386 pmap_object_init_pt().


# 97753 02-Jun-2002 alc

o Migrate vm_map_split() from vm_map.c to vm_object.c, renaming it
to vm_object_split(). Its interface should still be changed
to resemble vm_object_shadow().


# 96087 05-May-2002 alc

o Move vm_freeze_copyopts() from vm_map.{c.h} to vm_object.{c,h}. It's plainly
an operation on a vm_object and belongs in the latter place.


# 96042 04-May-2002 alc

o Make _vm_object_allocate() and vm_object_allocate() callable
without holding Giant.
o Begin documenting the trivial cases of the locking protocol
on vm_object.


# 95112 20-Apr-2002 alc

Reintroduce locking on accesses to vm_object_list.


# 85517 25-Oct-2001 dillon

Implement kern.maxvnodes. adjusting kern.maxvnodes now actually has a
real effect.

Optimize vfs_msync(). Avoid having to continually drop and re-obtain
mutexes when scanning the vnode list. Improves looping case by 500%.

Optimize ffs_sync(). Avoid having to continually drop and re-obtain
mutexes when scanning the vnode list. This makes a couple of assumptions,
which I believe are ok, in regards to vnode stability when the mount list
mutex is held. Improves looping case by 500%.

(more optimization work is needed on top of these fixes)

MFC after: 1 week


# 80705 31-Jul-2001 jake

Oops. Last commit to vm_object.c should have got these files too.

Remove the use of atomic ops to manipulate vm_object and vm_page flags.
Giant is required here, so they are superfluous.

Discussed with: dillon


# 79248 04-Jul-2001 dillon

Change inlines back into mainline code in preparation for mutexing. Also,
most of these inlines had been bloated in -current far beyond their
original intent. Normalize prototypes and function declarations to be ANSI
only (half already were). And do some general cleanup.

(kernel size also reduced by 50-100K, but that isn't the prime intent)


# 79224 04-Jul-2001 dillon

With Alfred's permission, remove vm_mtx in favor of a fine-grained approach
(this commit is just the first stage). Also add various GIANT_ macros to
formalize the removal of Giant, making it easy to test in a more piecemeal
fashion. These macros will allow us to test fine-grained locks to a degree
before removing Giant, and also after, and to remove Giant in a piecemeal
fashion via sysctl's on those subsystems which the authors believe can
operate without Giant.


# 76827 18-May-2001 alfred

Introduce a global lock for the vm subsystem (vm_mtx).

vm_mtx does not recurse and is required for most low level
vm operations.

faults can not be taken without holding Giant.

Memory subsystems can now call the base page allocators safely.

Almost all atomic ops were removed as they are covered under the
vm mutex.

Alpha and ia64 now need to catch up to i386's trap handlers.

FFS and NFS have been tested, other filesystems will need minor
changes (grabbing the vm lock when twiddling page properties).

Reviewed (partially) by: jake, jhb


# 75523 15-Apr-2001 alfred

use TAILQ_FOREACH, fix a comment's location


# 65103 26-Aug-2000 obrien

Make the arguments match the functionality of the functions.


# 63973 28-Jul-2000 peter

Minor cleanups:
- remove unused variables (fix warnings)
- use a more consistant ansi style rather than a mixture
- remove dead #if 0 code and declarations


# 61081 29-May-2000 dillon

This is a cleanup patch to Peter's new OBJT_PHYS VM object type
and sysv shared memory support for it. It implements a new
PG_UNMANAGED flag that has slightly different characteristics
from PG_FICTICIOUS.

A new sysctl, kern.ipc.shm_use_phys has been added to enable the
use of physically-backed sysv shared memory rather then swap-backed.
Physically backed shm segments are not tracked with PV entries,
allowing programs which use a large shm segment as a rendezvous
point to operate without eating an insane amount of KVM in the
PV entry management. Read: Oracle.

Peter's OBJT_PHYS object will also allow us to eventually implement
page-table sharing and/or 4MB physical page support for such segments.
We're half way there.


# 60938 26-May-2000 jake

Back out the previous change to the queue(3) interface.
It was not discussed and should probably not happen.

Requested by: msmith and others


# 60833 23-May-2000 jake

Change the way that the queue(3) structures are declared; don't assume that
the type argument to *_HEAD and *_ENTRY is a struct.

Suggested by: phk
Reviewed by: phk
Approved by: mdodd


# 60757 21-May-2000 peter

Checkpoint of a new physical memory backed object type, that does not
have pv_entries. This is intended for very special circumstances,
eg: a certain database that has a 1GB shm segment mapped into 300
processes. That would consume 2GB of kvm just to hold the pv_entries
alone. This would not be used on systems unless the physical ram was
available, as it's not pageable.

This is a work-in-progress, but is a useful and functional checkpoint.
Matt has got some more fixes for it that will be committed soon.

Reviewed by: dillon


# 55206 29-Dec-1999 peter

Change #ifdef KERNEL to #ifdef _KERNEL in the public headers. "KERNEL"
is an application space macro and the applications are supposed to be free
to use it as they please (but cannot). This is consistant with the other
BSD's who made this change quite some time ago. More commits to come.


# 54467 12-Dec-1999 dillon

Add MAP_NOSYNC feature to mmap(), and MADV_NOSYNC and MADV_AUTOSYNC to
madvise().

This feature prevents the update daemon from gratuitously flushing
dirty pages associated with a mapped file-backed region of memory. The
system pager will still page the memory as necessary and the VM system
will still be fully coherent with the filesystem. Modifications made
by other means to the same area of memory, for example by write(), are
unaffected. The feature works on a page-granularity basis.

MAP_NOSYNC allows one to use mmap() to share memory between processes
without incuring any significant filesystem overhead, putting it in
the same performance category as SysV Shared memory and anonymous memory.

Reviewed by: julian, alc, dg


# 52568 27-Oct-1999 alc

Shrink "struct vm_object" by not spending a full 32 bits
on "objtype_t".


# 50477 27-Aug-1999 peter

$Id$ -> $FreeBSD$


# 49338 01-Aug-1999 alc

Move the memory access behavior information provided by madvise
from the vm_object to the vm_map.

Submitted by: dillon


# 48833 16-Jul-1999 alc

Remove vm_object::last_read. It is used by the old swap pager, but
not by the new one, i.e., vm/swap_pager.c rev 1.108.

Reviewed by: dillon@backplane.com


# 48738 10-Jul-1999 alc

Change the data type used to represent page color in the vm_object
to be the same as that used in the vm_page. (This change also
shrinks the vm_object.)


# 48059 20-Jun-1999 alc

Remove vm_object::cache_count and vm_object::wired_count. They are
not used. (Nor is there any planned use by John who introduced them.)

Reviewed by: "John S. Dyson" <toor@dyson.iquest.net>


# 48022 19-Jun-1999 alc

Remove some unused function and variable declarations.


# 43777 08-Feb-1999 dillon

Revamp vm_object_[q]collapse(). Despite the complexity of this patch,
no major operational changes were made. The three core object->memq loops
were moved into a single inline procedure and various operational
characteristics of the collapse function were documented.


# 42973 21-Jan-1999 dillon

The vm_object structure is now somewhat smaller due to the removal
of most of the swap-pager-specific fields, the removal of the id,
and the removal of paging_offset.

A new inline, vm_object_pip_wakeupn() has been added to subtract an
arbitrary number n from the paging_in_progress count and then wakeup
waiters as necessary. n may be 0, resulting in a 'flash'.


# 42957 21-Jan-1999 dillon

This is a rather large commit that encompasses the new swapper,
changes to the VM system to support the new swapper, VM bug
fixes, several VM optimizations, and some additional revamping of the
VM code. The specific bug fixes will be documented with additional
forced commits. This commit is somewhat rough in regards to code
cleanup issues.

Reviewed by: "John S. Dyson" <root@dyson.iquest.net>, "David Greenman" <dg@root.com>


# 38517 24-Aug-1998 dfr

Change various syscalls to use size_t arguments instead of u_int.

Add some overflow checks to read/write (from bde).

Change all modifications to vm_page::flags, vm_page::busy, vm_object::flags
and vm_object::paging_in_progress to use operations which are not
interruptable.

Reviewed by: Bruce Evans <bde@zeta.org.au>


# 38135 06-Aug-1998 dfr

Protect all modifications to paging_in_progress with splvm(). The i386
managed to avoid corruption of this variable by luck (the compiler used a
memory read-modify-write instruction which wasn't interruptable) but other
architectures cannot.

With this change, I am now able to 'make buildworld' on the alpha (sfx: the
crowd goes wild...)


# 35694 04-May-1998 dyson

Fix the shm panic. I mistakenly used the shadow_count to keep the object
from being split, and instead added an OBJ_NOSPLIT.


# 35497 29-Apr-1998 dyson

Tighten up management of memory and swap space during map allocation,
deallocation cycles. This should provide a measurable improvement
on swap and memory allocation on loaded systems. It is unlikely a
complete solution. Also, provide more map info with procfs.
Chuck Cranor spurred on this improvement.


# 34206 07-Mar-1998 dyson

This mega-commit is meant to fix numerous interrelated problems. There
has been some bitrot and incorrect assumptions in the vfs_bio code. These
problems have manifest themselves worse on NFS type filesystems, but can
still affect local filesystems under certain circumstances. Most of
the problems have involved mmap consistancy, and as a side-effect broke
the vfs.ioopt code. This code might have been committed seperately, but
almost everything is interrelated.

1) Allow (pmap_object_init_pt) prefaulting of buffer-busy pages that
are fully valid.
2) Rather than deactivating erroneously read initial (header) pages in
kern_exec, we now free them.
3) Fix the rundown of non-VMIO buffers that are in an inconsistent
(missing vp) state.
4) Fix the disassociation of pages from buffers in brelse. The previous
code had rotted and was faulty in a couple of important circumstances.
5) Remove a gratuitious buffer wakeup in vfs_vmio_release.
6) Remove a crufty and currently unused cluster mechanism for VBLK
files in vfs_bio_awrite. When the code is functional, I'll add back
a cleaner version.
7) The page busy count wakeups assocated with the buffer cache usage were
incorrectly cleaned up in a previous commit by me. Revert to the
original, correct version, but with a cleaner implementation.
8) The cluster read code now tries to keep data associated with buffers
more aggressively (without breaking the heuristics) when it is presumed
that the read data (buffers) will be soon needed.
9) Change to filesystem lockmgr locks so that they use LK_NOPAUSE. The
delay loop waiting is not useful for filesystem locks, due to the
length of the time intervals.
10) Correct and clean-up spec_getpages.
11) Implement a fully functional nfs_getpages, nfs_putpages.
12) Fix nfs_write so that modifications are coherent with the NFS data on
the server disk (at least as well as NFS seems to allow.)
13) Properly support MS_INVALIDATE on NFS.
14) Properly pass down MS_INVALIDATE to lower levels of the VM code from
vm_map_clean.
15) Better support the notion of pages being busy but valid, so that
fewer in-transit waits occur. (use p->busy more for pageouts instead
of PG_BUSY.) Since the page is fully valid, it is still usable for
reads.
16) It is possible (in error) for cached pages to be busy. Make the
page allocation code handle that case correctly. (It should probably
be a printf or panic, but I want the system to handle coding errors
robustly. I'll probably add a printf.)
17) Correct the design and usage of vm_page_sleep. It didn't handle
consistancy problems very well, so make the design a little less
lofty. After vm_page_sleep, if it ever blocked, it is still important
to relookup the page (if the object generation count changed), and
verify it's status (always.)
18) In vm_pageout.c, vm_pageout_clean had rotted, so clean that up.
19) Push the page busy for writes and VM_PROT_READ into vm_pageout_flush.
20) Fix vm_pager_put_pages and it's descendents to support an int flag
instead of a boolean, so that we can pass down the invalidate bit.


# 33817 25-Feb-1998 dyson

Fix page prezeroing for SMP, and fix some potential paging-in-progress
hangs. The paging-in-progress diagnosis was a result of Tor Egge's
excellent detective work.
Submitted by: Partially from Tor Egge.


# 33109 05-Feb-1998 dyson

1) Start using a cleaner and more consistant page allocator instead
of the various ad-hoc schemes.
2) When bringing in UPAGES, the pmap code needs to do another vm_page_lookup.
3) When appropriate, set the PG_A or PG_M bits a-priori to both avoid some
processor errata, and to minimize redundant processor updating of page
tables.
4) Modify pmap_protect so that it can only remove permissions (as it
originally supported.) The additional capability is not needed.
5) Streamline read-only to read-write page mappings.
6) For pmap_copy_page, don't enable write mapping for source page.
7) Correct and clean-up pmap_incore.
8) Cluster initial kern_exec pagin.
9) Removal of some minor lint from kern_malloc.
10) Correct some ioopt code.
11) Remove some dead code from the MI swapout routine.
12) Correct vm_object_deallocate (to remove backing_object ref.)
13) Fix dead object handling, that had problems under heavy memory load.
14) Add minor vm_page_lookup improvements.
15) Some pages are not in objects, and make sure that the vm_page.c can
properly support such pages.
16) Add some more page deficit handling.
17) Some minor code readability improvements.


# 32937 31-Jan-1998 dyson

Change the busy page mgmt, so that when pages are freed, they
MUST be PG_BUSY. It is bogus to free a page that isn't busy,
because it is in a state of being "unavailable" when being
freed. The additional advantage is that the page_remove code
has a better cross-check that the page should be busy and
unavailable for other use. There were some minor problems
with the collapse code, and this plugs those subtile "holes."

Also, the vfs_bio code wasn't checking correctly for PG_BUSY
pages. I am going to develop a more consistant scheme for
grabbing pages, busy or otherwise. For now, we are stuck
with the current morass.


# 32702 22-Jan-1998 dyson

VM level code cleanups.

1) Start using TSM.
Struct procs continue to point to upages structure, after being freed.
Struct vmspace continues to point to pte object and kva space for kstack.
u_map is now superfluous.
2) vm_map's don't need to be reference counted. They always exist either
in the kernel or in a vmspace. The vmspaces are managed by reference
counts.
3) Remove the "wired" vm_map nonsense.
4) No need to keep a cache of kernel stack kva's.
5) Get rid of strange looking ++var, and change to var++.
6) Change more data structures to use our "zone" allocator. Added
struct proc, struct vmspace and struct vnode. This saves a significant
amount of kva space and physical memory. Additionally, this enables
TSM for the zone managed memory.
7) Keep ioopt disabled for now.
8) Remove the now bogus "single use" map concept.
9) Use generation counts or id's for data structures residing in TSM, where
it allows us to avoid unneeded restart overhead during traversals, where
blocking might occur.
10) Account better for memory deficits, so the pageout daemon will be able
to make enough memory available (experimental.)
11) Fix some vnode locking problems. (From Tor, I think.)
12) Add a check in ufs_lookup, to avoid lots of unneeded calls to bcmp.
(experimental.)
13) Significantly shrink, cleanup, and make slightly faster the vm_fault.c
code. Use generation counts, get rid of unneded collpase operations,
and clean up the cluster code.
14) Make vm_zone more suitable for TSM.

This commit is partially as a result of discussions and contributions from
other people, including DG, Tor Egge, PHK, and probably others that I
have forgotten to attribute (so let me know, if I forgot.)

This is not the infamous, final cleanup of the vnode stuff, but a necessary
step. Vnode mgmt should be correct, but things might still change, and
there is still some missing stuff (like ioopt, and physical backing of
non-merged cache files, debugging of layering concepts.)


# 32585 17-Jan-1998 dyson

Tie up some loose ends in vnode/object management. Remove an unneeded
config option in pmap. Fix a problem with faulting in pages. Clean-up
some loose ends in swap pager memory management.

The system should be much more stable, but all subtile bugs aren't fixed yet.


# 32286 06-Jan-1998 dyson

Make our v_usecount vnode reference count work identically to the
original BSD code. The association between the vnode and the vm_object
no longer includes reference counts. The major difference is that
vm_object's are no longer freed gratuitiously from the vnode, and so
once an object is created for the vnode, it will last as long as the
vnode does.

When a vnode object reference count is incremented, then the underlying
vnode reference count is incremented also. The two "objects" are now
more intimately related, and so the interactions are now much less
complex.

When vnodes are now normally placed onto the free queue with an object still
attached. The rundown of the object happens at vnode rundown time, and
happens with exactly the same filesystem semantics of the original VFS
code. There is absolutely no need for vnode_pager_uncache and other
travesties like that anymore.

A side-effect of these changes is that SMP locking should be much simpler,
the I/O copyin/copyout optimizations work, NFS should be more ponderable,
and further work on layered filesystems should be less frustrating, because
of the totally coherent management of the vnode objects and vnodes.

Please be careful with your system while running this code, but I would
greatly appreciate feedback as soon a reasonably possible.


# 32071 28-Dec-1997 dyson

Lots of improvements, including restructring the caching and management
of vnodes and objects. There are some metadata performance improvements
that come along with this. There are also a few prototypes added when
the need is noticed. Changes include:

1) Cleaning up vref, vget.
2) Removal of the object cache.
3) Nuke vnode_pager_uncache and friends, because they aren't needed anymore.
4) Correct some missing LK_RETRY's in vn_lock.
5) Correct the page range in the code for msync.

Be gentle, and please give me feedback asap.


# 31853 19-Dec-1997 dyson

Some performance improvements, and code cleanups (including changing our
expensive OFF_TO_IDX to btoc whenever possible.)


# 29653 21-Sep-1997 dyson

Change the M_NAMEI allocations to use the zone allocator. This change
plus the previous changes to use the zone allocator decrease the useage
of malloc by half. The Zone allocator will be upgradeable to be able
to use per CPU-pools, and has more intelligent usage of SPLs. Additionally,
it has reasonable stats gathering capabilities, while making most calls
inline.


# 28991 01-Sep-1997 bde

Some staticized variables were still declared to be extern.


# 27899 04-Aug-1997 dyson

Get rid of the ad-hoc memory allocator for vm_map_entries, in lieu of
a simple, clean zone type allocator. This new allocator will also be
used for machine dependent pmap PV entries.


# 22975 22-Feb-1997 peter

Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not
ready for it yet.


# 21673 14-Jan-1997 jkh

Make the long-awaited change from $Id$ to $FreeBSD$

This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.

Boy, I'm glad we're not using sup anymore. This update would have been
insane otherwise.


# 18973 17-Oct-1996 dyson

Clean up the rundown of the object backing a vnode. This should fix
NFS problems associated with forcible dismounts.


# 18942 15-Oct-1996 bde

Removed nested include of <sys/proc.h> from <vm/vm_object.h> and fixed
the one place that depended on it. wakeup() is now prototyped in
<sys/systm.h> so that it is normally visible.

Added nested include of <sys/queue.h> in <vm/vm_object.h>. The queue
macros are a more fundamental prerequisite for <vm/vm_object.h> than
the wakeup prototype and previously happened to be included by
namespace pollution from <sys/proc.h> or elsewhere.


# 18526 28-Sep-1996 dyson

Reviewed by:
Submitted by:
Obtained from:


# 18169 08-Sep-1996 dyson

Addition of page coloring support. Various levels of coloring are afforded.
The default level works with minimal overhead, but one can also enable
full, efficient use of a 512K cache. (Parameters can be generated
to support arbitrary cache sizes also.)


# 17761 21-Aug-1996 dyson

Even though this looks like it, this is not a complex code change.
The interface into the "VMIO" system has changed to be more consistant
and robust. Essentially, it is now no longer necessary to call vn_open
to get merged VM/Buffer cache operation, and exceptional conditions
such as merged operation of VBLK devices is simpler and more correct.

This code corrects a potentially large set of problems including the
problems with ktrace output and loaded systems, file create/deletes,
etc.

Most of the changes to NFS are cosmetic and name changes, eliminating
a layer of subroutine calls. The direct calls to vput/vrele have
been re-instituted for better cross platform compatibility.

Reviewed by: davidg


# 15819 19-May-1996 dyson

Initial support for mincore and madvise. Both are almost fully
supported, except madvise does not page in with MADV_WILLNEED, and
MADV_DONTNEED doesn't force dirty pages out.


# 14316 02-Mar-1996 dyson

1) Eliminate unnecessary bzero of UPAGES.
2) Eliminate unnecessary copying of pages during/after forks.
3) Add user map simplification.


# 12820 14-Dec-1995 phk

Another mega commit to staticize things.


# 12767 11-Dec-1995 dyson

Changes to support 1Tb filesizes. Pages are now named by an
(object,index) pair instead of (object,offset) pair.


# 12662 07-Dec-1995 dg

Untangled the vm.h include file spaghetti.


# 12110 05-Nov-1995 dyson

Greatly simplify the msync code. Eliminate complications in vm_pageout
for msyncing. Remove a bug that manifests itself primarily on NFS
(the dirty range on the buffers is not set on msync.)


# 10345 26-Aug-1995 bde

Change vm_object_print() to have the correct number and type of args
for a ddb command.


# 9759 29-Jul-1995 bde

Eliminate sloppy common-style declarations. There should be none left for
the LINT configuation.


# 9548 16-Jul-1995 dg

1) Merged swpager structure into vm_object.
2) Changed swap_pager internal interfaces to cope w/#1.
3) Eliminated object->copy as we no longer have copy objects.
4) Minor stylistic changes.


# 9507 13-Jul-1995 dg

NOTE: libkvm, w, ps, 'top', and any other utility which depends on struct
proc or any VM system structure will have to be rebuilt!!!

Much needed overhaul of the VM system. Included in this first round of
changes:

1) Improved pager interfaces: init, alloc, dealloc, getpages, putpages,
haspage, and sync operations are supported. The haspage interface now
provides information about clusterability. All pager routines now take
struct vm_object's instead of "pagers".

2) Improved data structures. In the previous paradigm, there is constant
confusion caused by pagers being both a data structure ("allocate a
pager") and a collection of routines. The idea of a pager structure has
escentially been eliminated. Objects now have types, and this type is
used to index the appropriate pager. In most cases, items in the pager
structure were duplicated in the object data structure and thus were
unnecessary. In the few cases that remained, a un_pager structure union
was created in the object to contain these items.

3) Because of the cleanup of #1 & #2, a lot of unnecessary layering can now
be removed. For instance, vm_object_enter(), vm_object_lookup(),
vm_object_remove(), and the associated object hash list were some of the
things that were removed.

4) simple_lock's removed. Discussion with several people reveals that the
SMP locking primitives used in the VM system aren't likely the mechanism
that we'll be adopting. Even if it were, the locking that was in the code
was very inadequate and would have to be mostly re-done anyway. The
locking in a uni-processor kernel was a no-op but went a long way toward
making the code difficult to read and debug.

5) Places that attempted to kludge-up the fact that we don't have kernel
thread support have been fixed to reflect the reality that we are really
dealing with processes, not threads. The VM system didn't have complete
thread support, so the comments and mis-named routines were just wrong.
We now use tsleep and wakeup directly in the lock routines, for instance.

6) Where appropriate, the pagers have been improved, especially in the
pager_alloc routines. Most of the pager_allocs have been rewritten and
are now faster and easier to maintain.

7) The pagedaemon pageout clustering algorithm has been rewritten and
now tries harder to output an even number of pages before and after
the requested page. This is sort of the reverse of the ideal pagein
algorithm and should provide better overall performance.

8) Unnecessary (incorrect) casts to caddr_t in calls to tsleep & wakeup
have been removed. Some other unnecessary casts have also been removed.

9) Some almost useless debugging code removed.

10) Terminology of shadow objects vs. backing objects straightened out.
The fact that the vm_object data structure escentially had this
backwards really confused things. The use of "shadow" and "backing
object" throughout the code is now internally consistent and correct
in the Mach terminology.

11) Several minor bug fixes, including one in the vm daemon that caused
0 RSS objects to not get purged as intended.

12) A "default pager" has now been created which cleans up the transition
of objects to the "swap" type. The previous checks throughout the code
for swp->pg_data != NULL were really ugly. This change also provides
the rudiments for future backing of "anonymous" memory by something
other than the swap pager (via the vnode pager, for example), and it
allows the decision about which of these pagers to use to be made
dynamically (although will need some additional decision code to do
this, of course).

13) (dyson) MAP_COPY has been deprecated and the corresponding "copy
object" code has been removed. MAP_COPY was undocumented and non-
standard. It was furthermore broken in several ways which caused its
behavior to degrade to MAP_PRIVATE. Binaries that use MAP_COPY will
continue to work correctly, but via the slightly different semantics
of MAP_PRIVATE.

14) (dyson) Sharing maps have been removed. It's marginal usefulness in a
threads design can be worked around in other ways. Both #12 and #13
were done to simplify the code and improve readability and maintain-
ability. (As were most all of these changes)

TODO:

1) Rewrite most of the vnode pager to use VOP_GETPAGES/PUTPAGES. Doing
this will reduce the vnode pager to a mere fraction of its current size.

2) Rewrite vm_fault and the swap/vnode pagers to use the clustering
information provided by the new haspage pager interface. This will
substantially reduce the overhead by eliminating a large number of
VOP_BMAP() calls. The VOP_BMAP() filesystem interface should be
improved to provide both a "behind" and "ahead" indication of
contiguousness.

3) Implement the extended features of pager_haspage in swap_pager_haspage().
It currently just says 0 pages ahead/behind.

4) Re-implement the swap device (swstrategy) in a more elegant way, perhaps
via a much more general mechanism that could also be used for disk
striping of regular filesystems.

5) Do something to improve the architecture of vm_object_collapse(). The
fact that it makes calls into the swap pager and knows too much about
how the swap pager operates really bothers me. It also doesn't allow
for collapsing of non-swap pager objects ("unnamed" objects backed by
other pagers).


# 8216 02-May-1995 dg

Changed object hash list to be a list rather than a tailq. This saves
space for the hash list buckets and is a little faster. The features
of tailq aren't needed. Increased the size of the object hash table
to improve performance. In the future, this will be changed so that
the table is sized dynamically.


# 7695 09-Apr-1995 dg

Changes from John Dyson and myself:

Fixed remaining known bugs in the buffer IO and VM system.

vfs_bio.c:
Fixed some race conditions and locking bugs. Improved performance
by removing some (now) unnecessary code and fixing some broken
logic.
Fixed process accounting of # of FS outputs.
Properly handle NFS interrupts (B_EINTR).

(various)
Replaced calls to clrbuf() with calls to an optimized routine
called vfs_bio_clrbuf().

(various FS sync)
Sync out modified vnode_pager backed pages.

ffs_vnops.c:
Do two passes: Sync out file data first, then indirect blocks.

vm_fault.c:
Fixed deadly embrace caused by acquiring locks in the wrong order.

vnode_pager.c:
Changed to use buffer I/O system for writing out modified pages. This
should fix the problem with the modification date previous not getting
updated. Also dramatically simplifies the code. Note that this is
going to change in the future and be implemented via VOP_PUTPAGES().

vm_object.c:
Fixed a pile of bugs related to cleaning (vnode) objects. The performance
of vm_object_page_clean() is terrible when dealing with huge objects,
but this will change when we implement a binary tree to keep the object
pages sorted.

vm_pageout.c:
Fixed broken clustering of pageouts. Fixed race conditions and other
lockup style bugs in the scanning of pages. Improved performance.


# 7246 22-Mar-1995 dg

Removed unused fifth argument to vm_object_page_clean(). Fixed bug with
VTEXT not always getting cleared when it is supposed to. Added check to
make sure that vm_object_remove() isn't called with a NULL pager or for
a pager for an OBJ_INTERNAL object (neither of which will be on the hash
list). Clear OBJ_CANPERSIST if we decide to terminate it because of no
resident pages.


# 7204 20-Mar-1995 dg

Added a new boolean argument to vm_object_page_clean that causes it to
only toss out clean pages if TRUE.


# 7185 20-Mar-1995 dg

Fixed "objde1" hang. It was caused by a "&" where an "&&" belonged in the
expression that decides if a wakeup should occur.


# 7090 16-Mar-1995 bde

Add and move declarations to fix all of the warnings from `gcc -Wimplicit'
(except in netccitt, netiso and netns) and most of the warnings from
`gcc -Wnested-externs'. Fix all the bugs found. There were no serious
ones.


# 7029 12-Mar-1995 bde

Move a kernel inline function inside `#ifdef KERNEL' so that including
<vm/vm.h> doesn't cause warnings about nonexistent functions called
by the inline function. Clean up the formatting of the function.


# 7016 12-Mar-1995 dg

Deleted vm_object_setpager().


# 6897 04-Mar-1995 jkh

Remove a gratutious cast.


# 6816 01-Mar-1995 dg

Various changes from John and myself that do the following:

New functions create - vm_object_pip_wakeup and pagedaemon_wakeup that
are used to reduce the actual number of wakeups.
New function vm_page_protect which is used in conjuction with some new
page flags to reduce the number of calls to pmap_page_protect.
Minor changes to reduce unnecessary spl nesting.
Rewrote vm_page_alloc() to improve readability.
Various other mostly cosmetic changes.


# 6623 22-Feb-1995 dg

Removed object locking code (it was a left over from an abortion that
was done a month or so ago).

Submitted by: John Dyson


# 6618 22-Feb-1995 dg

Only do object paging_in_progress wakeups if someone is waiting on this
condition.

Submitted by: John Dyson


# 6129 02-Feb-1995 dg

swap_pager.c:
Fixed long standing bug in freeing swap space during object collapses.
Fixed 'out of space' messages from printing out too often.
Modified to use new kmem_malloc() calling convention.
Implemented an additional stat in the swap pager struct to count the
amount of space allocated to that pager. This may be removed at some
point in the future.
Minimized unnecessary wakeups.

vm_fault.c:
Don't try to collect fault stats on 'swapped' processes - there aren't
any upages to store the stats in.
Changed read-ahead policy (again!).

vm_glue.c:
Be sure to gain a reference to the process's map before swapping.
Be sure to lose it when done.

kern_malloc.c:
Added the ability to specify if allocations are at interrupt time or
are 'safe'; this affects what types of pages can be allocated.

vm_map.c:
Fixed a variety of map lock problems; there's still a lurking bug that
will eventually bite.

vm_object.c:
Explicitly initialize the object fields rather than bzeroing the struct.
Eliminated the 'rcollapse' code and folded it's functionality into the
"real" collapse routine.
Moved an object_unlock() so that the backing_object is protected in
the qcollapse routine.
Make sure nobody fools with the backing_object when we're destroying it.
Added some diagnostic code which can be called from the debugger that
looks through all the internal objects and makes certain that they
all belong to someone.

vm_page.c:
Fixed a rather serious logic bug that would result in random system
crashes. Changed pagedaemon wakeup policy (again!).

vm_pageout.c:
Removed unnecessary page rotations on the inactive queue.
Changed the number of pages to explicitly free to just free_reserved
level.

Submitted by: John Dyson


# 5841 24-Jan-1995 dg

Added ability to detect sequential faults and DTRT. (swap_pager.c)
Added hook for pmap_prefault() and use symbolic constant for new third
argument to vm_page_alloc() (vm_fault.c, various)
Changed the way that upages and page tables are held. (vm_glue.c)
Fixed architectural flaw in allocating pages at interrupt time that was
introduced with the merged cache changes. (vm_page.c, various)
Adjusted some algorithms to acheive better paging performance and to
accomodate the fix for the architectural flaw mentioned above. (vm_pageout.c)
Fixed pbuf handling problem, changed policy on handling read-behind page.
(vnode_pager.c)

Submitted by: John Dyson


# 5455 09-Jan-1995 dg

These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.

The majority of the merged VM/cache work is by John Dyson.

The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.

vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.

vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.

vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.

vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.

vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.

pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.

vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.

proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.

swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.

machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.

machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.

ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.

Submitted by: John Dyson and David Greenman


# 4203 06-Nov-1994 dg

Added support for starting the experimental "vmdaemon" system process.
Enabled via REL2_1.

Added support for doing object collapses "on the fly". Enabled via REL2_1a.

Improved object collapses so that they can happen in more cases. Improved
sensing of modified pages to fix an apparant race condition and improve
clustered pageout opportunities. Fixed an "oops" with not restarting page
scan after a potential block in vm_pageout_clean() (not doing this can result
in strange behavior in some cases).

Submitted by: John Dyson & David Greenman


# 1817 02-Aug-1994 dg

Added $Id$


# 1542 24-May-1994 rgrimes

This commit was generated by cvs2svn to compensate for changes in r1541,
which included commits to RCS files with non-trunk default branches.


# 1541 24-May-1994 rgrimes

BSD 4.4 Lite Kernel Sources