History log of /freebsd-current/sys/vm/vnode_pager.c
Revision Date Author Comments
# b068bb09 07-Jan-2024 Konstantin Belousov <kib@FreeBSD.org>

Add vnode_pager_clean_{a,}sync(9)

Bump __FreeBSD_version for ZFS use.

Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D43356


# ed1a88a3 09-Jan-2024 Konstantin Belousov <kib@FreeBSD.org>

vnode_pager_generic_putpages(): rename maxblksz local to max_offset

Requested by: markj
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D43358


# bdb46c21 08-Jan-2024 Konstantin Belousov <kib@FreeBSD.org>

vnode_pager_generic_putpages(): correctly handle clean block at EOF

The loop 'skip clean blocks' checking for the clean blocks in the dirty
pages might end up setting the in_hole to true when exactly at EOF at
the middle of the block, without advancing the prev_offset value. Then
the next block is not dirty, and next_offset is clipped back to poffset
+ maxsize, equal to prev_offset, failing the assertion.

Instead of asserting prev_offset < next_offset, we must skip the write.

Reported by: asomers
PR: 276191
Reviewed by: alc, markj
Tested by: asomers
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D43358


# 29363fb4 23-Nov-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove ancient SCCS tags.

Remove ancient SCCS tags from the tree, automated scripting, with two
minor fixup to keep things compiling. All the common forms in the tree
were removed with a perl script.

Sponsored by: Netflix


# 685dc743 16-Aug-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove $FreeBSD$: one-line .c pattern

Remove /^[\s*]*__FBSDID\("\$FreeBSD\$"\);?\s*\n/


# 28f957b8 24-Mar-2023 Konstantin Belousov <kib@FreeBSD.org>

vnode_pager_input: return runningbufspace back

Both vnode_pager_input_smlfs() and vnode_pager_generic_getpages()
increment runningbufspace, but also both delegate io completion handling
on the pbuf to either plain bdone() or filesystem-specific strategy
routine. Accidentally, for e.g. UFS it is g_vfs_strategy()/g_vfs_done().
The later calls bufdone() which handles runningbufspace reclamation.

For plain bdone() io done handler, nothing would return
accounted b_runningbufspace back. Do it in the new
helper vnode_pager_input_bdone(), as well as in
vnode_pager_generic_getpages_done() explicitly.

Note that potential multiple calls to runningbufwakeup() for the same
pbuf or buf completion are safe. runningbufwakeup() clears accounting
for the buffer, so second and later calls are nop.

The problem was found due to tarfs using small vnode pager input but not
g_vfs_strategy().

Reported by: des
Reviewed by: markj, sjg
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D39263


# f45feecf 22-Sep-2022 Mateusz Guzik <mjg@FreeBSD.org>

vfs: add vn_getsize

getattr is very expensive and in important cases only gets called to get
the size. This can be optimized with a dedicated routine which obtains
that statistic.

As a step towards that goal make size-only consumers use a dedicated
routine.

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D37885


# b8ebd99a 13-Apr-2022 John Baldwin <jhb@FreeBSD.org>

vm: Use __diagused for variables only used in KASSERT().


# de2e1529 04-Aug-2021 Ka Ho Ng <khng@FreeBSD.org>

Add vnode_pager_purge_range(9) KPI

This KPI is created in addition to the existing vnode_pager_setsize(9)
KPI. The KPI is intended for file systems that are able to turn a range
of file into sparse range, also known as hole-punching.

Sponsored by: The FreeBSD Foundation
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D27194


# 00a3fe96 07-May-2021 Konstantin Belousov <kib@FreeBSD.org>

vm_object_kvme_type(): reimplement by embedding kvme_type into pagerops

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D30168


# d474440a 03-May-2021 Konstantin Belousov <kib@FreeBSD.org>

Constify vm_pager-related virtual tables.

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D30070


# 192112b7 30-Apr-2021 Konstantin Belousov <kib@FreeBSD.org>

Add pgo_getvp method

This eliminates the staircase of conditions in vm_map_entry_set_vnode_text().

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D30070


# c23c555b 30-Apr-2021 Konstantin Belousov <kib@FreeBSD.org>

Add pgo_mightbedirty method

Used to implement vm_object_mightbedirty()

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D30070


# 180bcaa4 30-Apr-2021 Konstantin Belousov <kib@FreeBSD.org>

vm_pager: add pgo_set_writeable_dirty method

specialized for swap and vnode pagers, and used to implement
vm_object_set_writeable_dirty().

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D30070


# a771bf74 17-Mar-2021 Bryan Drewery <bdrewery@FreeBSD.org>

Remove unused obj variable missed in r354870.

Sponsored by: Dell EMC


# cd853791 27-Nov-2020 Konstantin Belousov <kib@FreeBSD.org>

Make MAXPHYS tunable. Bump MAXPHYS to 1M.

Replace MAXPHYS by runtime variable maxphys. It is initialized from
MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys.

Make b_pages[] array in struct buf flexible. Size b_pages[] for buffer
cache buffers exactly to atop(maxbcachebuf) (currently it is sized to
atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1.
The +1 for pbufs allow several pbuf consumers, among them vmapbuf(),
to use unaligned buffers still sized to maxphys, esp. when such
buffers come from userspace (*). Overall, we save significant amount
of otherwise wasted memory in b_pages[] for buffer cache buffers,
while bumping MAXPHYS to desired high value.

Eliminate all direct uses of the MAXPHYS constant in kernel and driver
sources, except a place which initialize maxphys. Some random (and
arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted
straight. Some drivers, which use MAXPHYS to size embeded structures,
get private MAXPHYS-like constant; their convertion is out of scope
for this work.

Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs,
dev/siis, where either submitted by, or based on changes by mav.

Suggested by: mav (*)
Reviewed by: imp, mav, imp, mckusick, scottl (intermediate versions)
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D27225


# f9cc8410 18-Sep-2020 Eric van Gyzen <vangyzen@FreeBSD.org>

vm_ooffset_t is now unsigned

vm_ooffset_t is now unsigned. Remove some tests for negative values,
or make other adjustments accordingly.

Reported by: Coverity
Reviewed by: kib markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D26214


# c3aa3bf9 01-Sep-2020 Mateusz Guzik <mjg@FreeBSD.org>

vm: clean up empty lines in .c and .h files


# 7ad2a82d 18-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: drop the error parameter from vn_isdisk, introduce vn_isdisk_error

Most consumers pass NULL.


# 419e5698 16-Aug-2020 Konstantin Belousov <kib@FreeBSD.org>

Atomically update vm_object vnp_size, where atomic is available.

This will be used later, where it matters on 32bit arches.

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D25968


# efec381d 04-Aug-2020 Mark Johnston <markj@FreeBSD.org>

Remove most lingering references to the page lock in comments.

Finish updating comments to reflect new locking protocols introduced
over the past year. In particular, vm_page_lock is now effectively
unused.

Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D25868


# 1bd12a3b 17-Jul-2020 Chuck Silvers <chs@FreeBSD.org>

Fix vnode_pager handling of read ahead/behind pages when a disk read fails.
Rather than marking the read ahead/behind pages valid even though they were
not initialized, free them using the new function vm_page_free_invalid().

Reviewed by: markj, kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D25430


# c3dbadc1 17-Jul-2020 Chuck Silvers <chs@FreeBSD.org>

Revert my change from r361855 in favor of a better fix.

Reviewed by: markj, kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D25430


# bd7d64f5 05-Jun-2020 Chuck Silvers <chs@FreeBSD.org>

Don't mark pages as valid if reading the contents from disk fails.
Instead, just skip marking pages valid if the read fails. Future
attempts to access such pages will notice that they are not marked valid
and try to read them from disk again.

Reviewed by: kib, markj
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D25138


# abfdf767 30-Mar-2020 Konstantin Belousov <kib@FreeBSD.org>

VOP_GETPAGES_ASYNC(): consistently call iodone() callback in case of error.

Reviewed by: glebius, markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D24038


# cafbf0c6 19-Feb-2020 Warner Losh <imp@FreeBSD.org>

Don't convert all lower-layer errors to EIO.

Don't convert all lower layer errors to EIO. Instead, pass the actual error up
the stack. This will allow the upper layers that look for ENXIO to react
properly to that signal from the lower layers and, for UFS, unmount the
filesystem.

Reviewed by: kib@
Differential Revision: https://reviews.freebsd.org/D23755


# 65252dc9 19-Feb-2020 Warner Losh <imp@FreeBSD.org>

Don't spam the console with an additional, and useless, error message.

There's no need to spam the console with this error message. If there's an I/O
error, the disk/cam driver will report it at the lower levels. If that's an
actual problem, the upper layers will report that.

Reviewed by: kib@
Differential Revision: https://reviews.freebsd.org/D23756


# f1fa1ba3 03-Feb-2020 Mateusz Guzik <mjg@FreeBSD.org>

Fix up various vnode-related asserts which did not dump the used vnode


# d6e13f3b 19-Jan-2020 Jeff Roberson <jeff@FreeBSD.org>

Don't hold the object lock while calling getpages.

The vnode pager does not want the object lock held. Moving this out allows
further object lock scope reduction in callers. While here add some missing
paging in progress calls and an assert. The object handle is now protected
explicitly with pip.

Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D23033


# 9c83ff2d 19-Jan-2020 Jeff Roberson <jeff@FreeBSD.org>

It has not been possible to recursively terminate a vnode object for some time
now. Eliminate the dead code that supports it.

Approved by: kib, markj
Differential Revision: https://reviews.freebsd.org/D22908


# a314aba8 11-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vm: add missing CLTFLAG_MPSAFE annotations

This covers all vm/* files.


# b249ce48 03-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: drop the mostly unused flags argument from VOP_UNLOCK

Filesystems which want to use it in limited capacity can employ the
VOP_UNLOCK_FLAGS macro.

Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D21427


# abd80ddb 08-Dec-2019 Mateusz Guzik <mjg@FreeBSD.org>

vfs: introduce v_irflag and make v_type smaller

The current vnode layout is not smp-friendly by having frequently read data
avoidably sharing cachelines with very frequently modified fields. In
particular v_iflag inspected for VI_DOOMED can be found in the same line with
v_usecount. Instead make it available in the same cacheline as the v_op, v_data
and v_type which all get read all the time.

v_type is avoidably 4 bytes while the necessary data will easily fit in 1.
Shrinking it frees up 3 bytes, 2 of which get used here to introduce a new
flag field with a new value: VIRF_DOOMED.

Reviewed by: kib, jeff
Differential Revision: https://reviews.freebsd.org/D22715


# a67d5408 26-Nov-2019 Jeff Roberson <jeff@FreeBSD.org>

Use atomics in more cases for object references. We now can completely
omit the object lock if we are above a certain threshold. Hold only a
single vnode reference when the vnode object has any ref > 0. This
allows us to only lock the object and vnode on 0-1 and 1-0 transitions.

Differential Revision: https://reviews.freebsd.org/D22452


# 7f935055 19-Nov-2019 Jeff Roberson <jeff@FreeBSD.org>

Remove unnecessary object locking from the vnode pager. Recent changes to
busy/valid/dirty locking make these acquires redundant.

Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D22186


# 51df5321 29-Oct-2019 Jeff Roberson <jeff@FreeBSD.org>

Use atomics and a shared object lock to protect the object reference count.

Certain consumers still need to guarantee a stable reference so we can not
switch entirely to atomics yet. Exclusive lock holders can still modify
and examine the refcount without using the ref api.

Reviewed by: kib
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D21598


# 2f81c92e 23-Oct-2019 Mark Johnston <markj@FreeBSD.org>

Check for bogus_page in vnode_pager_generic_getpages_done().

We now assert that a page is busy when updating its validity-tracking
state, but bogus_page is not busied during a getpages operation.

Reported by: syzkaller
Reviewed by: alc, kib
Discussed with: jeff
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D22124


# 5b87ecc6 22-Oct-2019 Konstantin Belousov <kib@FreeBSD.org>

Assert that vnode_pager_setsize() is called with the vnode exclusively locked

except for filesystems that set the MNTK_VMSETSIZE_BUG, Set the flag for ZFS.

Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D21883


# 208b81bb 22-Oct-2019 Konstantin Belousov <kib@FreeBSD.org>

Add VV_VMSIZEVNLOCK flag.

The flag specifies that vm_fault() handler should check the vnode'
vm_object size under the vnode lock. It is converted into the object'
OBJ_SIZEVNLOCK flag in vnode_pager_alloc().

Tested by: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D21883


# 0012f373 14-Oct-2019 Jeff Roberson <jeff@FreeBSD.org>

(4/6) Protect page valid with the busy lock.

Atomics are used for page busy and valid state when the shared busy is
held. The details of the locking protocol and valid and dirty
synchronization are in the updated vm_page.h comments.

Reviewed by: kib, markj
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D21594


# fe7bcbaf 03-Sep-2019 Kyle Evans <kevans@FreeBSD.org>

vm pager: writemapping accounting for OBJT_SWAP

Currently writemapping accounting is only done for vnode_pager which does
some accounting on the underlying vnode.

Extend this to allow accounting to be possible for any of the pager types.
New pageops are added to update/release writecount that need to be
implemented for any pager wishing to do said accounting, and we implement
these methods now for both vnode_pager (unchanged) and swap_pager.

The primary motivation for this is to allow other systems with OBJT_SWAP
objects to check if their objects have any write mappings and reject
operations with EBUSY if so. posixshm will be the first to do so in order to
reject adding write seals to the shmfd if any writable mappings exist.

Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D21456


# 6470c8d3 29-Aug-2019 Konstantin Belousov <kib@FreeBSD.org>

Rework v_object lifecycle for vnodes.

Current implementation of vnode_create_vobject() and
vnode_destroy_vobject() is written so that it prepared to handle the
vm object destruction for live vnode. Practically, no filesystems use
this, except for some remnants that were present in UFS till today.
One of the consequences of that model is that each filesystem must
call vnode_destroy_vobject() in VOP_RECLAIM() or earlier, as result
all of them get rid of the v_object in reclaim.

Move the call to vnode_destroy_vobject() to vgonel() before
VOP_RECLAIM(). This makes v_object stable: either the object is NULL,
or it is valid vm object till the vnode reclamation. Remove code from
vnode_create_vobject() to handle races with the parallel destruction.

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D21412


# 783a68aa 25-Aug-2019 Konstantin Belousov <kib@FreeBSD.org>

Move OBJT_VNODE specific code from vm_object_terminate() to
vnode_destroy_vobject().

Reviewed by: alc, jeff (previous version), markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D21357


# 4153054a 19-Aug-2019 Jeff Roberson <jeff@FreeBSD.org>

Permit vm_pager_has_page() to run with a shared lock. Introduce
VM_OBJECT_DROP/VM_OBJECT_PICKUP to handle functions that are called with
uncertain lock state.

Reviewed by: kib, markj
Tested by: pho
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21310


# 6cb46f39 17-Jul-2019 Alan Somers <asomers@FreeBSD.org>

Revert r346608

That change was intended to be cosmetic, but it inadvertenly caused
vnode_pager_setsize to discard cached indirect blocks and extended
attributes on UFS during truncation. The reason is because those blocks
have negative LBNs, which get sign-cast to positive VM indexes.

Reported by: kib
Sponsored by: The FreeBSD Foundation


# 0cab71bc 06-Jul-2019 Doug Moore <dougm@FreeBSD.org>

Fix style(9) violations involving division by PAGE_SIZE.

Reviewed by: alc
Approved by: markj (mentor)
Differential Revision: https://reviews.freebsd.org/D20847


# daec9284 21-May-2019 Conrad Meyer <cem@FreeBSD.org>

Include ktr.h in more compilation units

Similar to r348026, exhaustive search for uses of CTRn() and cross reference
ktr.h includes. Where it was obvious that an OS compat header of some kind
included ktr.h indirectly, .c files were left alone. Some of these files
clearly got ktr.h via header pollution in some scenarios, or tinderbox would
not be passing prior to this revision, but go ahead and explicitly include it
in files using it anyway.

Like r348026, these CUs did not show up in tinderbox as missing the include.

Reported by: peterj (arm64/mp_machdep.c)
X-MFC-With: r347984
Sponsored by: Dell EMC Isilon


# 78022527 05-May-2019 Konstantin Belousov <kib@FreeBSD.org>

Switch to use shared vnode locks for text files during image activation.

kern_execve() locks text vnode exclusive to be able to set and clear
VV_TEXT flag. VV_TEXT is mutually exclusive with the v_writecount > 0
condition.

The change removes VV_TEXT, replacing it with the condition
v_writecount <= -1, and puts v_writecount under the vnode interlock.
Each text reference decrements v_writecount. To clear the text
reference when the segment is unmapped, it is recorded in the
vm_map_entry backed by the text file as MAP_ENTRY_VN_TEXT flag, and
v_writecount is incremented on the map entry removal

The operations like VOP_ADD_WRITECOUNT() and VOP_SET_TEXT() check that
v_writecount does not contradict the desired change. vn_writecheck()
is now racy and its use was eliminated everywhere except access.
Atomic check for writeability and increment of v_writecount is
performed by the VOP. vn_truncate() now increments v_writecount
around VOP_SETATTR() call, lack of which is arguably a bug on its own.

nullfs bypasses v_writecount to the lower vnode always, so nullfs
vnode has its own v_writecount correct, and lower vnode gets all
references, since object->handle is always lower vnode.

On the text vnode' vm object dealloc, the v_writecount value is reset
to zero, and deadfs vop_unset_text short-circuit the operation.
Reclamation of lowervp always reclaims all nullfs vnodes referencing
lowervp first, so no stray references are left.

Reviewed by: markj, trasz
Tested by: mjg, pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
Differential revision: https://reviews.freebsd.org/D19923


# 3746a90a 23-Apr-2019 Alan Somers <asomers@FreeBSD.org>

Slightly simplify vnode_pager_setsize

No functional change intended.

Sponsored by: The FreeBSD Foundation


# 40a51684 25-Feb-2019 Jason A. Harmening <jah@FreeBSD.org>

Fix incorrect assertion in vnode_pager_generic_getpages()

Reviewed by: kib, glebius
MFC after: 1 week


# 66fb0b1a 15-Feb-2019 Gleb Smirnoff <glebius@FreeBSD.org>

For 32-bit machines rollback the default number of vnode pager pbufs
back to the lever before r343030. For 64-bit machines reduce it slightly,
too. Together with r343030 I bumped the limit up to the value we use at
Netflix to serve 100 Gbit/s of sendfile traffic, and it probably isn't a
good default.

Provide a loader tunable to change vnode pager pbufs count. Document it.


# 756a5412 14-Jan-2019 Gleb Smirnoff <glebius@FreeBSD.org>

Allocate pager bufs from UMA instead of 80-ish mutex protected linked list.

o In vm_pager_bufferinit() create pbuf_zone and start accounting on how many
pbufs are we going to have set.
In various subsystems that are going to utilize pbufs create private zones
via call to pbuf_zsecond_create(). The latter calls uma_zsecond_create(),
and sets a limit on created zone. After startup preallocate pbufs according
to requirements of all pbuf zones.

Subsystems that used to have a private limit with old allocator now have
private pbuf zones: md(4), fusefs, NFS client, smbfs, VFS cluster, FFS,
swap, vnode pager.

The following subsystems use shared pbuf zone: cam(4), nvme(4), physio(9),
aio(4). They should have their private limits, but changing that is out of
scope of this commit.

o Fetch tunable value of kern.nswbuf from init_param2() and while here move
NSWBUF_MIN to opt_param.h and eliminate opt_swap.h, that was holding only
this option.
Default values aren't touched by this commit, but they probably should be
reviewed wrt to modern hardware.

This change removes a tight bottleneck from sendfile(2) operation, that
uses pbufs in vnode pager. Other pagers also would benefit from faster
allocation.

Together with: gallatin
Tested by: pho


# e5818a53 28-Mar-2018 Jeff Roberson <jeff@FreeBSD.org>

Implement several enhancements to NUMA policies.

Add a new "interleave" allocation policy which stripes pages across
domains with a stride or width keeping contiguity within a multi-page
region.

Move the kernel to the dedicated numbered cpuset #2 making it possible
to assign kernel threads and memory policy separately from user. This
also eliminates the need for the complicated interrupt binding code.

Add a sysctl API for viewing and manipulating domainsets. Refactor some
of the cpuset_t manipulation code using the generic bitset type so that
it can be used for both. This probably belongs in a dedicated subr file.

Attempt to improve the include situation.

Reviewed by: kib
Discussed with: jhb (cpuset parts)
Tested by: pho (before review feedback)
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14839


# e2068d0b 06-Feb-2018 Jeff Roberson <jeff@FreeBSD.org>

Use per-domain locks for vm page queue free. Move paging control from
global to per-domain state. Protect reservations with the free lock
from the domain that they belong to. Refactor to make vm domains more
of a first class object.

Reviewed by: markj, kib, gallatin
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14000


# 938cdc42 02-Feb-2018 Konstantin Belousov <kib@FreeBSD.org>

On pageout, in vnode generic pager, for partially dirty page, only
clear dirty bits for completely invalid blocks.

Otherwise we might not write out the last chunk that is shorter than
512 bytes, if the file end is not aligned on disk block boundary.
This become important after the r324794.

PR: 225586
Reported by: tris_vern@hotmail.com
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 3 days


# df57947f 18-Nov-2017 Pedro F. Giffuni <pfg@FreeBSD.org>

spdx: initial adoption of licensing ID tags.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.

Initially, only tag files that use BSD 4-Clause "Original" license.

RelNotes: yes
Differential Revision: https://reviews.freebsd.org/D13133


# b3d4ab66 20-Oct-2017 Konstantin Belousov <kib@FreeBSD.org>

Take the vm object lock in read mode in vnode_generic_putpages().

Only upgrade it to write mode if we need to clear dirty bits of the
partially valid page after EOF.

Suggested and reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks


# 05877a85 20-Oct-2017 Konstantin Belousov <kib@FreeBSD.org>

Do not overwrite clean blocks on pageout.

If filesystem block size is less than the page size, it is possible
that the page-out run contains partially clean pages. E.g., the chunk
of the page might be bdwrite()-ed, or some thread performed bwrite()
on a buffer which references a chunk of the paged out page. As
result, the assertion added in r319975, which checked that all pages
in the run are dirty, does not hold on such filesystems.

One solution is to remove the assert, but it is undesirable, because
we do overwrite the valid on-disk content. I cannot provide a scenario
where such write would corrupt the file data, but I do not like it on
principle. Another, in my opinion proper, solution is to only write
parts of the pages still marked dirty. The patch implements this, it
skips clean blocks and only writes the dirty block runs.

Note that due to clustering, write one page might clean other pages in
the run, so the next write range must be calculated only after the
current range is written out.

More, due to a possible invalidation, and the fact that the object
lock is dropped and reacquired before the checks, it is possible that
the whole page-out pages run appears to consist of only clean pages.
For this reason, it is impossible to assert that there is some work
for the pageout method to do (i.e. assert that there is at least one
dirty page in the run). But such clearing can only occur due to
invalidation, and not due to a parallel write, because we own the
vnode lock exclusive.

Reported by: fsu
In collaboration with: pho
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
Differential revision: https://reviews.freebsd.org/D12668


# 555b7bb4 26-Jul-2017 Konstantin Belousov <kib@FreeBSD.org>

Mark pages after EOF as clean after pageout.

Suppose that a file on NFS has partially filled last page, and this
page is dirty. NFS VOP_PAGEOUT() method only marks the the page clean
up to the block of the last written byte, leaving other blocks dirty.
Also any page which erronously exists in the vnode vm_object past EOF
is also left marked as dirty.

With the introduction of the buf-cache coherent pager, each pass of
syncer over the object with such page results in creation of B_DELWRI
buffer due to VOP_WRITE() call. This buffer is noted on next syncer
pass, which results e.g. a visible manifestation of shutdown never
finishing vnode sync. Note that before buf-cache coherency commit, a
dirty page might left never synced to server if a partial writes
occur.

Fix this by clearing dirty bits after EOF. Only blocks of the partial
page which are completely after EOF are marked clean, to avoid
possible user data loss.

Reported by: mav
Reviewed by: alc, markj
Tested by: mav, pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D11697


# e6c44f65 15-Jun-2017 Konstantin Belousov <kib@FreeBSD.org>

Some minor improvements to vnode_pager_generic_putpages().
- Add asserts that the pages to write are dirty. The last page, if
partially written, is only required to be dirty, while completely
written pages should have all dirty bit set.
- Use uintmax_t to print vm_page pindexes.
- Use NULL instead of casted zero.
- Remove if () test which duplicated the loop ending condition.
- Miscellaneous style fixes.

Reviewed by: alc, markj (previous version)
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 83c9dea1 17-Apr-2017 Gleb Smirnoff <glebius@FreeBSD.org>

- Remove 'struct vmmeter' from 'struct pcpu', leaving only global vmmeter
in place. To do per-cpu stats, convert all fields that previously were
maintained in the vmmeters that sit in pcpus to counter(9).
- Since some vmmeter stats may be touched at very early stages of boot,
before we have set up UMA and we can do counter_u64_alloc(), provide an
early counter mechanism:
o Leave one spare uint64_t in struct pcpu, named pc_early_dummy_counter.
o Point counter(9) fields of vmmeter to pcpu[0].pc_early_dummy_counter,
so that at early stages of boot, before counters are allocated we already
point to a counter that can be safely written to.
o For sparc64 that required a whole dummy pcpu[MAXCPU] array.

Further related changes:
- Don't include vmmeter.h into pcpu.h.
- vm.stats.vm.v_swappgsout and vm.stats.vm.v_swappgsin changed to 64-bit,
to match kernel representation.
- struct vmmeter hidden under _KERNEL, and only vmstat(1) is an exclusion.

This is based on benno@'s 4-year old patch:
https://lists.freebsd.org/pipermail/freebsd-arch/2013-July/014471.html

Reviewed by: kib, gallatin, marius, lidl
Differential Revision: https://reviews.freebsd.org/D10156


# 65b9599a 05-Apr-2017 Konstantin Belousov <kib@FreeBSD.org>

Extract calculation of ioflags from the vm_pager_putpages flags into a
helper.

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
X-Differential revision: https://reviews.freebsd.org/D10241


# 3dbb0ca6 05-Apr-2017 Konstantin Belousov <kib@FreeBSD.org>

Some style fixes for vnode_pager_generic_putpages(), in the local
declaration block.

Reviewed by: markj (as part of the larger patch)
Tested by: pho (as part of the larger patch)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
X-Differential revision: https://reviews.freebsd.org/D10241


# 4f56243a 12-Jan-2017 Gleb Smirnoff <glebius@FreeBSD.org>

Fix the contiguity once more.


# 1e0c121f 04-Jan-2017 Gleb Smirnoff <glebius@FreeBSD.org>

Fix assertion that checks that pages are consecutive to properly
handle bogus_page insertion(s).


# 6ff51a36 31-Dec-2016 Mateusz Guzik <mjg@FreeBSD.org>

Use vrefact in vnode_pager_alloc.


# 99e6e193 23-Nov-2016 Mark Johnston <markj@FreeBSD.org>

Release laundered vnode pages to the head of the inactive queue.

The swap pager enqueues laundered pages near the head of the inactive queue
to avoid another trip through LRU before reclamation. This change adds
support for this behaviour to the vnode pager and makes use of it in UFS and
ext2fs. Some ioflag handling is consolidated into a common subroutine so
that this support can be easily extended to other filesystems which make use
of the buffer cache. No changes are needed for ZFS since its putpages
routine always undirties the pages before returning, and the laundry
thread requeues the pages appropriately in this case.

Reviewed by: alc, kib
Differential Revision: https://reviews.freebsd.org/D8589


# bba39b9a 22-Nov-2016 Alan Cox <alc@FreeBSD.org>

Remove PG_CACHED-related fields from struct vmmeter, because they are no
longer used. More precisely, they are always zero because the code that
decremented and incremented them no longer exists.

Bump __FreeBSD_version to mark this change.

Reviewed by: kib, markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D8583


# e48b82bd 17-Nov-2016 Gleb Smirnoff <glebius@FreeBSD.org>

- If caller specifies readbehind and readahead that together with count
doesn't fit into a buf, then trim readbehind and readahead evenly. If
rbehind was limited by the previous BMAP, then roundup its trim to
block size.
- Add KASSERT to check that b_blkno has proper offset from original
blkno returned by BMAP. [1]
- Add KASSERT to check that pages in buf are consecutive.

Reviewed by: kib
Submitted by: kib [1]


# 7667839a 15-Nov-2016 Alan Cox <alc@FreeBSD.org>

Remove most of the code for implementing PG_CACHED pages. (This change does
not remove user-space visible fields from vm_cnt or all of the references to
cached pages from comments. Those changes will come later.)

Reviewed by: kib, markj
Tested by: pho
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D8497


# dcc0ff5a 19-Oct-2016 Gleb Smirnoff <glebius@FreeBSD.org>

Fix incorrect assertion that could miss overflows.

Reviewed by: kib


# 90880a1b 05-Jul-2016 Konstantin Belousov <kib@FreeBSD.org>

Clarify the vnode_destroy_vobject() logic handling for already terminated
objects.

Assert that there is no new waiters for the already terminated objects.
Old waiters should have been notified by the termination calling
vnode_pager_dealloc() (old/new are with regard of the lock acquisition
interval).

Only clear the vp->v_object for the case of already terminated object,
since other branches call vnode_pager_dealloc(), which should clear
the pointer. Assert this.

Tested by: pho
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Approved by: re (gjb)


# 2a339d9e 17-May-2016 Konstantin Belousov <kib@FreeBSD.org>

Add implementation of robust mutexes, hopefully close enough to the
intention of the POSIX IEEE Std 1003.1TM-2008/Cor 1-2013.

A robust mutex is guaranteed to be cleared by the system upon either
thread or process owner termination while the mutex is held. The next
mutex locker is then notified about inconsistent mutex state and can
execute (or abandon) corrective actions.

The patch mostly consists of small changes here and there, adding
neccessary checks for the inconsistent and abandoned conditions into
existing paths. Additionally, the thread exit handler was extended to
iterate over the userspace-maintained list of owned robust mutexes,
unlocking and marking as terminated each of them.

The list of owned robust mutexes cannot be maintained atomically
synchronous with the mutex lock state (it is possible in kernel, but
is too expensive). Instead, for the duration of lock or unlock
operation, the current mutex is remembered in a special slot that is
also checked by the kernel at thread termination.

Kernel must be aware about the per-thread location of the heads of
robust mutex lists and the current active mutex slot. When a thread
touches a robust mutex for the first time, a new umtx op syscall is
issued which informs about location of lists heads.

The umtx sleep queues for PP and PI mutexes are split between
non-robust and robust.

Somewhat unrelated changes in the patch:
1. Style.
2. The fix for proper tdfind() call use in umtxq_sleep_pi() for shared
pi mutexes.
3. Removal of the userspace struct pthread_mutex m_owner field.
4. The sysctl kern.ipc.umtx_vnode_persistent is added, which controls
the lifetime of the shared mutex associated with a vnode' page.

Reviewed by: jilles (previous version, supposedly the objection was fixed)
Discussed with: brooks, Martin Simmons <martin@lispworks.com> (some aspects)
Tested by: pho
Sponsored by: The FreeBSD Foundation


# 763df3ec 02-May-2016 Pedro F. Giffuni <pfg@FreeBSD.org>

sys/vm: minor spelling fixes in comments.

No functional change.


# ff64a90e 27-Dec-2015 Konstantin Belousov <kib@FreeBSD.org>

Add missed relpbuf() for a smallfs page-in.

Reported by: Shawn Webb
Tested by: pho
Sponsored by: The FreeBSD Foundation


# b0cd2017 16-Dec-2015 Gleb Smirnoff <glebius@FreeBSD.org>

A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().

o With new KPI consumers can request contiguous ranges of pages, and
unlike before, all pages will be kept busied on return, like it was
done before with the 'reqpage' only. Now the reqpage goes away. With
new interface it is easier to implement code protected from race
conditions.

Such arrayed requests for now should be preceeded by a call to
vm_pager_haspage() to make sure that request is possible. This
could be improved later, making vm_pager_haspage() obsolete.

Strenghtening the promises on the business of the array of pages
allows us to remove such hacks as swp_pager_free_nrpage() and
vm_pager_free_nonreq().

o New KPI accepts two integer pointers that may optionally point at
values for read ahead and read behind, that a pager may do, if it
can. These pages are completely owned by pager, and not controlled
by the caller.

This shifts the UFS-specific readahead logic from vm_fault.c, which
should be file system agnostic, into vnode_pager.c. It also removes
one VOP_BMAP() request per hard fault.

Discussed with: kib, alc, jeff, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix


# 9af50b01 22-Nov-2015 Konstantin Belousov <kib@FreeBSD.org>

Record proper commit message for r291157.

The r289895 revision did not accounted for the block containing the
requested page, when calculating the run of pages. Include the pages
before/after the requested page, that fit into the reqblock, into the
calculation.

Noted by: glebius
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 4586820a 22-Nov-2015 Konstantin Belousov <kib@FreeBSD.org>

Noted by: glebius
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 09c837b8 20-Nov-2015 Gleb Smirnoff <glebius@FreeBSD.org>

Remove remnants of the old NFS from vnode pager.

Reviewed by: kib
Sponsored by: Netflix


# eac91e32 24-Oct-2015 Konstantin Belousov <kib@FreeBSD.org>

Reduce the amount of calls to VOP_BMAP() made from the local vnode
pager. It is enough to execute VOP_BMAP() once to obtain both the
disk block address for the requested page, and the before/after limits
for the contiguous run. The clipping of the vm_page_t array passed to
the vnode_pager_generic_getpages() and the disk address for the first
page in the clipped array can be deduced from the call results.

While there, remove some noise (like if (1) {...}) and adjust nearby
code.

Reviewed by: alc
Discussed with: glebius
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks


# fade8dd7 23-Jul-2015 Jeff Roberson <jeff@FreeBSD.org>

Refactor unmapped buffer address handling.
- Use pointer assignment rather than a combination of pointers and
flags to switch buffers between unmapped and mapped. This eliminates
multiple flags and generally simplifies the logic.
- Eliminate b_saveaddr since it is only used with pager bufs which have
their b_data re-initialized on each allocation.
- Gather up some convenience routines in the buffer cache for
manipulating buf space and buf malloc space.
- Add an inline, buf_mapped(), to standardize checks around unmapped
buffers.

In collaboration with: mlaier
Reviewed by: kib
Tested by: pho (many small revisions ago)
Sponsored by: EMC / Isilon Storage Division


# 9cddade7 10-May-2015 Konstantin Belousov <kib@FreeBSD.org>

Satisfy vm_object uma zone destructor requirements after r282660 when
vnode object creation raced.

Reported by: pho
Reviewed by: alc
Sponsored by: The FreeBSD Foundation


# d2596d17 06-May-2015 Gleb Smirnoff <glebius@FreeBSD.org>

Fix the KASSERT and improve wording in r282426.

Submitted by: alc


# 84d31376 04-May-2015 Gleb Smirnoff <glebius@FreeBSD.org>

Fix arithmetical bug in vnode_pager_haspage(). The check against object size
should be done not with the number of pages in the first block, but with the
overall number of pages. While here, add KASSERT that makes sure that BMAP
doesn't return completely irrelevant blocks.

Reviewed by: kib
Tested by: pho
Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# f6d6b5e2 30-Mar-2015 Gleb Smirnoff <glebius@FreeBSD.org>

Catch up on r271387 and remove unused parameter from
VOP_GETPAGES_ASYNC().


# 3d653db0 21-Mar-2015 Alan Cox <alc@FreeBSD.org>

Introduce vm_object_color() and use it in mmap(2) to set the color of
named objects to zero before the virtual address is selected. Previously,
the color setting was delayed until after the virtual address was
selected. In rtld, this delay effectively prevented the mapping of a
shared library's code section using superpages. Now, for example, we see
the first 1 MB of libc's code on armv6 mapped by a superpage after we've
gotten through the initial cold misses that bring the first 1 MB of code
into memory. (With the page clustering that we perform on read faults,
this happens quickly.)

Differential Revision: https://reviews.freebsd.org/D2013
Reviewed by: jhb, kib
Tested by: Svatopluk Kraus (armv6)
MFC after: 6 weeks


# 4d6481a4 17-Mar-2015 Gleb Smirnoff <glebius@FreeBSD.org>

o Enhance vm_pager_free_nonreq() function:
- Allow to call the function with vm object lock held.
- Allow to specify reqpage that doesn't match any page in the region,
meaning freeing all pages.
o Utilize the new function in couple more places in vnode pager.

Reviewed by: alc, kib
Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# 41c895a8 16-Mar-2015 Gleb Smirnoff <glebius@FreeBSD.org>

Provide a comment explaining r279688.

Suggested by: alc


# 2c0cb026 10-Mar-2015 Gleb Smirnoff <glebius@FreeBSD.org>

Fix function name in comment.


# 73e9030e 06-Mar-2015 Gleb Smirnoff <glebius@FreeBSD.org>

- In vnode_pager_generic_getpages() use different free counters for
synchronous and asynchronous requests. The latter can saturate the
I/O and we do not want them to affect regular paging.
- Allocate the pbuf at the very beginning of the function, so that
if we are low on certain kind of pbufs don't even proceed to BMAP,
but sleep.

Reviewed by: kib
Sponsored by: Nginx, Inc.
Sponsored by: Netflix


# 1bb5ad63 24-Nov-2014 Gleb Smirnoff <glebius@FreeBSD.org>

We already have "int i" in this scope.

Submitted by: alc


# 90effb23 22-Nov-2014 Gleb Smirnoff <glebius@FreeBSD.org>

Merge from projects/sendfile:

o Provide a new VOP_GETPAGES_ASYNC(), which works like VOP_GETPAGES(), but
doesn't sleep. It returns immediately, and will execute the I/O done handler
function that must be supplied as argument.
o Provide VOP_GETPAGES_ASYNC() for the FFS, which uses vnode_pager.
o Extend pagertab to support pgo_getpages_async method, and implement this
method for vnode_pager.

Reviewed by: kib
Tested by: pho
Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# 79f0deb9 19-Nov-2014 Gleb Smirnoff <glebius@FreeBSD.org>

Use __func__ in KASSERTs, since the code is about to be moved to other place.

Sponsored by: Nginx, Inc.


# 2a5eef69 19-Nov-2014 Gleb Smirnoff <glebius@FreeBSD.org>

In vnode_pager_generic_getpages() vp->v_mount is dereferenced in the
beginning, thus can't be NULL.

Sponsored by: Nginx, Inc.


# e122dfc1 18-Nov-2014 Gleb Smirnoff <glebius@FreeBSD.org>

Collapse three contiguous comment blocks into one. Remove historical
note about wrong assumptions 20 years ago. Use proper casing.

Sponsored by: Nginx, Inc.


# a7fecb4d 15-Sep-2014 Alan Cox <alc@FreeBSD.org>

Three improvements to vnode_pager_generic_getpages():

Eliminate an exclusive object lock acquisition and release on the expected
execution path.

Do page zeroing before the object lock is acquired rather than during the
time that the object lock is held.

Use vm_pager_free_nonreq() to eliminate duplicated code.

Reviewed by: kib
MFC after: 6 weeks
Sponsored by: EMC / Isilon Storage Division


# d15b55c5 14-Sep-2014 Konstantin Belousov <kib@FreeBSD.org>

Provide the unique implementation for the VOP_GETPAGES() method used
by ffs and ext2fs. Remove duplicated call to vm_page_zero_invalid(),
done by VOP and by vm_pager_getpages(). Use vm_pager_free_nonreq().

Reviewed by: alc (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 6 weeks (after r271596)


# 396b3e34 14-Sep-2014 Alan Cox <alc@FreeBSD.org>

Avoid an exclusive acquisition of the object lock on the expected execution
path through the NFS clients' getpages functions.

Introduce vm_pager_free_nonreq(). This function can be used to eliminate
code that is duplicated in many getpages functions. Also, in contrast to
the code that currently appears in those getpages functions,
vm_pager_free_nonreq() avoids acquiring an exclusive object lock in one
case.

Reviewed by: kib
MFC after: 6 weeks
Sponsored by: EMC / Isilon Storage Division


# 33cad9e9 14-Sep-2014 Konstantin Belousov <kib@FreeBSD.org>

Fix mis-spelling of bits and types names in the vnode_pager_putpages().
The changes should not modify the generated code.

The pager->pgo_putpages() method takes int flags as its fourth
argument, while vnode_pager_putpages() used boolean_t (which is
typedef'ed to int). The flags are from VM_PAGER_* namespace, while
vnode_pager_putpages() passed TRUE and OBJPC_SYNC to VOP_PUTPAGES(),
which both are numerically equal to VM_PAGER_PUT_SYNC.

Noted and reviewed by: alc (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 27ad26d8 09-Sep-2014 Gleb Smirnoff <glebius@FreeBSD.org>

Remove unused arguments for VOP_GETPAGES(), VOP_PUTPAGES().


# 44f1c916 22-Mar-2014 Bryan Drewery <bdrewery@FreeBSD.org>

Rename global cnt to vm_cnt to avoid shadowing.

To reduce the diff struct pcu.cnt field was not renamed, so
PCPU_OP(cnt.field) is still used. pc_cnt and pcpu are also used in
kvm(3) and vmstat(8). The goal was to not affect externally used KPI.

Bump __FreeBSD_version_ in case some out-of-tree module/code relies on the
the global cnt variable.

Exp-run revealed no ports using it directly.

No objection from: arch@
Sponsored by: EMC / Isilon Storage Division


# 7ebba1f8 20-Jan-2014 Gleb Smirnoff <glebius@FreeBSD.org>

ANSIfy declarations.

Ok'ed by: alc


# c7aebda8 09-Aug-2013 Attilio Rao <attilio@FreeBSD.org>

The soft and hard busy mechanism rely on the vm object lock to work.
Unify the 2 concept into a real, minimal, sxlock where the shared
acquisition represent the soft busy and the exclusive acquisition
represent the hard busy.
The old VPO_WANTED mechanism becames the hard-path for this new lock
and it becomes per-page rather than per-object.
The vm_object lock becames an interlock for this functionality:
it can be held in both read or write mode.
However, if the vm_object lock is held in read mode while acquiring
or releasing the busy state, the thread owner cannot make any
assumption on the busy state unless it is also busying it.

Also:
- Add a new flag to directly shared busy pages while vm_page_alloc
and vm_page_grab are being executed. This will be very helpful
once these functions happen under a read object lock.
- Move the swapping sleep into its own per-object flag

The KPI is heavilly changed this is why the version is bumped.
It is very likely that some VM ports users will need to change
their own code.

Sponsored by: EMC / Isilon storage division
Discussed with: alc
Reviewed by: jeff, kib
Tested by: gavin, bapt (older version)
Tested by: pho, scottl


# c93dcf22 23-Jul-2013 Jeff Roberson <jeff@FreeBSD.org>

- Correct a stale comment. We don't have vclean() anymore. The work is
done by vgonel() and destroy_vobject() should only be called once from
VOP_INACTIVE().

Sponsored by: EMC / Isilon Storage Division


# 9b8851fa 28-Apr-2013 Konstantin Belousov <kib@FreeBSD.org>

Assert that the object type for the vnode' non-NULL v_object, passed
to vnode_pager_setsize(), is either OBJT_VNODE, or, if vnode was
already reclaimed, OBJT_DEAD. Note that the later is only possible
due to some filesystems, in particular, nfsiods from nfs clients, call
vnode_pager_setsize() with unlocked vnode.

More, if the object is terminated, do not perform the resizing
operation.

Reviewed by: alc
Tested by: pho, bf
MFC after: 1 week


# 6ded8427 28-Apr-2013 Konstantin Belousov <kib@FreeBSD.org>

Convert panic() into KASSERT().

Reviewed by: alc
MFC after: 1 week


# 6991ee13 20-Mar-2013 Konstantin Belousov <kib@FreeBSD.org>

Fix the logic inversion in the r248512.

Noted by: mckay


# 6ce697dc 19-Mar-2013 Konstantin Belousov <kib@FreeBSD.org>

Pass unmapped buffers for page in requests if the filesystem indicated support
for the unmapped i/o.

Sponsored by: The FreeBSD Foundation
Tested by: pho


# 70e198dd 14-Mar-2013 Konstantin Belousov <kib@FreeBSD.org>

Some style fixes.

Sponsored by: The FreeBSD Foundation


# 89f6b863 08-Mar-2013 Attilio Rao <attilio@FreeBSD.org>

Switch the vm_object mutex to be a rwlock. This will enable in the
future further optimizations where the vm_object lock will be held
in read mode most of the time the page cache resident pool of pages
are accessed for reading purposes.

The change is mostly mechanical but few notes are reported:
* The KPI changes as follow:
- VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK()
- VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK()
- VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK()
- VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED()
(in order to avoid visibility of implementation details)
- The read-mode operations are added:
VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(),
VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED()
* The vm/vm_pager.h namespace pollution avoidance (forcing requiring
sys/mutex.h in consumers directly to cater its inlining functions
using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h
consumers now must include also sys/rwlock.h.
* zfs requires a quite convoluted fix to include FreeBSD rwlocks into
the compat layer because the name clash between FreeBSD and solaris
versions must be avoided.
At this purpose zfs redefines the vm_object locking functions
directly, isolating the FreeBSD components in specific compat stubs.

The KPI results heavilly broken by this commit. Thirdy part ports must
be updated accordingly (I can think off-hand of VirtualBox, for example).

Sponsored by: EMC / Isilon storage division
Reviewed by: jeff
Reviewed by: pjd (ZFS specific review)
Discussed with: alc
Tested by: pho


# 64a3476f 26-Feb-2013 Attilio Rao <attilio@FreeBSD.org>

Remove white spaces.

Sponsored by: EMC / Isilon storage division


# 0dde287b 26-Feb-2013 Attilio Rao <attilio@FreeBSD.org>

Wrap the sleeps synchronized by the vm_object lock into the specific
macro VM_OBJECT_SLEEP().
This hides some implementation details like the usage of the msleep()
primitive and the necessity to access to the lock address directly.
For this reason VM_OBJECT_MTX() macro is now retired.

Sponsored by: EMC / Isilon storage division
Reviewed by: alc
Tested by: pho


# 140dedb8 02-Nov-2012 Konstantin Belousov <kib@FreeBSD.org>

The r241025 fixed the case when a binary, executed from nullfs mount,
was still possible to open for write from the lower filesystem. There
is a symmetric situation where the binary could already has file
descriptors opened for write, but it can be executed from the nullfs
overlay.

Handle the issue by passing one v_writecount reference to the lower
vnode if nullfs vnode has non-zero v_writecount. Note that only one
write reference can be donated, since nullfs only keeps one use
reference on the lower vnode. Always use the lower vnode v_writecount
for the checks.

Introduce the VOP_GET_WRITECOUNT to read v_writecount, which is
currently always bypassed to the lower vnode, and VOP_ADD_WRITECOUNT
to manipulate the v_writecount value, which manages a single bypass
reference to the lower vnode. Caling the VOPs instead of directly
accessing v_writecount provide the fix described in the previous
paragraph.

Tested by: pho
MFC after: 3 weeks


# 5050aa86 22-Oct-2012 Konstantin Belousov <kib@FreeBSD.org>

Remove the support for using non-mpsafe filesystem modules.

In particular, do not lock Giant conditionally when calling into the
filesystem module, remove the VFS_LOCK_GIANT() and related
macros. Stop handling buffers belonging to non-mpsafe filesystems.

The VFS_VERSION is bumped to indicate the interface change which does
not result in the interface signatures changes.

Conducted and reviewed by: attilio
Tested by: pho


# 877d24ac 28-Sep-2012 Konstantin Belousov <kib@FreeBSD.org>

Fix the mis-handling of the VV_TEXT on the nullfs vnodes.

If you have a binary on a filesystem which is also mounted over by
nullfs, you could execute the binary from the lower filesystem, or
from the nullfs mount. When executed from lower filesystem, the lower
vnode gets VV_TEXT flag set, and the file cannot be modified while the
binary is active. But, if executed as the nullfs alias, only the
nullfs vnode gets VV_TEXT set, and you still can open the lower vnode
for write.

Add a set of VOPs for the VV_TEXT query, set and clear operations,
which are correctly bypassed to lower vnode.

Tested by: pho (previous version)
MFC after: 2 weeks


# b6c00483 14-Aug-2012 Konstantin Belousov <kib@FreeBSD.org>

Do not leave invalid pages in the object after the short read for a
network file systems (not only NFS proper). Short reads cause pages
other then the requested one, which were not filled by read response,
to stay invalid.

Change the vm_page_readahead_finish() interface to not take the error
code, but instead to make a decision to free or to (de)activate the
page only by its validity. As result, not requested invalid pages are
freed even if the read RPC indicated success.

Noted and reviewed by: alc
MFC after: 1 week


# 1c771f92 05-Aug-2012 Konstantin Belousov <kib@FreeBSD.org>

After the PHYS_TO_VM_PAGE() function was de-inlined, the main reason
to pull vm_param.h was removed. Other big dependency of vm_page.h on
vm_param.h are PA_LOCK* definitions, which are only needed for
in-kernel code, because modules use KBI-safe functions to lock the
pages.

Stop including vm_param.h into vm_page.h. Include vm_param.h
explicitely for the kernel code which needs it.

Suggested and reviewed by: alc
MFC after: 2 weeks


# 0055cbd3 04-Aug-2012 Konstantin Belousov <kib@FreeBSD.org>

Reduce code duplication and exposure of direct access to struct
vm_page oflags by providing helper function
vm_page_readahead_finish(), which handles completed reads for pages
with indexes other then the requested one, for VOP_GETPAGES().

Reviewed by: alc
MFC after: 1 week


# db9ba578 16-Jun-2012 Attilio Rao <attilio@FreeBSD.org>

Do a more targeted check on the page cache and avoid to check the cache
pointer directly in vnode_pager_setsize() by using newly introduced
vm_page_is_cached() function.

Reviewed by: alc
MFC after: 2 weeks
X-MFC: r234039,234064


# 6031c68d 16-Jun-2012 Alan Cox <alc@FreeBSD.org>

The page flag PGA_WRITEABLE is set and cleared exclusively by the pmap
layer, but it is read directly by the MI VM layer. This change introduces
pmap_page_is_write_mapped() in order to completely encapsulate all direct
access to PGA_WRITEABLE in the pmap layer.

Aesthetics aside, I am making this change because amd64 will likely begin
using an alternative method to track write mappings, and having
pmap_page_is_write_mapped() in place allows me to make such a change
without further modification to the MI VM layer.

As an added bonus, tidy up some nearby comments concerning page flags.

Reviewed by: kib
MFC after: 6 weeks


# 1faacf5d 28-Mar-2012 Kirk McKusick <mckusick@FreeBSD.org>

Keep track of the mount point associated with a special device
to enable the collection of counts of synchronous and asynchronous
reads and writes for its associated filesystem. The counts are
displayed using `mount -v'.

Ensure that buffers used for paging indicate the vnode from
which they are operating so that counts of paging I/O operations
from the filesystem are collected.

This checkin only adds the setting of the mount point for the
UFS/FFS filesystem, but it would be trivial to add the setting
and clearing of the mount point at filesystem mount/unmount
time for other filesystems too.

Reviewed by: kib


# b47f6241 08-Mar-2012 John Baldwin <jhb@FreeBSD.org>

Add KTR_VFS traces to track modifications to a vnode's writecount.


# 84110e7e 23-Feb-2012 Konstantin Belousov <kib@FreeBSD.org>

Account the writeable shared mappings backed by file in the vnode
v_writecount. Keep the amount of the virtual address space used by
the mappings in the new vm_object un_pager.vnp.writemappings
counter. The vnode v_writecount is incremented when writemappings gets
non-zero value, and decremented when writemappings is returned to
zero.

Writeable shared vnode-backed mappings are accounted for in vm_mmap(),
and vm_map_insert() is instructed to set MAP_ENTRY_VN_WRITECNT flag on
the created map entry. During deferred map entry deallocation,
vm_map_process_deferred() checks for MAP_ENTRY_VN_WRITECOUNT and
decrements writemappings for the vm object.

Now, the writeable mount cannot be demoted to read-only while
writeable shared mappings of the vnodes from the mount point
exist. Also, execve(2) fails for such files with ETXTBUSY, as it
should be.

Noted by: tegge
Reviewed by: tegge (long time ago, early version), alc
Tested by: pho
MFC after: 3 weeks


# dc874f98 30-Nov-2011 Konstantin Belousov <kib@FreeBSD.org>

Rename vm_page_set_valid() to vm_page_set_valid_range().
The vm_page_set_valid() is the most reasonable name for the m->valid
accessor.

Reviewed by: attilio, alc


# 561cc9fc 05-Nov-2011 Konstantin Belousov <kib@FreeBSD.org>

Provide typedefs for the type of bit mask for the page bits.
Use the defined types instead of int when manipulating masks.
Supposedly, it could fix support for 32KB page size in the
machine-independend VM layer.

Reviewed by: alc
MFC after: 2 weeks


# 98601346 14-Oct-2011 John Baldwin <jhb@FreeBSD.org>

Fix a typo in a comment.


# 3407fefe 06-Sep-2011 Konstantin Belousov <kib@FreeBSD.org>

Split the vm_page flags PG_WRITEABLE and PG_REFERENCED into atomic
flags field. Updates to the atomic flags are performed using the atomic
ops on the containing word, do not require any vm lock to be held, and
are non-blocking. The vm_page_aflag_set(9) and vm_page_aflag_clear(9)
functions are provided to modify afalgs.

Document the changes to flags field to only require the page lock.

Introduce vm_page_reference(9) function to provide a stable KPI and
KBI for filesystems like tmpfs and zfs which need to mark a page as
referenced.

Reviewed by: alc, attilio
Tested by: marius, flo (sparc64); andreast (powerpc, powerpc64)
Approved by: re (bz)


# 6bbee8e2 29-Jun-2011 Alan Cox <alc@FreeBSD.org>

Add a new option, OBJPR_NOTMAPPED, to vm_object_page_remove(). Passing this
option to vm_object_page_remove() asserts that the specified range of pages
is not mapped, or more precisely that none of these pages have any managed
mappings. Thus, vm_object_page_remove() need not call pmap_remove_all() on
the pages.

This change not only saves time by eliminating pointless calls to
pmap_remove_all(), but it also eliminates an inconsistency in the use of
pmap_remove_all() versus related functions, like pmap_remove_write(). It
eliminates harmless but pointless calls to pmap_remove_all() that were being
performed on PG_UNMANAGED pages.

Update all of the existing assertions on pmap_remove_all() to reflect this
change.

Reviewed by: kib


# 9d17da3b 11-Jun-2011 Konstantin Belousov <kib@FreeBSD.org>

Fix a bug in r222586. Lock the page owner object around the modification
of the m->dirty.

Reported and tested by: nwhitehorn
Reviewed by: alc


# 031ec8c1 01-Jun-2011 Konstantin Belousov <kib@FreeBSD.org>

In the VOP_PUTPAGES() implementations, change the default error from
VM_PAGER_AGAIN to VM_PAGER_ERROR for the uwritten pages. Return
VM_PAGER_AGAIN for the partially written page. Always forward at least
one page in the loop of vm_object_page_clean().

VM_PAGER_ERROR causes the page reactivation and does not clear the
page dirty state, so the write is not lost.

The change fixes an infinite loop in vm_object_page_clean() when the
filesystem returns permanent errors for some page writes.

Reported and tested by: gavin
Reviewed by: alc, rmacklem
MFC after: 1 week


# a7d5f7eb 19-Oct-2010 Jamie Gritton <jamie@FreeBSD.org>

A new jail(8) with a configuration file, to replace the work currently done
by /etc/rc.d/jail.


# c8fa8709 02-Jun-2010 Alan Cox <alc@FreeBSD.org>

Minimize the use of the page queues lock for synchronizing access to the
page's dirty field. With the exception of one case, access to this field
is now synchronized by the object lock.


# c46b90e9 26-May-2010 Alan Cox <alc@FreeBSD.org>

Push down page queues lock acquisition in pmap_enter_object() and
pmap_is_referenced(). Eliminate the corresponding page queues lock
acquisitions from vm_map_pmap_enter() and mincore(), respectively. In
mincore(), this allows some additional cases to complete without ever
acquiring the page queues lock.

Assert that the page is managed in pmap_is_referenced().

On powerpc/aim, push down the page queues lock acquisition from
moea*_is_modified() and moea*_is_referenced() into moea*_query_bit().
Again, this will allow some additional cases to complete without ever
acquiring the page queues lock.

Reorder a few statements in vm_page_dontneed() so that a race can't lead
to an old reference persisting. This scenario is described in detail by a
comment.

Correct a spelling error in vm_page_dontneed().

Assert that the object is locked in vm_page_clear_dirty(), and restrict the
page queues lock assertion to just those cases in which the page is
currently writeable.

Add object locking to vnode_pager_generic_putpages(). This was the one
and only place where vm_page_clear_dirty() was being called without the
object being locked.

Eliminate an unnecessary vm_page_lock() around vnode_pager_setsize()'s call
to vm_page_clear_dirty().

Change vnode_pager_generic_putpages() to the modern-style of function
definition. Also, change the name of one of the parameters to follow
virtual memory system naming conventions.

Reviewed by: kib


# 03679e23 07-May-2010 Alan Cox <alc@FreeBSD.org>

Push down the page queues lock into vm_page_activate().


# eb00b276 06-May-2010 Alan Cox <alc@FreeBSD.org>

Eliminate page queues locking around most calls to vm_page_free().


# 2965a453 29-Apr-2010 Kip Macy <kmacy@FreeBSD.org>

On Alan's advice, rather than do a wholesale conversion on a single
architecture from page queue lock to a hashed array of page locks
(based on a patch by Jeff Roberson), I've implemented page lock
support in the MI code and have only moved vm_page's hold_count
out from under page queue mutex to page lock. This changes
pmap_extract_and_hold on all pmaps.

Supported by: Bitgravity Inc.

Discussed with: alc, jeffr, and kib


# f407eeb3 25-Feb-2010 Konstantin Belousov <kib@FreeBSD.org>

MFC r204205:
Remove write-only variable.


# d7de6e2c 22-Feb-2010 Konstantin Belousov <kib@FreeBSD.org>

Remove write-only variable.

MFC after: 3 days


# 02d65815 07-Feb-2010 Konstantin Belousov <kib@FreeBSD.org>

MFC r202529:
vunref() the vnode in vm object deallocation code for OBJT_VNODE
appropriate number of times to prevent possible vnode reference leak.


# b9f180d1 17-Jan-2010 Konstantin Belousov <kib@FreeBSD.org>

When a vnode-backed vm object is referenced, it increments the vnode
reference count, and decrements it on dereference. If referenced object
is deallocated, object type is reset to OBJT_DEAD. Consequently, all
vnode references that are owned by object references are never released.
vunref() the vnode in vm object deallocation code for OBJT_VNODE
appropriate number of times to prevent leak.

Add an assertion to the vm_pageout() to make sure that we never get
reference on the vnode but then do not execute code to release it.

In collaboration with: pho
Reviewed by: alc
MFC after: 3 weeks


# 9f80ce04 25-Jun-2009 Konstantin Belousov <kib@FreeBSD.org>

Change the type of uio_resid member of struct uio from int to ssize_t.
Note that this does not actually enable full-range i/o requests for
64 architectures, and is done now to update KBI only.

Tested by: pho
Reviewed by: jhb, bde (as part of the review of the bigger patch)


# 3364c323 23-Jun-2009 Konstantin Belousov <kib@FreeBSD.org>

Implement global and per-uid accounting of the anonymous memory. Add
rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved
for the uid.

The accounting information (charge) is associated with either map entry,
or vm object backing the entry, assuming the object is the first one
in the shadow chain and entry does not require COW. Charge is moved
from entry to object on allocation of the object, e.g. during the mmap,
assuming the object is allocated, or on the first page fault on the
entry. It moves back to the entry on forks due to COW setup.

The per-entry granularity of accounting makes the charge process fair
for processes that change uid during lifetime, and decrements charge
for proper uid when region is unmapped.

The interface of vm_pager_allocate(9) is extended by adding struct ucred *,
that is used to charge appropriate uid when allocation if performed by
kernel, e.g. md(4).

Several syscalls, among them is fork(2), may now return ENOMEM when
global or per-uid limits are enforced.

In collaboration with: pho
Reviewed by: alc
Approved by: re (kensmith)


# 3c33df62 02-Jun-2009 Alan Cox <alc@FreeBSD.org>

Correct a boundary case error in the management of a page's dirty bits by
shm_dotruncate() and vnode_pager_setsize(). Specifically, if the length of
a shared memory object or a file is truncated such that the length modulo
the page size is between 1 and 511, then all of the page's dirty bits were
cleared. Now, a dirty bit is cleared only if the corresponding block is
truncated in its entirety.


# 42eb4108 14-May-2009 Alan Cox <alc@FreeBSD.org>

Eliminate unnecessary clearing of the page's dirty mask from various
getpages functions.

Eliminate a stale comment.


# 12aa4fdc 11-May-2009 Alan Cox <alc@FreeBSD.org>

Eliminate gratuitous clearing of the page's dirty mask.


# 0d53a17b 09-May-2009 Alan Cox <alc@FreeBSD.org>

Fix a race involving vnode_pager_input_smlfs(). Specifically, in the case
that vnode_pager_input_smlfs() zeroes the page, it should not mark the page
as valid until after the page is zeroed. Otherwise, the page could be
mapped for read access (e.g., by vm_map_pmap_enter()) before the page is
zeroed. Reviewed by: tegge

Eliminate gratuitous clearing of the page's dirty mask by
vnode_pager_input_smlfs(). Instead, assert that the page is clean.
Reviewed by: tegge

Eliminate some blank lines.

Eliminate pointless calls to pmap_clear_modify() and vm_page_undirty() from
vnode_pager_input_old(). The page is not mapped. Therefore, it cannot have
any page table entries that are modified.

Eliminate an incorrect comment from vnode_pager_generic_getpages().


# 3a2cdcb0 04-May-2009 Alan Cox <alc@FreeBSD.org>

Eliminate vnode_pager_input_smlfs()'s pointless call to pmap_clear_modify().
The page can't possibly have any modified page table entries because it
isn't even mapped.


# 016a3c93 24-Apr-2009 Alan Cox <alc@FreeBSD.org>

Eliminate unnecessary calls to pmap_clear_modify(). Specifically, calling
pmap_clear_modify() on a page is pointless if that page is not mapped or
it is only mapped for read access. Instead, assert that the page is not
mapped or not mapped for write access as appropriate.

Eliminate unnecessary clearing of a page's dirty mask. Instead, assert
that the page's dirty mask is clear.


# 5bd65606 09-Mar-2009 John Baldwin <jhb@FreeBSD.org>

Adjust some variables (mostly related to the buffer cache) that hold
address space sizes to be longs instead of ints. Specifically, the follow
values are now longs: runningbufspace, bufspace, maxbufspace,
bufmallocspace, maxbufmallocspace, lobufspace, hibufspace, lorunningspace,
hirunningspace, maxswzone, maxbcache, and maxpipekva. Previously, a
relatively small number (~ 44000) of buffers set in kern.nbuf would result
in integer overflows resulting either in hangs or bogus values of
hidirtybuffers and lodirtybuffers. Now one has to overflow a long to see
such problems. There was a check for a nbuf setting that would cause
overflows in the auto-tuning of nbuf. I've changed it to always check and
cap nbuf but warn if a user-supplied tunable would cause overflow.

Note that this changes the ABI of several sysctls that are used by things
like top(1), etc., so any MFC would probably require a some gross shims
to allow for that.

MFC after: 1 month


# cb61d698 09-Feb-2009 Konstantin Belousov <kib@FreeBSD.org>

Comment out the assertion from r188321. It is not valid for nfs.

Reported by: alc


# 7b54b1a9 08-Feb-2009 Alan Cox <alc@FreeBSD.org>

Eliminate OBJ_NEEDGIANT. After r188331, OBJ_NEEDGIANT's only use is by a
redundant assertion in vm_fault().

Reviewed by: kib


# d2bf64c3 08-Feb-2009 Konstantin Belousov <kib@FreeBSD.org>

Do not sleep for vnode lock while holding map lock in vm_fault. Try to
acquire vnode lock for OBJT_VNODE object after map lock is dropped.
Because we have the busy page(s) in the object, sleeping there would
result in deadlock with vnode resize. Try to get lock without sleeping,
and, if the attempt failed, drop the state, lock the vnode, and restart
the fault handler from the start with already locked vnode.

Because the vnode_pager_lock() function is inlined in vm_fault(),
axe it.

Based on suggestion by: alc
Reviewed by: tegge, alc
Tested by: pho


# 705f0a82 08-Feb-2009 Konstantin Belousov <kib@FreeBSD.org>

Assert that vnode is exclusively locked when its vm object is resized.

Reviewed by: tegge


# d7f03759 19-Oct-2008 Ulf Lilleengen <lulf@FreeBSD.org>

- Import the HEAD csup code which is the basis for the cvsmode work.


# 0359a12e 28-Aug-2008 Attilio Rao <attilio@FreeBSD.org>

Decontextualize the couplet VOP_GETATTR / VOP_SETATTR as the passed thread
was always curthread and totally unuseful.

Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>


# 3cca4b6f 30-Jul-2008 John Baldwin <jhb@FreeBSD.org>

A few more whitespace fixes.


# 24bbc85b 30-Jul-2008 Konstantin Belousov <kib@FreeBSD.org>

The behaviour of the lockmgr going back at least to the 4.4BSD-Lite2 was
to downgrade the exclusive lock to shared one when exclusive lock owner
requested shared lock. New lockmgr panics instead.

The vnode_pager_lock function requests shared lock on the vnode backing
the OBJT_VNODE, and can be called when the current thread already holds
an exlcusive lock on the vnode. For instance, it happens when handling
page fault from the VOP_WRITE() uiomove that writes to the file, with
the faulted in page fetched from the vm object backed by the same file.
We then get the situation described above.

Verify whether the vnode is already exclusively locked by the curthread
and request recursed exclusive vnode lock instead of shared, if true.

Reported by: gallatin
Discussed with: attilio


# 11be8415 12-Jun-2008 Stephan Uphoff <ups@FreeBSD.org>

Fix vm object creation locking to allow SHARED vnode locking for vnode_create_vobject.
(Not currently used)

Noticed by: kib@


# 2ac78f0e 20-May-2008 Stephan Uphoff <ups@FreeBSD.org>

Allow VM object creation in ufs_lookup. (If vfs.vmiodirenable is set)
Directory IO without a VM object will store data in 'malloced' buffers
severely limiting caching of the data. Without this change VM objects for
directories are only created on an open() of the directory.
TODO: Inline test if VM object already exists to avoid locking/function call
overhead.

Tested by: kris@
Reviewed by: jeff@
Reported by: David Filo


# 22db15c0 13-Jan-2008 Attilio Rao <attilio@FreeBSD.org>

VOP_LOCK1() (and so VOP_LOCK()) and VOP_UNLOCK() are only used in
conjuction with 'thread' argument passing which is always curthread.
Remove the unuseful extra-argument and pass explicitly curthread to lower
layer functions, when necessary.

KPI results broken by this change, which should affect several ports, so
version bumping and manpage update will be further committed.

Tested by: kris, pho, Diego Sardina <siarodx at gmail dot com>


# cb05b60a 09-Jan-2008 Attilio Rao <attilio@FreeBSD.org>

vn_lock() is currently only used with the 'curthread' passed as argument.
Remove this argument and pass curthread directly to underlying
VOP_LOCK1() VFS method. This modify makes the code cleaner and in
particular remove an annoying dependence helping next lockmgr() cleanup.
KPI results, obviously, changed.

Manpage and FreeBSD_version will be updated through further commits.

As a side note, would be valuable to say that next commits will address
a similar cleanup about VFS methods, in particular vop_lock1 and
vop_unlock.

Tested by: Diego Sardina <siarodx at gmail dot com>,
Andrea Di Pasquale <whyx dot it at gmail dot com>


# 82cfdd5a 22-Nov-2007 Alan Cox <alc@FreeBSD.org>

Remove an unnecessary call to pmap_remove_all() and the associated "XXX"
comments from vnode_pager_setsize(). This call was introduced in
revision 1.140 to address a problem that no longer exists.
Specifically, pmap_zero_page_area() has replaced a (possibly)
problematic implementation of page zeroing that was based on
vm_pager_map(), bzero(), and vm_pager_unmap().


# 0ab3c7a5 22-Oct-2007 Alan Cox <alc@FreeBSD.org>

Correct an error of omission in the reimplementation of the page
cache: vnode_pager_setsize() must handle the case where a file is
truncated to a non-page-size-aligned boundary and there is a cached
page underlying the new end of file.

Reported by: kris, tegge
Tested by: kris
MFC after: 3 days


# 57fd3d55 26-Jul-2007 Pawel Jakub Dawidek <pjd@FreeBSD.org>

When we do open, we should lock the vnode exclusively. This fixes few races:
- fifo race, where two threads assign v_fifoinfo,
- v_writecount modifications,
- v_object modifications,
- and probably more...

Discussed with: kib, ups
Approved by: re (rwatson)


# b4b70819 04-Jun-2007 Attilio Rao <attilio@FreeBSD.org>

Do proper "locking" for missing vmmeters part.
Now, we assume no more sched_lock protection for some of them and use the
distribuited loads method for vmmeter (distribuited through CPUs).

Reviewed by: alc, bde
Approved by: jeff (mentor)


# 2feb50bf 31-May-2007 Attilio Rao <attilio@FreeBSD.org>

Revert VMCNT_* operations introduction.
Probabilly, a general approach is not the better solution here, so we should
solve the sched_lock protection problems separately.

Requested by: alc
Approved by: jeff (mentor)


# 222d0195 18-May-2007 Jeff Roberson <jeff@FreeBSD.org>

- define and use VMCNT_{GET,SET,ADD,SUB,PTR} macros for manipulating
vmcnts. This can be used to abstract away pcpu details but also changes
to use atomics for all counters now. This means sched lock is no longer
responsible for protecting counts in the switch routines.

Contributed by: Attilio Rao <attilio@FreeBSD.org>


# 7e2393ff 14-Oct-2006 Alan Cox <alc@FreeBSD.org>

Long ago, revision 1.22 of vm/vm_pager.h introduced a bug. Specifically,
it introduced a check after the call to file system's get pages method
that assumes that the get pages method does not change the array of pages
that is passed to it. In the case of vnode_pager_generic_getpages(),
this assumption has been incorrect. The contents of the array of pages
may be shifted by vnode_pager_generic_getpages(). Likely, the problem
has been hidden by vnode_pager_haspage() limiting the set of pages that
are passed to vnode_pager_generic_getpages() such that a shift never
occurs.

The fix implemented herein is to adjust the pointer to the array of pages
rather than shifting the pages within the array.

MFC after: 3 weeks
Fix suggested by: tegge


# bff76343 14-Oct-2006 Alan Cox <alc@FreeBSD.org>

Change vnode_pager_addr() such that on returning it distinguishes between
an error returned by VOP_BMAP() and a hole in the file.

Change the callers to vnode_pager_addr() such that they return
VM_PAGER_ERROR when VOP_BMAP fails instead of a zero-filled page.

Reviewed by: tegge
MFC after: 3 weeks


# 1de11f1a 10-Oct-2006 Alan Cox <alc@FreeBSD.org>

Distinguish between two distinct kinds of errors from VOP_BMAP() in
vnode_pager_generic_getpages(): (1) that VOP_BMAP() is unsupported by the
underlying file system and (2) an error in performing the VOP_BMAP().
Previously, vnode_pager_generic_getpages() assumed that all errors were
of the first type. If, in fact, the error was of the second type, the
likely outcome was for the process to become permanently blocked on a busy
page.

MFC after: 3 weeks
Reviewed by: tegge


# f4f83da0 08-Oct-2006 Alan Cox <alc@FreeBSD.org>

Change vnode_pager_generic_getpages() so that it does not panic if the
given file is sparse. Instead, it zeroes the requested page.

Reviewed by: tegge
PR: kern/98116
MFC after: 3 days


# 5786be7c 09-Aug-2006 Alan Cox <alc@FreeBSD.org>

Introduce a field to struct vm_page for storing flags that are
synchronized by the lock on the object containing the page.

Transition PG_WANTED and PG_SWAPINPROG to use the new field,
eliminating the need for holding the page queues lock when setting
or clearing these flags. Rename PG_WANTED and PG_SWAPINPROG to
VPO_WANTED and VPO_SWAPINPROG, respectively.

Eliminate the assertion that the page queues lock is held in
vm_page_io_finish().

Eliminate the acquisition and release of the page queues lock
around calls to vm_page_io_finish() in kern_sendfile() and
vfs_unbusy_pages().


# 3b582b4e 02-Mar-2006 Tor Egge <tegge@FreeBSD.org>

Eliminate a deadlock when creating snapshots. Blocking vn_start_write() must
be called without any vnode locks held. Remove calls to vn_start_write() and
vn_finished_write() in vnode_pager_putpages() and add these calls before the
vnode lock is obtained to most of the callers that don't already have them.


# b73f64c4 06-Feb-2006 Jeff Roberson <jeff@FreeBSD.org>

- Fix silly VI locking that is used to check a single flag. The vnode
lock also protects this flag so it is not necessary.
- Don't rely on v_mount to detect whether or not we've been recycled, use
the more appropriate VI_DOOMED instead.

Sponsored by: Isilon Systems, Inc.
MFC After: 1 week


# 731959b1 31-Jan-2006 Yaroslav Tykhiy <ytykhiy@gmail.com>

Use off_t for file size passed to vnode_create_vobject().
The former type, size_t, was causing truncation to 32 bits on i386,
which immediately led to undersizing of VM objects backed by
files >4GB. In particular, sendfile(2) was broken for such files.

PR: kern/92243
MFC after: 5 days


# dd498bef 01-Nov-2005 Paul Saab <ps@FreeBSD.org>

Rate limit vnode_pager_putpages printfs to once a second.


# 857b66d5 13-Aug-2005 Alexander Kabaev <kan@FreeBSD.org>

Do not use vm_pager_init() to initialize vnode_pbuf_freecnt variable.
vm_pager_init() is run before required nswbuf variable has been set
to correct value. This caused system to run with single pbuf available
for vnode_pager. Handle both cluster_pbuf_freecnt and vnode_pbuf_freecnt
variable in the same way.

Reported by: ade
Obtained from: alc
MFC after: 2 days


# 4f12e0ac 08-Aug-2005 Suleiman Souhlal <ssouhlal@FreeBSD.org>

Use atomic operations on runningbufspace.

PR: kern/84318
Submitted by: ade
MFC after: 3 days


# d2d9c9ac 18-May-2005 Alan Cox <alc@FreeBSD.org>

Remove a stale comment concerning spl* usage.


# f3aad9a6 18-May-2005 Bjoern A. Zeeb <bz@FreeBSD.org>

Correct 32 vs 64 bit signedness issues.

Approved by: pjd (mentor)
MFC after: 2 weeks


# ed4fe4f4 03-May-2005 Jeff Roberson <jeff@FreeBSD.org>

- Add a new object flag "OBJ_NEEDSGIANT". We set this flag if the
underlying vnode requires Giant.
- In vm_fault only acquire Giant if the underlying object has NEEDSGIANT
set.
- In vm_object_shadow inherit the NEEDSGIANT flag from the backing object.


# 6e4b2820 03-Apr-2005 Jeff Roberson <jeff@FreeBSD.org>

- Don't NULL the vnode's v_object pointer until after the object is torn
down. If we have dirty pages, the putpages routine will need to know
what the vnode's object is so that it may write out dirty pages.

Pointy hat: phk
Found by: obrien


# f247a524 30-Mar-2005 Jeff Roberson <jeff@FreeBSD.org>

- LK_NOPAUSE is a nop now.

Sponsored by: Isilon Systems, Inc.


# 7747c038 14-Mar-2005 Jeff Roberson <jeff@FreeBSD.org>

- Don't directly adjust v_usecount, use vref() instead.

Sponsored by: Isilon Systems, Inc.


# 1d39df3f 14-Mar-2005 Jeff Roberson <jeff@FreeBSD.org>

- Retire OLOCK and OWANT. All callers hold the vnode lock when creating
a vnode object. There has been an assert to prove this for some time.

Sponsored by: Isilon Systems, Inc.


# 493d78b3 12-Mar-2005 Jeff Roberson <jeff@FreeBSD.org>

- Don't acquire the vnode lock in destroy_vobject, assert that it has
already been acquired by the caller.

Sponsored by: Isilon Systems, Inc.


# dfd4be14 19-Feb-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Try to unbreak the vnode locking around vop_reclaim() (based mostly on
patch from kan@).

Pull bufobj_invalbuf() out of vinvalbuf() and make g_vfs call it on
close. This is not yet a generally safe function, but for this very
specific use it is safe. This solves the problem with buffers not
being flushed by unmount or after failed mount attempts.


# 7146d6cb 28-Jan-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Move the contents of vop_stddestroyvobject() to the new vnode_pager
function vnode_destroy_vobject().

Make the new function zero the vp->v_object pointer so we can tell
if a call is missing.


# d07a6d3f 24-Jan-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Move the body of vop_stdcreatevobject() over to the vnode_pager under
the name Sande^H^H^H^H^Hvnode_create_vobject().

Make the new function take a size argument which removes the need for
a VOP_STAT() or a very pessimistic guess for disks.

Call that new function from vop_stdcreatevobject().

Make vnode_pager_alloc() private now that its only user came home.


# 35764be3 24-Jan-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Kill the VV_OBJBUF and test the v_object for NULL instead.


# ae51ff11 24-Jan-2005 Jeff Roberson <jeff@FreeBSD.org>

- Remove GIANT_REQUIRED where giant is no longer required.
- Use VFS_LOCK_GIANT() rather than directly acquiring giant in places
where giant is only held because vfs requires it.

Sponsored By: Isilon Systems, Inc.


# 60727d8b 06-Jan-2005 Warner Losh <imp@FreeBSD.org>

/* -> /*- for license, minor formatting changes


# 475e8cc3 25-Dec-2004 Poul-Henning Kamp <phk@FreeBSD.org>

fix comment


# 2ad036b6 07-Dec-2004 Alan Cox <alc@FreeBSD.org>

Almost nine years ago, when support for 1TB files was introduced in
revision 1.55, the address parameter to vnode_pager_addr() was changed
from an unsigned 32-bit quantity to a signed 64-bit quantity. However,
an out-of-range check on the address was not updated. Consequently,
memory-mapped I/O on files greater than 2GB could cause a kernel panic.
Since the address is now a signed 64-bit quantity, the problem resolution
is simply to remove a cast.

Reviewed by: bde@ and tegge@
PR: 73010
MFC after: 1 week


# d8fed1d0 05-Dec-2004 Alan Cox <alc@FreeBSD.org>

Correct a sanity check in vnode_pager_generic_putpages(). The cast used
to implement the sanity check should have been changed when we converted
the implementation of vm_pindex_t from 32 to 64 bits. (Thus, RELENG_4 is
not affected.) The consequence of this error would be a legimate write to
an extremely large file being treated as an errant attempt to write meta-
data.

Discussed with: tegge@


# 9c83534d 15-Nov-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Make VOP_BMAP return a struct bufobj for the underlying storage device
instead of a vnode for it.

The vnode_pager does not and should not have any interest in what
the filesystem uses for backend.

(vfs_cluster doesn't use the backing store argument.)


# 676f3ee2 15-Nov-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Explicitly call pbrelvp()


# 19187819 05-Nov-2004 Alan Cox <alc@FreeBSD.org>

Move a call to wakeup() from vm_object_terminate() to vnode_pager_dealloc()
because this call is only needed to wake threads that slept when they
discovered a dead object connected to a vnode. To eliminate unnecessary
calls to wakeup() by vnode_pager_dealloc(), introduce a new flag,
OBJ_DISCONNECTWNT.

Reviewed by: tegge@


# 6229cc50 26-Oct-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Also check that the sectormask is bigger than zero.

Wrap this overly long KASSERT and remove newline.


# 5d9d81e7 26-Oct-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Put the I/O block size in bufobj->bo_bsize.

We keep si_bsize_phys around for now as that is the simplest way to pull
the number out of disk device drivers in devfs_open(). The correct solution
would be to do an ioctl(DIOCGSECTORSIZE), but the point is probably mooth
when filesystems sit on GEOM, so don't bother for now.


# b792bebe 24-Oct-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Move the buffer method vector (buf->b_op) to the bufobj.

Extend it with a strategy method.

Add bufstrategy() which do the usual VOP_SPECSTRATEGY/VOP_STRATEGY
song and dance.

Rename ibwrite to bufwrite().

Move the two NFS buf_ops to more sensible places, add bufstrategy
to them.

Add inlines for bwrite() and bstrategy() which calls through
buf->b_bufobj->b_ops->b_{write,strategy}().

Replace almost all VOP_STRATEGY()/VOP_SPECSTRATEGY() calls with bstrategy().


# 1a31a6c3 07-Sep-2004 Poul-Henning Kamp <phk@FreeBSD.org>

add KASSERTS


# 0cb507cb 18-Aug-2004 Alan Cox <alc@FreeBSD.org>

Acquire and release Giant around a call to VOP_BMAP(). (This is a
prerequisite to any further reduction in Giant's use by vm_fault().)


# 5a324893 05-May-2004 Alan Cox <alc@FreeBSD.org>

Make vm_page's PG_ZERO flag immutable between the time of the page's
allocation and deallocation. This flag's principal use is shortly after
allocation. For such cases, clearing the flag is pointless. The only
unusual use of PG_ZERO is in vfs_bio_clrbuf(). However, allocbuf() never
requests a prezeroed page. So, vfs_bio_clrbuf() never sees a prezeroed
page.

Reviewed by: tegge@


# 87aefa49 23-Apr-2004 Alan Cox <alc@FreeBSD.org>

Push down Giant into vm_pager_get_pages(). The only get pages methods that
require Giant are in the device and vnode pagers.


# 9e0ddbd0 06-Apr-2004 Alan Cox <alc@FreeBSD.org>

Eliminate vm_pager_map_page() and vm_pager_unmap_page() and their uses.
Use sf_buf_alloc() and sf_buf_free() instead.


# a6704857 03-Jan-2004 Alan Cox <alc@FreeBSD.org>

Eliminate the acquisition and release of Giant from vnode_pager_alloc().
The vm object and vnode locking should suffice.

Discussed with: jeff


# 167a9eff 15-Nov-2003 Tim J. Robbins <tjr@FreeBSD.org>

In vnode_pager_input_smlfs(), call VOP_STRATEGY instead of VOP_SPECSTRATEGY
on non-VCHR vnodes. This fixes a panic when reading data from files on a
filesystem with a small (less than a page) block size.

PR: 59271
Reviewed by: alc


# 52051abc 24-Oct-2003 Alan Cox <alc@FreeBSD.org>

- Call vnode_pager_input_old() with the vm object locked.


# 2e3b314d 24-Oct-2003 Alan Cox <alc@FreeBSD.org>

- Push down Giant from vm_pageout() to vm_pageout_scan(), freeing
vm_pageout_page_stats() from Giant.
- Modify vm_pager_put_pages() and vm_pager_page_unswapped() to expect the
vm object to be locked on entry. (All of the pager routines now expect
this.)


# 2bf43e43 19-Oct-2003 Alan Cox <alc@FreeBSD.org>

- Hold the vm object's lock around calls to vm_page_set_validclean().


# 1b26eb10 18-Oct-2003 Alan Cox <alc@FreeBSD.org>

- Synchronize access to a vm page's valid field using the containing
vm object's lock.
- Reduce the scope of the vm page queues lock in two places.


# 8b575f6c 18-Oct-2003 Alan Cox <alc@FreeBSD.org>

- Synchronize access to the page's valid field in
vnode_pager_generic_getpages() using the containing object's lock.


# 2c18019f 18-Oct-2003 Poul-Henning Kamp <phk@FreeBSD.org>

DuH!

bp->b_iooffset (the spot on the disk), not bp->b_offset (the offset in
the file)


# 9fbf91c0 18-Oct-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Initialize bp->b_offset before calling VOP_[SPEC]STRATEGY().
Remove stale comment about B_PHYS.


# 417a26a1 17-Sep-2003 Alan Cox <alc@FreeBSD.org>

Add vm object locking to vnode_pager_lock(). (This triggers the movement
of a VM_OBJECT_LOCK() in vm_fault().)


# 23562e4b 28-Aug-2003 Marcel Moolenaar <marcel@FreeBSD.org>

In vnode_pager_generic_putpages(), change the printf format specifier
to long and explicitly cast field dirty of struct vm_page to unsigned
long. When PAGE_SIZE is 32K, this field is actually unsigned long.


# b7ad744d 23-Aug-2003 Alan Cox <alc@FreeBSD.org>

Hold the page queues lock when performing vm_page_clear_dirty() and
vm_page_set_invalid().


# 6a4b5823 18-Aug-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Replace a homegrown bdone()/bwait() implementation by the real thing


# ec794849 17-Aug-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Use NULL for 3rd argument of VOP_BMAP() rather than custom cast.
Eliminate unused variable.


# 4e658600 05-Aug-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Use sparse struct initializations for struct pagerops.

This makes grepping for which pagers implement which methods easier.


# f29ba63e 22-Jun-2003 Alan Cox <alc@FreeBSD.org>

Maintain a lock on the vm object of interest throughout vm_fault(),
releasing the lock only if we are about to sleep (e.g., vm_pager_get_pages()
or vm_pager_has_pages()). If we sleep, we have marked the vm object with
the paging-in-progress flag.


# 31953be9 17-Jun-2003 Alan Cox <alc@FreeBSD.org>

Lock the vm object when freeing a vm page.


# 8630c117 12-Jun-2003 Alan Cox <alc@FreeBSD.org>

Add vm object locking to various pagers' "get pages" methods, i386 stack
management functions, and a u area management function.


# 874651b1 11-Jun-2003 David E. O'Brien <obrien@FreeBSD.org>

Use __FBSDID().


# 2a8f9ab5 10-Jun-2003 Alan Cox <alc@FreeBSD.org>

- Finish vm object and page locking in vnode_pager_setsize().
- Make some small style changes to vnode_pager_setsize(); most notably,
move two comments to a more logical place.


# 658ad5ff 05-May-2003 Alan Cox <alc@FreeBSD.org>

Lock the vm_object when performing vm_pager_deallocate().


# 1ca58953 26-Apr-2003 Alan Cox <alc@FreeBSD.org>

- Convert vm_object_pip_wait() from using tsleep() to msleep().
- Make vm_object_pip_sleep() static.
- Lock the vm_object when performing vm_object_pip_wait().


# 49281fbf 18-Apr-2003 Alan Cox <alc@FreeBSD.org>

Update locking around vm_object_page_remove() to use the new macros.


# b4b138c2 18-Mar-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Including <sys/stdint.h> is (almost?) universally only to be able to use
%j in printfs, so put a newsted include in <sys/systm.h> where the printf
prototype lives and save everybody else the trouble.


# 09c80124 05-Mar-2003 Alan Cox <alc@FreeBSD.org>

Remove ENABLE_VFS_IOOPT. It is a long unfinished work-in-progress.

Discussed on: arch@


# ca94e7c4 13-Jan-2003 Poul-Henning Kamp <phk@FreeBSD.org>

We can get past here on a normal vnode as well, so use VOP_STRATEGY if so.


# 5266a767 05-Jan-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Convert VOP_STRATEGY to VOP_SPECSTRATEGY in the generic getpages and
the pager input for small filesystems.


# 86270230 02-Jan-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Convert calls to BUF_STRATEGY to VOP_STRATEGY calls. This is a no-op since
all BUF_STRATEGY did in the first place was call VOP_STRATEGY.


# 43b7990e 28-Dec-2002 Matthew Dillon <dillon@FreeBSD.org>

Allow the VM object flushing code to cluster. When the filesystem syncer
comes along and flushes a file which has been mmap()'d SHARED/RW, with
dirty pages, it was flushing the underlying VM object asynchronously,
resulting in thousands of 8K writes. With this change the VM Object flushing
code will cluster dirty pages in 64K blocks.

Note that until the low memory deadlock issue is reviewed, it is not safe
to allow the pageout daemon to use this feature. Forced pageouts still
use fs block size'd ops for the moment.

MFC after: 3 days


# 475e8011 14-Dec-2002 Alan Cox <alc@FreeBSD.org>

Perform vm_object_lock() and vm_object_unlock() around
vm_object_page_remove().


# 85e01243 27-Nov-2002 Alan Cox <alc@FreeBSD.org>

Hold the page queues lock when performing pmap_clear_modify().

Approved by: re (blanket)


# 178949e0 23-Nov-2002 Alan Cox <alc@FreeBSD.org>

Hold the page queues/flags lock when calling vm_page_set_validclean().

Approved by: re


# e8a27959 22-Nov-2002 Alan Cox <alc@FreeBSD.org>

Add page queue and flag locking in vnode_pager_setsize().

Approved by: re


# 4fec79be 16-Nov-2002 Alan Cox <alc@FreeBSD.org>

Now that pmap_remove_all() is exported by our pmap implementations
use it directly.


# d154fb4f 10-Nov-2002 Alan Cox <alc@FreeBSD.org>

When prot is VM_PROT_NONE, call pmap_page_protect() directly rather than
indirectly through vm_page_protect(). The one remaining page flag that
is updated by vm_page_protect() is already being updated by our various
pmap implementations.

Note: A later commit will similarly change the VM_PROT_READ case and
eliminate vm_page_protect().


# bf1001fa 07-Nov-2002 Maxime Henrion <mux@FreeBSD.org>

Better printf() formats.


# 37c84183 28-Sep-2002 Poul-Henning Kamp <phk@FreeBSD.org>

Be consistent about "static" functions: if the function is marked
static in its prototype, mark it static at the definition too.

Inspired by: FlexeLint warning #512


# 63e7e60d 24-Sep-2002 Jeff Roberson <jeff@FreeBSD.org>

- Add a ASSERT_VOP_LOCKED in vnode_pager_alloc.
- Lock access to v_iflags.


# fff6062a 24-Aug-2002 Alan Cox <alc@FreeBSD.org>

o Retire vm_page_zero_fill() and vm_page_zero_fill_area(). Ever since
pmap_zero_page() and pmap_zero_page_area() were modified to accept
a struct vm_page * instead of a physical address, vm_page_zero_fill()
and vm_page_zero_fill_area() have served no purpose.


# e6e370a7 04-Aug-2002 Jeff Roberson <jeff@FreeBSD.org>

- Replace v_flag with v_iflag and v_vflag
- v_vflag is protected by the vnode lock and is used when synchronization
with VOP calls is needed.
- v_iflag is protected by interlock and is used for dealing with vnode
management issues. These flags include X/O LOCK, FREE, DOOMED, etc.
- All accesses to v_iflag and v_vflag have either been locked or marked with
mp_fixme's.
- Many ASSERT_VOP_LOCKED calls have been added where the locking was not
clear.
- Many functions in vfs_subr.c were restructured to provide for stronger
locking.

Idea stolen from: BSD/OS


# e43c2eab 28-Jul-2002 Alan Cox <alc@FreeBSD.org>

o Lock page queue accesses by vm_page_free().
o Apply some style fixes.


# 9d522888 26-Jul-2002 Alan Cox <alc@FreeBSD.org>

o Lock page queue accesses by vm_page_activate() and vm_page_deactivate().


# 47e151dd 01-Jul-2002 Robert Drehmel <robert@FreeBSD.org>

- Use (OFF_TO_IDX(off) - pi) instead of (OFF_TO_IDX(off - IDX_TO_OFF(pi))).
- Reformat a comment.


# 990ab7ad 22-Jun-2002 Alan Cox <alc@FreeBSD.org>

o Replace GIANT_REQUIRED in vnode_pager_alloc() by the acquisition and
release of Giant. (Annotate as MPSAFE.)
o Also, in vnode_pager_alloc(), remove an unnecessary re-initialization
of struct vm_object::flags and move a statement that is duplicated
in both branches of an if-else.


# d394511d 16-May-2002 Tom Rhodes <trhodes@FreeBSD.org>

More s/file system/filesystem/g


# 98b0c789 14-May-2002 Poul-Henning Kamp <phk@FreeBSD.org>

Make daddr_t and u_daddr_t 64bits wide.
Retire daddr64_t and use daddr_t instead.

Sponsored by: DARPA & NAI Labs.


# c0b6bbb8 05-May-2002 Alan Cox <alc@FreeBSD.org>

o Condition the compilation and use of vm_freeze_copyopts()
on ENABLE_VFS_IOOPT.


# 44e74ba6 27-Apr-2002 Peter Wemm <peter@FreeBSD.org>

We do not necessarily need to map/unmap pages to zero parts of them.
On systems where physical memory is also direct mapped (alpha, sparc,
ia64 etc) this is slightly harmful.


# 11caded3 19-Mar-2002 Alfred Perlstein <alfred@FreeBSD.org>

Remove __P.


# 0d2af521 15-Mar-2002 Kirk McKusick <mckusick@FreeBSD.org>

Introduce the new 64-bit size disk block, daddr64_t. Change
the bio and buffer structures to have daddr64_t bio_pblkno,
b_blkno, and b_lblkno fields which allows access to disks
larger than a Terabyte in size. This change also requires
that the VOP_BMAP vnode operation accept and return daddr64_t
blocks. This delta should not affect system operation in
any way. It merely sets up the necessary interfaces to allow
the development of disk drivers that work with these larger
disk block addresses. It also allows for the development of
UFS2 which will use 64-bit block addresses.


# a1287949 10-Mar-2002 Eivind Eklund <eivind@FreeBSD.org>

- Remove a number of extra newlines that do not belong here according to
style(9)
- Minor space adjustment in cases where we have "( ", " )", if(), return(),
while(), for(), etc.
- Add /* SYMBOL */ after a few #endifs.

Reviewed by: alc


# a854ed98 27-Feb-2002 John Baldwin <jhb@FreeBSD.org>

Simple p_ucred -> td_ucred changes to start using the per-thread ucred
reference.


# 3ebeaf59 13-Dec-2001 Matthew Dillon <dillon@FreeBSD.org>

This fixes a large number of bugs in our NFS client side code. A recent
commit by Kirk also fixed a softupdates bug that could easily be triggered
by server side NFS.

* An edge case with shared R+W mmap()'s and truncate whereby
the system would inappropriately clear the dirty bits on
still-dirty data. (applicable to all filesystems)

THIS FIX TEMPORARILY DISABLED PENDING FURTHER TESTING.
see vm/vm_page.c line 1641

* The straddle case for VM pages and buffer cache buffers when
truncating. (applicable to NFS client side)

* Possible SMP database corruption due to vm_pager_unmap_page()
not clearing the TLB for the other cpu's. (applicable to NFS
client side but could effect all filesystems). Note: not
considered serious since the corruption occurs beyond the file
EOF.

* When flusing a dirty buffer due to B_CACHE getting cleared,
we were accidently setting B_CACHE again (that is, bwrite() sets
B_CACHE), when we really want it to stay clear after the write
is complete. This resulted in a corrupt buffer. (applicable
to all filesystems but probably only triggered by NFS)

* We have to call vtruncbuf() when ftruncate()ing to remove
any buffer cache buffers. This is still tentitive, I may
be able to remove it due to the second bug fix. (applicable
to NFS client side)

* vnode_pager_setsize() race against nfs_vinvalbuf()... we have
to set n_size before calling nfs_vinvalbuf or the NFS code
may recursively vnode_pager_setsize() to the original value
before the truncate. This is what was causing the user mmap
bus faults in the nfs tester program. (applicable to NFS
client side)

* Fix to softupdates (see ufs/ffs/ffs_inode.c 1.73, commit made
by Kirk).

Testing program written by: Avadis Tevanian, Jr.
Testing program supplied by: jkh / Apple (see Dec2001 posting to freebsd-hackers with Subject 'NFS: How to make FreeBS fall on its face in one easy step')
MFC after: 1 week


# 33c67741 05-Nov-2001 Matthew Dillon <dillon@FreeBSD.org>

Adjust vnode_pager_input_smlfs() to not attempt to BMAP blocks beyond the
file EOF. This works around a bug in the ISOFS (CDRom) BMAP code which
returns bogus values for requests beyond the file EOF rather then returning
an error, resulting in either corrupt data being mmap()'d beyond the file EOF
or resulting in a seg-fault on the last page of a mmap()'d file (mmap()s of
CDRom files).

Reported by: peter / Yahoo
MFC after: 3 days


# 00a6f47f 12-Oct-2001 Matthew Dillon <dillon@FreeBSD.org>

Finally fix the VM bug where a file whos EOF occurs in the middle of a page
would sometimes prevent a dirty page from being cleaned, even when synced,
resulting in the dirty page being re-flushed to disk every 30-60 seconds or
so, forever. The problem is that when the filesystem flushes a page to
its backing file it typically does not clear dirty bits representing areas
of the page that are beyond the file EOF. If the file is also mmap()'d and
a fault is taken, vm_fault (properly, is required to) set the vm_page_t->dirty
bits to VM_PAGE_BITS_ALL. This combination could leave us with an uncleanable,
unfreeable page.

The solution is to have the vnode_pager detect the edge case and manually
clear the dirty bits representing areas beyond the file EOF. The filesystem
does the rest and the page comes up clean after the write completes.

MFC after: 3 days


# bd78cece 11-Oct-2001 John Baldwin <jhb@FreeBSD.org>

Change the kernel's ucred API as follows:
- crhold() returns a reference to the ucred whose refcount it bumps.
- crcopy() now simply copies the credentials from one credential to
another and has no return value.
- a new crshared() primitive is added which returns true if a ucred's
refcount is > 1 and false (0) otherwise.


# b40ce416 12-Sep-2001 Julian Elischer <julian@FreeBSD.org>

KSE Milestone 2
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.

Sorry john! (your next MFC will be a doosie!)

Reviewed by: peter@freebsd.org, dillon@freebsd.org

X-MFC after: ha ha ha ha


# bd8e0d58 04-Aug-2001 John Baldwin <jhb@FreeBSD.org>

Whitespace fixes.


# 54d92145 04-Jul-2001 Matthew Dillon <dillon@FreeBSD.org>

whitespace / register cleanup


# 0cddd8f0 04-Jul-2001 Matthew Dillon <dillon@FreeBSD.org>

With Alfred's permission, remove vm_mtx in favor of a fine-grained approach
(this commit is just the first stage). Also add various GIANT_ macros to
formalize the removal of Giant, making it easy to test in a more piecemeal
fashion. These macros will allow us to test fine-grained locks to a degree
before removing Giant, and also after, and to remove Giant in a piecemeal
fashion via sysctl's on those subsystems which the authors believe can
operate without Giant.


# b62b9b64 03-Jul-2001 John Baldwin <jhb@FreeBSD.org>

Fix a XXX comment by moving the initialization of the number of pbuf's
for the vnode pager to a new vnode pager init method instead of making it
a hack in getpages().


# 342a1480 29-May-2001 John Baldwin <jhb@FreeBSD.org>

Don't hold the VM lock across VOP's and other things that can sleep.


# e6b961ff 23-May-2001 John Baldwin <jhb@FreeBSD.org>

- Assert Giant is held in the vnode pager methods.
- Lock the VM while walking down a vm_object's backing_object list in
vnode_pager_lock().


# 23955314 18-May-2001 Alfred Perlstein <alfred@FreeBSD.org>

Introduce a global lock for the vm subsystem (vm_mtx).

vm_mtx does not recurse and is required for most low level
vm operations.

faults can not be taken without holding Giant.

Memory subsystems can now call the base page allocators safely.

Almost all atomic ops were removed as they are covered under the
vm mutex.

Alpha and ia64 now need to catch up to i386's trap handlers.

FFS and NFS have been tested, other filesystems will need minor
changes (grabbing the vm lock when twiddling page properties).

Reviewed (partially) by: jake, jhb


# 60fb0ce3 28-Apr-2001 Greg Lehey <grog@FreeBSD.org>

Revert consequences of changes to mount.h, part 2.

Requested by: bde


# d98dc34f 23-Apr-2001 Greg Lehey <grog@FreeBSD.org>

Correct #includes to work with fixed sys/mount.h.


# d8d5fa88 19-Apr-2001 Alfred Perlstein <alfred@FreeBSD.org>

vnode_pager_freepage() is really vm_page_free() in disguise,
nuke vnode_pager_freepage() and replace all calls to it with vm_page_free()


# 2b6b0df7 26-Dec-2000 Matthew Dillon <dillon@FreeBSD.org>

This implements a better launder limiting solution. There was a solution
in 4.2-REL which I ripped out in -stable and -current when implementing the
low-memory handling solution. However, maxlaunder turns out to be the saving
grace in certain very heavily loaded systems (e.g. newsreader box). The new
algorithm limits the number of pages laundered in the first pageout daemon
pass. If that is not sufficient then suceessive will be run without any
limit.

Write I/O is now pipelined using two sysctls, vfs.lorunningspace and
vfs.hirunningspace. This prevents excessive buffered writes in the
disk queues which cause long (multi-second) delays for reads. It leads
to more stable (less jerky) and generally faster I/O streaming to disk
by allowing required read ops (e.g. for indirect blocks and such) to occur
without interrupting the write stream, amoung other things.

NOTE: eventually, filesystem write I/O pipelining needs to be done on a
per-device basis. At the moment it is globalized.


# f2a2857b 11-Jul-2000 Kirk McKusick <mckusick@FreeBSD.org>

Add snapshots to the fast filesystem. Most of the changes support
the gating of system calls that cause modifications to the underlying
filesystem. The gating can be enabled by any filesystem that needs
to consistently suspend operations by adding the vop_stdgetwritemount
to their set of vnops. Once gating is enabled, the function
vfs_write_suspend stops all new write operations to a filesystem,
allows any filesystem modifying system calls already in progress
to complete, then sync's the filesystem to disk and returns. The
function vfs_write_resume allows the suspended write operations to
begin again. Gating is not added by default for all filesystems as
for SMP systems it adds two extra locks to such critical kernel
paths as the write system call. Thus, gating should only be added
as needed.

Details on the use and current status of snapshots in FFS can be
found in /sys/ufs/ffs/README.snapshot so for brevity and timelyness
is not included here. Unless and until you create a snapshot file,
these changes should have no effect on your system (famous last words).


# 0385347c 20-May-2000 Peter Wemm <peter@FreeBSD.org>

Implement an optimization of the VM<->pmap API. Pass vm_page_t's directly
to various pmap_*() functions instead of looking up the physical address
and passing that. In many cases, the first thing the pmap code was doing
was going to a lot of trouble to get back the original vm_page_t, or
it's shadow pv_table entry.

Inspired by: John Dyson's 1998 patches.

Also:
Eliminate pv_table as a seperate thing and build it into a machine
dependent part of vm_page_t. This eliminates having a seperate set of
structions that shadow each other in a 1:1 fashion that we often went to
a lot of trouble to translate from one to the other. (see above)
This happens to save 4 bytes of physical memory for each page in the
system. (8 bytes on the Alpha).

Eliminate the use of the phys_avail[] array to determine if a page is
managed (ie: it has pv_entries etc). Store this information in a flag.
Things like device_pager set it because they create vm_page_t's on the
fly that do not have pv_entries. This makes it easier to "unmanage" a
page of physical memory (this will be taken advantage of in subsequent
commits).

Add a function to add a new page to the freelist. This could be used
for reclaiming the previously wasted pages left over from preloaded
loader(8) files.

Reviewed by: dillon


# 9626b608 05-May-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Separate the struct bio related stuff out of <sys/buf.h> into
<sys/bio.h>.

<sys/bio.h> is now a prerequisite for <sys/buf.h> but it shall
not be made a nested include according to bdes teachings on the
subject of nested includes.

Diskdrivers and similar stuff below specfs::strategy() should no
longer need to include <sys/buf.> unless they need caching of data.

Still a few bogus uses of struct buf to track down.

Repocopy by: peter


# c244d2de 02-Apr-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Move B_ERROR flag to b_ioflags and call it BIO_ERROR.

(Much of this done by script)

Move B_ORDERED flag to b_ioflags and call it BIO_ORDERED.

Move b_pblkno and b_iodone_chain to struct bio while we transition, they
will be obsoleted once bio structs chain/stack.

Add bio_queue field for struct bio aware disksort.

Address a lot of stylistic issues brought up by bde.


# 5929bcfa 27-Mar-2000 Philippe Charnier <charnier@FreeBSD.org>

Revert spelling mistake I made in the previous commit
Requested by: Alan and Bruce


# 956f3135 26-Mar-2000 Philippe Charnier <charnier@FreeBSD.org>

Spelling


# b99c307a 20-Mar-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Rename the existing BUF_STRATEGY() to DEV_STRATEGY()

substitute BUF_WRITE(foo) for VOP_BWRITE(foo->b_vp, foo)

substitute BUF_STRATEGY(foo) for VOP_STRATEGY(foo->b_vp, foo)

This patch is machine generated except for the ccd.c and buf.h parts.


# 21144e3b 20-Mar-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Remove B_READ, B_WRITE and B_FREEBUF and replace them with a new
field in struct buf: b_iocmd. The b_iocmd is enforced to have
exactly one bit set.

B_WRITE was bogusly defined as zero giving rise to obvious coding
mistakes.

Also eliminate the redundant struct buf flag B_CALL, it can just
as efficiently be done by comparing b_iodone to NULL.

Should you get a panic or drop into the debugger, complaining about
"b_iocmd", don't continue. It is likely to write on your disk
where it should have been reading.

This change is a step in the direction towards a stackable BIO capability.

A lot of this patch were machine generated (Thanks to style(9) compliance!)

Vinum users: Greg has not had time to test this yet, be careful.


# 923502ff 29-Oct-1999 Poul-Henning Kamp <phk@FreeBSD.org>

useracc() the prequel:

Merge the contents (less some trivial bordering the silly comments)
of <vm/vm_prot.h> and <vm/vm_inherit.h> into <vm/vm.h>. This puts
the #defines for the vm_inherit_t and vm_prot_t types next to their
typedefs.

This paves the road for the commit to follow shortly: change
useracc() to use VM_PROT_{READ|WRITE} rather than B_{READ|WRITE}
as argument.


# 24579ca1 16-Sep-1999 Matthew Dillon <dillon@FreeBSD.org>

The vnode pager (used when you do file-backed mmaps) must use the
underlying physical sector size when aligning I/O transfer sizes.
It cannot assume 512 bytes.

We assume the underlying sector size is a power of 2. If it isn't,
mmap() will break badly anyway (in the same way mmap broke with NFS
when NFS tried to cache piecemeal write ranges in buffers, before
we enforced read-buffer-before-write-piecemeal for NFS).

Reviewed by: Alan Cox <alc@cs.rice.edu>, David Greenman <dg@root.com>


# c3aac50f 27-Aug-1999 Peter Wemm <peter@FreeBSD.org>

$Id$ -> $FreeBSD$


# 2c28a105 16-Aug-1999 Alan Cox <alc@FreeBSD.org>

Add the (inline) function vm_page_undirty for clearing the dirty bitmask
of a vm_page.

Use it.

Submitted by: dillon


# 3efc015b 01-Jul-1999 Peter Wemm <peter@FreeBSD.org>

Fix some int/long printf problems for the Alpha


# 67812eac 25-Jun-1999 Kirk McKusick <mckusick@FreeBSD.org>

Convert buffer locking from using the B_BUSY and B_WANTED flags to using
lockmgr locks. This commit should be functionally equivalent to the old
semantics. That is, all buffer locking is done with LK_EXCLUSIVE
requests. Changes to take advantage of LK_SHARED and LK_RECURSIVE will
be done in future commits.


# 54746b67 15-May-1999 Dmitrij Tejblum <dt@FreeBSD.org>

Fix confusion of size of transfer with size of the pager.

PR: 11658
Broken in: 1.89 (1998/03/07)


# b0eeea20 06-May-1999 Poul-Henning Kamp <phk@FreeBSD.org>

remove b_proc from struct buf, it's (now) unused.

Reviewed by: dillon, bde


# 4221e284 02-May-1999 Alan Cox <alc@FreeBSD.org>

The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.

The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.

getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.

There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.

Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.

Submitted by: Matthew Dillon <dillon@apollo.backplane.com>


# 897a45ef 10-Apr-1999 Dmitrij Tejblum <dt@FreeBSD.org>

Convert usage of vm_page_bits() to the new convention ("Inputs are required
to range within a page").


# 8d17e694 05-Apr-1999 Julian Elischer <julian@FreeBSD.org>

Catch a case spotted by Tor where files mmapped could leave garbage in the
unallocated parts of the last page when the file ended on a frag
but not a page boundary.
Delimitted by tags PRE_MATT_MMAP_EOF and POST_MATT_MMAP_EOF,
in files alpha/alpha/pmap.c i386/i386/pmap.c nfs/nfs_bio.c vm/pmap.h
vm/vm_page.c vm/vm_page.h vm/vnode_pager.c miscfs/specfs/spec_vnops.c
ufs/ufs/ufs_readwrite.c kern/vfs_bio.c

Submitted by: Matt Dillon <dillon@freebsd.org>
Reviewed by: Alan Cox <alc@freebsd.org>


# 4491ea91 26-Mar-1999 Eivind Eklund <eivind@FreeBSD.org>

Correct a comment.


# 0e3cdf2c 27-Feb-1999 Alan Cox <alc@FreeBSD.org>

Reviewed by: "John S. Dyson" <dyson@iquest.net>
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
To prevent a deadlock, if we are extremely low on memory, force synchronous
operation by the VOP_PUTPAGES in vnode_pager_putpages.


# e4542174 23-Jan-1999 Matthew Dillon <dillon@FreeBSD.org>

vm_pager_put_pages() is passed an rcval array to hold per-page return
values. The 'int' return value for the procedure was never used and
not well defined in any case when there are mixed errors on pages, so
it has been removed. vm_pager_put_pages() and associated vm_pager
functions now return void.


# 1c7c3c6a 21-Jan-1999 Matthew Dillon <dillon@FreeBSD.org>

This is a rather large commit that encompasses the new swapper,
changes to the VM system to support the new swapper, VM bug
fixes, several VM optimizations, and some additional revamping of the
VM code. The specific bug fixes will be documented with additional
forced commits. This commit is somewhat rough in regards to code
cleanup issues.

Reviewed by: "John S. Dyson" <root@dyson.iquest.net>, "David Greenman" <dg@root.com>


# af1f63c7 04-Dec-1998 Robert V. Baron <rvb@FreeBSD.org>

In vnode_pager_input_old, set auio.uio_procp = curproc
vs auio.uio_procp = (struct proc *) 0


# 6cde7a16 13-Oct-1998 David Greenman <dg@FreeBSD.org>

Fixed two potentially serious classes of bugs:

1) The vnode pager wasn't properly tracking the file size due to
"size" being page rounded in some cases and not in others.
This sometimes resulted in corrupted files. First noticed by
Terry Lambert.
Fixed by changing the "size" pager_alloc parameter to be a 64bit
byte value (as opposed to a 32bit page index) and changing the
pagers and their callers to deal with this properly.
2) Fixed a bogus type cast in round_page() and trunc_page() that
caused some 64bit offsets and sizes to be scrambled. Removing
the cast required adding casts at a few dozen callers.
There may be problems with other bogus casts in close-by
macros. A quick check seemed to indicate that those were okay,
however.


# 6e3a3f38 28-Sep-1998 Robert V. Baron <rvb@FreeBSD.org>

John Dyson approved of this solution; make vnode_pager_input_old set m->valid


# 500b04a2 05-Sep-1998 Bruce Evans <bde@FreeBSD.org>

Instantiate `nfs_mount_type' in a standard file so that it is present
when nfs is an LKM. Declare it in a header file. Don't forget to use
it in non-Lite2 code. Initialize it to -1 instead of to 0, since 0
will soon be the mount type number for the first vfs loaded.

NetBSD uses strcmp() to avoid this ugly global.


# e69763a3 04-Sep-1998 Doug Rabson <dfr@FreeBSD.org>

Cosmetic changes to the PAGE_XXX macros to make them consistent with
the other objects in vm.


# c576d121 25-Aug-1998 Luoqi Chen <luoqi@FreeBSD.org>

Fix a rounding problem that causes vnode pager to fail to remove the last
partially filled page during a truncation.

PR: kern/7422


# 069e9bc1 24-Aug-1998 Doug Rabson <dfr@FreeBSD.org>

Change various syscalls to use size_t arguments instead of u_int.

Add some overflow checks to read/write (from bde).

Change all modifications to vm_page::flags, vm_page::busy, vm_object::flags
and vm_object::paging_in_progress to use operations which are not
interruptable.

Reviewed by: Bruce Evans <bde@zeta.org.au>


# fc62ef1f 11-Jul-1998 Bruce Evans <bde@FreeBSD.org>

Fixed printf format errors.


# ac1e407b 11-Jul-1998 Bruce Evans <bde@FreeBSD.org>

Fixed printf format errors.


# fd5d1124 04-Jul-1998 Julian Elischer <julian@FreeBSD.org>

VOP_STRATEGY grows an (struct vnode *) argument
as the value in b_vp is often not really what you want.
(and needs to be frobbed). more cleanups will follow this.
Reviewed by: Bruce Evans <bde@freebsd.org>


# bef608bd 15-Mar-1998 John Dyson <dyson@FreeBSD.org>

Some VM improvements, including elimination of alot of Sig-11
problems. Tor Egge and others have helped with various VM bugs
lately, but don't blame him -- blame me!!!

pmap.c:
1) Create an object for kernel page table allocations. This
fixes a bogus allocation method previously used for such, by
grabbing pages from the kernel object, using bogus pindexes.
(This was a code cleanup, and perhaps a minor system stability
issue.)

pmap.c:
2) Pre-set the modify and accessed bits when prudent. This will
decrease bus traffic under certain circumstances.

vfs_bio.c, vfs_cluster.c:
3) Rather than calculating the beginning virtual byte offset
multiple times, stick the offset into the buffer header, so
that the calculated offset can be reused. (Long long multiplies
are often expensive, and this is a probably unmeasurable performance
improvement, and code cleanup.)

vfs_bio.c:
4) Handle write recursion more intelligently (but not perfectly) so
that it is less likely to cause a system panic, and is also
much more robust.

vfs_bio.c:
5) getblk incorrectly wrote out blocks that are incorrectly sized.
The problem is fixed, and writes blocks out ONLY when B_DELWRI
is true.

vfs_bio.c:
6) Check that already constituted buffers have fully valid pages. If
not, then make sure that the B_CACHE bit is not set. (This was
a major source of Sig-11 type problems.)

vfs_bio.c:
7) Fix a potential system deadlock due to an incorrectly specified
sleep priority while waiting for a buffer write operation. The
change that I made opens the system up to serious problems, and
we need to examine the issue of process sleep priorities.

vfs_cluster.c, vfs_bio.c:
8) Make clustered reads work more correctly (and more completely)
when buffers are already constituted, but not fully valid.
(This was another system reliability issue.)

vfs_subr.c, ffs_inode.c:
9) Create a vtruncbuf function, which is used by filesystems that
can truncate files. The vinvalbuf forced a file sync type operation,
while vtruncbuf only invalidates the buffers past the new end of file,
and also invalidates the appropriate pages. (This was a system reliabiliy
and performance issue.)

10) Modify FFS to use vtruncbuf.

vm_object.c:
11) Make the object rundown mechanism for OBJT_VNODE type objects work
more correctly. Included in that fix, create pager entries for
the OBJT_DEAD pager type, so that paging requests that might slip
in during race conditions are properly handled. (This was a system
reliability issue.)

vm_page.c:
12) Make some of the page validation routines be a little less picky
about arguments passed to them. Also, support page invalidation
change the object generation count so that we handle generation
counts a little more robustly.

vm_pageout.c:
13) Further reduce pageout daemon activity when the system doesn't
need help from it. There should be no additional performance
decrease even when the pageout daemon is running. (This was
a significant performance issue.)

vnode_pager.c:
14) Teach the vnode pager to handle race conditions during vnode
deallocations.


# 86ffbd76 09-Mar-1998 Mike Smith <msmith@FreeBSD.org>

Complement diagnostic messages about missing per-FS VOP page operations,
but don't make their absence fatal.
Submitted by: terry


# 8f9110f6 07-Mar-1998 John Dyson <dyson@FreeBSD.org>

This mega-commit is meant to fix numerous interrelated problems. There
has been some bitrot and incorrect assumptions in the vfs_bio code. These
problems have manifest themselves worse on NFS type filesystems, but can
still affect local filesystems under certain circumstances. Most of
the problems have involved mmap consistancy, and as a side-effect broke
the vfs.ioopt code. This code might have been committed seperately, but
almost everything is interrelated.

1) Allow (pmap_object_init_pt) prefaulting of buffer-busy pages that
are fully valid.
2) Rather than deactivating erroneously read initial (header) pages in
kern_exec, we now free them.
3) Fix the rundown of non-VMIO buffers that are in an inconsistent
(missing vp) state.
4) Fix the disassociation of pages from buffers in brelse. The previous
code had rotted and was faulty in a couple of important circumstances.
5) Remove a gratuitious buffer wakeup in vfs_vmio_release.
6) Remove a crufty and currently unused cluster mechanism for VBLK
files in vfs_bio_awrite. When the code is functional, I'll add back
a cleaner version.
7) The page busy count wakeups assocated with the buffer cache usage were
incorrectly cleaned up in a previous commit by me. Revert to the
original, correct version, but with a cleaner implementation.
8) The cluster read code now tries to keep data associated with buffers
more aggressively (without breaking the heuristics) when it is presumed
that the read data (buffers) will be soon needed.
9) Change to filesystem lockmgr locks so that they use LK_NOPAUSE. The
delay loop waiting is not useful for filesystem locks, due to the
length of the time intervals.
10) Correct and clean-up spec_getpages.
11) Implement a fully functional nfs_getpages, nfs_putpages.
12) Fix nfs_write so that modifications are coherent with the NFS data on
the server disk (at least as well as NFS seems to allow.)
13) Properly support MS_INVALIDATE on NFS.
14) Properly pass down MS_INVALIDATE to lower levels of the VM code from
vm_map_clean.
15) Better support the notion of pages being busy but valid, so that
fewer in-transit waits occur. (use p->busy more for pageouts instead
of PG_BUSY.) Since the page is fully valid, it is still usable for
reads.
16) It is possible (in error) for cached pages to be busy. Make the
page allocation code handle that case correctly. (It should probably
be a printf or panic, but I want the system to handle coding errors
robustly. I'll probably add a printf.)
17) Correct the design and usage of vm_page_sleep. It didn't handle
consistancy problems very well, so make the design a little less
lofty. After vm_page_sleep, if it ever blocked, it is still important
to relookup the page (if the object generation count changed), and
verify it's status (always.)
18) In vm_pageout.c, vm_pageout_clean had rotted, so clean that up.
19) Push the page busy for writes and VM_PROT_READ into vm_pageout_flush.
20) Fix vm_pager_put_pages and it's descendents to support an int flag
instead of a boolean, so that we can pass down the invalidate bit.


# ffc82b0a 28-Feb-1998 John Dyson <dyson@FreeBSD.org>

1) Use a more consistent page wait methodology.
2) Do not unnecessarily force page blocking when paging
pages out.
3) Further improve swap pager performance and correctness,
including fixing the paging in progress deadlock (except
in severe I/O error conditions.)
4) Enable vfs_ioopt=1 as a default.
5) Fix and enable the page prezeroing in SMP mode.

All in all, SMP systems especially should show a significant
improvement in "snappyness."


# ce75f2c3 25-Feb-1998 Mike Smith <msmith@FreeBSD.org>

In the author's words:

These diffs implement the first stage of a VOP_{GET|PUT}PAGES pushdown
for local media FS's.

See ffs_putpages in /sys/ufs/ufs/ufs_readwrite.c for implementation
details for generic *_{get|put}pages for local media FS's. Support
is trivial to add for any FS that formerly relied on the default
behaviour of the vnode_pager in in EOPNOTSUPP cases (just copy the
ffs_getpages() code for the FS in question's *_{get|put}pages).

Obviously, it would be better if each local media FS implemented a
more optimal method, instead of calling an exported interface from
the /sys/vm/vnode_pager.c, but this is a necessary first step in
getting the FS's to a point where they can be supplied with better
implementations on a case-by-case basis.

Obviously, the cd9660_putpages() can be rather trivial (since it
is a read-only FS type 8-)).

A slight (temporary) modification is made to print a diagnostic message
in the case where the underlying filesystem attempts to engage in the
previous behaviour. Failure is likely to be ungraceful.

Submitted by: terry@freebsd.org (Terry Lambert)


# 66095752 24-Feb-1998 John Dyson <dyson@FreeBSD.org>

Fix page prezeroing for SMP, and fix some potential paging-in-progress
hangs. The paging-in-progress diagnosis was a result of Tor Egge's
excellent detective work.
Submitted by: Partially from Tor Egge.


# e47ed70b 23-Feb-1998 John Dyson <dyson@FreeBSD.org>

Significantly improve the efficiency of the swap pager, which appears to
have declined due to code-rot over time. The swap pager rundown code
has been clean-up, and unneeded wakeups removed. Lots of splbio's
are changed to splvm's. Also, set the dynamic tunables for the
pageout daemon to be more sane for larger systems (thereby decreasing
the daemon overheadla.)


# 0b08f5f7 05-Feb-1998 Eivind Eklund <eivind@FreeBSD.org>

Back out DIAGNOSTIC changes.


# 95461b45 04-Feb-1998 John Dyson <dyson@FreeBSD.org>

1) Start using a cleaner and more consistant page allocator instead
of the various ad-hoc schemes.
2) When bringing in UPAGES, the pmap code needs to do another vm_page_lookup.
3) When appropriate, set the PG_A or PG_M bits a-priori to both avoid some
processor errata, and to minimize redundant processor updating of page
tables.
4) Modify pmap_protect so that it can only remove permissions (as it
originally supported.) The additional capability is not needed.
5) Streamline read-only to read-write page mappings.
6) For pmap_copy_page, don't enable write mapping for source page.
7) Correct and clean-up pmap_incore.
8) Cluster initial kern_exec pagin.
9) Removal of some minor lint from kern_malloc.
10) Correct some ioopt code.
11) Remove some dead code from the MI swapout routine.
12) Correct vm_object_deallocate (to remove backing_object ref.)
13) Fix dead object handling, that had problems under heavy memory load.
14) Add minor vm_page_lookup improvements.
15) Some pages are not in objects, and make sure that the vm_page.c can
properly support such pages.
16) Add some more page deficit handling.
17) Some minor code readability improvements.


# 47cfdb16 04-Feb-1998 Eivind Eklund <eivind@FreeBSD.org>

Turn DIAGNOSTIC into a new-style option.


# eaf13dd7 31-Jan-1998 John Dyson <dyson@FreeBSD.org>

Change the busy page mgmt, so that when pages are freed, they
MUST be PG_BUSY. It is bogus to free a page that isn't busy,
because it is in a state of being "unavailable" when being
freed. The additional advantage is that the page_remove code
has a better cross-check that the page should be busy and
unavailable for other use. There were some minor problems
with the collapse code, and this plugs those subtile "holes."

Also, the vfs_bio code wasn't checking correctly for PG_BUSY
pages. I am going to develop a more consistant scheme for
grabbing pages, busy or otherwise. For now, we are stuck
with the current morass.


# 47221757 17-Jan-1998 John Dyson <dyson@FreeBSD.org>

Tie up some loose ends in vnode/object management. Remove an unneeded
config option in pmap. Fix a problem with faulting in pages. Clean-up
some loose ends in swap pager memory management.

The system should be much more stable, but all subtile bugs aren't fixed yet.


# 95e5e988 05-Jan-1998 John Dyson <dyson@FreeBSD.org>

Make our v_usecount vnode reference count work identically to the
original BSD code. The association between the vnode and the vm_object
no longer includes reference counts. The major difference is that
vm_object's are no longer freed gratuitiously from the vnode, and so
once an object is created for the vnode, it will last as long as the
vnode does.

When a vnode object reference count is incremented, then the underlying
vnode reference count is incremented also. The two "objects" are now
more intimately related, and so the interactions are now much less
complex.

When vnodes are now normally placed onto the free queue with an object still
attached. The rundown of the object happens at vnode rundown time, and
happens with exactly the same filesystem semantics of the original VFS
code. There is absolutely no need for vnode_pager_uncache and other
travesties like that anymore.

A side-effect of these changes is that SMP locking should be much simpler,
the I/O copyin/copyout optimizations work, NFS should be more ponderable,
and further work on layered filesystems should be less frustrating, because
of the totally coherent management of the vnode objects and vnodes.

Please be careful with your system while running this code, but I would
greatly appreciate feedback as soon a reasonably possible.


# 2be70f79 28-Dec-1997 John Dyson <dyson@FreeBSD.org>

Lots of improvements, including restructring the caching and management
of vnodes and objects. There are some metadata performance improvements
that come along with this. There are also a few prototypes added when
the need is noticed. Changes include:

1) Cleaning up vref, vget.
2) Removal of the object cache.
3) Nuke vnode_pager_uncache and friends, because they aren't needed anymore.
4) Correct some missing LK_RETRY's in vn_lock.
5) Correct the page range in the code for msync.

Be gentle, and please give me feedback asap.


# 1efb74fb 19-Dec-1997 John Dyson <dyson@FreeBSD.org>

Some performance improvements, and code cleanups (including changing our
expensive OFF_TO_IDX to btoc whenever possible.)


# ab3f7469 02-Dec-1997 Poul-Henning Kamp <phk@FreeBSD.org>

In all such uses of struct buf: 's/b_un.b_addr/b_data/g'


# e7b0208f 05-Oct-1997 John Dyson <dyson@FreeBSD.org>

Relax the vnode locking for read only operations.


# 79624e21 31-Aug-1997 Bruce Evans <bde@FreeBSD.org>

Removed unused #includes.


# b9dcd593 25-Aug-1997 Bruce Evans <bde@FreeBSD.org>

Fixed type mismatches for functions with args of type vm_prot_t and/or
vm_inherit_t. These types are smaller than ints, so the prototypes
should have used the promoted type (int) to match the old-style function
definitions. They use just vm_prot_t and/or vm_inherit_t. This depends
on gcc features to work. I fixed the definitions since this is easiest.
The correct fix may be to change the small types to u_int, to optimize
for time instead of space.


# 89721f6f 21-Aug-1997 John Dyson <dyson@FreeBSD.org>

This is a trial improvement for the vnode reference count while on the vnode
free list problem. Also, the vnode age flag is no longer used by the
vnode pager. (It is actually incorrect to use then.) Constructive
feedback welcome -- just be kind.


# 32ad9cb5 19-May-1997 Doug Rabson <dfr@FreeBSD.org>

Fix a few bugs with NFS and mmap caused by NFS' use of b_validoff
and b_validend. The changes to vfs_bio.c are a bit ugly but hopefully
can be tidied up later by a slight redesign.

PR: kern/2573, kern/2754, kern/3046 (possibly)
Reviewed by: dyson


# eb2c768e 07-Mar-1997 John Dyson <dyson@FreeBSD.org>

When removing IN_RECURSE support during the Lite/2 merge, read/write
to/from mmaped regions was broken. This commit fixes the breakage, and
uses the new Lite/2 locking mechanisms.


# 6875d254 22-Feb-1997 Peter Wemm <peter@FreeBSD.org>

Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not
ready for it yet.


# 996c772f 09-Feb-1997 John Dyson <dyson@FreeBSD.org>

This is the kernel Lite/2 commit. There are some requisite userland
changes, so don't expect to be able to run the kernel as-is (very well)
without the appropriate Lite/2 userland changes.

The system boots and can mount UFS filesystems.

Untested: ext2fs, msdosfs, NFS
Known problems: Incorrect Berkeley ID strings in some files.
Mount_std mounts will not work until the getfsent
library routine is changed.

Reviewed by: various people
Submitted by: Jeffery Hsu <hsu@freebsd.org>


# 09841510 24-Jan-1997 David Greenman <dg@FreeBSD.org>

Added a check/panic for v_usecount being 0 (no vnode reference) in
vnode_pager_alloc().


# 1130b656 14-Jan-1997 Jordan K. Hubbard <jkh@FreeBSD.org>

Make the long-awaited change from $Id$ to $FreeBSD$

This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.

Boy, I'm glad we're not using sup anymore. This update would have been
insane otherwise.


# ad980522 16-Oct-1996 John Dyson <dyson@FreeBSD.org>

Clean up the rundown of the object backing a vnode. This should fix
NFS problems associated with forcible dismounts.


# 9fea9a6f 09-Sep-1996 John Dyson <dyson@FreeBSD.org>

The whole issue of not support VOP_LOCK for VBLK devices should be
rethought. This fixes YET another problem with unmounting filesystems.
The root cause is not fixed here, but at least the problem has gone
away.


# 6476c0d2 21-Aug-1996 John Dyson <dyson@FreeBSD.org>

Even though this looks like it, this is not a complex code change.
The interface into the "VMIO" system has changed to be more consistant
and robust. Essentially, it is now no longer necessary to call vn_open
to get merged VM/Buffer cache operation, and exceptional conditions
such as merged operation of VBLK devices is simpler and more correct.

This code corrects a potentially large set of problems including the
problems with ktrace output and loaded systems, file create/deletes,
etc.

Most of the changes to NFS are cosmetic and name changes, eliminating
a layer of subroutine calls. The direct calls to vput/vrele have
been re-instituted for better cross platform compatibility.

Reviewed by: davidg


# 67bf6868 29-Jul-1996 John Dyson <dyson@FreeBSD.org>

Backed out the recent changes/enhancements to the VM code. The
problem with the 'shell scripts' was found, but there was a 'strange'
problem found with a 486 laptop that we could not find. This commit
backs the code back to 25-jul, and will be re-entered after the snapshot
in smaller (more easily tested) chunks.


# 4f4d35ed 26-Jul-1996 John Dyson <dyson@FreeBSD.org>

This commit is meant to solve a couple of VM system problems or
performance issues.

1) The pmap module has had too many inlines, and so the
object file is simply bigger than it needs to be.
Some common code is also merged into subroutines.
2) Removal of some *evil* PHYS_TO_VM_PAGE macro calls.
Unfortunately, a few have needed to be added also.
The removal caused the need for more vm_page_lookups.
I added lookup hints to minimize the need for the
page table lookup operations.
3) Removal of some bogus performance improvements, that
mostly made the code more complex (tracking individual
page table page updates unnecessarily). Those improvements
actually hurt 386 processors perf (not that people who
worry about perf use 386 processors anymore :-)).
4) Changed pv queue manipulations/structures to be TAILQ's.
5) The pv queue code has had some performance problems since
day one. Some significant scalability issues are resolved
by threading the pv entries from the pmap AND the physical
address instead of just the physical address. This makes
certain pmap operations run much faster. This does
not affect most micro-benchmarks, but should help loaded system
performance *significantly*. DG helped and came up with most
of the solution for this one.
6) Most if not all pmap bit operations follow the pattern:
pmap_test_bit();
pmap_clear_bit();
That made for twice the necessary pv list traversal. The
pmap interface now supports only pmap_tc_bit type operations:
pmap_[test/clear]_modified, pmap_[test/clear]_referenced.
Additionally, the modified routine now takes a vm_page_t arg
instead of a phys address. This eliminates a PHYS_TO_VM_PAGE
operation.
7) Several rewrites of routines that contain redundant code to
use common routines, so that there is a greater likelihood of
keeping the cache footprint smaller.


# aa8de40a 03-May-1996 Poul-Henning Kamp <phk@FreeBSD.org>

Another sweep over the pmap/vm macros, this time with more focus on
the usage. I'm not satisfied with the naming, but now at least there is
less bogus stuff around.


# ad5dd234 18-Mar-1996 John Dyson <dyson@FreeBSD.org>

Fix the problem that unmounting filesystems that are backed by a VMIO
device have reference count problems. We mark the underlying object
ono-persistent, and account for the reference count that the VM system
maintainsfor the special device close. This should fix the removable
device problem.


# 8169788f 11-Mar-1996 Peter Wemm <peter@FreeBSD.org>

Import 4.4BSD-Lite2 onto the vendor branch, note that in the kernel, all
files are off the vendor branch, so this should not change anything.

A "U" marker generally means that the file was not changed in between
the 4.4Lite and Lite-2 releases, and does not need a merge. "C" generally
means that there was a change.


# bd7e5f99 18-Jan-1996 John Dyson <dyson@FreeBSD.org>

Eliminated many redundant vm_map_lookup operations for vm_mmap.
Speed up for vfs_bio -- addition of a routine bqrelse to greatly diminish
overhead for merged cache.
Efficiency improvement for vfs_cluster. It used to do alot of redundant
calls to cluster_rbuild.
Correct the ordering for vrele of .text and release of credentials.
Use the selective tlb update for 486/586/P6.
Numerous fixes to the size of objects allocated for files. Additionally,
fixes in the various pagers.
Fixes for proper positioning of vnode_pager_setsize in msdosfs and ext2fs.
Fixes in the swap pager for exhausted resources. The pageout code
will not as readily thrash.
Change the page queue flags (PG_ACTIVE, PG_INACTIVE, PG_FREE, PG_CACHE) into
page queue indices (PQ_ACTIVE, PQ_INACTIVE, PQ_FREE, PQ_CACHE),
thereby improving efficiency of several routines.
Eliminate even more unnecessary vm_page_protect operations.
Significantly speed up process forks.
Make vm_object_page_clean more efficient, thereby eliminating the pause
that happens every 30seconds.
Make sequential clustered writes B_ASYNC instead of B_DELWRI even in the
case of filesystems mounted async.
Fix a panic with busy pages when write clustering is done for non-VMIO
buffers.


# d63596ce 17-Dec-1995 John Dyson <dyson@FreeBSD.org>

Fix paging from ext2fs (and other fs w/block size < PAGE_SIZE). This
should fix kern/900.


# f708ef1b 14-Dec-1995 Poul-Henning Kamp <phk@FreeBSD.org>

Another mega commit to staticize things.


# a316d390 10-Dec-1995 John Dyson <dyson@FreeBSD.org>

Changes to support 1Tb filesizes. Pages are now named by an
(object,index) pair instead of (object,offset) pair.


# efeaf95a 06-Dec-1995 David Greenman <dg@FreeBSD.org>

Untangled the vm.h include file spaghetti.


# 3af76890 19-Nov-1995 Poul-Henning Kamp <phk@FreeBSD.org>

Remove unused vars & funcs, make things static, protoize a little bit.


# 0b8253a7 30-Oct-1995 Bruce Evans <bde@FreeBSD.org>

Don't pass an extra trailing arg to some functions.

Added the prototypes that found this bug.


# 2c4488fc 22-Oct-1995 John Dyson <dyson@FreeBSD.org>

Finalize GETPAGES layering scheme. Move the device GETPAGES
interface into specfs code. No need at this point to modify the
PUTPAGES stuff except in the layered-type (NULL/UNION) filesystems.


# eed2d59b 19-Oct-1995 David Greenman <dg@FreeBSD.org>

Fix initialization of "bsize" in vnode_pager_haspage(). It must happen
after the check for the mount point still existing or else the system
will panic if someone forcibly unmounted the filesystem.


# 6eab77f2 12-Sep-1995 John Dyson <dyson@FreeBSD.org>

Fix really bogus casting of a block number to a long. Also change the
comparison from a "< 0" to "== -1" like it should be.


# b1fc01b7 10-Sep-1995 John Dyson <dyson@FreeBSD.org>

Fix an error that can cause attempted reading beyond the end of file.


# ced399ee 05-Sep-1995 John Dyson <dyson@FreeBSD.org>

Minor performance improvements, additional prototype for additional
exported symbol.


# 170db9c6 03-Sep-1995 John Dyson <dyson@FreeBSD.org>

Allow the fault code to use additional clustering info from both
bmap and the swap pager. Improved fault clustering performance.


# c83ebe77 03-Sep-1995 John Dyson <dyson@FreeBSD.org>

Added VOP_GETPAGES/VOP_PUTPAGES and also the "backwards" block count
for VOP_BMAP. Updated affected filesystems...


# 24a1cce3 13-Jul-1995 David Greenman <dg@FreeBSD.org>

NOTE: libkvm, w, ps, 'top', and any other utility which depends on struct
proc or any VM system structure will have to be rebuilt!!!

Much needed overhaul of the VM system. Included in this first round of
changes:

1) Improved pager interfaces: init, alloc, dealloc, getpages, putpages,
haspage, and sync operations are supported. The haspage interface now
provides information about clusterability. All pager routines now take
struct vm_object's instead of "pagers".

2) Improved data structures. In the previous paradigm, there is constant
confusion caused by pagers being both a data structure ("allocate a
pager") and a collection of routines. The idea of a pager structure has
escentially been eliminated. Objects now have types, and this type is
used to index the appropriate pager. In most cases, items in the pager
structure were duplicated in the object data structure and thus were
unnecessary. In the few cases that remained, a un_pager structure union
was created in the object to contain these items.

3) Because of the cleanup of #1 & #2, a lot of unnecessary layering can now
be removed. For instance, vm_object_enter(), vm_object_lookup(),
vm_object_remove(), and the associated object hash list were some of the
things that were removed.

4) simple_lock's removed. Discussion with several people reveals that the
SMP locking primitives used in the VM system aren't likely the mechanism
that we'll be adopting. Even if it were, the locking that was in the code
was very inadequate and would have to be mostly re-done anyway. The
locking in a uni-processor kernel was a no-op but went a long way toward
making the code difficult to read and debug.

5) Places that attempted to kludge-up the fact that we don't have kernel
thread support have been fixed to reflect the reality that we are really
dealing with processes, not threads. The VM system didn't have complete
thread support, so the comments and mis-named routines were just wrong.
We now use tsleep and wakeup directly in the lock routines, for instance.

6) Where appropriate, the pagers have been improved, especially in the
pager_alloc routines. Most of the pager_allocs have been rewritten and
are now faster and easier to maintain.

7) The pagedaemon pageout clustering algorithm has been rewritten and
now tries harder to output an even number of pages before and after
the requested page. This is sort of the reverse of the ideal pagein
algorithm and should provide better overall performance.

8) Unnecessary (incorrect) casts to caddr_t in calls to tsleep & wakeup
have been removed. Some other unnecessary casts have also been removed.

9) Some almost useless debugging code removed.

10) Terminology of shadow objects vs. backing objects straightened out.
The fact that the vm_object data structure escentially had this
backwards really confused things. The use of "shadow" and "backing
object" throughout the code is now internally consistent and correct
in the Mach terminology.

11) Several minor bug fixes, including one in the vm daemon that caused
0 RSS objects to not get purged as intended.

12) A "default pager" has now been created which cleans up the transition
of objects to the "swap" type. The previous checks throughout the code
for swp->pg_data != NULL were really ugly. This change also provides
the rudiments for future backing of "anonymous" memory by something
other than the swap pager (via the vnode pager, for example), and it
allows the decision about which of these pagers to use to be made
dynamically (although will need some additional decision code to do
this, of course).

13) (dyson) MAP_COPY has been deprecated and the corresponding "copy
object" code has been removed. MAP_COPY was undocumented and non-
standard. It was furthermore broken in several ways which caused its
behavior to degrade to MAP_PRIVATE. Binaries that use MAP_COPY will
continue to work correctly, but via the slightly different semantics
of MAP_PRIVATE.

14) (dyson) Sharing maps have been removed. It's marginal usefulness in a
threads design can be worked around in other ways. Both #12 and #13
were done to simplify the code and improve readability and maintain-
ability. (As were most all of these changes)

TODO:

1) Rewrite most of the vnode pager to use VOP_GETPAGES/PUTPAGES. Doing
this will reduce the vnode pager to a mere fraction of its current size.

2) Rewrite vm_fault and the swap/vnode pagers to use the clustering
information provided by the new haspage pager interface. This will
substantially reduce the overhead by eliminating a large number of
VOP_BMAP() calls. The VOP_BMAP() filesystem interface should be
improved to provide both a "behind" and "ahead" indication of
contiguousness.

3) Implement the extended features of pager_haspage in swap_pager_haspage().
It currently just says 0 pages ahead/behind.

4) Re-implement the swap device (swstrategy) in a more elegant way, perhaps
via a much more general mechanism that could also be used for disk
striping of regular filesystems.

5) Do something to improve the architecture of vm_object_collapse(). The
fact that it makes calls into the swap pager and knows too much about
how the swap pager operates really bothers me. It also doesn't allow
for collapsing of non-swap pager objects ("unnamed" objects backed by
other pagers).


# 06cb7259 09-Jul-1995 David Greenman <dg@FreeBSD.org>

Moved call to VOP_GETATTR() out of vnode_pager_alloc() and into the places
that call vnode_pager_alloc() so that a failure return can be dealt with.
This fixes a panic seen on NFS clients when a file being opened is deleted
on the server before the open completes.


# 39d38f93 06-Jul-1995 David Greenman <dg@FreeBSD.org>

Fixed an object allocation race condition that was causing a "object
deallocated too many times" panic when using NFS.

Reviewed by: John Dyson


# aa2cabb9 27-Jun-1995 David Greenman <dg@FreeBSD.org>

1) Converted v_vmdata to v_object.
2) Removed unnecessary vm_object_lookup()/pager_cache(object, TRUE) pairs
after vnode_pager_alloc() calls - the object is already guaranteed to be
persistent.
3) Removed some gratuitous casts.


# 9b2e5354 30-May-1995 Rodney W. Grimes <rgrimes@FreeBSD.org>

Remove trailing whitespace.


# 5f55e841 17-May-1995 David Greenman <dg@FreeBSD.org>

Accessing pages beyond the end of a mapped file results in internal
inconsistencies in the VM system that eventually lead to a panic. These
changes fix the behavior to conform to the behavior in SunOS, which is
to deny faults to pages beyond the EOF (returning SIGBUS). Internally,
this is implemented by requiring faults to be within the object size
boundaries. These changes exposed another bug, namely that passing in
an offset to mmap when trying to map an unnamed anonymous region also
results in internal inconsistencies. In this case, the offset is forced
to zero.

Reviewed by: John Dyson and others


# ee3a64c9 10-May-1995 David Greenman <dg@FreeBSD.org>

Changed "handle" from type caddr_t to void *; "handle" is several different
types of pointers, and "char *" is a bad choice for the type.


# f6b04d2b 09-Apr-1995 David Greenman <dg@FreeBSD.org>

Changes from John Dyson and myself:

Fixed remaining known bugs in the buffer IO and VM system.

vfs_bio.c:
Fixed some race conditions and locking bugs. Improved performance
by removing some (now) unnecessary code and fixing some broken
logic.
Fixed process accounting of # of FS outputs.
Properly handle NFS interrupts (B_EINTR).

(various)
Replaced calls to clrbuf() with calls to an optimized routine
called vfs_bio_clrbuf().

(various FS sync)
Sync out modified vnode_pager backed pages.

ffs_vnops.c:
Do two passes: Sync out file data first, then indirect blocks.

vm_fault.c:
Fixed deadly embrace caused by acquiring locks in the wrong order.

vnode_pager.c:
Changed to use buffer I/O system for writing out modified pages. This
should fix the problem with the modification date previous not getting
updated. Also dramatically simplifies the code. Note that this is
going to change in the future and be implemented via VOP_PUTPAGES().

vm_object.c:
Fixed a pile of bugs related to cleaning (vnode) objects. The performance
of vm_object_page_clean() is terrible when dealing with huge objects,
but this will change when we implement a binary tree to keep the object
pages sorted.

vm_pageout.c:
Fixed broken clustering of pageouts. Fixed race conditions and other
lockup style bugs in the scanning of pages. Improved performance.


# 1b369d98 21-Mar-1995 David Greenman <dg@FreeBSD.org>

Removed unused variable declaration missed in previous commit.


# 71263bf8 21-Mar-1995 David Greenman <dg@FreeBSD.org>

Removed do-nothing VOP_UPDATE() call.


# 7c1f6ced 20-Mar-1995 David Greenman <dg@FreeBSD.org>

Added a new boolean argument to vm_object_page_clean that causes it to
only toss out clean pages if TRUE.


# 0426122f 20-Mar-1995 David Greenman <dg@FreeBSD.org>

Don't gain/lose an object reference in vnode_pager_setsize(). It will
cause vnode locking problems in vm_object_terminate().
Implement proper vnode locking in vm_object_terminate().


# 0bdb7528 19-Mar-1995 David Greenman <dg@FreeBSD.org>

Do proper vnode locking when doing paging I/O. Removed the asynchronous
paging capability to facilitate this (we saw little or no measureable
improvement with this anyway).

Submitted by: John Dyson


# c01a9b8c 18-Mar-1995 David Greenman <dg@FreeBSD.org>

Incorporated 4.4-lite vnode_pager_uncache() and vnode_pager_umount()
routines (and merged local changes). The changed vnode_pager_uncache
gets rids of the bogosity that you can call the routine without
having the vnode locked. The changed vnode_pager_umount properly locks
the vnode before calling vnode_pager_uncache.


# b5e8ce9f 16-Mar-1995 Bruce Evans <bde@FreeBSD.org>

Add and move declarations to fix all of the warnings from `gcc -Wimplicit'
(except in netccitt, netiso and netns) and most of the warnings from
`gcc -Wnested-externs'. Fix all the bugs found. There were no serious
ones.


# 4bb62461 12-Mar-1995 David Greenman <dg@FreeBSD.org>

Explicitly set object->flags = OBJ_CANPERSIST.


# 00072442 07-Mar-1995 David Greenman <dg@FreeBSD.org>

Set VAGE flag when pager is destroyed. This usually happens when an
object has fallen off the end of the cached list - this is likely the
last reference to the vnode and it should be reused before non file
vnodes that are already on the free list (VDIR mostly).


# f919ebde 01-Mar-1995 David Greenman <dg@FreeBSD.org>

Various changes from John and myself that do the following:

New functions create - vm_object_pip_wakeup and pagedaemon_wakeup that
are used to reduce the actual number of wakeups.
New function vm_page_protect which is used in conjuction with some new
page flags to reduce the number of calls to pmap_page_protect.
Minor changes to reduce unnecessary spl nesting.
Rewrote vm_page_alloc() to improve readability.
Various other mostly cosmetic changes.


# b106f3b2 23-Feb-1995 David Greenman <dg@FreeBSD.org>

Removed redundant HOLDRELE()'s.


# 187f0071 22-Feb-1995 David Greenman <dg@FreeBSD.org>

Changed return value from vnode_pager_addr to be in DEV_BSIZE units so
that 9 bits aren't lost in the conversion. Changed all callers to expect
this. This allows paging on large (>2GB) filesystems.

Submitted by: John Dyson


# c0503609 22-Feb-1995 David Greenman <dg@FreeBSD.org>

Only do object paging_in_progress wakeups if someone is waiting on this
condition.

Submitted by: John Dyson


# 7fb0c17e 20-Feb-1995 David Greenman <dg@FreeBSD.org>

Deprecated remaining use of vm_deallocate. Deprecated vm_allocate_with_
pager(). Almost completely rewrote vm_mmap(); when John gets done with
the bottom half, it will be a complete rewrite. Deprecated most use of
vm_object_setpager(). Removed side effect of setting object persist
in vm_object_enter and moved this into the pager(s). A few other
cosmetic changes.


# efc68ce1 02-Feb-1995 David Greenman <dg@FreeBSD.org>

Fixed bmap run-length brokeness.
Use bmap run-length extension when doing clustered paging.

Submitted by: John Dyson


# 6d40c3d3 24-Jan-1995 David Greenman <dg@FreeBSD.org>

Added ability to detect sequential faults and DTRT. (swap_pager.c)
Added hook for pmap_prefault() and use symbolic constant for new third
argument to vm_page_alloc() (vm_fault.c, various)
Changed the way that upages and page tables are held. (vm_glue.c)
Fixed architectural flaw in allocating pages at interrupt time that was
introduced with the merged cache changes. (vm_page.c, various)
Adjusted some algorithms to acheive better paging performance and to
accomodate the fix for the architectural flaw mentioned above. (vm_pageout.c)
Fixed pbuf handling problem, changed policy on handling read-behind page.
(vnode_pager.c)

Submitted by: John Dyson


# a7489784 11-Jan-1995 David Greenman <dg@FreeBSD.org>

Fixed a panic that Garrett reported to me...the OBJ_INTERNAL flag wasn't
being cleared in some cases for vnode backed objects; we now do this in
vnode_pager_alloc proper to guarantee it. Also be more careful in the
rcollapse code about messing with busy/bmapped pages.


# 0d94caff 09-Jan-1995 David Greenman <dg@FreeBSD.org>

These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.

The majority of the merged VM/cache work is by John Dyson.

The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.

vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.

vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.

vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.

vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.

vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.

pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.

vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.

proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.

swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.

machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.

machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.

ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.

Submitted by: John Dyson and David Greenman


# 4abc71c0 24-Nov-1994 David Greenman <dg@FreeBSD.org>

Don't try to page to a vnode that had it's filesystem unmounted.


# bf556a16 16-Nov-1994 Justin T. Gibbs <gibbs@FreeBSD.org>

Remove a peice of commented out code that was left over from the early
stages of debugging LFS:

* if we can't bmap, use old VOP code
*/
! if (/* (vp->v_mount && vp->v_mount->mnt_stat.f_type == MOUNT_LFS) || */
! VOP_BMAP(vp, foff, &dp, 0, 0)) {
for (i = 0; i < count; i++) {
if (i != reqpage) {
vnode_pager_freepage(m[i]);
--- 804,810 ----
/*
* if we can't bmap, use old VOP code
*/
! if (VOP_BMAP(vp, foff, &dp, 0, 0)) {

Reviewed by: gibbs
Submitted by: John Dyson


# 317205ca 13-Nov-1994 David Greenman <dg@FreeBSD.org>

Fixed bug where a read-behind to a negative offset would occur if the
fault was at offset 0 in the object. This resulted in more overhead but
was othewise benign. Added incore() check in vnode_pager_has_page()
to work around a problem with LFS...other than slightly higher overhead,
this change has no affect on UFS.


# a83c285c 06-Nov-1994 David Greenman <dg@FreeBSD.org>

Fixed return status from pagers. Ahem...the previous method would manufacture
data when it couldn't get it legitimately. :-(

Submitted by: John Dyson


# 976e77fc 15-Oct-1994 David Greenman <dg@FreeBSD.org>

1) Some of the counters in the vmmeter struct don't fit well into the Mach VM
scheme of things, so I've changed them to be more appropriate. page in/ous
are now associated with the pager that did them. Nuked v_fault as the
only fault of interest that wouldn't be already counted in v_trap is a VM
fault, and this is counted seperately.
2) Implemented most of the remaining counters and corrected the counting of
some that were done wrong. They are all almost correct now...just a few
minor ones left to fix.


# b73f3b1d 13-Oct-1994 David Greenman <dg@FreeBSD.org>

Got rid of redundant declaration warnings.


# 0a99546c 14-Oct-1994 Jordan K. Hubbard <jkh@FreeBSD.org>

Add missing )'s to previous midnight changes. :-)


# defb6744 13-Oct-1994 David Greenman <dg@FreeBSD.org>

Changed I/O error messages to be somewhat less cryptic. Removed a piece
of unused code.


# 05f0fdd2 08-Oct-1994 Poul-Henning Kamp <phk@FreeBSD.org>

Cosmetics: unused vars, ()'s, #include's &c &c to silence gcc.
Reviewed by: davidg


# 8e58bf68 05-Oct-1994 David Greenman <dg@FreeBSD.org>

Stuff object into v_vmdata rather than pager. Not important which at
the moment, but will be in the future. Other changes mostly cosmetic,
but are made for future VMIO considerations.

Submitted by: John Dyson


# db141545 06-Sep-1994 David Greenman <dg@FreeBSD.org>

Disabled a debugging printf.


# fff93ab6 29-Aug-1994 David Greenman <dg@FreeBSD.org>

Patches from John Dyson to improve swap code efficiency.
Religiously add back pmap_clear_modify() in vnode_pager_input until we figure
out why system performance isn't what we expect.

Submitted by: John Dyson (swap_pager) & David Greenman (vnode_pager)


# a481f200 07-Aug-1994 David Greenman <dg@FreeBSD.org>

Provide support for upcoming merged VM/buffer cache, and fixed a few bugs
that haven't appeared to manifest themselves (yet).

Submitted by: John Dyson


# c87801fe 06-Aug-1994 David Greenman <dg@FreeBSD.org>

Fixed various prototype problems with the pmap functions and the subsequent
problems that fixing them caused.


# 16f62314 06-Aug-1994 David Greenman <dg@FreeBSD.org>

Incorporated post 1.1.5 work from John Dyson. This includes performance
improvements via the new routines pmap_qenter/pmap_qremove and pmap_kenter/
pmap_kremove. These routine allow fast mapping of pages for those
architectures that have "normal" MMUs. Also included is a fix to the
pageout daemon to properly check a queue end condition.

Submitted by: John Dyson


# bbc0ec52 03-Aug-1994 David Greenman <dg@FreeBSD.org>

Integrated VM system improvements/fixes from FreeBSD-1.1.5.


# 26f9a767 25-May-1994 Rodney W. Grimes <rgrimes@FreeBSD.org>

The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch.

Reviewed by: Rodney W. Grimes
Submitted by: John Dyson and David Greenman


# df8bae1d 24-May-1994 Rodney W. Grimes <rgrimes@FreeBSD.org>

BSD 4.4 Lite Kernel Sources