History log of /openbsd-current/sys/kern/vfs_subr.c
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
# 1.319 03-Feb-2024 beck

Remove Softdep.

Softdep has been a no-op for some time now, this removes it to get
it out of the way.

Flensing mostly done in Talinn, with some help from krw@

ok deraadt@


Revision tags: OPENBSD_7_3_BASE OPENBSD_7_4_BASE
# 1.318 26-Dec-2022 miod

Replace two (void)copystr(..., NULL) with equivalent (void)strlcpy() calls.

ok millert@


Revision tags: OPENBSD_7_2_BASE
# 1.317 14-Aug-2022 jsg

remove unneeded includes in sys/kern
ok mpi@ miod@


# 1.316 12-Aug-2022 visa

Put more struct vnode fields under splbio().

Buffer cache related struct vnode fields can be accessed in interrupt
context. Be more consistent with the use of splbio().

OK mpi@


Revision tags: OPENBSD_7_1_BASE
# 1.315 27-Mar-2022 semarie

sys/vnode.h cleanup for vnode_hold_list, vnode_free_list, struct freelst

vnode_hold_list and vnode_free_list aren't used outside kern/vfs_subr.c

move `struct freelst` where used in kern/vfs_subr.c

no intented behaviour changes. survived a release(8) build.

ok millert@


# 1.314 25-Jan-2022 gnezdo

Capture a repeated pattern into sysctl_securelevel_int function

A few variables in the kernel are only writeable before securelevel is
raised. It makes sense to handle them with less code.

OK sthen@ bluhm@


# 1.313 25-Oct-2021 claudio

Revert commitid: ufM9BcSbXqfLpzBH;
Move vfs_stall_barrier() from the fd layer into vn_lock() and the vfs layer.
In some cases it can result in a deadlock while suspending.
Discussed with mpi@ and deraadt@


# 1.312 24-Oct-2021 jsg

use NULL not 0 for pointer values in kern
ok semarie@


# 1.311 23-Oct-2021 mpi

Sprinkle uvm_obj_destroy() over UVM object recycling code.

For now, only assert that the tree of pages is empty in uvm_obj_destroy().
This will soon be used to free the per-UVM object lock.

While here call uvm_obj_init() when new vnodes are allocated instead of
in uvn_attach(). Because vnodes and there associated UVM object are
currently never freed, it isn't easy to know where/when to garbage
collect the associated lock. So simply check that the reference of a
given object is 0 when uvn_attach().

Tested by many as part of a bigger diff.

ok kettenis@


# 1.310 23-Oct-2021 mpi

Assert that the KERNEL_LOCK() is held in vref(9).

This is a guard against pushing the lock too far in UVM's vnode land.

ok beck@


# 1.309 21-Oct-2021 claudio

Move vfs_stall_barrier() from the fd layer into vn_lock() and the vfs layer.
vfs stalling is used by suspend/resume and by vmt(4) to stall any
filesystem operation from altering the state on disk. All these
operations will call vn_lock and be stalled. Adjust vfs_stall_barrier()
to allow the lock owner to still progress so that suspend can sync
the filesystems after stalling vfs operation.
OK mpi@


# 1.308 20-Oct-2021 semarie

revert vnode: remove VLOCKSWORK and check locking when vop_islocked != nullop
(both kernel and userland bits)

GENERIC + VFSLCKDEBUG is broken with it.


# 1.307 19-Oct-2021 semarie

vnode: remove VLOCKSWORK and check locking when vop_islocked != nullop

This flag is currently used to mark or unmark a vnode to actively
check vnode locking semantic (when compiled with VFSLCKDEBUG).

Currently, VLOCKSWORK flag isn't properly set for several FS
implementation which have full locking support. This commit enable
proper checking for them too (cd9660, udf, fuse, msdosfs, tmpfs).

Instead of using a particular flag, it directly check if
v_op->vop_islocked is nullop or not to activate or not the vnode
locking checks.

ok mpi@


Revision tags: OPENBSD_7_0_BASE
# 1.306 31-Aug-2021 claudio

Swap lock flags so that LK_EXCLUSIVE is first like in all other places.


# 1.305 28-Apr-2021 claudio

Introduce a global vnode_mtx and use it to make vn_lock() safe to be called
without the KERNEL_LOCK.
This moves VXLOCK and VXWANT to a mutex protected v_lflag field and also
v_lockcount is protected by this mutex.

The vn_lock() dance is overly complex and all of this should probably replaced
by a proper lock on the vnode but such a diff is a lot more complex. This
is an intermediate step so that at least some calls can be modified to grab
the KERNEL_LOCK later or not at all.

OK mpi@


Revision tags: OPENBSD_6_9_BASE
# 1.304 29-Jan-2021 claudio

Use NULL instead of 0 to clear v_socket pointer (which actually clears all
of the v_un pointers).
OK jsg@ mvs@


Revision tags: OPENBSD_6_8_BASE
# 1.303 23-Aug-2020 kn

Remove unused debug_syncprt, improve debug sysctl handling

"syncprt" is unused since kern/vfs_syscalls.c r1.147 from 2008.

Adding new debug sysctls is a bit opaque and looking at kern/kern_sysctl.c
the only visible difference between used and stub ctldebug structs in the
debugvars[] array is their extern keyword, indicating that it is defined
elsewhere.

sys/sysctl.h declares all debugN members as extern upfront, but these
declarations are not needed.

Remove the unused debug sysctl, rename the only remaining one to something
meaningful and remove forward declarations from /sys/sysctl.h; this way,
adding new debug sysctls is a matter of adding extern and coming up with a
name, which is nicer to read on its own and better to grep for.

OK mpi


# 1.302 22-Aug-2020 kn

Move sysctl(2) CTL_DEBUG from DEBUG to new DEBUG_SYSCTL

Adding "debug.my-knob" sysctls is really helpful to select different
code paths and/or log on demand during runtime without recompile,
but as this code is under DEBUG, lots of other noise comes with it
which is often undesired, at least when looking at specific subsystems
only.

Adding globals to the kernel and breaking into DDB to change them helps,
but that does not work over SSH, hence the need for debug sysctls.

Introduces DEBUG_SYSCTL to make use of the "debug" MIB without the rest of
DEBUG; it's DEBUG_SYSCTL and not SYSCTL_DEBUG because it's not a general
option for all of sysctl(2).

OK gnezdo


Revision tags: OPENBSD_6_7_BASE
# 1.301 27-Mar-2020 anton

Relax the lockcount assertion in vputonfreelist(). Back when I fixed
several problems with the vnode exclusive lock implementation, I
overlooked the fact that a vnode can be in a state where the usecount is
zero while the holdcount still being positive. There could still be
threads waiting on the vnode lock in uvn_io() as long as the holdcount
is positive.

"go ahead" mpi@

Reported-by: syzbot+767d6deb1a647850a0ca@syzkaller.appspotmail.com


# 1.300 13-Feb-2020 claudio

Move the LK_DRAIN logic from VOP_LOCK() to vclean() the only caller of
VOP_LOCK with LK_DRAIN. This simplifies VOP_LOCK() a fair bit.
OK visa@


# 1.299 20-Jan-2020 claudio

struct vops is not modified during runtime so use const which moves each
into read-only data segment.
OK deraadt@ tedu@


# 1.298 10-Jan-2020 bluhm

Convert the vnode list at the mount point into a tailq. During
unmount this list is traversed and the dirty vnodes are flushed to
disk. Forced unmount expects that the list is empty after flushing,
otherwise the kernel panics with "dangling vnode". As the write
to disk can sleep, new vnodes may be inserted. If softdep is
enabled, resolving the dependencies creates new dirty vnodes and
inserts them to the list. To fix the panic, let insmntque() insert
new vnodes at the tail of the list. Then vflush() will still catch
them while traversing the list in forward direction.
OK tedu@ millert@ visa@


# 1.297 30-Dec-2019 bluhm

In vcount() a safe loop over vnodes was commited to 4.4BSD in 1994.
This is not necessary as the loop is restarted after vgone(). Switch
to SLIST_FOREACH without _SAFE.
OK visa@


# 1.296 27-Dec-2019 bluhm

Convert the speclisth hash buckets into SLIST macros. This makes
the vnode alias code more readable.
OK visa@


# 1.295 26-Dec-2019 bluhm

Fix white spaces.


# 1.294 08-Dec-2019 mpi

Convert infinite sleeps to tsleep_nsec(9).

ok visa@, jca@


Revision tags: OPENBSD_6_6_BASE
# 1.293 26-Aug-2019 anton

When a thread tries to exclusively lock a vnode, the same thread must
ensure that any other thread currently trying to acquire the underlying
vnode lock has observed that the same vnode is about to be exclusively
locked. Such threads must then sleep until the exclusive lock has been
released and then try to acquire the lock again. Otherwise, exclusive
access to the vnode cannot be guaranteed.

Thanks to naddy@ and visa@ for testing; ok visa@

Reported-by: syzbot+374d0e7e2400004957f7@syzkaller.appspotmail.com


# 1.292 25-Jul-2019 cheloha

vinvalbuf(9): tlseep -> tsleep_nsec(9); ok millert@


# 1.291 19-Jul-2019 cheloha

vwaitforio(9): tsleep(9) -> tsleep_nsec(9); ok visa@


# 1.290 28-Jun-2019 visa

Skip VFS barrier lock during normal operation to reduce overhead.
This removes a system-wide serialization point, which might help
finding timing-related bugs.

OK deraadt@ anton@


# 1.289 09-Jun-2019 beck

Add a temporary workaround to make removal of giant files better

mlarkin@ noticed we would freeze while removing enormous files because
of the amount of work done to invalidate buffers on unlink. This adds
a temporary workaround to ensure we give up the lock and yield while
doing this.

The longer term answer will be to move these buffers to another list
and not do the work here.

ok deraadt@


# 1.288 19-Apr-2019 visa

Add a subsystem lock for vfs_lockf.c. This enables calling lf_advlock()
and lf_purgelocks() without the kernel lock.

OK anton@ mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.287 02-Apr-2019 visa

Restrict which filesystems are available for swap. This rules out
obvious misconfigurations that cannot work.

OK mpi@ tedu@


# 1.286 17-Feb-2019 tedu

if a write fails, we mark the buffer invalid and throw it away. this can
lead to lost errors, where a later fsync will return success. to fix this,
set a flag on the vnode indicating a past error has occurred, and return
an error for future fsync calls.
ok bluhm deraadt visa


# 1.285 21-Jan-2019 anton

Introduce a dedicated entry point data structure for file locks. This new data
structure allows for better tracking of pending lock operations which is
essential in order to prevent a use-after-free once the underlying vnode is
gone.

Inspired by the lockf implementation in FreeBSD.

ok visa@

Reported-by: syzbot+d5540a236382f50f1dac@syzkaller.appspotmail.com


# 1.284 23-Dec-2018 natano

Rectify some issues with the noperm mount flag; the root vnode was not
protected properly and files without any x bit set were accidentaly considered
executable when checked with access(2).

Issues found and reported by deraadt, halex, reyk, tb
ok deraadt


# 1.283 07-Dec-2018 mpi

free(9) sizes for netcred.

ok visa@


Revision tags: OPENBSD_6_4_BASE
# 1.282 29-Sep-2018 visa

Use atomic operations to update vfc_refcount. Change the field's type
to unsigned int.

OK deraadt@


# 1.281 26-Sep-2018 visa

Move the allocating and freeing of mount points into
dedicated functions.

OK deraadt@ mpi@


# 1.280 22-Sep-2018 fcambus

Harmonize spacing after ellipses in displayed messages.

We were using spacing after ellipses in an inconsistent way in the
installer. Standardize on using "... " everywhere and take into account
the cursor position while we are waiting for the task to complete: the
cursor is now always positioned after the last dot, and the space is
added when displaying completion confirmation.

While there, also take cursor position into account in vfs_shutdown(),
and remove the extra leading space before ticks in dhclient.

OK deraadt@


# 1.279 17-Sep-2018 visa

Simplify VFS initialization.

Because loadable kernel modules are no longer, there is no need to
register or unregister filesystem implementations at runtime. Remove
vfs_register() and vfs_unregister(), and make vfsinit() call vfs_init
routines directly. Replace the linked list of vfsconf structs with
the vfsconflist[] array.

OK mpi@ bluhm@


# 1.278 16-Sep-2018 visa

Move vfsconf lookup code into dedicated functions.

OK bluhm@


# 1.277 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


# 1.276 02-Jul-2018 bluhm

Use more list macros for v_dirtyblkhd.
OK mpi@


# 1.275 06-Jun-2018 bluhm

The function dounmount() traverses the mnt_list in forward direction
to call vfs_busy() for all nested mount points. vfs_stall() called
vfs_busy() in reverser order for all mount points. Change the
direction of the latter to resolve the lock order conflict.
OK visa@


# 1.274 04-Jun-2018 guenther

Add VB_DUPOK to suppress witness(4) warning of concurrent mount locks.
Use that in three places:
- vfs_stall()
- sys_mount()
- dounmount()'s MNT_FORCE-does-recursive-unmounts case

ok deraadt@ visa@


# 1.273 27-May-2018 visa

Drop unnecessary `p' parameter from vget(9).

OK mpi@


# 1.272 08-May-2018 bluhm

When looping over mount points, the FOREACH SAVE macro is not save.
The loop variable mp is protected by vfs_busy() so that it cannot
be unmounted. But the next mount point nmp could be unmounted while
VFS_SYNC() sleeps. As the loop in vfs_stall() does not destroy the
mount point, TAILQ_FOREACH_REVERSE without _SAVE is the correct
macro to use.
OK deraadt@ visa@


# 1.271 08-May-2018 mpi

Move the vfs stall "barrier" logic to a function. FREF() will soon
change and this has nothing to do with it.

ok visa@, bluhm@


# 1.270 07-May-2018 bluhm

Print the vp pointer in the vinvalbuf() panic strings.
OK mpi@


# 1.269 02-May-2018 visa

Remove proc from the parameters of vn_lock(). The parameter is
unnecessary because curproc always does the locking.

OK mpi@


# 1.268 28-Apr-2018 visa

Clean up the parameters of VOP_LOCK() and VOP_UNLOCK(). It is always
curproc that does the locking or unlocking, so the proc parameter
is pointless and can be dropped.

OK mpi@, deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.267 07-Mar-2018 bluhm

Remounting files systems read-only does not work reliably. There
are corner cases where ffs may leak blocks. So better revert and
unmount all file systems at reboot. The "init died" panic will be
fixed in a different way.
OK deraadt@


# 1.266 10-Feb-2018 deraadt

Syncronize filesystems to disk when suspending. Each mountpoint's vnodes
are pushed to disk. Dangling vnodes (unlinked files still in use) and
vnodes undergoing change by long-running syscalls are identified -- and
such filesystems are marked dirty on-disk while we are suspended (in case
power is lost, a fsck will be required). Filesystems without dangling or
busy vnodes are marked clean, resulting in faster boots following
"battery died" circumstances.
Tested by numerous developers, thanks for the feedback.


# 1.265 14-Dec-2017 deraadt

Don't bother using DETACH_FORCE for the softraid luns at reboot
time; the aggressive mountpoint destruction seems to hit insane
use-after-frees when we are already far on the way down.


# 1.264 14-Dec-2017 deraadt

Give vflush_vnode() a hint about vnodes we don't need to account as "busy".
Change mountpoint to RDONLY a little later. Seems to improve the
rw->ro transition a bit.


# 1.263 11-Dec-2017 bluhm

Format the vnode lists of ddb show mount properly in columns.
OK krw@


# 1.262 11-Dec-2017 deraadt

In uvm Chuck decided backing store would not be allocated proactively
for blocks re-fetchable from the filesystem. However at reboot time,
filesystems are unmounted, and since processes lack backing store they
are killed. Since the scheduler is still running, in some cases init is
killed... which drops us to ddb [noted by bluhm]. Solution is to convert
filesystems to read-only [proposed by kettenis]. The tale follows:
sys_reboot() should pass proc * to MD boot() to vfs_shutdown() which
completes current IO with vfs_busy VB_WRITE|VB_WAIT, then calls VFS_MOUNT()
with MNT_UPDATE | MNT_RDONLY, soon teaching us that *fs_mount() calls a
copyin() late... so store the sizes in vfsconflist[] and move the copyin()
to sys_mount()... and notice nfs_mount copyin() is size-variant, so kill
legacy struct nfs_args3. Next we learn ffs_mount()'s MNT_UPDATE code is
sharp and rusty especially wrt softdep, so fix some bugs adn add
~MNT_SOFTDEP to the downgrade. Some vnodes need a little more help,
so tie them to &dead_vnops.

ffs_mount calling DIOCCACHESYNC is causing a bit of grief still but
this issue is seperate and will be dealt with in time.
couple hundred reboots by bluhm and myself, advice from guenther and
others at the hut


# 1.261 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.260 31-Jul-2017 florian

Give back some space to the ramdisk by compiling net/radix.c only
if we compile pf, ipsec, pipex or nfsserver.
Suggested by mpi some time ago.
Tweak & OK bluhm
deraadt assumes it's fair


# 1.259 20-Apr-2017 visa

Tweak lock inits to make the system runnable with witness(4)
on amd64 and i386.


# 1.258 04-Apr-2017 deraadt

struct vfsconf is tightly packed, but let's M_ZERO it in case that ever
changes to avoid exposing userland memory.


Revision tags: OPENBSD_6_1_BASE
# 1.257 15-Jan-2017 bluhm

When traversing the mount list, the current mount point is locked
with vfs_busy(). If the FOREACH_SAFE macro is used, the next pointer
is not locked and could be freed by another process. Unless
necessary, do not use _SAFE as it is unsafe. In vfs_unmountall()
the current pointer is actullay freed. Add a comment that this
race has to be fixed later.
OK krw@


# 1.256 10-Jan-2017 bluhm

Replace manual for() loops with FOREACH() macro.
OK millert@


# 1.255 10-Jan-2017 bluhm

Remove the unused olddp parameter from function dounmount().
OK mpi@ millert@


# 1.254 28-Sep-2016 kettenis

Cast enum to u_int when doing a bounds check to avoid a clang warning that
the comparison is always true.

ok jca@, tedu@


# 1.253 16-Sep-2016 dlg

move the namecache_rb_tree from RB macros to RBT functions.

i had to shuffle the includes a bit. all the knowledge of the RB
tree is now inside vfs_cache.c, and all accesses are via cache_*
functions.


# 1.252 16-Sep-2016 dlg

move buf_rb_bufs from RB macros to RBT functions

i had to shuffle the order of some header bits cos RBT_PROTOTYPE
needs to see what RBT_HEAD produces.


# 1.251 15-Sep-2016 dlg

all pools have their ipl set via pool_setipl, so fold it into pool_init.

the ioff argument to pool_init() is unused and has been for many
years, so this replaces it with an ipl argument. because the ipl
will be set on init we no longer need pool_setipl.

most of these changes have been done with coccinelle using the spatch
below. cocci sucks at formatting code though, so i fixed that by hand.

the manpage and subr_pool.c bits i did myself.

ok tedu@ jmatthew@

@ipl@
expression pp;
expression ipl;
expression s, a, o, f, m, p;
@@
-pool_init(pp, s, a, o, f, m, p);
-pool_setipl(pp, ipl);
+pool_init(pp, s, a, ipl, f, m, p);


# 1.250 25-Aug-2016 dlg

pool_setipl

ok kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.249 22-Jul-2016 kettenis

Prevent NULL-pointer call for filesystems that don't provide vfs_sysctl
in their vfsops.

Issue reported by Tim Newsham.

ok claudio@, natano@


# 1.248 19-Jun-2016 natano

Remove the lockmgr() API. It is only used by filesystems, where it is a
trivial change to use rrw locks instead. All it needs is LK_* defines
for the RW_* flags.

tested by naddy and sthen on package building infrastructure
input and ok jmc mpi tedu


# 1.247 26-May-2016 natano

The doforce variable isn't modified anywhere. Also, the only filesystem
left using it is fuse. It has been removed from all other filesystems.

ok millert deraadt


# 1.246 26-Apr-2016 natano

copy_statfs_info() is not only used by ufs, but by other filesystems too,
so make sure that all members of mp->mnt_stat.mount_info are copied.
ok stefan


# 1.245 26-Apr-2016 beck

fix off by one in vfs_vnode_print - found by miod
ok deraadt@, krw@


# 1.244 07-Apr-2016 natano

Share clone bitmap between aliased vnodes. This prevents duplicate clone
instance numbers being handed out for the same minor device.
ok mikeb


# 1.243 05-Apr-2016 natano

Increase size of the clone bitmap (revised diff after revert). I have
tested this with fuse _and_ drm on amd64 and macppc. Also tested with
cloning bpf (not in the tree) on macppc.

ok mikeb
"looks correct to me" millert

The original commit message is as follows:

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.242 01-Apr-2016 mikeb

Revert the clone bitmap enlargement change


# 1.241 31-Mar-2016 natano

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.240 19-Mar-2016 natano

Remove the unused flags argument from VOP_UNLOCK().

torture tested on amd64, i386 and macppc
ok beck mpi stefan
"the change looks right" deraadt


# 1.239 14-Mar-2016 krw

Change a bunch of (<blah> *)0 to NULL.

ok beck@ deraadt@


Revision tags: OPENBSD_5_9_BASE
# 1.238 05-Dec-2015 tedu

branches: 1.238.2;
remove stale lint annotations


# 1.237 16-Nov-2015 deraadt

In getdevvp() set the VISTTY flag on a vnode to indicate the underlying
device is a D_TTY device. (Like spec_open, but this sets the flag to
satisfy pre-VOP_OPEN situations)
ok millert semarie tedu guenther


# 1.236 13-Oct-2015 guenther

Initialize va_filerev in vattr_null() to avoid leaking stack garbage;
problem pointed out by Martin Natano (natano (at) natano.net)

Also, stop chaining assignments (foo = bar = baz) in vattr_null().
The exact meaning of those depends on the order of the sizes-and-
signednesses of the lvalues, making them fragile: a statement here
mixed *six* types, but managed to get them in a safe order. Delete
a 20+ year old XXX comment that was almost certainly bemoaning a bug
from when they were in an unsafe order.

ok deraadt@ miod@


# 1.235 08-Oct-2015 mpi

Use the radix API directly and get rid of the function pointers. There
is no point in keeping an unused level of abstraction.

ok mikeb@, claudio@


# 1.234 07-Oct-2015 mpi

rn_inithead() offset argument is now specified in byte, missed in previous.


# 1.233 04-Sep-2015 mpi

Make every subsystem using a radix tree call rn_init() and pass the
length of the key as argument.

This way every consumer of the radix tree has a chance to explicitly
initialize the shared data structures and no longer rely on another
subsystem to do the initialization.

As a bonus ``dom_maxrtkey'' is no longer used an die.

ART kernels should now be fully usable because pf(4) and IPSEC properly
initialized the radix tree.

ok chris@, reyk@


Revision tags: OPENBSD_5_8_BASE
# 1.232 16-Jul-2015 claudio

branches: 1.232.4;
Fix rn_match and there for the expoerted lookup functions in radix.c
to never return the internal RNF_ROOT nodes. This removes the checks
in the callee to verify that not an RNF_ROOT node was returned.
OK mpi@


# 1.231 12-May-2015 mikeb

Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.230 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.229 02-Mar-2015 guenther

Return EINVAL if the creds supplied for NFS export have a cr_ngroups less
than zero or greater than NGROUPS_MAX

Fixes panic seen by henning@


# 1.228 09-Jan-2015 tedu

rename desiredvnodes to initialvnodes. less of a lie. ok beck deraadt


# 1.227 19-Dec-2014 tedu

start retiring the nointr allocator. specify PR_WAITOK as a flag as a
marker for which pools are not interrupt safe. ok dlg


# 1.226 17-Dec-2014 tedu

remove lock.h from uvm_extern.h. another holdover from the simpletonlock
era. fix uvm including c files to include lock.h or atomic.h as necessary.
ok deraadt


# 1.225 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.224 10-Dec-2014 tedu

convert bcopy to memcpy. ok millert


# 1.223 21-Nov-2014 tedu

simple lock is long dead


# 1.222 19-Nov-2014 tedu

delete the KERN_VNODE sysctl. it fails to provide any isolation from the
kernel struct vnode defintion, and the only consumer (pstat) still needs
kvm to read much of the required information. no great loss to always use
kvm until there's a better replacement interface.
ok deraadt millert uebayasi


# 1.221 14-Nov-2014 tedu

prefer sizeof(*ptr) to sizeof(struct) for malloc and free


# 1.220 03-Nov-2014 deraadt

pass size argument to free()
ok doug tedu


# 1.219 13-Sep-2014 doug

Replace all queue *_END macro calls except CIRCLEQ_END with NULL.

CIRCLEQ_* is deprecated and not called in the tree. The other queue types
have *_END macros which were added for symmetry with CIRCLEQ_END. They are
defined as NULL. There's no reason to keep the other *_END macro calls.

ok millert@


Revision tags: OPENBSD_5_6_BASE
# 1.218 13-Jul-2014 tedu

pass the size to free in some of the obvious cases


# 1.217 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.216 10-Jul-2014 mpi

Stop using a shutdown hook for softraid(4) and explicitly shutdown
the disciplines right after vfs_shutdown().

This change is required in order to be able to set `cold' to 1 before
traversing the device (mainbus) tree for DVACT_POWERDOWN when halting
a machine. Yes, this is ugly because sr_shutdown() needs to sleep. But
at least it is obvious and hopefully somebody will be ofended and fix
it.

In order to properly flush the cache of the disks under softraid0,
sr_shutdown() now propagates DVACT_POWERDOWN for this particular subtree
of devices which are not under mainbus. As a side effect sd(4) shutdown
hook should no longer be necessary.

Tested by stsp@ and Jean-Philippe Ouellet.

ok deraadt@, stsp@, jsing@


# 1.215 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.214 04-Jun-2014 claudio

While it may be smart to use the radix tree for exports it is not OK to
use the domain specific tree initialisation method for this since that one
is multipath enabled and assumes that the radix node is part of a struct
rtentry. This code uses a different struct and so the multipath modifies
wrong fields and breaks stuff in mysterious ways.
Since we only support AF_INET here anyway simplify the code and only have
one radix_node_head pointer instead of AF_MAX ones.
Fixes NFS server issues reported by rpe@, OK rpe@, guenther@, sthen@


# 1.213 10-Apr-2014 tedu

pull the bufcache freelist code out into separate functions to allow new
algorithms to be tested. in the process, drop support for unused B_AGE and
b_synctime options.
previous versions ok beck deraadt


# 1.212 24-Mar-2014 guenther

Split the API: struct ucred remains the kernel internal structure while
struct xucred becomes the structure for syscalls (mount(2) and nfssvc(2)).

ok deraadt@ beck@


Revision tags: OPENBSD_5_5_BASE
# 1.211 21-Jan-2014 tedu

bzero -> memset


# 1.210 01-Dec-2013 krw

Change 'mountlist' from CIRCLEQ to TAILQ. Be paranoid and
use TAILQ_*_SAFE more than might be needed.

Bulk ports build by sthen@ showed nobody sticking their fingers
so deep into the kernel.

Feedback and suggestions from millert@. ok jsing@


# 1.209 27-Nov-2013 jsing

Defer the v_type initialisation until after the vnode has been purged from
the namecache. Changing the v_type between cache_enter() and cache_purge()
results in bad things happening.

ok beck@


# 1.208 02-Oct-2013 sf

format string fix: b_flags is long


# 1.207 01-Oct-2013 sf

Format string fixes: Cast time_t to long long

and mnt_stat.f_ctime is long long, too


# 1.206 08-Aug-2013 syl

Uncomment kprintf format attributes for sys/kern

tested on vax (gcc3) ok miod@


# 1.205 30-Jul-2013 beck

The previous change was made while chasing nfs performance issues
on Theo's servers - however this was in the context of the buffer flipper
changes and this is now suspect in a continues performance issue with NFS
so back it out for now


Revision tags: OPENBSD_5_4_BASE
# 1.204 24-Jun-2013 beck

Manipulating buffers after sleeping is dangerous. Instead of attempting
to cheat and VOP_BWRITE a buffer, restart the vinvalbuf if we have to wait
for a busy buffer to complete
ok tedu@ guenther@


# 1.203 15-Apr-2013 jsing

Add an f_mntfromspec member to struct statfs, which specifies the name of
the special provided when the mount was requested. This may be the same as
the special that was actually used for the mount (e.g. in the case of a
device node) or it may be different (e.g. in the case of a DUID).

Whilst here, change f_ctime to a 64 bit type and remove the pointless
f_spare members.

Compatibility goo courtesy of guenther@

ok krw@ millert@


Revision tags: OPENBSD_5_3_BASE
# 1.202 17-Feb-2013 miod

Comment out recently added __attribute__((__format__(__kprintf__))) annotations
in MI code; gcc 2.95 does not accept such annotation for function pointer
declarations, only function prototypes.
To be uncommented once gcc 2.95 bites the dust.


# 1.201 09-Feb-2013 miod

Add explicit __attribute__ ((__format__(__kprintf__)))) to the functions and
function pointer arguments which are {used as,} wrappers around the kernel
printf function.
No functional change.


# 1.200 17-Nov-2012 beck

Don't map a buffer (and potentially sleep) when invalidating it in vinvalbuf.
This fixes a problem where we could sleep for kva and then our pointers
would not be valid on the next pass through the loop. We do this
by adding buf_acquire_nomap() - which can be used to busy up the buffer
without changing its mapped or unmapped state. We do not need to have
the buffer mapped to invalidate it, so it is sufficient to acquire it
for that. In the case where we write the buffer, we do map the buffer, and
potentially sleep.


# 1.199 01-Oct-2012 guenther

Make groupmember() check the effective gid too, so that the checks are
consistent when the effective gid isn't also a supplementary group.

ok beck@


# 1.198 19-Sep-2012 guenther

vhold() and vdrop() are prototyped in vnode.h, so don't repeat them here

ok beck@


Revision tags: OPENBSD_5_2_BASE
# 1.197 16-Jul-2012 deraadt

oops, need sys/acct.h too


# 1.196 16-Jul-2012 deraadt

Put acct_shutdown() proto in a better place


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.195 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.194 02-Jul-2011 thib

rename VFSDEBUG to VFLCKDEBUG;

prompted by tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.193 21-Dec-2010 thib

Bring back the "End the VOP experiment." diff, naddy's issues where
unrelated, and his alpha is much happier now.

OK deraadt@


# 1.192 06-Dec-2010 jasper

- drop NENTS(), which was yet another copy of nitems().
no binary change


ok deraadt@


# 1.191 10-Sep-2010 thib

Backout the VOP diff until the issues naddy was seeing on alpha (gcc3)
have been resolved.


# 1.190 06-Sep-2010 thib

End the VOP experiment. Instead of the ridicolusly complicated operation
vector setup that has questionable features (that have, as far as I can
tell never been used in practice, atleast not in OpenBSD), remove all
the gunk and favor a simple struct full of function pointers that get
set directly by each of the filesystems.

Removes gobs of ugly code and makes things simpler by a magnitude.

The only downside of this is that we loose the vnoperate feature so
the spec/fifo operations of the filesystems need to be kept in sync
with specfs and fifofs, this is no big deal as the API it self is pretty
static.

Many thanks to armani@ who pulled an earlier version of this diff to
current after c2k10 and Gabriel Kihlman on tech@ for testing.

Liked by many. "come on, find your balls" deraadt@.


# 1.189 12-Aug-2010 oga

Nuke extra (typoed) extern declaration and a spare newline from the last
commit.

"fix it -- free commit" beck@


# 1.188 11-Aug-2010 beck

Make the number of vnodes to correspond to the number of buffers in
buffer cache - we grow them dynamically, but do not attempt to shrink
them if the buffer cache shrinks after growing.

Tested by very many for a long time.

ok oga@ todd@ phessler@ tedu@


Revision tags: OPENBSD_4_8_BASE
# 1.187 29-Jun-2010 tedu

makefstype was only used in ported from freebsd filesystems. fix them
and remove the function. ok thib


# 1.186 28-Jun-2010 claudio

Add the rtable id as an argument to rn_walktree(). Functions like
rt_if_remove_rtdelete() need to know the table id to be able to correctly
remove nodes.
Problem found by Andrea Parazzini and analyzed by Martin Pelik�n.
OK henning@


# 1.185 06-May-2010 mpf

Fix favail format string.
From mickey.
OK thib, otto.


Revision tags: OPENBSD_4_7_BASE
# 1.184 17-Dec-2009 oga

if anyone vref()s a VNON vnode, panic. This should not happen.

Written while trying to debug the nfs_inactive panics. Turns out it
never got hit, but it's a useful check to have.

ok beck@


# 1.183 17-Aug-2009 jasper

dd 'show all bufs' to show all the buffers in the system

ok beck@ thib@


# 1.182 13-Aug-2009 thib

add a show all vnodes command, use dlg's nice pool_walk() to accomplish
this.

ok beck@, dlg@


# 1.181 12-Aug-2009 beck

Namecache revamp.

This eliminates the large single namecache hash table, and implements
the name cache as a global lru of entires, and a redblack tree in each
vnode. It makes cache_purge actually purge the namecache entries associated
with a vnode when a vnode is recycled (very important for later on actually being
able to resize the vnode pool)

This commit does #if 0 out a bunch of procmap code that was
already broken before this change, but needs to be redone completely.

Tested by many, including in thib's nfs test setup.

ok oga@,art@,thib@,miod@


# 1.180 02-Aug-2009 beck

Dynamic buffer cache support - a re-commit of what was backed out
after c2k9

allows buffer cache to be extended and grow/shrink dynamically

tested by many, ok oga@, "why not just commit it" deraadt@


Revision tags: OPENBSD_4_6_BASE
# 1.179 25-Jun-2009 thib

backout the buf_acquire() does the bremfree() since all callers
where doing bremfree() befure calling buf_acquire().

This is causing us headache pinning down a bug that showed up
when deraadt@ too cvs to current, and will have to be done
anyway as a preperation for backouts.

OK deraadt@


# 1.178 15-Jun-2009 beck

Back out all the buffer cache changes I committed during c2k9. This reverts three
commits:

1) The sysctl allowing bufcachepercent to be changed at boot time.
2) The change moving the buffer cache hash chains to a red-black tree
3) The dynamic buffer cache (Which depended on the earlier too).

ok on the backout from marco and todd


# 1.177 06-Jun-2009 art

All caller of buf_acquire were doing bremfree before the call.
Just put it in the buf_acquire function.
oga@ ok


# 1.176 03-Jun-2009 beck

Change bufhash from the old grotty hash table to red-black trees hanging
off the vnode.
ok art@, oga@, miod@


Revision tags: OPENBSD_4_5_BASE
# 1.175 10-Nov-2008 pedro

Fix typo in comment, okay jmc@.


# 1.174 01-Nov-2008 deraadt

change vrele() to return an int. if it returns 0, it can gaurantee that
it did not sleep. this is used to avoid checkdirs() to avoid having
to restart the allproc walk every time through
idea from tedu, ok thib pedro


Revision tags: OPENBSD_4_4_BASE
# 1.173 05-Jul-2008 thib

re-introduce vdrop() to signal a lost intrest in a vnode;

ok art@


# 1.172 14-Jun-2008 mk

A bunch of pool_get() + bzero() -> pool_get(..., .. | PR_ZERO)
conversions that should shave a few bytes off the kernel.

ok henning, krw, jsing, oga, miod, and thib (``even though i usually prefer
FOO|BAR''; thanks for looking.


# 1.171 13-Jun-2008 beck

back out stupid vnode change that was unintentionally included
with biomem and art has no idea how it got there.
ok art@ thib@


# 1.170 12-Jun-2008 deraadt

Bring biomem diff back into the tree after the nfs_bio.c fix went in.
ok thib beck art


# 1.169 11-Jun-2008 deraadt

back out biomem diff since it is not right yet. Doing very large
file copies to nfsv2 causes the system to eventually peg the console.
On the console ^T indicates that the load is increasing rapidly, ddb
indicates many calls to getbuf, there is some very slow nfs traffic
making none (or extremely slow) progress. Eventually some machines
seize up entirely.


# 1.168 10-Jun-2008 beck

Buffer cache revamp

1) remove multiple size queues, introduced as a stopgap.
2) decouple pages containing data from their mappings
3) only keep buffers mapped when they actually have to be mapped
(right now, this is when buffers are B_BUSY)
4) New functions to make a buffer busy, and release the busy flag
(buf_acquire and buf_release)
5) Move high/low water marks and statistics counters into a structure
6) Add a sysctl to retrieve buffer cache statistics

Tested in several variants and beat upon by bob and art for a year. run
accidentally on henning's nfs server for a few months...

ok deraadt@, krw@, art@ - who promises to be around to deal with any fallout


# 1.167 09-Jun-2008 millert

Update access(2) to have modern semantics with respect to X_OK and
the superuser. access(2) will now only indicate success for X_OK on
non-directories if there is at least one execute bit set on the file.
OK deraadt@ thib@ otto@


# 1.166 07-May-2008 thib

remove the vfc_mountroot member from vfsconf and
do appropriate cleanup;

OK deraadt@


# 1.165 07-May-2008 claudio

Implement routing priorities. Every route inserted has a priority assigned
and the one route with the lowest number wins. This will be used by the
routing daemons to resolve the synchronisations issue in case of conflicts.
The nasty bits of this are in the multipath code. If no priority is specified
the kernel will choose an appropriate priority.

Looked at by a few people at n2k8 code is much older


# 1.164 06-May-2008 thib

retire vfs_mountroot();

setroot() is now (and has been) responsible for setting
the mountroot function pointer "to the right thing", or
failing todo that, to ffs_mountroot;

based on a discussion/diff from deraadt@.
OK deraadt@


# 1.163 23-Mar-2008 miod

Wrong printf construct.


# 1.162 16-Mar-2008 otto

Widen some struct statfs fields to support large filesystem stata
and add some to be able to support statvfs(2). Do the compat dance
to provide backward compatibility. ok thib@ miod@


Revision tags: OPENBSD_4_3_BASE
# 1.161 13-Dec-2007 blambert

replace calls to ltsleep with tsleep

remove PNORELOCK flag, as PNORELOCK is used for msleep

ok art@ thib@


# 1.160 16-Nov-2007 deraadt

er, the newline is wrong. dissapointing.


# 1.159 15-Nov-2007 deraadt

newline before syncing disks is way prettier


# 1.158 29-Oct-2007 chl

MALLOC/FREE -> malloc/free
replace an hard coded value with M_WAITOK

ok krw@


# 1.157 15-Sep-2007 bluhm

Allow to pull out an usb stick with ffs filesystem while mounted
and a file is written onto the stick. Without these fixes the
machine panics or hangs.
The usb fix calls the callback when the stick is pulled out to free
the associated buffers. Otherwise we have busy buffers for ever
and the automatic unmount will panic.
The change in the scsi layer prevents passing down further dirty
buffers to usb after the stick has been deactivated.
In vfs the automatic unmount has moved from the function vgonel()
to vop_generic_revoke(). Both are called when the sd device's vnode
is removed. In vgonel() the VXLOCK is already held which can cause
a deadlock. So call dounmount() earlier.

ok krw@, I like this marco@, tested by ian@


# 1.156 07-Sep-2007 art

Use M_ZERO in a few more places to shave bytes from the kernel.

eyeballed and ok dlg@


Revision tags: OPENBSD_4_2_BASE
# 1.155 07-Aug-2007 beck

A few changes to deal with multi-user performance issues seen. this
brings us back roughly to 4.1 level performance, although this is still
far from optimal as we have seen in a number of cases. This change

1) puts a lower bound on buffer cache queues to prevent starvation
2) fixes the code which looks for a buffer to recycle
3) reduces the number of vnodes back to 4.1 levels to avoid complex
performance issues better addressed after 4.2

ok art@ deraadt@, tested by many


# 1.154 01-Jun-2007 beck

decouple the allocated number of vnodes from the "desiredvnodes" variable
which is used to size a zillion other things that increasing excessively
has been shown to cause problems - so that we may incrementally look at
increasing those other things without making the kernel unusable.

This diff effectivly increases the number of vnodes back to the number
of buffers, as in the earlier dynamic buffer cache commits, without
increasing anything else (namecache, softdeps, etc. etc.)

ok pedro@ tedu@ art@ thib@


# 1.153 31-May-2007 tedu

remove some silly casts, no real change


# 1.152 31-May-2007 pedro

NFSv2 cannot cope with a big number of vnodes, so revert to NPROC-based
calculation until the problem is fixed, okay beck@ art@


# 1.151 30-May-2007 beck

back out vfs change - todd fries has seen afs issues, and I'm suspicious
this can cause other problems.


# 1.150 29-May-2007 beck

Step one of some vnode improvements - change getnewvnode to
actually allocate "desiredvnodes" - add a vdrop to un-hold a vnode held
with vhold, and change the name cache to make use of vhold/vdrop, while
keeping track of which vnodes are referred to by which cache entries to
correctly hold/drop vnodes when the cache uses them.
ok thib@, tedu@, art@


# 1.149 28-May-2007 thib

de-inline vref();

ok pedro@


# 1.148 26-May-2007 pedro

Dynamic buffer cache. Initial diff from mickey@, okay art@ beck@ toby@
deraadt@ dlg@.


# 1.147 26-May-2007 thib

Nuke a bunch of simpelocks and associated goo.

ok art@


# 1.146 17-May-2007 thib

Collapse struct v_selectinfo in struct vnode, remove the
simplelock and reuse the name for the selinfo member.
Clean-up accordingly.

ok tedu@,art@


# 1.145 09-May-2007 deraadt

kinfo_vgetfailed has not been used for > 8 years


# 1.144 13-Apr-2007 thib

Move the declaration of VN_KNOTE() into vnode.h instead of having
multiple defines all over;

ok tedu@


# 1.143 13-Apr-2007 bluhm

Remove comments talking about vnode interlock. No binary change.
ok thib


# 1.142 11-Apr-2007 thib

Remove the simplelock argument from vrecycle();

ok pedro@, sturm@


# 1.141 21-Mar-2007 thib

Remove the v_interlock simplelock from the vnode structure.
Zap all calls to simple_lock/unlock() on it (those calls are
#defined away though). Remove the LK_INTERLOCK from the calls
to vn_lock() and cleanup the filesystems wich implement VOP_LOCK().
(by remvoing the v_interlock from there calls to lockmgr()).

ok pedro@, art@, tedu@


# 1.140 12-Mar-2007 mickey

better desiredvnodes not based on maxusers; pedro@ deraadt@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.139 20-Feb-2007 deraadt

for vfsconf sysctl, do not leak kernel sensors out to userland
ok art thib


# 1.138 17-Feb-2007 mickey

fix ddb buf printing for daddr_t growth to 64bit;
from juan hernandez gonzalez; tested by bluhm@


# 1.137 14-Feb-2007 jsg

Consistently spell FALLTHROUGH to appease lint.
ok kettenis@ cloder@ tom@ henning@


# 1.136 13-Feb-2007 mickey

fix ddb buf print


# 1.135 20-Nov-2006 tom

vprint() should be defined if DIAGNOSTIC || DEBUG. Noticed by (and
original diff from) Jake < antipsychic (at) hotmail.com >. Discussed
with Mickey and Miod.

ok miod@ pedro@


# 1.134 30-Oct-2006 thib

use vp->v_type to index into vtypes rather then vp->v_tag,
fixing odd output in the 'show vnode' ddb code.

ok mickey@


Revision tags: OPENBSD_4_0_BASE
# 1.133 11-Jul-2006 mickey

add mount/vnode/buf and softdep printing commands; tested on a few archs and will make pedro happy too (;


# 1.132 09-Jul-2006 pedro

Fix tab where space was meant


# 1.131 08-Jul-2006 thib

vinvalbuf() debugging aid, under VFSDEBUG.

ok pedro@


# 1.130 03-Jul-2006 mickey

also print vp in vprint (useful for debugging); pedro@ ok


# 1.129 25-Jun-2006 sturm

rename vfs_busy() flags VB_UMIGNORE/VB_UMWAIT to VB_NOWAIT/VB_WAIT

requested by and ok pedro


# 1.128 14-Jun-2006 sturm

move vfs_busy() to rwlocks and properly hide the locking api from vfs

ok tedu, pedro


# 1.127 02-Jun-2006 pedro

Add a clonable devices implementation. Hacked along with thib@, input
from krw@ and toby@, subliminal prodding from dlg@, okay deraadt@.


# 1.126 28-May-2006 pedro

Spacing in vfs_sysctl()


# 1.125 07-May-2006 sturm

forgot to remove this sentence from the comment
ok pedro


# 1.124 30-Apr-2006 sturm

remove the simplelock argument from vfs_busy() which is currently not
used and will never be used this way in VFS

requested by and ok pedro, ok krw, biorn


# 1.123 19-Apr-2006 pedro

Remove unused mount list simple_lock() goo


Revision tags: OPENBSD_3_9_BASE
# 1.122 09-Jan-2006 pedro

Put vprint() under DIAGNOSTIC, as to save space in generated ramdisks.
Inspiration from miod@, okay deraadt@. Tested on i386, macppc and amd64.


# 1.121 30-Nov-2005 pedro

No need for vfs_busy() and vfs_unbusy() to take a process pointer
anymore. Testing by jolan@, thanks.


# 1.120 24-Nov-2005 pedro

Remove kernfs, okay deraadt@.


# 1.119 19-Nov-2005 pedro

Remove unnecessary lockmgr() archaism that was costing too much in terms
of panics and bugfixes. Access curproc directly, do not expect a process
pointer as an argument. Should fix many "process context required" bugs.
Incentive and okay millert@, okay marc@. Various testing, thanks.


# 1.118 18-Nov-2005 pedro

Work around yet another race on non-locking file systems: when calling
VOP_INACTIVE() in vrele() and vput(), we may sleep. Since there's no
locking of any kind, someone can vget() the vnode and vrele() it while
we sleep, beating us in getting the vnode on the free list.


# 1.117 08-Nov-2005 pedro

Missed one use of 'register'


# 1.116 07-Nov-2005 pedro

Use ANSI function declarations and deregister, no binary change


# 1.115 19-Oct-2005 pedro

Remove v_vnlock from struct vnode, okay krw@ tedu@


Revision tags: OPENBSD_3_8_BASE
# 1.114 26-May-2005 pedro

branches: 1.114.2;
RIP stackable filesystems, ok marius@ tedu@, discussed with deraadt@


# 1.113 24-May-2005 pedro

when a device vnode associated with a mount point disappears, mark the
filesystem as doomed and unmount it


# 1.112 22-May-2005 pedro

put VLOCKSWORK stuff under a single option, VFSDEBUG


# 1.111 01-May-2005 pedro

check for VBIOONFREELIST and VBIOONSYNCLIST in vprint(), okay marius@


# 1.110 24-Mar-2005 tedu

always good to check for invalid values. ok marius pedro


Revision tags: OPENBSD_3_7_BASE
# 1.109 10-Jan-2005 pedro

branches: 1.109.2;
change vget() to only put a vnode back on the free lists if it actually
was there. should fix a (rare) corner case introduced by my last commit.
ok tedu@, testing by joris, moritz@, danh@, otto@ and krw@. many thanks.


# 1.108 31-Dec-2004 pedro

sprinkle some more list macros in here


# 1.107 31-Dec-2004 pedro

when releasing a vnode, make it inactive before sticking it to one of
the free lists. should fix some races on filesystems that don't have
locks, such as nfs. also, it allows for a more straightforward way of
releasing vnodes (nodes that are going to be recycled don't have to be
moved to the head of the list). tested by many, thanks.

ok tedu@ deraadt@


# 1.106 28-Dec-2004 deraadt

clean dirty accident by miod


# 1.105 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


# 1.104 09-Dec-2004 pedro

minor spacing/styling nits


Revision tags: OPENBSD_3_6_BASE
# 1.103 04-Aug-2004 art

Uninline vputonfreelist.


# 1.102 04-Aug-2004 pedro

better comments


# 1.101 02-Aug-2004 pedro

- check for LK_NOWAIT on vget()
- use ltsleep() instead of the unlock + sleep combo

ok art@, inspiration from free/net


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.100 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


# 1.99 27-May-2004 tedu

shutdown accounting before shutting down vfs. should prevent some panics.
ok david@ millert@ (iirc)


# 1.98 25-Apr-2004 itojun

radix tree with multipath support. from kame. deraadt ok
user visible changes:
- you can add multiple routes with same key (route add A B then route add A C)
- you have to specify gateway address if there are multiple entries on the table
(route delete A B, instead of route delete A)
kernel change:
- radix_node_head has an extra entry
- rnh_deladdr takes extra argument

TODO:
- actually take advantage of multipath (rtalloc -> rtalloc_mpath)


Revision tags: OPENBSD_3_5_BASE
# 1.97 09-Jan-2004 tedu

back out vnode parents. weird breakge found in ports tree


# 1.96 06-Jan-2004 tedu

keep track of a vnode's parent dir. ufs only, and unused atm, but
the fun stuff is coming. testing by brad.


Revision tags: OPENBSD_3_4_BASE
# 1.95 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.94 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: UBC_SYNC_A
# 1.93 13-May-2003 naddy

Back out previous change that causes "vnode table full" for large-scale
file operations.


# 1.92 13-May-2003 tedu

do reclaim LAYER vnodes, no good reason not to


# 1.91 06-May-2003 tedu

attempt to put a process's cwd back in place after a forced umount.
won't always work, but it's the best we can do for now. this covers
at least some of the failure cases the previous commit to vfs_lookup.c
checks for.
ok weingart@


# 1.90 01-May-2003 tedu

several related changes:
vfs_subr.c:
add a missing simple_lock_init for vnode interlock
try to avoid reclaiming locked or layered vnodes
initialize vnlock pointer to NULL
remove old code to free vnlock, never used
lockinit the new vnode lock
vfs_syscalls.c:
support for VLAYER flag
vnode_if.sh:
support for splitting VDESC flags
vnode_if.src:
split VDESC flags
WILLPUT is the combination of WILLRELE and WILLUNLOCK
most uses for WILLRELE become WILLPUT
vnode.h:
add v_lock to struct vnode
add VLAYER flag
update for new VDESC flags


# 1.89 06-Apr-2003 ho

strcat/strcpy/sprintf cleanup. krw@, anil@ ok. art@ tested sparc64.


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_B
# 1.88 11-Aug-2002 art

Add two missing vfs_busy calls in the failure path of sysctl_vnode.
Found by aaron@

NOTE - I think we need a mount-point iterator just like we have
NOTE - vfs_mount_foreach_vnode. (btw. why don't we use foreach_vnode in here?)


# 1.87 12-Jul-2002 art

Change the locking on the mountpoint slightly. Instead of using mnt_lock
to get shared locks for lookup and get the exclusive lock only with
LK_DRAIN on unmount and do the real exclusive locking with flags in
mnt_flags, we now use shared locks for lookup and an exclusive lock for
unmount.

This is accomplished by slightly changing the semantics of vfs_busy.
Old vfs_busy behavior:
- with LK_NOWAIT set in flags, a shared lock was obtained if the
mountpoint wasn't being unmounted, otherwise we just returned an error.
- with no flags, a shared lock was obtained if the mountpoint was being
unmounted, otherwise we slept until the unmount was done and returned
an error.
LK_NOWAIT was used for sync(2) and some statistics code where it isn't really
critical that we get the correct results.
0 was used in fchdir and lookup where it's critical that we get the right
directory vnode for the filesystem root.

After this change vfs_busy keeps the same behavior for no flags and LK_NOWAIT.
But if some other flags are passed into it, they are passed directly
into lockmgr (actually LK_SLEEPFAIL is always added to those flags because
if we sleep for the lock, that means someone was holding the exclusive lock
and the exclusive lock is only held when the filesystem is being unmounted.

More changes:
dounmount must now be called with the exclusive lock held. (before this
the caller was supposed to hold the vfs_busy lock, but that wasn't always
true).
Zap some (now) unused mount flags.
And the highlight of this change:
Add some vfs_busy calls to match some vfs_unbusy calls, especially in
sys_mount. (lockmgr doesn't detect the case where we release a lock noone
holds (it will do that soon)).

If you've seen hangs on reboot with mfs this should solve it (I repeat this
for the fourth time now, but this time I spent two months fixing and
redesigning this and reading the code so this time I must have gotten
this right).


# 1.86 16-Jun-2002 miod

When processing the KERN_VNODE sysctl, the kernel builds a packed structure,
while pstat(8) expects a C structure abiding the regular structure packing
rules. This caused pstat -v to break on powerpc.

Unbreak the confusion by defining the structure in a common header file,
and having the kernel use it.

ok millert@ deraadt@


# 1.85 08-Jun-2002 art

Use ltsleep in vfs_busy.


# 1.84 16-May-2002 art

sprinkle some splassert(IPL_BIO) in some functions that are commented as "should be called at splbio()"


Revision tags: OPENBSD_3_1_BASE
# 1.83 14-Mar-2002 millert

First round of __P removal in sys


# 1.82 04-Feb-2002 miod

Cleanup mountroot-related definitions.


# 1.81 23-Jan-2002 art

Pool deals fairly well with physical memory shortage, but it doesn't deal
well (not at all) with shortages of the vm_map where the pages are mapped
(usually kmem_map).

Try to deal with it:
- group all information the backend allocator for a pool in a separate
struct. The pool will only have a pointer to that struct.
- change the pool_init API to reflect that.
- link all pools allocating from the same allocator on a linked list.
- Since an allocator is responsible to wait for physical memory it will
only fail (waitok) when it runs out of its backing vm_map, carefully
drain pools using the same allocator so that va space is freed.
(see comments in code for caveats and details).
- change pool_reclaim to return if it actually succeeded to free some
memory, use that information to make draining easier and more efficient.
- get rid of PR_URGENT, noone uses it.


# 1.80 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.79 10-Dec-2001 art

branches: 1.79.2;
No need to initialize the uobj on every getnewvnode. Just do
it when allocating. Add some improved diagnostics.


# 1.78 10-Dec-2001 art

Big cleanup inspired by NetBSD with some parts of the code from NetBSD.
- get rid of VOP_BALLOCN and VOP_SIZE
- move the generic getpages and putpages into miscfs/genfs
- create a genfs_node which must be added to the top of the private portion
of each vnode for filsystems that want to use genfs_{get,put}pages
- rename genfs_mmap to vop_generic_mmap


# 1.77 10-Dec-2001 art

Merge in struct uvm_vnode into struct vnode.


# 1.76 05-Dec-2001 art

Break out the part that lowers v_holdcnt in brelvp into an own function
and make it and vhold into public interfaces.


# 1.75 29-Nov-2001 art

Ooops. Revert part of the last commit that was completly wrong and wasn't supposed to be committed.


# 1.74 29-Nov-2001 art

Correctly handle b_vp with bgetvp and brelvp in {get,put}pages.
Prevents panics caused by vnodes being recycled under our feet.


# 1.73 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.72 21-Nov-2001 csapuntz

Added vfs_isbusy. Useful for verifying that a mount point is locked
Added vfs_mount_foreach_vnode. Several places in the code seem to want to
traverse the mount list and they all seem to handle locking differently.
Centralize traversing the mount list in one place so that we only need
to get the locking right once.


# 1.71 15-Nov-2001 art

Don't zero v_bioflag when recycling a vnode in getnewvnode.
Sometimes the vnode can be on the syncers list. While that is a bug, it's
just a minor annoyance. A vnode on a syncer worklist without VBIOONSYNCLIST
set is a disaster.


# 1.70 12-Nov-2001 art

Remove unnecessary check for NULL vnode in reassignbuf.


# 1.69 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.68 02-Oct-2001 csapuntz

Bounds check index into routing table. Thanks to Ken Ashcraft of Stanford
for finding this bug.


# 1.67 19-Sep-2001 csapuntz

Get rid of B_VFLUSH. Not relevant after the end of the AGE queue.


# 1.66 16-Sep-2001 millert

Add some missing lengths checks when passing data from userland to
kernel. From based on NetBSD patches.


# 1.65 02-Aug-2001 assar

(vput): make panic strings actually say vput instead of vrele


# 1.64 26-Jul-2001 miod

Typo.


# 1.63 27-Jun-2001 art

remove old vm


# 1.62 22-Jun-2001 deraadt

KNF


# 1.61 05-Jun-2001 provos

send note_revoke to knotes when vnode goes away, okay art@


# 1.60 16-May-2001 art

indentation nit.


# 1.59 29-Apr-2001 art

cleanup, remove incorrect comment


Revision tags: OPENBSD_2_9_BASE
# 1.58 22-Mar-2001 art

branches: 1.58.2;
Use pool for allocating vnodes.
Even though vnodes are never freed (could be) this gives us big memory and
kmem_map savings.


# 1.57 21-Mar-2001 art

uvm_vnp_terminate expect the vnode to be locked.
Why didn't LOCKDEBUG catch this?


# 1.56 16-Mar-2001 art

Oops. fix thinko in last.


# 1.55 16-Mar-2001 art

Use CIRCLEQ macros for mountlist.


# 1.54 16-Mar-2001 art

Initialize the mountlist_slock.


# 1.53 26-Feb-2001 csapuntz

Move v_writecount test back to it original place


# 1.52 26-Feb-2001 csapuntz

Make ref counts 32-bit unsigned ints as opposed to a potpourri of longs and
ints.


# 1.51 24-Feb-2001 csapuntz

Cleanup of vnode interface continues. Get rid of VHOLD/HOLDRELE.
Change VM/UVM to use buf_replacevnode to change the vnode associated
with a buffer.

Addition v_bioflag for flags written in interrupt handlers
(and read at splbio, though not strictly necessary)

Add vwaitforio and use it instead of a while loop of v_numoutput.

Fix race conditions when manipulation vnode free list


# 1.50 23-Feb-2001 csapuntz

Remove the clustering fields from the vnodes and place them in the
file system inode instead


# 1.49 21-Feb-2001 csapuntz

Latest soft updates from FreeBSD/Kirk McKusick

Snapshot-related code has been commented out.


# 1.48 08-Feb-2001 mickey

do not print stuff when not verbose


Revision tags: OPENBSD_2_8_BASE
# 1.47 27-Sep-2000 art

branches: 1.47.2;
Minimal optimization.


# 1.46 17-Jul-2000 art

Don't wait for B_READ buffers on shutdown.
From NetBSD.


Revision tags: OPENBSD_2_7_BASE
# 1.45 25-Apr-2000 csapuntz

Use CIRCLEQ_FOREACH


# 1.44 21-Apr-2000 mickey

see if there is any meaning under curproc before using &proc0 in vfs_syncwait(); from art@


Revision tags: SMP_BASE kame_19991208
# 1.43 05-Dec-1999 art

branches: 1.43.2;
With soft updates, some buffers will be remarked as dirty after being written.
Handle this when syncing filesystems when unmounting.
From NetBSD.


# 1.42 05-Dec-1999 art

Use VONSYNCLIST to see if we should remove a vnode from the sync list instead
of looking at v_dirtyblkhd.


Revision tags: OPENBSD_2_6_BASE
# 1.41 20-Aug-1999 art

more paranoid check of the refcount in vfs_register


# 1.40 08-Aug-1999 niklas

From NetBSD; vdevgone, used for revoking access to device nodes when they
disappear (detach is coming).


# 1.39 31-May-1999 millert

New struct statfs with mount options. NOTE: this replaces statfs(2),
fstatfs(2), and getfsstat(2) so you will need to build a new kernel
before doing a "make build" or you will get "unimplemented syscall" errors.

The new struct statfs has the following featuires:
o Has a u_int32_t flags field--now softdep can have a real flag.

o Uses u_int32_t instead of longs (nicer on the alpha). Note: the man
page used to lie about setting invalid/unused fields to -1. SunOS does
that but our code never has.

o Gets rid of f_type completely. It hasn't been used since NetBSD 0.9
and having it there but always 0 is confusing. It is conceivable
that this may cause some old code to not compile but that is better
than silently breaking.

o Adds a mount_info union that contains the FSTYPE_args struct. This
means that "mount" can now tell you all the options a filesystem was
mounted with. This is especially nice for NFS.

Other changes:
o The linux statfs emulation didn't convert between BSD fs names
and linux f_type numbers. Now it does, since the BSD f_type
number is useless to linux apps (and has been removed anyway)

o FreeBSD's struct statfs is different from our (both old and new)
and thus needs conversion. Previously, the OpenBSD syscalls
were used without any real translation.

o mount(8) will now show extra info when invoked with no arguments.
However, to see *everything* you need to use the -v (verbose) flag.


# 1.38 06-May-1999 mickey

factor out sync+wait code into vfa_syncwait() routine for
applications in system like power management and such.
art@ finally said `commit it'


# 1.37 30-Apr-1999 art

in vput, simple_unlock the v_interlock before VOP_INACTIVE, not after


Revision tags: OPENBSD_2_5_BASE
# 1.36 11-Mar-1999 deraadt

backout


# 1.35 11-Mar-1999 deraadt

back out unapproved changes


# 1.34 11-Mar-1999 mickey

indent


# 1.33 11-Mar-1999 mickey

factor sync+wait operation out into a separate function.


# 1.32 26-Feb-1999 art

adapt to uvm vnode pager


# 1.31 19-Feb-1999 art

add vfs_register and vfs_unregister functions


# 1.30 28-Dec-1998 art

simple_lock fixes


# 1.29 22-Dec-1998 art

deconfuse vprint, print holdcount, not refcount when we are talking about holdcnt


# 1.28 10-Dec-1998 art

vfs_unmountall: retry to unmount all remaining filesystems when one unmount failed


# 1.27 05-Dec-1998 csapuntz

Framework for generating automatic test code for locking discipline
in DIAGNOSTIC mode.

Added documentation to vfs_subr.c on locking needs of a couple calls.

Improvements to the vinvalbuf patch. We need to start over after we
let our pants down.


# 1.26 04-Dec-1998 csapuntz

VFS-Lite2 requires stricter locking around vnode buffer queues. vinvalbuf
had insufficient protection


# 1.25 20-Nov-1998 art

vn_lock already unlocks the simple lock. don't do that again


# 1.24 12-Nov-1998 csapuntz

Integrate latest soft updates patches for McKusick.

Integrate cleaner ffs mount code from FreeBSD. Most notably, this mount
code prevents you from mounting an unclean file system read-write.


Revision tags: OPENBSD_2_4_BASE
# 1.23 13-Oct-1998 csapuntz

In vrele, vget, reinstate to following order

- VNODE gets placed on free list
- VOP_INACTIVE is called

This was the original order. It was changed in an earlier patch due to
a race condition in non-locking FSes (like NFS) between getnewvnode
and inactive. However, the modified order had its own race conditions, so
it turned out not to be a good choice.


# 1.22 30-Aug-1998 csapuntz

Cleanup.

Error diagnostics in vputonfreelist to catch violations of assumptions.


# 1.21 06-Aug-1998 csapuntz

Rename vop_revoke, vn_bwrite, vop_noislocked, vop_nolock, vop_nounlock
to be vop_generic_revoke, vop_generic_bwrite, vop_generic_islocked,
vop_generic_lock and vop_generic_unlock.

Create vop_generic_abortop and propogate change to all file systems.

Fix PR/371.

Get rid of locking in NULLFS (should be mostly unnecessary now except for
forced unmounts).


# 1.20 25-Apr-1998 niklas

typo


Revision tags: OPENBSD_2_3_BASE
# 1.19 20-Feb-1998 niklas

typo


# 1.18 11-Jan-1998 csapuntz

Fix a couple spinlock references. More code motion in vfs_subr.c


# 1.17 10-Jan-1998 csapuntz

Broke up vfs_subr.c which was getting a bit huge. We now have seperate files
for the syncer daemon as well as default VOP_*.


# 1.16 24-Nov-1997 niklas

Fix non-DIAGNOSTIC (and non-COMPAT*) compilation


# 1.15 07-Nov-1997 csapuntz

Fixed hang on shutdown
Disabled vop_nolock for now. Filesystems still need to be cleaned up.


# 1.14 06-Nov-1997 csapuntz

DEBUG now compiles


# 1.13 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.12 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.11 06-Oct-1997 csapuntz

VFS Lite2 Changes


Revision tags: OPENBSD_2_1_BASE
# 1.10 25-Apr-1997 deraadt

proper mask check; mike@fast.cs.utah.edu


# 1.9 14-Apr-1997 tholo

Minor performance enhancements from NetBSD


# 1.8 24-Feb-1997 niklas

OpenBSD tags


# 1.7 11-Feb-1997 millert

Add fs_id support and random inode generation numbers for ffs.


# 1.6 04-Jan-1997 kstailey

spec_advlock() via lf_advlock()


Revision tags: OPENBSD_2_0_BASE
# 1.5 08-Aug-1996 tholo

Make {,f}chown(2) behaviour POSIX.1 compliant with SUID / SGID files
Enable CTL_FS processing by sysctl(3)
Add CTL_FS request to disable clearing SUID / SGID bit when a files owner
or group is changed by root
Make sysctl(8) understand CTL_FS requests


# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 29-Feb-1996 niklas

From NetBSD: Merge with NetBSD 960217


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.318 26-Dec-2022 miod

Replace two (void)copystr(..., NULL) with equivalent (void)strlcpy() calls.

ok millert@


Revision tags: OPENBSD_7_2_BASE
# 1.317 14-Aug-2022 jsg

remove unneeded includes in sys/kern
ok mpi@ miod@


# 1.316 12-Aug-2022 visa

Put more struct vnode fields under splbio().

Buffer cache related struct vnode fields can be accessed in interrupt
context. Be more consistent with the use of splbio().

OK mpi@


Revision tags: OPENBSD_7_1_BASE
# 1.315 27-Mar-2022 semarie

sys/vnode.h cleanup for vnode_hold_list, vnode_free_list, struct freelst

vnode_hold_list and vnode_free_list aren't used outside kern/vfs_subr.c

move `struct freelst` where used in kern/vfs_subr.c

no intented behaviour changes. survived a release(8) build.

ok millert@


# 1.314 25-Jan-2022 gnezdo

Capture a repeated pattern into sysctl_securelevel_int function

A few variables in the kernel are only writeable before securelevel is
raised. It makes sense to handle them with less code.

OK sthen@ bluhm@


# 1.313 25-Oct-2021 claudio

Revert commitid: ufM9BcSbXqfLpzBH;
Move vfs_stall_barrier() from the fd layer into vn_lock() and the vfs layer.
In some cases it can result in a deadlock while suspending.
Discussed with mpi@ and deraadt@


# 1.312 24-Oct-2021 jsg

use NULL not 0 for pointer values in kern
ok semarie@


# 1.311 23-Oct-2021 mpi

Sprinkle uvm_obj_destroy() over UVM object recycling code.

For now, only assert that the tree of pages is empty in uvm_obj_destroy().
This will soon be used to free the per-UVM object lock.

While here call uvm_obj_init() when new vnodes are allocated instead of
in uvn_attach(). Because vnodes and there associated UVM object are
currently never freed, it isn't easy to know where/when to garbage
collect the associated lock. So simply check that the reference of a
given object is 0 when uvn_attach().

Tested by many as part of a bigger diff.

ok kettenis@


# 1.310 23-Oct-2021 mpi

Assert that the KERNEL_LOCK() is held in vref(9).

This is a guard against pushing the lock too far in UVM's vnode land.

ok beck@


# 1.309 21-Oct-2021 claudio

Move vfs_stall_barrier() from the fd layer into vn_lock() and the vfs layer.
vfs stalling is used by suspend/resume and by vmt(4) to stall any
filesystem operation from altering the state on disk. All these
operations will call vn_lock and be stalled. Adjust vfs_stall_barrier()
to allow the lock owner to still progress so that suspend can sync
the filesystems after stalling vfs operation.
OK mpi@


# 1.308 20-Oct-2021 semarie

revert vnode: remove VLOCKSWORK and check locking when vop_islocked != nullop
(both kernel and userland bits)

GENERIC + VFSLCKDEBUG is broken with it.


# 1.307 19-Oct-2021 semarie

vnode: remove VLOCKSWORK and check locking when vop_islocked != nullop

This flag is currently used to mark or unmark a vnode to actively
check vnode locking semantic (when compiled with VFSLCKDEBUG).

Currently, VLOCKSWORK flag isn't properly set for several FS
implementation which have full locking support. This commit enable
proper checking for them too (cd9660, udf, fuse, msdosfs, tmpfs).

Instead of using a particular flag, it directly check if
v_op->vop_islocked is nullop or not to activate or not the vnode
locking checks.

ok mpi@


Revision tags: OPENBSD_7_0_BASE
# 1.306 31-Aug-2021 claudio

Swap lock flags so that LK_EXCLUSIVE is first like in all other places.


# 1.305 28-Apr-2021 claudio

Introduce a global vnode_mtx and use it to make vn_lock() safe to be called
without the KERNEL_LOCK.
This moves VXLOCK and VXWANT to a mutex protected v_lflag field and also
v_lockcount is protected by this mutex.

The vn_lock() dance is overly complex and all of this should probably replaced
by a proper lock on the vnode but such a diff is a lot more complex. This
is an intermediate step so that at least some calls can be modified to grab
the KERNEL_LOCK later or not at all.

OK mpi@


Revision tags: OPENBSD_6_9_BASE
# 1.304 29-Jan-2021 claudio

Use NULL instead of 0 to clear v_socket pointer (which actually clears all
of the v_un pointers).
OK jsg@ mvs@


Revision tags: OPENBSD_6_8_BASE
# 1.303 23-Aug-2020 kn

Remove unused debug_syncprt, improve debug sysctl handling

"syncprt" is unused since kern/vfs_syscalls.c r1.147 from 2008.

Adding new debug sysctls is a bit opaque and looking at kern/kern_sysctl.c
the only visible difference between used and stub ctldebug structs in the
debugvars[] array is their extern keyword, indicating that it is defined
elsewhere.

sys/sysctl.h declares all debugN members as extern upfront, but these
declarations are not needed.

Remove the unused debug sysctl, rename the only remaining one to something
meaningful and remove forward declarations from /sys/sysctl.h; this way,
adding new debug sysctls is a matter of adding extern and coming up with a
name, which is nicer to read on its own and better to grep for.

OK mpi


# 1.302 22-Aug-2020 kn

Move sysctl(2) CTL_DEBUG from DEBUG to new DEBUG_SYSCTL

Adding "debug.my-knob" sysctls is really helpful to select different
code paths and/or log on demand during runtime without recompile,
but as this code is under DEBUG, lots of other noise comes with it
which is often undesired, at least when looking at specific subsystems
only.

Adding globals to the kernel and breaking into DDB to change them helps,
but that does not work over SSH, hence the need for debug sysctls.

Introduces DEBUG_SYSCTL to make use of the "debug" MIB without the rest of
DEBUG; it's DEBUG_SYSCTL and not SYSCTL_DEBUG because it's not a general
option for all of sysctl(2).

OK gnezdo


Revision tags: OPENBSD_6_7_BASE
# 1.301 27-Mar-2020 anton

Relax the lockcount assertion in vputonfreelist(). Back when I fixed
several problems with the vnode exclusive lock implementation, I
overlooked the fact that a vnode can be in a state where the usecount is
zero while the holdcount still being positive. There could still be
threads waiting on the vnode lock in uvn_io() as long as the holdcount
is positive.

"go ahead" mpi@

Reported-by: syzbot+767d6deb1a647850a0ca@syzkaller.appspotmail.com


# 1.300 13-Feb-2020 claudio

Move the LK_DRAIN logic from VOP_LOCK() to vclean() the only caller of
VOP_LOCK with LK_DRAIN. This simplifies VOP_LOCK() a fair bit.
OK visa@


# 1.299 20-Jan-2020 claudio

struct vops is not modified during runtime so use const which moves each
into read-only data segment.
OK deraadt@ tedu@


# 1.298 10-Jan-2020 bluhm

Convert the vnode list at the mount point into a tailq. During
unmount this list is traversed and the dirty vnodes are flushed to
disk. Forced unmount expects that the list is empty after flushing,
otherwise the kernel panics with "dangling vnode". As the write
to disk can sleep, new vnodes may be inserted. If softdep is
enabled, resolving the dependencies creates new dirty vnodes and
inserts them to the list. To fix the panic, let insmntque() insert
new vnodes at the tail of the list. Then vflush() will still catch
them while traversing the list in forward direction.
OK tedu@ millert@ visa@


# 1.297 30-Dec-2019 bluhm

In vcount() a safe loop over vnodes was commited to 4.4BSD in 1994.
This is not necessary as the loop is restarted after vgone(). Switch
to SLIST_FOREACH without _SAFE.
OK visa@


# 1.296 27-Dec-2019 bluhm

Convert the speclisth hash buckets into SLIST macros. This makes
the vnode alias code more readable.
OK visa@


# 1.295 26-Dec-2019 bluhm

Fix white spaces.


# 1.294 08-Dec-2019 mpi

Convert infinite sleeps to tsleep_nsec(9).

ok visa@, jca@


Revision tags: OPENBSD_6_6_BASE
# 1.293 26-Aug-2019 anton

When a thread tries to exclusively lock a vnode, the same thread must
ensure that any other thread currently trying to acquire the underlying
vnode lock has observed that the same vnode is about to be exclusively
locked. Such threads must then sleep until the exclusive lock has been
released and then try to acquire the lock again. Otherwise, exclusive
access to the vnode cannot be guaranteed.

Thanks to naddy@ and visa@ for testing; ok visa@

Reported-by: syzbot+374d0e7e2400004957f7@syzkaller.appspotmail.com


# 1.292 25-Jul-2019 cheloha

vinvalbuf(9): tlseep -> tsleep_nsec(9); ok millert@


# 1.291 19-Jul-2019 cheloha

vwaitforio(9): tsleep(9) -> tsleep_nsec(9); ok visa@


# 1.290 28-Jun-2019 visa

Skip VFS barrier lock during normal operation to reduce overhead.
This removes a system-wide serialization point, which might help
finding timing-related bugs.

OK deraadt@ anton@


# 1.289 09-Jun-2019 beck

Add a temporary workaround to make removal of giant files better

mlarkin@ noticed we would freeze while removing enormous files because
of the amount of work done to invalidate buffers on unlink. This adds
a temporary workaround to ensure we give up the lock and yield while
doing this.

The longer term answer will be to move these buffers to another list
and not do the work here.

ok deraadt@


# 1.288 19-Apr-2019 visa

Add a subsystem lock for vfs_lockf.c. This enables calling lf_advlock()
and lf_purgelocks() without the kernel lock.

OK anton@ mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.287 02-Apr-2019 visa

Restrict which filesystems are available for swap. This rules out
obvious misconfigurations that cannot work.

OK mpi@ tedu@


# 1.286 17-Feb-2019 tedu

if a write fails, we mark the buffer invalid and throw it away. this can
lead to lost errors, where a later fsync will return success. to fix this,
set a flag on the vnode indicating a past error has occurred, and return
an error for future fsync calls.
ok bluhm deraadt visa


# 1.285 21-Jan-2019 anton

Introduce a dedicated entry point data structure for file locks. This new data
structure allows for better tracking of pending lock operations which is
essential in order to prevent a use-after-free once the underlying vnode is
gone.

Inspired by the lockf implementation in FreeBSD.

ok visa@

Reported-by: syzbot+d5540a236382f50f1dac@syzkaller.appspotmail.com


# 1.284 23-Dec-2018 natano

Rectify some issues with the noperm mount flag; the root vnode was not
protected properly and files without any x bit set were accidentaly considered
executable when checked with access(2).

Issues found and reported by deraadt, halex, reyk, tb
ok deraadt


# 1.283 07-Dec-2018 mpi

free(9) sizes for netcred.

ok visa@


Revision tags: OPENBSD_6_4_BASE
# 1.282 29-Sep-2018 visa

Use atomic operations to update vfc_refcount. Change the field's type
to unsigned int.

OK deraadt@


# 1.281 26-Sep-2018 visa

Move the allocating and freeing of mount points into
dedicated functions.

OK deraadt@ mpi@


# 1.280 22-Sep-2018 fcambus

Harmonize spacing after ellipses in displayed messages.

We were using spacing after ellipses in an inconsistent way in the
installer. Standardize on using "... " everywhere and take into account
the cursor position while we are waiting for the task to complete: the
cursor is now always positioned after the last dot, and the space is
added when displaying completion confirmation.

While there, also take cursor position into account in vfs_shutdown(),
and remove the extra leading space before ticks in dhclient.

OK deraadt@


# 1.279 17-Sep-2018 visa

Simplify VFS initialization.

Because loadable kernel modules are no longer, there is no need to
register or unregister filesystem implementations at runtime. Remove
vfs_register() and vfs_unregister(), and make vfsinit() call vfs_init
routines directly. Replace the linked list of vfsconf structs with
the vfsconflist[] array.

OK mpi@ bluhm@


# 1.278 16-Sep-2018 visa

Move vfsconf lookup code into dedicated functions.

OK bluhm@


# 1.277 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


# 1.276 02-Jul-2018 bluhm

Use more list macros for v_dirtyblkhd.
OK mpi@


# 1.275 06-Jun-2018 bluhm

The function dounmount() traverses the mnt_list in forward direction
to call vfs_busy() for all nested mount points. vfs_stall() called
vfs_busy() in reverser order for all mount points. Change the
direction of the latter to resolve the lock order conflict.
OK visa@


# 1.274 04-Jun-2018 guenther

Add VB_DUPOK to suppress witness(4) warning of concurrent mount locks.
Use that in three places:
- vfs_stall()
- sys_mount()
- dounmount()'s MNT_FORCE-does-recursive-unmounts case

ok deraadt@ visa@


# 1.273 27-May-2018 visa

Drop unnecessary `p' parameter from vget(9).

OK mpi@


# 1.272 08-May-2018 bluhm

When looping over mount points, the FOREACH SAVE macro is not save.
The loop variable mp is protected by vfs_busy() so that it cannot
be unmounted. But the next mount point nmp could be unmounted while
VFS_SYNC() sleeps. As the loop in vfs_stall() does not destroy the
mount point, TAILQ_FOREACH_REVERSE without _SAVE is the correct
macro to use.
OK deraadt@ visa@


# 1.271 08-May-2018 mpi

Move the vfs stall "barrier" logic to a function. FREF() will soon
change and this has nothing to do with it.

ok visa@, bluhm@


# 1.270 07-May-2018 bluhm

Print the vp pointer in the vinvalbuf() panic strings.
OK mpi@


# 1.269 02-May-2018 visa

Remove proc from the parameters of vn_lock(). The parameter is
unnecessary because curproc always does the locking.

OK mpi@


# 1.268 28-Apr-2018 visa

Clean up the parameters of VOP_LOCK() and VOP_UNLOCK(). It is always
curproc that does the locking or unlocking, so the proc parameter
is pointless and can be dropped.

OK mpi@, deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.267 07-Mar-2018 bluhm

Remounting files systems read-only does not work reliably. There
are corner cases where ffs may leak blocks. So better revert and
unmount all file systems at reboot. The "init died" panic will be
fixed in a different way.
OK deraadt@


# 1.266 10-Feb-2018 deraadt

Syncronize filesystems to disk when suspending. Each mountpoint's vnodes
are pushed to disk. Dangling vnodes (unlinked files still in use) and
vnodes undergoing change by long-running syscalls are identified -- and
such filesystems are marked dirty on-disk while we are suspended (in case
power is lost, a fsck will be required). Filesystems without dangling or
busy vnodes are marked clean, resulting in faster boots following
"battery died" circumstances.
Tested by numerous developers, thanks for the feedback.


# 1.265 14-Dec-2017 deraadt

Don't bother using DETACH_FORCE for the softraid luns at reboot
time; the aggressive mountpoint destruction seems to hit insane
use-after-frees when we are already far on the way down.


# 1.264 14-Dec-2017 deraadt

Give vflush_vnode() a hint about vnodes we don't need to account as "busy".
Change mountpoint to RDONLY a little later. Seems to improve the
rw->ro transition a bit.


# 1.263 11-Dec-2017 bluhm

Format the vnode lists of ddb show mount properly in columns.
OK krw@


# 1.262 11-Dec-2017 deraadt

In uvm Chuck decided backing store would not be allocated proactively
for blocks re-fetchable from the filesystem. However at reboot time,
filesystems are unmounted, and since processes lack backing store they
are killed. Since the scheduler is still running, in some cases init is
killed... which drops us to ddb [noted by bluhm]. Solution is to convert
filesystems to read-only [proposed by kettenis]. The tale follows:
sys_reboot() should pass proc * to MD boot() to vfs_shutdown() which
completes current IO with vfs_busy VB_WRITE|VB_WAIT, then calls VFS_MOUNT()
with MNT_UPDATE | MNT_RDONLY, soon teaching us that *fs_mount() calls a
copyin() late... so store the sizes in vfsconflist[] and move the copyin()
to sys_mount()... and notice nfs_mount copyin() is size-variant, so kill
legacy struct nfs_args3. Next we learn ffs_mount()'s MNT_UPDATE code is
sharp and rusty especially wrt softdep, so fix some bugs adn add
~MNT_SOFTDEP to the downgrade. Some vnodes need a little more help,
so tie them to &dead_vnops.

ffs_mount calling DIOCCACHESYNC is causing a bit of grief still but
this issue is seperate and will be dealt with in time.
couple hundred reboots by bluhm and myself, advice from guenther and
others at the hut


# 1.261 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.260 31-Jul-2017 florian

Give back some space to the ramdisk by compiling net/radix.c only
if we compile pf, ipsec, pipex or nfsserver.
Suggested by mpi some time ago.
Tweak & OK bluhm
deraadt assumes it's fair


# 1.259 20-Apr-2017 visa

Tweak lock inits to make the system runnable with witness(4)
on amd64 and i386.


# 1.258 04-Apr-2017 deraadt

struct vfsconf is tightly packed, but let's M_ZERO it in case that ever
changes to avoid exposing userland memory.


Revision tags: OPENBSD_6_1_BASE
# 1.257 15-Jan-2017 bluhm

When traversing the mount list, the current mount point is locked
with vfs_busy(). If the FOREACH_SAFE macro is used, the next pointer
is not locked and could be freed by another process. Unless
necessary, do not use _SAFE as it is unsafe. In vfs_unmountall()
the current pointer is actullay freed. Add a comment that this
race has to be fixed later.
OK krw@


# 1.256 10-Jan-2017 bluhm

Replace manual for() loops with FOREACH() macro.
OK millert@


# 1.255 10-Jan-2017 bluhm

Remove the unused olddp parameter from function dounmount().
OK mpi@ millert@


# 1.254 28-Sep-2016 kettenis

Cast enum to u_int when doing a bounds check to avoid a clang warning that
the comparison is always true.

ok jca@, tedu@


# 1.253 16-Sep-2016 dlg

move the namecache_rb_tree from RB macros to RBT functions.

i had to shuffle the includes a bit. all the knowledge of the RB
tree is now inside vfs_cache.c, and all accesses are via cache_*
functions.


# 1.252 16-Sep-2016 dlg

move buf_rb_bufs from RB macros to RBT functions

i had to shuffle the order of some header bits cos RBT_PROTOTYPE
needs to see what RBT_HEAD produces.


# 1.251 15-Sep-2016 dlg

all pools have their ipl set via pool_setipl, so fold it into pool_init.

the ioff argument to pool_init() is unused and has been for many
years, so this replaces it with an ipl argument. because the ipl
will be set on init we no longer need pool_setipl.

most of these changes have been done with coccinelle using the spatch
below. cocci sucks at formatting code though, so i fixed that by hand.

the manpage and subr_pool.c bits i did myself.

ok tedu@ jmatthew@

@ipl@
expression pp;
expression ipl;
expression s, a, o, f, m, p;
@@
-pool_init(pp, s, a, o, f, m, p);
-pool_setipl(pp, ipl);
+pool_init(pp, s, a, ipl, f, m, p);


# 1.250 25-Aug-2016 dlg

pool_setipl

ok kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.249 22-Jul-2016 kettenis

Prevent NULL-pointer call for filesystems that don't provide vfs_sysctl
in their vfsops.

Issue reported by Tim Newsham.

ok claudio@, natano@


# 1.248 19-Jun-2016 natano

Remove the lockmgr() API. It is only used by filesystems, where it is a
trivial change to use rrw locks instead. All it needs is LK_* defines
for the RW_* flags.

tested by naddy and sthen on package building infrastructure
input and ok jmc mpi tedu


# 1.247 26-May-2016 natano

The doforce variable isn't modified anywhere. Also, the only filesystem
left using it is fuse. It has been removed from all other filesystems.

ok millert deraadt


# 1.246 26-Apr-2016 natano

copy_statfs_info() is not only used by ufs, but by other filesystems too,
so make sure that all members of mp->mnt_stat.mount_info are copied.
ok stefan


# 1.245 26-Apr-2016 beck

fix off by one in vfs_vnode_print - found by miod
ok deraadt@, krw@


# 1.244 07-Apr-2016 natano

Share clone bitmap between aliased vnodes. This prevents duplicate clone
instance numbers being handed out for the same minor device.
ok mikeb


# 1.243 05-Apr-2016 natano

Increase size of the clone bitmap (revised diff after revert). I have
tested this with fuse _and_ drm on amd64 and macppc. Also tested with
cloning bpf (not in the tree) on macppc.

ok mikeb
"looks correct to me" millert

The original commit message is as follows:

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.242 01-Apr-2016 mikeb

Revert the clone bitmap enlargement change


# 1.241 31-Mar-2016 natano

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.240 19-Mar-2016 natano

Remove the unused flags argument from VOP_UNLOCK().

torture tested on amd64, i386 and macppc
ok beck mpi stefan
"the change looks right" deraadt


# 1.239 14-Mar-2016 krw

Change a bunch of (<blah> *)0 to NULL.

ok beck@ deraadt@


Revision tags: OPENBSD_5_9_BASE
# 1.238 05-Dec-2015 tedu

branches: 1.238.2;
remove stale lint annotations


# 1.237 16-Nov-2015 deraadt

In getdevvp() set the VISTTY flag on a vnode to indicate the underlying
device is a D_TTY device. (Like spec_open, but this sets the flag to
satisfy pre-VOP_OPEN situations)
ok millert semarie tedu guenther


# 1.236 13-Oct-2015 guenther

Initialize va_filerev in vattr_null() to avoid leaking stack garbage;
problem pointed out by Martin Natano (natano (at) natano.net)

Also, stop chaining assignments (foo = bar = baz) in vattr_null().
The exact meaning of those depends on the order of the sizes-and-
signednesses of the lvalues, making them fragile: a statement here
mixed *six* types, but managed to get them in a safe order. Delete
a 20+ year old XXX comment that was almost certainly bemoaning a bug
from when they were in an unsafe order.

ok deraadt@ miod@


# 1.235 08-Oct-2015 mpi

Use the radix API directly and get rid of the function pointers. There
is no point in keeping an unused level of abstraction.

ok mikeb@, claudio@


# 1.234 07-Oct-2015 mpi

rn_inithead() offset argument is now specified in byte, missed in previous.


# 1.233 04-Sep-2015 mpi

Make every subsystem using a radix tree call rn_init() and pass the
length of the key as argument.

This way every consumer of the radix tree has a chance to explicitly
initialize the shared data structures and no longer rely on another
subsystem to do the initialization.

As a bonus ``dom_maxrtkey'' is no longer used an die.

ART kernels should now be fully usable because pf(4) and IPSEC properly
initialized the radix tree.

ok chris@, reyk@


Revision tags: OPENBSD_5_8_BASE
# 1.232 16-Jul-2015 claudio

branches: 1.232.4;
Fix rn_match and there for the expoerted lookup functions in radix.c
to never return the internal RNF_ROOT nodes. This removes the checks
in the callee to verify that not an RNF_ROOT node was returned.
OK mpi@


# 1.231 12-May-2015 mikeb

Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.230 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.229 02-Mar-2015 guenther

Return EINVAL if the creds supplied for NFS export have a cr_ngroups less
than zero or greater than NGROUPS_MAX

Fixes panic seen by henning@


# 1.228 09-Jan-2015 tedu

rename desiredvnodes to initialvnodes. less of a lie. ok beck deraadt


# 1.227 19-Dec-2014 tedu

start retiring the nointr allocator. specify PR_WAITOK as a flag as a
marker for which pools are not interrupt safe. ok dlg


# 1.226 17-Dec-2014 tedu

remove lock.h from uvm_extern.h. another holdover from the simpletonlock
era. fix uvm including c files to include lock.h or atomic.h as necessary.
ok deraadt


# 1.225 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.224 10-Dec-2014 tedu

convert bcopy to memcpy. ok millert


# 1.223 21-Nov-2014 tedu

simple lock is long dead


# 1.222 19-Nov-2014 tedu

delete the KERN_VNODE sysctl. it fails to provide any isolation from the
kernel struct vnode defintion, and the only consumer (pstat) still needs
kvm to read much of the required information. no great loss to always use
kvm until there's a better replacement interface.
ok deraadt millert uebayasi


# 1.221 14-Nov-2014 tedu

prefer sizeof(*ptr) to sizeof(struct) for malloc and free


# 1.220 03-Nov-2014 deraadt

pass size argument to free()
ok doug tedu


# 1.219 13-Sep-2014 doug

Replace all queue *_END macro calls except CIRCLEQ_END with NULL.

CIRCLEQ_* is deprecated and not called in the tree. The other queue types
have *_END macros which were added for symmetry with CIRCLEQ_END. They are
defined as NULL. There's no reason to keep the other *_END macro calls.

ok millert@


Revision tags: OPENBSD_5_6_BASE
# 1.218 13-Jul-2014 tedu

pass the size to free in some of the obvious cases


# 1.217 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.216 10-Jul-2014 mpi

Stop using a shutdown hook for softraid(4) and explicitly shutdown
the disciplines right after vfs_shutdown().

This change is required in order to be able to set `cold' to 1 before
traversing the device (mainbus) tree for DVACT_POWERDOWN when halting
a machine. Yes, this is ugly because sr_shutdown() needs to sleep. But
at least it is obvious and hopefully somebody will be ofended and fix
it.

In order to properly flush the cache of the disks under softraid0,
sr_shutdown() now propagates DVACT_POWERDOWN for this particular subtree
of devices which are not under mainbus. As a side effect sd(4) shutdown
hook should no longer be necessary.

Tested by stsp@ and Jean-Philippe Ouellet.

ok deraadt@, stsp@, jsing@


# 1.215 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.214 04-Jun-2014 claudio

While it may be smart to use the radix tree for exports it is not OK to
use the domain specific tree initialisation method for this since that one
is multipath enabled and assumes that the radix node is part of a struct
rtentry. This code uses a different struct and so the multipath modifies
wrong fields and breaks stuff in mysterious ways.
Since we only support AF_INET here anyway simplify the code and only have
one radix_node_head pointer instead of AF_MAX ones.
Fixes NFS server issues reported by rpe@, OK rpe@, guenther@, sthen@


# 1.213 10-Apr-2014 tedu

pull the bufcache freelist code out into separate functions to allow new
algorithms to be tested. in the process, drop support for unused B_AGE and
b_synctime options.
previous versions ok beck deraadt


# 1.212 24-Mar-2014 guenther

Split the API: struct ucred remains the kernel internal structure while
struct xucred becomes the structure for syscalls (mount(2) and nfssvc(2)).

ok deraadt@ beck@


Revision tags: OPENBSD_5_5_BASE
# 1.211 21-Jan-2014 tedu

bzero -> memset


# 1.210 01-Dec-2013 krw

Change 'mountlist' from CIRCLEQ to TAILQ. Be paranoid and
use TAILQ_*_SAFE more than might be needed.

Bulk ports build by sthen@ showed nobody sticking their fingers
so deep into the kernel.

Feedback and suggestions from millert@. ok jsing@


# 1.209 27-Nov-2013 jsing

Defer the v_type initialisation until after the vnode has been purged from
the namecache. Changing the v_type between cache_enter() and cache_purge()
results in bad things happening.

ok beck@


# 1.208 02-Oct-2013 sf

format string fix: b_flags is long


# 1.207 01-Oct-2013 sf

Format string fixes: Cast time_t to long long

and mnt_stat.f_ctime is long long, too


# 1.206 08-Aug-2013 syl

Uncomment kprintf format attributes for sys/kern

tested on vax (gcc3) ok miod@


# 1.205 30-Jul-2013 beck

The previous change was made while chasing nfs performance issues
on Theo's servers - however this was in the context of the buffer flipper
changes and this is now suspect in a continues performance issue with NFS
so back it out for now


Revision tags: OPENBSD_5_4_BASE
# 1.204 24-Jun-2013 beck

Manipulating buffers after sleeping is dangerous. Instead of attempting
to cheat and VOP_BWRITE a buffer, restart the vinvalbuf if we have to wait
for a busy buffer to complete
ok tedu@ guenther@


# 1.203 15-Apr-2013 jsing

Add an f_mntfromspec member to struct statfs, which specifies the name of
the special provided when the mount was requested. This may be the same as
the special that was actually used for the mount (e.g. in the case of a
device node) or it may be different (e.g. in the case of a DUID).

Whilst here, change f_ctime to a 64 bit type and remove the pointless
f_spare members.

Compatibility goo courtesy of guenther@

ok krw@ millert@


Revision tags: OPENBSD_5_3_BASE
# 1.202 17-Feb-2013 miod

Comment out recently added __attribute__((__format__(__kprintf__))) annotations
in MI code; gcc 2.95 does not accept such annotation for function pointer
declarations, only function prototypes.
To be uncommented once gcc 2.95 bites the dust.


# 1.201 09-Feb-2013 miod

Add explicit __attribute__ ((__format__(__kprintf__)))) to the functions and
function pointer arguments which are {used as,} wrappers around the kernel
printf function.
No functional change.


# 1.200 17-Nov-2012 beck

Don't map a buffer (and potentially sleep) when invalidating it in vinvalbuf.
This fixes a problem where we could sleep for kva and then our pointers
would not be valid on the next pass through the loop. We do this
by adding buf_acquire_nomap() - which can be used to busy up the buffer
without changing its mapped or unmapped state. We do not need to have
the buffer mapped to invalidate it, so it is sufficient to acquire it
for that. In the case where we write the buffer, we do map the buffer, and
potentially sleep.


# 1.199 01-Oct-2012 guenther

Make groupmember() check the effective gid too, so that the checks are
consistent when the effective gid isn't also a supplementary group.

ok beck@


# 1.198 19-Sep-2012 guenther

vhold() and vdrop() are prototyped in vnode.h, so don't repeat them here

ok beck@


Revision tags: OPENBSD_5_2_BASE
# 1.197 16-Jul-2012 deraadt

oops, need sys/acct.h too


# 1.196 16-Jul-2012 deraadt

Put acct_shutdown() proto in a better place


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.195 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.194 02-Jul-2011 thib

rename VFSDEBUG to VFLCKDEBUG;

prompted by tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.193 21-Dec-2010 thib

Bring back the "End the VOP experiment." diff, naddy's issues where
unrelated, and his alpha is much happier now.

OK deraadt@


# 1.192 06-Dec-2010 jasper

- drop NENTS(), which was yet another copy of nitems().
no binary change


ok deraadt@


# 1.191 10-Sep-2010 thib

Backout the VOP diff until the issues naddy was seeing on alpha (gcc3)
have been resolved.


# 1.190 06-Sep-2010 thib

End the VOP experiment. Instead of the ridicolusly complicated operation
vector setup that has questionable features (that have, as far as I can
tell never been used in practice, atleast not in OpenBSD), remove all
the gunk and favor a simple struct full of function pointers that get
set directly by each of the filesystems.

Removes gobs of ugly code and makes things simpler by a magnitude.

The only downside of this is that we loose the vnoperate feature so
the spec/fifo operations of the filesystems need to be kept in sync
with specfs and fifofs, this is no big deal as the API it self is pretty
static.

Many thanks to armani@ who pulled an earlier version of this diff to
current after c2k10 and Gabriel Kihlman on tech@ for testing.

Liked by many. "come on, find your balls" deraadt@.


# 1.189 12-Aug-2010 oga

Nuke extra (typoed) extern declaration and a spare newline from the last
commit.

"fix it -- free commit" beck@


# 1.188 11-Aug-2010 beck

Make the number of vnodes to correspond to the number of buffers in
buffer cache - we grow them dynamically, but do not attempt to shrink
them if the buffer cache shrinks after growing.

Tested by very many for a long time.

ok oga@ todd@ phessler@ tedu@


Revision tags: OPENBSD_4_8_BASE
# 1.187 29-Jun-2010 tedu

makefstype was only used in ported from freebsd filesystems. fix them
and remove the function. ok thib


# 1.186 28-Jun-2010 claudio

Add the rtable id as an argument to rn_walktree(). Functions like
rt_if_remove_rtdelete() need to know the table id to be able to correctly
remove nodes.
Problem found by Andrea Parazzini and analyzed by Martin Pelik�n.
OK henning@


# 1.185 06-May-2010 mpf

Fix favail format string.
From mickey.
OK thib, otto.


Revision tags: OPENBSD_4_7_BASE
# 1.184 17-Dec-2009 oga

if anyone vref()s a VNON vnode, panic. This should not happen.

Written while trying to debug the nfs_inactive panics. Turns out it
never got hit, but it's a useful check to have.

ok beck@


# 1.183 17-Aug-2009 jasper

dd 'show all bufs' to show all the buffers in the system

ok beck@ thib@


# 1.182 13-Aug-2009 thib

add a show all vnodes command, use dlg's nice pool_walk() to accomplish
this.

ok beck@, dlg@


# 1.181 12-Aug-2009 beck

Namecache revamp.

This eliminates the large single namecache hash table, and implements
the name cache as a global lru of entires, and a redblack tree in each
vnode. It makes cache_purge actually purge the namecache entries associated
with a vnode when a vnode is recycled (very important for later on actually being
able to resize the vnode pool)

This commit does #if 0 out a bunch of procmap code that was
already broken before this change, but needs to be redone completely.

Tested by many, including in thib's nfs test setup.

ok oga@,art@,thib@,miod@


# 1.180 02-Aug-2009 beck

Dynamic buffer cache support - a re-commit of what was backed out
after c2k9

allows buffer cache to be extended and grow/shrink dynamically

tested by many, ok oga@, "why not just commit it" deraadt@


Revision tags: OPENBSD_4_6_BASE
# 1.179 25-Jun-2009 thib

backout the buf_acquire() does the bremfree() since all callers
where doing bremfree() befure calling buf_acquire().

This is causing us headache pinning down a bug that showed up
when deraadt@ too cvs to current, and will have to be done
anyway as a preperation for backouts.

OK deraadt@


# 1.178 15-Jun-2009 beck

Back out all the buffer cache changes I committed during c2k9. This reverts three
commits:

1) The sysctl allowing bufcachepercent to be changed at boot time.
2) The change moving the buffer cache hash chains to a red-black tree
3) The dynamic buffer cache (Which depended on the earlier too).

ok on the backout from marco and todd


# 1.177 06-Jun-2009 art

All caller of buf_acquire were doing bremfree before the call.
Just put it in the buf_acquire function.
oga@ ok


# 1.176 03-Jun-2009 beck

Change bufhash from the old grotty hash table to red-black trees hanging
off the vnode.
ok art@, oga@, miod@


Revision tags: OPENBSD_4_5_BASE
# 1.175 10-Nov-2008 pedro

Fix typo in comment, okay jmc@.


# 1.174 01-Nov-2008 deraadt

change vrele() to return an int. if it returns 0, it can gaurantee that
it did not sleep. this is used to avoid checkdirs() to avoid having
to restart the allproc walk every time through
idea from tedu, ok thib pedro


Revision tags: OPENBSD_4_4_BASE
# 1.173 05-Jul-2008 thib

re-introduce vdrop() to signal a lost intrest in a vnode;

ok art@


# 1.172 14-Jun-2008 mk

A bunch of pool_get() + bzero() -> pool_get(..., .. | PR_ZERO)
conversions that should shave a few bytes off the kernel.

ok henning, krw, jsing, oga, miod, and thib (``even though i usually prefer
FOO|BAR''; thanks for looking.


# 1.171 13-Jun-2008 beck

back out stupid vnode change that was unintentionally included
with biomem and art has no idea how it got there.
ok art@ thib@


# 1.170 12-Jun-2008 deraadt

Bring biomem diff back into the tree after the nfs_bio.c fix went in.
ok thib beck art


# 1.169 11-Jun-2008 deraadt

back out biomem diff since it is not right yet. Doing very large
file copies to nfsv2 causes the system to eventually peg the console.
On the console ^T indicates that the load is increasing rapidly, ddb
indicates many calls to getbuf, there is some very slow nfs traffic
making none (or extremely slow) progress. Eventually some machines
seize up entirely.


# 1.168 10-Jun-2008 beck

Buffer cache revamp

1) remove multiple size queues, introduced as a stopgap.
2) decouple pages containing data from their mappings
3) only keep buffers mapped when they actually have to be mapped
(right now, this is when buffers are B_BUSY)
4) New functions to make a buffer busy, and release the busy flag
(buf_acquire and buf_release)
5) Move high/low water marks and statistics counters into a structure
6) Add a sysctl to retrieve buffer cache statistics

Tested in several variants and beat upon by bob and art for a year. run
accidentally on henning's nfs server for a few months...

ok deraadt@, krw@, art@ - who promises to be around to deal with any fallout


# 1.167 09-Jun-2008 millert

Update access(2) to have modern semantics with respect to X_OK and
the superuser. access(2) will now only indicate success for X_OK on
non-directories if there is at least one execute bit set on the file.
OK deraadt@ thib@ otto@


# 1.166 07-May-2008 thib

remove the vfc_mountroot member from vfsconf and
do appropriate cleanup;

OK deraadt@


# 1.165 07-May-2008 claudio

Implement routing priorities. Every route inserted has a priority assigned
and the one route with the lowest number wins. This will be used by the
routing daemons to resolve the synchronisations issue in case of conflicts.
The nasty bits of this are in the multipath code. If no priority is specified
the kernel will choose an appropriate priority.

Looked at by a few people at n2k8 code is much older


# 1.164 06-May-2008 thib

retire vfs_mountroot();

setroot() is now (and has been) responsible for setting
the mountroot function pointer "to the right thing", or
failing todo that, to ffs_mountroot;

based on a discussion/diff from deraadt@.
OK deraadt@


# 1.163 23-Mar-2008 miod

Wrong printf construct.


# 1.162 16-Mar-2008 otto

Widen some struct statfs fields to support large filesystem stata
and add some to be able to support statvfs(2). Do the compat dance
to provide backward compatibility. ok thib@ miod@


Revision tags: OPENBSD_4_3_BASE
# 1.161 13-Dec-2007 blambert

replace calls to ltsleep with tsleep

remove PNORELOCK flag, as PNORELOCK is used for msleep

ok art@ thib@


# 1.160 16-Nov-2007 deraadt

er, the newline is wrong. dissapointing.


# 1.159 15-Nov-2007 deraadt

newline before syncing disks is way prettier


# 1.158 29-Oct-2007 chl

MALLOC/FREE -> malloc/free
replace an hard coded value with M_WAITOK

ok krw@


# 1.157 15-Sep-2007 bluhm

Allow to pull out an usb stick with ffs filesystem while mounted
and a file is written onto the stick. Without these fixes the
machine panics or hangs.
The usb fix calls the callback when the stick is pulled out to free
the associated buffers. Otherwise we have busy buffers for ever
and the automatic unmount will panic.
The change in the scsi layer prevents passing down further dirty
buffers to usb after the stick has been deactivated.
In vfs the automatic unmount has moved from the function vgonel()
to vop_generic_revoke(). Both are called when the sd device's vnode
is removed. In vgonel() the VXLOCK is already held which can cause
a deadlock. So call dounmount() earlier.

ok krw@, I like this marco@, tested by ian@


# 1.156 07-Sep-2007 art

Use M_ZERO in a few more places to shave bytes from the kernel.

eyeballed and ok dlg@


Revision tags: OPENBSD_4_2_BASE
# 1.155 07-Aug-2007 beck

A few changes to deal with multi-user performance issues seen. this
brings us back roughly to 4.1 level performance, although this is still
far from optimal as we have seen in a number of cases. This change

1) puts a lower bound on buffer cache queues to prevent starvation
2) fixes the code which looks for a buffer to recycle
3) reduces the number of vnodes back to 4.1 levels to avoid complex
performance issues better addressed after 4.2

ok art@ deraadt@, tested by many


# 1.154 01-Jun-2007 beck

decouple the allocated number of vnodes from the "desiredvnodes" variable
which is used to size a zillion other things that increasing excessively
has been shown to cause problems - so that we may incrementally look at
increasing those other things without making the kernel unusable.

This diff effectivly increases the number of vnodes back to the number
of buffers, as in the earlier dynamic buffer cache commits, without
increasing anything else (namecache, softdeps, etc. etc.)

ok pedro@ tedu@ art@ thib@


# 1.153 31-May-2007 tedu

remove some silly casts, no real change


# 1.152 31-May-2007 pedro

NFSv2 cannot cope with a big number of vnodes, so revert to NPROC-based
calculation until the problem is fixed, okay beck@ art@


# 1.151 30-May-2007 beck

back out vfs change - todd fries has seen afs issues, and I'm suspicious
this can cause other problems.


# 1.150 29-May-2007 beck

Step one of some vnode improvements - change getnewvnode to
actually allocate "desiredvnodes" - add a vdrop to un-hold a vnode held
with vhold, and change the name cache to make use of vhold/vdrop, while
keeping track of which vnodes are referred to by which cache entries to
correctly hold/drop vnodes when the cache uses them.
ok thib@, tedu@, art@


# 1.149 28-May-2007 thib

de-inline vref();

ok pedro@


# 1.148 26-May-2007 pedro

Dynamic buffer cache. Initial diff from mickey@, okay art@ beck@ toby@
deraadt@ dlg@.


# 1.147 26-May-2007 thib

Nuke a bunch of simpelocks and associated goo.

ok art@


# 1.146 17-May-2007 thib

Collapse struct v_selectinfo in struct vnode, remove the
simplelock and reuse the name for the selinfo member.
Clean-up accordingly.

ok tedu@,art@


# 1.145 09-May-2007 deraadt

kinfo_vgetfailed has not been used for > 8 years


# 1.144 13-Apr-2007 thib

Move the declaration of VN_KNOTE() into vnode.h instead of having
multiple defines all over;

ok tedu@


# 1.143 13-Apr-2007 bluhm

Remove comments talking about vnode interlock. No binary change.
ok thib


# 1.142 11-Apr-2007 thib

Remove the simplelock argument from vrecycle();

ok pedro@, sturm@


# 1.141 21-Mar-2007 thib

Remove the v_interlock simplelock from the vnode structure.
Zap all calls to simple_lock/unlock() on it (those calls are
#defined away though). Remove the LK_INTERLOCK from the calls
to vn_lock() and cleanup the filesystems wich implement VOP_LOCK().
(by remvoing the v_interlock from there calls to lockmgr()).

ok pedro@, art@, tedu@


# 1.140 12-Mar-2007 mickey

better desiredvnodes not based on maxusers; pedro@ deraadt@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.139 20-Feb-2007 deraadt

for vfsconf sysctl, do not leak kernel sensors out to userland
ok art thib


# 1.138 17-Feb-2007 mickey

fix ddb buf printing for daddr_t growth to 64bit;
from juan hernandez gonzalez; tested by bluhm@


# 1.137 14-Feb-2007 jsg

Consistently spell FALLTHROUGH to appease lint.
ok kettenis@ cloder@ tom@ henning@


# 1.136 13-Feb-2007 mickey

fix ddb buf print


# 1.135 20-Nov-2006 tom

vprint() should be defined if DIAGNOSTIC || DEBUG. Noticed by (and
original diff from) Jake < antipsychic (at) hotmail.com >. Discussed
with Mickey and Miod.

ok miod@ pedro@


# 1.134 30-Oct-2006 thib

use vp->v_type to index into vtypes rather then vp->v_tag,
fixing odd output in the 'show vnode' ddb code.

ok mickey@


Revision tags: OPENBSD_4_0_BASE
# 1.133 11-Jul-2006 mickey

add mount/vnode/buf and softdep printing commands; tested on a few archs and will make pedro happy too (;


# 1.132 09-Jul-2006 pedro

Fix tab where space was meant


# 1.131 08-Jul-2006 thib

vinvalbuf() debugging aid, under VFSDEBUG.

ok pedro@


# 1.130 03-Jul-2006 mickey

also print vp in vprint (useful for debugging); pedro@ ok


# 1.129 25-Jun-2006 sturm

rename vfs_busy() flags VB_UMIGNORE/VB_UMWAIT to VB_NOWAIT/VB_WAIT

requested by and ok pedro


# 1.128 14-Jun-2006 sturm

move vfs_busy() to rwlocks and properly hide the locking api from vfs

ok tedu, pedro


# 1.127 02-Jun-2006 pedro

Add a clonable devices implementation. Hacked along with thib@, input
from krw@ and toby@, subliminal prodding from dlg@, okay deraadt@.


# 1.126 28-May-2006 pedro

Spacing in vfs_sysctl()


# 1.125 07-May-2006 sturm

forgot to remove this sentence from the comment
ok pedro


# 1.124 30-Apr-2006 sturm

remove the simplelock argument from vfs_busy() which is currently not
used and will never be used this way in VFS

requested by and ok pedro, ok krw, biorn


# 1.123 19-Apr-2006 pedro

Remove unused mount list simple_lock() goo


Revision tags: OPENBSD_3_9_BASE
# 1.122 09-Jan-2006 pedro

Put vprint() under DIAGNOSTIC, as to save space in generated ramdisks.
Inspiration from miod@, okay deraadt@. Tested on i386, macppc and amd64.


# 1.121 30-Nov-2005 pedro

No need for vfs_busy() and vfs_unbusy() to take a process pointer
anymore. Testing by jolan@, thanks.


# 1.120 24-Nov-2005 pedro

Remove kernfs, okay deraadt@.


# 1.119 19-Nov-2005 pedro

Remove unnecessary lockmgr() archaism that was costing too much in terms
of panics and bugfixes. Access curproc directly, do not expect a process
pointer as an argument. Should fix many "process context required" bugs.
Incentive and okay millert@, okay marc@. Various testing, thanks.


# 1.118 18-Nov-2005 pedro

Work around yet another race on non-locking file systems: when calling
VOP_INACTIVE() in vrele() and vput(), we may sleep. Since there's no
locking of any kind, someone can vget() the vnode and vrele() it while
we sleep, beating us in getting the vnode on the free list.


# 1.117 08-Nov-2005 pedro

Missed one use of 'register'


# 1.116 07-Nov-2005 pedro

Use ANSI function declarations and deregister, no binary change


# 1.115 19-Oct-2005 pedro

Remove v_vnlock from struct vnode, okay krw@ tedu@


Revision tags: OPENBSD_3_8_BASE
# 1.114 26-May-2005 pedro

branches: 1.114.2;
RIP stackable filesystems, ok marius@ tedu@, discussed with deraadt@


# 1.113 24-May-2005 pedro

when a device vnode associated with a mount point disappears, mark the
filesystem as doomed and unmount it


# 1.112 22-May-2005 pedro

put VLOCKSWORK stuff under a single option, VFSDEBUG


# 1.111 01-May-2005 pedro

check for VBIOONFREELIST and VBIOONSYNCLIST in vprint(), okay marius@


# 1.110 24-Mar-2005 tedu

always good to check for invalid values. ok marius pedro


Revision tags: OPENBSD_3_7_BASE
# 1.109 10-Jan-2005 pedro

branches: 1.109.2;
change vget() to only put a vnode back on the free lists if it actually
was there. should fix a (rare) corner case introduced by my last commit.
ok tedu@, testing by joris, moritz@, danh@, otto@ and krw@. many thanks.


# 1.108 31-Dec-2004 pedro

sprinkle some more list macros in here


# 1.107 31-Dec-2004 pedro

when releasing a vnode, make it inactive before sticking it to one of
the free lists. should fix some races on filesystems that don't have
locks, such as nfs. also, it allows for a more straightforward way of
releasing vnodes (nodes that are going to be recycled don't have to be
moved to the head of the list). tested by many, thanks.

ok tedu@ deraadt@


# 1.106 28-Dec-2004 deraadt

clean dirty accident by miod


# 1.105 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


# 1.104 09-Dec-2004 pedro

minor spacing/styling nits


Revision tags: OPENBSD_3_6_BASE
# 1.103 04-Aug-2004 art

Uninline vputonfreelist.


# 1.102 04-Aug-2004 pedro

better comments


# 1.101 02-Aug-2004 pedro

- check for LK_NOWAIT on vget()
- use ltsleep() instead of the unlock + sleep combo

ok art@, inspiration from free/net


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.100 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


# 1.99 27-May-2004 tedu

shutdown accounting before shutting down vfs. should prevent some panics.
ok david@ millert@ (iirc)


# 1.98 25-Apr-2004 itojun

radix tree with multipath support. from kame. deraadt ok
user visible changes:
- you can add multiple routes with same key (route add A B then route add A C)
- you have to specify gateway address if there are multiple entries on the table
(route delete A B, instead of route delete A)
kernel change:
- radix_node_head has an extra entry
- rnh_deladdr takes extra argument

TODO:
- actually take advantage of multipath (rtalloc -> rtalloc_mpath)


Revision tags: OPENBSD_3_5_BASE
# 1.97 09-Jan-2004 tedu

back out vnode parents. weird breakge found in ports tree


# 1.96 06-Jan-2004 tedu

keep track of a vnode's parent dir. ufs only, and unused atm, but
the fun stuff is coming. testing by brad.


Revision tags: OPENBSD_3_4_BASE
# 1.95 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.94 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: UBC_SYNC_A
# 1.93 13-May-2003 naddy

Back out previous change that causes "vnode table full" for large-scale
file operations.


# 1.92 13-May-2003 tedu

do reclaim LAYER vnodes, no good reason not to


# 1.91 06-May-2003 tedu

attempt to put a process's cwd back in place after a forced umount.
won't always work, but it's the best we can do for now. this covers
at least some of the failure cases the previous commit to vfs_lookup.c
checks for.
ok weingart@


# 1.90 01-May-2003 tedu

several related changes:
vfs_subr.c:
add a missing simple_lock_init for vnode interlock
try to avoid reclaiming locked or layered vnodes
initialize vnlock pointer to NULL
remove old code to free vnlock, never used
lockinit the new vnode lock
vfs_syscalls.c:
support for VLAYER flag
vnode_if.sh:
support for splitting VDESC flags
vnode_if.src:
split VDESC flags
WILLPUT is the combination of WILLRELE and WILLUNLOCK
most uses for WILLRELE become WILLPUT
vnode.h:
add v_lock to struct vnode
add VLAYER flag
update for new VDESC flags


# 1.89 06-Apr-2003 ho

strcat/strcpy/sprintf cleanup. krw@, anil@ ok. art@ tested sparc64.


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_B
# 1.88 11-Aug-2002 art

Add two missing vfs_busy calls in the failure path of sysctl_vnode.
Found by aaron@

NOTE - I think we need a mount-point iterator just like we have
NOTE - vfs_mount_foreach_vnode. (btw. why don't we use foreach_vnode in here?)


# 1.87 12-Jul-2002 art

Change the locking on the mountpoint slightly. Instead of using mnt_lock
to get shared locks for lookup and get the exclusive lock only with
LK_DRAIN on unmount and do the real exclusive locking with flags in
mnt_flags, we now use shared locks for lookup and an exclusive lock for
unmount.

This is accomplished by slightly changing the semantics of vfs_busy.
Old vfs_busy behavior:
- with LK_NOWAIT set in flags, a shared lock was obtained if the
mountpoint wasn't being unmounted, otherwise we just returned an error.
- with no flags, a shared lock was obtained if the mountpoint was being
unmounted, otherwise we slept until the unmount was done and returned
an error.
LK_NOWAIT was used for sync(2) and some statistics code where it isn't really
critical that we get the correct results.
0 was used in fchdir and lookup where it's critical that we get the right
directory vnode for the filesystem root.

After this change vfs_busy keeps the same behavior for no flags and LK_NOWAIT.
But if some other flags are passed into it, they are passed directly
into lockmgr (actually LK_SLEEPFAIL is always added to those flags because
if we sleep for the lock, that means someone was holding the exclusive lock
and the exclusive lock is only held when the filesystem is being unmounted.

More changes:
dounmount must now be called with the exclusive lock held. (before this
the caller was supposed to hold the vfs_busy lock, but that wasn't always
true).
Zap some (now) unused mount flags.
And the highlight of this change:
Add some vfs_busy calls to match some vfs_unbusy calls, especially in
sys_mount. (lockmgr doesn't detect the case where we release a lock noone
holds (it will do that soon)).

If you've seen hangs on reboot with mfs this should solve it (I repeat this
for the fourth time now, but this time I spent two months fixing and
redesigning this and reading the code so this time I must have gotten
this right).


# 1.86 16-Jun-2002 miod

When processing the KERN_VNODE sysctl, the kernel builds a packed structure,
while pstat(8) expects a C structure abiding the regular structure packing
rules. This caused pstat -v to break on powerpc.

Unbreak the confusion by defining the structure in a common header file,
and having the kernel use it.

ok millert@ deraadt@


# 1.85 08-Jun-2002 art

Use ltsleep in vfs_busy.


# 1.84 16-May-2002 art

sprinkle some splassert(IPL_BIO) in some functions that are commented as "should be called at splbio()"


Revision tags: OPENBSD_3_1_BASE
# 1.83 14-Mar-2002 millert

First round of __P removal in sys


# 1.82 04-Feb-2002 miod

Cleanup mountroot-related definitions.


# 1.81 23-Jan-2002 art

Pool deals fairly well with physical memory shortage, but it doesn't deal
well (not at all) with shortages of the vm_map where the pages are mapped
(usually kmem_map).

Try to deal with it:
- group all information the backend allocator for a pool in a separate
struct. The pool will only have a pointer to that struct.
- change the pool_init API to reflect that.
- link all pools allocating from the same allocator on a linked list.
- Since an allocator is responsible to wait for physical memory it will
only fail (waitok) when it runs out of its backing vm_map, carefully
drain pools using the same allocator so that va space is freed.
(see comments in code for caveats and details).
- change pool_reclaim to return if it actually succeeded to free some
memory, use that information to make draining easier and more efficient.
- get rid of PR_URGENT, noone uses it.


# 1.80 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.79 10-Dec-2001 art

branches: 1.79.2;
No need to initialize the uobj on every getnewvnode. Just do
it when allocating. Add some improved diagnostics.


# 1.78 10-Dec-2001 art

Big cleanup inspired by NetBSD with some parts of the code from NetBSD.
- get rid of VOP_BALLOCN and VOP_SIZE
- move the generic getpages and putpages into miscfs/genfs
- create a genfs_node which must be added to the top of the private portion
of each vnode for filsystems that want to use genfs_{get,put}pages
- rename genfs_mmap to vop_generic_mmap


# 1.77 10-Dec-2001 art

Merge in struct uvm_vnode into struct vnode.


# 1.76 05-Dec-2001 art

Break out the part that lowers v_holdcnt in brelvp into an own function
and make it and vhold into public interfaces.


# 1.75 29-Nov-2001 art

Ooops. Revert part of the last commit that was completly wrong and wasn't supposed to be committed.


# 1.74 29-Nov-2001 art

Correctly handle b_vp with bgetvp and brelvp in {get,put}pages.
Prevents panics caused by vnodes being recycled under our feet.


# 1.73 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.72 21-Nov-2001 csapuntz

Added vfs_isbusy. Useful for verifying that a mount point is locked
Added vfs_mount_foreach_vnode. Several places in the code seem to want to
traverse the mount list and they all seem to handle locking differently.
Centralize traversing the mount list in one place so that we only need
to get the locking right once.


# 1.71 15-Nov-2001 art

Don't zero v_bioflag when recycling a vnode in getnewvnode.
Sometimes the vnode can be on the syncers list. While that is a bug, it's
just a minor annoyance. A vnode on a syncer worklist without VBIOONSYNCLIST
set is a disaster.


# 1.70 12-Nov-2001 art

Remove unnecessary check for NULL vnode in reassignbuf.


# 1.69 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.68 02-Oct-2001 csapuntz

Bounds check index into routing table. Thanks to Ken Ashcraft of Stanford
for finding this bug.


# 1.67 19-Sep-2001 csapuntz

Get rid of B_VFLUSH. Not relevant after the end of the AGE queue.


# 1.66 16-Sep-2001 millert

Add some missing lengths checks when passing data from userland to
kernel. From based on NetBSD patches.


# 1.65 02-Aug-2001 assar

(vput): make panic strings actually say vput instead of vrele


# 1.64 26-Jul-2001 miod

Typo.


# 1.63 27-Jun-2001 art

remove old vm


# 1.62 22-Jun-2001 deraadt

KNF


# 1.61 05-Jun-2001 provos

send note_revoke to knotes when vnode goes away, okay art@


# 1.60 16-May-2001 art

indentation nit.


# 1.59 29-Apr-2001 art

cleanup, remove incorrect comment


Revision tags: OPENBSD_2_9_BASE
# 1.58 22-Mar-2001 art

branches: 1.58.2;
Use pool for allocating vnodes.
Even though vnodes are never freed (could be) this gives us big memory and
kmem_map savings.


# 1.57 21-Mar-2001 art

uvm_vnp_terminate expect the vnode to be locked.
Why didn't LOCKDEBUG catch this?


# 1.56 16-Mar-2001 art

Oops. fix thinko in last.


# 1.55 16-Mar-2001 art

Use CIRCLEQ macros for mountlist.


# 1.54 16-Mar-2001 art

Initialize the mountlist_slock.


# 1.53 26-Feb-2001 csapuntz

Move v_writecount test back to it original place


# 1.52 26-Feb-2001 csapuntz

Make ref counts 32-bit unsigned ints as opposed to a potpourri of longs and
ints.


# 1.51 24-Feb-2001 csapuntz

Cleanup of vnode interface continues. Get rid of VHOLD/HOLDRELE.
Change VM/UVM to use buf_replacevnode to change the vnode associated
with a buffer.

Addition v_bioflag for flags written in interrupt handlers
(and read at splbio, though not strictly necessary)

Add vwaitforio and use it instead of a while loop of v_numoutput.

Fix race conditions when manipulation vnode free list


# 1.50 23-Feb-2001 csapuntz

Remove the clustering fields from the vnodes and place them in the
file system inode instead


# 1.49 21-Feb-2001 csapuntz

Latest soft updates from FreeBSD/Kirk McKusick

Snapshot-related code has been commented out.


# 1.48 08-Feb-2001 mickey

do not print stuff when not verbose


Revision tags: OPENBSD_2_8_BASE
# 1.47 27-Sep-2000 art

branches: 1.47.2;
Minimal optimization.


# 1.46 17-Jul-2000 art

Don't wait for B_READ buffers on shutdown.
From NetBSD.


Revision tags: OPENBSD_2_7_BASE
# 1.45 25-Apr-2000 csapuntz

Use CIRCLEQ_FOREACH


# 1.44 21-Apr-2000 mickey

see if there is any meaning under curproc before using &proc0 in vfs_syncwait(); from art@


Revision tags: SMP_BASE kame_19991208
# 1.43 05-Dec-1999 art

branches: 1.43.2;
With soft updates, some buffers will be remarked as dirty after being written.
Handle this when syncing filesystems when unmounting.
From NetBSD.


# 1.42 05-Dec-1999 art

Use VONSYNCLIST to see if we should remove a vnode from the sync list instead
of looking at v_dirtyblkhd.


Revision tags: OPENBSD_2_6_BASE
# 1.41 20-Aug-1999 art

more paranoid check of the refcount in vfs_register


# 1.40 08-Aug-1999 niklas

From NetBSD; vdevgone, used for revoking access to device nodes when they
disappear (detach is coming).


# 1.39 31-May-1999 millert

New struct statfs with mount options. NOTE: this replaces statfs(2),
fstatfs(2), and getfsstat(2) so you will need to build a new kernel
before doing a "make build" or you will get "unimplemented syscall" errors.

The new struct statfs has the following featuires:
o Has a u_int32_t flags field--now softdep can have a real flag.

o Uses u_int32_t instead of longs (nicer on the alpha). Note: the man
page used to lie about setting invalid/unused fields to -1. SunOS does
that but our code never has.

o Gets rid of f_type completely. It hasn't been used since NetBSD 0.9
and having it there but always 0 is confusing. It is conceivable
that this may cause some old code to not compile but that is better
than silently breaking.

o Adds a mount_info union that contains the FSTYPE_args struct. This
means that "mount" can now tell you all the options a filesystem was
mounted with. This is especially nice for NFS.

Other changes:
o The linux statfs emulation didn't convert between BSD fs names
and linux f_type numbers. Now it does, since the BSD f_type
number is useless to linux apps (and has been removed anyway)

o FreeBSD's struct statfs is different from our (both old and new)
and thus needs conversion. Previously, the OpenBSD syscalls
were used without any real translation.

o mount(8) will now show extra info when invoked with no arguments.
However, to see *everything* you need to use the -v (verbose) flag.


# 1.38 06-May-1999 mickey

factor out sync+wait code into vfa_syncwait() routine for
applications in system like power management and such.
art@ finally said `commit it'


# 1.37 30-Apr-1999 art

in vput, simple_unlock the v_interlock before VOP_INACTIVE, not after


Revision tags: OPENBSD_2_5_BASE
# 1.36 11-Mar-1999 deraadt

backout


# 1.35 11-Mar-1999 deraadt

back out unapproved changes


# 1.34 11-Mar-1999 mickey

indent


# 1.33 11-Mar-1999 mickey

factor sync+wait operation out into a separate function.


# 1.32 26-Feb-1999 art

adapt to uvm vnode pager


# 1.31 19-Feb-1999 art

add vfs_register and vfs_unregister functions


# 1.30 28-Dec-1998 art

simple_lock fixes


# 1.29 22-Dec-1998 art

deconfuse vprint, print holdcount, not refcount when we are talking about holdcnt


# 1.28 10-Dec-1998 art

vfs_unmountall: retry to unmount all remaining filesystems when one unmount failed


# 1.27 05-Dec-1998 csapuntz

Framework for generating automatic test code for locking discipline
in DIAGNOSTIC mode.

Added documentation to vfs_subr.c on locking needs of a couple calls.

Improvements to the vinvalbuf patch. We need to start over after we
let our pants down.


# 1.26 04-Dec-1998 csapuntz

VFS-Lite2 requires stricter locking around vnode buffer queues. vinvalbuf
had insufficient protection


# 1.25 20-Nov-1998 art

vn_lock already unlocks the simple lock. don't do that again


# 1.24 12-Nov-1998 csapuntz

Integrate latest soft updates patches for McKusick.

Integrate cleaner ffs mount code from FreeBSD. Most notably, this mount
code prevents you from mounting an unclean file system read-write.


Revision tags: OPENBSD_2_4_BASE
# 1.23 13-Oct-1998 csapuntz

In vrele, vget, reinstate to following order

- VNODE gets placed on free list
- VOP_INACTIVE is called

This was the original order. It was changed in an earlier patch due to
a race condition in non-locking FSes (like NFS) between getnewvnode
and inactive. However, the modified order had its own race conditions, so
it turned out not to be a good choice.


# 1.22 30-Aug-1998 csapuntz

Cleanup.

Error diagnostics in vputonfreelist to catch violations of assumptions.


# 1.21 06-Aug-1998 csapuntz

Rename vop_revoke, vn_bwrite, vop_noislocked, vop_nolock, vop_nounlock
to be vop_generic_revoke, vop_generic_bwrite, vop_generic_islocked,
vop_generic_lock and vop_generic_unlock.

Create vop_generic_abortop and propogate change to all file systems.

Fix PR/371.

Get rid of locking in NULLFS (should be mostly unnecessary now except for
forced unmounts).


# 1.20 25-Apr-1998 niklas

typo


Revision tags: OPENBSD_2_3_BASE
# 1.19 20-Feb-1998 niklas

typo


# 1.18 11-Jan-1998 csapuntz

Fix a couple spinlock references. More code motion in vfs_subr.c


# 1.17 10-Jan-1998 csapuntz

Broke up vfs_subr.c which was getting a bit huge. We now have seperate files
for the syncer daemon as well as default VOP_*.


# 1.16 24-Nov-1997 niklas

Fix non-DIAGNOSTIC (and non-COMPAT*) compilation


# 1.15 07-Nov-1997 csapuntz

Fixed hang on shutdown
Disabled vop_nolock for now. Filesystems still need to be cleaned up.


# 1.14 06-Nov-1997 csapuntz

DEBUG now compiles


# 1.13 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.12 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.11 06-Oct-1997 csapuntz

VFS Lite2 Changes


Revision tags: OPENBSD_2_1_BASE
# 1.10 25-Apr-1997 deraadt

proper mask check; mike@fast.cs.utah.edu


# 1.9 14-Apr-1997 tholo

Minor performance enhancements from NetBSD


# 1.8 24-Feb-1997 niklas

OpenBSD tags


# 1.7 11-Feb-1997 millert

Add fs_id support and random inode generation numbers for ffs.


# 1.6 04-Jan-1997 kstailey

spec_advlock() via lf_advlock()


Revision tags: OPENBSD_2_0_BASE
# 1.5 08-Aug-1996 tholo

Make {,f}chown(2) behaviour POSIX.1 compliant with SUID / SGID files
Enable CTL_FS processing by sysctl(3)
Add CTL_FS request to disable clearing SUID / SGID bit when a files owner
or group is changed by root
Make sysctl(8) understand CTL_FS requests


# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 29-Feb-1996 niklas

From NetBSD: Merge with NetBSD 960217


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.317 14-Aug-2022 jsg

remove unneeded includes in sys/kern
ok mpi@ miod@


# 1.316 12-Aug-2022 visa

Put more struct vnode fields under splbio().

Buffer cache related struct vnode fields can be accessed in interrupt
context. Be more consistent with the use of splbio().

OK mpi@


Revision tags: OPENBSD_7_1_BASE
# 1.315 27-Mar-2022 semarie

sys/vnode.h cleanup for vnode_hold_list, vnode_free_list, struct freelst

vnode_hold_list and vnode_free_list aren't used outside kern/vfs_subr.c

move `struct freelst` where used in kern/vfs_subr.c

no intented behaviour changes. survived a release(8) build.

ok millert@


# 1.314 25-Jan-2022 gnezdo

Capture a repeated pattern into sysctl_securelevel_int function

A few variables in the kernel are only writeable before securelevel is
raised. It makes sense to handle them with less code.

OK sthen@ bluhm@


# 1.313 25-Oct-2021 claudio

Revert commitid: ufM9BcSbXqfLpzBH;
Move vfs_stall_barrier() from the fd layer into vn_lock() and the vfs layer.
In some cases it can result in a deadlock while suspending.
Discussed with mpi@ and deraadt@


# 1.312 24-Oct-2021 jsg

use NULL not 0 for pointer values in kern
ok semarie@


# 1.311 23-Oct-2021 mpi

Sprinkle uvm_obj_destroy() over UVM object recycling code.

For now, only assert that the tree of pages is empty in uvm_obj_destroy().
This will soon be used to free the per-UVM object lock.

While here call uvm_obj_init() when new vnodes are allocated instead of
in uvn_attach(). Because vnodes and there associated UVM object are
currently never freed, it isn't easy to know where/when to garbage
collect the associated lock. So simply check that the reference of a
given object is 0 when uvn_attach().

Tested by many as part of a bigger diff.

ok kettenis@


# 1.310 23-Oct-2021 mpi

Assert that the KERNEL_LOCK() is held in vref(9).

This is a guard against pushing the lock too far in UVM's vnode land.

ok beck@


# 1.309 21-Oct-2021 claudio

Move vfs_stall_barrier() from the fd layer into vn_lock() and the vfs layer.
vfs stalling is used by suspend/resume and by vmt(4) to stall any
filesystem operation from altering the state on disk. All these
operations will call vn_lock and be stalled. Adjust vfs_stall_barrier()
to allow the lock owner to still progress so that suspend can sync
the filesystems after stalling vfs operation.
OK mpi@


# 1.308 20-Oct-2021 semarie

revert vnode: remove VLOCKSWORK and check locking when vop_islocked != nullop
(both kernel and userland bits)

GENERIC + VFSLCKDEBUG is broken with it.


# 1.307 19-Oct-2021 semarie

vnode: remove VLOCKSWORK and check locking when vop_islocked != nullop

This flag is currently used to mark or unmark a vnode to actively
check vnode locking semantic (when compiled with VFSLCKDEBUG).

Currently, VLOCKSWORK flag isn't properly set for several FS
implementation which have full locking support. This commit enable
proper checking for them too (cd9660, udf, fuse, msdosfs, tmpfs).

Instead of using a particular flag, it directly check if
v_op->vop_islocked is nullop or not to activate or not the vnode
locking checks.

ok mpi@


Revision tags: OPENBSD_7_0_BASE
# 1.306 31-Aug-2021 claudio

Swap lock flags so that LK_EXCLUSIVE is first like in all other places.


# 1.305 28-Apr-2021 claudio

Introduce a global vnode_mtx and use it to make vn_lock() safe to be called
without the KERNEL_LOCK.
This moves VXLOCK and VXWANT to a mutex protected v_lflag field and also
v_lockcount is protected by this mutex.

The vn_lock() dance is overly complex and all of this should probably replaced
by a proper lock on the vnode but such a diff is a lot more complex. This
is an intermediate step so that at least some calls can be modified to grab
the KERNEL_LOCK later or not at all.

OK mpi@


Revision tags: OPENBSD_6_9_BASE
# 1.304 29-Jan-2021 claudio

Use NULL instead of 0 to clear v_socket pointer (which actually clears all
of the v_un pointers).
OK jsg@ mvs@


Revision tags: OPENBSD_6_8_BASE
# 1.303 23-Aug-2020 kn

Remove unused debug_syncprt, improve debug sysctl handling

"syncprt" is unused since kern/vfs_syscalls.c r1.147 from 2008.

Adding new debug sysctls is a bit opaque and looking at kern/kern_sysctl.c
the only visible difference between used and stub ctldebug structs in the
debugvars[] array is their extern keyword, indicating that it is defined
elsewhere.

sys/sysctl.h declares all debugN members as extern upfront, but these
declarations are not needed.

Remove the unused debug sysctl, rename the only remaining one to something
meaningful and remove forward declarations from /sys/sysctl.h; this way,
adding new debug sysctls is a matter of adding extern and coming up with a
name, which is nicer to read on its own and better to grep for.

OK mpi


# 1.302 22-Aug-2020 kn

Move sysctl(2) CTL_DEBUG from DEBUG to new DEBUG_SYSCTL

Adding "debug.my-knob" sysctls is really helpful to select different
code paths and/or log on demand during runtime without recompile,
but as this code is under DEBUG, lots of other noise comes with it
which is often undesired, at least when looking at specific subsystems
only.

Adding globals to the kernel and breaking into DDB to change them helps,
but that does not work over SSH, hence the need for debug sysctls.

Introduces DEBUG_SYSCTL to make use of the "debug" MIB without the rest of
DEBUG; it's DEBUG_SYSCTL and not SYSCTL_DEBUG because it's not a general
option for all of sysctl(2).

OK gnezdo


Revision tags: OPENBSD_6_7_BASE
# 1.301 27-Mar-2020 anton

Relax the lockcount assertion in vputonfreelist(). Back when I fixed
several problems with the vnode exclusive lock implementation, I
overlooked the fact that a vnode can be in a state where the usecount is
zero while the holdcount still being positive. There could still be
threads waiting on the vnode lock in uvn_io() as long as the holdcount
is positive.

"go ahead" mpi@

Reported-by: syzbot+767d6deb1a647850a0ca@syzkaller.appspotmail.com


# 1.300 13-Feb-2020 claudio

Move the LK_DRAIN logic from VOP_LOCK() to vclean() the only caller of
VOP_LOCK with LK_DRAIN. This simplifies VOP_LOCK() a fair bit.
OK visa@


# 1.299 20-Jan-2020 claudio

struct vops is not modified during runtime so use const which moves each
into read-only data segment.
OK deraadt@ tedu@


# 1.298 10-Jan-2020 bluhm

Convert the vnode list at the mount point into a tailq. During
unmount this list is traversed and the dirty vnodes are flushed to
disk. Forced unmount expects that the list is empty after flushing,
otherwise the kernel panics with "dangling vnode". As the write
to disk can sleep, new vnodes may be inserted. If softdep is
enabled, resolving the dependencies creates new dirty vnodes and
inserts them to the list. To fix the panic, let insmntque() insert
new vnodes at the tail of the list. Then vflush() will still catch
them while traversing the list in forward direction.
OK tedu@ millert@ visa@


# 1.297 30-Dec-2019 bluhm

In vcount() a safe loop over vnodes was commited to 4.4BSD in 1994.
This is not necessary as the loop is restarted after vgone(). Switch
to SLIST_FOREACH without _SAFE.
OK visa@


# 1.296 27-Dec-2019 bluhm

Convert the speclisth hash buckets into SLIST macros. This makes
the vnode alias code more readable.
OK visa@


# 1.295 26-Dec-2019 bluhm

Fix white spaces.


# 1.294 08-Dec-2019 mpi

Convert infinite sleeps to tsleep_nsec(9).

ok visa@, jca@


Revision tags: OPENBSD_6_6_BASE
# 1.293 26-Aug-2019 anton

When a thread tries to exclusively lock a vnode, the same thread must
ensure that any other thread currently trying to acquire the underlying
vnode lock has observed that the same vnode is about to be exclusively
locked. Such threads must then sleep until the exclusive lock has been
released and then try to acquire the lock again. Otherwise, exclusive
access to the vnode cannot be guaranteed.

Thanks to naddy@ and visa@ for testing; ok visa@

Reported-by: syzbot+374d0e7e2400004957f7@syzkaller.appspotmail.com


# 1.292 25-Jul-2019 cheloha

vinvalbuf(9): tlseep -> tsleep_nsec(9); ok millert@


# 1.291 19-Jul-2019 cheloha

vwaitforio(9): tsleep(9) -> tsleep_nsec(9); ok visa@


# 1.290 28-Jun-2019 visa

Skip VFS barrier lock during normal operation to reduce overhead.
This removes a system-wide serialization point, which might help
finding timing-related bugs.

OK deraadt@ anton@


# 1.289 09-Jun-2019 beck

Add a temporary workaround to make removal of giant files better

mlarkin@ noticed we would freeze while removing enormous files because
of the amount of work done to invalidate buffers on unlink. This adds
a temporary workaround to ensure we give up the lock and yield while
doing this.

The longer term answer will be to move these buffers to another list
and not do the work here.

ok deraadt@


# 1.288 19-Apr-2019 visa

Add a subsystem lock for vfs_lockf.c. This enables calling lf_advlock()
and lf_purgelocks() without the kernel lock.

OK anton@ mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.287 02-Apr-2019 visa

Restrict which filesystems are available for swap. This rules out
obvious misconfigurations that cannot work.

OK mpi@ tedu@


# 1.286 17-Feb-2019 tedu

if a write fails, we mark the buffer invalid and throw it away. this can
lead to lost errors, where a later fsync will return success. to fix this,
set a flag on the vnode indicating a past error has occurred, and return
an error for future fsync calls.
ok bluhm deraadt visa


# 1.285 21-Jan-2019 anton

Introduce a dedicated entry point data structure for file locks. This new data
structure allows for better tracking of pending lock operations which is
essential in order to prevent a use-after-free once the underlying vnode is
gone.

Inspired by the lockf implementation in FreeBSD.

ok visa@

Reported-by: syzbot+d5540a236382f50f1dac@syzkaller.appspotmail.com


# 1.284 23-Dec-2018 natano

Rectify some issues with the noperm mount flag; the root vnode was not
protected properly and files without any x bit set were accidentaly considered
executable when checked with access(2).

Issues found and reported by deraadt, halex, reyk, tb
ok deraadt


# 1.283 07-Dec-2018 mpi

free(9) sizes for netcred.

ok visa@


Revision tags: OPENBSD_6_4_BASE
# 1.282 29-Sep-2018 visa

Use atomic operations to update vfc_refcount. Change the field's type
to unsigned int.

OK deraadt@


# 1.281 26-Sep-2018 visa

Move the allocating and freeing of mount points into
dedicated functions.

OK deraadt@ mpi@


# 1.280 22-Sep-2018 fcambus

Harmonize spacing after ellipses in displayed messages.

We were using spacing after ellipses in an inconsistent way in the
installer. Standardize on using "... " everywhere and take into account
the cursor position while we are waiting for the task to complete: the
cursor is now always positioned after the last dot, and the space is
added when displaying completion confirmation.

While there, also take cursor position into account in vfs_shutdown(),
and remove the extra leading space before ticks in dhclient.

OK deraadt@


# 1.279 17-Sep-2018 visa

Simplify VFS initialization.

Because loadable kernel modules are no longer, there is no need to
register or unregister filesystem implementations at runtime. Remove
vfs_register() and vfs_unregister(), and make vfsinit() call vfs_init
routines directly. Replace the linked list of vfsconf structs with
the vfsconflist[] array.

OK mpi@ bluhm@


# 1.278 16-Sep-2018 visa

Move vfsconf lookup code into dedicated functions.

OK bluhm@


# 1.277 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


# 1.276 02-Jul-2018 bluhm

Use more list macros for v_dirtyblkhd.
OK mpi@


# 1.275 06-Jun-2018 bluhm

The function dounmount() traverses the mnt_list in forward direction
to call vfs_busy() for all nested mount points. vfs_stall() called
vfs_busy() in reverser order for all mount points. Change the
direction of the latter to resolve the lock order conflict.
OK visa@


# 1.274 04-Jun-2018 guenther

Add VB_DUPOK to suppress witness(4) warning of concurrent mount locks.
Use that in three places:
- vfs_stall()
- sys_mount()
- dounmount()'s MNT_FORCE-does-recursive-unmounts case

ok deraadt@ visa@


# 1.273 27-May-2018 visa

Drop unnecessary `p' parameter from vget(9).

OK mpi@


# 1.272 08-May-2018 bluhm

When looping over mount points, the FOREACH SAVE macro is not save.
The loop variable mp is protected by vfs_busy() so that it cannot
be unmounted. But the next mount point nmp could be unmounted while
VFS_SYNC() sleeps. As the loop in vfs_stall() does not destroy the
mount point, TAILQ_FOREACH_REVERSE without _SAVE is the correct
macro to use.
OK deraadt@ visa@


# 1.271 08-May-2018 mpi

Move the vfs stall "barrier" logic to a function. FREF() will soon
change and this has nothing to do with it.

ok visa@, bluhm@


# 1.270 07-May-2018 bluhm

Print the vp pointer in the vinvalbuf() panic strings.
OK mpi@


# 1.269 02-May-2018 visa

Remove proc from the parameters of vn_lock(). The parameter is
unnecessary because curproc always does the locking.

OK mpi@


# 1.268 28-Apr-2018 visa

Clean up the parameters of VOP_LOCK() and VOP_UNLOCK(). It is always
curproc that does the locking or unlocking, so the proc parameter
is pointless and can be dropped.

OK mpi@, deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.267 07-Mar-2018 bluhm

Remounting files systems read-only does not work reliably. There
are corner cases where ffs may leak blocks. So better revert and
unmount all file systems at reboot. The "init died" panic will be
fixed in a different way.
OK deraadt@


# 1.266 10-Feb-2018 deraadt

Syncronize filesystems to disk when suspending. Each mountpoint's vnodes
are pushed to disk. Dangling vnodes (unlinked files still in use) and
vnodes undergoing change by long-running syscalls are identified -- and
such filesystems are marked dirty on-disk while we are suspended (in case
power is lost, a fsck will be required). Filesystems without dangling or
busy vnodes are marked clean, resulting in faster boots following
"battery died" circumstances.
Tested by numerous developers, thanks for the feedback.


# 1.265 14-Dec-2017 deraadt

Don't bother using DETACH_FORCE for the softraid luns at reboot
time; the aggressive mountpoint destruction seems to hit insane
use-after-frees when we are already far on the way down.


# 1.264 14-Dec-2017 deraadt

Give vflush_vnode() a hint about vnodes we don't need to account as "busy".
Change mountpoint to RDONLY a little later. Seems to improve the
rw->ro transition a bit.


# 1.263 11-Dec-2017 bluhm

Format the vnode lists of ddb show mount properly in columns.
OK krw@


# 1.262 11-Dec-2017 deraadt

In uvm Chuck decided backing store would not be allocated proactively
for blocks re-fetchable from the filesystem. However at reboot time,
filesystems are unmounted, and since processes lack backing store they
are killed. Since the scheduler is still running, in some cases init is
killed... which drops us to ddb [noted by bluhm]. Solution is to convert
filesystems to read-only [proposed by kettenis]. The tale follows:
sys_reboot() should pass proc * to MD boot() to vfs_shutdown() which
completes current IO with vfs_busy VB_WRITE|VB_WAIT, then calls VFS_MOUNT()
with MNT_UPDATE | MNT_RDONLY, soon teaching us that *fs_mount() calls a
copyin() late... so store the sizes in vfsconflist[] and move the copyin()
to sys_mount()... and notice nfs_mount copyin() is size-variant, so kill
legacy struct nfs_args3. Next we learn ffs_mount()'s MNT_UPDATE code is
sharp and rusty especially wrt softdep, so fix some bugs adn add
~MNT_SOFTDEP to the downgrade. Some vnodes need a little more help,
so tie them to &dead_vnops.

ffs_mount calling DIOCCACHESYNC is causing a bit of grief still but
this issue is seperate and will be dealt with in time.
couple hundred reboots by bluhm and myself, advice from guenther and
others at the hut


# 1.261 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.260 31-Jul-2017 florian

Give back some space to the ramdisk by compiling net/radix.c only
if we compile pf, ipsec, pipex or nfsserver.
Suggested by mpi some time ago.
Tweak & OK bluhm
deraadt assumes it's fair


# 1.259 20-Apr-2017 visa

Tweak lock inits to make the system runnable with witness(4)
on amd64 and i386.


# 1.258 04-Apr-2017 deraadt

struct vfsconf is tightly packed, but let's M_ZERO it in case that ever
changes to avoid exposing userland memory.


Revision tags: OPENBSD_6_1_BASE
# 1.257 15-Jan-2017 bluhm

When traversing the mount list, the current mount point is locked
with vfs_busy(). If the FOREACH_SAFE macro is used, the next pointer
is not locked and could be freed by another process. Unless
necessary, do not use _SAFE as it is unsafe. In vfs_unmountall()
the current pointer is actullay freed. Add a comment that this
race has to be fixed later.
OK krw@


# 1.256 10-Jan-2017 bluhm

Replace manual for() loops with FOREACH() macro.
OK millert@


# 1.255 10-Jan-2017 bluhm

Remove the unused olddp parameter from function dounmount().
OK mpi@ millert@


# 1.254 28-Sep-2016 kettenis

Cast enum to u_int when doing a bounds check to avoid a clang warning that
the comparison is always true.

ok jca@, tedu@


# 1.253 16-Sep-2016 dlg

move the namecache_rb_tree from RB macros to RBT functions.

i had to shuffle the includes a bit. all the knowledge of the RB
tree is now inside vfs_cache.c, and all accesses are via cache_*
functions.


# 1.252 16-Sep-2016 dlg

move buf_rb_bufs from RB macros to RBT functions

i had to shuffle the order of some header bits cos RBT_PROTOTYPE
needs to see what RBT_HEAD produces.


# 1.251 15-Sep-2016 dlg

all pools have their ipl set via pool_setipl, so fold it into pool_init.

the ioff argument to pool_init() is unused and has been for many
years, so this replaces it with an ipl argument. because the ipl
will be set on init we no longer need pool_setipl.

most of these changes have been done with coccinelle using the spatch
below. cocci sucks at formatting code though, so i fixed that by hand.

the manpage and subr_pool.c bits i did myself.

ok tedu@ jmatthew@

@ipl@
expression pp;
expression ipl;
expression s, a, o, f, m, p;
@@
-pool_init(pp, s, a, o, f, m, p);
-pool_setipl(pp, ipl);
+pool_init(pp, s, a, ipl, f, m, p);


# 1.250 25-Aug-2016 dlg

pool_setipl

ok kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.249 22-Jul-2016 kettenis

Prevent NULL-pointer call for filesystems that don't provide vfs_sysctl
in their vfsops.

Issue reported by Tim Newsham.

ok claudio@, natano@


# 1.248 19-Jun-2016 natano

Remove the lockmgr() API. It is only used by filesystems, where it is a
trivial change to use rrw locks instead. All it needs is LK_* defines
for the RW_* flags.

tested by naddy and sthen on package building infrastructure
input and ok jmc mpi tedu


# 1.247 26-May-2016 natano

The doforce variable isn't modified anywhere. Also, the only filesystem
left using it is fuse. It has been removed from all other filesystems.

ok millert deraadt


# 1.246 26-Apr-2016 natano

copy_statfs_info() is not only used by ufs, but by other filesystems too,
so make sure that all members of mp->mnt_stat.mount_info are copied.
ok stefan


# 1.245 26-Apr-2016 beck

fix off by one in vfs_vnode_print - found by miod
ok deraadt@, krw@


# 1.244 07-Apr-2016 natano

Share clone bitmap between aliased vnodes. This prevents duplicate clone
instance numbers being handed out for the same minor device.
ok mikeb


# 1.243 05-Apr-2016 natano

Increase size of the clone bitmap (revised diff after revert). I have
tested this with fuse _and_ drm on amd64 and macppc. Also tested with
cloning bpf (not in the tree) on macppc.

ok mikeb
"looks correct to me" millert

The original commit message is as follows:

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.242 01-Apr-2016 mikeb

Revert the clone bitmap enlargement change


# 1.241 31-Mar-2016 natano

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.240 19-Mar-2016 natano

Remove the unused flags argument from VOP_UNLOCK().

torture tested on amd64, i386 and macppc
ok beck mpi stefan
"the change looks right" deraadt


# 1.239 14-Mar-2016 krw

Change a bunch of (<blah> *)0 to NULL.

ok beck@ deraadt@


Revision tags: OPENBSD_5_9_BASE
# 1.238 05-Dec-2015 tedu

branches: 1.238.2;
remove stale lint annotations


# 1.237 16-Nov-2015 deraadt

In getdevvp() set the VISTTY flag on a vnode to indicate the underlying
device is a D_TTY device. (Like spec_open, but this sets the flag to
satisfy pre-VOP_OPEN situations)
ok millert semarie tedu guenther


# 1.236 13-Oct-2015 guenther

Initialize va_filerev in vattr_null() to avoid leaking stack garbage;
problem pointed out by Martin Natano (natano (at) natano.net)

Also, stop chaining assignments (foo = bar = baz) in vattr_null().
The exact meaning of those depends on the order of the sizes-and-
signednesses of the lvalues, making them fragile: a statement here
mixed *six* types, but managed to get them in a safe order. Delete
a 20+ year old XXX comment that was almost certainly bemoaning a bug
from when they were in an unsafe order.

ok deraadt@ miod@


# 1.235 08-Oct-2015 mpi

Use the radix API directly and get rid of the function pointers. There
is no point in keeping an unused level of abstraction.

ok mikeb@, claudio@


# 1.234 07-Oct-2015 mpi

rn_inithead() offset argument is now specified in byte, missed in previous.


# 1.233 04-Sep-2015 mpi

Make every subsystem using a radix tree call rn_init() and pass the
length of the key as argument.

This way every consumer of the radix tree has a chance to explicitly
initialize the shared data structures and no longer rely on another
subsystem to do the initialization.

As a bonus ``dom_maxrtkey'' is no longer used an die.

ART kernels should now be fully usable because pf(4) and IPSEC properly
initialized the radix tree.

ok chris@, reyk@


Revision tags: OPENBSD_5_8_BASE
# 1.232 16-Jul-2015 claudio

branches: 1.232.4;
Fix rn_match and there for the expoerted lookup functions in radix.c
to never return the internal RNF_ROOT nodes. This removes the checks
in the callee to verify that not an RNF_ROOT node was returned.
OK mpi@


# 1.231 12-May-2015 mikeb

Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.230 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.229 02-Mar-2015 guenther

Return EINVAL if the creds supplied for NFS export have a cr_ngroups less
than zero or greater than NGROUPS_MAX

Fixes panic seen by henning@


# 1.228 09-Jan-2015 tedu

rename desiredvnodes to initialvnodes. less of a lie. ok beck deraadt


# 1.227 19-Dec-2014 tedu

start retiring the nointr allocator. specify PR_WAITOK as a flag as a
marker for which pools are not interrupt safe. ok dlg


# 1.226 17-Dec-2014 tedu

remove lock.h from uvm_extern.h. another holdover from the simpletonlock
era. fix uvm including c files to include lock.h or atomic.h as necessary.
ok deraadt


# 1.225 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.224 10-Dec-2014 tedu

convert bcopy to memcpy. ok millert


# 1.223 21-Nov-2014 tedu

simple lock is long dead


# 1.222 19-Nov-2014 tedu

delete the KERN_VNODE sysctl. it fails to provide any isolation from the
kernel struct vnode defintion, and the only consumer (pstat) still needs
kvm to read much of the required information. no great loss to always use
kvm until there's a better replacement interface.
ok deraadt millert uebayasi


# 1.221 14-Nov-2014 tedu

prefer sizeof(*ptr) to sizeof(struct) for malloc and free


# 1.220 03-Nov-2014 deraadt

pass size argument to free()
ok doug tedu


# 1.219 13-Sep-2014 doug

Replace all queue *_END macro calls except CIRCLEQ_END with NULL.

CIRCLEQ_* is deprecated and not called in the tree. The other queue types
have *_END macros which were added for symmetry with CIRCLEQ_END. They are
defined as NULL. There's no reason to keep the other *_END macro calls.

ok millert@


Revision tags: OPENBSD_5_6_BASE
# 1.218 13-Jul-2014 tedu

pass the size to free in some of the obvious cases


# 1.217 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.216 10-Jul-2014 mpi

Stop using a shutdown hook for softraid(4) and explicitly shutdown
the disciplines right after vfs_shutdown().

This change is required in order to be able to set `cold' to 1 before
traversing the device (mainbus) tree for DVACT_POWERDOWN when halting
a machine. Yes, this is ugly because sr_shutdown() needs to sleep. But
at least it is obvious and hopefully somebody will be ofended and fix
it.

In order to properly flush the cache of the disks under softraid0,
sr_shutdown() now propagates DVACT_POWERDOWN for this particular subtree
of devices which are not under mainbus. As a side effect sd(4) shutdown
hook should no longer be necessary.

Tested by stsp@ and Jean-Philippe Ouellet.

ok deraadt@, stsp@, jsing@


# 1.215 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.214 04-Jun-2014 claudio

While it may be smart to use the radix tree for exports it is not OK to
use the domain specific tree initialisation method for this since that one
is multipath enabled and assumes that the radix node is part of a struct
rtentry. This code uses a different struct and so the multipath modifies
wrong fields and breaks stuff in mysterious ways.
Since we only support AF_INET here anyway simplify the code and only have
one radix_node_head pointer instead of AF_MAX ones.
Fixes NFS server issues reported by rpe@, OK rpe@, guenther@, sthen@


# 1.213 10-Apr-2014 tedu

pull the bufcache freelist code out into separate functions to allow new
algorithms to be tested. in the process, drop support for unused B_AGE and
b_synctime options.
previous versions ok beck deraadt


# 1.212 24-Mar-2014 guenther

Split the API: struct ucred remains the kernel internal structure while
struct xucred becomes the structure for syscalls (mount(2) and nfssvc(2)).

ok deraadt@ beck@


Revision tags: OPENBSD_5_5_BASE
# 1.211 21-Jan-2014 tedu

bzero -> memset


# 1.210 01-Dec-2013 krw

Change 'mountlist' from CIRCLEQ to TAILQ. Be paranoid and
use TAILQ_*_SAFE more than might be needed.

Bulk ports build by sthen@ showed nobody sticking their fingers
so deep into the kernel.

Feedback and suggestions from millert@. ok jsing@


# 1.209 27-Nov-2013 jsing

Defer the v_type initialisation until after the vnode has been purged from
the namecache. Changing the v_type between cache_enter() and cache_purge()
results in bad things happening.

ok beck@


# 1.208 02-Oct-2013 sf

format string fix: b_flags is long


# 1.207 01-Oct-2013 sf

Format string fixes: Cast time_t to long long

and mnt_stat.f_ctime is long long, too


# 1.206 08-Aug-2013 syl

Uncomment kprintf format attributes for sys/kern

tested on vax (gcc3) ok miod@


# 1.205 30-Jul-2013 beck

The previous change was made while chasing nfs performance issues
on Theo's servers - however this was in the context of the buffer flipper
changes and this is now suspect in a continues performance issue with NFS
so back it out for now


Revision tags: OPENBSD_5_4_BASE
# 1.204 24-Jun-2013 beck

Manipulating buffers after sleeping is dangerous. Instead of attempting
to cheat and VOP_BWRITE a buffer, restart the vinvalbuf if we have to wait
for a busy buffer to complete
ok tedu@ guenther@


# 1.203 15-Apr-2013 jsing

Add an f_mntfromspec member to struct statfs, which specifies the name of
the special provided when the mount was requested. This may be the same as
the special that was actually used for the mount (e.g. in the case of a
device node) or it may be different (e.g. in the case of a DUID).

Whilst here, change f_ctime to a 64 bit type and remove the pointless
f_spare members.

Compatibility goo courtesy of guenther@

ok krw@ millert@


Revision tags: OPENBSD_5_3_BASE
# 1.202 17-Feb-2013 miod

Comment out recently added __attribute__((__format__(__kprintf__))) annotations
in MI code; gcc 2.95 does not accept such annotation for function pointer
declarations, only function prototypes.
To be uncommented once gcc 2.95 bites the dust.


# 1.201 09-Feb-2013 miod

Add explicit __attribute__ ((__format__(__kprintf__)))) to the functions and
function pointer arguments which are {used as,} wrappers around the kernel
printf function.
No functional change.


# 1.200 17-Nov-2012 beck

Don't map a buffer (and potentially sleep) when invalidating it in vinvalbuf.
This fixes a problem where we could sleep for kva and then our pointers
would not be valid on the next pass through the loop. We do this
by adding buf_acquire_nomap() - which can be used to busy up the buffer
without changing its mapped or unmapped state. We do not need to have
the buffer mapped to invalidate it, so it is sufficient to acquire it
for that. In the case where we write the buffer, we do map the buffer, and
potentially sleep.


# 1.199 01-Oct-2012 guenther

Make groupmember() check the effective gid too, so that the checks are
consistent when the effective gid isn't also a supplementary group.

ok beck@


# 1.198 19-Sep-2012 guenther

vhold() and vdrop() are prototyped in vnode.h, so don't repeat them here

ok beck@


Revision tags: OPENBSD_5_2_BASE
# 1.197 16-Jul-2012 deraadt

oops, need sys/acct.h too


# 1.196 16-Jul-2012 deraadt

Put acct_shutdown() proto in a better place


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.195 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.194 02-Jul-2011 thib

rename VFSDEBUG to VFLCKDEBUG;

prompted by tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.193 21-Dec-2010 thib

Bring back the "End the VOP experiment." diff, naddy's issues where
unrelated, and his alpha is much happier now.

OK deraadt@


# 1.192 06-Dec-2010 jasper

- drop NENTS(), which was yet another copy of nitems().
no binary change


ok deraadt@


# 1.191 10-Sep-2010 thib

Backout the VOP diff until the issues naddy was seeing on alpha (gcc3)
have been resolved.


# 1.190 06-Sep-2010 thib

End the VOP experiment. Instead of the ridicolusly complicated operation
vector setup that has questionable features (that have, as far as I can
tell never been used in practice, atleast not in OpenBSD), remove all
the gunk and favor a simple struct full of function pointers that get
set directly by each of the filesystems.

Removes gobs of ugly code and makes things simpler by a magnitude.

The only downside of this is that we loose the vnoperate feature so
the spec/fifo operations of the filesystems need to be kept in sync
with specfs and fifofs, this is no big deal as the API it self is pretty
static.

Many thanks to armani@ who pulled an earlier version of this diff to
current after c2k10 and Gabriel Kihlman on tech@ for testing.

Liked by many. "come on, find your balls" deraadt@.


# 1.189 12-Aug-2010 oga

Nuke extra (typoed) extern declaration and a spare newline from the last
commit.

"fix it -- free commit" beck@


# 1.188 11-Aug-2010 beck

Make the number of vnodes to correspond to the number of buffers in
buffer cache - we grow them dynamically, but do not attempt to shrink
them if the buffer cache shrinks after growing.

Tested by very many for a long time.

ok oga@ todd@ phessler@ tedu@


Revision tags: OPENBSD_4_8_BASE
# 1.187 29-Jun-2010 tedu

makefstype was only used in ported from freebsd filesystems. fix them
and remove the function. ok thib


# 1.186 28-Jun-2010 claudio

Add the rtable id as an argument to rn_walktree(). Functions like
rt_if_remove_rtdelete() need to know the table id to be able to correctly
remove nodes.
Problem found by Andrea Parazzini and analyzed by Martin Pelik�n.
OK henning@


# 1.185 06-May-2010 mpf

Fix favail format string.
From mickey.
OK thib, otto.


Revision tags: OPENBSD_4_7_BASE
# 1.184 17-Dec-2009 oga

if anyone vref()s a VNON vnode, panic. This should not happen.

Written while trying to debug the nfs_inactive panics. Turns out it
never got hit, but it's a useful check to have.

ok beck@


# 1.183 17-Aug-2009 jasper

dd 'show all bufs' to show all the buffers in the system

ok beck@ thib@


# 1.182 13-Aug-2009 thib

add a show all vnodes command, use dlg's nice pool_walk() to accomplish
this.

ok beck@, dlg@


# 1.181 12-Aug-2009 beck

Namecache revamp.

This eliminates the large single namecache hash table, and implements
the name cache as a global lru of entires, and a redblack tree in each
vnode. It makes cache_purge actually purge the namecache entries associated
with a vnode when a vnode is recycled (very important for later on actually being
able to resize the vnode pool)

This commit does #if 0 out a bunch of procmap code that was
already broken before this change, but needs to be redone completely.

Tested by many, including in thib's nfs test setup.

ok oga@,art@,thib@,miod@


# 1.180 02-Aug-2009 beck

Dynamic buffer cache support - a re-commit of what was backed out
after c2k9

allows buffer cache to be extended and grow/shrink dynamically

tested by many, ok oga@, "why not just commit it" deraadt@


Revision tags: OPENBSD_4_6_BASE
# 1.179 25-Jun-2009 thib

backout the buf_acquire() does the bremfree() since all callers
where doing bremfree() befure calling buf_acquire().

This is causing us headache pinning down a bug that showed up
when deraadt@ too cvs to current, and will have to be done
anyway as a preperation for backouts.

OK deraadt@


# 1.178 15-Jun-2009 beck

Back out all the buffer cache changes I committed during c2k9. This reverts three
commits:

1) The sysctl allowing bufcachepercent to be changed at boot time.
2) The change moving the buffer cache hash chains to a red-black tree
3) The dynamic buffer cache (Which depended on the earlier too).

ok on the backout from marco and todd


# 1.177 06-Jun-2009 art

All caller of buf_acquire were doing bremfree before the call.
Just put it in the buf_acquire function.
oga@ ok


# 1.176 03-Jun-2009 beck

Change bufhash from the old grotty hash table to red-black trees hanging
off the vnode.
ok art@, oga@, miod@


Revision tags: OPENBSD_4_5_BASE
# 1.175 10-Nov-2008 pedro

Fix typo in comment, okay jmc@.


# 1.174 01-Nov-2008 deraadt

change vrele() to return an int. if it returns 0, it can gaurantee that
it did not sleep. this is used to avoid checkdirs() to avoid having
to restart the allproc walk every time through
idea from tedu, ok thib pedro


Revision tags: OPENBSD_4_4_BASE
# 1.173 05-Jul-2008 thib

re-introduce vdrop() to signal a lost intrest in a vnode;

ok art@


# 1.172 14-Jun-2008 mk

A bunch of pool_get() + bzero() -> pool_get(..., .. | PR_ZERO)
conversions that should shave a few bytes off the kernel.

ok henning, krw, jsing, oga, miod, and thib (``even though i usually prefer
FOO|BAR''; thanks for looking.


# 1.171 13-Jun-2008 beck

back out stupid vnode change that was unintentionally included
with biomem and art has no idea how it got there.
ok art@ thib@


# 1.170 12-Jun-2008 deraadt

Bring biomem diff back into the tree after the nfs_bio.c fix went in.
ok thib beck art


# 1.169 11-Jun-2008 deraadt

back out biomem diff since it is not right yet. Doing very large
file copies to nfsv2 causes the system to eventually peg the console.
On the console ^T indicates that the load is increasing rapidly, ddb
indicates many calls to getbuf, there is some very slow nfs traffic
making none (or extremely slow) progress. Eventually some machines
seize up entirely.


# 1.168 10-Jun-2008 beck

Buffer cache revamp

1) remove multiple size queues, introduced as a stopgap.
2) decouple pages containing data from their mappings
3) only keep buffers mapped when they actually have to be mapped
(right now, this is when buffers are B_BUSY)
4) New functions to make a buffer busy, and release the busy flag
(buf_acquire and buf_release)
5) Move high/low water marks and statistics counters into a structure
6) Add a sysctl to retrieve buffer cache statistics

Tested in several variants and beat upon by bob and art for a year. run
accidentally on henning's nfs server for a few months...

ok deraadt@, krw@, art@ - who promises to be around to deal with any fallout


# 1.167 09-Jun-2008 millert

Update access(2) to have modern semantics with respect to X_OK and
the superuser. access(2) will now only indicate success for X_OK on
non-directories if there is at least one execute bit set on the file.
OK deraadt@ thib@ otto@


# 1.166 07-May-2008 thib

remove the vfc_mountroot member from vfsconf and
do appropriate cleanup;

OK deraadt@


# 1.165 07-May-2008 claudio

Implement routing priorities. Every route inserted has a priority assigned
and the one route with the lowest number wins. This will be used by the
routing daemons to resolve the synchronisations issue in case of conflicts.
The nasty bits of this are in the multipath code. If no priority is specified
the kernel will choose an appropriate priority.

Looked at by a few people at n2k8 code is much older


# 1.164 06-May-2008 thib

retire vfs_mountroot();

setroot() is now (and has been) responsible for setting
the mountroot function pointer "to the right thing", or
failing todo that, to ffs_mountroot;

based on a discussion/diff from deraadt@.
OK deraadt@


# 1.163 23-Mar-2008 miod

Wrong printf construct.


# 1.162 16-Mar-2008 otto

Widen some struct statfs fields to support large filesystem stata
and add some to be able to support statvfs(2). Do the compat dance
to provide backward compatibility. ok thib@ miod@


Revision tags: OPENBSD_4_3_BASE
# 1.161 13-Dec-2007 blambert

replace calls to ltsleep with tsleep

remove PNORELOCK flag, as PNORELOCK is used for msleep

ok art@ thib@


# 1.160 16-Nov-2007 deraadt

er, the newline is wrong. dissapointing.


# 1.159 15-Nov-2007 deraadt

newline before syncing disks is way prettier


# 1.158 29-Oct-2007 chl

MALLOC/FREE -> malloc/free
replace an hard coded value with M_WAITOK

ok krw@


# 1.157 15-Sep-2007 bluhm

Allow to pull out an usb stick with ffs filesystem while mounted
and a file is written onto the stick. Without these fixes the
machine panics or hangs.
The usb fix calls the callback when the stick is pulled out to free
the associated buffers. Otherwise we have busy buffers for ever
and the automatic unmount will panic.
The change in the scsi layer prevents passing down further dirty
buffers to usb after the stick has been deactivated.
In vfs the automatic unmount has moved from the function vgonel()
to vop_generic_revoke(). Both are called when the sd device's vnode
is removed. In vgonel() the VXLOCK is already held which can cause
a deadlock. So call dounmount() earlier.

ok krw@, I like this marco@, tested by ian@


# 1.156 07-Sep-2007 art

Use M_ZERO in a few more places to shave bytes from the kernel.

eyeballed and ok dlg@


Revision tags: OPENBSD_4_2_BASE
# 1.155 07-Aug-2007 beck

A few changes to deal with multi-user performance issues seen. this
brings us back roughly to 4.1 level performance, although this is still
far from optimal as we have seen in a number of cases. This change

1) puts a lower bound on buffer cache queues to prevent starvation
2) fixes the code which looks for a buffer to recycle
3) reduces the number of vnodes back to 4.1 levels to avoid complex
performance issues better addressed after 4.2

ok art@ deraadt@, tested by many


# 1.154 01-Jun-2007 beck

decouple the allocated number of vnodes from the "desiredvnodes" variable
which is used to size a zillion other things that increasing excessively
has been shown to cause problems - so that we may incrementally look at
increasing those other things without making the kernel unusable.

This diff effectivly increases the number of vnodes back to the number
of buffers, as in the earlier dynamic buffer cache commits, without
increasing anything else (namecache, softdeps, etc. etc.)

ok pedro@ tedu@ art@ thib@


# 1.153 31-May-2007 tedu

remove some silly casts, no real change


# 1.152 31-May-2007 pedro

NFSv2 cannot cope with a big number of vnodes, so revert to NPROC-based
calculation until the problem is fixed, okay beck@ art@


# 1.151 30-May-2007 beck

back out vfs change - todd fries has seen afs issues, and I'm suspicious
this can cause other problems.


# 1.150 29-May-2007 beck

Step one of some vnode improvements - change getnewvnode to
actually allocate "desiredvnodes" - add a vdrop to un-hold a vnode held
with vhold, and change the name cache to make use of vhold/vdrop, while
keeping track of which vnodes are referred to by which cache entries to
correctly hold/drop vnodes when the cache uses them.
ok thib@, tedu@, art@


# 1.149 28-May-2007 thib

de-inline vref();

ok pedro@


# 1.148 26-May-2007 pedro

Dynamic buffer cache. Initial diff from mickey@, okay art@ beck@ toby@
deraadt@ dlg@.


# 1.147 26-May-2007 thib

Nuke a bunch of simpelocks and associated goo.

ok art@


# 1.146 17-May-2007 thib

Collapse struct v_selectinfo in struct vnode, remove the
simplelock and reuse the name for the selinfo member.
Clean-up accordingly.

ok tedu@,art@


# 1.145 09-May-2007 deraadt

kinfo_vgetfailed has not been used for > 8 years


# 1.144 13-Apr-2007 thib

Move the declaration of VN_KNOTE() into vnode.h instead of having
multiple defines all over;

ok tedu@


# 1.143 13-Apr-2007 bluhm

Remove comments talking about vnode interlock. No binary change.
ok thib


# 1.142 11-Apr-2007 thib

Remove the simplelock argument from vrecycle();

ok pedro@, sturm@


# 1.141 21-Mar-2007 thib

Remove the v_interlock simplelock from the vnode structure.
Zap all calls to simple_lock/unlock() on it (those calls are
#defined away though). Remove the LK_INTERLOCK from the calls
to vn_lock() and cleanup the filesystems wich implement VOP_LOCK().
(by remvoing the v_interlock from there calls to lockmgr()).

ok pedro@, art@, tedu@


# 1.140 12-Mar-2007 mickey

better desiredvnodes not based on maxusers; pedro@ deraadt@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.139 20-Feb-2007 deraadt

for vfsconf sysctl, do not leak kernel sensors out to userland
ok art thib


# 1.138 17-Feb-2007 mickey

fix ddb buf printing for daddr_t growth to 64bit;
from juan hernandez gonzalez; tested by bluhm@


# 1.137 14-Feb-2007 jsg

Consistently spell FALLTHROUGH to appease lint.
ok kettenis@ cloder@ tom@ henning@


# 1.136 13-Feb-2007 mickey

fix ddb buf print


# 1.135 20-Nov-2006 tom

vprint() should be defined if DIAGNOSTIC || DEBUG. Noticed by (and
original diff from) Jake < antipsychic (at) hotmail.com >. Discussed
with Mickey and Miod.

ok miod@ pedro@


# 1.134 30-Oct-2006 thib

use vp->v_type to index into vtypes rather then vp->v_tag,
fixing odd output in the 'show vnode' ddb code.

ok mickey@


Revision tags: OPENBSD_4_0_BASE
# 1.133 11-Jul-2006 mickey

add mount/vnode/buf and softdep printing commands; tested on a few archs and will make pedro happy too (;


# 1.132 09-Jul-2006 pedro

Fix tab where space was meant


# 1.131 08-Jul-2006 thib

vinvalbuf() debugging aid, under VFSDEBUG.

ok pedro@


# 1.130 03-Jul-2006 mickey

also print vp in vprint (useful for debugging); pedro@ ok


# 1.129 25-Jun-2006 sturm

rename vfs_busy() flags VB_UMIGNORE/VB_UMWAIT to VB_NOWAIT/VB_WAIT

requested by and ok pedro


# 1.128 14-Jun-2006 sturm

move vfs_busy() to rwlocks and properly hide the locking api from vfs

ok tedu, pedro


# 1.127 02-Jun-2006 pedro

Add a clonable devices implementation. Hacked along with thib@, input
from krw@ and toby@, subliminal prodding from dlg@, okay deraadt@.


# 1.126 28-May-2006 pedro

Spacing in vfs_sysctl()


# 1.125 07-May-2006 sturm

forgot to remove this sentence from the comment
ok pedro


# 1.124 30-Apr-2006 sturm

remove the simplelock argument from vfs_busy() which is currently not
used and will never be used this way in VFS

requested by and ok pedro, ok krw, biorn


# 1.123 19-Apr-2006 pedro

Remove unused mount list simple_lock() goo


Revision tags: OPENBSD_3_9_BASE
# 1.122 09-Jan-2006 pedro

Put vprint() under DIAGNOSTIC, as to save space in generated ramdisks.
Inspiration from miod@, okay deraadt@. Tested on i386, macppc and amd64.


# 1.121 30-Nov-2005 pedro

No need for vfs_busy() and vfs_unbusy() to take a process pointer
anymore. Testing by jolan@, thanks.


# 1.120 24-Nov-2005 pedro

Remove kernfs, okay deraadt@.


# 1.119 19-Nov-2005 pedro

Remove unnecessary lockmgr() archaism that was costing too much in terms
of panics and bugfixes. Access curproc directly, do not expect a process
pointer as an argument. Should fix many "process context required" bugs.
Incentive and okay millert@, okay marc@. Various testing, thanks.


# 1.118 18-Nov-2005 pedro

Work around yet another race on non-locking file systems: when calling
VOP_INACTIVE() in vrele() and vput(), we may sleep. Since there's no
locking of any kind, someone can vget() the vnode and vrele() it while
we sleep, beating us in getting the vnode on the free list.


# 1.117 08-Nov-2005 pedro

Missed one use of 'register'


# 1.116 07-Nov-2005 pedro

Use ANSI function declarations and deregister, no binary change


# 1.115 19-Oct-2005 pedro

Remove v_vnlock from struct vnode, okay krw@ tedu@


Revision tags: OPENBSD_3_8_BASE
# 1.114 26-May-2005 pedro

branches: 1.114.2;
RIP stackable filesystems, ok marius@ tedu@, discussed with deraadt@


# 1.113 24-May-2005 pedro

when a device vnode associated with a mount point disappears, mark the
filesystem as doomed and unmount it


# 1.112 22-May-2005 pedro

put VLOCKSWORK stuff under a single option, VFSDEBUG


# 1.111 01-May-2005 pedro

check for VBIOONFREELIST and VBIOONSYNCLIST in vprint(), okay marius@


# 1.110 24-Mar-2005 tedu

always good to check for invalid values. ok marius pedro


Revision tags: OPENBSD_3_7_BASE
# 1.109 10-Jan-2005 pedro

branches: 1.109.2;
change vget() to only put a vnode back on the free lists if it actually
was there. should fix a (rare) corner case introduced by my last commit.
ok tedu@, testing by joris, moritz@, danh@, otto@ and krw@. many thanks.


# 1.108 31-Dec-2004 pedro

sprinkle some more list macros in here


# 1.107 31-Dec-2004 pedro

when releasing a vnode, make it inactive before sticking it to one of
the free lists. should fix some races on filesystems that don't have
locks, such as nfs. also, it allows for a more straightforward way of
releasing vnodes (nodes that are going to be recycled don't have to be
moved to the head of the list). tested by many, thanks.

ok tedu@ deraadt@


# 1.106 28-Dec-2004 deraadt

clean dirty accident by miod


# 1.105 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


# 1.104 09-Dec-2004 pedro

minor spacing/styling nits


Revision tags: OPENBSD_3_6_BASE
# 1.103 04-Aug-2004 art

Uninline vputonfreelist.


# 1.102 04-Aug-2004 pedro

better comments


# 1.101 02-Aug-2004 pedro

- check for LK_NOWAIT on vget()
- use ltsleep() instead of the unlock + sleep combo

ok art@, inspiration from free/net


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.100 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


# 1.99 27-May-2004 tedu

shutdown accounting before shutting down vfs. should prevent some panics.
ok david@ millert@ (iirc)


# 1.98 25-Apr-2004 itojun

radix tree with multipath support. from kame. deraadt ok
user visible changes:
- you can add multiple routes with same key (route add A B then route add A C)
- you have to specify gateway address if there are multiple entries on the table
(route delete A B, instead of route delete A)
kernel change:
- radix_node_head has an extra entry
- rnh_deladdr takes extra argument

TODO:
- actually take advantage of multipath (rtalloc -> rtalloc_mpath)


Revision tags: OPENBSD_3_5_BASE
# 1.97 09-Jan-2004 tedu

back out vnode parents. weird breakge found in ports tree


# 1.96 06-Jan-2004 tedu

keep track of a vnode's parent dir. ufs only, and unused atm, but
the fun stuff is coming. testing by brad.


Revision tags: OPENBSD_3_4_BASE
# 1.95 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.94 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: UBC_SYNC_A
# 1.93 13-May-2003 naddy

Back out previous change that causes "vnode table full" for large-scale
file operations.


# 1.92 13-May-2003 tedu

do reclaim LAYER vnodes, no good reason not to


# 1.91 06-May-2003 tedu

attempt to put a process's cwd back in place after a forced umount.
won't always work, but it's the best we can do for now. this covers
at least some of the failure cases the previous commit to vfs_lookup.c
checks for.
ok weingart@


# 1.90 01-May-2003 tedu

several related changes:
vfs_subr.c:
add a missing simple_lock_init for vnode interlock
try to avoid reclaiming locked or layered vnodes
initialize vnlock pointer to NULL
remove old code to free vnlock, never used
lockinit the new vnode lock
vfs_syscalls.c:
support for VLAYER flag
vnode_if.sh:
support for splitting VDESC flags
vnode_if.src:
split VDESC flags
WILLPUT is the combination of WILLRELE and WILLUNLOCK
most uses for WILLRELE become WILLPUT
vnode.h:
add v_lock to struct vnode
add VLAYER flag
update for new VDESC flags


# 1.89 06-Apr-2003 ho

strcat/strcpy/sprintf cleanup. krw@, anil@ ok. art@ tested sparc64.


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_B
# 1.88 11-Aug-2002 art

Add two missing vfs_busy calls in the failure path of sysctl_vnode.
Found by aaron@

NOTE - I think we need a mount-point iterator just like we have
NOTE - vfs_mount_foreach_vnode. (btw. why don't we use foreach_vnode in here?)


# 1.87 12-Jul-2002 art

Change the locking on the mountpoint slightly. Instead of using mnt_lock
to get shared locks for lookup and get the exclusive lock only with
LK_DRAIN on unmount and do the real exclusive locking with flags in
mnt_flags, we now use shared locks for lookup and an exclusive lock for
unmount.

This is accomplished by slightly changing the semantics of vfs_busy.
Old vfs_busy behavior:
- with LK_NOWAIT set in flags, a shared lock was obtained if the
mountpoint wasn't being unmounted, otherwise we just returned an error.
- with no flags, a shared lock was obtained if the mountpoint was being
unmounted, otherwise we slept until the unmount was done and returned
an error.
LK_NOWAIT was used for sync(2) and some statistics code where it isn't really
critical that we get the correct results.
0 was used in fchdir and lookup where it's critical that we get the right
directory vnode for the filesystem root.

After this change vfs_busy keeps the same behavior for no flags and LK_NOWAIT.
But if some other flags are passed into it, they are passed directly
into lockmgr (actually LK_SLEEPFAIL is always added to those flags because
if we sleep for the lock, that means someone was holding the exclusive lock
and the exclusive lock is only held when the filesystem is being unmounted.

More changes:
dounmount must now be called with the exclusive lock held. (before this
the caller was supposed to hold the vfs_busy lock, but that wasn't always
true).
Zap some (now) unused mount flags.
And the highlight of this change:
Add some vfs_busy calls to match some vfs_unbusy calls, especially in
sys_mount. (lockmgr doesn't detect the case where we release a lock noone
holds (it will do that soon)).

If you've seen hangs on reboot with mfs this should solve it (I repeat this
for the fourth time now, but this time I spent two months fixing and
redesigning this and reading the code so this time I must have gotten
this right).


# 1.86 16-Jun-2002 miod

When processing the KERN_VNODE sysctl, the kernel builds a packed structure,
while pstat(8) expects a C structure abiding the regular structure packing
rules. This caused pstat -v to break on powerpc.

Unbreak the confusion by defining the structure in a common header file,
and having the kernel use it.

ok millert@ deraadt@


# 1.85 08-Jun-2002 art

Use ltsleep in vfs_busy.


# 1.84 16-May-2002 art

sprinkle some splassert(IPL_BIO) in some functions that are commented as "should be called at splbio()"


Revision tags: OPENBSD_3_1_BASE
# 1.83 14-Mar-2002 millert

First round of __P removal in sys


# 1.82 04-Feb-2002 miod

Cleanup mountroot-related definitions.


# 1.81 23-Jan-2002 art

Pool deals fairly well with physical memory shortage, but it doesn't deal
well (not at all) with shortages of the vm_map where the pages are mapped
(usually kmem_map).

Try to deal with it:
- group all information the backend allocator for a pool in a separate
struct. The pool will only have a pointer to that struct.
- change the pool_init API to reflect that.
- link all pools allocating from the same allocator on a linked list.
- Since an allocator is responsible to wait for physical memory it will
only fail (waitok) when it runs out of its backing vm_map, carefully
drain pools using the same allocator so that va space is freed.
(see comments in code for caveats and details).
- change pool_reclaim to return if it actually succeeded to free some
memory, use that information to make draining easier and more efficient.
- get rid of PR_URGENT, noone uses it.


# 1.80 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.79 10-Dec-2001 art

branches: 1.79.2;
No need to initialize the uobj on every getnewvnode. Just do
it when allocating. Add some improved diagnostics.


# 1.78 10-Dec-2001 art

Big cleanup inspired by NetBSD with some parts of the code from NetBSD.
- get rid of VOP_BALLOCN and VOP_SIZE
- move the generic getpages and putpages into miscfs/genfs
- create a genfs_node which must be added to the top of the private portion
of each vnode for filsystems that want to use genfs_{get,put}pages
- rename genfs_mmap to vop_generic_mmap


# 1.77 10-Dec-2001 art

Merge in struct uvm_vnode into struct vnode.


# 1.76 05-Dec-2001 art

Break out the part that lowers v_holdcnt in brelvp into an own function
and make it and vhold into public interfaces.


# 1.75 29-Nov-2001 art

Ooops. Revert part of the last commit that was completly wrong and wasn't supposed to be committed.


# 1.74 29-Nov-2001 art

Correctly handle b_vp with bgetvp and brelvp in {get,put}pages.
Prevents panics caused by vnodes being recycled under our feet.


# 1.73 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.72 21-Nov-2001 csapuntz

Added vfs_isbusy. Useful for verifying that a mount point is locked
Added vfs_mount_foreach_vnode. Several places in the code seem to want to
traverse the mount list and they all seem to handle locking differently.
Centralize traversing the mount list in one place so that we only need
to get the locking right once.


# 1.71 15-Nov-2001 art

Don't zero v_bioflag when recycling a vnode in getnewvnode.
Sometimes the vnode can be on the syncers list. While that is a bug, it's
just a minor annoyance. A vnode on a syncer worklist without VBIOONSYNCLIST
set is a disaster.


# 1.70 12-Nov-2001 art

Remove unnecessary check for NULL vnode in reassignbuf.


# 1.69 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.68 02-Oct-2001 csapuntz

Bounds check index into routing table. Thanks to Ken Ashcraft of Stanford
for finding this bug.


# 1.67 19-Sep-2001 csapuntz

Get rid of B_VFLUSH. Not relevant after the end of the AGE queue.


# 1.66 16-Sep-2001 millert

Add some missing lengths checks when passing data from userland to
kernel. From based on NetBSD patches.


# 1.65 02-Aug-2001 assar

(vput): make panic strings actually say vput instead of vrele


# 1.64 26-Jul-2001 miod

Typo.


# 1.63 27-Jun-2001 art

remove old vm


# 1.62 22-Jun-2001 deraadt

KNF


# 1.61 05-Jun-2001 provos

send note_revoke to knotes when vnode goes away, okay art@


# 1.60 16-May-2001 art

indentation nit.


# 1.59 29-Apr-2001 art

cleanup, remove incorrect comment


Revision tags: OPENBSD_2_9_BASE
# 1.58 22-Mar-2001 art

branches: 1.58.2;
Use pool for allocating vnodes.
Even though vnodes are never freed (could be) this gives us big memory and
kmem_map savings.


# 1.57 21-Mar-2001 art

uvm_vnp_terminate expect the vnode to be locked.
Why didn't LOCKDEBUG catch this?


# 1.56 16-Mar-2001 art

Oops. fix thinko in last.


# 1.55 16-Mar-2001 art

Use CIRCLEQ macros for mountlist.


# 1.54 16-Mar-2001 art

Initialize the mountlist_slock.


# 1.53 26-Feb-2001 csapuntz

Move v_writecount test back to it original place


# 1.52 26-Feb-2001 csapuntz

Make ref counts 32-bit unsigned ints as opposed to a potpourri of longs and
ints.


# 1.51 24-Feb-2001 csapuntz

Cleanup of vnode interface continues. Get rid of VHOLD/HOLDRELE.
Change VM/UVM to use buf_replacevnode to change the vnode associated
with a buffer.

Addition v_bioflag for flags written in interrupt handlers
(and read at splbio, though not strictly necessary)

Add vwaitforio and use it instead of a while loop of v_numoutput.

Fix race conditions when manipulation vnode free list


# 1.50 23-Feb-2001 csapuntz

Remove the clustering fields from the vnodes and place them in the
file system inode instead


# 1.49 21-Feb-2001 csapuntz

Latest soft updates from FreeBSD/Kirk McKusick

Snapshot-related code has been commented out.


# 1.48 08-Feb-2001 mickey

do not print stuff when not verbose


Revision tags: OPENBSD_2_8_BASE
# 1.47 27-Sep-2000 art

branches: 1.47.2;
Minimal optimization.


# 1.46 17-Jul-2000 art

Don't wait for B_READ buffers on shutdown.
From NetBSD.


Revision tags: OPENBSD_2_7_BASE
# 1.45 25-Apr-2000 csapuntz

Use CIRCLEQ_FOREACH


# 1.44 21-Apr-2000 mickey

see if there is any meaning under curproc before using &proc0 in vfs_syncwait(); from art@


Revision tags: SMP_BASE kame_19991208
# 1.43 05-Dec-1999 art

branches: 1.43.2;
With soft updates, some buffers will be remarked as dirty after being written.
Handle this when syncing filesystems when unmounting.
From NetBSD.


# 1.42 05-Dec-1999 art

Use VONSYNCLIST to see if we should remove a vnode from the sync list instead
of looking at v_dirtyblkhd.


Revision tags: OPENBSD_2_6_BASE
# 1.41 20-Aug-1999 art

more paranoid check of the refcount in vfs_register


# 1.40 08-Aug-1999 niklas

From NetBSD; vdevgone, used for revoking access to device nodes when they
disappear (detach is coming).


# 1.39 31-May-1999 millert

New struct statfs with mount options. NOTE: this replaces statfs(2),
fstatfs(2), and getfsstat(2) so you will need to build a new kernel
before doing a "make build" or you will get "unimplemented syscall" errors.

The new struct statfs has the following featuires:
o Has a u_int32_t flags field--now softdep can have a real flag.

o Uses u_int32_t instead of longs (nicer on the alpha). Note: the man
page used to lie about setting invalid/unused fields to -1. SunOS does
that but our code never has.

o Gets rid of f_type completely. It hasn't been used since NetBSD 0.9
and having it there but always 0 is confusing. It is conceivable
that this may cause some old code to not compile but that is better
than silently breaking.

o Adds a mount_info union that contains the FSTYPE_args struct. This
means that "mount" can now tell you all the options a filesystem was
mounted with. This is especially nice for NFS.

Other changes:
o The linux statfs emulation didn't convert between BSD fs names
and linux f_type numbers. Now it does, since the BSD f_type
number is useless to linux apps (and has been removed anyway)

o FreeBSD's struct statfs is different from our (both old and new)
and thus needs conversion. Previously, the OpenBSD syscalls
were used without any real translation.

o mount(8) will now show extra info when invoked with no arguments.
However, to see *everything* you need to use the -v (verbose) flag.


# 1.38 06-May-1999 mickey

factor out sync+wait code into vfa_syncwait() routine for
applications in system like power management and such.
art@ finally said `commit it'


# 1.37 30-Apr-1999 art

in vput, simple_unlock the v_interlock before VOP_INACTIVE, not after


Revision tags: OPENBSD_2_5_BASE
# 1.36 11-Mar-1999 deraadt

backout


# 1.35 11-Mar-1999 deraadt

back out unapproved changes


# 1.34 11-Mar-1999 mickey

indent


# 1.33 11-Mar-1999 mickey

factor sync+wait operation out into a separate function.


# 1.32 26-Feb-1999 art

adapt to uvm vnode pager


# 1.31 19-Feb-1999 art

add vfs_register and vfs_unregister functions


# 1.30 28-Dec-1998 art

simple_lock fixes


# 1.29 22-Dec-1998 art

deconfuse vprint, print holdcount, not refcount when we are talking about holdcnt


# 1.28 10-Dec-1998 art

vfs_unmountall: retry to unmount all remaining filesystems when one unmount failed


# 1.27 05-Dec-1998 csapuntz

Framework for generating automatic test code for locking discipline
in DIAGNOSTIC mode.

Added documentation to vfs_subr.c on locking needs of a couple calls.

Improvements to the vinvalbuf patch. We need to start over after we
let our pants down.


# 1.26 04-Dec-1998 csapuntz

VFS-Lite2 requires stricter locking around vnode buffer queues. vinvalbuf
had insufficient protection


# 1.25 20-Nov-1998 art

vn_lock already unlocks the simple lock. don't do that again


# 1.24 12-Nov-1998 csapuntz

Integrate latest soft updates patches for McKusick.

Integrate cleaner ffs mount code from FreeBSD. Most notably, this mount
code prevents you from mounting an unclean file system read-write.


Revision tags: OPENBSD_2_4_BASE
# 1.23 13-Oct-1998 csapuntz

In vrele, vget, reinstate to following order

- VNODE gets placed on free list
- VOP_INACTIVE is called

This was the original order. It was changed in an earlier patch due to
a race condition in non-locking FSes (like NFS) between getnewvnode
and inactive. However, the modified order had its own race conditions, so
it turned out not to be a good choice.


# 1.22 30-Aug-1998 csapuntz

Cleanup.

Error diagnostics in vputonfreelist to catch violations of assumptions.


# 1.21 06-Aug-1998 csapuntz

Rename vop_revoke, vn_bwrite, vop_noislocked, vop_nolock, vop_nounlock
to be vop_generic_revoke, vop_generic_bwrite, vop_generic_islocked,
vop_generic_lock and vop_generic_unlock.

Create vop_generic_abortop and propogate change to all file systems.

Fix PR/371.

Get rid of locking in NULLFS (should be mostly unnecessary now except for
forced unmounts).


# 1.20 25-Apr-1998 niklas

typo


Revision tags: OPENBSD_2_3_BASE
# 1.19 20-Feb-1998 niklas

typo


# 1.18 11-Jan-1998 csapuntz

Fix a couple spinlock references. More code motion in vfs_subr.c


# 1.17 10-Jan-1998 csapuntz

Broke up vfs_subr.c which was getting a bit huge. We now have seperate files
for the syncer daemon as well as default VOP_*.


# 1.16 24-Nov-1997 niklas

Fix non-DIAGNOSTIC (and non-COMPAT*) compilation


# 1.15 07-Nov-1997 csapuntz

Fixed hang on shutdown
Disabled vop_nolock for now. Filesystems still need to be cleaned up.


# 1.14 06-Nov-1997 csapuntz

DEBUG now compiles


# 1.13 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.12 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.11 06-Oct-1997 csapuntz

VFS Lite2 Changes


Revision tags: OPENBSD_2_1_BASE
# 1.10 25-Apr-1997 deraadt

proper mask check; mike@fast.cs.utah.edu


# 1.9 14-Apr-1997 tholo

Minor performance enhancements from NetBSD


# 1.8 24-Feb-1997 niklas

OpenBSD tags


# 1.7 11-Feb-1997 millert

Add fs_id support and random inode generation numbers for ffs.


# 1.6 04-Jan-1997 kstailey

spec_advlock() via lf_advlock()


Revision tags: OPENBSD_2_0_BASE
# 1.5 08-Aug-1996 tholo

Make {,f}chown(2) behaviour POSIX.1 compliant with SUID / SGID files
Enable CTL_FS processing by sysctl(3)
Add CTL_FS request to disable clearing SUID / SGID bit when a files owner
or group is changed by root
Make sysctl(8) understand CTL_FS requests


# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 29-Feb-1996 niklas

From NetBSD: Merge with NetBSD 960217


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.316 12-Aug-2022 visa

Put more struct vnode fields under splbio().

Buffer cache related struct vnode fields can be accessed in interrupt
context. Be more consistent with the use of splbio().

OK mpi@


Revision tags: OPENBSD_7_1_BASE
# 1.315 27-Mar-2022 semarie

sys/vnode.h cleanup for vnode_hold_list, vnode_free_list, struct freelst

vnode_hold_list and vnode_free_list aren't used outside kern/vfs_subr.c

move `struct freelst` where used in kern/vfs_subr.c

no intented behaviour changes. survived a release(8) build.

ok millert@


# 1.314 25-Jan-2022 gnezdo

Capture a repeated pattern into sysctl_securelevel_int function

A few variables in the kernel are only writeable before securelevel is
raised. It makes sense to handle them with less code.

OK sthen@ bluhm@


# 1.313 25-Oct-2021 claudio

Revert commitid: ufM9BcSbXqfLpzBH;
Move vfs_stall_barrier() from the fd layer into vn_lock() and the vfs layer.
In some cases it can result in a deadlock while suspending.
Discussed with mpi@ and deraadt@


# 1.312 24-Oct-2021 jsg

use NULL not 0 for pointer values in kern
ok semarie@


# 1.311 23-Oct-2021 mpi

Sprinkle uvm_obj_destroy() over UVM object recycling code.

For now, only assert that the tree of pages is empty in uvm_obj_destroy().
This will soon be used to free the per-UVM object lock.

While here call uvm_obj_init() when new vnodes are allocated instead of
in uvn_attach(). Because vnodes and there associated UVM object are
currently never freed, it isn't easy to know where/when to garbage
collect the associated lock. So simply check that the reference of a
given object is 0 when uvn_attach().

Tested by many as part of a bigger diff.

ok kettenis@


# 1.310 23-Oct-2021 mpi

Assert that the KERNEL_LOCK() is held in vref(9).

This is a guard against pushing the lock too far in UVM's vnode land.

ok beck@


# 1.309 21-Oct-2021 claudio

Move vfs_stall_barrier() from the fd layer into vn_lock() and the vfs layer.
vfs stalling is used by suspend/resume and by vmt(4) to stall any
filesystem operation from altering the state on disk. All these
operations will call vn_lock and be stalled. Adjust vfs_stall_barrier()
to allow the lock owner to still progress so that suspend can sync
the filesystems after stalling vfs operation.
OK mpi@


# 1.308 20-Oct-2021 semarie

revert vnode: remove VLOCKSWORK and check locking when vop_islocked != nullop
(both kernel and userland bits)

GENERIC + VFSLCKDEBUG is broken with it.


# 1.307 19-Oct-2021 semarie

vnode: remove VLOCKSWORK and check locking when vop_islocked != nullop

This flag is currently used to mark or unmark a vnode to actively
check vnode locking semantic (when compiled with VFSLCKDEBUG).

Currently, VLOCKSWORK flag isn't properly set for several FS
implementation which have full locking support. This commit enable
proper checking for them too (cd9660, udf, fuse, msdosfs, tmpfs).

Instead of using a particular flag, it directly check if
v_op->vop_islocked is nullop or not to activate or not the vnode
locking checks.

ok mpi@


Revision tags: OPENBSD_7_0_BASE
# 1.306 31-Aug-2021 claudio

Swap lock flags so that LK_EXCLUSIVE is first like in all other places.


# 1.305 28-Apr-2021 claudio

Introduce a global vnode_mtx and use it to make vn_lock() safe to be called
without the KERNEL_LOCK.
This moves VXLOCK and VXWANT to a mutex protected v_lflag field and also
v_lockcount is protected by this mutex.

The vn_lock() dance is overly complex and all of this should probably replaced
by a proper lock on the vnode but such a diff is a lot more complex. This
is an intermediate step so that at least some calls can be modified to grab
the KERNEL_LOCK later or not at all.

OK mpi@


Revision tags: OPENBSD_6_9_BASE
# 1.304 29-Jan-2021 claudio

Use NULL instead of 0 to clear v_socket pointer (which actually clears all
of the v_un pointers).
OK jsg@ mvs@


Revision tags: OPENBSD_6_8_BASE
# 1.303 23-Aug-2020 kn

Remove unused debug_syncprt, improve debug sysctl handling

"syncprt" is unused since kern/vfs_syscalls.c r1.147 from 2008.

Adding new debug sysctls is a bit opaque and looking at kern/kern_sysctl.c
the only visible difference between used and stub ctldebug structs in the
debugvars[] array is their extern keyword, indicating that it is defined
elsewhere.

sys/sysctl.h declares all debugN members as extern upfront, but these
declarations are not needed.

Remove the unused debug sysctl, rename the only remaining one to something
meaningful and remove forward declarations from /sys/sysctl.h; this way,
adding new debug sysctls is a matter of adding extern and coming up with a
name, which is nicer to read on its own and better to grep for.

OK mpi


# 1.302 22-Aug-2020 kn

Move sysctl(2) CTL_DEBUG from DEBUG to new DEBUG_SYSCTL

Adding "debug.my-knob" sysctls is really helpful to select different
code paths and/or log on demand during runtime without recompile,
but as this code is under DEBUG, lots of other noise comes with it
which is often undesired, at least when looking at specific subsystems
only.

Adding globals to the kernel and breaking into DDB to change them helps,
but that does not work over SSH, hence the need for debug sysctls.

Introduces DEBUG_SYSCTL to make use of the "debug" MIB without the rest of
DEBUG; it's DEBUG_SYSCTL and not SYSCTL_DEBUG because it's not a general
option for all of sysctl(2).

OK gnezdo


Revision tags: OPENBSD_6_7_BASE
# 1.301 27-Mar-2020 anton

Relax the lockcount assertion in vputonfreelist(). Back when I fixed
several problems with the vnode exclusive lock implementation, I
overlooked the fact that a vnode can be in a state where the usecount is
zero while the holdcount still being positive. There could still be
threads waiting on the vnode lock in uvn_io() as long as the holdcount
is positive.

"go ahead" mpi@

Reported-by: syzbot+767d6deb1a647850a0ca@syzkaller.appspotmail.com


# 1.300 13-Feb-2020 claudio

Move the LK_DRAIN logic from VOP_LOCK() to vclean() the only caller of
VOP_LOCK with LK_DRAIN. This simplifies VOP_LOCK() a fair bit.
OK visa@


# 1.299 20-Jan-2020 claudio

struct vops is not modified during runtime so use const which moves each
into read-only data segment.
OK deraadt@ tedu@


# 1.298 10-Jan-2020 bluhm

Convert the vnode list at the mount point into a tailq. During
unmount this list is traversed and the dirty vnodes are flushed to
disk. Forced unmount expects that the list is empty after flushing,
otherwise the kernel panics with "dangling vnode". As the write
to disk can sleep, new vnodes may be inserted. If softdep is
enabled, resolving the dependencies creates new dirty vnodes and
inserts them to the list. To fix the panic, let insmntque() insert
new vnodes at the tail of the list. Then vflush() will still catch
them while traversing the list in forward direction.
OK tedu@ millert@ visa@


# 1.297 30-Dec-2019 bluhm

In vcount() a safe loop over vnodes was commited to 4.4BSD in 1994.
This is not necessary as the loop is restarted after vgone(). Switch
to SLIST_FOREACH without _SAFE.
OK visa@


# 1.296 27-Dec-2019 bluhm

Convert the speclisth hash buckets into SLIST macros. This makes
the vnode alias code more readable.
OK visa@


# 1.295 26-Dec-2019 bluhm

Fix white spaces.


# 1.294 08-Dec-2019 mpi

Convert infinite sleeps to tsleep_nsec(9).

ok visa@, jca@


Revision tags: OPENBSD_6_6_BASE
# 1.293 26-Aug-2019 anton

When a thread tries to exclusively lock a vnode, the same thread must
ensure that any other thread currently trying to acquire the underlying
vnode lock has observed that the same vnode is about to be exclusively
locked. Such threads must then sleep until the exclusive lock has been
released and then try to acquire the lock again. Otherwise, exclusive
access to the vnode cannot be guaranteed.

Thanks to naddy@ and visa@ for testing; ok visa@

Reported-by: syzbot+374d0e7e2400004957f7@syzkaller.appspotmail.com


# 1.292 25-Jul-2019 cheloha

vinvalbuf(9): tlseep -> tsleep_nsec(9); ok millert@


# 1.291 19-Jul-2019 cheloha

vwaitforio(9): tsleep(9) -> tsleep_nsec(9); ok visa@


# 1.290 28-Jun-2019 visa

Skip VFS barrier lock during normal operation to reduce overhead.
This removes a system-wide serialization point, which might help
finding timing-related bugs.

OK deraadt@ anton@


# 1.289 09-Jun-2019 beck

Add a temporary workaround to make removal of giant files better

mlarkin@ noticed we would freeze while removing enormous files because
of the amount of work done to invalidate buffers on unlink. This adds
a temporary workaround to ensure we give up the lock and yield while
doing this.

The longer term answer will be to move these buffers to another list
and not do the work here.

ok deraadt@


# 1.288 19-Apr-2019 visa

Add a subsystem lock for vfs_lockf.c. This enables calling lf_advlock()
and lf_purgelocks() without the kernel lock.

OK anton@ mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.287 02-Apr-2019 visa

Restrict which filesystems are available for swap. This rules out
obvious misconfigurations that cannot work.

OK mpi@ tedu@


# 1.286 17-Feb-2019 tedu

if a write fails, we mark the buffer invalid and throw it away. this can
lead to lost errors, where a later fsync will return success. to fix this,
set a flag on the vnode indicating a past error has occurred, and return
an error for future fsync calls.
ok bluhm deraadt visa


# 1.285 21-Jan-2019 anton

Introduce a dedicated entry point data structure for file locks. This new data
structure allows for better tracking of pending lock operations which is
essential in order to prevent a use-after-free once the underlying vnode is
gone.

Inspired by the lockf implementation in FreeBSD.

ok visa@

Reported-by: syzbot+d5540a236382f50f1dac@syzkaller.appspotmail.com


# 1.284 23-Dec-2018 natano

Rectify some issues with the noperm mount flag; the root vnode was not
protected properly and files without any x bit set were accidentaly considered
executable when checked with access(2).

Issues found and reported by deraadt, halex, reyk, tb
ok deraadt


# 1.283 07-Dec-2018 mpi

free(9) sizes for netcred.

ok visa@


Revision tags: OPENBSD_6_4_BASE
# 1.282 29-Sep-2018 visa

Use atomic operations to update vfc_refcount. Change the field's type
to unsigned int.

OK deraadt@


# 1.281 26-Sep-2018 visa

Move the allocating and freeing of mount points into
dedicated functions.

OK deraadt@ mpi@


# 1.280 22-Sep-2018 fcambus

Harmonize spacing after ellipses in displayed messages.

We were using spacing after ellipses in an inconsistent way in the
installer. Standardize on using "... " everywhere and take into account
the cursor position while we are waiting for the task to complete: the
cursor is now always positioned after the last dot, and the space is
added when displaying completion confirmation.

While there, also take cursor position into account in vfs_shutdown(),
and remove the extra leading space before ticks in dhclient.

OK deraadt@


# 1.279 17-Sep-2018 visa

Simplify VFS initialization.

Because loadable kernel modules are no longer, there is no need to
register or unregister filesystem implementations at runtime. Remove
vfs_register() and vfs_unregister(), and make vfsinit() call vfs_init
routines directly. Replace the linked list of vfsconf structs with
the vfsconflist[] array.

OK mpi@ bluhm@


# 1.278 16-Sep-2018 visa

Move vfsconf lookup code into dedicated functions.

OK bluhm@


# 1.277 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


# 1.276 02-Jul-2018 bluhm

Use more list macros for v_dirtyblkhd.
OK mpi@


# 1.275 06-Jun-2018 bluhm

The function dounmount() traverses the mnt_list in forward direction
to call vfs_busy() for all nested mount points. vfs_stall() called
vfs_busy() in reverser order for all mount points. Change the
direction of the latter to resolve the lock order conflict.
OK visa@


# 1.274 04-Jun-2018 guenther

Add VB_DUPOK to suppress witness(4) warning of concurrent mount locks.
Use that in three places:
- vfs_stall()
- sys_mount()
- dounmount()'s MNT_FORCE-does-recursive-unmounts case

ok deraadt@ visa@


# 1.273 27-May-2018 visa

Drop unnecessary `p' parameter from vget(9).

OK mpi@


# 1.272 08-May-2018 bluhm

When looping over mount points, the FOREACH SAVE macro is not save.
The loop variable mp is protected by vfs_busy() so that it cannot
be unmounted. But the next mount point nmp could be unmounted while
VFS_SYNC() sleeps. As the loop in vfs_stall() does not destroy the
mount point, TAILQ_FOREACH_REVERSE without _SAVE is the correct
macro to use.
OK deraadt@ visa@


# 1.271 08-May-2018 mpi

Move the vfs stall "barrier" logic to a function. FREF() will soon
change and this has nothing to do with it.

ok visa@, bluhm@


# 1.270 07-May-2018 bluhm

Print the vp pointer in the vinvalbuf() panic strings.
OK mpi@


# 1.269 02-May-2018 visa

Remove proc from the parameters of vn_lock(). The parameter is
unnecessary because curproc always does the locking.

OK mpi@


# 1.268 28-Apr-2018 visa

Clean up the parameters of VOP_LOCK() and VOP_UNLOCK(). It is always
curproc that does the locking or unlocking, so the proc parameter
is pointless and can be dropped.

OK mpi@, deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.267 07-Mar-2018 bluhm

Remounting files systems read-only does not work reliably. There
are corner cases where ffs may leak blocks. So better revert and
unmount all file systems at reboot. The "init died" panic will be
fixed in a different way.
OK deraadt@


# 1.266 10-Feb-2018 deraadt

Syncronize filesystems to disk when suspending. Each mountpoint's vnodes
are pushed to disk. Dangling vnodes (unlinked files still in use) and
vnodes undergoing change by long-running syscalls are identified -- and
such filesystems are marked dirty on-disk while we are suspended (in case
power is lost, a fsck will be required). Filesystems without dangling or
busy vnodes are marked clean, resulting in faster boots following
"battery died" circumstances.
Tested by numerous developers, thanks for the feedback.


# 1.265 14-Dec-2017 deraadt

Don't bother using DETACH_FORCE for the softraid luns at reboot
time; the aggressive mountpoint destruction seems to hit insane
use-after-frees when we are already far on the way down.


# 1.264 14-Dec-2017 deraadt

Give vflush_vnode() a hint about vnodes we don't need to account as "busy".
Change mountpoint to RDONLY a little later. Seems to improve the
rw->ro transition a bit.


# 1.263 11-Dec-2017 bluhm

Format the vnode lists of ddb show mount properly in columns.
OK krw@


# 1.262 11-Dec-2017 deraadt

In uvm Chuck decided backing store would not be allocated proactively
for blocks re-fetchable from the filesystem. However at reboot time,
filesystems are unmounted, and since processes lack backing store they
are killed. Since the scheduler is still running, in some cases init is
killed... which drops us to ddb [noted by bluhm]. Solution is to convert
filesystems to read-only [proposed by kettenis]. The tale follows:
sys_reboot() should pass proc * to MD boot() to vfs_shutdown() which
completes current IO with vfs_busy VB_WRITE|VB_WAIT, then calls VFS_MOUNT()
with MNT_UPDATE | MNT_RDONLY, soon teaching us that *fs_mount() calls a
copyin() late... so store the sizes in vfsconflist[] and move the copyin()
to sys_mount()... and notice nfs_mount copyin() is size-variant, so kill
legacy struct nfs_args3. Next we learn ffs_mount()'s MNT_UPDATE code is
sharp and rusty especially wrt softdep, so fix some bugs adn add
~MNT_SOFTDEP to the downgrade. Some vnodes need a little more help,
so tie them to &dead_vnops.

ffs_mount calling DIOCCACHESYNC is causing a bit of grief still but
this issue is seperate and will be dealt with in time.
couple hundred reboots by bluhm and myself, advice from guenther and
others at the hut


# 1.261 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.260 31-Jul-2017 florian

Give back some space to the ramdisk by compiling net/radix.c only
if we compile pf, ipsec, pipex or nfsserver.
Suggested by mpi some time ago.
Tweak & OK bluhm
deraadt assumes it's fair


# 1.259 20-Apr-2017 visa

Tweak lock inits to make the system runnable with witness(4)
on amd64 and i386.


# 1.258 04-Apr-2017 deraadt

struct vfsconf is tightly packed, but let's M_ZERO it in case that ever
changes to avoid exposing userland memory.


Revision tags: OPENBSD_6_1_BASE
# 1.257 15-Jan-2017 bluhm

When traversing the mount list, the current mount point is locked
with vfs_busy(). If the FOREACH_SAFE macro is used, the next pointer
is not locked and could be freed by another process. Unless
necessary, do not use _SAFE as it is unsafe. In vfs_unmountall()
the current pointer is actullay freed. Add a comment that this
race has to be fixed later.
OK krw@


# 1.256 10-Jan-2017 bluhm

Replace manual for() loops with FOREACH() macro.
OK millert@


# 1.255 10-Jan-2017 bluhm

Remove the unused olddp parameter from function dounmount().
OK mpi@ millert@


# 1.254 28-Sep-2016 kettenis

Cast enum to u_int when doing a bounds check to avoid a clang warning that
the comparison is always true.

ok jca@, tedu@


# 1.253 16-Sep-2016 dlg

move the namecache_rb_tree from RB macros to RBT functions.

i had to shuffle the includes a bit. all the knowledge of the RB
tree is now inside vfs_cache.c, and all accesses are via cache_*
functions.


# 1.252 16-Sep-2016 dlg

move buf_rb_bufs from RB macros to RBT functions

i had to shuffle the order of some header bits cos RBT_PROTOTYPE
needs to see what RBT_HEAD produces.


# 1.251 15-Sep-2016 dlg

all pools have their ipl set via pool_setipl, so fold it into pool_init.

the ioff argument to pool_init() is unused and has been for many
years, so this replaces it with an ipl argument. because the ipl
will be set on init we no longer need pool_setipl.

most of these changes have been done with coccinelle using the spatch
below. cocci sucks at formatting code though, so i fixed that by hand.

the manpage and subr_pool.c bits i did myself.

ok tedu@ jmatthew@

@ipl@
expression pp;
expression ipl;
expression s, a, o, f, m, p;
@@
-pool_init(pp, s, a, o, f, m, p);
-pool_setipl(pp, ipl);
+pool_init(pp, s, a, ipl, f, m, p);


# 1.250 25-Aug-2016 dlg

pool_setipl

ok kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.249 22-Jul-2016 kettenis

Prevent NULL-pointer call for filesystems that don't provide vfs_sysctl
in their vfsops.

Issue reported by Tim Newsham.

ok claudio@, natano@


# 1.248 19-Jun-2016 natano

Remove the lockmgr() API. It is only used by filesystems, where it is a
trivial change to use rrw locks instead. All it needs is LK_* defines
for the RW_* flags.

tested by naddy and sthen on package building infrastructure
input and ok jmc mpi tedu


# 1.247 26-May-2016 natano

The doforce variable isn't modified anywhere. Also, the only filesystem
left using it is fuse. It has been removed from all other filesystems.

ok millert deraadt


# 1.246 26-Apr-2016 natano

copy_statfs_info() is not only used by ufs, but by other filesystems too,
so make sure that all members of mp->mnt_stat.mount_info are copied.
ok stefan


# 1.245 26-Apr-2016 beck

fix off by one in vfs_vnode_print - found by miod
ok deraadt@, krw@


# 1.244 07-Apr-2016 natano

Share clone bitmap between aliased vnodes. This prevents duplicate clone
instance numbers being handed out for the same minor device.
ok mikeb


# 1.243 05-Apr-2016 natano

Increase size of the clone bitmap (revised diff after revert). I have
tested this with fuse _and_ drm on amd64 and macppc. Also tested with
cloning bpf (not in the tree) on macppc.

ok mikeb
"looks correct to me" millert

The original commit message is as follows:

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.242 01-Apr-2016 mikeb

Revert the clone bitmap enlargement change


# 1.241 31-Mar-2016 natano

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.240 19-Mar-2016 natano

Remove the unused flags argument from VOP_UNLOCK().

torture tested on amd64, i386 and macppc
ok beck mpi stefan
"the change looks right" deraadt


# 1.239 14-Mar-2016 krw

Change a bunch of (<blah> *)0 to NULL.

ok beck@ deraadt@


Revision tags: OPENBSD_5_9_BASE
# 1.238 05-Dec-2015 tedu

branches: 1.238.2;
remove stale lint annotations


# 1.237 16-Nov-2015 deraadt

In getdevvp() set the VISTTY flag on a vnode to indicate the underlying
device is a D_TTY device. (Like spec_open, but this sets the flag to
satisfy pre-VOP_OPEN situations)
ok millert semarie tedu guenther


# 1.236 13-Oct-2015 guenther

Initialize va_filerev in vattr_null() to avoid leaking stack garbage;
problem pointed out by Martin Natano (natano (at) natano.net)

Also, stop chaining assignments (foo = bar = baz) in vattr_null().
The exact meaning of those depends on the order of the sizes-and-
signednesses of the lvalues, making them fragile: a statement here
mixed *six* types, but managed to get them in a safe order. Delete
a 20+ year old XXX comment that was almost certainly bemoaning a bug
from when they were in an unsafe order.

ok deraadt@ miod@


# 1.235 08-Oct-2015 mpi

Use the radix API directly and get rid of the function pointers. There
is no point in keeping an unused level of abstraction.

ok mikeb@, claudio@


# 1.234 07-Oct-2015 mpi

rn_inithead() offset argument is now specified in byte, missed in previous.


# 1.233 04-Sep-2015 mpi

Make every subsystem using a radix tree call rn_init() and pass the
length of the key as argument.

This way every consumer of the radix tree has a chance to explicitly
initialize the shared data structures and no longer rely on another
subsystem to do the initialization.

As a bonus ``dom_maxrtkey'' is no longer used an die.

ART kernels should now be fully usable because pf(4) and IPSEC properly
initialized the radix tree.

ok chris@, reyk@


Revision tags: OPENBSD_5_8_BASE
# 1.232 16-Jul-2015 claudio

branches: 1.232.4;
Fix rn_match and there for the expoerted lookup functions in radix.c
to never return the internal RNF_ROOT nodes. This removes the checks
in the callee to verify that not an RNF_ROOT node was returned.
OK mpi@


# 1.231 12-May-2015 mikeb

Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.230 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.229 02-Mar-2015 guenther

Return EINVAL if the creds supplied for NFS export have a cr_ngroups less
than zero or greater than NGROUPS_MAX

Fixes panic seen by henning@


# 1.228 09-Jan-2015 tedu

rename desiredvnodes to initialvnodes. less of a lie. ok beck deraadt


# 1.227 19-Dec-2014 tedu

start retiring the nointr allocator. specify PR_WAITOK as a flag as a
marker for which pools are not interrupt safe. ok dlg


# 1.226 17-Dec-2014 tedu

remove lock.h from uvm_extern.h. another holdover from the simpletonlock
era. fix uvm including c files to include lock.h or atomic.h as necessary.
ok deraadt


# 1.225 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.224 10-Dec-2014 tedu

convert bcopy to memcpy. ok millert


# 1.223 21-Nov-2014 tedu

simple lock is long dead


# 1.222 19-Nov-2014 tedu

delete the KERN_VNODE sysctl. it fails to provide any isolation from the
kernel struct vnode defintion, and the only consumer (pstat) still needs
kvm to read much of the required information. no great loss to always use
kvm until there's a better replacement interface.
ok deraadt millert uebayasi


# 1.221 14-Nov-2014 tedu

prefer sizeof(*ptr) to sizeof(struct) for malloc and free


# 1.220 03-Nov-2014 deraadt

pass size argument to free()
ok doug tedu


# 1.219 13-Sep-2014 doug

Replace all queue *_END macro calls except CIRCLEQ_END with NULL.

CIRCLEQ_* is deprecated and not called in the tree. The other queue types
have *_END macros which were added for symmetry with CIRCLEQ_END. They are
defined as NULL. There's no reason to keep the other *_END macro calls.

ok millert@


Revision tags: OPENBSD_5_6_BASE
# 1.218 13-Jul-2014 tedu

pass the size to free in some of the obvious cases


# 1.217 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.216 10-Jul-2014 mpi

Stop using a shutdown hook for softraid(4) and explicitly shutdown
the disciplines right after vfs_shutdown().

This change is required in order to be able to set `cold' to 1 before
traversing the device (mainbus) tree for DVACT_POWERDOWN when halting
a machine. Yes, this is ugly because sr_shutdown() needs to sleep. But
at least it is obvious and hopefully somebody will be ofended and fix
it.

In order to properly flush the cache of the disks under softraid0,
sr_shutdown() now propagates DVACT_POWERDOWN for this particular subtree
of devices which are not under mainbus. As a side effect sd(4) shutdown
hook should no longer be necessary.

Tested by stsp@ and Jean-Philippe Ouellet.

ok deraadt@, stsp@, jsing@


# 1.215 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.214 04-Jun-2014 claudio

While it may be smart to use the radix tree for exports it is not OK to
use the domain specific tree initialisation method for this since that one
is multipath enabled and assumes that the radix node is part of a struct
rtentry. This code uses a different struct and so the multipath modifies
wrong fields and breaks stuff in mysterious ways.
Since we only support AF_INET here anyway simplify the code and only have
one radix_node_head pointer instead of AF_MAX ones.
Fixes NFS server issues reported by rpe@, OK rpe@, guenther@, sthen@


# 1.213 10-Apr-2014 tedu

pull the bufcache freelist code out into separate functions to allow new
algorithms to be tested. in the process, drop support for unused B_AGE and
b_synctime options.
previous versions ok beck deraadt


# 1.212 24-Mar-2014 guenther

Split the API: struct ucred remains the kernel internal structure while
struct xucred becomes the structure for syscalls (mount(2) and nfssvc(2)).

ok deraadt@ beck@


Revision tags: OPENBSD_5_5_BASE
# 1.211 21-Jan-2014 tedu

bzero -> memset


# 1.210 01-Dec-2013 krw

Change 'mountlist' from CIRCLEQ to TAILQ. Be paranoid and
use TAILQ_*_SAFE more than might be needed.

Bulk ports build by sthen@ showed nobody sticking their fingers
so deep into the kernel.

Feedback and suggestions from millert@. ok jsing@


# 1.209 27-Nov-2013 jsing

Defer the v_type initialisation until after the vnode has been purged from
the namecache. Changing the v_type between cache_enter() and cache_purge()
results in bad things happening.

ok beck@


# 1.208 02-Oct-2013 sf

format string fix: b_flags is long


# 1.207 01-Oct-2013 sf

Format string fixes: Cast time_t to long long

and mnt_stat.f_ctime is long long, too


# 1.206 08-Aug-2013 syl

Uncomment kprintf format attributes for sys/kern

tested on vax (gcc3) ok miod@


# 1.205 30-Jul-2013 beck

The previous change was made while chasing nfs performance issues
on Theo's servers - however this was in the context of the buffer flipper
changes and this is now suspect in a continues performance issue with NFS
so back it out for now


Revision tags: OPENBSD_5_4_BASE
# 1.204 24-Jun-2013 beck

Manipulating buffers after sleeping is dangerous. Instead of attempting
to cheat and VOP_BWRITE a buffer, restart the vinvalbuf if we have to wait
for a busy buffer to complete
ok tedu@ guenther@


# 1.203 15-Apr-2013 jsing

Add an f_mntfromspec member to struct statfs, which specifies the name of
the special provided when the mount was requested. This may be the same as
the special that was actually used for the mount (e.g. in the case of a
device node) or it may be different (e.g. in the case of a DUID).

Whilst here, change f_ctime to a 64 bit type and remove the pointless
f_spare members.

Compatibility goo courtesy of guenther@

ok krw@ millert@


Revision tags: OPENBSD_5_3_BASE
# 1.202 17-Feb-2013 miod

Comment out recently added __attribute__((__format__(__kprintf__))) annotations
in MI code; gcc 2.95 does not accept such annotation for function pointer
declarations, only function prototypes.
To be uncommented once gcc 2.95 bites the dust.


# 1.201 09-Feb-2013 miod

Add explicit __attribute__ ((__format__(__kprintf__)))) to the functions and
function pointer arguments which are {used as,} wrappers around the kernel
printf function.
No functional change.


# 1.200 17-Nov-2012 beck

Don't map a buffer (and potentially sleep) when invalidating it in vinvalbuf.
This fixes a problem where we could sleep for kva and then our pointers
would not be valid on the next pass through the loop. We do this
by adding buf_acquire_nomap() - which can be used to busy up the buffer
without changing its mapped or unmapped state. We do not need to have
the buffer mapped to invalidate it, so it is sufficient to acquire it
for that. In the case where we write the buffer, we do map the buffer, and
potentially sleep.


# 1.199 01-Oct-2012 guenther

Make groupmember() check the effective gid too, so that the checks are
consistent when the effective gid isn't also a supplementary group.

ok beck@


# 1.198 19-Sep-2012 guenther

vhold() and vdrop() are prototyped in vnode.h, so don't repeat them here

ok beck@


Revision tags: OPENBSD_5_2_BASE
# 1.197 16-Jul-2012 deraadt

oops, need sys/acct.h too


# 1.196 16-Jul-2012 deraadt

Put acct_shutdown() proto in a better place


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.195 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.194 02-Jul-2011 thib

rename VFSDEBUG to VFLCKDEBUG;

prompted by tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.193 21-Dec-2010 thib

Bring back the "End the VOP experiment." diff, naddy's issues where
unrelated, and his alpha is much happier now.

OK deraadt@


# 1.192 06-Dec-2010 jasper

- drop NENTS(), which was yet another copy of nitems().
no binary change


ok deraadt@


# 1.191 10-Sep-2010 thib

Backout the VOP diff until the issues naddy was seeing on alpha (gcc3)
have been resolved.


# 1.190 06-Sep-2010 thib

End the VOP experiment. Instead of the ridicolusly complicated operation
vector setup that has questionable features (that have, as far as I can
tell never been used in practice, atleast not in OpenBSD), remove all
the gunk and favor a simple struct full of function pointers that get
set directly by each of the filesystems.

Removes gobs of ugly code and makes things simpler by a magnitude.

The only downside of this is that we loose the vnoperate feature so
the spec/fifo operations of the filesystems need to be kept in sync
with specfs and fifofs, this is no big deal as the API it self is pretty
static.

Many thanks to armani@ who pulled an earlier version of this diff to
current after c2k10 and Gabriel Kihlman on tech@ for testing.

Liked by many. "come on, find your balls" deraadt@.


# 1.189 12-Aug-2010 oga

Nuke extra (typoed) extern declaration and a spare newline from the last
commit.

"fix it -- free commit" beck@


# 1.188 11-Aug-2010 beck

Make the number of vnodes to correspond to the number of buffers in
buffer cache - we grow them dynamically, but do not attempt to shrink
them if the buffer cache shrinks after growing.

Tested by very many for a long time.

ok oga@ todd@ phessler@ tedu@


Revision tags: OPENBSD_4_8_BASE
# 1.187 29-Jun-2010 tedu

makefstype was only used in ported from freebsd filesystems. fix them
and remove the function. ok thib


# 1.186 28-Jun-2010 claudio

Add the rtable id as an argument to rn_walktree(). Functions like
rt_if_remove_rtdelete() need to know the table id to be able to correctly
remove nodes.
Problem found by Andrea Parazzini and analyzed by Martin Pelik�n.
OK henning@


# 1.185 06-May-2010 mpf

Fix favail format string.
From mickey.
OK thib, otto.


Revision tags: OPENBSD_4_7_BASE
# 1.184 17-Dec-2009 oga

if anyone vref()s a VNON vnode, panic. This should not happen.

Written while trying to debug the nfs_inactive panics. Turns out it
never got hit, but it's a useful check to have.

ok beck@


# 1.183 17-Aug-2009 jasper

dd 'show all bufs' to show all the buffers in the system

ok beck@ thib@


# 1.182 13-Aug-2009 thib

add a show all vnodes command, use dlg's nice pool_walk() to accomplish
this.

ok beck@, dlg@


# 1.181 12-Aug-2009 beck

Namecache revamp.

This eliminates the large single namecache hash table, and implements
the name cache as a global lru of entires, and a redblack tree in each
vnode. It makes cache_purge actually purge the namecache entries associated
with a vnode when a vnode is recycled (very important for later on actually being
able to resize the vnode pool)

This commit does #if 0 out a bunch of procmap code that was
already broken before this change, but needs to be redone completely.

Tested by many, including in thib's nfs test setup.

ok oga@,art@,thib@,miod@


# 1.180 02-Aug-2009 beck

Dynamic buffer cache support - a re-commit of what was backed out
after c2k9

allows buffer cache to be extended and grow/shrink dynamically

tested by many, ok oga@, "why not just commit it" deraadt@


Revision tags: OPENBSD_4_6_BASE
# 1.179 25-Jun-2009 thib

backout the buf_acquire() does the bremfree() since all callers
where doing bremfree() befure calling buf_acquire().

This is causing us headache pinning down a bug that showed up
when deraadt@ too cvs to current, and will have to be done
anyway as a preperation for backouts.

OK deraadt@


# 1.178 15-Jun-2009 beck

Back out all the buffer cache changes I committed during c2k9. This reverts three
commits:

1) The sysctl allowing bufcachepercent to be changed at boot time.
2) The change moving the buffer cache hash chains to a red-black tree
3) The dynamic buffer cache (Which depended on the earlier too).

ok on the backout from marco and todd


# 1.177 06-Jun-2009 art

All caller of buf_acquire were doing bremfree before the call.
Just put it in the buf_acquire function.
oga@ ok


# 1.176 03-Jun-2009 beck

Change bufhash from the old grotty hash table to red-black trees hanging
off the vnode.
ok art@, oga@, miod@


Revision tags: OPENBSD_4_5_BASE
# 1.175 10-Nov-2008 pedro

Fix typo in comment, okay jmc@.


# 1.174 01-Nov-2008 deraadt

change vrele() to return an int. if it returns 0, it can gaurantee that
it did not sleep. this is used to avoid checkdirs() to avoid having
to restart the allproc walk every time through
idea from tedu, ok thib pedro


Revision tags: OPENBSD_4_4_BASE
# 1.173 05-Jul-2008 thib

re-introduce vdrop() to signal a lost intrest in a vnode;

ok art@


# 1.172 14-Jun-2008 mk

A bunch of pool_get() + bzero() -> pool_get(..., .. | PR_ZERO)
conversions that should shave a few bytes off the kernel.

ok henning, krw, jsing, oga, miod, and thib (``even though i usually prefer
FOO|BAR''; thanks for looking.


# 1.171 13-Jun-2008 beck

back out stupid vnode change that was unintentionally included
with biomem and art has no idea how it got there.
ok art@ thib@


# 1.170 12-Jun-2008 deraadt

Bring biomem diff back into the tree after the nfs_bio.c fix went in.
ok thib beck art


# 1.169 11-Jun-2008 deraadt

back out biomem diff since it is not right yet. Doing very large
file copies to nfsv2 causes the system to eventually peg the console.
On the console ^T indicates that the load is increasing rapidly, ddb
indicates many calls to getbuf, there is some very slow nfs traffic
making none (or extremely slow) progress. Eventually some machines
seize up entirely.


# 1.168 10-Jun-2008 beck

Buffer cache revamp

1) remove multiple size queues, introduced as a stopgap.
2) decouple pages containing data from their mappings
3) only keep buffers mapped when they actually have to be mapped
(right now, this is when buffers are B_BUSY)
4) New functions to make a buffer busy, and release the busy flag
(buf_acquire and buf_release)
5) Move high/low water marks and statistics counters into a structure
6) Add a sysctl to retrieve buffer cache statistics

Tested in several variants and beat upon by bob and art for a year. run
accidentally on henning's nfs server for a few months...

ok deraadt@, krw@, art@ - who promises to be around to deal with any fallout


# 1.167 09-Jun-2008 millert

Update access(2) to have modern semantics with respect to X_OK and
the superuser. access(2) will now only indicate success for X_OK on
non-directories if there is at least one execute bit set on the file.
OK deraadt@ thib@ otto@


# 1.166 07-May-2008 thib

remove the vfc_mountroot member from vfsconf and
do appropriate cleanup;

OK deraadt@


# 1.165 07-May-2008 claudio

Implement routing priorities. Every route inserted has a priority assigned
and the one route with the lowest number wins. This will be used by the
routing daemons to resolve the synchronisations issue in case of conflicts.
The nasty bits of this are in the multipath code. If no priority is specified
the kernel will choose an appropriate priority.

Looked at by a few people at n2k8 code is much older


# 1.164 06-May-2008 thib

retire vfs_mountroot();

setroot() is now (and has been) responsible for setting
the mountroot function pointer "to the right thing", or
failing todo that, to ffs_mountroot;

based on a discussion/diff from deraadt@.
OK deraadt@


# 1.163 23-Mar-2008 miod

Wrong printf construct.


# 1.162 16-Mar-2008 otto

Widen some struct statfs fields to support large filesystem stata
and add some to be able to support statvfs(2). Do the compat dance
to provide backward compatibility. ok thib@ miod@


Revision tags: OPENBSD_4_3_BASE
# 1.161 13-Dec-2007 blambert

replace calls to ltsleep with tsleep

remove PNORELOCK flag, as PNORELOCK is used for msleep

ok art@ thib@


# 1.160 16-Nov-2007 deraadt

er, the newline is wrong. dissapointing.


# 1.159 15-Nov-2007 deraadt

newline before syncing disks is way prettier


# 1.158 29-Oct-2007 chl

MALLOC/FREE -> malloc/free
replace an hard coded value with M_WAITOK

ok krw@


# 1.157 15-Sep-2007 bluhm

Allow to pull out an usb stick with ffs filesystem while mounted
and a file is written onto the stick. Without these fixes the
machine panics or hangs.
The usb fix calls the callback when the stick is pulled out to free
the associated buffers. Otherwise we have busy buffers for ever
and the automatic unmount will panic.
The change in the scsi layer prevents passing down further dirty
buffers to usb after the stick has been deactivated.
In vfs the automatic unmount has moved from the function vgonel()
to vop_generic_revoke(). Both are called when the sd device's vnode
is removed. In vgonel() the VXLOCK is already held which can cause
a deadlock. So call dounmount() earlier.

ok krw@, I like this marco@, tested by ian@


# 1.156 07-Sep-2007 art

Use M_ZERO in a few more places to shave bytes from the kernel.

eyeballed and ok dlg@


Revision tags: OPENBSD_4_2_BASE
# 1.155 07-Aug-2007 beck

A few changes to deal with multi-user performance issues seen. this
brings us back roughly to 4.1 level performance, although this is still
far from optimal as we have seen in a number of cases. This change

1) puts a lower bound on buffer cache queues to prevent starvation
2) fixes the code which looks for a buffer to recycle
3) reduces the number of vnodes back to 4.1 levels to avoid complex
performance issues better addressed after 4.2

ok art@ deraadt@, tested by many


# 1.154 01-Jun-2007 beck

decouple the allocated number of vnodes from the "desiredvnodes" variable
which is used to size a zillion other things that increasing excessively
has been shown to cause problems - so that we may incrementally look at
increasing those other things without making the kernel unusable.

This diff effectivly increases the number of vnodes back to the number
of buffers, as in the earlier dynamic buffer cache commits, without
increasing anything else (namecache, softdeps, etc. etc.)

ok pedro@ tedu@ art@ thib@


# 1.153 31-May-2007 tedu

remove some silly casts, no real change


# 1.152 31-May-2007 pedro

NFSv2 cannot cope with a big number of vnodes, so revert to NPROC-based
calculation until the problem is fixed, okay beck@ art@


# 1.151 30-May-2007 beck

back out vfs change - todd fries has seen afs issues, and I'm suspicious
this can cause other problems.


# 1.150 29-May-2007 beck

Step one of some vnode improvements - change getnewvnode to
actually allocate "desiredvnodes" - add a vdrop to un-hold a vnode held
with vhold, and change the name cache to make use of vhold/vdrop, while
keeping track of which vnodes are referred to by which cache entries to
correctly hold/drop vnodes when the cache uses them.
ok thib@, tedu@, art@


# 1.149 28-May-2007 thib

de-inline vref();

ok pedro@


# 1.148 26-May-2007 pedro

Dynamic buffer cache. Initial diff from mickey@, okay art@ beck@ toby@
deraadt@ dlg@.


# 1.147 26-May-2007 thib

Nuke a bunch of simpelocks and associated goo.

ok art@


# 1.146 17-May-2007 thib

Collapse struct v_selectinfo in struct vnode, remove the
simplelock and reuse the name for the selinfo member.
Clean-up accordingly.

ok tedu@,art@


# 1.145 09-May-2007 deraadt

kinfo_vgetfailed has not been used for > 8 years


# 1.144 13-Apr-2007 thib

Move the declaration of VN_KNOTE() into vnode.h instead of having
multiple defines all over;

ok tedu@


# 1.143 13-Apr-2007 bluhm

Remove comments talking about vnode interlock. No binary change.
ok thib


# 1.142 11-Apr-2007 thib

Remove the simplelock argument from vrecycle();

ok pedro@, sturm@


# 1.141 21-Mar-2007 thib

Remove the v_interlock simplelock from the vnode structure.
Zap all calls to simple_lock/unlock() on it (those calls are
#defined away though). Remove the LK_INTERLOCK from the calls
to vn_lock() and cleanup the filesystems wich implement VOP_LOCK().
(by remvoing the v_interlock from there calls to lockmgr()).

ok pedro@, art@, tedu@


# 1.140 12-Mar-2007 mickey

better desiredvnodes not based on maxusers; pedro@ deraadt@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.139 20-Feb-2007 deraadt

for vfsconf sysctl, do not leak kernel sensors out to userland
ok art thib


# 1.138 17-Feb-2007 mickey

fix ddb buf printing for daddr_t growth to 64bit;
from juan hernandez gonzalez; tested by bluhm@


# 1.137 14-Feb-2007 jsg

Consistently spell FALLTHROUGH to appease lint.
ok kettenis@ cloder@ tom@ henning@


# 1.136 13-Feb-2007 mickey

fix ddb buf print


# 1.135 20-Nov-2006 tom

vprint() should be defined if DIAGNOSTIC || DEBUG. Noticed by (and
original diff from) Jake < antipsychic (at) hotmail.com >. Discussed
with Mickey and Miod.

ok miod@ pedro@


# 1.134 30-Oct-2006 thib

use vp->v_type to index into vtypes rather then vp->v_tag,
fixing odd output in the 'show vnode' ddb code.

ok mickey@


Revision tags: OPENBSD_4_0_BASE
# 1.133 11-Jul-2006 mickey

add mount/vnode/buf and softdep printing commands; tested on a few archs and will make pedro happy too (;


# 1.132 09-Jul-2006 pedro

Fix tab where space was meant


# 1.131 08-Jul-2006 thib

vinvalbuf() debugging aid, under VFSDEBUG.

ok pedro@


# 1.130 03-Jul-2006 mickey

also print vp in vprint (useful for debugging); pedro@ ok


# 1.129 25-Jun-2006 sturm

rename vfs_busy() flags VB_UMIGNORE/VB_UMWAIT to VB_NOWAIT/VB_WAIT

requested by and ok pedro


# 1.128 14-Jun-2006 sturm

move vfs_busy() to rwlocks and properly hide the locking api from vfs

ok tedu, pedro


# 1.127 02-Jun-2006 pedro

Add a clonable devices implementation. Hacked along with thib@, input
from krw@ and toby@, subliminal prodding from dlg@, okay deraadt@.


# 1.126 28-May-2006 pedro

Spacing in vfs_sysctl()


# 1.125 07-May-2006 sturm

forgot to remove this sentence from the comment
ok pedro


# 1.124 30-Apr-2006 sturm

remove the simplelock argument from vfs_busy() which is currently not
used and will never be used this way in VFS

requested by and ok pedro, ok krw, biorn


# 1.123 19-Apr-2006 pedro

Remove unused mount list simple_lock() goo


Revision tags: OPENBSD_3_9_BASE
# 1.122 09-Jan-2006 pedro

Put vprint() under DIAGNOSTIC, as to save space in generated ramdisks.
Inspiration from miod@, okay deraadt@. Tested on i386, macppc and amd64.


# 1.121 30-Nov-2005 pedro

No need for vfs_busy() and vfs_unbusy() to take a process pointer
anymore. Testing by jolan@, thanks.


# 1.120 24-Nov-2005 pedro

Remove kernfs, okay deraadt@.


# 1.119 19-Nov-2005 pedro

Remove unnecessary lockmgr() archaism that was costing too much in terms
of panics and bugfixes. Access curproc directly, do not expect a process
pointer as an argument. Should fix many "process context required" bugs.
Incentive and okay millert@, okay marc@. Various testing, thanks.


# 1.118 18-Nov-2005 pedro

Work around yet another race on non-locking file systems: when calling
VOP_INACTIVE() in vrele() and vput(), we may sleep. Since there's no
locking of any kind, someone can vget() the vnode and vrele() it while
we sleep, beating us in getting the vnode on the free list.


# 1.117 08-Nov-2005 pedro

Missed one use of 'register'


# 1.116 07-Nov-2005 pedro

Use ANSI function declarations and deregister, no binary change


# 1.115 19-Oct-2005 pedro

Remove v_vnlock from struct vnode, okay krw@ tedu@


Revision tags: OPENBSD_3_8_BASE
# 1.114 26-May-2005 pedro

branches: 1.114.2;
RIP stackable filesystems, ok marius@ tedu@, discussed with deraadt@


# 1.113 24-May-2005 pedro

when a device vnode associated with a mount point disappears, mark the
filesystem as doomed and unmount it


# 1.112 22-May-2005 pedro

put VLOCKSWORK stuff under a single option, VFSDEBUG


# 1.111 01-May-2005 pedro

check for VBIOONFREELIST and VBIOONSYNCLIST in vprint(), okay marius@


# 1.110 24-Mar-2005 tedu

always good to check for invalid values. ok marius pedro


Revision tags: OPENBSD_3_7_BASE
# 1.109 10-Jan-2005 pedro

branches: 1.109.2;
change vget() to only put a vnode back on the free lists if it actually
was there. should fix a (rare) corner case introduced by my last commit.
ok tedu@, testing by joris, moritz@, danh@, otto@ and krw@. many thanks.


# 1.108 31-Dec-2004 pedro

sprinkle some more list macros in here


# 1.107 31-Dec-2004 pedro

when releasing a vnode, make it inactive before sticking it to one of
the free lists. should fix some races on filesystems that don't have
locks, such as nfs. also, it allows for a more straightforward way of
releasing vnodes (nodes that are going to be recycled don't have to be
moved to the head of the list). tested by many, thanks.

ok tedu@ deraadt@


# 1.106 28-Dec-2004 deraadt

clean dirty accident by miod


# 1.105 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


# 1.104 09-Dec-2004 pedro

minor spacing/styling nits


Revision tags: OPENBSD_3_6_BASE
# 1.103 04-Aug-2004 art

Uninline vputonfreelist.


# 1.102 04-Aug-2004 pedro

better comments


# 1.101 02-Aug-2004 pedro

- check for LK_NOWAIT on vget()
- use ltsleep() instead of the unlock + sleep combo

ok art@, inspiration from free/net


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.100 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


# 1.99 27-May-2004 tedu

shutdown accounting before shutting down vfs. should prevent some panics.
ok david@ millert@ (iirc)


# 1.98 25-Apr-2004 itojun

radix tree with multipath support. from kame. deraadt ok
user visible changes:
- you can add multiple routes with same key (route add A B then route add A C)
- you have to specify gateway address if there are multiple entries on the table
(route delete A B, instead of route delete A)
kernel change:
- radix_node_head has an extra entry
- rnh_deladdr takes extra argument

TODO:
- actually take advantage of multipath (rtalloc -> rtalloc_mpath)


Revision tags: OPENBSD_3_5_BASE
# 1.97 09-Jan-2004 tedu

back out vnode parents. weird breakge found in ports tree


# 1.96 06-Jan-2004 tedu

keep track of a vnode's parent dir. ufs only, and unused atm, but
the fun stuff is coming. testing by brad.


Revision tags: OPENBSD_3_4_BASE
# 1.95 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.94 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: UBC_SYNC_A
# 1.93 13-May-2003 naddy

Back out previous change that causes "vnode table full" for large-scale
file operations.


# 1.92 13-May-2003 tedu

do reclaim LAYER vnodes, no good reason not to


# 1.91 06-May-2003 tedu

attempt to put a process's cwd back in place after a forced umount.
won't always work, but it's the best we can do for now. this covers
at least some of the failure cases the previous commit to vfs_lookup.c
checks for.
ok weingart@


# 1.90 01-May-2003 tedu

several related changes:
vfs_subr.c:
add a missing simple_lock_init for vnode interlock
try to avoid reclaiming locked or layered vnodes
initialize vnlock pointer to NULL
remove old code to free vnlock, never used
lockinit the new vnode lock
vfs_syscalls.c:
support for VLAYER flag
vnode_if.sh:
support for splitting VDESC flags
vnode_if.src:
split VDESC flags
WILLPUT is the combination of WILLRELE and WILLUNLOCK
most uses for WILLRELE become WILLPUT
vnode.h:
add v_lock to struct vnode
add VLAYER flag
update for new VDESC flags


# 1.89 06-Apr-2003 ho

strcat/strcpy/sprintf cleanup. krw@, anil@ ok. art@ tested sparc64.


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_B
# 1.88 11-Aug-2002 art

Add two missing vfs_busy calls in the failure path of sysctl_vnode.
Found by aaron@

NOTE - I think we need a mount-point iterator just like we have
NOTE - vfs_mount_foreach_vnode. (btw. why don't we use foreach_vnode in here?)


# 1.87 12-Jul-2002 art

Change the locking on the mountpoint slightly. Instead of using mnt_lock
to get shared locks for lookup and get the exclusive lock only with
LK_DRAIN on unmount and do the real exclusive locking with flags in
mnt_flags, we now use shared locks for lookup and an exclusive lock for
unmount.

This is accomplished by slightly changing the semantics of vfs_busy.
Old vfs_busy behavior:
- with LK_NOWAIT set in flags, a shared lock was obtained if the
mountpoint wasn't being unmounted, otherwise we just returned an error.
- with no flags, a shared lock was obtained if the mountpoint was being
unmounted, otherwise we slept until the unmount was done and returned
an error.
LK_NOWAIT was used for sync(2) and some statistics code where it isn't really
critical that we get the correct results.
0 was used in fchdir and lookup where it's critical that we get the right
directory vnode for the filesystem root.

After this change vfs_busy keeps the same behavior for no flags and LK_NOWAIT.
But if some other flags are passed into it, they are passed directly
into lockmgr (actually LK_SLEEPFAIL is always added to those flags because
if we sleep for the lock, that means someone was holding the exclusive lock
and the exclusive lock is only held when the filesystem is being unmounted.

More changes:
dounmount must now be called with the exclusive lock held. (before this
the caller was supposed to hold the vfs_busy lock, but that wasn't always
true).
Zap some (now) unused mount flags.
And the highlight of this change:
Add some vfs_busy calls to match some vfs_unbusy calls, especially in
sys_mount. (lockmgr doesn't detect the case where we release a lock noone
holds (it will do that soon)).

If you've seen hangs on reboot with mfs this should solve it (I repeat this
for the fourth time now, but this time I spent two months fixing and
redesigning this and reading the code so this time I must have gotten
this right).


# 1.86 16-Jun-2002 miod

When processing the KERN_VNODE sysctl, the kernel builds a packed structure,
while pstat(8) expects a C structure abiding the regular structure packing
rules. This caused pstat -v to break on powerpc.

Unbreak the confusion by defining the structure in a common header file,
and having the kernel use it.

ok millert@ deraadt@


# 1.85 08-Jun-2002 art

Use ltsleep in vfs_busy.


# 1.84 16-May-2002 art

sprinkle some splassert(IPL_BIO) in some functions that are commented as "should be called at splbio()"


Revision tags: OPENBSD_3_1_BASE
# 1.83 14-Mar-2002 millert

First round of __P removal in sys


# 1.82 04-Feb-2002 miod

Cleanup mountroot-related definitions.


# 1.81 23-Jan-2002 art

Pool deals fairly well with physical memory shortage, but it doesn't deal
well (not at all) with shortages of the vm_map where the pages are mapped
(usually kmem_map).

Try to deal with it:
- group all information the backend allocator for a pool in a separate
struct. The pool will only have a pointer to that struct.
- change the pool_init API to reflect that.
- link all pools allocating from the same allocator on a linked list.
- Since an allocator is responsible to wait for physical memory it will
only fail (waitok) when it runs out of its backing vm_map, carefully
drain pools using the same allocator so that va space is freed.
(see comments in code for caveats and details).
- change pool_reclaim to return if it actually succeeded to free some
memory, use that information to make draining easier and more efficient.
- get rid of PR_URGENT, noone uses it.


# 1.80 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.79 10-Dec-2001 art

branches: 1.79.2;
No need to initialize the uobj on every getnewvnode. Just do
it when allocating. Add some improved diagnostics.


# 1.78 10-Dec-2001 art

Big cleanup inspired by NetBSD with some parts of the code from NetBSD.
- get rid of VOP_BALLOCN and VOP_SIZE
- move the generic getpages and putpages into miscfs/genfs
- create a genfs_node which must be added to the top of the private portion
of each vnode for filsystems that want to use genfs_{get,put}pages
- rename genfs_mmap to vop_generic_mmap


# 1.77 10-Dec-2001 art

Merge in struct uvm_vnode into struct vnode.


# 1.76 05-Dec-2001 art

Break out the part that lowers v_holdcnt in brelvp into an own function
and make it and vhold into public interfaces.


# 1.75 29-Nov-2001 art

Ooops. Revert part of the last commit that was completly wrong and wasn't supposed to be committed.


# 1.74 29-Nov-2001 art

Correctly handle b_vp with bgetvp and brelvp in {get,put}pages.
Prevents panics caused by vnodes being recycled under our feet.


# 1.73 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.72 21-Nov-2001 csapuntz

Added vfs_isbusy. Useful for verifying that a mount point is locked
Added vfs_mount_foreach_vnode. Several places in the code seem to want to
traverse the mount list and they all seem to handle locking differently.
Centralize traversing the mount list in one place so that we only need
to get the locking right once.


# 1.71 15-Nov-2001 art

Don't zero v_bioflag when recycling a vnode in getnewvnode.
Sometimes the vnode can be on the syncers list. While that is a bug, it's
just a minor annoyance. A vnode on a syncer worklist without VBIOONSYNCLIST
set is a disaster.


# 1.70 12-Nov-2001 art

Remove unnecessary check for NULL vnode in reassignbuf.


# 1.69 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.68 02-Oct-2001 csapuntz

Bounds check index into routing table. Thanks to Ken Ashcraft of Stanford
for finding this bug.


# 1.67 19-Sep-2001 csapuntz

Get rid of B_VFLUSH. Not relevant after the end of the AGE queue.


# 1.66 16-Sep-2001 millert

Add some missing lengths checks when passing data from userland to
kernel. From based on NetBSD patches.


# 1.65 02-Aug-2001 assar

(vput): make panic strings actually say vput instead of vrele


# 1.64 26-Jul-2001 miod

Typo.


# 1.63 27-Jun-2001 art

remove old vm


# 1.62 22-Jun-2001 deraadt

KNF


# 1.61 05-Jun-2001 provos

send note_revoke to knotes when vnode goes away, okay art@


# 1.60 16-May-2001 art

indentation nit.


# 1.59 29-Apr-2001 art

cleanup, remove incorrect comment


Revision tags: OPENBSD_2_9_BASE
# 1.58 22-Mar-2001 art

branches: 1.58.2;
Use pool for allocating vnodes.
Even though vnodes are never freed (could be) this gives us big memory and
kmem_map savings.


# 1.57 21-Mar-2001 art

uvm_vnp_terminate expect the vnode to be locked.
Why didn't LOCKDEBUG catch this?


# 1.56 16-Mar-2001 art

Oops. fix thinko in last.


# 1.55 16-Mar-2001 art

Use CIRCLEQ macros for mountlist.


# 1.54 16-Mar-2001 art

Initialize the mountlist_slock.


# 1.53 26-Feb-2001 csapuntz

Move v_writecount test back to it original place


# 1.52 26-Feb-2001 csapuntz

Make ref counts 32-bit unsigned ints as opposed to a potpourri of longs and
ints.


# 1.51 24-Feb-2001 csapuntz

Cleanup of vnode interface continues. Get rid of VHOLD/HOLDRELE.
Change VM/UVM to use buf_replacevnode to change the vnode associated
with a buffer.

Addition v_bioflag for flags written in interrupt handlers
(and read at splbio, though not strictly necessary)

Add vwaitforio and use it instead of a while loop of v_numoutput.

Fix race conditions when manipulation vnode free list


# 1.50 23-Feb-2001 csapuntz

Remove the clustering fields from the vnodes and place them in the
file system inode instead


# 1.49 21-Feb-2001 csapuntz

Latest soft updates from FreeBSD/Kirk McKusick

Snapshot-related code has been commented out.


# 1.48 08-Feb-2001 mickey

do not print stuff when not verbose


Revision tags: OPENBSD_2_8_BASE
# 1.47 27-Sep-2000 art

branches: 1.47.2;
Minimal optimization.


# 1.46 17-Jul-2000 art

Don't wait for B_READ buffers on shutdown.
From NetBSD.


Revision tags: OPENBSD_2_7_BASE
# 1.45 25-Apr-2000 csapuntz

Use CIRCLEQ_FOREACH


# 1.44 21-Apr-2000 mickey

see if there is any meaning under curproc before using &proc0 in vfs_syncwait(); from art@


Revision tags: SMP_BASE kame_19991208
# 1.43 05-Dec-1999 art

branches: 1.43.2;
With soft updates, some buffers will be remarked as dirty after being written.
Handle this when syncing filesystems when unmounting.
From NetBSD.


# 1.42 05-Dec-1999 art

Use VONSYNCLIST to see if we should remove a vnode from the sync list instead
of looking at v_dirtyblkhd.


Revision tags: OPENBSD_2_6_BASE
# 1.41 20-Aug-1999 art

more paranoid check of the refcount in vfs_register


# 1.40 08-Aug-1999 niklas

From NetBSD; vdevgone, used for revoking access to device nodes when they
disappear (detach is coming).


# 1.39 31-May-1999 millert

New struct statfs with mount options. NOTE: this replaces statfs(2),
fstatfs(2), and getfsstat(2) so you will need to build a new kernel
before doing a "make build" or you will get "unimplemented syscall" errors.

The new struct statfs has the following featuires:
o Has a u_int32_t flags field--now softdep can have a real flag.

o Uses u_int32_t instead of longs (nicer on the alpha). Note: the man
page used to lie about setting invalid/unused fields to -1. SunOS does
that but our code never has.

o Gets rid of f_type completely. It hasn't been used since NetBSD 0.9
and having it there but always 0 is confusing. It is conceivable
that this may cause some old code to not compile but that is better
than silently breaking.

o Adds a mount_info union that contains the FSTYPE_args struct. This
means that "mount" can now tell you all the options a filesystem was
mounted with. This is especially nice for NFS.

Other changes:
o The linux statfs emulation didn't convert between BSD fs names
and linux f_type numbers. Now it does, since the BSD f_type
number is useless to linux apps (and has been removed anyway)

o FreeBSD's struct statfs is different from our (both old and new)
and thus needs conversion. Previously, the OpenBSD syscalls
were used without any real translation.

o mount(8) will now show extra info when invoked with no arguments.
However, to see *everything* you need to use the -v (verbose) flag.


# 1.38 06-May-1999 mickey

factor out sync+wait code into vfa_syncwait() routine for
applications in system like power management and such.
art@ finally said `commit it'


# 1.37 30-Apr-1999 art

in vput, simple_unlock the v_interlock before VOP_INACTIVE, not after


Revision tags: OPENBSD_2_5_BASE
# 1.36 11-Mar-1999 deraadt

backout


# 1.35 11-Mar-1999 deraadt

back out unapproved changes


# 1.34 11-Mar-1999 mickey

indent


# 1.33 11-Mar-1999 mickey

factor sync+wait operation out into a separate function.


# 1.32 26-Feb-1999 art

adapt to uvm vnode pager


# 1.31 19-Feb-1999 art

add vfs_register and vfs_unregister functions


# 1.30 28-Dec-1998 art

simple_lock fixes


# 1.29 22-Dec-1998 art

deconfuse vprint, print holdcount, not refcount when we are talking about holdcnt


# 1.28 10-Dec-1998 art

vfs_unmountall: retry to unmount all remaining filesystems when one unmount failed


# 1.27 05-Dec-1998 csapuntz

Framework for generating automatic test code for locking discipline
in DIAGNOSTIC mode.

Added documentation to vfs_subr.c on locking needs of a couple calls.

Improvements to the vinvalbuf patch. We need to start over after we
let our pants down.


# 1.26 04-Dec-1998 csapuntz

VFS-Lite2 requires stricter locking around vnode buffer queues. vinvalbuf
had insufficient protection


# 1.25 20-Nov-1998 art

vn_lock already unlocks the simple lock. don't do that again


# 1.24 12-Nov-1998 csapuntz

Integrate latest soft updates patches for McKusick.

Integrate cleaner ffs mount code from FreeBSD. Most notably, this mount
code prevents you from mounting an unclean file system read-write.


Revision tags: OPENBSD_2_4_BASE
# 1.23 13-Oct-1998 csapuntz

In vrele, vget, reinstate to following order

- VNODE gets placed on free list
- VOP_INACTIVE is called

This was the original order. It was changed in an earlier patch due to
a race condition in non-locking FSes (like NFS) between getnewvnode
and inactive. However, the modified order had its own race conditions, so
it turned out not to be a good choice.


# 1.22 30-Aug-1998 csapuntz

Cleanup.

Error diagnostics in vputonfreelist to catch violations of assumptions.


# 1.21 06-Aug-1998 csapuntz

Rename vop_revoke, vn_bwrite, vop_noislocked, vop_nolock, vop_nounlock
to be vop_generic_revoke, vop_generic_bwrite, vop_generic_islocked,
vop_generic_lock and vop_generic_unlock.

Create vop_generic_abortop and propogate change to all file systems.

Fix PR/371.

Get rid of locking in NULLFS (should be mostly unnecessary now except for
forced unmounts).


# 1.20 25-Apr-1998 niklas

typo


Revision tags: OPENBSD_2_3_BASE
# 1.19 20-Feb-1998 niklas

typo


# 1.18 11-Jan-1998 csapuntz

Fix a couple spinlock references. More code motion in vfs_subr.c


# 1.17 10-Jan-1998 csapuntz

Broke up vfs_subr.c which was getting a bit huge. We now have seperate files
for the syncer daemon as well as default VOP_*.


# 1.16 24-Nov-1997 niklas

Fix non-DIAGNOSTIC (and non-COMPAT*) compilation


# 1.15 07-Nov-1997 csapuntz

Fixed hang on shutdown
Disabled vop_nolock for now. Filesystems still need to be cleaned up.


# 1.14 06-Nov-1997 csapuntz

DEBUG now compiles


# 1.13 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.12 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.11 06-Oct-1997 csapuntz

VFS Lite2 Changes


Revision tags: OPENBSD_2_1_BASE
# 1.10 25-Apr-1997 deraadt

proper mask check; mike@fast.cs.utah.edu


# 1.9 14-Apr-1997 tholo

Minor performance enhancements from NetBSD


# 1.8 24-Feb-1997 niklas

OpenBSD tags


# 1.7 11-Feb-1997 millert

Add fs_id support and random inode generation numbers for ffs.


# 1.6 04-Jan-1997 kstailey

spec_advlock() via lf_advlock()


Revision tags: OPENBSD_2_0_BASE
# 1.5 08-Aug-1996 tholo

Make {,f}chown(2) behaviour POSIX.1 compliant with SUID / SGID files
Enable CTL_FS processing by sysctl(3)
Add CTL_FS request to disable clearing SUID / SGID bit when a files owner
or group is changed by root
Make sysctl(8) understand CTL_FS requests


# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 29-Feb-1996 niklas

From NetBSD: Merge with NetBSD 960217


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.315 27-Mar-2022 semarie

sys/vnode.h cleanup for vnode_hold_list, vnode_free_list, struct freelst

vnode_hold_list and vnode_free_list aren't used outside kern/vfs_subr.c

move `struct freelst` where used in kern/vfs_subr.c

no intented behaviour changes. survived a release(8) build.

ok millert@


# 1.314 25-Jan-2022 gnezdo

Capture a repeated pattern into sysctl_securelevel_int function

A few variables in the kernel are only writeable before securelevel is
raised. It makes sense to handle them with less code.

OK sthen@ bluhm@


# 1.313 25-Oct-2021 claudio

Revert commitid: ufM9BcSbXqfLpzBH;
Move vfs_stall_barrier() from the fd layer into vn_lock() and the vfs layer.
In some cases it can result in a deadlock while suspending.
Discussed with mpi@ and deraadt@


# 1.312 24-Oct-2021 jsg

use NULL not 0 for pointer values in kern
ok semarie@


# 1.311 23-Oct-2021 mpi

Sprinkle uvm_obj_destroy() over UVM object recycling code.

For now, only assert that the tree of pages is empty in uvm_obj_destroy().
This will soon be used to free the per-UVM object lock.

While here call uvm_obj_init() when new vnodes are allocated instead of
in uvn_attach(). Because vnodes and there associated UVM object are
currently never freed, it isn't easy to know where/when to garbage
collect the associated lock. So simply check that the reference of a
given object is 0 when uvn_attach().

Tested by many as part of a bigger diff.

ok kettenis@


# 1.310 23-Oct-2021 mpi

Assert that the KERNEL_LOCK() is held in vref(9).

This is a guard against pushing the lock too far in UVM's vnode land.

ok beck@


# 1.309 21-Oct-2021 claudio

Move vfs_stall_barrier() from the fd layer into vn_lock() and the vfs layer.
vfs stalling is used by suspend/resume and by vmt(4) to stall any
filesystem operation from altering the state on disk. All these
operations will call vn_lock and be stalled. Adjust vfs_stall_barrier()
to allow the lock owner to still progress so that suspend can sync
the filesystems after stalling vfs operation.
OK mpi@


# 1.308 20-Oct-2021 semarie

revert vnode: remove VLOCKSWORK and check locking when vop_islocked != nullop
(both kernel and userland bits)

GENERIC + VFSLCKDEBUG is broken with it.


# 1.307 19-Oct-2021 semarie

vnode: remove VLOCKSWORK and check locking when vop_islocked != nullop

This flag is currently used to mark or unmark a vnode to actively
check vnode locking semantic (when compiled with VFSLCKDEBUG).

Currently, VLOCKSWORK flag isn't properly set for several FS
implementation which have full locking support. This commit enable
proper checking for them too (cd9660, udf, fuse, msdosfs, tmpfs).

Instead of using a particular flag, it directly check if
v_op->vop_islocked is nullop or not to activate or not the vnode
locking checks.

ok mpi@


Revision tags: OPENBSD_7_0_BASE
# 1.306 31-Aug-2021 claudio

Swap lock flags so that LK_EXCLUSIVE is first like in all other places.


# 1.305 28-Apr-2021 claudio

Introduce a global vnode_mtx and use it to make vn_lock() safe to be called
without the KERNEL_LOCK.
This moves VXLOCK and VXWANT to a mutex protected v_lflag field and also
v_lockcount is protected by this mutex.

The vn_lock() dance is overly complex and all of this should probably replaced
by a proper lock on the vnode but such a diff is a lot more complex. This
is an intermediate step so that at least some calls can be modified to grab
the KERNEL_LOCK later or not at all.

OK mpi@


Revision tags: OPENBSD_6_9_BASE
# 1.304 29-Jan-2021 claudio

Use NULL instead of 0 to clear v_socket pointer (which actually clears all
of the v_un pointers).
OK jsg@ mvs@


Revision tags: OPENBSD_6_8_BASE
# 1.303 23-Aug-2020 kn

Remove unused debug_syncprt, improve debug sysctl handling

"syncprt" is unused since kern/vfs_syscalls.c r1.147 from 2008.

Adding new debug sysctls is a bit opaque and looking at kern/kern_sysctl.c
the only visible difference between used and stub ctldebug structs in the
debugvars[] array is their extern keyword, indicating that it is defined
elsewhere.

sys/sysctl.h declares all debugN members as extern upfront, but these
declarations are not needed.

Remove the unused debug sysctl, rename the only remaining one to something
meaningful and remove forward declarations from /sys/sysctl.h; this way,
adding new debug sysctls is a matter of adding extern and coming up with a
name, which is nicer to read on its own and better to grep for.

OK mpi


# 1.302 22-Aug-2020 kn

Move sysctl(2) CTL_DEBUG from DEBUG to new DEBUG_SYSCTL

Adding "debug.my-knob" sysctls is really helpful to select different
code paths and/or log on demand during runtime without recompile,
but as this code is under DEBUG, lots of other noise comes with it
which is often undesired, at least when looking at specific subsystems
only.

Adding globals to the kernel and breaking into DDB to change them helps,
but that does not work over SSH, hence the need for debug sysctls.

Introduces DEBUG_SYSCTL to make use of the "debug" MIB without the rest of
DEBUG; it's DEBUG_SYSCTL and not SYSCTL_DEBUG because it's not a general
option for all of sysctl(2).

OK gnezdo


Revision tags: OPENBSD_6_7_BASE
# 1.301 27-Mar-2020 anton

Relax the lockcount assertion in vputonfreelist(). Back when I fixed
several problems with the vnode exclusive lock implementation, I
overlooked the fact that a vnode can be in a state where the usecount is
zero while the holdcount still being positive. There could still be
threads waiting on the vnode lock in uvn_io() as long as the holdcount
is positive.

"go ahead" mpi@

Reported-by: syzbot+767d6deb1a647850a0ca@syzkaller.appspotmail.com


# 1.300 13-Feb-2020 claudio

Move the LK_DRAIN logic from VOP_LOCK() to vclean() the only caller of
VOP_LOCK with LK_DRAIN. This simplifies VOP_LOCK() a fair bit.
OK visa@


# 1.299 20-Jan-2020 claudio

struct vops is not modified during runtime so use const which moves each
into read-only data segment.
OK deraadt@ tedu@


# 1.298 10-Jan-2020 bluhm

Convert the vnode list at the mount point into a tailq. During
unmount this list is traversed and the dirty vnodes are flushed to
disk. Forced unmount expects that the list is empty after flushing,
otherwise the kernel panics with "dangling vnode". As the write
to disk can sleep, new vnodes may be inserted. If softdep is
enabled, resolving the dependencies creates new dirty vnodes and
inserts them to the list. To fix the panic, let insmntque() insert
new vnodes at the tail of the list. Then vflush() will still catch
them while traversing the list in forward direction.
OK tedu@ millert@ visa@


# 1.297 30-Dec-2019 bluhm

In vcount() a safe loop over vnodes was commited to 4.4BSD in 1994.
This is not necessary as the loop is restarted after vgone(). Switch
to SLIST_FOREACH without _SAFE.
OK visa@


# 1.296 27-Dec-2019 bluhm

Convert the speclisth hash buckets into SLIST macros. This makes
the vnode alias code more readable.
OK visa@


# 1.295 26-Dec-2019 bluhm

Fix white spaces.


# 1.294 08-Dec-2019 mpi

Convert infinite sleeps to tsleep_nsec(9).

ok visa@, jca@


Revision tags: OPENBSD_6_6_BASE
# 1.293 26-Aug-2019 anton

When a thread tries to exclusively lock a vnode, the same thread must
ensure that any other thread currently trying to acquire the underlying
vnode lock has observed that the same vnode is about to be exclusively
locked. Such threads must then sleep until the exclusive lock has been
released and then try to acquire the lock again. Otherwise, exclusive
access to the vnode cannot be guaranteed.

Thanks to naddy@ and visa@ for testing; ok visa@

Reported-by: syzbot+374d0e7e2400004957f7@syzkaller.appspotmail.com


# 1.292 25-Jul-2019 cheloha

vinvalbuf(9): tlseep -> tsleep_nsec(9); ok millert@


# 1.291 19-Jul-2019 cheloha

vwaitforio(9): tsleep(9) -> tsleep_nsec(9); ok visa@


# 1.290 28-Jun-2019 visa

Skip VFS barrier lock during normal operation to reduce overhead.
This removes a system-wide serialization point, which might help
finding timing-related bugs.

OK deraadt@ anton@


# 1.289 09-Jun-2019 beck

Add a temporary workaround to make removal of giant files better

mlarkin@ noticed we would freeze while removing enormous files because
of the amount of work done to invalidate buffers on unlink. This adds
a temporary workaround to ensure we give up the lock and yield while
doing this.

The longer term answer will be to move these buffers to another list
and not do the work here.

ok deraadt@


# 1.288 19-Apr-2019 visa

Add a subsystem lock for vfs_lockf.c. This enables calling lf_advlock()
and lf_purgelocks() without the kernel lock.

OK anton@ mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.287 02-Apr-2019 visa

Restrict which filesystems are available for swap. This rules out
obvious misconfigurations that cannot work.

OK mpi@ tedu@


# 1.286 17-Feb-2019 tedu

if a write fails, we mark the buffer invalid and throw it away. this can
lead to lost errors, where a later fsync will return success. to fix this,
set a flag on the vnode indicating a past error has occurred, and return
an error for future fsync calls.
ok bluhm deraadt visa


# 1.285 21-Jan-2019 anton

Introduce a dedicated entry point data structure for file locks. This new data
structure allows for better tracking of pending lock operations which is
essential in order to prevent a use-after-free once the underlying vnode is
gone.

Inspired by the lockf implementation in FreeBSD.

ok visa@

Reported-by: syzbot+d5540a236382f50f1dac@syzkaller.appspotmail.com


# 1.284 23-Dec-2018 natano

Rectify some issues with the noperm mount flag; the root vnode was not
protected properly and files without any x bit set were accidentaly considered
executable when checked with access(2).

Issues found and reported by deraadt, halex, reyk, tb
ok deraadt


# 1.283 07-Dec-2018 mpi

free(9) sizes for netcred.

ok visa@


Revision tags: OPENBSD_6_4_BASE
# 1.282 29-Sep-2018 visa

Use atomic operations to update vfc_refcount. Change the field's type
to unsigned int.

OK deraadt@


# 1.281 26-Sep-2018 visa

Move the allocating and freeing of mount points into
dedicated functions.

OK deraadt@ mpi@


# 1.280 22-Sep-2018 fcambus

Harmonize spacing after ellipses in displayed messages.

We were using spacing after ellipses in an inconsistent way in the
installer. Standardize on using "... " everywhere and take into account
the cursor position while we are waiting for the task to complete: the
cursor is now always positioned after the last dot, and the space is
added when displaying completion confirmation.

While there, also take cursor position into account in vfs_shutdown(),
and remove the extra leading space before ticks in dhclient.

OK deraadt@


# 1.279 17-Sep-2018 visa

Simplify VFS initialization.

Because loadable kernel modules are no longer, there is no need to
register or unregister filesystem implementations at runtime. Remove
vfs_register() and vfs_unregister(), and make vfsinit() call vfs_init
routines directly. Replace the linked list of vfsconf structs with
the vfsconflist[] array.

OK mpi@ bluhm@


# 1.278 16-Sep-2018 visa

Move vfsconf lookup code into dedicated functions.

OK bluhm@


# 1.277 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


# 1.276 02-Jul-2018 bluhm

Use more list macros for v_dirtyblkhd.
OK mpi@


# 1.275 06-Jun-2018 bluhm

The function dounmount() traverses the mnt_list in forward direction
to call vfs_busy() for all nested mount points. vfs_stall() called
vfs_busy() in reverser order for all mount points. Change the
direction of the latter to resolve the lock order conflict.
OK visa@


# 1.274 04-Jun-2018 guenther

Add VB_DUPOK to suppress witness(4) warning of concurrent mount locks.
Use that in three places:
- vfs_stall()
- sys_mount()
- dounmount()'s MNT_FORCE-does-recursive-unmounts case

ok deraadt@ visa@


# 1.273 27-May-2018 visa

Drop unnecessary `p' parameter from vget(9).

OK mpi@


# 1.272 08-May-2018 bluhm

When looping over mount points, the FOREACH SAVE macro is not save.
The loop variable mp is protected by vfs_busy() so that it cannot
be unmounted. But the next mount point nmp could be unmounted while
VFS_SYNC() sleeps. As the loop in vfs_stall() does not destroy the
mount point, TAILQ_FOREACH_REVERSE without _SAVE is the correct
macro to use.
OK deraadt@ visa@


# 1.271 08-May-2018 mpi

Move the vfs stall "barrier" logic to a function. FREF() will soon
change and this has nothing to do with it.

ok visa@, bluhm@


# 1.270 07-May-2018 bluhm

Print the vp pointer in the vinvalbuf() panic strings.
OK mpi@


# 1.269 02-May-2018 visa

Remove proc from the parameters of vn_lock(). The parameter is
unnecessary because curproc always does the locking.

OK mpi@


# 1.268 28-Apr-2018 visa

Clean up the parameters of VOP_LOCK() and VOP_UNLOCK(). It is always
curproc that does the locking or unlocking, so the proc parameter
is pointless and can be dropped.

OK mpi@, deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.267 07-Mar-2018 bluhm

Remounting files systems read-only does not work reliably. There
are corner cases where ffs may leak blocks. So better revert and
unmount all file systems at reboot. The "init died" panic will be
fixed in a different way.
OK deraadt@


# 1.266 10-Feb-2018 deraadt

Syncronize filesystems to disk when suspending. Each mountpoint's vnodes
are pushed to disk. Dangling vnodes (unlinked files still in use) and
vnodes undergoing change by long-running syscalls are identified -- and
such filesystems are marked dirty on-disk while we are suspended (in case
power is lost, a fsck will be required). Filesystems without dangling or
busy vnodes are marked clean, resulting in faster boots following
"battery died" circumstances.
Tested by numerous developers, thanks for the feedback.


# 1.265 14-Dec-2017 deraadt

Don't bother using DETACH_FORCE for the softraid luns at reboot
time; the aggressive mountpoint destruction seems to hit insane
use-after-frees when we are already far on the way down.


# 1.264 14-Dec-2017 deraadt

Give vflush_vnode() a hint about vnodes we don't need to account as "busy".
Change mountpoint to RDONLY a little later. Seems to improve the
rw->ro transition a bit.


# 1.263 11-Dec-2017 bluhm

Format the vnode lists of ddb show mount properly in columns.
OK krw@


# 1.262 11-Dec-2017 deraadt

In uvm Chuck decided backing store would not be allocated proactively
for blocks re-fetchable from the filesystem. However at reboot time,
filesystems are unmounted, and since processes lack backing store they
are killed. Since the scheduler is still running, in some cases init is
killed... which drops us to ddb [noted by bluhm]. Solution is to convert
filesystems to read-only [proposed by kettenis]. The tale follows:
sys_reboot() should pass proc * to MD boot() to vfs_shutdown() which
completes current IO with vfs_busy VB_WRITE|VB_WAIT, then calls VFS_MOUNT()
with MNT_UPDATE | MNT_RDONLY, soon teaching us that *fs_mount() calls a
copyin() late... so store the sizes in vfsconflist[] and move the copyin()
to sys_mount()... and notice nfs_mount copyin() is size-variant, so kill
legacy struct nfs_args3. Next we learn ffs_mount()'s MNT_UPDATE code is
sharp and rusty especially wrt softdep, so fix some bugs adn add
~MNT_SOFTDEP to the downgrade. Some vnodes need a little more help,
so tie them to &dead_vnops.

ffs_mount calling DIOCCACHESYNC is causing a bit of grief still but
this issue is seperate and will be dealt with in time.
couple hundred reboots by bluhm and myself, advice from guenther and
others at the hut


# 1.261 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.260 31-Jul-2017 florian

Give back some space to the ramdisk by compiling net/radix.c only
if we compile pf, ipsec, pipex or nfsserver.
Suggested by mpi some time ago.
Tweak & OK bluhm
deraadt assumes it's fair


# 1.259 20-Apr-2017 visa

Tweak lock inits to make the system runnable with witness(4)
on amd64 and i386.


# 1.258 04-Apr-2017 deraadt

struct vfsconf is tightly packed, but let's M_ZERO it in case that ever
changes to avoid exposing userland memory.


Revision tags: OPENBSD_6_1_BASE
# 1.257 15-Jan-2017 bluhm

When traversing the mount list, the current mount point is locked
with vfs_busy(). If the FOREACH_SAFE macro is used, the next pointer
is not locked and could be freed by another process. Unless
necessary, do not use _SAFE as it is unsafe. In vfs_unmountall()
the current pointer is actullay freed. Add a comment that this
race has to be fixed later.
OK krw@


# 1.256 10-Jan-2017 bluhm

Replace manual for() loops with FOREACH() macro.
OK millert@


# 1.255 10-Jan-2017 bluhm

Remove the unused olddp parameter from function dounmount().
OK mpi@ millert@


# 1.254 28-Sep-2016 kettenis

Cast enum to u_int when doing a bounds check to avoid a clang warning that
the comparison is always true.

ok jca@, tedu@


# 1.253 16-Sep-2016 dlg

move the namecache_rb_tree from RB macros to RBT functions.

i had to shuffle the includes a bit. all the knowledge of the RB
tree is now inside vfs_cache.c, and all accesses are via cache_*
functions.


# 1.252 16-Sep-2016 dlg

move buf_rb_bufs from RB macros to RBT functions

i had to shuffle the order of some header bits cos RBT_PROTOTYPE
needs to see what RBT_HEAD produces.


# 1.251 15-Sep-2016 dlg

all pools have their ipl set via pool_setipl, so fold it into pool_init.

the ioff argument to pool_init() is unused and has been for many
years, so this replaces it with an ipl argument. because the ipl
will be set on init we no longer need pool_setipl.

most of these changes have been done with coccinelle using the spatch
below. cocci sucks at formatting code though, so i fixed that by hand.

the manpage and subr_pool.c bits i did myself.

ok tedu@ jmatthew@

@ipl@
expression pp;
expression ipl;
expression s, a, o, f, m, p;
@@
-pool_init(pp, s, a, o, f, m, p);
-pool_setipl(pp, ipl);
+pool_init(pp, s, a, ipl, f, m, p);


# 1.250 25-Aug-2016 dlg

pool_setipl

ok kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.249 22-Jul-2016 kettenis

Prevent NULL-pointer call for filesystems that don't provide vfs_sysctl
in their vfsops.

Issue reported by Tim Newsham.

ok claudio@, natano@


# 1.248 19-Jun-2016 natano

Remove the lockmgr() API. It is only used by filesystems, where it is a
trivial change to use rrw locks instead. All it needs is LK_* defines
for the RW_* flags.

tested by naddy and sthen on package building infrastructure
input and ok jmc mpi tedu


# 1.247 26-May-2016 natano

The doforce variable isn't modified anywhere. Also, the only filesystem
left using it is fuse. It has been removed from all other filesystems.

ok millert deraadt


# 1.246 26-Apr-2016 natano

copy_statfs_info() is not only used by ufs, but by other filesystems too,
so make sure that all members of mp->mnt_stat.mount_info are copied.
ok stefan


# 1.245 26-Apr-2016 beck

fix off by one in vfs_vnode_print - found by miod
ok deraadt@, krw@


# 1.244 07-Apr-2016 natano

Share clone bitmap between aliased vnodes. This prevents duplicate clone
instance numbers being handed out for the same minor device.
ok mikeb


# 1.243 05-Apr-2016 natano

Increase size of the clone bitmap (revised diff after revert). I have
tested this with fuse _and_ drm on amd64 and macppc. Also tested with
cloning bpf (not in the tree) on macppc.

ok mikeb
"looks correct to me" millert

The original commit message is as follows:

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.242 01-Apr-2016 mikeb

Revert the clone bitmap enlargement change


# 1.241 31-Mar-2016 natano

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.240 19-Mar-2016 natano

Remove the unused flags argument from VOP_UNLOCK().

torture tested on amd64, i386 and macppc
ok beck mpi stefan
"the change looks right" deraadt


# 1.239 14-Mar-2016 krw

Change a bunch of (<blah> *)0 to NULL.

ok beck@ deraadt@


Revision tags: OPENBSD_5_9_BASE
# 1.238 05-Dec-2015 tedu

branches: 1.238.2;
remove stale lint annotations


# 1.237 16-Nov-2015 deraadt

In getdevvp() set the VISTTY flag on a vnode to indicate the underlying
device is a D_TTY device. (Like spec_open, but this sets the flag to
satisfy pre-VOP_OPEN situations)
ok millert semarie tedu guenther


# 1.236 13-Oct-2015 guenther

Initialize va_filerev in vattr_null() to avoid leaking stack garbage;
problem pointed out by Martin Natano (natano (at) natano.net)

Also, stop chaining assignments (foo = bar = baz) in vattr_null().
The exact meaning of those depends on the order of the sizes-and-
signednesses of the lvalues, making them fragile: a statement here
mixed *six* types, but managed to get them in a safe order. Delete
a 20+ year old XXX comment that was almost certainly bemoaning a bug
from when they were in an unsafe order.

ok deraadt@ miod@


# 1.235 08-Oct-2015 mpi

Use the radix API directly and get rid of the function pointers. There
is no point in keeping an unused level of abstraction.

ok mikeb@, claudio@


# 1.234 07-Oct-2015 mpi

rn_inithead() offset argument is now specified in byte, missed in previous.


# 1.233 04-Sep-2015 mpi

Make every subsystem using a radix tree call rn_init() and pass the
length of the key as argument.

This way every consumer of the radix tree has a chance to explicitly
initialize the shared data structures and no longer rely on another
subsystem to do the initialization.

As a bonus ``dom_maxrtkey'' is no longer used an die.

ART kernels should now be fully usable because pf(4) and IPSEC properly
initialized the radix tree.

ok chris@, reyk@


Revision tags: OPENBSD_5_8_BASE
# 1.232 16-Jul-2015 claudio

branches: 1.232.4;
Fix rn_match and there for the expoerted lookup functions in radix.c
to never return the internal RNF_ROOT nodes. This removes the checks
in the callee to verify that not an RNF_ROOT node was returned.
OK mpi@


# 1.231 12-May-2015 mikeb

Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.230 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.229 02-Mar-2015 guenther

Return EINVAL if the creds supplied for NFS export have a cr_ngroups less
than zero or greater than NGROUPS_MAX

Fixes panic seen by henning@


# 1.228 09-Jan-2015 tedu

rename desiredvnodes to initialvnodes. less of a lie. ok beck deraadt


# 1.227 19-Dec-2014 tedu

start retiring the nointr allocator. specify PR_WAITOK as a flag as a
marker for which pools are not interrupt safe. ok dlg


# 1.226 17-Dec-2014 tedu

remove lock.h from uvm_extern.h. another holdover from the simpletonlock
era. fix uvm including c files to include lock.h or atomic.h as necessary.
ok deraadt


# 1.225 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.224 10-Dec-2014 tedu

convert bcopy to memcpy. ok millert


# 1.223 21-Nov-2014 tedu

simple lock is long dead


# 1.222 19-Nov-2014 tedu

delete the KERN_VNODE sysctl. it fails to provide any isolation from the
kernel struct vnode defintion, and the only consumer (pstat) still needs
kvm to read much of the required information. no great loss to always use
kvm until there's a better replacement interface.
ok deraadt millert uebayasi


# 1.221 14-Nov-2014 tedu

prefer sizeof(*ptr) to sizeof(struct) for malloc and free


# 1.220 03-Nov-2014 deraadt

pass size argument to free()
ok doug tedu


# 1.219 13-Sep-2014 doug

Replace all queue *_END macro calls except CIRCLEQ_END with NULL.

CIRCLEQ_* is deprecated and not called in the tree. The other queue types
have *_END macros which were added for symmetry with CIRCLEQ_END. They are
defined as NULL. There's no reason to keep the other *_END macro calls.

ok millert@


Revision tags: OPENBSD_5_6_BASE
# 1.218 13-Jul-2014 tedu

pass the size to free in some of the obvious cases


# 1.217 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.216 10-Jul-2014 mpi

Stop using a shutdown hook for softraid(4) and explicitly shutdown
the disciplines right after vfs_shutdown().

This change is required in order to be able to set `cold' to 1 before
traversing the device (mainbus) tree for DVACT_POWERDOWN when halting
a machine. Yes, this is ugly because sr_shutdown() needs to sleep. But
at least it is obvious and hopefully somebody will be ofended and fix
it.

In order to properly flush the cache of the disks under softraid0,
sr_shutdown() now propagates DVACT_POWERDOWN for this particular subtree
of devices which are not under mainbus. As a side effect sd(4) shutdown
hook should no longer be necessary.

Tested by stsp@ and Jean-Philippe Ouellet.

ok deraadt@, stsp@, jsing@


# 1.215 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.214 04-Jun-2014 claudio

While it may be smart to use the radix tree for exports it is not OK to
use the domain specific tree initialisation method for this since that one
is multipath enabled and assumes that the radix node is part of a struct
rtentry. This code uses a different struct and so the multipath modifies
wrong fields and breaks stuff in mysterious ways.
Since we only support AF_INET here anyway simplify the code and only have
one radix_node_head pointer instead of AF_MAX ones.
Fixes NFS server issues reported by rpe@, OK rpe@, guenther@, sthen@


# 1.213 10-Apr-2014 tedu

pull the bufcache freelist code out into separate functions to allow new
algorithms to be tested. in the process, drop support for unused B_AGE and
b_synctime options.
previous versions ok beck deraadt


# 1.212 24-Mar-2014 guenther

Split the API: struct ucred remains the kernel internal structure while
struct xucred becomes the structure for syscalls (mount(2) and nfssvc(2)).

ok deraadt@ beck@


Revision tags: OPENBSD_5_5_BASE
# 1.211 21-Jan-2014 tedu

bzero -> memset


# 1.210 01-Dec-2013 krw

Change 'mountlist' from CIRCLEQ to TAILQ. Be paranoid and
use TAILQ_*_SAFE more than might be needed.

Bulk ports build by sthen@ showed nobody sticking their fingers
so deep into the kernel.

Feedback and suggestions from millert@. ok jsing@


# 1.209 27-Nov-2013 jsing

Defer the v_type initialisation until after the vnode has been purged from
the namecache. Changing the v_type between cache_enter() and cache_purge()
results in bad things happening.

ok beck@


# 1.208 02-Oct-2013 sf

format string fix: b_flags is long


# 1.207 01-Oct-2013 sf

Format string fixes: Cast time_t to long long

and mnt_stat.f_ctime is long long, too


# 1.206 08-Aug-2013 syl

Uncomment kprintf format attributes for sys/kern

tested on vax (gcc3) ok miod@


# 1.205 30-Jul-2013 beck

The previous change was made while chasing nfs performance issues
on Theo's servers - however this was in the context of the buffer flipper
changes and this is now suspect in a continues performance issue with NFS
so back it out for now


Revision tags: OPENBSD_5_4_BASE
# 1.204 24-Jun-2013 beck

Manipulating buffers after sleeping is dangerous. Instead of attempting
to cheat and VOP_BWRITE a buffer, restart the vinvalbuf if we have to wait
for a busy buffer to complete
ok tedu@ guenther@


# 1.203 15-Apr-2013 jsing

Add an f_mntfromspec member to struct statfs, which specifies the name of
the special provided when the mount was requested. This may be the same as
the special that was actually used for the mount (e.g. in the case of a
device node) or it may be different (e.g. in the case of a DUID).

Whilst here, change f_ctime to a 64 bit type and remove the pointless
f_spare members.

Compatibility goo courtesy of guenther@

ok krw@ millert@


Revision tags: OPENBSD_5_3_BASE
# 1.202 17-Feb-2013 miod

Comment out recently added __attribute__((__format__(__kprintf__))) annotations
in MI code; gcc 2.95 does not accept such annotation for function pointer
declarations, only function prototypes.
To be uncommented once gcc 2.95 bites the dust.


# 1.201 09-Feb-2013 miod

Add explicit __attribute__ ((__format__(__kprintf__)))) to the functions and
function pointer arguments which are {used as,} wrappers around the kernel
printf function.
No functional change.


# 1.200 17-Nov-2012 beck

Don't map a buffer (and potentially sleep) when invalidating it in vinvalbuf.
This fixes a problem where we could sleep for kva and then our pointers
would not be valid on the next pass through the loop. We do this
by adding buf_acquire_nomap() - which can be used to busy up the buffer
without changing its mapped or unmapped state. We do not need to have
the buffer mapped to invalidate it, so it is sufficient to acquire it
for that. In the case where we write the buffer, we do map the buffer, and
potentially sleep.


# 1.199 01-Oct-2012 guenther

Make groupmember() check the effective gid too, so that the checks are
consistent when the effective gid isn't also a supplementary group.

ok beck@


# 1.198 19-Sep-2012 guenther

vhold() and vdrop() are prototyped in vnode.h, so don't repeat them here

ok beck@


Revision tags: OPENBSD_5_2_BASE
# 1.197 16-Jul-2012 deraadt

oops, need sys/acct.h too


# 1.196 16-Jul-2012 deraadt

Put acct_shutdown() proto in a better place


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.195 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.194 02-Jul-2011 thib

rename VFSDEBUG to VFLCKDEBUG;

prompted by tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.193 21-Dec-2010 thib

Bring back the "End the VOP experiment." diff, naddy's issues where
unrelated, and his alpha is much happier now.

OK deraadt@


# 1.192 06-Dec-2010 jasper

- drop NENTS(), which was yet another copy of nitems().
no binary change


ok deraadt@


# 1.191 10-Sep-2010 thib

Backout the VOP diff until the issues naddy was seeing on alpha (gcc3)
have been resolved.


# 1.190 06-Sep-2010 thib

End the VOP experiment. Instead of the ridicolusly complicated operation
vector setup that has questionable features (that have, as far as I can
tell never been used in practice, atleast not in OpenBSD), remove all
the gunk and favor a simple struct full of function pointers that get
set directly by each of the filesystems.

Removes gobs of ugly code and makes things simpler by a magnitude.

The only downside of this is that we loose the vnoperate feature so
the spec/fifo operations of the filesystems need to be kept in sync
with specfs and fifofs, this is no big deal as the API it self is pretty
static.

Many thanks to armani@ who pulled an earlier version of this diff to
current after c2k10 and Gabriel Kihlman on tech@ for testing.

Liked by many. "come on, find your balls" deraadt@.


# 1.189 12-Aug-2010 oga

Nuke extra (typoed) extern declaration and a spare newline from the last
commit.

"fix it -- free commit" beck@


# 1.188 11-Aug-2010 beck

Make the number of vnodes to correspond to the number of buffers in
buffer cache - we grow them dynamically, but do not attempt to shrink
them if the buffer cache shrinks after growing.

Tested by very many for a long time.

ok oga@ todd@ phessler@ tedu@


Revision tags: OPENBSD_4_8_BASE
# 1.187 29-Jun-2010 tedu

makefstype was only used in ported from freebsd filesystems. fix them
and remove the function. ok thib


# 1.186 28-Jun-2010 claudio

Add the rtable id as an argument to rn_walktree(). Functions like
rt_if_remove_rtdelete() need to know the table id to be able to correctly
remove nodes.
Problem found by Andrea Parazzini and analyzed by Martin Pelik�n.
OK henning@


# 1.185 06-May-2010 mpf

Fix favail format string.
From mickey.
OK thib, otto.


Revision tags: OPENBSD_4_7_BASE
# 1.184 17-Dec-2009 oga

if anyone vref()s a VNON vnode, panic. This should not happen.

Written while trying to debug the nfs_inactive panics. Turns out it
never got hit, but it's a useful check to have.

ok beck@


# 1.183 17-Aug-2009 jasper

dd 'show all bufs' to show all the buffers in the system

ok beck@ thib@


# 1.182 13-Aug-2009 thib

add a show all vnodes command, use dlg's nice pool_walk() to accomplish
this.

ok beck@, dlg@


# 1.181 12-Aug-2009 beck

Namecache revamp.

This eliminates the large single namecache hash table, and implements
the name cache as a global lru of entires, and a redblack tree in each
vnode. It makes cache_purge actually purge the namecache entries associated
with a vnode when a vnode is recycled (very important for later on actually being
able to resize the vnode pool)

This commit does #if 0 out a bunch of procmap code that was
already broken before this change, but needs to be redone completely.

Tested by many, including in thib's nfs test setup.

ok oga@,art@,thib@,miod@


# 1.180 02-Aug-2009 beck

Dynamic buffer cache support - a re-commit of what was backed out
after c2k9

allows buffer cache to be extended and grow/shrink dynamically

tested by many, ok oga@, "why not just commit it" deraadt@


Revision tags: OPENBSD_4_6_BASE
# 1.179 25-Jun-2009 thib

backout the buf_acquire() does the bremfree() since all callers
where doing bremfree() befure calling buf_acquire().

This is causing us headache pinning down a bug that showed up
when deraadt@ too cvs to current, and will have to be done
anyway as a preperation for backouts.

OK deraadt@


# 1.178 15-Jun-2009 beck

Back out all the buffer cache changes I committed during c2k9. This reverts three
commits:

1) The sysctl allowing bufcachepercent to be changed at boot time.
2) The change moving the buffer cache hash chains to a red-black tree
3) The dynamic buffer cache (Which depended on the earlier too).

ok on the backout from marco and todd


# 1.177 06-Jun-2009 art

All caller of buf_acquire were doing bremfree before the call.
Just put it in the buf_acquire function.
oga@ ok


# 1.176 03-Jun-2009 beck

Change bufhash from the old grotty hash table to red-black trees hanging
off the vnode.
ok art@, oga@, miod@


Revision tags: OPENBSD_4_5_BASE
# 1.175 10-Nov-2008 pedro

Fix typo in comment, okay jmc@.


# 1.174 01-Nov-2008 deraadt

change vrele() to return an int. if it returns 0, it can gaurantee that
it did not sleep. this is used to avoid checkdirs() to avoid having
to restart the allproc walk every time through
idea from tedu, ok thib pedro


Revision tags: OPENBSD_4_4_BASE
# 1.173 05-Jul-2008 thib

re-introduce vdrop() to signal a lost intrest in a vnode;

ok art@


# 1.172 14-Jun-2008 mk

A bunch of pool_get() + bzero() -> pool_get(..., .. | PR_ZERO)
conversions that should shave a few bytes off the kernel.

ok henning, krw, jsing, oga, miod, and thib (``even though i usually prefer
FOO|BAR''; thanks for looking.


# 1.171 13-Jun-2008 beck

back out stupid vnode change that was unintentionally included
with biomem and art has no idea how it got there.
ok art@ thib@


# 1.170 12-Jun-2008 deraadt

Bring biomem diff back into the tree after the nfs_bio.c fix went in.
ok thib beck art


# 1.169 11-Jun-2008 deraadt

back out biomem diff since it is not right yet. Doing very large
file copies to nfsv2 causes the system to eventually peg the console.
On the console ^T indicates that the load is increasing rapidly, ddb
indicates many calls to getbuf, there is some very slow nfs traffic
making none (or extremely slow) progress. Eventually some machines
seize up entirely.


# 1.168 10-Jun-2008 beck

Buffer cache revamp

1) remove multiple size queues, introduced as a stopgap.
2) decouple pages containing data from their mappings
3) only keep buffers mapped when they actually have to be mapped
(right now, this is when buffers are B_BUSY)
4) New functions to make a buffer busy, and release the busy flag
(buf_acquire and buf_release)
5) Move high/low water marks and statistics counters into a structure
6) Add a sysctl to retrieve buffer cache statistics

Tested in several variants and beat upon by bob and art for a year. run
accidentally on henning's nfs server for a few months...

ok deraadt@, krw@, art@ - who promises to be around to deal with any fallout


# 1.167 09-Jun-2008 millert

Update access(2) to have modern semantics with respect to X_OK and
the superuser. access(2) will now only indicate success for X_OK on
non-directories if there is at least one execute bit set on the file.
OK deraadt@ thib@ otto@


# 1.166 07-May-2008 thib

remove the vfc_mountroot member from vfsconf and
do appropriate cleanup;

OK deraadt@


# 1.165 07-May-2008 claudio

Implement routing priorities. Every route inserted has a priority assigned
and the one route with the lowest number wins. This will be used by the
routing daemons to resolve the synchronisations issue in case of conflicts.
The nasty bits of this are in the multipath code. If no priority is specified
the kernel will choose an appropriate priority.

Looked at by a few people at n2k8 code is much older


# 1.164 06-May-2008 thib

retire vfs_mountroot();

setroot() is now (and has been) responsible for setting
the mountroot function pointer "to the right thing", or
failing todo that, to ffs_mountroot;

based on a discussion/diff from deraadt@.
OK deraadt@


# 1.163 23-Mar-2008 miod

Wrong printf construct.


# 1.162 16-Mar-2008 otto

Widen some struct statfs fields to support large filesystem stata
and add some to be able to support statvfs(2). Do the compat dance
to provide backward compatibility. ok thib@ miod@


Revision tags: OPENBSD_4_3_BASE
# 1.161 13-Dec-2007 blambert

replace calls to ltsleep with tsleep

remove PNORELOCK flag, as PNORELOCK is used for msleep

ok art@ thib@


# 1.160 16-Nov-2007 deraadt

er, the newline is wrong. dissapointing.


# 1.159 15-Nov-2007 deraadt

newline before syncing disks is way prettier


# 1.158 29-Oct-2007 chl

MALLOC/FREE -> malloc/free
replace an hard coded value with M_WAITOK

ok krw@


# 1.157 15-Sep-2007 bluhm

Allow to pull out an usb stick with ffs filesystem while mounted
and a file is written onto the stick. Without these fixes the
machine panics or hangs.
The usb fix calls the callback when the stick is pulled out to free
the associated buffers. Otherwise we have busy buffers for ever
and the automatic unmount will panic.
The change in the scsi layer prevents passing down further dirty
buffers to usb after the stick has been deactivated.
In vfs the automatic unmount has moved from the function vgonel()
to vop_generic_revoke(). Both are called when the sd device's vnode
is removed. In vgonel() the VXLOCK is already held which can cause
a deadlock. So call dounmount() earlier.

ok krw@, I like this marco@, tested by ian@


# 1.156 07-Sep-2007 art

Use M_ZERO in a few more places to shave bytes from the kernel.

eyeballed and ok dlg@


Revision tags: OPENBSD_4_2_BASE
# 1.155 07-Aug-2007 beck

A few changes to deal with multi-user performance issues seen. this
brings us back roughly to 4.1 level performance, although this is still
far from optimal as we have seen in a number of cases. This change

1) puts a lower bound on buffer cache queues to prevent starvation
2) fixes the code which looks for a buffer to recycle
3) reduces the number of vnodes back to 4.1 levels to avoid complex
performance issues better addressed after 4.2

ok art@ deraadt@, tested by many


# 1.154 01-Jun-2007 beck

decouple the allocated number of vnodes from the "desiredvnodes" variable
which is used to size a zillion other things that increasing excessively
has been shown to cause problems - so that we may incrementally look at
increasing those other things without making the kernel unusable.

This diff effectivly increases the number of vnodes back to the number
of buffers, as in the earlier dynamic buffer cache commits, without
increasing anything else (namecache, softdeps, etc. etc.)

ok pedro@ tedu@ art@ thib@


# 1.153 31-May-2007 tedu

remove some silly casts, no real change


# 1.152 31-May-2007 pedro

NFSv2 cannot cope with a big number of vnodes, so revert to NPROC-based
calculation until the problem is fixed, okay beck@ art@


# 1.151 30-May-2007 beck

back out vfs change - todd fries has seen afs issues, and I'm suspicious
this can cause other problems.


# 1.150 29-May-2007 beck

Step one of some vnode improvements - change getnewvnode to
actually allocate "desiredvnodes" - add a vdrop to un-hold a vnode held
with vhold, and change the name cache to make use of vhold/vdrop, while
keeping track of which vnodes are referred to by which cache entries to
correctly hold/drop vnodes when the cache uses them.
ok thib@, tedu@, art@


# 1.149 28-May-2007 thib

de-inline vref();

ok pedro@


# 1.148 26-May-2007 pedro

Dynamic buffer cache. Initial diff from mickey@, okay art@ beck@ toby@
deraadt@ dlg@.


# 1.147 26-May-2007 thib

Nuke a bunch of simpelocks and associated goo.

ok art@


# 1.146 17-May-2007 thib

Collapse struct v_selectinfo in struct vnode, remove the
simplelock and reuse the name for the selinfo member.
Clean-up accordingly.

ok tedu@,art@


# 1.145 09-May-2007 deraadt

kinfo_vgetfailed has not been used for > 8 years


# 1.144 13-Apr-2007 thib

Move the declaration of VN_KNOTE() into vnode.h instead of having
multiple defines all over;

ok tedu@


# 1.143 13-Apr-2007 bluhm

Remove comments talking about vnode interlock. No binary change.
ok thib


# 1.142 11-Apr-2007 thib

Remove the simplelock argument from vrecycle();

ok pedro@, sturm@


# 1.141 21-Mar-2007 thib

Remove the v_interlock simplelock from the vnode structure.
Zap all calls to simple_lock/unlock() on it (those calls are
#defined away though). Remove the LK_INTERLOCK from the calls
to vn_lock() and cleanup the filesystems wich implement VOP_LOCK().
(by remvoing the v_interlock from there calls to lockmgr()).

ok pedro@, art@, tedu@


# 1.140 12-Mar-2007 mickey

better desiredvnodes not based on maxusers; pedro@ deraadt@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.139 20-Feb-2007 deraadt

for vfsconf sysctl, do not leak kernel sensors out to userland
ok art thib


# 1.138 17-Feb-2007 mickey

fix ddb buf printing for daddr_t growth to 64bit;
from juan hernandez gonzalez; tested by bluhm@


# 1.137 14-Feb-2007 jsg

Consistently spell FALLTHROUGH to appease lint.
ok kettenis@ cloder@ tom@ henning@


# 1.136 13-Feb-2007 mickey

fix ddb buf print


# 1.135 20-Nov-2006 tom

vprint() should be defined if DIAGNOSTIC || DEBUG. Noticed by (and
original diff from) Jake < antipsychic (at) hotmail.com >. Discussed
with Mickey and Miod.

ok miod@ pedro@


# 1.134 30-Oct-2006 thib

use vp->v_type to index into vtypes rather then vp->v_tag,
fixing odd output in the 'show vnode' ddb code.

ok mickey@


Revision tags: OPENBSD_4_0_BASE
# 1.133 11-Jul-2006 mickey

add mount/vnode/buf and softdep printing commands; tested on a few archs and will make pedro happy too (;


# 1.132 09-Jul-2006 pedro

Fix tab where space was meant


# 1.131 08-Jul-2006 thib

vinvalbuf() debugging aid, under VFSDEBUG.

ok pedro@


# 1.130 03-Jul-2006 mickey

also print vp in vprint (useful for debugging); pedro@ ok


# 1.129 25-Jun-2006 sturm

rename vfs_busy() flags VB_UMIGNORE/VB_UMWAIT to VB_NOWAIT/VB_WAIT

requested by and ok pedro


# 1.128 14-Jun-2006 sturm

move vfs_busy() to rwlocks and properly hide the locking api from vfs

ok tedu, pedro


# 1.127 02-Jun-2006 pedro

Add a clonable devices implementation. Hacked along with thib@, input
from krw@ and toby@, subliminal prodding from dlg@, okay deraadt@.


# 1.126 28-May-2006 pedro

Spacing in vfs_sysctl()


# 1.125 07-May-2006 sturm

forgot to remove this sentence from the comment
ok pedro


# 1.124 30-Apr-2006 sturm

remove the simplelock argument from vfs_busy() which is currently not
used and will never be used this way in VFS

requested by and ok pedro, ok krw, biorn


# 1.123 19-Apr-2006 pedro

Remove unused mount list simple_lock() goo


Revision tags: OPENBSD_3_9_BASE
# 1.122 09-Jan-2006 pedro

Put vprint() under DIAGNOSTIC, as to save space in generated ramdisks.
Inspiration from miod@, okay deraadt@. Tested on i386, macppc and amd64.


# 1.121 30-Nov-2005 pedro

No need for vfs_busy() and vfs_unbusy() to take a process pointer
anymore. Testing by jolan@, thanks.


# 1.120 24-Nov-2005 pedro

Remove kernfs, okay deraadt@.


# 1.119 19-Nov-2005 pedro

Remove unnecessary lockmgr() archaism that was costing too much in terms
of panics and bugfixes. Access curproc directly, do not expect a process
pointer as an argument. Should fix many "process context required" bugs.
Incentive and okay millert@, okay marc@. Various testing, thanks.


# 1.118 18-Nov-2005 pedro

Work around yet another race on non-locking file systems: when calling
VOP_INACTIVE() in vrele() and vput(), we may sleep. Since there's no
locking of any kind, someone can vget() the vnode and vrele() it while
we sleep, beating us in getting the vnode on the free list.


# 1.117 08-Nov-2005 pedro

Missed one use of 'register'


# 1.116 07-Nov-2005 pedro

Use ANSI function declarations and deregister, no binary change


# 1.115 19-Oct-2005 pedro

Remove v_vnlock from struct vnode, okay krw@ tedu@


Revision tags: OPENBSD_3_8_BASE
# 1.114 26-May-2005 pedro

branches: 1.114.2;
RIP stackable filesystems, ok marius@ tedu@, discussed with deraadt@


# 1.113 24-May-2005 pedro

when a device vnode associated with a mount point disappears, mark the
filesystem as doomed and unmount it


# 1.112 22-May-2005 pedro

put VLOCKSWORK stuff under a single option, VFSDEBUG


# 1.111 01-May-2005 pedro

check for VBIOONFREELIST and VBIOONSYNCLIST in vprint(), okay marius@


# 1.110 24-Mar-2005 tedu

always good to check for invalid values. ok marius pedro


Revision tags: OPENBSD_3_7_BASE
# 1.109 10-Jan-2005 pedro

branches: 1.109.2;
change vget() to only put a vnode back on the free lists if it actually
was there. should fix a (rare) corner case introduced by my last commit.
ok tedu@, testing by joris, moritz@, danh@, otto@ and krw@. many thanks.


# 1.108 31-Dec-2004 pedro

sprinkle some more list macros in here


# 1.107 31-Dec-2004 pedro

when releasing a vnode, make it inactive before sticking it to one of
the free lists. should fix some races on filesystems that don't have
locks, such as nfs. also, it allows for a more straightforward way of
releasing vnodes (nodes that are going to be recycled don't have to be
moved to the head of the list). tested by many, thanks.

ok tedu@ deraadt@


# 1.106 28-Dec-2004 deraadt

clean dirty accident by miod


# 1.105 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


# 1.104 09-Dec-2004 pedro

minor spacing/styling nits


Revision tags: OPENBSD_3_6_BASE
# 1.103 04-Aug-2004 art

Uninline vputonfreelist.


# 1.102 04-Aug-2004 pedro

better comments


# 1.101 02-Aug-2004 pedro

- check for LK_NOWAIT on vget()
- use ltsleep() instead of the unlock + sleep combo

ok art@, inspiration from free/net


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.100 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


# 1.99 27-May-2004 tedu

shutdown accounting before shutting down vfs. should prevent some panics.
ok david@ millert@ (iirc)


# 1.98 25-Apr-2004 itojun

radix tree with multipath support. from kame. deraadt ok
user visible changes:
- you can add multiple routes with same key (route add A B then route add A C)
- you have to specify gateway address if there are multiple entries on the table
(route delete A B, instead of route delete A)
kernel change:
- radix_node_head has an extra entry
- rnh_deladdr takes extra argument

TODO:
- actually take advantage of multipath (rtalloc -> rtalloc_mpath)


Revision tags: OPENBSD_3_5_BASE
# 1.97 09-Jan-2004 tedu

back out vnode parents. weird breakge found in ports tree


# 1.96 06-Jan-2004 tedu

keep track of a vnode's parent dir. ufs only, and unused atm, but
the fun stuff is coming. testing by brad.


Revision tags: OPENBSD_3_4_BASE
# 1.95 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.94 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: UBC_SYNC_A
# 1.93 13-May-2003 naddy

Back out previous change that causes "vnode table full" for large-scale
file operations.


# 1.92 13-May-2003 tedu

do reclaim LAYER vnodes, no good reason not to


# 1.91 06-May-2003 tedu

attempt to put a process's cwd back in place after a forced umount.
won't always work, but it's the best we can do for now. this covers
at least some of the failure cases the previous commit to vfs_lookup.c
checks for.
ok weingart@


# 1.90 01-May-2003 tedu

several related changes:
vfs_subr.c:
add a missing simple_lock_init for vnode interlock
try to avoid reclaiming locked or layered vnodes
initialize vnlock pointer to NULL
remove old code to free vnlock, never used
lockinit the new vnode lock
vfs_syscalls.c:
support for VLAYER flag
vnode_if.sh:
support for splitting VDESC flags
vnode_if.src:
split VDESC flags
WILLPUT is the combination of WILLRELE and WILLUNLOCK
most uses for WILLRELE become WILLPUT
vnode.h:
add v_lock to struct vnode
add VLAYER flag
update for new VDESC flags


# 1.89 06-Apr-2003 ho

strcat/strcpy/sprintf cleanup. krw@, anil@ ok. art@ tested sparc64.


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_B
# 1.88 11-Aug-2002 art

Add two missing vfs_busy calls in the failure path of sysctl_vnode.
Found by aaron@

NOTE - I think we need a mount-point iterator just like we have
NOTE - vfs_mount_foreach_vnode. (btw. why don't we use foreach_vnode in here?)


# 1.87 12-Jul-2002 art

Change the locking on the mountpoint slightly. Instead of using mnt_lock
to get shared locks for lookup and get the exclusive lock only with
LK_DRAIN on unmount and do the real exclusive locking with flags in
mnt_flags, we now use shared locks for lookup and an exclusive lock for
unmount.

This is accomplished by slightly changing the semantics of vfs_busy.
Old vfs_busy behavior:
- with LK_NOWAIT set in flags, a shared lock was obtained if the
mountpoint wasn't being unmounted, otherwise we just returned an error.
- with no flags, a shared lock was obtained if the mountpoint was being
unmounted, otherwise we slept until the unmount was done and returned
an error.
LK_NOWAIT was used for sync(2) and some statistics code where it isn't really
critical that we get the correct results.
0 was used in fchdir and lookup where it's critical that we get the right
directory vnode for the filesystem root.

After this change vfs_busy keeps the same behavior for no flags and LK_NOWAIT.
But if some other flags are passed into it, they are passed directly
into lockmgr (actually LK_SLEEPFAIL is always added to those flags because
if we sleep for the lock, that means someone was holding the exclusive lock
and the exclusive lock is only held when the filesystem is being unmounted.

More changes:
dounmount must now be called with the exclusive lock held. (before this
the caller was supposed to hold the vfs_busy lock, but that wasn't always
true).
Zap some (now) unused mount flags.
And the highlight of this change:
Add some vfs_busy calls to match some vfs_unbusy calls, especially in
sys_mount. (lockmgr doesn't detect the case where we release a lock noone
holds (it will do that soon)).

If you've seen hangs on reboot with mfs this should solve it (I repeat this
for the fourth time now, but this time I spent two months fixing and
redesigning this and reading the code so this time I must have gotten
this right).


# 1.86 16-Jun-2002 miod

When processing the KERN_VNODE sysctl, the kernel builds a packed structure,
while pstat(8) expects a C structure abiding the regular structure packing
rules. This caused pstat -v to break on powerpc.

Unbreak the confusion by defining the structure in a common header file,
and having the kernel use it.

ok millert@ deraadt@


# 1.85 08-Jun-2002 art

Use ltsleep in vfs_busy.


# 1.84 16-May-2002 art

sprinkle some splassert(IPL_BIO) in some functions that are commented as "should be called at splbio()"


Revision tags: OPENBSD_3_1_BASE
# 1.83 14-Mar-2002 millert

First round of __P removal in sys


# 1.82 04-Feb-2002 miod

Cleanup mountroot-related definitions.


# 1.81 23-Jan-2002 art

Pool deals fairly well with physical memory shortage, but it doesn't deal
well (not at all) with shortages of the vm_map where the pages are mapped
(usually kmem_map).

Try to deal with it:
- group all information the backend allocator for a pool in a separate
struct. The pool will only have a pointer to that struct.
- change the pool_init API to reflect that.
- link all pools allocating from the same allocator on a linked list.
- Since an allocator is responsible to wait for physical memory it will
only fail (waitok) when it runs out of its backing vm_map, carefully
drain pools using the same allocator so that va space is freed.
(see comments in code for caveats and details).
- change pool_reclaim to return if it actually succeeded to free some
memory, use that information to make draining easier and more efficient.
- get rid of PR_URGENT, noone uses it.


# 1.80 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.79 10-Dec-2001 art

branches: 1.79.2;
No need to initialize the uobj on every getnewvnode. Just do
it when allocating. Add some improved diagnostics.


# 1.78 10-Dec-2001 art

Big cleanup inspired by NetBSD with some parts of the code from NetBSD.
- get rid of VOP_BALLOCN and VOP_SIZE
- move the generic getpages and putpages into miscfs/genfs
- create a genfs_node which must be added to the top of the private portion
of each vnode for filsystems that want to use genfs_{get,put}pages
- rename genfs_mmap to vop_generic_mmap


# 1.77 10-Dec-2001 art

Merge in struct uvm_vnode into struct vnode.


# 1.76 05-Dec-2001 art

Break out the part that lowers v_holdcnt in brelvp into an own function
and make it and vhold into public interfaces.


# 1.75 29-Nov-2001 art

Ooops. Revert part of the last commit that was completly wrong and wasn't supposed to be committed.


# 1.74 29-Nov-2001 art

Correctly handle b_vp with bgetvp and brelvp in {get,put}pages.
Prevents panics caused by vnodes being recycled under our feet.


# 1.73 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.72 21-Nov-2001 csapuntz

Added vfs_isbusy. Useful for verifying that a mount point is locked
Added vfs_mount_foreach_vnode. Several places in the code seem to want to
traverse the mount list and they all seem to handle locking differently.
Centralize traversing the mount list in one place so that we only need
to get the locking right once.


# 1.71 15-Nov-2001 art

Don't zero v_bioflag when recycling a vnode in getnewvnode.
Sometimes the vnode can be on the syncers list. While that is a bug, it's
just a minor annoyance. A vnode on a syncer worklist without VBIOONSYNCLIST
set is a disaster.


# 1.70 12-Nov-2001 art

Remove unnecessary check for NULL vnode in reassignbuf.


# 1.69 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.68 02-Oct-2001 csapuntz

Bounds check index into routing table. Thanks to Ken Ashcraft of Stanford
for finding this bug.


# 1.67 19-Sep-2001 csapuntz

Get rid of B_VFLUSH. Not relevant after the end of the AGE queue.


# 1.66 16-Sep-2001 millert

Add some missing lengths checks when passing data from userland to
kernel. From based on NetBSD patches.


# 1.65 02-Aug-2001 assar

(vput): make panic strings actually say vput instead of vrele


# 1.64 26-Jul-2001 miod

Typo.


# 1.63 27-Jun-2001 art

remove old vm


# 1.62 22-Jun-2001 deraadt

KNF


# 1.61 05-Jun-2001 provos

send note_revoke to knotes when vnode goes away, okay art@


# 1.60 16-May-2001 art

indentation nit.


# 1.59 29-Apr-2001 art

cleanup, remove incorrect comment


Revision tags: OPENBSD_2_9_BASE
# 1.58 22-Mar-2001 art

branches: 1.58.2;
Use pool for allocating vnodes.
Even though vnodes are never freed (could be) this gives us big memory and
kmem_map savings.


# 1.57 21-Mar-2001 art

uvm_vnp_terminate expect the vnode to be locked.
Why didn't LOCKDEBUG catch this?


# 1.56 16-Mar-2001 art

Oops. fix thinko in last.


# 1.55 16-Mar-2001 art

Use CIRCLEQ macros for mountlist.


# 1.54 16-Mar-2001 art

Initialize the mountlist_slock.


# 1.53 26-Feb-2001 csapuntz

Move v_writecount test back to it original place


# 1.52 26-Feb-2001 csapuntz

Make ref counts 32-bit unsigned ints as opposed to a potpourri of longs and
ints.


# 1.51 24-Feb-2001 csapuntz

Cleanup of vnode interface continues. Get rid of VHOLD/HOLDRELE.
Change VM/UVM to use buf_replacevnode to change the vnode associated
with a buffer.

Addition v_bioflag for flags written in interrupt handlers
(and read at splbio, though not strictly necessary)

Add vwaitforio and use it instead of a while loop of v_numoutput.

Fix race conditions when manipulation vnode free list


# 1.50 23-Feb-2001 csapuntz

Remove the clustering fields from the vnodes and place them in the
file system inode instead


# 1.49 21-Feb-2001 csapuntz

Latest soft updates from FreeBSD/Kirk McKusick

Snapshot-related code has been commented out.


# 1.48 08-Feb-2001 mickey

do not print stuff when not verbose


Revision tags: OPENBSD_2_8_BASE
# 1.47 27-Sep-2000 art

branches: 1.47.2;
Minimal optimization.


# 1.46 17-Jul-2000 art

Don't wait for B_READ buffers on shutdown.
From NetBSD.


Revision tags: OPENBSD_2_7_BASE
# 1.45 25-Apr-2000 csapuntz

Use CIRCLEQ_FOREACH


# 1.44 21-Apr-2000 mickey

see if there is any meaning under curproc before using &proc0 in vfs_syncwait(); from art@


Revision tags: SMP_BASE kame_19991208
# 1.43 05-Dec-1999 art

branches: 1.43.2;
With soft updates, some buffers will be remarked as dirty after being written.
Handle this when syncing filesystems when unmounting.
From NetBSD.


# 1.42 05-Dec-1999 art

Use VONSYNCLIST to see if we should remove a vnode from the sync list instead
of looking at v_dirtyblkhd.


Revision tags: OPENBSD_2_6_BASE
# 1.41 20-Aug-1999 art

more paranoid check of the refcount in vfs_register


# 1.40 08-Aug-1999 niklas

From NetBSD; vdevgone, used for revoking access to device nodes when they
disappear (detach is coming).


# 1.39 31-May-1999 millert

New struct statfs with mount options. NOTE: this replaces statfs(2),
fstatfs(2), and getfsstat(2) so you will need to build a new kernel
before doing a "make build" or you will get "unimplemented syscall" errors.

The new struct statfs has the following featuires:
o Has a u_int32_t flags field--now softdep can have a real flag.

o Uses u_int32_t instead of longs (nicer on the alpha). Note: the man
page used to lie about setting invalid/unused fields to -1. SunOS does
that but our code never has.

o Gets rid of f_type completely. It hasn't been used since NetBSD 0.9
and having it there but always 0 is confusing. It is conceivable
that this may cause some old code to not compile but that is better
than silently breaking.

o Adds a mount_info union that contains the FSTYPE_args struct. This
means that "mount" can now tell you all the options a filesystem was
mounted with. This is especially nice for NFS.

Other changes:
o The linux statfs emulation didn't convert between BSD fs names
and linux f_type numbers. Now it does, since the BSD f_type
number is useless to linux apps (and has been removed anyway)

o FreeBSD's struct statfs is different from our (both old and new)
and thus needs conversion. Previously, the OpenBSD syscalls
were used without any real translation.

o mount(8) will now show extra info when invoked with no arguments.
However, to see *everything* you need to use the -v (verbose) flag.


# 1.38 06-May-1999 mickey

factor out sync+wait code into vfa_syncwait() routine for
applications in system like power management and such.
art@ finally said `commit it'


# 1.37 30-Apr-1999 art

in vput, simple_unlock the v_interlock before VOP_INACTIVE, not after


Revision tags: OPENBSD_2_5_BASE
# 1.36 11-Mar-1999 deraadt

backout


# 1.35 11-Mar-1999 deraadt

back out unapproved changes


# 1.34 11-Mar-1999 mickey

indent


# 1.33 11-Mar-1999 mickey

factor sync+wait operation out into a separate function.


# 1.32 26-Feb-1999 art

adapt to uvm vnode pager


# 1.31 19-Feb-1999 art

add vfs_register and vfs_unregister functions


# 1.30 28-Dec-1998 art

simple_lock fixes


# 1.29 22-Dec-1998 art

deconfuse vprint, print holdcount, not refcount when we are talking about holdcnt


# 1.28 10-Dec-1998 art

vfs_unmountall: retry to unmount all remaining filesystems when one unmount failed


# 1.27 05-Dec-1998 csapuntz

Framework for generating automatic test code for locking discipline
in DIAGNOSTIC mode.

Added documentation to vfs_subr.c on locking needs of a couple calls.

Improvements to the vinvalbuf patch. We need to start over after we
let our pants down.


# 1.26 04-Dec-1998 csapuntz

VFS-Lite2 requires stricter locking around vnode buffer queues. vinvalbuf
had insufficient protection


# 1.25 20-Nov-1998 art

vn_lock already unlocks the simple lock. don't do that again


# 1.24 12-Nov-1998 csapuntz

Integrate latest soft updates patches for McKusick.

Integrate cleaner ffs mount code from FreeBSD. Most notably, this mount
code prevents you from mounting an unclean file system read-write.


Revision tags: OPENBSD_2_4_BASE
# 1.23 13-Oct-1998 csapuntz

In vrele, vget, reinstate to following order

- VNODE gets placed on free list
- VOP_INACTIVE is called

This was the original order. It was changed in an earlier patch due to
a race condition in non-locking FSes (like NFS) between getnewvnode
and inactive. However, the modified order had its own race conditions, so
it turned out not to be a good choice.


# 1.22 30-Aug-1998 csapuntz

Cleanup.

Error diagnostics in vputonfreelist to catch violations of assumptions.


# 1.21 06-Aug-1998 csapuntz

Rename vop_revoke, vn_bwrite, vop_noislocked, vop_nolock, vop_nounlock
to be vop_generic_revoke, vop_generic_bwrite, vop_generic_islocked,
vop_generic_lock and vop_generic_unlock.

Create vop_generic_abortop and propogate change to all file systems.

Fix PR/371.

Get rid of locking in NULLFS (should be mostly unnecessary now except for
forced unmounts).


# 1.20 25-Apr-1998 niklas

typo


Revision tags: OPENBSD_2_3_BASE
# 1.19 20-Feb-1998 niklas

typo


# 1.18 11-Jan-1998 csapuntz

Fix a couple spinlock references. More code motion in vfs_subr.c


# 1.17 10-Jan-1998 csapuntz

Broke up vfs_subr.c which was getting a bit huge. We now have seperate files
for the syncer daemon as well as default VOP_*.


# 1.16 24-Nov-1997 niklas

Fix non-DIAGNOSTIC (and non-COMPAT*) compilation


# 1.15 07-Nov-1997 csapuntz

Fixed hang on shutdown
Disabled vop_nolock for now. Filesystems still need to be cleaned up.


# 1.14 06-Nov-1997 csapuntz

DEBUG now compiles


# 1.13 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.12 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.11 06-Oct-1997 csapuntz

VFS Lite2 Changes


Revision tags: OPENBSD_2_1_BASE
# 1.10 25-Apr-1997 deraadt

proper mask check; mike@fast.cs.utah.edu


# 1.9 14-Apr-1997 tholo

Minor performance enhancements from NetBSD


# 1.8 24-Feb-1997 niklas

OpenBSD tags


# 1.7 11-Feb-1997 millert

Add fs_id support and random inode generation numbers for ffs.


# 1.6 04-Jan-1997 kstailey

spec_advlock() via lf_advlock()


Revision tags: OPENBSD_2_0_BASE
# 1.5 08-Aug-1996 tholo

Make {,f}chown(2) behaviour POSIX.1 compliant with SUID / SGID files
Enable CTL_FS processing by sysctl(3)
Add CTL_FS request to disable clearing SUID / SGID bit when a files owner
or group is changed by root
Make sysctl(8) understand CTL_FS requests


# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 29-Feb-1996 niklas

From NetBSD: Merge with NetBSD 960217


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.314 25-Jan-2022 gnezdo

Capture a repeated pattern into sysctl_securelevel_int function

A few variables in the kernel are only writeable before securelevel is
raised. It makes sense to handle them with less code.

OK sthen@ bluhm@


# 1.313 25-Oct-2021 claudio

Revert commitid: ufM9BcSbXqfLpzBH;
Move vfs_stall_barrier() from the fd layer into vn_lock() and the vfs layer.
In some cases it can result in a deadlock while suspending.
Discussed with mpi@ and deraadt@


# 1.312 24-Oct-2021 jsg

use NULL not 0 for pointer values in kern
ok semarie@


# 1.311 23-Oct-2021 mpi

Sprinkle uvm_obj_destroy() over UVM object recycling code.

For now, only assert that the tree of pages is empty in uvm_obj_destroy().
This will soon be used to free the per-UVM object lock.

While here call uvm_obj_init() when new vnodes are allocated instead of
in uvn_attach(). Because vnodes and there associated UVM object are
currently never freed, it isn't easy to know where/when to garbage
collect the associated lock. So simply check that the reference of a
given object is 0 when uvn_attach().

Tested by many as part of a bigger diff.

ok kettenis@


# 1.310 23-Oct-2021 mpi

Assert that the KERNEL_LOCK() is held in vref(9).

This is a guard against pushing the lock too far in UVM's vnode land.

ok beck@


# 1.309 21-Oct-2021 claudio

Move vfs_stall_barrier() from the fd layer into vn_lock() and the vfs layer.
vfs stalling is used by suspend/resume and by vmt(4) to stall any
filesystem operation from altering the state on disk. All these
operations will call vn_lock and be stalled. Adjust vfs_stall_barrier()
to allow the lock owner to still progress so that suspend can sync
the filesystems after stalling vfs operation.
OK mpi@


# 1.308 20-Oct-2021 semarie

revert vnode: remove VLOCKSWORK and check locking when vop_islocked != nullop
(both kernel and userland bits)

GENERIC + VFSLCKDEBUG is broken with it.


# 1.307 19-Oct-2021 semarie

vnode: remove VLOCKSWORK and check locking when vop_islocked != nullop

This flag is currently used to mark or unmark a vnode to actively
check vnode locking semantic (when compiled with VFSLCKDEBUG).

Currently, VLOCKSWORK flag isn't properly set for several FS
implementation which have full locking support. This commit enable
proper checking for them too (cd9660, udf, fuse, msdosfs, tmpfs).

Instead of using a particular flag, it directly check if
v_op->vop_islocked is nullop or not to activate or not the vnode
locking checks.

ok mpi@


Revision tags: OPENBSD_7_0_BASE
# 1.306 31-Aug-2021 claudio

Swap lock flags so that LK_EXCLUSIVE is first like in all other places.


# 1.305 28-Apr-2021 claudio

Introduce a global vnode_mtx and use it to make vn_lock() safe to be called
without the KERNEL_LOCK.
This moves VXLOCK and VXWANT to a mutex protected v_lflag field and also
v_lockcount is protected by this mutex.

The vn_lock() dance is overly complex and all of this should probably replaced
by a proper lock on the vnode but such a diff is a lot more complex. This
is an intermediate step so that at least some calls can be modified to grab
the KERNEL_LOCK later or not at all.

OK mpi@


Revision tags: OPENBSD_6_9_BASE
# 1.304 29-Jan-2021 claudio

Use NULL instead of 0 to clear v_socket pointer (which actually clears all
of the v_un pointers).
OK jsg@ mvs@


Revision tags: OPENBSD_6_8_BASE
# 1.303 23-Aug-2020 kn

Remove unused debug_syncprt, improve debug sysctl handling

"syncprt" is unused since kern/vfs_syscalls.c r1.147 from 2008.

Adding new debug sysctls is a bit opaque and looking at kern/kern_sysctl.c
the only visible difference between used and stub ctldebug structs in the
debugvars[] array is their extern keyword, indicating that it is defined
elsewhere.

sys/sysctl.h declares all debugN members as extern upfront, but these
declarations are not needed.

Remove the unused debug sysctl, rename the only remaining one to something
meaningful and remove forward declarations from /sys/sysctl.h; this way,
adding new debug sysctls is a matter of adding extern and coming up with a
name, which is nicer to read on its own and better to grep for.

OK mpi


# 1.302 22-Aug-2020 kn

Move sysctl(2) CTL_DEBUG from DEBUG to new DEBUG_SYSCTL

Adding "debug.my-knob" sysctls is really helpful to select different
code paths and/or log on demand during runtime without recompile,
but as this code is under DEBUG, lots of other noise comes with it
which is often undesired, at least when looking at specific subsystems
only.

Adding globals to the kernel and breaking into DDB to change them helps,
but that does not work over SSH, hence the need for debug sysctls.

Introduces DEBUG_SYSCTL to make use of the "debug" MIB without the rest of
DEBUG; it's DEBUG_SYSCTL and not SYSCTL_DEBUG because it's not a general
option for all of sysctl(2).

OK gnezdo


Revision tags: OPENBSD_6_7_BASE
# 1.301 27-Mar-2020 anton

Relax the lockcount assertion in vputonfreelist(). Back when I fixed
several problems with the vnode exclusive lock implementation, I
overlooked the fact that a vnode can be in a state where the usecount is
zero while the holdcount still being positive. There could still be
threads waiting on the vnode lock in uvn_io() as long as the holdcount
is positive.

"go ahead" mpi@

Reported-by: syzbot+767d6deb1a647850a0ca@syzkaller.appspotmail.com


# 1.300 13-Feb-2020 claudio

Move the LK_DRAIN logic from VOP_LOCK() to vclean() the only caller of
VOP_LOCK with LK_DRAIN. This simplifies VOP_LOCK() a fair bit.
OK visa@


# 1.299 20-Jan-2020 claudio

struct vops is not modified during runtime so use const which moves each
into read-only data segment.
OK deraadt@ tedu@


# 1.298 10-Jan-2020 bluhm

Convert the vnode list at the mount point into a tailq. During
unmount this list is traversed and the dirty vnodes are flushed to
disk. Forced unmount expects that the list is empty after flushing,
otherwise the kernel panics with "dangling vnode". As the write
to disk can sleep, new vnodes may be inserted. If softdep is
enabled, resolving the dependencies creates new dirty vnodes and
inserts them to the list. To fix the panic, let insmntque() insert
new vnodes at the tail of the list. Then vflush() will still catch
them while traversing the list in forward direction.
OK tedu@ millert@ visa@


# 1.297 30-Dec-2019 bluhm

In vcount() a safe loop over vnodes was commited to 4.4BSD in 1994.
This is not necessary as the loop is restarted after vgone(). Switch
to SLIST_FOREACH without _SAFE.
OK visa@


# 1.296 27-Dec-2019 bluhm

Convert the speclisth hash buckets into SLIST macros. This makes
the vnode alias code more readable.
OK visa@


# 1.295 26-Dec-2019 bluhm

Fix white spaces.


# 1.294 08-Dec-2019 mpi

Convert infinite sleeps to tsleep_nsec(9).

ok visa@, jca@


Revision tags: OPENBSD_6_6_BASE
# 1.293 26-Aug-2019 anton

When a thread tries to exclusively lock a vnode, the same thread must
ensure that any other thread currently trying to acquire the underlying
vnode lock has observed that the same vnode is about to be exclusively
locked. Such threads must then sleep until the exclusive lock has been
released and then try to acquire the lock again. Otherwise, exclusive
access to the vnode cannot be guaranteed.

Thanks to naddy@ and visa@ for testing; ok visa@

Reported-by: syzbot+374d0e7e2400004957f7@syzkaller.appspotmail.com


# 1.292 25-Jul-2019 cheloha

vinvalbuf(9): tlseep -> tsleep_nsec(9); ok millert@


# 1.291 19-Jul-2019 cheloha

vwaitforio(9): tsleep(9) -> tsleep_nsec(9); ok visa@


# 1.290 28-Jun-2019 visa

Skip VFS barrier lock during normal operation to reduce overhead.
This removes a system-wide serialization point, which might help
finding timing-related bugs.

OK deraadt@ anton@


# 1.289 09-Jun-2019 beck

Add a temporary workaround to make removal of giant files better

mlarkin@ noticed we would freeze while removing enormous files because
of the amount of work done to invalidate buffers on unlink. This adds
a temporary workaround to ensure we give up the lock and yield while
doing this.

The longer term answer will be to move these buffers to another list
and not do the work here.

ok deraadt@


# 1.288 19-Apr-2019 visa

Add a subsystem lock for vfs_lockf.c. This enables calling lf_advlock()
and lf_purgelocks() without the kernel lock.

OK anton@ mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.287 02-Apr-2019 visa

Restrict which filesystems are available for swap. This rules out
obvious misconfigurations that cannot work.

OK mpi@ tedu@


# 1.286 17-Feb-2019 tedu

if a write fails, we mark the buffer invalid and throw it away. this can
lead to lost errors, where a later fsync will return success. to fix this,
set a flag on the vnode indicating a past error has occurred, and return
an error for future fsync calls.
ok bluhm deraadt visa


# 1.285 21-Jan-2019 anton

Introduce a dedicated entry point data structure for file locks. This new data
structure allows for better tracking of pending lock operations which is
essential in order to prevent a use-after-free once the underlying vnode is
gone.

Inspired by the lockf implementation in FreeBSD.

ok visa@

Reported-by: syzbot+d5540a236382f50f1dac@syzkaller.appspotmail.com


# 1.284 23-Dec-2018 natano

Rectify some issues with the noperm mount flag; the root vnode was not
protected properly and files without any x bit set were accidentaly considered
executable when checked with access(2).

Issues found and reported by deraadt, halex, reyk, tb
ok deraadt


# 1.283 07-Dec-2018 mpi

free(9) sizes for netcred.

ok visa@


Revision tags: OPENBSD_6_4_BASE
# 1.282 29-Sep-2018 visa

Use atomic operations to update vfc_refcount. Change the field's type
to unsigned int.

OK deraadt@


# 1.281 26-Sep-2018 visa

Move the allocating and freeing of mount points into
dedicated functions.

OK deraadt@ mpi@


# 1.280 22-Sep-2018 fcambus

Harmonize spacing after ellipses in displayed messages.

We were using spacing after ellipses in an inconsistent way in the
installer. Standardize on using "... " everywhere and take into account
the cursor position while we are waiting for the task to complete: the
cursor is now always positioned after the last dot, and the space is
added when displaying completion confirmation.

While there, also take cursor position into account in vfs_shutdown(),
and remove the extra leading space before ticks in dhclient.

OK deraadt@


# 1.279 17-Sep-2018 visa

Simplify VFS initialization.

Because loadable kernel modules are no longer, there is no need to
register or unregister filesystem implementations at runtime. Remove
vfs_register() and vfs_unregister(), and make vfsinit() call vfs_init
routines directly. Replace the linked list of vfsconf structs with
the vfsconflist[] array.

OK mpi@ bluhm@


# 1.278 16-Sep-2018 visa

Move vfsconf lookup code into dedicated functions.

OK bluhm@


# 1.277 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


# 1.276 02-Jul-2018 bluhm

Use more list macros for v_dirtyblkhd.
OK mpi@


# 1.275 06-Jun-2018 bluhm

The function dounmount() traverses the mnt_list in forward direction
to call vfs_busy() for all nested mount points. vfs_stall() called
vfs_busy() in reverser order for all mount points. Change the
direction of the latter to resolve the lock order conflict.
OK visa@


# 1.274 04-Jun-2018 guenther

Add VB_DUPOK to suppress witness(4) warning of concurrent mount locks.
Use that in three places:
- vfs_stall()
- sys_mount()
- dounmount()'s MNT_FORCE-does-recursive-unmounts case

ok deraadt@ visa@


# 1.273 27-May-2018 visa

Drop unnecessary `p' parameter from vget(9).

OK mpi@


# 1.272 08-May-2018 bluhm

When looping over mount points, the FOREACH SAVE macro is not save.
The loop variable mp is protected by vfs_busy() so that it cannot
be unmounted. But the next mount point nmp could be unmounted while
VFS_SYNC() sleeps. As the loop in vfs_stall() does not destroy the
mount point, TAILQ_FOREACH_REVERSE without _SAVE is the correct
macro to use.
OK deraadt@ visa@


# 1.271 08-May-2018 mpi

Move the vfs stall "barrier" logic to a function. FREF() will soon
change and this has nothing to do with it.

ok visa@, bluhm@


# 1.270 07-May-2018 bluhm

Print the vp pointer in the vinvalbuf() panic strings.
OK mpi@


# 1.269 02-May-2018 visa

Remove proc from the parameters of vn_lock(). The parameter is
unnecessary because curproc always does the locking.

OK mpi@


# 1.268 28-Apr-2018 visa

Clean up the parameters of VOP_LOCK() and VOP_UNLOCK(). It is always
curproc that does the locking or unlocking, so the proc parameter
is pointless and can be dropped.

OK mpi@, deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.267 07-Mar-2018 bluhm

Remounting files systems read-only does not work reliably. There
are corner cases where ffs may leak blocks. So better revert and
unmount all file systems at reboot. The "init died" panic will be
fixed in a different way.
OK deraadt@


# 1.266 10-Feb-2018 deraadt

Syncronize filesystems to disk when suspending. Each mountpoint's vnodes
are pushed to disk. Dangling vnodes (unlinked files still in use) and
vnodes undergoing change by long-running syscalls are identified -- and
such filesystems are marked dirty on-disk while we are suspended (in case
power is lost, a fsck will be required). Filesystems without dangling or
busy vnodes are marked clean, resulting in faster boots following
"battery died" circumstances.
Tested by numerous developers, thanks for the feedback.


# 1.265 14-Dec-2017 deraadt

Don't bother using DETACH_FORCE for the softraid luns at reboot
time; the aggressive mountpoint destruction seems to hit insane
use-after-frees when we are already far on the way down.


# 1.264 14-Dec-2017 deraadt

Give vflush_vnode() a hint about vnodes we don't need to account as "busy".
Change mountpoint to RDONLY a little later. Seems to improve the
rw->ro transition a bit.


# 1.263 11-Dec-2017 bluhm

Format the vnode lists of ddb show mount properly in columns.
OK krw@


# 1.262 11-Dec-2017 deraadt

In uvm Chuck decided backing store would not be allocated proactively
for blocks re-fetchable from the filesystem. However at reboot time,
filesystems are unmounted, and since processes lack backing store they
are killed. Since the scheduler is still running, in some cases init is
killed... which drops us to ddb [noted by bluhm]. Solution is to convert
filesystems to read-only [proposed by kettenis]. The tale follows:
sys_reboot() should pass proc * to MD boot() to vfs_shutdown() which
completes current IO with vfs_busy VB_WRITE|VB_WAIT, then calls VFS_MOUNT()
with MNT_UPDATE | MNT_RDONLY, soon teaching us that *fs_mount() calls a
copyin() late... so store the sizes in vfsconflist[] and move the copyin()
to sys_mount()... and notice nfs_mount copyin() is size-variant, so kill
legacy struct nfs_args3. Next we learn ffs_mount()'s MNT_UPDATE code is
sharp and rusty especially wrt softdep, so fix some bugs adn add
~MNT_SOFTDEP to the downgrade. Some vnodes need a little more help,
so tie them to &dead_vnops.

ffs_mount calling DIOCCACHESYNC is causing a bit of grief still but
this issue is seperate and will be dealt with in time.
couple hundred reboots by bluhm and myself, advice from guenther and
others at the hut


# 1.261 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.260 31-Jul-2017 florian

Give back some space to the ramdisk by compiling net/radix.c only
if we compile pf, ipsec, pipex or nfsserver.
Suggested by mpi some time ago.
Tweak & OK bluhm
deraadt assumes it's fair


# 1.259 20-Apr-2017 visa

Tweak lock inits to make the system runnable with witness(4)
on amd64 and i386.


# 1.258 04-Apr-2017 deraadt

struct vfsconf is tightly packed, but let's M_ZERO it in case that ever
changes to avoid exposing userland memory.


Revision tags: OPENBSD_6_1_BASE
# 1.257 15-Jan-2017 bluhm

When traversing the mount list, the current mount point is locked
with vfs_busy(). If the FOREACH_SAFE macro is used, the next pointer
is not locked and could be freed by another process. Unless
necessary, do not use _SAFE as it is unsafe. In vfs_unmountall()
the current pointer is actullay freed. Add a comment that this
race has to be fixed later.
OK krw@


# 1.256 10-Jan-2017 bluhm

Replace manual for() loops with FOREACH() macro.
OK millert@


# 1.255 10-Jan-2017 bluhm

Remove the unused olddp parameter from function dounmount().
OK mpi@ millert@


# 1.254 28-Sep-2016 kettenis

Cast enum to u_int when doing a bounds check to avoid a clang warning that
the comparison is always true.

ok jca@, tedu@


# 1.253 16-Sep-2016 dlg

move the namecache_rb_tree from RB macros to RBT functions.

i had to shuffle the includes a bit. all the knowledge of the RB
tree is now inside vfs_cache.c, and all accesses are via cache_*
functions.


# 1.252 16-Sep-2016 dlg

move buf_rb_bufs from RB macros to RBT functions

i had to shuffle the order of some header bits cos RBT_PROTOTYPE
needs to see what RBT_HEAD produces.


# 1.251 15-Sep-2016 dlg

all pools have their ipl set via pool_setipl, so fold it into pool_init.

the ioff argument to pool_init() is unused and has been for many
years, so this replaces it with an ipl argument. because the ipl
will be set on init we no longer need pool_setipl.

most of these changes have been done with coccinelle using the spatch
below. cocci sucks at formatting code though, so i fixed that by hand.

the manpage and subr_pool.c bits i did myself.

ok tedu@ jmatthew@

@ipl@
expression pp;
expression ipl;
expression s, a, o, f, m, p;
@@
-pool_init(pp, s, a, o, f, m, p);
-pool_setipl(pp, ipl);
+pool_init(pp, s, a, ipl, f, m, p);


# 1.250 25-Aug-2016 dlg

pool_setipl

ok kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.249 22-Jul-2016 kettenis

Prevent NULL-pointer call for filesystems that don't provide vfs_sysctl
in their vfsops.

Issue reported by Tim Newsham.

ok claudio@, natano@


# 1.248 19-Jun-2016 natano

Remove the lockmgr() API. It is only used by filesystems, where it is a
trivial change to use rrw locks instead. All it needs is LK_* defines
for the RW_* flags.

tested by naddy and sthen on package building infrastructure
input and ok jmc mpi tedu


# 1.247 26-May-2016 natano

The doforce variable isn't modified anywhere. Also, the only filesystem
left using it is fuse. It has been removed from all other filesystems.

ok millert deraadt


# 1.246 26-Apr-2016 natano

copy_statfs_info() is not only used by ufs, but by other filesystems too,
so make sure that all members of mp->mnt_stat.mount_info are copied.
ok stefan


# 1.245 26-Apr-2016 beck

fix off by one in vfs_vnode_print - found by miod
ok deraadt@, krw@


# 1.244 07-Apr-2016 natano

Share clone bitmap between aliased vnodes. This prevents duplicate clone
instance numbers being handed out for the same minor device.
ok mikeb


# 1.243 05-Apr-2016 natano

Increase size of the clone bitmap (revised diff after revert). I have
tested this with fuse _and_ drm on amd64 and macppc. Also tested with
cloning bpf (not in the tree) on macppc.

ok mikeb
"looks correct to me" millert

The original commit message is as follows:

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.242 01-Apr-2016 mikeb

Revert the clone bitmap enlargement change


# 1.241 31-Mar-2016 natano

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.240 19-Mar-2016 natano

Remove the unused flags argument from VOP_UNLOCK().

torture tested on amd64, i386 and macppc
ok beck mpi stefan
"the change looks right" deraadt


# 1.239 14-Mar-2016 krw

Change a bunch of (<blah> *)0 to NULL.

ok beck@ deraadt@


Revision tags: OPENBSD_5_9_BASE
# 1.238 05-Dec-2015 tedu

branches: 1.238.2;
remove stale lint annotations


# 1.237 16-Nov-2015 deraadt

In getdevvp() set the VISTTY flag on a vnode to indicate the underlying
device is a D_TTY device. (Like spec_open, but this sets the flag to
satisfy pre-VOP_OPEN situations)
ok millert semarie tedu guenther


# 1.236 13-Oct-2015 guenther

Initialize va_filerev in vattr_null() to avoid leaking stack garbage;
problem pointed out by Martin Natano (natano (at) natano.net)

Also, stop chaining assignments (foo = bar = baz) in vattr_null().
The exact meaning of those depends on the order of the sizes-and-
signednesses of the lvalues, making them fragile: a statement here
mixed *six* types, but managed to get them in a safe order. Delete
a 20+ year old XXX comment that was almost certainly bemoaning a bug
from when they were in an unsafe order.

ok deraadt@ miod@


# 1.235 08-Oct-2015 mpi

Use the radix API directly and get rid of the function pointers. There
is no point in keeping an unused level of abstraction.

ok mikeb@, claudio@


# 1.234 07-Oct-2015 mpi

rn_inithead() offset argument is now specified in byte, missed in previous.


# 1.233 04-Sep-2015 mpi

Make every subsystem using a radix tree call rn_init() and pass the
length of the key as argument.

This way every consumer of the radix tree has a chance to explicitly
initialize the shared data structures and no longer rely on another
subsystem to do the initialization.

As a bonus ``dom_maxrtkey'' is no longer used an die.

ART kernels should now be fully usable because pf(4) and IPSEC properly
initialized the radix tree.

ok chris@, reyk@


Revision tags: OPENBSD_5_8_BASE
# 1.232 16-Jul-2015 claudio

branches: 1.232.4;
Fix rn_match and there for the expoerted lookup functions in radix.c
to never return the internal RNF_ROOT nodes. This removes the checks
in the callee to verify that not an RNF_ROOT node was returned.
OK mpi@


# 1.231 12-May-2015 mikeb

Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.230 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.229 02-Mar-2015 guenther

Return EINVAL if the creds supplied for NFS export have a cr_ngroups less
than zero or greater than NGROUPS_MAX

Fixes panic seen by henning@


# 1.228 09-Jan-2015 tedu

rename desiredvnodes to initialvnodes. less of a lie. ok beck deraadt


# 1.227 19-Dec-2014 tedu

start retiring the nointr allocator. specify PR_WAITOK as a flag as a
marker for which pools are not interrupt safe. ok dlg


# 1.226 17-Dec-2014 tedu

remove lock.h from uvm_extern.h. another holdover from the simpletonlock
era. fix uvm including c files to include lock.h or atomic.h as necessary.
ok deraadt


# 1.225 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.224 10-Dec-2014 tedu

convert bcopy to memcpy. ok millert


# 1.223 21-Nov-2014 tedu

simple lock is long dead


# 1.222 19-Nov-2014 tedu

delete the KERN_VNODE sysctl. it fails to provide any isolation from the
kernel struct vnode defintion, and the only consumer (pstat) still needs
kvm to read much of the required information. no great loss to always use
kvm until there's a better replacement interface.
ok deraadt millert uebayasi


# 1.221 14-Nov-2014 tedu

prefer sizeof(*ptr) to sizeof(struct) for malloc and free


# 1.220 03-Nov-2014 deraadt

pass size argument to free()
ok doug tedu


# 1.219 13-Sep-2014 doug

Replace all queue *_END macro calls except CIRCLEQ_END with NULL.

CIRCLEQ_* is deprecated and not called in the tree. The other queue types
have *_END macros which were added for symmetry with CIRCLEQ_END. They are
defined as NULL. There's no reason to keep the other *_END macro calls.

ok millert@


Revision tags: OPENBSD_5_6_BASE
# 1.218 13-Jul-2014 tedu

pass the size to free in some of the obvious cases


# 1.217 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.216 10-Jul-2014 mpi

Stop using a shutdown hook for softraid(4) and explicitly shutdown
the disciplines right after vfs_shutdown().

This change is required in order to be able to set `cold' to 1 before
traversing the device (mainbus) tree for DVACT_POWERDOWN when halting
a machine. Yes, this is ugly because sr_shutdown() needs to sleep. But
at least it is obvious and hopefully somebody will be ofended and fix
it.

In order to properly flush the cache of the disks under softraid0,
sr_shutdown() now propagates DVACT_POWERDOWN for this particular subtree
of devices which are not under mainbus. As a side effect sd(4) shutdown
hook should no longer be necessary.

Tested by stsp@ and Jean-Philippe Ouellet.

ok deraadt@, stsp@, jsing@


# 1.215 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.214 04-Jun-2014 claudio

While it may be smart to use the radix tree for exports it is not OK to
use the domain specific tree initialisation method for this since that one
is multipath enabled and assumes that the radix node is part of a struct
rtentry. This code uses a different struct and so the multipath modifies
wrong fields and breaks stuff in mysterious ways.
Since we only support AF_INET here anyway simplify the code and only have
one radix_node_head pointer instead of AF_MAX ones.
Fixes NFS server issues reported by rpe@, OK rpe@, guenther@, sthen@


# 1.213 10-Apr-2014 tedu

pull the bufcache freelist code out into separate functions to allow new
algorithms to be tested. in the process, drop support for unused B_AGE and
b_synctime options.
previous versions ok beck deraadt


# 1.212 24-Mar-2014 guenther

Split the API: struct ucred remains the kernel internal structure while
struct xucred becomes the structure for syscalls (mount(2) and nfssvc(2)).

ok deraadt@ beck@


Revision tags: OPENBSD_5_5_BASE
# 1.211 21-Jan-2014 tedu

bzero -> memset


# 1.210 01-Dec-2013 krw

Change 'mountlist' from CIRCLEQ to TAILQ. Be paranoid and
use TAILQ_*_SAFE more than might be needed.

Bulk ports build by sthen@ showed nobody sticking their fingers
so deep into the kernel.

Feedback and suggestions from millert@. ok jsing@


# 1.209 27-Nov-2013 jsing

Defer the v_type initialisation until after the vnode has been purged from
the namecache. Changing the v_type between cache_enter() and cache_purge()
results in bad things happening.

ok beck@


# 1.208 02-Oct-2013 sf

format string fix: b_flags is long


# 1.207 01-Oct-2013 sf

Format string fixes: Cast time_t to long long

and mnt_stat.f_ctime is long long, too


# 1.206 08-Aug-2013 syl

Uncomment kprintf format attributes for sys/kern

tested on vax (gcc3) ok miod@


# 1.205 30-Jul-2013 beck

The previous change was made while chasing nfs performance issues
on Theo's servers - however this was in the context of the buffer flipper
changes and this is now suspect in a continues performance issue with NFS
so back it out for now


Revision tags: OPENBSD_5_4_BASE
# 1.204 24-Jun-2013 beck

Manipulating buffers after sleeping is dangerous. Instead of attempting
to cheat and VOP_BWRITE a buffer, restart the vinvalbuf if we have to wait
for a busy buffer to complete
ok tedu@ guenther@


# 1.203 15-Apr-2013 jsing

Add an f_mntfromspec member to struct statfs, which specifies the name of
the special provided when the mount was requested. This may be the same as
the special that was actually used for the mount (e.g. in the case of a
device node) or it may be different (e.g. in the case of a DUID).

Whilst here, change f_ctime to a 64 bit type and remove the pointless
f_spare members.

Compatibility goo courtesy of guenther@

ok krw@ millert@


Revision tags: OPENBSD_5_3_BASE
# 1.202 17-Feb-2013 miod

Comment out recently added __attribute__((__format__(__kprintf__))) annotations
in MI code; gcc 2.95 does not accept such annotation for function pointer
declarations, only function prototypes.
To be uncommented once gcc 2.95 bites the dust.


# 1.201 09-Feb-2013 miod

Add explicit __attribute__ ((__format__(__kprintf__)))) to the functions and
function pointer arguments which are {used as,} wrappers around the kernel
printf function.
No functional change.


# 1.200 17-Nov-2012 beck

Don't map a buffer (and potentially sleep) when invalidating it in vinvalbuf.
This fixes a problem where we could sleep for kva and then our pointers
would not be valid on the next pass through the loop. We do this
by adding buf_acquire_nomap() - which can be used to busy up the buffer
without changing its mapped or unmapped state. We do not need to have
the buffer mapped to invalidate it, so it is sufficient to acquire it
for that. In the case where we write the buffer, we do map the buffer, and
potentially sleep.


# 1.199 01-Oct-2012 guenther

Make groupmember() check the effective gid too, so that the checks are
consistent when the effective gid isn't also a supplementary group.

ok beck@


# 1.198 19-Sep-2012 guenther

vhold() and vdrop() are prototyped in vnode.h, so don't repeat them here

ok beck@


Revision tags: OPENBSD_5_2_BASE
# 1.197 16-Jul-2012 deraadt

oops, need sys/acct.h too


# 1.196 16-Jul-2012 deraadt

Put acct_shutdown() proto in a better place


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.195 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.194 02-Jul-2011 thib

rename VFSDEBUG to VFLCKDEBUG;

prompted by tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.193 21-Dec-2010 thib

Bring back the "End the VOP experiment." diff, naddy's issues where
unrelated, and his alpha is much happier now.

OK deraadt@


# 1.192 06-Dec-2010 jasper

- drop NENTS(), which was yet another copy of nitems().
no binary change


ok deraadt@


# 1.191 10-Sep-2010 thib

Backout the VOP diff until the issues naddy was seeing on alpha (gcc3)
have been resolved.


# 1.190 06-Sep-2010 thib

End the VOP experiment. Instead of the ridicolusly complicated operation
vector setup that has questionable features (that have, as far as I can
tell never been used in practice, atleast not in OpenBSD), remove all
the gunk and favor a simple struct full of function pointers that get
set directly by each of the filesystems.

Removes gobs of ugly code and makes things simpler by a magnitude.

The only downside of this is that we loose the vnoperate feature so
the spec/fifo operations of the filesystems need to be kept in sync
with specfs and fifofs, this is no big deal as the API it self is pretty
static.

Many thanks to armani@ who pulled an earlier version of this diff to
current after c2k10 and Gabriel Kihlman on tech@ for testing.

Liked by many. "come on, find your balls" deraadt@.


# 1.189 12-Aug-2010 oga

Nuke extra (typoed) extern declaration and a spare newline from the last
commit.

"fix it -- free commit" beck@


# 1.188 11-Aug-2010 beck

Make the number of vnodes to correspond to the number of buffers in
buffer cache - we grow them dynamically, but do not attempt to shrink
them if the buffer cache shrinks after growing.

Tested by very many for a long time.

ok oga@ todd@ phessler@ tedu@


Revision tags: OPENBSD_4_8_BASE
# 1.187 29-Jun-2010 tedu

makefstype was only used in ported from freebsd filesystems. fix them
and remove the function. ok thib


# 1.186 28-Jun-2010 claudio

Add the rtable id as an argument to rn_walktree(). Functions like
rt_if_remove_rtdelete() need to know the table id to be able to correctly
remove nodes.
Problem found by Andrea Parazzini and analyzed by Martin Pelik�n.
OK henning@


# 1.185 06-May-2010 mpf

Fix favail format string.
From mickey.
OK thib, otto.


Revision tags: OPENBSD_4_7_BASE
# 1.184 17-Dec-2009 oga

if anyone vref()s a VNON vnode, panic. This should not happen.

Written while trying to debug the nfs_inactive panics. Turns out it
never got hit, but it's a useful check to have.

ok beck@


# 1.183 17-Aug-2009 jasper

dd 'show all bufs' to show all the buffers in the system

ok beck@ thib@


# 1.182 13-Aug-2009 thib

add a show all vnodes command, use dlg's nice pool_walk() to accomplish
this.

ok beck@, dlg@


# 1.181 12-Aug-2009 beck

Namecache revamp.

This eliminates the large single namecache hash table, and implements
the name cache as a global lru of entires, and a redblack tree in each
vnode. It makes cache_purge actually purge the namecache entries associated
with a vnode when a vnode is recycled (very important for later on actually being
able to resize the vnode pool)

This commit does #if 0 out a bunch of procmap code that was
already broken before this change, but needs to be redone completely.

Tested by many, including in thib's nfs test setup.

ok oga@,art@,thib@,miod@


# 1.180 02-Aug-2009 beck

Dynamic buffer cache support - a re-commit of what was backed out
after c2k9

allows buffer cache to be extended and grow/shrink dynamically

tested by many, ok oga@, "why not just commit it" deraadt@


Revision tags: OPENBSD_4_6_BASE
# 1.179 25-Jun-2009 thib

backout the buf_acquire() does the bremfree() since all callers
where doing bremfree() befure calling buf_acquire().

This is causing us headache pinning down a bug that showed up
when deraadt@ too cvs to current, and will have to be done
anyway as a preperation for backouts.

OK deraadt@


# 1.178 15-Jun-2009 beck

Back out all the buffer cache changes I committed during c2k9. This reverts three
commits:

1) The sysctl allowing bufcachepercent to be changed at boot time.
2) The change moving the buffer cache hash chains to a red-black tree
3) The dynamic buffer cache (Which depended on the earlier too).

ok on the backout from marco and todd


# 1.177 06-Jun-2009 art

All caller of buf_acquire were doing bremfree before the call.
Just put it in the buf_acquire function.
oga@ ok


# 1.176 03-Jun-2009 beck

Change bufhash from the old grotty hash table to red-black trees hanging
off the vnode.
ok art@, oga@, miod@


Revision tags: OPENBSD_4_5_BASE
# 1.175 10-Nov-2008 pedro

Fix typo in comment, okay jmc@.


# 1.174 01-Nov-2008 deraadt

change vrele() to return an int. if it returns 0, it can gaurantee that
it did not sleep. this is used to avoid checkdirs() to avoid having
to restart the allproc walk every time through
idea from tedu, ok thib pedro


Revision tags: OPENBSD_4_4_BASE
# 1.173 05-Jul-2008 thib

re-introduce vdrop() to signal a lost intrest in a vnode;

ok art@


# 1.172 14-Jun-2008 mk

A bunch of pool_get() + bzero() -> pool_get(..., .. | PR_ZERO)
conversions that should shave a few bytes off the kernel.

ok henning, krw, jsing, oga, miod, and thib (``even though i usually prefer
FOO|BAR''; thanks for looking.


# 1.171 13-Jun-2008 beck

back out stupid vnode change that was unintentionally included
with biomem and art has no idea how it got there.
ok art@ thib@


# 1.170 12-Jun-2008 deraadt

Bring biomem diff back into the tree after the nfs_bio.c fix went in.
ok thib beck art


# 1.169 11-Jun-2008 deraadt

back out biomem diff since it is not right yet. Doing very large
file copies to nfsv2 causes the system to eventually peg the console.
On the console ^T indicates that the load is increasing rapidly, ddb
indicates many calls to getbuf, there is some very slow nfs traffic
making none (or extremely slow) progress. Eventually some machines
seize up entirely.


# 1.168 10-Jun-2008 beck

Buffer cache revamp

1) remove multiple size queues, introduced as a stopgap.
2) decouple pages containing data from their mappings
3) only keep buffers mapped when they actually have to be mapped
(right now, this is when buffers are B_BUSY)
4) New functions to make a buffer busy, and release the busy flag
(buf_acquire and buf_release)
5) Move high/low water marks and statistics counters into a structure
6) Add a sysctl to retrieve buffer cache statistics

Tested in several variants and beat upon by bob and art for a year. run
accidentally on henning's nfs server for a few months...

ok deraadt@, krw@, art@ - who promises to be around to deal with any fallout


# 1.167 09-Jun-2008 millert

Update access(2) to have modern semantics with respect to X_OK and
the superuser. access(2) will now only indicate success for X_OK on
non-directories if there is at least one execute bit set on the file.
OK deraadt@ thib@ otto@


# 1.166 07-May-2008 thib

remove the vfc_mountroot member from vfsconf and
do appropriate cleanup;

OK deraadt@


# 1.165 07-May-2008 claudio

Implement routing priorities. Every route inserted has a priority assigned
and the one route with the lowest number wins. This will be used by the
routing daemons to resolve the synchronisations issue in case of conflicts.
The nasty bits of this are in the multipath code. If no priority is specified
the kernel will choose an appropriate priority.

Looked at by a few people at n2k8 code is much older


# 1.164 06-May-2008 thib

retire vfs_mountroot();

setroot() is now (and has been) responsible for setting
the mountroot function pointer "to the right thing", or
failing todo that, to ffs_mountroot;

based on a discussion/diff from deraadt@.
OK deraadt@


# 1.163 23-Mar-2008 miod

Wrong printf construct.


# 1.162 16-Mar-2008 otto

Widen some struct statfs fields to support large filesystem stata
and add some to be able to support statvfs(2). Do the compat dance
to provide backward compatibility. ok thib@ miod@


Revision tags: OPENBSD_4_3_BASE
# 1.161 13-Dec-2007 blambert

replace calls to ltsleep with tsleep

remove PNORELOCK flag, as PNORELOCK is used for msleep

ok art@ thib@


# 1.160 16-Nov-2007 deraadt

er, the newline is wrong. dissapointing.


# 1.159 15-Nov-2007 deraadt

newline before syncing disks is way prettier


# 1.158 29-Oct-2007 chl

MALLOC/FREE -> malloc/free
replace an hard coded value with M_WAITOK

ok krw@


# 1.157 15-Sep-2007 bluhm

Allow to pull out an usb stick with ffs filesystem while mounted
and a file is written onto the stick. Without these fixes the
machine panics or hangs.
The usb fix calls the callback when the stick is pulled out to free
the associated buffers. Otherwise we have busy buffers for ever
and the automatic unmount will panic.
The change in the scsi layer prevents passing down further dirty
buffers to usb after the stick has been deactivated.
In vfs the automatic unmount has moved from the function vgonel()
to vop_generic_revoke(). Both are called when the sd device's vnode
is removed. In vgonel() the VXLOCK is already held which can cause
a deadlock. So call dounmount() earlier.

ok krw@, I like this marco@, tested by ian@


# 1.156 07-Sep-2007 art

Use M_ZERO in a few more places to shave bytes from the kernel.

eyeballed and ok dlg@


Revision tags: OPENBSD_4_2_BASE
# 1.155 07-Aug-2007 beck

A few changes to deal with multi-user performance issues seen. this
brings us back roughly to 4.1 level performance, although this is still
far from optimal as we have seen in a number of cases. This change

1) puts a lower bound on buffer cache queues to prevent starvation
2) fixes the code which looks for a buffer to recycle
3) reduces the number of vnodes back to 4.1 levels to avoid complex
performance issues better addressed after 4.2

ok art@ deraadt@, tested by many


# 1.154 01-Jun-2007 beck

decouple the allocated number of vnodes from the "desiredvnodes" variable
which is used to size a zillion other things that increasing excessively
has been shown to cause problems - so that we may incrementally look at
increasing those other things without making the kernel unusable.

This diff effectivly increases the number of vnodes back to the number
of buffers, as in the earlier dynamic buffer cache commits, without
increasing anything else (namecache, softdeps, etc. etc.)

ok pedro@ tedu@ art@ thib@


# 1.153 31-May-2007 tedu

remove some silly casts, no real change


# 1.152 31-May-2007 pedro

NFSv2 cannot cope with a big number of vnodes, so revert to NPROC-based
calculation until the problem is fixed, okay beck@ art@


# 1.151 30-May-2007 beck

back out vfs change - todd fries has seen afs issues, and I'm suspicious
this can cause other problems.


# 1.150 29-May-2007 beck

Step one of some vnode improvements - change getnewvnode to
actually allocate "desiredvnodes" - add a vdrop to un-hold a vnode held
with vhold, and change the name cache to make use of vhold/vdrop, while
keeping track of which vnodes are referred to by which cache entries to
correctly hold/drop vnodes when the cache uses them.
ok thib@, tedu@, art@


# 1.149 28-May-2007 thib

de-inline vref();

ok pedro@


# 1.148 26-May-2007 pedro

Dynamic buffer cache. Initial diff from mickey@, okay art@ beck@ toby@
deraadt@ dlg@.


# 1.147 26-May-2007 thib

Nuke a bunch of simpelocks and associated goo.

ok art@


# 1.146 17-May-2007 thib

Collapse struct v_selectinfo in struct vnode, remove the
simplelock and reuse the name for the selinfo member.
Clean-up accordingly.

ok tedu@,art@


# 1.145 09-May-2007 deraadt

kinfo_vgetfailed has not been used for > 8 years


# 1.144 13-Apr-2007 thib

Move the declaration of VN_KNOTE() into vnode.h instead of having
multiple defines all over;

ok tedu@


# 1.143 13-Apr-2007 bluhm

Remove comments talking about vnode interlock. No binary change.
ok thib


# 1.142 11-Apr-2007 thib

Remove the simplelock argument from vrecycle();

ok pedro@, sturm@


# 1.141 21-Mar-2007 thib

Remove the v_interlock simplelock from the vnode structure.
Zap all calls to simple_lock/unlock() on it (those calls are
#defined away though). Remove the LK_INTERLOCK from the calls
to vn_lock() and cleanup the filesystems wich implement VOP_LOCK().
(by remvoing the v_interlock from there calls to lockmgr()).

ok pedro@, art@, tedu@


# 1.140 12-Mar-2007 mickey

better desiredvnodes not based on maxusers; pedro@ deraadt@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.139 20-Feb-2007 deraadt

for vfsconf sysctl, do not leak kernel sensors out to userland
ok art thib


# 1.138 17-Feb-2007 mickey

fix ddb buf printing for daddr_t growth to 64bit;
from juan hernandez gonzalez; tested by bluhm@


# 1.137 14-Feb-2007 jsg

Consistently spell FALLTHROUGH to appease lint.
ok kettenis@ cloder@ tom@ henning@


# 1.136 13-Feb-2007 mickey

fix ddb buf print


# 1.135 20-Nov-2006 tom

vprint() should be defined if DIAGNOSTIC || DEBUG. Noticed by (and
original diff from) Jake < antipsychic (at) hotmail.com >. Discussed
with Mickey and Miod.

ok miod@ pedro@


# 1.134 30-Oct-2006 thib

use vp->v_type to index into vtypes rather then vp->v_tag,
fixing odd output in the 'show vnode' ddb code.

ok mickey@


Revision tags: OPENBSD_4_0_BASE
# 1.133 11-Jul-2006 mickey

add mount/vnode/buf and softdep printing commands; tested on a few archs and will make pedro happy too (;


# 1.132 09-Jul-2006 pedro

Fix tab where space was meant


# 1.131 08-Jul-2006 thib

vinvalbuf() debugging aid, under VFSDEBUG.

ok pedro@


# 1.130 03-Jul-2006 mickey

also print vp in vprint (useful for debugging); pedro@ ok


# 1.129 25-Jun-2006 sturm

rename vfs_busy() flags VB_UMIGNORE/VB_UMWAIT to VB_NOWAIT/VB_WAIT

requested by and ok pedro


# 1.128 14-Jun-2006 sturm

move vfs_busy() to rwlocks and properly hide the locking api from vfs

ok tedu, pedro


# 1.127 02-Jun-2006 pedro

Add a clonable devices implementation. Hacked along with thib@, input
from krw@ and toby@, subliminal prodding from dlg@, okay deraadt@.


# 1.126 28-May-2006 pedro

Spacing in vfs_sysctl()


# 1.125 07-May-2006 sturm

forgot to remove this sentence from the comment
ok pedro


# 1.124 30-Apr-2006 sturm

remove the simplelock argument from vfs_busy() which is currently not
used and will never be used this way in VFS

requested by and ok pedro, ok krw, biorn


# 1.123 19-Apr-2006 pedro

Remove unused mount list simple_lock() goo


Revision tags: OPENBSD_3_9_BASE
# 1.122 09-Jan-2006 pedro

Put vprint() under DIAGNOSTIC, as to save space in generated ramdisks.
Inspiration from miod@, okay deraadt@. Tested on i386, macppc and amd64.


# 1.121 30-Nov-2005 pedro

No need for vfs_busy() and vfs_unbusy() to take a process pointer
anymore. Testing by jolan@, thanks.


# 1.120 24-Nov-2005 pedro

Remove kernfs, okay deraadt@.


# 1.119 19-Nov-2005 pedro

Remove unnecessary lockmgr() archaism that was costing too much in terms
of panics and bugfixes. Access curproc directly, do not expect a process
pointer as an argument. Should fix many "process context required" bugs.
Incentive and okay millert@, okay marc@. Various testing, thanks.


# 1.118 18-Nov-2005 pedro

Work around yet another race on non-locking file systems: when calling
VOP_INACTIVE() in vrele() and vput(), we may sleep. Since there's no
locking of any kind, someone can vget() the vnode and vrele() it while
we sleep, beating us in getting the vnode on the free list.


# 1.117 08-Nov-2005 pedro

Missed one use of 'register'


# 1.116 07-Nov-2005 pedro

Use ANSI function declarations and deregister, no binary change


# 1.115 19-Oct-2005 pedro

Remove v_vnlock from struct vnode, okay krw@ tedu@


Revision tags: OPENBSD_3_8_BASE
# 1.114 26-May-2005 pedro

branches: 1.114.2;
RIP stackable filesystems, ok marius@ tedu@, discussed with deraadt@


# 1.113 24-May-2005 pedro

when a device vnode associated with a mount point disappears, mark the
filesystem as doomed and unmount it


# 1.112 22-May-2005 pedro

put VLOCKSWORK stuff under a single option, VFSDEBUG


# 1.111 01-May-2005 pedro

check for VBIOONFREELIST and VBIOONSYNCLIST in vprint(), okay marius@


# 1.110 24-Mar-2005 tedu

always good to check for invalid values. ok marius pedro


Revision tags: OPENBSD_3_7_BASE
# 1.109 10-Jan-2005 pedro

branches: 1.109.2;
change vget() to only put a vnode back on the free lists if it actually
was there. should fix a (rare) corner case introduced by my last commit.
ok tedu@, testing by joris, moritz@, danh@, otto@ and krw@. many thanks.


# 1.108 31-Dec-2004 pedro

sprinkle some more list macros in here


# 1.107 31-Dec-2004 pedro

when releasing a vnode, make it inactive before sticking it to one of
the free lists. should fix some races on filesystems that don't have
locks, such as nfs. also, it allows for a more straightforward way of
releasing vnodes (nodes that are going to be recycled don't have to be
moved to the head of the list). tested by many, thanks.

ok tedu@ deraadt@


# 1.106 28-Dec-2004 deraadt

clean dirty accident by miod


# 1.105 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


# 1.104 09-Dec-2004 pedro

minor spacing/styling nits


Revision tags: OPENBSD_3_6_BASE
# 1.103 04-Aug-2004 art

Uninline vputonfreelist.


# 1.102 04-Aug-2004 pedro

better comments


# 1.101 02-Aug-2004 pedro

- check for LK_NOWAIT on vget()
- use ltsleep() instead of the unlock + sleep combo

ok art@, inspiration from free/net


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.100 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


# 1.99 27-May-2004 tedu

shutdown accounting before shutting down vfs. should prevent some panics.
ok david@ millert@ (iirc)


# 1.98 25-Apr-2004 itojun

radix tree with multipath support. from kame. deraadt ok
user visible changes:
- you can add multiple routes with same key (route add A B then route add A C)
- you have to specify gateway address if there are multiple entries on the table
(route delete A B, instead of route delete A)
kernel change:
- radix_node_head has an extra entry
- rnh_deladdr takes extra argument

TODO:
- actually take advantage of multipath (rtalloc -> rtalloc_mpath)


Revision tags: OPENBSD_3_5_BASE
# 1.97 09-Jan-2004 tedu

back out vnode parents. weird breakge found in ports tree


# 1.96 06-Jan-2004 tedu

keep track of a vnode's parent dir. ufs only, and unused atm, but
the fun stuff is coming. testing by brad.


Revision tags: OPENBSD_3_4_BASE
# 1.95 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.94 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: UBC_SYNC_A
# 1.93 13-May-2003 naddy

Back out previous change that causes "vnode table full" for large-scale
file operations.


# 1.92 13-May-2003 tedu

do reclaim LAYER vnodes, no good reason not to


# 1.91 06-May-2003 tedu

attempt to put a process's cwd back in place after a forced umount.
won't always work, but it's the best we can do for now. this covers
at least some of the failure cases the previous commit to vfs_lookup.c
checks for.
ok weingart@


# 1.90 01-May-2003 tedu

several related changes:
vfs_subr.c:
add a missing simple_lock_init for vnode interlock
try to avoid reclaiming locked or layered vnodes
initialize vnlock pointer to NULL
remove old code to free vnlock, never used
lockinit the new vnode lock
vfs_syscalls.c:
support for VLAYER flag
vnode_if.sh:
support for splitting VDESC flags
vnode_if.src:
split VDESC flags
WILLPUT is the combination of WILLRELE and WILLUNLOCK
most uses for WILLRELE become WILLPUT
vnode.h:
add v_lock to struct vnode
add VLAYER flag
update for new VDESC flags


# 1.89 06-Apr-2003 ho

strcat/strcpy/sprintf cleanup. krw@, anil@ ok. art@ tested sparc64.


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_B
# 1.88 11-Aug-2002 art

Add two missing vfs_busy calls in the failure path of sysctl_vnode.
Found by aaron@

NOTE - I think we need a mount-point iterator just like we have
NOTE - vfs_mount_foreach_vnode. (btw. why don't we use foreach_vnode in here?)


# 1.87 12-Jul-2002 art

Change the locking on the mountpoint slightly. Instead of using mnt_lock
to get shared locks for lookup and get the exclusive lock only with
LK_DRAIN on unmount and do the real exclusive locking with flags in
mnt_flags, we now use shared locks for lookup and an exclusive lock for
unmount.

This is accomplished by slightly changing the semantics of vfs_busy.
Old vfs_busy behavior:
- with LK_NOWAIT set in flags, a shared lock was obtained if the
mountpoint wasn't being unmounted, otherwise we just returned an error.
- with no flags, a shared lock was obtained if the mountpoint was being
unmounted, otherwise we slept until the unmount was done and returned
an error.
LK_NOWAIT was used for sync(2) and some statistics code where it isn't really
critical that we get the correct results.
0 was used in fchdir and lookup where it's critical that we get the right
directory vnode for the filesystem root.

After this change vfs_busy keeps the same behavior for no flags and LK_NOWAIT.
But if some other flags are passed into it, they are passed directly
into lockmgr (actually LK_SLEEPFAIL is always added to those flags because
if we sleep for the lock, that means someone was holding the exclusive lock
and the exclusive lock is only held when the filesystem is being unmounted.

More changes:
dounmount must now be called with the exclusive lock held. (before this
the caller was supposed to hold the vfs_busy lock, but that wasn't always
true).
Zap some (now) unused mount flags.
And the highlight of this change:
Add some vfs_busy calls to match some vfs_unbusy calls, especially in
sys_mount. (lockmgr doesn't detect the case where we release a lock noone
holds (it will do that soon)).

If you've seen hangs on reboot with mfs this should solve it (I repeat this
for the fourth time now, but this time I spent two months fixing and
redesigning this and reading the code so this time I must have gotten
this right).


# 1.86 16-Jun-2002 miod

When processing the KERN_VNODE sysctl, the kernel builds a packed structure,
while pstat(8) expects a C structure abiding the regular structure packing
rules. This caused pstat -v to break on powerpc.

Unbreak the confusion by defining the structure in a common header file,
and having the kernel use it.

ok millert@ deraadt@


# 1.85 08-Jun-2002 art

Use ltsleep in vfs_busy.


# 1.84 16-May-2002 art

sprinkle some splassert(IPL_BIO) in some functions that are commented as "should be called at splbio()"


Revision tags: OPENBSD_3_1_BASE
# 1.83 14-Mar-2002 millert

First round of __P removal in sys


# 1.82 04-Feb-2002 miod

Cleanup mountroot-related definitions.


# 1.81 23-Jan-2002 art

Pool deals fairly well with physical memory shortage, but it doesn't deal
well (not at all) with shortages of the vm_map where the pages are mapped
(usually kmem_map).

Try to deal with it:
- group all information the backend allocator for a pool in a separate
struct. The pool will only have a pointer to that struct.
- change the pool_init API to reflect that.
- link all pools allocating from the same allocator on a linked list.
- Since an allocator is responsible to wait for physical memory it will
only fail (waitok) when it runs out of its backing vm_map, carefully
drain pools using the same allocator so that va space is freed.
(see comments in code for caveats and details).
- change pool_reclaim to return if it actually succeeded to free some
memory, use that information to make draining easier and more efficient.
- get rid of PR_URGENT, noone uses it.


# 1.80 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.79 10-Dec-2001 art

branches: 1.79.2;
No need to initialize the uobj on every getnewvnode. Just do
it when allocating. Add some improved diagnostics.


# 1.78 10-Dec-2001 art

Big cleanup inspired by NetBSD with some parts of the code from NetBSD.
- get rid of VOP_BALLOCN and VOP_SIZE
- move the generic getpages and putpages into miscfs/genfs
- create a genfs_node which must be added to the top of the private portion
of each vnode for filsystems that want to use genfs_{get,put}pages
- rename genfs_mmap to vop_generic_mmap


# 1.77 10-Dec-2001 art

Merge in struct uvm_vnode into struct vnode.


# 1.76 05-Dec-2001 art

Break out the part that lowers v_holdcnt in brelvp into an own function
and make it and vhold into public interfaces.


# 1.75 29-Nov-2001 art

Ooops. Revert part of the last commit that was completly wrong and wasn't supposed to be committed.


# 1.74 29-Nov-2001 art

Correctly handle b_vp with bgetvp and brelvp in {get,put}pages.
Prevents panics caused by vnodes being recycled under our feet.


# 1.73 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.72 21-Nov-2001 csapuntz

Added vfs_isbusy. Useful for verifying that a mount point is locked
Added vfs_mount_foreach_vnode. Several places in the code seem to want to
traverse the mount list and they all seem to handle locking differently.
Centralize traversing the mount list in one place so that we only need
to get the locking right once.


# 1.71 15-Nov-2001 art

Don't zero v_bioflag when recycling a vnode in getnewvnode.
Sometimes the vnode can be on the syncers list. While that is a bug, it's
just a minor annoyance. A vnode on a syncer worklist without VBIOONSYNCLIST
set is a disaster.


# 1.70 12-Nov-2001 art

Remove unnecessary check for NULL vnode in reassignbuf.


# 1.69 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.68 02-Oct-2001 csapuntz

Bounds check index into routing table. Thanks to Ken Ashcraft of Stanford
for finding this bug.


# 1.67 19-Sep-2001 csapuntz

Get rid of B_VFLUSH. Not relevant after the end of the AGE queue.


# 1.66 16-Sep-2001 millert

Add some missing lengths checks when passing data from userland to
kernel. From based on NetBSD patches.


# 1.65 02-Aug-2001 assar

(vput): make panic strings actually say vput instead of vrele


# 1.64 26-Jul-2001 miod

Typo.


# 1.63 27-Jun-2001 art

remove old vm


# 1.62 22-Jun-2001 deraadt

KNF


# 1.61 05-Jun-2001 provos

send note_revoke to knotes when vnode goes away, okay art@


# 1.60 16-May-2001 art

indentation nit.


# 1.59 29-Apr-2001 art

cleanup, remove incorrect comment


Revision tags: OPENBSD_2_9_BASE
# 1.58 22-Mar-2001 art

branches: 1.58.2;
Use pool for allocating vnodes.
Even though vnodes are never freed (could be) this gives us big memory and
kmem_map savings.


# 1.57 21-Mar-2001 art

uvm_vnp_terminate expect the vnode to be locked.
Why didn't LOCKDEBUG catch this?


# 1.56 16-Mar-2001 art

Oops. fix thinko in last.


# 1.55 16-Mar-2001 art

Use CIRCLEQ macros for mountlist.


# 1.54 16-Mar-2001 art

Initialize the mountlist_slock.


# 1.53 26-Feb-2001 csapuntz

Move v_writecount test back to it original place


# 1.52 26-Feb-2001 csapuntz

Make ref counts 32-bit unsigned ints as opposed to a potpourri of longs and
ints.


# 1.51 24-Feb-2001 csapuntz

Cleanup of vnode interface continues. Get rid of VHOLD/HOLDRELE.
Change VM/UVM to use buf_replacevnode to change the vnode associated
with a buffer.

Addition v_bioflag for flags written in interrupt handlers
(and read at splbio, though not strictly necessary)

Add vwaitforio and use it instead of a while loop of v_numoutput.

Fix race conditions when manipulation vnode free list


# 1.50 23-Feb-2001 csapuntz

Remove the clustering fields from the vnodes and place them in the
file system inode instead


# 1.49 21-Feb-2001 csapuntz

Latest soft updates from FreeBSD/Kirk McKusick

Snapshot-related code has been commented out.


# 1.48 08-Feb-2001 mickey

do not print stuff when not verbose


Revision tags: OPENBSD_2_8_BASE
# 1.47 27-Sep-2000 art

branches: 1.47.2;
Minimal optimization.


# 1.46 17-Jul-2000 art

Don't wait for B_READ buffers on shutdown.
From NetBSD.


Revision tags: OPENBSD_2_7_BASE
# 1.45 25-Apr-2000 csapuntz

Use CIRCLEQ_FOREACH


# 1.44 21-Apr-2000 mickey

see if there is any meaning under curproc before using &proc0 in vfs_syncwait(); from art@


Revision tags: SMP_BASE kame_19991208
# 1.43 05-Dec-1999 art

branches: 1.43.2;
With soft updates, some buffers will be remarked as dirty after being written.
Handle this when syncing filesystems when unmounting.
From NetBSD.


# 1.42 05-Dec-1999 art

Use VONSYNCLIST to see if we should remove a vnode from the sync list instead
of looking at v_dirtyblkhd.


Revision tags: OPENBSD_2_6_BASE
# 1.41 20-Aug-1999 art

more paranoid check of the refcount in vfs_register


# 1.40 08-Aug-1999 niklas

From NetBSD; vdevgone, used for revoking access to device nodes when they
disappear (detach is coming).


# 1.39 31-May-1999 millert

New struct statfs with mount options. NOTE: this replaces statfs(2),
fstatfs(2), and getfsstat(2) so you will need to build a new kernel
before doing a "make build" or you will get "unimplemented syscall" errors.

The new struct statfs has the following featuires:
o Has a u_int32_t flags field--now softdep can have a real flag.

o Uses u_int32_t instead of longs (nicer on the alpha). Note: the man
page used to lie about setting invalid/unused fields to -1. SunOS does
that but our code never has.

o Gets rid of f_type completely. It hasn't been used since NetBSD 0.9
and having it there but always 0 is confusing. It is conceivable
that this may cause some old code to not compile but that is better
than silently breaking.

o Adds a mount_info union that contains the FSTYPE_args struct. This
means that "mount" can now tell you all the options a filesystem was
mounted with. This is especially nice for NFS.

Other changes:
o The linux statfs emulation didn't convert between BSD fs names
and linux f_type numbers. Now it does, since the BSD f_type
number is useless to linux apps (and has been removed anyway)

o FreeBSD's struct statfs is different from our (both old and new)
and thus needs conversion. Previously, the OpenBSD syscalls
were used without any real translation.

o mount(8) will now show extra info when invoked with no arguments.
However, to see *everything* you need to use the -v (verbose) flag.


# 1.38 06-May-1999 mickey

factor out sync+wait code into vfa_syncwait() routine for
applications in system like power management and such.
art@ finally said `commit it'


# 1.37 30-Apr-1999 art

in vput, simple_unlock the v_interlock before VOP_INACTIVE, not after


Revision tags: OPENBSD_2_5_BASE
# 1.36 11-Mar-1999 deraadt

backout


# 1.35 11-Mar-1999 deraadt

back out unapproved changes


# 1.34 11-Mar-1999 mickey

indent


# 1.33 11-Mar-1999 mickey

factor sync+wait operation out into a separate function.


# 1.32 26-Feb-1999 art

adapt to uvm vnode pager


# 1.31 19-Feb-1999 art

add vfs_register and vfs_unregister functions


# 1.30 28-Dec-1998 art

simple_lock fixes


# 1.29 22-Dec-1998 art

deconfuse vprint, print holdcount, not refcount when we are talking about holdcnt


# 1.28 10-Dec-1998 art

vfs_unmountall: retry to unmount all remaining filesystems when one unmount failed


# 1.27 05-Dec-1998 csapuntz

Framework for generating automatic test code for locking discipline
in DIAGNOSTIC mode.

Added documentation to vfs_subr.c on locking needs of a couple calls.

Improvements to the vinvalbuf patch. We need to start over after we
let our pants down.


# 1.26 04-Dec-1998 csapuntz

VFS-Lite2 requires stricter locking around vnode buffer queues. vinvalbuf
had insufficient protection


# 1.25 20-Nov-1998 art

vn_lock already unlocks the simple lock. don't do that again


# 1.24 12-Nov-1998 csapuntz

Integrate latest soft updates patches for McKusick.

Integrate cleaner ffs mount code from FreeBSD. Most notably, this mount
code prevents you from mounting an unclean file system read-write.


Revision tags: OPENBSD_2_4_BASE
# 1.23 13-Oct-1998 csapuntz

In vrele, vget, reinstate to following order

- VNODE gets placed on free list
- VOP_INACTIVE is called

This was the original order. It was changed in an earlier patch due to
a race condition in non-locking FSes (like NFS) between getnewvnode
and inactive. However, the modified order had its own race conditions, so
it turned out not to be a good choice.


# 1.22 30-Aug-1998 csapuntz

Cleanup.

Error diagnostics in vputonfreelist to catch violations of assumptions.


# 1.21 06-Aug-1998 csapuntz

Rename vop_revoke, vn_bwrite, vop_noislocked, vop_nolock, vop_nounlock
to be vop_generic_revoke, vop_generic_bwrite, vop_generic_islocked,
vop_generic_lock and vop_generic_unlock.

Create vop_generic_abortop and propogate change to all file systems.

Fix PR/371.

Get rid of locking in NULLFS (should be mostly unnecessary now except for
forced unmounts).


# 1.20 25-Apr-1998 niklas

typo


Revision tags: OPENBSD_2_3_BASE
# 1.19 20-Feb-1998 niklas

typo


# 1.18 11-Jan-1998 csapuntz

Fix a couple spinlock references. More code motion in vfs_subr.c


# 1.17 10-Jan-1998 csapuntz

Broke up vfs_subr.c which was getting a bit huge. We now have seperate files
for the syncer daemon as well as default VOP_*.


# 1.16 24-Nov-1997 niklas

Fix non-DIAGNOSTIC (and non-COMPAT*) compilation


# 1.15 07-Nov-1997 csapuntz

Fixed hang on shutdown
Disabled vop_nolock for now. Filesystems still need to be cleaned up.


# 1.14 06-Nov-1997 csapuntz

DEBUG now compiles


# 1.13 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.12 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.11 06-Oct-1997 csapuntz

VFS Lite2 Changes


Revision tags: OPENBSD_2_1_BASE
# 1.10 25-Apr-1997 deraadt

proper mask check; mike@fast.cs.utah.edu


# 1.9 14-Apr-1997 tholo

Minor performance enhancements from NetBSD


# 1.8 24-Feb-1997 niklas

OpenBSD tags


# 1.7 11-Feb-1997 millert

Add fs_id support and random inode generation numbers for ffs.


# 1.6 04-Jan-1997 kstailey

spec_advlock() via lf_advlock()


Revision tags: OPENBSD_2_0_BASE
# 1.5 08-Aug-1996 tholo

Make {,f}chown(2) behaviour POSIX.1 compliant with SUID / SGID files
Enable CTL_FS processing by sysctl(3)
Add CTL_FS request to disable clearing SUID / SGID bit when a files owner
or group is changed by root
Make sysctl(8) understand CTL_FS requests


# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 29-Feb-1996 niklas

From NetBSD: Merge with NetBSD 960217


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.313 25-Oct-2021 claudio

Revert commitid: ufM9BcSbXqfLpzBH;
Move vfs_stall_barrier() from the fd layer into vn_lock() and the vfs layer.
In some cases it can result in a deadlock while suspending.
Discussed with mpi@ and deraadt@


# 1.312 24-Oct-2021 jsg

use NULL not 0 for pointer values in kern
ok semarie@


# 1.311 23-Oct-2021 mpi

Sprinkle uvm_obj_destroy() over UVM object recycling code.

For now, only assert that the tree of pages is empty in uvm_obj_destroy().
This will soon be used to free the per-UVM object lock.

While here call uvm_obj_init() when new vnodes are allocated instead of
in uvn_attach(). Because vnodes and there associated UVM object are
currently never freed, it isn't easy to know where/when to garbage
collect the associated lock. So simply check that the reference of a
given object is 0 when uvn_attach().

Tested by many as part of a bigger diff.

ok kettenis@


# 1.310 23-Oct-2021 mpi

Assert that the KERNEL_LOCK() is held in vref(9).

This is a guard against pushing the lock too far in UVM's vnode land.

ok beck@


# 1.309 21-Oct-2021 claudio

Move vfs_stall_barrier() from the fd layer into vn_lock() and the vfs layer.
vfs stalling is used by suspend/resume and by vmt(4) to stall any
filesystem operation from altering the state on disk. All these
operations will call vn_lock and be stalled. Adjust vfs_stall_barrier()
to allow the lock owner to still progress so that suspend can sync
the filesystems after stalling vfs operation.
OK mpi@


# 1.308 20-Oct-2021 semarie

revert vnode: remove VLOCKSWORK and check locking when vop_islocked != nullop
(both kernel and userland bits)

GENERIC + VFSLCKDEBUG is broken with it.


# 1.307 19-Oct-2021 semarie

vnode: remove VLOCKSWORK and check locking when vop_islocked != nullop

This flag is currently used to mark or unmark a vnode to actively
check vnode locking semantic (when compiled with VFSLCKDEBUG).

Currently, VLOCKSWORK flag isn't properly set for several FS
implementation which have full locking support. This commit enable
proper checking for them too (cd9660, udf, fuse, msdosfs, tmpfs).

Instead of using a particular flag, it directly check if
v_op->vop_islocked is nullop or not to activate or not the vnode
locking checks.

ok mpi@


Revision tags: OPENBSD_7_0_BASE
# 1.306 31-Aug-2021 claudio

Swap lock flags so that LK_EXCLUSIVE is first like in all other places.


# 1.305 28-Apr-2021 claudio

Introduce a global vnode_mtx and use it to make vn_lock() safe to be called
without the KERNEL_LOCK.
This moves VXLOCK and VXWANT to a mutex protected v_lflag field and also
v_lockcount is protected by this mutex.

The vn_lock() dance is overly complex and all of this should probably replaced
by a proper lock on the vnode but such a diff is a lot more complex. This
is an intermediate step so that at least some calls can be modified to grab
the KERNEL_LOCK later or not at all.

OK mpi@


Revision tags: OPENBSD_6_9_BASE
# 1.304 29-Jan-2021 claudio

Use NULL instead of 0 to clear v_socket pointer (which actually clears all
of the v_un pointers).
OK jsg@ mvs@


Revision tags: OPENBSD_6_8_BASE
# 1.303 23-Aug-2020 kn

Remove unused debug_syncprt, improve debug sysctl handling

"syncprt" is unused since kern/vfs_syscalls.c r1.147 from 2008.

Adding new debug sysctls is a bit opaque and looking at kern/kern_sysctl.c
the only visible difference between used and stub ctldebug structs in the
debugvars[] array is their extern keyword, indicating that it is defined
elsewhere.

sys/sysctl.h declares all debugN members as extern upfront, but these
declarations are not needed.

Remove the unused debug sysctl, rename the only remaining one to something
meaningful and remove forward declarations from /sys/sysctl.h; this way,
adding new debug sysctls is a matter of adding extern and coming up with a
name, which is nicer to read on its own and better to grep for.

OK mpi


# 1.302 22-Aug-2020 kn

Move sysctl(2) CTL_DEBUG from DEBUG to new DEBUG_SYSCTL

Adding "debug.my-knob" sysctls is really helpful to select different
code paths and/or log on demand during runtime without recompile,
but as this code is under DEBUG, lots of other noise comes with it
which is often undesired, at least when looking at specific subsystems
only.

Adding globals to the kernel and breaking into DDB to change them helps,
but that does not work over SSH, hence the need for debug sysctls.

Introduces DEBUG_SYSCTL to make use of the "debug" MIB without the rest of
DEBUG; it's DEBUG_SYSCTL and not SYSCTL_DEBUG because it's not a general
option for all of sysctl(2).

OK gnezdo


Revision tags: OPENBSD_6_7_BASE
# 1.301 27-Mar-2020 anton

Relax the lockcount assertion in vputonfreelist(). Back when I fixed
several problems with the vnode exclusive lock implementation, I
overlooked the fact that a vnode can be in a state where the usecount is
zero while the holdcount still being positive. There could still be
threads waiting on the vnode lock in uvn_io() as long as the holdcount
is positive.

"go ahead" mpi@

Reported-by: syzbot+767d6deb1a647850a0ca@syzkaller.appspotmail.com


# 1.300 13-Feb-2020 claudio

Move the LK_DRAIN logic from VOP_LOCK() to vclean() the only caller of
VOP_LOCK with LK_DRAIN. This simplifies VOP_LOCK() a fair bit.
OK visa@


# 1.299 20-Jan-2020 claudio

struct vops is not modified during runtime so use const which moves each
into read-only data segment.
OK deraadt@ tedu@


# 1.298 10-Jan-2020 bluhm

Convert the vnode list at the mount point into a tailq. During
unmount this list is traversed and the dirty vnodes are flushed to
disk. Forced unmount expects that the list is empty after flushing,
otherwise the kernel panics with "dangling vnode". As the write
to disk can sleep, new vnodes may be inserted. If softdep is
enabled, resolving the dependencies creates new dirty vnodes and
inserts them to the list. To fix the panic, let insmntque() insert
new vnodes at the tail of the list. Then vflush() will still catch
them while traversing the list in forward direction.
OK tedu@ millert@ visa@


# 1.297 30-Dec-2019 bluhm

In vcount() a safe loop over vnodes was commited to 4.4BSD in 1994.
This is not necessary as the loop is restarted after vgone(). Switch
to SLIST_FOREACH without _SAFE.
OK visa@


# 1.296 27-Dec-2019 bluhm

Convert the speclisth hash buckets into SLIST macros. This makes
the vnode alias code more readable.
OK visa@


# 1.295 26-Dec-2019 bluhm

Fix white spaces.


# 1.294 08-Dec-2019 mpi

Convert infinite sleeps to tsleep_nsec(9).

ok visa@, jca@


Revision tags: OPENBSD_6_6_BASE
# 1.293 26-Aug-2019 anton

When a thread tries to exclusively lock a vnode, the same thread must
ensure that any other thread currently trying to acquire the underlying
vnode lock has observed that the same vnode is about to be exclusively
locked. Such threads must then sleep until the exclusive lock has been
released and then try to acquire the lock again. Otherwise, exclusive
access to the vnode cannot be guaranteed.

Thanks to naddy@ and visa@ for testing; ok visa@

Reported-by: syzbot+374d0e7e2400004957f7@syzkaller.appspotmail.com


# 1.292 25-Jul-2019 cheloha

vinvalbuf(9): tlseep -> tsleep_nsec(9); ok millert@


# 1.291 19-Jul-2019 cheloha

vwaitforio(9): tsleep(9) -> tsleep_nsec(9); ok visa@


# 1.290 28-Jun-2019 visa

Skip VFS barrier lock during normal operation to reduce overhead.
This removes a system-wide serialization point, which might help
finding timing-related bugs.

OK deraadt@ anton@


# 1.289 09-Jun-2019 beck

Add a temporary workaround to make removal of giant files better

mlarkin@ noticed we would freeze while removing enormous files because
of the amount of work done to invalidate buffers on unlink. This adds
a temporary workaround to ensure we give up the lock and yield while
doing this.

The longer term answer will be to move these buffers to another list
and not do the work here.

ok deraadt@


# 1.288 19-Apr-2019 visa

Add a subsystem lock for vfs_lockf.c. This enables calling lf_advlock()
and lf_purgelocks() without the kernel lock.

OK anton@ mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.287 02-Apr-2019 visa

Restrict which filesystems are available for swap. This rules out
obvious misconfigurations that cannot work.

OK mpi@ tedu@


# 1.286 17-Feb-2019 tedu

if a write fails, we mark the buffer invalid and throw it away. this can
lead to lost errors, where a later fsync will return success. to fix this,
set a flag on the vnode indicating a past error has occurred, and return
an error for future fsync calls.
ok bluhm deraadt visa


# 1.285 21-Jan-2019 anton

Introduce a dedicated entry point data structure for file locks. This new data
structure allows for better tracking of pending lock operations which is
essential in order to prevent a use-after-free once the underlying vnode is
gone.

Inspired by the lockf implementation in FreeBSD.

ok visa@

Reported-by: syzbot+d5540a236382f50f1dac@syzkaller.appspotmail.com


# 1.284 23-Dec-2018 natano

Rectify some issues with the noperm mount flag; the root vnode was not
protected properly and files without any x bit set were accidentaly considered
executable when checked with access(2).

Issues found and reported by deraadt, halex, reyk, tb
ok deraadt


# 1.283 07-Dec-2018 mpi

free(9) sizes for netcred.

ok visa@


Revision tags: OPENBSD_6_4_BASE
# 1.282 29-Sep-2018 visa

Use atomic operations to update vfc_refcount. Change the field's type
to unsigned int.

OK deraadt@


# 1.281 26-Sep-2018 visa

Move the allocating and freeing of mount points into
dedicated functions.

OK deraadt@ mpi@


# 1.280 22-Sep-2018 fcambus

Harmonize spacing after ellipses in displayed messages.

We were using spacing after ellipses in an inconsistent way in the
installer. Standardize on using "... " everywhere and take into account
the cursor position while we are waiting for the task to complete: the
cursor is now always positioned after the last dot, and the space is
added when displaying completion confirmation.

While there, also take cursor position into account in vfs_shutdown(),
and remove the extra leading space before ticks in dhclient.

OK deraadt@


# 1.279 17-Sep-2018 visa

Simplify VFS initialization.

Because loadable kernel modules are no longer, there is no need to
register or unregister filesystem implementations at runtime. Remove
vfs_register() and vfs_unregister(), and make vfsinit() call vfs_init
routines directly. Replace the linked list of vfsconf structs with
the vfsconflist[] array.

OK mpi@ bluhm@


# 1.278 16-Sep-2018 visa

Move vfsconf lookup code into dedicated functions.

OK bluhm@


# 1.277 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


# 1.276 02-Jul-2018 bluhm

Use more list macros for v_dirtyblkhd.
OK mpi@


# 1.275 06-Jun-2018 bluhm

The function dounmount() traverses the mnt_list in forward direction
to call vfs_busy() for all nested mount points. vfs_stall() called
vfs_busy() in reverser order for all mount points. Change the
direction of the latter to resolve the lock order conflict.
OK visa@


# 1.274 04-Jun-2018 guenther

Add VB_DUPOK to suppress witness(4) warning of concurrent mount locks.
Use that in three places:
- vfs_stall()
- sys_mount()
- dounmount()'s MNT_FORCE-does-recursive-unmounts case

ok deraadt@ visa@


# 1.273 27-May-2018 visa

Drop unnecessary `p' parameter from vget(9).

OK mpi@


# 1.272 08-May-2018 bluhm

When looping over mount points, the FOREACH SAVE macro is not save.
The loop variable mp is protected by vfs_busy() so that it cannot
be unmounted. But the next mount point nmp could be unmounted while
VFS_SYNC() sleeps. As the loop in vfs_stall() does not destroy the
mount point, TAILQ_FOREACH_REVERSE without _SAVE is the correct
macro to use.
OK deraadt@ visa@


# 1.271 08-May-2018 mpi

Move the vfs stall "barrier" logic to a function. FREF() will soon
change and this has nothing to do with it.

ok visa@, bluhm@


# 1.270 07-May-2018 bluhm

Print the vp pointer in the vinvalbuf() panic strings.
OK mpi@


# 1.269 02-May-2018 visa

Remove proc from the parameters of vn_lock(). The parameter is
unnecessary because curproc always does the locking.

OK mpi@


# 1.268 28-Apr-2018 visa

Clean up the parameters of VOP_LOCK() and VOP_UNLOCK(). It is always
curproc that does the locking or unlocking, so the proc parameter
is pointless and can be dropped.

OK mpi@, deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.267 07-Mar-2018 bluhm

Remounting files systems read-only does not work reliably. There
are corner cases where ffs may leak blocks. So better revert and
unmount all file systems at reboot. The "init died" panic will be
fixed in a different way.
OK deraadt@


# 1.266 10-Feb-2018 deraadt

Syncronize filesystems to disk when suspending. Each mountpoint's vnodes
are pushed to disk. Dangling vnodes (unlinked files still in use) and
vnodes undergoing change by long-running syscalls are identified -- and
such filesystems are marked dirty on-disk while we are suspended (in case
power is lost, a fsck will be required). Filesystems without dangling or
busy vnodes are marked clean, resulting in faster boots following
"battery died" circumstances.
Tested by numerous developers, thanks for the feedback.


# 1.265 14-Dec-2017 deraadt

Don't bother using DETACH_FORCE for the softraid luns at reboot
time; the aggressive mountpoint destruction seems to hit insane
use-after-frees when we are already far on the way down.


# 1.264 14-Dec-2017 deraadt

Give vflush_vnode() a hint about vnodes we don't need to account as "busy".
Change mountpoint to RDONLY a little later. Seems to improve the
rw->ro transition a bit.


# 1.263 11-Dec-2017 bluhm

Format the vnode lists of ddb show mount properly in columns.
OK krw@


# 1.262 11-Dec-2017 deraadt

In uvm Chuck decided backing store would not be allocated proactively
for blocks re-fetchable from the filesystem. However at reboot time,
filesystems are unmounted, and since processes lack backing store they
are killed. Since the scheduler is still running, in some cases init is
killed... which drops us to ddb [noted by bluhm]. Solution is to convert
filesystems to read-only [proposed by kettenis]. The tale follows:
sys_reboot() should pass proc * to MD boot() to vfs_shutdown() which
completes current IO with vfs_busy VB_WRITE|VB_WAIT, then calls VFS_MOUNT()
with MNT_UPDATE | MNT_RDONLY, soon teaching us that *fs_mount() calls a
copyin() late... so store the sizes in vfsconflist[] and move the copyin()
to sys_mount()... and notice nfs_mount copyin() is size-variant, so kill
legacy struct nfs_args3. Next we learn ffs_mount()'s MNT_UPDATE code is
sharp and rusty especially wrt softdep, so fix some bugs adn add
~MNT_SOFTDEP to the downgrade. Some vnodes need a little more help,
so tie them to &dead_vnops.

ffs_mount calling DIOCCACHESYNC is causing a bit of grief still but
this issue is seperate and will be dealt with in time.
couple hundred reboots by bluhm and myself, advice from guenther and
others at the hut


# 1.261 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.260 31-Jul-2017 florian

Give back some space to the ramdisk by compiling net/radix.c only
if we compile pf, ipsec, pipex or nfsserver.
Suggested by mpi some time ago.
Tweak & OK bluhm
deraadt assumes it's fair


# 1.259 20-Apr-2017 visa

Tweak lock inits to make the system runnable with witness(4)
on amd64 and i386.


# 1.258 04-Apr-2017 deraadt

struct vfsconf is tightly packed, but let's M_ZERO it in case that ever
changes to avoid exposing userland memory.


Revision tags: OPENBSD_6_1_BASE
# 1.257 15-Jan-2017 bluhm

When traversing the mount list, the current mount point is locked
with vfs_busy(). If the FOREACH_SAFE macro is used, the next pointer
is not locked and could be freed by another process. Unless
necessary, do not use _SAFE as it is unsafe. In vfs_unmountall()
the current pointer is actullay freed. Add a comment that this
race has to be fixed later.
OK krw@


# 1.256 10-Jan-2017 bluhm

Replace manual for() loops with FOREACH() macro.
OK millert@


# 1.255 10-Jan-2017 bluhm

Remove the unused olddp parameter from function dounmount().
OK mpi@ millert@


# 1.254 28-Sep-2016 kettenis

Cast enum to u_int when doing a bounds check to avoid a clang warning that
the comparison is always true.

ok jca@, tedu@


# 1.253 16-Sep-2016 dlg

move the namecache_rb_tree from RB macros to RBT functions.

i had to shuffle the includes a bit. all the knowledge of the RB
tree is now inside vfs_cache.c, and all accesses are via cache_*
functions.


# 1.252 16-Sep-2016 dlg

move buf_rb_bufs from RB macros to RBT functions

i had to shuffle the order of some header bits cos RBT_PROTOTYPE
needs to see what RBT_HEAD produces.


# 1.251 15-Sep-2016 dlg

all pools have their ipl set via pool_setipl, so fold it into pool_init.

the ioff argument to pool_init() is unused and has been for many
years, so this replaces it with an ipl argument. because the ipl
will be set on init we no longer need pool_setipl.

most of these changes have been done with coccinelle using the spatch
below. cocci sucks at formatting code though, so i fixed that by hand.

the manpage and subr_pool.c bits i did myself.

ok tedu@ jmatthew@

@ipl@
expression pp;
expression ipl;
expression s, a, o, f, m, p;
@@
-pool_init(pp, s, a, o, f, m, p);
-pool_setipl(pp, ipl);
+pool_init(pp, s, a, ipl, f, m, p);


# 1.250 25-Aug-2016 dlg

pool_setipl

ok kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.249 22-Jul-2016 kettenis

Prevent NULL-pointer call for filesystems that don't provide vfs_sysctl
in their vfsops.

Issue reported by Tim Newsham.

ok claudio@, natano@


# 1.248 19-Jun-2016 natano

Remove the lockmgr() API. It is only used by filesystems, where it is a
trivial change to use rrw locks instead. All it needs is LK_* defines
for the RW_* flags.

tested by naddy and sthen on package building infrastructure
input and ok jmc mpi tedu


# 1.247 26-May-2016 natano

The doforce variable isn't modified anywhere. Also, the only filesystem
left using it is fuse. It has been removed from all other filesystems.

ok millert deraadt


# 1.246 26-Apr-2016 natano

copy_statfs_info() is not only used by ufs, but by other filesystems too,
so make sure that all members of mp->mnt_stat.mount_info are copied.
ok stefan


# 1.245 26-Apr-2016 beck

fix off by one in vfs_vnode_print - found by miod
ok deraadt@, krw@


# 1.244 07-Apr-2016 natano

Share clone bitmap between aliased vnodes. This prevents duplicate clone
instance numbers being handed out for the same minor device.
ok mikeb


# 1.243 05-Apr-2016 natano

Increase size of the clone bitmap (revised diff after revert). I have
tested this with fuse _and_ drm on amd64 and macppc. Also tested with
cloning bpf (not in the tree) on macppc.

ok mikeb
"looks correct to me" millert

The original commit message is as follows:

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.242 01-Apr-2016 mikeb

Revert the clone bitmap enlargement change


# 1.241 31-Mar-2016 natano

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.240 19-Mar-2016 natano

Remove the unused flags argument from VOP_UNLOCK().

torture tested on amd64, i386 and macppc
ok beck mpi stefan
"the change looks right" deraadt


# 1.239 14-Mar-2016 krw

Change a bunch of (<blah> *)0 to NULL.

ok beck@ deraadt@


Revision tags: OPENBSD_5_9_BASE
# 1.238 05-Dec-2015 tedu

branches: 1.238.2;
remove stale lint annotations


# 1.237 16-Nov-2015 deraadt

In getdevvp() set the VISTTY flag on a vnode to indicate the underlying
device is a D_TTY device. (Like spec_open, but this sets the flag to
satisfy pre-VOP_OPEN situations)
ok millert semarie tedu guenther


# 1.236 13-Oct-2015 guenther

Initialize va_filerev in vattr_null() to avoid leaking stack garbage;
problem pointed out by Martin Natano (natano (at) natano.net)

Also, stop chaining assignments (foo = bar = baz) in vattr_null().
The exact meaning of those depends on the order of the sizes-and-
signednesses of the lvalues, making them fragile: a statement here
mixed *six* types, but managed to get them in a safe order. Delete
a 20+ year old XXX comment that was almost certainly bemoaning a bug
from when they were in an unsafe order.

ok deraadt@ miod@


# 1.235 08-Oct-2015 mpi

Use the radix API directly and get rid of the function pointers. There
is no point in keeping an unused level of abstraction.

ok mikeb@, claudio@


# 1.234 07-Oct-2015 mpi

rn_inithead() offset argument is now specified in byte, missed in previous.


# 1.233 04-Sep-2015 mpi

Make every subsystem using a radix tree call rn_init() and pass the
length of the key as argument.

This way every consumer of the radix tree has a chance to explicitly
initialize the shared data structures and no longer rely on another
subsystem to do the initialization.

As a bonus ``dom_maxrtkey'' is no longer used an die.

ART kernels should now be fully usable because pf(4) and IPSEC properly
initialized the radix tree.

ok chris@, reyk@


Revision tags: OPENBSD_5_8_BASE
# 1.232 16-Jul-2015 claudio

branches: 1.232.4;
Fix rn_match and there for the expoerted lookup functions in radix.c
to never return the internal RNF_ROOT nodes. This removes the checks
in the callee to verify that not an RNF_ROOT node was returned.
OK mpi@


# 1.231 12-May-2015 mikeb

Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.230 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.229 02-Mar-2015 guenther

Return EINVAL if the creds supplied for NFS export have a cr_ngroups less
than zero or greater than NGROUPS_MAX

Fixes panic seen by henning@


# 1.228 09-Jan-2015 tedu

rename desiredvnodes to initialvnodes. less of a lie. ok beck deraadt


# 1.227 19-Dec-2014 tedu

start retiring the nointr allocator. specify PR_WAITOK as a flag as a
marker for which pools are not interrupt safe. ok dlg


# 1.226 17-Dec-2014 tedu

remove lock.h from uvm_extern.h. another holdover from the simpletonlock
era. fix uvm including c files to include lock.h or atomic.h as necessary.
ok deraadt


# 1.225 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.224 10-Dec-2014 tedu

convert bcopy to memcpy. ok millert


# 1.223 21-Nov-2014 tedu

simple lock is long dead


# 1.222 19-Nov-2014 tedu

delete the KERN_VNODE sysctl. it fails to provide any isolation from the
kernel struct vnode defintion, and the only consumer (pstat) still needs
kvm to read much of the required information. no great loss to always use
kvm until there's a better replacement interface.
ok deraadt millert uebayasi


# 1.221 14-Nov-2014 tedu

prefer sizeof(*ptr) to sizeof(struct) for malloc and free


# 1.220 03-Nov-2014 deraadt

pass size argument to free()
ok doug tedu


# 1.219 13-Sep-2014 doug

Replace all queue *_END macro calls except CIRCLEQ_END with NULL.

CIRCLEQ_* is deprecated and not called in the tree. The other queue types
have *_END macros which were added for symmetry with CIRCLEQ_END. They are
defined as NULL. There's no reason to keep the other *_END macro calls.

ok millert@


Revision tags: OPENBSD_5_6_BASE
# 1.218 13-Jul-2014 tedu

pass the size to free in some of the obvious cases


# 1.217 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.216 10-Jul-2014 mpi

Stop using a shutdown hook for softraid(4) and explicitly shutdown
the disciplines right after vfs_shutdown().

This change is required in order to be able to set `cold' to 1 before
traversing the device (mainbus) tree for DVACT_POWERDOWN when halting
a machine. Yes, this is ugly because sr_shutdown() needs to sleep. But
at least it is obvious and hopefully somebody will be ofended and fix
it.

In order to properly flush the cache of the disks under softraid0,
sr_shutdown() now propagates DVACT_POWERDOWN for this particular subtree
of devices which are not under mainbus. As a side effect sd(4) shutdown
hook should no longer be necessary.

Tested by stsp@ and Jean-Philippe Ouellet.

ok deraadt@, stsp@, jsing@


# 1.215 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.214 04-Jun-2014 claudio

While it may be smart to use the radix tree for exports it is not OK to
use the domain specific tree initialisation method for this since that one
is multipath enabled and assumes that the radix node is part of a struct
rtentry. This code uses a different struct and so the multipath modifies
wrong fields and breaks stuff in mysterious ways.
Since we only support AF_INET here anyway simplify the code and only have
one radix_node_head pointer instead of AF_MAX ones.
Fixes NFS server issues reported by rpe@, OK rpe@, guenther@, sthen@


# 1.213 10-Apr-2014 tedu

pull the bufcache freelist code out into separate functions to allow new
algorithms to be tested. in the process, drop support for unused B_AGE and
b_synctime options.
previous versions ok beck deraadt


# 1.212 24-Mar-2014 guenther

Split the API: struct ucred remains the kernel internal structure while
struct xucred becomes the structure for syscalls (mount(2) and nfssvc(2)).

ok deraadt@ beck@


Revision tags: OPENBSD_5_5_BASE
# 1.211 21-Jan-2014 tedu

bzero -> memset


# 1.210 01-Dec-2013 krw

Change 'mountlist' from CIRCLEQ to TAILQ. Be paranoid and
use TAILQ_*_SAFE more than might be needed.

Bulk ports build by sthen@ showed nobody sticking their fingers
so deep into the kernel.

Feedback and suggestions from millert@. ok jsing@


# 1.209 27-Nov-2013 jsing

Defer the v_type initialisation until after the vnode has been purged from
the namecache. Changing the v_type between cache_enter() and cache_purge()
results in bad things happening.

ok beck@


# 1.208 02-Oct-2013 sf

format string fix: b_flags is long


# 1.207 01-Oct-2013 sf

Format string fixes: Cast time_t to long long

and mnt_stat.f_ctime is long long, too


# 1.206 08-Aug-2013 syl

Uncomment kprintf format attributes for sys/kern

tested on vax (gcc3) ok miod@


# 1.205 30-Jul-2013 beck

The previous change was made while chasing nfs performance issues
on Theo's servers - however this was in the context of the buffer flipper
changes and this is now suspect in a continues performance issue with NFS
so back it out for now


Revision tags: OPENBSD_5_4_BASE
# 1.204 24-Jun-2013 beck

Manipulating buffers after sleeping is dangerous. Instead of attempting
to cheat and VOP_BWRITE a buffer, restart the vinvalbuf if we have to wait
for a busy buffer to complete
ok tedu@ guenther@


# 1.203 15-Apr-2013 jsing

Add an f_mntfromspec member to struct statfs, which specifies the name of
the special provided when the mount was requested. This may be the same as
the special that was actually used for the mount (e.g. in the case of a
device node) or it may be different (e.g. in the case of a DUID).

Whilst here, change f_ctime to a 64 bit type and remove the pointless
f_spare members.

Compatibility goo courtesy of guenther@

ok krw@ millert@


Revision tags: OPENBSD_5_3_BASE
# 1.202 17-Feb-2013 miod

Comment out recently added __attribute__((__format__(__kprintf__))) annotations
in MI code; gcc 2.95 does not accept such annotation for function pointer
declarations, only function prototypes.
To be uncommented once gcc 2.95 bites the dust.


# 1.201 09-Feb-2013 miod

Add explicit __attribute__ ((__format__(__kprintf__)))) to the functions and
function pointer arguments which are {used as,} wrappers around the kernel
printf function.
No functional change.


# 1.200 17-Nov-2012 beck

Don't map a buffer (and potentially sleep) when invalidating it in vinvalbuf.
This fixes a problem where we could sleep for kva and then our pointers
would not be valid on the next pass through the loop. We do this
by adding buf_acquire_nomap() - which can be used to busy up the buffer
without changing its mapped or unmapped state. We do not need to have
the buffer mapped to invalidate it, so it is sufficient to acquire it
for that. In the case where we write the buffer, we do map the buffer, and
potentially sleep.


# 1.199 01-Oct-2012 guenther

Make groupmember() check the effective gid too, so that the checks are
consistent when the effective gid isn't also a supplementary group.

ok beck@


# 1.198 19-Sep-2012 guenther

vhold() and vdrop() are prototyped in vnode.h, so don't repeat them here

ok beck@


Revision tags: OPENBSD_5_2_BASE
# 1.197 16-Jul-2012 deraadt

oops, need sys/acct.h too


# 1.196 16-Jul-2012 deraadt

Put acct_shutdown() proto in a better place


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.195 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.194 02-Jul-2011 thib

rename VFSDEBUG to VFLCKDEBUG;

prompted by tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.193 21-Dec-2010 thib

Bring back the "End the VOP experiment." diff, naddy's issues where
unrelated, and his alpha is much happier now.

OK deraadt@


# 1.192 06-Dec-2010 jasper

- drop NENTS(), which was yet another copy of nitems().
no binary change


ok deraadt@


# 1.191 10-Sep-2010 thib

Backout the VOP diff until the issues naddy was seeing on alpha (gcc3)
have been resolved.


# 1.190 06-Sep-2010 thib

End the VOP experiment. Instead of the ridicolusly complicated operation
vector setup that has questionable features (that have, as far as I can
tell never been used in practice, atleast not in OpenBSD), remove all
the gunk and favor a simple struct full of function pointers that get
set directly by each of the filesystems.

Removes gobs of ugly code and makes things simpler by a magnitude.

The only downside of this is that we loose the vnoperate feature so
the spec/fifo operations of the filesystems need to be kept in sync
with specfs and fifofs, this is no big deal as the API it self is pretty
static.

Many thanks to armani@ who pulled an earlier version of this diff to
current after c2k10 and Gabriel Kihlman on tech@ for testing.

Liked by many. "come on, find your balls" deraadt@.


# 1.189 12-Aug-2010 oga

Nuke extra (typoed) extern declaration and a spare newline from the last
commit.

"fix it -- free commit" beck@


# 1.188 11-Aug-2010 beck

Make the number of vnodes to correspond to the number of buffers in
buffer cache - we grow them dynamically, but do not attempt to shrink
them if the buffer cache shrinks after growing.

Tested by very many for a long time.

ok oga@ todd@ phessler@ tedu@


Revision tags: OPENBSD_4_8_BASE
# 1.187 29-Jun-2010 tedu

makefstype was only used in ported from freebsd filesystems. fix them
and remove the function. ok thib


# 1.186 28-Jun-2010 claudio

Add the rtable id as an argument to rn_walktree(). Functions like
rt_if_remove_rtdelete() need to know the table id to be able to correctly
remove nodes.
Problem found by Andrea Parazzini and analyzed by Martin Pelik�n.
OK henning@


# 1.185 06-May-2010 mpf

Fix favail format string.
From mickey.
OK thib, otto.


Revision tags: OPENBSD_4_7_BASE
# 1.184 17-Dec-2009 oga

if anyone vref()s a VNON vnode, panic. This should not happen.

Written while trying to debug the nfs_inactive panics. Turns out it
never got hit, but it's a useful check to have.

ok beck@


# 1.183 17-Aug-2009 jasper

dd 'show all bufs' to show all the buffers in the system

ok beck@ thib@


# 1.182 13-Aug-2009 thib

add a show all vnodes command, use dlg's nice pool_walk() to accomplish
this.

ok beck@, dlg@


# 1.181 12-Aug-2009 beck

Namecache revamp.

This eliminates the large single namecache hash table, and implements
the name cache as a global lru of entires, and a redblack tree in each
vnode. It makes cache_purge actually purge the namecache entries associated
with a vnode when a vnode is recycled (very important for later on actually being
able to resize the vnode pool)

This commit does #if 0 out a bunch of procmap code that was
already broken before this change, but needs to be redone completely.

Tested by many, including in thib's nfs test setup.

ok oga@,art@,thib@,miod@


# 1.180 02-Aug-2009 beck

Dynamic buffer cache support - a re-commit of what was backed out
after c2k9

allows buffer cache to be extended and grow/shrink dynamically

tested by many, ok oga@, "why not just commit it" deraadt@


Revision tags: OPENBSD_4_6_BASE
# 1.179 25-Jun-2009 thib

backout the buf_acquire() does the bremfree() since all callers
where doing bremfree() befure calling buf_acquire().

This is causing us headache pinning down a bug that showed up
when deraadt@ too cvs to current, and will have to be done
anyway as a preperation for backouts.

OK deraadt@


# 1.178 15-Jun-2009 beck

Back out all the buffer cache changes I committed during c2k9. This reverts three
commits:

1) The sysctl allowing bufcachepercent to be changed at boot time.
2) The change moving the buffer cache hash chains to a red-black tree
3) The dynamic buffer cache (Which depended on the earlier too).

ok on the backout from marco and todd


# 1.177 06-Jun-2009 art

All caller of buf_acquire were doing bremfree before the call.
Just put it in the buf_acquire function.
oga@ ok


# 1.176 03-Jun-2009 beck

Change bufhash from the old grotty hash table to red-black trees hanging
off the vnode.
ok art@, oga@, miod@


Revision tags: OPENBSD_4_5_BASE
# 1.175 10-Nov-2008 pedro

Fix typo in comment, okay jmc@.


# 1.174 01-Nov-2008 deraadt

change vrele() to return an int. if it returns 0, it can gaurantee that
it did not sleep. this is used to avoid checkdirs() to avoid having
to restart the allproc walk every time through
idea from tedu, ok thib pedro


Revision tags: OPENBSD_4_4_BASE
# 1.173 05-Jul-2008 thib

re-introduce vdrop() to signal a lost intrest in a vnode;

ok art@


# 1.172 14-Jun-2008 mk

A bunch of pool_get() + bzero() -> pool_get(..., .. | PR_ZERO)
conversions that should shave a few bytes off the kernel.

ok henning, krw, jsing, oga, miod, and thib (``even though i usually prefer
FOO|BAR''; thanks for looking.


# 1.171 13-Jun-2008 beck

back out stupid vnode change that was unintentionally included
with biomem and art has no idea how it got there.
ok art@ thib@


# 1.170 12-Jun-2008 deraadt

Bring biomem diff back into the tree after the nfs_bio.c fix went in.
ok thib beck art


# 1.169 11-Jun-2008 deraadt

back out biomem diff since it is not right yet. Doing very large
file copies to nfsv2 causes the system to eventually peg the console.
On the console ^T indicates that the load is increasing rapidly, ddb
indicates many calls to getbuf, there is some very slow nfs traffic
making none (or extremely slow) progress. Eventually some machines
seize up entirely.


# 1.168 10-Jun-2008 beck

Buffer cache revamp

1) remove multiple size queues, introduced as a stopgap.
2) decouple pages containing data from their mappings
3) only keep buffers mapped when they actually have to be mapped
(right now, this is when buffers are B_BUSY)
4) New functions to make a buffer busy, and release the busy flag
(buf_acquire and buf_release)
5) Move high/low water marks and statistics counters into a structure
6) Add a sysctl to retrieve buffer cache statistics

Tested in several variants and beat upon by bob and art for a year. run
accidentally on henning's nfs server for a few months...

ok deraadt@, krw@, art@ - who promises to be around to deal with any fallout


# 1.167 09-Jun-2008 millert

Update access(2) to have modern semantics with respect to X_OK and
the superuser. access(2) will now only indicate success for X_OK on
non-directories if there is at least one execute bit set on the file.
OK deraadt@ thib@ otto@


# 1.166 07-May-2008 thib

remove the vfc_mountroot member from vfsconf and
do appropriate cleanup;

OK deraadt@


# 1.165 07-May-2008 claudio

Implement routing priorities. Every route inserted has a priority assigned
and the one route with the lowest number wins. This will be used by the
routing daemons to resolve the synchronisations issue in case of conflicts.
The nasty bits of this are in the multipath code. If no priority is specified
the kernel will choose an appropriate priority.

Looked at by a few people at n2k8 code is much older


# 1.164 06-May-2008 thib

retire vfs_mountroot();

setroot() is now (and has been) responsible for setting
the mountroot function pointer "to the right thing", or
failing todo that, to ffs_mountroot;

based on a discussion/diff from deraadt@.
OK deraadt@


# 1.163 23-Mar-2008 miod

Wrong printf construct.


# 1.162 16-Mar-2008 otto

Widen some struct statfs fields to support large filesystem stata
and add some to be able to support statvfs(2). Do the compat dance
to provide backward compatibility. ok thib@ miod@


Revision tags: OPENBSD_4_3_BASE
# 1.161 13-Dec-2007 blambert

replace calls to ltsleep with tsleep

remove PNORELOCK flag, as PNORELOCK is used for msleep

ok art@ thib@


# 1.160 16-Nov-2007 deraadt

er, the newline is wrong. dissapointing.


# 1.159 15-Nov-2007 deraadt

newline before syncing disks is way prettier


# 1.158 29-Oct-2007 chl

MALLOC/FREE -> malloc/free
replace an hard coded value with M_WAITOK

ok krw@


# 1.157 15-Sep-2007 bluhm

Allow to pull out an usb stick with ffs filesystem while mounted
and a file is written onto the stick. Without these fixes the
machine panics or hangs.
The usb fix calls the callback when the stick is pulled out to free
the associated buffers. Otherwise we have busy buffers for ever
and the automatic unmount will panic.
The change in the scsi layer prevents passing down further dirty
buffers to usb after the stick has been deactivated.
In vfs the automatic unmount has moved from the function vgonel()
to vop_generic_revoke(). Both are called when the sd device's vnode
is removed. In vgonel() the VXLOCK is already held which can cause
a deadlock. So call dounmount() earlier.

ok krw@, I like this marco@, tested by ian@


# 1.156 07-Sep-2007 art

Use M_ZERO in a few more places to shave bytes from the kernel.

eyeballed and ok dlg@


Revision tags: OPENBSD_4_2_BASE
# 1.155 07-Aug-2007 beck

A few changes to deal with multi-user performance issues seen. this
brings us back roughly to 4.1 level performance, although this is still
far from optimal as we have seen in a number of cases. This change

1) puts a lower bound on buffer cache queues to prevent starvation
2) fixes the code which looks for a buffer to recycle
3) reduces the number of vnodes back to 4.1 levels to avoid complex
performance issues better addressed after 4.2

ok art@ deraadt@, tested by many


# 1.154 01-Jun-2007 beck

decouple the allocated number of vnodes from the "desiredvnodes" variable
which is used to size a zillion other things that increasing excessively
has been shown to cause problems - so that we may incrementally look at
increasing those other things without making the kernel unusable.

This diff effectivly increases the number of vnodes back to the number
of buffers, as in the earlier dynamic buffer cache commits, without
increasing anything else (namecache, softdeps, etc. etc.)

ok pedro@ tedu@ art@ thib@


# 1.153 31-May-2007 tedu

remove some silly casts, no real change


# 1.152 31-May-2007 pedro

NFSv2 cannot cope with a big number of vnodes, so revert to NPROC-based
calculation until the problem is fixed, okay beck@ art@


# 1.151 30-May-2007 beck

back out vfs change - todd fries has seen afs issues, and I'm suspicious
this can cause other problems.


# 1.150 29-May-2007 beck

Step one of some vnode improvements - change getnewvnode to
actually allocate "desiredvnodes" - add a vdrop to un-hold a vnode held
with vhold, and change the name cache to make use of vhold/vdrop, while
keeping track of which vnodes are referred to by which cache entries to
correctly hold/drop vnodes when the cache uses them.
ok thib@, tedu@, art@


# 1.149 28-May-2007 thib

de-inline vref();

ok pedro@


# 1.148 26-May-2007 pedro

Dynamic buffer cache. Initial diff from mickey@, okay art@ beck@ toby@
deraadt@ dlg@.


# 1.147 26-May-2007 thib

Nuke a bunch of simpelocks and associated goo.

ok art@


# 1.146 17-May-2007 thib

Collapse struct v_selectinfo in struct vnode, remove the
simplelock and reuse the name for the selinfo member.
Clean-up accordingly.

ok tedu@,art@


# 1.145 09-May-2007 deraadt

kinfo_vgetfailed has not been used for > 8 years


# 1.144 13-Apr-2007 thib

Move the declaration of VN_KNOTE() into vnode.h instead of having
multiple defines all over;

ok tedu@


# 1.143 13-Apr-2007 bluhm

Remove comments talking about vnode interlock. No binary change.
ok thib


# 1.142 11-Apr-2007 thib

Remove the simplelock argument from vrecycle();

ok pedro@, sturm@


# 1.141 21-Mar-2007 thib

Remove the v_interlock simplelock from the vnode structure.
Zap all calls to simple_lock/unlock() on it (those calls are
#defined away though). Remove the LK_INTERLOCK from the calls
to vn_lock() and cleanup the filesystems wich implement VOP_LOCK().
(by remvoing the v_interlock from there calls to lockmgr()).

ok pedro@, art@, tedu@


# 1.140 12-Mar-2007 mickey

better desiredvnodes not based on maxusers; pedro@ deraadt@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.139 20-Feb-2007 deraadt

for vfsconf sysctl, do not leak kernel sensors out to userland
ok art thib


# 1.138 17-Feb-2007 mickey

fix ddb buf printing for daddr_t growth to 64bit;
from juan hernandez gonzalez; tested by bluhm@


# 1.137 14-Feb-2007 jsg

Consistently spell FALLTHROUGH to appease lint.
ok kettenis@ cloder@ tom@ henning@


# 1.136 13-Feb-2007 mickey

fix ddb buf print


# 1.135 20-Nov-2006 tom

vprint() should be defined if DIAGNOSTIC || DEBUG. Noticed by (and
original diff from) Jake < antipsychic (at) hotmail.com >. Discussed
with Mickey and Miod.

ok miod@ pedro@


# 1.134 30-Oct-2006 thib

use vp->v_type to index into vtypes rather then vp->v_tag,
fixing odd output in the 'show vnode' ddb code.

ok mickey@


Revision tags: OPENBSD_4_0_BASE
# 1.133 11-Jul-2006 mickey

add mount/vnode/buf and softdep printing commands; tested on a few archs and will make pedro happy too (;


# 1.132 09-Jul-2006 pedro

Fix tab where space was meant


# 1.131 08-Jul-2006 thib

vinvalbuf() debugging aid, under VFSDEBUG.

ok pedro@


# 1.130 03-Jul-2006 mickey

also print vp in vprint (useful for debugging); pedro@ ok


# 1.129 25-Jun-2006 sturm

rename vfs_busy() flags VB_UMIGNORE/VB_UMWAIT to VB_NOWAIT/VB_WAIT

requested by and ok pedro


# 1.128 14-Jun-2006 sturm

move vfs_busy() to rwlocks and properly hide the locking api from vfs

ok tedu, pedro


# 1.127 02-Jun-2006 pedro

Add a clonable devices implementation. Hacked along with thib@, input
from krw@ and toby@, subliminal prodding from dlg@, okay deraadt@.


# 1.126 28-May-2006 pedro

Spacing in vfs_sysctl()


# 1.125 07-May-2006 sturm

forgot to remove this sentence from the comment
ok pedro


# 1.124 30-Apr-2006 sturm

remove the simplelock argument from vfs_busy() which is currently not
used and will never be used this way in VFS

requested by and ok pedro, ok krw, biorn


# 1.123 19-Apr-2006 pedro

Remove unused mount list simple_lock() goo


Revision tags: OPENBSD_3_9_BASE
# 1.122 09-Jan-2006 pedro

Put vprint() under DIAGNOSTIC, as to save space in generated ramdisks.
Inspiration from miod@, okay deraadt@. Tested on i386, macppc and amd64.


# 1.121 30-Nov-2005 pedro

No need for vfs_busy() and vfs_unbusy() to take a process pointer
anymore. Testing by jolan@, thanks.


# 1.120 24-Nov-2005 pedro

Remove kernfs, okay deraadt@.


# 1.119 19-Nov-2005 pedro

Remove unnecessary lockmgr() archaism that was costing too much in terms
of panics and bugfixes. Access curproc directly, do not expect a process
pointer as an argument. Should fix many "process context required" bugs.
Incentive and okay millert@, okay marc@. Various testing, thanks.


# 1.118 18-Nov-2005 pedro

Work around yet another race on non-locking file systems: when calling
VOP_INACTIVE() in vrele() and vput(), we may sleep. Since there's no
locking of any kind, someone can vget() the vnode and vrele() it while
we sleep, beating us in getting the vnode on the free list.


# 1.117 08-Nov-2005 pedro

Missed one use of 'register'


# 1.116 07-Nov-2005 pedro

Use ANSI function declarations and deregister, no binary change


# 1.115 19-Oct-2005 pedro

Remove v_vnlock from struct vnode, okay krw@ tedu@


Revision tags: OPENBSD_3_8_BASE
# 1.114 26-May-2005 pedro

branches: 1.114.2;
RIP stackable filesystems, ok marius@ tedu@, discussed with deraadt@


# 1.113 24-May-2005 pedro

when a device vnode associated with a mount point disappears, mark the
filesystem as doomed and unmount it


# 1.112 22-May-2005 pedro

put VLOCKSWORK stuff under a single option, VFSDEBUG


# 1.111 01-May-2005 pedro

check for VBIOONFREELIST and VBIOONSYNCLIST in vprint(), okay marius@


# 1.110 24-Mar-2005 tedu

always good to check for invalid values. ok marius pedro


Revision tags: OPENBSD_3_7_BASE
# 1.109 10-Jan-2005 pedro

branches: 1.109.2;
change vget() to only put a vnode back on the free lists if it actually
was there. should fix a (rare) corner case introduced by my last commit.
ok tedu@, testing by joris, moritz@, danh@, otto@ and krw@. many thanks.


# 1.108 31-Dec-2004 pedro

sprinkle some more list macros in here


# 1.107 31-Dec-2004 pedro

when releasing a vnode, make it inactive before sticking it to one of
the free lists. should fix some races on filesystems that don't have
locks, such as nfs. also, it allows for a more straightforward way of
releasing vnodes (nodes that are going to be recycled don't have to be
moved to the head of the list). tested by many, thanks.

ok tedu@ deraadt@


# 1.106 28-Dec-2004 deraadt

clean dirty accident by miod


# 1.105 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


# 1.104 09-Dec-2004 pedro

minor spacing/styling nits


Revision tags: OPENBSD_3_6_BASE
# 1.103 04-Aug-2004 art

Uninline vputonfreelist.


# 1.102 04-Aug-2004 pedro

better comments


# 1.101 02-Aug-2004 pedro

- check for LK_NOWAIT on vget()
- use ltsleep() instead of the unlock + sleep combo

ok art@, inspiration from free/net


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.100 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


# 1.99 27-May-2004 tedu

shutdown accounting before shutting down vfs. should prevent some panics.
ok david@ millert@ (iirc)


# 1.98 25-Apr-2004 itojun

radix tree with multipath support. from kame. deraadt ok
user visible changes:
- you can add multiple routes with same key (route add A B then route add A C)
- you have to specify gateway address if there are multiple entries on the table
(route delete A B, instead of route delete A)
kernel change:
- radix_node_head has an extra entry
- rnh_deladdr takes extra argument

TODO:
- actually take advantage of multipath (rtalloc -> rtalloc_mpath)


Revision tags: OPENBSD_3_5_BASE
# 1.97 09-Jan-2004 tedu

back out vnode parents. weird breakge found in ports tree


# 1.96 06-Jan-2004 tedu

keep track of a vnode's parent dir. ufs only, and unused atm, but
the fun stuff is coming. testing by brad.


Revision tags: OPENBSD_3_4_BASE
# 1.95 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.94 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: UBC_SYNC_A
# 1.93 13-May-2003 naddy

Back out previous change that causes "vnode table full" for large-scale
file operations.


# 1.92 13-May-2003 tedu

do reclaim LAYER vnodes, no good reason not to


# 1.91 06-May-2003 tedu

attempt to put a process's cwd back in place after a forced umount.
won't always work, but it's the best we can do for now. this covers
at least some of the failure cases the previous commit to vfs_lookup.c
checks for.
ok weingart@


# 1.90 01-May-2003 tedu

several related changes:
vfs_subr.c:
add a missing simple_lock_init for vnode interlock
try to avoid reclaiming locked or layered vnodes
initialize vnlock pointer to NULL
remove old code to free vnlock, never used
lockinit the new vnode lock
vfs_syscalls.c:
support for VLAYER flag
vnode_if.sh:
support for splitting VDESC flags
vnode_if.src:
split VDESC flags
WILLPUT is the combination of WILLRELE and WILLUNLOCK
most uses for WILLRELE become WILLPUT
vnode.h:
add v_lock to struct vnode
add VLAYER flag
update for new VDESC flags


# 1.89 06-Apr-2003 ho

strcat/strcpy/sprintf cleanup. krw@, anil@ ok. art@ tested sparc64.


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_B
# 1.88 11-Aug-2002 art

Add two missing vfs_busy calls in the failure path of sysctl_vnode.
Found by aaron@

NOTE - I think we need a mount-point iterator just like we have
NOTE - vfs_mount_foreach_vnode. (btw. why don't we use foreach_vnode in here?)


# 1.87 12-Jul-2002 art

Change the locking on the mountpoint slightly. Instead of using mnt_lock
to get shared locks for lookup and get the exclusive lock only with
LK_DRAIN on unmount and do the real exclusive locking with flags in
mnt_flags, we now use shared locks for lookup and an exclusive lock for
unmount.

This is accomplished by slightly changing the semantics of vfs_busy.
Old vfs_busy behavior:
- with LK_NOWAIT set in flags, a shared lock was obtained if the
mountpoint wasn't being unmounted, otherwise we just returned an error.
- with no flags, a shared lock was obtained if the mountpoint was being
unmounted, otherwise we slept until the unmount was done and returned
an error.
LK_NOWAIT was used for sync(2) and some statistics code where it isn't really
critical that we get the correct results.
0 was used in fchdir and lookup where it's critical that we get the right
directory vnode for the filesystem root.

After this change vfs_busy keeps the same behavior for no flags and LK_NOWAIT.
But if some other flags are passed into it, they are passed directly
into lockmgr (actually LK_SLEEPFAIL is always added to those flags because
if we sleep for the lock, that means someone was holding the exclusive lock
and the exclusive lock is only held when the filesystem is being unmounted.

More changes:
dounmount must now be called with the exclusive lock held. (before this
the caller was supposed to hold the vfs_busy lock, but that wasn't always
true).
Zap some (now) unused mount flags.
And the highlight of this change:
Add some vfs_busy calls to match some vfs_unbusy calls, especially in
sys_mount. (lockmgr doesn't detect the case where we release a lock noone
holds (it will do that soon)).

If you've seen hangs on reboot with mfs this should solve it (I repeat this
for the fourth time now, but this time I spent two months fixing and
redesigning this and reading the code so this time I must have gotten
this right).


# 1.86 16-Jun-2002 miod

When processing the KERN_VNODE sysctl, the kernel builds a packed structure,
while pstat(8) expects a C structure abiding the regular structure packing
rules. This caused pstat -v to break on powerpc.

Unbreak the confusion by defining the structure in a common header file,
and having the kernel use it.

ok millert@ deraadt@


# 1.85 08-Jun-2002 art

Use ltsleep in vfs_busy.


# 1.84 16-May-2002 art

sprinkle some splassert(IPL_BIO) in some functions that are commented as "should be called at splbio()"


Revision tags: OPENBSD_3_1_BASE
# 1.83 14-Mar-2002 millert

First round of __P removal in sys


# 1.82 04-Feb-2002 miod

Cleanup mountroot-related definitions.


# 1.81 23-Jan-2002 art

Pool deals fairly well with physical memory shortage, but it doesn't deal
well (not at all) with shortages of the vm_map where the pages are mapped
(usually kmem_map).

Try to deal with it:
- group all information the backend allocator for a pool in a separate
struct. The pool will only have a pointer to that struct.
- change the pool_init API to reflect that.
- link all pools allocating from the same allocator on a linked list.
- Since an allocator is responsible to wait for physical memory it will
only fail (waitok) when it runs out of its backing vm_map, carefully
drain pools using the same allocator so that va space is freed.
(see comments in code for caveats and details).
- change pool_reclaim to return if it actually succeeded to free some
memory, use that information to make draining easier and more efficient.
- get rid of PR_URGENT, noone uses it.


# 1.80 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.79 10-Dec-2001 art

branches: 1.79.2;
No need to initialize the uobj on every getnewvnode. Just do
it when allocating. Add some improved diagnostics.


# 1.78 10-Dec-2001 art

Big cleanup inspired by NetBSD with some parts of the code from NetBSD.
- get rid of VOP_BALLOCN and VOP_SIZE
- move the generic getpages and putpages into miscfs/genfs
- create a genfs_node which must be added to the top of the private portion
of each vnode for filsystems that want to use genfs_{get,put}pages
- rename genfs_mmap to vop_generic_mmap


# 1.77 10-Dec-2001 art

Merge in struct uvm_vnode into struct vnode.


# 1.76 05-Dec-2001 art

Break out the part that lowers v_holdcnt in brelvp into an own function
and make it and vhold into public interfaces.


# 1.75 29-Nov-2001 art

Ooops. Revert part of the last commit that was completly wrong and wasn't supposed to be committed.


# 1.74 29-Nov-2001 art

Correctly handle b_vp with bgetvp and brelvp in {get,put}pages.
Prevents panics caused by vnodes being recycled under our feet.


# 1.73 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.72 21-Nov-2001 csapuntz

Added vfs_isbusy. Useful for verifying that a mount point is locked
Added vfs_mount_foreach_vnode. Several places in the code seem to want to
traverse the mount list and they all seem to handle locking differently.
Centralize traversing the mount list in one place so that we only need
to get the locking right once.


# 1.71 15-Nov-2001 art

Don't zero v_bioflag when recycling a vnode in getnewvnode.
Sometimes the vnode can be on the syncers list. While that is a bug, it's
just a minor annoyance. A vnode on a syncer worklist without VBIOONSYNCLIST
set is a disaster.


# 1.70 12-Nov-2001 art

Remove unnecessary check for NULL vnode in reassignbuf.


# 1.69 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.68 02-Oct-2001 csapuntz

Bounds check index into routing table. Thanks to Ken Ashcraft of Stanford
for finding this bug.


# 1.67 19-Sep-2001 csapuntz

Get rid of B_VFLUSH. Not relevant after the end of the AGE queue.


# 1.66 16-Sep-2001 millert

Add some missing lengths checks when passing data from userland to
kernel. From based on NetBSD patches.


# 1.65 02-Aug-2001 assar

(vput): make panic strings actually say vput instead of vrele


# 1.64 26-Jul-2001 miod

Typo.


# 1.63 27-Jun-2001 art

remove old vm


# 1.62 22-Jun-2001 deraadt

KNF


# 1.61 05-Jun-2001 provos

send note_revoke to knotes when vnode goes away, okay art@


# 1.60 16-May-2001 art

indentation nit.


# 1.59 29-Apr-2001 art

cleanup, remove incorrect comment


Revision tags: OPENBSD_2_9_BASE
# 1.58 22-Mar-2001 art

branches: 1.58.2;
Use pool for allocating vnodes.
Even though vnodes are never freed (could be) this gives us big memory and
kmem_map savings.


# 1.57 21-Mar-2001 art

uvm_vnp_terminate expect the vnode to be locked.
Why didn't LOCKDEBUG catch this?


# 1.56 16-Mar-2001 art

Oops. fix thinko in last.


# 1.55 16-Mar-2001 art

Use CIRCLEQ macros for mountlist.


# 1.54 16-Mar-2001 art

Initialize the mountlist_slock.


# 1.53 26-Feb-2001 csapuntz

Move v_writecount test back to it original place


# 1.52 26-Feb-2001 csapuntz

Make ref counts 32-bit unsigned ints as opposed to a potpourri of longs and
ints.


# 1.51 24-Feb-2001 csapuntz

Cleanup of vnode interface continues. Get rid of VHOLD/HOLDRELE.
Change VM/UVM to use buf_replacevnode to change the vnode associated
with a buffer.

Addition v_bioflag for flags written in interrupt handlers
(and read at splbio, though not strictly necessary)

Add vwaitforio and use it instead of a while loop of v_numoutput.

Fix race conditions when manipulation vnode free list


# 1.50 23-Feb-2001 csapuntz

Remove the clustering fields from the vnodes and place them in the
file system inode instead


# 1.49 21-Feb-2001 csapuntz

Latest soft updates from FreeBSD/Kirk McKusick

Snapshot-related code has been commented out.


# 1.48 08-Feb-2001 mickey

do not print stuff when not verbose


Revision tags: OPENBSD_2_8_BASE
# 1.47 27-Sep-2000 art

branches: 1.47.2;
Minimal optimization.


# 1.46 17-Jul-2000 art

Don't wait for B_READ buffers on shutdown.
From NetBSD.


Revision tags: OPENBSD_2_7_BASE
# 1.45 25-Apr-2000 csapuntz

Use CIRCLEQ_FOREACH


# 1.44 21-Apr-2000 mickey

see if there is any meaning under curproc before using &proc0 in vfs_syncwait(); from art@


Revision tags: SMP_BASE kame_19991208
# 1.43 05-Dec-1999 art

branches: 1.43.2;
With soft updates, some buffers will be remarked as dirty after being written.
Handle this when syncing filesystems when unmounting.
From NetBSD.


# 1.42 05-Dec-1999 art

Use VONSYNCLIST to see if we should remove a vnode from the sync list instead
of looking at v_dirtyblkhd.


Revision tags: OPENBSD_2_6_BASE
# 1.41 20-Aug-1999 art

more paranoid check of the refcount in vfs_register


# 1.40 08-Aug-1999 niklas

From NetBSD; vdevgone, used for revoking access to device nodes when they
disappear (detach is coming).


# 1.39 31-May-1999 millert

New struct statfs with mount options. NOTE: this replaces statfs(2),
fstatfs(2), and getfsstat(2) so you will need to build a new kernel
before doing a "make build" or you will get "unimplemented syscall" errors.

The new struct statfs has the following featuires:
o Has a u_int32_t flags field--now softdep can have a real flag.

o Uses u_int32_t instead of longs (nicer on the alpha). Note: the man
page used to lie about setting invalid/unused fields to -1. SunOS does
that but our code never has.

o Gets rid of f_type completely. It hasn't been used since NetBSD 0.9
and having it there but always 0 is confusing. It is conceivable
that this may cause some old code to not compile but that is better
than silently breaking.

o Adds a mount_info union that contains the FSTYPE_args struct. This
means that "mount" can now tell you all the options a filesystem was
mounted with. This is especially nice for NFS.

Other changes:
o The linux statfs emulation didn't convert between BSD fs names
and linux f_type numbers. Now it does, since the BSD f_type
number is useless to linux apps (and has been removed anyway)

o FreeBSD's struct statfs is different from our (both old and new)
and thus needs conversion. Previously, the OpenBSD syscalls
were used without any real translation.

o mount(8) will now show extra info when invoked with no arguments.
However, to see *everything* you need to use the -v (verbose) flag.


# 1.38 06-May-1999 mickey

factor out sync+wait code into vfa_syncwait() routine for
applications in system like power management and such.
art@ finally said `commit it'


# 1.37 30-Apr-1999 art

in vput, simple_unlock the v_interlock before VOP_INACTIVE, not after


Revision tags: OPENBSD_2_5_BASE
# 1.36 11-Mar-1999 deraadt

backout


# 1.35 11-Mar-1999 deraadt

back out unapproved changes


# 1.34 11-Mar-1999 mickey

indent


# 1.33 11-Mar-1999 mickey

factor sync+wait operation out into a separate function.


# 1.32 26-Feb-1999 art

adapt to uvm vnode pager


# 1.31 19-Feb-1999 art

add vfs_register and vfs_unregister functions


# 1.30 28-Dec-1998 art

simple_lock fixes


# 1.29 22-Dec-1998 art

deconfuse vprint, print holdcount, not refcount when we are talking about holdcnt


# 1.28 10-Dec-1998 art

vfs_unmountall: retry to unmount all remaining filesystems when one unmount failed


# 1.27 05-Dec-1998 csapuntz

Framework for generating automatic test code for locking discipline
in DIAGNOSTIC mode.

Added documentation to vfs_subr.c on locking needs of a couple calls.

Improvements to the vinvalbuf patch. We need to start over after we
let our pants down.


# 1.26 04-Dec-1998 csapuntz

VFS-Lite2 requires stricter locking around vnode buffer queues. vinvalbuf
had insufficient protection


# 1.25 20-Nov-1998 art

vn_lock already unlocks the simple lock. don't do that again


# 1.24 12-Nov-1998 csapuntz

Integrate latest soft updates patches for McKusick.

Integrate cleaner ffs mount code from FreeBSD. Most notably, this mount
code prevents you from mounting an unclean file system read-write.


Revision tags: OPENBSD_2_4_BASE
# 1.23 13-Oct-1998 csapuntz

In vrele, vget, reinstate to following order

- VNODE gets placed on free list
- VOP_INACTIVE is called

This was the original order. It was changed in an earlier patch due to
a race condition in non-locking FSes (like NFS) between getnewvnode
and inactive. However, the modified order had its own race conditions, so
it turned out not to be a good choice.


# 1.22 30-Aug-1998 csapuntz

Cleanup.

Error diagnostics in vputonfreelist to catch violations of assumptions.


# 1.21 06-Aug-1998 csapuntz

Rename vop_revoke, vn_bwrite, vop_noislocked, vop_nolock, vop_nounlock
to be vop_generic_revoke, vop_generic_bwrite, vop_generic_islocked,
vop_generic_lock and vop_generic_unlock.

Create vop_generic_abortop and propogate change to all file systems.

Fix PR/371.

Get rid of locking in NULLFS (should be mostly unnecessary now except for
forced unmounts).


# 1.20 25-Apr-1998 niklas

typo


Revision tags: OPENBSD_2_3_BASE
# 1.19 20-Feb-1998 niklas

typo


# 1.18 11-Jan-1998 csapuntz

Fix a couple spinlock references. More code motion in vfs_subr.c


# 1.17 10-Jan-1998 csapuntz

Broke up vfs_subr.c which was getting a bit huge. We now have seperate files
for the syncer daemon as well as default VOP_*.


# 1.16 24-Nov-1997 niklas

Fix non-DIAGNOSTIC (and non-COMPAT*) compilation


# 1.15 07-Nov-1997 csapuntz

Fixed hang on shutdown
Disabled vop_nolock for now. Filesystems still need to be cleaned up.


# 1.14 06-Nov-1997 csapuntz

DEBUG now compiles


# 1.13 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.12 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.11 06-Oct-1997 csapuntz

VFS Lite2 Changes


Revision tags: OPENBSD_2_1_BASE
# 1.10 25-Apr-1997 deraadt

proper mask check; mike@fast.cs.utah.edu


# 1.9 14-Apr-1997 tholo

Minor performance enhancements from NetBSD


# 1.8 24-Feb-1997 niklas

OpenBSD tags


# 1.7 11-Feb-1997 millert

Add fs_id support and random inode generation numbers for ffs.


# 1.6 04-Jan-1997 kstailey

spec_advlock() via lf_advlock()


Revision tags: OPENBSD_2_0_BASE
# 1.5 08-Aug-1996 tholo

Make {,f}chown(2) behaviour POSIX.1 compliant with SUID / SGID files
Enable CTL_FS processing by sysctl(3)
Add CTL_FS request to disable clearing SUID / SGID bit when a files owner
or group is changed by root
Make sysctl(8) understand CTL_FS requests


# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 29-Feb-1996 niklas

From NetBSD: Merge with NetBSD 960217


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.312 24-Oct-2021 jsg

use NULL not 0 for pointer values in kern
ok semarie@


# 1.311 23-Oct-2021 mpi

Sprinkle uvm_obj_destroy() over UVM object recycling code.

For now, only assert that the tree of pages is empty in uvm_obj_destroy().
This will soon be used to free the per-UVM object lock.

While here call uvm_obj_init() when new vnodes are allocated instead of
in uvn_attach(). Because vnodes and there associated UVM object are
currently never freed, it isn't easy to know where/when to garbage
collect the associated lock. So simply check that the reference of a
given object is 0 when uvn_attach().

Tested by many as part of a bigger diff.

ok kettenis@


# 1.310 23-Oct-2021 mpi

Assert that the KERNEL_LOCK() is held in vref(9).

This is a guard against pushing the lock too far in UVM's vnode land.

ok beck@


# 1.309 21-Oct-2021 claudio

Move vfs_stall_barrier() from the fd layer into vn_lock() and the vfs layer.
vfs stalling is used by suspend/resume and by vmt(4) to stall any
filesystem operation from altering the state on disk. All these
operations will call vn_lock and be stalled. Adjust vfs_stall_barrier()
to allow the lock owner to still progress so that suspend can sync
the filesystems after stalling vfs operation.
OK mpi@


# 1.308 20-Oct-2021 semarie

revert vnode: remove VLOCKSWORK and check locking when vop_islocked != nullop
(both kernel and userland bits)

GENERIC + VFSLCKDEBUG is broken with it.


# 1.307 19-Oct-2021 semarie

vnode: remove VLOCKSWORK and check locking when vop_islocked != nullop

This flag is currently used to mark or unmark a vnode to actively
check vnode locking semantic (when compiled with VFSLCKDEBUG).

Currently, VLOCKSWORK flag isn't properly set for several FS
implementation which have full locking support. This commit enable
proper checking for them too (cd9660, udf, fuse, msdosfs, tmpfs).

Instead of using a particular flag, it directly check if
v_op->vop_islocked is nullop or not to activate or not the vnode
locking checks.

ok mpi@


Revision tags: OPENBSD_7_0_BASE
# 1.306 31-Aug-2021 claudio

Swap lock flags so that LK_EXCLUSIVE is first like in all other places.


# 1.305 28-Apr-2021 claudio

Introduce a global vnode_mtx and use it to make vn_lock() safe to be called
without the KERNEL_LOCK.
This moves VXLOCK and VXWANT to a mutex protected v_lflag field and also
v_lockcount is protected by this mutex.

The vn_lock() dance is overly complex and all of this should probably replaced
by a proper lock on the vnode but such a diff is a lot more complex. This
is an intermediate step so that at least some calls can be modified to grab
the KERNEL_LOCK later or not at all.

OK mpi@


Revision tags: OPENBSD_6_9_BASE
# 1.304 29-Jan-2021 claudio

Use NULL instead of 0 to clear v_socket pointer (which actually clears all
of the v_un pointers).
OK jsg@ mvs@


Revision tags: OPENBSD_6_8_BASE
# 1.303 23-Aug-2020 kn

Remove unused debug_syncprt, improve debug sysctl handling

"syncprt" is unused since kern/vfs_syscalls.c r1.147 from 2008.

Adding new debug sysctls is a bit opaque and looking at kern/kern_sysctl.c
the only visible difference between used and stub ctldebug structs in the
debugvars[] array is their extern keyword, indicating that it is defined
elsewhere.

sys/sysctl.h declares all debugN members as extern upfront, but these
declarations are not needed.

Remove the unused debug sysctl, rename the only remaining one to something
meaningful and remove forward declarations from /sys/sysctl.h; this way,
adding new debug sysctls is a matter of adding extern and coming up with a
name, which is nicer to read on its own and better to grep for.

OK mpi


# 1.302 22-Aug-2020 kn

Move sysctl(2) CTL_DEBUG from DEBUG to new DEBUG_SYSCTL

Adding "debug.my-knob" sysctls is really helpful to select different
code paths and/or log on demand during runtime without recompile,
but as this code is under DEBUG, lots of other noise comes with it
which is often undesired, at least when looking at specific subsystems
only.

Adding globals to the kernel and breaking into DDB to change them helps,
but that does not work over SSH, hence the need for debug sysctls.

Introduces DEBUG_SYSCTL to make use of the "debug" MIB without the rest of
DEBUG; it's DEBUG_SYSCTL and not SYSCTL_DEBUG because it's not a general
option for all of sysctl(2).

OK gnezdo


Revision tags: OPENBSD_6_7_BASE
# 1.301 27-Mar-2020 anton

Relax the lockcount assertion in vputonfreelist(). Back when I fixed
several problems with the vnode exclusive lock implementation, I
overlooked the fact that a vnode can be in a state where the usecount is
zero while the holdcount still being positive. There could still be
threads waiting on the vnode lock in uvn_io() as long as the holdcount
is positive.

"go ahead" mpi@

Reported-by: syzbot+767d6deb1a647850a0ca@syzkaller.appspotmail.com


# 1.300 13-Feb-2020 claudio

Move the LK_DRAIN logic from VOP_LOCK() to vclean() the only caller of
VOP_LOCK with LK_DRAIN. This simplifies VOP_LOCK() a fair bit.
OK visa@


# 1.299 20-Jan-2020 claudio

struct vops is not modified during runtime so use const which moves each
into read-only data segment.
OK deraadt@ tedu@


# 1.298 10-Jan-2020 bluhm

Convert the vnode list at the mount point into a tailq. During
unmount this list is traversed and the dirty vnodes are flushed to
disk. Forced unmount expects that the list is empty after flushing,
otherwise the kernel panics with "dangling vnode". As the write
to disk can sleep, new vnodes may be inserted. If softdep is
enabled, resolving the dependencies creates new dirty vnodes and
inserts them to the list. To fix the panic, let insmntque() insert
new vnodes at the tail of the list. Then vflush() will still catch
them while traversing the list in forward direction.
OK tedu@ millert@ visa@


# 1.297 30-Dec-2019 bluhm

In vcount() a safe loop over vnodes was commited to 4.4BSD in 1994.
This is not necessary as the loop is restarted after vgone(). Switch
to SLIST_FOREACH without _SAFE.
OK visa@


# 1.296 27-Dec-2019 bluhm

Convert the speclisth hash buckets into SLIST macros. This makes
the vnode alias code more readable.
OK visa@


# 1.295 26-Dec-2019 bluhm

Fix white spaces.


# 1.294 08-Dec-2019 mpi

Convert infinite sleeps to tsleep_nsec(9).

ok visa@, jca@


Revision tags: OPENBSD_6_6_BASE
# 1.293 26-Aug-2019 anton

When a thread tries to exclusively lock a vnode, the same thread must
ensure that any other thread currently trying to acquire the underlying
vnode lock has observed that the same vnode is about to be exclusively
locked. Such threads must then sleep until the exclusive lock has been
released and then try to acquire the lock again. Otherwise, exclusive
access to the vnode cannot be guaranteed.

Thanks to naddy@ and visa@ for testing; ok visa@

Reported-by: syzbot+374d0e7e2400004957f7@syzkaller.appspotmail.com


# 1.292 25-Jul-2019 cheloha

vinvalbuf(9): tlseep -> tsleep_nsec(9); ok millert@


# 1.291 19-Jul-2019 cheloha

vwaitforio(9): tsleep(9) -> tsleep_nsec(9); ok visa@


# 1.290 28-Jun-2019 visa

Skip VFS barrier lock during normal operation to reduce overhead.
This removes a system-wide serialization point, which might help
finding timing-related bugs.

OK deraadt@ anton@


# 1.289 09-Jun-2019 beck

Add a temporary workaround to make removal of giant files better

mlarkin@ noticed we would freeze while removing enormous files because
of the amount of work done to invalidate buffers on unlink. This adds
a temporary workaround to ensure we give up the lock and yield while
doing this.

The longer term answer will be to move these buffers to another list
and not do the work here.

ok deraadt@


# 1.288 19-Apr-2019 visa

Add a subsystem lock for vfs_lockf.c. This enables calling lf_advlock()
and lf_purgelocks() without the kernel lock.

OK anton@ mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.287 02-Apr-2019 visa

Restrict which filesystems are available for swap. This rules out
obvious misconfigurations that cannot work.

OK mpi@ tedu@


# 1.286 17-Feb-2019 tedu

if a write fails, we mark the buffer invalid and throw it away. this can
lead to lost errors, where a later fsync will return success. to fix this,
set a flag on the vnode indicating a past error has occurred, and return
an error for future fsync calls.
ok bluhm deraadt visa


# 1.285 21-Jan-2019 anton

Introduce a dedicated entry point data structure for file locks. This new data
structure allows for better tracking of pending lock operations which is
essential in order to prevent a use-after-free once the underlying vnode is
gone.

Inspired by the lockf implementation in FreeBSD.

ok visa@

Reported-by: syzbot+d5540a236382f50f1dac@syzkaller.appspotmail.com


# 1.284 23-Dec-2018 natano

Rectify some issues with the noperm mount flag; the root vnode was not
protected properly and files without any x bit set were accidentaly considered
executable when checked with access(2).

Issues found and reported by deraadt, halex, reyk, tb
ok deraadt


# 1.283 07-Dec-2018 mpi

free(9) sizes for netcred.

ok visa@


Revision tags: OPENBSD_6_4_BASE
# 1.282 29-Sep-2018 visa

Use atomic operations to update vfc_refcount. Change the field's type
to unsigned int.

OK deraadt@


# 1.281 26-Sep-2018 visa

Move the allocating and freeing of mount points into
dedicated functions.

OK deraadt@ mpi@


# 1.280 22-Sep-2018 fcambus

Harmonize spacing after ellipses in displayed messages.

We were using spacing after ellipses in an inconsistent way in the
installer. Standardize on using "... " everywhere and take into account
the cursor position while we are waiting for the task to complete: the
cursor is now always positioned after the last dot, and the space is
added when displaying completion confirmation.

While there, also take cursor position into account in vfs_shutdown(),
and remove the extra leading space before ticks in dhclient.

OK deraadt@


# 1.279 17-Sep-2018 visa

Simplify VFS initialization.

Because loadable kernel modules are no longer, there is no need to
register or unregister filesystem implementations at runtime. Remove
vfs_register() and vfs_unregister(), and make vfsinit() call vfs_init
routines directly. Replace the linked list of vfsconf structs with
the vfsconflist[] array.

OK mpi@ bluhm@


# 1.278 16-Sep-2018 visa

Move vfsconf lookup code into dedicated functions.

OK bluhm@


# 1.277 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


# 1.276 02-Jul-2018 bluhm

Use more list macros for v_dirtyblkhd.
OK mpi@


# 1.275 06-Jun-2018 bluhm

The function dounmount() traverses the mnt_list in forward direction
to call vfs_busy() for all nested mount points. vfs_stall() called
vfs_busy() in reverser order for all mount points. Change the
direction of the latter to resolve the lock order conflict.
OK visa@


# 1.274 04-Jun-2018 guenther

Add VB_DUPOK to suppress witness(4) warning of concurrent mount locks.
Use that in three places:
- vfs_stall()
- sys_mount()
- dounmount()'s MNT_FORCE-does-recursive-unmounts case

ok deraadt@ visa@


# 1.273 27-May-2018 visa

Drop unnecessary `p' parameter from vget(9).

OK mpi@


# 1.272 08-May-2018 bluhm

When looping over mount points, the FOREACH SAVE macro is not save.
The loop variable mp is protected by vfs_busy() so that it cannot
be unmounted. But the next mount point nmp could be unmounted while
VFS_SYNC() sleeps. As the loop in vfs_stall() does not destroy the
mount point, TAILQ_FOREACH_REVERSE without _SAVE is the correct
macro to use.
OK deraadt@ visa@


# 1.271 08-May-2018 mpi

Move the vfs stall "barrier" logic to a function. FREF() will soon
change and this has nothing to do with it.

ok visa@, bluhm@


# 1.270 07-May-2018 bluhm

Print the vp pointer in the vinvalbuf() panic strings.
OK mpi@


# 1.269 02-May-2018 visa

Remove proc from the parameters of vn_lock(). The parameter is
unnecessary because curproc always does the locking.

OK mpi@


# 1.268 28-Apr-2018 visa

Clean up the parameters of VOP_LOCK() and VOP_UNLOCK(). It is always
curproc that does the locking or unlocking, so the proc parameter
is pointless and can be dropped.

OK mpi@, deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.267 07-Mar-2018 bluhm

Remounting files systems read-only does not work reliably. There
are corner cases where ffs may leak blocks. So better revert and
unmount all file systems at reboot. The "init died" panic will be
fixed in a different way.
OK deraadt@


# 1.266 10-Feb-2018 deraadt

Syncronize filesystems to disk when suspending. Each mountpoint's vnodes
are pushed to disk. Dangling vnodes (unlinked files still in use) and
vnodes undergoing change by long-running syscalls are identified -- and
such filesystems are marked dirty on-disk while we are suspended (in case
power is lost, a fsck will be required). Filesystems without dangling or
busy vnodes are marked clean, resulting in faster boots following
"battery died" circumstances.
Tested by numerous developers, thanks for the feedback.


# 1.265 14-Dec-2017 deraadt

Don't bother using DETACH_FORCE for the softraid luns at reboot
time; the aggressive mountpoint destruction seems to hit insane
use-after-frees when we are already far on the way down.


# 1.264 14-Dec-2017 deraadt

Give vflush_vnode() a hint about vnodes we don't need to account as "busy".
Change mountpoint to RDONLY a little later. Seems to improve the
rw->ro transition a bit.


# 1.263 11-Dec-2017 bluhm

Format the vnode lists of ddb show mount properly in columns.
OK krw@


# 1.262 11-Dec-2017 deraadt

In uvm Chuck decided backing store would not be allocated proactively
for blocks re-fetchable from the filesystem. However at reboot time,
filesystems are unmounted, and since processes lack backing store they
are killed. Since the scheduler is still running, in some cases init is
killed... which drops us to ddb [noted by bluhm]. Solution is to convert
filesystems to read-only [proposed by kettenis]. The tale follows:
sys_reboot() should pass proc * to MD boot() to vfs_shutdown() which
completes current IO with vfs_busy VB_WRITE|VB_WAIT, then calls VFS_MOUNT()
with MNT_UPDATE | MNT_RDONLY, soon teaching us that *fs_mount() calls a
copyin() late... so store the sizes in vfsconflist[] and move the copyin()
to sys_mount()... and notice nfs_mount copyin() is size-variant, so kill
legacy struct nfs_args3. Next we learn ffs_mount()'s MNT_UPDATE code is
sharp and rusty especially wrt softdep, so fix some bugs adn add
~MNT_SOFTDEP to the downgrade. Some vnodes need a little more help,
so tie them to &dead_vnops.

ffs_mount calling DIOCCACHESYNC is causing a bit of grief still but
this issue is seperate and will be dealt with in time.
couple hundred reboots by bluhm and myself, advice from guenther and
others at the hut


# 1.261 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.260 31-Jul-2017 florian

Give back some space to the ramdisk by compiling net/radix.c only
if we compile pf, ipsec, pipex or nfsserver.
Suggested by mpi some time ago.
Tweak & OK bluhm
deraadt assumes it's fair


# 1.259 20-Apr-2017 visa

Tweak lock inits to make the system runnable with witness(4)
on amd64 and i386.


# 1.258 04-Apr-2017 deraadt

struct vfsconf is tightly packed, but let's M_ZERO it in case that ever
changes to avoid exposing userland memory.


Revision tags: OPENBSD_6_1_BASE
# 1.257 15-Jan-2017 bluhm

When traversing the mount list, the current mount point is locked
with vfs_busy(). If the FOREACH_SAFE macro is used, the next pointer
is not locked and could be freed by another process. Unless
necessary, do not use _SAFE as it is unsafe. In vfs_unmountall()
the current pointer is actullay freed. Add a comment that this
race has to be fixed later.
OK krw@


# 1.256 10-Jan-2017 bluhm

Replace manual for() loops with FOREACH() macro.
OK millert@


# 1.255 10-Jan-2017 bluhm

Remove the unused olddp parameter from function dounmount().
OK mpi@ millert@


# 1.254 28-Sep-2016 kettenis

Cast enum to u_int when doing a bounds check to avoid a clang warning that
the comparison is always true.

ok jca@, tedu@


# 1.253 16-Sep-2016 dlg

move the namecache_rb_tree from RB macros to RBT functions.

i had to shuffle the includes a bit. all the knowledge of the RB
tree is now inside vfs_cache.c, and all accesses are via cache_*
functions.


# 1.252 16-Sep-2016 dlg

move buf_rb_bufs from RB macros to RBT functions

i had to shuffle the order of some header bits cos RBT_PROTOTYPE
needs to see what RBT_HEAD produces.


# 1.251 15-Sep-2016 dlg

all pools have their ipl set via pool_setipl, so fold it into pool_init.

the ioff argument to pool_init() is unused and has been for many
years, so this replaces it with an ipl argument. because the ipl
will be set on init we no longer need pool_setipl.

most of these changes have been done with coccinelle using the spatch
below. cocci sucks at formatting code though, so i fixed that by hand.

the manpage and subr_pool.c bits i did myself.

ok tedu@ jmatthew@

@ipl@
expression pp;
expression ipl;
expression s, a, o, f, m, p;
@@
-pool_init(pp, s, a, o, f, m, p);
-pool_setipl(pp, ipl);
+pool_init(pp, s, a, ipl, f, m, p);


# 1.250 25-Aug-2016 dlg

pool_setipl

ok kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.249 22-Jul-2016 kettenis

Prevent NULL-pointer call for filesystems that don't provide vfs_sysctl
in their vfsops.

Issue reported by Tim Newsham.

ok claudio@, natano@


# 1.248 19-Jun-2016 natano

Remove the lockmgr() API. It is only used by filesystems, where it is a
trivial change to use rrw locks instead. All it needs is LK_* defines
for the RW_* flags.

tested by naddy and sthen on package building infrastructure
input and ok jmc mpi tedu


# 1.247 26-May-2016 natano

The doforce variable isn't modified anywhere. Also, the only filesystem
left using it is fuse. It has been removed from all other filesystems.

ok millert deraadt


# 1.246 26-Apr-2016 natano

copy_statfs_info() is not only used by ufs, but by other filesystems too,
so make sure that all members of mp->mnt_stat.mount_info are copied.
ok stefan


# 1.245 26-Apr-2016 beck

fix off by one in vfs_vnode_print - found by miod
ok deraadt@, krw@


# 1.244 07-Apr-2016 natano

Share clone bitmap between aliased vnodes. This prevents duplicate clone
instance numbers being handed out for the same minor device.
ok mikeb


# 1.243 05-Apr-2016 natano

Increase size of the clone bitmap (revised diff after revert). I have
tested this with fuse _and_ drm on amd64 and macppc. Also tested with
cloning bpf (not in the tree) on macppc.

ok mikeb
"looks correct to me" millert

The original commit message is as follows:

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.242 01-Apr-2016 mikeb

Revert the clone bitmap enlargement change


# 1.241 31-Mar-2016 natano

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.240 19-Mar-2016 natano

Remove the unused flags argument from VOP_UNLOCK().

torture tested on amd64, i386 and macppc
ok beck mpi stefan
"the change looks right" deraadt


# 1.239 14-Mar-2016 krw

Change a bunch of (<blah> *)0 to NULL.

ok beck@ deraadt@


Revision tags: OPENBSD_5_9_BASE
# 1.238 05-Dec-2015 tedu

branches: 1.238.2;
remove stale lint annotations


# 1.237 16-Nov-2015 deraadt

In getdevvp() set the VISTTY flag on a vnode to indicate the underlying
device is a D_TTY device. (Like spec_open, but this sets the flag to
satisfy pre-VOP_OPEN situations)
ok millert semarie tedu guenther


# 1.236 13-Oct-2015 guenther

Initialize va_filerev in vattr_null() to avoid leaking stack garbage;
problem pointed out by Martin Natano (natano (at) natano.net)

Also, stop chaining assignments (foo = bar = baz) in vattr_null().
The exact meaning of those depends on the order of the sizes-and-
signednesses of the lvalues, making them fragile: a statement here
mixed *six* types, but managed to get them in a safe order. Delete
a 20+ year old XXX comment that was almost certainly bemoaning a bug
from when they were in an unsafe order.

ok deraadt@ miod@


# 1.235 08-Oct-2015 mpi

Use the radix API directly and get rid of the function pointers. There
is no point in keeping an unused level of abstraction.

ok mikeb@, claudio@


# 1.234 07-Oct-2015 mpi

rn_inithead() offset argument is now specified in byte, missed in previous.


# 1.233 04-Sep-2015 mpi

Make every subsystem using a radix tree call rn_init() and pass the
length of the key as argument.

This way every consumer of the radix tree has a chance to explicitly
initialize the shared data structures and no longer rely on another
subsystem to do the initialization.

As a bonus ``dom_maxrtkey'' is no longer used an die.

ART kernels should now be fully usable because pf(4) and IPSEC properly
initialized the radix tree.

ok chris@, reyk@


Revision tags: OPENBSD_5_8_BASE
# 1.232 16-Jul-2015 claudio

branches: 1.232.4;
Fix rn_match and there for the expoerted lookup functions in radix.c
to never return the internal RNF_ROOT nodes. This removes the checks
in the callee to verify that not an RNF_ROOT node was returned.
OK mpi@


# 1.231 12-May-2015 mikeb

Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.230 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.229 02-Mar-2015 guenther

Return EINVAL if the creds supplied for NFS export have a cr_ngroups less
than zero or greater than NGROUPS_MAX

Fixes panic seen by henning@


# 1.228 09-Jan-2015 tedu

rename desiredvnodes to initialvnodes. less of a lie. ok beck deraadt


# 1.227 19-Dec-2014 tedu

start retiring the nointr allocator. specify PR_WAITOK as a flag as a
marker for which pools are not interrupt safe. ok dlg


# 1.226 17-Dec-2014 tedu

remove lock.h from uvm_extern.h. another holdover from the simpletonlock
era. fix uvm including c files to include lock.h or atomic.h as necessary.
ok deraadt


# 1.225 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.224 10-Dec-2014 tedu

convert bcopy to memcpy. ok millert


# 1.223 21-Nov-2014 tedu

simple lock is long dead


# 1.222 19-Nov-2014 tedu

delete the KERN_VNODE sysctl. it fails to provide any isolation from the
kernel struct vnode defintion, and the only consumer (pstat) still needs
kvm to read much of the required information. no great loss to always use
kvm until there's a better replacement interface.
ok deraadt millert uebayasi


# 1.221 14-Nov-2014 tedu

prefer sizeof(*ptr) to sizeof(struct) for malloc and free


# 1.220 03-Nov-2014 deraadt

pass size argument to free()
ok doug tedu


# 1.219 13-Sep-2014 doug

Replace all queue *_END macro calls except CIRCLEQ_END with NULL.

CIRCLEQ_* is deprecated and not called in the tree. The other queue types
have *_END macros which were added for symmetry with CIRCLEQ_END. They are
defined as NULL. There's no reason to keep the other *_END macro calls.

ok millert@


Revision tags: OPENBSD_5_6_BASE
# 1.218 13-Jul-2014 tedu

pass the size to free in some of the obvious cases


# 1.217 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.216 10-Jul-2014 mpi

Stop using a shutdown hook for softraid(4) and explicitly shutdown
the disciplines right after vfs_shutdown().

This change is required in order to be able to set `cold' to 1 before
traversing the device (mainbus) tree for DVACT_POWERDOWN when halting
a machine. Yes, this is ugly because sr_shutdown() needs to sleep. But
at least it is obvious and hopefully somebody will be ofended and fix
it.

In order to properly flush the cache of the disks under softraid0,
sr_shutdown() now propagates DVACT_POWERDOWN for this particular subtree
of devices which are not under mainbus. As a side effect sd(4) shutdown
hook should no longer be necessary.

Tested by stsp@ and Jean-Philippe Ouellet.

ok deraadt@, stsp@, jsing@


# 1.215 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.214 04-Jun-2014 claudio

While it may be smart to use the radix tree for exports it is not OK to
use the domain specific tree initialisation method for this since that one
is multipath enabled and assumes that the radix node is part of a struct
rtentry. This code uses a different struct and so the multipath modifies
wrong fields and breaks stuff in mysterious ways.
Since we only support AF_INET here anyway simplify the code and only have
one radix_node_head pointer instead of AF_MAX ones.
Fixes NFS server issues reported by rpe@, OK rpe@, guenther@, sthen@


# 1.213 10-Apr-2014 tedu

pull the bufcache freelist code out into separate functions to allow new
algorithms to be tested. in the process, drop support for unused B_AGE and
b_synctime options.
previous versions ok beck deraadt


# 1.212 24-Mar-2014 guenther

Split the API: struct ucred remains the kernel internal structure while
struct xucred becomes the structure for syscalls (mount(2) and nfssvc(2)).

ok deraadt@ beck@


Revision tags: OPENBSD_5_5_BASE
# 1.211 21-Jan-2014 tedu

bzero -> memset


# 1.210 01-Dec-2013 krw

Change 'mountlist' from CIRCLEQ to TAILQ. Be paranoid and
use TAILQ_*_SAFE more than might be needed.

Bulk ports build by sthen@ showed nobody sticking their fingers
so deep into the kernel.

Feedback and suggestions from millert@. ok jsing@


# 1.209 27-Nov-2013 jsing

Defer the v_type initialisation until after the vnode has been purged from
the namecache. Changing the v_type between cache_enter() and cache_purge()
results in bad things happening.

ok beck@


# 1.208 02-Oct-2013 sf

format string fix: b_flags is long


# 1.207 01-Oct-2013 sf

Format string fixes: Cast time_t to long long

and mnt_stat.f_ctime is long long, too


# 1.206 08-Aug-2013 syl

Uncomment kprintf format attributes for sys/kern

tested on vax (gcc3) ok miod@


# 1.205 30-Jul-2013 beck

The previous change was made while chasing nfs performance issues
on Theo's servers - however this was in the context of the buffer flipper
changes and this is now suspect in a continues performance issue with NFS
so back it out for now


Revision tags: OPENBSD_5_4_BASE
# 1.204 24-Jun-2013 beck

Manipulating buffers after sleeping is dangerous. Instead of attempting
to cheat and VOP_BWRITE a buffer, restart the vinvalbuf if we have to wait
for a busy buffer to complete
ok tedu@ guenther@


# 1.203 15-Apr-2013 jsing

Add an f_mntfromspec member to struct statfs, which specifies the name of
the special provided when the mount was requested. This may be the same as
the special that was actually used for the mount (e.g. in the case of a
device node) or it may be different (e.g. in the case of a DUID).

Whilst here, change f_ctime to a 64 bit type and remove the pointless
f_spare members.

Compatibility goo courtesy of guenther@

ok krw@ millert@


Revision tags: OPENBSD_5_3_BASE
# 1.202 17-Feb-2013 miod

Comment out recently added __attribute__((__format__(__kprintf__))) annotations
in MI code; gcc 2.95 does not accept such annotation for function pointer
declarations, only function prototypes.
To be uncommented once gcc 2.95 bites the dust.


# 1.201 09-Feb-2013 miod

Add explicit __attribute__ ((__format__(__kprintf__)))) to the functions and
function pointer arguments which are {used as,} wrappers around the kernel
printf function.
No functional change.


# 1.200 17-Nov-2012 beck

Don't map a buffer (and potentially sleep) when invalidating it in vinvalbuf.
This fixes a problem where we could sleep for kva and then our pointers
would not be valid on the next pass through the loop. We do this
by adding buf_acquire_nomap() - which can be used to busy up the buffer
without changing its mapped or unmapped state. We do not need to have
the buffer mapped to invalidate it, so it is sufficient to acquire it
for that. In the case where we write the buffer, we do map the buffer, and
potentially sleep.


# 1.199 01-Oct-2012 guenther

Make groupmember() check the effective gid too, so that the checks are
consistent when the effective gid isn't also a supplementary group.

ok beck@


# 1.198 19-Sep-2012 guenther

vhold() and vdrop() are prototyped in vnode.h, so don't repeat them here

ok beck@


Revision tags: OPENBSD_5_2_BASE
# 1.197 16-Jul-2012 deraadt

oops, need sys/acct.h too


# 1.196 16-Jul-2012 deraadt

Put acct_shutdown() proto in a better place


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.195 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.194 02-Jul-2011 thib

rename VFSDEBUG to VFLCKDEBUG;

prompted by tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.193 21-Dec-2010 thib

Bring back the "End the VOP experiment." diff, naddy's issues where
unrelated, and his alpha is much happier now.

OK deraadt@


# 1.192 06-Dec-2010 jasper

- drop NENTS(), which was yet another copy of nitems().
no binary change


ok deraadt@


# 1.191 10-Sep-2010 thib

Backout the VOP diff until the issues naddy was seeing on alpha (gcc3)
have been resolved.


# 1.190 06-Sep-2010 thib

End the VOP experiment. Instead of the ridicolusly complicated operation
vector setup that has questionable features (that have, as far as I can
tell never been used in practice, atleast not in OpenBSD), remove all
the gunk and favor a simple struct full of function pointers that get
set directly by each of the filesystems.

Removes gobs of ugly code and makes things simpler by a magnitude.

The only downside of this is that we loose the vnoperate feature so
the spec/fifo operations of the filesystems need to be kept in sync
with specfs and fifofs, this is no big deal as the API it self is pretty
static.

Many thanks to armani@ who pulled an earlier version of this diff to
current after c2k10 and Gabriel Kihlman on tech@ for testing.

Liked by many. "come on, find your balls" deraadt@.


# 1.189 12-Aug-2010 oga

Nuke extra (typoed) extern declaration and a spare newline from the last
commit.

"fix it -- free commit" beck@


# 1.188 11-Aug-2010 beck

Make the number of vnodes to correspond to the number of buffers in
buffer cache - we grow them dynamically, but do not attempt to shrink
them if the buffer cache shrinks after growing.

Tested by very many for a long time.

ok oga@ todd@ phessler@ tedu@


Revision tags: OPENBSD_4_8_BASE
# 1.187 29-Jun-2010 tedu

makefstype was only used in ported from freebsd filesystems. fix them
and remove the function. ok thib


# 1.186 28-Jun-2010 claudio

Add the rtable id as an argument to rn_walktree(). Functions like
rt_if_remove_rtdelete() need to know the table id to be able to correctly
remove nodes.
Problem found by Andrea Parazzini and analyzed by Martin Pelik�n.
OK henning@


# 1.185 06-May-2010 mpf

Fix favail format string.
From mickey.
OK thib, otto.


Revision tags: OPENBSD_4_7_BASE
# 1.184 17-Dec-2009 oga

if anyone vref()s a VNON vnode, panic. This should not happen.

Written while trying to debug the nfs_inactive panics. Turns out it
never got hit, but it's a useful check to have.

ok beck@


# 1.183 17-Aug-2009 jasper

dd 'show all bufs' to show all the buffers in the system

ok beck@ thib@


# 1.182 13-Aug-2009 thib

add a show all vnodes command, use dlg's nice pool_walk() to accomplish
this.

ok beck@, dlg@


# 1.181 12-Aug-2009 beck

Namecache revamp.

This eliminates the large single namecache hash table, and implements
the name cache as a global lru of entires, and a redblack tree in each
vnode. It makes cache_purge actually purge the namecache entries associated
with a vnode when a vnode is recycled (very important for later on actually being
able to resize the vnode pool)

This commit does #if 0 out a bunch of procmap code that was
already broken before this change, but needs to be redone completely.

Tested by many, including in thib's nfs test setup.

ok oga@,art@,thib@,miod@


# 1.180 02-Aug-2009 beck

Dynamic buffer cache support - a re-commit of what was backed out
after c2k9

allows buffer cache to be extended and grow/shrink dynamically

tested by many, ok oga@, "why not just commit it" deraadt@


Revision tags: OPENBSD_4_6_BASE
# 1.179 25-Jun-2009 thib

backout the buf_acquire() does the bremfree() since all callers
where doing bremfree() befure calling buf_acquire().

This is causing us headache pinning down a bug that showed up
when deraadt@ too cvs to current, and will have to be done
anyway as a preperation for backouts.

OK deraadt@


# 1.178 15-Jun-2009 beck

Back out all the buffer cache changes I committed during c2k9. This reverts three
commits:

1) The sysctl allowing bufcachepercent to be changed at boot time.
2) The change moving the buffer cache hash chains to a red-black tree
3) The dynamic buffer cache (Which depended on the earlier too).

ok on the backout from marco and todd


# 1.177 06-Jun-2009 art

All caller of buf_acquire were doing bremfree before the call.
Just put it in the buf_acquire function.
oga@ ok


# 1.176 03-Jun-2009 beck

Change bufhash from the old grotty hash table to red-black trees hanging
off the vnode.
ok art@, oga@, miod@


Revision tags: OPENBSD_4_5_BASE
# 1.175 10-Nov-2008 pedro

Fix typo in comment, okay jmc@.


# 1.174 01-Nov-2008 deraadt

change vrele() to return an int. if it returns 0, it can gaurantee that
it did not sleep. this is used to avoid checkdirs() to avoid having
to restart the allproc walk every time through
idea from tedu, ok thib pedro


Revision tags: OPENBSD_4_4_BASE
# 1.173 05-Jul-2008 thib

re-introduce vdrop() to signal a lost intrest in a vnode;

ok art@


# 1.172 14-Jun-2008 mk

A bunch of pool_get() + bzero() -> pool_get(..., .. | PR_ZERO)
conversions that should shave a few bytes off the kernel.

ok henning, krw, jsing, oga, miod, and thib (``even though i usually prefer
FOO|BAR''; thanks for looking.


# 1.171 13-Jun-2008 beck

back out stupid vnode change that was unintentionally included
with biomem and art has no idea how it got there.
ok art@ thib@


# 1.170 12-Jun-2008 deraadt

Bring biomem diff back into the tree after the nfs_bio.c fix went in.
ok thib beck art


# 1.169 11-Jun-2008 deraadt

back out biomem diff since it is not right yet. Doing very large
file copies to nfsv2 causes the system to eventually peg the console.
On the console ^T indicates that the load is increasing rapidly, ddb
indicates many calls to getbuf, there is some very slow nfs traffic
making none (or extremely slow) progress. Eventually some machines
seize up entirely.


# 1.168 10-Jun-2008 beck

Buffer cache revamp

1) remove multiple size queues, introduced as a stopgap.
2) decouple pages containing data from their mappings
3) only keep buffers mapped when they actually have to be mapped
(right now, this is when buffers are B_BUSY)
4) New functions to make a buffer busy, and release the busy flag
(buf_acquire and buf_release)
5) Move high/low water marks and statistics counters into a structure
6) Add a sysctl to retrieve buffer cache statistics

Tested in several variants and beat upon by bob and art for a year. run
accidentally on henning's nfs server for a few months...

ok deraadt@, krw@, art@ - who promises to be around to deal with any fallout


# 1.167 09-Jun-2008 millert

Update access(2) to have modern semantics with respect to X_OK and
the superuser. access(2) will now only indicate success for X_OK on
non-directories if there is at least one execute bit set on the file.
OK deraadt@ thib@ otto@


# 1.166 07-May-2008 thib

remove the vfc_mountroot member from vfsconf and
do appropriate cleanup;

OK deraadt@


# 1.165 07-May-2008 claudio

Implement routing priorities. Every route inserted has a priority assigned
and the one route with the lowest number wins. This will be used by the
routing daemons to resolve the synchronisations issue in case of conflicts.
The nasty bits of this are in the multipath code. If no priority is specified
the kernel will choose an appropriate priority.

Looked at by a few people at n2k8 code is much older


# 1.164 06-May-2008 thib

retire vfs_mountroot();

setroot() is now (and has been) responsible for setting
the mountroot function pointer "to the right thing", or
failing todo that, to ffs_mountroot;

based on a discussion/diff from deraadt@.
OK deraadt@


# 1.163 23-Mar-2008 miod

Wrong printf construct.


# 1.162 16-Mar-2008 otto

Widen some struct statfs fields to support large filesystem stata
and add some to be able to support statvfs(2). Do the compat dance
to provide backward compatibility. ok thib@ miod@


Revision tags: OPENBSD_4_3_BASE
# 1.161 13-Dec-2007 blambert

replace calls to ltsleep with tsleep

remove PNORELOCK flag, as PNORELOCK is used for msleep

ok art@ thib@


# 1.160 16-Nov-2007 deraadt

er, the newline is wrong. dissapointing.


# 1.159 15-Nov-2007 deraadt

newline before syncing disks is way prettier


# 1.158 29-Oct-2007 chl

MALLOC/FREE -> malloc/free
replace an hard coded value with M_WAITOK

ok krw@


# 1.157 15-Sep-2007 bluhm

Allow to pull out an usb stick with ffs filesystem while mounted
and a file is written onto the stick. Without these fixes the
machine panics or hangs.
The usb fix calls the callback when the stick is pulled out to free
the associated buffers. Otherwise we have busy buffers for ever
and the automatic unmount will panic.
The change in the scsi layer prevents passing down further dirty
buffers to usb after the stick has been deactivated.
In vfs the automatic unmount has moved from the function vgonel()
to vop_generic_revoke(). Both are called when the sd device's vnode
is removed. In vgonel() the VXLOCK is already held which can cause
a deadlock. So call dounmount() earlier.

ok krw@, I like this marco@, tested by ian@


# 1.156 07-Sep-2007 art

Use M_ZERO in a few more places to shave bytes from the kernel.

eyeballed and ok dlg@


Revision tags: OPENBSD_4_2_BASE
# 1.155 07-Aug-2007 beck

A few changes to deal with multi-user performance issues seen. this
brings us back roughly to 4.1 level performance, although this is still
far from optimal as we have seen in a number of cases. This change

1) puts a lower bound on buffer cache queues to prevent starvation
2) fixes the code which looks for a buffer to recycle
3) reduces the number of vnodes back to 4.1 levels to avoid complex
performance issues better addressed after 4.2

ok art@ deraadt@, tested by many


# 1.154 01-Jun-2007 beck

decouple the allocated number of vnodes from the "desiredvnodes" variable
which is used to size a zillion other things that increasing excessively
has been shown to cause problems - so that we may incrementally look at
increasing those other things without making the kernel unusable.

This diff effectivly increases the number of vnodes back to the number
of buffers, as in the earlier dynamic buffer cache commits, without
increasing anything else (namecache, softdeps, etc. etc.)

ok pedro@ tedu@ art@ thib@


# 1.153 31-May-2007 tedu

remove some silly casts, no real change


# 1.152 31-May-2007 pedro

NFSv2 cannot cope with a big number of vnodes, so revert to NPROC-based
calculation until the problem is fixed, okay beck@ art@


# 1.151 30-May-2007 beck

back out vfs change - todd fries has seen afs issues, and I'm suspicious
this can cause other problems.


# 1.150 29-May-2007 beck

Step one of some vnode improvements - change getnewvnode to
actually allocate "desiredvnodes" - add a vdrop to un-hold a vnode held
with vhold, and change the name cache to make use of vhold/vdrop, while
keeping track of which vnodes are referred to by which cache entries to
correctly hold/drop vnodes when the cache uses them.
ok thib@, tedu@, art@


# 1.149 28-May-2007 thib

de-inline vref();

ok pedro@


# 1.148 26-May-2007 pedro

Dynamic buffer cache. Initial diff from mickey@, okay art@ beck@ toby@
deraadt@ dlg@.


# 1.147 26-May-2007 thib

Nuke a bunch of simpelocks and associated goo.

ok art@


# 1.146 17-May-2007 thib

Collapse struct v_selectinfo in struct vnode, remove the
simplelock and reuse the name for the selinfo member.
Clean-up accordingly.

ok tedu@,art@


# 1.145 09-May-2007 deraadt

kinfo_vgetfailed has not been used for > 8 years


# 1.144 13-Apr-2007 thib

Move the declaration of VN_KNOTE() into vnode.h instead of having
multiple defines all over;

ok tedu@


# 1.143 13-Apr-2007 bluhm

Remove comments talking about vnode interlock. No binary change.
ok thib


# 1.142 11-Apr-2007 thib

Remove the simplelock argument from vrecycle();

ok pedro@, sturm@


# 1.141 21-Mar-2007 thib

Remove the v_interlock simplelock from the vnode structure.
Zap all calls to simple_lock/unlock() on it (those calls are
#defined away though). Remove the LK_INTERLOCK from the calls
to vn_lock() and cleanup the filesystems wich implement VOP_LOCK().
(by remvoing the v_interlock from there calls to lockmgr()).

ok pedro@, art@, tedu@


# 1.140 12-Mar-2007 mickey

better desiredvnodes not based on maxusers; pedro@ deraadt@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.139 20-Feb-2007 deraadt

for vfsconf sysctl, do not leak kernel sensors out to userland
ok art thib


# 1.138 17-Feb-2007 mickey

fix ddb buf printing for daddr_t growth to 64bit;
from juan hernandez gonzalez; tested by bluhm@


# 1.137 14-Feb-2007 jsg

Consistently spell FALLTHROUGH to appease lint.
ok kettenis@ cloder@ tom@ henning@


# 1.136 13-Feb-2007 mickey

fix ddb buf print


# 1.135 20-Nov-2006 tom

vprint() should be defined if DIAGNOSTIC || DEBUG. Noticed by (and
original diff from) Jake < antipsychic (at) hotmail.com >. Discussed
with Mickey and Miod.

ok miod@ pedro@


# 1.134 30-Oct-2006 thib

use vp->v_type to index into vtypes rather then vp->v_tag,
fixing odd output in the 'show vnode' ddb code.

ok mickey@


Revision tags: OPENBSD_4_0_BASE
# 1.133 11-Jul-2006 mickey

add mount/vnode/buf and softdep printing commands; tested on a few archs and will make pedro happy too (;


# 1.132 09-Jul-2006 pedro

Fix tab where space was meant


# 1.131 08-Jul-2006 thib

vinvalbuf() debugging aid, under VFSDEBUG.

ok pedro@


# 1.130 03-Jul-2006 mickey

also print vp in vprint (useful for debugging); pedro@ ok


# 1.129 25-Jun-2006 sturm

rename vfs_busy() flags VB_UMIGNORE/VB_UMWAIT to VB_NOWAIT/VB_WAIT

requested by and ok pedro


# 1.128 14-Jun-2006 sturm

move vfs_busy() to rwlocks and properly hide the locking api from vfs

ok tedu, pedro


# 1.127 02-Jun-2006 pedro

Add a clonable devices implementation. Hacked along with thib@, input
from krw@ and toby@, subliminal prodding from dlg@, okay deraadt@.


# 1.126 28-May-2006 pedro

Spacing in vfs_sysctl()


# 1.125 07-May-2006 sturm

forgot to remove this sentence from the comment
ok pedro


# 1.124 30-Apr-2006 sturm

remove the simplelock argument from vfs_busy() which is currently not
used and will never be used this way in VFS

requested by and ok pedro, ok krw, biorn


# 1.123 19-Apr-2006 pedro

Remove unused mount list simple_lock() goo


Revision tags: OPENBSD_3_9_BASE
# 1.122 09-Jan-2006 pedro

Put vprint() under DIAGNOSTIC, as to save space in generated ramdisks.
Inspiration from miod@, okay deraadt@. Tested on i386, macppc and amd64.


# 1.121 30-Nov-2005 pedro

No need for vfs_busy() and vfs_unbusy() to take a process pointer
anymore. Testing by jolan@, thanks.


# 1.120 24-Nov-2005 pedro

Remove kernfs, okay deraadt@.


# 1.119 19-Nov-2005 pedro

Remove unnecessary lockmgr() archaism that was costing too much in terms
of panics and bugfixes. Access curproc directly, do not expect a process
pointer as an argument. Should fix many "process context required" bugs.
Incentive and okay millert@, okay marc@. Various testing, thanks.


# 1.118 18-Nov-2005 pedro

Work around yet another race on non-locking file systems: when calling
VOP_INACTIVE() in vrele() and vput(), we may sleep. Since there's no
locking of any kind, someone can vget() the vnode and vrele() it while
we sleep, beating us in getting the vnode on the free list.


# 1.117 08-Nov-2005 pedro

Missed one use of 'register'


# 1.116 07-Nov-2005 pedro

Use ANSI function declarations and deregister, no binary change


# 1.115 19-Oct-2005 pedro

Remove v_vnlock from struct vnode, okay krw@ tedu@


Revision tags: OPENBSD_3_8_BASE
# 1.114 26-May-2005 pedro

branches: 1.114.2;
RIP stackable filesystems, ok marius@ tedu@, discussed with deraadt@


# 1.113 24-May-2005 pedro

when a device vnode associated with a mount point disappears, mark the
filesystem as doomed and unmount it


# 1.112 22-May-2005 pedro

put VLOCKSWORK stuff under a single option, VFSDEBUG


# 1.111 01-May-2005 pedro

check for VBIOONFREELIST and VBIOONSYNCLIST in vprint(), okay marius@


# 1.110 24-Mar-2005 tedu

always good to check for invalid values. ok marius pedro


Revision tags: OPENBSD_3_7_BASE
# 1.109 10-Jan-2005 pedro

branches: 1.109.2;
change vget() to only put a vnode back on the free lists if it actually
was there. should fix a (rare) corner case introduced by my last commit.
ok tedu@, testing by joris, moritz@, danh@, otto@ and krw@. many thanks.


# 1.108 31-Dec-2004 pedro

sprinkle some more list macros in here


# 1.107 31-Dec-2004 pedro

when releasing a vnode, make it inactive before sticking it to one of
the free lists. should fix some races on filesystems that don't have
locks, such as nfs. also, it allows for a more straightforward way of
releasing vnodes (nodes that are going to be recycled don't have to be
moved to the head of the list). tested by many, thanks.

ok tedu@ deraadt@


# 1.106 28-Dec-2004 deraadt

clean dirty accident by miod


# 1.105 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


# 1.104 09-Dec-2004 pedro

minor spacing/styling nits


Revision tags: OPENBSD_3_6_BASE
# 1.103 04-Aug-2004 art

Uninline vputonfreelist.


# 1.102 04-Aug-2004 pedro

better comments


# 1.101 02-Aug-2004 pedro

- check for LK_NOWAIT on vget()
- use ltsleep() instead of the unlock + sleep combo

ok art@, inspiration from free/net


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.100 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


# 1.99 27-May-2004 tedu

shutdown accounting before shutting down vfs. should prevent some panics.
ok david@ millert@ (iirc)


# 1.98 25-Apr-2004 itojun

radix tree with multipath support. from kame. deraadt ok
user visible changes:
- you can add multiple routes with same key (route add A B then route add A C)
- you have to specify gateway address if there are multiple entries on the table
(route delete A B, instead of route delete A)
kernel change:
- radix_node_head has an extra entry
- rnh_deladdr takes extra argument

TODO:
- actually take advantage of multipath (rtalloc -> rtalloc_mpath)


Revision tags: OPENBSD_3_5_BASE
# 1.97 09-Jan-2004 tedu

back out vnode parents. weird breakge found in ports tree


# 1.96 06-Jan-2004 tedu

keep track of a vnode's parent dir. ufs only, and unused atm, but
the fun stuff is coming. testing by brad.


Revision tags: OPENBSD_3_4_BASE
# 1.95 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.94 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: UBC_SYNC_A
# 1.93 13-May-2003 naddy

Back out previous change that causes "vnode table full" for large-scale
file operations.


# 1.92 13-May-2003 tedu

do reclaim LAYER vnodes, no good reason not to


# 1.91 06-May-2003 tedu

attempt to put a process's cwd back in place after a forced umount.
won't always work, but it's the best we can do for now. this covers
at least some of the failure cases the previous commit to vfs_lookup.c
checks for.
ok weingart@


# 1.90 01-May-2003 tedu

several related changes:
vfs_subr.c:
add a missing simple_lock_init for vnode interlock
try to avoid reclaiming locked or layered vnodes
initialize vnlock pointer to NULL
remove old code to free vnlock, never used
lockinit the new vnode lock
vfs_syscalls.c:
support for VLAYER flag
vnode_if.sh:
support for splitting VDESC flags
vnode_if.src:
split VDESC flags
WILLPUT is the combination of WILLRELE and WILLUNLOCK
most uses for WILLRELE become WILLPUT
vnode.h:
add v_lock to struct vnode
add VLAYER flag
update for new VDESC flags


# 1.89 06-Apr-2003 ho

strcat/strcpy/sprintf cleanup. krw@, anil@ ok. art@ tested sparc64.


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_B
# 1.88 11-Aug-2002 art

Add two missing vfs_busy calls in the failure path of sysctl_vnode.
Found by aaron@

NOTE - I think we need a mount-point iterator just like we have
NOTE - vfs_mount_foreach_vnode. (btw. why don't we use foreach_vnode in here?)


# 1.87 12-Jul-2002 art

Change the locking on the mountpoint slightly. Instead of using mnt_lock
to get shared locks for lookup and get the exclusive lock only with
LK_DRAIN on unmount and do the real exclusive locking with flags in
mnt_flags, we now use shared locks for lookup and an exclusive lock for
unmount.

This is accomplished by slightly changing the semantics of vfs_busy.
Old vfs_busy behavior:
- with LK_NOWAIT set in flags, a shared lock was obtained if the
mountpoint wasn't being unmounted, otherwise we just returned an error.
- with no flags, a shared lock was obtained if the mountpoint was being
unmounted, otherwise we slept until the unmount was done and returned
an error.
LK_NOWAIT was used for sync(2) and some statistics code where it isn't really
critical that we get the correct results.
0 was used in fchdir and lookup where it's critical that we get the right
directory vnode for the filesystem root.

After this change vfs_busy keeps the same behavior for no flags and LK_NOWAIT.
But if some other flags are passed into it, they are passed directly
into lockmgr (actually LK_SLEEPFAIL is always added to those flags because
if we sleep for the lock, that means someone was holding the exclusive lock
and the exclusive lock is only held when the filesystem is being unmounted.

More changes:
dounmount must now be called with the exclusive lock held. (before this
the caller was supposed to hold the vfs_busy lock, but that wasn't always
true).
Zap some (now) unused mount flags.
And the highlight of this change:
Add some vfs_busy calls to match some vfs_unbusy calls, especially in
sys_mount. (lockmgr doesn't detect the case where we release a lock noone
holds (it will do that soon)).

If you've seen hangs on reboot with mfs this should solve it (I repeat this
for the fourth time now, but this time I spent two months fixing and
redesigning this and reading the code so this time I must have gotten
this right).


# 1.86 16-Jun-2002 miod

When processing the KERN_VNODE sysctl, the kernel builds a packed structure,
while pstat(8) expects a C structure abiding the regular structure packing
rules. This caused pstat -v to break on powerpc.

Unbreak the confusion by defining the structure in a common header file,
and having the kernel use it.

ok millert@ deraadt@


# 1.85 08-Jun-2002 art

Use ltsleep in vfs_busy.


# 1.84 16-May-2002 art

sprinkle some splassert(IPL_BIO) in some functions that are commented as "should be called at splbio()"


Revision tags: OPENBSD_3_1_BASE
# 1.83 14-Mar-2002 millert

First round of __P removal in sys


# 1.82 04-Feb-2002 miod

Cleanup mountroot-related definitions.


# 1.81 23-Jan-2002 art

Pool deals fairly well with physical memory shortage, but it doesn't deal
well (not at all) with shortages of the vm_map where the pages are mapped
(usually kmem_map).

Try to deal with it:
- group all information the backend allocator for a pool in a separate
struct. The pool will only have a pointer to that struct.
- change the pool_init API to reflect that.
- link all pools allocating from the same allocator on a linked list.
- Since an allocator is responsible to wait for physical memory it will
only fail (waitok) when it runs out of its backing vm_map, carefully
drain pools using the same allocator so that va space is freed.
(see comments in code for caveats and details).
- change pool_reclaim to return if it actually succeeded to free some
memory, use that information to make draining easier and more efficient.
- get rid of PR_URGENT, noone uses it.


# 1.80 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.79 10-Dec-2001 art

branches: 1.79.2;
No need to initialize the uobj on every getnewvnode. Just do
it when allocating. Add some improved diagnostics.


# 1.78 10-Dec-2001 art

Big cleanup inspired by NetBSD with some parts of the code from NetBSD.
- get rid of VOP_BALLOCN and VOP_SIZE
- move the generic getpages and putpages into miscfs/genfs
- create a genfs_node which must be added to the top of the private portion
of each vnode for filsystems that want to use genfs_{get,put}pages
- rename genfs_mmap to vop_generic_mmap


# 1.77 10-Dec-2001 art

Merge in struct uvm_vnode into struct vnode.


# 1.76 05-Dec-2001 art

Break out the part that lowers v_holdcnt in brelvp into an own function
and make it and vhold into public interfaces.


# 1.75 29-Nov-2001 art

Ooops. Revert part of the last commit that was completly wrong and wasn't supposed to be committed.


# 1.74 29-Nov-2001 art

Correctly handle b_vp with bgetvp and brelvp in {get,put}pages.
Prevents panics caused by vnodes being recycled under our feet.


# 1.73 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.72 21-Nov-2001 csapuntz

Added vfs_isbusy. Useful for verifying that a mount point is locked
Added vfs_mount_foreach_vnode. Several places in the code seem to want to
traverse the mount list and they all seem to handle locking differently.
Centralize traversing the mount list in one place so that we only need
to get the locking right once.


# 1.71 15-Nov-2001 art

Don't zero v_bioflag when recycling a vnode in getnewvnode.
Sometimes the vnode can be on the syncers list. While that is a bug, it's
just a minor annoyance. A vnode on a syncer worklist without VBIOONSYNCLIST
set is a disaster.


# 1.70 12-Nov-2001 art

Remove unnecessary check for NULL vnode in reassignbuf.


# 1.69 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.68 02-Oct-2001 csapuntz

Bounds check index into routing table. Thanks to Ken Ashcraft of Stanford
for finding this bug.


# 1.67 19-Sep-2001 csapuntz

Get rid of B_VFLUSH. Not relevant after the end of the AGE queue.


# 1.66 16-Sep-2001 millert

Add some missing lengths checks when passing data from userland to
kernel. From based on NetBSD patches.


# 1.65 02-Aug-2001 assar

(vput): make panic strings actually say vput instead of vrele


# 1.64 26-Jul-2001 miod

Typo.


# 1.63 27-Jun-2001 art

remove old vm


# 1.62 22-Jun-2001 deraadt

KNF


# 1.61 05-Jun-2001 provos

send note_revoke to knotes when vnode goes away, okay art@


# 1.60 16-May-2001 art

indentation nit.


# 1.59 29-Apr-2001 art

cleanup, remove incorrect comment


Revision tags: OPENBSD_2_9_BASE
# 1.58 22-Mar-2001 art

branches: 1.58.2;
Use pool for allocating vnodes.
Even though vnodes are never freed (could be) this gives us big memory and
kmem_map savings.


# 1.57 21-Mar-2001 art

uvm_vnp_terminate expect the vnode to be locked.
Why didn't LOCKDEBUG catch this?


# 1.56 16-Mar-2001 art

Oops. fix thinko in last.


# 1.55 16-Mar-2001 art

Use CIRCLEQ macros for mountlist.


# 1.54 16-Mar-2001 art

Initialize the mountlist_slock.


# 1.53 26-Feb-2001 csapuntz

Move v_writecount test back to it original place


# 1.52 26-Feb-2001 csapuntz

Make ref counts 32-bit unsigned ints as opposed to a potpourri of longs and
ints.


# 1.51 24-Feb-2001 csapuntz

Cleanup of vnode interface continues. Get rid of VHOLD/HOLDRELE.
Change VM/UVM to use buf_replacevnode to change the vnode associated
with a buffer.

Addition v_bioflag for flags written in interrupt handlers
(and read at splbio, though not strictly necessary)

Add vwaitforio and use it instead of a while loop of v_numoutput.

Fix race conditions when manipulation vnode free list


# 1.50 23-Feb-2001 csapuntz

Remove the clustering fields from the vnodes and place them in the
file system inode instead


# 1.49 21-Feb-2001 csapuntz

Latest soft updates from FreeBSD/Kirk McKusick

Snapshot-related code has been commented out.


# 1.48 08-Feb-2001 mickey

do not print stuff when not verbose


Revision tags: OPENBSD_2_8_BASE
# 1.47 27-Sep-2000 art

branches: 1.47.2;
Minimal optimization.


# 1.46 17-Jul-2000 art

Don't wait for B_READ buffers on shutdown.
From NetBSD.


Revision tags: OPENBSD_2_7_BASE
# 1.45 25-Apr-2000 csapuntz

Use CIRCLEQ_FOREACH


# 1.44 21-Apr-2000 mickey

see if there is any meaning under curproc before using &proc0 in vfs_syncwait(); from art@


Revision tags: SMP_BASE kame_19991208
# 1.43 05-Dec-1999 art

branches: 1.43.2;
With soft updates, some buffers will be remarked as dirty after being written.
Handle this when syncing filesystems when unmounting.
From NetBSD.


# 1.42 05-Dec-1999 art

Use VONSYNCLIST to see if we should remove a vnode from the sync list instead
of looking at v_dirtyblkhd.


Revision tags: OPENBSD_2_6_BASE
# 1.41 20-Aug-1999 art

more paranoid check of the refcount in vfs_register


# 1.40 08-Aug-1999 niklas

From NetBSD; vdevgone, used for revoking access to device nodes when they
disappear (detach is coming).


# 1.39 31-May-1999 millert

New struct statfs with mount options. NOTE: this replaces statfs(2),
fstatfs(2), and getfsstat(2) so you will need to build a new kernel
before doing a "make build" or you will get "unimplemented syscall" errors.

The new struct statfs has the following featuires:
o Has a u_int32_t flags field--now softdep can have a real flag.

o Uses u_int32_t instead of longs (nicer on the alpha). Note: the man
page used to lie about setting invalid/unused fields to -1. SunOS does
that but our code never has.

o Gets rid of f_type completely. It hasn't been used since NetBSD 0.9
and having it there but always 0 is confusing. It is conceivable
that this may cause some old code to not compile but that is better
than silently breaking.

o Adds a mount_info union that contains the FSTYPE_args struct. This
means that "mount" can now tell you all the options a filesystem was
mounted with. This is especially nice for NFS.

Other changes:
o The linux statfs emulation didn't convert between BSD fs names
and linux f_type numbers. Now it does, since the BSD f_type
number is useless to linux apps (and has been removed anyway)

o FreeBSD's struct statfs is different from our (both old and new)
and thus needs conversion. Previously, the OpenBSD syscalls
were used without any real translation.

o mount(8) will now show extra info when invoked with no arguments.
However, to see *everything* you need to use the -v (verbose) flag.


# 1.38 06-May-1999 mickey

factor out sync+wait code into vfa_syncwait() routine for
applications in system like power management and such.
art@ finally said `commit it'


# 1.37 30-Apr-1999 art

in vput, simple_unlock the v_interlock before VOP_INACTIVE, not after


Revision tags: OPENBSD_2_5_BASE
# 1.36 11-Mar-1999 deraadt

backout


# 1.35 11-Mar-1999 deraadt

back out unapproved changes


# 1.34 11-Mar-1999 mickey

indent


# 1.33 11-Mar-1999 mickey

factor sync+wait operation out into a separate function.


# 1.32 26-Feb-1999 art

adapt to uvm vnode pager


# 1.31 19-Feb-1999 art

add vfs_register and vfs_unregister functions


# 1.30 28-Dec-1998 art

simple_lock fixes


# 1.29 22-Dec-1998 art

deconfuse vprint, print holdcount, not refcount when we are talking about holdcnt


# 1.28 10-Dec-1998 art

vfs_unmountall: retry to unmount all remaining filesystems when one unmount failed


# 1.27 05-Dec-1998 csapuntz

Framework for generating automatic test code for locking discipline
in DIAGNOSTIC mode.

Added documentation to vfs_subr.c on locking needs of a couple calls.

Improvements to the vinvalbuf patch. We need to start over after we
let our pants down.


# 1.26 04-Dec-1998 csapuntz

VFS-Lite2 requires stricter locking around vnode buffer queues. vinvalbuf
had insufficient protection


# 1.25 20-Nov-1998 art

vn_lock already unlocks the simple lock. don't do that again


# 1.24 12-Nov-1998 csapuntz

Integrate latest soft updates patches for McKusick.

Integrate cleaner ffs mount code from FreeBSD. Most notably, this mount
code prevents you from mounting an unclean file system read-write.


Revision tags: OPENBSD_2_4_BASE
# 1.23 13-Oct-1998 csapuntz

In vrele, vget, reinstate to following order

- VNODE gets placed on free list
- VOP_INACTIVE is called

This was the original order. It was changed in an earlier patch due to
a race condition in non-locking FSes (like NFS) between getnewvnode
and inactive. However, the modified order had its own race conditions, so
it turned out not to be a good choice.


# 1.22 30-Aug-1998 csapuntz

Cleanup.

Error diagnostics in vputonfreelist to catch violations of assumptions.


# 1.21 06-Aug-1998 csapuntz

Rename vop_revoke, vn_bwrite, vop_noislocked, vop_nolock, vop_nounlock
to be vop_generic_revoke, vop_generic_bwrite, vop_generic_islocked,
vop_generic_lock and vop_generic_unlock.

Create vop_generic_abortop and propogate change to all file systems.

Fix PR/371.

Get rid of locking in NULLFS (should be mostly unnecessary now except for
forced unmounts).


# 1.20 25-Apr-1998 niklas

typo


Revision tags: OPENBSD_2_3_BASE
# 1.19 20-Feb-1998 niklas

typo


# 1.18 11-Jan-1998 csapuntz

Fix a couple spinlock references. More code motion in vfs_subr.c


# 1.17 10-Jan-1998 csapuntz

Broke up vfs_subr.c which was getting a bit huge. We now have seperate files
for the syncer daemon as well as default VOP_*.


# 1.16 24-Nov-1997 niklas

Fix non-DIAGNOSTIC (and non-COMPAT*) compilation


# 1.15 07-Nov-1997 csapuntz

Fixed hang on shutdown
Disabled vop_nolock for now. Filesystems still need to be cleaned up.


# 1.14 06-Nov-1997 csapuntz

DEBUG now compiles


# 1.13 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.12 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.11 06-Oct-1997 csapuntz

VFS Lite2 Changes


Revision tags: OPENBSD_2_1_BASE
# 1.10 25-Apr-1997 deraadt

proper mask check; mike@fast.cs.utah.edu


# 1.9 14-Apr-1997 tholo

Minor performance enhancements from NetBSD


# 1.8 24-Feb-1997 niklas

OpenBSD tags


# 1.7 11-Feb-1997 millert

Add fs_id support and random inode generation numbers for ffs.


# 1.6 04-Jan-1997 kstailey

spec_advlock() via lf_advlock()


Revision tags: OPENBSD_2_0_BASE
# 1.5 08-Aug-1996 tholo

Make {,f}chown(2) behaviour POSIX.1 compliant with SUID / SGID files
Enable CTL_FS processing by sysctl(3)
Add CTL_FS request to disable clearing SUID / SGID bit when a files owner
or group is changed by root
Make sysctl(8) understand CTL_FS requests


# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 29-Feb-1996 niklas

From NetBSD: Merge with NetBSD 960217


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.309 21-Oct-2021 claudio

Move vfs_stall_barrier() from the fd layer into vn_lock() and the vfs layer.
vfs stalling is used by suspend/resume and by vmt(4) to stall any
filesystem operation from altering the state on disk. All these
operations will call vn_lock and be stalled. Adjust vfs_stall_barrier()
to allow the lock owner to still progress so that suspend can sync
the filesystems after stalling vfs operation.
OK mpi@


# 1.308 20-Oct-2021 semarie

revert vnode: remove VLOCKSWORK and check locking when vop_islocked != nullop
(both kernel and userland bits)

GENERIC + VFSLCKDEBUG is broken with it.


# 1.307 19-Oct-2021 semarie

vnode: remove VLOCKSWORK and check locking when vop_islocked != nullop

This flag is currently used to mark or unmark a vnode to actively
check vnode locking semantic (when compiled with VFSLCKDEBUG).

Currently, VLOCKSWORK flag isn't properly set for several FS
implementation which have full locking support. This commit enable
proper checking for them too (cd9660, udf, fuse, msdosfs, tmpfs).

Instead of using a particular flag, it directly check if
v_op->vop_islocked is nullop or not to activate or not the vnode
locking checks.

ok mpi@


Revision tags: OPENBSD_7_0_BASE
# 1.306 31-Aug-2021 claudio

Swap lock flags so that LK_EXCLUSIVE is first like in all other places.


# 1.305 28-Apr-2021 claudio

Introduce a global vnode_mtx and use it to make vn_lock() safe to be called
without the KERNEL_LOCK.
This moves VXLOCK and VXWANT to a mutex protected v_lflag field and also
v_lockcount is protected by this mutex.

The vn_lock() dance is overly complex and all of this should probably replaced
by a proper lock on the vnode but such a diff is a lot more complex. This
is an intermediate step so that at least some calls can be modified to grab
the KERNEL_LOCK later or not at all.

OK mpi@


Revision tags: OPENBSD_6_9_BASE
# 1.304 29-Jan-2021 claudio

Use NULL instead of 0 to clear v_socket pointer (which actually clears all
of the v_un pointers).
OK jsg@ mvs@


Revision tags: OPENBSD_6_8_BASE
# 1.303 23-Aug-2020 kn

Remove unused debug_syncprt, improve debug sysctl handling

"syncprt" is unused since kern/vfs_syscalls.c r1.147 from 2008.

Adding new debug sysctls is a bit opaque and looking at kern/kern_sysctl.c
the only visible difference between used and stub ctldebug structs in the
debugvars[] array is their extern keyword, indicating that it is defined
elsewhere.

sys/sysctl.h declares all debugN members as extern upfront, but these
declarations are not needed.

Remove the unused debug sysctl, rename the only remaining one to something
meaningful and remove forward declarations from /sys/sysctl.h; this way,
adding new debug sysctls is a matter of adding extern and coming up with a
name, which is nicer to read on its own and better to grep for.

OK mpi


# 1.302 22-Aug-2020 kn

Move sysctl(2) CTL_DEBUG from DEBUG to new DEBUG_SYSCTL

Adding "debug.my-knob" sysctls is really helpful to select different
code paths and/or log on demand during runtime without recompile,
but as this code is under DEBUG, lots of other noise comes with it
which is often undesired, at least when looking at specific subsystems
only.

Adding globals to the kernel and breaking into DDB to change them helps,
but that does not work over SSH, hence the need for debug sysctls.

Introduces DEBUG_SYSCTL to make use of the "debug" MIB without the rest of
DEBUG; it's DEBUG_SYSCTL and not SYSCTL_DEBUG because it's not a general
option for all of sysctl(2).

OK gnezdo


Revision tags: OPENBSD_6_7_BASE
# 1.301 27-Mar-2020 anton

Relax the lockcount assertion in vputonfreelist(). Back when I fixed
several problems with the vnode exclusive lock implementation, I
overlooked the fact that a vnode can be in a state where the usecount is
zero while the holdcount still being positive. There could still be
threads waiting on the vnode lock in uvn_io() as long as the holdcount
is positive.

"go ahead" mpi@

Reported-by: syzbot+767d6deb1a647850a0ca@syzkaller.appspotmail.com


# 1.300 13-Feb-2020 claudio

Move the LK_DRAIN logic from VOP_LOCK() to vclean() the only caller of
VOP_LOCK with LK_DRAIN. This simplifies VOP_LOCK() a fair bit.
OK visa@


# 1.299 20-Jan-2020 claudio

struct vops is not modified during runtime so use const which moves each
into read-only data segment.
OK deraadt@ tedu@


# 1.298 10-Jan-2020 bluhm

Convert the vnode list at the mount point into a tailq. During
unmount this list is traversed and the dirty vnodes are flushed to
disk. Forced unmount expects that the list is empty after flushing,
otherwise the kernel panics with "dangling vnode". As the write
to disk can sleep, new vnodes may be inserted. If softdep is
enabled, resolving the dependencies creates new dirty vnodes and
inserts them to the list. To fix the panic, let insmntque() insert
new vnodes at the tail of the list. Then vflush() will still catch
them while traversing the list in forward direction.
OK tedu@ millert@ visa@


# 1.297 30-Dec-2019 bluhm

In vcount() a safe loop over vnodes was commited to 4.4BSD in 1994.
This is not necessary as the loop is restarted after vgone(). Switch
to SLIST_FOREACH without _SAFE.
OK visa@


# 1.296 27-Dec-2019 bluhm

Convert the speclisth hash buckets into SLIST macros. This makes
the vnode alias code more readable.
OK visa@


# 1.295 26-Dec-2019 bluhm

Fix white spaces.


# 1.294 08-Dec-2019 mpi

Convert infinite sleeps to tsleep_nsec(9).

ok visa@, jca@


Revision tags: OPENBSD_6_6_BASE
# 1.293 26-Aug-2019 anton

When a thread tries to exclusively lock a vnode, the same thread must
ensure that any other thread currently trying to acquire the underlying
vnode lock has observed that the same vnode is about to be exclusively
locked. Such threads must then sleep until the exclusive lock has been
released and then try to acquire the lock again. Otherwise, exclusive
access to the vnode cannot be guaranteed.

Thanks to naddy@ and visa@ for testing; ok visa@

Reported-by: syzbot+374d0e7e2400004957f7@syzkaller.appspotmail.com


# 1.292 25-Jul-2019 cheloha

vinvalbuf(9): tlseep -> tsleep_nsec(9); ok millert@


# 1.291 19-Jul-2019 cheloha

vwaitforio(9): tsleep(9) -> tsleep_nsec(9); ok visa@


# 1.290 28-Jun-2019 visa

Skip VFS barrier lock during normal operation to reduce overhead.
This removes a system-wide serialization point, which might help
finding timing-related bugs.

OK deraadt@ anton@


# 1.289 09-Jun-2019 beck

Add a temporary workaround to make removal of giant files better

mlarkin@ noticed we would freeze while removing enormous files because
of the amount of work done to invalidate buffers on unlink. This adds
a temporary workaround to ensure we give up the lock and yield while
doing this.

The longer term answer will be to move these buffers to another list
and not do the work here.

ok deraadt@


# 1.288 19-Apr-2019 visa

Add a subsystem lock for vfs_lockf.c. This enables calling lf_advlock()
and lf_purgelocks() without the kernel lock.

OK anton@ mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.287 02-Apr-2019 visa

Restrict which filesystems are available for swap. This rules out
obvious misconfigurations that cannot work.

OK mpi@ tedu@


# 1.286 17-Feb-2019 tedu

if a write fails, we mark the buffer invalid and throw it away. this can
lead to lost errors, where a later fsync will return success. to fix this,
set a flag on the vnode indicating a past error has occurred, and return
an error for future fsync calls.
ok bluhm deraadt visa


# 1.285 21-Jan-2019 anton

Introduce a dedicated entry point data structure for file locks. This new data
structure allows for better tracking of pending lock operations which is
essential in order to prevent a use-after-free once the underlying vnode is
gone.

Inspired by the lockf implementation in FreeBSD.

ok visa@

Reported-by: syzbot+d5540a236382f50f1dac@syzkaller.appspotmail.com


# 1.284 23-Dec-2018 natano

Rectify some issues with the noperm mount flag; the root vnode was not
protected properly and files without any x bit set were accidentaly considered
executable when checked with access(2).

Issues found and reported by deraadt, halex, reyk, tb
ok deraadt


# 1.283 07-Dec-2018 mpi

free(9) sizes for netcred.

ok visa@


Revision tags: OPENBSD_6_4_BASE
# 1.282 29-Sep-2018 visa

Use atomic operations to update vfc_refcount. Change the field's type
to unsigned int.

OK deraadt@


# 1.281 26-Sep-2018 visa

Move the allocating and freeing of mount points into
dedicated functions.

OK deraadt@ mpi@


# 1.280 22-Sep-2018 fcambus

Harmonize spacing after ellipses in displayed messages.

We were using spacing after ellipses in an inconsistent way in the
installer. Standardize on using "... " everywhere and take into account
the cursor position while we are waiting for the task to complete: the
cursor is now always positioned after the last dot, and the space is
added when displaying completion confirmation.

While there, also take cursor position into account in vfs_shutdown(),
and remove the extra leading space before ticks in dhclient.

OK deraadt@


# 1.279 17-Sep-2018 visa

Simplify VFS initialization.

Because loadable kernel modules are no longer, there is no need to
register or unregister filesystem implementations at runtime. Remove
vfs_register() and vfs_unregister(), and make vfsinit() call vfs_init
routines directly. Replace the linked list of vfsconf structs with
the vfsconflist[] array.

OK mpi@ bluhm@


# 1.278 16-Sep-2018 visa

Move vfsconf lookup code into dedicated functions.

OK bluhm@


# 1.277 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


# 1.276 02-Jul-2018 bluhm

Use more list macros for v_dirtyblkhd.
OK mpi@


# 1.275 06-Jun-2018 bluhm

The function dounmount() traverses the mnt_list in forward direction
to call vfs_busy() for all nested mount points. vfs_stall() called
vfs_busy() in reverser order for all mount points. Change the
direction of the latter to resolve the lock order conflict.
OK visa@


# 1.274 04-Jun-2018 guenther

Add VB_DUPOK to suppress witness(4) warning of concurrent mount locks.
Use that in three places:
- vfs_stall()
- sys_mount()
- dounmount()'s MNT_FORCE-does-recursive-unmounts case

ok deraadt@ visa@


# 1.273 27-May-2018 visa

Drop unnecessary `p' parameter from vget(9).

OK mpi@


# 1.272 08-May-2018 bluhm

When looping over mount points, the FOREACH SAVE macro is not save.
The loop variable mp is protected by vfs_busy() so that it cannot
be unmounted. But the next mount point nmp could be unmounted while
VFS_SYNC() sleeps. As the loop in vfs_stall() does not destroy the
mount point, TAILQ_FOREACH_REVERSE without _SAVE is the correct
macro to use.
OK deraadt@ visa@


# 1.271 08-May-2018 mpi

Move the vfs stall "barrier" logic to a function. FREF() will soon
change and this has nothing to do with it.

ok visa@, bluhm@


# 1.270 07-May-2018 bluhm

Print the vp pointer in the vinvalbuf() panic strings.
OK mpi@


# 1.269 02-May-2018 visa

Remove proc from the parameters of vn_lock(). The parameter is
unnecessary because curproc always does the locking.

OK mpi@


# 1.268 28-Apr-2018 visa

Clean up the parameters of VOP_LOCK() and VOP_UNLOCK(). It is always
curproc that does the locking or unlocking, so the proc parameter
is pointless and can be dropped.

OK mpi@, deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.267 07-Mar-2018 bluhm

Remounting files systems read-only does not work reliably. There
are corner cases where ffs may leak blocks. So better revert and
unmount all file systems at reboot. The "init died" panic will be
fixed in a different way.
OK deraadt@


# 1.266 10-Feb-2018 deraadt

Syncronize filesystems to disk when suspending. Each mountpoint's vnodes
are pushed to disk. Dangling vnodes (unlinked files still in use) and
vnodes undergoing change by long-running syscalls are identified -- and
such filesystems are marked dirty on-disk while we are suspended (in case
power is lost, a fsck will be required). Filesystems without dangling or
busy vnodes are marked clean, resulting in faster boots following
"battery died" circumstances.
Tested by numerous developers, thanks for the feedback.


# 1.265 14-Dec-2017 deraadt

Don't bother using DETACH_FORCE for the softraid luns at reboot
time; the aggressive mountpoint destruction seems to hit insane
use-after-frees when we are already far on the way down.


# 1.264 14-Dec-2017 deraadt

Give vflush_vnode() a hint about vnodes we don't need to account as "busy".
Change mountpoint to RDONLY a little later. Seems to improve the
rw->ro transition a bit.


# 1.263 11-Dec-2017 bluhm

Format the vnode lists of ddb show mount properly in columns.
OK krw@


# 1.262 11-Dec-2017 deraadt

In uvm Chuck decided backing store would not be allocated proactively
for blocks re-fetchable from the filesystem. However at reboot time,
filesystems are unmounted, and since processes lack backing store they
are killed. Since the scheduler is still running, in some cases init is
killed... which drops us to ddb [noted by bluhm]. Solution is to convert
filesystems to read-only [proposed by kettenis]. The tale follows:
sys_reboot() should pass proc * to MD boot() to vfs_shutdown() which
completes current IO with vfs_busy VB_WRITE|VB_WAIT, then calls VFS_MOUNT()
with MNT_UPDATE | MNT_RDONLY, soon teaching us that *fs_mount() calls a
copyin() late... so store the sizes in vfsconflist[] and move the copyin()
to sys_mount()... and notice nfs_mount copyin() is size-variant, so kill
legacy struct nfs_args3. Next we learn ffs_mount()'s MNT_UPDATE code is
sharp and rusty especially wrt softdep, so fix some bugs adn add
~MNT_SOFTDEP to the downgrade. Some vnodes need a little more help,
so tie them to &dead_vnops.

ffs_mount calling DIOCCACHESYNC is causing a bit of grief still but
this issue is seperate and will be dealt with in time.
couple hundred reboots by bluhm and myself, advice from guenther and
others at the hut


# 1.261 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.260 31-Jul-2017 florian

Give back some space to the ramdisk by compiling net/radix.c only
if we compile pf, ipsec, pipex or nfsserver.
Suggested by mpi some time ago.
Tweak & OK bluhm
deraadt assumes it's fair


# 1.259 20-Apr-2017 visa

Tweak lock inits to make the system runnable with witness(4)
on amd64 and i386.


# 1.258 04-Apr-2017 deraadt

struct vfsconf is tightly packed, but let's M_ZERO it in case that ever
changes to avoid exposing userland memory.


Revision tags: OPENBSD_6_1_BASE
# 1.257 15-Jan-2017 bluhm

When traversing the mount list, the current mount point is locked
with vfs_busy(). If the FOREACH_SAFE macro is used, the next pointer
is not locked and could be freed by another process. Unless
necessary, do not use _SAFE as it is unsafe. In vfs_unmountall()
the current pointer is actullay freed. Add a comment that this
race has to be fixed later.
OK krw@


# 1.256 10-Jan-2017 bluhm

Replace manual for() loops with FOREACH() macro.
OK millert@


# 1.255 10-Jan-2017 bluhm

Remove the unused olddp parameter from function dounmount().
OK mpi@ millert@


# 1.254 28-Sep-2016 kettenis

Cast enum to u_int when doing a bounds check to avoid a clang warning that
the comparison is always true.

ok jca@, tedu@


# 1.253 16-Sep-2016 dlg

move the namecache_rb_tree from RB macros to RBT functions.

i had to shuffle the includes a bit. all the knowledge of the RB
tree is now inside vfs_cache.c, and all accesses are via cache_*
functions.


# 1.252 16-Sep-2016 dlg

move buf_rb_bufs from RB macros to RBT functions

i had to shuffle the order of some header bits cos RBT_PROTOTYPE
needs to see what RBT_HEAD produces.


# 1.251 15-Sep-2016 dlg

all pools have their ipl set via pool_setipl, so fold it into pool_init.

the ioff argument to pool_init() is unused and has been for many
years, so this replaces it with an ipl argument. because the ipl
will be set on init we no longer need pool_setipl.

most of these changes have been done with coccinelle using the spatch
below. cocci sucks at formatting code though, so i fixed that by hand.

the manpage and subr_pool.c bits i did myself.

ok tedu@ jmatthew@

@ipl@
expression pp;
expression ipl;
expression s, a, o, f, m, p;
@@
-pool_init(pp, s, a, o, f, m, p);
-pool_setipl(pp, ipl);
+pool_init(pp, s, a, ipl, f, m, p);


# 1.250 25-Aug-2016 dlg

pool_setipl

ok kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.249 22-Jul-2016 kettenis

Prevent NULL-pointer call for filesystems that don't provide vfs_sysctl
in their vfsops.

Issue reported by Tim Newsham.

ok claudio@, natano@


# 1.248 19-Jun-2016 natano

Remove the lockmgr() API. It is only used by filesystems, where it is a
trivial change to use rrw locks instead. All it needs is LK_* defines
for the RW_* flags.

tested by naddy and sthen on package building infrastructure
input and ok jmc mpi tedu


# 1.247 26-May-2016 natano

The doforce variable isn't modified anywhere. Also, the only filesystem
left using it is fuse. It has been removed from all other filesystems.

ok millert deraadt


# 1.246 26-Apr-2016 natano

copy_statfs_info() is not only used by ufs, but by other filesystems too,
so make sure that all members of mp->mnt_stat.mount_info are copied.
ok stefan


# 1.245 26-Apr-2016 beck

fix off by one in vfs_vnode_print - found by miod
ok deraadt@, krw@


# 1.244 07-Apr-2016 natano

Share clone bitmap between aliased vnodes. This prevents duplicate clone
instance numbers being handed out for the same minor device.
ok mikeb


# 1.243 05-Apr-2016 natano

Increase size of the clone bitmap (revised diff after revert). I have
tested this with fuse _and_ drm on amd64 and macppc. Also tested with
cloning bpf (not in the tree) on macppc.

ok mikeb
"looks correct to me" millert

The original commit message is as follows:

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.242 01-Apr-2016 mikeb

Revert the clone bitmap enlargement change


# 1.241 31-Mar-2016 natano

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.240 19-Mar-2016 natano

Remove the unused flags argument from VOP_UNLOCK().

torture tested on amd64, i386 and macppc
ok beck mpi stefan
"the change looks right" deraadt


# 1.239 14-Mar-2016 krw

Change a bunch of (<blah> *)0 to NULL.

ok beck@ deraadt@


Revision tags: OPENBSD_5_9_BASE
# 1.238 05-Dec-2015 tedu

branches: 1.238.2;
remove stale lint annotations


# 1.237 16-Nov-2015 deraadt

In getdevvp() set the VISTTY flag on a vnode to indicate the underlying
device is a D_TTY device. (Like spec_open, but this sets the flag to
satisfy pre-VOP_OPEN situations)
ok millert semarie tedu guenther


# 1.236 13-Oct-2015 guenther

Initialize va_filerev in vattr_null() to avoid leaking stack garbage;
problem pointed out by Martin Natano (natano (at) natano.net)

Also, stop chaining assignments (foo = bar = baz) in vattr_null().
The exact meaning of those depends on the order of the sizes-and-
signednesses of the lvalues, making them fragile: a statement here
mixed *six* types, but managed to get them in a safe order. Delete
a 20+ year old XXX comment that was almost certainly bemoaning a bug
from when they were in an unsafe order.

ok deraadt@ miod@


# 1.235 08-Oct-2015 mpi

Use the radix API directly and get rid of the function pointers. There
is no point in keeping an unused level of abstraction.

ok mikeb@, claudio@


# 1.234 07-Oct-2015 mpi

rn_inithead() offset argument is now specified in byte, missed in previous.


# 1.233 04-Sep-2015 mpi

Make every subsystem using a radix tree call rn_init() and pass the
length of the key as argument.

This way every consumer of the radix tree has a chance to explicitly
initialize the shared data structures and no longer rely on another
subsystem to do the initialization.

As a bonus ``dom_maxrtkey'' is no longer used an die.

ART kernels should now be fully usable because pf(4) and IPSEC properly
initialized the radix tree.

ok chris@, reyk@


Revision tags: OPENBSD_5_8_BASE
# 1.232 16-Jul-2015 claudio

branches: 1.232.4;
Fix rn_match and there for the expoerted lookup functions in radix.c
to never return the internal RNF_ROOT nodes. This removes the checks
in the callee to verify that not an RNF_ROOT node was returned.
OK mpi@


# 1.231 12-May-2015 mikeb

Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.230 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.229 02-Mar-2015 guenther

Return EINVAL if the creds supplied for NFS export have a cr_ngroups less
than zero or greater than NGROUPS_MAX

Fixes panic seen by henning@


# 1.228 09-Jan-2015 tedu

rename desiredvnodes to initialvnodes. less of a lie. ok beck deraadt


# 1.227 19-Dec-2014 tedu

start retiring the nointr allocator. specify PR_WAITOK as a flag as a
marker for which pools are not interrupt safe. ok dlg


# 1.226 17-Dec-2014 tedu

remove lock.h from uvm_extern.h. another holdover from the simpletonlock
era. fix uvm including c files to include lock.h or atomic.h as necessary.
ok deraadt


# 1.225 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.224 10-Dec-2014 tedu

convert bcopy to memcpy. ok millert


# 1.223 21-Nov-2014 tedu

simple lock is long dead


# 1.222 19-Nov-2014 tedu

delete the KERN_VNODE sysctl. it fails to provide any isolation from the
kernel struct vnode defintion, and the only consumer (pstat) still needs
kvm to read much of the required information. no great loss to always use
kvm until there's a better replacement interface.
ok deraadt millert uebayasi


# 1.221 14-Nov-2014 tedu

prefer sizeof(*ptr) to sizeof(struct) for malloc and free


# 1.220 03-Nov-2014 deraadt

pass size argument to free()
ok doug tedu


# 1.219 13-Sep-2014 doug

Replace all queue *_END macro calls except CIRCLEQ_END with NULL.

CIRCLEQ_* is deprecated and not called in the tree. The other queue types
have *_END macros which were added for symmetry with CIRCLEQ_END. They are
defined as NULL. There's no reason to keep the other *_END macro calls.

ok millert@


Revision tags: OPENBSD_5_6_BASE
# 1.218 13-Jul-2014 tedu

pass the size to free in some of the obvious cases


# 1.217 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.216 10-Jul-2014 mpi

Stop using a shutdown hook for softraid(4) and explicitly shutdown
the disciplines right after vfs_shutdown().

This change is required in order to be able to set `cold' to 1 before
traversing the device (mainbus) tree for DVACT_POWERDOWN when halting
a machine. Yes, this is ugly because sr_shutdown() needs to sleep. But
at least it is obvious and hopefully somebody will be ofended and fix
it.

In order to properly flush the cache of the disks under softraid0,
sr_shutdown() now propagates DVACT_POWERDOWN for this particular subtree
of devices which are not under mainbus. As a side effect sd(4) shutdown
hook should no longer be necessary.

Tested by stsp@ and Jean-Philippe Ouellet.

ok deraadt@, stsp@, jsing@


# 1.215 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.214 04-Jun-2014 claudio

While it may be smart to use the radix tree for exports it is not OK to
use the domain specific tree initialisation method for this since that one
is multipath enabled and assumes that the radix node is part of a struct
rtentry. This code uses a different struct and so the multipath modifies
wrong fields and breaks stuff in mysterious ways.
Since we only support AF_INET here anyway simplify the code and only have
one radix_node_head pointer instead of AF_MAX ones.
Fixes NFS server issues reported by rpe@, OK rpe@, guenther@, sthen@


# 1.213 10-Apr-2014 tedu

pull the bufcache freelist code out into separate functions to allow new
algorithms to be tested. in the process, drop support for unused B_AGE and
b_synctime options.
previous versions ok beck deraadt


# 1.212 24-Mar-2014 guenther

Split the API: struct ucred remains the kernel internal structure while
struct xucred becomes the structure for syscalls (mount(2) and nfssvc(2)).

ok deraadt@ beck@


Revision tags: OPENBSD_5_5_BASE
# 1.211 21-Jan-2014 tedu

bzero -> memset


# 1.210 01-Dec-2013 krw

Change 'mountlist' from CIRCLEQ to TAILQ. Be paranoid and
use TAILQ_*_SAFE more than might be needed.

Bulk ports build by sthen@ showed nobody sticking their fingers
so deep into the kernel.

Feedback and suggestions from millert@. ok jsing@


# 1.209 27-Nov-2013 jsing

Defer the v_type initialisation until after the vnode has been purged from
the namecache. Changing the v_type between cache_enter() and cache_purge()
results in bad things happening.

ok beck@


# 1.208 02-Oct-2013 sf

format string fix: b_flags is long


# 1.207 01-Oct-2013 sf

Format string fixes: Cast time_t to long long

and mnt_stat.f_ctime is long long, too


# 1.206 08-Aug-2013 syl

Uncomment kprintf format attributes for sys/kern

tested on vax (gcc3) ok miod@


# 1.205 30-Jul-2013 beck

The previous change was made while chasing nfs performance issues
on Theo's servers - however this was in the context of the buffer flipper
changes and this is now suspect in a continues performance issue with NFS
so back it out for now


Revision tags: OPENBSD_5_4_BASE
# 1.204 24-Jun-2013 beck

Manipulating buffers after sleeping is dangerous. Instead of attempting
to cheat and VOP_BWRITE a buffer, restart the vinvalbuf if we have to wait
for a busy buffer to complete
ok tedu@ guenther@


# 1.203 15-Apr-2013 jsing

Add an f_mntfromspec member to struct statfs, which specifies the name of
the special provided when the mount was requested. This may be the same as
the special that was actually used for the mount (e.g. in the case of a
device node) or it may be different (e.g. in the case of a DUID).

Whilst here, change f_ctime to a 64 bit type and remove the pointless
f_spare members.

Compatibility goo courtesy of guenther@

ok krw@ millert@


Revision tags: OPENBSD_5_3_BASE
# 1.202 17-Feb-2013 miod

Comment out recently added __attribute__((__format__(__kprintf__))) annotations
in MI code; gcc 2.95 does not accept such annotation for function pointer
declarations, only function prototypes.
To be uncommented once gcc 2.95 bites the dust.


# 1.201 09-Feb-2013 miod

Add explicit __attribute__ ((__format__(__kprintf__)))) to the functions and
function pointer arguments which are {used as,} wrappers around the kernel
printf function.
No functional change.


# 1.200 17-Nov-2012 beck

Don't map a buffer (and potentially sleep) when invalidating it in vinvalbuf.
This fixes a problem where we could sleep for kva and then our pointers
would not be valid on the next pass through the loop. We do this
by adding buf_acquire_nomap() - which can be used to busy up the buffer
without changing its mapped or unmapped state. We do not need to have
the buffer mapped to invalidate it, so it is sufficient to acquire it
for that. In the case where we write the buffer, we do map the buffer, and
potentially sleep.


# 1.199 01-Oct-2012 guenther

Make groupmember() check the effective gid too, so that the checks are
consistent when the effective gid isn't also a supplementary group.

ok beck@


# 1.198 19-Sep-2012 guenther

vhold() and vdrop() are prototyped in vnode.h, so don't repeat them here

ok beck@


Revision tags: OPENBSD_5_2_BASE
# 1.197 16-Jul-2012 deraadt

oops, need sys/acct.h too


# 1.196 16-Jul-2012 deraadt

Put acct_shutdown() proto in a better place


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.195 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.194 02-Jul-2011 thib

rename VFSDEBUG to VFLCKDEBUG;

prompted by tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.193 21-Dec-2010 thib

Bring back the "End the VOP experiment." diff, naddy's issues where
unrelated, and his alpha is much happier now.

OK deraadt@


# 1.192 06-Dec-2010 jasper

- drop NENTS(), which was yet another copy of nitems().
no binary change


ok deraadt@


# 1.191 10-Sep-2010 thib

Backout the VOP diff until the issues naddy was seeing on alpha (gcc3)
have been resolved.


# 1.190 06-Sep-2010 thib

End the VOP experiment. Instead of the ridicolusly complicated operation
vector setup that has questionable features (that have, as far as I can
tell never been used in practice, atleast not in OpenBSD), remove all
the gunk and favor a simple struct full of function pointers that get
set directly by each of the filesystems.

Removes gobs of ugly code and makes things simpler by a magnitude.

The only downside of this is that we loose the vnoperate feature so
the spec/fifo operations of the filesystems need to be kept in sync
with specfs and fifofs, this is no big deal as the API it self is pretty
static.

Many thanks to armani@ who pulled an earlier version of this diff to
current after c2k10 and Gabriel Kihlman on tech@ for testing.

Liked by many. "come on, find your balls" deraadt@.


# 1.189 12-Aug-2010 oga

Nuke extra (typoed) extern declaration and a spare newline from the last
commit.

"fix it -- free commit" beck@


# 1.188 11-Aug-2010 beck

Make the number of vnodes to correspond to the number of buffers in
buffer cache - we grow them dynamically, but do not attempt to shrink
them if the buffer cache shrinks after growing.

Tested by very many for a long time.

ok oga@ todd@ phessler@ tedu@


Revision tags: OPENBSD_4_8_BASE
# 1.187 29-Jun-2010 tedu

makefstype was only used in ported from freebsd filesystems. fix them
and remove the function. ok thib


# 1.186 28-Jun-2010 claudio

Add the rtable id as an argument to rn_walktree(). Functions like
rt_if_remove_rtdelete() need to know the table id to be able to correctly
remove nodes.
Problem found by Andrea Parazzini and analyzed by Martin Pelik�n.
OK henning@


# 1.185 06-May-2010 mpf

Fix favail format string.
From mickey.
OK thib, otto.


Revision tags: OPENBSD_4_7_BASE
# 1.184 17-Dec-2009 oga

if anyone vref()s a VNON vnode, panic. This should not happen.

Written while trying to debug the nfs_inactive panics. Turns out it
never got hit, but it's a useful check to have.

ok beck@


# 1.183 17-Aug-2009 jasper

dd 'show all bufs' to show all the buffers in the system

ok beck@ thib@


# 1.182 13-Aug-2009 thib

add a show all vnodes command, use dlg's nice pool_walk() to accomplish
this.

ok beck@, dlg@


# 1.181 12-Aug-2009 beck

Namecache revamp.

This eliminates the large single namecache hash table, and implements
the name cache as a global lru of entires, and a redblack tree in each
vnode. It makes cache_purge actually purge the namecache entries associated
with a vnode when a vnode is recycled (very important for later on actually being
able to resize the vnode pool)

This commit does #if 0 out a bunch of procmap code that was
already broken before this change, but needs to be redone completely.

Tested by many, including in thib's nfs test setup.

ok oga@,art@,thib@,miod@


# 1.180 02-Aug-2009 beck

Dynamic buffer cache support - a re-commit of what was backed out
after c2k9

allows buffer cache to be extended and grow/shrink dynamically

tested by many, ok oga@, "why not just commit it" deraadt@


Revision tags: OPENBSD_4_6_BASE
# 1.179 25-Jun-2009 thib

backout the buf_acquire() does the bremfree() since all callers
where doing bremfree() befure calling buf_acquire().

This is causing us headache pinning down a bug that showed up
when deraadt@ too cvs to current, and will have to be done
anyway as a preperation for backouts.

OK deraadt@


# 1.178 15-Jun-2009 beck

Back out all the buffer cache changes I committed during c2k9. This reverts three
commits:

1) The sysctl allowing bufcachepercent to be changed at boot time.
2) The change moving the buffer cache hash chains to a red-black tree
3) The dynamic buffer cache (Which depended on the earlier too).

ok on the backout from marco and todd


# 1.177 06-Jun-2009 art

All caller of buf_acquire were doing bremfree before the call.
Just put it in the buf_acquire function.
oga@ ok


# 1.176 03-Jun-2009 beck

Change bufhash from the old grotty hash table to red-black trees hanging
off the vnode.
ok art@, oga@, miod@


Revision tags: OPENBSD_4_5_BASE
# 1.175 10-Nov-2008 pedro

Fix typo in comment, okay jmc@.


# 1.174 01-Nov-2008 deraadt

change vrele() to return an int. if it returns 0, it can gaurantee that
it did not sleep. this is used to avoid checkdirs() to avoid having
to restart the allproc walk every time through
idea from tedu, ok thib pedro


Revision tags: OPENBSD_4_4_BASE
# 1.173 05-Jul-2008 thib

re-introduce vdrop() to signal a lost intrest in a vnode;

ok art@


# 1.172 14-Jun-2008 mk

A bunch of pool_get() + bzero() -> pool_get(..., .. | PR_ZERO)
conversions that should shave a few bytes off the kernel.

ok henning, krw, jsing, oga, miod, and thib (``even though i usually prefer
FOO|BAR''; thanks for looking.


# 1.171 13-Jun-2008 beck

back out stupid vnode change that was unintentionally included
with biomem and art has no idea how it got there.
ok art@ thib@


# 1.170 12-Jun-2008 deraadt

Bring biomem diff back into the tree after the nfs_bio.c fix went in.
ok thib beck art


# 1.169 11-Jun-2008 deraadt

back out biomem diff since it is not right yet. Doing very large
file copies to nfsv2 causes the system to eventually peg the console.
On the console ^T indicates that the load is increasing rapidly, ddb
indicates many calls to getbuf, there is some very slow nfs traffic
making none (or extremely slow) progress. Eventually some machines
seize up entirely.


# 1.168 10-Jun-2008 beck

Buffer cache revamp

1) remove multiple size queues, introduced as a stopgap.
2) decouple pages containing data from their mappings
3) only keep buffers mapped when they actually have to be mapped
(right now, this is when buffers are B_BUSY)
4) New functions to make a buffer busy, and release the busy flag
(buf_acquire and buf_release)
5) Move high/low water marks and statistics counters into a structure
6) Add a sysctl to retrieve buffer cache statistics

Tested in several variants and beat upon by bob and art for a year. run
accidentally on henning's nfs server for a few months...

ok deraadt@, krw@, art@ - who promises to be around to deal with any fallout


# 1.167 09-Jun-2008 millert

Update access(2) to have modern semantics with respect to X_OK and
the superuser. access(2) will now only indicate success for X_OK on
non-directories if there is at least one execute bit set on the file.
OK deraadt@ thib@ otto@


# 1.166 07-May-2008 thib

remove the vfc_mountroot member from vfsconf and
do appropriate cleanup;

OK deraadt@


# 1.165 07-May-2008 claudio

Implement routing priorities. Every route inserted has a priority assigned
and the one route with the lowest number wins. This will be used by the
routing daemons to resolve the synchronisations issue in case of conflicts.
The nasty bits of this are in the multipath code. If no priority is specified
the kernel will choose an appropriate priority.

Looked at by a few people at n2k8 code is much older


# 1.164 06-May-2008 thib

retire vfs_mountroot();

setroot() is now (and has been) responsible for setting
the mountroot function pointer "to the right thing", or
failing todo that, to ffs_mountroot;

based on a discussion/diff from deraadt@.
OK deraadt@


# 1.163 23-Mar-2008 miod

Wrong printf construct.


# 1.162 16-Mar-2008 otto

Widen some struct statfs fields to support large filesystem stata
and add some to be able to support statvfs(2). Do the compat dance
to provide backward compatibility. ok thib@ miod@


Revision tags: OPENBSD_4_3_BASE
# 1.161 13-Dec-2007 blambert

replace calls to ltsleep with tsleep

remove PNORELOCK flag, as PNORELOCK is used for msleep

ok art@ thib@


# 1.160 16-Nov-2007 deraadt

er, the newline is wrong. dissapointing.


# 1.159 15-Nov-2007 deraadt

newline before syncing disks is way prettier


# 1.158 29-Oct-2007 chl

MALLOC/FREE -> malloc/free
replace an hard coded value with M_WAITOK

ok krw@


# 1.157 15-Sep-2007 bluhm

Allow to pull out an usb stick with ffs filesystem while mounted
and a file is written onto the stick. Without these fixes the
machine panics or hangs.
The usb fix calls the callback when the stick is pulled out to free
the associated buffers. Otherwise we have busy buffers for ever
and the automatic unmount will panic.
The change in the scsi layer prevents passing down further dirty
buffers to usb after the stick has been deactivated.
In vfs the automatic unmount has moved from the function vgonel()
to vop_generic_revoke(). Both are called when the sd device's vnode
is removed. In vgonel() the VXLOCK is already held which can cause
a deadlock. So call dounmount() earlier.

ok krw@, I like this marco@, tested by ian@


# 1.156 07-Sep-2007 art

Use M_ZERO in a few more places to shave bytes from the kernel.

eyeballed and ok dlg@


Revision tags: OPENBSD_4_2_BASE
# 1.155 07-Aug-2007 beck

A few changes to deal with multi-user performance issues seen. this
brings us back roughly to 4.1 level performance, although this is still
far from optimal as we have seen in a number of cases. This change

1) puts a lower bound on buffer cache queues to prevent starvation
2) fixes the code which looks for a buffer to recycle
3) reduces the number of vnodes back to 4.1 levels to avoid complex
performance issues better addressed after 4.2

ok art@ deraadt@, tested by many


# 1.154 01-Jun-2007 beck

decouple the allocated number of vnodes from the "desiredvnodes" variable
which is used to size a zillion other things that increasing excessively
has been shown to cause problems - so that we may incrementally look at
increasing those other things without making the kernel unusable.

This diff effectivly increases the number of vnodes back to the number
of buffers, as in the earlier dynamic buffer cache commits, without
increasing anything else (namecache, softdeps, etc. etc.)

ok pedro@ tedu@ art@ thib@


# 1.153 31-May-2007 tedu

remove some silly casts, no real change


# 1.152 31-May-2007 pedro

NFSv2 cannot cope with a big number of vnodes, so revert to NPROC-based
calculation until the problem is fixed, okay beck@ art@


# 1.151 30-May-2007 beck

back out vfs change - todd fries has seen afs issues, and I'm suspicious
this can cause other problems.


# 1.150 29-May-2007 beck

Step one of some vnode improvements - change getnewvnode to
actually allocate "desiredvnodes" - add a vdrop to un-hold a vnode held
with vhold, and change the name cache to make use of vhold/vdrop, while
keeping track of which vnodes are referred to by which cache entries to
correctly hold/drop vnodes when the cache uses them.
ok thib@, tedu@, art@


# 1.149 28-May-2007 thib

de-inline vref();

ok pedro@


# 1.148 26-May-2007 pedro

Dynamic buffer cache. Initial diff from mickey@, okay art@ beck@ toby@
deraadt@ dlg@.


# 1.147 26-May-2007 thib

Nuke a bunch of simpelocks and associated goo.

ok art@


# 1.146 17-May-2007 thib

Collapse struct v_selectinfo in struct vnode, remove the
simplelock and reuse the name for the selinfo member.
Clean-up accordingly.

ok tedu@,art@


# 1.145 09-May-2007 deraadt

kinfo_vgetfailed has not been used for > 8 years


# 1.144 13-Apr-2007 thib

Move the declaration of VN_KNOTE() into vnode.h instead of having
multiple defines all over;

ok tedu@


# 1.143 13-Apr-2007 bluhm

Remove comments talking about vnode interlock. No binary change.
ok thib


# 1.142 11-Apr-2007 thib

Remove the simplelock argument from vrecycle();

ok pedro@, sturm@


# 1.141 21-Mar-2007 thib

Remove the v_interlock simplelock from the vnode structure.
Zap all calls to simple_lock/unlock() on it (those calls are
#defined away though). Remove the LK_INTERLOCK from the calls
to vn_lock() and cleanup the filesystems wich implement VOP_LOCK().
(by remvoing the v_interlock from there calls to lockmgr()).

ok pedro@, art@, tedu@


# 1.140 12-Mar-2007 mickey

better desiredvnodes not based on maxusers; pedro@ deraadt@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.139 20-Feb-2007 deraadt

for vfsconf sysctl, do not leak kernel sensors out to userland
ok art thib


# 1.138 17-Feb-2007 mickey

fix ddb buf printing for daddr_t growth to 64bit;
from juan hernandez gonzalez; tested by bluhm@


# 1.137 14-Feb-2007 jsg

Consistently spell FALLTHROUGH to appease lint.
ok kettenis@ cloder@ tom@ henning@


# 1.136 13-Feb-2007 mickey

fix ddb buf print


# 1.135 20-Nov-2006 tom

vprint() should be defined if DIAGNOSTIC || DEBUG. Noticed by (and
original diff from) Jake < antipsychic (at) hotmail.com >. Discussed
with Mickey and Miod.

ok miod@ pedro@


# 1.134 30-Oct-2006 thib

use vp->v_type to index into vtypes rather then vp->v_tag,
fixing odd output in the 'show vnode' ddb code.

ok mickey@


Revision tags: OPENBSD_4_0_BASE
# 1.133 11-Jul-2006 mickey

add mount/vnode/buf and softdep printing commands; tested on a few archs and will make pedro happy too (;


# 1.132 09-Jul-2006 pedro

Fix tab where space was meant


# 1.131 08-Jul-2006 thib

vinvalbuf() debugging aid, under VFSDEBUG.

ok pedro@


# 1.130 03-Jul-2006 mickey

also print vp in vprint (useful for debugging); pedro@ ok


# 1.129 25-Jun-2006 sturm

rename vfs_busy() flags VB_UMIGNORE/VB_UMWAIT to VB_NOWAIT/VB_WAIT

requested by and ok pedro


# 1.128 14-Jun-2006 sturm

move vfs_busy() to rwlocks and properly hide the locking api from vfs

ok tedu, pedro


# 1.127 02-Jun-2006 pedro

Add a clonable devices implementation. Hacked along with thib@, input
from krw@ and toby@, subliminal prodding from dlg@, okay deraadt@.


# 1.126 28-May-2006 pedro

Spacing in vfs_sysctl()


# 1.125 07-May-2006 sturm

forgot to remove this sentence from the comment
ok pedro


# 1.124 30-Apr-2006 sturm

remove the simplelock argument from vfs_busy() which is currently not
used and will never be used this way in VFS

requested by and ok pedro, ok krw, biorn


# 1.123 19-Apr-2006 pedro

Remove unused mount list simple_lock() goo


Revision tags: OPENBSD_3_9_BASE
# 1.122 09-Jan-2006 pedro

Put vprint() under DIAGNOSTIC, as to save space in generated ramdisks.
Inspiration from miod@, okay deraadt@. Tested on i386, macppc and amd64.


# 1.121 30-Nov-2005 pedro

No need for vfs_busy() and vfs_unbusy() to take a process pointer
anymore. Testing by jolan@, thanks.


# 1.120 24-Nov-2005 pedro

Remove kernfs, okay deraadt@.


# 1.119 19-Nov-2005 pedro

Remove unnecessary lockmgr() archaism that was costing too much in terms
of panics and bugfixes. Access curproc directly, do not expect a process
pointer as an argument. Should fix many "process context required" bugs.
Incentive and okay millert@, okay marc@. Various testing, thanks.


# 1.118 18-Nov-2005 pedro

Work around yet another race on non-locking file systems: when calling
VOP_INACTIVE() in vrele() and vput(), we may sleep. Since there's no
locking of any kind, someone can vget() the vnode and vrele() it while
we sleep, beating us in getting the vnode on the free list.


# 1.117 08-Nov-2005 pedro

Missed one use of 'register'


# 1.116 07-Nov-2005 pedro

Use ANSI function declarations and deregister, no binary change


# 1.115 19-Oct-2005 pedro

Remove v_vnlock from struct vnode, okay krw@ tedu@


Revision tags: OPENBSD_3_8_BASE
# 1.114 26-May-2005 pedro

branches: 1.114.2;
RIP stackable filesystems, ok marius@ tedu@, discussed with deraadt@


# 1.113 24-May-2005 pedro

when a device vnode associated with a mount point disappears, mark the
filesystem as doomed and unmount it


# 1.112 22-May-2005 pedro

put VLOCKSWORK stuff under a single option, VFSDEBUG


# 1.111 01-May-2005 pedro

check for VBIOONFREELIST and VBIOONSYNCLIST in vprint(), okay marius@


# 1.110 24-Mar-2005 tedu

always good to check for invalid values. ok marius pedro


Revision tags: OPENBSD_3_7_BASE
# 1.109 10-Jan-2005 pedro

branches: 1.109.2;
change vget() to only put a vnode back on the free lists if it actually
was there. should fix a (rare) corner case introduced by my last commit.
ok tedu@, testing by joris, moritz@, danh@, otto@ and krw@. many thanks.


# 1.108 31-Dec-2004 pedro

sprinkle some more list macros in here


# 1.107 31-Dec-2004 pedro

when releasing a vnode, make it inactive before sticking it to one of
the free lists. should fix some races on filesystems that don't have
locks, such as nfs. also, it allows for a more straightforward way of
releasing vnodes (nodes that are going to be recycled don't have to be
moved to the head of the list). tested by many, thanks.

ok tedu@ deraadt@


# 1.106 28-Dec-2004 deraadt

clean dirty accident by miod


# 1.105 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


# 1.104 09-Dec-2004 pedro

minor spacing/styling nits


Revision tags: OPENBSD_3_6_BASE
# 1.103 04-Aug-2004 art

Uninline vputonfreelist.


# 1.102 04-Aug-2004 pedro

better comments


# 1.101 02-Aug-2004 pedro

- check for LK_NOWAIT on vget()
- use ltsleep() instead of the unlock + sleep combo

ok art@, inspiration from free/net


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.100 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


# 1.99 27-May-2004 tedu

shutdown accounting before shutting down vfs. should prevent some panics.
ok david@ millert@ (iirc)


# 1.98 25-Apr-2004 itojun

radix tree with multipath support. from kame. deraadt ok
user visible changes:
- you can add multiple routes with same key (route add A B then route add A C)
- you have to specify gateway address if there are multiple entries on the table
(route delete A B, instead of route delete A)
kernel change:
- radix_node_head has an extra entry
- rnh_deladdr takes extra argument

TODO:
- actually take advantage of multipath (rtalloc -> rtalloc_mpath)


Revision tags: OPENBSD_3_5_BASE
# 1.97 09-Jan-2004 tedu

back out vnode parents. weird breakge found in ports tree


# 1.96 06-Jan-2004 tedu

keep track of a vnode's parent dir. ufs only, and unused atm, but
the fun stuff is coming. testing by brad.


Revision tags: OPENBSD_3_4_BASE
# 1.95 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.94 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: UBC_SYNC_A
# 1.93 13-May-2003 naddy

Back out previous change that causes "vnode table full" for large-scale
file operations.


# 1.92 13-May-2003 tedu

do reclaim LAYER vnodes, no good reason not to


# 1.91 06-May-2003 tedu

attempt to put a process's cwd back in place after a forced umount.
won't always work, but it's the best we can do for now. this covers
at least some of the failure cases the previous commit to vfs_lookup.c
checks for.
ok weingart@


# 1.90 01-May-2003 tedu

several related changes:
vfs_subr.c:
add a missing simple_lock_init for vnode interlock
try to avoid reclaiming locked or layered vnodes
initialize vnlock pointer to NULL
remove old code to free vnlock, never used
lockinit the new vnode lock
vfs_syscalls.c:
support for VLAYER flag
vnode_if.sh:
support for splitting VDESC flags
vnode_if.src:
split VDESC flags
WILLPUT is the combination of WILLRELE and WILLUNLOCK
most uses for WILLRELE become WILLPUT
vnode.h:
add v_lock to struct vnode
add VLAYER flag
update for new VDESC flags


# 1.89 06-Apr-2003 ho

strcat/strcpy/sprintf cleanup. krw@, anil@ ok. art@ tested sparc64.


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_B
# 1.88 11-Aug-2002 art

Add two missing vfs_busy calls in the failure path of sysctl_vnode.
Found by aaron@

NOTE - I think we need a mount-point iterator just like we have
NOTE - vfs_mount_foreach_vnode. (btw. why don't we use foreach_vnode in here?)


# 1.87 12-Jul-2002 art

Change the locking on the mountpoint slightly. Instead of using mnt_lock
to get shared locks for lookup and get the exclusive lock only with
LK_DRAIN on unmount and do the real exclusive locking with flags in
mnt_flags, we now use shared locks for lookup and an exclusive lock for
unmount.

This is accomplished by slightly changing the semantics of vfs_busy.
Old vfs_busy behavior:
- with LK_NOWAIT set in flags, a shared lock was obtained if the
mountpoint wasn't being unmounted, otherwise we just returned an error.
- with no flags, a shared lock was obtained if the mountpoint was being
unmounted, otherwise we slept until the unmount was done and returned
an error.
LK_NOWAIT was used for sync(2) and some statistics code where it isn't really
critical that we get the correct results.
0 was used in fchdir and lookup where it's critical that we get the right
directory vnode for the filesystem root.

After this change vfs_busy keeps the same behavior for no flags and LK_NOWAIT.
But if some other flags are passed into it, they are passed directly
into lockmgr (actually LK_SLEEPFAIL is always added to those flags because
if we sleep for the lock, that means someone was holding the exclusive lock
and the exclusive lock is only held when the filesystem is being unmounted.

More changes:
dounmount must now be called with the exclusive lock held. (before this
the caller was supposed to hold the vfs_busy lock, but that wasn't always
true).
Zap some (now) unused mount flags.
And the highlight of this change:
Add some vfs_busy calls to match some vfs_unbusy calls, especially in
sys_mount. (lockmgr doesn't detect the case where we release a lock noone
holds (it will do that soon)).

If you've seen hangs on reboot with mfs this should solve it (I repeat this
for the fourth time now, but this time I spent two months fixing and
redesigning this and reading the code so this time I must have gotten
this right).


# 1.86 16-Jun-2002 miod

When processing the KERN_VNODE sysctl, the kernel builds a packed structure,
while pstat(8) expects a C structure abiding the regular structure packing
rules. This caused pstat -v to break on powerpc.

Unbreak the confusion by defining the structure in a common header file,
and having the kernel use it.

ok millert@ deraadt@


# 1.85 08-Jun-2002 art

Use ltsleep in vfs_busy.


# 1.84 16-May-2002 art

sprinkle some splassert(IPL_BIO) in some functions that are commented as "should be called at splbio()"


Revision tags: OPENBSD_3_1_BASE
# 1.83 14-Mar-2002 millert

First round of __P removal in sys


# 1.82 04-Feb-2002 miod

Cleanup mountroot-related definitions.


# 1.81 23-Jan-2002 art

Pool deals fairly well with physical memory shortage, but it doesn't deal
well (not at all) with shortages of the vm_map where the pages are mapped
(usually kmem_map).

Try to deal with it:
- group all information the backend allocator for a pool in a separate
struct. The pool will only have a pointer to that struct.
- change the pool_init API to reflect that.
- link all pools allocating from the same allocator on a linked list.
- Since an allocator is responsible to wait for physical memory it will
only fail (waitok) when it runs out of its backing vm_map, carefully
drain pools using the same allocator so that va space is freed.
(see comments in code for caveats and details).
- change pool_reclaim to return if it actually succeeded to free some
memory, use that information to make draining easier and more efficient.
- get rid of PR_URGENT, noone uses it.


# 1.80 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.79 10-Dec-2001 art

branches: 1.79.2;
No need to initialize the uobj on every getnewvnode. Just do
it when allocating. Add some improved diagnostics.


# 1.78 10-Dec-2001 art

Big cleanup inspired by NetBSD with some parts of the code from NetBSD.
- get rid of VOP_BALLOCN and VOP_SIZE
- move the generic getpages and putpages into miscfs/genfs
- create a genfs_node which must be added to the top of the private portion
of each vnode for filsystems that want to use genfs_{get,put}pages
- rename genfs_mmap to vop_generic_mmap


# 1.77 10-Dec-2001 art

Merge in struct uvm_vnode into struct vnode.


# 1.76 05-Dec-2001 art

Break out the part that lowers v_holdcnt in brelvp into an own function
and make it and vhold into public interfaces.


# 1.75 29-Nov-2001 art

Ooops. Revert part of the last commit that was completly wrong and wasn't supposed to be committed.


# 1.74 29-Nov-2001 art

Correctly handle b_vp with bgetvp and brelvp in {get,put}pages.
Prevents panics caused by vnodes being recycled under our feet.


# 1.73 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.72 21-Nov-2001 csapuntz

Added vfs_isbusy. Useful for verifying that a mount point is locked
Added vfs_mount_foreach_vnode. Several places in the code seem to want to
traverse the mount list and they all seem to handle locking differently.
Centralize traversing the mount list in one place so that we only need
to get the locking right once.


# 1.71 15-Nov-2001 art

Don't zero v_bioflag when recycling a vnode in getnewvnode.
Sometimes the vnode can be on the syncers list. While that is a bug, it's
just a minor annoyance. A vnode on a syncer worklist without VBIOONSYNCLIST
set is a disaster.


# 1.70 12-Nov-2001 art

Remove unnecessary check for NULL vnode in reassignbuf.


# 1.69 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.68 02-Oct-2001 csapuntz

Bounds check index into routing table. Thanks to Ken Ashcraft of Stanford
for finding this bug.


# 1.67 19-Sep-2001 csapuntz

Get rid of B_VFLUSH. Not relevant after the end of the AGE queue.


# 1.66 16-Sep-2001 millert

Add some missing lengths checks when passing data from userland to
kernel. From based on NetBSD patches.


# 1.65 02-Aug-2001 assar

(vput): make panic strings actually say vput instead of vrele


# 1.64 26-Jul-2001 miod

Typo.


# 1.63 27-Jun-2001 art

remove old vm


# 1.62 22-Jun-2001 deraadt

KNF


# 1.61 05-Jun-2001 provos

send note_revoke to knotes when vnode goes away, okay art@


# 1.60 16-May-2001 art

indentation nit.


# 1.59 29-Apr-2001 art

cleanup, remove incorrect comment


Revision tags: OPENBSD_2_9_BASE
# 1.58 22-Mar-2001 art

branches: 1.58.2;
Use pool for allocating vnodes.
Even though vnodes are never freed (could be) this gives us big memory and
kmem_map savings.


# 1.57 21-Mar-2001 art

uvm_vnp_terminate expect the vnode to be locked.
Why didn't LOCKDEBUG catch this?


# 1.56 16-Mar-2001 art

Oops. fix thinko in last.


# 1.55 16-Mar-2001 art

Use CIRCLEQ macros for mountlist.


# 1.54 16-Mar-2001 art

Initialize the mountlist_slock.


# 1.53 26-Feb-2001 csapuntz

Move v_writecount test back to it original place


# 1.52 26-Feb-2001 csapuntz

Make ref counts 32-bit unsigned ints as opposed to a potpourri of longs and
ints.


# 1.51 24-Feb-2001 csapuntz

Cleanup of vnode interface continues. Get rid of VHOLD/HOLDRELE.
Change VM/UVM to use buf_replacevnode to change the vnode associated
with a buffer.

Addition v_bioflag for flags written in interrupt handlers
(and read at splbio, though not strictly necessary)

Add vwaitforio and use it instead of a while loop of v_numoutput.

Fix race conditions when manipulation vnode free list


# 1.50 23-Feb-2001 csapuntz

Remove the clustering fields from the vnodes and place them in the
file system inode instead


# 1.49 21-Feb-2001 csapuntz

Latest soft updates from FreeBSD/Kirk McKusick

Snapshot-related code has been commented out.


# 1.48 08-Feb-2001 mickey

do not print stuff when not verbose


Revision tags: OPENBSD_2_8_BASE
# 1.47 27-Sep-2000 art

branches: 1.47.2;
Minimal optimization.


# 1.46 17-Jul-2000 art

Don't wait for B_READ buffers on shutdown.
From NetBSD.


Revision tags: OPENBSD_2_7_BASE
# 1.45 25-Apr-2000 csapuntz

Use CIRCLEQ_FOREACH


# 1.44 21-Apr-2000 mickey

see if there is any meaning under curproc before using &proc0 in vfs_syncwait(); from art@


Revision tags: SMP_BASE kame_19991208
# 1.43 05-Dec-1999 art

branches: 1.43.2;
With soft updates, some buffers will be remarked as dirty after being written.
Handle this when syncing filesystems when unmounting.
From NetBSD.


# 1.42 05-Dec-1999 art

Use VONSYNCLIST to see if we should remove a vnode from the sync list instead
of looking at v_dirtyblkhd.


Revision tags: OPENBSD_2_6_BASE
# 1.41 20-Aug-1999 art

more paranoid check of the refcount in vfs_register


# 1.40 08-Aug-1999 niklas

From NetBSD; vdevgone, used for revoking access to device nodes when they
disappear (detach is coming).


# 1.39 31-May-1999 millert

New struct statfs with mount options. NOTE: this replaces statfs(2),
fstatfs(2), and getfsstat(2) so you will need to build a new kernel
before doing a "make build" or you will get "unimplemented syscall" errors.

The new struct statfs has the following featuires:
o Has a u_int32_t flags field--now softdep can have a real flag.

o Uses u_int32_t instead of longs (nicer on the alpha). Note: the man
page used to lie about setting invalid/unused fields to -1. SunOS does
that but our code never has.

o Gets rid of f_type completely. It hasn't been used since NetBSD 0.9
and having it there but always 0 is confusing. It is conceivable
that this may cause some old code to not compile but that is better
than silently breaking.

o Adds a mount_info union that contains the FSTYPE_args struct. This
means that "mount" can now tell you all the options a filesystem was
mounted with. This is especially nice for NFS.

Other changes:
o The linux statfs emulation didn't convert between BSD fs names
and linux f_type numbers. Now it does, since the BSD f_type
number is useless to linux apps (and has been removed anyway)

o FreeBSD's struct statfs is different from our (both old and new)
and thus needs conversion. Previously, the OpenBSD syscalls
were used without any real translation.

o mount(8) will now show extra info when invoked with no arguments.
However, to see *everything* you need to use the -v (verbose) flag.


# 1.38 06-May-1999 mickey

factor out sync+wait code into vfa_syncwait() routine for
applications in system like power management and such.
art@ finally said `commit it'


# 1.37 30-Apr-1999 art

in vput, simple_unlock the v_interlock before VOP_INACTIVE, not after


Revision tags: OPENBSD_2_5_BASE
# 1.36 11-Mar-1999 deraadt

backout


# 1.35 11-Mar-1999 deraadt

back out unapproved changes


# 1.34 11-Mar-1999 mickey

indent


# 1.33 11-Mar-1999 mickey

factor sync+wait operation out into a separate function.


# 1.32 26-Feb-1999 art

adapt to uvm vnode pager


# 1.31 19-Feb-1999 art

add vfs_register and vfs_unregister functions


# 1.30 28-Dec-1998 art

simple_lock fixes


# 1.29 22-Dec-1998 art

deconfuse vprint, print holdcount, not refcount when we are talking about holdcnt


# 1.28 10-Dec-1998 art

vfs_unmountall: retry to unmount all remaining filesystems when one unmount failed


# 1.27 05-Dec-1998 csapuntz

Framework for generating automatic test code for locking discipline
in DIAGNOSTIC mode.

Added documentation to vfs_subr.c on locking needs of a couple calls.

Improvements to the vinvalbuf patch. We need to start over after we
let our pants down.


# 1.26 04-Dec-1998 csapuntz

VFS-Lite2 requires stricter locking around vnode buffer queues. vinvalbuf
had insufficient protection


# 1.25 20-Nov-1998 art

vn_lock already unlocks the simple lock. don't do that again


# 1.24 12-Nov-1998 csapuntz

Integrate latest soft updates patches for McKusick.

Integrate cleaner ffs mount code from FreeBSD. Most notably, this mount
code prevents you from mounting an unclean file system read-write.


Revision tags: OPENBSD_2_4_BASE
# 1.23 13-Oct-1998 csapuntz

In vrele, vget, reinstate to following order

- VNODE gets placed on free list
- VOP_INACTIVE is called

This was the original order. It was changed in an earlier patch due to
a race condition in non-locking FSes (like NFS) between getnewvnode
and inactive. However, the modified order had its own race conditions, so
it turned out not to be a good choice.


# 1.22 30-Aug-1998 csapuntz

Cleanup.

Error diagnostics in vputonfreelist to catch violations of assumptions.


# 1.21 06-Aug-1998 csapuntz

Rename vop_revoke, vn_bwrite, vop_noislocked, vop_nolock, vop_nounlock
to be vop_generic_revoke, vop_generic_bwrite, vop_generic_islocked,
vop_generic_lock and vop_generic_unlock.

Create vop_generic_abortop and propogate change to all file systems.

Fix PR/371.

Get rid of locking in NULLFS (should be mostly unnecessary now except for
forced unmounts).


# 1.20 25-Apr-1998 niklas

typo


Revision tags: OPENBSD_2_3_BASE
# 1.19 20-Feb-1998 niklas

typo


# 1.18 11-Jan-1998 csapuntz

Fix a couple spinlock references. More code motion in vfs_subr.c


# 1.17 10-Jan-1998 csapuntz

Broke up vfs_subr.c which was getting a bit huge. We now have seperate files
for the syncer daemon as well as default VOP_*.


# 1.16 24-Nov-1997 niklas

Fix non-DIAGNOSTIC (and non-COMPAT*) compilation


# 1.15 07-Nov-1997 csapuntz

Fixed hang on shutdown
Disabled vop_nolock for now. Filesystems still need to be cleaned up.


# 1.14 06-Nov-1997 csapuntz

DEBUG now compiles


# 1.13 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.12 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.11 06-Oct-1997 csapuntz

VFS Lite2 Changes


Revision tags: OPENBSD_2_1_BASE
# 1.10 25-Apr-1997 deraadt

proper mask check; mike@fast.cs.utah.edu


# 1.9 14-Apr-1997 tholo

Minor performance enhancements from NetBSD


# 1.8 24-Feb-1997 niklas

OpenBSD tags


# 1.7 11-Feb-1997 millert

Add fs_id support and random inode generation numbers for ffs.


# 1.6 04-Jan-1997 kstailey

spec_advlock() via lf_advlock()


Revision tags: OPENBSD_2_0_BASE
# 1.5 08-Aug-1996 tholo

Make {,f}chown(2) behaviour POSIX.1 compliant with SUID / SGID files
Enable CTL_FS processing by sysctl(3)
Add CTL_FS request to disable clearing SUID / SGID bit when a files owner
or group is changed by root
Make sysctl(8) understand CTL_FS requests


# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 29-Feb-1996 niklas

From NetBSD: Merge with NetBSD 960217


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.308 20-Oct-2021 semarie

revert vnode: remove VLOCKSWORK and check locking when vop_islocked != nullop
(both kernel and userland bits)

GENERIC + VFSLCKDEBUG is broken with it.


# 1.307 19-Oct-2021 semarie

vnode: remove VLOCKSWORK and check locking when vop_islocked != nullop

This flag is currently used to mark or unmark a vnode to actively
check vnode locking semantic (when compiled with VFSLCKDEBUG).

Currently, VLOCKSWORK flag isn't properly set for several FS
implementation which have full locking support. This commit enable
proper checking for them too (cd9660, udf, fuse, msdosfs, tmpfs).

Instead of using a particular flag, it directly check if
v_op->vop_islocked is nullop or not to activate or not the vnode
locking checks.

ok mpi@


Revision tags: OPENBSD_7_0_BASE
# 1.306 31-Aug-2021 claudio

Swap lock flags so that LK_EXCLUSIVE is first like in all other places.


# 1.305 28-Apr-2021 claudio

Introduce a global vnode_mtx and use it to make vn_lock() safe to be called
without the KERNEL_LOCK.
This moves VXLOCK and VXWANT to a mutex protected v_lflag field and also
v_lockcount is protected by this mutex.

The vn_lock() dance is overly complex and all of this should probably replaced
by a proper lock on the vnode but such a diff is a lot more complex. This
is an intermediate step so that at least some calls can be modified to grab
the KERNEL_LOCK later or not at all.

OK mpi@


Revision tags: OPENBSD_6_9_BASE
# 1.304 29-Jan-2021 claudio

Use NULL instead of 0 to clear v_socket pointer (which actually clears all
of the v_un pointers).
OK jsg@ mvs@


Revision tags: OPENBSD_6_8_BASE
# 1.303 23-Aug-2020 kn

Remove unused debug_syncprt, improve debug sysctl handling

"syncprt" is unused since kern/vfs_syscalls.c r1.147 from 2008.

Adding new debug sysctls is a bit opaque and looking at kern/kern_sysctl.c
the only visible difference between used and stub ctldebug structs in the
debugvars[] array is their extern keyword, indicating that it is defined
elsewhere.

sys/sysctl.h declares all debugN members as extern upfront, but these
declarations are not needed.

Remove the unused debug sysctl, rename the only remaining one to something
meaningful and remove forward declarations from /sys/sysctl.h; this way,
adding new debug sysctls is a matter of adding extern and coming up with a
name, which is nicer to read on its own and better to grep for.

OK mpi


# 1.302 22-Aug-2020 kn

Move sysctl(2) CTL_DEBUG from DEBUG to new DEBUG_SYSCTL

Adding "debug.my-knob" sysctls is really helpful to select different
code paths and/or log on demand during runtime without recompile,
but as this code is under DEBUG, lots of other noise comes with it
which is often undesired, at least when looking at specific subsystems
only.

Adding globals to the kernel and breaking into DDB to change them helps,
but that does not work over SSH, hence the need for debug sysctls.

Introduces DEBUG_SYSCTL to make use of the "debug" MIB without the rest of
DEBUG; it's DEBUG_SYSCTL and not SYSCTL_DEBUG because it's not a general
option for all of sysctl(2).

OK gnezdo


Revision tags: OPENBSD_6_7_BASE
# 1.301 27-Mar-2020 anton

Relax the lockcount assertion in vputonfreelist(). Back when I fixed
several problems with the vnode exclusive lock implementation, I
overlooked the fact that a vnode can be in a state where the usecount is
zero while the holdcount still being positive. There could still be
threads waiting on the vnode lock in uvn_io() as long as the holdcount
is positive.

"go ahead" mpi@

Reported-by: syzbot+767d6deb1a647850a0ca@syzkaller.appspotmail.com


# 1.300 13-Feb-2020 claudio

Move the LK_DRAIN logic from VOP_LOCK() to vclean() the only caller of
VOP_LOCK with LK_DRAIN. This simplifies VOP_LOCK() a fair bit.
OK visa@


# 1.299 20-Jan-2020 claudio

struct vops is not modified during runtime so use const which moves each
into read-only data segment.
OK deraadt@ tedu@


# 1.298 10-Jan-2020 bluhm

Convert the vnode list at the mount point into a tailq. During
unmount this list is traversed and the dirty vnodes are flushed to
disk. Forced unmount expects that the list is empty after flushing,
otherwise the kernel panics with "dangling vnode". As the write
to disk can sleep, new vnodes may be inserted. If softdep is
enabled, resolving the dependencies creates new dirty vnodes and
inserts them to the list. To fix the panic, let insmntque() insert
new vnodes at the tail of the list. Then vflush() will still catch
them while traversing the list in forward direction.
OK tedu@ millert@ visa@


# 1.297 30-Dec-2019 bluhm

In vcount() a safe loop over vnodes was commited to 4.4BSD in 1994.
This is not necessary as the loop is restarted after vgone(). Switch
to SLIST_FOREACH without _SAFE.
OK visa@


# 1.296 27-Dec-2019 bluhm

Convert the speclisth hash buckets into SLIST macros. This makes
the vnode alias code more readable.
OK visa@


# 1.295 26-Dec-2019 bluhm

Fix white spaces.


# 1.294 08-Dec-2019 mpi

Convert infinite sleeps to tsleep_nsec(9).

ok visa@, jca@


Revision tags: OPENBSD_6_6_BASE
# 1.293 26-Aug-2019 anton

When a thread tries to exclusively lock a vnode, the same thread must
ensure that any other thread currently trying to acquire the underlying
vnode lock has observed that the same vnode is about to be exclusively
locked. Such threads must then sleep until the exclusive lock has been
released and then try to acquire the lock again. Otherwise, exclusive
access to the vnode cannot be guaranteed.

Thanks to naddy@ and visa@ for testing; ok visa@

Reported-by: syzbot+374d0e7e2400004957f7@syzkaller.appspotmail.com


# 1.292 25-Jul-2019 cheloha

vinvalbuf(9): tlseep -> tsleep_nsec(9); ok millert@


# 1.291 19-Jul-2019 cheloha

vwaitforio(9): tsleep(9) -> tsleep_nsec(9); ok visa@


# 1.290 28-Jun-2019 visa

Skip VFS barrier lock during normal operation to reduce overhead.
This removes a system-wide serialization point, which might help
finding timing-related bugs.

OK deraadt@ anton@


# 1.289 09-Jun-2019 beck

Add a temporary workaround to make removal of giant files better

mlarkin@ noticed we would freeze while removing enormous files because
of the amount of work done to invalidate buffers on unlink. This adds
a temporary workaround to ensure we give up the lock and yield while
doing this.

The longer term answer will be to move these buffers to another list
and not do the work here.

ok deraadt@


# 1.288 19-Apr-2019 visa

Add a subsystem lock for vfs_lockf.c. This enables calling lf_advlock()
and lf_purgelocks() without the kernel lock.

OK anton@ mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.287 02-Apr-2019 visa

Restrict which filesystems are available for swap. This rules out
obvious misconfigurations that cannot work.

OK mpi@ tedu@


# 1.286 17-Feb-2019 tedu

if a write fails, we mark the buffer invalid and throw it away. this can
lead to lost errors, where a later fsync will return success. to fix this,
set a flag on the vnode indicating a past error has occurred, and return
an error for future fsync calls.
ok bluhm deraadt visa


# 1.285 21-Jan-2019 anton

Introduce a dedicated entry point data structure for file locks. This new data
structure allows for better tracking of pending lock operations which is
essential in order to prevent a use-after-free once the underlying vnode is
gone.

Inspired by the lockf implementation in FreeBSD.

ok visa@

Reported-by: syzbot+d5540a236382f50f1dac@syzkaller.appspotmail.com


# 1.284 23-Dec-2018 natano

Rectify some issues with the noperm mount flag; the root vnode was not
protected properly and files without any x bit set were accidentaly considered
executable when checked with access(2).

Issues found and reported by deraadt, halex, reyk, tb
ok deraadt


# 1.283 07-Dec-2018 mpi

free(9) sizes for netcred.

ok visa@


Revision tags: OPENBSD_6_4_BASE
# 1.282 29-Sep-2018 visa

Use atomic operations to update vfc_refcount. Change the field's type
to unsigned int.

OK deraadt@


# 1.281 26-Sep-2018 visa

Move the allocating and freeing of mount points into
dedicated functions.

OK deraadt@ mpi@


# 1.280 22-Sep-2018 fcambus

Harmonize spacing after ellipses in displayed messages.

We were using spacing after ellipses in an inconsistent way in the
installer. Standardize on using "... " everywhere and take into account
the cursor position while we are waiting for the task to complete: the
cursor is now always positioned after the last dot, and the space is
added when displaying completion confirmation.

While there, also take cursor position into account in vfs_shutdown(),
and remove the extra leading space before ticks in dhclient.

OK deraadt@


# 1.279 17-Sep-2018 visa

Simplify VFS initialization.

Because loadable kernel modules are no longer, there is no need to
register or unregister filesystem implementations at runtime. Remove
vfs_register() and vfs_unregister(), and make vfsinit() call vfs_init
routines directly. Replace the linked list of vfsconf structs with
the vfsconflist[] array.

OK mpi@ bluhm@


# 1.278 16-Sep-2018 visa

Move vfsconf lookup code into dedicated functions.

OK bluhm@


# 1.277 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


# 1.276 02-Jul-2018 bluhm

Use more list macros for v_dirtyblkhd.
OK mpi@


# 1.275 06-Jun-2018 bluhm

The function dounmount() traverses the mnt_list in forward direction
to call vfs_busy() for all nested mount points. vfs_stall() called
vfs_busy() in reverser order for all mount points. Change the
direction of the latter to resolve the lock order conflict.
OK visa@


# 1.274 04-Jun-2018 guenther

Add VB_DUPOK to suppress witness(4) warning of concurrent mount locks.
Use that in three places:
- vfs_stall()
- sys_mount()
- dounmount()'s MNT_FORCE-does-recursive-unmounts case

ok deraadt@ visa@


# 1.273 27-May-2018 visa

Drop unnecessary `p' parameter from vget(9).

OK mpi@


# 1.272 08-May-2018 bluhm

When looping over mount points, the FOREACH SAVE macro is not save.
The loop variable mp is protected by vfs_busy() so that it cannot
be unmounted. But the next mount point nmp could be unmounted while
VFS_SYNC() sleeps. As the loop in vfs_stall() does not destroy the
mount point, TAILQ_FOREACH_REVERSE without _SAVE is the correct
macro to use.
OK deraadt@ visa@


# 1.271 08-May-2018 mpi

Move the vfs stall "barrier" logic to a function. FREF() will soon
change and this has nothing to do with it.

ok visa@, bluhm@


# 1.270 07-May-2018 bluhm

Print the vp pointer in the vinvalbuf() panic strings.
OK mpi@


# 1.269 02-May-2018 visa

Remove proc from the parameters of vn_lock(). The parameter is
unnecessary because curproc always does the locking.

OK mpi@


# 1.268 28-Apr-2018 visa

Clean up the parameters of VOP_LOCK() and VOP_UNLOCK(). It is always
curproc that does the locking or unlocking, so the proc parameter
is pointless and can be dropped.

OK mpi@, deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.267 07-Mar-2018 bluhm

Remounting files systems read-only does not work reliably. There
are corner cases where ffs may leak blocks. So better revert and
unmount all file systems at reboot. The "init died" panic will be
fixed in a different way.
OK deraadt@


# 1.266 10-Feb-2018 deraadt

Syncronize filesystems to disk when suspending. Each mountpoint's vnodes
are pushed to disk. Dangling vnodes (unlinked files still in use) and
vnodes undergoing change by long-running syscalls are identified -- and
such filesystems are marked dirty on-disk while we are suspended (in case
power is lost, a fsck will be required). Filesystems without dangling or
busy vnodes are marked clean, resulting in faster boots following
"battery died" circumstances.
Tested by numerous developers, thanks for the feedback.


# 1.265 14-Dec-2017 deraadt

Don't bother using DETACH_FORCE for the softraid luns at reboot
time; the aggressive mountpoint destruction seems to hit insane
use-after-frees when we are already far on the way down.


# 1.264 14-Dec-2017 deraadt

Give vflush_vnode() a hint about vnodes we don't need to account as "busy".
Change mountpoint to RDONLY a little later. Seems to improve the
rw->ro transition a bit.


# 1.263 11-Dec-2017 bluhm

Format the vnode lists of ddb show mount properly in columns.
OK krw@


# 1.262 11-Dec-2017 deraadt

In uvm Chuck decided backing store would not be allocated proactively
for blocks re-fetchable from the filesystem. However at reboot time,
filesystems are unmounted, and since processes lack backing store they
are killed. Since the scheduler is still running, in some cases init is
killed... which drops us to ddb [noted by bluhm]. Solution is to convert
filesystems to read-only [proposed by kettenis]. The tale follows:
sys_reboot() should pass proc * to MD boot() to vfs_shutdown() which
completes current IO with vfs_busy VB_WRITE|VB_WAIT, then calls VFS_MOUNT()
with MNT_UPDATE | MNT_RDONLY, soon teaching us that *fs_mount() calls a
copyin() late... so store the sizes in vfsconflist[] and move the copyin()
to sys_mount()... and notice nfs_mount copyin() is size-variant, so kill
legacy struct nfs_args3. Next we learn ffs_mount()'s MNT_UPDATE code is
sharp and rusty especially wrt softdep, so fix some bugs adn add
~MNT_SOFTDEP to the downgrade. Some vnodes need a little more help,
so tie them to &dead_vnops.

ffs_mount calling DIOCCACHESYNC is causing a bit of grief still but
this issue is seperate and will be dealt with in time.
couple hundred reboots by bluhm and myself, advice from guenther and
others at the hut


# 1.261 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.260 31-Jul-2017 florian

Give back some space to the ramdisk by compiling net/radix.c only
if we compile pf, ipsec, pipex or nfsserver.
Suggested by mpi some time ago.
Tweak & OK bluhm
deraadt assumes it's fair


# 1.259 20-Apr-2017 visa

Tweak lock inits to make the system runnable with witness(4)
on amd64 and i386.


# 1.258 04-Apr-2017 deraadt

struct vfsconf is tightly packed, but let's M_ZERO it in case that ever
changes to avoid exposing userland memory.


Revision tags: OPENBSD_6_1_BASE
# 1.257 15-Jan-2017 bluhm

When traversing the mount list, the current mount point is locked
with vfs_busy(). If the FOREACH_SAFE macro is used, the next pointer
is not locked and could be freed by another process. Unless
necessary, do not use _SAFE as it is unsafe. In vfs_unmountall()
the current pointer is actullay freed. Add a comment that this
race has to be fixed later.
OK krw@


# 1.256 10-Jan-2017 bluhm

Replace manual for() loops with FOREACH() macro.
OK millert@


# 1.255 10-Jan-2017 bluhm

Remove the unused olddp parameter from function dounmount().
OK mpi@ millert@


# 1.254 28-Sep-2016 kettenis

Cast enum to u_int when doing a bounds check to avoid a clang warning that
the comparison is always true.

ok jca@, tedu@


# 1.253 16-Sep-2016 dlg

move the namecache_rb_tree from RB macros to RBT functions.

i had to shuffle the includes a bit. all the knowledge of the RB
tree is now inside vfs_cache.c, and all accesses are via cache_*
functions.


# 1.252 16-Sep-2016 dlg

move buf_rb_bufs from RB macros to RBT functions

i had to shuffle the order of some header bits cos RBT_PROTOTYPE
needs to see what RBT_HEAD produces.


# 1.251 15-Sep-2016 dlg

all pools have their ipl set via pool_setipl, so fold it into pool_init.

the ioff argument to pool_init() is unused and has been for many
years, so this replaces it with an ipl argument. because the ipl
will be set on init we no longer need pool_setipl.

most of these changes have been done with coccinelle using the spatch
below. cocci sucks at formatting code though, so i fixed that by hand.

the manpage and subr_pool.c bits i did myself.

ok tedu@ jmatthew@

@ipl@
expression pp;
expression ipl;
expression s, a, o, f, m, p;
@@
-pool_init(pp, s, a, o, f, m, p);
-pool_setipl(pp, ipl);
+pool_init(pp, s, a, ipl, f, m, p);


# 1.250 25-Aug-2016 dlg

pool_setipl

ok kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.249 22-Jul-2016 kettenis

Prevent NULL-pointer call for filesystems that don't provide vfs_sysctl
in their vfsops.

Issue reported by Tim Newsham.

ok claudio@, natano@


# 1.248 19-Jun-2016 natano

Remove the lockmgr() API. It is only used by filesystems, where it is a
trivial change to use rrw locks instead. All it needs is LK_* defines
for the RW_* flags.

tested by naddy and sthen on package building infrastructure
input and ok jmc mpi tedu


# 1.247 26-May-2016 natano

The doforce variable isn't modified anywhere. Also, the only filesystem
left using it is fuse. It has been removed from all other filesystems.

ok millert deraadt


# 1.246 26-Apr-2016 natano

copy_statfs_info() is not only used by ufs, but by other filesystems too,
so make sure that all members of mp->mnt_stat.mount_info are copied.
ok stefan


# 1.245 26-Apr-2016 beck

fix off by one in vfs_vnode_print - found by miod
ok deraadt@, krw@


# 1.244 07-Apr-2016 natano

Share clone bitmap between aliased vnodes. This prevents duplicate clone
instance numbers being handed out for the same minor device.
ok mikeb


# 1.243 05-Apr-2016 natano

Increase size of the clone bitmap (revised diff after revert). I have
tested this with fuse _and_ drm on amd64 and macppc. Also tested with
cloning bpf (not in the tree) on macppc.

ok mikeb
"looks correct to me" millert

The original commit message is as follows:

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.242 01-Apr-2016 mikeb

Revert the clone bitmap enlargement change


# 1.241 31-Mar-2016 natano

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.240 19-Mar-2016 natano

Remove the unused flags argument from VOP_UNLOCK().

torture tested on amd64, i386 and macppc
ok beck mpi stefan
"the change looks right" deraadt


# 1.239 14-Mar-2016 krw

Change a bunch of (<blah> *)0 to NULL.

ok beck@ deraadt@


Revision tags: OPENBSD_5_9_BASE
# 1.238 05-Dec-2015 tedu

branches: 1.238.2;
remove stale lint annotations


# 1.237 16-Nov-2015 deraadt

In getdevvp() set the VISTTY flag on a vnode to indicate the underlying
device is a D_TTY device. (Like spec_open, but this sets the flag to
satisfy pre-VOP_OPEN situations)
ok millert semarie tedu guenther


# 1.236 13-Oct-2015 guenther

Initialize va_filerev in vattr_null() to avoid leaking stack garbage;
problem pointed out by Martin Natano (natano (at) natano.net)

Also, stop chaining assignments (foo = bar = baz) in vattr_null().
The exact meaning of those depends on the order of the sizes-and-
signednesses of the lvalues, making them fragile: a statement here
mixed *six* types, but managed to get them in a safe order. Delete
a 20+ year old XXX comment that was almost certainly bemoaning a bug
from when they were in an unsafe order.

ok deraadt@ miod@


# 1.235 08-Oct-2015 mpi

Use the radix API directly and get rid of the function pointers. There
is no point in keeping an unused level of abstraction.

ok mikeb@, claudio@


# 1.234 07-Oct-2015 mpi

rn_inithead() offset argument is now specified in byte, missed in previous.


# 1.233 04-Sep-2015 mpi

Make every subsystem using a radix tree call rn_init() and pass the
length of the key as argument.

This way every consumer of the radix tree has a chance to explicitly
initialize the shared data structures and no longer rely on another
subsystem to do the initialization.

As a bonus ``dom_maxrtkey'' is no longer used an die.

ART kernels should now be fully usable because pf(4) and IPSEC properly
initialized the radix tree.

ok chris@, reyk@


Revision tags: OPENBSD_5_8_BASE
# 1.232 16-Jul-2015 claudio

branches: 1.232.4;
Fix rn_match and there for the expoerted lookup functions in radix.c
to never return the internal RNF_ROOT nodes. This removes the checks
in the callee to verify that not an RNF_ROOT node was returned.
OK mpi@


# 1.231 12-May-2015 mikeb

Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.230 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.229 02-Mar-2015 guenther

Return EINVAL if the creds supplied for NFS export have a cr_ngroups less
than zero or greater than NGROUPS_MAX

Fixes panic seen by henning@


# 1.228 09-Jan-2015 tedu

rename desiredvnodes to initialvnodes. less of a lie. ok beck deraadt


# 1.227 19-Dec-2014 tedu

start retiring the nointr allocator. specify PR_WAITOK as a flag as a
marker for which pools are not interrupt safe. ok dlg


# 1.226 17-Dec-2014 tedu

remove lock.h from uvm_extern.h. another holdover from the simpletonlock
era. fix uvm including c files to include lock.h or atomic.h as necessary.
ok deraadt


# 1.225 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.224 10-Dec-2014 tedu

convert bcopy to memcpy. ok millert


# 1.223 21-Nov-2014 tedu

simple lock is long dead


# 1.222 19-Nov-2014 tedu

delete the KERN_VNODE sysctl. it fails to provide any isolation from the
kernel struct vnode defintion, and the only consumer (pstat) still needs
kvm to read much of the required information. no great loss to always use
kvm until there's a better replacement interface.
ok deraadt millert uebayasi


# 1.221 14-Nov-2014 tedu

prefer sizeof(*ptr) to sizeof(struct) for malloc and free


# 1.220 03-Nov-2014 deraadt

pass size argument to free()
ok doug tedu


# 1.219 13-Sep-2014 doug

Replace all queue *_END macro calls except CIRCLEQ_END with NULL.

CIRCLEQ_* is deprecated and not called in the tree. The other queue types
have *_END macros which were added for symmetry with CIRCLEQ_END. They are
defined as NULL. There's no reason to keep the other *_END macro calls.

ok millert@


Revision tags: OPENBSD_5_6_BASE
# 1.218 13-Jul-2014 tedu

pass the size to free in some of the obvious cases


# 1.217 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.216 10-Jul-2014 mpi

Stop using a shutdown hook for softraid(4) and explicitly shutdown
the disciplines right after vfs_shutdown().

This change is required in order to be able to set `cold' to 1 before
traversing the device (mainbus) tree for DVACT_POWERDOWN when halting
a machine. Yes, this is ugly because sr_shutdown() needs to sleep. But
at least it is obvious and hopefully somebody will be ofended and fix
it.

In order to properly flush the cache of the disks under softraid0,
sr_shutdown() now propagates DVACT_POWERDOWN for this particular subtree
of devices which are not under mainbus. As a side effect sd(4) shutdown
hook should no longer be necessary.

Tested by stsp@ and Jean-Philippe Ouellet.

ok deraadt@, stsp@, jsing@


# 1.215 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.214 04-Jun-2014 claudio

While it may be smart to use the radix tree for exports it is not OK to
use the domain specific tree initialisation method for this since that one
is multipath enabled and assumes that the radix node is part of a struct
rtentry. This code uses a different struct and so the multipath modifies
wrong fields and breaks stuff in mysterious ways.
Since we only support AF_INET here anyway simplify the code and only have
one radix_node_head pointer instead of AF_MAX ones.
Fixes NFS server issues reported by rpe@, OK rpe@, guenther@, sthen@


# 1.213 10-Apr-2014 tedu

pull the bufcache freelist code out into separate functions to allow new
algorithms to be tested. in the process, drop support for unused B_AGE and
b_synctime options.
previous versions ok beck deraadt


# 1.212 24-Mar-2014 guenther

Split the API: struct ucred remains the kernel internal structure while
struct xucred becomes the structure for syscalls (mount(2) and nfssvc(2)).

ok deraadt@ beck@


Revision tags: OPENBSD_5_5_BASE
# 1.211 21-Jan-2014 tedu

bzero -> memset


# 1.210 01-Dec-2013 krw

Change 'mountlist' from CIRCLEQ to TAILQ. Be paranoid and
use TAILQ_*_SAFE more than might be needed.

Bulk ports build by sthen@ showed nobody sticking their fingers
so deep into the kernel.

Feedback and suggestions from millert@. ok jsing@


# 1.209 27-Nov-2013 jsing

Defer the v_type initialisation until after the vnode has been purged from
the namecache. Changing the v_type between cache_enter() and cache_purge()
results in bad things happening.

ok beck@


# 1.208 02-Oct-2013 sf

format string fix: b_flags is long


# 1.207 01-Oct-2013 sf

Format string fixes: Cast time_t to long long

and mnt_stat.f_ctime is long long, too


# 1.206 08-Aug-2013 syl

Uncomment kprintf format attributes for sys/kern

tested on vax (gcc3) ok miod@


# 1.205 30-Jul-2013 beck

The previous change was made while chasing nfs performance issues
on Theo's servers - however this was in the context of the buffer flipper
changes and this is now suspect in a continues performance issue with NFS
so back it out for now


Revision tags: OPENBSD_5_4_BASE
# 1.204 24-Jun-2013 beck

Manipulating buffers after sleeping is dangerous. Instead of attempting
to cheat and VOP_BWRITE a buffer, restart the vinvalbuf if we have to wait
for a busy buffer to complete
ok tedu@ guenther@


# 1.203 15-Apr-2013 jsing

Add an f_mntfromspec member to struct statfs, which specifies the name of
the special provided when the mount was requested. This may be the same as
the special that was actually used for the mount (e.g. in the case of a
device node) or it may be different (e.g. in the case of a DUID).

Whilst here, change f_ctime to a 64 bit type and remove the pointless
f_spare members.

Compatibility goo courtesy of guenther@

ok krw@ millert@


Revision tags: OPENBSD_5_3_BASE
# 1.202 17-Feb-2013 miod

Comment out recently added __attribute__((__format__(__kprintf__))) annotations
in MI code; gcc 2.95 does not accept such annotation for function pointer
declarations, only function prototypes.
To be uncommented once gcc 2.95 bites the dust.


# 1.201 09-Feb-2013 miod

Add explicit __attribute__ ((__format__(__kprintf__)))) to the functions and
function pointer arguments which are {used as,} wrappers around the kernel
printf function.
No functional change.


# 1.200 17-Nov-2012 beck

Don't map a buffer (and potentially sleep) when invalidating it in vinvalbuf.
This fixes a problem where we could sleep for kva and then our pointers
would not be valid on the next pass through the loop. We do this
by adding buf_acquire_nomap() - which can be used to busy up the buffer
without changing its mapped or unmapped state. We do not need to have
the buffer mapped to invalidate it, so it is sufficient to acquire it
for that. In the case where we write the buffer, we do map the buffer, and
potentially sleep.


# 1.199 01-Oct-2012 guenther

Make groupmember() check the effective gid too, so that the checks are
consistent when the effective gid isn't also a supplementary group.

ok beck@


# 1.198 19-Sep-2012 guenther

vhold() and vdrop() are prototyped in vnode.h, so don't repeat them here

ok beck@


Revision tags: OPENBSD_5_2_BASE
# 1.197 16-Jul-2012 deraadt

oops, need sys/acct.h too


# 1.196 16-Jul-2012 deraadt

Put acct_shutdown() proto in a better place


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.195 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.194 02-Jul-2011 thib

rename VFSDEBUG to VFLCKDEBUG;

prompted by tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.193 21-Dec-2010 thib

Bring back the "End the VOP experiment." diff, naddy's issues where
unrelated, and his alpha is much happier now.

OK deraadt@


# 1.192 06-Dec-2010 jasper

- drop NENTS(), which was yet another copy of nitems().
no binary change


ok deraadt@


# 1.191 10-Sep-2010 thib

Backout the VOP diff until the issues naddy was seeing on alpha (gcc3)
have been resolved.


# 1.190 06-Sep-2010 thib

End the VOP experiment. Instead of the ridicolusly complicated operation
vector setup that has questionable features (that have, as far as I can
tell never been used in practice, atleast not in OpenBSD), remove all
the gunk and favor a simple struct full of function pointers that get
set directly by each of the filesystems.

Removes gobs of ugly code and makes things simpler by a magnitude.

The only downside of this is that we loose the vnoperate feature so
the spec/fifo operations of the filesystems need to be kept in sync
with specfs and fifofs, this is no big deal as the API it self is pretty
static.

Many thanks to armani@ who pulled an earlier version of this diff to
current after c2k10 and Gabriel Kihlman on tech@ for testing.

Liked by many. "come on, find your balls" deraadt@.


# 1.189 12-Aug-2010 oga

Nuke extra (typoed) extern declaration and a spare newline from the last
commit.

"fix it -- free commit" beck@


# 1.188 11-Aug-2010 beck

Make the number of vnodes to correspond to the number of buffers in
buffer cache - we grow them dynamically, but do not attempt to shrink
them if the buffer cache shrinks after growing.

Tested by very many for a long time.

ok oga@ todd@ phessler@ tedu@


Revision tags: OPENBSD_4_8_BASE
# 1.187 29-Jun-2010 tedu

makefstype was only used in ported from freebsd filesystems. fix them
and remove the function. ok thib


# 1.186 28-Jun-2010 claudio

Add the rtable id as an argument to rn_walktree(). Functions like
rt_if_remove_rtdelete() need to know the table id to be able to correctly
remove nodes.
Problem found by Andrea Parazzini and analyzed by Martin Pelik�n.
OK henning@


# 1.185 06-May-2010 mpf

Fix favail format string.
From mickey.
OK thib, otto.


Revision tags: OPENBSD_4_7_BASE
# 1.184 17-Dec-2009 oga

if anyone vref()s a VNON vnode, panic. This should not happen.

Written while trying to debug the nfs_inactive panics. Turns out it
never got hit, but it's a useful check to have.

ok beck@


# 1.183 17-Aug-2009 jasper

dd 'show all bufs' to show all the buffers in the system

ok beck@ thib@


# 1.182 13-Aug-2009 thib

add a show all vnodes command, use dlg's nice pool_walk() to accomplish
this.

ok beck@, dlg@


# 1.181 12-Aug-2009 beck

Namecache revamp.

This eliminates the large single namecache hash table, and implements
the name cache as a global lru of entires, and a redblack tree in each
vnode. It makes cache_purge actually purge the namecache entries associated
with a vnode when a vnode is recycled (very important for later on actually being
able to resize the vnode pool)

This commit does #if 0 out a bunch of procmap code that was
already broken before this change, but needs to be redone completely.

Tested by many, including in thib's nfs test setup.

ok oga@,art@,thib@,miod@


# 1.180 02-Aug-2009 beck

Dynamic buffer cache support - a re-commit of what was backed out
after c2k9

allows buffer cache to be extended and grow/shrink dynamically

tested by many, ok oga@, "why not just commit it" deraadt@


Revision tags: OPENBSD_4_6_BASE
# 1.179 25-Jun-2009 thib

backout the buf_acquire() does the bremfree() since all callers
where doing bremfree() befure calling buf_acquire().

This is causing us headache pinning down a bug that showed up
when deraadt@ too cvs to current, and will have to be done
anyway as a preperation for backouts.

OK deraadt@


# 1.178 15-Jun-2009 beck

Back out all the buffer cache changes I committed during c2k9. This reverts three
commits:

1) The sysctl allowing bufcachepercent to be changed at boot time.
2) The change moving the buffer cache hash chains to a red-black tree
3) The dynamic buffer cache (Which depended on the earlier too).

ok on the backout from marco and todd


# 1.177 06-Jun-2009 art

All caller of buf_acquire were doing bremfree before the call.
Just put it in the buf_acquire function.
oga@ ok


# 1.176 03-Jun-2009 beck

Change bufhash from the old grotty hash table to red-black trees hanging
off the vnode.
ok art@, oga@, miod@


Revision tags: OPENBSD_4_5_BASE
# 1.175 10-Nov-2008 pedro

Fix typo in comment, okay jmc@.


# 1.174 01-Nov-2008 deraadt

change vrele() to return an int. if it returns 0, it can gaurantee that
it did not sleep. this is used to avoid checkdirs() to avoid having
to restart the allproc walk every time through
idea from tedu, ok thib pedro


Revision tags: OPENBSD_4_4_BASE
# 1.173 05-Jul-2008 thib

re-introduce vdrop() to signal a lost intrest in a vnode;

ok art@


# 1.172 14-Jun-2008 mk

A bunch of pool_get() + bzero() -> pool_get(..., .. | PR_ZERO)
conversions that should shave a few bytes off the kernel.

ok henning, krw, jsing, oga, miod, and thib (``even though i usually prefer
FOO|BAR''; thanks for looking.


# 1.171 13-Jun-2008 beck

back out stupid vnode change that was unintentionally included
with biomem and art has no idea how it got there.
ok art@ thib@


# 1.170 12-Jun-2008 deraadt

Bring biomem diff back into the tree after the nfs_bio.c fix went in.
ok thib beck art


# 1.169 11-Jun-2008 deraadt

back out biomem diff since it is not right yet. Doing very large
file copies to nfsv2 causes the system to eventually peg the console.
On the console ^T indicates that the load is increasing rapidly, ddb
indicates many calls to getbuf, there is some very slow nfs traffic
making none (or extremely slow) progress. Eventually some machines
seize up entirely.


# 1.168 10-Jun-2008 beck

Buffer cache revamp

1) remove multiple size queues, introduced as a stopgap.
2) decouple pages containing data from their mappings
3) only keep buffers mapped when they actually have to be mapped
(right now, this is when buffers are B_BUSY)
4) New functions to make a buffer busy, and release the busy flag
(buf_acquire and buf_release)
5) Move high/low water marks and statistics counters into a structure
6) Add a sysctl to retrieve buffer cache statistics

Tested in several variants and beat upon by bob and art for a year. run
accidentally on henning's nfs server for a few months...

ok deraadt@, krw@, art@ - who promises to be around to deal with any fallout


# 1.167 09-Jun-2008 millert

Update access(2) to have modern semantics with respect to X_OK and
the superuser. access(2) will now only indicate success for X_OK on
non-directories if there is at least one execute bit set on the file.
OK deraadt@ thib@ otto@


# 1.166 07-May-2008 thib

remove the vfc_mountroot member from vfsconf and
do appropriate cleanup;

OK deraadt@


# 1.165 07-May-2008 claudio

Implement routing priorities. Every route inserted has a priority assigned
and the one route with the lowest number wins. This will be used by the
routing daemons to resolve the synchronisations issue in case of conflicts.
The nasty bits of this are in the multipath code. If no priority is specified
the kernel will choose an appropriate priority.

Looked at by a few people at n2k8 code is much older


# 1.164 06-May-2008 thib

retire vfs_mountroot();

setroot() is now (and has been) responsible for setting
the mountroot function pointer "to the right thing", or
failing todo that, to ffs_mountroot;

based on a discussion/diff from deraadt@.
OK deraadt@


# 1.163 23-Mar-2008 miod

Wrong printf construct.


# 1.162 16-Mar-2008 otto

Widen some struct statfs fields to support large filesystem stata
and add some to be able to support statvfs(2). Do the compat dance
to provide backward compatibility. ok thib@ miod@


Revision tags: OPENBSD_4_3_BASE
# 1.161 13-Dec-2007 blambert

replace calls to ltsleep with tsleep

remove PNORELOCK flag, as PNORELOCK is used for msleep

ok art@ thib@


# 1.160 16-Nov-2007 deraadt

er, the newline is wrong. dissapointing.


# 1.159 15-Nov-2007 deraadt

newline before syncing disks is way prettier


# 1.158 29-Oct-2007 chl

MALLOC/FREE -> malloc/free
replace an hard coded value with M_WAITOK

ok krw@


# 1.157 15-Sep-2007 bluhm

Allow to pull out an usb stick with ffs filesystem while mounted
and a file is written onto the stick. Without these fixes the
machine panics or hangs.
The usb fix calls the callback when the stick is pulled out to free
the associated buffers. Otherwise we have busy buffers for ever
and the automatic unmount will panic.
The change in the scsi layer prevents passing down further dirty
buffers to usb after the stick has been deactivated.
In vfs the automatic unmount has moved from the function vgonel()
to vop_generic_revoke(). Both are called when the sd device's vnode
is removed. In vgonel() the VXLOCK is already held which can cause
a deadlock. So call dounmount() earlier.

ok krw@, I like this marco@, tested by ian@


# 1.156 07-Sep-2007 art

Use M_ZERO in a few more places to shave bytes from the kernel.

eyeballed and ok dlg@


Revision tags: OPENBSD_4_2_BASE
# 1.155 07-Aug-2007 beck

A few changes to deal with multi-user performance issues seen. this
brings us back roughly to 4.1 level performance, although this is still
far from optimal as we have seen in a number of cases. This change

1) puts a lower bound on buffer cache queues to prevent starvation
2) fixes the code which looks for a buffer to recycle
3) reduces the number of vnodes back to 4.1 levels to avoid complex
performance issues better addressed after 4.2

ok art@ deraadt@, tested by many


# 1.154 01-Jun-2007 beck

decouple the allocated number of vnodes from the "desiredvnodes" variable
which is used to size a zillion other things that increasing excessively
has been shown to cause problems - so that we may incrementally look at
increasing those other things without making the kernel unusable.

This diff effectivly increases the number of vnodes back to the number
of buffers, as in the earlier dynamic buffer cache commits, without
increasing anything else (namecache, softdeps, etc. etc.)

ok pedro@ tedu@ art@ thib@


# 1.153 31-May-2007 tedu

remove some silly casts, no real change


# 1.152 31-May-2007 pedro

NFSv2 cannot cope with a big number of vnodes, so revert to NPROC-based
calculation until the problem is fixed, okay beck@ art@


# 1.151 30-May-2007 beck

back out vfs change - todd fries has seen afs issues, and I'm suspicious
this can cause other problems.


# 1.150 29-May-2007 beck

Step one of some vnode improvements - change getnewvnode to
actually allocate "desiredvnodes" - add a vdrop to un-hold a vnode held
with vhold, and change the name cache to make use of vhold/vdrop, while
keeping track of which vnodes are referred to by which cache entries to
correctly hold/drop vnodes when the cache uses them.
ok thib@, tedu@, art@


# 1.149 28-May-2007 thib

de-inline vref();

ok pedro@


# 1.148 26-May-2007 pedro

Dynamic buffer cache. Initial diff from mickey@, okay art@ beck@ toby@
deraadt@ dlg@.


# 1.147 26-May-2007 thib

Nuke a bunch of simpelocks and associated goo.

ok art@


# 1.146 17-May-2007 thib

Collapse struct v_selectinfo in struct vnode, remove the
simplelock and reuse the name for the selinfo member.
Clean-up accordingly.

ok tedu@,art@


# 1.145 09-May-2007 deraadt

kinfo_vgetfailed has not been used for > 8 years


# 1.144 13-Apr-2007 thib

Move the declaration of VN_KNOTE() into vnode.h instead of having
multiple defines all over;

ok tedu@


# 1.143 13-Apr-2007 bluhm

Remove comments talking about vnode interlock. No binary change.
ok thib


# 1.142 11-Apr-2007 thib

Remove the simplelock argument from vrecycle();

ok pedro@, sturm@


# 1.141 21-Mar-2007 thib

Remove the v_interlock simplelock from the vnode structure.
Zap all calls to simple_lock/unlock() on it (those calls are
#defined away though). Remove the LK_INTERLOCK from the calls
to vn_lock() and cleanup the filesystems wich implement VOP_LOCK().
(by remvoing the v_interlock from there calls to lockmgr()).

ok pedro@, art@, tedu@


# 1.140 12-Mar-2007 mickey

better desiredvnodes not based on maxusers; pedro@ deraadt@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.139 20-Feb-2007 deraadt

for vfsconf sysctl, do not leak kernel sensors out to userland
ok art thib


# 1.138 17-Feb-2007 mickey

fix ddb buf printing for daddr_t growth to 64bit;
from juan hernandez gonzalez; tested by bluhm@


# 1.137 14-Feb-2007 jsg

Consistently spell FALLTHROUGH to appease lint.
ok kettenis@ cloder@ tom@ henning@


# 1.136 13-Feb-2007 mickey

fix ddb buf print


# 1.135 20-Nov-2006 tom

vprint() should be defined if DIAGNOSTIC || DEBUG. Noticed by (and
original diff from) Jake < antipsychic (at) hotmail.com >. Discussed
with Mickey and Miod.

ok miod@ pedro@


# 1.134 30-Oct-2006 thib

use vp->v_type to index into vtypes rather then vp->v_tag,
fixing odd output in the 'show vnode' ddb code.

ok mickey@


Revision tags: OPENBSD_4_0_BASE
# 1.133 11-Jul-2006 mickey

add mount/vnode/buf and softdep printing commands; tested on a few archs and will make pedro happy too (;


# 1.132 09-Jul-2006 pedro

Fix tab where space was meant


# 1.131 08-Jul-2006 thib

vinvalbuf() debugging aid, under VFSDEBUG.

ok pedro@


# 1.130 03-Jul-2006 mickey

also print vp in vprint (useful for debugging); pedro@ ok


# 1.129 25-Jun-2006 sturm

rename vfs_busy() flags VB_UMIGNORE/VB_UMWAIT to VB_NOWAIT/VB_WAIT

requested by and ok pedro


# 1.128 14-Jun-2006 sturm

move vfs_busy() to rwlocks and properly hide the locking api from vfs

ok tedu, pedro


# 1.127 02-Jun-2006 pedro

Add a clonable devices implementation. Hacked along with thib@, input
from krw@ and toby@, subliminal prodding from dlg@, okay deraadt@.


# 1.126 28-May-2006 pedro

Spacing in vfs_sysctl()


# 1.125 07-May-2006 sturm

forgot to remove this sentence from the comment
ok pedro


# 1.124 30-Apr-2006 sturm

remove the simplelock argument from vfs_busy() which is currently not
used and will never be used this way in VFS

requested by and ok pedro, ok krw, biorn


# 1.123 19-Apr-2006 pedro

Remove unused mount list simple_lock() goo


Revision tags: OPENBSD_3_9_BASE
# 1.122 09-Jan-2006 pedro

Put vprint() under DIAGNOSTIC, as to save space in generated ramdisks.
Inspiration from miod@, okay deraadt@. Tested on i386, macppc and amd64.


# 1.121 30-Nov-2005 pedro

No need for vfs_busy() and vfs_unbusy() to take a process pointer
anymore. Testing by jolan@, thanks.


# 1.120 24-Nov-2005 pedro

Remove kernfs, okay deraadt@.


# 1.119 19-Nov-2005 pedro

Remove unnecessary lockmgr() archaism that was costing too much in terms
of panics and bugfixes. Access curproc directly, do not expect a process
pointer as an argument. Should fix many "process context required" bugs.
Incentive and okay millert@, okay marc@. Various testing, thanks.


# 1.118 18-Nov-2005 pedro

Work around yet another race on non-locking file systems: when calling
VOP_INACTIVE() in vrele() and vput(), we may sleep. Since there's no
locking of any kind, someone can vget() the vnode and vrele() it while
we sleep, beating us in getting the vnode on the free list.


# 1.117 08-Nov-2005 pedro

Missed one use of 'register'


# 1.116 07-Nov-2005 pedro

Use ANSI function declarations and deregister, no binary change


# 1.115 19-Oct-2005 pedro

Remove v_vnlock from struct vnode, okay krw@ tedu@


Revision tags: OPENBSD_3_8_BASE
# 1.114 26-May-2005 pedro

branches: 1.114.2;
RIP stackable filesystems, ok marius@ tedu@, discussed with deraadt@


# 1.113 24-May-2005 pedro

when a device vnode associated with a mount point disappears, mark the
filesystem as doomed and unmount it


# 1.112 22-May-2005 pedro

put VLOCKSWORK stuff under a single option, VFSDEBUG


# 1.111 01-May-2005 pedro

check for VBIOONFREELIST and VBIOONSYNCLIST in vprint(), okay marius@


# 1.110 24-Mar-2005 tedu

always good to check for invalid values. ok marius pedro


Revision tags: OPENBSD_3_7_BASE
# 1.109 10-Jan-2005 pedro

branches: 1.109.2;
change vget() to only put a vnode back on the free lists if it actually
was there. should fix a (rare) corner case introduced by my last commit.
ok tedu@, testing by joris, moritz@, danh@, otto@ and krw@. many thanks.


# 1.108 31-Dec-2004 pedro

sprinkle some more list macros in here


# 1.107 31-Dec-2004 pedro

when releasing a vnode, make it inactive before sticking it to one of
the free lists. should fix some races on filesystems that don't have
locks, such as nfs. also, it allows for a more straightforward way of
releasing vnodes (nodes that are going to be recycled don't have to be
moved to the head of the list). tested by many, thanks.

ok tedu@ deraadt@


# 1.106 28-Dec-2004 deraadt

clean dirty accident by miod


# 1.105 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


# 1.104 09-Dec-2004 pedro

minor spacing/styling nits


Revision tags: OPENBSD_3_6_BASE
# 1.103 04-Aug-2004 art

Uninline vputonfreelist.


# 1.102 04-Aug-2004 pedro

better comments


# 1.101 02-Aug-2004 pedro

- check for LK_NOWAIT on vget()
- use ltsleep() instead of the unlock + sleep combo

ok art@, inspiration from free/net


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.100 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


# 1.99 27-May-2004 tedu

shutdown accounting before shutting down vfs. should prevent some panics.
ok david@ millert@ (iirc)


# 1.98 25-Apr-2004 itojun

radix tree with multipath support. from kame. deraadt ok
user visible changes:
- you can add multiple routes with same key (route add A B then route add A C)
- you have to specify gateway address if there are multiple entries on the table
(route delete A B, instead of route delete A)
kernel change:
- radix_node_head has an extra entry
- rnh_deladdr takes extra argument

TODO:
- actually take advantage of multipath (rtalloc -> rtalloc_mpath)


Revision tags: OPENBSD_3_5_BASE
# 1.97 09-Jan-2004 tedu

back out vnode parents. weird breakge found in ports tree


# 1.96 06-Jan-2004 tedu

keep track of a vnode's parent dir. ufs only, and unused atm, but
the fun stuff is coming. testing by brad.


Revision tags: OPENBSD_3_4_BASE
# 1.95 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.94 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: UBC_SYNC_A
# 1.93 13-May-2003 naddy

Back out previous change that causes "vnode table full" for large-scale
file operations.


# 1.92 13-May-2003 tedu

do reclaim LAYER vnodes, no good reason not to


# 1.91 06-May-2003 tedu

attempt to put a process's cwd back in place after a forced umount.
won't always work, but it's the best we can do for now. this covers
at least some of the failure cases the previous commit to vfs_lookup.c
checks for.
ok weingart@


# 1.90 01-May-2003 tedu

several related changes:
vfs_subr.c:
add a missing simple_lock_init for vnode interlock
try to avoid reclaiming locked or layered vnodes
initialize vnlock pointer to NULL
remove old code to free vnlock, never used
lockinit the new vnode lock
vfs_syscalls.c:
support for VLAYER flag
vnode_if.sh:
support for splitting VDESC flags
vnode_if.src:
split VDESC flags
WILLPUT is the combination of WILLRELE and WILLUNLOCK
most uses for WILLRELE become WILLPUT
vnode.h:
add v_lock to struct vnode
add VLAYER flag
update for new VDESC flags


# 1.89 06-Apr-2003 ho

strcat/strcpy/sprintf cleanup. krw@, anil@ ok. art@ tested sparc64.


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_B
# 1.88 11-Aug-2002 art

Add two missing vfs_busy calls in the failure path of sysctl_vnode.
Found by aaron@

NOTE - I think we need a mount-point iterator just like we have
NOTE - vfs_mount_foreach_vnode. (btw. why don't we use foreach_vnode in here?)


# 1.87 12-Jul-2002 art

Change the locking on the mountpoint slightly. Instead of using mnt_lock
to get shared locks for lookup and get the exclusive lock only with
LK_DRAIN on unmount and do the real exclusive locking with flags in
mnt_flags, we now use shared locks for lookup and an exclusive lock for
unmount.

This is accomplished by slightly changing the semantics of vfs_busy.
Old vfs_busy behavior:
- with LK_NOWAIT set in flags, a shared lock was obtained if the
mountpoint wasn't being unmounted, otherwise we just returned an error.
- with no flags, a shared lock was obtained if the mountpoint was being
unmounted, otherwise we slept until the unmount was done and returned
an error.
LK_NOWAIT was used for sync(2) and some statistics code where it isn't really
critical that we get the correct results.
0 was used in fchdir and lookup where it's critical that we get the right
directory vnode for the filesystem root.

After this change vfs_busy keeps the same behavior for no flags and LK_NOWAIT.
But if some other flags are passed into it, they are passed directly
into lockmgr (actually LK_SLEEPFAIL is always added to those flags because
if we sleep for the lock, that means someone was holding the exclusive lock
and the exclusive lock is only held when the filesystem is being unmounted.

More changes:
dounmount must now be called with the exclusive lock held. (before this
the caller was supposed to hold the vfs_busy lock, but that wasn't always
true).
Zap some (now) unused mount flags.
And the highlight of this change:
Add some vfs_busy calls to match some vfs_unbusy calls, especially in
sys_mount. (lockmgr doesn't detect the case where we release a lock noone
holds (it will do that soon)).

If you've seen hangs on reboot with mfs this should solve it (I repeat this
for the fourth time now, but this time I spent two months fixing and
redesigning this and reading the code so this time I must have gotten
this right).


# 1.86 16-Jun-2002 miod

When processing the KERN_VNODE sysctl, the kernel builds a packed structure,
while pstat(8) expects a C structure abiding the regular structure packing
rules. This caused pstat -v to break on powerpc.

Unbreak the confusion by defining the structure in a common header file,
and having the kernel use it.

ok millert@ deraadt@


# 1.85 08-Jun-2002 art

Use ltsleep in vfs_busy.


# 1.84 16-May-2002 art

sprinkle some splassert(IPL_BIO) in some functions that are commented as "should be called at splbio()"


Revision tags: OPENBSD_3_1_BASE
# 1.83 14-Mar-2002 millert

First round of __P removal in sys


# 1.82 04-Feb-2002 miod

Cleanup mountroot-related definitions.


# 1.81 23-Jan-2002 art

Pool deals fairly well with physical memory shortage, but it doesn't deal
well (not at all) with shortages of the vm_map where the pages are mapped
(usually kmem_map).

Try to deal with it:
- group all information the backend allocator for a pool in a separate
struct. The pool will only have a pointer to that struct.
- change the pool_init API to reflect that.
- link all pools allocating from the same allocator on a linked list.
- Since an allocator is responsible to wait for physical memory it will
only fail (waitok) when it runs out of its backing vm_map, carefully
drain pools using the same allocator so that va space is freed.
(see comments in code for caveats and details).
- change pool_reclaim to return if it actually succeeded to free some
memory, use that information to make draining easier and more efficient.
- get rid of PR_URGENT, noone uses it.


# 1.80 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.79 10-Dec-2001 art

branches: 1.79.2;
No need to initialize the uobj on every getnewvnode. Just do
it when allocating. Add some improved diagnostics.


# 1.78 10-Dec-2001 art

Big cleanup inspired by NetBSD with some parts of the code from NetBSD.
- get rid of VOP_BALLOCN and VOP_SIZE
- move the generic getpages and putpages into miscfs/genfs
- create a genfs_node which must be added to the top of the private portion
of each vnode for filsystems that want to use genfs_{get,put}pages
- rename genfs_mmap to vop_generic_mmap


# 1.77 10-Dec-2001 art

Merge in struct uvm_vnode into struct vnode.


# 1.76 05-Dec-2001 art

Break out the part that lowers v_holdcnt in brelvp into an own function
and make it and vhold into public interfaces.


# 1.75 29-Nov-2001 art

Ooops. Revert part of the last commit that was completly wrong and wasn't supposed to be committed.


# 1.74 29-Nov-2001 art

Correctly handle b_vp with bgetvp and brelvp in {get,put}pages.
Prevents panics caused by vnodes being recycled under our feet.


# 1.73 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.72 21-Nov-2001 csapuntz

Added vfs_isbusy. Useful for verifying that a mount point is locked
Added vfs_mount_foreach_vnode. Several places in the code seem to want to
traverse the mount list and they all seem to handle locking differently.
Centralize traversing the mount list in one place so that we only need
to get the locking right once.


# 1.71 15-Nov-2001 art

Don't zero v_bioflag when recycling a vnode in getnewvnode.
Sometimes the vnode can be on the syncers list. While that is a bug, it's
just a minor annoyance. A vnode on a syncer worklist without VBIOONSYNCLIST
set is a disaster.


# 1.70 12-Nov-2001 art

Remove unnecessary check for NULL vnode in reassignbuf.


# 1.69 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.68 02-Oct-2001 csapuntz

Bounds check index into routing table. Thanks to Ken Ashcraft of Stanford
for finding this bug.


# 1.67 19-Sep-2001 csapuntz

Get rid of B_VFLUSH. Not relevant after the end of the AGE queue.


# 1.66 16-Sep-2001 millert

Add some missing lengths checks when passing data from userland to
kernel. From based on NetBSD patches.


# 1.65 02-Aug-2001 assar

(vput): make panic strings actually say vput instead of vrele


# 1.64 26-Jul-2001 miod

Typo.


# 1.63 27-Jun-2001 art

remove old vm


# 1.62 22-Jun-2001 deraadt

KNF


# 1.61 05-Jun-2001 provos

send note_revoke to knotes when vnode goes away, okay art@


# 1.60 16-May-2001 art

indentation nit.


# 1.59 29-Apr-2001 art

cleanup, remove incorrect comment


Revision tags: OPENBSD_2_9_BASE
# 1.58 22-Mar-2001 art

branches: 1.58.2;
Use pool for allocating vnodes.
Even though vnodes are never freed (could be) this gives us big memory and
kmem_map savings.


# 1.57 21-Mar-2001 art

uvm_vnp_terminate expect the vnode to be locked.
Why didn't LOCKDEBUG catch this?


# 1.56 16-Mar-2001 art

Oops. fix thinko in last.


# 1.55 16-Mar-2001 art

Use CIRCLEQ macros for mountlist.


# 1.54 16-Mar-2001 art

Initialize the mountlist_slock.


# 1.53 26-Feb-2001 csapuntz

Move v_writecount test back to it original place


# 1.52 26-Feb-2001 csapuntz

Make ref counts 32-bit unsigned ints as opposed to a potpourri of longs and
ints.


# 1.51 24-Feb-2001 csapuntz

Cleanup of vnode interface continues. Get rid of VHOLD/HOLDRELE.
Change VM/UVM to use buf_replacevnode to change the vnode associated
with a buffer.

Addition v_bioflag for flags written in interrupt handlers
(and read at splbio, though not strictly necessary)

Add vwaitforio and use it instead of a while loop of v_numoutput.

Fix race conditions when manipulation vnode free list


# 1.50 23-Feb-2001 csapuntz

Remove the clustering fields from the vnodes and place them in the
file system inode instead


# 1.49 21-Feb-2001 csapuntz

Latest soft updates from FreeBSD/Kirk McKusick

Snapshot-related code has been commented out.


# 1.48 08-Feb-2001 mickey

do not print stuff when not verbose


Revision tags: OPENBSD_2_8_BASE
# 1.47 27-Sep-2000 art

branches: 1.47.2;
Minimal optimization.


# 1.46 17-Jul-2000 art

Don't wait for B_READ buffers on shutdown.
From NetBSD.


Revision tags: OPENBSD_2_7_BASE
# 1.45 25-Apr-2000 csapuntz

Use CIRCLEQ_FOREACH


# 1.44 21-Apr-2000 mickey

see if there is any meaning under curproc before using &proc0 in vfs_syncwait(); from art@


Revision tags: SMP_BASE kame_19991208
# 1.43 05-Dec-1999 art

branches: 1.43.2;
With soft updates, some buffers will be remarked as dirty after being written.
Handle this when syncing filesystems when unmounting.
From NetBSD.


# 1.42 05-Dec-1999 art

Use VONSYNCLIST to see if we should remove a vnode from the sync list instead
of looking at v_dirtyblkhd.


Revision tags: OPENBSD_2_6_BASE
# 1.41 20-Aug-1999 art

more paranoid check of the refcount in vfs_register


# 1.40 08-Aug-1999 niklas

From NetBSD; vdevgone, used for revoking access to device nodes when they
disappear (detach is coming).


# 1.39 31-May-1999 millert

New struct statfs with mount options. NOTE: this replaces statfs(2),
fstatfs(2), and getfsstat(2) so you will need to build a new kernel
before doing a "make build" or you will get "unimplemented syscall" errors.

The new struct statfs has the following featuires:
o Has a u_int32_t flags field--now softdep can have a real flag.

o Uses u_int32_t instead of longs (nicer on the alpha). Note: the man
page used to lie about setting invalid/unused fields to -1. SunOS does
that but our code never has.

o Gets rid of f_type completely. It hasn't been used since NetBSD 0.9
and having it there but always 0 is confusing. It is conceivable
that this may cause some old code to not compile but that is better
than silently breaking.

o Adds a mount_info union that contains the FSTYPE_args struct. This
means that "mount" can now tell you all the options a filesystem was
mounted with. This is especially nice for NFS.

Other changes:
o The linux statfs emulation didn't convert between BSD fs names
and linux f_type numbers. Now it does, since the BSD f_type
number is useless to linux apps (and has been removed anyway)

o FreeBSD's struct statfs is different from our (both old and new)
and thus needs conversion. Previously, the OpenBSD syscalls
were used without any real translation.

o mount(8) will now show extra info when invoked with no arguments.
However, to see *everything* you need to use the -v (verbose) flag.


# 1.38 06-May-1999 mickey

factor out sync+wait code into vfa_syncwait() routine for
applications in system like power management and such.
art@ finally said `commit it'


# 1.37 30-Apr-1999 art

in vput, simple_unlock the v_interlock before VOP_INACTIVE, not after


Revision tags: OPENBSD_2_5_BASE
# 1.36 11-Mar-1999 deraadt

backout


# 1.35 11-Mar-1999 deraadt

back out unapproved changes


# 1.34 11-Mar-1999 mickey

indent


# 1.33 11-Mar-1999 mickey

factor sync+wait operation out into a separate function.


# 1.32 26-Feb-1999 art

adapt to uvm vnode pager


# 1.31 19-Feb-1999 art

add vfs_register and vfs_unregister functions


# 1.30 28-Dec-1998 art

simple_lock fixes


# 1.29 22-Dec-1998 art

deconfuse vprint, print holdcount, not refcount when we are talking about holdcnt


# 1.28 10-Dec-1998 art

vfs_unmountall: retry to unmount all remaining filesystems when one unmount failed


# 1.27 05-Dec-1998 csapuntz

Framework for generating automatic test code for locking discipline
in DIAGNOSTIC mode.

Added documentation to vfs_subr.c on locking needs of a couple calls.

Improvements to the vinvalbuf patch. We need to start over after we
let our pants down.


# 1.26 04-Dec-1998 csapuntz

VFS-Lite2 requires stricter locking around vnode buffer queues. vinvalbuf
had insufficient protection


# 1.25 20-Nov-1998 art

vn_lock already unlocks the simple lock. don't do that again


# 1.24 12-Nov-1998 csapuntz

Integrate latest soft updates patches for McKusick.

Integrate cleaner ffs mount code from FreeBSD. Most notably, this mount
code prevents you from mounting an unclean file system read-write.


Revision tags: OPENBSD_2_4_BASE
# 1.23 13-Oct-1998 csapuntz

In vrele, vget, reinstate to following order

- VNODE gets placed on free list
- VOP_INACTIVE is called

This was the original order. It was changed in an earlier patch due to
a race condition in non-locking FSes (like NFS) between getnewvnode
and inactive. However, the modified order had its own race conditions, so
it turned out not to be a good choice.


# 1.22 30-Aug-1998 csapuntz

Cleanup.

Error diagnostics in vputonfreelist to catch violations of assumptions.


# 1.21 06-Aug-1998 csapuntz

Rename vop_revoke, vn_bwrite, vop_noislocked, vop_nolock, vop_nounlock
to be vop_generic_revoke, vop_generic_bwrite, vop_generic_islocked,
vop_generic_lock and vop_generic_unlock.

Create vop_generic_abortop and propogate change to all file systems.

Fix PR/371.

Get rid of locking in NULLFS (should be mostly unnecessary now except for
forced unmounts).


# 1.20 25-Apr-1998 niklas

typo


Revision tags: OPENBSD_2_3_BASE
# 1.19 20-Feb-1998 niklas

typo


# 1.18 11-Jan-1998 csapuntz

Fix a couple spinlock references. More code motion in vfs_subr.c


# 1.17 10-Jan-1998 csapuntz

Broke up vfs_subr.c which was getting a bit huge. We now have seperate files
for the syncer daemon as well as default VOP_*.


# 1.16 24-Nov-1997 niklas

Fix non-DIAGNOSTIC (and non-COMPAT*) compilation


# 1.15 07-Nov-1997 csapuntz

Fixed hang on shutdown
Disabled vop_nolock for now. Filesystems still need to be cleaned up.


# 1.14 06-Nov-1997 csapuntz

DEBUG now compiles


# 1.13 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.12 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.11 06-Oct-1997 csapuntz

VFS Lite2 Changes


Revision tags: OPENBSD_2_1_BASE
# 1.10 25-Apr-1997 deraadt

proper mask check; mike@fast.cs.utah.edu


# 1.9 14-Apr-1997 tholo

Minor performance enhancements from NetBSD


# 1.8 24-Feb-1997 niklas

OpenBSD tags


# 1.7 11-Feb-1997 millert

Add fs_id support and random inode generation numbers for ffs.


# 1.6 04-Jan-1997 kstailey

spec_advlock() via lf_advlock()


Revision tags: OPENBSD_2_0_BASE
# 1.5 08-Aug-1996 tholo

Make {,f}chown(2) behaviour POSIX.1 compliant with SUID / SGID files
Enable CTL_FS processing by sysctl(3)
Add CTL_FS request to disable clearing SUID / SGID bit when a files owner
or group is changed by root
Make sysctl(8) understand CTL_FS requests


# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 29-Feb-1996 niklas

From NetBSD: Merge with NetBSD 960217


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.307 19-Oct-2021 semarie

vnode: remove VLOCKSWORK and check locking when vop_islocked != nullop

This flag is currently used to mark or unmark a vnode to actively
check vnode locking semantic (when compiled with VFSLCKDEBUG).

Currently, VLOCKSWORK flag isn't properly set for several FS
implementation which have full locking support. This commit enable
proper checking for them too (cd9660, udf, fuse, msdosfs, tmpfs).

Instead of using a particular flag, it directly check if
v_op->vop_islocked is nullop or not to activate or not the vnode
locking checks.

ok mpi@


Revision tags: OPENBSD_7_0_BASE
# 1.306 31-Aug-2021 claudio

Swap lock flags so that LK_EXCLUSIVE is first like in all other places.


# 1.305 28-Apr-2021 claudio

Introduce a global vnode_mtx and use it to make vn_lock() safe to be called
without the KERNEL_LOCK.
This moves VXLOCK and VXWANT to a mutex protected v_lflag field and also
v_lockcount is protected by this mutex.

The vn_lock() dance is overly complex and all of this should probably replaced
by a proper lock on the vnode but such a diff is a lot more complex. This
is an intermediate step so that at least some calls can be modified to grab
the KERNEL_LOCK later or not at all.

OK mpi@


Revision tags: OPENBSD_6_9_BASE
# 1.304 29-Jan-2021 claudio

Use NULL instead of 0 to clear v_socket pointer (which actually clears all
of the v_un pointers).
OK jsg@ mvs@


Revision tags: OPENBSD_6_8_BASE
# 1.303 23-Aug-2020 kn

Remove unused debug_syncprt, improve debug sysctl handling

"syncprt" is unused since kern/vfs_syscalls.c r1.147 from 2008.

Adding new debug sysctls is a bit opaque and looking at kern/kern_sysctl.c
the only visible difference between used and stub ctldebug structs in the
debugvars[] array is their extern keyword, indicating that it is defined
elsewhere.

sys/sysctl.h declares all debugN members as extern upfront, but these
declarations are not needed.

Remove the unused debug sysctl, rename the only remaining one to something
meaningful and remove forward declarations from /sys/sysctl.h; this way,
adding new debug sysctls is a matter of adding extern and coming up with a
name, which is nicer to read on its own and better to grep for.

OK mpi


# 1.302 22-Aug-2020 kn

Move sysctl(2) CTL_DEBUG from DEBUG to new DEBUG_SYSCTL

Adding "debug.my-knob" sysctls is really helpful to select different
code paths and/or log on demand during runtime without recompile,
but as this code is under DEBUG, lots of other noise comes with it
which is often undesired, at least when looking at specific subsystems
only.

Adding globals to the kernel and breaking into DDB to change them helps,
but that does not work over SSH, hence the need for debug sysctls.

Introduces DEBUG_SYSCTL to make use of the "debug" MIB without the rest of
DEBUG; it's DEBUG_SYSCTL and not SYSCTL_DEBUG because it's not a general
option for all of sysctl(2).

OK gnezdo


Revision tags: OPENBSD_6_7_BASE
# 1.301 27-Mar-2020 anton

Relax the lockcount assertion in vputonfreelist(). Back when I fixed
several problems with the vnode exclusive lock implementation, I
overlooked the fact that a vnode can be in a state where the usecount is
zero while the holdcount still being positive. There could still be
threads waiting on the vnode lock in uvn_io() as long as the holdcount
is positive.

"go ahead" mpi@

Reported-by: syzbot+767d6deb1a647850a0ca@syzkaller.appspotmail.com


# 1.300 13-Feb-2020 claudio

Move the LK_DRAIN logic from VOP_LOCK() to vclean() the only caller of
VOP_LOCK with LK_DRAIN. This simplifies VOP_LOCK() a fair bit.
OK visa@


# 1.299 20-Jan-2020 claudio

struct vops is not modified during runtime so use const which moves each
into read-only data segment.
OK deraadt@ tedu@


# 1.298 10-Jan-2020 bluhm

Convert the vnode list at the mount point into a tailq. During
unmount this list is traversed and the dirty vnodes are flushed to
disk. Forced unmount expects that the list is empty after flushing,
otherwise the kernel panics with "dangling vnode". As the write
to disk can sleep, new vnodes may be inserted. If softdep is
enabled, resolving the dependencies creates new dirty vnodes and
inserts them to the list. To fix the panic, let insmntque() insert
new vnodes at the tail of the list. Then vflush() will still catch
them while traversing the list in forward direction.
OK tedu@ millert@ visa@


# 1.297 30-Dec-2019 bluhm

In vcount() a safe loop over vnodes was commited to 4.4BSD in 1994.
This is not necessary as the loop is restarted after vgone(). Switch
to SLIST_FOREACH without _SAFE.
OK visa@


# 1.296 27-Dec-2019 bluhm

Convert the speclisth hash buckets into SLIST macros. This makes
the vnode alias code more readable.
OK visa@


# 1.295 26-Dec-2019 bluhm

Fix white spaces.


# 1.294 08-Dec-2019 mpi

Convert infinite sleeps to tsleep_nsec(9).

ok visa@, jca@


Revision tags: OPENBSD_6_6_BASE
# 1.293 26-Aug-2019 anton

When a thread tries to exclusively lock a vnode, the same thread must
ensure that any other thread currently trying to acquire the underlying
vnode lock has observed that the same vnode is about to be exclusively
locked. Such threads must then sleep until the exclusive lock has been
released and then try to acquire the lock again. Otherwise, exclusive
access to the vnode cannot be guaranteed.

Thanks to naddy@ and visa@ for testing; ok visa@

Reported-by: syzbot+374d0e7e2400004957f7@syzkaller.appspotmail.com


# 1.292 25-Jul-2019 cheloha

vinvalbuf(9): tlseep -> tsleep_nsec(9); ok millert@


# 1.291 19-Jul-2019 cheloha

vwaitforio(9): tsleep(9) -> tsleep_nsec(9); ok visa@


# 1.290 28-Jun-2019 visa

Skip VFS barrier lock during normal operation to reduce overhead.
This removes a system-wide serialization point, which might help
finding timing-related bugs.

OK deraadt@ anton@


# 1.289 09-Jun-2019 beck

Add a temporary workaround to make removal of giant files better

mlarkin@ noticed we would freeze while removing enormous files because
of the amount of work done to invalidate buffers on unlink. This adds
a temporary workaround to ensure we give up the lock and yield while
doing this.

The longer term answer will be to move these buffers to another list
and not do the work here.

ok deraadt@


# 1.288 19-Apr-2019 visa

Add a subsystem lock for vfs_lockf.c. This enables calling lf_advlock()
and lf_purgelocks() without the kernel lock.

OK anton@ mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.287 02-Apr-2019 visa

Restrict which filesystems are available for swap. This rules out
obvious misconfigurations that cannot work.

OK mpi@ tedu@


# 1.286 17-Feb-2019 tedu

if a write fails, we mark the buffer invalid and throw it away. this can
lead to lost errors, where a later fsync will return success. to fix this,
set a flag on the vnode indicating a past error has occurred, and return
an error for future fsync calls.
ok bluhm deraadt visa


# 1.285 21-Jan-2019 anton

Introduce a dedicated entry point data structure for file locks. This new data
structure allows for better tracking of pending lock operations which is
essential in order to prevent a use-after-free once the underlying vnode is
gone.

Inspired by the lockf implementation in FreeBSD.

ok visa@

Reported-by: syzbot+d5540a236382f50f1dac@syzkaller.appspotmail.com


# 1.284 23-Dec-2018 natano

Rectify some issues with the noperm mount flag; the root vnode was not
protected properly and files without any x bit set were accidentaly considered
executable when checked with access(2).

Issues found and reported by deraadt, halex, reyk, tb
ok deraadt


# 1.283 07-Dec-2018 mpi

free(9) sizes for netcred.

ok visa@


Revision tags: OPENBSD_6_4_BASE
# 1.282 29-Sep-2018 visa

Use atomic operations to update vfc_refcount. Change the field's type
to unsigned int.

OK deraadt@


# 1.281 26-Sep-2018 visa

Move the allocating and freeing of mount points into
dedicated functions.

OK deraadt@ mpi@


# 1.280 22-Sep-2018 fcambus

Harmonize spacing after ellipses in displayed messages.

We were using spacing after ellipses in an inconsistent way in the
installer. Standardize on using "... " everywhere and take into account
the cursor position while we are waiting for the task to complete: the
cursor is now always positioned after the last dot, and the space is
added when displaying completion confirmation.

While there, also take cursor position into account in vfs_shutdown(),
and remove the extra leading space before ticks in dhclient.

OK deraadt@


# 1.279 17-Sep-2018 visa

Simplify VFS initialization.

Because loadable kernel modules are no longer, there is no need to
register or unregister filesystem implementations at runtime. Remove
vfs_register() and vfs_unregister(), and make vfsinit() call vfs_init
routines directly. Replace the linked list of vfsconf structs with
the vfsconflist[] array.

OK mpi@ bluhm@


# 1.278 16-Sep-2018 visa

Move vfsconf lookup code into dedicated functions.

OK bluhm@


# 1.277 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


# 1.276 02-Jul-2018 bluhm

Use more list macros for v_dirtyblkhd.
OK mpi@


# 1.275 06-Jun-2018 bluhm

The function dounmount() traverses the mnt_list in forward direction
to call vfs_busy() for all nested mount points. vfs_stall() called
vfs_busy() in reverser order for all mount points. Change the
direction of the latter to resolve the lock order conflict.
OK visa@


# 1.274 04-Jun-2018 guenther

Add VB_DUPOK to suppress witness(4) warning of concurrent mount locks.
Use that in three places:
- vfs_stall()
- sys_mount()
- dounmount()'s MNT_FORCE-does-recursive-unmounts case

ok deraadt@ visa@


# 1.273 27-May-2018 visa

Drop unnecessary `p' parameter from vget(9).

OK mpi@


# 1.272 08-May-2018 bluhm

When looping over mount points, the FOREACH SAVE macro is not save.
The loop variable mp is protected by vfs_busy() so that it cannot
be unmounted. But the next mount point nmp could be unmounted while
VFS_SYNC() sleeps. As the loop in vfs_stall() does not destroy the
mount point, TAILQ_FOREACH_REVERSE without _SAVE is the correct
macro to use.
OK deraadt@ visa@


# 1.271 08-May-2018 mpi

Move the vfs stall "barrier" logic to a function. FREF() will soon
change and this has nothing to do with it.

ok visa@, bluhm@


# 1.270 07-May-2018 bluhm

Print the vp pointer in the vinvalbuf() panic strings.
OK mpi@


# 1.269 02-May-2018 visa

Remove proc from the parameters of vn_lock(). The parameter is
unnecessary because curproc always does the locking.

OK mpi@


# 1.268 28-Apr-2018 visa

Clean up the parameters of VOP_LOCK() and VOP_UNLOCK(). It is always
curproc that does the locking or unlocking, so the proc parameter
is pointless and can be dropped.

OK mpi@, deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.267 07-Mar-2018 bluhm

Remounting files systems read-only does not work reliably. There
are corner cases where ffs may leak blocks. So better revert and
unmount all file systems at reboot. The "init died" panic will be
fixed in a different way.
OK deraadt@


# 1.266 10-Feb-2018 deraadt

Syncronize filesystems to disk when suspending. Each mountpoint's vnodes
are pushed to disk. Dangling vnodes (unlinked files still in use) and
vnodes undergoing change by long-running syscalls are identified -- and
such filesystems are marked dirty on-disk while we are suspended (in case
power is lost, a fsck will be required). Filesystems without dangling or
busy vnodes are marked clean, resulting in faster boots following
"battery died" circumstances.
Tested by numerous developers, thanks for the feedback.


# 1.265 14-Dec-2017 deraadt

Don't bother using DETACH_FORCE for the softraid luns at reboot
time; the aggressive mountpoint destruction seems to hit insane
use-after-frees when we are already far on the way down.


# 1.264 14-Dec-2017 deraadt

Give vflush_vnode() a hint about vnodes we don't need to account as "busy".
Change mountpoint to RDONLY a little later. Seems to improve the
rw->ro transition a bit.


# 1.263 11-Dec-2017 bluhm

Format the vnode lists of ddb show mount properly in columns.
OK krw@


# 1.262 11-Dec-2017 deraadt

In uvm Chuck decided backing store would not be allocated proactively
for blocks re-fetchable from the filesystem. However at reboot time,
filesystems are unmounted, and since processes lack backing store they
are killed. Since the scheduler is still running, in some cases init is
killed... which drops us to ddb [noted by bluhm]. Solution is to convert
filesystems to read-only [proposed by kettenis]. The tale follows:
sys_reboot() should pass proc * to MD boot() to vfs_shutdown() which
completes current IO with vfs_busy VB_WRITE|VB_WAIT, then calls VFS_MOUNT()
with MNT_UPDATE | MNT_RDONLY, soon teaching us that *fs_mount() calls a
copyin() late... so store the sizes in vfsconflist[] and move the copyin()
to sys_mount()... and notice nfs_mount copyin() is size-variant, so kill
legacy struct nfs_args3. Next we learn ffs_mount()'s MNT_UPDATE code is
sharp and rusty especially wrt softdep, so fix some bugs adn add
~MNT_SOFTDEP to the downgrade. Some vnodes need a little more help,
so tie them to &dead_vnops.

ffs_mount calling DIOCCACHESYNC is causing a bit of grief still but
this issue is seperate and will be dealt with in time.
couple hundred reboots by bluhm and myself, advice from guenther and
others at the hut


# 1.261 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.260 31-Jul-2017 florian

Give back some space to the ramdisk by compiling net/radix.c only
if we compile pf, ipsec, pipex or nfsserver.
Suggested by mpi some time ago.
Tweak & OK bluhm
deraadt assumes it's fair


# 1.259 20-Apr-2017 visa

Tweak lock inits to make the system runnable with witness(4)
on amd64 and i386.


# 1.258 04-Apr-2017 deraadt

struct vfsconf is tightly packed, but let's M_ZERO it in case that ever
changes to avoid exposing userland memory.


Revision tags: OPENBSD_6_1_BASE
# 1.257 15-Jan-2017 bluhm

When traversing the mount list, the current mount point is locked
with vfs_busy(). If the FOREACH_SAFE macro is used, the next pointer
is not locked and could be freed by another process. Unless
necessary, do not use _SAFE as it is unsafe. In vfs_unmountall()
the current pointer is actullay freed. Add a comment that this
race has to be fixed later.
OK krw@


# 1.256 10-Jan-2017 bluhm

Replace manual for() loops with FOREACH() macro.
OK millert@


# 1.255 10-Jan-2017 bluhm

Remove the unused olddp parameter from function dounmount().
OK mpi@ millert@


# 1.254 28-Sep-2016 kettenis

Cast enum to u_int when doing a bounds check to avoid a clang warning that
the comparison is always true.

ok jca@, tedu@


# 1.253 16-Sep-2016 dlg

move the namecache_rb_tree from RB macros to RBT functions.

i had to shuffle the includes a bit. all the knowledge of the RB
tree is now inside vfs_cache.c, and all accesses are via cache_*
functions.


# 1.252 16-Sep-2016 dlg

move buf_rb_bufs from RB macros to RBT functions

i had to shuffle the order of some header bits cos RBT_PROTOTYPE
needs to see what RBT_HEAD produces.


# 1.251 15-Sep-2016 dlg

all pools have their ipl set via pool_setipl, so fold it into pool_init.

the ioff argument to pool_init() is unused and has been for many
years, so this replaces it with an ipl argument. because the ipl
will be set on init we no longer need pool_setipl.

most of these changes have been done with coccinelle using the spatch
below. cocci sucks at formatting code though, so i fixed that by hand.

the manpage and subr_pool.c bits i did myself.

ok tedu@ jmatthew@

@ipl@
expression pp;
expression ipl;
expression s, a, o, f, m, p;
@@
-pool_init(pp, s, a, o, f, m, p);
-pool_setipl(pp, ipl);
+pool_init(pp, s, a, ipl, f, m, p);


# 1.250 25-Aug-2016 dlg

pool_setipl

ok kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.249 22-Jul-2016 kettenis

Prevent NULL-pointer call for filesystems that don't provide vfs_sysctl
in their vfsops.

Issue reported by Tim Newsham.

ok claudio@, natano@


# 1.248 19-Jun-2016 natano

Remove the lockmgr() API. It is only used by filesystems, where it is a
trivial change to use rrw locks instead. All it needs is LK_* defines
for the RW_* flags.

tested by naddy and sthen on package building infrastructure
input and ok jmc mpi tedu


# 1.247 26-May-2016 natano

The doforce variable isn't modified anywhere. Also, the only filesystem
left using it is fuse. It has been removed from all other filesystems.

ok millert deraadt


# 1.246 26-Apr-2016 natano

copy_statfs_info() is not only used by ufs, but by other filesystems too,
so make sure that all members of mp->mnt_stat.mount_info are copied.
ok stefan


# 1.245 26-Apr-2016 beck

fix off by one in vfs_vnode_print - found by miod
ok deraadt@, krw@


# 1.244 07-Apr-2016 natano

Share clone bitmap between aliased vnodes. This prevents duplicate clone
instance numbers being handed out for the same minor device.
ok mikeb


# 1.243 05-Apr-2016 natano

Increase size of the clone bitmap (revised diff after revert). I have
tested this with fuse _and_ drm on amd64 and macppc. Also tested with
cloning bpf (not in the tree) on macppc.

ok mikeb
"looks correct to me" millert

The original commit message is as follows:

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.242 01-Apr-2016 mikeb

Revert the clone bitmap enlargement change


# 1.241 31-Mar-2016 natano

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.240 19-Mar-2016 natano

Remove the unused flags argument from VOP_UNLOCK().

torture tested on amd64, i386 and macppc
ok beck mpi stefan
"the change looks right" deraadt


# 1.239 14-Mar-2016 krw

Change a bunch of (<blah> *)0 to NULL.

ok beck@ deraadt@


Revision tags: OPENBSD_5_9_BASE
# 1.238 05-Dec-2015 tedu

branches: 1.238.2;
remove stale lint annotations


# 1.237 16-Nov-2015 deraadt

In getdevvp() set the VISTTY flag on a vnode to indicate the underlying
device is a D_TTY device. (Like spec_open, but this sets the flag to
satisfy pre-VOP_OPEN situations)
ok millert semarie tedu guenther


# 1.236 13-Oct-2015 guenther

Initialize va_filerev in vattr_null() to avoid leaking stack garbage;
problem pointed out by Martin Natano (natano (at) natano.net)

Also, stop chaining assignments (foo = bar = baz) in vattr_null().
The exact meaning of those depends on the order of the sizes-and-
signednesses of the lvalues, making them fragile: a statement here
mixed *six* types, but managed to get them in a safe order. Delete
a 20+ year old XXX comment that was almost certainly bemoaning a bug
from when they were in an unsafe order.

ok deraadt@ miod@


# 1.235 08-Oct-2015 mpi

Use the radix API directly and get rid of the function pointers. There
is no point in keeping an unused level of abstraction.

ok mikeb@, claudio@


# 1.234 07-Oct-2015 mpi

rn_inithead() offset argument is now specified in byte, missed in previous.


# 1.233 04-Sep-2015 mpi

Make every subsystem using a radix tree call rn_init() and pass the
length of the key as argument.

This way every consumer of the radix tree has a chance to explicitly
initialize the shared data structures and no longer rely on another
subsystem to do the initialization.

As a bonus ``dom_maxrtkey'' is no longer used an die.

ART kernels should now be fully usable because pf(4) and IPSEC properly
initialized the radix tree.

ok chris@, reyk@


Revision tags: OPENBSD_5_8_BASE
# 1.232 16-Jul-2015 claudio

branches: 1.232.4;
Fix rn_match and there for the expoerted lookup functions in radix.c
to never return the internal RNF_ROOT nodes. This removes the checks
in the callee to verify that not an RNF_ROOT node was returned.
OK mpi@


# 1.231 12-May-2015 mikeb

Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.230 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.229 02-Mar-2015 guenther

Return EINVAL if the creds supplied for NFS export have a cr_ngroups less
than zero or greater than NGROUPS_MAX

Fixes panic seen by henning@


# 1.228 09-Jan-2015 tedu

rename desiredvnodes to initialvnodes. less of a lie. ok beck deraadt


# 1.227 19-Dec-2014 tedu

start retiring the nointr allocator. specify PR_WAITOK as a flag as a
marker for which pools are not interrupt safe. ok dlg


# 1.226 17-Dec-2014 tedu

remove lock.h from uvm_extern.h. another holdover from the simpletonlock
era. fix uvm including c files to include lock.h or atomic.h as necessary.
ok deraadt


# 1.225 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.224 10-Dec-2014 tedu

convert bcopy to memcpy. ok millert


# 1.223 21-Nov-2014 tedu

simple lock is long dead


# 1.222 19-Nov-2014 tedu

delete the KERN_VNODE sysctl. it fails to provide any isolation from the
kernel struct vnode defintion, and the only consumer (pstat) still needs
kvm to read much of the required information. no great loss to always use
kvm until there's a better replacement interface.
ok deraadt millert uebayasi


# 1.221 14-Nov-2014 tedu

prefer sizeof(*ptr) to sizeof(struct) for malloc and free


# 1.220 03-Nov-2014 deraadt

pass size argument to free()
ok doug tedu


# 1.219 13-Sep-2014 doug

Replace all queue *_END macro calls except CIRCLEQ_END with NULL.

CIRCLEQ_* is deprecated and not called in the tree. The other queue types
have *_END macros which were added for symmetry with CIRCLEQ_END. They are
defined as NULL. There's no reason to keep the other *_END macro calls.

ok millert@


Revision tags: OPENBSD_5_6_BASE
# 1.218 13-Jul-2014 tedu

pass the size to free in some of the obvious cases


# 1.217 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.216 10-Jul-2014 mpi

Stop using a shutdown hook for softraid(4) and explicitly shutdown
the disciplines right after vfs_shutdown().

This change is required in order to be able to set `cold' to 1 before
traversing the device (mainbus) tree for DVACT_POWERDOWN when halting
a machine. Yes, this is ugly because sr_shutdown() needs to sleep. But
at least it is obvious and hopefully somebody will be ofended and fix
it.

In order to properly flush the cache of the disks under softraid0,
sr_shutdown() now propagates DVACT_POWERDOWN for this particular subtree
of devices which are not under mainbus. As a side effect sd(4) shutdown
hook should no longer be necessary.

Tested by stsp@ and Jean-Philippe Ouellet.

ok deraadt@, stsp@, jsing@


# 1.215 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.214 04-Jun-2014 claudio

While it may be smart to use the radix tree for exports it is not OK to
use the domain specific tree initialisation method for this since that one
is multipath enabled and assumes that the radix node is part of a struct
rtentry. This code uses a different struct and so the multipath modifies
wrong fields and breaks stuff in mysterious ways.
Since we only support AF_INET here anyway simplify the code and only have
one radix_node_head pointer instead of AF_MAX ones.
Fixes NFS server issues reported by rpe@, OK rpe@, guenther@, sthen@


# 1.213 10-Apr-2014 tedu

pull the bufcache freelist code out into separate functions to allow new
algorithms to be tested. in the process, drop support for unused B_AGE and
b_synctime options.
previous versions ok beck deraadt


# 1.212 24-Mar-2014 guenther

Split the API: struct ucred remains the kernel internal structure while
struct xucred becomes the structure for syscalls (mount(2) and nfssvc(2)).

ok deraadt@ beck@


Revision tags: OPENBSD_5_5_BASE
# 1.211 21-Jan-2014 tedu

bzero -> memset


# 1.210 01-Dec-2013 krw

Change 'mountlist' from CIRCLEQ to TAILQ. Be paranoid and
use TAILQ_*_SAFE more than might be needed.

Bulk ports build by sthen@ showed nobody sticking their fingers
so deep into the kernel.

Feedback and suggestions from millert@. ok jsing@


# 1.209 27-Nov-2013 jsing

Defer the v_type initialisation until after the vnode has been purged from
the namecache. Changing the v_type between cache_enter() and cache_purge()
results in bad things happening.

ok beck@


# 1.208 02-Oct-2013 sf

format string fix: b_flags is long


# 1.207 01-Oct-2013 sf

Format string fixes: Cast time_t to long long

and mnt_stat.f_ctime is long long, too


# 1.206 08-Aug-2013 syl

Uncomment kprintf format attributes for sys/kern

tested on vax (gcc3) ok miod@


# 1.205 30-Jul-2013 beck

The previous change was made while chasing nfs performance issues
on Theo's servers - however this was in the context of the buffer flipper
changes and this is now suspect in a continues performance issue with NFS
so back it out for now


Revision tags: OPENBSD_5_4_BASE
# 1.204 24-Jun-2013 beck

Manipulating buffers after sleeping is dangerous. Instead of attempting
to cheat and VOP_BWRITE a buffer, restart the vinvalbuf if we have to wait
for a busy buffer to complete
ok tedu@ guenther@


# 1.203 15-Apr-2013 jsing

Add an f_mntfromspec member to struct statfs, which specifies the name of
the special provided when the mount was requested. This may be the same as
the special that was actually used for the mount (e.g. in the case of a
device node) or it may be different (e.g. in the case of a DUID).

Whilst here, change f_ctime to a 64 bit type and remove the pointless
f_spare members.

Compatibility goo courtesy of guenther@

ok krw@ millert@


Revision tags: OPENBSD_5_3_BASE
# 1.202 17-Feb-2013 miod

Comment out recently added __attribute__((__format__(__kprintf__))) annotations
in MI code; gcc 2.95 does not accept such annotation for function pointer
declarations, only function prototypes.
To be uncommented once gcc 2.95 bites the dust.


# 1.201 09-Feb-2013 miod

Add explicit __attribute__ ((__format__(__kprintf__)))) to the functions and
function pointer arguments which are {used as,} wrappers around the kernel
printf function.
No functional change.


# 1.200 17-Nov-2012 beck

Don't map a buffer (and potentially sleep) when invalidating it in vinvalbuf.
This fixes a problem where we could sleep for kva and then our pointers
would not be valid on the next pass through the loop. We do this
by adding buf_acquire_nomap() - which can be used to busy up the buffer
without changing its mapped or unmapped state. We do not need to have
the buffer mapped to invalidate it, so it is sufficient to acquire it
for that. In the case where we write the buffer, we do map the buffer, and
potentially sleep.


# 1.199 01-Oct-2012 guenther

Make groupmember() check the effective gid too, so that the checks are
consistent when the effective gid isn't also a supplementary group.

ok beck@


# 1.198 19-Sep-2012 guenther

vhold() and vdrop() are prototyped in vnode.h, so don't repeat them here

ok beck@


Revision tags: OPENBSD_5_2_BASE
# 1.197 16-Jul-2012 deraadt

oops, need sys/acct.h too


# 1.196 16-Jul-2012 deraadt

Put acct_shutdown() proto in a better place


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.195 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.194 02-Jul-2011 thib

rename VFSDEBUG to VFLCKDEBUG;

prompted by tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.193 21-Dec-2010 thib

Bring back the "End the VOP experiment." diff, naddy's issues where
unrelated, and his alpha is much happier now.

OK deraadt@


# 1.192 06-Dec-2010 jasper

- drop NENTS(), which was yet another copy of nitems().
no binary change


ok deraadt@


# 1.191 10-Sep-2010 thib

Backout the VOP diff until the issues naddy was seeing on alpha (gcc3)
have been resolved.


# 1.190 06-Sep-2010 thib

End the VOP experiment. Instead of the ridicolusly complicated operation
vector setup that has questionable features (that have, as far as I can
tell never been used in practice, atleast not in OpenBSD), remove all
the gunk and favor a simple struct full of function pointers that get
set directly by each of the filesystems.

Removes gobs of ugly code and makes things simpler by a magnitude.

The only downside of this is that we loose the vnoperate feature so
the spec/fifo operations of the filesystems need to be kept in sync
with specfs and fifofs, this is no big deal as the API it self is pretty
static.

Many thanks to armani@ who pulled an earlier version of this diff to
current after c2k10 and Gabriel Kihlman on tech@ for testing.

Liked by many. "come on, find your balls" deraadt@.


# 1.189 12-Aug-2010 oga

Nuke extra (typoed) extern declaration and a spare newline from the last
commit.

"fix it -- free commit" beck@


# 1.188 11-Aug-2010 beck

Make the number of vnodes to correspond to the number of buffers in
buffer cache - we grow them dynamically, but do not attempt to shrink
them if the buffer cache shrinks after growing.

Tested by very many for a long time.

ok oga@ todd@ phessler@ tedu@


Revision tags: OPENBSD_4_8_BASE
# 1.187 29-Jun-2010 tedu

makefstype was only used in ported from freebsd filesystems. fix them
and remove the function. ok thib


# 1.186 28-Jun-2010 claudio

Add the rtable id as an argument to rn_walktree(). Functions like
rt_if_remove_rtdelete() need to know the table id to be able to correctly
remove nodes.
Problem found by Andrea Parazzini and analyzed by Martin Pelik�n.
OK henning@


# 1.185 06-May-2010 mpf

Fix favail format string.
From mickey.
OK thib, otto.


Revision tags: OPENBSD_4_7_BASE
# 1.184 17-Dec-2009 oga

if anyone vref()s a VNON vnode, panic. This should not happen.

Written while trying to debug the nfs_inactive panics. Turns out it
never got hit, but it's a useful check to have.

ok beck@


# 1.183 17-Aug-2009 jasper

dd 'show all bufs' to show all the buffers in the system

ok beck@ thib@


# 1.182 13-Aug-2009 thib

add a show all vnodes command, use dlg's nice pool_walk() to accomplish
this.

ok beck@, dlg@


# 1.181 12-Aug-2009 beck

Namecache revamp.

This eliminates the large single namecache hash table, and implements
the name cache as a global lru of entires, and a redblack tree in each
vnode. It makes cache_purge actually purge the namecache entries associated
with a vnode when a vnode is recycled (very important for later on actually being
able to resize the vnode pool)

This commit does #if 0 out a bunch of procmap code that was
already broken before this change, but needs to be redone completely.

Tested by many, including in thib's nfs test setup.

ok oga@,art@,thib@,miod@


# 1.180 02-Aug-2009 beck

Dynamic buffer cache support - a re-commit of what was backed out
after c2k9

allows buffer cache to be extended and grow/shrink dynamically

tested by many, ok oga@, "why not just commit it" deraadt@


Revision tags: OPENBSD_4_6_BASE
# 1.179 25-Jun-2009 thib

backout the buf_acquire() does the bremfree() since all callers
where doing bremfree() befure calling buf_acquire().

This is causing us headache pinning down a bug that showed up
when deraadt@ too cvs to current, and will have to be done
anyway as a preperation for backouts.

OK deraadt@


# 1.178 15-Jun-2009 beck

Back out all the buffer cache changes I committed during c2k9. This reverts three
commits:

1) The sysctl allowing bufcachepercent to be changed at boot time.
2) The change moving the buffer cache hash chains to a red-black tree
3) The dynamic buffer cache (Which depended on the earlier too).

ok on the backout from marco and todd


# 1.177 06-Jun-2009 art

All caller of buf_acquire were doing bremfree before the call.
Just put it in the buf_acquire function.
oga@ ok


# 1.176 03-Jun-2009 beck

Change bufhash from the old grotty hash table to red-black trees hanging
off the vnode.
ok art@, oga@, miod@


Revision tags: OPENBSD_4_5_BASE
# 1.175 10-Nov-2008 pedro

Fix typo in comment, okay jmc@.


# 1.174 01-Nov-2008 deraadt

change vrele() to return an int. if it returns 0, it can gaurantee that
it did not sleep. this is used to avoid checkdirs() to avoid having
to restart the allproc walk every time through
idea from tedu, ok thib pedro


Revision tags: OPENBSD_4_4_BASE
# 1.173 05-Jul-2008 thib

re-introduce vdrop() to signal a lost intrest in a vnode;

ok art@


# 1.172 14-Jun-2008 mk

A bunch of pool_get() + bzero() -> pool_get(..., .. | PR_ZERO)
conversions that should shave a few bytes off the kernel.

ok henning, krw, jsing, oga, miod, and thib (``even though i usually prefer
FOO|BAR''; thanks for looking.


# 1.171 13-Jun-2008 beck

back out stupid vnode change that was unintentionally included
with biomem and art has no idea how it got there.
ok art@ thib@


# 1.170 12-Jun-2008 deraadt

Bring biomem diff back into the tree after the nfs_bio.c fix went in.
ok thib beck art


# 1.169 11-Jun-2008 deraadt

back out biomem diff since it is not right yet. Doing very large
file copies to nfsv2 causes the system to eventually peg the console.
On the console ^T indicates that the load is increasing rapidly, ddb
indicates many calls to getbuf, there is some very slow nfs traffic
making none (or extremely slow) progress. Eventually some machines
seize up entirely.


# 1.168 10-Jun-2008 beck

Buffer cache revamp

1) remove multiple size queues, introduced as a stopgap.
2) decouple pages containing data from their mappings
3) only keep buffers mapped when they actually have to be mapped
(right now, this is when buffers are B_BUSY)
4) New functions to make a buffer busy, and release the busy flag
(buf_acquire and buf_release)
5) Move high/low water marks and statistics counters into a structure
6) Add a sysctl to retrieve buffer cache statistics

Tested in several variants and beat upon by bob and art for a year. run
accidentally on henning's nfs server for a few months...

ok deraadt@, krw@, art@ - who promises to be around to deal with any fallout


# 1.167 09-Jun-2008 millert

Update access(2) to have modern semantics with respect to X_OK and
the superuser. access(2) will now only indicate success for X_OK on
non-directories if there is at least one execute bit set on the file.
OK deraadt@ thib@ otto@


# 1.166 07-May-2008 thib

remove the vfc_mountroot member from vfsconf and
do appropriate cleanup;

OK deraadt@


# 1.165 07-May-2008 claudio

Implement routing priorities. Every route inserted has a priority assigned
and the one route with the lowest number wins. This will be used by the
routing daemons to resolve the synchronisations issue in case of conflicts.
The nasty bits of this are in the multipath code. If no priority is specified
the kernel will choose an appropriate priority.

Looked at by a few people at n2k8 code is much older


# 1.164 06-May-2008 thib

retire vfs_mountroot();

setroot() is now (and has been) responsible for setting
the mountroot function pointer "to the right thing", or
failing todo that, to ffs_mountroot;

based on a discussion/diff from deraadt@.
OK deraadt@


# 1.163 23-Mar-2008 miod

Wrong printf construct.


# 1.162 16-Mar-2008 otto

Widen some struct statfs fields to support large filesystem stata
and add some to be able to support statvfs(2). Do the compat dance
to provide backward compatibility. ok thib@ miod@


Revision tags: OPENBSD_4_3_BASE
# 1.161 13-Dec-2007 blambert

replace calls to ltsleep with tsleep

remove PNORELOCK flag, as PNORELOCK is used for msleep

ok art@ thib@


# 1.160 16-Nov-2007 deraadt

er, the newline is wrong. dissapointing.


# 1.159 15-Nov-2007 deraadt

newline before syncing disks is way prettier


# 1.158 29-Oct-2007 chl

MALLOC/FREE -> malloc/free
replace an hard coded value with M_WAITOK

ok krw@


# 1.157 15-Sep-2007 bluhm

Allow to pull out an usb stick with ffs filesystem while mounted
and a file is written onto the stick. Without these fixes the
machine panics or hangs.
The usb fix calls the callback when the stick is pulled out to free
the associated buffers. Otherwise we have busy buffers for ever
and the automatic unmount will panic.
The change in the scsi layer prevents passing down further dirty
buffers to usb after the stick has been deactivated.
In vfs the automatic unmount has moved from the function vgonel()
to vop_generic_revoke(). Both are called when the sd device's vnode
is removed. In vgonel() the VXLOCK is already held which can cause
a deadlock. So call dounmount() earlier.

ok krw@, I like this marco@, tested by ian@


# 1.156 07-Sep-2007 art

Use M_ZERO in a few more places to shave bytes from the kernel.

eyeballed and ok dlg@


Revision tags: OPENBSD_4_2_BASE
# 1.155 07-Aug-2007 beck

A few changes to deal with multi-user performance issues seen. this
brings us back roughly to 4.1 level performance, although this is still
far from optimal as we have seen in a number of cases. This change

1) puts a lower bound on buffer cache queues to prevent starvation
2) fixes the code which looks for a buffer to recycle
3) reduces the number of vnodes back to 4.1 levels to avoid complex
performance issues better addressed after 4.2

ok art@ deraadt@, tested by many


# 1.154 01-Jun-2007 beck

decouple the allocated number of vnodes from the "desiredvnodes" variable
which is used to size a zillion other things that increasing excessively
has been shown to cause problems - so that we may incrementally look at
increasing those other things without making the kernel unusable.

This diff effectivly increases the number of vnodes back to the number
of buffers, as in the earlier dynamic buffer cache commits, without
increasing anything else (namecache, softdeps, etc. etc.)

ok pedro@ tedu@ art@ thib@


# 1.153 31-May-2007 tedu

remove some silly casts, no real change


# 1.152 31-May-2007 pedro

NFSv2 cannot cope with a big number of vnodes, so revert to NPROC-based
calculation until the problem is fixed, okay beck@ art@


# 1.151 30-May-2007 beck

back out vfs change - todd fries has seen afs issues, and I'm suspicious
this can cause other problems.


# 1.150 29-May-2007 beck

Step one of some vnode improvements - change getnewvnode to
actually allocate "desiredvnodes" - add a vdrop to un-hold a vnode held
with vhold, and change the name cache to make use of vhold/vdrop, while
keeping track of which vnodes are referred to by which cache entries to
correctly hold/drop vnodes when the cache uses them.
ok thib@, tedu@, art@


# 1.149 28-May-2007 thib

de-inline vref();

ok pedro@


# 1.148 26-May-2007 pedro

Dynamic buffer cache. Initial diff from mickey@, okay art@ beck@ toby@
deraadt@ dlg@.


# 1.147 26-May-2007 thib

Nuke a bunch of simpelocks and associated goo.

ok art@


# 1.146 17-May-2007 thib

Collapse struct v_selectinfo in struct vnode, remove the
simplelock and reuse the name for the selinfo member.
Clean-up accordingly.

ok tedu@,art@


# 1.145 09-May-2007 deraadt

kinfo_vgetfailed has not been used for > 8 years


# 1.144 13-Apr-2007 thib

Move the declaration of VN_KNOTE() into vnode.h instead of having
multiple defines all over;

ok tedu@


# 1.143 13-Apr-2007 bluhm

Remove comments talking about vnode interlock. No binary change.
ok thib


# 1.142 11-Apr-2007 thib

Remove the simplelock argument from vrecycle();

ok pedro@, sturm@


# 1.141 21-Mar-2007 thib

Remove the v_interlock simplelock from the vnode structure.
Zap all calls to simple_lock/unlock() on it (those calls are
#defined away though). Remove the LK_INTERLOCK from the calls
to vn_lock() and cleanup the filesystems wich implement VOP_LOCK().
(by remvoing the v_interlock from there calls to lockmgr()).

ok pedro@, art@, tedu@


# 1.140 12-Mar-2007 mickey

better desiredvnodes not based on maxusers; pedro@ deraadt@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.139 20-Feb-2007 deraadt

for vfsconf sysctl, do not leak kernel sensors out to userland
ok art thib


# 1.138 17-Feb-2007 mickey

fix ddb buf printing for daddr_t growth to 64bit;
from juan hernandez gonzalez; tested by bluhm@


# 1.137 14-Feb-2007 jsg

Consistently spell FALLTHROUGH to appease lint.
ok kettenis@ cloder@ tom@ henning@


# 1.136 13-Feb-2007 mickey

fix ddb buf print


# 1.135 20-Nov-2006 tom

vprint() should be defined if DIAGNOSTIC || DEBUG. Noticed by (and
original diff from) Jake < antipsychic (at) hotmail.com >. Discussed
with Mickey and Miod.

ok miod@ pedro@


# 1.134 30-Oct-2006 thib

use vp->v_type to index into vtypes rather then vp->v_tag,
fixing odd output in the 'show vnode' ddb code.

ok mickey@


Revision tags: OPENBSD_4_0_BASE
# 1.133 11-Jul-2006 mickey

add mount/vnode/buf and softdep printing commands; tested on a few archs and will make pedro happy too (;


# 1.132 09-Jul-2006 pedro

Fix tab where space was meant


# 1.131 08-Jul-2006 thib

vinvalbuf() debugging aid, under VFSDEBUG.

ok pedro@


# 1.130 03-Jul-2006 mickey

also print vp in vprint (useful for debugging); pedro@ ok


# 1.129 25-Jun-2006 sturm

rename vfs_busy() flags VB_UMIGNORE/VB_UMWAIT to VB_NOWAIT/VB_WAIT

requested by and ok pedro


# 1.128 14-Jun-2006 sturm

move vfs_busy() to rwlocks and properly hide the locking api from vfs

ok tedu, pedro


# 1.127 02-Jun-2006 pedro

Add a clonable devices implementation. Hacked along with thib@, input
from krw@ and toby@, subliminal prodding from dlg@, okay deraadt@.


# 1.126 28-May-2006 pedro

Spacing in vfs_sysctl()


# 1.125 07-May-2006 sturm

forgot to remove this sentence from the comment
ok pedro


# 1.124 30-Apr-2006 sturm

remove the simplelock argument from vfs_busy() which is currently not
used and will never be used this way in VFS

requested by and ok pedro, ok krw, biorn


# 1.123 19-Apr-2006 pedro

Remove unused mount list simple_lock() goo


Revision tags: OPENBSD_3_9_BASE
# 1.122 09-Jan-2006 pedro

Put vprint() under DIAGNOSTIC, as to save space in generated ramdisks.
Inspiration from miod@, okay deraadt@. Tested on i386, macppc and amd64.


# 1.121 30-Nov-2005 pedro

No need for vfs_busy() and vfs_unbusy() to take a process pointer
anymore. Testing by jolan@, thanks.


# 1.120 24-Nov-2005 pedro

Remove kernfs, okay deraadt@.


# 1.119 19-Nov-2005 pedro

Remove unnecessary lockmgr() archaism that was costing too much in terms
of panics and bugfixes. Access curproc directly, do not expect a process
pointer as an argument. Should fix many "process context required" bugs.
Incentive and okay millert@, okay marc@. Various testing, thanks.


# 1.118 18-Nov-2005 pedro

Work around yet another race on non-locking file systems: when calling
VOP_INACTIVE() in vrele() and vput(), we may sleep. Since there's no
locking of any kind, someone can vget() the vnode and vrele() it while
we sleep, beating us in getting the vnode on the free list.


# 1.117 08-Nov-2005 pedro

Missed one use of 'register'


# 1.116 07-Nov-2005 pedro

Use ANSI function declarations and deregister, no binary change


# 1.115 19-Oct-2005 pedro

Remove v_vnlock from struct vnode, okay krw@ tedu@


Revision tags: OPENBSD_3_8_BASE
# 1.114 26-May-2005 pedro

branches: 1.114.2;
RIP stackable filesystems, ok marius@ tedu@, discussed with deraadt@


# 1.113 24-May-2005 pedro

when a device vnode associated with a mount point disappears, mark the
filesystem as doomed and unmount it


# 1.112 22-May-2005 pedro

put VLOCKSWORK stuff under a single option, VFSDEBUG


# 1.111 01-May-2005 pedro

check for VBIOONFREELIST and VBIOONSYNCLIST in vprint(), okay marius@


# 1.110 24-Mar-2005 tedu

always good to check for invalid values. ok marius pedro


Revision tags: OPENBSD_3_7_BASE
# 1.109 10-Jan-2005 pedro

branches: 1.109.2;
change vget() to only put a vnode back on the free lists if it actually
was there. should fix a (rare) corner case introduced by my last commit.
ok tedu@, testing by joris, moritz@, danh@, otto@ and krw@. many thanks.


# 1.108 31-Dec-2004 pedro

sprinkle some more list macros in here


# 1.107 31-Dec-2004 pedro

when releasing a vnode, make it inactive before sticking it to one of
the free lists. should fix some races on filesystems that don't have
locks, such as nfs. also, it allows for a more straightforward way of
releasing vnodes (nodes that are going to be recycled don't have to be
moved to the head of the list). tested by many, thanks.

ok tedu@ deraadt@


# 1.106 28-Dec-2004 deraadt

clean dirty accident by miod


# 1.105 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


# 1.104 09-Dec-2004 pedro

minor spacing/styling nits


Revision tags: OPENBSD_3_6_BASE
# 1.103 04-Aug-2004 art

Uninline vputonfreelist.


# 1.102 04-Aug-2004 pedro

better comments


# 1.101 02-Aug-2004 pedro

- check for LK_NOWAIT on vget()
- use ltsleep() instead of the unlock + sleep combo

ok art@, inspiration from free/net


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.100 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


# 1.99 27-May-2004 tedu

shutdown accounting before shutting down vfs. should prevent some panics.
ok david@ millert@ (iirc)


# 1.98 25-Apr-2004 itojun

radix tree with multipath support. from kame. deraadt ok
user visible changes:
- you can add multiple routes with same key (route add A B then route add A C)
- you have to specify gateway address if there are multiple entries on the table
(route delete A B, instead of route delete A)
kernel change:
- radix_node_head has an extra entry
- rnh_deladdr takes extra argument

TODO:
- actually take advantage of multipath (rtalloc -> rtalloc_mpath)


Revision tags: OPENBSD_3_5_BASE
# 1.97 09-Jan-2004 tedu

back out vnode parents. weird breakge found in ports tree


# 1.96 06-Jan-2004 tedu

keep track of a vnode's parent dir. ufs only, and unused atm, but
the fun stuff is coming. testing by brad.


Revision tags: OPENBSD_3_4_BASE
# 1.95 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.94 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: UBC_SYNC_A
# 1.93 13-May-2003 naddy

Back out previous change that causes "vnode table full" for large-scale
file operations.


# 1.92 13-May-2003 tedu

do reclaim LAYER vnodes, no good reason not to


# 1.91 06-May-2003 tedu

attempt to put a process's cwd back in place after a forced umount.
won't always work, but it's the best we can do for now. this covers
at least some of the failure cases the previous commit to vfs_lookup.c
checks for.
ok weingart@


# 1.90 01-May-2003 tedu

several related changes:
vfs_subr.c:
add a missing simple_lock_init for vnode interlock
try to avoid reclaiming locked or layered vnodes
initialize vnlock pointer to NULL
remove old code to free vnlock, never used
lockinit the new vnode lock
vfs_syscalls.c:
support for VLAYER flag
vnode_if.sh:
support for splitting VDESC flags
vnode_if.src:
split VDESC flags
WILLPUT is the combination of WILLRELE and WILLUNLOCK
most uses for WILLRELE become WILLPUT
vnode.h:
add v_lock to struct vnode
add VLAYER flag
update for new VDESC flags


# 1.89 06-Apr-2003 ho

strcat/strcpy/sprintf cleanup. krw@, anil@ ok. art@ tested sparc64.


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_B
# 1.88 11-Aug-2002 art

Add two missing vfs_busy calls in the failure path of sysctl_vnode.
Found by aaron@

NOTE - I think we need a mount-point iterator just like we have
NOTE - vfs_mount_foreach_vnode. (btw. why don't we use foreach_vnode in here?)


# 1.87 12-Jul-2002 art

Change the locking on the mountpoint slightly. Instead of using mnt_lock
to get shared locks for lookup and get the exclusive lock only with
LK_DRAIN on unmount and do the real exclusive locking with flags in
mnt_flags, we now use shared locks for lookup and an exclusive lock for
unmount.

This is accomplished by slightly changing the semantics of vfs_busy.
Old vfs_busy behavior:
- with LK_NOWAIT set in flags, a shared lock was obtained if the
mountpoint wasn't being unmounted, otherwise we just returned an error.
- with no flags, a shared lock was obtained if the mountpoint was being
unmounted, otherwise we slept until the unmount was done and returned
an error.
LK_NOWAIT was used for sync(2) and some statistics code where it isn't really
critical that we get the correct results.
0 was used in fchdir and lookup where it's critical that we get the right
directory vnode for the filesystem root.

After this change vfs_busy keeps the same behavior for no flags and LK_NOWAIT.
But if some other flags are passed into it, they are passed directly
into lockmgr (actually LK_SLEEPFAIL is always added to those flags because
if we sleep for the lock, that means someone was holding the exclusive lock
and the exclusive lock is only held when the filesystem is being unmounted.

More changes:
dounmount must now be called with the exclusive lock held. (before this
the caller was supposed to hold the vfs_busy lock, but that wasn't always
true).
Zap some (now) unused mount flags.
And the highlight of this change:
Add some vfs_busy calls to match some vfs_unbusy calls, especially in
sys_mount. (lockmgr doesn't detect the case where we release a lock noone
holds (it will do that soon)).

If you've seen hangs on reboot with mfs this should solve it (I repeat this
for the fourth time now, but this time I spent two months fixing and
redesigning this and reading the code so this time I must have gotten
this right).


# 1.86 16-Jun-2002 miod

When processing the KERN_VNODE sysctl, the kernel builds a packed structure,
while pstat(8) expects a C structure abiding the regular structure packing
rules. This caused pstat -v to break on powerpc.

Unbreak the confusion by defining the structure in a common header file,
and having the kernel use it.

ok millert@ deraadt@


# 1.85 08-Jun-2002 art

Use ltsleep in vfs_busy.


# 1.84 16-May-2002 art

sprinkle some splassert(IPL_BIO) in some functions that are commented as "should be called at splbio()"


Revision tags: OPENBSD_3_1_BASE
# 1.83 14-Mar-2002 millert

First round of __P removal in sys


# 1.82 04-Feb-2002 miod

Cleanup mountroot-related definitions.


# 1.81 23-Jan-2002 art

Pool deals fairly well with physical memory shortage, but it doesn't deal
well (not at all) with shortages of the vm_map where the pages are mapped
(usually kmem_map).

Try to deal with it:
- group all information the backend allocator for a pool in a separate
struct. The pool will only have a pointer to that struct.
- change the pool_init API to reflect that.
- link all pools allocating from the same allocator on a linked list.
- Since an allocator is responsible to wait for physical memory it will
only fail (waitok) when it runs out of its backing vm_map, carefully
drain pools using the same allocator so that va space is freed.
(see comments in code for caveats and details).
- change pool_reclaim to return if it actually succeeded to free some
memory, use that information to make draining easier and more efficient.
- get rid of PR_URGENT, noone uses it.


# 1.80 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.79 10-Dec-2001 art

branches: 1.79.2;
No need to initialize the uobj on every getnewvnode. Just do
it when allocating. Add some improved diagnostics.


# 1.78 10-Dec-2001 art

Big cleanup inspired by NetBSD with some parts of the code from NetBSD.
- get rid of VOP_BALLOCN and VOP_SIZE
- move the generic getpages and putpages into miscfs/genfs
- create a genfs_node which must be added to the top of the private portion
of each vnode for filsystems that want to use genfs_{get,put}pages
- rename genfs_mmap to vop_generic_mmap


# 1.77 10-Dec-2001 art

Merge in struct uvm_vnode into struct vnode.


# 1.76 05-Dec-2001 art

Break out the part that lowers v_holdcnt in brelvp into an own function
and make it and vhold into public interfaces.


# 1.75 29-Nov-2001 art

Ooops. Revert part of the last commit that was completly wrong and wasn't supposed to be committed.


# 1.74 29-Nov-2001 art

Correctly handle b_vp with bgetvp and brelvp in {get,put}pages.
Prevents panics caused by vnodes being recycled under our feet.


# 1.73 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.72 21-Nov-2001 csapuntz

Added vfs_isbusy. Useful for verifying that a mount point is locked
Added vfs_mount_foreach_vnode. Several places in the code seem to want to
traverse the mount list and they all seem to handle locking differently.
Centralize traversing the mount list in one place so that we only need
to get the locking right once.


# 1.71 15-Nov-2001 art

Don't zero v_bioflag when recycling a vnode in getnewvnode.
Sometimes the vnode can be on the syncers list. While that is a bug, it's
just a minor annoyance. A vnode on a syncer worklist without VBIOONSYNCLIST
set is a disaster.


# 1.70 12-Nov-2001 art

Remove unnecessary check for NULL vnode in reassignbuf.


# 1.69 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.68 02-Oct-2001 csapuntz

Bounds check index into routing table. Thanks to Ken Ashcraft of Stanford
for finding this bug.


# 1.67 19-Sep-2001 csapuntz

Get rid of B_VFLUSH. Not relevant after the end of the AGE queue.


# 1.66 16-Sep-2001 millert

Add some missing lengths checks when passing data from userland to
kernel. From based on NetBSD patches.


# 1.65 02-Aug-2001 assar

(vput): make panic strings actually say vput instead of vrele


# 1.64 26-Jul-2001 miod

Typo.


# 1.63 27-Jun-2001 art

remove old vm


# 1.62 22-Jun-2001 deraadt

KNF


# 1.61 05-Jun-2001 provos

send note_revoke to knotes when vnode goes away, okay art@


# 1.60 16-May-2001 art

indentation nit.


# 1.59 29-Apr-2001 art

cleanup, remove incorrect comment


Revision tags: OPENBSD_2_9_BASE
# 1.58 22-Mar-2001 art

branches: 1.58.2;
Use pool for allocating vnodes.
Even though vnodes are never freed (could be) this gives us big memory and
kmem_map savings.


# 1.57 21-Mar-2001 art

uvm_vnp_terminate expect the vnode to be locked.
Why didn't LOCKDEBUG catch this?


# 1.56 16-Mar-2001 art

Oops. fix thinko in last.


# 1.55 16-Mar-2001 art

Use CIRCLEQ macros for mountlist.


# 1.54 16-Mar-2001 art

Initialize the mountlist_slock.


# 1.53 26-Feb-2001 csapuntz

Move v_writecount test back to it original place


# 1.52 26-Feb-2001 csapuntz

Make ref counts 32-bit unsigned ints as opposed to a potpourri of longs and
ints.


# 1.51 24-Feb-2001 csapuntz

Cleanup of vnode interface continues. Get rid of VHOLD/HOLDRELE.
Change VM/UVM to use buf_replacevnode to change the vnode associated
with a buffer.

Addition v_bioflag for flags written in interrupt handlers
(and read at splbio, though not strictly necessary)

Add vwaitforio and use it instead of a while loop of v_numoutput.

Fix race conditions when manipulation vnode free list


# 1.50 23-Feb-2001 csapuntz

Remove the clustering fields from the vnodes and place them in the
file system inode instead


# 1.49 21-Feb-2001 csapuntz

Latest soft updates from FreeBSD/Kirk McKusick

Snapshot-related code has been commented out.


# 1.48 08-Feb-2001 mickey

do not print stuff when not verbose


Revision tags: OPENBSD_2_8_BASE
# 1.47 27-Sep-2000 art

branches: 1.47.2;
Minimal optimization.


# 1.46 17-Jul-2000 art

Don't wait for B_READ buffers on shutdown.
From NetBSD.


Revision tags: OPENBSD_2_7_BASE
# 1.45 25-Apr-2000 csapuntz

Use CIRCLEQ_FOREACH


# 1.44 21-Apr-2000 mickey

see if there is any meaning under curproc before using &proc0 in vfs_syncwait(); from art@


Revision tags: SMP_BASE kame_19991208
# 1.43 05-Dec-1999 art

branches: 1.43.2;
With soft updates, some buffers will be remarked as dirty after being written.
Handle this when syncing filesystems when unmounting.
From NetBSD.


# 1.42 05-Dec-1999 art

Use VONSYNCLIST to see if we should remove a vnode from the sync list instead
of looking at v_dirtyblkhd.


Revision tags: OPENBSD_2_6_BASE
# 1.41 20-Aug-1999 art

more paranoid check of the refcount in vfs_register


# 1.40 08-Aug-1999 niklas

From NetBSD; vdevgone, used for revoking access to device nodes when they
disappear (detach is coming).


# 1.39 31-May-1999 millert

New struct statfs with mount options. NOTE: this replaces statfs(2),
fstatfs(2), and getfsstat(2) so you will need to build a new kernel
before doing a "make build" or you will get "unimplemented syscall" errors.

The new struct statfs has the following featuires:
o Has a u_int32_t flags field--now softdep can have a real flag.

o Uses u_int32_t instead of longs (nicer on the alpha). Note: the man
page used to lie about setting invalid/unused fields to -1. SunOS does
that but our code never has.

o Gets rid of f_type completely. It hasn't been used since NetBSD 0.9
and having it there but always 0 is confusing. It is conceivable
that this may cause some old code to not compile but that is better
than silently breaking.

o Adds a mount_info union that contains the FSTYPE_args struct. This
means that "mount" can now tell you all the options a filesystem was
mounted with. This is especially nice for NFS.

Other changes:
o The linux statfs emulation didn't convert between BSD fs names
and linux f_type numbers. Now it does, since the BSD f_type
number is useless to linux apps (and has been removed anyway)

o FreeBSD's struct statfs is different from our (both old and new)
and thus needs conversion. Previously, the OpenBSD syscalls
were used without any real translation.

o mount(8) will now show extra info when invoked with no arguments.
However, to see *everything* you need to use the -v (verbose) flag.


# 1.38 06-May-1999 mickey

factor out sync+wait code into vfa_syncwait() routine for
applications in system like power management and such.
art@ finally said `commit it'


# 1.37 30-Apr-1999 art

in vput, simple_unlock the v_interlock before VOP_INACTIVE, not after


Revision tags: OPENBSD_2_5_BASE
# 1.36 11-Mar-1999 deraadt

backout


# 1.35 11-Mar-1999 deraadt

back out unapproved changes


# 1.34 11-Mar-1999 mickey

indent


# 1.33 11-Mar-1999 mickey

factor sync+wait operation out into a separate function.


# 1.32 26-Feb-1999 art

adapt to uvm vnode pager


# 1.31 19-Feb-1999 art

add vfs_register and vfs_unregister functions


# 1.30 28-Dec-1998 art

simple_lock fixes


# 1.29 22-Dec-1998 art

deconfuse vprint, print holdcount, not refcount when we are talking about holdcnt


# 1.28 10-Dec-1998 art

vfs_unmountall: retry to unmount all remaining filesystems when one unmount failed


# 1.27 05-Dec-1998 csapuntz

Framework for generating automatic test code for locking discipline
in DIAGNOSTIC mode.

Added documentation to vfs_subr.c on locking needs of a couple calls.

Improvements to the vinvalbuf patch. We need to start over after we
let our pants down.


# 1.26 04-Dec-1998 csapuntz

VFS-Lite2 requires stricter locking around vnode buffer queues. vinvalbuf
had insufficient protection


# 1.25 20-Nov-1998 art

vn_lock already unlocks the simple lock. don't do that again


# 1.24 12-Nov-1998 csapuntz

Integrate latest soft updates patches for McKusick.

Integrate cleaner ffs mount code from FreeBSD. Most notably, this mount
code prevents you from mounting an unclean file system read-write.


Revision tags: OPENBSD_2_4_BASE
# 1.23 13-Oct-1998 csapuntz

In vrele, vget, reinstate to following order

- VNODE gets placed on free list
- VOP_INACTIVE is called

This was the original order. It was changed in an earlier patch due to
a race condition in non-locking FSes (like NFS) between getnewvnode
and inactive. However, the modified order had its own race conditions, so
it turned out not to be a good choice.


# 1.22 30-Aug-1998 csapuntz

Cleanup.

Error diagnostics in vputonfreelist to catch violations of assumptions.


# 1.21 06-Aug-1998 csapuntz

Rename vop_revoke, vn_bwrite, vop_noislocked, vop_nolock, vop_nounlock
to be vop_generic_revoke, vop_generic_bwrite, vop_generic_islocked,
vop_generic_lock and vop_generic_unlock.

Create vop_generic_abortop and propogate change to all file systems.

Fix PR/371.

Get rid of locking in NULLFS (should be mostly unnecessary now except for
forced unmounts).


# 1.20 25-Apr-1998 niklas

typo


Revision tags: OPENBSD_2_3_BASE
# 1.19 20-Feb-1998 niklas

typo


# 1.18 11-Jan-1998 csapuntz

Fix a couple spinlock references. More code motion in vfs_subr.c


# 1.17 10-Jan-1998 csapuntz

Broke up vfs_subr.c which was getting a bit huge. We now have seperate files
for the syncer daemon as well as default VOP_*.


# 1.16 24-Nov-1997 niklas

Fix non-DIAGNOSTIC (and non-COMPAT*) compilation


# 1.15 07-Nov-1997 csapuntz

Fixed hang on shutdown
Disabled vop_nolock for now. Filesystems still need to be cleaned up.


# 1.14 06-Nov-1997 csapuntz

DEBUG now compiles


# 1.13 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.12 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.11 06-Oct-1997 csapuntz

VFS Lite2 Changes


Revision tags: OPENBSD_2_1_BASE
# 1.10 25-Apr-1997 deraadt

proper mask check; mike@fast.cs.utah.edu


# 1.9 14-Apr-1997 tholo

Minor performance enhancements from NetBSD


# 1.8 24-Feb-1997 niklas

OpenBSD tags


# 1.7 11-Feb-1997 millert

Add fs_id support and random inode generation numbers for ffs.


# 1.6 04-Jan-1997 kstailey

spec_advlock() via lf_advlock()


Revision tags: OPENBSD_2_0_BASE
# 1.5 08-Aug-1996 tholo

Make {,f}chown(2) behaviour POSIX.1 compliant with SUID / SGID files
Enable CTL_FS processing by sysctl(3)
Add CTL_FS request to disable clearing SUID / SGID bit when a files owner
or group is changed by root
Make sysctl(8) understand CTL_FS requests


# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 29-Feb-1996 niklas

From NetBSD: Merge with NetBSD 960217


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.306 31-Aug-2021 claudio

Swap lock flags so that LK_EXCLUSIVE is first like in all other places.


# 1.305 28-Apr-2021 claudio

Introduce a global vnode_mtx and use it to make vn_lock() safe to be called
without the KERNEL_LOCK.
This moves VXLOCK and VXWANT to a mutex protected v_lflag field and also
v_lockcount is protected by this mutex.

The vn_lock() dance is overly complex and all of this should probably replaced
by a proper lock on the vnode but such a diff is a lot more complex. This
is an intermediate step so that at least some calls can be modified to grab
the KERNEL_LOCK later or not at all.

OK mpi@


Revision tags: OPENBSD_6_9_BASE
# 1.304 29-Jan-2021 claudio

Use NULL instead of 0 to clear v_socket pointer (which actually clears all
of the v_un pointers).
OK jsg@ mvs@


Revision tags: OPENBSD_6_8_BASE
# 1.303 23-Aug-2020 kn

Remove unused debug_syncprt, improve debug sysctl handling

"syncprt" is unused since kern/vfs_syscalls.c r1.147 from 2008.

Adding new debug sysctls is a bit opaque and looking at kern/kern_sysctl.c
the only visible difference between used and stub ctldebug structs in the
debugvars[] array is their extern keyword, indicating that it is defined
elsewhere.

sys/sysctl.h declares all debugN members as extern upfront, but these
declarations are not needed.

Remove the unused debug sysctl, rename the only remaining one to something
meaningful and remove forward declarations from /sys/sysctl.h; this way,
adding new debug sysctls is a matter of adding extern and coming up with a
name, which is nicer to read on its own and better to grep for.

OK mpi


# 1.302 22-Aug-2020 kn

Move sysctl(2) CTL_DEBUG from DEBUG to new DEBUG_SYSCTL

Adding "debug.my-knob" sysctls is really helpful to select different
code paths and/or log on demand during runtime without recompile,
but as this code is under DEBUG, lots of other noise comes with it
which is often undesired, at least when looking at specific subsystems
only.

Adding globals to the kernel and breaking into DDB to change them helps,
but that does not work over SSH, hence the need for debug sysctls.

Introduces DEBUG_SYSCTL to make use of the "debug" MIB without the rest of
DEBUG; it's DEBUG_SYSCTL and not SYSCTL_DEBUG because it's not a general
option for all of sysctl(2).

OK gnezdo


Revision tags: OPENBSD_6_7_BASE
# 1.301 27-Mar-2020 anton

Relax the lockcount assertion in vputonfreelist(). Back when I fixed
several problems with the vnode exclusive lock implementation, I
overlooked the fact that a vnode can be in a state where the usecount is
zero while the holdcount still being positive. There could still be
threads waiting on the vnode lock in uvn_io() as long as the holdcount
is positive.

"go ahead" mpi@

Reported-by: syzbot+767d6deb1a647850a0ca@syzkaller.appspotmail.com


# 1.300 13-Feb-2020 claudio

Move the LK_DRAIN logic from VOP_LOCK() to vclean() the only caller of
VOP_LOCK with LK_DRAIN. This simplifies VOP_LOCK() a fair bit.
OK visa@


# 1.299 20-Jan-2020 claudio

struct vops is not modified during runtime so use const which moves each
into read-only data segment.
OK deraadt@ tedu@


# 1.298 10-Jan-2020 bluhm

Convert the vnode list at the mount point into a tailq. During
unmount this list is traversed and the dirty vnodes are flushed to
disk. Forced unmount expects that the list is empty after flushing,
otherwise the kernel panics with "dangling vnode". As the write
to disk can sleep, new vnodes may be inserted. If softdep is
enabled, resolving the dependencies creates new dirty vnodes and
inserts them to the list. To fix the panic, let insmntque() insert
new vnodes at the tail of the list. Then vflush() will still catch
them while traversing the list in forward direction.
OK tedu@ millert@ visa@


# 1.297 30-Dec-2019 bluhm

In vcount() a safe loop over vnodes was commited to 4.4BSD in 1994.
This is not necessary as the loop is restarted after vgone(). Switch
to SLIST_FOREACH without _SAFE.
OK visa@


# 1.296 27-Dec-2019 bluhm

Convert the speclisth hash buckets into SLIST macros. This makes
the vnode alias code more readable.
OK visa@


# 1.295 26-Dec-2019 bluhm

Fix white spaces.


# 1.294 08-Dec-2019 mpi

Convert infinite sleeps to tsleep_nsec(9).

ok visa@, jca@


Revision tags: OPENBSD_6_6_BASE
# 1.293 26-Aug-2019 anton

When a thread tries to exclusively lock a vnode, the same thread must
ensure that any other thread currently trying to acquire the underlying
vnode lock has observed that the same vnode is about to be exclusively
locked. Such threads must then sleep until the exclusive lock has been
released and then try to acquire the lock again. Otherwise, exclusive
access to the vnode cannot be guaranteed.

Thanks to naddy@ and visa@ for testing; ok visa@

Reported-by: syzbot+374d0e7e2400004957f7@syzkaller.appspotmail.com


# 1.292 25-Jul-2019 cheloha

vinvalbuf(9): tlseep -> tsleep_nsec(9); ok millert@


# 1.291 19-Jul-2019 cheloha

vwaitforio(9): tsleep(9) -> tsleep_nsec(9); ok visa@


# 1.290 28-Jun-2019 visa

Skip VFS barrier lock during normal operation to reduce overhead.
This removes a system-wide serialization point, which might help
finding timing-related bugs.

OK deraadt@ anton@


# 1.289 09-Jun-2019 beck

Add a temporary workaround to make removal of giant files better

mlarkin@ noticed we would freeze while removing enormous files because
of the amount of work done to invalidate buffers on unlink. This adds
a temporary workaround to ensure we give up the lock and yield while
doing this.

The longer term answer will be to move these buffers to another list
and not do the work here.

ok deraadt@


# 1.288 19-Apr-2019 visa

Add a subsystem lock for vfs_lockf.c. This enables calling lf_advlock()
and lf_purgelocks() without the kernel lock.

OK anton@ mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.287 02-Apr-2019 visa

Restrict which filesystems are available for swap. This rules out
obvious misconfigurations that cannot work.

OK mpi@ tedu@


# 1.286 17-Feb-2019 tedu

if a write fails, we mark the buffer invalid and throw it away. this can
lead to lost errors, where a later fsync will return success. to fix this,
set a flag on the vnode indicating a past error has occurred, and return
an error for future fsync calls.
ok bluhm deraadt visa


# 1.285 21-Jan-2019 anton

Introduce a dedicated entry point data structure for file locks. This new data
structure allows for better tracking of pending lock operations which is
essential in order to prevent a use-after-free once the underlying vnode is
gone.

Inspired by the lockf implementation in FreeBSD.

ok visa@

Reported-by: syzbot+d5540a236382f50f1dac@syzkaller.appspotmail.com


# 1.284 23-Dec-2018 natano

Rectify some issues with the noperm mount flag; the root vnode was not
protected properly and files without any x bit set were accidentaly considered
executable when checked with access(2).

Issues found and reported by deraadt, halex, reyk, tb
ok deraadt


# 1.283 07-Dec-2018 mpi

free(9) sizes for netcred.

ok visa@


Revision tags: OPENBSD_6_4_BASE
# 1.282 29-Sep-2018 visa

Use atomic operations to update vfc_refcount. Change the field's type
to unsigned int.

OK deraadt@


# 1.281 26-Sep-2018 visa

Move the allocating and freeing of mount points into
dedicated functions.

OK deraadt@ mpi@


# 1.280 22-Sep-2018 fcambus

Harmonize spacing after ellipses in displayed messages.

We were using spacing after ellipses in an inconsistent way in the
installer. Standardize on using "... " everywhere and take into account
the cursor position while we are waiting for the task to complete: the
cursor is now always positioned after the last dot, and the space is
added when displaying completion confirmation.

While there, also take cursor position into account in vfs_shutdown(),
and remove the extra leading space before ticks in dhclient.

OK deraadt@


# 1.279 17-Sep-2018 visa

Simplify VFS initialization.

Because loadable kernel modules are no longer, there is no need to
register or unregister filesystem implementations at runtime. Remove
vfs_register() and vfs_unregister(), and make vfsinit() call vfs_init
routines directly. Replace the linked list of vfsconf structs with
the vfsconflist[] array.

OK mpi@ bluhm@


# 1.278 16-Sep-2018 visa

Move vfsconf lookup code into dedicated functions.

OK bluhm@


# 1.277 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


# 1.276 02-Jul-2018 bluhm

Use more list macros for v_dirtyblkhd.
OK mpi@


# 1.275 06-Jun-2018 bluhm

The function dounmount() traverses the mnt_list in forward direction
to call vfs_busy() for all nested mount points. vfs_stall() called
vfs_busy() in reverser order for all mount points. Change the
direction of the latter to resolve the lock order conflict.
OK visa@


# 1.274 04-Jun-2018 guenther

Add VB_DUPOK to suppress witness(4) warning of concurrent mount locks.
Use that in three places:
- vfs_stall()
- sys_mount()
- dounmount()'s MNT_FORCE-does-recursive-unmounts case

ok deraadt@ visa@


# 1.273 27-May-2018 visa

Drop unnecessary `p' parameter from vget(9).

OK mpi@


# 1.272 08-May-2018 bluhm

When looping over mount points, the FOREACH SAVE macro is not save.
The loop variable mp is protected by vfs_busy() so that it cannot
be unmounted. But the next mount point nmp could be unmounted while
VFS_SYNC() sleeps. As the loop in vfs_stall() does not destroy the
mount point, TAILQ_FOREACH_REVERSE without _SAVE is the correct
macro to use.
OK deraadt@ visa@


# 1.271 08-May-2018 mpi

Move the vfs stall "barrier" logic to a function. FREF() will soon
change and this has nothing to do with it.

ok visa@, bluhm@


# 1.270 07-May-2018 bluhm

Print the vp pointer in the vinvalbuf() panic strings.
OK mpi@


# 1.269 02-May-2018 visa

Remove proc from the parameters of vn_lock(). The parameter is
unnecessary because curproc always does the locking.

OK mpi@


# 1.268 28-Apr-2018 visa

Clean up the parameters of VOP_LOCK() and VOP_UNLOCK(). It is always
curproc that does the locking or unlocking, so the proc parameter
is pointless and can be dropped.

OK mpi@, deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.267 07-Mar-2018 bluhm

Remounting files systems read-only does not work reliably. There
are corner cases where ffs may leak blocks. So better revert and
unmount all file systems at reboot. The "init died" panic will be
fixed in a different way.
OK deraadt@


# 1.266 10-Feb-2018 deraadt

Syncronize filesystems to disk when suspending. Each mountpoint's vnodes
are pushed to disk. Dangling vnodes (unlinked files still in use) and
vnodes undergoing change by long-running syscalls are identified -- and
such filesystems are marked dirty on-disk while we are suspended (in case
power is lost, a fsck will be required). Filesystems without dangling or
busy vnodes are marked clean, resulting in faster boots following
"battery died" circumstances.
Tested by numerous developers, thanks for the feedback.


# 1.265 14-Dec-2017 deraadt

Don't bother using DETACH_FORCE for the softraid luns at reboot
time; the aggressive mountpoint destruction seems to hit insane
use-after-frees when we are already far on the way down.


# 1.264 14-Dec-2017 deraadt

Give vflush_vnode() a hint about vnodes we don't need to account as "busy".
Change mountpoint to RDONLY a little later. Seems to improve the
rw->ro transition a bit.


# 1.263 11-Dec-2017 bluhm

Format the vnode lists of ddb show mount properly in columns.
OK krw@


# 1.262 11-Dec-2017 deraadt

In uvm Chuck decided backing store would not be allocated proactively
for blocks re-fetchable from the filesystem. However at reboot time,
filesystems are unmounted, and since processes lack backing store they
are killed. Since the scheduler is still running, in some cases init is
killed... which drops us to ddb [noted by bluhm]. Solution is to convert
filesystems to read-only [proposed by kettenis]. The tale follows:
sys_reboot() should pass proc * to MD boot() to vfs_shutdown() which
completes current IO with vfs_busy VB_WRITE|VB_WAIT, then calls VFS_MOUNT()
with MNT_UPDATE | MNT_RDONLY, soon teaching us that *fs_mount() calls a
copyin() late... so store the sizes in vfsconflist[] and move the copyin()
to sys_mount()... and notice nfs_mount copyin() is size-variant, so kill
legacy struct nfs_args3. Next we learn ffs_mount()'s MNT_UPDATE code is
sharp and rusty especially wrt softdep, so fix some bugs adn add
~MNT_SOFTDEP to the downgrade. Some vnodes need a little more help,
so tie them to &dead_vnops.

ffs_mount calling DIOCCACHESYNC is causing a bit of grief still but
this issue is seperate and will be dealt with in time.
couple hundred reboots by bluhm and myself, advice from guenther and
others at the hut


# 1.261 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.260 31-Jul-2017 florian

Give back some space to the ramdisk by compiling net/radix.c only
if we compile pf, ipsec, pipex or nfsserver.
Suggested by mpi some time ago.
Tweak & OK bluhm
deraadt assumes it's fair


# 1.259 20-Apr-2017 visa

Tweak lock inits to make the system runnable with witness(4)
on amd64 and i386.


# 1.258 04-Apr-2017 deraadt

struct vfsconf is tightly packed, but let's M_ZERO it in case that ever
changes to avoid exposing userland memory.


Revision tags: OPENBSD_6_1_BASE
# 1.257 15-Jan-2017 bluhm

When traversing the mount list, the current mount point is locked
with vfs_busy(). If the FOREACH_SAFE macro is used, the next pointer
is not locked and could be freed by another process. Unless
necessary, do not use _SAFE as it is unsafe. In vfs_unmountall()
the current pointer is actullay freed. Add a comment that this
race has to be fixed later.
OK krw@


# 1.256 10-Jan-2017 bluhm

Replace manual for() loops with FOREACH() macro.
OK millert@


# 1.255 10-Jan-2017 bluhm

Remove the unused olddp parameter from function dounmount().
OK mpi@ millert@


# 1.254 28-Sep-2016 kettenis

Cast enum to u_int when doing a bounds check to avoid a clang warning that
the comparison is always true.

ok jca@, tedu@


# 1.253 16-Sep-2016 dlg

move the namecache_rb_tree from RB macros to RBT functions.

i had to shuffle the includes a bit. all the knowledge of the RB
tree is now inside vfs_cache.c, and all accesses are via cache_*
functions.


# 1.252 16-Sep-2016 dlg

move buf_rb_bufs from RB macros to RBT functions

i had to shuffle the order of some header bits cos RBT_PROTOTYPE
needs to see what RBT_HEAD produces.


# 1.251 15-Sep-2016 dlg

all pools have their ipl set via pool_setipl, so fold it into pool_init.

the ioff argument to pool_init() is unused and has been for many
years, so this replaces it with an ipl argument. because the ipl
will be set on init we no longer need pool_setipl.

most of these changes have been done with coccinelle using the spatch
below. cocci sucks at formatting code though, so i fixed that by hand.

the manpage and subr_pool.c bits i did myself.

ok tedu@ jmatthew@

@ipl@
expression pp;
expression ipl;
expression s, a, o, f, m, p;
@@
-pool_init(pp, s, a, o, f, m, p);
-pool_setipl(pp, ipl);
+pool_init(pp, s, a, ipl, f, m, p);


# 1.250 25-Aug-2016 dlg

pool_setipl

ok kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.249 22-Jul-2016 kettenis

Prevent NULL-pointer call for filesystems that don't provide vfs_sysctl
in their vfsops.

Issue reported by Tim Newsham.

ok claudio@, natano@


# 1.248 19-Jun-2016 natano

Remove the lockmgr() API. It is only used by filesystems, where it is a
trivial change to use rrw locks instead. All it needs is LK_* defines
for the RW_* flags.

tested by naddy and sthen on package building infrastructure
input and ok jmc mpi tedu


# 1.247 26-May-2016 natano

The doforce variable isn't modified anywhere. Also, the only filesystem
left using it is fuse. It has been removed from all other filesystems.

ok millert deraadt


# 1.246 26-Apr-2016 natano

copy_statfs_info() is not only used by ufs, but by other filesystems too,
so make sure that all members of mp->mnt_stat.mount_info are copied.
ok stefan


# 1.245 26-Apr-2016 beck

fix off by one in vfs_vnode_print - found by miod
ok deraadt@, krw@


# 1.244 07-Apr-2016 natano

Share clone bitmap between aliased vnodes. This prevents duplicate clone
instance numbers being handed out for the same minor device.
ok mikeb


# 1.243 05-Apr-2016 natano

Increase size of the clone bitmap (revised diff after revert). I have
tested this with fuse _and_ drm on amd64 and macppc. Also tested with
cloning bpf (not in the tree) on macppc.

ok mikeb
"looks correct to me" millert

The original commit message is as follows:

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.242 01-Apr-2016 mikeb

Revert the clone bitmap enlargement change


# 1.241 31-Mar-2016 natano

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.240 19-Mar-2016 natano

Remove the unused flags argument from VOP_UNLOCK().

torture tested on amd64, i386 and macppc
ok beck mpi stefan
"the change looks right" deraadt


# 1.239 14-Mar-2016 krw

Change a bunch of (<blah> *)0 to NULL.

ok beck@ deraadt@


Revision tags: OPENBSD_5_9_BASE
# 1.238 05-Dec-2015 tedu

branches: 1.238.2;
remove stale lint annotations


# 1.237 16-Nov-2015 deraadt

In getdevvp() set the VISTTY flag on a vnode to indicate the underlying
device is a D_TTY device. (Like spec_open, but this sets the flag to
satisfy pre-VOP_OPEN situations)
ok millert semarie tedu guenther


# 1.236 13-Oct-2015 guenther

Initialize va_filerev in vattr_null() to avoid leaking stack garbage;
problem pointed out by Martin Natano (natano (at) natano.net)

Also, stop chaining assignments (foo = bar = baz) in vattr_null().
The exact meaning of those depends on the order of the sizes-and-
signednesses of the lvalues, making them fragile: a statement here
mixed *six* types, but managed to get them in a safe order. Delete
a 20+ year old XXX comment that was almost certainly bemoaning a bug
from when they were in an unsafe order.

ok deraadt@ miod@


# 1.235 08-Oct-2015 mpi

Use the radix API directly and get rid of the function pointers. There
is no point in keeping an unused level of abstraction.

ok mikeb@, claudio@


# 1.234 07-Oct-2015 mpi

rn_inithead() offset argument is now specified in byte, missed in previous.


# 1.233 04-Sep-2015 mpi

Make every subsystem using a radix tree call rn_init() and pass the
length of the key as argument.

This way every consumer of the radix tree has a chance to explicitly
initialize the shared data structures and no longer rely on another
subsystem to do the initialization.

As a bonus ``dom_maxrtkey'' is no longer used an die.

ART kernels should now be fully usable because pf(4) and IPSEC properly
initialized the radix tree.

ok chris@, reyk@


Revision tags: OPENBSD_5_8_BASE
# 1.232 16-Jul-2015 claudio

branches: 1.232.4;
Fix rn_match and there for the expoerted lookup functions in radix.c
to never return the internal RNF_ROOT nodes. This removes the checks
in the callee to verify that not an RNF_ROOT node was returned.
OK mpi@


# 1.231 12-May-2015 mikeb

Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.230 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.229 02-Mar-2015 guenther

Return EINVAL if the creds supplied for NFS export have a cr_ngroups less
than zero or greater than NGROUPS_MAX

Fixes panic seen by henning@


# 1.228 09-Jan-2015 tedu

rename desiredvnodes to initialvnodes. less of a lie. ok beck deraadt


# 1.227 19-Dec-2014 tedu

start retiring the nointr allocator. specify PR_WAITOK as a flag as a
marker for which pools are not interrupt safe. ok dlg


# 1.226 17-Dec-2014 tedu

remove lock.h from uvm_extern.h. another holdover from the simpletonlock
era. fix uvm including c files to include lock.h or atomic.h as necessary.
ok deraadt


# 1.225 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.224 10-Dec-2014 tedu

convert bcopy to memcpy. ok millert


# 1.223 21-Nov-2014 tedu

simple lock is long dead


# 1.222 19-Nov-2014 tedu

delete the KERN_VNODE sysctl. it fails to provide any isolation from the
kernel struct vnode defintion, and the only consumer (pstat) still needs
kvm to read much of the required information. no great loss to always use
kvm until there's a better replacement interface.
ok deraadt millert uebayasi


# 1.221 14-Nov-2014 tedu

prefer sizeof(*ptr) to sizeof(struct) for malloc and free


# 1.220 03-Nov-2014 deraadt

pass size argument to free()
ok doug tedu


# 1.219 13-Sep-2014 doug

Replace all queue *_END macro calls except CIRCLEQ_END with NULL.

CIRCLEQ_* is deprecated and not called in the tree. The other queue types
have *_END macros which were added for symmetry with CIRCLEQ_END. They are
defined as NULL. There's no reason to keep the other *_END macro calls.

ok millert@


Revision tags: OPENBSD_5_6_BASE
# 1.218 13-Jul-2014 tedu

pass the size to free in some of the obvious cases


# 1.217 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.216 10-Jul-2014 mpi

Stop using a shutdown hook for softraid(4) and explicitly shutdown
the disciplines right after vfs_shutdown().

This change is required in order to be able to set `cold' to 1 before
traversing the device (mainbus) tree for DVACT_POWERDOWN when halting
a machine. Yes, this is ugly because sr_shutdown() needs to sleep. But
at least it is obvious and hopefully somebody will be ofended and fix
it.

In order to properly flush the cache of the disks under softraid0,
sr_shutdown() now propagates DVACT_POWERDOWN for this particular subtree
of devices which are not under mainbus. As a side effect sd(4) shutdown
hook should no longer be necessary.

Tested by stsp@ and Jean-Philippe Ouellet.

ok deraadt@, stsp@, jsing@


# 1.215 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.214 04-Jun-2014 claudio

While it may be smart to use the radix tree for exports it is not OK to
use the domain specific tree initialisation method for this since that one
is multipath enabled and assumes that the radix node is part of a struct
rtentry. This code uses a different struct and so the multipath modifies
wrong fields and breaks stuff in mysterious ways.
Since we only support AF_INET here anyway simplify the code and only have
one radix_node_head pointer instead of AF_MAX ones.
Fixes NFS server issues reported by rpe@, OK rpe@, guenther@, sthen@


# 1.213 10-Apr-2014 tedu

pull the bufcache freelist code out into separate functions to allow new
algorithms to be tested. in the process, drop support for unused B_AGE and
b_synctime options.
previous versions ok beck deraadt


# 1.212 24-Mar-2014 guenther

Split the API: struct ucred remains the kernel internal structure while
struct xucred becomes the structure for syscalls (mount(2) and nfssvc(2)).

ok deraadt@ beck@


Revision tags: OPENBSD_5_5_BASE
# 1.211 21-Jan-2014 tedu

bzero -> memset


# 1.210 01-Dec-2013 krw

Change 'mountlist' from CIRCLEQ to TAILQ. Be paranoid and
use TAILQ_*_SAFE more than might be needed.

Bulk ports build by sthen@ showed nobody sticking their fingers
so deep into the kernel.

Feedback and suggestions from millert@. ok jsing@


# 1.209 27-Nov-2013 jsing

Defer the v_type initialisation until after the vnode has been purged from
the namecache. Changing the v_type between cache_enter() and cache_purge()
results in bad things happening.

ok beck@


# 1.208 02-Oct-2013 sf

format string fix: b_flags is long


# 1.207 01-Oct-2013 sf

Format string fixes: Cast time_t to long long

and mnt_stat.f_ctime is long long, too


# 1.206 08-Aug-2013 syl

Uncomment kprintf format attributes for sys/kern

tested on vax (gcc3) ok miod@


# 1.205 30-Jul-2013 beck

The previous change was made while chasing nfs performance issues
on Theo's servers - however this was in the context of the buffer flipper
changes and this is now suspect in a continues performance issue with NFS
so back it out for now


Revision tags: OPENBSD_5_4_BASE
# 1.204 24-Jun-2013 beck

Manipulating buffers after sleeping is dangerous. Instead of attempting
to cheat and VOP_BWRITE a buffer, restart the vinvalbuf if we have to wait
for a busy buffer to complete
ok tedu@ guenther@


# 1.203 15-Apr-2013 jsing

Add an f_mntfromspec member to struct statfs, which specifies the name of
the special provided when the mount was requested. This may be the same as
the special that was actually used for the mount (e.g. in the case of a
device node) or it may be different (e.g. in the case of a DUID).

Whilst here, change f_ctime to a 64 bit type and remove the pointless
f_spare members.

Compatibility goo courtesy of guenther@

ok krw@ millert@


Revision tags: OPENBSD_5_3_BASE
# 1.202 17-Feb-2013 miod

Comment out recently added __attribute__((__format__(__kprintf__))) annotations
in MI code; gcc 2.95 does not accept such annotation for function pointer
declarations, only function prototypes.
To be uncommented once gcc 2.95 bites the dust.


# 1.201 09-Feb-2013 miod

Add explicit __attribute__ ((__format__(__kprintf__)))) to the functions and
function pointer arguments which are {used as,} wrappers around the kernel
printf function.
No functional change.


# 1.200 17-Nov-2012 beck

Don't map a buffer (and potentially sleep) when invalidating it in vinvalbuf.
This fixes a problem where we could sleep for kva and then our pointers
would not be valid on the next pass through the loop. We do this
by adding buf_acquire_nomap() - which can be used to busy up the buffer
without changing its mapped or unmapped state. We do not need to have
the buffer mapped to invalidate it, so it is sufficient to acquire it
for that. In the case where we write the buffer, we do map the buffer, and
potentially sleep.


# 1.199 01-Oct-2012 guenther

Make groupmember() check the effective gid too, so that the checks are
consistent when the effective gid isn't also a supplementary group.

ok beck@


# 1.198 19-Sep-2012 guenther

vhold() and vdrop() are prototyped in vnode.h, so don't repeat them here

ok beck@


Revision tags: OPENBSD_5_2_BASE
# 1.197 16-Jul-2012 deraadt

oops, need sys/acct.h too


# 1.196 16-Jul-2012 deraadt

Put acct_shutdown() proto in a better place


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.195 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.194 02-Jul-2011 thib

rename VFSDEBUG to VFLCKDEBUG;

prompted by tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.193 21-Dec-2010 thib

Bring back the "End the VOP experiment." diff, naddy's issues where
unrelated, and his alpha is much happier now.

OK deraadt@


# 1.192 06-Dec-2010 jasper

- drop NENTS(), which was yet another copy of nitems().
no binary change


ok deraadt@


# 1.191 10-Sep-2010 thib

Backout the VOP diff until the issues naddy was seeing on alpha (gcc3)
have been resolved.


# 1.190 06-Sep-2010 thib

End the VOP experiment. Instead of the ridicolusly complicated operation
vector setup that has questionable features (that have, as far as I can
tell never been used in practice, atleast not in OpenBSD), remove all
the gunk and favor a simple struct full of function pointers that get
set directly by each of the filesystems.

Removes gobs of ugly code and makes things simpler by a magnitude.

The only downside of this is that we loose the vnoperate feature so
the spec/fifo operations of the filesystems need to be kept in sync
with specfs and fifofs, this is no big deal as the API it self is pretty
static.

Many thanks to armani@ who pulled an earlier version of this diff to
current after c2k10 and Gabriel Kihlman on tech@ for testing.

Liked by many. "come on, find your balls" deraadt@.


# 1.189 12-Aug-2010 oga

Nuke extra (typoed) extern declaration and a spare newline from the last
commit.

"fix it -- free commit" beck@


# 1.188 11-Aug-2010 beck

Make the number of vnodes to correspond to the number of buffers in
buffer cache - we grow them dynamically, but do not attempt to shrink
them if the buffer cache shrinks after growing.

Tested by very many for a long time.

ok oga@ todd@ phessler@ tedu@


Revision tags: OPENBSD_4_8_BASE
# 1.187 29-Jun-2010 tedu

makefstype was only used in ported from freebsd filesystems. fix them
and remove the function. ok thib


# 1.186 28-Jun-2010 claudio

Add the rtable id as an argument to rn_walktree(). Functions like
rt_if_remove_rtdelete() need to know the table id to be able to correctly
remove nodes.
Problem found by Andrea Parazzini and analyzed by Martin Pelik�n.
OK henning@


# 1.185 06-May-2010 mpf

Fix favail format string.
From mickey.
OK thib, otto.


Revision tags: OPENBSD_4_7_BASE
# 1.184 17-Dec-2009 oga

if anyone vref()s a VNON vnode, panic. This should not happen.

Written while trying to debug the nfs_inactive panics. Turns out it
never got hit, but it's a useful check to have.

ok beck@


# 1.183 17-Aug-2009 jasper

dd 'show all bufs' to show all the buffers in the system

ok beck@ thib@


# 1.182 13-Aug-2009 thib

add a show all vnodes command, use dlg's nice pool_walk() to accomplish
this.

ok beck@, dlg@


# 1.181 12-Aug-2009 beck

Namecache revamp.

This eliminates the large single namecache hash table, and implements
the name cache as a global lru of entires, and a redblack tree in each
vnode. It makes cache_purge actually purge the namecache entries associated
with a vnode when a vnode is recycled (very important for later on actually being
able to resize the vnode pool)

This commit does #if 0 out a bunch of procmap code that was
already broken before this change, but needs to be redone completely.

Tested by many, including in thib's nfs test setup.

ok oga@,art@,thib@,miod@


# 1.180 02-Aug-2009 beck

Dynamic buffer cache support - a re-commit of what was backed out
after c2k9

allows buffer cache to be extended and grow/shrink dynamically

tested by many, ok oga@, "why not just commit it" deraadt@


Revision tags: OPENBSD_4_6_BASE
# 1.179 25-Jun-2009 thib

backout the buf_acquire() does the bremfree() since all callers
where doing bremfree() befure calling buf_acquire().

This is causing us headache pinning down a bug that showed up
when deraadt@ too cvs to current, and will have to be done
anyway as a preperation for backouts.

OK deraadt@


# 1.178 15-Jun-2009 beck

Back out all the buffer cache changes I committed during c2k9. This reverts three
commits:

1) The sysctl allowing bufcachepercent to be changed at boot time.
2) The change moving the buffer cache hash chains to a red-black tree
3) The dynamic buffer cache (Which depended on the earlier too).

ok on the backout from marco and todd


# 1.177 06-Jun-2009 art

All caller of buf_acquire were doing bremfree before the call.
Just put it in the buf_acquire function.
oga@ ok


# 1.176 03-Jun-2009 beck

Change bufhash from the old grotty hash table to red-black trees hanging
off the vnode.
ok art@, oga@, miod@


Revision tags: OPENBSD_4_5_BASE
# 1.175 10-Nov-2008 pedro

Fix typo in comment, okay jmc@.


# 1.174 01-Nov-2008 deraadt

change vrele() to return an int. if it returns 0, it can gaurantee that
it did not sleep. this is used to avoid checkdirs() to avoid having
to restart the allproc walk every time through
idea from tedu, ok thib pedro


Revision tags: OPENBSD_4_4_BASE
# 1.173 05-Jul-2008 thib

re-introduce vdrop() to signal a lost intrest in a vnode;

ok art@


# 1.172 14-Jun-2008 mk

A bunch of pool_get() + bzero() -> pool_get(..., .. | PR_ZERO)
conversions that should shave a few bytes off the kernel.

ok henning, krw, jsing, oga, miod, and thib (``even though i usually prefer
FOO|BAR''; thanks for looking.


# 1.171 13-Jun-2008 beck

back out stupid vnode change that was unintentionally included
with biomem and art has no idea how it got there.
ok art@ thib@


# 1.170 12-Jun-2008 deraadt

Bring biomem diff back into the tree after the nfs_bio.c fix went in.
ok thib beck art


# 1.169 11-Jun-2008 deraadt

back out biomem diff since it is not right yet. Doing very large
file copies to nfsv2 causes the system to eventually peg the console.
On the console ^T indicates that the load is increasing rapidly, ddb
indicates many calls to getbuf, there is some very slow nfs traffic
making none (or extremely slow) progress. Eventually some machines
seize up entirely.


# 1.168 10-Jun-2008 beck

Buffer cache revamp

1) remove multiple size queues, introduced as a stopgap.
2) decouple pages containing data from their mappings
3) only keep buffers mapped when they actually have to be mapped
(right now, this is when buffers are B_BUSY)
4) New functions to make a buffer busy, and release the busy flag
(buf_acquire and buf_release)
5) Move high/low water marks and statistics counters into a structure
6) Add a sysctl to retrieve buffer cache statistics

Tested in several variants and beat upon by bob and art for a year. run
accidentally on henning's nfs server for a few months...

ok deraadt@, krw@, art@ - who promises to be around to deal with any fallout


# 1.167 09-Jun-2008 millert

Update access(2) to have modern semantics with respect to X_OK and
the superuser. access(2) will now only indicate success for X_OK on
non-directories if there is at least one execute bit set on the file.
OK deraadt@ thib@ otto@


# 1.166 07-May-2008 thib

remove the vfc_mountroot member from vfsconf and
do appropriate cleanup;

OK deraadt@


# 1.165 07-May-2008 claudio

Implement routing priorities. Every route inserted has a priority assigned
and the one route with the lowest number wins. This will be used by the
routing daemons to resolve the synchronisations issue in case of conflicts.
The nasty bits of this are in the multipath code. If no priority is specified
the kernel will choose an appropriate priority.

Looked at by a few people at n2k8 code is much older


# 1.164 06-May-2008 thib

retire vfs_mountroot();

setroot() is now (and has been) responsible for setting
the mountroot function pointer "to the right thing", or
failing todo that, to ffs_mountroot;

based on a discussion/diff from deraadt@.
OK deraadt@


# 1.163 23-Mar-2008 miod

Wrong printf construct.


# 1.162 16-Mar-2008 otto

Widen some struct statfs fields to support large filesystem stata
and add some to be able to support statvfs(2). Do the compat dance
to provide backward compatibility. ok thib@ miod@


Revision tags: OPENBSD_4_3_BASE
# 1.161 13-Dec-2007 blambert

replace calls to ltsleep with tsleep

remove PNORELOCK flag, as PNORELOCK is used for msleep

ok art@ thib@


# 1.160 16-Nov-2007 deraadt

er, the newline is wrong. dissapointing.


# 1.159 15-Nov-2007 deraadt

newline before syncing disks is way prettier


# 1.158 29-Oct-2007 chl

MALLOC/FREE -> malloc/free
replace an hard coded value with M_WAITOK

ok krw@


# 1.157 15-Sep-2007 bluhm

Allow to pull out an usb stick with ffs filesystem while mounted
and a file is written onto the stick. Without these fixes the
machine panics or hangs.
The usb fix calls the callback when the stick is pulled out to free
the associated buffers. Otherwise we have busy buffers for ever
and the automatic unmount will panic.
The change in the scsi layer prevents passing down further dirty
buffers to usb after the stick has been deactivated.
In vfs the automatic unmount has moved from the function vgonel()
to vop_generic_revoke(). Both are called when the sd device's vnode
is removed. In vgonel() the VXLOCK is already held which can cause
a deadlock. So call dounmount() earlier.

ok krw@, I like this marco@, tested by ian@


# 1.156 07-Sep-2007 art

Use M_ZERO in a few more places to shave bytes from the kernel.

eyeballed and ok dlg@


Revision tags: OPENBSD_4_2_BASE
# 1.155 07-Aug-2007 beck

A few changes to deal with multi-user performance issues seen. this
brings us back roughly to 4.1 level performance, although this is still
far from optimal as we have seen in a number of cases. This change

1) puts a lower bound on buffer cache queues to prevent starvation
2) fixes the code which looks for a buffer to recycle
3) reduces the number of vnodes back to 4.1 levels to avoid complex
performance issues better addressed after 4.2

ok art@ deraadt@, tested by many


# 1.154 01-Jun-2007 beck

decouple the allocated number of vnodes from the "desiredvnodes" variable
which is used to size a zillion other things that increasing excessively
has been shown to cause problems - so that we may incrementally look at
increasing those other things without making the kernel unusable.

This diff effectivly increases the number of vnodes back to the number
of buffers, as in the earlier dynamic buffer cache commits, without
increasing anything else (namecache, softdeps, etc. etc.)

ok pedro@ tedu@ art@ thib@


# 1.153 31-May-2007 tedu

remove some silly casts, no real change


# 1.152 31-May-2007 pedro

NFSv2 cannot cope with a big number of vnodes, so revert to NPROC-based
calculation until the problem is fixed, okay beck@ art@


# 1.151 30-May-2007 beck

back out vfs change - todd fries has seen afs issues, and I'm suspicious
this can cause other problems.


# 1.150 29-May-2007 beck

Step one of some vnode improvements - change getnewvnode to
actually allocate "desiredvnodes" - add a vdrop to un-hold a vnode held
with vhold, and change the name cache to make use of vhold/vdrop, while
keeping track of which vnodes are referred to by which cache entries to
correctly hold/drop vnodes when the cache uses them.
ok thib@, tedu@, art@


# 1.149 28-May-2007 thib

de-inline vref();

ok pedro@


# 1.148 26-May-2007 pedro

Dynamic buffer cache. Initial diff from mickey@, okay art@ beck@ toby@
deraadt@ dlg@.


# 1.147 26-May-2007 thib

Nuke a bunch of simpelocks and associated goo.

ok art@


# 1.146 17-May-2007 thib

Collapse struct v_selectinfo in struct vnode, remove the
simplelock and reuse the name for the selinfo member.
Clean-up accordingly.

ok tedu@,art@


# 1.145 09-May-2007 deraadt

kinfo_vgetfailed has not been used for > 8 years


# 1.144 13-Apr-2007 thib

Move the declaration of VN_KNOTE() into vnode.h instead of having
multiple defines all over;

ok tedu@


# 1.143 13-Apr-2007 bluhm

Remove comments talking about vnode interlock. No binary change.
ok thib


# 1.142 11-Apr-2007 thib

Remove the simplelock argument from vrecycle();

ok pedro@, sturm@


# 1.141 21-Mar-2007 thib

Remove the v_interlock simplelock from the vnode structure.
Zap all calls to simple_lock/unlock() on it (those calls are
#defined away though). Remove the LK_INTERLOCK from the calls
to vn_lock() and cleanup the filesystems wich implement VOP_LOCK().
(by remvoing the v_interlock from there calls to lockmgr()).

ok pedro@, art@, tedu@


# 1.140 12-Mar-2007 mickey

better desiredvnodes not based on maxusers; pedro@ deraadt@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.139 20-Feb-2007 deraadt

for vfsconf sysctl, do not leak kernel sensors out to userland
ok art thib


# 1.138 17-Feb-2007 mickey

fix ddb buf printing for daddr_t growth to 64bit;
from juan hernandez gonzalez; tested by bluhm@


# 1.137 14-Feb-2007 jsg

Consistently spell FALLTHROUGH to appease lint.
ok kettenis@ cloder@ tom@ henning@


# 1.136 13-Feb-2007 mickey

fix ddb buf print


# 1.135 20-Nov-2006 tom

vprint() should be defined if DIAGNOSTIC || DEBUG. Noticed by (and
original diff from) Jake < antipsychic (at) hotmail.com >. Discussed
with Mickey and Miod.

ok miod@ pedro@


# 1.134 30-Oct-2006 thib

use vp->v_type to index into vtypes rather then vp->v_tag,
fixing odd output in the 'show vnode' ddb code.

ok mickey@


Revision tags: OPENBSD_4_0_BASE
# 1.133 11-Jul-2006 mickey

add mount/vnode/buf and softdep printing commands; tested on a few archs and will make pedro happy too (;


# 1.132 09-Jul-2006 pedro

Fix tab where space was meant


# 1.131 08-Jul-2006 thib

vinvalbuf() debugging aid, under VFSDEBUG.

ok pedro@


# 1.130 03-Jul-2006 mickey

also print vp in vprint (useful for debugging); pedro@ ok


# 1.129 25-Jun-2006 sturm

rename vfs_busy() flags VB_UMIGNORE/VB_UMWAIT to VB_NOWAIT/VB_WAIT

requested by and ok pedro


# 1.128 14-Jun-2006 sturm

move vfs_busy() to rwlocks and properly hide the locking api from vfs

ok tedu, pedro


# 1.127 02-Jun-2006 pedro

Add a clonable devices implementation. Hacked along with thib@, input
from krw@ and toby@, subliminal prodding from dlg@, okay deraadt@.


# 1.126 28-May-2006 pedro

Spacing in vfs_sysctl()


# 1.125 07-May-2006 sturm

forgot to remove this sentence from the comment
ok pedro


# 1.124 30-Apr-2006 sturm

remove the simplelock argument from vfs_busy() which is currently not
used and will never be used this way in VFS

requested by and ok pedro, ok krw, biorn


# 1.123 19-Apr-2006 pedro

Remove unused mount list simple_lock() goo


Revision tags: OPENBSD_3_9_BASE
# 1.122 09-Jan-2006 pedro

Put vprint() under DIAGNOSTIC, as to save space in generated ramdisks.
Inspiration from miod@, okay deraadt@. Tested on i386, macppc and amd64.


# 1.121 30-Nov-2005 pedro

No need for vfs_busy() and vfs_unbusy() to take a process pointer
anymore. Testing by jolan@, thanks.


# 1.120 24-Nov-2005 pedro

Remove kernfs, okay deraadt@.


# 1.119 19-Nov-2005 pedro

Remove unnecessary lockmgr() archaism that was costing too much in terms
of panics and bugfixes. Access curproc directly, do not expect a process
pointer as an argument. Should fix many "process context required" bugs.
Incentive and okay millert@, okay marc@. Various testing, thanks.


# 1.118 18-Nov-2005 pedro

Work around yet another race on non-locking file systems: when calling
VOP_INACTIVE() in vrele() and vput(), we may sleep. Since there's no
locking of any kind, someone can vget() the vnode and vrele() it while
we sleep, beating us in getting the vnode on the free list.


# 1.117 08-Nov-2005 pedro

Missed one use of 'register'


# 1.116 07-Nov-2005 pedro

Use ANSI function declarations and deregister, no binary change


# 1.115 19-Oct-2005 pedro

Remove v_vnlock from struct vnode, okay krw@ tedu@


Revision tags: OPENBSD_3_8_BASE
# 1.114 26-May-2005 pedro

branches: 1.114.2;
RIP stackable filesystems, ok marius@ tedu@, discussed with deraadt@


# 1.113 24-May-2005 pedro

when a device vnode associated with a mount point disappears, mark the
filesystem as doomed and unmount it


# 1.112 22-May-2005 pedro

put VLOCKSWORK stuff under a single option, VFSDEBUG


# 1.111 01-May-2005 pedro

check for VBIOONFREELIST and VBIOONSYNCLIST in vprint(), okay marius@


# 1.110 24-Mar-2005 tedu

always good to check for invalid values. ok marius pedro


Revision tags: OPENBSD_3_7_BASE
# 1.109 10-Jan-2005 pedro

branches: 1.109.2;
change vget() to only put a vnode back on the free lists if it actually
was there. should fix a (rare) corner case introduced by my last commit.
ok tedu@, testing by joris, moritz@, danh@, otto@ and krw@. many thanks.


# 1.108 31-Dec-2004 pedro

sprinkle some more list macros in here


# 1.107 31-Dec-2004 pedro

when releasing a vnode, make it inactive before sticking it to one of
the free lists. should fix some races on filesystems that don't have
locks, such as nfs. also, it allows for a more straightforward way of
releasing vnodes (nodes that are going to be recycled don't have to be
moved to the head of the list). tested by many, thanks.

ok tedu@ deraadt@


# 1.106 28-Dec-2004 deraadt

clean dirty accident by miod


# 1.105 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


# 1.104 09-Dec-2004 pedro

minor spacing/styling nits


Revision tags: OPENBSD_3_6_BASE
# 1.103 04-Aug-2004 art

Uninline vputonfreelist.


# 1.102 04-Aug-2004 pedro

better comments


# 1.101 02-Aug-2004 pedro

- check for LK_NOWAIT on vget()
- use ltsleep() instead of the unlock + sleep combo

ok art@, inspiration from free/net


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.100 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


# 1.99 27-May-2004 tedu

shutdown accounting before shutting down vfs. should prevent some panics.
ok david@ millert@ (iirc)


# 1.98 25-Apr-2004 itojun

radix tree with multipath support. from kame. deraadt ok
user visible changes:
- you can add multiple routes with same key (route add A B then route add A C)
- you have to specify gateway address if there are multiple entries on the table
(route delete A B, instead of route delete A)
kernel change:
- radix_node_head has an extra entry
- rnh_deladdr takes extra argument

TODO:
- actually take advantage of multipath (rtalloc -> rtalloc_mpath)


Revision tags: OPENBSD_3_5_BASE
# 1.97 09-Jan-2004 tedu

back out vnode parents. weird breakge found in ports tree


# 1.96 06-Jan-2004 tedu

keep track of a vnode's parent dir. ufs only, and unused atm, but
the fun stuff is coming. testing by brad.


Revision tags: OPENBSD_3_4_BASE
# 1.95 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.94 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: UBC_SYNC_A
# 1.93 13-May-2003 naddy

Back out previous change that causes "vnode table full" for large-scale
file operations.


# 1.92 13-May-2003 tedu

do reclaim LAYER vnodes, no good reason not to


# 1.91 06-May-2003 tedu

attempt to put a process's cwd back in place after a forced umount.
won't always work, but it's the best we can do for now. this covers
at least some of the failure cases the previous commit to vfs_lookup.c
checks for.
ok weingart@


# 1.90 01-May-2003 tedu

several related changes:
vfs_subr.c:
add a missing simple_lock_init for vnode interlock
try to avoid reclaiming locked or layered vnodes
initialize vnlock pointer to NULL
remove old code to free vnlock, never used
lockinit the new vnode lock
vfs_syscalls.c:
support for VLAYER flag
vnode_if.sh:
support for splitting VDESC flags
vnode_if.src:
split VDESC flags
WILLPUT is the combination of WILLRELE and WILLUNLOCK
most uses for WILLRELE become WILLPUT
vnode.h:
add v_lock to struct vnode
add VLAYER flag
update for new VDESC flags


# 1.89 06-Apr-2003 ho

strcat/strcpy/sprintf cleanup. krw@, anil@ ok. art@ tested sparc64.


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_B
# 1.88 11-Aug-2002 art

Add two missing vfs_busy calls in the failure path of sysctl_vnode.
Found by aaron@

NOTE - I think we need a mount-point iterator just like we have
NOTE - vfs_mount_foreach_vnode. (btw. why don't we use foreach_vnode in here?)


# 1.87 12-Jul-2002 art

Change the locking on the mountpoint slightly. Instead of using mnt_lock
to get shared locks for lookup and get the exclusive lock only with
LK_DRAIN on unmount and do the real exclusive locking with flags in
mnt_flags, we now use shared locks for lookup and an exclusive lock for
unmount.

This is accomplished by slightly changing the semantics of vfs_busy.
Old vfs_busy behavior:
- with LK_NOWAIT set in flags, a shared lock was obtained if the
mountpoint wasn't being unmounted, otherwise we just returned an error.
- with no flags, a shared lock was obtained if the mountpoint was being
unmounted, otherwise we slept until the unmount was done and returned
an error.
LK_NOWAIT was used for sync(2) and some statistics code where it isn't really
critical that we get the correct results.
0 was used in fchdir and lookup where it's critical that we get the right
directory vnode for the filesystem root.

After this change vfs_busy keeps the same behavior for no flags and LK_NOWAIT.
But if some other flags are passed into it, they are passed directly
into lockmgr (actually LK_SLEEPFAIL is always added to those flags because
if we sleep for the lock, that means someone was holding the exclusive lock
and the exclusive lock is only held when the filesystem is being unmounted.

More changes:
dounmount must now be called with the exclusive lock held. (before this
the caller was supposed to hold the vfs_busy lock, but that wasn't always
true).
Zap some (now) unused mount flags.
And the highlight of this change:
Add some vfs_busy calls to match some vfs_unbusy calls, especially in
sys_mount. (lockmgr doesn't detect the case where we release a lock noone
holds (it will do that soon)).

If you've seen hangs on reboot with mfs this should solve it (I repeat this
for the fourth time now, but this time I spent two months fixing and
redesigning this and reading the code so this time I must have gotten
this right).


# 1.86 16-Jun-2002 miod

When processing the KERN_VNODE sysctl, the kernel builds a packed structure,
while pstat(8) expects a C structure abiding the regular structure packing
rules. This caused pstat -v to break on powerpc.

Unbreak the confusion by defining the structure in a common header file,
and having the kernel use it.

ok millert@ deraadt@


# 1.85 08-Jun-2002 art

Use ltsleep in vfs_busy.


# 1.84 16-May-2002 art

sprinkle some splassert(IPL_BIO) in some functions that are commented as "should be called at splbio()"


Revision tags: OPENBSD_3_1_BASE
# 1.83 14-Mar-2002 millert

First round of __P removal in sys


# 1.82 04-Feb-2002 miod

Cleanup mountroot-related definitions.


# 1.81 23-Jan-2002 art

Pool deals fairly well with physical memory shortage, but it doesn't deal
well (not at all) with shortages of the vm_map where the pages are mapped
(usually kmem_map).

Try to deal with it:
- group all information the backend allocator for a pool in a separate
struct. The pool will only have a pointer to that struct.
- change the pool_init API to reflect that.
- link all pools allocating from the same allocator on a linked list.
- Since an allocator is responsible to wait for physical memory it will
only fail (waitok) when it runs out of its backing vm_map, carefully
drain pools using the same allocator so that va space is freed.
(see comments in code for caveats and details).
- change pool_reclaim to return if it actually succeeded to free some
memory, use that information to make draining easier and more efficient.
- get rid of PR_URGENT, noone uses it.


# 1.80 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.79 10-Dec-2001 art

branches: 1.79.2;
No need to initialize the uobj on every getnewvnode. Just do
it when allocating. Add some improved diagnostics.


# 1.78 10-Dec-2001 art

Big cleanup inspired by NetBSD with some parts of the code from NetBSD.
- get rid of VOP_BALLOCN and VOP_SIZE
- move the generic getpages and putpages into miscfs/genfs
- create a genfs_node which must be added to the top of the private portion
of each vnode for filsystems that want to use genfs_{get,put}pages
- rename genfs_mmap to vop_generic_mmap


# 1.77 10-Dec-2001 art

Merge in struct uvm_vnode into struct vnode.


# 1.76 05-Dec-2001 art

Break out the part that lowers v_holdcnt in brelvp into an own function
and make it and vhold into public interfaces.


# 1.75 29-Nov-2001 art

Ooops. Revert part of the last commit that was completly wrong and wasn't supposed to be committed.


# 1.74 29-Nov-2001 art

Correctly handle b_vp with bgetvp and brelvp in {get,put}pages.
Prevents panics caused by vnodes being recycled under our feet.


# 1.73 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.72 21-Nov-2001 csapuntz

Added vfs_isbusy. Useful for verifying that a mount point is locked
Added vfs_mount_foreach_vnode. Several places in the code seem to want to
traverse the mount list and they all seem to handle locking differently.
Centralize traversing the mount list in one place so that we only need
to get the locking right once.


# 1.71 15-Nov-2001 art

Don't zero v_bioflag when recycling a vnode in getnewvnode.
Sometimes the vnode can be on the syncers list. While that is a bug, it's
just a minor annoyance. A vnode on a syncer worklist without VBIOONSYNCLIST
set is a disaster.


# 1.70 12-Nov-2001 art

Remove unnecessary check for NULL vnode in reassignbuf.


# 1.69 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.68 02-Oct-2001 csapuntz

Bounds check index into routing table. Thanks to Ken Ashcraft of Stanford
for finding this bug.


# 1.67 19-Sep-2001 csapuntz

Get rid of B_VFLUSH. Not relevant after the end of the AGE queue.


# 1.66 16-Sep-2001 millert

Add some missing lengths checks when passing data from userland to
kernel. From based on NetBSD patches.


# 1.65 02-Aug-2001 assar

(vput): make panic strings actually say vput instead of vrele


# 1.64 26-Jul-2001 miod

Typo.


# 1.63 27-Jun-2001 art

remove old vm


# 1.62 22-Jun-2001 deraadt

KNF


# 1.61 05-Jun-2001 provos

send note_revoke to knotes when vnode goes away, okay art@


# 1.60 16-May-2001 art

indentation nit.


# 1.59 29-Apr-2001 art

cleanup, remove incorrect comment


Revision tags: OPENBSD_2_9_BASE
# 1.58 22-Mar-2001 art

branches: 1.58.2;
Use pool for allocating vnodes.
Even though vnodes are never freed (could be) this gives us big memory and
kmem_map savings.


# 1.57 21-Mar-2001 art

uvm_vnp_terminate expect the vnode to be locked.
Why didn't LOCKDEBUG catch this?


# 1.56 16-Mar-2001 art

Oops. fix thinko in last.


# 1.55 16-Mar-2001 art

Use CIRCLEQ macros for mountlist.


# 1.54 16-Mar-2001 art

Initialize the mountlist_slock.


# 1.53 26-Feb-2001 csapuntz

Move v_writecount test back to it original place


# 1.52 26-Feb-2001 csapuntz

Make ref counts 32-bit unsigned ints as opposed to a potpourri of longs and
ints.


# 1.51 24-Feb-2001 csapuntz

Cleanup of vnode interface continues. Get rid of VHOLD/HOLDRELE.
Change VM/UVM to use buf_replacevnode to change the vnode associated
with a buffer.

Addition v_bioflag for flags written in interrupt handlers
(and read at splbio, though not strictly necessary)

Add vwaitforio and use it instead of a while loop of v_numoutput.

Fix race conditions when manipulation vnode free list


# 1.50 23-Feb-2001 csapuntz

Remove the clustering fields from the vnodes and place them in the
file system inode instead


# 1.49 21-Feb-2001 csapuntz

Latest soft updates from FreeBSD/Kirk McKusick

Snapshot-related code has been commented out.


# 1.48 08-Feb-2001 mickey

do not print stuff when not verbose


Revision tags: OPENBSD_2_8_BASE
# 1.47 27-Sep-2000 art

branches: 1.47.2;
Minimal optimization.


# 1.46 17-Jul-2000 art

Don't wait for B_READ buffers on shutdown.
From NetBSD.


Revision tags: OPENBSD_2_7_BASE
# 1.45 25-Apr-2000 csapuntz

Use CIRCLEQ_FOREACH


# 1.44 21-Apr-2000 mickey

see if there is any meaning under curproc before using &proc0 in vfs_syncwait(); from art@


Revision tags: SMP_BASE kame_19991208
# 1.43 05-Dec-1999 art

branches: 1.43.2;
With soft updates, some buffers will be remarked as dirty after being written.
Handle this when syncing filesystems when unmounting.
From NetBSD.


# 1.42 05-Dec-1999 art

Use VONSYNCLIST to see if we should remove a vnode from the sync list instead
of looking at v_dirtyblkhd.


Revision tags: OPENBSD_2_6_BASE
# 1.41 20-Aug-1999 art

more paranoid check of the refcount in vfs_register


# 1.40 08-Aug-1999 niklas

From NetBSD; vdevgone, used for revoking access to device nodes when they
disappear (detach is coming).


# 1.39 31-May-1999 millert

New struct statfs with mount options. NOTE: this replaces statfs(2),
fstatfs(2), and getfsstat(2) so you will need to build a new kernel
before doing a "make build" or you will get "unimplemented syscall" errors.

The new struct statfs has the following featuires:
o Has a u_int32_t flags field--now softdep can have a real flag.

o Uses u_int32_t instead of longs (nicer on the alpha). Note: the man
page used to lie about setting invalid/unused fields to -1. SunOS does
that but our code never has.

o Gets rid of f_type completely. It hasn't been used since NetBSD 0.9
and having it there but always 0 is confusing. It is conceivable
that this may cause some old code to not compile but that is better
than silently breaking.

o Adds a mount_info union that contains the FSTYPE_args struct. This
means that "mount" can now tell you all the options a filesystem was
mounted with. This is especially nice for NFS.

Other changes:
o The linux statfs emulation didn't convert between BSD fs names
and linux f_type numbers. Now it does, since the BSD f_type
number is useless to linux apps (and has been removed anyway)

o FreeBSD's struct statfs is different from our (both old and new)
and thus needs conversion. Previously, the OpenBSD syscalls
were used without any real translation.

o mount(8) will now show extra info when invoked with no arguments.
However, to see *everything* you need to use the -v (verbose) flag.


# 1.38 06-May-1999 mickey

factor out sync+wait code into vfa_syncwait() routine for
applications in system like power management and such.
art@ finally said `commit it'


# 1.37 30-Apr-1999 art

in vput, simple_unlock the v_interlock before VOP_INACTIVE, not after


Revision tags: OPENBSD_2_5_BASE
# 1.36 11-Mar-1999 deraadt

backout


# 1.35 11-Mar-1999 deraadt

back out unapproved changes


# 1.34 11-Mar-1999 mickey

indent


# 1.33 11-Mar-1999 mickey

factor sync+wait operation out into a separate function.


# 1.32 26-Feb-1999 art

adapt to uvm vnode pager


# 1.31 19-Feb-1999 art

add vfs_register and vfs_unregister functions


# 1.30 28-Dec-1998 art

simple_lock fixes


# 1.29 22-Dec-1998 art

deconfuse vprint, print holdcount, not refcount when we are talking about holdcnt


# 1.28 10-Dec-1998 art

vfs_unmountall: retry to unmount all remaining filesystems when one unmount failed


# 1.27 05-Dec-1998 csapuntz

Framework for generating automatic test code for locking discipline
in DIAGNOSTIC mode.

Added documentation to vfs_subr.c on locking needs of a couple calls.

Improvements to the vinvalbuf patch. We need to start over after we
let our pants down.


# 1.26 04-Dec-1998 csapuntz

VFS-Lite2 requires stricter locking around vnode buffer queues. vinvalbuf
had insufficient protection


# 1.25 20-Nov-1998 art

vn_lock already unlocks the simple lock. don't do that again


# 1.24 12-Nov-1998 csapuntz

Integrate latest soft updates patches for McKusick.

Integrate cleaner ffs mount code from FreeBSD. Most notably, this mount
code prevents you from mounting an unclean file system read-write.


Revision tags: OPENBSD_2_4_BASE
# 1.23 13-Oct-1998 csapuntz

In vrele, vget, reinstate to following order

- VNODE gets placed on free list
- VOP_INACTIVE is called

This was the original order. It was changed in an earlier patch due to
a race condition in non-locking FSes (like NFS) between getnewvnode
and inactive. However, the modified order had its own race conditions, so
it turned out not to be a good choice.


# 1.22 30-Aug-1998 csapuntz

Cleanup.

Error diagnostics in vputonfreelist to catch violations of assumptions.


# 1.21 06-Aug-1998 csapuntz

Rename vop_revoke, vn_bwrite, vop_noislocked, vop_nolock, vop_nounlock
to be vop_generic_revoke, vop_generic_bwrite, vop_generic_islocked,
vop_generic_lock and vop_generic_unlock.

Create vop_generic_abortop and propogate change to all file systems.

Fix PR/371.

Get rid of locking in NULLFS (should be mostly unnecessary now except for
forced unmounts).


# 1.20 25-Apr-1998 niklas

typo


Revision tags: OPENBSD_2_3_BASE
# 1.19 20-Feb-1998 niklas

typo


# 1.18 11-Jan-1998 csapuntz

Fix a couple spinlock references. More code motion in vfs_subr.c


# 1.17 10-Jan-1998 csapuntz

Broke up vfs_subr.c which was getting a bit huge. We now have seperate files
for the syncer daemon as well as default VOP_*.


# 1.16 24-Nov-1997 niklas

Fix non-DIAGNOSTIC (and non-COMPAT*) compilation


# 1.15 07-Nov-1997 csapuntz

Fixed hang on shutdown
Disabled vop_nolock for now. Filesystems still need to be cleaned up.


# 1.14 06-Nov-1997 csapuntz

DEBUG now compiles


# 1.13 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.12 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.11 06-Oct-1997 csapuntz

VFS Lite2 Changes


Revision tags: OPENBSD_2_1_BASE
# 1.10 25-Apr-1997 deraadt

proper mask check; mike@fast.cs.utah.edu


# 1.9 14-Apr-1997 tholo

Minor performance enhancements from NetBSD


# 1.8 24-Feb-1997 niklas

OpenBSD tags


# 1.7 11-Feb-1997 millert

Add fs_id support and random inode generation numbers for ffs.


# 1.6 04-Jan-1997 kstailey

spec_advlock() via lf_advlock()


Revision tags: OPENBSD_2_0_BASE
# 1.5 08-Aug-1996 tholo

Make {,f}chown(2) behaviour POSIX.1 compliant with SUID / SGID files
Enable CTL_FS processing by sysctl(3)
Add CTL_FS request to disable clearing SUID / SGID bit when a files owner
or group is changed by root
Make sysctl(8) understand CTL_FS requests


# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 29-Feb-1996 niklas

From NetBSD: Merge with NetBSD 960217


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.305 28-Apr-2021 claudio

Introduce a global vnode_mtx and use it to make vn_lock() safe to be called
without the KERNEL_LOCK.
This moves VXLOCK and VXWANT to a mutex protected v_lflag field and also
v_lockcount is protected by this mutex.

The vn_lock() dance is overly complex and all of this should probably replaced
by a proper lock on the vnode but such a diff is a lot more complex. This
is an intermediate step so that at least some calls can be modified to grab
the KERNEL_LOCK later or not at all.

OK mpi@


Revision tags: OPENBSD_6_9_BASE
# 1.304 29-Jan-2021 claudio

Use NULL instead of 0 to clear v_socket pointer (which actually clears all
of the v_un pointers).
OK jsg@ mvs@


Revision tags: OPENBSD_6_8_BASE
# 1.303 23-Aug-2020 kn

Remove unused debug_syncprt, improve debug sysctl handling

"syncprt" is unused since kern/vfs_syscalls.c r1.147 from 2008.

Adding new debug sysctls is a bit opaque and looking at kern/kern_sysctl.c
the only visible difference between used and stub ctldebug structs in the
debugvars[] array is their extern keyword, indicating that it is defined
elsewhere.

sys/sysctl.h declares all debugN members as extern upfront, but these
declarations are not needed.

Remove the unused debug sysctl, rename the only remaining one to something
meaningful and remove forward declarations from /sys/sysctl.h; this way,
adding new debug sysctls is a matter of adding extern and coming up with a
name, which is nicer to read on its own and better to grep for.

OK mpi


# 1.302 22-Aug-2020 kn

Move sysctl(2) CTL_DEBUG from DEBUG to new DEBUG_SYSCTL

Adding "debug.my-knob" sysctls is really helpful to select different
code paths and/or log on demand during runtime without recompile,
but as this code is under DEBUG, lots of other noise comes with it
which is often undesired, at least when looking at specific subsystems
only.

Adding globals to the kernel and breaking into DDB to change them helps,
but that does not work over SSH, hence the need for debug sysctls.

Introduces DEBUG_SYSCTL to make use of the "debug" MIB without the rest of
DEBUG; it's DEBUG_SYSCTL and not SYSCTL_DEBUG because it's not a general
option for all of sysctl(2).

OK gnezdo


Revision tags: OPENBSD_6_7_BASE
# 1.301 27-Mar-2020 anton

Relax the lockcount assertion in vputonfreelist(). Back when I fixed
several problems with the vnode exclusive lock implementation, I
overlooked the fact that a vnode can be in a state where the usecount is
zero while the holdcount still being positive. There could still be
threads waiting on the vnode lock in uvn_io() as long as the holdcount
is positive.

"go ahead" mpi@

Reported-by: syzbot+767d6deb1a647850a0ca@syzkaller.appspotmail.com


# 1.300 13-Feb-2020 claudio

Move the LK_DRAIN logic from VOP_LOCK() to vclean() the only caller of
VOP_LOCK with LK_DRAIN. This simplifies VOP_LOCK() a fair bit.
OK visa@


# 1.299 20-Jan-2020 claudio

struct vops is not modified during runtime so use const which moves each
into read-only data segment.
OK deraadt@ tedu@


# 1.298 10-Jan-2020 bluhm

Convert the vnode list at the mount point into a tailq. During
unmount this list is traversed and the dirty vnodes are flushed to
disk. Forced unmount expects that the list is empty after flushing,
otherwise the kernel panics with "dangling vnode". As the write
to disk can sleep, new vnodes may be inserted. If softdep is
enabled, resolving the dependencies creates new dirty vnodes and
inserts them to the list. To fix the panic, let insmntque() insert
new vnodes at the tail of the list. Then vflush() will still catch
them while traversing the list in forward direction.
OK tedu@ millert@ visa@


# 1.297 30-Dec-2019 bluhm

In vcount() a safe loop over vnodes was commited to 4.4BSD in 1994.
This is not necessary as the loop is restarted after vgone(). Switch
to SLIST_FOREACH without _SAFE.
OK visa@


# 1.296 27-Dec-2019 bluhm

Convert the speclisth hash buckets into SLIST macros. This makes
the vnode alias code more readable.
OK visa@


# 1.295 26-Dec-2019 bluhm

Fix white spaces.


# 1.294 08-Dec-2019 mpi

Convert infinite sleeps to tsleep_nsec(9).

ok visa@, jca@


Revision tags: OPENBSD_6_6_BASE
# 1.293 26-Aug-2019 anton

When a thread tries to exclusively lock a vnode, the same thread must
ensure that any other thread currently trying to acquire the underlying
vnode lock has observed that the same vnode is about to be exclusively
locked. Such threads must then sleep until the exclusive lock has been
released and then try to acquire the lock again. Otherwise, exclusive
access to the vnode cannot be guaranteed.

Thanks to naddy@ and visa@ for testing; ok visa@

Reported-by: syzbot+374d0e7e2400004957f7@syzkaller.appspotmail.com


# 1.292 25-Jul-2019 cheloha

vinvalbuf(9): tlseep -> tsleep_nsec(9); ok millert@


# 1.291 19-Jul-2019 cheloha

vwaitforio(9): tsleep(9) -> tsleep_nsec(9); ok visa@


# 1.290 28-Jun-2019 visa

Skip VFS barrier lock during normal operation to reduce overhead.
This removes a system-wide serialization point, which might help
finding timing-related bugs.

OK deraadt@ anton@


# 1.289 09-Jun-2019 beck

Add a temporary workaround to make removal of giant files better

mlarkin@ noticed we would freeze while removing enormous files because
of the amount of work done to invalidate buffers on unlink. This adds
a temporary workaround to ensure we give up the lock and yield while
doing this.

The longer term answer will be to move these buffers to another list
and not do the work here.

ok deraadt@


# 1.288 19-Apr-2019 visa

Add a subsystem lock for vfs_lockf.c. This enables calling lf_advlock()
and lf_purgelocks() without the kernel lock.

OK anton@ mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.287 02-Apr-2019 visa

Restrict which filesystems are available for swap. This rules out
obvious misconfigurations that cannot work.

OK mpi@ tedu@


# 1.286 17-Feb-2019 tedu

if a write fails, we mark the buffer invalid and throw it away. this can
lead to lost errors, where a later fsync will return success. to fix this,
set a flag on the vnode indicating a past error has occurred, and return
an error for future fsync calls.
ok bluhm deraadt visa


# 1.285 21-Jan-2019 anton

Introduce a dedicated entry point data structure for file locks. This new data
structure allows for better tracking of pending lock operations which is
essential in order to prevent a use-after-free once the underlying vnode is
gone.

Inspired by the lockf implementation in FreeBSD.

ok visa@

Reported-by: syzbot+d5540a236382f50f1dac@syzkaller.appspotmail.com


# 1.284 23-Dec-2018 natano

Rectify some issues with the noperm mount flag; the root vnode was not
protected properly and files without any x bit set were accidentaly considered
executable when checked with access(2).

Issues found and reported by deraadt, halex, reyk, tb
ok deraadt


# 1.283 07-Dec-2018 mpi

free(9) sizes for netcred.

ok visa@


Revision tags: OPENBSD_6_4_BASE
# 1.282 29-Sep-2018 visa

Use atomic operations to update vfc_refcount. Change the field's type
to unsigned int.

OK deraadt@


# 1.281 26-Sep-2018 visa

Move the allocating and freeing of mount points into
dedicated functions.

OK deraadt@ mpi@


# 1.280 22-Sep-2018 fcambus

Harmonize spacing after ellipses in displayed messages.

We were using spacing after ellipses in an inconsistent way in the
installer. Standardize on using "... " everywhere and take into account
the cursor position while we are waiting for the task to complete: the
cursor is now always positioned after the last dot, and the space is
added when displaying completion confirmation.

While there, also take cursor position into account in vfs_shutdown(),
and remove the extra leading space before ticks in dhclient.

OK deraadt@


# 1.279 17-Sep-2018 visa

Simplify VFS initialization.

Because loadable kernel modules are no longer, there is no need to
register or unregister filesystem implementations at runtime. Remove
vfs_register() and vfs_unregister(), and make vfsinit() call vfs_init
routines directly. Replace the linked list of vfsconf structs with
the vfsconflist[] array.

OK mpi@ bluhm@


# 1.278 16-Sep-2018 visa

Move vfsconf lookup code into dedicated functions.

OK bluhm@


# 1.277 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


# 1.276 02-Jul-2018 bluhm

Use more list macros for v_dirtyblkhd.
OK mpi@


# 1.275 06-Jun-2018 bluhm

The function dounmount() traverses the mnt_list in forward direction
to call vfs_busy() for all nested mount points. vfs_stall() called
vfs_busy() in reverser order for all mount points. Change the
direction of the latter to resolve the lock order conflict.
OK visa@


# 1.274 04-Jun-2018 guenther

Add VB_DUPOK to suppress witness(4) warning of concurrent mount locks.
Use that in three places:
- vfs_stall()
- sys_mount()
- dounmount()'s MNT_FORCE-does-recursive-unmounts case

ok deraadt@ visa@


# 1.273 27-May-2018 visa

Drop unnecessary `p' parameter from vget(9).

OK mpi@


# 1.272 08-May-2018 bluhm

When looping over mount points, the FOREACH SAVE macro is not save.
The loop variable mp is protected by vfs_busy() so that it cannot
be unmounted. But the next mount point nmp could be unmounted while
VFS_SYNC() sleeps. As the loop in vfs_stall() does not destroy the
mount point, TAILQ_FOREACH_REVERSE without _SAVE is the correct
macro to use.
OK deraadt@ visa@


# 1.271 08-May-2018 mpi

Move the vfs stall "barrier" logic to a function. FREF() will soon
change and this has nothing to do with it.

ok visa@, bluhm@


# 1.270 07-May-2018 bluhm

Print the vp pointer in the vinvalbuf() panic strings.
OK mpi@


# 1.269 02-May-2018 visa

Remove proc from the parameters of vn_lock(). The parameter is
unnecessary because curproc always does the locking.

OK mpi@


# 1.268 28-Apr-2018 visa

Clean up the parameters of VOP_LOCK() and VOP_UNLOCK(). It is always
curproc that does the locking or unlocking, so the proc parameter
is pointless and can be dropped.

OK mpi@, deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.267 07-Mar-2018 bluhm

Remounting files systems read-only does not work reliably. There
are corner cases where ffs may leak blocks. So better revert and
unmount all file systems at reboot. The "init died" panic will be
fixed in a different way.
OK deraadt@


# 1.266 10-Feb-2018 deraadt

Syncronize filesystems to disk when suspending. Each mountpoint's vnodes
are pushed to disk. Dangling vnodes (unlinked files still in use) and
vnodes undergoing change by long-running syscalls are identified -- and
such filesystems are marked dirty on-disk while we are suspended (in case
power is lost, a fsck will be required). Filesystems without dangling or
busy vnodes are marked clean, resulting in faster boots following
"battery died" circumstances.
Tested by numerous developers, thanks for the feedback.


# 1.265 14-Dec-2017 deraadt

Don't bother using DETACH_FORCE for the softraid luns at reboot
time; the aggressive mountpoint destruction seems to hit insane
use-after-frees when we are already far on the way down.


# 1.264 14-Dec-2017 deraadt

Give vflush_vnode() a hint about vnodes we don't need to account as "busy".
Change mountpoint to RDONLY a little later. Seems to improve the
rw->ro transition a bit.


# 1.263 11-Dec-2017 bluhm

Format the vnode lists of ddb show mount properly in columns.
OK krw@


# 1.262 11-Dec-2017 deraadt

In uvm Chuck decided backing store would not be allocated proactively
for blocks re-fetchable from the filesystem. However at reboot time,
filesystems are unmounted, and since processes lack backing store they
are killed. Since the scheduler is still running, in some cases init is
killed... which drops us to ddb [noted by bluhm]. Solution is to convert
filesystems to read-only [proposed by kettenis]. The tale follows:
sys_reboot() should pass proc * to MD boot() to vfs_shutdown() which
completes current IO with vfs_busy VB_WRITE|VB_WAIT, then calls VFS_MOUNT()
with MNT_UPDATE | MNT_RDONLY, soon teaching us that *fs_mount() calls a
copyin() late... so store the sizes in vfsconflist[] and move the copyin()
to sys_mount()... and notice nfs_mount copyin() is size-variant, so kill
legacy struct nfs_args3. Next we learn ffs_mount()'s MNT_UPDATE code is
sharp and rusty especially wrt softdep, so fix some bugs adn add
~MNT_SOFTDEP to the downgrade. Some vnodes need a little more help,
so tie them to &dead_vnops.

ffs_mount calling DIOCCACHESYNC is causing a bit of grief still but
this issue is seperate and will be dealt with in time.
couple hundred reboots by bluhm and myself, advice from guenther and
others at the hut


# 1.261 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.260 31-Jul-2017 florian

Give back some space to the ramdisk by compiling net/radix.c only
if we compile pf, ipsec, pipex or nfsserver.
Suggested by mpi some time ago.
Tweak & OK bluhm
deraadt assumes it's fair


# 1.259 20-Apr-2017 visa

Tweak lock inits to make the system runnable with witness(4)
on amd64 and i386.


# 1.258 04-Apr-2017 deraadt

struct vfsconf is tightly packed, but let's M_ZERO it in case that ever
changes to avoid exposing userland memory.


Revision tags: OPENBSD_6_1_BASE
# 1.257 15-Jan-2017 bluhm

When traversing the mount list, the current mount point is locked
with vfs_busy(). If the FOREACH_SAFE macro is used, the next pointer
is not locked and could be freed by another process. Unless
necessary, do not use _SAFE as it is unsafe. In vfs_unmountall()
the current pointer is actullay freed. Add a comment that this
race has to be fixed later.
OK krw@


# 1.256 10-Jan-2017 bluhm

Replace manual for() loops with FOREACH() macro.
OK millert@


# 1.255 10-Jan-2017 bluhm

Remove the unused olddp parameter from function dounmount().
OK mpi@ millert@


# 1.254 28-Sep-2016 kettenis

Cast enum to u_int when doing a bounds check to avoid a clang warning that
the comparison is always true.

ok jca@, tedu@


# 1.253 16-Sep-2016 dlg

move the namecache_rb_tree from RB macros to RBT functions.

i had to shuffle the includes a bit. all the knowledge of the RB
tree is now inside vfs_cache.c, and all accesses are via cache_*
functions.


# 1.252 16-Sep-2016 dlg

move buf_rb_bufs from RB macros to RBT functions

i had to shuffle the order of some header bits cos RBT_PROTOTYPE
needs to see what RBT_HEAD produces.


# 1.251 15-Sep-2016 dlg

all pools have their ipl set via pool_setipl, so fold it into pool_init.

the ioff argument to pool_init() is unused and has been for many
years, so this replaces it with an ipl argument. because the ipl
will be set on init we no longer need pool_setipl.

most of these changes have been done with coccinelle using the spatch
below. cocci sucks at formatting code though, so i fixed that by hand.

the manpage and subr_pool.c bits i did myself.

ok tedu@ jmatthew@

@ipl@
expression pp;
expression ipl;
expression s, a, o, f, m, p;
@@
-pool_init(pp, s, a, o, f, m, p);
-pool_setipl(pp, ipl);
+pool_init(pp, s, a, ipl, f, m, p);


# 1.250 25-Aug-2016 dlg

pool_setipl

ok kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.249 22-Jul-2016 kettenis

Prevent NULL-pointer call for filesystems that don't provide vfs_sysctl
in their vfsops.

Issue reported by Tim Newsham.

ok claudio@, natano@


# 1.248 19-Jun-2016 natano

Remove the lockmgr() API. It is only used by filesystems, where it is a
trivial change to use rrw locks instead. All it needs is LK_* defines
for the RW_* flags.

tested by naddy and sthen on package building infrastructure
input and ok jmc mpi tedu


# 1.247 26-May-2016 natano

The doforce variable isn't modified anywhere. Also, the only filesystem
left using it is fuse. It has been removed from all other filesystems.

ok millert deraadt


# 1.246 26-Apr-2016 natano

copy_statfs_info() is not only used by ufs, but by other filesystems too,
so make sure that all members of mp->mnt_stat.mount_info are copied.
ok stefan


# 1.245 26-Apr-2016 beck

fix off by one in vfs_vnode_print - found by miod
ok deraadt@, krw@


# 1.244 07-Apr-2016 natano

Share clone bitmap between aliased vnodes. This prevents duplicate clone
instance numbers being handed out for the same minor device.
ok mikeb


# 1.243 05-Apr-2016 natano

Increase size of the clone bitmap (revised diff after revert). I have
tested this with fuse _and_ drm on amd64 and macppc. Also tested with
cloning bpf (not in the tree) on macppc.

ok mikeb
"looks correct to me" millert

The original commit message is as follows:

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.242 01-Apr-2016 mikeb

Revert the clone bitmap enlargement change


# 1.241 31-Mar-2016 natano

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.240 19-Mar-2016 natano

Remove the unused flags argument from VOP_UNLOCK().

torture tested on amd64, i386 and macppc
ok beck mpi stefan
"the change looks right" deraadt


# 1.239 14-Mar-2016 krw

Change a bunch of (<blah> *)0 to NULL.

ok beck@ deraadt@


Revision tags: OPENBSD_5_9_BASE
# 1.238 05-Dec-2015 tedu

branches: 1.238.2;
remove stale lint annotations


# 1.237 16-Nov-2015 deraadt

In getdevvp() set the VISTTY flag on a vnode to indicate the underlying
device is a D_TTY device. (Like spec_open, but this sets the flag to
satisfy pre-VOP_OPEN situations)
ok millert semarie tedu guenther


# 1.236 13-Oct-2015 guenther

Initialize va_filerev in vattr_null() to avoid leaking stack garbage;
problem pointed out by Martin Natano (natano (at) natano.net)

Also, stop chaining assignments (foo = bar = baz) in vattr_null().
The exact meaning of those depends on the order of the sizes-and-
signednesses of the lvalues, making them fragile: a statement here
mixed *six* types, but managed to get them in a safe order. Delete
a 20+ year old XXX comment that was almost certainly bemoaning a bug
from when they were in an unsafe order.

ok deraadt@ miod@


# 1.235 08-Oct-2015 mpi

Use the radix API directly and get rid of the function pointers. There
is no point in keeping an unused level of abstraction.

ok mikeb@, claudio@


# 1.234 07-Oct-2015 mpi

rn_inithead() offset argument is now specified in byte, missed in previous.


# 1.233 04-Sep-2015 mpi

Make every subsystem using a radix tree call rn_init() and pass the
length of the key as argument.

This way every consumer of the radix tree has a chance to explicitly
initialize the shared data structures and no longer rely on another
subsystem to do the initialization.

As a bonus ``dom_maxrtkey'' is no longer used an die.

ART kernels should now be fully usable because pf(4) and IPSEC properly
initialized the radix tree.

ok chris@, reyk@


Revision tags: OPENBSD_5_8_BASE
# 1.232 16-Jul-2015 claudio

branches: 1.232.4;
Fix rn_match and there for the expoerted lookup functions in radix.c
to never return the internal RNF_ROOT nodes. This removes the checks
in the callee to verify that not an RNF_ROOT node was returned.
OK mpi@


# 1.231 12-May-2015 mikeb

Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.230 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.229 02-Mar-2015 guenther

Return EINVAL if the creds supplied for NFS export have a cr_ngroups less
than zero or greater than NGROUPS_MAX

Fixes panic seen by henning@


# 1.228 09-Jan-2015 tedu

rename desiredvnodes to initialvnodes. less of a lie. ok beck deraadt


# 1.227 19-Dec-2014 tedu

start retiring the nointr allocator. specify PR_WAITOK as a flag as a
marker for which pools are not interrupt safe. ok dlg


# 1.226 17-Dec-2014 tedu

remove lock.h from uvm_extern.h. another holdover from the simpletonlock
era. fix uvm including c files to include lock.h or atomic.h as necessary.
ok deraadt


# 1.225 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.224 10-Dec-2014 tedu

convert bcopy to memcpy. ok millert


# 1.223 21-Nov-2014 tedu

simple lock is long dead


# 1.222 19-Nov-2014 tedu

delete the KERN_VNODE sysctl. it fails to provide any isolation from the
kernel struct vnode defintion, and the only consumer (pstat) still needs
kvm to read much of the required information. no great loss to always use
kvm until there's a better replacement interface.
ok deraadt millert uebayasi


# 1.221 14-Nov-2014 tedu

prefer sizeof(*ptr) to sizeof(struct) for malloc and free


# 1.220 03-Nov-2014 deraadt

pass size argument to free()
ok doug tedu


# 1.219 13-Sep-2014 doug

Replace all queue *_END macro calls except CIRCLEQ_END with NULL.

CIRCLEQ_* is deprecated and not called in the tree. The other queue types
have *_END macros which were added for symmetry with CIRCLEQ_END. They are
defined as NULL. There's no reason to keep the other *_END macro calls.

ok millert@


Revision tags: OPENBSD_5_6_BASE
# 1.218 13-Jul-2014 tedu

pass the size to free in some of the obvious cases


# 1.217 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.216 10-Jul-2014 mpi

Stop using a shutdown hook for softraid(4) and explicitly shutdown
the disciplines right after vfs_shutdown().

This change is required in order to be able to set `cold' to 1 before
traversing the device (mainbus) tree for DVACT_POWERDOWN when halting
a machine. Yes, this is ugly because sr_shutdown() needs to sleep. But
at least it is obvious and hopefully somebody will be ofended and fix
it.

In order to properly flush the cache of the disks under softraid0,
sr_shutdown() now propagates DVACT_POWERDOWN for this particular subtree
of devices which are not under mainbus. As a side effect sd(4) shutdown
hook should no longer be necessary.

Tested by stsp@ and Jean-Philippe Ouellet.

ok deraadt@, stsp@, jsing@


# 1.215 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.214 04-Jun-2014 claudio

While it may be smart to use the radix tree for exports it is not OK to
use the domain specific tree initialisation method for this since that one
is multipath enabled and assumes that the radix node is part of a struct
rtentry. This code uses a different struct and so the multipath modifies
wrong fields and breaks stuff in mysterious ways.
Since we only support AF_INET here anyway simplify the code and only have
one radix_node_head pointer instead of AF_MAX ones.
Fixes NFS server issues reported by rpe@, OK rpe@, guenther@, sthen@


# 1.213 10-Apr-2014 tedu

pull the bufcache freelist code out into separate functions to allow new
algorithms to be tested. in the process, drop support for unused B_AGE and
b_synctime options.
previous versions ok beck deraadt


# 1.212 24-Mar-2014 guenther

Split the API: struct ucred remains the kernel internal structure while
struct xucred becomes the structure for syscalls (mount(2) and nfssvc(2)).

ok deraadt@ beck@


Revision tags: OPENBSD_5_5_BASE
# 1.211 21-Jan-2014 tedu

bzero -> memset


# 1.210 01-Dec-2013 krw

Change 'mountlist' from CIRCLEQ to TAILQ. Be paranoid and
use TAILQ_*_SAFE more than might be needed.

Bulk ports build by sthen@ showed nobody sticking their fingers
so deep into the kernel.

Feedback and suggestions from millert@. ok jsing@


# 1.209 27-Nov-2013 jsing

Defer the v_type initialisation until after the vnode has been purged from
the namecache. Changing the v_type between cache_enter() and cache_purge()
results in bad things happening.

ok beck@


# 1.208 02-Oct-2013 sf

format string fix: b_flags is long


# 1.207 01-Oct-2013 sf

Format string fixes: Cast time_t to long long

and mnt_stat.f_ctime is long long, too


# 1.206 08-Aug-2013 syl

Uncomment kprintf format attributes for sys/kern

tested on vax (gcc3) ok miod@


# 1.205 30-Jul-2013 beck

The previous change was made while chasing nfs performance issues
on Theo's servers - however this was in the context of the buffer flipper
changes and this is now suspect in a continues performance issue with NFS
so back it out for now


Revision tags: OPENBSD_5_4_BASE
# 1.204 24-Jun-2013 beck

Manipulating buffers after sleeping is dangerous. Instead of attempting
to cheat and VOP_BWRITE a buffer, restart the vinvalbuf if we have to wait
for a busy buffer to complete
ok tedu@ guenther@


# 1.203 15-Apr-2013 jsing

Add an f_mntfromspec member to struct statfs, which specifies the name of
the special provided when the mount was requested. This may be the same as
the special that was actually used for the mount (e.g. in the case of a
device node) or it may be different (e.g. in the case of a DUID).

Whilst here, change f_ctime to a 64 bit type and remove the pointless
f_spare members.

Compatibility goo courtesy of guenther@

ok krw@ millert@


Revision tags: OPENBSD_5_3_BASE
# 1.202 17-Feb-2013 miod

Comment out recently added __attribute__((__format__(__kprintf__))) annotations
in MI code; gcc 2.95 does not accept such annotation for function pointer
declarations, only function prototypes.
To be uncommented once gcc 2.95 bites the dust.


# 1.201 09-Feb-2013 miod

Add explicit __attribute__ ((__format__(__kprintf__)))) to the functions and
function pointer arguments which are {used as,} wrappers around the kernel
printf function.
No functional change.


# 1.200 17-Nov-2012 beck

Don't map a buffer (and potentially sleep) when invalidating it in vinvalbuf.
This fixes a problem where we could sleep for kva and then our pointers
would not be valid on the next pass through the loop. We do this
by adding buf_acquire_nomap() - which can be used to busy up the buffer
without changing its mapped or unmapped state. We do not need to have
the buffer mapped to invalidate it, so it is sufficient to acquire it
for that. In the case where we write the buffer, we do map the buffer, and
potentially sleep.


# 1.199 01-Oct-2012 guenther

Make groupmember() check the effective gid too, so that the checks are
consistent when the effective gid isn't also a supplementary group.

ok beck@


# 1.198 19-Sep-2012 guenther

vhold() and vdrop() are prototyped in vnode.h, so don't repeat them here

ok beck@


Revision tags: OPENBSD_5_2_BASE
# 1.197 16-Jul-2012 deraadt

oops, need sys/acct.h too


# 1.196 16-Jul-2012 deraadt

Put acct_shutdown() proto in a better place


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.195 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.194 02-Jul-2011 thib

rename VFSDEBUG to VFLCKDEBUG;

prompted by tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.193 21-Dec-2010 thib

Bring back the "End the VOP experiment." diff, naddy's issues where
unrelated, and his alpha is much happier now.

OK deraadt@


# 1.192 06-Dec-2010 jasper

- drop NENTS(), which was yet another copy of nitems().
no binary change


ok deraadt@


# 1.191 10-Sep-2010 thib

Backout the VOP diff until the issues naddy was seeing on alpha (gcc3)
have been resolved.


# 1.190 06-Sep-2010 thib

End the VOP experiment. Instead of the ridicolusly complicated operation
vector setup that has questionable features (that have, as far as I can
tell never been used in practice, atleast not in OpenBSD), remove all
the gunk and favor a simple struct full of function pointers that get
set directly by each of the filesystems.

Removes gobs of ugly code and makes things simpler by a magnitude.

The only downside of this is that we loose the vnoperate feature so
the spec/fifo operations of the filesystems need to be kept in sync
with specfs and fifofs, this is no big deal as the API it self is pretty
static.

Many thanks to armani@ who pulled an earlier version of this diff to
current after c2k10 and Gabriel Kihlman on tech@ for testing.

Liked by many. "come on, find your balls" deraadt@.


# 1.189 12-Aug-2010 oga

Nuke extra (typoed) extern declaration and a spare newline from the last
commit.

"fix it -- free commit" beck@


# 1.188 11-Aug-2010 beck

Make the number of vnodes to correspond to the number of buffers in
buffer cache - we grow them dynamically, but do not attempt to shrink
them if the buffer cache shrinks after growing.

Tested by very many for a long time.

ok oga@ todd@ phessler@ tedu@


Revision tags: OPENBSD_4_8_BASE
# 1.187 29-Jun-2010 tedu

makefstype was only used in ported from freebsd filesystems. fix them
and remove the function. ok thib


# 1.186 28-Jun-2010 claudio

Add the rtable id as an argument to rn_walktree(). Functions like
rt_if_remove_rtdelete() need to know the table id to be able to correctly
remove nodes.
Problem found by Andrea Parazzini and analyzed by Martin Pelik�n.
OK henning@


# 1.185 06-May-2010 mpf

Fix favail format string.
From mickey.
OK thib, otto.


Revision tags: OPENBSD_4_7_BASE
# 1.184 17-Dec-2009 oga

if anyone vref()s a VNON vnode, panic. This should not happen.

Written while trying to debug the nfs_inactive panics. Turns out it
never got hit, but it's a useful check to have.

ok beck@


# 1.183 17-Aug-2009 jasper

dd 'show all bufs' to show all the buffers in the system

ok beck@ thib@


# 1.182 13-Aug-2009 thib

add a show all vnodes command, use dlg's nice pool_walk() to accomplish
this.

ok beck@, dlg@


# 1.181 12-Aug-2009 beck

Namecache revamp.

This eliminates the large single namecache hash table, and implements
the name cache as a global lru of entires, and a redblack tree in each
vnode. It makes cache_purge actually purge the namecache entries associated
with a vnode when a vnode is recycled (very important for later on actually being
able to resize the vnode pool)

This commit does #if 0 out a bunch of procmap code that was
already broken before this change, but needs to be redone completely.

Tested by many, including in thib's nfs test setup.

ok oga@,art@,thib@,miod@


# 1.180 02-Aug-2009 beck

Dynamic buffer cache support - a re-commit of what was backed out
after c2k9

allows buffer cache to be extended and grow/shrink dynamically

tested by many, ok oga@, "why not just commit it" deraadt@


Revision tags: OPENBSD_4_6_BASE
# 1.179 25-Jun-2009 thib

backout the buf_acquire() does the bremfree() since all callers
where doing bremfree() befure calling buf_acquire().

This is causing us headache pinning down a bug that showed up
when deraadt@ too cvs to current, and will have to be done
anyway as a preperation for backouts.

OK deraadt@


# 1.178 15-Jun-2009 beck

Back out all the buffer cache changes I committed during c2k9. This reverts three
commits:

1) The sysctl allowing bufcachepercent to be changed at boot time.
2) The change moving the buffer cache hash chains to a red-black tree
3) The dynamic buffer cache (Which depended on the earlier too).

ok on the backout from marco and todd


# 1.177 06-Jun-2009 art

All caller of buf_acquire were doing bremfree before the call.
Just put it in the buf_acquire function.
oga@ ok


# 1.176 03-Jun-2009 beck

Change bufhash from the old grotty hash table to red-black trees hanging
off the vnode.
ok art@, oga@, miod@


Revision tags: OPENBSD_4_5_BASE
# 1.175 10-Nov-2008 pedro

Fix typo in comment, okay jmc@.


# 1.174 01-Nov-2008 deraadt

change vrele() to return an int. if it returns 0, it can gaurantee that
it did not sleep. this is used to avoid checkdirs() to avoid having
to restart the allproc walk every time through
idea from tedu, ok thib pedro


Revision tags: OPENBSD_4_4_BASE
# 1.173 05-Jul-2008 thib

re-introduce vdrop() to signal a lost intrest in a vnode;

ok art@


# 1.172 14-Jun-2008 mk

A bunch of pool_get() + bzero() -> pool_get(..., .. | PR_ZERO)
conversions that should shave a few bytes off the kernel.

ok henning, krw, jsing, oga, miod, and thib (``even though i usually prefer
FOO|BAR''; thanks for looking.


# 1.171 13-Jun-2008 beck

back out stupid vnode change that was unintentionally included
with biomem and art has no idea how it got there.
ok art@ thib@


# 1.170 12-Jun-2008 deraadt

Bring biomem diff back into the tree after the nfs_bio.c fix went in.
ok thib beck art


# 1.169 11-Jun-2008 deraadt

back out biomem diff since it is not right yet. Doing very large
file copies to nfsv2 causes the system to eventually peg the console.
On the console ^T indicates that the load is increasing rapidly, ddb
indicates many calls to getbuf, there is some very slow nfs traffic
making none (or extremely slow) progress. Eventually some machines
seize up entirely.


# 1.168 10-Jun-2008 beck

Buffer cache revamp

1) remove multiple size queues, introduced as a stopgap.
2) decouple pages containing data from their mappings
3) only keep buffers mapped when they actually have to be mapped
(right now, this is when buffers are B_BUSY)
4) New functions to make a buffer busy, and release the busy flag
(buf_acquire and buf_release)
5) Move high/low water marks and statistics counters into a structure
6) Add a sysctl to retrieve buffer cache statistics

Tested in several variants and beat upon by bob and art for a year. run
accidentally on henning's nfs server for a few months...

ok deraadt@, krw@, art@ - who promises to be around to deal with any fallout


# 1.167 09-Jun-2008 millert

Update access(2) to have modern semantics with respect to X_OK and
the superuser. access(2) will now only indicate success for X_OK on
non-directories if there is at least one execute bit set on the file.
OK deraadt@ thib@ otto@


# 1.166 07-May-2008 thib

remove the vfc_mountroot member from vfsconf and
do appropriate cleanup;

OK deraadt@


# 1.165 07-May-2008 claudio

Implement routing priorities. Every route inserted has a priority assigned
and the one route with the lowest number wins. This will be used by the
routing daemons to resolve the synchronisations issue in case of conflicts.
The nasty bits of this are in the multipath code. If no priority is specified
the kernel will choose an appropriate priority.

Looked at by a few people at n2k8 code is much older


# 1.164 06-May-2008 thib

retire vfs_mountroot();

setroot() is now (and has been) responsible for setting
the mountroot function pointer "to the right thing", or
failing todo that, to ffs_mountroot;

based on a discussion/diff from deraadt@.
OK deraadt@


# 1.163 23-Mar-2008 miod

Wrong printf construct.


# 1.162 16-Mar-2008 otto

Widen some struct statfs fields to support large filesystem stata
and add some to be able to support statvfs(2). Do the compat dance
to provide backward compatibility. ok thib@ miod@


Revision tags: OPENBSD_4_3_BASE
# 1.161 13-Dec-2007 blambert

replace calls to ltsleep with tsleep

remove PNORELOCK flag, as PNORELOCK is used for msleep

ok art@ thib@


# 1.160 16-Nov-2007 deraadt

er, the newline is wrong. dissapointing.


# 1.159 15-Nov-2007 deraadt

newline before syncing disks is way prettier


# 1.158 29-Oct-2007 chl

MALLOC/FREE -> malloc/free
replace an hard coded value with M_WAITOK

ok krw@


# 1.157 15-Sep-2007 bluhm

Allow to pull out an usb stick with ffs filesystem while mounted
and a file is written onto the stick. Without these fixes the
machine panics or hangs.
The usb fix calls the callback when the stick is pulled out to free
the associated buffers. Otherwise we have busy buffers for ever
and the automatic unmount will panic.
The change in the scsi layer prevents passing down further dirty
buffers to usb after the stick has been deactivated.
In vfs the automatic unmount has moved from the function vgonel()
to vop_generic_revoke(). Both are called when the sd device's vnode
is removed. In vgonel() the VXLOCK is already held which can cause
a deadlock. So call dounmount() earlier.

ok krw@, I like this marco@, tested by ian@


# 1.156 07-Sep-2007 art

Use M_ZERO in a few more places to shave bytes from the kernel.

eyeballed and ok dlg@


Revision tags: OPENBSD_4_2_BASE
# 1.155 07-Aug-2007 beck

A few changes to deal with multi-user performance issues seen. this
brings us back roughly to 4.1 level performance, although this is still
far from optimal as we have seen in a number of cases. This change

1) puts a lower bound on buffer cache queues to prevent starvation
2) fixes the code which looks for a buffer to recycle
3) reduces the number of vnodes back to 4.1 levels to avoid complex
performance issues better addressed after 4.2

ok art@ deraadt@, tested by many


# 1.154 01-Jun-2007 beck

decouple the allocated number of vnodes from the "desiredvnodes" variable
which is used to size a zillion other things that increasing excessively
has been shown to cause problems - so that we may incrementally look at
increasing those other things without making the kernel unusable.

This diff effectivly increases the number of vnodes back to the number
of buffers, as in the earlier dynamic buffer cache commits, without
increasing anything else (namecache, softdeps, etc. etc.)

ok pedro@ tedu@ art@ thib@


# 1.153 31-May-2007 tedu

remove some silly casts, no real change


# 1.152 31-May-2007 pedro

NFSv2 cannot cope with a big number of vnodes, so revert to NPROC-based
calculation until the problem is fixed, okay beck@ art@


# 1.151 30-May-2007 beck

back out vfs change - todd fries has seen afs issues, and I'm suspicious
this can cause other problems.


# 1.150 29-May-2007 beck

Step one of some vnode improvements - change getnewvnode to
actually allocate "desiredvnodes" - add a vdrop to un-hold a vnode held
with vhold, and change the name cache to make use of vhold/vdrop, while
keeping track of which vnodes are referred to by which cache entries to
correctly hold/drop vnodes when the cache uses them.
ok thib@, tedu@, art@


# 1.149 28-May-2007 thib

de-inline vref();

ok pedro@


# 1.148 26-May-2007 pedro

Dynamic buffer cache. Initial diff from mickey@, okay art@ beck@ toby@
deraadt@ dlg@.


# 1.147 26-May-2007 thib

Nuke a bunch of simpelocks and associated goo.

ok art@


# 1.146 17-May-2007 thib

Collapse struct v_selectinfo in struct vnode, remove the
simplelock and reuse the name for the selinfo member.
Clean-up accordingly.

ok tedu@,art@


# 1.145 09-May-2007 deraadt

kinfo_vgetfailed has not been used for > 8 years


# 1.144 13-Apr-2007 thib

Move the declaration of VN_KNOTE() into vnode.h instead of having
multiple defines all over;

ok tedu@


# 1.143 13-Apr-2007 bluhm

Remove comments talking about vnode interlock. No binary change.
ok thib


# 1.142 11-Apr-2007 thib

Remove the simplelock argument from vrecycle();

ok pedro@, sturm@


# 1.141 21-Mar-2007 thib

Remove the v_interlock simplelock from the vnode structure.
Zap all calls to simple_lock/unlock() on it (those calls are
#defined away though). Remove the LK_INTERLOCK from the calls
to vn_lock() and cleanup the filesystems wich implement VOP_LOCK().
(by remvoing the v_interlock from there calls to lockmgr()).

ok pedro@, art@, tedu@


# 1.140 12-Mar-2007 mickey

better desiredvnodes not based on maxusers; pedro@ deraadt@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.139 20-Feb-2007 deraadt

for vfsconf sysctl, do not leak kernel sensors out to userland
ok art thib


# 1.138 17-Feb-2007 mickey

fix ddb buf printing for daddr_t growth to 64bit;
from juan hernandez gonzalez; tested by bluhm@


# 1.137 14-Feb-2007 jsg

Consistently spell FALLTHROUGH to appease lint.
ok kettenis@ cloder@ tom@ henning@


# 1.136 13-Feb-2007 mickey

fix ddb buf print


# 1.135 20-Nov-2006 tom

vprint() should be defined if DIAGNOSTIC || DEBUG. Noticed by (and
original diff from) Jake < antipsychic (at) hotmail.com >. Discussed
with Mickey and Miod.

ok miod@ pedro@


# 1.134 30-Oct-2006 thib

use vp->v_type to index into vtypes rather then vp->v_tag,
fixing odd output in the 'show vnode' ddb code.

ok mickey@


Revision tags: OPENBSD_4_0_BASE
# 1.133 11-Jul-2006 mickey

add mount/vnode/buf and softdep printing commands; tested on a few archs and will make pedro happy too (;


# 1.132 09-Jul-2006 pedro

Fix tab where space was meant


# 1.131 08-Jul-2006 thib

vinvalbuf() debugging aid, under VFSDEBUG.

ok pedro@


# 1.130 03-Jul-2006 mickey

also print vp in vprint (useful for debugging); pedro@ ok


# 1.129 25-Jun-2006 sturm

rename vfs_busy() flags VB_UMIGNORE/VB_UMWAIT to VB_NOWAIT/VB_WAIT

requested by and ok pedro


# 1.128 14-Jun-2006 sturm

move vfs_busy() to rwlocks and properly hide the locking api from vfs

ok tedu, pedro


# 1.127 02-Jun-2006 pedro

Add a clonable devices implementation. Hacked along with thib@, input
from krw@ and toby@, subliminal prodding from dlg@, okay deraadt@.


# 1.126 28-May-2006 pedro

Spacing in vfs_sysctl()


# 1.125 07-May-2006 sturm

forgot to remove this sentence from the comment
ok pedro


# 1.124 30-Apr-2006 sturm

remove the simplelock argument from vfs_busy() which is currently not
used and will never be used this way in VFS

requested by and ok pedro, ok krw, biorn


# 1.123 19-Apr-2006 pedro

Remove unused mount list simple_lock() goo


Revision tags: OPENBSD_3_9_BASE
# 1.122 09-Jan-2006 pedro

Put vprint() under DIAGNOSTIC, as to save space in generated ramdisks.
Inspiration from miod@, okay deraadt@. Tested on i386, macppc and amd64.


# 1.121 30-Nov-2005 pedro

No need for vfs_busy() and vfs_unbusy() to take a process pointer
anymore. Testing by jolan@, thanks.


# 1.120 24-Nov-2005 pedro

Remove kernfs, okay deraadt@.


# 1.119 19-Nov-2005 pedro

Remove unnecessary lockmgr() archaism that was costing too much in terms
of panics and bugfixes. Access curproc directly, do not expect a process
pointer as an argument. Should fix many "process context required" bugs.
Incentive and okay millert@, okay marc@. Various testing, thanks.


# 1.118 18-Nov-2005 pedro

Work around yet another race on non-locking file systems: when calling
VOP_INACTIVE() in vrele() and vput(), we may sleep. Since there's no
locking of any kind, someone can vget() the vnode and vrele() it while
we sleep, beating us in getting the vnode on the free list.


# 1.117 08-Nov-2005 pedro

Missed one use of 'register'


# 1.116 07-Nov-2005 pedro

Use ANSI function declarations and deregister, no binary change


# 1.115 19-Oct-2005 pedro

Remove v_vnlock from struct vnode, okay krw@ tedu@


Revision tags: OPENBSD_3_8_BASE
# 1.114 26-May-2005 pedro

branches: 1.114.2;
RIP stackable filesystems, ok marius@ tedu@, discussed with deraadt@


# 1.113 24-May-2005 pedro

when a device vnode associated with a mount point disappears, mark the
filesystem as doomed and unmount it


# 1.112 22-May-2005 pedro

put VLOCKSWORK stuff under a single option, VFSDEBUG


# 1.111 01-May-2005 pedro

check for VBIOONFREELIST and VBIOONSYNCLIST in vprint(), okay marius@


# 1.110 24-Mar-2005 tedu

always good to check for invalid values. ok marius pedro


Revision tags: OPENBSD_3_7_BASE
# 1.109 10-Jan-2005 pedro

branches: 1.109.2;
change vget() to only put a vnode back on the free lists if it actually
was there. should fix a (rare) corner case introduced by my last commit.
ok tedu@, testing by joris, moritz@, danh@, otto@ and krw@. many thanks.


# 1.108 31-Dec-2004 pedro

sprinkle some more list macros in here


# 1.107 31-Dec-2004 pedro

when releasing a vnode, make it inactive before sticking it to one of
the free lists. should fix some races on filesystems that don't have
locks, such as nfs. also, it allows for a more straightforward way of
releasing vnodes (nodes that are going to be recycled don't have to be
moved to the head of the list). tested by many, thanks.

ok tedu@ deraadt@


# 1.106 28-Dec-2004 deraadt

clean dirty accident by miod


# 1.105 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


# 1.104 09-Dec-2004 pedro

minor spacing/styling nits


Revision tags: OPENBSD_3_6_BASE
# 1.103 04-Aug-2004 art

Uninline vputonfreelist.


# 1.102 04-Aug-2004 pedro

better comments


# 1.101 02-Aug-2004 pedro

- check for LK_NOWAIT on vget()
- use ltsleep() instead of the unlock + sleep combo

ok art@, inspiration from free/net


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.100 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


# 1.99 27-May-2004 tedu

shutdown accounting before shutting down vfs. should prevent some panics.
ok david@ millert@ (iirc)


# 1.98 25-Apr-2004 itojun

radix tree with multipath support. from kame. deraadt ok
user visible changes:
- you can add multiple routes with same key (route add A B then route add A C)
- you have to specify gateway address if there are multiple entries on the table
(route delete A B, instead of route delete A)
kernel change:
- radix_node_head has an extra entry
- rnh_deladdr takes extra argument

TODO:
- actually take advantage of multipath (rtalloc -> rtalloc_mpath)


Revision tags: OPENBSD_3_5_BASE
# 1.97 09-Jan-2004 tedu

back out vnode parents. weird breakge found in ports tree


# 1.96 06-Jan-2004 tedu

keep track of a vnode's parent dir. ufs only, and unused atm, but
the fun stuff is coming. testing by brad.


Revision tags: OPENBSD_3_4_BASE
# 1.95 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.94 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: UBC_SYNC_A
# 1.93 13-May-2003 naddy

Back out previous change that causes "vnode table full" for large-scale
file operations.


# 1.92 13-May-2003 tedu

do reclaim LAYER vnodes, no good reason not to


# 1.91 06-May-2003 tedu

attempt to put a process's cwd back in place after a forced umount.
won't always work, but it's the best we can do for now. this covers
at least some of the failure cases the previous commit to vfs_lookup.c
checks for.
ok weingart@


# 1.90 01-May-2003 tedu

several related changes:
vfs_subr.c:
add a missing simple_lock_init for vnode interlock
try to avoid reclaiming locked or layered vnodes
initialize vnlock pointer to NULL
remove old code to free vnlock, never used
lockinit the new vnode lock
vfs_syscalls.c:
support for VLAYER flag
vnode_if.sh:
support for splitting VDESC flags
vnode_if.src:
split VDESC flags
WILLPUT is the combination of WILLRELE and WILLUNLOCK
most uses for WILLRELE become WILLPUT
vnode.h:
add v_lock to struct vnode
add VLAYER flag
update for new VDESC flags


# 1.89 06-Apr-2003 ho

strcat/strcpy/sprintf cleanup. krw@, anil@ ok. art@ tested sparc64.


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_B
# 1.88 11-Aug-2002 art

Add two missing vfs_busy calls in the failure path of sysctl_vnode.
Found by aaron@

NOTE - I think we need a mount-point iterator just like we have
NOTE - vfs_mount_foreach_vnode. (btw. why don't we use foreach_vnode in here?)


# 1.87 12-Jul-2002 art

Change the locking on the mountpoint slightly. Instead of using mnt_lock
to get shared locks for lookup and get the exclusive lock only with
LK_DRAIN on unmount and do the real exclusive locking with flags in
mnt_flags, we now use shared locks for lookup and an exclusive lock for
unmount.

This is accomplished by slightly changing the semantics of vfs_busy.
Old vfs_busy behavior:
- with LK_NOWAIT set in flags, a shared lock was obtained if the
mountpoint wasn't being unmounted, otherwise we just returned an error.
- with no flags, a shared lock was obtained if the mountpoint was being
unmounted, otherwise we slept until the unmount was done and returned
an error.
LK_NOWAIT was used for sync(2) and some statistics code where it isn't really
critical that we get the correct results.
0 was used in fchdir and lookup where it's critical that we get the right
directory vnode for the filesystem root.

After this change vfs_busy keeps the same behavior for no flags and LK_NOWAIT.
But if some other flags are passed into it, they are passed directly
into lockmgr (actually LK_SLEEPFAIL is always added to those flags because
if we sleep for the lock, that means someone was holding the exclusive lock
and the exclusive lock is only held when the filesystem is being unmounted.

More changes:
dounmount must now be called with the exclusive lock held. (before this
the caller was supposed to hold the vfs_busy lock, but that wasn't always
true).
Zap some (now) unused mount flags.
And the highlight of this change:
Add some vfs_busy calls to match some vfs_unbusy calls, especially in
sys_mount. (lockmgr doesn't detect the case where we release a lock noone
holds (it will do that soon)).

If you've seen hangs on reboot with mfs this should solve it (I repeat this
for the fourth time now, but this time I spent two months fixing and
redesigning this and reading the code so this time I must have gotten
this right).


# 1.86 16-Jun-2002 miod

When processing the KERN_VNODE sysctl, the kernel builds a packed structure,
while pstat(8) expects a C structure abiding the regular structure packing
rules. This caused pstat -v to break on powerpc.

Unbreak the confusion by defining the structure in a common header file,
and having the kernel use it.

ok millert@ deraadt@


# 1.85 08-Jun-2002 art

Use ltsleep in vfs_busy.


# 1.84 16-May-2002 art

sprinkle some splassert(IPL_BIO) in some functions that are commented as "should be called at splbio()"


Revision tags: OPENBSD_3_1_BASE
# 1.83 14-Mar-2002 millert

First round of __P removal in sys


# 1.82 04-Feb-2002 miod

Cleanup mountroot-related definitions.


# 1.81 23-Jan-2002 art

Pool deals fairly well with physical memory shortage, but it doesn't deal
well (not at all) with shortages of the vm_map where the pages are mapped
(usually kmem_map).

Try to deal with it:
- group all information the backend allocator for a pool in a separate
struct. The pool will only have a pointer to that struct.
- change the pool_init API to reflect that.
- link all pools allocating from the same allocator on a linked list.
- Since an allocator is responsible to wait for physical memory it will
only fail (waitok) when it runs out of its backing vm_map, carefully
drain pools using the same allocator so that va space is freed.
(see comments in code for caveats and details).
- change pool_reclaim to return if it actually succeeded to free some
memory, use that information to make draining easier and more efficient.
- get rid of PR_URGENT, noone uses it.


# 1.80 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.79 10-Dec-2001 art

branches: 1.79.2;
No need to initialize the uobj on every getnewvnode. Just do
it when allocating. Add some improved diagnostics.


# 1.78 10-Dec-2001 art

Big cleanup inspired by NetBSD with some parts of the code from NetBSD.
- get rid of VOP_BALLOCN and VOP_SIZE
- move the generic getpages and putpages into miscfs/genfs
- create a genfs_node which must be added to the top of the private portion
of each vnode for filsystems that want to use genfs_{get,put}pages
- rename genfs_mmap to vop_generic_mmap


# 1.77 10-Dec-2001 art

Merge in struct uvm_vnode into struct vnode.


# 1.76 05-Dec-2001 art

Break out the part that lowers v_holdcnt in brelvp into an own function
and make it and vhold into public interfaces.


# 1.75 29-Nov-2001 art

Ooops. Revert part of the last commit that was completly wrong and wasn't supposed to be committed.


# 1.74 29-Nov-2001 art

Correctly handle b_vp with bgetvp and brelvp in {get,put}pages.
Prevents panics caused by vnodes being recycled under our feet.


# 1.73 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.72 21-Nov-2001 csapuntz

Added vfs_isbusy. Useful for verifying that a mount point is locked
Added vfs_mount_foreach_vnode. Several places in the code seem to want to
traverse the mount list and they all seem to handle locking differently.
Centralize traversing the mount list in one place so that we only need
to get the locking right once.


# 1.71 15-Nov-2001 art

Don't zero v_bioflag when recycling a vnode in getnewvnode.
Sometimes the vnode can be on the syncers list. While that is a bug, it's
just a minor annoyance. A vnode on a syncer worklist without VBIOONSYNCLIST
set is a disaster.


# 1.70 12-Nov-2001 art

Remove unnecessary check for NULL vnode in reassignbuf.


# 1.69 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.68 02-Oct-2001 csapuntz

Bounds check index into routing table. Thanks to Ken Ashcraft of Stanford
for finding this bug.


# 1.67 19-Sep-2001 csapuntz

Get rid of B_VFLUSH. Not relevant after the end of the AGE queue.


# 1.66 16-Sep-2001 millert

Add some missing lengths checks when passing data from userland to
kernel. From based on NetBSD patches.


# 1.65 02-Aug-2001 assar

(vput): make panic strings actually say vput instead of vrele


# 1.64 26-Jul-2001 miod

Typo.


# 1.63 27-Jun-2001 art

remove old vm


# 1.62 22-Jun-2001 deraadt

KNF


# 1.61 05-Jun-2001 provos

send note_revoke to knotes when vnode goes away, okay art@


# 1.60 16-May-2001 art

indentation nit.


# 1.59 29-Apr-2001 art

cleanup, remove incorrect comment


Revision tags: OPENBSD_2_9_BASE
# 1.58 22-Mar-2001 art

branches: 1.58.2;
Use pool for allocating vnodes.
Even though vnodes are never freed (could be) this gives us big memory and
kmem_map savings.


# 1.57 21-Mar-2001 art

uvm_vnp_terminate expect the vnode to be locked.
Why didn't LOCKDEBUG catch this?


# 1.56 16-Mar-2001 art

Oops. fix thinko in last.


# 1.55 16-Mar-2001 art

Use CIRCLEQ macros for mountlist.


# 1.54 16-Mar-2001 art

Initialize the mountlist_slock.


# 1.53 26-Feb-2001 csapuntz

Move v_writecount test back to it original place


# 1.52 26-Feb-2001 csapuntz

Make ref counts 32-bit unsigned ints as opposed to a potpourri of longs and
ints.


# 1.51 24-Feb-2001 csapuntz

Cleanup of vnode interface continues. Get rid of VHOLD/HOLDRELE.
Change VM/UVM to use buf_replacevnode to change the vnode associated
with a buffer.

Addition v_bioflag for flags written in interrupt handlers
(and read at splbio, though not strictly necessary)

Add vwaitforio and use it instead of a while loop of v_numoutput.

Fix race conditions when manipulation vnode free list


# 1.50 23-Feb-2001 csapuntz

Remove the clustering fields from the vnodes and place them in the
file system inode instead


# 1.49 21-Feb-2001 csapuntz

Latest soft updates from FreeBSD/Kirk McKusick

Snapshot-related code has been commented out.


# 1.48 08-Feb-2001 mickey

do not print stuff when not verbose


Revision tags: OPENBSD_2_8_BASE
# 1.47 27-Sep-2000 art

branches: 1.47.2;
Minimal optimization.


# 1.46 17-Jul-2000 art

Don't wait for B_READ buffers on shutdown.
From NetBSD.


Revision tags: OPENBSD_2_7_BASE
# 1.45 25-Apr-2000 csapuntz

Use CIRCLEQ_FOREACH


# 1.44 21-Apr-2000 mickey

see if there is any meaning under curproc before using &proc0 in vfs_syncwait(); from art@


Revision tags: SMP_BASE kame_19991208
# 1.43 05-Dec-1999 art

branches: 1.43.2;
With soft updates, some buffers will be remarked as dirty after being written.
Handle this when syncing filesystems when unmounting.
From NetBSD.


# 1.42 05-Dec-1999 art

Use VONSYNCLIST to see if we should remove a vnode from the sync list instead
of looking at v_dirtyblkhd.


Revision tags: OPENBSD_2_6_BASE
# 1.41 20-Aug-1999 art

more paranoid check of the refcount in vfs_register


# 1.40 08-Aug-1999 niklas

From NetBSD; vdevgone, used for revoking access to device nodes when they
disappear (detach is coming).


# 1.39 31-May-1999 millert

New struct statfs with mount options. NOTE: this replaces statfs(2),
fstatfs(2), and getfsstat(2) so you will need to build a new kernel
before doing a "make build" or you will get "unimplemented syscall" errors.

The new struct statfs has the following featuires:
o Has a u_int32_t flags field--now softdep can have a real flag.

o Uses u_int32_t instead of longs (nicer on the alpha). Note: the man
page used to lie about setting invalid/unused fields to -1. SunOS does
that but our code never has.

o Gets rid of f_type completely. It hasn't been used since NetBSD 0.9
and having it there but always 0 is confusing. It is conceivable
that this may cause some old code to not compile but that is better
than silently breaking.

o Adds a mount_info union that contains the FSTYPE_args struct. This
means that "mount" can now tell you all the options a filesystem was
mounted with. This is especially nice for NFS.

Other changes:
o The linux statfs emulation didn't convert between BSD fs names
and linux f_type numbers. Now it does, since the BSD f_type
number is useless to linux apps (and has been removed anyway)

o FreeBSD's struct statfs is different from our (both old and new)
and thus needs conversion. Previously, the OpenBSD syscalls
were used without any real translation.

o mount(8) will now show extra info when invoked with no arguments.
However, to see *everything* you need to use the -v (verbose) flag.


# 1.38 06-May-1999 mickey

factor out sync+wait code into vfa_syncwait() routine for
applications in system like power management and such.
art@ finally said `commit it'


# 1.37 30-Apr-1999 art

in vput, simple_unlock the v_interlock before VOP_INACTIVE, not after


Revision tags: OPENBSD_2_5_BASE
# 1.36 11-Mar-1999 deraadt

backout


# 1.35 11-Mar-1999 deraadt

back out unapproved changes


# 1.34 11-Mar-1999 mickey

indent


# 1.33 11-Mar-1999 mickey

factor sync+wait operation out into a separate function.


# 1.32 26-Feb-1999 art

adapt to uvm vnode pager


# 1.31 19-Feb-1999 art

add vfs_register and vfs_unregister functions


# 1.30 28-Dec-1998 art

simple_lock fixes


# 1.29 22-Dec-1998 art

deconfuse vprint, print holdcount, not refcount when we are talking about holdcnt


# 1.28 10-Dec-1998 art

vfs_unmountall: retry to unmount all remaining filesystems when one unmount failed


# 1.27 05-Dec-1998 csapuntz

Framework for generating automatic test code for locking discipline
in DIAGNOSTIC mode.

Added documentation to vfs_subr.c on locking needs of a couple calls.

Improvements to the vinvalbuf patch. We need to start over after we
let our pants down.


# 1.26 04-Dec-1998 csapuntz

VFS-Lite2 requires stricter locking around vnode buffer queues. vinvalbuf
had insufficient protection


# 1.25 20-Nov-1998 art

vn_lock already unlocks the simple lock. don't do that again


# 1.24 12-Nov-1998 csapuntz

Integrate latest soft updates patches for McKusick.

Integrate cleaner ffs mount code from FreeBSD. Most notably, this mount
code prevents you from mounting an unclean file system read-write.


Revision tags: OPENBSD_2_4_BASE
# 1.23 13-Oct-1998 csapuntz

In vrele, vget, reinstate to following order

- VNODE gets placed on free list
- VOP_INACTIVE is called

This was the original order. It was changed in an earlier patch due to
a race condition in non-locking FSes (like NFS) between getnewvnode
and inactive. However, the modified order had its own race conditions, so
it turned out not to be a good choice.


# 1.22 30-Aug-1998 csapuntz

Cleanup.

Error diagnostics in vputonfreelist to catch violations of assumptions.


# 1.21 06-Aug-1998 csapuntz

Rename vop_revoke, vn_bwrite, vop_noislocked, vop_nolock, vop_nounlock
to be vop_generic_revoke, vop_generic_bwrite, vop_generic_islocked,
vop_generic_lock and vop_generic_unlock.

Create vop_generic_abortop and propogate change to all file systems.

Fix PR/371.

Get rid of locking in NULLFS (should be mostly unnecessary now except for
forced unmounts).


# 1.20 25-Apr-1998 niklas

typo


Revision tags: OPENBSD_2_3_BASE
# 1.19 20-Feb-1998 niklas

typo


# 1.18 11-Jan-1998 csapuntz

Fix a couple spinlock references. More code motion in vfs_subr.c


# 1.17 10-Jan-1998 csapuntz

Broke up vfs_subr.c which was getting a bit huge. We now have seperate files
for the syncer daemon as well as default VOP_*.


# 1.16 24-Nov-1997 niklas

Fix non-DIAGNOSTIC (and non-COMPAT*) compilation


# 1.15 07-Nov-1997 csapuntz

Fixed hang on shutdown
Disabled vop_nolock for now. Filesystems still need to be cleaned up.


# 1.14 06-Nov-1997 csapuntz

DEBUG now compiles


# 1.13 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.12 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.11 06-Oct-1997 csapuntz

VFS Lite2 Changes


Revision tags: OPENBSD_2_1_BASE
# 1.10 25-Apr-1997 deraadt

proper mask check; mike@fast.cs.utah.edu


# 1.9 14-Apr-1997 tholo

Minor performance enhancements from NetBSD


# 1.8 24-Feb-1997 niklas

OpenBSD tags


# 1.7 11-Feb-1997 millert

Add fs_id support and random inode generation numbers for ffs.


# 1.6 04-Jan-1997 kstailey

spec_advlock() via lf_advlock()


Revision tags: OPENBSD_2_0_BASE
# 1.5 08-Aug-1996 tholo

Make {,f}chown(2) behaviour POSIX.1 compliant with SUID / SGID files
Enable CTL_FS processing by sysctl(3)
Add CTL_FS request to disable clearing SUID / SGID bit when a files owner
or group is changed by root
Make sysctl(8) understand CTL_FS requests


# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 29-Feb-1996 niklas

From NetBSD: Merge with NetBSD 960217


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.304 29-Jan-2021 claudio

Use NULL instead of 0 to clear v_socket pointer (which actually clears all
of the v_un pointers).
OK jsg@ mvs@


Revision tags: OPENBSD_6_8_BASE
# 1.303 23-Aug-2020 kn

Remove unused debug_syncprt, improve debug sysctl handling

"syncprt" is unused since kern/vfs_syscalls.c r1.147 from 2008.

Adding new debug sysctls is a bit opaque and looking at kern/kern_sysctl.c
the only visible difference between used and stub ctldebug structs in the
debugvars[] array is their extern keyword, indicating that it is defined
elsewhere.

sys/sysctl.h declares all debugN members as extern upfront, but these
declarations are not needed.

Remove the unused debug sysctl, rename the only remaining one to something
meaningful and remove forward declarations from /sys/sysctl.h; this way,
adding new debug sysctls is a matter of adding extern and coming up with a
name, which is nicer to read on its own and better to grep for.

OK mpi


# 1.302 22-Aug-2020 kn

Move sysctl(2) CTL_DEBUG from DEBUG to new DEBUG_SYSCTL

Adding "debug.my-knob" sysctls is really helpful to select different
code paths and/or log on demand during runtime without recompile,
but as this code is under DEBUG, lots of other noise comes with it
which is often undesired, at least when looking at specific subsystems
only.

Adding globals to the kernel and breaking into DDB to change them helps,
but that does not work over SSH, hence the need for debug sysctls.

Introduces DEBUG_SYSCTL to make use of the "debug" MIB without the rest of
DEBUG; it's DEBUG_SYSCTL and not SYSCTL_DEBUG because it's not a general
option for all of sysctl(2).

OK gnezdo


Revision tags: OPENBSD_6_7_BASE
# 1.301 27-Mar-2020 anton

Relax the lockcount assertion in vputonfreelist(). Back when I fixed
several problems with the vnode exclusive lock implementation, I
overlooked the fact that a vnode can be in a state where the usecount is
zero while the holdcount still being positive. There could still be
threads waiting on the vnode lock in uvn_io() as long as the holdcount
is positive.

"go ahead" mpi@

Reported-by: syzbot+767d6deb1a647850a0ca@syzkaller.appspotmail.com


# 1.300 13-Feb-2020 claudio

Move the LK_DRAIN logic from VOP_LOCK() to vclean() the only caller of
VOP_LOCK with LK_DRAIN. This simplifies VOP_LOCK() a fair bit.
OK visa@


# 1.299 20-Jan-2020 claudio

struct vops is not modified during runtime so use const which moves each
into read-only data segment.
OK deraadt@ tedu@


# 1.298 10-Jan-2020 bluhm

Convert the vnode list at the mount point into a tailq. During
unmount this list is traversed and the dirty vnodes are flushed to
disk. Forced unmount expects that the list is empty after flushing,
otherwise the kernel panics with "dangling vnode". As the write
to disk can sleep, new vnodes may be inserted. If softdep is
enabled, resolving the dependencies creates new dirty vnodes and
inserts them to the list. To fix the panic, let insmntque() insert
new vnodes at the tail of the list. Then vflush() will still catch
them while traversing the list in forward direction.
OK tedu@ millert@ visa@


# 1.297 30-Dec-2019 bluhm

In vcount() a safe loop over vnodes was commited to 4.4BSD in 1994.
This is not necessary as the loop is restarted after vgone(). Switch
to SLIST_FOREACH without _SAFE.
OK visa@


# 1.296 27-Dec-2019 bluhm

Convert the speclisth hash buckets into SLIST macros. This makes
the vnode alias code more readable.
OK visa@


# 1.295 26-Dec-2019 bluhm

Fix white spaces.


# 1.294 08-Dec-2019 mpi

Convert infinite sleeps to tsleep_nsec(9).

ok visa@, jca@


Revision tags: OPENBSD_6_6_BASE
# 1.293 26-Aug-2019 anton

When a thread tries to exclusively lock a vnode, the same thread must
ensure that any other thread currently trying to acquire the underlying
vnode lock has observed that the same vnode is about to be exclusively
locked. Such threads must then sleep until the exclusive lock has been
released and then try to acquire the lock again. Otherwise, exclusive
access to the vnode cannot be guaranteed.

Thanks to naddy@ and visa@ for testing; ok visa@

Reported-by: syzbot+374d0e7e2400004957f7@syzkaller.appspotmail.com


# 1.292 25-Jul-2019 cheloha

vinvalbuf(9): tlseep -> tsleep_nsec(9); ok millert@


# 1.291 19-Jul-2019 cheloha

vwaitforio(9): tsleep(9) -> tsleep_nsec(9); ok visa@


# 1.290 28-Jun-2019 visa

Skip VFS barrier lock during normal operation to reduce overhead.
This removes a system-wide serialization point, which might help
finding timing-related bugs.

OK deraadt@ anton@


# 1.289 09-Jun-2019 beck

Add a temporary workaround to make removal of giant files better

mlarkin@ noticed we would freeze while removing enormous files because
of the amount of work done to invalidate buffers on unlink. This adds
a temporary workaround to ensure we give up the lock and yield while
doing this.

The longer term answer will be to move these buffers to another list
and not do the work here.

ok deraadt@


# 1.288 19-Apr-2019 visa

Add a subsystem lock for vfs_lockf.c. This enables calling lf_advlock()
and lf_purgelocks() without the kernel lock.

OK anton@ mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.287 02-Apr-2019 visa

Restrict which filesystems are available for swap. This rules out
obvious misconfigurations that cannot work.

OK mpi@ tedu@


# 1.286 17-Feb-2019 tedu

if a write fails, we mark the buffer invalid and throw it away. this can
lead to lost errors, where a later fsync will return success. to fix this,
set a flag on the vnode indicating a past error has occurred, and return
an error for future fsync calls.
ok bluhm deraadt visa


# 1.285 21-Jan-2019 anton

Introduce a dedicated entry point data structure for file locks. This new data
structure allows for better tracking of pending lock operations which is
essential in order to prevent a use-after-free once the underlying vnode is
gone.

Inspired by the lockf implementation in FreeBSD.

ok visa@

Reported-by: syzbot+d5540a236382f50f1dac@syzkaller.appspotmail.com


# 1.284 23-Dec-2018 natano

Rectify some issues with the noperm mount flag; the root vnode was not
protected properly and files without any x bit set were accidentaly considered
executable when checked with access(2).

Issues found and reported by deraadt, halex, reyk, tb
ok deraadt


# 1.283 07-Dec-2018 mpi

free(9) sizes for netcred.

ok visa@


Revision tags: OPENBSD_6_4_BASE
# 1.282 29-Sep-2018 visa

Use atomic operations to update vfc_refcount. Change the field's type
to unsigned int.

OK deraadt@


# 1.281 26-Sep-2018 visa

Move the allocating and freeing of mount points into
dedicated functions.

OK deraadt@ mpi@


# 1.280 22-Sep-2018 fcambus

Harmonize spacing after ellipses in displayed messages.

We were using spacing after ellipses in an inconsistent way in the
installer. Standardize on using "... " everywhere and take into account
the cursor position while we are waiting for the task to complete: the
cursor is now always positioned after the last dot, and the space is
added when displaying completion confirmation.

While there, also take cursor position into account in vfs_shutdown(),
and remove the extra leading space before ticks in dhclient.

OK deraadt@


# 1.279 17-Sep-2018 visa

Simplify VFS initialization.

Because loadable kernel modules are no longer, there is no need to
register or unregister filesystem implementations at runtime. Remove
vfs_register() and vfs_unregister(), and make vfsinit() call vfs_init
routines directly. Replace the linked list of vfsconf structs with
the vfsconflist[] array.

OK mpi@ bluhm@


# 1.278 16-Sep-2018 visa

Move vfsconf lookup code into dedicated functions.

OK bluhm@


# 1.277 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


# 1.276 02-Jul-2018 bluhm

Use more list macros for v_dirtyblkhd.
OK mpi@


# 1.275 06-Jun-2018 bluhm

The function dounmount() traverses the mnt_list in forward direction
to call vfs_busy() for all nested mount points. vfs_stall() called
vfs_busy() in reverser order for all mount points. Change the
direction of the latter to resolve the lock order conflict.
OK visa@


# 1.274 04-Jun-2018 guenther

Add VB_DUPOK to suppress witness(4) warning of concurrent mount locks.
Use that in three places:
- vfs_stall()
- sys_mount()
- dounmount()'s MNT_FORCE-does-recursive-unmounts case

ok deraadt@ visa@


# 1.273 27-May-2018 visa

Drop unnecessary `p' parameter from vget(9).

OK mpi@


# 1.272 08-May-2018 bluhm

When looping over mount points, the FOREACH SAVE macro is not save.
The loop variable mp is protected by vfs_busy() so that it cannot
be unmounted. But the next mount point nmp could be unmounted while
VFS_SYNC() sleeps. As the loop in vfs_stall() does not destroy the
mount point, TAILQ_FOREACH_REVERSE without _SAVE is the correct
macro to use.
OK deraadt@ visa@


# 1.271 08-May-2018 mpi

Move the vfs stall "barrier" logic to a function. FREF() will soon
change and this has nothing to do with it.

ok visa@, bluhm@


# 1.270 07-May-2018 bluhm

Print the vp pointer in the vinvalbuf() panic strings.
OK mpi@


# 1.269 02-May-2018 visa

Remove proc from the parameters of vn_lock(). The parameter is
unnecessary because curproc always does the locking.

OK mpi@


# 1.268 28-Apr-2018 visa

Clean up the parameters of VOP_LOCK() and VOP_UNLOCK(). It is always
curproc that does the locking or unlocking, so the proc parameter
is pointless and can be dropped.

OK mpi@, deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.267 07-Mar-2018 bluhm

Remounting files systems read-only does not work reliably. There
are corner cases where ffs may leak blocks. So better revert and
unmount all file systems at reboot. The "init died" panic will be
fixed in a different way.
OK deraadt@


# 1.266 10-Feb-2018 deraadt

Syncronize filesystems to disk when suspending. Each mountpoint's vnodes
are pushed to disk. Dangling vnodes (unlinked files still in use) and
vnodes undergoing change by long-running syscalls are identified -- and
such filesystems are marked dirty on-disk while we are suspended (in case
power is lost, a fsck will be required). Filesystems without dangling or
busy vnodes are marked clean, resulting in faster boots following
"battery died" circumstances.
Tested by numerous developers, thanks for the feedback.


# 1.265 14-Dec-2017 deraadt

Don't bother using DETACH_FORCE for the softraid luns at reboot
time; the aggressive mountpoint destruction seems to hit insane
use-after-frees when we are already far on the way down.


# 1.264 14-Dec-2017 deraadt

Give vflush_vnode() a hint about vnodes we don't need to account as "busy".
Change mountpoint to RDONLY a little later. Seems to improve the
rw->ro transition a bit.


# 1.263 11-Dec-2017 bluhm

Format the vnode lists of ddb show mount properly in columns.
OK krw@


# 1.262 11-Dec-2017 deraadt

In uvm Chuck decided backing store would not be allocated proactively
for blocks re-fetchable from the filesystem. However at reboot time,
filesystems are unmounted, and since processes lack backing store they
are killed. Since the scheduler is still running, in some cases init is
killed... which drops us to ddb [noted by bluhm]. Solution is to convert
filesystems to read-only [proposed by kettenis]. The tale follows:
sys_reboot() should pass proc * to MD boot() to vfs_shutdown() which
completes current IO with vfs_busy VB_WRITE|VB_WAIT, then calls VFS_MOUNT()
with MNT_UPDATE | MNT_RDONLY, soon teaching us that *fs_mount() calls a
copyin() late... so store the sizes in vfsconflist[] and move the copyin()
to sys_mount()... and notice nfs_mount copyin() is size-variant, so kill
legacy struct nfs_args3. Next we learn ffs_mount()'s MNT_UPDATE code is
sharp and rusty especially wrt softdep, so fix some bugs adn add
~MNT_SOFTDEP to the downgrade. Some vnodes need a little more help,
so tie them to &dead_vnops.

ffs_mount calling DIOCCACHESYNC is causing a bit of grief still but
this issue is seperate and will be dealt with in time.
couple hundred reboots by bluhm and myself, advice from guenther and
others at the hut


# 1.261 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.260 31-Jul-2017 florian

Give back some space to the ramdisk by compiling net/radix.c only
if we compile pf, ipsec, pipex or nfsserver.
Suggested by mpi some time ago.
Tweak & OK bluhm
deraadt assumes it's fair


# 1.259 20-Apr-2017 visa

Tweak lock inits to make the system runnable with witness(4)
on amd64 and i386.


# 1.258 04-Apr-2017 deraadt

struct vfsconf is tightly packed, but let's M_ZERO it in case that ever
changes to avoid exposing userland memory.


Revision tags: OPENBSD_6_1_BASE
# 1.257 15-Jan-2017 bluhm

When traversing the mount list, the current mount point is locked
with vfs_busy(). If the FOREACH_SAFE macro is used, the next pointer
is not locked and could be freed by another process. Unless
necessary, do not use _SAFE as it is unsafe. In vfs_unmountall()
the current pointer is actullay freed. Add a comment that this
race has to be fixed later.
OK krw@


# 1.256 10-Jan-2017 bluhm

Replace manual for() loops with FOREACH() macro.
OK millert@


# 1.255 10-Jan-2017 bluhm

Remove the unused olddp parameter from function dounmount().
OK mpi@ millert@


# 1.254 28-Sep-2016 kettenis

Cast enum to u_int when doing a bounds check to avoid a clang warning that
the comparison is always true.

ok jca@, tedu@


# 1.253 16-Sep-2016 dlg

move the namecache_rb_tree from RB macros to RBT functions.

i had to shuffle the includes a bit. all the knowledge of the RB
tree is now inside vfs_cache.c, and all accesses are via cache_*
functions.


# 1.252 16-Sep-2016 dlg

move buf_rb_bufs from RB macros to RBT functions

i had to shuffle the order of some header bits cos RBT_PROTOTYPE
needs to see what RBT_HEAD produces.


# 1.251 15-Sep-2016 dlg

all pools have their ipl set via pool_setipl, so fold it into pool_init.

the ioff argument to pool_init() is unused and has been for many
years, so this replaces it with an ipl argument. because the ipl
will be set on init we no longer need pool_setipl.

most of these changes have been done with coccinelle using the spatch
below. cocci sucks at formatting code though, so i fixed that by hand.

the manpage and subr_pool.c bits i did myself.

ok tedu@ jmatthew@

@ipl@
expression pp;
expression ipl;
expression s, a, o, f, m, p;
@@
-pool_init(pp, s, a, o, f, m, p);
-pool_setipl(pp, ipl);
+pool_init(pp, s, a, ipl, f, m, p);


# 1.250 25-Aug-2016 dlg

pool_setipl

ok kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.249 22-Jul-2016 kettenis

Prevent NULL-pointer call for filesystems that don't provide vfs_sysctl
in their vfsops.

Issue reported by Tim Newsham.

ok claudio@, natano@


# 1.248 19-Jun-2016 natano

Remove the lockmgr() API. It is only used by filesystems, where it is a
trivial change to use rrw locks instead. All it needs is LK_* defines
for the RW_* flags.

tested by naddy and sthen on package building infrastructure
input and ok jmc mpi tedu


# 1.247 26-May-2016 natano

The doforce variable isn't modified anywhere. Also, the only filesystem
left using it is fuse. It has been removed from all other filesystems.

ok millert deraadt


# 1.246 26-Apr-2016 natano

copy_statfs_info() is not only used by ufs, but by other filesystems too,
so make sure that all members of mp->mnt_stat.mount_info are copied.
ok stefan


# 1.245 26-Apr-2016 beck

fix off by one in vfs_vnode_print - found by miod
ok deraadt@, krw@


# 1.244 07-Apr-2016 natano

Share clone bitmap between aliased vnodes. This prevents duplicate clone
instance numbers being handed out for the same minor device.
ok mikeb


# 1.243 05-Apr-2016 natano

Increase size of the clone bitmap (revised diff after revert). I have
tested this with fuse _and_ drm on amd64 and macppc. Also tested with
cloning bpf (not in the tree) on macppc.

ok mikeb
"looks correct to me" millert

The original commit message is as follows:

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.242 01-Apr-2016 mikeb

Revert the clone bitmap enlargement change


# 1.241 31-Mar-2016 natano

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.240 19-Mar-2016 natano

Remove the unused flags argument from VOP_UNLOCK().

torture tested on amd64, i386 and macppc
ok beck mpi stefan
"the change looks right" deraadt


# 1.239 14-Mar-2016 krw

Change a bunch of (<blah> *)0 to NULL.

ok beck@ deraadt@


Revision tags: OPENBSD_5_9_BASE
# 1.238 05-Dec-2015 tedu

branches: 1.238.2;
remove stale lint annotations


# 1.237 16-Nov-2015 deraadt

In getdevvp() set the VISTTY flag on a vnode to indicate the underlying
device is a D_TTY device. (Like spec_open, but this sets the flag to
satisfy pre-VOP_OPEN situations)
ok millert semarie tedu guenther


# 1.236 13-Oct-2015 guenther

Initialize va_filerev in vattr_null() to avoid leaking stack garbage;
problem pointed out by Martin Natano (natano (at) natano.net)

Also, stop chaining assignments (foo = bar = baz) in vattr_null().
The exact meaning of those depends on the order of the sizes-and-
signednesses of the lvalues, making them fragile: a statement here
mixed *six* types, but managed to get them in a safe order. Delete
a 20+ year old XXX comment that was almost certainly bemoaning a bug
from when they were in an unsafe order.

ok deraadt@ miod@


# 1.235 08-Oct-2015 mpi

Use the radix API directly and get rid of the function pointers. There
is no point in keeping an unused level of abstraction.

ok mikeb@, claudio@


# 1.234 07-Oct-2015 mpi

rn_inithead() offset argument is now specified in byte, missed in previous.


# 1.233 04-Sep-2015 mpi

Make every subsystem using a radix tree call rn_init() and pass the
length of the key as argument.

This way every consumer of the radix tree has a chance to explicitly
initialize the shared data structures and no longer rely on another
subsystem to do the initialization.

As a bonus ``dom_maxrtkey'' is no longer used an die.

ART kernels should now be fully usable because pf(4) and IPSEC properly
initialized the radix tree.

ok chris@, reyk@


Revision tags: OPENBSD_5_8_BASE
# 1.232 16-Jul-2015 claudio

branches: 1.232.4;
Fix rn_match and there for the expoerted lookup functions in radix.c
to never return the internal RNF_ROOT nodes. This removes the checks
in the callee to verify that not an RNF_ROOT node was returned.
OK mpi@


# 1.231 12-May-2015 mikeb

Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.230 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.229 02-Mar-2015 guenther

Return EINVAL if the creds supplied for NFS export have a cr_ngroups less
than zero or greater than NGROUPS_MAX

Fixes panic seen by henning@


# 1.228 09-Jan-2015 tedu

rename desiredvnodes to initialvnodes. less of a lie. ok beck deraadt


# 1.227 19-Dec-2014 tedu

start retiring the nointr allocator. specify PR_WAITOK as a flag as a
marker for which pools are not interrupt safe. ok dlg


# 1.226 17-Dec-2014 tedu

remove lock.h from uvm_extern.h. another holdover from the simpletonlock
era. fix uvm including c files to include lock.h or atomic.h as necessary.
ok deraadt


# 1.225 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.224 10-Dec-2014 tedu

convert bcopy to memcpy. ok millert


# 1.223 21-Nov-2014 tedu

simple lock is long dead


# 1.222 19-Nov-2014 tedu

delete the KERN_VNODE sysctl. it fails to provide any isolation from the
kernel struct vnode defintion, and the only consumer (pstat) still needs
kvm to read much of the required information. no great loss to always use
kvm until there's a better replacement interface.
ok deraadt millert uebayasi


# 1.221 14-Nov-2014 tedu

prefer sizeof(*ptr) to sizeof(struct) for malloc and free


# 1.220 03-Nov-2014 deraadt

pass size argument to free()
ok doug tedu


# 1.219 13-Sep-2014 doug

Replace all queue *_END macro calls except CIRCLEQ_END with NULL.

CIRCLEQ_* is deprecated and not called in the tree. The other queue types
have *_END macros which were added for symmetry with CIRCLEQ_END. They are
defined as NULL. There's no reason to keep the other *_END macro calls.

ok millert@


Revision tags: OPENBSD_5_6_BASE
# 1.218 13-Jul-2014 tedu

pass the size to free in some of the obvious cases


# 1.217 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.216 10-Jul-2014 mpi

Stop using a shutdown hook for softraid(4) and explicitly shutdown
the disciplines right after vfs_shutdown().

This change is required in order to be able to set `cold' to 1 before
traversing the device (mainbus) tree for DVACT_POWERDOWN when halting
a machine. Yes, this is ugly because sr_shutdown() needs to sleep. But
at least it is obvious and hopefully somebody will be ofended and fix
it.

In order to properly flush the cache of the disks under softraid0,
sr_shutdown() now propagates DVACT_POWERDOWN for this particular subtree
of devices which are not under mainbus. As a side effect sd(4) shutdown
hook should no longer be necessary.

Tested by stsp@ and Jean-Philippe Ouellet.

ok deraadt@, stsp@, jsing@


# 1.215 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.214 04-Jun-2014 claudio

While it may be smart to use the radix tree for exports it is not OK to
use the domain specific tree initialisation method for this since that one
is multipath enabled and assumes that the radix node is part of a struct
rtentry. This code uses a different struct and so the multipath modifies
wrong fields and breaks stuff in mysterious ways.
Since we only support AF_INET here anyway simplify the code and only have
one radix_node_head pointer instead of AF_MAX ones.
Fixes NFS server issues reported by rpe@, OK rpe@, guenther@, sthen@


# 1.213 10-Apr-2014 tedu

pull the bufcache freelist code out into separate functions to allow new
algorithms to be tested. in the process, drop support for unused B_AGE and
b_synctime options.
previous versions ok beck deraadt


# 1.212 24-Mar-2014 guenther

Split the API: struct ucred remains the kernel internal structure while
struct xucred becomes the structure for syscalls (mount(2) and nfssvc(2)).

ok deraadt@ beck@


Revision tags: OPENBSD_5_5_BASE
# 1.211 21-Jan-2014 tedu

bzero -> memset


# 1.210 01-Dec-2013 krw

Change 'mountlist' from CIRCLEQ to TAILQ. Be paranoid and
use TAILQ_*_SAFE more than might be needed.

Bulk ports build by sthen@ showed nobody sticking their fingers
so deep into the kernel.

Feedback and suggestions from millert@. ok jsing@


# 1.209 27-Nov-2013 jsing

Defer the v_type initialisation until after the vnode has been purged from
the namecache. Changing the v_type between cache_enter() and cache_purge()
results in bad things happening.

ok beck@


# 1.208 02-Oct-2013 sf

format string fix: b_flags is long


# 1.207 01-Oct-2013 sf

Format string fixes: Cast time_t to long long

and mnt_stat.f_ctime is long long, too


# 1.206 08-Aug-2013 syl

Uncomment kprintf format attributes for sys/kern

tested on vax (gcc3) ok miod@


# 1.205 30-Jul-2013 beck

The previous change was made while chasing nfs performance issues
on Theo's servers - however this was in the context of the buffer flipper
changes and this is now suspect in a continues performance issue with NFS
so back it out for now


Revision tags: OPENBSD_5_4_BASE
# 1.204 24-Jun-2013 beck

Manipulating buffers after sleeping is dangerous. Instead of attempting
to cheat and VOP_BWRITE a buffer, restart the vinvalbuf if we have to wait
for a busy buffer to complete
ok tedu@ guenther@


# 1.203 15-Apr-2013 jsing

Add an f_mntfromspec member to struct statfs, which specifies the name of
the special provided when the mount was requested. This may be the same as
the special that was actually used for the mount (e.g. in the case of a
device node) or it may be different (e.g. in the case of a DUID).

Whilst here, change f_ctime to a 64 bit type and remove the pointless
f_spare members.

Compatibility goo courtesy of guenther@

ok krw@ millert@


Revision tags: OPENBSD_5_3_BASE
# 1.202 17-Feb-2013 miod

Comment out recently added __attribute__((__format__(__kprintf__))) annotations
in MI code; gcc 2.95 does not accept such annotation for function pointer
declarations, only function prototypes.
To be uncommented once gcc 2.95 bites the dust.


# 1.201 09-Feb-2013 miod

Add explicit __attribute__ ((__format__(__kprintf__)))) to the functions and
function pointer arguments which are {used as,} wrappers around the kernel
printf function.
No functional change.


# 1.200 17-Nov-2012 beck

Don't map a buffer (and potentially sleep) when invalidating it in vinvalbuf.
This fixes a problem where we could sleep for kva and then our pointers
would not be valid on the next pass through the loop. We do this
by adding buf_acquire_nomap() - which can be used to busy up the buffer
without changing its mapped or unmapped state. We do not need to have
the buffer mapped to invalidate it, so it is sufficient to acquire it
for that. In the case where we write the buffer, we do map the buffer, and
potentially sleep.


# 1.199 01-Oct-2012 guenther

Make groupmember() check the effective gid too, so that the checks are
consistent when the effective gid isn't also a supplementary group.

ok beck@


# 1.198 19-Sep-2012 guenther

vhold() and vdrop() are prototyped in vnode.h, so don't repeat them here

ok beck@


Revision tags: OPENBSD_5_2_BASE
# 1.197 16-Jul-2012 deraadt

oops, need sys/acct.h too


# 1.196 16-Jul-2012 deraadt

Put acct_shutdown() proto in a better place


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.195 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.194 02-Jul-2011 thib

rename VFSDEBUG to VFLCKDEBUG;

prompted by tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.193 21-Dec-2010 thib

Bring back the "End the VOP experiment." diff, naddy's issues where
unrelated, and his alpha is much happier now.

OK deraadt@


# 1.192 06-Dec-2010 jasper

- drop NENTS(), which was yet another copy of nitems().
no binary change


ok deraadt@


# 1.191 10-Sep-2010 thib

Backout the VOP diff until the issues naddy was seeing on alpha (gcc3)
have been resolved.


# 1.190 06-Sep-2010 thib

End the VOP experiment. Instead of the ridicolusly complicated operation
vector setup that has questionable features (that have, as far as I can
tell never been used in practice, atleast not in OpenBSD), remove all
the gunk and favor a simple struct full of function pointers that get
set directly by each of the filesystems.

Removes gobs of ugly code and makes things simpler by a magnitude.

The only downside of this is that we loose the vnoperate feature so
the spec/fifo operations of the filesystems need to be kept in sync
with specfs and fifofs, this is no big deal as the API it self is pretty
static.

Many thanks to armani@ who pulled an earlier version of this diff to
current after c2k10 and Gabriel Kihlman on tech@ for testing.

Liked by many. "come on, find your balls" deraadt@.


# 1.189 12-Aug-2010 oga

Nuke extra (typoed) extern declaration and a spare newline from the last
commit.

"fix it -- free commit" beck@


# 1.188 11-Aug-2010 beck

Make the number of vnodes to correspond to the number of buffers in
buffer cache - we grow them dynamically, but do not attempt to shrink
them if the buffer cache shrinks after growing.

Tested by very many for a long time.

ok oga@ todd@ phessler@ tedu@


Revision tags: OPENBSD_4_8_BASE
# 1.187 29-Jun-2010 tedu

makefstype was only used in ported from freebsd filesystems. fix them
and remove the function. ok thib


# 1.186 28-Jun-2010 claudio

Add the rtable id as an argument to rn_walktree(). Functions like
rt_if_remove_rtdelete() need to know the table id to be able to correctly
remove nodes.
Problem found by Andrea Parazzini and analyzed by Martin Pelik�n.
OK henning@


# 1.185 06-May-2010 mpf

Fix favail format string.
From mickey.
OK thib, otto.


Revision tags: OPENBSD_4_7_BASE
# 1.184 17-Dec-2009 oga

if anyone vref()s a VNON vnode, panic. This should not happen.

Written while trying to debug the nfs_inactive panics. Turns out it
never got hit, but it's a useful check to have.

ok beck@


# 1.183 17-Aug-2009 jasper

dd 'show all bufs' to show all the buffers in the system

ok beck@ thib@


# 1.182 13-Aug-2009 thib

add a show all vnodes command, use dlg's nice pool_walk() to accomplish
this.

ok beck@, dlg@


# 1.181 12-Aug-2009 beck

Namecache revamp.

This eliminates the large single namecache hash table, and implements
the name cache as a global lru of entires, and a redblack tree in each
vnode. It makes cache_purge actually purge the namecache entries associated
with a vnode when a vnode is recycled (very important for later on actually being
able to resize the vnode pool)

This commit does #if 0 out a bunch of procmap code that was
already broken before this change, but needs to be redone completely.

Tested by many, including in thib's nfs test setup.

ok oga@,art@,thib@,miod@


# 1.180 02-Aug-2009 beck

Dynamic buffer cache support - a re-commit of what was backed out
after c2k9

allows buffer cache to be extended and grow/shrink dynamically

tested by many, ok oga@, "why not just commit it" deraadt@


Revision tags: OPENBSD_4_6_BASE
# 1.179 25-Jun-2009 thib

backout the buf_acquire() does the bremfree() since all callers
where doing bremfree() befure calling buf_acquire().

This is causing us headache pinning down a bug that showed up
when deraadt@ too cvs to current, and will have to be done
anyway as a preperation for backouts.

OK deraadt@


# 1.178 15-Jun-2009 beck

Back out all the buffer cache changes I committed during c2k9. This reverts three
commits:

1) The sysctl allowing bufcachepercent to be changed at boot time.
2) The change moving the buffer cache hash chains to a red-black tree
3) The dynamic buffer cache (Which depended on the earlier too).

ok on the backout from marco and todd


# 1.177 06-Jun-2009 art

All caller of buf_acquire were doing bremfree before the call.
Just put it in the buf_acquire function.
oga@ ok


# 1.176 03-Jun-2009 beck

Change bufhash from the old grotty hash table to red-black trees hanging
off the vnode.
ok art@, oga@, miod@


Revision tags: OPENBSD_4_5_BASE
# 1.175 10-Nov-2008 pedro

Fix typo in comment, okay jmc@.


# 1.174 01-Nov-2008 deraadt

change vrele() to return an int. if it returns 0, it can gaurantee that
it did not sleep. this is used to avoid checkdirs() to avoid having
to restart the allproc walk every time through
idea from tedu, ok thib pedro


Revision tags: OPENBSD_4_4_BASE
# 1.173 05-Jul-2008 thib

re-introduce vdrop() to signal a lost intrest in a vnode;

ok art@


# 1.172 14-Jun-2008 mk

A bunch of pool_get() + bzero() -> pool_get(..., .. | PR_ZERO)
conversions that should shave a few bytes off the kernel.

ok henning, krw, jsing, oga, miod, and thib (``even though i usually prefer
FOO|BAR''; thanks for looking.


# 1.171 13-Jun-2008 beck

back out stupid vnode change that was unintentionally included
with biomem and art has no idea how it got there.
ok art@ thib@


# 1.170 12-Jun-2008 deraadt

Bring biomem diff back into the tree after the nfs_bio.c fix went in.
ok thib beck art


# 1.169 11-Jun-2008 deraadt

back out biomem diff since it is not right yet. Doing very large
file copies to nfsv2 causes the system to eventually peg the console.
On the console ^T indicates that the load is increasing rapidly, ddb
indicates many calls to getbuf, there is some very slow nfs traffic
making none (or extremely slow) progress. Eventually some machines
seize up entirely.


# 1.168 10-Jun-2008 beck

Buffer cache revamp

1) remove multiple size queues, introduced as a stopgap.
2) decouple pages containing data from their mappings
3) only keep buffers mapped when they actually have to be mapped
(right now, this is when buffers are B_BUSY)
4) New functions to make a buffer busy, and release the busy flag
(buf_acquire and buf_release)
5) Move high/low water marks and statistics counters into a structure
6) Add a sysctl to retrieve buffer cache statistics

Tested in several variants and beat upon by bob and art for a year. run
accidentally on henning's nfs server for a few months...

ok deraadt@, krw@, art@ - who promises to be around to deal with any fallout


# 1.167 09-Jun-2008 millert

Update access(2) to have modern semantics with respect to X_OK and
the superuser. access(2) will now only indicate success for X_OK on
non-directories if there is at least one execute bit set on the file.
OK deraadt@ thib@ otto@


# 1.166 07-May-2008 thib

remove the vfc_mountroot member from vfsconf and
do appropriate cleanup;

OK deraadt@


# 1.165 07-May-2008 claudio

Implement routing priorities. Every route inserted has a priority assigned
and the one route with the lowest number wins. This will be used by the
routing daemons to resolve the synchronisations issue in case of conflicts.
The nasty bits of this are in the multipath code. If no priority is specified
the kernel will choose an appropriate priority.

Looked at by a few people at n2k8 code is much older


# 1.164 06-May-2008 thib

retire vfs_mountroot();

setroot() is now (and has been) responsible for setting
the mountroot function pointer "to the right thing", or
failing todo that, to ffs_mountroot;

based on a discussion/diff from deraadt@.
OK deraadt@


# 1.163 23-Mar-2008 miod

Wrong printf construct.


# 1.162 16-Mar-2008 otto

Widen some struct statfs fields to support large filesystem stata
and add some to be able to support statvfs(2). Do the compat dance
to provide backward compatibility. ok thib@ miod@


Revision tags: OPENBSD_4_3_BASE
# 1.161 13-Dec-2007 blambert

replace calls to ltsleep with tsleep

remove PNORELOCK flag, as PNORELOCK is used for msleep

ok art@ thib@


# 1.160 16-Nov-2007 deraadt

er, the newline is wrong. dissapointing.


# 1.159 15-Nov-2007 deraadt

newline before syncing disks is way prettier


# 1.158 29-Oct-2007 chl

MALLOC/FREE -> malloc/free
replace an hard coded value with M_WAITOK

ok krw@


# 1.157 15-Sep-2007 bluhm

Allow to pull out an usb stick with ffs filesystem while mounted
and a file is written onto the stick. Without these fixes the
machine panics or hangs.
The usb fix calls the callback when the stick is pulled out to free
the associated buffers. Otherwise we have busy buffers for ever
and the automatic unmount will panic.
The change in the scsi layer prevents passing down further dirty
buffers to usb after the stick has been deactivated.
In vfs the automatic unmount has moved from the function vgonel()
to vop_generic_revoke(). Both are called when the sd device's vnode
is removed. In vgonel() the VXLOCK is already held which can cause
a deadlock. So call dounmount() earlier.

ok krw@, I like this marco@, tested by ian@


# 1.156 07-Sep-2007 art

Use M_ZERO in a few more places to shave bytes from the kernel.

eyeballed and ok dlg@


Revision tags: OPENBSD_4_2_BASE
# 1.155 07-Aug-2007 beck

A few changes to deal with multi-user performance issues seen. this
brings us back roughly to 4.1 level performance, although this is still
far from optimal as we have seen in a number of cases. This change

1) puts a lower bound on buffer cache queues to prevent starvation
2) fixes the code which looks for a buffer to recycle
3) reduces the number of vnodes back to 4.1 levels to avoid complex
performance issues better addressed after 4.2

ok art@ deraadt@, tested by many


# 1.154 01-Jun-2007 beck

decouple the allocated number of vnodes from the "desiredvnodes" variable
which is used to size a zillion other things that increasing excessively
has been shown to cause problems - so that we may incrementally look at
increasing those other things without making the kernel unusable.

This diff effectivly increases the number of vnodes back to the number
of buffers, as in the earlier dynamic buffer cache commits, without
increasing anything else (namecache, softdeps, etc. etc.)

ok pedro@ tedu@ art@ thib@


# 1.153 31-May-2007 tedu

remove some silly casts, no real change


# 1.152 31-May-2007 pedro

NFSv2 cannot cope with a big number of vnodes, so revert to NPROC-based
calculation until the problem is fixed, okay beck@ art@


# 1.151 30-May-2007 beck

back out vfs change - todd fries has seen afs issues, and I'm suspicious
this can cause other problems.


# 1.150 29-May-2007 beck

Step one of some vnode improvements - change getnewvnode to
actually allocate "desiredvnodes" - add a vdrop to un-hold a vnode held
with vhold, and change the name cache to make use of vhold/vdrop, while
keeping track of which vnodes are referred to by which cache entries to
correctly hold/drop vnodes when the cache uses them.
ok thib@, tedu@, art@


# 1.149 28-May-2007 thib

de-inline vref();

ok pedro@


# 1.148 26-May-2007 pedro

Dynamic buffer cache. Initial diff from mickey@, okay art@ beck@ toby@
deraadt@ dlg@.


# 1.147 26-May-2007 thib

Nuke a bunch of simpelocks and associated goo.

ok art@


# 1.146 17-May-2007 thib

Collapse struct v_selectinfo in struct vnode, remove the
simplelock and reuse the name for the selinfo member.
Clean-up accordingly.

ok tedu@,art@


# 1.145 09-May-2007 deraadt

kinfo_vgetfailed has not been used for > 8 years


# 1.144 13-Apr-2007 thib

Move the declaration of VN_KNOTE() into vnode.h instead of having
multiple defines all over;

ok tedu@


# 1.143 13-Apr-2007 bluhm

Remove comments talking about vnode interlock. No binary change.
ok thib


# 1.142 11-Apr-2007 thib

Remove the simplelock argument from vrecycle();

ok pedro@, sturm@


# 1.141 21-Mar-2007 thib

Remove the v_interlock simplelock from the vnode structure.
Zap all calls to simple_lock/unlock() on it (those calls are
#defined away though). Remove the LK_INTERLOCK from the calls
to vn_lock() and cleanup the filesystems wich implement VOP_LOCK().
(by remvoing the v_interlock from there calls to lockmgr()).

ok pedro@, art@, tedu@


# 1.140 12-Mar-2007 mickey

better desiredvnodes not based on maxusers; pedro@ deraadt@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.139 20-Feb-2007 deraadt

for vfsconf sysctl, do not leak kernel sensors out to userland
ok art thib


# 1.138 17-Feb-2007 mickey

fix ddb buf printing for daddr_t growth to 64bit;
from juan hernandez gonzalez; tested by bluhm@


# 1.137 14-Feb-2007 jsg

Consistently spell FALLTHROUGH to appease lint.
ok kettenis@ cloder@ tom@ henning@


# 1.136 13-Feb-2007 mickey

fix ddb buf print


# 1.135 20-Nov-2006 tom

vprint() should be defined if DIAGNOSTIC || DEBUG. Noticed by (and
original diff from) Jake < antipsychic (at) hotmail.com >. Discussed
with Mickey and Miod.

ok miod@ pedro@


# 1.134 30-Oct-2006 thib

use vp->v_type to index into vtypes rather then vp->v_tag,
fixing odd output in the 'show vnode' ddb code.

ok mickey@


Revision tags: OPENBSD_4_0_BASE
# 1.133 11-Jul-2006 mickey

add mount/vnode/buf and softdep printing commands; tested on a few archs and will make pedro happy too (;


# 1.132 09-Jul-2006 pedro

Fix tab where space was meant


# 1.131 08-Jul-2006 thib

vinvalbuf() debugging aid, under VFSDEBUG.

ok pedro@


# 1.130 03-Jul-2006 mickey

also print vp in vprint (useful for debugging); pedro@ ok


# 1.129 25-Jun-2006 sturm

rename vfs_busy() flags VB_UMIGNORE/VB_UMWAIT to VB_NOWAIT/VB_WAIT

requested by and ok pedro


# 1.128 14-Jun-2006 sturm

move vfs_busy() to rwlocks and properly hide the locking api from vfs

ok tedu, pedro


# 1.127 02-Jun-2006 pedro

Add a clonable devices implementation. Hacked along with thib@, input
from krw@ and toby@, subliminal prodding from dlg@, okay deraadt@.


# 1.126 28-May-2006 pedro

Spacing in vfs_sysctl()


# 1.125 07-May-2006 sturm

forgot to remove this sentence from the comment
ok pedro


# 1.124 30-Apr-2006 sturm

remove the simplelock argument from vfs_busy() which is currently not
used and will never be used this way in VFS

requested by and ok pedro, ok krw, biorn


# 1.123 19-Apr-2006 pedro

Remove unused mount list simple_lock() goo


Revision tags: OPENBSD_3_9_BASE
# 1.122 09-Jan-2006 pedro

Put vprint() under DIAGNOSTIC, as to save space in generated ramdisks.
Inspiration from miod@, okay deraadt@. Tested on i386, macppc and amd64.


# 1.121 30-Nov-2005 pedro

No need for vfs_busy() and vfs_unbusy() to take a process pointer
anymore. Testing by jolan@, thanks.


# 1.120 24-Nov-2005 pedro

Remove kernfs, okay deraadt@.


# 1.119 19-Nov-2005 pedro

Remove unnecessary lockmgr() archaism that was costing too much in terms
of panics and bugfixes. Access curproc directly, do not expect a process
pointer as an argument. Should fix many "process context required" bugs.
Incentive and okay millert@, okay marc@. Various testing, thanks.


# 1.118 18-Nov-2005 pedro

Work around yet another race on non-locking file systems: when calling
VOP_INACTIVE() in vrele() and vput(), we may sleep. Since there's no
locking of any kind, someone can vget() the vnode and vrele() it while
we sleep, beating us in getting the vnode on the free list.


# 1.117 08-Nov-2005 pedro

Missed one use of 'register'


# 1.116 07-Nov-2005 pedro

Use ANSI function declarations and deregister, no binary change


# 1.115 19-Oct-2005 pedro

Remove v_vnlock from struct vnode, okay krw@ tedu@


Revision tags: OPENBSD_3_8_BASE
# 1.114 26-May-2005 pedro

branches: 1.114.2;
RIP stackable filesystems, ok marius@ tedu@, discussed with deraadt@


# 1.113 24-May-2005 pedro

when a device vnode associated with a mount point disappears, mark the
filesystem as doomed and unmount it


# 1.112 22-May-2005 pedro

put VLOCKSWORK stuff under a single option, VFSDEBUG


# 1.111 01-May-2005 pedro

check for VBIOONFREELIST and VBIOONSYNCLIST in vprint(), okay marius@


# 1.110 24-Mar-2005 tedu

always good to check for invalid values. ok marius pedro


Revision tags: OPENBSD_3_7_BASE
# 1.109 10-Jan-2005 pedro

branches: 1.109.2;
change vget() to only put a vnode back on the free lists if it actually
was there. should fix a (rare) corner case introduced by my last commit.
ok tedu@, testing by joris, moritz@, danh@, otto@ and krw@. many thanks.


# 1.108 31-Dec-2004 pedro

sprinkle some more list macros in here


# 1.107 31-Dec-2004 pedro

when releasing a vnode, make it inactive before sticking it to one of
the free lists. should fix some races on filesystems that don't have
locks, such as nfs. also, it allows for a more straightforward way of
releasing vnodes (nodes that are going to be recycled don't have to be
moved to the head of the list). tested by many, thanks.

ok tedu@ deraadt@


# 1.106 28-Dec-2004 deraadt

clean dirty accident by miod


# 1.105 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


# 1.104 09-Dec-2004 pedro

minor spacing/styling nits


Revision tags: OPENBSD_3_6_BASE
# 1.103 04-Aug-2004 art

Uninline vputonfreelist.


# 1.102 04-Aug-2004 pedro

better comments


# 1.101 02-Aug-2004 pedro

- check for LK_NOWAIT on vget()
- use ltsleep() instead of the unlock + sleep combo

ok art@, inspiration from free/net


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.100 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


# 1.99 27-May-2004 tedu

shutdown accounting before shutting down vfs. should prevent some panics.
ok david@ millert@ (iirc)


# 1.98 25-Apr-2004 itojun

radix tree with multipath support. from kame. deraadt ok
user visible changes:
- you can add multiple routes with same key (route add A B then route add A C)
- you have to specify gateway address if there are multiple entries on the table
(route delete A B, instead of route delete A)
kernel change:
- radix_node_head has an extra entry
- rnh_deladdr takes extra argument

TODO:
- actually take advantage of multipath (rtalloc -> rtalloc_mpath)


Revision tags: OPENBSD_3_5_BASE
# 1.97 09-Jan-2004 tedu

back out vnode parents. weird breakge found in ports tree


# 1.96 06-Jan-2004 tedu

keep track of a vnode's parent dir. ufs only, and unused atm, but
the fun stuff is coming. testing by brad.


Revision tags: OPENBSD_3_4_BASE
# 1.95 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.94 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: UBC_SYNC_A
# 1.93 13-May-2003 naddy

Back out previous change that causes "vnode table full" for large-scale
file operations.


# 1.92 13-May-2003 tedu

do reclaim LAYER vnodes, no good reason not to


# 1.91 06-May-2003 tedu

attempt to put a process's cwd back in place after a forced umount.
won't always work, but it's the best we can do for now. this covers
at least some of the failure cases the previous commit to vfs_lookup.c
checks for.
ok weingart@


# 1.90 01-May-2003 tedu

several related changes:
vfs_subr.c:
add a missing simple_lock_init for vnode interlock
try to avoid reclaiming locked or layered vnodes
initialize vnlock pointer to NULL
remove old code to free vnlock, never used
lockinit the new vnode lock
vfs_syscalls.c:
support for VLAYER flag
vnode_if.sh:
support for splitting VDESC flags
vnode_if.src:
split VDESC flags
WILLPUT is the combination of WILLRELE and WILLUNLOCK
most uses for WILLRELE become WILLPUT
vnode.h:
add v_lock to struct vnode
add VLAYER flag
update for new VDESC flags


# 1.89 06-Apr-2003 ho

strcat/strcpy/sprintf cleanup. krw@, anil@ ok. art@ tested sparc64.


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_B
# 1.88 11-Aug-2002 art

Add two missing vfs_busy calls in the failure path of sysctl_vnode.
Found by aaron@

NOTE - I think we need a mount-point iterator just like we have
NOTE - vfs_mount_foreach_vnode. (btw. why don't we use foreach_vnode in here?)


# 1.87 12-Jul-2002 art

Change the locking on the mountpoint slightly. Instead of using mnt_lock
to get shared locks for lookup and get the exclusive lock only with
LK_DRAIN on unmount and do the real exclusive locking with flags in
mnt_flags, we now use shared locks for lookup and an exclusive lock for
unmount.

This is accomplished by slightly changing the semantics of vfs_busy.
Old vfs_busy behavior:
- with LK_NOWAIT set in flags, a shared lock was obtained if the
mountpoint wasn't being unmounted, otherwise we just returned an error.
- with no flags, a shared lock was obtained if the mountpoint was being
unmounted, otherwise we slept until the unmount was done and returned
an error.
LK_NOWAIT was used for sync(2) and some statistics code where it isn't really
critical that we get the correct results.
0 was used in fchdir and lookup where it's critical that we get the right
directory vnode for the filesystem root.

After this change vfs_busy keeps the same behavior for no flags and LK_NOWAIT.
But if some other flags are passed into it, they are passed directly
into lockmgr (actually LK_SLEEPFAIL is always added to those flags because
if we sleep for the lock, that means someone was holding the exclusive lock
and the exclusive lock is only held when the filesystem is being unmounted.

More changes:
dounmount must now be called with the exclusive lock held. (before this
the caller was supposed to hold the vfs_busy lock, but that wasn't always
true).
Zap some (now) unused mount flags.
And the highlight of this change:
Add some vfs_busy calls to match some vfs_unbusy calls, especially in
sys_mount. (lockmgr doesn't detect the case where we release a lock noone
holds (it will do that soon)).

If you've seen hangs on reboot with mfs this should solve it (I repeat this
for the fourth time now, but this time I spent two months fixing and
redesigning this and reading the code so this time I must have gotten
this right).


# 1.86 16-Jun-2002 miod

When processing the KERN_VNODE sysctl, the kernel builds a packed structure,
while pstat(8) expects a C structure abiding the regular structure packing
rules. This caused pstat -v to break on powerpc.

Unbreak the confusion by defining the structure in a common header file,
and having the kernel use it.

ok millert@ deraadt@


# 1.85 08-Jun-2002 art

Use ltsleep in vfs_busy.


# 1.84 16-May-2002 art

sprinkle some splassert(IPL_BIO) in some functions that are commented as "should be called at splbio()"


Revision tags: OPENBSD_3_1_BASE
# 1.83 14-Mar-2002 millert

First round of __P removal in sys


# 1.82 04-Feb-2002 miod

Cleanup mountroot-related definitions.


# 1.81 23-Jan-2002 art

Pool deals fairly well with physical memory shortage, but it doesn't deal
well (not at all) with shortages of the vm_map where the pages are mapped
(usually kmem_map).

Try to deal with it:
- group all information the backend allocator for a pool in a separate
struct. The pool will only have a pointer to that struct.
- change the pool_init API to reflect that.
- link all pools allocating from the same allocator on a linked list.
- Since an allocator is responsible to wait for physical memory it will
only fail (waitok) when it runs out of its backing vm_map, carefully
drain pools using the same allocator so that va space is freed.
(see comments in code for caveats and details).
- change pool_reclaim to return if it actually succeeded to free some
memory, use that information to make draining easier and more efficient.
- get rid of PR_URGENT, noone uses it.


# 1.80 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.79 10-Dec-2001 art

branches: 1.79.2;
No need to initialize the uobj on every getnewvnode. Just do
it when allocating. Add some improved diagnostics.


# 1.78 10-Dec-2001 art

Big cleanup inspired by NetBSD with some parts of the code from NetBSD.
- get rid of VOP_BALLOCN and VOP_SIZE
- move the generic getpages and putpages into miscfs/genfs
- create a genfs_node which must be added to the top of the private portion
of each vnode for filsystems that want to use genfs_{get,put}pages
- rename genfs_mmap to vop_generic_mmap


# 1.77 10-Dec-2001 art

Merge in struct uvm_vnode into struct vnode.


# 1.76 05-Dec-2001 art

Break out the part that lowers v_holdcnt in brelvp into an own function
and make it and vhold into public interfaces.


# 1.75 29-Nov-2001 art

Ooops. Revert part of the last commit that was completly wrong and wasn't supposed to be committed.


# 1.74 29-Nov-2001 art

Correctly handle b_vp with bgetvp and brelvp in {get,put}pages.
Prevents panics caused by vnodes being recycled under our feet.


# 1.73 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.72 21-Nov-2001 csapuntz

Added vfs_isbusy. Useful for verifying that a mount point is locked
Added vfs_mount_foreach_vnode. Several places in the code seem to want to
traverse the mount list and they all seem to handle locking differently.
Centralize traversing the mount list in one place so that we only need
to get the locking right once.


# 1.71 15-Nov-2001 art

Don't zero v_bioflag when recycling a vnode in getnewvnode.
Sometimes the vnode can be on the syncers list. While that is a bug, it's
just a minor annoyance. A vnode on a syncer worklist without VBIOONSYNCLIST
set is a disaster.


# 1.70 12-Nov-2001 art

Remove unnecessary check for NULL vnode in reassignbuf.


# 1.69 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.68 02-Oct-2001 csapuntz

Bounds check index into routing table. Thanks to Ken Ashcraft of Stanford
for finding this bug.


# 1.67 19-Sep-2001 csapuntz

Get rid of B_VFLUSH. Not relevant after the end of the AGE queue.


# 1.66 16-Sep-2001 millert

Add some missing lengths checks when passing data from userland to
kernel. From based on NetBSD patches.


# 1.65 02-Aug-2001 assar

(vput): make panic strings actually say vput instead of vrele


# 1.64 26-Jul-2001 miod

Typo.


# 1.63 27-Jun-2001 art

remove old vm


# 1.62 22-Jun-2001 deraadt

KNF


# 1.61 05-Jun-2001 provos

send note_revoke to knotes when vnode goes away, okay art@


# 1.60 16-May-2001 art

indentation nit.


# 1.59 29-Apr-2001 art

cleanup, remove incorrect comment


Revision tags: OPENBSD_2_9_BASE
# 1.58 22-Mar-2001 art

branches: 1.58.2;
Use pool for allocating vnodes.
Even though vnodes are never freed (could be) this gives us big memory and
kmem_map savings.


# 1.57 21-Mar-2001 art

uvm_vnp_terminate expect the vnode to be locked.
Why didn't LOCKDEBUG catch this?


# 1.56 16-Mar-2001 art

Oops. fix thinko in last.


# 1.55 16-Mar-2001 art

Use CIRCLEQ macros for mountlist.


# 1.54 16-Mar-2001 art

Initialize the mountlist_slock.


# 1.53 26-Feb-2001 csapuntz

Move v_writecount test back to it original place


# 1.52 26-Feb-2001 csapuntz

Make ref counts 32-bit unsigned ints as opposed to a potpourri of longs and
ints.


# 1.51 24-Feb-2001 csapuntz

Cleanup of vnode interface continues. Get rid of VHOLD/HOLDRELE.
Change VM/UVM to use buf_replacevnode to change the vnode associated
with a buffer.

Addition v_bioflag for flags written in interrupt handlers
(and read at splbio, though not strictly necessary)

Add vwaitforio and use it instead of a while loop of v_numoutput.

Fix race conditions when manipulation vnode free list


# 1.50 23-Feb-2001 csapuntz

Remove the clustering fields from the vnodes and place them in the
file system inode instead


# 1.49 21-Feb-2001 csapuntz

Latest soft updates from FreeBSD/Kirk McKusick

Snapshot-related code has been commented out.


# 1.48 08-Feb-2001 mickey

do not print stuff when not verbose


Revision tags: OPENBSD_2_8_BASE
# 1.47 27-Sep-2000 art

branches: 1.47.2;
Minimal optimization.


# 1.46 17-Jul-2000 art

Don't wait for B_READ buffers on shutdown.
From NetBSD.


Revision tags: OPENBSD_2_7_BASE
# 1.45 25-Apr-2000 csapuntz

Use CIRCLEQ_FOREACH


# 1.44 21-Apr-2000 mickey

see if there is any meaning under curproc before using &proc0 in vfs_syncwait(); from art@


Revision tags: SMP_BASE kame_19991208
# 1.43 05-Dec-1999 art

branches: 1.43.2;
With soft updates, some buffers will be remarked as dirty after being written.
Handle this when syncing filesystems when unmounting.
From NetBSD.


# 1.42 05-Dec-1999 art

Use VONSYNCLIST to see if we should remove a vnode from the sync list instead
of looking at v_dirtyblkhd.


Revision tags: OPENBSD_2_6_BASE
# 1.41 20-Aug-1999 art

more paranoid check of the refcount in vfs_register


# 1.40 08-Aug-1999 niklas

From NetBSD; vdevgone, used for revoking access to device nodes when they
disappear (detach is coming).


# 1.39 31-May-1999 millert

New struct statfs with mount options. NOTE: this replaces statfs(2),
fstatfs(2), and getfsstat(2) so you will need to build a new kernel
before doing a "make build" or you will get "unimplemented syscall" errors.

The new struct statfs has the following featuires:
o Has a u_int32_t flags field--now softdep can have a real flag.

o Uses u_int32_t instead of longs (nicer on the alpha). Note: the man
page used to lie about setting invalid/unused fields to -1. SunOS does
that but our code never has.

o Gets rid of f_type completely. It hasn't been used since NetBSD 0.9
and having it there but always 0 is confusing. It is conceivable
that this may cause some old code to not compile but that is better
than silently breaking.

o Adds a mount_info union that contains the FSTYPE_args struct. This
means that "mount" can now tell you all the options a filesystem was
mounted with. This is especially nice for NFS.

Other changes:
o The linux statfs emulation didn't convert between BSD fs names
and linux f_type numbers. Now it does, since the BSD f_type
number is useless to linux apps (and has been removed anyway)

o FreeBSD's struct statfs is different from our (both old and new)
and thus needs conversion. Previously, the OpenBSD syscalls
were used without any real translation.

o mount(8) will now show extra info when invoked with no arguments.
However, to see *everything* you need to use the -v (verbose) flag.


# 1.38 06-May-1999 mickey

factor out sync+wait code into vfa_syncwait() routine for
applications in system like power management and such.
art@ finally said `commit it'


# 1.37 30-Apr-1999 art

in vput, simple_unlock the v_interlock before VOP_INACTIVE, not after


Revision tags: OPENBSD_2_5_BASE
# 1.36 11-Mar-1999 deraadt

backout


# 1.35 11-Mar-1999 deraadt

back out unapproved changes


# 1.34 11-Mar-1999 mickey

indent


# 1.33 11-Mar-1999 mickey

factor sync+wait operation out into a separate function.


# 1.32 26-Feb-1999 art

adapt to uvm vnode pager


# 1.31 19-Feb-1999 art

add vfs_register and vfs_unregister functions


# 1.30 28-Dec-1998 art

simple_lock fixes


# 1.29 22-Dec-1998 art

deconfuse vprint, print holdcount, not refcount when we are talking about holdcnt


# 1.28 10-Dec-1998 art

vfs_unmountall: retry to unmount all remaining filesystems when one unmount failed


# 1.27 05-Dec-1998 csapuntz

Framework for generating automatic test code for locking discipline
in DIAGNOSTIC mode.

Added documentation to vfs_subr.c on locking needs of a couple calls.

Improvements to the vinvalbuf patch. We need to start over after we
let our pants down.


# 1.26 04-Dec-1998 csapuntz

VFS-Lite2 requires stricter locking around vnode buffer queues. vinvalbuf
had insufficient protection


# 1.25 20-Nov-1998 art

vn_lock already unlocks the simple lock. don't do that again


# 1.24 12-Nov-1998 csapuntz

Integrate latest soft updates patches for McKusick.

Integrate cleaner ffs mount code from FreeBSD. Most notably, this mount
code prevents you from mounting an unclean file system read-write.


Revision tags: OPENBSD_2_4_BASE
# 1.23 13-Oct-1998 csapuntz

In vrele, vget, reinstate to following order

- VNODE gets placed on free list
- VOP_INACTIVE is called

This was the original order. It was changed in an earlier patch due to
a race condition in non-locking FSes (like NFS) between getnewvnode
and inactive. However, the modified order had its own race conditions, so
it turned out not to be a good choice.


# 1.22 30-Aug-1998 csapuntz

Cleanup.

Error diagnostics in vputonfreelist to catch violations of assumptions.


# 1.21 06-Aug-1998 csapuntz

Rename vop_revoke, vn_bwrite, vop_noislocked, vop_nolock, vop_nounlock
to be vop_generic_revoke, vop_generic_bwrite, vop_generic_islocked,
vop_generic_lock and vop_generic_unlock.

Create vop_generic_abortop and propogate change to all file systems.

Fix PR/371.

Get rid of locking in NULLFS (should be mostly unnecessary now except for
forced unmounts).


# 1.20 25-Apr-1998 niklas

typo


Revision tags: OPENBSD_2_3_BASE
# 1.19 20-Feb-1998 niklas

typo


# 1.18 11-Jan-1998 csapuntz

Fix a couple spinlock references. More code motion in vfs_subr.c


# 1.17 10-Jan-1998 csapuntz

Broke up vfs_subr.c which was getting a bit huge. We now have seperate files
for the syncer daemon as well as default VOP_*.


# 1.16 24-Nov-1997 niklas

Fix non-DIAGNOSTIC (and non-COMPAT*) compilation


# 1.15 07-Nov-1997 csapuntz

Fixed hang on shutdown
Disabled vop_nolock for now. Filesystems still need to be cleaned up.


# 1.14 06-Nov-1997 csapuntz

DEBUG now compiles


# 1.13 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.12 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.11 06-Oct-1997 csapuntz

VFS Lite2 Changes


Revision tags: OPENBSD_2_1_BASE
# 1.10 25-Apr-1997 deraadt

proper mask check; mike@fast.cs.utah.edu


# 1.9 14-Apr-1997 tholo

Minor performance enhancements from NetBSD


# 1.8 24-Feb-1997 niklas

OpenBSD tags


# 1.7 11-Feb-1997 millert

Add fs_id support and random inode generation numbers for ffs.


# 1.6 04-Jan-1997 kstailey

spec_advlock() via lf_advlock()


Revision tags: OPENBSD_2_0_BASE
# 1.5 08-Aug-1996 tholo

Make {,f}chown(2) behaviour POSIX.1 compliant with SUID / SGID files
Enable CTL_FS processing by sysctl(3)
Add CTL_FS request to disable clearing SUID / SGID bit when a files owner
or group is changed by root
Make sysctl(8) understand CTL_FS requests


# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 29-Feb-1996 niklas

From NetBSD: Merge with NetBSD 960217


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.303 23-Aug-2020 kn

Remove unused debug_syncprt, improve debug sysctl handling

"syncprt" is unused since kern/vfs_syscalls.c r1.147 from 2008.

Adding new debug sysctls is a bit opaque and looking at kern/kern_sysctl.c
the only visible difference between used and stub ctldebug structs in the
debugvars[] array is their extern keyword, indicating that it is defined
elsewhere.

sys/sysctl.h declares all debugN members as extern upfront, but these
declarations are not needed.

Remove the unused debug sysctl, rename the only remaining one to something
meaningful and remove forward declarations from /sys/sysctl.h; this way,
adding new debug sysctls is a matter of adding extern and coming up with a
name, which is nicer to read on its own and better to grep for.

OK mpi


# 1.302 22-Aug-2020 kn

Move sysctl(2) CTL_DEBUG from DEBUG to new DEBUG_SYSCTL

Adding "debug.my-knob" sysctls is really helpful to select different
code paths and/or log on demand during runtime without recompile,
but as this code is under DEBUG, lots of other noise comes with it
which is often undesired, at least when looking at specific subsystems
only.

Adding globals to the kernel and breaking into DDB to change them helps,
but that does not work over SSH, hence the need for debug sysctls.

Introduces DEBUG_SYSCTL to make use of the "debug" MIB without the rest of
DEBUG; it's DEBUG_SYSCTL and not SYSCTL_DEBUG because it's not a general
option for all of sysctl(2).

OK gnezdo


Revision tags: OPENBSD_6_7_BASE
# 1.301 27-Mar-2020 anton

Relax the lockcount assertion in vputonfreelist(). Back when I fixed
several problems with the vnode exclusive lock implementation, I
overlooked the fact that a vnode can be in a state where the usecount is
zero while the holdcount still being positive. There could still be
threads waiting on the vnode lock in uvn_io() as long as the holdcount
is positive.

"go ahead" mpi@

Reported-by: syzbot+767d6deb1a647850a0ca@syzkaller.appspotmail.com


# 1.300 13-Feb-2020 claudio

Move the LK_DRAIN logic from VOP_LOCK() to vclean() the only caller of
VOP_LOCK with LK_DRAIN. This simplifies VOP_LOCK() a fair bit.
OK visa@


# 1.299 20-Jan-2020 claudio

struct vops is not modified during runtime so use const which moves each
into read-only data segment.
OK deraadt@ tedu@


# 1.298 10-Jan-2020 bluhm

Convert the vnode list at the mount point into a tailq. During
unmount this list is traversed and the dirty vnodes are flushed to
disk. Forced unmount expects that the list is empty after flushing,
otherwise the kernel panics with "dangling vnode". As the write
to disk can sleep, new vnodes may be inserted. If softdep is
enabled, resolving the dependencies creates new dirty vnodes and
inserts them to the list. To fix the panic, let insmntque() insert
new vnodes at the tail of the list. Then vflush() will still catch
them while traversing the list in forward direction.
OK tedu@ millert@ visa@


# 1.297 30-Dec-2019 bluhm

In vcount() a safe loop over vnodes was commited to 4.4BSD in 1994.
This is not necessary as the loop is restarted after vgone(). Switch
to SLIST_FOREACH without _SAFE.
OK visa@


# 1.296 27-Dec-2019 bluhm

Convert the speclisth hash buckets into SLIST macros. This makes
the vnode alias code more readable.
OK visa@


# 1.295 26-Dec-2019 bluhm

Fix white spaces.


# 1.294 08-Dec-2019 mpi

Convert infinite sleeps to tsleep_nsec(9).

ok visa@, jca@


Revision tags: OPENBSD_6_6_BASE
# 1.293 26-Aug-2019 anton

When a thread tries to exclusively lock a vnode, the same thread must
ensure that any other thread currently trying to acquire the underlying
vnode lock has observed that the same vnode is about to be exclusively
locked. Such threads must then sleep until the exclusive lock has been
released and then try to acquire the lock again. Otherwise, exclusive
access to the vnode cannot be guaranteed.

Thanks to naddy@ and visa@ for testing; ok visa@

Reported-by: syzbot+374d0e7e2400004957f7@syzkaller.appspotmail.com


# 1.292 25-Jul-2019 cheloha

vinvalbuf(9): tlseep -> tsleep_nsec(9); ok millert@


# 1.291 19-Jul-2019 cheloha

vwaitforio(9): tsleep(9) -> tsleep_nsec(9); ok visa@


# 1.290 28-Jun-2019 visa

Skip VFS barrier lock during normal operation to reduce overhead.
This removes a system-wide serialization point, which might help
finding timing-related bugs.

OK deraadt@ anton@


# 1.289 09-Jun-2019 beck

Add a temporary workaround to make removal of giant files better

mlarkin@ noticed we would freeze while removing enormous files because
of the amount of work done to invalidate buffers on unlink. This adds
a temporary workaround to ensure we give up the lock and yield while
doing this.

The longer term answer will be to move these buffers to another list
and not do the work here.

ok deraadt@


# 1.288 19-Apr-2019 visa

Add a subsystem lock for vfs_lockf.c. This enables calling lf_advlock()
and lf_purgelocks() without the kernel lock.

OK anton@ mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.287 02-Apr-2019 visa

Restrict which filesystems are available for swap. This rules out
obvious misconfigurations that cannot work.

OK mpi@ tedu@


# 1.286 17-Feb-2019 tedu

if a write fails, we mark the buffer invalid and throw it away. this can
lead to lost errors, where a later fsync will return success. to fix this,
set a flag on the vnode indicating a past error has occurred, and return
an error for future fsync calls.
ok bluhm deraadt visa


# 1.285 21-Jan-2019 anton

Introduce a dedicated entry point data structure for file locks. This new data
structure allows for better tracking of pending lock operations which is
essential in order to prevent a use-after-free once the underlying vnode is
gone.

Inspired by the lockf implementation in FreeBSD.

ok visa@

Reported-by: syzbot+d5540a236382f50f1dac@syzkaller.appspotmail.com


# 1.284 23-Dec-2018 natano

Rectify some issues with the noperm mount flag; the root vnode was not
protected properly and files without any x bit set were accidentaly considered
executable when checked with access(2).

Issues found and reported by deraadt, halex, reyk, tb
ok deraadt


# 1.283 07-Dec-2018 mpi

free(9) sizes for netcred.

ok visa@


Revision tags: OPENBSD_6_4_BASE
# 1.282 29-Sep-2018 visa

Use atomic operations to update vfc_refcount. Change the field's type
to unsigned int.

OK deraadt@


# 1.281 26-Sep-2018 visa

Move the allocating and freeing of mount points into
dedicated functions.

OK deraadt@ mpi@


# 1.280 22-Sep-2018 fcambus

Harmonize spacing after ellipses in displayed messages.

We were using spacing after ellipses in an inconsistent way in the
installer. Standardize on using "... " everywhere and take into account
the cursor position while we are waiting for the task to complete: the
cursor is now always positioned after the last dot, and the space is
added when displaying completion confirmation.

While there, also take cursor position into account in vfs_shutdown(),
and remove the extra leading space before ticks in dhclient.

OK deraadt@


# 1.279 17-Sep-2018 visa

Simplify VFS initialization.

Because loadable kernel modules are no longer, there is no need to
register or unregister filesystem implementations at runtime. Remove
vfs_register() and vfs_unregister(), and make vfsinit() call vfs_init
routines directly. Replace the linked list of vfsconf structs with
the vfsconflist[] array.

OK mpi@ bluhm@


# 1.278 16-Sep-2018 visa

Move vfsconf lookup code into dedicated functions.

OK bluhm@


# 1.277 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


# 1.276 02-Jul-2018 bluhm

Use more list macros for v_dirtyblkhd.
OK mpi@


# 1.275 06-Jun-2018 bluhm

The function dounmount() traverses the mnt_list in forward direction
to call vfs_busy() for all nested mount points. vfs_stall() called
vfs_busy() in reverser order for all mount points. Change the
direction of the latter to resolve the lock order conflict.
OK visa@


# 1.274 04-Jun-2018 guenther

Add VB_DUPOK to suppress witness(4) warning of concurrent mount locks.
Use that in three places:
- vfs_stall()
- sys_mount()
- dounmount()'s MNT_FORCE-does-recursive-unmounts case

ok deraadt@ visa@


# 1.273 27-May-2018 visa

Drop unnecessary `p' parameter from vget(9).

OK mpi@


# 1.272 08-May-2018 bluhm

When looping over mount points, the FOREACH SAVE macro is not save.
The loop variable mp is protected by vfs_busy() so that it cannot
be unmounted. But the next mount point nmp could be unmounted while
VFS_SYNC() sleeps. As the loop in vfs_stall() does not destroy the
mount point, TAILQ_FOREACH_REVERSE without _SAVE is the correct
macro to use.
OK deraadt@ visa@


# 1.271 08-May-2018 mpi

Move the vfs stall "barrier" logic to a function. FREF() will soon
change and this has nothing to do with it.

ok visa@, bluhm@


# 1.270 07-May-2018 bluhm

Print the vp pointer in the vinvalbuf() panic strings.
OK mpi@


# 1.269 02-May-2018 visa

Remove proc from the parameters of vn_lock(). The parameter is
unnecessary because curproc always does the locking.

OK mpi@


# 1.268 28-Apr-2018 visa

Clean up the parameters of VOP_LOCK() and VOP_UNLOCK(). It is always
curproc that does the locking or unlocking, so the proc parameter
is pointless and can be dropped.

OK mpi@, deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.267 07-Mar-2018 bluhm

Remounting files systems read-only does not work reliably. There
are corner cases where ffs may leak blocks. So better revert and
unmount all file systems at reboot. The "init died" panic will be
fixed in a different way.
OK deraadt@


# 1.266 10-Feb-2018 deraadt

Syncronize filesystems to disk when suspending. Each mountpoint's vnodes
are pushed to disk. Dangling vnodes (unlinked files still in use) and
vnodes undergoing change by long-running syscalls are identified -- and
such filesystems are marked dirty on-disk while we are suspended (in case
power is lost, a fsck will be required). Filesystems without dangling or
busy vnodes are marked clean, resulting in faster boots following
"battery died" circumstances.
Tested by numerous developers, thanks for the feedback.


# 1.265 14-Dec-2017 deraadt

Don't bother using DETACH_FORCE for the softraid luns at reboot
time; the aggressive mountpoint destruction seems to hit insane
use-after-frees when we are already far on the way down.


# 1.264 14-Dec-2017 deraadt

Give vflush_vnode() a hint about vnodes we don't need to account as "busy".
Change mountpoint to RDONLY a little later. Seems to improve the
rw->ro transition a bit.


# 1.263 11-Dec-2017 bluhm

Format the vnode lists of ddb show mount properly in columns.
OK krw@


# 1.262 11-Dec-2017 deraadt

In uvm Chuck decided backing store would not be allocated proactively
for blocks re-fetchable from the filesystem. However at reboot time,
filesystems are unmounted, and since processes lack backing store they
are killed. Since the scheduler is still running, in some cases init is
killed... which drops us to ddb [noted by bluhm]. Solution is to convert
filesystems to read-only [proposed by kettenis]. The tale follows:
sys_reboot() should pass proc * to MD boot() to vfs_shutdown() which
completes current IO with vfs_busy VB_WRITE|VB_WAIT, then calls VFS_MOUNT()
with MNT_UPDATE | MNT_RDONLY, soon teaching us that *fs_mount() calls a
copyin() late... so store the sizes in vfsconflist[] and move the copyin()
to sys_mount()... and notice nfs_mount copyin() is size-variant, so kill
legacy struct nfs_args3. Next we learn ffs_mount()'s MNT_UPDATE code is
sharp and rusty especially wrt softdep, so fix some bugs adn add
~MNT_SOFTDEP to the downgrade. Some vnodes need a little more help,
so tie them to &dead_vnops.

ffs_mount calling DIOCCACHESYNC is causing a bit of grief still but
this issue is seperate and will be dealt with in time.
couple hundred reboots by bluhm and myself, advice from guenther and
others at the hut


# 1.261 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.260 31-Jul-2017 florian

Give back some space to the ramdisk by compiling net/radix.c only
if we compile pf, ipsec, pipex or nfsserver.
Suggested by mpi some time ago.
Tweak & OK bluhm
deraadt assumes it's fair


# 1.259 20-Apr-2017 visa

Tweak lock inits to make the system runnable with witness(4)
on amd64 and i386.


# 1.258 04-Apr-2017 deraadt

struct vfsconf is tightly packed, but let's M_ZERO it in case that ever
changes to avoid exposing userland memory.


Revision tags: OPENBSD_6_1_BASE
# 1.257 15-Jan-2017 bluhm

When traversing the mount list, the current mount point is locked
with vfs_busy(). If the FOREACH_SAFE macro is used, the next pointer
is not locked and could be freed by another process. Unless
necessary, do not use _SAFE as it is unsafe. In vfs_unmountall()
the current pointer is actullay freed. Add a comment that this
race has to be fixed later.
OK krw@


# 1.256 10-Jan-2017 bluhm

Replace manual for() loops with FOREACH() macro.
OK millert@


# 1.255 10-Jan-2017 bluhm

Remove the unused olddp parameter from function dounmount().
OK mpi@ millert@


# 1.254 28-Sep-2016 kettenis

Cast enum to u_int when doing a bounds check to avoid a clang warning that
the comparison is always true.

ok jca@, tedu@


# 1.253 16-Sep-2016 dlg

move the namecache_rb_tree from RB macros to RBT functions.

i had to shuffle the includes a bit. all the knowledge of the RB
tree is now inside vfs_cache.c, and all accesses are via cache_*
functions.


# 1.252 16-Sep-2016 dlg

move buf_rb_bufs from RB macros to RBT functions

i had to shuffle the order of some header bits cos RBT_PROTOTYPE
needs to see what RBT_HEAD produces.


# 1.251 15-Sep-2016 dlg

all pools have their ipl set via pool_setipl, so fold it into pool_init.

the ioff argument to pool_init() is unused and has been for many
years, so this replaces it with an ipl argument. because the ipl
will be set on init we no longer need pool_setipl.

most of these changes have been done with coccinelle using the spatch
below. cocci sucks at formatting code though, so i fixed that by hand.

the manpage and subr_pool.c bits i did myself.

ok tedu@ jmatthew@

@ipl@
expression pp;
expression ipl;
expression s, a, o, f, m, p;
@@
-pool_init(pp, s, a, o, f, m, p);
-pool_setipl(pp, ipl);
+pool_init(pp, s, a, ipl, f, m, p);


# 1.250 25-Aug-2016 dlg

pool_setipl

ok kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.249 22-Jul-2016 kettenis

Prevent NULL-pointer call for filesystems that don't provide vfs_sysctl
in their vfsops.

Issue reported by Tim Newsham.

ok claudio@, natano@


# 1.248 19-Jun-2016 natano

Remove the lockmgr() API. It is only used by filesystems, where it is a
trivial change to use rrw locks instead. All it needs is LK_* defines
for the RW_* flags.

tested by naddy and sthen on package building infrastructure
input and ok jmc mpi tedu


# 1.247 26-May-2016 natano

The doforce variable isn't modified anywhere. Also, the only filesystem
left using it is fuse. It has been removed from all other filesystems.

ok millert deraadt


# 1.246 26-Apr-2016 natano

copy_statfs_info() is not only used by ufs, but by other filesystems too,
so make sure that all members of mp->mnt_stat.mount_info are copied.
ok stefan


# 1.245 26-Apr-2016 beck

fix off by one in vfs_vnode_print - found by miod
ok deraadt@, krw@


# 1.244 07-Apr-2016 natano

Share clone bitmap between aliased vnodes. This prevents duplicate clone
instance numbers being handed out for the same minor device.
ok mikeb


# 1.243 05-Apr-2016 natano

Increase size of the clone bitmap (revised diff after revert). I have
tested this with fuse _and_ drm on amd64 and macppc. Also tested with
cloning bpf (not in the tree) on macppc.

ok mikeb
"looks correct to me" millert

The original commit message is as follows:

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.242 01-Apr-2016 mikeb

Revert the clone bitmap enlargement change


# 1.241 31-Mar-2016 natano

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.240 19-Mar-2016 natano

Remove the unused flags argument from VOP_UNLOCK().

torture tested on amd64, i386 and macppc
ok beck mpi stefan
"the change looks right" deraadt


# 1.239 14-Mar-2016 krw

Change a bunch of (<blah> *)0 to NULL.

ok beck@ deraadt@


Revision tags: OPENBSD_5_9_BASE
# 1.238 05-Dec-2015 tedu

branches: 1.238.2;
remove stale lint annotations


# 1.237 16-Nov-2015 deraadt

In getdevvp() set the VISTTY flag on a vnode to indicate the underlying
device is a D_TTY device. (Like spec_open, but this sets the flag to
satisfy pre-VOP_OPEN situations)
ok millert semarie tedu guenther


# 1.236 13-Oct-2015 guenther

Initialize va_filerev in vattr_null() to avoid leaking stack garbage;
problem pointed out by Martin Natano (natano (at) natano.net)

Also, stop chaining assignments (foo = bar = baz) in vattr_null().
The exact meaning of those depends on the order of the sizes-and-
signednesses of the lvalues, making them fragile: a statement here
mixed *six* types, but managed to get them in a safe order. Delete
a 20+ year old XXX comment that was almost certainly bemoaning a bug
from when they were in an unsafe order.

ok deraadt@ miod@


# 1.235 08-Oct-2015 mpi

Use the radix API directly and get rid of the function pointers. There
is no point in keeping an unused level of abstraction.

ok mikeb@, claudio@


# 1.234 07-Oct-2015 mpi

rn_inithead() offset argument is now specified in byte, missed in previous.


# 1.233 04-Sep-2015 mpi

Make every subsystem using a radix tree call rn_init() and pass the
length of the key as argument.

This way every consumer of the radix tree has a chance to explicitly
initialize the shared data structures and no longer rely on another
subsystem to do the initialization.

As a bonus ``dom_maxrtkey'' is no longer used an die.

ART kernels should now be fully usable because pf(4) and IPSEC properly
initialized the radix tree.

ok chris@, reyk@


Revision tags: OPENBSD_5_8_BASE
# 1.232 16-Jul-2015 claudio

branches: 1.232.4;
Fix rn_match and there for the expoerted lookup functions in radix.c
to never return the internal RNF_ROOT nodes. This removes the checks
in the callee to verify that not an RNF_ROOT node was returned.
OK mpi@


# 1.231 12-May-2015 mikeb

Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.230 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.229 02-Mar-2015 guenther

Return EINVAL if the creds supplied for NFS export have a cr_ngroups less
than zero or greater than NGROUPS_MAX

Fixes panic seen by henning@


# 1.228 09-Jan-2015 tedu

rename desiredvnodes to initialvnodes. less of a lie. ok beck deraadt


# 1.227 19-Dec-2014 tedu

start retiring the nointr allocator. specify PR_WAITOK as a flag as a
marker for which pools are not interrupt safe. ok dlg


# 1.226 17-Dec-2014 tedu

remove lock.h from uvm_extern.h. another holdover from the simpletonlock
era. fix uvm including c files to include lock.h or atomic.h as necessary.
ok deraadt


# 1.225 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.224 10-Dec-2014 tedu

convert bcopy to memcpy. ok millert


# 1.223 21-Nov-2014 tedu

simple lock is long dead


# 1.222 19-Nov-2014 tedu

delete the KERN_VNODE sysctl. it fails to provide any isolation from the
kernel struct vnode defintion, and the only consumer (pstat) still needs
kvm to read much of the required information. no great loss to always use
kvm until there's a better replacement interface.
ok deraadt millert uebayasi


# 1.221 14-Nov-2014 tedu

prefer sizeof(*ptr) to sizeof(struct) for malloc and free


# 1.220 03-Nov-2014 deraadt

pass size argument to free()
ok doug tedu


# 1.219 13-Sep-2014 doug

Replace all queue *_END macro calls except CIRCLEQ_END with NULL.

CIRCLEQ_* is deprecated and not called in the tree. The other queue types
have *_END macros which were added for symmetry with CIRCLEQ_END. They are
defined as NULL. There's no reason to keep the other *_END macro calls.

ok millert@


Revision tags: OPENBSD_5_6_BASE
# 1.218 13-Jul-2014 tedu

pass the size to free in some of the obvious cases


# 1.217 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.216 10-Jul-2014 mpi

Stop using a shutdown hook for softraid(4) and explicitly shutdown
the disciplines right after vfs_shutdown().

This change is required in order to be able to set `cold' to 1 before
traversing the device (mainbus) tree for DVACT_POWERDOWN when halting
a machine. Yes, this is ugly because sr_shutdown() needs to sleep. But
at least it is obvious and hopefully somebody will be ofended and fix
it.

In order to properly flush the cache of the disks under softraid0,
sr_shutdown() now propagates DVACT_POWERDOWN for this particular subtree
of devices which are not under mainbus. As a side effect sd(4) shutdown
hook should no longer be necessary.

Tested by stsp@ and Jean-Philippe Ouellet.

ok deraadt@, stsp@, jsing@


# 1.215 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.214 04-Jun-2014 claudio

While it may be smart to use the radix tree for exports it is not OK to
use the domain specific tree initialisation method for this since that one
is multipath enabled and assumes that the radix node is part of a struct
rtentry. This code uses a different struct and so the multipath modifies
wrong fields and breaks stuff in mysterious ways.
Since we only support AF_INET here anyway simplify the code and only have
one radix_node_head pointer instead of AF_MAX ones.
Fixes NFS server issues reported by rpe@, OK rpe@, guenther@, sthen@


# 1.213 10-Apr-2014 tedu

pull the bufcache freelist code out into separate functions to allow new
algorithms to be tested. in the process, drop support for unused B_AGE and
b_synctime options.
previous versions ok beck deraadt


# 1.212 24-Mar-2014 guenther

Split the API: struct ucred remains the kernel internal structure while
struct xucred becomes the structure for syscalls (mount(2) and nfssvc(2)).

ok deraadt@ beck@


Revision tags: OPENBSD_5_5_BASE
# 1.211 21-Jan-2014 tedu

bzero -> memset


# 1.210 01-Dec-2013 krw

Change 'mountlist' from CIRCLEQ to TAILQ. Be paranoid and
use TAILQ_*_SAFE more than might be needed.

Bulk ports build by sthen@ showed nobody sticking their fingers
so deep into the kernel.

Feedback and suggestions from millert@. ok jsing@


# 1.209 27-Nov-2013 jsing

Defer the v_type initialisation until after the vnode has been purged from
the namecache. Changing the v_type between cache_enter() and cache_purge()
results in bad things happening.

ok beck@


# 1.208 02-Oct-2013 sf

format string fix: b_flags is long


# 1.207 01-Oct-2013 sf

Format string fixes: Cast time_t to long long

and mnt_stat.f_ctime is long long, too


# 1.206 08-Aug-2013 syl

Uncomment kprintf format attributes for sys/kern

tested on vax (gcc3) ok miod@


# 1.205 30-Jul-2013 beck

The previous change was made while chasing nfs performance issues
on Theo's servers - however this was in the context of the buffer flipper
changes and this is now suspect in a continues performance issue with NFS
so back it out for now


Revision tags: OPENBSD_5_4_BASE
# 1.204 24-Jun-2013 beck

Manipulating buffers after sleeping is dangerous. Instead of attempting
to cheat and VOP_BWRITE a buffer, restart the vinvalbuf if we have to wait
for a busy buffer to complete
ok tedu@ guenther@


# 1.203 15-Apr-2013 jsing

Add an f_mntfromspec member to struct statfs, which specifies the name of
the special provided when the mount was requested. This may be the same as
the special that was actually used for the mount (e.g. in the case of a
device node) or it may be different (e.g. in the case of a DUID).

Whilst here, change f_ctime to a 64 bit type and remove the pointless
f_spare members.

Compatibility goo courtesy of guenther@

ok krw@ millert@


Revision tags: OPENBSD_5_3_BASE
# 1.202 17-Feb-2013 miod

Comment out recently added __attribute__((__format__(__kprintf__))) annotations
in MI code; gcc 2.95 does not accept such annotation for function pointer
declarations, only function prototypes.
To be uncommented once gcc 2.95 bites the dust.


# 1.201 09-Feb-2013 miod

Add explicit __attribute__ ((__format__(__kprintf__)))) to the functions and
function pointer arguments which are {used as,} wrappers around the kernel
printf function.
No functional change.


# 1.200 17-Nov-2012 beck

Don't map a buffer (and potentially sleep) when invalidating it in vinvalbuf.
This fixes a problem where we could sleep for kva and then our pointers
would not be valid on the next pass through the loop. We do this
by adding buf_acquire_nomap() - which can be used to busy up the buffer
without changing its mapped or unmapped state. We do not need to have
the buffer mapped to invalidate it, so it is sufficient to acquire it
for that. In the case where we write the buffer, we do map the buffer, and
potentially sleep.


# 1.199 01-Oct-2012 guenther

Make groupmember() check the effective gid too, so that the checks are
consistent when the effective gid isn't also a supplementary group.

ok beck@


# 1.198 19-Sep-2012 guenther

vhold() and vdrop() are prototyped in vnode.h, so don't repeat them here

ok beck@


Revision tags: OPENBSD_5_2_BASE
# 1.197 16-Jul-2012 deraadt

oops, need sys/acct.h too


# 1.196 16-Jul-2012 deraadt

Put acct_shutdown() proto in a better place


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.195 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.194 02-Jul-2011 thib

rename VFSDEBUG to VFLCKDEBUG;

prompted by tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.193 21-Dec-2010 thib

Bring back the "End the VOP experiment." diff, naddy's issues where
unrelated, and his alpha is much happier now.

OK deraadt@


# 1.192 06-Dec-2010 jasper

- drop NENTS(), which was yet another copy of nitems().
no binary change


ok deraadt@


# 1.191 10-Sep-2010 thib

Backout the VOP diff until the issues naddy was seeing on alpha (gcc3)
have been resolved.


# 1.190 06-Sep-2010 thib

End the VOP experiment. Instead of the ridicolusly complicated operation
vector setup that has questionable features (that have, as far as I can
tell never been used in practice, atleast not in OpenBSD), remove all
the gunk and favor a simple struct full of function pointers that get
set directly by each of the filesystems.

Removes gobs of ugly code and makes things simpler by a magnitude.

The only downside of this is that we loose the vnoperate feature so
the spec/fifo operations of the filesystems need to be kept in sync
with specfs and fifofs, this is no big deal as the API it self is pretty
static.

Many thanks to armani@ who pulled an earlier version of this diff to
current after c2k10 and Gabriel Kihlman on tech@ for testing.

Liked by many. "come on, find your balls" deraadt@.


# 1.189 12-Aug-2010 oga

Nuke extra (typoed) extern declaration and a spare newline from the last
commit.

"fix it -- free commit" beck@


# 1.188 11-Aug-2010 beck

Make the number of vnodes to correspond to the number of buffers in
buffer cache - we grow them dynamically, but do not attempt to shrink
them if the buffer cache shrinks after growing.

Tested by very many for a long time.

ok oga@ todd@ phessler@ tedu@


Revision tags: OPENBSD_4_8_BASE
# 1.187 29-Jun-2010 tedu

makefstype was only used in ported from freebsd filesystems. fix them
and remove the function. ok thib


# 1.186 28-Jun-2010 claudio

Add the rtable id as an argument to rn_walktree(). Functions like
rt_if_remove_rtdelete() need to know the table id to be able to correctly
remove nodes.
Problem found by Andrea Parazzini and analyzed by Martin Pelik�n.
OK henning@


# 1.185 06-May-2010 mpf

Fix favail format string.
From mickey.
OK thib, otto.


Revision tags: OPENBSD_4_7_BASE
# 1.184 17-Dec-2009 oga

if anyone vref()s a VNON vnode, panic. This should not happen.

Written while trying to debug the nfs_inactive panics. Turns out it
never got hit, but it's a useful check to have.

ok beck@


# 1.183 17-Aug-2009 jasper

dd 'show all bufs' to show all the buffers in the system

ok beck@ thib@


# 1.182 13-Aug-2009 thib

add a show all vnodes command, use dlg's nice pool_walk() to accomplish
this.

ok beck@, dlg@


# 1.181 12-Aug-2009 beck

Namecache revamp.

This eliminates the large single namecache hash table, and implements
the name cache as a global lru of entires, and a redblack tree in each
vnode. It makes cache_purge actually purge the namecache entries associated
with a vnode when a vnode is recycled (very important for later on actually being
able to resize the vnode pool)

This commit does #if 0 out a bunch of procmap code that was
already broken before this change, but needs to be redone completely.

Tested by many, including in thib's nfs test setup.

ok oga@,art@,thib@,miod@


# 1.180 02-Aug-2009 beck

Dynamic buffer cache support - a re-commit of what was backed out
after c2k9

allows buffer cache to be extended and grow/shrink dynamically

tested by many, ok oga@, "why not just commit it" deraadt@


Revision tags: OPENBSD_4_6_BASE
# 1.179 25-Jun-2009 thib

backout the buf_acquire() does the bremfree() since all callers
where doing bremfree() befure calling buf_acquire().

This is causing us headache pinning down a bug that showed up
when deraadt@ too cvs to current, and will have to be done
anyway as a preperation for backouts.

OK deraadt@


# 1.178 15-Jun-2009 beck

Back out all the buffer cache changes I committed during c2k9. This reverts three
commits:

1) The sysctl allowing bufcachepercent to be changed at boot time.
2) The change moving the buffer cache hash chains to a red-black tree
3) The dynamic buffer cache (Which depended on the earlier too).

ok on the backout from marco and todd


# 1.177 06-Jun-2009 art

All caller of buf_acquire were doing bremfree before the call.
Just put it in the buf_acquire function.
oga@ ok


# 1.176 03-Jun-2009 beck

Change bufhash from the old grotty hash table to red-black trees hanging
off the vnode.
ok art@, oga@, miod@


Revision tags: OPENBSD_4_5_BASE
# 1.175 10-Nov-2008 pedro

Fix typo in comment, okay jmc@.


# 1.174 01-Nov-2008 deraadt

change vrele() to return an int. if it returns 0, it can gaurantee that
it did not sleep. this is used to avoid checkdirs() to avoid having
to restart the allproc walk every time through
idea from tedu, ok thib pedro


Revision tags: OPENBSD_4_4_BASE
# 1.173 05-Jul-2008 thib

re-introduce vdrop() to signal a lost intrest in a vnode;

ok art@


# 1.172 14-Jun-2008 mk

A bunch of pool_get() + bzero() -> pool_get(..., .. | PR_ZERO)
conversions that should shave a few bytes off the kernel.

ok henning, krw, jsing, oga, miod, and thib (``even though i usually prefer
FOO|BAR''; thanks for looking.


# 1.171 13-Jun-2008 beck

back out stupid vnode change that was unintentionally included
with biomem and art has no idea how it got there.
ok art@ thib@


# 1.170 12-Jun-2008 deraadt

Bring biomem diff back into the tree after the nfs_bio.c fix went in.
ok thib beck art


# 1.169 11-Jun-2008 deraadt

back out biomem diff since it is not right yet. Doing very large
file copies to nfsv2 causes the system to eventually peg the console.
On the console ^T indicates that the load is increasing rapidly, ddb
indicates many calls to getbuf, there is some very slow nfs traffic
making none (or extremely slow) progress. Eventually some machines
seize up entirely.


# 1.168 10-Jun-2008 beck

Buffer cache revamp

1) remove multiple size queues, introduced as a stopgap.
2) decouple pages containing data from their mappings
3) only keep buffers mapped when they actually have to be mapped
(right now, this is when buffers are B_BUSY)
4) New functions to make a buffer busy, and release the busy flag
(buf_acquire and buf_release)
5) Move high/low water marks and statistics counters into a structure
6) Add a sysctl to retrieve buffer cache statistics

Tested in several variants and beat upon by bob and art for a year. run
accidentally on henning's nfs server for a few months...

ok deraadt@, krw@, art@ - who promises to be around to deal with any fallout


# 1.167 09-Jun-2008 millert

Update access(2) to have modern semantics with respect to X_OK and
the superuser. access(2) will now only indicate success for X_OK on
non-directories if there is at least one execute bit set on the file.
OK deraadt@ thib@ otto@


# 1.166 07-May-2008 thib

remove the vfc_mountroot member from vfsconf and
do appropriate cleanup;

OK deraadt@


# 1.165 07-May-2008 claudio

Implement routing priorities. Every route inserted has a priority assigned
and the one route with the lowest number wins. This will be used by the
routing daemons to resolve the synchronisations issue in case of conflicts.
The nasty bits of this are in the multipath code. If no priority is specified
the kernel will choose an appropriate priority.

Looked at by a few people at n2k8 code is much older


# 1.164 06-May-2008 thib

retire vfs_mountroot();

setroot() is now (and has been) responsible for setting
the mountroot function pointer "to the right thing", or
failing todo that, to ffs_mountroot;

based on a discussion/diff from deraadt@.
OK deraadt@


# 1.163 23-Mar-2008 miod

Wrong printf construct.


# 1.162 16-Mar-2008 otto

Widen some struct statfs fields to support large filesystem stata
and add some to be able to support statvfs(2). Do the compat dance
to provide backward compatibility. ok thib@ miod@


Revision tags: OPENBSD_4_3_BASE
# 1.161 13-Dec-2007 blambert

replace calls to ltsleep with tsleep

remove PNORELOCK flag, as PNORELOCK is used for msleep

ok art@ thib@


# 1.160 16-Nov-2007 deraadt

er, the newline is wrong. dissapointing.


# 1.159 15-Nov-2007 deraadt

newline before syncing disks is way prettier


# 1.158 29-Oct-2007 chl

MALLOC/FREE -> malloc/free
replace an hard coded value with M_WAITOK

ok krw@


# 1.157 15-Sep-2007 bluhm

Allow to pull out an usb stick with ffs filesystem while mounted
and a file is written onto the stick. Without these fixes the
machine panics or hangs.
The usb fix calls the callback when the stick is pulled out to free
the associated buffers. Otherwise we have busy buffers for ever
and the automatic unmount will panic.
The change in the scsi layer prevents passing down further dirty
buffers to usb after the stick has been deactivated.
In vfs the automatic unmount has moved from the function vgonel()
to vop_generic_revoke(). Both are called when the sd device's vnode
is removed. In vgonel() the VXLOCK is already held which can cause
a deadlock. So call dounmount() earlier.

ok krw@, I like this marco@, tested by ian@


# 1.156 07-Sep-2007 art

Use M_ZERO in a few more places to shave bytes from the kernel.

eyeballed and ok dlg@


Revision tags: OPENBSD_4_2_BASE
# 1.155 07-Aug-2007 beck

A few changes to deal with multi-user performance issues seen. this
brings us back roughly to 4.1 level performance, although this is still
far from optimal as we have seen in a number of cases. This change

1) puts a lower bound on buffer cache queues to prevent starvation
2) fixes the code which looks for a buffer to recycle
3) reduces the number of vnodes back to 4.1 levels to avoid complex
performance issues better addressed after 4.2

ok art@ deraadt@, tested by many


# 1.154 01-Jun-2007 beck

decouple the allocated number of vnodes from the "desiredvnodes" variable
which is used to size a zillion other things that increasing excessively
has been shown to cause problems - so that we may incrementally look at
increasing those other things without making the kernel unusable.

This diff effectivly increases the number of vnodes back to the number
of buffers, as in the earlier dynamic buffer cache commits, without
increasing anything else (namecache, softdeps, etc. etc.)

ok pedro@ tedu@ art@ thib@


# 1.153 31-May-2007 tedu

remove some silly casts, no real change


# 1.152 31-May-2007 pedro

NFSv2 cannot cope with a big number of vnodes, so revert to NPROC-based
calculation until the problem is fixed, okay beck@ art@


# 1.151 30-May-2007 beck

back out vfs change - todd fries has seen afs issues, and I'm suspicious
this can cause other problems.


# 1.150 29-May-2007 beck

Step one of some vnode improvements - change getnewvnode to
actually allocate "desiredvnodes" - add a vdrop to un-hold a vnode held
with vhold, and change the name cache to make use of vhold/vdrop, while
keeping track of which vnodes are referred to by which cache entries to
correctly hold/drop vnodes when the cache uses them.
ok thib@, tedu@, art@


# 1.149 28-May-2007 thib

de-inline vref();

ok pedro@


# 1.148 26-May-2007 pedro

Dynamic buffer cache. Initial diff from mickey@, okay art@ beck@ toby@
deraadt@ dlg@.


# 1.147 26-May-2007 thib

Nuke a bunch of simpelocks and associated goo.

ok art@


# 1.146 17-May-2007 thib

Collapse struct v_selectinfo in struct vnode, remove the
simplelock and reuse the name for the selinfo member.
Clean-up accordingly.

ok tedu@,art@


# 1.145 09-May-2007 deraadt

kinfo_vgetfailed has not been used for > 8 years


# 1.144 13-Apr-2007 thib

Move the declaration of VN_KNOTE() into vnode.h instead of having
multiple defines all over;

ok tedu@


# 1.143 13-Apr-2007 bluhm

Remove comments talking about vnode interlock. No binary change.
ok thib


# 1.142 11-Apr-2007 thib

Remove the simplelock argument from vrecycle();

ok pedro@, sturm@


# 1.141 21-Mar-2007 thib

Remove the v_interlock simplelock from the vnode structure.
Zap all calls to simple_lock/unlock() on it (those calls are
#defined away though). Remove the LK_INTERLOCK from the calls
to vn_lock() and cleanup the filesystems wich implement VOP_LOCK().
(by remvoing the v_interlock from there calls to lockmgr()).

ok pedro@, art@, tedu@


# 1.140 12-Mar-2007 mickey

better desiredvnodes not based on maxusers; pedro@ deraadt@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.139 20-Feb-2007 deraadt

for vfsconf sysctl, do not leak kernel sensors out to userland
ok art thib


# 1.138 17-Feb-2007 mickey

fix ddb buf printing for daddr_t growth to 64bit;
from juan hernandez gonzalez; tested by bluhm@


# 1.137 14-Feb-2007 jsg

Consistently spell FALLTHROUGH to appease lint.
ok kettenis@ cloder@ tom@ henning@


# 1.136 13-Feb-2007 mickey

fix ddb buf print


# 1.135 20-Nov-2006 tom

vprint() should be defined if DIAGNOSTIC || DEBUG. Noticed by (and
original diff from) Jake < antipsychic (at) hotmail.com >. Discussed
with Mickey and Miod.

ok miod@ pedro@


# 1.134 30-Oct-2006 thib

use vp->v_type to index into vtypes rather then vp->v_tag,
fixing odd output in the 'show vnode' ddb code.

ok mickey@


Revision tags: OPENBSD_4_0_BASE
# 1.133 11-Jul-2006 mickey

add mount/vnode/buf and softdep printing commands; tested on a few archs and will make pedro happy too (;


# 1.132 09-Jul-2006 pedro

Fix tab where space was meant


# 1.131 08-Jul-2006 thib

vinvalbuf() debugging aid, under VFSDEBUG.

ok pedro@


# 1.130 03-Jul-2006 mickey

also print vp in vprint (useful for debugging); pedro@ ok


# 1.129 25-Jun-2006 sturm

rename vfs_busy() flags VB_UMIGNORE/VB_UMWAIT to VB_NOWAIT/VB_WAIT

requested by and ok pedro


# 1.128 14-Jun-2006 sturm

move vfs_busy() to rwlocks and properly hide the locking api from vfs

ok tedu, pedro


# 1.127 02-Jun-2006 pedro

Add a clonable devices implementation. Hacked along with thib@, input
from krw@ and toby@, subliminal prodding from dlg@, okay deraadt@.


# 1.126 28-May-2006 pedro

Spacing in vfs_sysctl()


# 1.125 07-May-2006 sturm

forgot to remove this sentence from the comment
ok pedro


# 1.124 30-Apr-2006 sturm

remove the simplelock argument from vfs_busy() which is currently not
used and will never be used this way in VFS

requested by and ok pedro, ok krw, biorn


# 1.123 19-Apr-2006 pedro

Remove unused mount list simple_lock() goo


Revision tags: OPENBSD_3_9_BASE
# 1.122 09-Jan-2006 pedro

Put vprint() under DIAGNOSTIC, as to save space in generated ramdisks.
Inspiration from miod@, okay deraadt@. Tested on i386, macppc and amd64.


# 1.121 30-Nov-2005 pedro

No need for vfs_busy() and vfs_unbusy() to take a process pointer
anymore. Testing by jolan@, thanks.


# 1.120 24-Nov-2005 pedro

Remove kernfs, okay deraadt@.


# 1.119 19-Nov-2005 pedro

Remove unnecessary lockmgr() archaism that was costing too much in terms
of panics and bugfixes. Access curproc directly, do not expect a process
pointer as an argument. Should fix many "process context required" bugs.
Incentive and okay millert@, okay marc@. Various testing, thanks.


# 1.118 18-Nov-2005 pedro

Work around yet another race on non-locking file systems: when calling
VOP_INACTIVE() in vrele() and vput(), we may sleep. Since there's no
locking of any kind, someone can vget() the vnode and vrele() it while
we sleep, beating us in getting the vnode on the free list.


# 1.117 08-Nov-2005 pedro

Missed one use of 'register'


# 1.116 07-Nov-2005 pedro

Use ANSI function declarations and deregister, no binary change


# 1.115 19-Oct-2005 pedro

Remove v_vnlock from struct vnode, okay krw@ tedu@


Revision tags: OPENBSD_3_8_BASE
# 1.114 26-May-2005 pedro

branches: 1.114.2;
RIP stackable filesystems, ok marius@ tedu@, discussed with deraadt@


# 1.113 24-May-2005 pedro

when a device vnode associated with a mount point disappears, mark the
filesystem as doomed and unmount it


# 1.112 22-May-2005 pedro

put VLOCKSWORK stuff under a single option, VFSDEBUG


# 1.111 01-May-2005 pedro

check for VBIOONFREELIST and VBIOONSYNCLIST in vprint(), okay marius@


# 1.110 24-Mar-2005 tedu

always good to check for invalid values. ok marius pedro


Revision tags: OPENBSD_3_7_BASE
# 1.109 10-Jan-2005 pedro

branches: 1.109.2;
change vget() to only put a vnode back on the free lists if it actually
was there. should fix a (rare) corner case introduced by my last commit.
ok tedu@, testing by joris, moritz@, danh@, otto@ and krw@. many thanks.


# 1.108 31-Dec-2004 pedro

sprinkle some more list macros in here


# 1.107 31-Dec-2004 pedro

when releasing a vnode, make it inactive before sticking it to one of
the free lists. should fix some races on filesystems that don't have
locks, such as nfs. also, it allows for a more straightforward way of
releasing vnodes (nodes that are going to be recycled don't have to be
moved to the head of the list). tested by many, thanks.

ok tedu@ deraadt@


# 1.106 28-Dec-2004 deraadt

clean dirty accident by miod


# 1.105 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


# 1.104 09-Dec-2004 pedro

minor spacing/styling nits


Revision tags: OPENBSD_3_6_BASE
# 1.103 04-Aug-2004 art

Uninline vputonfreelist.


# 1.102 04-Aug-2004 pedro

better comments


# 1.101 02-Aug-2004 pedro

- check for LK_NOWAIT on vget()
- use ltsleep() instead of the unlock + sleep combo

ok art@, inspiration from free/net


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.100 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


# 1.99 27-May-2004 tedu

shutdown accounting before shutting down vfs. should prevent some panics.
ok david@ millert@ (iirc)


# 1.98 25-Apr-2004 itojun

radix tree with multipath support. from kame. deraadt ok
user visible changes:
- you can add multiple routes with same key (route add A B then route add A C)
- you have to specify gateway address if there are multiple entries on the table
(route delete A B, instead of route delete A)
kernel change:
- radix_node_head has an extra entry
- rnh_deladdr takes extra argument

TODO:
- actually take advantage of multipath (rtalloc -> rtalloc_mpath)


Revision tags: OPENBSD_3_5_BASE
# 1.97 09-Jan-2004 tedu

back out vnode parents. weird breakge found in ports tree


# 1.96 06-Jan-2004 tedu

keep track of a vnode's parent dir. ufs only, and unused atm, but
the fun stuff is coming. testing by brad.


Revision tags: OPENBSD_3_4_BASE
# 1.95 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.94 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: UBC_SYNC_A
# 1.93 13-May-2003 naddy

Back out previous change that causes "vnode table full" for large-scale
file operations.


# 1.92 13-May-2003 tedu

do reclaim LAYER vnodes, no good reason not to


# 1.91 06-May-2003 tedu

attempt to put a process's cwd back in place after a forced umount.
won't always work, but it's the best we can do for now. this covers
at least some of the failure cases the previous commit to vfs_lookup.c
checks for.
ok weingart@


# 1.90 01-May-2003 tedu

several related changes:
vfs_subr.c:
add a missing simple_lock_init for vnode interlock
try to avoid reclaiming locked or layered vnodes
initialize vnlock pointer to NULL
remove old code to free vnlock, never used
lockinit the new vnode lock
vfs_syscalls.c:
support for VLAYER flag
vnode_if.sh:
support for splitting VDESC flags
vnode_if.src:
split VDESC flags
WILLPUT is the combination of WILLRELE and WILLUNLOCK
most uses for WILLRELE become WILLPUT
vnode.h:
add v_lock to struct vnode
add VLAYER flag
update for new VDESC flags


# 1.89 06-Apr-2003 ho

strcat/strcpy/sprintf cleanup. krw@, anil@ ok. art@ tested sparc64.


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_B
# 1.88 11-Aug-2002 art

Add two missing vfs_busy calls in the failure path of sysctl_vnode.
Found by aaron@

NOTE - I think we need a mount-point iterator just like we have
NOTE - vfs_mount_foreach_vnode. (btw. why don't we use foreach_vnode in here?)


# 1.87 12-Jul-2002 art

Change the locking on the mountpoint slightly. Instead of using mnt_lock
to get shared locks for lookup and get the exclusive lock only with
LK_DRAIN on unmount and do the real exclusive locking with flags in
mnt_flags, we now use shared locks for lookup and an exclusive lock for
unmount.

This is accomplished by slightly changing the semantics of vfs_busy.
Old vfs_busy behavior:
- with LK_NOWAIT set in flags, a shared lock was obtained if the
mountpoint wasn't being unmounted, otherwise we just returned an error.
- with no flags, a shared lock was obtained if the mountpoint was being
unmounted, otherwise we slept until the unmount was done and returned
an error.
LK_NOWAIT was used for sync(2) and some statistics code where it isn't really
critical that we get the correct results.
0 was used in fchdir and lookup where it's critical that we get the right
directory vnode for the filesystem root.

After this change vfs_busy keeps the same behavior for no flags and LK_NOWAIT.
But if some other flags are passed into it, they are passed directly
into lockmgr (actually LK_SLEEPFAIL is always added to those flags because
if we sleep for the lock, that means someone was holding the exclusive lock
and the exclusive lock is only held when the filesystem is being unmounted.

More changes:
dounmount must now be called with the exclusive lock held. (before this
the caller was supposed to hold the vfs_busy lock, but that wasn't always
true).
Zap some (now) unused mount flags.
And the highlight of this change:
Add some vfs_busy calls to match some vfs_unbusy calls, especially in
sys_mount. (lockmgr doesn't detect the case where we release a lock noone
holds (it will do that soon)).

If you've seen hangs on reboot with mfs this should solve it (I repeat this
for the fourth time now, but this time I spent two months fixing and
redesigning this and reading the code so this time I must have gotten
this right).


# 1.86 16-Jun-2002 miod

When processing the KERN_VNODE sysctl, the kernel builds a packed structure,
while pstat(8) expects a C structure abiding the regular structure packing
rules. This caused pstat -v to break on powerpc.

Unbreak the confusion by defining the structure in a common header file,
and having the kernel use it.

ok millert@ deraadt@


# 1.85 08-Jun-2002 art

Use ltsleep in vfs_busy.


# 1.84 16-May-2002 art

sprinkle some splassert(IPL_BIO) in some functions that are commented as "should be called at splbio()"


Revision tags: OPENBSD_3_1_BASE
# 1.83 14-Mar-2002 millert

First round of __P removal in sys


# 1.82 04-Feb-2002 miod

Cleanup mountroot-related definitions.


# 1.81 23-Jan-2002 art

Pool deals fairly well with physical memory shortage, but it doesn't deal
well (not at all) with shortages of the vm_map where the pages are mapped
(usually kmem_map).

Try to deal with it:
- group all information the backend allocator for a pool in a separate
struct. The pool will only have a pointer to that struct.
- change the pool_init API to reflect that.
- link all pools allocating from the same allocator on a linked list.
- Since an allocator is responsible to wait for physical memory it will
only fail (waitok) when it runs out of its backing vm_map, carefully
drain pools using the same allocator so that va space is freed.
(see comments in code for caveats and details).
- change pool_reclaim to return if it actually succeeded to free some
memory, use that information to make draining easier and more efficient.
- get rid of PR_URGENT, noone uses it.


# 1.80 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.79 10-Dec-2001 art

branches: 1.79.2;
No need to initialize the uobj on every getnewvnode. Just do
it when allocating. Add some improved diagnostics.


# 1.78 10-Dec-2001 art

Big cleanup inspired by NetBSD with some parts of the code from NetBSD.
- get rid of VOP_BALLOCN and VOP_SIZE
- move the generic getpages and putpages into miscfs/genfs
- create a genfs_node which must be added to the top of the private portion
of each vnode for filsystems that want to use genfs_{get,put}pages
- rename genfs_mmap to vop_generic_mmap


# 1.77 10-Dec-2001 art

Merge in struct uvm_vnode into struct vnode.


# 1.76 05-Dec-2001 art

Break out the part that lowers v_holdcnt in brelvp into an own function
and make it and vhold into public interfaces.


# 1.75 29-Nov-2001 art

Ooops. Revert part of the last commit that was completly wrong and wasn't supposed to be committed.


# 1.74 29-Nov-2001 art

Correctly handle b_vp with bgetvp and brelvp in {get,put}pages.
Prevents panics caused by vnodes being recycled under our feet.


# 1.73 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.72 21-Nov-2001 csapuntz

Added vfs_isbusy. Useful for verifying that a mount point is locked
Added vfs_mount_foreach_vnode. Several places in the code seem to want to
traverse the mount list and they all seem to handle locking differently.
Centralize traversing the mount list in one place so that we only need
to get the locking right once.


# 1.71 15-Nov-2001 art

Don't zero v_bioflag when recycling a vnode in getnewvnode.
Sometimes the vnode can be on the syncers list. While that is a bug, it's
just a minor annoyance. A vnode on a syncer worklist without VBIOONSYNCLIST
set is a disaster.


# 1.70 12-Nov-2001 art

Remove unnecessary check for NULL vnode in reassignbuf.


# 1.69 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.68 02-Oct-2001 csapuntz

Bounds check index into routing table. Thanks to Ken Ashcraft of Stanford
for finding this bug.


# 1.67 19-Sep-2001 csapuntz

Get rid of B_VFLUSH. Not relevant after the end of the AGE queue.


# 1.66 16-Sep-2001 millert

Add some missing lengths checks when passing data from userland to
kernel. From based on NetBSD patches.


# 1.65 02-Aug-2001 assar

(vput): make panic strings actually say vput instead of vrele


# 1.64 26-Jul-2001 miod

Typo.


# 1.63 27-Jun-2001 art

remove old vm


# 1.62 22-Jun-2001 deraadt

KNF


# 1.61 05-Jun-2001 provos

send note_revoke to knotes when vnode goes away, okay art@


# 1.60 16-May-2001 art

indentation nit.


# 1.59 29-Apr-2001 art

cleanup, remove incorrect comment


Revision tags: OPENBSD_2_9_BASE
# 1.58 22-Mar-2001 art

branches: 1.58.2;
Use pool for allocating vnodes.
Even though vnodes are never freed (could be) this gives us big memory and
kmem_map savings.


# 1.57 21-Mar-2001 art

uvm_vnp_terminate expect the vnode to be locked.
Why didn't LOCKDEBUG catch this?


# 1.56 16-Mar-2001 art

Oops. fix thinko in last.


# 1.55 16-Mar-2001 art

Use CIRCLEQ macros for mountlist.


# 1.54 16-Mar-2001 art

Initialize the mountlist_slock.


# 1.53 26-Feb-2001 csapuntz

Move v_writecount test back to it original place


# 1.52 26-Feb-2001 csapuntz

Make ref counts 32-bit unsigned ints as opposed to a potpourri of longs and
ints.


# 1.51 24-Feb-2001 csapuntz

Cleanup of vnode interface continues. Get rid of VHOLD/HOLDRELE.
Change VM/UVM to use buf_replacevnode to change the vnode associated
with a buffer.

Addition v_bioflag for flags written in interrupt handlers
(and read at splbio, though not strictly necessary)

Add vwaitforio and use it instead of a while loop of v_numoutput.

Fix race conditions when manipulation vnode free list


# 1.50 23-Feb-2001 csapuntz

Remove the clustering fields from the vnodes and place them in the
file system inode instead


# 1.49 21-Feb-2001 csapuntz

Latest soft updates from FreeBSD/Kirk McKusick

Snapshot-related code has been commented out.


# 1.48 08-Feb-2001 mickey

do not print stuff when not verbose


Revision tags: OPENBSD_2_8_BASE
# 1.47 27-Sep-2000 art

branches: 1.47.2;
Minimal optimization.


# 1.46 17-Jul-2000 art

Don't wait for B_READ buffers on shutdown.
From NetBSD.


Revision tags: OPENBSD_2_7_BASE
# 1.45 25-Apr-2000 csapuntz

Use CIRCLEQ_FOREACH


# 1.44 21-Apr-2000 mickey

see if there is any meaning under curproc before using &proc0 in vfs_syncwait(); from art@


Revision tags: SMP_BASE kame_19991208
# 1.43 05-Dec-1999 art

branches: 1.43.2;
With soft updates, some buffers will be remarked as dirty after being written.
Handle this when syncing filesystems when unmounting.
From NetBSD.


# 1.42 05-Dec-1999 art

Use VONSYNCLIST to see if we should remove a vnode from the sync list instead
of looking at v_dirtyblkhd.


Revision tags: OPENBSD_2_6_BASE
# 1.41 20-Aug-1999 art

more paranoid check of the refcount in vfs_register


# 1.40 08-Aug-1999 niklas

From NetBSD; vdevgone, used for revoking access to device nodes when they
disappear (detach is coming).


# 1.39 31-May-1999 millert

New struct statfs with mount options. NOTE: this replaces statfs(2),
fstatfs(2), and getfsstat(2) so you will need to build a new kernel
before doing a "make build" or you will get "unimplemented syscall" errors.

The new struct statfs has the following featuires:
o Has a u_int32_t flags field--now softdep can have a real flag.

o Uses u_int32_t instead of longs (nicer on the alpha). Note: the man
page used to lie about setting invalid/unused fields to -1. SunOS does
that but our code never has.

o Gets rid of f_type completely. It hasn't been used since NetBSD 0.9
and having it there but always 0 is confusing. It is conceivable
that this may cause some old code to not compile but that is better
than silently breaking.

o Adds a mount_info union that contains the FSTYPE_args struct. This
means that "mount" can now tell you all the options a filesystem was
mounted with. This is especially nice for NFS.

Other changes:
o The linux statfs emulation didn't convert between BSD fs names
and linux f_type numbers. Now it does, since the BSD f_type
number is useless to linux apps (and has been removed anyway)

o FreeBSD's struct statfs is different from our (both old and new)
and thus needs conversion. Previously, the OpenBSD syscalls
were used without any real translation.

o mount(8) will now show extra info when invoked with no arguments.
However, to see *everything* you need to use the -v (verbose) flag.


# 1.38 06-May-1999 mickey

factor out sync+wait code into vfa_syncwait() routine for
applications in system like power management and such.
art@ finally said `commit it'


# 1.37 30-Apr-1999 art

in vput, simple_unlock the v_interlock before VOP_INACTIVE, not after


Revision tags: OPENBSD_2_5_BASE
# 1.36 11-Mar-1999 deraadt

backout


# 1.35 11-Mar-1999 deraadt

back out unapproved changes


# 1.34 11-Mar-1999 mickey

indent


# 1.33 11-Mar-1999 mickey

factor sync+wait operation out into a separate function.


# 1.32 26-Feb-1999 art

adapt to uvm vnode pager


# 1.31 19-Feb-1999 art

add vfs_register and vfs_unregister functions


# 1.30 28-Dec-1998 art

simple_lock fixes


# 1.29 22-Dec-1998 art

deconfuse vprint, print holdcount, not refcount when we are talking about holdcnt


# 1.28 10-Dec-1998 art

vfs_unmountall: retry to unmount all remaining filesystems when one unmount failed


# 1.27 05-Dec-1998 csapuntz

Framework for generating automatic test code for locking discipline
in DIAGNOSTIC mode.

Added documentation to vfs_subr.c on locking needs of a couple calls.

Improvements to the vinvalbuf patch. We need to start over after we
let our pants down.


# 1.26 04-Dec-1998 csapuntz

VFS-Lite2 requires stricter locking around vnode buffer queues. vinvalbuf
had insufficient protection


# 1.25 20-Nov-1998 art

vn_lock already unlocks the simple lock. don't do that again


# 1.24 12-Nov-1998 csapuntz

Integrate latest soft updates patches for McKusick.

Integrate cleaner ffs mount code from FreeBSD. Most notably, this mount
code prevents you from mounting an unclean file system read-write.


Revision tags: OPENBSD_2_4_BASE
# 1.23 13-Oct-1998 csapuntz

In vrele, vget, reinstate to following order

- VNODE gets placed on free list
- VOP_INACTIVE is called

This was the original order. It was changed in an earlier patch due to
a race condition in non-locking FSes (like NFS) between getnewvnode
and inactive. However, the modified order had its own race conditions, so
it turned out not to be a good choice.


# 1.22 30-Aug-1998 csapuntz

Cleanup.

Error diagnostics in vputonfreelist to catch violations of assumptions.


# 1.21 06-Aug-1998 csapuntz

Rename vop_revoke, vn_bwrite, vop_noislocked, vop_nolock, vop_nounlock
to be vop_generic_revoke, vop_generic_bwrite, vop_generic_islocked,
vop_generic_lock and vop_generic_unlock.

Create vop_generic_abortop and propogate change to all file systems.

Fix PR/371.

Get rid of locking in NULLFS (should be mostly unnecessary now except for
forced unmounts).


# 1.20 25-Apr-1998 niklas

typo


Revision tags: OPENBSD_2_3_BASE
# 1.19 20-Feb-1998 niklas

typo


# 1.18 11-Jan-1998 csapuntz

Fix a couple spinlock references. More code motion in vfs_subr.c


# 1.17 10-Jan-1998 csapuntz

Broke up vfs_subr.c which was getting a bit huge. We now have seperate files
for the syncer daemon as well as default VOP_*.


# 1.16 24-Nov-1997 niklas

Fix non-DIAGNOSTIC (and non-COMPAT*) compilation


# 1.15 07-Nov-1997 csapuntz

Fixed hang on shutdown
Disabled vop_nolock for now. Filesystems still need to be cleaned up.


# 1.14 06-Nov-1997 csapuntz

DEBUG now compiles


# 1.13 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.12 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.11 06-Oct-1997 csapuntz

VFS Lite2 Changes


Revision tags: OPENBSD_2_1_BASE
# 1.10 25-Apr-1997 deraadt

proper mask check; mike@fast.cs.utah.edu


# 1.9 14-Apr-1997 tholo

Minor performance enhancements from NetBSD


# 1.8 24-Feb-1997 niklas

OpenBSD tags


# 1.7 11-Feb-1997 millert

Add fs_id support and random inode generation numbers for ffs.


# 1.6 04-Jan-1997 kstailey

spec_advlock() via lf_advlock()


Revision tags: OPENBSD_2_0_BASE
# 1.5 08-Aug-1996 tholo

Make {,f}chown(2) behaviour POSIX.1 compliant with SUID / SGID files
Enable CTL_FS processing by sysctl(3)
Add CTL_FS request to disable clearing SUID / SGID bit when a files owner
or group is changed by root
Make sysctl(8) understand CTL_FS requests


# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 29-Feb-1996 niklas

From NetBSD: Merge with NetBSD 960217


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.301 27-Mar-2020 anton

Relax the lockcount assertion in vputonfreelist(). Back when I fixed
several problems with the vnode exclusive lock implementation, I
overlooked the fact that a vnode can be in a state where the usecount is
zero while the holdcount still being positive. There could still be
threads waiting on the vnode lock in uvn_io() as long as the holdcount
is positive.

"go ahead" mpi@

Reported-by: syzbot+767d6deb1a647850a0ca@syzkaller.appspotmail.com


# 1.300 13-Feb-2020 claudio

Move the LK_DRAIN logic from VOP_LOCK() to vclean() the only caller of
VOP_LOCK with LK_DRAIN. This simplifies VOP_LOCK() a fair bit.
OK visa@


# 1.299 20-Jan-2020 claudio

struct vops is not modified during runtime so use const which moves each
into read-only data segment.
OK deraadt@ tedu@


# 1.298 10-Jan-2020 bluhm

Convert the vnode list at the mount point into a tailq. During
unmount this list is traversed and the dirty vnodes are flushed to
disk. Forced unmount expects that the list is empty after flushing,
otherwise the kernel panics with "dangling vnode". As the write
to disk can sleep, new vnodes may be inserted. If softdep is
enabled, resolving the dependencies creates new dirty vnodes and
inserts them to the list. To fix the panic, let insmntque() insert
new vnodes at the tail of the list. Then vflush() will still catch
them while traversing the list in forward direction.
OK tedu@ millert@ visa@


# 1.297 30-Dec-2019 bluhm

In vcount() a safe loop over vnodes was commited to 4.4BSD in 1994.
This is not necessary as the loop is restarted after vgone(). Switch
to SLIST_FOREACH without _SAFE.
OK visa@


# 1.296 27-Dec-2019 bluhm

Convert the speclisth hash buckets into SLIST macros. This makes
the vnode alias code more readable.
OK visa@


# 1.295 26-Dec-2019 bluhm

Fix white spaces.


# 1.294 08-Dec-2019 mpi

Convert infinite sleeps to tsleep_nsec(9).

ok visa@, jca@


Revision tags: OPENBSD_6_6_BASE
# 1.293 26-Aug-2019 anton

When a thread tries to exclusively lock a vnode, the same thread must
ensure that any other thread currently trying to acquire the underlying
vnode lock has observed that the same vnode is about to be exclusively
locked. Such threads must then sleep until the exclusive lock has been
released and then try to acquire the lock again. Otherwise, exclusive
access to the vnode cannot be guaranteed.

Thanks to naddy@ and visa@ for testing; ok visa@

Reported-by: syzbot+374d0e7e2400004957f7@syzkaller.appspotmail.com


# 1.292 25-Jul-2019 cheloha

vinvalbuf(9): tlseep -> tsleep_nsec(9); ok millert@


# 1.291 19-Jul-2019 cheloha

vwaitforio(9): tsleep(9) -> tsleep_nsec(9); ok visa@


# 1.290 28-Jun-2019 visa

Skip VFS barrier lock during normal operation to reduce overhead.
This removes a system-wide serialization point, which might help
finding timing-related bugs.

OK deraadt@ anton@


# 1.289 09-Jun-2019 beck

Add a temporary workaround to make removal of giant files better

mlarkin@ noticed we would freeze while removing enormous files because
of the amount of work done to invalidate buffers on unlink. This adds
a temporary workaround to ensure we give up the lock and yield while
doing this.

The longer term answer will be to move these buffers to another list
and not do the work here.

ok deraadt@


# 1.288 19-Apr-2019 visa

Add a subsystem lock for vfs_lockf.c. This enables calling lf_advlock()
and lf_purgelocks() without the kernel lock.

OK anton@ mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.287 02-Apr-2019 visa

Restrict which filesystems are available for swap. This rules out
obvious misconfigurations that cannot work.

OK mpi@ tedu@


# 1.286 17-Feb-2019 tedu

if a write fails, we mark the buffer invalid and throw it away. this can
lead to lost errors, where a later fsync will return success. to fix this,
set a flag on the vnode indicating a past error has occurred, and return
an error for future fsync calls.
ok bluhm deraadt visa


# 1.285 21-Jan-2019 anton

Introduce a dedicated entry point data structure for file locks. This new data
structure allows for better tracking of pending lock operations which is
essential in order to prevent a use-after-free once the underlying vnode is
gone.

Inspired by the lockf implementation in FreeBSD.

ok visa@

Reported-by: syzbot+d5540a236382f50f1dac@syzkaller.appspotmail.com


# 1.284 23-Dec-2018 natano

Rectify some issues with the noperm mount flag; the root vnode was not
protected properly and files without any x bit set were accidentaly considered
executable when checked with access(2).

Issues found and reported by deraadt, halex, reyk, tb
ok deraadt


# 1.283 07-Dec-2018 mpi

free(9) sizes for netcred.

ok visa@


Revision tags: OPENBSD_6_4_BASE
# 1.282 29-Sep-2018 visa

Use atomic operations to update vfc_refcount. Change the field's type
to unsigned int.

OK deraadt@


# 1.281 26-Sep-2018 visa

Move the allocating and freeing of mount points into
dedicated functions.

OK deraadt@ mpi@


# 1.280 22-Sep-2018 fcambus

Harmonize spacing after ellipses in displayed messages.

We were using spacing after ellipses in an inconsistent way in the
installer. Standardize on using "... " everywhere and take into account
the cursor position while we are waiting for the task to complete: the
cursor is now always positioned after the last dot, and the space is
added when displaying completion confirmation.

While there, also take cursor position into account in vfs_shutdown(),
and remove the extra leading space before ticks in dhclient.

OK deraadt@


# 1.279 17-Sep-2018 visa

Simplify VFS initialization.

Because loadable kernel modules are no longer, there is no need to
register or unregister filesystem implementations at runtime. Remove
vfs_register() and vfs_unregister(), and make vfsinit() call vfs_init
routines directly. Replace the linked list of vfsconf structs with
the vfsconflist[] array.

OK mpi@ bluhm@


# 1.278 16-Sep-2018 visa

Move vfsconf lookup code into dedicated functions.

OK bluhm@


# 1.277 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


# 1.276 02-Jul-2018 bluhm

Use more list macros for v_dirtyblkhd.
OK mpi@


# 1.275 06-Jun-2018 bluhm

The function dounmount() traverses the mnt_list in forward direction
to call vfs_busy() for all nested mount points. vfs_stall() called
vfs_busy() in reverser order for all mount points. Change the
direction of the latter to resolve the lock order conflict.
OK visa@


# 1.274 04-Jun-2018 guenther

Add VB_DUPOK to suppress witness(4) warning of concurrent mount locks.
Use that in three places:
- vfs_stall()
- sys_mount()
- dounmount()'s MNT_FORCE-does-recursive-unmounts case

ok deraadt@ visa@


# 1.273 27-May-2018 visa

Drop unnecessary `p' parameter from vget(9).

OK mpi@


# 1.272 08-May-2018 bluhm

When looping over mount points, the FOREACH SAVE macro is not save.
The loop variable mp is protected by vfs_busy() so that it cannot
be unmounted. But the next mount point nmp could be unmounted while
VFS_SYNC() sleeps. As the loop in vfs_stall() does not destroy the
mount point, TAILQ_FOREACH_REVERSE without _SAVE is the correct
macro to use.
OK deraadt@ visa@


# 1.271 08-May-2018 mpi

Move the vfs stall "barrier" logic to a function. FREF() will soon
change and this has nothing to do with it.

ok visa@, bluhm@


# 1.270 07-May-2018 bluhm

Print the vp pointer in the vinvalbuf() panic strings.
OK mpi@


# 1.269 02-May-2018 visa

Remove proc from the parameters of vn_lock(). The parameter is
unnecessary because curproc always does the locking.

OK mpi@


# 1.268 28-Apr-2018 visa

Clean up the parameters of VOP_LOCK() and VOP_UNLOCK(). It is always
curproc that does the locking or unlocking, so the proc parameter
is pointless and can be dropped.

OK mpi@, deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.267 07-Mar-2018 bluhm

Remounting files systems read-only does not work reliably. There
are corner cases where ffs may leak blocks. So better revert and
unmount all file systems at reboot. The "init died" panic will be
fixed in a different way.
OK deraadt@


# 1.266 10-Feb-2018 deraadt

Syncronize filesystems to disk when suspending. Each mountpoint's vnodes
are pushed to disk. Dangling vnodes (unlinked files still in use) and
vnodes undergoing change by long-running syscalls are identified -- and
such filesystems are marked dirty on-disk while we are suspended (in case
power is lost, a fsck will be required). Filesystems without dangling or
busy vnodes are marked clean, resulting in faster boots following
"battery died" circumstances.
Tested by numerous developers, thanks for the feedback.


# 1.265 14-Dec-2017 deraadt

Don't bother using DETACH_FORCE for the softraid luns at reboot
time; the aggressive mountpoint destruction seems to hit insane
use-after-frees when we are already far on the way down.


# 1.264 14-Dec-2017 deraadt

Give vflush_vnode() a hint about vnodes we don't need to account as "busy".
Change mountpoint to RDONLY a little later. Seems to improve the
rw->ro transition a bit.


# 1.263 11-Dec-2017 bluhm

Format the vnode lists of ddb show mount properly in columns.
OK krw@


# 1.262 11-Dec-2017 deraadt

In uvm Chuck decided backing store would not be allocated proactively
for blocks re-fetchable from the filesystem. However at reboot time,
filesystems are unmounted, and since processes lack backing store they
are killed. Since the scheduler is still running, in some cases init is
killed... which drops us to ddb [noted by bluhm]. Solution is to convert
filesystems to read-only [proposed by kettenis]. The tale follows:
sys_reboot() should pass proc * to MD boot() to vfs_shutdown() which
completes current IO with vfs_busy VB_WRITE|VB_WAIT, then calls VFS_MOUNT()
with MNT_UPDATE | MNT_RDONLY, soon teaching us that *fs_mount() calls a
copyin() late... so store the sizes in vfsconflist[] and move the copyin()
to sys_mount()... and notice nfs_mount copyin() is size-variant, so kill
legacy struct nfs_args3. Next we learn ffs_mount()'s MNT_UPDATE code is
sharp and rusty especially wrt softdep, so fix some bugs adn add
~MNT_SOFTDEP to the downgrade. Some vnodes need a little more help,
so tie them to &dead_vnops.

ffs_mount calling DIOCCACHESYNC is causing a bit of grief still but
this issue is seperate and will be dealt with in time.
couple hundred reboots by bluhm and myself, advice from guenther and
others at the hut


# 1.261 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.260 31-Jul-2017 florian

Give back some space to the ramdisk by compiling net/radix.c only
if we compile pf, ipsec, pipex or nfsserver.
Suggested by mpi some time ago.
Tweak & OK bluhm
deraadt assumes it's fair


# 1.259 20-Apr-2017 visa

Tweak lock inits to make the system runnable with witness(4)
on amd64 and i386.


# 1.258 04-Apr-2017 deraadt

struct vfsconf is tightly packed, but let's M_ZERO it in case that ever
changes to avoid exposing userland memory.


Revision tags: OPENBSD_6_1_BASE
# 1.257 15-Jan-2017 bluhm

When traversing the mount list, the current mount point is locked
with vfs_busy(). If the FOREACH_SAFE macro is used, the next pointer
is not locked and could be freed by another process. Unless
necessary, do not use _SAFE as it is unsafe. In vfs_unmountall()
the current pointer is actullay freed. Add a comment that this
race has to be fixed later.
OK krw@


# 1.256 10-Jan-2017 bluhm

Replace manual for() loops with FOREACH() macro.
OK millert@


# 1.255 10-Jan-2017 bluhm

Remove the unused olddp parameter from function dounmount().
OK mpi@ millert@


# 1.254 28-Sep-2016 kettenis

Cast enum to u_int when doing a bounds check to avoid a clang warning that
the comparison is always true.

ok jca@, tedu@


# 1.253 16-Sep-2016 dlg

move the namecache_rb_tree from RB macros to RBT functions.

i had to shuffle the includes a bit. all the knowledge of the RB
tree is now inside vfs_cache.c, and all accesses are via cache_*
functions.


# 1.252 16-Sep-2016 dlg

move buf_rb_bufs from RB macros to RBT functions

i had to shuffle the order of some header bits cos RBT_PROTOTYPE
needs to see what RBT_HEAD produces.


# 1.251 15-Sep-2016 dlg

all pools have their ipl set via pool_setipl, so fold it into pool_init.

the ioff argument to pool_init() is unused and has been for many
years, so this replaces it with an ipl argument. because the ipl
will be set on init we no longer need pool_setipl.

most of these changes have been done with coccinelle using the spatch
below. cocci sucks at formatting code though, so i fixed that by hand.

the manpage and subr_pool.c bits i did myself.

ok tedu@ jmatthew@

@ipl@
expression pp;
expression ipl;
expression s, a, o, f, m, p;
@@
-pool_init(pp, s, a, o, f, m, p);
-pool_setipl(pp, ipl);
+pool_init(pp, s, a, ipl, f, m, p);


# 1.250 25-Aug-2016 dlg

pool_setipl

ok kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.249 22-Jul-2016 kettenis

Prevent NULL-pointer call for filesystems that don't provide vfs_sysctl
in their vfsops.

Issue reported by Tim Newsham.

ok claudio@, natano@


# 1.248 19-Jun-2016 natano

Remove the lockmgr() API. It is only used by filesystems, where it is a
trivial change to use rrw locks instead. All it needs is LK_* defines
for the RW_* flags.

tested by naddy and sthen on package building infrastructure
input and ok jmc mpi tedu


# 1.247 26-May-2016 natano

The doforce variable isn't modified anywhere. Also, the only filesystem
left using it is fuse. It has been removed from all other filesystems.

ok millert deraadt


# 1.246 26-Apr-2016 natano

copy_statfs_info() is not only used by ufs, but by other filesystems too,
so make sure that all members of mp->mnt_stat.mount_info are copied.
ok stefan


# 1.245 26-Apr-2016 beck

fix off by one in vfs_vnode_print - found by miod
ok deraadt@, krw@


# 1.244 07-Apr-2016 natano

Share clone bitmap between aliased vnodes. This prevents duplicate clone
instance numbers being handed out for the same minor device.
ok mikeb


# 1.243 05-Apr-2016 natano

Increase size of the clone bitmap (revised diff after revert). I have
tested this with fuse _and_ drm on amd64 and macppc. Also tested with
cloning bpf (not in the tree) on macppc.

ok mikeb
"looks correct to me" millert

The original commit message is as follows:

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.242 01-Apr-2016 mikeb

Revert the clone bitmap enlargement change


# 1.241 31-Mar-2016 natano

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.240 19-Mar-2016 natano

Remove the unused flags argument from VOP_UNLOCK().

torture tested on amd64, i386 and macppc
ok beck mpi stefan
"the change looks right" deraadt


# 1.239 14-Mar-2016 krw

Change a bunch of (<blah> *)0 to NULL.

ok beck@ deraadt@


Revision tags: OPENBSD_5_9_BASE
# 1.238 05-Dec-2015 tedu

branches: 1.238.2;
remove stale lint annotations


# 1.237 16-Nov-2015 deraadt

In getdevvp() set the VISTTY flag on a vnode to indicate the underlying
device is a D_TTY device. (Like spec_open, but this sets the flag to
satisfy pre-VOP_OPEN situations)
ok millert semarie tedu guenther


# 1.236 13-Oct-2015 guenther

Initialize va_filerev in vattr_null() to avoid leaking stack garbage;
problem pointed out by Martin Natano (natano (at) natano.net)

Also, stop chaining assignments (foo = bar = baz) in vattr_null().
The exact meaning of those depends on the order of the sizes-and-
signednesses of the lvalues, making them fragile: a statement here
mixed *six* types, but managed to get them in a safe order. Delete
a 20+ year old XXX comment that was almost certainly bemoaning a bug
from when they were in an unsafe order.

ok deraadt@ miod@


# 1.235 08-Oct-2015 mpi

Use the radix API directly and get rid of the function pointers. There
is no point in keeping an unused level of abstraction.

ok mikeb@, claudio@


# 1.234 07-Oct-2015 mpi

rn_inithead() offset argument is now specified in byte, missed in previous.


# 1.233 04-Sep-2015 mpi

Make every subsystem using a radix tree call rn_init() and pass the
length of the key as argument.

This way every consumer of the radix tree has a chance to explicitly
initialize the shared data structures and no longer rely on another
subsystem to do the initialization.

As a bonus ``dom_maxrtkey'' is no longer used an die.

ART kernels should now be fully usable because pf(4) and IPSEC properly
initialized the radix tree.

ok chris@, reyk@


Revision tags: OPENBSD_5_8_BASE
# 1.232 16-Jul-2015 claudio

branches: 1.232.4;
Fix rn_match and there for the expoerted lookup functions in radix.c
to never return the internal RNF_ROOT nodes. This removes the checks
in the callee to verify that not an RNF_ROOT node was returned.
OK mpi@


# 1.231 12-May-2015 mikeb

Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.230 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.229 02-Mar-2015 guenther

Return EINVAL if the creds supplied for NFS export have a cr_ngroups less
than zero or greater than NGROUPS_MAX

Fixes panic seen by henning@


# 1.228 09-Jan-2015 tedu

rename desiredvnodes to initialvnodes. less of a lie. ok beck deraadt


# 1.227 19-Dec-2014 tedu

start retiring the nointr allocator. specify PR_WAITOK as a flag as a
marker for which pools are not interrupt safe. ok dlg


# 1.226 17-Dec-2014 tedu

remove lock.h from uvm_extern.h. another holdover from the simpletonlock
era. fix uvm including c files to include lock.h or atomic.h as necessary.
ok deraadt


# 1.225 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.224 10-Dec-2014 tedu

convert bcopy to memcpy. ok millert


# 1.223 21-Nov-2014 tedu

simple lock is long dead


# 1.222 19-Nov-2014 tedu

delete the KERN_VNODE sysctl. it fails to provide any isolation from the
kernel struct vnode defintion, and the only consumer (pstat) still needs
kvm to read much of the required information. no great loss to always use
kvm until there's a better replacement interface.
ok deraadt millert uebayasi


# 1.221 14-Nov-2014 tedu

prefer sizeof(*ptr) to sizeof(struct) for malloc and free


# 1.220 03-Nov-2014 deraadt

pass size argument to free()
ok doug tedu


# 1.219 13-Sep-2014 doug

Replace all queue *_END macro calls except CIRCLEQ_END with NULL.

CIRCLEQ_* is deprecated and not called in the tree. The other queue types
have *_END macros which were added for symmetry with CIRCLEQ_END. They are
defined as NULL. There's no reason to keep the other *_END macro calls.

ok millert@


Revision tags: OPENBSD_5_6_BASE
# 1.218 13-Jul-2014 tedu

pass the size to free in some of the obvious cases


# 1.217 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.216 10-Jul-2014 mpi

Stop using a shutdown hook for softraid(4) and explicitly shutdown
the disciplines right after vfs_shutdown().

This change is required in order to be able to set `cold' to 1 before
traversing the device (mainbus) tree for DVACT_POWERDOWN when halting
a machine. Yes, this is ugly because sr_shutdown() needs to sleep. But
at least it is obvious and hopefully somebody will be ofended and fix
it.

In order to properly flush the cache of the disks under softraid0,
sr_shutdown() now propagates DVACT_POWERDOWN for this particular subtree
of devices which are not under mainbus. As a side effect sd(4) shutdown
hook should no longer be necessary.

Tested by stsp@ and Jean-Philippe Ouellet.

ok deraadt@, stsp@, jsing@


# 1.215 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.214 04-Jun-2014 claudio

While it may be smart to use the radix tree for exports it is not OK to
use the domain specific tree initialisation method for this since that one
is multipath enabled and assumes that the radix node is part of a struct
rtentry. This code uses a different struct and so the multipath modifies
wrong fields and breaks stuff in mysterious ways.
Since we only support AF_INET here anyway simplify the code and only have
one radix_node_head pointer instead of AF_MAX ones.
Fixes NFS server issues reported by rpe@, OK rpe@, guenther@, sthen@


# 1.213 10-Apr-2014 tedu

pull the bufcache freelist code out into separate functions to allow new
algorithms to be tested. in the process, drop support for unused B_AGE and
b_synctime options.
previous versions ok beck deraadt


# 1.212 24-Mar-2014 guenther

Split the API: struct ucred remains the kernel internal structure while
struct xucred becomes the structure for syscalls (mount(2) and nfssvc(2)).

ok deraadt@ beck@


Revision tags: OPENBSD_5_5_BASE
# 1.211 21-Jan-2014 tedu

bzero -> memset


# 1.210 01-Dec-2013 krw

Change 'mountlist' from CIRCLEQ to TAILQ. Be paranoid and
use TAILQ_*_SAFE more than might be needed.

Bulk ports build by sthen@ showed nobody sticking their fingers
so deep into the kernel.

Feedback and suggestions from millert@. ok jsing@


# 1.209 27-Nov-2013 jsing

Defer the v_type initialisation until after the vnode has been purged from
the namecache. Changing the v_type between cache_enter() and cache_purge()
results in bad things happening.

ok beck@


# 1.208 02-Oct-2013 sf

format string fix: b_flags is long


# 1.207 01-Oct-2013 sf

Format string fixes: Cast time_t to long long

and mnt_stat.f_ctime is long long, too


# 1.206 08-Aug-2013 syl

Uncomment kprintf format attributes for sys/kern

tested on vax (gcc3) ok miod@


# 1.205 30-Jul-2013 beck

The previous change was made while chasing nfs performance issues
on Theo's servers - however this was in the context of the buffer flipper
changes and this is now suspect in a continues performance issue with NFS
so back it out for now


Revision tags: OPENBSD_5_4_BASE
# 1.204 24-Jun-2013 beck

Manipulating buffers after sleeping is dangerous. Instead of attempting
to cheat and VOP_BWRITE a buffer, restart the vinvalbuf if we have to wait
for a busy buffer to complete
ok tedu@ guenther@


# 1.203 15-Apr-2013 jsing

Add an f_mntfromspec member to struct statfs, which specifies the name of
the special provided when the mount was requested. This may be the same as
the special that was actually used for the mount (e.g. in the case of a
device node) or it may be different (e.g. in the case of a DUID).

Whilst here, change f_ctime to a 64 bit type and remove the pointless
f_spare members.

Compatibility goo courtesy of guenther@

ok krw@ millert@


Revision tags: OPENBSD_5_3_BASE
# 1.202 17-Feb-2013 miod

Comment out recently added __attribute__((__format__(__kprintf__))) annotations
in MI code; gcc 2.95 does not accept such annotation for function pointer
declarations, only function prototypes.
To be uncommented once gcc 2.95 bites the dust.


# 1.201 09-Feb-2013 miod

Add explicit __attribute__ ((__format__(__kprintf__)))) to the functions and
function pointer arguments which are {used as,} wrappers around the kernel
printf function.
No functional change.


# 1.200 17-Nov-2012 beck

Don't map a buffer (and potentially sleep) when invalidating it in vinvalbuf.
This fixes a problem where we could sleep for kva and then our pointers
would not be valid on the next pass through the loop. We do this
by adding buf_acquire_nomap() - which can be used to busy up the buffer
without changing its mapped or unmapped state. We do not need to have
the buffer mapped to invalidate it, so it is sufficient to acquire it
for that. In the case where we write the buffer, we do map the buffer, and
potentially sleep.


# 1.199 01-Oct-2012 guenther

Make groupmember() check the effective gid too, so that the checks are
consistent when the effective gid isn't also a supplementary group.

ok beck@


# 1.198 19-Sep-2012 guenther

vhold() and vdrop() are prototyped in vnode.h, so don't repeat them here

ok beck@


Revision tags: OPENBSD_5_2_BASE
# 1.197 16-Jul-2012 deraadt

oops, need sys/acct.h too


# 1.196 16-Jul-2012 deraadt

Put acct_shutdown() proto in a better place


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.195 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.194 02-Jul-2011 thib

rename VFSDEBUG to VFLCKDEBUG;

prompted by tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.193 21-Dec-2010 thib

Bring back the "End the VOP experiment." diff, naddy's issues where
unrelated, and his alpha is much happier now.

OK deraadt@


# 1.192 06-Dec-2010 jasper

- drop NENTS(), which was yet another copy of nitems().
no binary change


ok deraadt@


# 1.191 10-Sep-2010 thib

Backout the VOP diff until the issues naddy was seeing on alpha (gcc3)
have been resolved.


# 1.190 06-Sep-2010 thib

End the VOP experiment. Instead of the ridicolusly complicated operation
vector setup that has questionable features (that have, as far as I can
tell never been used in practice, atleast not in OpenBSD), remove all
the gunk and favor a simple struct full of function pointers that get
set directly by each of the filesystems.

Removes gobs of ugly code and makes things simpler by a magnitude.

The only downside of this is that we loose the vnoperate feature so
the spec/fifo operations of the filesystems need to be kept in sync
with specfs and fifofs, this is no big deal as the API it self is pretty
static.

Many thanks to armani@ who pulled an earlier version of this diff to
current after c2k10 and Gabriel Kihlman on tech@ for testing.

Liked by many. "come on, find your balls" deraadt@.


# 1.189 12-Aug-2010 oga

Nuke extra (typoed) extern declaration and a spare newline from the last
commit.

"fix it -- free commit" beck@


# 1.188 11-Aug-2010 beck

Make the number of vnodes to correspond to the number of buffers in
buffer cache - we grow them dynamically, but do not attempt to shrink
them if the buffer cache shrinks after growing.

Tested by very many for a long time.

ok oga@ todd@ phessler@ tedu@


Revision tags: OPENBSD_4_8_BASE
# 1.187 29-Jun-2010 tedu

makefstype was only used in ported from freebsd filesystems. fix them
and remove the function. ok thib


# 1.186 28-Jun-2010 claudio

Add the rtable id as an argument to rn_walktree(). Functions like
rt_if_remove_rtdelete() need to know the table id to be able to correctly
remove nodes.
Problem found by Andrea Parazzini and analyzed by Martin Pelik�n.
OK henning@


# 1.185 06-May-2010 mpf

Fix favail format string.
From mickey.
OK thib, otto.


Revision tags: OPENBSD_4_7_BASE
# 1.184 17-Dec-2009 oga

if anyone vref()s a VNON vnode, panic. This should not happen.

Written while trying to debug the nfs_inactive panics. Turns out it
never got hit, but it's a useful check to have.

ok beck@


# 1.183 17-Aug-2009 jasper

dd 'show all bufs' to show all the buffers in the system

ok beck@ thib@


# 1.182 13-Aug-2009 thib

add a show all vnodes command, use dlg's nice pool_walk() to accomplish
this.

ok beck@, dlg@


# 1.181 12-Aug-2009 beck

Namecache revamp.

This eliminates the large single namecache hash table, and implements
the name cache as a global lru of entires, and a redblack tree in each
vnode. It makes cache_purge actually purge the namecache entries associated
with a vnode when a vnode is recycled (very important for later on actually being
able to resize the vnode pool)

This commit does #if 0 out a bunch of procmap code that was
already broken before this change, but needs to be redone completely.

Tested by many, including in thib's nfs test setup.

ok oga@,art@,thib@,miod@


# 1.180 02-Aug-2009 beck

Dynamic buffer cache support - a re-commit of what was backed out
after c2k9

allows buffer cache to be extended and grow/shrink dynamically

tested by many, ok oga@, "why not just commit it" deraadt@


Revision tags: OPENBSD_4_6_BASE
# 1.179 25-Jun-2009 thib

backout the buf_acquire() does the bremfree() since all callers
where doing bremfree() befure calling buf_acquire().

This is causing us headache pinning down a bug that showed up
when deraadt@ too cvs to current, and will have to be done
anyway as a preperation for backouts.

OK deraadt@


# 1.178 15-Jun-2009 beck

Back out all the buffer cache changes I committed during c2k9. This reverts three
commits:

1) The sysctl allowing bufcachepercent to be changed at boot time.
2) The change moving the buffer cache hash chains to a red-black tree
3) The dynamic buffer cache (Which depended on the earlier too).

ok on the backout from marco and todd


# 1.177 06-Jun-2009 art

All caller of buf_acquire were doing bremfree before the call.
Just put it in the buf_acquire function.
oga@ ok


# 1.176 03-Jun-2009 beck

Change bufhash from the old grotty hash table to red-black trees hanging
off the vnode.
ok art@, oga@, miod@


Revision tags: OPENBSD_4_5_BASE
# 1.175 10-Nov-2008 pedro

Fix typo in comment, okay jmc@.


# 1.174 01-Nov-2008 deraadt

change vrele() to return an int. if it returns 0, it can gaurantee that
it did not sleep. this is used to avoid checkdirs() to avoid having
to restart the allproc walk every time through
idea from tedu, ok thib pedro


Revision tags: OPENBSD_4_4_BASE
# 1.173 05-Jul-2008 thib

re-introduce vdrop() to signal a lost intrest in a vnode;

ok art@


# 1.172 14-Jun-2008 mk

A bunch of pool_get() + bzero() -> pool_get(..., .. | PR_ZERO)
conversions that should shave a few bytes off the kernel.

ok henning, krw, jsing, oga, miod, and thib (``even though i usually prefer
FOO|BAR''; thanks for looking.


# 1.171 13-Jun-2008 beck

back out stupid vnode change that was unintentionally included
with biomem and art has no idea how it got there.
ok art@ thib@


# 1.170 12-Jun-2008 deraadt

Bring biomem diff back into the tree after the nfs_bio.c fix went in.
ok thib beck art


# 1.169 11-Jun-2008 deraadt

back out biomem diff since it is not right yet. Doing very large
file copies to nfsv2 causes the system to eventually peg the console.
On the console ^T indicates that the load is increasing rapidly, ddb
indicates many calls to getbuf, there is some very slow nfs traffic
making none (or extremely slow) progress. Eventually some machines
seize up entirely.


# 1.168 10-Jun-2008 beck

Buffer cache revamp

1) remove multiple size queues, introduced as a stopgap.
2) decouple pages containing data from their mappings
3) only keep buffers mapped when they actually have to be mapped
(right now, this is when buffers are B_BUSY)
4) New functions to make a buffer busy, and release the busy flag
(buf_acquire and buf_release)
5) Move high/low water marks and statistics counters into a structure
6) Add a sysctl to retrieve buffer cache statistics

Tested in several variants and beat upon by bob and art for a year. run
accidentally on henning's nfs server for a few months...

ok deraadt@, krw@, art@ - who promises to be around to deal with any fallout


# 1.167 09-Jun-2008 millert

Update access(2) to have modern semantics with respect to X_OK and
the superuser. access(2) will now only indicate success for X_OK on
non-directories if there is at least one execute bit set on the file.
OK deraadt@ thib@ otto@


# 1.166 07-May-2008 thib

remove the vfc_mountroot member from vfsconf and
do appropriate cleanup;

OK deraadt@


# 1.165 07-May-2008 claudio

Implement routing priorities. Every route inserted has a priority assigned
and the one route with the lowest number wins. This will be used by the
routing daemons to resolve the synchronisations issue in case of conflicts.
The nasty bits of this are in the multipath code. If no priority is specified
the kernel will choose an appropriate priority.

Looked at by a few people at n2k8 code is much older


# 1.164 06-May-2008 thib

retire vfs_mountroot();

setroot() is now (and has been) responsible for setting
the mountroot function pointer "to the right thing", or
failing todo that, to ffs_mountroot;

based on a discussion/diff from deraadt@.
OK deraadt@


# 1.163 23-Mar-2008 miod

Wrong printf construct.


# 1.162 16-Mar-2008 otto

Widen some struct statfs fields to support large filesystem stata
and add some to be able to support statvfs(2). Do the compat dance
to provide backward compatibility. ok thib@ miod@


Revision tags: OPENBSD_4_3_BASE
# 1.161 13-Dec-2007 blambert

replace calls to ltsleep with tsleep

remove PNORELOCK flag, as PNORELOCK is used for msleep

ok art@ thib@


# 1.160 16-Nov-2007 deraadt

er, the newline is wrong. dissapointing.


# 1.159 15-Nov-2007 deraadt

newline before syncing disks is way prettier


# 1.158 29-Oct-2007 chl

MALLOC/FREE -> malloc/free
replace an hard coded value with M_WAITOK

ok krw@


# 1.157 15-Sep-2007 bluhm

Allow to pull out an usb stick with ffs filesystem while mounted
and a file is written onto the stick. Without these fixes the
machine panics or hangs.
The usb fix calls the callback when the stick is pulled out to free
the associated buffers. Otherwise we have busy buffers for ever
and the automatic unmount will panic.
The change in the scsi layer prevents passing down further dirty
buffers to usb after the stick has been deactivated.
In vfs the automatic unmount has moved from the function vgonel()
to vop_generic_revoke(). Both are called when the sd device's vnode
is removed. In vgonel() the VXLOCK is already held which can cause
a deadlock. So call dounmount() earlier.

ok krw@, I like this marco@, tested by ian@


# 1.156 07-Sep-2007 art

Use M_ZERO in a few more places to shave bytes from the kernel.

eyeballed and ok dlg@


Revision tags: OPENBSD_4_2_BASE
# 1.155 07-Aug-2007 beck

A few changes to deal with multi-user performance issues seen. this
brings us back roughly to 4.1 level performance, although this is still
far from optimal as we have seen in a number of cases. This change

1) puts a lower bound on buffer cache queues to prevent starvation
2) fixes the code which looks for a buffer to recycle
3) reduces the number of vnodes back to 4.1 levels to avoid complex
performance issues better addressed after 4.2

ok art@ deraadt@, tested by many


# 1.154 01-Jun-2007 beck

decouple the allocated number of vnodes from the "desiredvnodes" variable
which is used to size a zillion other things that increasing excessively
has been shown to cause problems - so that we may incrementally look at
increasing those other things without making the kernel unusable.

This diff effectivly increases the number of vnodes back to the number
of buffers, as in the earlier dynamic buffer cache commits, without
increasing anything else (namecache, softdeps, etc. etc.)

ok pedro@ tedu@ art@ thib@


# 1.153 31-May-2007 tedu

remove some silly casts, no real change


# 1.152 31-May-2007 pedro

NFSv2 cannot cope with a big number of vnodes, so revert to NPROC-based
calculation until the problem is fixed, okay beck@ art@


# 1.151 30-May-2007 beck

back out vfs change - todd fries has seen afs issues, and I'm suspicious
this can cause other problems.


# 1.150 29-May-2007 beck

Step one of some vnode improvements - change getnewvnode to
actually allocate "desiredvnodes" - add a vdrop to un-hold a vnode held
with vhold, and change the name cache to make use of vhold/vdrop, while
keeping track of which vnodes are referred to by which cache entries to
correctly hold/drop vnodes when the cache uses them.
ok thib@, tedu@, art@


# 1.149 28-May-2007 thib

de-inline vref();

ok pedro@


# 1.148 26-May-2007 pedro

Dynamic buffer cache. Initial diff from mickey@, okay art@ beck@ toby@
deraadt@ dlg@.


# 1.147 26-May-2007 thib

Nuke a bunch of simpelocks and associated goo.

ok art@


# 1.146 17-May-2007 thib

Collapse struct v_selectinfo in struct vnode, remove the
simplelock and reuse the name for the selinfo member.
Clean-up accordingly.

ok tedu@,art@


# 1.145 09-May-2007 deraadt

kinfo_vgetfailed has not been used for > 8 years


# 1.144 13-Apr-2007 thib

Move the declaration of VN_KNOTE() into vnode.h instead of having
multiple defines all over;

ok tedu@


# 1.143 13-Apr-2007 bluhm

Remove comments talking about vnode interlock. No binary change.
ok thib


# 1.142 11-Apr-2007 thib

Remove the simplelock argument from vrecycle();

ok pedro@, sturm@


# 1.141 21-Mar-2007 thib

Remove the v_interlock simplelock from the vnode structure.
Zap all calls to simple_lock/unlock() on it (those calls are
#defined away though). Remove the LK_INTERLOCK from the calls
to vn_lock() and cleanup the filesystems wich implement VOP_LOCK().
(by remvoing the v_interlock from there calls to lockmgr()).

ok pedro@, art@, tedu@


# 1.140 12-Mar-2007 mickey

better desiredvnodes not based on maxusers; pedro@ deraadt@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.139 20-Feb-2007 deraadt

for vfsconf sysctl, do not leak kernel sensors out to userland
ok art thib


# 1.138 17-Feb-2007 mickey

fix ddb buf printing for daddr_t growth to 64bit;
from juan hernandez gonzalez; tested by bluhm@


# 1.137 14-Feb-2007 jsg

Consistently spell FALLTHROUGH to appease lint.
ok kettenis@ cloder@ tom@ henning@


# 1.136 13-Feb-2007 mickey

fix ddb buf print


# 1.135 20-Nov-2006 tom

vprint() should be defined if DIAGNOSTIC || DEBUG. Noticed by (and
original diff from) Jake < antipsychic (at) hotmail.com >. Discussed
with Mickey and Miod.

ok miod@ pedro@


# 1.134 30-Oct-2006 thib

use vp->v_type to index into vtypes rather then vp->v_tag,
fixing odd output in the 'show vnode' ddb code.

ok mickey@


Revision tags: OPENBSD_4_0_BASE
# 1.133 11-Jul-2006 mickey

add mount/vnode/buf and softdep printing commands; tested on a few archs and will make pedro happy too (;


# 1.132 09-Jul-2006 pedro

Fix tab where space was meant


# 1.131 08-Jul-2006 thib

vinvalbuf() debugging aid, under VFSDEBUG.

ok pedro@


# 1.130 03-Jul-2006 mickey

also print vp in vprint (useful for debugging); pedro@ ok


# 1.129 25-Jun-2006 sturm

rename vfs_busy() flags VB_UMIGNORE/VB_UMWAIT to VB_NOWAIT/VB_WAIT

requested by and ok pedro


# 1.128 14-Jun-2006 sturm

move vfs_busy() to rwlocks and properly hide the locking api from vfs

ok tedu, pedro


# 1.127 02-Jun-2006 pedro

Add a clonable devices implementation. Hacked along with thib@, input
from krw@ and toby@, subliminal prodding from dlg@, okay deraadt@.


# 1.126 28-May-2006 pedro

Spacing in vfs_sysctl()


# 1.125 07-May-2006 sturm

forgot to remove this sentence from the comment
ok pedro


# 1.124 30-Apr-2006 sturm

remove the simplelock argument from vfs_busy() which is currently not
used and will never be used this way in VFS

requested by and ok pedro, ok krw, biorn


# 1.123 19-Apr-2006 pedro

Remove unused mount list simple_lock() goo


Revision tags: OPENBSD_3_9_BASE
# 1.122 09-Jan-2006 pedro

Put vprint() under DIAGNOSTIC, as to save space in generated ramdisks.
Inspiration from miod@, okay deraadt@. Tested on i386, macppc and amd64.


# 1.121 30-Nov-2005 pedro

No need for vfs_busy() and vfs_unbusy() to take a process pointer
anymore. Testing by jolan@, thanks.


# 1.120 24-Nov-2005 pedro

Remove kernfs, okay deraadt@.


# 1.119 19-Nov-2005 pedro

Remove unnecessary lockmgr() archaism that was costing too much in terms
of panics and bugfixes. Access curproc directly, do not expect a process
pointer as an argument. Should fix many "process context required" bugs.
Incentive and okay millert@, okay marc@. Various testing, thanks.


# 1.118 18-Nov-2005 pedro

Work around yet another race on non-locking file systems: when calling
VOP_INACTIVE() in vrele() and vput(), we may sleep. Since there's no
locking of any kind, someone can vget() the vnode and vrele() it while
we sleep, beating us in getting the vnode on the free list.


# 1.117 08-Nov-2005 pedro

Missed one use of 'register'


# 1.116 07-Nov-2005 pedro

Use ANSI function declarations and deregister, no binary change


# 1.115 19-Oct-2005 pedro

Remove v_vnlock from struct vnode, okay krw@ tedu@


Revision tags: OPENBSD_3_8_BASE
# 1.114 26-May-2005 pedro

branches: 1.114.2;
RIP stackable filesystems, ok marius@ tedu@, discussed with deraadt@


# 1.113 24-May-2005 pedro

when a device vnode associated with a mount point disappears, mark the
filesystem as doomed and unmount it


# 1.112 22-May-2005 pedro

put VLOCKSWORK stuff under a single option, VFSDEBUG


# 1.111 01-May-2005 pedro

check for VBIOONFREELIST and VBIOONSYNCLIST in vprint(), okay marius@


# 1.110 24-Mar-2005 tedu

always good to check for invalid values. ok marius pedro


Revision tags: OPENBSD_3_7_BASE
# 1.109 10-Jan-2005 pedro

branches: 1.109.2;
change vget() to only put a vnode back on the free lists if it actually
was there. should fix a (rare) corner case introduced by my last commit.
ok tedu@, testing by joris, moritz@, danh@, otto@ and krw@. many thanks.


# 1.108 31-Dec-2004 pedro

sprinkle some more list macros in here


# 1.107 31-Dec-2004 pedro

when releasing a vnode, make it inactive before sticking it to one of
the free lists. should fix some races on filesystems that don't have
locks, such as nfs. also, it allows for a more straightforward way of
releasing vnodes (nodes that are going to be recycled don't have to be
moved to the head of the list). tested by many, thanks.

ok tedu@ deraadt@


# 1.106 28-Dec-2004 deraadt

clean dirty accident by miod


# 1.105 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


# 1.104 09-Dec-2004 pedro

minor spacing/styling nits


Revision tags: OPENBSD_3_6_BASE
# 1.103 04-Aug-2004 art

Uninline vputonfreelist.


# 1.102 04-Aug-2004 pedro

better comments


# 1.101 02-Aug-2004 pedro

- check for LK_NOWAIT on vget()
- use ltsleep() instead of the unlock + sleep combo

ok art@, inspiration from free/net


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.100 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


# 1.99 27-May-2004 tedu

shutdown accounting before shutting down vfs. should prevent some panics.
ok david@ millert@ (iirc)


# 1.98 25-Apr-2004 itojun

radix tree with multipath support. from kame. deraadt ok
user visible changes:
- you can add multiple routes with same key (route add A B then route add A C)
- you have to specify gateway address if there are multiple entries on the table
(route delete A B, instead of route delete A)
kernel change:
- radix_node_head has an extra entry
- rnh_deladdr takes extra argument

TODO:
- actually take advantage of multipath (rtalloc -> rtalloc_mpath)


Revision tags: OPENBSD_3_5_BASE
# 1.97 09-Jan-2004 tedu

back out vnode parents. weird breakge found in ports tree


# 1.96 06-Jan-2004 tedu

keep track of a vnode's parent dir. ufs only, and unused atm, but
the fun stuff is coming. testing by brad.


Revision tags: OPENBSD_3_4_BASE
# 1.95 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.94 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: UBC_SYNC_A
# 1.93 13-May-2003 naddy

Back out previous change that causes "vnode table full" for large-scale
file operations.


# 1.92 13-May-2003 tedu

do reclaim LAYER vnodes, no good reason not to


# 1.91 06-May-2003 tedu

attempt to put a process's cwd back in place after a forced umount.
won't always work, but it's the best we can do for now. this covers
at least some of the failure cases the previous commit to vfs_lookup.c
checks for.
ok weingart@


# 1.90 01-May-2003 tedu

several related changes:
vfs_subr.c:
add a missing simple_lock_init for vnode interlock
try to avoid reclaiming locked or layered vnodes
initialize vnlock pointer to NULL
remove old code to free vnlock, never used
lockinit the new vnode lock
vfs_syscalls.c:
support for VLAYER flag
vnode_if.sh:
support for splitting VDESC flags
vnode_if.src:
split VDESC flags
WILLPUT is the combination of WILLRELE and WILLUNLOCK
most uses for WILLRELE become WILLPUT
vnode.h:
add v_lock to struct vnode
add VLAYER flag
update for new VDESC flags


# 1.89 06-Apr-2003 ho

strcat/strcpy/sprintf cleanup. krw@, anil@ ok. art@ tested sparc64.


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_B
# 1.88 11-Aug-2002 art

Add two missing vfs_busy calls in the failure path of sysctl_vnode.
Found by aaron@

NOTE - I think we need a mount-point iterator just like we have
NOTE - vfs_mount_foreach_vnode. (btw. why don't we use foreach_vnode in here?)


# 1.87 12-Jul-2002 art

Change the locking on the mountpoint slightly. Instead of using mnt_lock
to get shared locks for lookup and get the exclusive lock only with
LK_DRAIN on unmount and do the real exclusive locking with flags in
mnt_flags, we now use shared locks for lookup and an exclusive lock for
unmount.

This is accomplished by slightly changing the semantics of vfs_busy.
Old vfs_busy behavior:
- with LK_NOWAIT set in flags, a shared lock was obtained if the
mountpoint wasn't being unmounted, otherwise we just returned an error.
- with no flags, a shared lock was obtained if the mountpoint was being
unmounted, otherwise we slept until the unmount was done and returned
an error.
LK_NOWAIT was used for sync(2) and some statistics code where it isn't really
critical that we get the correct results.
0 was used in fchdir and lookup where it's critical that we get the right
directory vnode for the filesystem root.

After this change vfs_busy keeps the same behavior for no flags and LK_NOWAIT.
But if some other flags are passed into it, they are passed directly
into lockmgr (actually LK_SLEEPFAIL is always added to those flags because
if we sleep for the lock, that means someone was holding the exclusive lock
and the exclusive lock is only held when the filesystem is being unmounted.

More changes:
dounmount must now be called with the exclusive lock held. (before this
the caller was supposed to hold the vfs_busy lock, but that wasn't always
true).
Zap some (now) unused mount flags.
And the highlight of this change:
Add some vfs_busy calls to match some vfs_unbusy calls, especially in
sys_mount. (lockmgr doesn't detect the case where we release a lock noone
holds (it will do that soon)).

If you've seen hangs on reboot with mfs this should solve it (I repeat this
for the fourth time now, but this time I spent two months fixing and
redesigning this and reading the code so this time I must have gotten
this right).


# 1.86 16-Jun-2002 miod

When processing the KERN_VNODE sysctl, the kernel builds a packed structure,
while pstat(8) expects a C structure abiding the regular structure packing
rules. This caused pstat -v to break on powerpc.

Unbreak the confusion by defining the structure in a common header file,
and having the kernel use it.

ok millert@ deraadt@


# 1.85 08-Jun-2002 art

Use ltsleep in vfs_busy.


# 1.84 16-May-2002 art

sprinkle some splassert(IPL_BIO) in some functions that are commented as "should be called at splbio()"


Revision tags: OPENBSD_3_1_BASE
# 1.83 14-Mar-2002 millert

First round of __P removal in sys


# 1.82 04-Feb-2002 miod

Cleanup mountroot-related definitions.


# 1.81 23-Jan-2002 art

Pool deals fairly well with physical memory shortage, but it doesn't deal
well (not at all) with shortages of the vm_map where the pages are mapped
(usually kmem_map).

Try to deal with it:
- group all information the backend allocator for a pool in a separate
struct. The pool will only have a pointer to that struct.
- change the pool_init API to reflect that.
- link all pools allocating from the same allocator on a linked list.
- Since an allocator is responsible to wait for physical memory it will
only fail (waitok) when it runs out of its backing vm_map, carefully
drain pools using the same allocator so that va space is freed.
(see comments in code for caveats and details).
- change pool_reclaim to return if it actually succeeded to free some
memory, use that information to make draining easier and more efficient.
- get rid of PR_URGENT, noone uses it.


# 1.80 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.79 10-Dec-2001 art

branches: 1.79.2;
No need to initialize the uobj on every getnewvnode. Just do
it when allocating. Add some improved diagnostics.


# 1.78 10-Dec-2001 art

Big cleanup inspired by NetBSD with some parts of the code from NetBSD.
- get rid of VOP_BALLOCN and VOP_SIZE
- move the generic getpages and putpages into miscfs/genfs
- create a genfs_node which must be added to the top of the private portion
of each vnode for filsystems that want to use genfs_{get,put}pages
- rename genfs_mmap to vop_generic_mmap


# 1.77 10-Dec-2001 art

Merge in struct uvm_vnode into struct vnode.


# 1.76 05-Dec-2001 art

Break out the part that lowers v_holdcnt in brelvp into an own function
and make it and vhold into public interfaces.


# 1.75 29-Nov-2001 art

Ooops. Revert part of the last commit that was completly wrong and wasn't supposed to be committed.


# 1.74 29-Nov-2001 art

Correctly handle b_vp with bgetvp and brelvp in {get,put}pages.
Prevents panics caused by vnodes being recycled under our feet.


# 1.73 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.72 21-Nov-2001 csapuntz

Added vfs_isbusy. Useful for verifying that a mount point is locked
Added vfs_mount_foreach_vnode. Several places in the code seem to want to
traverse the mount list and they all seem to handle locking differently.
Centralize traversing the mount list in one place so that we only need
to get the locking right once.


# 1.71 15-Nov-2001 art

Don't zero v_bioflag when recycling a vnode in getnewvnode.
Sometimes the vnode can be on the syncers list. While that is a bug, it's
just a minor annoyance. A vnode on a syncer worklist without VBIOONSYNCLIST
set is a disaster.


# 1.70 12-Nov-2001 art

Remove unnecessary check for NULL vnode in reassignbuf.


# 1.69 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.68 02-Oct-2001 csapuntz

Bounds check index into routing table. Thanks to Ken Ashcraft of Stanford
for finding this bug.


# 1.67 19-Sep-2001 csapuntz

Get rid of B_VFLUSH. Not relevant after the end of the AGE queue.


# 1.66 16-Sep-2001 millert

Add some missing lengths checks when passing data from userland to
kernel. From based on NetBSD patches.


# 1.65 02-Aug-2001 assar

(vput): make panic strings actually say vput instead of vrele


# 1.64 26-Jul-2001 miod

Typo.


# 1.63 27-Jun-2001 art

remove old vm


# 1.62 22-Jun-2001 deraadt

KNF


# 1.61 05-Jun-2001 provos

send note_revoke to knotes when vnode goes away, okay art@


# 1.60 16-May-2001 art

indentation nit.


# 1.59 29-Apr-2001 art

cleanup, remove incorrect comment


Revision tags: OPENBSD_2_9_BASE
# 1.58 22-Mar-2001 art

branches: 1.58.2;
Use pool for allocating vnodes.
Even though vnodes are never freed (could be) this gives us big memory and
kmem_map savings.


# 1.57 21-Mar-2001 art

uvm_vnp_terminate expect the vnode to be locked.
Why didn't LOCKDEBUG catch this?


# 1.56 16-Mar-2001 art

Oops. fix thinko in last.


# 1.55 16-Mar-2001 art

Use CIRCLEQ macros for mountlist.


# 1.54 16-Mar-2001 art

Initialize the mountlist_slock.


# 1.53 26-Feb-2001 csapuntz

Move v_writecount test back to it original place


# 1.52 26-Feb-2001 csapuntz

Make ref counts 32-bit unsigned ints as opposed to a potpourri of longs and
ints.


# 1.51 24-Feb-2001 csapuntz

Cleanup of vnode interface continues. Get rid of VHOLD/HOLDRELE.
Change VM/UVM to use buf_replacevnode to change the vnode associated
with a buffer.

Addition v_bioflag for flags written in interrupt handlers
(and read at splbio, though not strictly necessary)

Add vwaitforio and use it instead of a while loop of v_numoutput.

Fix race conditions when manipulation vnode free list


# 1.50 23-Feb-2001 csapuntz

Remove the clustering fields from the vnodes and place them in the
file system inode instead


# 1.49 21-Feb-2001 csapuntz

Latest soft updates from FreeBSD/Kirk McKusick

Snapshot-related code has been commented out.


# 1.48 08-Feb-2001 mickey

do not print stuff when not verbose


Revision tags: OPENBSD_2_8_BASE
# 1.47 27-Sep-2000 art

branches: 1.47.2;
Minimal optimization.


# 1.46 17-Jul-2000 art

Don't wait for B_READ buffers on shutdown.
From NetBSD.


Revision tags: OPENBSD_2_7_BASE
# 1.45 25-Apr-2000 csapuntz

Use CIRCLEQ_FOREACH


# 1.44 21-Apr-2000 mickey

see if there is any meaning under curproc before using &proc0 in vfs_syncwait(); from art@


Revision tags: SMP_BASE kame_19991208
# 1.43 05-Dec-1999 art

branches: 1.43.2;
With soft updates, some buffers will be remarked as dirty after being written.
Handle this when syncing filesystems when unmounting.
From NetBSD.


# 1.42 05-Dec-1999 art

Use VONSYNCLIST to see if we should remove a vnode from the sync list instead
of looking at v_dirtyblkhd.


Revision tags: OPENBSD_2_6_BASE
# 1.41 20-Aug-1999 art

more paranoid check of the refcount in vfs_register


# 1.40 08-Aug-1999 niklas

From NetBSD; vdevgone, used for revoking access to device nodes when they
disappear (detach is coming).


# 1.39 31-May-1999 millert

New struct statfs with mount options. NOTE: this replaces statfs(2),
fstatfs(2), and getfsstat(2) so you will need to build a new kernel
before doing a "make build" or you will get "unimplemented syscall" errors.

The new struct statfs has the following featuires:
o Has a u_int32_t flags field--now softdep can have a real flag.

o Uses u_int32_t instead of longs (nicer on the alpha). Note: the man
page used to lie about setting invalid/unused fields to -1. SunOS does
that but our code never has.

o Gets rid of f_type completely. It hasn't been used since NetBSD 0.9
and having it there but always 0 is confusing. It is conceivable
that this may cause some old code to not compile but that is better
than silently breaking.

o Adds a mount_info union that contains the FSTYPE_args struct. This
means that "mount" can now tell you all the options a filesystem was
mounted with. This is especially nice for NFS.

Other changes:
o The linux statfs emulation didn't convert between BSD fs names
and linux f_type numbers. Now it does, since the BSD f_type
number is useless to linux apps (and has been removed anyway)

o FreeBSD's struct statfs is different from our (both old and new)
and thus needs conversion. Previously, the OpenBSD syscalls
were used without any real translation.

o mount(8) will now show extra info when invoked with no arguments.
However, to see *everything* you need to use the -v (verbose) flag.


# 1.38 06-May-1999 mickey

factor out sync+wait code into vfa_syncwait() routine for
applications in system like power management and such.
art@ finally said `commit it'


# 1.37 30-Apr-1999 art

in vput, simple_unlock the v_interlock before VOP_INACTIVE, not after


Revision tags: OPENBSD_2_5_BASE
# 1.36 11-Mar-1999 deraadt

backout


# 1.35 11-Mar-1999 deraadt

back out unapproved changes


# 1.34 11-Mar-1999 mickey

indent


# 1.33 11-Mar-1999 mickey

factor sync+wait operation out into a separate function.


# 1.32 26-Feb-1999 art

adapt to uvm vnode pager


# 1.31 19-Feb-1999 art

add vfs_register and vfs_unregister functions


# 1.30 28-Dec-1998 art

simple_lock fixes


# 1.29 22-Dec-1998 art

deconfuse vprint, print holdcount, not refcount when we are talking about holdcnt


# 1.28 10-Dec-1998 art

vfs_unmountall: retry to unmount all remaining filesystems when one unmount failed


# 1.27 05-Dec-1998 csapuntz

Framework for generating automatic test code for locking discipline
in DIAGNOSTIC mode.

Added documentation to vfs_subr.c on locking needs of a couple calls.

Improvements to the vinvalbuf patch. We need to start over after we
let our pants down.


# 1.26 04-Dec-1998 csapuntz

VFS-Lite2 requires stricter locking around vnode buffer queues. vinvalbuf
had insufficient protection


# 1.25 20-Nov-1998 art

vn_lock already unlocks the simple lock. don't do that again


# 1.24 12-Nov-1998 csapuntz

Integrate latest soft updates patches for McKusick.

Integrate cleaner ffs mount code from FreeBSD. Most notably, this mount
code prevents you from mounting an unclean file system read-write.


Revision tags: OPENBSD_2_4_BASE
# 1.23 13-Oct-1998 csapuntz

In vrele, vget, reinstate to following order

- VNODE gets placed on free list
- VOP_INACTIVE is called

This was the original order. It was changed in an earlier patch due to
a race condition in non-locking FSes (like NFS) between getnewvnode
and inactive. However, the modified order had its own race conditions, so
it turned out not to be a good choice.


# 1.22 30-Aug-1998 csapuntz

Cleanup.

Error diagnostics in vputonfreelist to catch violations of assumptions.


# 1.21 06-Aug-1998 csapuntz

Rename vop_revoke, vn_bwrite, vop_noislocked, vop_nolock, vop_nounlock
to be vop_generic_revoke, vop_generic_bwrite, vop_generic_islocked,
vop_generic_lock and vop_generic_unlock.

Create vop_generic_abortop and propogate change to all file systems.

Fix PR/371.

Get rid of locking in NULLFS (should be mostly unnecessary now except for
forced unmounts).


# 1.20 25-Apr-1998 niklas

typo


Revision tags: OPENBSD_2_3_BASE
# 1.19 20-Feb-1998 niklas

typo


# 1.18 11-Jan-1998 csapuntz

Fix a couple spinlock references. More code motion in vfs_subr.c


# 1.17 10-Jan-1998 csapuntz

Broke up vfs_subr.c which was getting a bit huge. We now have seperate files
for the syncer daemon as well as default VOP_*.


# 1.16 24-Nov-1997 niklas

Fix non-DIAGNOSTIC (and non-COMPAT*) compilation


# 1.15 07-Nov-1997 csapuntz

Fixed hang on shutdown
Disabled vop_nolock for now. Filesystems still need to be cleaned up.


# 1.14 06-Nov-1997 csapuntz

DEBUG now compiles


# 1.13 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.12 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.11 06-Oct-1997 csapuntz

VFS Lite2 Changes


Revision tags: OPENBSD_2_1_BASE
# 1.10 25-Apr-1997 deraadt

proper mask check; mike@fast.cs.utah.edu


# 1.9 14-Apr-1997 tholo

Minor performance enhancements from NetBSD


# 1.8 24-Feb-1997 niklas

OpenBSD tags


# 1.7 11-Feb-1997 millert

Add fs_id support and random inode generation numbers for ffs.


# 1.6 04-Jan-1997 kstailey

spec_advlock() via lf_advlock()


Revision tags: OPENBSD_2_0_BASE
# 1.5 08-Aug-1996 tholo

Make {,f}chown(2) behaviour POSIX.1 compliant with SUID / SGID files
Enable CTL_FS processing by sysctl(3)
Add CTL_FS request to disable clearing SUID / SGID bit when a files owner
or group is changed by root
Make sysctl(8) understand CTL_FS requests


# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 29-Feb-1996 niklas

From NetBSD: Merge with NetBSD 960217


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.300 13-Feb-2020 claudio

Move the LK_DRAIN logic from VOP_LOCK() to vclean() the only caller of
VOP_LOCK with LK_DRAIN. This simplifies VOP_LOCK() a fair bit.
OK visa@


# 1.299 20-Jan-2020 claudio

struct vops is not modified during runtime so use const which moves each
into read-only data segment.
OK deraadt@ tedu@


# 1.298 10-Jan-2020 bluhm

Convert the vnode list at the mount point into a tailq. During
unmount this list is traversed and the dirty vnodes are flushed to
disk. Forced unmount expects that the list is empty after flushing,
otherwise the kernel panics with "dangling vnode". As the write
to disk can sleep, new vnodes may be inserted. If softdep is
enabled, resolving the dependencies creates new dirty vnodes and
inserts them to the list. To fix the panic, let insmntque() insert
new vnodes at the tail of the list. Then vflush() will still catch
them while traversing the list in forward direction.
OK tedu@ millert@ visa@


# 1.297 30-Dec-2019 bluhm

In vcount() a safe loop over vnodes was commited to 4.4BSD in 1994.
This is not necessary as the loop is restarted after vgone(). Switch
to SLIST_FOREACH without _SAFE.
OK visa@


# 1.296 27-Dec-2019 bluhm

Convert the speclisth hash buckets into SLIST macros. This makes
the vnode alias code more readable.
OK visa@


# 1.295 26-Dec-2019 bluhm

Fix white spaces.


# 1.294 08-Dec-2019 mpi

Convert infinite sleeps to tsleep_nsec(9).

ok visa@, jca@


Revision tags: OPENBSD_6_6_BASE
# 1.293 26-Aug-2019 anton

When a thread tries to exclusively lock a vnode, the same thread must
ensure that any other thread currently trying to acquire the underlying
vnode lock has observed that the same vnode is about to be exclusively
locked. Such threads must then sleep until the exclusive lock has been
released and then try to acquire the lock again. Otherwise, exclusive
access to the vnode cannot be guaranteed.

Thanks to naddy@ and visa@ for testing; ok visa@

Reported-by: syzbot+374d0e7e2400004957f7@syzkaller.appspotmail.com


# 1.292 25-Jul-2019 cheloha

vinvalbuf(9): tlseep -> tsleep_nsec(9); ok millert@


# 1.291 19-Jul-2019 cheloha

vwaitforio(9): tsleep(9) -> tsleep_nsec(9); ok visa@


# 1.290 28-Jun-2019 visa

Skip VFS barrier lock during normal operation to reduce overhead.
This removes a system-wide serialization point, which might help
finding timing-related bugs.

OK deraadt@ anton@


# 1.289 09-Jun-2019 beck

Add a temporary workaround to make removal of giant files better

mlarkin@ noticed we would freeze while removing enormous files because
of the amount of work done to invalidate buffers on unlink. This adds
a temporary workaround to ensure we give up the lock and yield while
doing this.

The longer term answer will be to move these buffers to another list
and not do the work here.

ok deraadt@


# 1.288 19-Apr-2019 visa

Add a subsystem lock for vfs_lockf.c. This enables calling lf_advlock()
and lf_purgelocks() without the kernel lock.

OK anton@ mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.287 02-Apr-2019 visa

Restrict which filesystems are available for swap. This rules out
obvious misconfigurations that cannot work.

OK mpi@ tedu@


# 1.286 17-Feb-2019 tedu

if a write fails, we mark the buffer invalid and throw it away. this can
lead to lost errors, where a later fsync will return success. to fix this,
set a flag on the vnode indicating a past error has occurred, and return
an error for future fsync calls.
ok bluhm deraadt visa


# 1.285 21-Jan-2019 anton

Introduce a dedicated entry point data structure for file locks. This new data
structure allows for better tracking of pending lock operations which is
essential in order to prevent a use-after-free once the underlying vnode is
gone.

Inspired by the lockf implementation in FreeBSD.

ok visa@

Reported-by: syzbot+d5540a236382f50f1dac@syzkaller.appspotmail.com


# 1.284 23-Dec-2018 natano

Rectify some issues with the noperm mount flag; the root vnode was not
protected properly and files without any x bit set were accidentaly considered
executable when checked with access(2).

Issues found and reported by deraadt, halex, reyk, tb
ok deraadt


# 1.283 07-Dec-2018 mpi

free(9) sizes for netcred.

ok visa@


Revision tags: OPENBSD_6_4_BASE
# 1.282 29-Sep-2018 visa

Use atomic operations to update vfc_refcount. Change the field's type
to unsigned int.

OK deraadt@


# 1.281 26-Sep-2018 visa

Move the allocating and freeing of mount points into
dedicated functions.

OK deraadt@ mpi@


# 1.280 22-Sep-2018 fcambus

Harmonize spacing after ellipses in displayed messages.

We were using spacing after ellipses in an inconsistent way in the
installer. Standardize on using "... " everywhere and take into account
the cursor position while we are waiting for the task to complete: the
cursor is now always positioned after the last dot, and the space is
added when displaying completion confirmation.

While there, also take cursor position into account in vfs_shutdown(),
and remove the extra leading space before ticks in dhclient.

OK deraadt@


# 1.279 17-Sep-2018 visa

Simplify VFS initialization.

Because loadable kernel modules are no longer, there is no need to
register or unregister filesystem implementations at runtime. Remove
vfs_register() and vfs_unregister(), and make vfsinit() call vfs_init
routines directly. Replace the linked list of vfsconf structs with
the vfsconflist[] array.

OK mpi@ bluhm@


# 1.278 16-Sep-2018 visa

Move vfsconf lookup code into dedicated functions.

OK bluhm@


# 1.277 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


# 1.276 02-Jul-2018 bluhm

Use more list macros for v_dirtyblkhd.
OK mpi@


# 1.275 06-Jun-2018 bluhm

The function dounmount() traverses the mnt_list in forward direction
to call vfs_busy() for all nested mount points. vfs_stall() called
vfs_busy() in reverser order for all mount points. Change the
direction of the latter to resolve the lock order conflict.
OK visa@


# 1.274 04-Jun-2018 guenther

Add VB_DUPOK to suppress witness(4) warning of concurrent mount locks.
Use that in three places:
- vfs_stall()
- sys_mount()
- dounmount()'s MNT_FORCE-does-recursive-unmounts case

ok deraadt@ visa@


# 1.273 27-May-2018 visa

Drop unnecessary `p' parameter from vget(9).

OK mpi@


# 1.272 08-May-2018 bluhm

When looping over mount points, the FOREACH SAVE macro is not save.
The loop variable mp is protected by vfs_busy() so that it cannot
be unmounted. But the next mount point nmp could be unmounted while
VFS_SYNC() sleeps. As the loop in vfs_stall() does not destroy the
mount point, TAILQ_FOREACH_REVERSE without _SAVE is the correct
macro to use.
OK deraadt@ visa@


# 1.271 08-May-2018 mpi

Move the vfs stall "barrier" logic to a function. FREF() will soon
change and this has nothing to do with it.

ok visa@, bluhm@


# 1.270 07-May-2018 bluhm

Print the vp pointer in the vinvalbuf() panic strings.
OK mpi@


# 1.269 02-May-2018 visa

Remove proc from the parameters of vn_lock(). The parameter is
unnecessary because curproc always does the locking.

OK mpi@


# 1.268 28-Apr-2018 visa

Clean up the parameters of VOP_LOCK() and VOP_UNLOCK(). It is always
curproc that does the locking or unlocking, so the proc parameter
is pointless and can be dropped.

OK mpi@, deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.267 07-Mar-2018 bluhm

Remounting files systems read-only does not work reliably. There
are corner cases where ffs may leak blocks. So better revert and
unmount all file systems at reboot. The "init died" panic will be
fixed in a different way.
OK deraadt@


# 1.266 10-Feb-2018 deraadt

Syncronize filesystems to disk when suspending. Each mountpoint's vnodes
are pushed to disk. Dangling vnodes (unlinked files still in use) and
vnodes undergoing change by long-running syscalls are identified -- and
such filesystems are marked dirty on-disk while we are suspended (in case
power is lost, a fsck will be required). Filesystems without dangling or
busy vnodes are marked clean, resulting in faster boots following
"battery died" circumstances.
Tested by numerous developers, thanks for the feedback.


# 1.265 14-Dec-2017 deraadt

Don't bother using DETACH_FORCE for the softraid luns at reboot
time; the aggressive mountpoint destruction seems to hit insane
use-after-frees when we are already far on the way down.


# 1.264 14-Dec-2017 deraadt

Give vflush_vnode() a hint about vnodes we don't need to account as "busy".
Change mountpoint to RDONLY a little later. Seems to improve the
rw->ro transition a bit.


# 1.263 11-Dec-2017 bluhm

Format the vnode lists of ddb show mount properly in columns.
OK krw@


# 1.262 11-Dec-2017 deraadt

In uvm Chuck decided backing store would not be allocated proactively
for blocks re-fetchable from the filesystem. However at reboot time,
filesystems are unmounted, and since processes lack backing store they
are killed. Since the scheduler is still running, in some cases init is
killed... which drops us to ddb [noted by bluhm]. Solution is to convert
filesystems to read-only [proposed by kettenis]. The tale follows:
sys_reboot() should pass proc * to MD boot() to vfs_shutdown() which
completes current IO with vfs_busy VB_WRITE|VB_WAIT, then calls VFS_MOUNT()
with MNT_UPDATE | MNT_RDONLY, soon teaching us that *fs_mount() calls a
copyin() late... so store the sizes in vfsconflist[] and move the copyin()
to sys_mount()... and notice nfs_mount copyin() is size-variant, so kill
legacy struct nfs_args3. Next we learn ffs_mount()'s MNT_UPDATE code is
sharp and rusty especially wrt softdep, so fix some bugs adn add
~MNT_SOFTDEP to the downgrade. Some vnodes need a little more help,
so tie them to &dead_vnops.

ffs_mount calling DIOCCACHESYNC is causing a bit of grief still but
this issue is seperate and will be dealt with in time.
couple hundred reboots by bluhm and myself, advice from guenther and
others at the hut


# 1.261 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.260 31-Jul-2017 florian

Give back some space to the ramdisk by compiling net/radix.c only
if we compile pf, ipsec, pipex or nfsserver.
Suggested by mpi some time ago.
Tweak & OK bluhm
deraadt assumes it's fair


# 1.259 20-Apr-2017 visa

Tweak lock inits to make the system runnable with witness(4)
on amd64 and i386.


# 1.258 04-Apr-2017 deraadt

struct vfsconf is tightly packed, but let's M_ZERO it in case that ever
changes to avoid exposing userland memory.


Revision tags: OPENBSD_6_1_BASE
# 1.257 15-Jan-2017 bluhm

When traversing the mount list, the current mount point is locked
with vfs_busy(). If the FOREACH_SAFE macro is used, the next pointer
is not locked and could be freed by another process. Unless
necessary, do not use _SAFE as it is unsafe. In vfs_unmountall()
the current pointer is actullay freed. Add a comment that this
race has to be fixed later.
OK krw@


# 1.256 10-Jan-2017 bluhm

Replace manual for() loops with FOREACH() macro.
OK millert@


# 1.255 10-Jan-2017 bluhm

Remove the unused olddp parameter from function dounmount().
OK mpi@ millert@


# 1.254 28-Sep-2016 kettenis

Cast enum to u_int when doing a bounds check to avoid a clang warning that
the comparison is always true.

ok jca@, tedu@


# 1.253 16-Sep-2016 dlg

move the namecache_rb_tree from RB macros to RBT functions.

i had to shuffle the includes a bit. all the knowledge of the RB
tree is now inside vfs_cache.c, and all accesses are via cache_*
functions.


# 1.252 16-Sep-2016 dlg

move buf_rb_bufs from RB macros to RBT functions

i had to shuffle the order of some header bits cos RBT_PROTOTYPE
needs to see what RBT_HEAD produces.


# 1.251 15-Sep-2016 dlg

all pools have their ipl set via pool_setipl, so fold it into pool_init.

the ioff argument to pool_init() is unused and has been for many
years, so this replaces it with an ipl argument. because the ipl
will be set on init we no longer need pool_setipl.

most of these changes have been done with coccinelle using the spatch
below. cocci sucks at formatting code though, so i fixed that by hand.

the manpage and subr_pool.c bits i did myself.

ok tedu@ jmatthew@

@ipl@
expression pp;
expression ipl;
expression s, a, o, f, m, p;
@@
-pool_init(pp, s, a, o, f, m, p);
-pool_setipl(pp, ipl);
+pool_init(pp, s, a, ipl, f, m, p);


# 1.250 25-Aug-2016 dlg

pool_setipl

ok kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.249 22-Jul-2016 kettenis

Prevent NULL-pointer call for filesystems that don't provide vfs_sysctl
in their vfsops.

Issue reported by Tim Newsham.

ok claudio@, natano@


# 1.248 19-Jun-2016 natano

Remove the lockmgr() API. It is only used by filesystems, where it is a
trivial change to use rrw locks instead. All it needs is LK_* defines
for the RW_* flags.

tested by naddy and sthen on package building infrastructure
input and ok jmc mpi tedu


# 1.247 26-May-2016 natano

The doforce variable isn't modified anywhere. Also, the only filesystem
left using it is fuse. It has been removed from all other filesystems.

ok millert deraadt


# 1.246 26-Apr-2016 natano

copy_statfs_info() is not only used by ufs, but by other filesystems too,
so make sure that all members of mp->mnt_stat.mount_info are copied.
ok stefan


# 1.245 26-Apr-2016 beck

fix off by one in vfs_vnode_print - found by miod
ok deraadt@, krw@


# 1.244 07-Apr-2016 natano

Share clone bitmap between aliased vnodes. This prevents duplicate clone
instance numbers being handed out for the same minor device.
ok mikeb


# 1.243 05-Apr-2016 natano

Increase size of the clone bitmap (revised diff after revert). I have
tested this with fuse _and_ drm on amd64 and macppc. Also tested with
cloning bpf (not in the tree) on macppc.

ok mikeb
"looks correct to me" millert

The original commit message is as follows:

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.242 01-Apr-2016 mikeb

Revert the clone bitmap enlargement change


# 1.241 31-Mar-2016 natano

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.240 19-Mar-2016 natano

Remove the unused flags argument from VOP_UNLOCK().

torture tested on amd64, i386 and macppc
ok beck mpi stefan
"the change looks right" deraadt


# 1.239 14-Mar-2016 krw

Change a bunch of (<blah> *)0 to NULL.

ok beck@ deraadt@


Revision tags: OPENBSD_5_9_BASE
# 1.238 05-Dec-2015 tedu

branches: 1.238.2;
remove stale lint annotations


# 1.237 16-Nov-2015 deraadt

In getdevvp() set the VISTTY flag on a vnode to indicate the underlying
device is a D_TTY device. (Like spec_open, but this sets the flag to
satisfy pre-VOP_OPEN situations)
ok millert semarie tedu guenther


# 1.236 13-Oct-2015 guenther

Initialize va_filerev in vattr_null() to avoid leaking stack garbage;
problem pointed out by Martin Natano (natano (at) natano.net)

Also, stop chaining assignments (foo = bar = baz) in vattr_null().
The exact meaning of those depends on the order of the sizes-and-
signednesses of the lvalues, making them fragile: a statement here
mixed *six* types, but managed to get them in a safe order. Delete
a 20+ year old XXX comment that was almost certainly bemoaning a bug
from when they were in an unsafe order.

ok deraadt@ miod@


# 1.235 08-Oct-2015 mpi

Use the radix API directly and get rid of the function pointers. There
is no point in keeping an unused level of abstraction.

ok mikeb@, claudio@


# 1.234 07-Oct-2015 mpi

rn_inithead() offset argument is now specified in byte, missed in previous.


# 1.233 04-Sep-2015 mpi

Make every subsystem using a radix tree call rn_init() and pass the
length of the key as argument.

This way every consumer of the radix tree has a chance to explicitly
initialize the shared data structures and no longer rely on another
subsystem to do the initialization.

As a bonus ``dom_maxrtkey'' is no longer used an die.

ART kernels should now be fully usable because pf(4) and IPSEC properly
initialized the radix tree.

ok chris@, reyk@


Revision tags: OPENBSD_5_8_BASE
# 1.232 16-Jul-2015 claudio

branches: 1.232.4;
Fix rn_match and there for the expoerted lookup functions in radix.c
to never return the internal RNF_ROOT nodes. This removes the checks
in the callee to verify that not an RNF_ROOT node was returned.
OK mpi@


# 1.231 12-May-2015 mikeb

Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.230 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.229 02-Mar-2015 guenther

Return EINVAL if the creds supplied for NFS export have a cr_ngroups less
than zero or greater than NGROUPS_MAX

Fixes panic seen by henning@


# 1.228 09-Jan-2015 tedu

rename desiredvnodes to initialvnodes. less of a lie. ok beck deraadt


# 1.227 19-Dec-2014 tedu

start retiring the nointr allocator. specify PR_WAITOK as a flag as a
marker for which pools are not interrupt safe. ok dlg


# 1.226 17-Dec-2014 tedu

remove lock.h from uvm_extern.h. another holdover from the simpletonlock
era. fix uvm including c files to include lock.h or atomic.h as necessary.
ok deraadt


# 1.225 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.224 10-Dec-2014 tedu

convert bcopy to memcpy. ok millert


# 1.223 21-Nov-2014 tedu

simple lock is long dead


# 1.222 19-Nov-2014 tedu

delete the KERN_VNODE sysctl. it fails to provide any isolation from the
kernel struct vnode defintion, and the only consumer (pstat) still needs
kvm to read much of the required information. no great loss to always use
kvm until there's a better replacement interface.
ok deraadt millert uebayasi


# 1.221 14-Nov-2014 tedu

prefer sizeof(*ptr) to sizeof(struct) for malloc and free


# 1.220 03-Nov-2014 deraadt

pass size argument to free()
ok doug tedu


# 1.219 13-Sep-2014 doug

Replace all queue *_END macro calls except CIRCLEQ_END with NULL.

CIRCLEQ_* is deprecated and not called in the tree. The other queue types
have *_END macros which were added for symmetry with CIRCLEQ_END. They are
defined as NULL. There's no reason to keep the other *_END macro calls.

ok millert@


Revision tags: OPENBSD_5_6_BASE
# 1.218 13-Jul-2014 tedu

pass the size to free in some of the obvious cases


# 1.217 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.216 10-Jul-2014 mpi

Stop using a shutdown hook for softraid(4) and explicitly shutdown
the disciplines right after vfs_shutdown().

This change is required in order to be able to set `cold' to 1 before
traversing the device (mainbus) tree for DVACT_POWERDOWN when halting
a machine. Yes, this is ugly because sr_shutdown() needs to sleep. But
at least it is obvious and hopefully somebody will be ofended and fix
it.

In order to properly flush the cache of the disks under softraid0,
sr_shutdown() now propagates DVACT_POWERDOWN for this particular subtree
of devices which are not under mainbus. As a side effect sd(4) shutdown
hook should no longer be necessary.

Tested by stsp@ and Jean-Philippe Ouellet.

ok deraadt@, stsp@, jsing@


# 1.215 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.214 04-Jun-2014 claudio

While it may be smart to use the radix tree for exports it is not OK to
use the domain specific tree initialisation method for this since that one
is multipath enabled and assumes that the radix node is part of a struct
rtentry. This code uses a different struct and so the multipath modifies
wrong fields and breaks stuff in mysterious ways.
Since we only support AF_INET here anyway simplify the code and only have
one radix_node_head pointer instead of AF_MAX ones.
Fixes NFS server issues reported by rpe@, OK rpe@, guenther@, sthen@


# 1.213 10-Apr-2014 tedu

pull the bufcache freelist code out into separate functions to allow new
algorithms to be tested. in the process, drop support for unused B_AGE and
b_synctime options.
previous versions ok beck deraadt


# 1.212 24-Mar-2014 guenther

Split the API: struct ucred remains the kernel internal structure while
struct xucred becomes the structure for syscalls (mount(2) and nfssvc(2)).

ok deraadt@ beck@


Revision tags: OPENBSD_5_5_BASE
# 1.211 21-Jan-2014 tedu

bzero -> memset


# 1.210 01-Dec-2013 krw

Change 'mountlist' from CIRCLEQ to TAILQ. Be paranoid and
use TAILQ_*_SAFE more than might be needed.

Bulk ports build by sthen@ showed nobody sticking their fingers
so deep into the kernel.

Feedback and suggestions from millert@. ok jsing@


# 1.209 27-Nov-2013 jsing

Defer the v_type initialisation until after the vnode has been purged from
the namecache. Changing the v_type between cache_enter() and cache_purge()
results in bad things happening.

ok beck@


# 1.208 02-Oct-2013 sf

format string fix: b_flags is long


# 1.207 01-Oct-2013 sf

Format string fixes: Cast time_t to long long

and mnt_stat.f_ctime is long long, too


# 1.206 08-Aug-2013 syl

Uncomment kprintf format attributes for sys/kern

tested on vax (gcc3) ok miod@


# 1.205 30-Jul-2013 beck

The previous change was made while chasing nfs performance issues
on Theo's servers - however this was in the context of the buffer flipper
changes and this is now suspect in a continues performance issue with NFS
so back it out for now


Revision tags: OPENBSD_5_4_BASE
# 1.204 24-Jun-2013 beck

Manipulating buffers after sleeping is dangerous. Instead of attempting
to cheat and VOP_BWRITE a buffer, restart the vinvalbuf if we have to wait
for a busy buffer to complete
ok tedu@ guenther@


# 1.203 15-Apr-2013 jsing

Add an f_mntfromspec member to struct statfs, which specifies the name of
the special provided when the mount was requested. This may be the same as
the special that was actually used for the mount (e.g. in the case of a
device node) or it may be different (e.g. in the case of a DUID).

Whilst here, change f_ctime to a 64 bit type and remove the pointless
f_spare members.

Compatibility goo courtesy of guenther@

ok krw@ millert@


Revision tags: OPENBSD_5_3_BASE
# 1.202 17-Feb-2013 miod

Comment out recently added __attribute__((__format__(__kprintf__))) annotations
in MI code; gcc 2.95 does not accept such annotation for function pointer
declarations, only function prototypes.
To be uncommented once gcc 2.95 bites the dust.


# 1.201 09-Feb-2013 miod

Add explicit __attribute__ ((__format__(__kprintf__)))) to the functions and
function pointer arguments which are {used as,} wrappers around the kernel
printf function.
No functional change.


# 1.200 17-Nov-2012 beck

Don't map a buffer (and potentially sleep) when invalidating it in vinvalbuf.
This fixes a problem where we could sleep for kva and then our pointers
would not be valid on the next pass through the loop. We do this
by adding buf_acquire_nomap() - which can be used to busy up the buffer
without changing its mapped or unmapped state. We do not need to have
the buffer mapped to invalidate it, so it is sufficient to acquire it
for that. In the case where we write the buffer, we do map the buffer, and
potentially sleep.


# 1.199 01-Oct-2012 guenther

Make groupmember() check the effective gid too, so that the checks are
consistent when the effective gid isn't also a supplementary group.

ok beck@


# 1.198 19-Sep-2012 guenther

vhold() and vdrop() are prototyped in vnode.h, so don't repeat them here

ok beck@


Revision tags: OPENBSD_5_2_BASE
# 1.197 16-Jul-2012 deraadt

oops, need sys/acct.h too


# 1.196 16-Jul-2012 deraadt

Put acct_shutdown() proto in a better place


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.195 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.194 02-Jul-2011 thib

rename VFSDEBUG to VFLCKDEBUG;

prompted by tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.193 21-Dec-2010 thib

Bring back the "End the VOP experiment." diff, naddy's issues where
unrelated, and his alpha is much happier now.

OK deraadt@


# 1.192 06-Dec-2010 jasper

- drop NENTS(), which was yet another copy of nitems().
no binary change


ok deraadt@


# 1.191 10-Sep-2010 thib

Backout the VOP diff until the issues naddy was seeing on alpha (gcc3)
have been resolved.


# 1.190 06-Sep-2010 thib

End the VOP experiment. Instead of the ridicolusly complicated operation
vector setup that has questionable features (that have, as far as I can
tell never been used in practice, atleast not in OpenBSD), remove all
the gunk and favor a simple struct full of function pointers that get
set directly by each of the filesystems.

Removes gobs of ugly code and makes things simpler by a magnitude.

The only downside of this is that we loose the vnoperate feature so
the spec/fifo operations of the filesystems need to be kept in sync
with specfs and fifofs, this is no big deal as the API it self is pretty
static.

Many thanks to armani@ who pulled an earlier version of this diff to
current after c2k10 and Gabriel Kihlman on tech@ for testing.

Liked by many. "come on, find your balls" deraadt@.


# 1.189 12-Aug-2010 oga

Nuke extra (typoed) extern declaration and a spare newline from the last
commit.

"fix it -- free commit" beck@


# 1.188 11-Aug-2010 beck

Make the number of vnodes to correspond to the number of buffers in
buffer cache - we grow them dynamically, but do not attempt to shrink
them if the buffer cache shrinks after growing.

Tested by very many for a long time.

ok oga@ todd@ phessler@ tedu@


Revision tags: OPENBSD_4_8_BASE
# 1.187 29-Jun-2010 tedu

makefstype was only used in ported from freebsd filesystems. fix them
and remove the function. ok thib


# 1.186 28-Jun-2010 claudio

Add the rtable id as an argument to rn_walktree(). Functions like
rt_if_remove_rtdelete() need to know the table id to be able to correctly
remove nodes.
Problem found by Andrea Parazzini and analyzed by Martin Pelik�n.
OK henning@


# 1.185 06-May-2010 mpf

Fix favail format string.
From mickey.
OK thib, otto.


Revision tags: OPENBSD_4_7_BASE
# 1.184 17-Dec-2009 oga

if anyone vref()s a VNON vnode, panic. This should not happen.

Written while trying to debug the nfs_inactive panics. Turns out it
never got hit, but it's a useful check to have.

ok beck@


# 1.183 17-Aug-2009 jasper

dd 'show all bufs' to show all the buffers in the system

ok beck@ thib@


# 1.182 13-Aug-2009 thib

add a show all vnodes command, use dlg's nice pool_walk() to accomplish
this.

ok beck@, dlg@


# 1.181 12-Aug-2009 beck

Namecache revamp.

This eliminates the large single namecache hash table, and implements
the name cache as a global lru of entires, and a redblack tree in each
vnode. It makes cache_purge actually purge the namecache entries associated
with a vnode when a vnode is recycled (very important for later on actually being
able to resize the vnode pool)

This commit does #if 0 out a bunch of procmap code that was
already broken before this change, but needs to be redone completely.

Tested by many, including in thib's nfs test setup.

ok oga@,art@,thib@,miod@


# 1.180 02-Aug-2009 beck

Dynamic buffer cache support - a re-commit of what was backed out
after c2k9

allows buffer cache to be extended and grow/shrink dynamically

tested by many, ok oga@, "why not just commit it" deraadt@


Revision tags: OPENBSD_4_6_BASE
# 1.179 25-Jun-2009 thib

backout the buf_acquire() does the bremfree() since all callers
where doing bremfree() befure calling buf_acquire().

This is causing us headache pinning down a bug that showed up
when deraadt@ too cvs to current, and will have to be done
anyway as a preperation for backouts.

OK deraadt@


# 1.178 15-Jun-2009 beck

Back out all the buffer cache changes I committed during c2k9. This reverts three
commits:

1) The sysctl allowing bufcachepercent to be changed at boot time.
2) The change moving the buffer cache hash chains to a red-black tree
3) The dynamic buffer cache (Which depended on the earlier too).

ok on the backout from marco and todd


# 1.177 06-Jun-2009 art

All caller of buf_acquire were doing bremfree before the call.
Just put it in the buf_acquire function.
oga@ ok


# 1.176 03-Jun-2009 beck

Change bufhash from the old grotty hash table to red-black trees hanging
off the vnode.
ok art@, oga@, miod@


Revision tags: OPENBSD_4_5_BASE
# 1.175 10-Nov-2008 pedro

Fix typo in comment, okay jmc@.


# 1.174 01-Nov-2008 deraadt

change vrele() to return an int. if it returns 0, it can gaurantee that
it did not sleep. this is used to avoid checkdirs() to avoid having
to restart the allproc walk every time through
idea from tedu, ok thib pedro


Revision tags: OPENBSD_4_4_BASE
# 1.173 05-Jul-2008 thib

re-introduce vdrop() to signal a lost intrest in a vnode;

ok art@


# 1.172 14-Jun-2008 mk

A bunch of pool_get() + bzero() -> pool_get(..., .. | PR_ZERO)
conversions that should shave a few bytes off the kernel.

ok henning, krw, jsing, oga, miod, and thib (``even though i usually prefer
FOO|BAR''; thanks for looking.


# 1.171 13-Jun-2008 beck

back out stupid vnode change that was unintentionally included
with biomem and art has no idea how it got there.
ok art@ thib@


# 1.170 12-Jun-2008 deraadt

Bring biomem diff back into the tree after the nfs_bio.c fix went in.
ok thib beck art


# 1.169 11-Jun-2008 deraadt

back out biomem diff since it is not right yet. Doing very large
file copies to nfsv2 causes the system to eventually peg the console.
On the console ^T indicates that the load is increasing rapidly, ddb
indicates many calls to getbuf, there is some very slow nfs traffic
making none (or extremely slow) progress. Eventually some machines
seize up entirely.


# 1.168 10-Jun-2008 beck

Buffer cache revamp

1) remove multiple size queues, introduced as a stopgap.
2) decouple pages containing data from their mappings
3) only keep buffers mapped when they actually have to be mapped
(right now, this is when buffers are B_BUSY)
4) New functions to make a buffer busy, and release the busy flag
(buf_acquire and buf_release)
5) Move high/low water marks and statistics counters into a structure
6) Add a sysctl to retrieve buffer cache statistics

Tested in several variants and beat upon by bob and art for a year. run
accidentally on henning's nfs server for a few months...

ok deraadt@, krw@, art@ - who promises to be around to deal with any fallout


# 1.167 09-Jun-2008 millert

Update access(2) to have modern semantics with respect to X_OK and
the superuser. access(2) will now only indicate success for X_OK on
non-directories if there is at least one execute bit set on the file.
OK deraadt@ thib@ otto@


# 1.166 07-May-2008 thib

remove the vfc_mountroot member from vfsconf and
do appropriate cleanup;

OK deraadt@


# 1.165 07-May-2008 claudio

Implement routing priorities. Every route inserted has a priority assigned
and the one route with the lowest number wins. This will be used by the
routing daemons to resolve the synchronisations issue in case of conflicts.
The nasty bits of this are in the multipath code. If no priority is specified
the kernel will choose an appropriate priority.

Looked at by a few people at n2k8 code is much older


# 1.164 06-May-2008 thib

retire vfs_mountroot();

setroot() is now (and has been) responsible for setting
the mountroot function pointer "to the right thing", or
failing todo that, to ffs_mountroot;

based on a discussion/diff from deraadt@.
OK deraadt@


# 1.163 23-Mar-2008 miod

Wrong printf construct.


# 1.162 16-Mar-2008 otto

Widen some struct statfs fields to support large filesystem stata
and add some to be able to support statvfs(2). Do the compat dance
to provide backward compatibility. ok thib@ miod@


Revision tags: OPENBSD_4_3_BASE
# 1.161 13-Dec-2007 blambert

replace calls to ltsleep with tsleep

remove PNORELOCK flag, as PNORELOCK is used for msleep

ok art@ thib@


# 1.160 16-Nov-2007 deraadt

er, the newline is wrong. dissapointing.


# 1.159 15-Nov-2007 deraadt

newline before syncing disks is way prettier


# 1.158 29-Oct-2007 chl

MALLOC/FREE -> malloc/free
replace an hard coded value with M_WAITOK

ok krw@


# 1.157 15-Sep-2007 bluhm

Allow to pull out an usb stick with ffs filesystem while mounted
and a file is written onto the stick. Without these fixes the
machine panics or hangs.
The usb fix calls the callback when the stick is pulled out to free
the associated buffers. Otherwise we have busy buffers for ever
and the automatic unmount will panic.
The change in the scsi layer prevents passing down further dirty
buffers to usb after the stick has been deactivated.
In vfs the automatic unmount has moved from the function vgonel()
to vop_generic_revoke(). Both are called when the sd device's vnode
is removed. In vgonel() the VXLOCK is already held which can cause
a deadlock. So call dounmount() earlier.

ok krw@, I like this marco@, tested by ian@


# 1.156 07-Sep-2007 art

Use M_ZERO in a few more places to shave bytes from the kernel.

eyeballed and ok dlg@


Revision tags: OPENBSD_4_2_BASE
# 1.155 07-Aug-2007 beck

A few changes to deal with multi-user performance issues seen. this
brings us back roughly to 4.1 level performance, although this is still
far from optimal as we have seen in a number of cases. This change

1) puts a lower bound on buffer cache queues to prevent starvation
2) fixes the code which looks for a buffer to recycle
3) reduces the number of vnodes back to 4.1 levels to avoid complex
performance issues better addressed after 4.2

ok art@ deraadt@, tested by many


# 1.154 01-Jun-2007 beck

decouple the allocated number of vnodes from the "desiredvnodes" variable
which is used to size a zillion other things that increasing excessively
has been shown to cause problems - so that we may incrementally look at
increasing those other things without making the kernel unusable.

This diff effectivly increases the number of vnodes back to the number
of buffers, as in the earlier dynamic buffer cache commits, without
increasing anything else (namecache, softdeps, etc. etc.)

ok pedro@ tedu@ art@ thib@


# 1.153 31-May-2007 tedu

remove some silly casts, no real change


# 1.152 31-May-2007 pedro

NFSv2 cannot cope with a big number of vnodes, so revert to NPROC-based
calculation until the problem is fixed, okay beck@ art@


# 1.151 30-May-2007 beck

back out vfs change - todd fries has seen afs issues, and I'm suspicious
this can cause other problems.


# 1.150 29-May-2007 beck

Step one of some vnode improvements - change getnewvnode to
actually allocate "desiredvnodes" - add a vdrop to un-hold a vnode held
with vhold, and change the name cache to make use of vhold/vdrop, while
keeping track of which vnodes are referred to by which cache entries to
correctly hold/drop vnodes when the cache uses them.
ok thib@, tedu@, art@


# 1.149 28-May-2007 thib

de-inline vref();

ok pedro@


# 1.148 26-May-2007 pedro

Dynamic buffer cache. Initial diff from mickey@, okay art@ beck@ toby@
deraadt@ dlg@.


# 1.147 26-May-2007 thib

Nuke a bunch of simpelocks and associated goo.

ok art@


# 1.146 17-May-2007 thib

Collapse struct v_selectinfo in struct vnode, remove the
simplelock and reuse the name for the selinfo member.
Clean-up accordingly.

ok tedu@,art@


# 1.145 09-May-2007 deraadt

kinfo_vgetfailed has not been used for > 8 years


# 1.144 13-Apr-2007 thib

Move the declaration of VN_KNOTE() into vnode.h instead of having
multiple defines all over;

ok tedu@


# 1.143 13-Apr-2007 bluhm

Remove comments talking about vnode interlock. No binary change.
ok thib


# 1.142 11-Apr-2007 thib

Remove the simplelock argument from vrecycle();

ok pedro@, sturm@


# 1.141 21-Mar-2007 thib

Remove the v_interlock simplelock from the vnode structure.
Zap all calls to simple_lock/unlock() on it (those calls are
#defined away though). Remove the LK_INTERLOCK from the calls
to vn_lock() and cleanup the filesystems wich implement VOP_LOCK().
(by remvoing the v_interlock from there calls to lockmgr()).

ok pedro@, art@, tedu@


# 1.140 12-Mar-2007 mickey

better desiredvnodes not based on maxusers; pedro@ deraadt@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.139 20-Feb-2007 deraadt

for vfsconf sysctl, do not leak kernel sensors out to userland
ok art thib


# 1.138 17-Feb-2007 mickey

fix ddb buf printing for daddr_t growth to 64bit;
from juan hernandez gonzalez; tested by bluhm@


# 1.137 14-Feb-2007 jsg

Consistently spell FALLTHROUGH to appease lint.
ok kettenis@ cloder@ tom@ henning@


# 1.136 13-Feb-2007 mickey

fix ddb buf print


# 1.135 20-Nov-2006 tom

vprint() should be defined if DIAGNOSTIC || DEBUG. Noticed by (and
original diff from) Jake < antipsychic (at) hotmail.com >. Discussed
with Mickey and Miod.

ok miod@ pedro@


# 1.134 30-Oct-2006 thib

use vp->v_type to index into vtypes rather then vp->v_tag,
fixing odd output in the 'show vnode' ddb code.

ok mickey@


Revision tags: OPENBSD_4_0_BASE
# 1.133 11-Jul-2006 mickey

add mount/vnode/buf and softdep printing commands; tested on a few archs and will make pedro happy too (;


# 1.132 09-Jul-2006 pedro

Fix tab where space was meant


# 1.131 08-Jul-2006 thib

vinvalbuf() debugging aid, under VFSDEBUG.

ok pedro@


# 1.130 03-Jul-2006 mickey

also print vp in vprint (useful for debugging); pedro@ ok


# 1.129 25-Jun-2006 sturm

rename vfs_busy() flags VB_UMIGNORE/VB_UMWAIT to VB_NOWAIT/VB_WAIT

requested by and ok pedro


# 1.128 14-Jun-2006 sturm

move vfs_busy() to rwlocks and properly hide the locking api from vfs

ok tedu, pedro


# 1.127 02-Jun-2006 pedro

Add a clonable devices implementation. Hacked along with thib@, input
from krw@ and toby@, subliminal prodding from dlg@, okay deraadt@.


# 1.126 28-May-2006 pedro

Spacing in vfs_sysctl()


# 1.125 07-May-2006 sturm

forgot to remove this sentence from the comment
ok pedro


# 1.124 30-Apr-2006 sturm

remove the simplelock argument from vfs_busy() which is currently not
used and will never be used this way in VFS

requested by and ok pedro, ok krw, biorn


# 1.123 19-Apr-2006 pedro

Remove unused mount list simple_lock() goo


Revision tags: OPENBSD_3_9_BASE
# 1.122 09-Jan-2006 pedro

Put vprint() under DIAGNOSTIC, as to save space in generated ramdisks.
Inspiration from miod@, okay deraadt@. Tested on i386, macppc and amd64.


# 1.121 30-Nov-2005 pedro

No need for vfs_busy() and vfs_unbusy() to take a process pointer
anymore. Testing by jolan@, thanks.


# 1.120 24-Nov-2005 pedro

Remove kernfs, okay deraadt@.


# 1.119 19-Nov-2005 pedro

Remove unnecessary lockmgr() archaism that was costing too much in terms
of panics and bugfixes. Access curproc directly, do not expect a process
pointer as an argument. Should fix many "process context required" bugs.
Incentive and okay millert@, okay marc@. Various testing, thanks.


# 1.118 18-Nov-2005 pedro

Work around yet another race on non-locking file systems: when calling
VOP_INACTIVE() in vrele() and vput(), we may sleep. Since there's no
locking of any kind, someone can vget() the vnode and vrele() it while
we sleep, beating us in getting the vnode on the free list.


# 1.117 08-Nov-2005 pedro

Missed one use of 'register'


# 1.116 07-Nov-2005 pedro

Use ANSI function declarations and deregister, no binary change


# 1.115 19-Oct-2005 pedro

Remove v_vnlock from struct vnode, okay krw@ tedu@


Revision tags: OPENBSD_3_8_BASE
# 1.114 26-May-2005 pedro

branches: 1.114.2;
RIP stackable filesystems, ok marius@ tedu@, discussed with deraadt@


# 1.113 24-May-2005 pedro

when a device vnode associated with a mount point disappears, mark the
filesystem as doomed and unmount it


# 1.112 22-May-2005 pedro

put VLOCKSWORK stuff under a single option, VFSDEBUG


# 1.111 01-May-2005 pedro

check for VBIOONFREELIST and VBIOONSYNCLIST in vprint(), okay marius@


# 1.110 24-Mar-2005 tedu

always good to check for invalid values. ok marius pedro


Revision tags: OPENBSD_3_7_BASE
# 1.109 10-Jan-2005 pedro

branches: 1.109.2;
change vget() to only put a vnode back on the free lists if it actually
was there. should fix a (rare) corner case introduced by my last commit.
ok tedu@, testing by joris, moritz@, danh@, otto@ and krw@. many thanks.


# 1.108 31-Dec-2004 pedro

sprinkle some more list macros in here


# 1.107 31-Dec-2004 pedro

when releasing a vnode, make it inactive before sticking it to one of
the free lists. should fix some races on filesystems that don't have
locks, such as nfs. also, it allows for a more straightforward way of
releasing vnodes (nodes that are going to be recycled don't have to be
moved to the head of the list). tested by many, thanks.

ok tedu@ deraadt@


# 1.106 28-Dec-2004 deraadt

clean dirty accident by miod


# 1.105 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


# 1.104 09-Dec-2004 pedro

minor spacing/styling nits


Revision tags: OPENBSD_3_6_BASE
# 1.103 04-Aug-2004 art

Uninline vputonfreelist.


# 1.102 04-Aug-2004 pedro

better comments


# 1.101 02-Aug-2004 pedro

- check for LK_NOWAIT on vget()
- use ltsleep() instead of the unlock + sleep combo

ok art@, inspiration from free/net


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.100 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


# 1.99 27-May-2004 tedu

shutdown accounting before shutting down vfs. should prevent some panics.
ok david@ millert@ (iirc)


# 1.98 25-Apr-2004 itojun

radix tree with multipath support. from kame. deraadt ok
user visible changes:
- you can add multiple routes with same key (route add A B then route add A C)
- you have to specify gateway address if there are multiple entries on the table
(route delete A B, instead of route delete A)
kernel change:
- radix_node_head has an extra entry
- rnh_deladdr takes extra argument

TODO:
- actually take advantage of multipath (rtalloc -> rtalloc_mpath)


Revision tags: OPENBSD_3_5_BASE
# 1.97 09-Jan-2004 tedu

back out vnode parents. weird breakge found in ports tree


# 1.96 06-Jan-2004 tedu

keep track of a vnode's parent dir. ufs only, and unused atm, but
the fun stuff is coming. testing by brad.


Revision tags: OPENBSD_3_4_BASE
# 1.95 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.94 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: UBC_SYNC_A
# 1.93 13-May-2003 naddy

Back out previous change that causes "vnode table full" for large-scale
file operations.


# 1.92 13-May-2003 tedu

do reclaim LAYER vnodes, no good reason not to


# 1.91 06-May-2003 tedu

attempt to put a process's cwd back in place after a forced umount.
won't always work, but it's the best we can do for now. this covers
at least some of the failure cases the previous commit to vfs_lookup.c
checks for.
ok weingart@


# 1.90 01-May-2003 tedu

several related changes:
vfs_subr.c:
add a missing simple_lock_init for vnode interlock
try to avoid reclaiming locked or layered vnodes
initialize vnlock pointer to NULL
remove old code to free vnlock, never used
lockinit the new vnode lock
vfs_syscalls.c:
support for VLAYER flag
vnode_if.sh:
support for splitting VDESC flags
vnode_if.src:
split VDESC flags
WILLPUT is the combination of WILLRELE and WILLUNLOCK
most uses for WILLRELE become WILLPUT
vnode.h:
add v_lock to struct vnode
add VLAYER flag
update for new VDESC flags


# 1.89 06-Apr-2003 ho

strcat/strcpy/sprintf cleanup. krw@, anil@ ok. art@ tested sparc64.


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_B
# 1.88 11-Aug-2002 art

Add two missing vfs_busy calls in the failure path of sysctl_vnode.
Found by aaron@

NOTE - I think we need a mount-point iterator just like we have
NOTE - vfs_mount_foreach_vnode. (btw. why don't we use foreach_vnode in here?)


# 1.87 12-Jul-2002 art

Change the locking on the mountpoint slightly. Instead of using mnt_lock
to get shared locks for lookup and get the exclusive lock only with
LK_DRAIN on unmount and do the real exclusive locking with flags in
mnt_flags, we now use shared locks for lookup and an exclusive lock for
unmount.

This is accomplished by slightly changing the semantics of vfs_busy.
Old vfs_busy behavior:
- with LK_NOWAIT set in flags, a shared lock was obtained if the
mountpoint wasn't being unmounted, otherwise we just returned an error.
- with no flags, a shared lock was obtained if the mountpoint was being
unmounted, otherwise we slept until the unmount was done and returned
an error.
LK_NOWAIT was used for sync(2) and some statistics code where it isn't really
critical that we get the correct results.
0 was used in fchdir and lookup where it's critical that we get the right
directory vnode for the filesystem root.

After this change vfs_busy keeps the same behavior for no flags and LK_NOWAIT.
But if some other flags are passed into it, they are passed directly
into lockmgr (actually LK_SLEEPFAIL is always added to those flags because
if we sleep for the lock, that means someone was holding the exclusive lock
and the exclusive lock is only held when the filesystem is being unmounted.

More changes:
dounmount must now be called with the exclusive lock held. (before this
the caller was supposed to hold the vfs_busy lock, but that wasn't always
true).
Zap some (now) unused mount flags.
And the highlight of this change:
Add some vfs_busy calls to match some vfs_unbusy calls, especially in
sys_mount. (lockmgr doesn't detect the case where we release a lock noone
holds (it will do that soon)).

If you've seen hangs on reboot with mfs this should solve it (I repeat this
for the fourth time now, but this time I spent two months fixing and
redesigning this and reading the code so this time I must have gotten
this right).


# 1.86 16-Jun-2002 miod

When processing the KERN_VNODE sysctl, the kernel builds a packed structure,
while pstat(8) expects a C structure abiding the regular structure packing
rules. This caused pstat -v to break on powerpc.

Unbreak the confusion by defining the structure in a common header file,
and having the kernel use it.

ok millert@ deraadt@


# 1.85 08-Jun-2002 art

Use ltsleep in vfs_busy.


# 1.84 16-May-2002 art

sprinkle some splassert(IPL_BIO) in some functions that are commented as "should be called at splbio()"


Revision tags: OPENBSD_3_1_BASE
# 1.83 14-Mar-2002 millert

First round of __P removal in sys


# 1.82 04-Feb-2002 miod

Cleanup mountroot-related definitions.


# 1.81 23-Jan-2002 art

Pool deals fairly well with physical memory shortage, but it doesn't deal
well (not at all) with shortages of the vm_map where the pages are mapped
(usually kmem_map).

Try to deal with it:
- group all information the backend allocator for a pool in a separate
struct. The pool will only have a pointer to that struct.
- change the pool_init API to reflect that.
- link all pools allocating from the same allocator on a linked list.
- Since an allocator is responsible to wait for physical memory it will
only fail (waitok) when it runs out of its backing vm_map, carefully
drain pools using the same allocator so that va space is freed.
(see comments in code for caveats and details).
- change pool_reclaim to return if it actually succeeded to free some
memory, use that information to make draining easier and more efficient.
- get rid of PR_URGENT, noone uses it.


# 1.80 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.79 10-Dec-2001 art

branches: 1.79.2;
No need to initialize the uobj on every getnewvnode. Just do
it when allocating. Add some improved diagnostics.


# 1.78 10-Dec-2001 art

Big cleanup inspired by NetBSD with some parts of the code from NetBSD.
- get rid of VOP_BALLOCN and VOP_SIZE
- move the generic getpages and putpages into miscfs/genfs
- create a genfs_node which must be added to the top of the private portion
of each vnode for filsystems that want to use genfs_{get,put}pages
- rename genfs_mmap to vop_generic_mmap


# 1.77 10-Dec-2001 art

Merge in struct uvm_vnode into struct vnode.


# 1.76 05-Dec-2001 art

Break out the part that lowers v_holdcnt in brelvp into an own function
and make it and vhold into public interfaces.


# 1.75 29-Nov-2001 art

Ooops. Revert part of the last commit that was completly wrong and wasn't supposed to be committed.


# 1.74 29-Nov-2001 art

Correctly handle b_vp with bgetvp and brelvp in {get,put}pages.
Prevents panics caused by vnodes being recycled under our feet.


# 1.73 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.72 21-Nov-2001 csapuntz

Added vfs_isbusy. Useful for verifying that a mount point is locked
Added vfs_mount_foreach_vnode. Several places in the code seem to want to
traverse the mount list and they all seem to handle locking differently.
Centralize traversing the mount list in one place so that we only need
to get the locking right once.


# 1.71 15-Nov-2001 art

Don't zero v_bioflag when recycling a vnode in getnewvnode.
Sometimes the vnode can be on the syncers list. While that is a bug, it's
just a minor annoyance. A vnode on a syncer worklist without VBIOONSYNCLIST
set is a disaster.


# 1.70 12-Nov-2001 art

Remove unnecessary check for NULL vnode in reassignbuf.


# 1.69 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.68 02-Oct-2001 csapuntz

Bounds check index into routing table. Thanks to Ken Ashcraft of Stanford
for finding this bug.


# 1.67 19-Sep-2001 csapuntz

Get rid of B_VFLUSH. Not relevant after the end of the AGE queue.


# 1.66 16-Sep-2001 millert

Add some missing lengths checks when passing data from userland to
kernel. From based on NetBSD patches.


# 1.65 02-Aug-2001 assar

(vput): make panic strings actually say vput instead of vrele


# 1.64 26-Jul-2001 miod

Typo.


# 1.63 27-Jun-2001 art

remove old vm


# 1.62 22-Jun-2001 deraadt

KNF


# 1.61 05-Jun-2001 provos

send note_revoke to knotes when vnode goes away, okay art@


# 1.60 16-May-2001 art

indentation nit.


# 1.59 29-Apr-2001 art

cleanup, remove incorrect comment


Revision tags: OPENBSD_2_9_BASE
# 1.58 22-Mar-2001 art

branches: 1.58.2;
Use pool for allocating vnodes.
Even though vnodes are never freed (could be) this gives us big memory and
kmem_map savings.


# 1.57 21-Mar-2001 art

uvm_vnp_terminate expect the vnode to be locked.
Why didn't LOCKDEBUG catch this?


# 1.56 16-Mar-2001 art

Oops. fix thinko in last.


# 1.55 16-Mar-2001 art

Use CIRCLEQ macros for mountlist.


# 1.54 16-Mar-2001 art

Initialize the mountlist_slock.


# 1.53 26-Feb-2001 csapuntz

Move v_writecount test back to it original place


# 1.52 26-Feb-2001 csapuntz

Make ref counts 32-bit unsigned ints as opposed to a potpourri of longs and
ints.


# 1.51 24-Feb-2001 csapuntz

Cleanup of vnode interface continues. Get rid of VHOLD/HOLDRELE.
Change VM/UVM to use buf_replacevnode to change the vnode associated
with a buffer.

Addition v_bioflag for flags written in interrupt handlers
(and read at splbio, though not strictly necessary)

Add vwaitforio and use it instead of a while loop of v_numoutput.

Fix race conditions when manipulation vnode free list


# 1.50 23-Feb-2001 csapuntz

Remove the clustering fields from the vnodes and place them in the
file system inode instead


# 1.49 21-Feb-2001 csapuntz

Latest soft updates from FreeBSD/Kirk McKusick

Snapshot-related code has been commented out.


# 1.48 08-Feb-2001 mickey

do not print stuff when not verbose


Revision tags: OPENBSD_2_8_BASE
# 1.47 27-Sep-2000 art

branches: 1.47.2;
Minimal optimization.


# 1.46 17-Jul-2000 art

Don't wait for B_READ buffers on shutdown.
From NetBSD.


Revision tags: OPENBSD_2_7_BASE
# 1.45 25-Apr-2000 csapuntz

Use CIRCLEQ_FOREACH


# 1.44 21-Apr-2000 mickey

see if there is any meaning under curproc before using &proc0 in vfs_syncwait(); from art@


Revision tags: SMP_BASE kame_19991208
# 1.43 05-Dec-1999 art

branches: 1.43.2;
With soft updates, some buffers will be remarked as dirty after being written.
Handle this when syncing filesystems when unmounting.
From NetBSD.


# 1.42 05-Dec-1999 art

Use VONSYNCLIST to see if we should remove a vnode from the sync list instead
of looking at v_dirtyblkhd.


Revision tags: OPENBSD_2_6_BASE
# 1.41 20-Aug-1999 art

more paranoid check of the refcount in vfs_register


# 1.40 08-Aug-1999 niklas

From NetBSD; vdevgone, used for revoking access to device nodes when they
disappear (detach is coming).


# 1.39 31-May-1999 millert

New struct statfs with mount options. NOTE: this replaces statfs(2),
fstatfs(2), and getfsstat(2) so you will need to build a new kernel
before doing a "make build" or you will get "unimplemented syscall" errors.

The new struct statfs has the following featuires:
o Has a u_int32_t flags field--now softdep can have a real flag.

o Uses u_int32_t instead of longs (nicer on the alpha). Note: the man
page used to lie about setting invalid/unused fields to -1. SunOS does
that but our code never has.

o Gets rid of f_type completely. It hasn't been used since NetBSD 0.9
and having it there but always 0 is confusing. It is conceivable
that this may cause some old code to not compile but that is better
than silently breaking.

o Adds a mount_info union that contains the FSTYPE_args struct. This
means that "mount" can now tell you all the options a filesystem was
mounted with. This is especially nice for NFS.

Other changes:
o The linux statfs emulation didn't convert between BSD fs names
and linux f_type numbers. Now it does, since the BSD f_type
number is useless to linux apps (and has been removed anyway)

o FreeBSD's struct statfs is different from our (both old and new)
and thus needs conversion. Previously, the OpenBSD syscalls
were used without any real translation.

o mount(8) will now show extra info when invoked with no arguments.
However, to see *everything* you need to use the -v (verbose) flag.


# 1.38 06-May-1999 mickey

factor out sync+wait code into vfa_syncwait() routine for
applications in system like power management and such.
art@ finally said `commit it'


# 1.37 30-Apr-1999 art

in vput, simple_unlock the v_interlock before VOP_INACTIVE, not after


Revision tags: OPENBSD_2_5_BASE
# 1.36 11-Mar-1999 deraadt

backout


# 1.35 11-Mar-1999 deraadt

back out unapproved changes


# 1.34 11-Mar-1999 mickey

indent


# 1.33 11-Mar-1999 mickey

factor sync+wait operation out into a separate function.


# 1.32 26-Feb-1999 art

adapt to uvm vnode pager


# 1.31 19-Feb-1999 art

add vfs_register and vfs_unregister functions


# 1.30 28-Dec-1998 art

simple_lock fixes


# 1.29 22-Dec-1998 art

deconfuse vprint, print holdcount, not refcount when we are talking about holdcnt


# 1.28 10-Dec-1998 art

vfs_unmountall: retry to unmount all remaining filesystems when one unmount failed


# 1.27 05-Dec-1998 csapuntz

Framework for generating automatic test code for locking discipline
in DIAGNOSTIC mode.

Added documentation to vfs_subr.c on locking needs of a couple calls.

Improvements to the vinvalbuf patch. We need to start over after we
let our pants down.


# 1.26 04-Dec-1998 csapuntz

VFS-Lite2 requires stricter locking around vnode buffer queues. vinvalbuf
had insufficient protection


# 1.25 20-Nov-1998 art

vn_lock already unlocks the simple lock. don't do that again


# 1.24 12-Nov-1998 csapuntz

Integrate latest soft updates patches for McKusick.

Integrate cleaner ffs mount code from FreeBSD. Most notably, this mount
code prevents you from mounting an unclean file system read-write.


Revision tags: OPENBSD_2_4_BASE
# 1.23 13-Oct-1998 csapuntz

In vrele, vget, reinstate to following order

- VNODE gets placed on free list
- VOP_INACTIVE is called

This was the original order. It was changed in an earlier patch due to
a race condition in non-locking FSes (like NFS) between getnewvnode
and inactive. However, the modified order had its own race conditions, so
it turned out not to be a good choice.


# 1.22 30-Aug-1998 csapuntz

Cleanup.

Error diagnostics in vputonfreelist to catch violations of assumptions.


# 1.21 06-Aug-1998 csapuntz

Rename vop_revoke, vn_bwrite, vop_noislocked, vop_nolock, vop_nounlock
to be vop_generic_revoke, vop_generic_bwrite, vop_generic_islocked,
vop_generic_lock and vop_generic_unlock.

Create vop_generic_abortop and propogate change to all file systems.

Fix PR/371.

Get rid of locking in NULLFS (should be mostly unnecessary now except for
forced unmounts).


# 1.20 25-Apr-1998 niklas

typo


Revision tags: OPENBSD_2_3_BASE
# 1.19 20-Feb-1998 niklas

typo


# 1.18 11-Jan-1998 csapuntz

Fix a couple spinlock references. More code motion in vfs_subr.c


# 1.17 10-Jan-1998 csapuntz

Broke up vfs_subr.c which was getting a bit huge. We now have seperate files
for the syncer daemon as well as default VOP_*.


# 1.16 24-Nov-1997 niklas

Fix non-DIAGNOSTIC (and non-COMPAT*) compilation


# 1.15 07-Nov-1997 csapuntz

Fixed hang on shutdown
Disabled vop_nolock for now. Filesystems still need to be cleaned up.


# 1.14 06-Nov-1997 csapuntz

DEBUG now compiles


# 1.13 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.12 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.11 06-Oct-1997 csapuntz

VFS Lite2 Changes


Revision tags: OPENBSD_2_1_BASE
# 1.10 25-Apr-1997 deraadt

proper mask check; mike@fast.cs.utah.edu


# 1.9 14-Apr-1997 tholo

Minor performance enhancements from NetBSD


# 1.8 24-Feb-1997 niklas

OpenBSD tags


# 1.7 11-Feb-1997 millert

Add fs_id support and random inode generation numbers for ffs.


# 1.6 04-Jan-1997 kstailey

spec_advlock() via lf_advlock()


Revision tags: OPENBSD_2_0_BASE
# 1.5 08-Aug-1996 tholo

Make {,f}chown(2) behaviour POSIX.1 compliant with SUID / SGID files
Enable CTL_FS processing by sysctl(3)
Add CTL_FS request to disable clearing SUID / SGID bit when a files owner
or group is changed by root
Make sysctl(8) understand CTL_FS requests


# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 29-Feb-1996 niklas

From NetBSD: Merge with NetBSD 960217


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.299 20-Jan-2020 claudio

struct vops is not modified during runtime so use const which moves each
into read-only data segment.
OK deraadt@ tedu@


# 1.298 10-Jan-2020 bluhm

Convert the vnode list at the mount point into a tailq. During
unmount this list is traversed and the dirty vnodes are flushed to
disk. Forced unmount expects that the list is empty after flushing,
otherwise the kernel panics with "dangling vnode". As the write
to disk can sleep, new vnodes may be inserted. If softdep is
enabled, resolving the dependencies creates new dirty vnodes and
inserts them to the list. To fix the panic, let insmntque() insert
new vnodes at the tail of the list. Then vflush() will still catch
them while traversing the list in forward direction.
OK tedu@ millert@ visa@


# 1.297 30-Dec-2019 bluhm

In vcount() a safe loop over vnodes was commited to 4.4BSD in 1994.
This is not necessary as the loop is restarted after vgone(). Switch
to SLIST_FOREACH without _SAFE.
OK visa@


# 1.296 27-Dec-2019 bluhm

Convert the speclisth hash buckets into SLIST macros. This makes
the vnode alias code more readable.
OK visa@


# 1.295 26-Dec-2019 bluhm

Fix white spaces.


# 1.294 08-Dec-2019 mpi

Convert infinite sleeps to tsleep_nsec(9).

ok visa@, jca@


Revision tags: OPENBSD_6_6_BASE
# 1.293 26-Aug-2019 anton

When a thread tries to exclusively lock a vnode, the same thread must
ensure that any other thread currently trying to acquire the underlying
vnode lock has observed that the same vnode is about to be exclusively
locked. Such threads must then sleep until the exclusive lock has been
released and then try to acquire the lock again. Otherwise, exclusive
access to the vnode cannot be guaranteed.

Thanks to naddy@ and visa@ for testing; ok visa@

Reported-by: syzbot+374d0e7e2400004957f7@syzkaller.appspotmail.com


# 1.292 25-Jul-2019 cheloha

vinvalbuf(9): tlseep -> tsleep_nsec(9); ok millert@


# 1.291 19-Jul-2019 cheloha

vwaitforio(9): tsleep(9) -> tsleep_nsec(9); ok visa@


# 1.290 28-Jun-2019 visa

Skip VFS barrier lock during normal operation to reduce overhead.
This removes a system-wide serialization point, which might help
finding timing-related bugs.

OK deraadt@ anton@


# 1.289 09-Jun-2019 beck

Add a temporary workaround to make removal of giant files better

mlarkin@ noticed we would freeze while removing enormous files because
of the amount of work done to invalidate buffers on unlink. This adds
a temporary workaround to ensure we give up the lock and yield while
doing this.

The longer term answer will be to move these buffers to another list
and not do the work here.

ok deraadt@


# 1.288 19-Apr-2019 visa

Add a subsystem lock for vfs_lockf.c. This enables calling lf_advlock()
and lf_purgelocks() without the kernel lock.

OK anton@ mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.287 02-Apr-2019 visa

Restrict which filesystems are available for swap. This rules out
obvious misconfigurations that cannot work.

OK mpi@ tedu@


# 1.286 17-Feb-2019 tedu

if a write fails, we mark the buffer invalid and throw it away. this can
lead to lost errors, where a later fsync will return success. to fix this,
set a flag on the vnode indicating a past error has occurred, and return
an error for future fsync calls.
ok bluhm deraadt visa


# 1.285 21-Jan-2019 anton

Introduce a dedicated entry point data structure for file locks. This new data
structure allows for better tracking of pending lock operations which is
essential in order to prevent a use-after-free once the underlying vnode is
gone.

Inspired by the lockf implementation in FreeBSD.

ok visa@

Reported-by: syzbot+d5540a236382f50f1dac@syzkaller.appspotmail.com


# 1.284 23-Dec-2018 natano

Rectify some issues with the noperm mount flag; the root vnode was not
protected properly and files without any x bit set were accidentaly considered
executable when checked with access(2).

Issues found and reported by deraadt, halex, reyk, tb
ok deraadt


# 1.283 07-Dec-2018 mpi

free(9) sizes for netcred.

ok visa@


Revision tags: OPENBSD_6_4_BASE
# 1.282 29-Sep-2018 visa

Use atomic operations to update vfc_refcount. Change the field's type
to unsigned int.

OK deraadt@


# 1.281 26-Sep-2018 visa

Move the allocating and freeing of mount points into
dedicated functions.

OK deraadt@ mpi@


# 1.280 22-Sep-2018 fcambus

Harmonize spacing after ellipses in displayed messages.

We were using spacing after ellipses in an inconsistent way in the
installer. Standardize on using "... " everywhere and take into account
the cursor position while we are waiting for the task to complete: the
cursor is now always positioned after the last dot, and the space is
added when displaying completion confirmation.

While there, also take cursor position into account in vfs_shutdown(),
and remove the extra leading space before ticks in dhclient.

OK deraadt@


# 1.279 17-Sep-2018 visa

Simplify VFS initialization.

Because loadable kernel modules are no longer, there is no need to
register or unregister filesystem implementations at runtime. Remove
vfs_register() and vfs_unregister(), and make vfsinit() call vfs_init
routines directly. Replace the linked list of vfsconf structs with
the vfsconflist[] array.

OK mpi@ bluhm@


# 1.278 16-Sep-2018 visa

Move vfsconf lookup code into dedicated functions.

OK bluhm@


# 1.277 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


# 1.276 02-Jul-2018 bluhm

Use more list macros for v_dirtyblkhd.
OK mpi@


# 1.275 06-Jun-2018 bluhm

The function dounmount() traverses the mnt_list in forward direction
to call vfs_busy() for all nested mount points. vfs_stall() called
vfs_busy() in reverser order for all mount points. Change the
direction of the latter to resolve the lock order conflict.
OK visa@


# 1.274 04-Jun-2018 guenther

Add VB_DUPOK to suppress witness(4) warning of concurrent mount locks.
Use that in three places:
- vfs_stall()
- sys_mount()
- dounmount()'s MNT_FORCE-does-recursive-unmounts case

ok deraadt@ visa@


# 1.273 27-May-2018 visa

Drop unnecessary `p' parameter from vget(9).

OK mpi@


# 1.272 08-May-2018 bluhm

When looping over mount points, the FOREACH SAVE macro is not save.
The loop variable mp is protected by vfs_busy() so that it cannot
be unmounted. But the next mount point nmp could be unmounted while
VFS_SYNC() sleeps. As the loop in vfs_stall() does not destroy the
mount point, TAILQ_FOREACH_REVERSE without _SAVE is the correct
macro to use.
OK deraadt@ visa@


# 1.271 08-May-2018 mpi

Move the vfs stall "barrier" logic to a function. FREF() will soon
change and this has nothing to do with it.

ok visa@, bluhm@


# 1.270 07-May-2018 bluhm

Print the vp pointer in the vinvalbuf() panic strings.
OK mpi@


# 1.269 02-May-2018 visa

Remove proc from the parameters of vn_lock(). The parameter is
unnecessary because curproc always does the locking.

OK mpi@


# 1.268 28-Apr-2018 visa

Clean up the parameters of VOP_LOCK() and VOP_UNLOCK(). It is always
curproc that does the locking or unlocking, so the proc parameter
is pointless and can be dropped.

OK mpi@, deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.267 07-Mar-2018 bluhm

Remounting files systems read-only does not work reliably. There
are corner cases where ffs may leak blocks. So better revert and
unmount all file systems at reboot. The "init died" panic will be
fixed in a different way.
OK deraadt@


# 1.266 10-Feb-2018 deraadt

Syncronize filesystems to disk when suspending. Each mountpoint's vnodes
are pushed to disk. Dangling vnodes (unlinked files still in use) and
vnodes undergoing change by long-running syscalls are identified -- and
such filesystems are marked dirty on-disk while we are suspended (in case
power is lost, a fsck will be required). Filesystems without dangling or
busy vnodes are marked clean, resulting in faster boots following
"battery died" circumstances.
Tested by numerous developers, thanks for the feedback.


# 1.265 14-Dec-2017 deraadt

Don't bother using DETACH_FORCE for the softraid luns at reboot
time; the aggressive mountpoint destruction seems to hit insane
use-after-frees when we are already far on the way down.


# 1.264 14-Dec-2017 deraadt

Give vflush_vnode() a hint about vnodes we don't need to account as "busy".
Change mountpoint to RDONLY a little later. Seems to improve the
rw->ro transition a bit.


# 1.263 11-Dec-2017 bluhm

Format the vnode lists of ddb show mount properly in columns.
OK krw@


# 1.262 11-Dec-2017 deraadt

In uvm Chuck decided backing store would not be allocated proactively
for blocks re-fetchable from the filesystem. However at reboot time,
filesystems are unmounted, and since processes lack backing store they
are killed. Since the scheduler is still running, in some cases init is
killed... which drops us to ddb [noted by bluhm]. Solution is to convert
filesystems to read-only [proposed by kettenis]. The tale follows:
sys_reboot() should pass proc * to MD boot() to vfs_shutdown() which
completes current IO with vfs_busy VB_WRITE|VB_WAIT, then calls VFS_MOUNT()
with MNT_UPDATE | MNT_RDONLY, soon teaching us that *fs_mount() calls a
copyin() late... so store the sizes in vfsconflist[] and move the copyin()
to sys_mount()... and notice nfs_mount copyin() is size-variant, so kill
legacy struct nfs_args3. Next we learn ffs_mount()'s MNT_UPDATE code is
sharp and rusty especially wrt softdep, so fix some bugs adn add
~MNT_SOFTDEP to the downgrade. Some vnodes need a little more help,
so tie them to &dead_vnops.

ffs_mount calling DIOCCACHESYNC is causing a bit of grief still but
this issue is seperate and will be dealt with in time.
couple hundred reboots by bluhm and myself, advice from guenther and
others at the hut


# 1.261 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.260 31-Jul-2017 florian

Give back some space to the ramdisk by compiling net/radix.c only
if we compile pf, ipsec, pipex or nfsserver.
Suggested by mpi some time ago.
Tweak & OK bluhm
deraadt assumes it's fair


# 1.259 20-Apr-2017 visa

Tweak lock inits to make the system runnable with witness(4)
on amd64 and i386.


# 1.258 04-Apr-2017 deraadt

struct vfsconf is tightly packed, but let's M_ZERO it in case that ever
changes to avoid exposing userland memory.


Revision tags: OPENBSD_6_1_BASE
# 1.257 15-Jan-2017 bluhm

When traversing the mount list, the current mount point is locked
with vfs_busy(). If the FOREACH_SAFE macro is used, the next pointer
is not locked and could be freed by another process. Unless
necessary, do not use _SAFE as it is unsafe. In vfs_unmountall()
the current pointer is actullay freed. Add a comment that this
race has to be fixed later.
OK krw@


# 1.256 10-Jan-2017 bluhm

Replace manual for() loops with FOREACH() macro.
OK millert@


# 1.255 10-Jan-2017 bluhm

Remove the unused olddp parameter from function dounmount().
OK mpi@ millert@


# 1.254 28-Sep-2016 kettenis

Cast enum to u_int when doing a bounds check to avoid a clang warning that
the comparison is always true.

ok jca@, tedu@


# 1.253 16-Sep-2016 dlg

move the namecache_rb_tree from RB macros to RBT functions.

i had to shuffle the includes a bit. all the knowledge of the RB
tree is now inside vfs_cache.c, and all accesses are via cache_*
functions.


# 1.252 16-Sep-2016 dlg

move buf_rb_bufs from RB macros to RBT functions

i had to shuffle the order of some header bits cos RBT_PROTOTYPE
needs to see what RBT_HEAD produces.


# 1.251 15-Sep-2016 dlg

all pools have their ipl set via pool_setipl, so fold it into pool_init.

the ioff argument to pool_init() is unused and has been for many
years, so this replaces it with an ipl argument. because the ipl
will be set on init we no longer need pool_setipl.

most of these changes have been done with coccinelle using the spatch
below. cocci sucks at formatting code though, so i fixed that by hand.

the manpage and subr_pool.c bits i did myself.

ok tedu@ jmatthew@

@ipl@
expression pp;
expression ipl;
expression s, a, o, f, m, p;
@@
-pool_init(pp, s, a, o, f, m, p);
-pool_setipl(pp, ipl);
+pool_init(pp, s, a, ipl, f, m, p);


# 1.250 25-Aug-2016 dlg

pool_setipl

ok kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.249 22-Jul-2016 kettenis

Prevent NULL-pointer call for filesystems that don't provide vfs_sysctl
in their vfsops.

Issue reported by Tim Newsham.

ok claudio@, natano@


# 1.248 19-Jun-2016 natano

Remove the lockmgr() API. It is only used by filesystems, where it is a
trivial change to use rrw locks instead. All it needs is LK_* defines
for the RW_* flags.

tested by naddy and sthen on package building infrastructure
input and ok jmc mpi tedu


# 1.247 26-May-2016 natano

The doforce variable isn't modified anywhere. Also, the only filesystem
left using it is fuse. It has been removed from all other filesystems.

ok millert deraadt


# 1.246 26-Apr-2016 natano

copy_statfs_info() is not only used by ufs, but by other filesystems too,
so make sure that all members of mp->mnt_stat.mount_info are copied.
ok stefan


# 1.245 26-Apr-2016 beck

fix off by one in vfs_vnode_print - found by miod
ok deraadt@, krw@


# 1.244 07-Apr-2016 natano

Share clone bitmap between aliased vnodes. This prevents duplicate clone
instance numbers being handed out for the same minor device.
ok mikeb


# 1.243 05-Apr-2016 natano

Increase size of the clone bitmap (revised diff after revert). I have
tested this with fuse _and_ drm on amd64 and macppc. Also tested with
cloning bpf (not in the tree) on macppc.

ok mikeb
"looks correct to me" millert

The original commit message is as follows:

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.242 01-Apr-2016 mikeb

Revert the clone bitmap enlargement change


# 1.241 31-Mar-2016 natano

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.240 19-Mar-2016 natano

Remove the unused flags argument from VOP_UNLOCK().

torture tested on amd64, i386 and macppc
ok beck mpi stefan
"the change looks right" deraadt


# 1.239 14-Mar-2016 krw

Change a bunch of (<blah> *)0 to NULL.

ok beck@ deraadt@


Revision tags: OPENBSD_5_9_BASE
# 1.238 05-Dec-2015 tedu

branches: 1.238.2;
remove stale lint annotations


# 1.237 16-Nov-2015 deraadt

In getdevvp() set the VISTTY flag on a vnode to indicate the underlying
device is a D_TTY device. (Like spec_open, but this sets the flag to
satisfy pre-VOP_OPEN situations)
ok millert semarie tedu guenther


# 1.236 13-Oct-2015 guenther

Initialize va_filerev in vattr_null() to avoid leaking stack garbage;
problem pointed out by Martin Natano (natano (at) natano.net)

Also, stop chaining assignments (foo = bar = baz) in vattr_null().
The exact meaning of those depends on the order of the sizes-and-
signednesses of the lvalues, making them fragile: a statement here
mixed *six* types, but managed to get them in a safe order. Delete
a 20+ year old XXX comment that was almost certainly bemoaning a bug
from when they were in an unsafe order.

ok deraadt@ miod@


# 1.235 08-Oct-2015 mpi

Use the radix API directly and get rid of the function pointers. There
is no point in keeping an unused level of abstraction.

ok mikeb@, claudio@


# 1.234 07-Oct-2015 mpi

rn_inithead() offset argument is now specified in byte, missed in previous.


# 1.233 04-Sep-2015 mpi

Make every subsystem using a radix tree call rn_init() and pass the
length of the key as argument.

This way every consumer of the radix tree has a chance to explicitly
initialize the shared data structures and no longer rely on another
subsystem to do the initialization.

As a bonus ``dom_maxrtkey'' is no longer used an die.

ART kernels should now be fully usable because pf(4) and IPSEC properly
initialized the radix tree.

ok chris@, reyk@


Revision tags: OPENBSD_5_8_BASE
# 1.232 16-Jul-2015 claudio

branches: 1.232.4;
Fix rn_match and there for the expoerted lookup functions in radix.c
to never return the internal RNF_ROOT nodes. This removes the checks
in the callee to verify that not an RNF_ROOT node was returned.
OK mpi@


# 1.231 12-May-2015 mikeb

Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.230 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.229 02-Mar-2015 guenther

Return EINVAL if the creds supplied for NFS export have a cr_ngroups less
than zero or greater than NGROUPS_MAX

Fixes panic seen by henning@


# 1.228 09-Jan-2015 tedu

rename desiredvnodes to initialvnodes. less of a lie. ok beck deraadt


# 1.227 19-Dec-2014 tedu

start retiring the nointr allocator. specify PR_WAITOK as a flag as a
marker for which pools are not interrupt safe. ok dlg


# 1.226 17-Dec-2014 tedu

remove lock.h from uvm_extern.h. another holdover from the simpletonlock
era. fix uvm including c files to include lock.h or atomic.h as necessary.
ok deraadt


# 1.225 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.224 10-Dec-2014 tedu

convert bcopy to memcpy. ok millert


# 1.223 21-Nov-2014 tedu

simple lock is long dead


# 1.222 19-Nov-2014 tedu

delete the KERN_VNODE sysctl. it fails to provide any isolation from the
kernel struct vnode defintion, and the only consumer (pstat) still needs
kvm to read much of the required information. no great loss to always use
kvm until there's a better replacement interface.
ok deraadt millert uebayasi


# 1.221 14-Nov-2014 tedu

prefer sizeof(*ptr) to sizeof(struct) for malloc and free


# 1.220 03-Nov-2014 deraadt

pass size argument to free()
ok doug tedu


# 1.219 13-Sep-2014 doug

Replace all queue *_END macro calls except CIRCLEQ_END with NULL.

CIRCLEQ_* is deprecated and not called in the tree. The other queue types
have *_END macros which were added for symmetry with CIRCLEQ_END. They are
defined as NULL. There's no reason to keep the other *_END macro calls.

ok millert@


Revision tags: OPENBSD_5_6_BASE
# 1.218 13-Jul-2014 tedu

pass the size to free in some of the obvious cases


# 1.217 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.216 10-Jul-2014 mpi

Stop using a shutdown hook for softraid(4) and explicitly shutdown
the disciplines right after vfs_shutdown().

This change is required in order to be able to set `cold' to 1 before
traversing the device (mainbus) tree for DVACT_POWERDOWN when halting
a machine. Yes, this is ugly because sr_shutdown() needs to sleep. But
at least it is obvious and hopefully somebody will be ofended and fix
it.

In order to properly flush the cache of the disks under softraid0,
sr_shutdown() now propagates DVACT_POWERDOWN for this particular subtree
of devices which are not under mainbus. As a side effect sd(4) shutdown
hook should no longer be necessary.

Tested by stsp@ and Jean-Philippe Ouellet.

ok deraadt@, stsp@, jsing@


# 1.215 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.214 04-Jun-2014 claudio

While it may be smart to use the radix tree for exports it is not OK to
use the domain specific tree initialisation method for this since that one
is multipath enabled and assumes that the radix node is part of a struct
rtentry. This code uses a different struct and so the multipath modifies
wrong fields and breaks stuff in mysterious ways.
Since we only support AF_INET here anyway simplify the code and only have
one radix_node_head pointer instead of AF_MAX ones.
Fixes NFS server issues reported by rpe@, OK rpe@, guenther@, sthen@


# 1.213 10-Apr-2014 tedu

pull the bufcache freelist code out into separate functions to allow new
algorithms to be tested. in the process, drop support for unused B_AGE and
b_synctime options.
previous versions ok beck deraadt


# 1.212 24-Mar-2014 guenther

Split the API: struct ucred remains the kernel internal structure while
struct xucred becomes the structure for syscalls (mount(2) and nfssvc(2)).

ok deraadt@ beck@


Revision tags: OPENBSD_5_5_BASE
# 1.211 21-Jan-2014 tedu

bzero -> memset


# 1.210 01-Dec-2013 krw

Change 'mountlist' from CIRCLEQ to TAILQ. Be paranoid and
use TAILQ_*_SAFE more than might be needed.

Bulk ports build by sthen@ showed nobody sticking their fingers
so deep into the kernel.

Feedback and suggestions from millert@. ok jsing@


# 1.209 27-Nov-2013 jsing

Defer the v_type initialisation until after the vnode has been purged from
the namecache. Changing the v_type between cache_enter() and cache_purge()
results in bad things happening.

ok beck@


# 1.208 02-Oct-2013 sf

format string fix: b_flags is long


# 1.207 01-Oct-2013 sf

Format string fixes: Cast time_t to long long

and mnt_stat.f_ctime is long long, too


# 1.206 08-Aug-2013 syl

Uncomment kprintf format attributes for sys/kern

tested on vax (gcc3) ok miod@


# 1.205 30-Jul-2013 beck

The previous change was made while chasing nfs performance issues
on Theo's servers - however this was in the context of the buffer flipper
changes and this is now suspect in a continues performance issue with NFS
so back it out for now


Revision tags: OPENBSD_5_4_BASE
# 1.204 24-Jun-2013 beck

Manipulating buffers after sleeping is dangerous. Instead of attempting
to cheat and VOP_BWRITE a buffer, restart the vinvalbuf if we have to wait
for a busy buffer to complete
ok tedu@ guenther@


# 1.203 15-Apr-2013 jsing

Add an f_mntfromspec member to struct statfs, which specifies the name of
the special provided when the mount was requested. This may be the same as
the special that was actually used for the mount (e.g. in the case of a
device node) or it may be different (e.g. in the case of a DUID).

Whilst here, change f_ctime to a 64 bit type and remove the pointless
f_spare members.

Compatibility goo courtesy of guenther@

ok krw@ millert@


Revision tags: OPENBSD_5_3_BASE
# 1.202 17-Feb-2013 miod

Comment out recently added __attribute__((__format__(__kprintf__))) annotations
in MI code; gcc 2.95 does not accept such annotation for function pointer
declarations, only function prototypes.
To be uncommented once gcc 2.95 bites the dust.


# 1.201 09-Feb-2013 miod

Add explicit __attribute__ ((__format__(__kprintf__)))) to the functions and
function pointer arguments which are {used as,} wrappers around the kernel
printf function.
No functional change.


# 1.200 17-Nov-2012 beck

Don't map a buffer (and potentially sleep) when invalidating it in vinvalbuf.
This fixes a problem where we could sleep for kva and then our pointers
would not be valid on the next pass through the loop. We do this
by adding buf_acquire_nomap() - which can be used to busy up the buffer
without changing its mapped or unmapped state. We do not need to have
the buffer mapped to invalidate it, so it is sufficient to acquire it
for that. In the case where we write the buffer, we do map the buffer, and
potentially sleep.


# 1.199 01-Oct-2012 guenther

Make groupmember() check the effective gid too, so that the checks are
consistent when the effective gid isn't also a supplementary group.

ok beck@


# 1.198 19-Sep-2012 guenther

vhold() and vdrop() are prototyped in vnode.h, so don't repeat them here

ok beck@


Revision tags: OPENBSD_5_2_BASE
# 1.197 16-Jul-2012 deraadt

oops, need sys/acct.h too


# 1.196 16-Jul-2012 deraadt

Put acct_shutdown() proto in a better place


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.195 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.194 02-Jul-2011 thib

rename VFSDEBUG to VFLCKDEBUG;

prompted by tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.193 21-Dec-2010 thib

Bring back the "End the VOP experiment." diff, naddy's issues where
unrelated, and his alpha is much happier now.

OK deraadt@


# 1.192 06-Dec-2010 jasper

- drop NENTS(), which was yet another copy of nitems().
no binary change


ok deraadt@


# 1.191 10-Sep-2010 thib

Backout the VOP diff until the issues naddy was seeing on alpha (gcc3)
have been resolved.


# 1.190 06-Sep-2010 thib

End the VOP experiment. Instead of the ridicolusly complicated operation
vector setup that has questionable features (that have, as far as I can
tell never been used in practice, atleast not in OpenBSD), remove all
the gunk and favor a simple struct full of function pointers that get
set directly by each of the filesystems.

Removes gobs of ugly code and makes things simpler by a magnitude.

The only downside of this is that we loose the vnoperate feature so
the spec/fifo operations of the filesystems need to be kept in sync
with specfs and fifofs, this is no big deal as the API it self is pretty
static.

Many thanks to armani@ who pulled an earlier version of this diff to
current after c2k10 and Gabriel Kihlman on tech@ for testing.

Liked by many. "come on, find your balls" deraadt@.


# 1.189 12-Aug-2010 oga

Nuke extra (typoed) extern declaration and a spare newline from the last
commit.

"fix it -- free commit" beck@


# 1.188 11-Aug-2010 beck

Make the number of vnodes to correspond to the number of buffers in
buffer cache - we grow them dynamically, but do not attempt to shrink
them if the buffer cache shrinks after growing.

Tested by very many for a long time.

ok oga@ todd@ phessler@ tedu@


Revision tags: OPENBSD_4_8_BASE
# 1.187 29-Jun-2010 tedu

makefstype was only used in ported from freebsd filesystems. fix them
and remove the function. ok thib


# 1.186 28-Jun-2010 claudio

Add the rtable id as an argument to rn_walktree(). Functions like
rt_if_remove_rtdelete() need to know the table id to be able to correctly
remove nodes.
Problem found by Andrea Parazzini and analyzed by Martin Pelik�n.
OK henning@


# 1.185 06-May-2010 mpf

Fix favail format string.
From mickey.
OK thib, otto.


Revision tags: OPENBSD_4_7_BASE
# 1.184 17-Dec-2009 oga

if anyone vref()s a VNON vnode, panic. This should not happen.

Written while trying to debug the nfs_inactive panics. Turns out it
never got hit, but it's a useful check to have.

ok beck@


# 1.183 17-Aug-2009 jasper

dd 'show all bufs' to show all the buffers in the system

ok beck@ thib@


# 1.182 13-Aug-2009 thib

add a show all vnodes command, use dlg's nice pool_walk() to accomplish
this.

ok beck@, dlg@


# 1.181 12-Aug-2009 beck

Namecache revamp.

This eliminates the large single namecache hash table, and implements
the name cache as a global lru of entires, and a redblack tree in each
vnode. It makes cache_purge actually purge the namecache entries associated
with a vnode when a vnode is recycled (very important for later on actually being
able to resize the vnode pool)

This commit does #if 0 out a bunch of procmap code that was
already broken before this change, but needs to be redone completely.

Tested by many, including in thib's nfs test setup.

ok oga@,art@,thib@,miod@


# 1.180 02-Aug-2009 beck

Dynamic buffer cache support - a re-commit of what was backed out
after c2k9

allows buffer cache to be extended and grow/shrink dynamically

tested by many, ok oga@, "why not just commit it" deraadt@


Revision tags: OPENBSD_4_6_BASE
# 1.179 25-Jun-2009 thib

backout the buf_acquire() does the bremfree() since all callers
where doing bremfree() befure calling buf_acquire().

This is causing us headache pinning down a bug that showed up
when deraadt@ too cvs to current, and will have to be done
anyway as a preperation for backouts.

OK deraadt@


# 1.178 15-Jun-2009 beck

Back out all the buffer cache changes I committed during c2k9. This reverts three
commits:

1) The sysctl allowing bufcachepercent to be changed at boot time.
2) The change moving the buffer cache hash chains to a red-black tree
3) The dynamic buffer cache (Which depended on the earlier too).

ok on the backout from marco and todd


# 1.177 06-Jun-2009 art

All caller of buf_acquire were doing bremfree before the call.
Just put it in the buf_acquire function.
oga@ ok


# 1.176 03-Jun-2009 beck

Change bufhash from the old grotty hash table to red-black trees hanging
off the vnode.
ok art@, oga@, miod@


Revision tags: OPENBSD_4_5_BASE
# 1.175 10-Nov-2008 pedro

Fix typo in comment, okay jmc@.


# 1.174 01-Nov-2008 deraadt

change vrele() to return an int. if it returns 0, it can gaurantee that
it did not sleep. this is used to avoid checkdirs() to avoid having
to restart the allproc walk every time through
idea from tedu, ok thib pedro


Revision tags: OPENBSD_4_4_BASE
# 1.173 05-Jul-2008 thib

re-introduce vdrop() to signal a lost intrest in a vnode;

ok art@


# 1.172 14-Jun-2008 mk

A bunch of pool_get() + bzero() -> pool_get(..., .. | PR_ZERO)
conversions that should shave a few bytes off the kernel.

ok henning, krw, jsing, oga, miod, and thib (``even though i usually prefer
FOO|BAR''; thanks for looking.


# 1.171 13-Jun-2008 beck

back out stupid vnode change that was unintentionally included
with biomem and art has no idea how it got there.
ok art@ thib@


# 1.170 12-Jun-2008 deraadt

Bring biomem diff back into the tree after the nfs_bio.c fix went in.
ok thib beck art


# 1.169 11-Jun-2008 deraadt

back out biomem diff since it is not right yet. Doing very large
file copies to nfsv2 causes the system to eventually peg the console.
On the console ^T indicates that the load is increasing rapidly, ddb
indicates many calls to getbuf, there is some very slow nfs traffic
making none (or extremely slow) progress. Eventually some machines
seize up entirely.


# 1.168 10-Jun-2008 beck

Buffer cache revamp

1) remove multiple size queues, introduced as a stopgap.
2) decouple pages containing data from their mappings
3) only keep buffers mapped when they actually have to be mapped
(right now, this is when buffers are B_BUSY)
4) New functions to make a buffer busy, and release the busy flag
(buf_acquire and buf_release)
5) Move high/low water marks and statistics counters into a structure
6) Add a sysctl to retrieve buffer cache statistics

Tested in several variants and beat upon by bob and art for a year. run
accidentally on henning's nfs server for a few months...

ok deraadt@, krw@, art@ - who promises to be around to deal with any fallout


# 1.167 09-Jun-2008 millert

Update access(2) to have modern semantics with respect to X_OK and
the superuser. access(2) will now only indicate success for X_OK on
non-directories if there is at least one execute bit set on the file.
OK deraadt@ thib@ otto@


# 1.166 07-May-2008 thib

remove the vfc_mountroot member from vfsconf and
do appropriate cleanup;

OK deraadt@


# 1.165 07-May-2008 claudio

Implement routing priorities. Every route inserted has a priority assigned
and the one route with the lowest number wins. This will be used by the
routing daemons to resolve the synchronisations issue in case of conflicts.
The nasty bits of this are in the multipath code. If no priority is specified
the kernel will choose an appropriate priority.

Looked at by a few people at n2k8 code is much older


# 1.164 06-May-2008 thib

retire vfs_mountroot();

setroot() is now (and has been) responsible for setting
the mountroot function pointer "to the right thing", or
failing todo that, to ffs_mountroot;

based on a discussion/diff from deraadt@.
OK deraadt@


# 1.163 23-Mar-2008 miod

Wrong printf construct.


# 1.162 16-Mar-2008 otto

Widen some struct statfs fields to support large filesystem stata
and add some to be able to support statvfs(2). Do the compat dance
to provide backward compatibility. ok thib@ miod@


Revision tags: OPENBSD_4_3_BASE
# 1.161 13-Dec-2007 blambert

replace calls to ltsleep with tsleep

remove PNORELOCK flag, as PNORELOCK is used for msleep

ok art@ thib@


# 1.160 16-Nov-2007 deraadt

er, the newline is wrong. dissapointing.


# 1.159 15-Nov-2007 deraadt

newline before syncing disks is way prettier


# 1.158 29-Oct-2007 chl

MALLOC/FREE -> malloc/free
replace an hard coded value with M_WAITOK

ok krw@


# 1.157 15-Sep-2007 bluhm

Allow to pull out an usb stick with ffs filesystem while mounted
and a file is written onto the stick. Without these fixes the
machine panics or hangs.
The usb fix calls the callback when the stick is pulled out to free
the associated buffers. Otherwise we have busy buffers for ever
and the automatic unmount will panic.
The change in the scsi layer prevents passing down further dirty
buffers to usb after the stick has been deactivated.
In vfs the automatic unmount has moved from the function vgonel()
to vop_generic_revoke(). Both are called when the sd device's vnode
is removed. In vgonel() the VXLOCK is already held which can cause
a deadlock. So call dounmount() earlier.

ok krw@, I like this marco@, tested by ian@


# 1.156 07-Sep-2007 art

Use M_ZERO in a few more places to shave bytes from the kernel.

eyeballed and ok dlg@


Revision tags: OPENBSD_4_2_BASE
# 1.155 07-Aug-2007 beck

A few changes to deal with multi-user performance issues seen. this
brings us back roughly to 4.1 level performance, although this is still
far from optimal as we have seen in a number of cases. This change

1) puts a lower bound on buffer cache queues to prevent starvation
2) fixes the code which looks for a buffer to recycle
3) reduces the number of vnodes back to 4.1 levels to avoid complex
performance issues better addressed after 4.2

ok art@ deraadt@, tested by many


# 1.154 01-Jun-2007 beck

decouple the allocated number of vnodes from the "desiredvnodes" variable
which is used to size a zillion other things that increasing excessively
has been shown to cause problems - so that we may incrementally look at
increasing those other things without making the kernel unusable.

This diff effectivly increases the number of vnodes back to the number
of buffers, as in the earlier dynamic buffer cache commits, without
increasing anything else (namecache, softdeps, etc. etc.)

ok pedro@ tedu@ art@ thib@


# 1.153 31-May-2007 tedu

remove some silly casts, no real change


# 1.152 31-May-2007 pedro

NFSv2 cannot cope with a big number of vnodes, so revert to NPROC-based
calculation until the problem is fixed, okay beck@ art@


# 1.151 30-May-2007 beck

back out vfs change - todd fries has seen afs issues, and I'm suspicious
this can cause other problems.


# 1.150 29-May-2007 beck

Step one of some vnode improvements - change getnewvnode to
actually allocate "desiredvnodes" - add a vdrop to un-hold a vnode held
with vhold, and change the name cache to make use of vhold/vdrop, while
keeping track of which vnodes are referred to by which cache entries to
correctly hold/drop vnodes when the cache uses them.
ok thib@, tedu@, art@


# 1.149 28-May-2007 thib

de-inline vref();

ok pedro@


# 1.148 26-May-2007 pedro

Dynamic buffer cache. Initial diff from mickey@, okay art@ beck@ toby@
deraadt@ dlg@.


# 1.147 26-May-2007 thib

Nuke a bunch of simpelocks and associated goo.

ok art@


# 1.146 17-May-2007 thib

Collapse struct v_selectinfo in struct vnode, remove the
simplelock and reuse the name for the selinfo member.
Clean-up accordingly.

ok tedu@,art@


# 1.145 09-May-2007 deraadt

kinfo_vgetfailed has not been used for > 8 years


# 1.144 13-Apr-2007 thib

Move the declaration of VN_KNOTE() into vnode.h instead of having
multiple defines all over;

ok tedu@


# 1.143 13-Apr-2007 bluhm

Remove comments talking about vnode interlock. No binary change.
ok thib


# 1.142 11-Apr-2007 thib

Remove the simplelock argument from vrecycle();

ok pedro@, sturm@


# 1.141 21-Mar-2007 thib

Remove the v_interlock simplelock from the vnode structure.
Zap all calls to simple_lock/unlock() on it (those calls are
#defined away though). Remove the LK_INTERLOCK from the calls
to vn_lock() and cleanup the filesystems wich implement VOP_LOCK().
(by remvoing the v_interlock from there calls to lockmgr()).

ok pedro@, art@, tedu@


# 1.140 12-Mar-2007 mickey

better desiredvnodes not based on maxusers; pedro@ deraadt@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.139 20-Feb-2007 deraadt

for vfsconf sysctl, do not leak kernel sensors out to userland
ok art thib


# 1.138 17-Feb-2007 mickey

fix ddb buf printing for daddr_t growth to 64bit;
from juan hernandez gonzalez; tested by bluhm@


# 1.137 14-Feb-2007 jsg

Consistently spell FALLTHROUGH to appease lint.
ok kettenis@ cloder@ tom@ henning@


# 1.136 13-Feb-2007 mickey

fix ddb buf print


# 1.135 20-Nov-2006 tom

vprint() should be defined if DIAGNOSTIC || DEBUG. Noticed by (and
original diff from) Jake < antipsychic (at) hotmail.com >. Discussed
with Mickey and Miod.

ok miod@ pedro@


# 1.134 30-Oct-2006 thib

use vp->v_type to index into vtypes rather then vp->v_tag,
fixing odd output in the 'show vnode' ddb code.

ok mickey@


Revision tags: OPENBSD_4_0_BASE
# 1.133 11-Jul-2006 mickey

add mount/vnode/buf and softdep printing commands; tested on a few archs and will make pedro happy too (;


# 1.132 09-Jul-2006 pedro

Fix tab where space was meant


# 1.131 08-Jul-2006 thib

vinvalbuf() debugging aid, under VFSDEBUG.

ok pedro@


# 1.130 03-Jul-2006 mickey

also print vp in vprint (useful for debugging); pedro@ ok


# 1.129 25-Jun-2006 sturm

rename vfs_busy() flags VB_UMIGNORE/VB_UMWAIT to VB_NOWAIT/VB_WAIT

requested by and ok pedro


# 1.128 14-Jun-2006 sturm

move vfs_busy() to rwlocks and properly hide the locking api from vfs

ok tedu, pedro


# 1.127 02-Jun-2006 pedro

Add a clonable devices implementation. Hacked along with thib@, input
from krw@ and toby@, subliminal prodding from dlg@, okay deraadt@.


# 1.126 28-May-2006 pedro

Spacing in vfs_sysctl()


# 1.125 07-May-2006 sturm

forgot to remove this sentence from the comment
ok pedro


# 1.124 30-Apr-2006 sturm

remove the simplelock argument from vfs_busy() which is currently not
used and will never be used this way in VFS

requested by and ok pedro, ok krw, biorn


# 1.123 19-Apr-2006 pedro

Remove unused mount list simple_lock() goo


Revision tags: OPENBSD_3_9_BASE
# 1.122 09-Jan-2006 pedro

Put vprint() under DIAGNOSTIC, as to save space in generated ramdisks.
Inspiration from miod@, okay deraadt@. Tested on i386, macppc and amd64.


# 1.121 30-Nov-2005 pedro

No need for vfs_busy() and vfs_unbusy() to take a process pointer
anymore. Testing by jolan@, thanks.


# 1.120 24-Nov-2005 pedro

Remove kernfs, okay deraadt@.


# 1.119 19-Nov-2005 pedro

Remove unnecessary lockmgr() archaism that was costing too much in terms
of panics and bugfixes. Access curproc directly, do not expect a process
pointer as an argument. Should fix many "process context required" bugs.
Incentive and okay millert@, okay marc@. Various testing, thanks.


# 1.118 18-Nov-2005 pedro

Work around yet another race on non-locking file systems: when calling
VOP_INACTIVE() in vrele() and vput(), we may sleep. Since there's no
locking of any kind, someone can vget() the vnode and vrele() it while
we sleep, beating us in getting the vnode on the free list.


# 1.117 08-Nov-2005 pedro

Missed one use of 'register'


# 1.116 07-Nov-2005 pedro

Use ANSI function declarations and deregister, no binary change


# 1.115 19-Oct-2005 pedro

Remove v_vnlock from struct vnode, okay krw@ tedu@


Revision tags: OPENBSD_3_8_BASE
# 1.114 26-May-2005 pedro

branches: 1.114.2;
RIP stackable filesystems, ok marius@ tedu@, discussed with deraadt@


# 1.113 24-May-2005 pedro

when a device vnode associated with a mount point disappears, mark the
filesystem as doomed and unmount it


# 1.112 22-May-2005 pedro

put VLOCKSWORK stuff under a single option, VFSDEBUG


# 1.111 01-May-2005 pedro

check for VBIOONFREELIST and VBIOONSYNCLIST in vprint(), okay marius@


# 1.110 24-Mar-2005 tedu

always good to check for invalid values. ok marius pedro


Revision tags: OPENBSD_3_7_BASE
# 1.109 10-Jan-2005 pedro

branches: 1.109.2;
change vget() to only put a vnode back on the free lists if it actually
was there. should fix a (rare) corner case introduced by my last commit.
ok tedu@, testing by joris, moritz@, danh@, otto@ and krw@. many thanks.


# 1.108 31-Dec-2004 pedro

sprinkle some more list macros in here


# 1.107 31-Dec-2004 pedro

when releasing a vnode, make it inactive before sticking it to one of
the free lists. should fix some races on filesystems that don't have
locks, such as nfs. also, it allows for a more straightforward way of
releasing vnodes (nodes that are going to be recycled don't have to be
moved to the head of the list). tested by many, thanks.

ok tedu@ deraadt@


# 1.106 28-Dec-2004 deraadt

clean dirty accident by miod


# 1.105 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


# 1.104 09-Dec-2004 pedro

minor spacing/styling nits


Revision tags: OPENBSD_3_6_BASE
# 1.103 04-Aug-2004 art

Uninline vputonfreelist.


# 1.102 04-Aug-2004 pedro

better comments


# 1.101 02-Aug-2004 pedro

- check for LK_NOWAIT on vget()
- use ltsleep() instead of the unlock + sleep combo

ok art@, inspiration from free/net


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.100 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


# 1.99 27-May-2004 tedu

shutdown accounting before shutting down vfs. should prevent some panics.
ok david@ millert@ (iirc)


# 1.98 25-Apr-2004 itojun

radix tree with multipath support. from kame. deraadt ok
user visible changes:
- you can add multiple routes with same key (route add A B then route add A C)
- you have to specify gateway address if there are multiple entries on the table
(route delete A B, instead of route delete A)
kernel change:
- radix_node_head has an extra entry
- rnh_deladdr takes extra argument

TODO:
- actually take advantage of multipath (rtalloc -> rtalloc_mpath)


Revision tags: OPENBSD_3_5_BASE
# 1.97 09-Jan-2004 tedu

back out vnode parents. weird breakge found in ports tree


# 1.96 06-Jan-2004 tedu

keep track of a vnode's parent dir. ufs only, and unused atm, but
the fun stuff is coming. testing by brad.


Revision tags: OPENBSD_3_4_BASE
# 1.95 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.94 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: UBC_SYNC_A
# 1.93 13-May-2003 naddy

Back out previous change that causes "vnode table full" for large-scale
file operations.


# 1.92 13-May-2003 tedu

do reclaim LAYER vnodes, no good reason not to


# 1.91 06-May-2003 tedu

attempt to put a process's cwd back in place after a forced umount.
won't always work, but it's the best we can do for now. this covers
at least some of the failure cases the previous commit to vfs_lookup.c
checks for.
ok weingart@


# 1.90 01-May-2003 tedu

several related changes:
vfs_subr.c:
add a missing simple_lock_init for vnode interlock
try to avoid reclaiming locked or layered vnodes
initialize vnlock pointer to NULL
remove old code to free vnlock, never used
lockinit the new vnode lock
vfs_syscalls.c:
support for VLAYER flag
vnode_if.sh:
support for splitting VDESC flags
vnode_if.src:
split VDESC flags
WILLPUT is the combination of WILLRELE and WILLUNLOCK
most uses for WILLRELE become WILLPUT
vnode.h:
add v_lock to struct vnode
add VLAYER flag
update for new VDESC flags


# 1.89 06-Apr-2003 ho

strcat/strcpy/sprintf cleanup. krw@, anil@ ok. art@ tested sparc64.


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_B
# 1.88 11-Aug-2002 art

Add two missing vfs_busy calls in the failure path of sysctl_vnode.
Found by aaron@

NOTE - I think we need a mount-point iterator just like we have
NOTE - vfs_mount_foreach_vnode. (btw. why don't we use foreach_vnode in here?)


# 1.87 12-Jul-2002 art

Change the locking on the mountpoint slightly. Instead of using mnt_lock
to get shared locks for lookup and get the exclusive lock only with
LK_DRAIN on unmount and do the real exclusive locking with flags in
mnt_flags, we now use shared locks for lookup and an exclusive lock for
unmount.

This is accomplished by slightly changing the semantics of vfs_busy.
Old vfs_busy behavior:
- with LK_NOWAIT set in flags, a shared lock was obtained if the
mountpoint wasn't being unmounted, otherwise we just returned an error.
- with no flags, a shared lock was obtained if the mountpoint was being
unmounted, otherwise we slept until the unmount was done and returned
an error.
LK_NOWAIT was used for sync(2) and some statistics code where it isn't really
critical that we get the correct results.
0 was used in fchdir and lookup where it's critical that we get the right
directory vnode for the filesystem root.

After this change vfs_busy keeps the same behavior for no flags and LK_NOWAIT.
But if some other flags are passed into it, they are passed directly
into lockmgr (actually LK_SLEEPFAIL is always added to those flags because
if we sleep for the lock, that means someone was holding the exclusive lock
and the exclusive lock is only held when the filesystem is being unmounted.

More changes:
dounmount must now be called with the exclusive lock held. (before this
the caller was supposed to hold the vfs_busy lock, but that wasn't always
true).
Zap some (now) unused mount flags.
And the highlight of this change:
Add some vfs_busy calls to match some vfs_unbusy calls, especially in
sys_mount. (lockmgr doesn't detect the case where we release a lock noone
holds (it will do that soon)).

If you've seen hangs on reboot with mfs this should solve it (I repeat this
for the fourth time now, but this time I spent two months fixing and
redesigning this and reading the code so this time I must have gotten
this right).


# 1.86 16-Jun-2002 miod

When processing the KERN_VNODE sysctl, the kernel builds a packed structure,
while pstat(8) expects a C structure abiding the regular structure packing
rules. This caused pstat -v to break on powerpc.

Unbreak the confusion by defining the structure in a common header file,
and having the kernel use it.

ok millert@ deraadt@


# 1.85 08-Jun-2002 art

Use ltsleep in vfs_busy.


# 1.84 16-May-2002 art

sprinkle some splassert(IPL_BIO) in some functions that are commented as "should be called at splbio()"


Revision tags: OPENBSD_3_1_BASE
# 1.83 14-Mar-2002 millert

First round of __P removal in sys


# 1.82 04-Feb-2002 miod

Cleanup mountroot-related definitions.


# 1.81 23-Jan-2002 art

Pool deals fairly well with physical memory shortage, but it doesn't deal
well (not at all) with shortages of the vm_map where the pages are mapped
(usually kmem_map).

Try to deal with it:
- group all information the backend allocator for a pool in a separate
struct. The pool will only have a pointer to that struct.
- change the pool_init API to reflect that.
- link all pools allocating from the same allocator on a linked list.
- Since an allocator is responsible to wait for physical memory it will
only fail (waitok) when it runs out of its backing vm_map, carefully
drain pools using the same allocator so that va space is freed.
(see comments in code for caveats and details).
- change pool_reclaim to return if it actually succeeded to free some
memory, use that information to make draining easier and more efficient.
- get rid of PR_URGENT, noone uses it.


# 1.80 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.79 10-Dec-2001 art

branches: 1.79.2;
No need to initialize the uobj on every getnewvnode. Just do
it when allocating. Add some improved diagnostics.


# 1.78 10-Dec-2001 art

Big cleanup inspired by NetBSD with some parts of the code from NetBSD.
- get rid of VOP_BALLOCN and VOP_SIZE
- move the generic getpages and putpages into miscfs/genfs
- create a genfs_node which must be added to the top of the private portion
of each vnode for filsystems that want to use genfs_{get,put}pages
- rename genfs_mmap to vop_generic_mmap


# 1.77 10-Dec-2001 art

Merge in struct uvm_vnode into struct vnode.


# 1.76 05-Dec-2001 art

Break out the part that lowers v_holdcnt in brelvp into an own function
and make it and vhold into public interfaces.


# 1.75 29-Nov-2001 art

Ooops. Revert part of the last commit that was completly wrong and wasn't supposed to be committed.


# 1.74 29-Nov-2001 art

Correctly handle b_vp with bgetvp and brelvp in {get,put}pages.
Prevents panics caused by vnodes being recycled under our feet.


# 1.73 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.72 21-Nov-2001 csapuntz

Added vfs_isbusy. Useful for verifying that a mount point is locked
Added vfs_mount_foreach_vnode. Several places in the code seem to want to
traverse the mount list and they all seem to handle locking differently.
Centralize traversing the mount list in one place so that we only need
to get the locking right once.


# 1.71 15-Nov-2001 art

Don't zero v_bioflag when recycling a vnode in getnewvnode.
Sometimes the vnode can be on the syncers list. While that is a bug, it's
just a minor annoyance. A vnode on a syncer worklist without VBIOONSYNCLIST
set is a disaster.


# 1.70 12-Nov-2001 art

Remove unnecessary check for NULL vnode in reassignbuf.


# 1.69 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.68 02-Oct-2001 csapuntz

Bounds check index into routing table. Thanks to Ken Ashcraft of Stanford
for finding this bug.


# 1.67 19-Sep-2001 csapuntz

Get rid of B_VFLUSH. Not relevant after the end of the AGE queue.


# 1.66 16-Sep-2001 millert

Add some missing lengths checks when passing data from userland to
kernel. From based on NetBSD patches.


# 1.65 02-Aug-2001 assar

(vput): make panic strings actually say vput instead of vrele


# 1.64 26-Jul-2001 miod

Typo.


# 1.63 27-Jun-2001 art

remove old vm


# 1.62 22-Jun-2001 deraadt

KNF


# 1.61 05-Jun-2001 provos

send note_revoke to knotes when vnode goes away, okay art@


# 1.60 16-May-2001 art

indentation nit.


# 1.59 29-Apr-2001 art

cleanup, remove incorrect comment


Revision tags: OPENBSD_2_9_BASE
# 1.58 22-Mar-2001 art

branches: 1.58.2;
Use pool for allocating vnodes.
Even though vnodes are never freed (could be) this gives us big memory and
kmem_map savings.


# 1.57 21-Mar-2001 art

uvm_vnp_terminate expect the vnode to be locked.
Why didn't LOCKDEBUG catch this?


# 1.56 16-Mar-2001 art

Oops. fix thinko in last.


# 1.55 16-Mar-2001 art

Use CIRCLEQ macros for mountlist.


# 1.54 16-Mar-2001 art

Initialize the mountlist_slock.


# 1.53 26-Feb-2001 csapuntz

Move v_writecount test back to it original place


# 1.52 26-Feb-2001 csapuntz

Make ref counts 32-bit unsigned ints as opposed to a potpourri of longs and
ints.


# 1.51 24-Feb-2001 csapuntz

Cleanup of vnode interface continues. Get rid of VHOLD/HOLDRELE.
Change VM/UVM to use buf_replacevnode to change the vnode associated
with a buffer.

Addition v_bioflag for flags written in interrupt handlers
(and read at splbio, though not strictly necessary)

Add vwaitforio and use it instead of a while loop of v_numoutput.

Fix race conditions when manipulation vnode free list


# 1.50 23-Feb-2001 csapuntz

Remove the clustering fields from the vnodes and place them in the
file system inode instead


# 1.49 21-Feb-2001 csapuntz

Latest soft updates from FreeBSD/Kirk McKusick

Snapshot-related code has been commented out.


# 1.48 08-Feb-2001 mickey

do not print stuff when not verbose


Revision tags: OPENBSD_2_8_BASE
# 1.47 27-Sep-2000 art

branches: 1.47.2;
Minimal optimization.


# 1.46 17-Jul-2000 art

Don't wait for B_READ buffers on shutdown.
From NetBSD.


Revision tags: OPENBSD_2_7_BASE
# 1.45 25-Apr-2000 csapuntz

Use CIRCLEQ_FOREACH


# 1.44 21-Apr-2000 mickey

see if there is any meaning under curproc before using &proc0 in vfs_syncwait(); from art@


Revision tags: SMP_BASE kame_19991208
# 1.43 05-Dec-1999 art

branches: 1.43.2;
With soft updates, some buffers will be remarked as dirty after being written.
Handle this when syncing filesystems when unmounting.
From NetBSD.


# 1.42 05-Dec-1999 art

Use VONSYNCLIST to see if we should remove a vnode from the sync list instead
of looking at v_dirtyblkhd.


Revision tags: OPENBSD_2_6_BASE
# 1.41 20-Aug-1999 art

more paranoid check of the refcount in vfs_register


# 1.40 08-Aug-1999 niklas

From NetBSD; vdevgone, used for revoking access to device nodes when they
disappear (detach is coming).


# 1.39 31-May-1999 millert

New struct statfs with mount options. NOTE: this replaces statfs(2),
fstatfs(2), and getfsstat(2) so you will need to build a new kernel
before doing a "make build" or you will get "unimplemented syscall" errors.

The new struct statfs has the following featuires:
o Has a u_int32_t flags field--now softdep can have a real flag.

o Uses u_int32_t instead of longs (nicer on the alpha). Note: the man
page used to lie about setting invalid/unused fields to -1. SunOS does
that but our code never has.

o Gets rid of f_type completely. It hasn't been used since NetBSD 0.9
and having it there but always 0 is confusing. It is conceivable
that this may cause some old code to not compile but that is better
than silently breaking.

o Adds a mount_info union that contains the FSTYPE_args struct. This
means that "mount" can now tell you all the options a filesystem was
mounted with. This is especially nice for NFS.

Other changes:
o The linux statfs emulation didn't convert between BSD fs names
and linux f_type numbers. Now it does, since the BSD f_type
number is useless to linux apps (and has been removed anyway)

o FreeBSD's struct statfs is different from our (both old and new)
and thus needs conversion. Previously, the OpenBSD syscalls
were used without any real translation.

o mount(8) will now show extra info when invoked with no arguments.
However, to see *everything* you need to use the -v (verbose) flag.


# 1.38 06-May-1999 mickey

factor out sync+wait code into vfa_syncwait() routine for
applications in system like power management and such.
art@ finally said `commit it'


# 1.37 30-Apr-1999 art

in vput, simple_unlock the v_interlock before VOP_INACTIVE, not after


Revision tags: OPENBSD_2_5_BASE
# 1.36 11-Mar-1999 deraadt

backout


# 1.35 11-Mar-1999 deraadt

back out unapproved changes


# 1.34 11-Mar-1999 mickey

indent


# 1.33 11-Mar-1999 mickey

factor sync+wait operation out into a separate function.


# 1.32 26-Feb-1999 art

adapt to uvm vnode pager


# 1.31 19-Feb-1999 art

add vfs_register and vfs_unregister functions


# 1.30 28-Dec-1998 art

simple_lock fixes


# 1.29 22-Dec-1998 art

deconfuse vprint, print holdcount, not refcount when we are talking about holdcnt


# 1.28 10-Dec-1998 art

vfs_unmountall: retry to unmount all remaining filesystems when one unmount failed


# 1.27 05-Dec-1998 csapuntz

Framework for generating automatic test code for locking discipline
in DIAGNOSTIC mode.

Added documentation to vfs_subr.c on locking needs of a couple calls.

Improvements to the vinvalbuf patch. We need to start over after we
let our pants down.


# 1.26 04-Dec-1998 csapuntz

VFS-Lite2 requires stricter locking around vnode buffer queues. vinvalbuf
had insufficient protection


# 1.25 20-Nov-1998 art

vn_lock already unlocks the simple lock. don't do that again


# 1.24 12-Nov-1998 csapuntz

Integrate latest soft updates patches for McKusick.

Integrate cleaner ffs mount code from FreeBSD. Most notably, this mount
code prevents you from mounting an unclean file system read-write.


Revision tags: OPENBSD_2_4_BASE
# 1.23 13-Oct-1998 csapuntz

In vrele, vget, reinstate to following order

- VNODE gets placed on free list
- VOP_INACTIVE is called

This was the original order. It was changed in an earlier patch due to
a race condition in non-locking FSes (like NFS) between getnewvnode
and inactive. However, the modified order had its own race conditions, so
it turned out not to be a good choice.


# 1.22 30-Aug-1998 csapuntz

Cleanup.

Error diagnostics in vputonfreelist to catch violations of assumptions.


# 1.21 06-Aug-1998 csapuntz

Rename vop_revoke, vn_bwrite, vop_noislocked, vop_nolock, vop_nounlock
to be vop_generic_revoke, vop_generic_bwrite, vop_generic_islocked,
vop_generic_lock and vop_generic_unlock.

Create vop_generic_abortop and propogate change to all file systems.

Fix PR/371.

Get rid of locking in NULLFS (should be mostly unnecessary now except for
forced unmounts).


# 1.20 25-Apr-1998 niklas

typo


Revision tags: OPENBSD_2_3_BASE
# 1.19 20-Feb-1998 niklas

typo


# 1.18 11-Jan-1998 csapuntz

Fix a couple spinlock references. More code motion in vfs_subr.c


# 1.17 10-Jan-1998 csapuntz

Broke up vfs_subr.c which was getting a bit huge. We now have seperate files
for the syncer daemon as well as default VOP_*.


# 1.16 24-Nov-1997 niklas

Fix non-DIAGNOSTIC (and non-COMPAT*) compilation


# 1.15 07-Nov-1997 csapuntz

Fixed hang on shutdown
Disabled vop_nolock for now. Filesystems still need to be cleaned up.


# 1.14 06-Nov-1997 csapuntz

DEBUG now compiles


# 1.13 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.12 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.11 06-Oct-1997 csapuntz

VFS Lite2 Changes


Revision tags: OPENBSD_2_1_BASE
# 1.10 25-Apr-1997 deraadt

proper mask check; mike@fast.cs.utah.edu


# 1.9 14-Apr-1997 tholo

Minor performance enhancements from NetBSD


# 1.8 24-Feb-1997 niklas

OpenBSD tags


# 1.7 11-Feb-1997 millert

Add fs_id support and random inode generation numbers for ffs.


# 1.6 04-Jan-1997 kstailey

spec_advlock() via lf_advlock()


Revision tags: OPENBSD_2_0_BASE
# 1.5 08-Aug-1996 tholo

Make {,f}chown(2) behaviour POSIX.1 compliant with SUID / SGID files
Enable CTL_FS processing by sysctl(3)
Add CTL_FS request to disable clearing SUID / SGID bit when a files owner
or group is changed by root
Make sysctl(8) understand CTL_FS requests


# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 29-Feb-1996 niklas

From NetBSD: Merge with NetBSD 960217


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.298 10-Jan-2020 bluhm

Convert the vnode list at the mount point into a tailq. During
unmount this list is traversed and the dirty vnodes are flushed to
disk. Forced unmount expects that the list is empty after flushing,
otherwise the kernel panics with "dangling vnode". As the write
to disk can sleep, new vnodes may be inserted. If softdep is
enabled, resolving the dependencies creates new dirty vnodes and
inserts them to the list. To fix the panic, let insmntque() insert
new vnodes at the tail of the list. Then vflush() will still catch
them while traversing the list in forward direction.
OK tedu@ millert@ visa@


# 1.297 30-Dec-2019 bluhm

In vcount() a safe loop over vnodes was commited to 4.4BSD in 1994.
This is not necessary as the loop is restarted after vgone(). Switch
to SLIST_FOREACH without _SAFE.
OK visa@


# 1.296 27-Dec-2019 bluhm

Convert the speclisth hash buckets into SLIST macros. This makes
the vnode alias code more readable.
OK visa@


# 1.295 26-Dec-2019 bluhm

Fix white spaces.


# 1.294 08-Dec-2019 mpi

Convert infinite sleeps to tsleep_nsec(9).

ok visa@, jca@


Revision tags: OPENBSD_6_6_BASE
# 1.293 26-Aug-2019 anton

When a thread tries to exclusively lock a vnode, the same thread must
ensure that any other thread currently trying to acquire the underlying
vnode lock has observed that the same vnode is about to be exclusively
locked. Such threads must then sleep until the exclusive lock has been
released and then try to acquire the lock again. Otherwise, exclusive
access to the vnode cannot be guaranteed.

Thanks to naddy@ and visa@ for testing; ok visa@

Reported-by: syzbot+374d0e7e2400004957f7@syzkaller.appspotmail.com


# 1.292 25-Jul-2019 cheloha

vinvalbuf(9): tlseep -> tsleep_nsec(9); ok millert@


# 1.291 19-Jul-2019 cheloha

vwaitforio(9): tsleep(9) -> tsleep_nsec(9); ok visa@


# 1.290 28-Jun-2019 visa

Skip VFS barrier lock during normal operation to reduce overhead.
This removes a system-wide serialization point, which might help
finding timing-related bugs.

OK deraadt@ anton@


# 1.289 09-Jun-2019 beck

Add a temporary workaround to make removal of giant files better

mlarkin@ noticed we would freeze while removing enormous files because
of the amount of work done to invalidate buffers on unlink. This adds
a temporary workaround to ensure we give up the lock and yield while
doing this.

The longer term answer will be to move these buffers to another list
and not do the work here.

ok deraadt@


# 1.288 19-Apr-2019 visa

Add a subsystem lock for vfs_lockf.c. This enables calling lf_advlock()
and lf_purgelocks() without the kernel lock.

OK anton@ mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.287 02-Apr-2019 visa

Restrict which filesystems are available for swap. This rules out
obvious misconfigurations that cannot work.

OK mpi@ tedu@


# 1.286 17-Feb-2019 tedu

if a write fails, we mark the buffer invalid and throw it away. this can
lead to lost errors, where a later fsync will return success. to fix this,
set a flag on the vnode indicating a past error has occurred, and return
an error for future fsync calls.
ok bluhm deraadt visa


# 1.285 21-Jan-2019 anton

Introduce a dedicated entry point data structure for file locks. This new data
structure allows for better tracking of pending lock operations which is
essential in order to prevent a use-after-free once the underlying vnode is
gone.

Inspired by the lockf implementation in FreeBSD.

ok visa@

Reported-by: syzbot+d5540a236382f50f1dac@syzkaller.appspotmail.com


# 1.284 23-Dec-2018 natano

Rectify some issues with the noperm mount flag; the root vnode was not
protected properly and files without any x bit set were accidentaly considered
executable when checked with access(2).

Issues found and reported by deraadt, halex, reyk, tb
ok deraadt


# 1.283 07-Dec-2018 mpi

free(9) sizes for netcred.

ok visa@


Revision tags: OPENBSD_6_4_BASE
# 1.282 29-Sep-2018 visa

Use atomic operations to update vfc_refcount. Change the field's type
to unsigned int.

OK deraadt@


# 1.281 26-Sep-2018 visa

Move the allocating and freeing of mount points into
dedicated functions.

OK deraadt@ mpi@


# 1.280 22-Sep-2018 fcambus

Harmonize spacing after ellipses in displayed messages.

We were using spacing after ellipses in an inconsistent way in the
installer. Standardize on using "... " everywhere and take into account
the cursor position while we are waiting for the task to complete: the
cursor is now always positioned after the last dot, and the space is
added when displaying completion confirmation.

While there, also take cursor position into account in vfs_shutdown(),
and remove the extra leading space before ticks in dhclient.

OK deraadt@


# 1.279 17-Sep-2018 visa

Simplify VFS initialization.

Because loadable kernel modules are no longer, there is no need to
register or unregister filesystem implementations at runtime. Remove
vfs_register() and vfs_unregister(), and make vfsinit() call vfs_init
routines directly. Replace the linked list of vfsconf structs with
the vfsconflist[] array.

OK mpi@ bluhm@


# 1.278 16-Sep-2018 visa

Move vfsconf lookup code into dedicated functions.

OK bluhm@


# 1.277 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


# 1.276 02-Jul-2018 bluhm

Use more list macros for v_dirtyblkhd.
OK mpi@


# 1.275 06-Jun-2018 bluhm

The function dounmount() traverses the mnt_list in forward direction
to call vfs_busy() for all nested mount points. vfs_stall() called
vfs_busy() in reverser order for all mount points. Change the
direction of the latter to resolve the lock order conflict.
OK visa@


# 1.274 04-Jun-2018 guenther

Add VB_DUPOK to suppress witness(4) warning of concurrent mount locks.
Use that in three places:
- vfs_stall()
- sys_mount()
- dounmount()'s MNT_FORCE-does-recursive-unmounts case

ok deraadt@ visa@


# 1.273 27-May-2018 visa

Drop unnecessary `p' parameter from vget(9).

OK mpi@


# 1.272 08-May-2018 bluhm

When looping over mount points, the FOREACH SAVE macro is not save.
The loop variable mp is protected by vfs_busy() so that it cannot
be unmounted. But the next mount point nmp could be unmounted while
VFS_SYNC() sleeps. As the loop in vfs_stall() does not destroy the
mount point, TAILQ_FOREACH_REVERSE without _SAVE is the correct
macro to use.
OK deraadt@ visa@


# 1.271 08-May-2018 mpi

Move the vfs stall "barrier" logic to a function. FREF() will soon
change and this has nothing to do with it.

ok visa@, bluhm@


# 1.270 07-May-2018 bluhm

Print the vp pointer in the vinvalbuf() panic strings.
OK mpi@


# 1.269 02-May-2018 visa

Remove proc from the parameters of vn_lock(). The parameter is
unnecessary because curproc always does the locking.

OK mpi@


# 1.268 28-Apr-2018 visa

Clean up the parameters of VOP_LOCK() and VOP_UNLOCK(). It is always
curproc that does the locking or unlocking, so the proc parameter
is pointless and can be dropped.

OK mpi@, deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.267 07-Mar-2018 bluhm

Remounting files systems read-only does not work reliably. There
are corner cases where ffs may leak blocks. So better revert and
unmount all file systems at reboot. The "init died" panic will be
fixed in a different way.
OK deraadt@


# 1.266 10-Feb-2018 deraadt

Syncronize filesystems to disk when suspending. Each mountpoint's vnodes
are pushed to disk. Dangling vnodes (unlinked files still in use) and
vnodes undergoing change by long-running syscalls are identified -- and
such filesystems are marked dirty on-disk while we are suspended (in case
power is lost, a fsck will be required). Filesystems without dangling or
busy vnodes are marked clean, resulting in faster boots following
"battery died" circumstances.
Tested by numerous developers, thanks for the feedback.


# 1.265 14-Dec-2017 deraadt

Don't bother using DETACH_FORCE for the softraid luns at reboot
time; the aggressive mountpoint destruction seems to hit insane
use-after-frees when we are already far on the way down.


# 1.264 14-Dec-2017 deraadt

Give vflush_vnode() a hint about vnodes we don't need to account as "busy".
Change mountpoint to RDONLY a little later. Seems to improve the
rw->ro transition a bit.


# 1.263 11-Dec-2017 bluhm

Format the vnode lists of ddb show mount properly in columns.
OK krw@


# 1.262 11-Dec-2017 deraadt

In uvm Chuck decided backing store would not be allocated proactively
for blocks re-fetchable from the filesystem. However at reboot time,
filesystems are unmounted, and since processes lack backing store they
are killed. Since the scheduler is still running, in some cases init is
killed... which drops us to ddb [noted by bluhm]. Solution is to convert
filesystems to read-only [proposed by kettenis]. The tale follows:
sys_reboot() should pass proc * to MD boot() to vfs_shutdown() which
completes current IO with vfs_busy VB_WRITE|VB_WAIT, then calls VFS_MOUNT()
with MNT_UPDATE | MNT_RDONLY, soon teaching us that *fs_mount() calls a
copyin() late... so store the sizes in vfsconflist[] and move the copyin()
to sys_mount()... and notice nfs_mount copyin() is size-variant, so kill
legacy struct nfs_args3. Next we learn ffs_mount()'s MNT_UPDATE code is
sharp and rusty especially wrt softdep, so fix some bugs adn add
~MNT_SOFTDEP to the downgrade. Some vnodes need a little more help,
so tie them to &dead_vnops.

ffs_mount calling DIOCCACHESYNC is causing a bit of grief still but
this issue is seperate and will be dealt with in time.
couple hundred reboots by bluhm and myself, advice from guenther and
others at the hut


# 1.261 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.260 31-Jul-2017 florian

Give back some space to the ramdisk by compiling net/radix.c only
if we compile pf, ipsec, pipex or nfsserver.
Suggested by mpi some time ago.
Tweak & OK bluhm
deraadt assumes it's fair


# 1.259 20-Apr-2017 visa

Tweak lock inits to make the system runnable with witness(4)
on amd64 and i386.


# 1.258 04-Apr-2017 deraadt

struct vfsconf is tightly packed, but let's M_ZERO it in case that ever
changes to avoid exposing userland memory.


Revision tags: OPENBSD_6_1_BASE
# 1.257 15-Jan-2017 bluhm

When traversing the mount list, the current mount point is locked
with vfs_busy(). If the FOREACH_SAFE macro is used, the next pointer
is not locked and could be freed by another process. Unless
necessary, do not use _SAFE as it is unsafe. In vfs_unmountall()
the current pointer is actullay freed. Add a comment that this
race has to be fixed later.
OK krw@


# 1.256 10-Jan-2017 bluhm

Replace manual for() loops with FOREACH() macro.
OK millert@


# 1.255 10-Jan-2017 bluhm

Remove the unused olddp parameter from function dounmount().
OK mpi@ millert@


# 1.254 28-Sep-2016 kettenis

Cast enum to u_int when doing a bounds check to avoid a clang warning that
the comparison is always true.

ok jca@, tedu@


# 1.253 16-Sep-2016 dlg

move the namecache_rb_tree from RB macros to RBT functions.

i had to shuffle the includes a bit. all the knowledge of the RB
tree is now inside vfs_cache.c, and all accesses are via cache_*
functions.


# 1.252 16-Sep-2016 dlg

move buf_rb_bufs from RB macros to RBT functions

i had to shuffle the order of some header bits cos RBT_PROTOTYPE
needs to see what RBT_HEAD produces.


# 1.251 15-Sep-2016 dlg

all pools have their ipl set via pool_setipl, so fold it into pool_init.

the ioff argument to pool_init() is unused and has been for many
years, so this replaces it with an ipl argument. because the ipl
will be set on init we no longer need pool_setipl.

most of these changes have been done with coccinelle using the spatch
below. cocci sucks at formatting code though, so i fixed that by hand.

the manpage and subr_pool.c bits i did myself.

ok tedu@ jmatthew@

@ipl@
expression pp;
expression ipl;
expression s, a, o, f, m, p;
@@
-pool_init(pp, s, a, o, f, m, p);
-pool_setipl(pp, ipl);
+pool_init(pp, s, a, ipl, f, m, p);


# 1.250 25-Aug-2016 dlg

pool_setipl

ok kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.249 22-Jul-2016 kettenis

Prevent NULL-pointer call for filesystems that don't provide vfs_sysctl
in their vfsops.

Issue reported by Tim Newsham.

ok claudio@, natano@


# 1.248 19-Jun-2016 natano

Remove the lockmgr() API. It is only used by filesystems, where it is a
trivial change to use rrw locks instead. All it needs is LK_* defines
for the RW_* flags.

tested by naddy and sthen on package building infrastructure
input and ok jmc mpi tedu


# 1.247 26-May-2016 natano

The doforce variable isn't modified anywhere. Also, the only filesystem
left using it is fuse. It has been removed from all other filesystems.

ok millert deraadt


# 1.246 26-Apr-2016 natano

copy_statfs_info() is not only used by ufs, but by other filesystems too,
so make sure that all members of mp->mnt_stat.mount_info are copied.
ok stefan


# 1.245 26-Apr-2016 beck

fix off by one in vfs_vnode_print - found by miod
ok deraadt@, krw@


# 1.244 07-Apr-2016 natano

Share clone bitmap between aliased vnodes. This prevents duplicate clone
instance numbers being handed out for the same minor device.
ok mikeb


# 1.243 05-Apr-2016 natano

Increase size of the clone bitmap (revised diff after revert). I have
tested this with fuse _and_ drm on amd64 and macppc. Also tested with
cloning bpf (not in the tree) on macppc.

ok mikeb
"looks correct to me" millert

The original commit message is as follows:

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.242 01-Apr-2016 mikeb

Revert the clone bitmap enlargement change


# 1.241 31-Mar-2016 natano

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.240 19-Mar-2016 natano

Remove the unused flags argument from VOP_UNLOCK().

torture tested on amd64, i386 and macppc
ok beck mpi stefan
"the change looks right" deraadt


# 1.239 14-Mar-2016 krw

Change a bunch of (<blah> *)0 to NULL.

ok beck@ deraadt@


Revision tags: OPENBSD_5_9_BASE
# 1.238 05-Dec-2015 tedu

branches: 1.238.2;
remove stale lint annotations


# 1.237 16-Nov-2015 deraadt

In getdevvp() set the VISTTY flag on a vnode to indicate the underlying
device is a D_TTY device. (Like spec_open, but this sets the flag to
satisfy pre-VOP_OPEN situations)
ok millert semarie tedu guenther


# 1.236 13-Oct-2015 guenther

Initialize va_filerev in vattr_null() to avoid leaking stack garbage;
problem pointed out by Martin Natano (natano (at) natano.net)

Also, stop chaining assignments (foo = bar = baz) in vattr_null().
The exact meaning of those depends on the order of the sizes-and-
signednesses of the lvalues, making them fragile: a statement here
mixed *six* types, but managed to get them in a safe order. Delete
a 20+ year old XXX comment that was almost certainly bemoaning a bug
from when they were in an unsafe order.

ok deraadt@ miod@


# 1.235 08-Oct-2015 mpi

Use the radix API directly and get rid of the function pointers. There
is no point in keeping an unused level of abstraction.

ok mikeb@, claudio@


# 1.234 07-Oct-2015 mpi

rn_inithead() offset argument is now specified in byte, missed in previous.


# 1.233 04-Sep-2015 mpi

Make every subsystem using a radix tree call rn_init() and pass the
length of the key as argument.

This way every consumer of the radix tree has a chance to explicitly
initialize the shared data structures and no longer rely on another
subsystem to do the initialization.

As a bonus ``dom_maxrtkey'' is no longer used an die.

ART kernels should now be fully usable because pf(4) and IPSEC properly
initialized the radix tree.

ok chris@, reyk@


Revision tags: OPENBSD_5_8_BASE
# 1.232 16-Jul-2015 claudio

branches: 1.232.4;
Fix rn_match and there for the expoerted lookup functions in radix.c
to never return the internal RNF_ROOT nodes. This removes the checks
in the callee to verify that not an RNF_ROOT node was returned.
OK mpi@


# 1.231 12-May-2015 mikeb

Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.230 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.229 02-Mar-2015 guenther

Return EINVAL if the creds supplied for NFS export have a cr_ngroups less
than zero or greater than NGROUPS_MAX

Fixes panic seen by henning@


# 1.228 09-Jan-2015 tedu

rename desiredvnodes to initialvnodes. less of a lie. ok beck deraadt


# 1.227 19-Dec-2014 tedu

start retiring the nointr allocator. specify PR_WAITOK as a flag as a
marker for which pools are not interrupt safe. ok dlg


# 1.226 17-Dec-2014 tedu

remove lock.h from uvm_extern.h. another holdover from the simpletonlock
era. fix uvm including c files to include lock.h or atomic.h as necessary.
ok deraadt


# 1.225 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.224 10-Dec-2014 tedu

convert bcopy to memcpy. ok millert


# 1.223 21-Nov-2014 tedu

simple lock is long dead


# 1.222 19-Nov-2014 tedu

delete the KERN_VNODE sysctl. it fails to provide any isolation from the
kernel struct vnode defintion, and the only consumer (pstat) still needs
kvm to read much of the required information. no great loss to always use
kvm until there's a better replacement interface.
ok deraadt millert uebayasi


# 1.221 14-Nov-2014 tedu

prefer sizeof(*ptr) to sizeof(struct) for malloc and free


# 1.220 03-Nov-2014 deraadt

pass size argument to free()
ok doug tedu


# 1.219 13-Sep-2014 doug

Replace all queue *_END macro calls except CIRCLEQ_END with NULL.

CIRCLEQ_* is deprecated and not called in the tree. The other queue types
have *_END macros which were added for symmetry with CIRCLEQ_END. They are
defined as NULL. There's no reason to keep the other *_END macro calls.

ok millert@


Revision tags: OPENBSD_5_6_BASE
# 1.218 13-Jul-2014 tedu

pass the size to free in some of the obvious cases


# 1.217 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.216 10-Jul-2014 mpi

Stop using a shutdown hook for softraid(4) and explicitly shutdown
the disciplines right after vfs_shutdown().

This change is required in order to be able to set `cold' to 1 before
traversing the device (mainbus) tree for DVACT_POWERDOWN when halting
a machine. Yes, this is ugly because sr_shutdown() needs to sleep. But
at least it is obvious and hopefully somebody will be ofended and fix
it.

In order to properly flush the cache of the disks under softraid0,
sr_shutdown() now propagates DVACT_POWERDOWN for this particular subtree
of devices which are not under mainbus. As a side effect sd(4) shutdown
hook should no longer be necessary.

Tested by stsp@ and Jean-Philippe Ouellet.

ok deraadt@, stsp@, jsing@


# 1.215 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.214 04-Jun-2014 claudio

While it may be smart to use the radix tree for exports it is not OK to
use the domain specific tree initialisation method for this since that one
is multipath enabled and assumes that the radix node is part of a struct
rtentry. This code uses a different struct and so the multipath modifies
wrong fields and breaks stuff in mysterious ways.
Since we only support AF_INET here anyway simplify the code and only have
one radix_node_head pointer instead of AF_MAX ones.
Fixes NFS server issues reported by rpe@, OK rpe@, guenther@, sthen@


# 1.213 10-Apr-2014 tedu

pull the bufcache freelist code out into separate functions to allow new
algorithms to be tested. in the process, drop support for unused B_AGE and
b_synctime options.
previous versions ok beck deraadt


# 1.212 24-Mar-2014 guenther

Split the API: struct ucred remains the kernel internal structure while
struct xucred becomes the structure for syscalls (mount(2) and nfssvc(2)).

ok deraadt@ beck@


Revision tags: OPENBSD_5_5_BASE
# 1.211 21-Jan-2014 tedu

bzero -> memset


# 1.210 01-Dec-2013 krw

Change 'mountlist' from CIRCLEQ to TAILQ. Be paranoid and
use TAILQ_*_SAFE more than might be needed.

Bulk ports build by sthen@ showed nobody sticking their fingers
so deep into the kernel.

Feedback and suggestions from millert@. ok jsing@


# 1.209 27-Nov-2013 jsing

Defer the v_type initialisation until after the vnode has been purged from
the namecache. Changing the v_type between cache_enter() and cache_purge()
results in bad things happening.

ok beck@


# 1.208 02-Oct-2013 sf

format string fix: b_flags is long


# 1.207 01-Oct-2013 sf

Format string fixes: Cast time_t to long long

and mnt_stat.f_ctime is long long, too


# 1.206 08-Aug-2013 syl

Uncomment kprintf format attributes for sys/kern

tested on vax (gcc3) ok miod@


# 1.205 30-Jul-2013 beck

The previous change was made while chasing nfs performance issues
on Theo's servers - however this was in the context of the buffer flipper
changes and this is now suspect in a continues performance issue with NFS
so back it out for now


Revision tags: OPENBSD_5_4_BASE
# 1.204 24-Jun-2013 beck

Manipulating buffers after sleeping is dangerous. Instead of attempting
to cheat and VOP_BWRITE a buffer, restart the vinvalbuf if we have to wait
for a busy buffer to complete
ok tedu@ guenther@


# 1.203 15-Apr-2013 jsing

Add an f_mntfromspec member to struct statfs, which specifies the name of
the special provided when the mount was requested. This may be the same as
the special that was actually used for the mount (e.g. in the case of a
device node) or it may be different (e.g. in the case of a DUID).

Whilst here, change f_ctime to a 64 bit type and remove the pointless
f_spare members.

Compatibility goo courtesy of guenther@

ok krw@ millert@


Revision tags: OPENBSD_5_3_BASE
# 1.202 17-Feb-2013 miod

Comment out recently added __attribute__((__format__(__kprintf__))) annotations
in MI code; gcc 2.95 does not accept such annotation for function pointer
declarations, only function prototypes.
To be uncommented once gcc 2.95 bites the dust.


# 1.201 09-Feb-2013 miod

Add explicit __attribute__ ((__format__(__kprintf__)))) to the functions and
function pointer arguments which are {used as,} wrappers around the kernel
printf function.
No functional change.


# 1.200 17-Nov-2012 beck

Don't map a buffer (and potentially sleep) when invalidating it in vinvalbuf.
This fixes a problem where we could sleep for kva and then our pointers
would not be valid on the next pass through the loop. We do this
by adding buf_acquire_nomap() - which can be used to busy up the buffer
without changing its mapped or unmapped state. We do not need to have
the buffer mapped to invalidate it, so it is sufficient to acquire it
for that. In the case where we write the buffer, we do map the buffer, and
potentially sleep.


# 1.199 01-Oct-2012 guenther

Make groupmember() check the effective gid too, so that the checks are
consistent when the effective gid isn't also a supplementary group.

ok beck@


# 1.198 19-Sep-2012 guenther

vhold() and vdrop() are prototyped in vnode.h, so don't repeat them here

ok beck@


Revision tags: OPENBSD_5_2_BASE
# 1.197 16-Jul-2012 deraadt

oops, need sys/acct.h too


# 1.196 16-Jul-2012 deraadt

Put acct_shutdown() proto in a better place


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.195 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.194 02-Jul-2011 thib

rename VFSDEBUG to VFLCKDEBUG;

prompted by tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.193 21-Dec-2010 thib

Bring back the "End the VOP experiment." diff, naddy's issues where
unrelated, and his alpha is much happier now.

OK deraadt@


# 1.192 06-Dec-2010 jasper

- drop NENTS(), which was yet another copy of nitems().
no binary change


ok deraadt@


# 1.191 10-Sep-2010 thib

Backout the VOP diff until the issues naddy was seeing on alpha (gcc3)
have been resolved.


# 1.190 06-Sep-2010 thib

End the VOP experiment. Instead of the ridicolusly complicated operation
vector setup that has questionable features (that have, as far as I can
tell never been used in practice, atleast not in OpenBSD), remove all
the gunk and favor a simple struct full of function pointers that get
set directly by each of the filesystems.

Removes gobs of ugly code and makes things simpler by a magnitude.

The only downside of this is that we loose the vnoperate feature so
the spec/fifo operations of the filesystems need to be kept in sync
with specfs and fifofs, this is no big deal as the API it self is pretty
static.

Many thanks to armani@ who pulled an earlier version of this diff to
current after c2k10 and Gabriel Kihlman on tech@ for testing.

Liked by many. "come on, find your balls" deraadt@.


# 1.189 12-Aug-2010 oga

Nuke extra (typoed) extern declaration and a spare newline from the last
commit.

"fix it -- free commit" beck@


# 1.188 11-Aug-2010 beck

Make the number of vnodes to correspond to the number of buffers in
buffer cache - we grow them dynamically, but do not attempt to shrink
them if the buffer cache shrinks after growing.

Tested by very many for a long time.

ok oga@ todd@ phessler@ tedu@


Revision tags: OPENBSD_4_8_BASE
# 1.187 29-Jun-2010 tedu

makefstype was only used in ported from freebsd filesystems. fix them
and remove the function. ok thib


# 1.186 28-Jun-2010 claudio

Add the rtable id as an argument to rn_walktree(). Functions like
rt_if_remove_rtdelete() need to know the table id to be able to correctly
remove nodes.
Problem found by Andrea Parazzini and analyzed by Martin Pelik�n.
OK henning@


# 1.185 06-May-2010 mpf

Fix favail format string.
From mickey.
OK thib, otto.


Revision tags: OPENBSD_4_7_BASE
# 1.184 17-Dec-2009 oga

if anyone vref()s a VNON vnode, panic. This should not happen.

Written while trying to debug the nfs_inactive panics. Turns out it
never got hit, but it's a useful check to have.

ok beck@


# 1.183 17-Aug-2009 jasper

dd 'show all bufs' to show all the buffers in the system

ok beck@ thib@


# 1.182 13-Aug-2009 thib

add a show all vnodes command, use dlg's nice pool_walk() to accomplish
this.

ok beck@, dlg@


# 1.181 12-Aug-2009 beck

Namecache revamp.

This eliminates the large single namecache hash table, and implements
the name cache as a global lru of entires, and a redblack tree in each
vnode. It makes cache_purge actually purge the namecache entries associated
with a vnode when a vnode is recycled (very important for later on actually being
able to resize the vnode pool)

This commit does #if 0 out a bunch of procmap code that was
already broken before this change, but needs to be redone completely.

Tested by many, including in thib's nfs test setup.

ok oga@,art@,thib@,miod@


# 1.180 02-Aug-2009 beck

Dynamic buffer cache support - a re-commit of what was backed out
after c2k9

allows buffer cache to be extended and grow/shrink dynamically

tested by many, ok oga@, "why not just commit it" deraadt@


Revision tags: OPENBSD_4_6_BASE
# 1.179 25-Jun-2009 thib

backout the buf_acquire() does the bremfree() since all callers
where doing bremfree() befure calling buf_acquire().

This is causing us headache pinning down a bug that showed up
when deraadt@ too cvs to current, and will have to be done
anyway as a preperation for backouts.

OK deraadt@


# 1.178 15-Jun-2009 beck

Back out all the buffer cache changes I committed during c2k9. This reverts three
commits:

1) The sysctl allowing bufcachepercent to be changed at boot time.
2) The change moving the buffer cache hash chains to a red-black tree
3) The dynamic buffer cache (Which depended on the earlier too).

ok on the backout from marco and todd


# 1.177 06-Jun-2009 art

All caller of buf_acquire were doing bremfree before the call.
Just put it in the buf_acquire function.
oga@ ok


# 1.176 03-Jun-2009 beck

Change bufhash from the old grotty hash table to red-black trees hanging
off the vnode.
ok art@, oga@, miod@


Revision tags: OPENBSD_4_5_BASE
# 1.175 10-Nov-2008 pedro

Fix typo in comment, okay jmc@.


# 1.174 01-Nov-2008 deraadt

change vrele() to return an int. if it returns 0, it can gaurantee that
it did not sleep. this is used to avoid checkdirs() to avoid having
to restart the allproc walk every time through
idea from tedu, ok thib pedro


Revision tags: OPENBSD_4_4_BASE
# 1.173 05-Jul-2008 thib

re-introduce vdrop() to signal a lost intrest in a vnode;

ok art@


# 1.172 14-Jun-2008 mk

A bunch of pool_get() + bzero() -> pool_get(..., .. | PR_ZERO)
conversions that should shave a few bytes off the kernel.

ok henning, krw, jsing, oga, miod, and thib (``even though i usually prefer
FOO|BAR''; thanks for looking.


# 1.171 13-Jun-2008 beck

back out stupid vnode change that was unintentionally included
with biomem and art has no idea how it got there.
ok art@ thib@


# 1.170 12-Jun-2008 deraadt

Bring biomem diff back into the tree after the nfs_bio.c fix went in.
ok thib beck art


# 1.169 11-Jun-2008 deraadt

back out biomem diff since it is not right yet. Doing very large
file copies to nfsv2 causes the system to eventually peg the console.
On the console ^T indicates that the load is increasing rapidly, ddb
indicates many calls to getbuf, there is some very slow nfs traffic
making none (or extremely slow) progress. Eventually some machines
seize up entirely.


# 1.168 10-Jun-2008 beck

Buffer cache revamp

1) remove multiple size queues, introduced as a stopgap.
2) decouple pages containing data from their mappings
3) only keep buffers mapped when they actually have to be mapped
(right now, this is when buffers are B_BUSY)
4) New functions to make a buffer busy, and release the busy flag
(buf_acquire and buf_release)
5) Move high/low water marks and statistics counters into a structure
6) Add a sysctl to retrieve buffer cache statistics

Tested in several variants and beat upon by bob and art for a year. run
accidentally on henning's nfs server for a few months...

ok deraadt@, krw@, art@ - who promises to be around to deal with any fallout


# 1.167 09-Jun-2008 millert

Update access(2) to have modern semantics with respect to X_OK and
the superuser. access(2) will now only indicate success for X_OK on
non-directories if there is at least one execute bit set on the file.
OK deraadt@ thib@ otto@


# 1.166 07-May-2008 thib

remove the vfc_mountroot member from vfsconf and
do appropriate cleanup;

OK deraadt@


# 1.165 07-May-2008 claudio

Implement routing priorities. Every route inserted has a priority assigned
and the one route with the lowest number wins. This will be used by the
routing daemons to resolve the synchronisations issue in case of conflicts.
The nasty bits of this are in the multipath code. If no priority is specified
the kernel will choose an appropriate priority.

Looked at by a few people at n2k8 code is much older


# 1.164 06-May-2008 thib

retire vfs_mountroot();

setroot() is now (and has been) responsible for setting
the mountroot function pointer "to the right thing", or
failing todo that, to ffs_mountroot;

based on a discussion/diff from deraadt@.
OK deraadt@


# 1.163 23-Mar-2008 miod

Wrong printf construct.


# 1.162 16-Mar-2008 otto

Widen some struct statfs fields to support large filesystem stata
and add some to be able to support statvfs(2). Do the compat dance
to provide backward compatibility. ok thib@ miod@


Revision tags: OPENBSD_4_3_BASE
# 1.161 13-Dec-2007 blambert

replace calls to ltsleep with tsleep

remove PNORELOCK flag, as PNORELOCK is used for msleep

ok art@ thib@


# 1.160 16-Nov-2007 deraadt

er, the newline is wrong. dissapointing.


# 1.159 15-Nov-2007 deraadt

newline before syncing disks is way prettier


# 1.158 29-Oct-2007 chl

MALLOC/FREE -> malloc/free
replace an hard coded value with M_WAITOK

ok krw@


# 1.157 15-Sep-2007 bluhm

Allow to pull out an usb stick with ffs filesystem while mounted
and a file is written onto the stick. Without these fixes the
machine panics or hangs.
The usb fix calls the callback when the stick is pulled out to free
the associated buffers. Otherwise we have busy buffers for ever
and the automatic unmount will panic.
The change in the scsi layer prevents passing down further dirty
buffers to usb after the stick has been deactivated.
In vfs the automatic unmount has moved from the function vgonel()
to vop_generic_revoke(). Both are called when the sd device's vnode
is removed. In vgonel() the VXLOCK is already held which can cause
a deadlock. So call dounmount() earlier.

ok krw@, I like this marco@, tested by ian@


# 1.156 07-Sep-2007 art

Use M_ZERO in a few more places to shave bytes from the kernel.

eyeballed and ok dlg@


Revision tags: OPENBSD_4_2_BASE
# 1.155 07-Aug-2007 beck

A few changes to deal with multi-user performance issues seen. this
brings us back roughly to 4.1 level performance, although this is still
far from optimal as we have seen in a number of cases. This change

1) puts a lower bound on buffer cache queues to prevent starvation
2) fixes the code which looks for a buffer to recycle
3) reduces the number of vnodes back to 4.1 levels to avoid complex
performance issues better addressed after 4.2

ok art@ deraadt@, tested by many


# 1.154 01-Jun-2007 beck

decouple the allocated number of vnodes from the "desiredvnodes" variable
which is used to size a zillion other things that increasing excessively
has been shown to cause problems - so that we may incrementally look at
increasing those other things without making the kernel unusable.

This diff effectivly increases the number of vnodes back to the number
of buffers, as in the earlier dynamic buffer cache commits, without
increasing anything else (namecache, softdeps, etc. etc.)

ok pedro@ tedu@ art@ thib@


# 1.153 31-May-2007 tedu

remove some silly casts, no real change


# 1.152 31-May-2007 pedro

NFSv2 cannot cope with a big number of vnodes, so revert to NPROC-based
calculation until the problem is fixed, okay beck@ art@


# 1.151 30-May-2007 beck

back out vfs change - todd fries has seen afs issues, and I'm suspicious
this can cause other problems.


# 1.150 29-May-2007 beck

Step one of some vnode improvements - change getnewvnode to
actually allocate "desiredvnodes" - add a vdrop to un-hold a vnode held
with vhold, and change the name cache to make use of vhold/vdrop, while
keeping track of which vnodes are referred to by which cache entries to
correctly hold/drop vnodes when the cache uses them.
ok thib@, tedu@, art@


# 1.149 28-May-2007 thib

de-inline vref();

ok pedro@


# 1.148 26-May-2007 pedro

Dynamic buffer cache. Initial diff from mickey@, okay art@ beck@ toby@
deraadt@ dlg@.


# 1.147 26-May-2007 thib

Nuke a bunch of simpelocks and associated goo.

ok art@


# 1.146 17-May-2007 thib

Collapse struct v_selectinfo in struct vnode, remove the
simplelock and reuse the name for the selinfo member.
Clean-up accordingly.

ok tedu@,art@


# 1.145 09-May-2007 deraadt

kinfo_vgetfailed has not been used for > 8 years


# 1.144 13-Apr-2007 thib

Move the declaration of VN_KNOTE() into vnode.h instead of having
multiple defines all over;

ok tedu@


# 1.143 13-Apr-2007 bluhm

Remove comments talking about vnode interlock. No binary change.
ok thib


# 1.142 11-Apr-2007 thib

Remove the simplelock argument from vrecycle();

ok pedro@, sturm@


# 1.141 21-Mar-2007 thib

Remove the v_interlock simplelock from the vnode structure.
Zap all calls to simple_lock/unlock() on it (those calls are
#defined away though). Remove the LK_INTERLOCK from the calls
to vn_lock() and cleanup the filesystems wich implement VOP_LOCK().
(by remvoing the v_interlock from there calls to lockmgr()).

ok pedro@, art@, tedu@


# 1.140 12-Mar-2007 mickey

better desiredvnodes not based on maxusers; pedro@ deraadt@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.139 20-Feb-2007 deraadt

for vfsconf sysctl, do not leak kernel sensors out to userland
ok art thib


# 1.138 17-Feb-2007 mickey

fix ddb buf printing for daddr_t growth to 64bit;
from juan hernandez gonzalez; tested by bluhm@


# 1.137 14-Feb-2007 jsg

Consistently spell FALLTHROUGH to appease lint.
ok kettenis@ cloder@ tom@ henning@


# 1.136 13-Feb-2007 mickey

fix ddb buf print


# 1.135 20-Nov-2006 tom

vprint() should be defined if DIAGNOSTIC || DEBUG. Noticed by (and
original diff from) Jake < antipsychic (at) hotmail.com >. Discussed
with Mickey and Miod.

ok miod@ pedro@


# 1.134 30-Oct-2006 thib

use vp->v_type to index into vtypes rather then vp->v_tag,
fixing odd output in the 'show vnode' ddb code.

ok mickey@


Revision tags: OPENBSD_4_0_BASE
# 1.133 11-Jul-2006 mickey

add mount/vnode/buf and softdep printing commands; tested on a few archs and will make pedro happy too (;


# 1.132 09-Jul-2006 pedro

Fix tab where space was meant


# 1.131 08-Jul-2006 thib

vinvalbuf() debugging aid, under VFSDEBUG.

ok pedro@


# 1.130 03-Jul-2006 mickey

also print vp in vprint (useful for debugging); pedro@ ok


# 1.129 25-Jun-2006 sturm

rename vfs_busy() flags VB_UMIGNORE/VB_UMWAIT to VB_NOWAIT/VB_WAIT

requested by and ok pedro


# 1.128 14-Jun-2006 sturm

move vfs_busy() to rwlocks and properly hide the locking api from vfs

ok tedu, pedro


# 1.127 02-Jun-2006 pedro

Add a clonable devices implementation. Hacked along with thib@, input
from krw@ and toby@, subliminal prodding from dlg@, okay deraadt@.


# 1.126 28-May-2006 pedro

Spacing in vfs_sysctl()


# 1.125 07-May-2006 sturm

forgot to remove this sentence from the comment
ok pedro


# 1.124 30-Apr-2006 sturm

remove the simplelock argument from vfs_busy() which is currently not
used and will never be used this way in VFS

requested by and ok pedro, ok krw, biorn


# 1.123 19-Apr-2006 pedro

Remove unused mount list simple_lock() goo


Revision tags: OPENBSD_3_9_BASE
# 1.122 09-Jan-2006 pedro

Put vprint() under DIAGNOSTIC, as to save space in generated ramdisks.
Inspiration from miod@, okay deraadt@. Tested on i386, macppc and amd64.


# 1.121 30-Nov-2005 pedro

No need for vfs_busy() and vfs_unbusy() to take a process pointer
anymore. Testing by jolan@, thanks.


# 1.120 24-Nov-2005 pedro

Remove kernfs, okay deraadt@.


# 1.119 19-Nov-2005 pedro

Remove unnecessary lockmgr() archaism that was costing too much in terms
of panics and bugfixes. Access curproc directly, do not expect a process
pointer as an argument. Should fix many "process context required" bugs.
Incentive and okay millert@, okay marc@. Various testing, thanks.


# 1.118 18-Nov-2005 pedro

Work around yet another race on non-locking file systems: when calling
VOP_INACTIVE() in vrele() and vput(), we may sleep. Since there's no
locking of any kind, someone can vget() the vnode and vrele() it while
we sleep, beating us in getting the vnode on the free list.


# 1.117 08-Nov-2005 pedro

Missed one use of 'register'


# 1.116 07-Nov-2005 pedro

Use ANSI function declarations and deregister, no binary change


# 1.115 19-Oct-2005 pedro

Remove v_vnlock from struct vnode, okay krw@ tedu@


Revision tags: OPENBSD_3_8_BASE
# 1.114 26-May-2005 pedro

branches: 1.114.2;
RIP stackable filesystems, ok marius@ tedu@, discussed with deraadt@


# 1.113 24-May-2005 pedro

when a device vnode associated with a mount point disappears, mark the
filesystem as doomed and unmount it


# 1.112 22-May-2005 pedro

put VLOCKSWORK stuff under a single option, VFSDEBUG


# 1.111 01-May-2005 pedro

check for VBIOONFREELIST and VBIOONSYNCLIST in vprint(), okay marius@


# 1.110 24-Mar-2005 tedu

always good to check for invalid values. ok marius pedro


Revision tags: OPENBSD_3_7_BASE
# 1.109 10-Jan-2005 pedro

branches: 1.109.2;
change vget() to only put a vnode back on the free lists if it actually
was there. should fix a (rare) corner case introduced by my last commit.
ok tedu@, testing by joris, moritz@, danh@, otto@ and krw@. many thanks.


# 1.108 31-Dec-2004 pedro

sprinkle some more list macros in here


# 1.107 31-Dec-2004 pedro

when releasing a vnode, make it inactive before sticking it to one of
the free lists. should fix some races on filesystems that don't have
locks, such as nfs. also, it allows for a more straightforward way of
releasing vnodes (nodes that are going to be recycled don't have to be
moved to the head of the list). tested by many, thanks.

ok tedu@ deraadt@


# 1.106 28-Dec-2004 deraadt

clean dirty accident by miod


# 1.105 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


# 1.104 09-Dec-2004 pedro

minor spacing/styling nits


Revision tags: OPENBSD_3_6_BASE
# 1.103 04-Aug-2004 art

Uninline vputonfreelist.


# 1.102 04-Aug-2004 pedro

better comments


# 1.101 02-Aug-2004 pedro

- check for LK_NOWAIT on vget()
- use ltsleep() instead of the unlock + sleep combo

ok art@, inspiration from free/net


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.100 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


# 1.99 27-May-2004 tedu

shutdown accounting before shutting down vfs. should prevent some panics.
ok david@ millert@ (iirc)


# 1.98 25-Apr-2004 itojun

radix tree with multipath support. from kame. deraadt ok
user visible changes:
- you can add multiple routes with same key (route add A B then route add A C)
- you have to specify gateway address if there are multiple entries on the table
(route delete A B, instead of route delete A)
kernel change:
- radix_node_head has an extra entry
- rnh_deladdr takes extra argument

TODO:
- actually take advantage of multipath (rtalloc -> rtalloc_mpath)


Revision tags: OPENBSD_3_5_BASE
# 1.97 09-Jan-2004 tedu

back out vnode parents. weird breakge found in ports tree


# 1.96 06-Jan-2004 tedu

keep track of a vnode's parent dir. ufs only, and unused atm, but
the fun stuff is coming. testing by brad.


Revision tags: OPENBSD_3_4_BASE
# 1.95 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.94 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: UBC_SYNC_A
# 1.93 13-May-2003 naddy

Back out previous change that causes "vnode table full" for large-scale
file operations.


# 1.92 13-May-2003 tedu

do reclaim LAYER vnodes, no good reason not to


# 1.91 06-May-2003 tedu

attempt to put a process's cwd back in place after a forced umount.
won't always work, but it's the best we can do for now. this covers
at least some of the failure cases the previous commit to vfs_lookup.c
checks for.
ok weingart@


# 1.90 01-May-2003 tedu

several related changes:
vfs_subr.c:
add a missing simple_lock_init for vnode interlock
try to avoid reclaiming locked or layered vnodes
initialize vnlock pointer to NULL
remove old code to free vnlock, never used
lockinit the new vnode lock
vfs_syscalls.c:
support for VLAYER flag
vnode_if.sh:
support for splitting VDESC flags
vnode_if.src:
split VDESC flags
WILLPUT is the combination of WILLRELE and WILLUNLOCK
most uses for WILLRELE become WILLPUT
vnode.h:
add v_lock to struct vnode
add VLAYER flag
update for new VDESC flags


# 1.89 06-Apr-2003 ho

strcat/strcpy/sprintf cleanup. krw@, anil@ ok. art@ tested sparc64.


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_B
# 1.88 11-Aug-2002 art

Add two missing vfs_busy calls in the failure path of sysctl_vnode.
Found by aaron@

NOTE - I think we need a mount-point iterator just like we have
NOTE - vfs_mount_foreach_vnode. (btw. why don't we use foreach_vnode in here?)


# 1.87 12-Jul-2002 art

Change the locking on the mountpoint slightly. Instead of using mnt_lock
to get shared locks for lookup and get the exclusive lock only with
LK_DRAIN on unmount and do the real exclusive locking with flags in
mnt_flags, we now use shared locks for lookup and an exclusive lock for
unmount.

This is accomplished by slightly changing the semantics of vfs_busy.
Old vfs_busy behavior:
- with LK_NOWAIT set in flags, a shared lock was obtained if the
mountpoint wasn't being unmounted, otherwise we just returned an error.
- with no flags, a shared lock was obtained if the mountpoint was being
unmounted, otherwise we slept until the unmount was done and returned
an error.
LK_NOWAIT was used for sync(2) and some statistics code where it isn't really
critical that we get the correct results.
0 was used in fchdir and lookup where it's critical that we get the right
directory vnode for the filesystem root.

After this change vfs_busy keeps the same behavior for no flags and LK_NOWAIT.
But if some other flags are passed into it, they are passed directly
into lockmgr (actually LK_SLEEPFAIL is always added to those flags because
if we sleep for the lock, that means someone was holding the exclusive lock
and the exclusive lock is only held when the filesystem is being unmounted.

More changes:
dounmount must now be called with the exclusive lock held. (before this
the caller was supposed to hold the vfs_busy lock, but that wasn't always
true).
Zap some (now) unused mount flags.
And the highlight of this change:
Add some vfs_busy calls to match some vfs_unbusy calls, especially in
sys_mount. (lockmgr doesn't detect the case where we release a lock noone
holds (it will do that soon)).

If you've seen hangs on reboot with mfs this should solve it (I repeat this
for the fourth time now, but this time I spent two months fixing and
redesigning this and reading the code so this time I must have gotten
this right).


# 1.86 16-Jun-2002 miod

When processing the KERN_VNODE sysctl, the kernel builds a packed structure,
while pstat(8) expects a C structure abiding the regular structure packing
rules. This caused pstat -v to break on powerpc.

Unbreak the confusion by defining the structure in a common header file,
and having the kernel use it.

ok millert@ deraadt@


# 1.85 08-Jun-2002 art

Use ltsleep in vfs_busy.


# 1.84 16-May-2002 art

sprinkle some splassert(IPL_BIO) in some functions that are commented as "should be called at splbio()"


Revision tags: OPENBSD_3_1_BASE
# 1.83 14-Mar-2002 millert

First round of __P removal in sys


# 1.82 04-Feb-2002 miod

Cleanup mountroot-related definitions.


# 1.81 23-Jan-2002 art

Pool deals fairly well with physical memory shortage, but it doesn't deal
well (not at all) with shortages of the vm_map where the pages are mapped
(usually kmem_map).

Try to deal with it:
- group all information the backend allocator for a pool in a separate
struct. The pool will only have a pointer to that struct.
- change the pool_init API to reflect that.
- link all pools allocating from the same allocator on a linked list.
- Since an allocator is responsible to wait for physical memory it will
only fail (waitok) when it runs out of its backing vm_map, carefully
drain pools using the same allocator so that va space is freed.
(see comments in code for caveats and details).
- change pool_reclaim to return if it actually succeeded to free some
memory, use that information to make draining easier and more efficient.
- get rid of PR_URGENT, noone uses it.


# 1.80 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.79 10-Dec-2001 art

branches: 1.79.2;
No need to initialize the uobj on every getnewvnode. Just do
it when allocating. Add some improved diagnostics.


# 1.78 10-Dec-2001 art

Big cleanup inspired by NetBSD with some parts of the code from NetBSD.
- get rid of VOP_BALLOCN and VOP_SIZE
- move the generic getpages and putpages into miscfs/genfs
- create a genfs_node which must be added to the top of the private portion
of each vnode for filsystems that want to use genfs_{get,put}pages
- rename genfs_mmap to vop_generic_mmap


# 1.77 10-Dec-2001 art

Merge in struct uvm_vnode into struct vnode.


# 1.76 05-Dec-2001 art

Break out the part that lowers v_holdcnt in brelvp into an own function
and make it and vhold into public interfaces.


# 1.75 29-Nov-2001 art

Ooops. Revert part of the last commit that was completly wrong and wasn't supposed to be committed.


# 1.74 29-Nov-2001 art

Correctly handle b_vp with bgetvp and brelvp in {get,put}pages.
Prevents panics caused by vnodes being recycled under our feet.


# 1.73 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.72 21-Nov-2001 csapuntz

Added vfs_isbusy. Useful for verifying that a mount point is locked
Added vfs_mount_foreach_vnode. Several places in the code seem to want to
traverse the mount list and they all seem to handle locking differently.
Centralize traversing the mount list in one place so that we only need
to get the locking right once.


# 1.71 15-Nov-2001 art

Don't zero v_bioflag when recycling a vnode in getnewvnode.
Sometimes the vnode can be on the syncers list. While that is a bug, it's
just a minor annoyance. A vnode on a syncer worklist without VBIOONSYNCLIST
set is a disaster.


# 1.70 12-Nov-2001 art

Remove unnecessary check for NULL vnode in reassignbuf.


# 1.69 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.68 02-Oct-2001 csapuntz

Bounds check index into routing table. Thanks to Ken Ashcraft of Stanford
for finding this bug.


# 1.67 19-Sep-2001 csapuntz

Get rid of B_VFLUSH. Not relevant after the end of the AGE queue.


# 1.66 16-Sep-2001 millert

Add some missing lengths checks when passing data from userland to
kernel. From based on NetBSD patches.


# 1.65 02-Aug-2001 assar

(vput): make panic strings actually say vput instead of vrele


# 1.64 26-Jul-2001 miod

Typo.


# 1.63 27-Jun-2001 art

remove old vm


# 1.62 22-Jun-2001 deraadt

KNF


# 1.61 05-Jun-2001 provos

send note_revoke to knotes when vnode goes away, okay art@


# 1.60 16-May-2001 art

indentation nit.


# 1.59 29-Apr-2001 art

cleanup, remove incorrect comment


Revision tags: OPENBSD_2_9_BASE
# 1.58 22-Mar-2001 art

branches: 1.58.2;
Use pool for allocating vnodes.
Even though vnodes are never freed (could be) this gives us big memory and
kmem_map savings.


# 1.57 21-Mar-2001 art

uvm_vnp_terminate expect the vnode to be locked.
Why didn't LOCKDEBUG catch this?


# 1.56 16-Mar-2001 art

Oops. fix thinko in last.


# 1.55 16-Mar-2001 art

Use CIRCLEQ macros for mountlist.


# 1.54 16-Mar-2001 art

Initialize the mountlist_slock.


# 1.53 26-Feb-2001 csapuntz

Move v_writecount test back to it original place


# 1.52 26-Feb-2001 csapuntz

Make ref counts 32-bit unsigned ints as opposed to a potpourri of longs and
ints.


# 1.51 24-Feb-2001 csapuntz

Cleanup of vnode interface continues. Get rid of VHOLD/HOLDRELE.
Change VM/UVM to use buf_replacevnode to change the vnode associated
with a buffer.

Addition v_bioflag for flags written in interrupt handlers
(and read at splbio, though not strictly necessary)

Add vwaitforio and use it instead of a while loop of v_numoutput.

Fix race conditions when manipulation vnode free list


# 1.50 23-Feb-2001 csapuntz

Remove the clustering fields from the vnodes and place them in the
file system inode instead


# 1.49 21-Feb-2001 csapuntz

Latest soft updates from FreeBSD/Kirk McKusick

Snapshot-related code has been commented out.


# 1.48 08-Feb-2001 mickey

do not print stuff when not verbose


Revision tags: OPENBSD_2_8_BASE
# 1.47 27-Sep-2000 art

branches: 1.47.2;
Minimal optimization.


# 1.46 17-Jul-2000 art

Don't wait for B_READ buffers on shutdown.
From NetBSD.


Revision tags: OPENBSD_2_7_BASE
# 1.45 25-Apr-2000 csapuntz

Use CIRCLEQ_FOREACH


# 1.44 21-Apr-2000 mickey

see if there is any meaning under curproc before using &proc0 in vfs_syncwait(); from art@


Revision tags: SMP_BASE kame_19991208
# 1.43 05-Dec-1999 art

branches: 1.43.2;
With soft updates, some buffers will be remarked as dirty after being written.
Handle this when syncing filesystems when unmounting.
From NetBSD.


# 1.42 05-Dec-1999 art

Use VONSYNCLIST to see if we should remove a vnode from the sync list instead
of looking at v_dirtyblkhd.


Revision tags: OPENBSD_2_6_BASE
# 1.41 20-Aug-1999 art

more paranoid check of the refcount in vfs_register


# 1.40 08-Aug-1999 niklas

From NetBSD; vdevgone, used for revoking access to device nodes when they
disappear (detach is coming).


# 1.39 31-May-1999 millert

New struct statfs with mount options. NOTE: this replaces statfs(2),
fstatfs(2), and getfsstat(2) so you will need to build a new kernel
before doing a "make build" or you will get "unimplemented syscall" errors.

The new struct statfs has the following featuires:
o Has a u_int32_t flags field--now softdep can have a real flag.

o Uses u_int32_t instead of longs (nicer on the alpha). Note: the man
page used to lie about setting invalid/unused fields to -1. SunOS does
that but our code never has.

o Gets rid of f_type completely. It hasn't been used since NetBSD 0.9
and having it there but always 0 is confusing. It is conceivable
that this may cause some old code to not compile but that is better
than silently breaking.

o Adds a mount_info union that contains the FSTYPE_args struct. This
means that "mount" can now tell you all the options a filesystem was
mounted with. This is especially nice for NFS.

Other changes:
o The linux statfs emulation didn't convert between BSD fs names
and linux f_type numbers. Now it does, since the BSD f_type
number is useless to linux apps (and has been removed anyway)

o FreeBSD's struct statfs is different from our (both old and new)
and thus needs conversion. Previously, the OpenBSD syscalls
were used without any real translation.

o mount(8) will now show extra info when invoked with no arguments.
However, to see *everything* you need to use the -v (verbose) flag.


# 1.38 06-May-1999 mickey

factor out sync+wait code into vfa_syncwait() routine for
applications in system like power management and such.
art@ finally said `commit it'


# 1.37 30-Apr-1999 art

in vput, simple_unlock the v_interlock before VOP_INACTIVE, not after


Revision tags: OPENBSD_2_5_BASE
# 1.36 11-Mar-1999 deraadt

backout


# 1.35 11-Mar-1999 deraadt

back out unapproved changes


# 1.34 11-Mar-1999 mickey

indent


# 1.33 11-Mar-1999 mickey

factor sync+wait operation out into a separate function.


# 1.32 26-Feb-1999 art

adapt to uvm vnode pager


# 1.31 19-Feb-1999 art

add vfs_register and vfs_unregister functions


# 1.30 28-Dec-1998 art

simple_lock fixes


# 1.29 22-Dec-1998 art

deconfuse vprint, print holdcount, not refcount when we are talking about holdcnt


# 1.28 10-Dec-1998 art

vfs_unmountall: retry to unmount all remaining filesystems when one unmount failed


# 1.27 05-Dec-1998 csapuntz

Framework for generating automatic test code for locking discipline
in DIAGNOSTIC mode.

Added documentation to vfs_subr.c on locking needs of a couple calls.

Improvements to the vinvalbuf patch. We need to start over after we
let our pants down.


# 1.26 04-Dec-1998 csapuntz

VFS-Lite2 requires stricter locking around vnode buffer queues. vinvalbuf
had insufficient protection


# 1.25 20-Nov-1998 art

vn_lock already unlocks the simple lock. don't do that again


# 1.24 12-Nov-1998 csapuntz

Integrate latest soft updates patches for McKusick.

Integrate cleaner ffs mount code from FreeBSD. Most notably, this mount
code prevents you from mounting an unclean file system read-write.


Revision tags: OPENBSD_2_4_BASE
# 1.23 13-Oct-1998 csapuntz

In vrele, vget, reinstate to following order

- VNODE gets placed on free list
- VOP_INACTIVE is called

This was the original order. It was changed in an earlier patch due to
a race condition in non-locking FSes (like NFS) between getnewvnode
and inactive. However, the modified order had its own race conditions, so
it turned out not to be a good choice.


# 1.22 30-Aug-1998 csapuntz

Cleanup.

Error diagnostics in vputonfreelist to catch violations of assumptions.


# 1.21 06-Aug-1998 csapuntz

Rename vop_revoke, vn_bwrite, vop_noislocked, vop_nolock, vop_nounlock
to be vop_generic_revoke, vop_generic_bwrite, vop_generic_islocked,
vop_generic_lock and vop_generic_unlock.

Create vop_generic_abortop and propogate change to all file systems.

Fix PR/371.

Get rid of locking in NULLFS (should be mostly unnecessary now except for
forced unmounts).


# 1.20 25-Apr-1998 niklas

typo


Revision tags: OPENBSD_2_3_BASE
# 1.19 20-Feb-1998 niklas

typo


# 1.18 11-Jan-1998 csapuntz

Fix a couple spinlock references. More code motion in vfs_subr.c


# 1.17 10-Jan-1998 csapuntz

Broke up vfs_subr.c which was getting a bit huge. We now have seperate files
for the syncer daemon as well as default VOP_*.


# 1.16 24-Nov-1997 niklas

Fix non-DIAGNOSTIC (and non-COMPAT*) compilation


# 1.15 07-Nov-1997 csapuntz

Fixed hang on shutdown
Disabled vop_nolock for now. Filesystems still need to be cleaned up.


# 1.14 06-Nov-1997 csapuntz

DEBUG now compiles


# 1.13 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.12 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.11 06-Oct-1997 csapuntz

VFS Lite2 Changes


Revision tags: OPENBSD_2_1_BASE
# 1.10 25-Apr-1997 deraadt

proper mask check; mike@fast.cs.utah.edu


# 1.9 14-Apr-1997 tholo

Minor performance enhancements from NetBSD


# 1.8 24-Feb-1997 niklas

OpenBSD tags


# 1.7 11-Feb-1997 millert

Add fs_id support and random inode generation numbers for ffs.


# 1.6 04-Jan-1997 kstailey

spec_advlock() via lf_advlock()


Revision tags: OPENBSD_2_0_BASE
# 1.5 08-Aug-1996 tholo

Make {,f}chown(2) behaviour POSIX.1 compliant with SUID / SGID files
Enable CTL_FS processing by sysctl(3)
Add CTL_FS request to disable clearing SUID / SGID bit when a files owner
or group is changed by root
Make sysctl(8) understand CTL_FS requests


# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 29-Feb-1996 niklas

From NetBSD: Merge with NetBSD 960217


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.297 30-Dec-2019 bluhm

In vcount() a safe loop over vnodes was commited to 4.4BSD in 1994.
This is not necessary as the loop is restarted after vgone(). Switch
to SLIST_FOREACH without _SAFE.
OK visa@


# 1.296 27-Dec-2019 bluhm

Convert the speclisth hash buckets into SLIST macros. This makes
the vnode alias code more readable.
OK visa@


# 1.295 26-Dec-2019 bluhm

Fix white spaces.


# 1.294 08-Dec-2019 mpi

Convert infinite sleeps to tsleep_nsec(9).

ok visa@, jca@


Revision tags: OPENBSD_6_6_BASE
# 1.293 26-Aug-2019 anton

When a thread tries to exclusively lock a vnode, the same thread must
ensure that any other thread currently trying to acquire the underlying
vnode lock has observed that the same vnode is about to be exclusively
locked. Such threads must then sleep until the exclusive lock has been
released and then try to acquire the lock again. Otherwise, exclusive
access to the vnode cannot be guaranteed.

Thanks to naddy@ and visa@ for testing; ok visa@

Reported-by: syzbot+374d0e7e2400004957f7@syzkaller.appspotmail.com


# 1.292 25-Jul-2019 cheloha

vinvalbuf(9): tlseep -> tsleep_nsec(9); ok millert@


# 1.291 19-Jul-2019 cheloha

vwaitforio(9): tsleep(9) -> tsleep_nsec(9); ok visa@


# 1.290 28-Jun-2019 visa

Skip VFS barrier lock during normal operation to reduce overhead.
This removes a system-wide serialization point, which might help
finding timing-related bugs.

OK deraadt@ anton@


# 1.289 09-Jun-2019 beck

Add a temporary workaround to make removal of giant files better

mlarkin@ noticed we would freeze while removing enormous files because
of the amount of work done to invalidate buffers on unlink. This adds
a temporary workaround to ensure we give up the lock and yield while
doing this.

The longer term answer will be to move these buffers to another list
and not do the work here.

ok deraadt@


# 1.288 19-Apr-2019 visa

Add a subsystem lock for vfs_lockf.c. This enables calling lf_advlock()
and lf_purgelocks() without the kernel lock.

OK anton@ mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.287 02-Apr-2019 visa

Restrict which filesystems are available for swap. This rules out
obvious misconfigurations that cannot work.

OK mpi@ tedu@


# 1.286 17-Feb-2019 tedu

if a write fails, we mark the buffer invalid and throw it away. this can
lead to lost errors, where a later fsync will return success. to fix this,
set a flag on the vnode indicating a past error has occurred, and return
an error for future fsync calls.
ok bluhm deraadt visa


# 1.285 21-Jan-2019 anton

Introduce a dedicated entry point data structure for file locks. This new data
structure allows for better tracking of pending lock operations which is
essential in order to prevent a use-after-free once the underlying vnode is
gone.

Inspired by the lockf implementation in FreeBSD.

ok visa@

Reported-by: syzbot+d5540a236382f50f1dac@syzkaller.appspotmail.com


# 1.284 23-Dec-2018 natano

Rectify some issues with the noperm mount flag; the root vnode was not
protected properly and files without any x bit set were accidentaly considered
executable when checked with access(2).

Issues found and reported by deraadt, halex, reyk, tb
ok deraadt


# 1.283 07-Dec-2018 mpi

free(9) sizes for netcred.

ok visa@


Revision tags: OPENBSD_6_4_BASE
# 1.282 29-Sep-2018 visa

Use atomic operations to update vfc_refcount. Change the field's type
to unsigned int.

OK deraadt@


# 1.281 26-Sep-2018 visa

Move the allocating and freeing of mount points into
dedicated functions.

OK deraadt@ mpi@


# 1.280 22-Sep-2018 fcambus

Harmonize spacing after ellipses in displayed messages.

We were using spacing after ellipses in an inconsistent way in the
installer. Standardize on using "... " everywhere and take into account
the cursor position while we are waiting for the task to complete: the
cursor is now always positioned after the last dot, and the space is
added when displaying completion confirmation.

While there, also take cursor position into account in vfs_shutdown(),
and remove the extra leading space before ticks in dhclient.

OK deraadt@


# 1.279 17-Sep-2018 visa

Simplify VFS initialization.

Because loadable kernel modules are no longer, there is no need to
register or unregister filesystem implementations at runtime. Remove
vfs_register() and vfs_unregister(), and make vfsinit() call vfs_init
routines directly. Replace the linked list of vfsconf structs with
the vfsconflist[] array.

OK mpi@ bluhm@


# 1.278 16-Sep-2018 visa

Move vfsconf lookup code into dedicated functions.

OK bluhm@


# 1.277 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


# 1.276 02-Jul-2018 bluhm

Use more list macros for v_dirtyblkhd.
OK mpi@


# 1.275 06-Jun-2018 bluhm

The function dounmount() traverses the mnt_list in forward direction
to call vfs_busy() for all nested mount points. vfs_stall() called
vfs_busy() in reverser order for all mount points. Change the
direction of the latter to resolve the lock order conflict.
OK visa@


# 1.274 04-Jun-2018 guenther

Add VB_DUPOK to suppress witness(4) warning of concurrent mount locks.
Use that in three places:
- vfs_stall()
- sys_mount()
- dounmount()'s MNT_FORCE-does-recursive-unmounts case

ok deraadt@ visa@


# 1.273 27-May-2018 visa

Drop unnecessary `p' parameter from vget(9).

OK mpi@


# 1.272 08-May-2018 bluhm

When looping over mount points, the FOREACH SAVE macro is not save.
The loop variable mp is protected by vfs_busy() so that it cannot
be unmounted. But the next mount point nmp could be unmounted while
VFS_SYNC() sleeps. As the loop in vfs_stall() does not destroy the
mount point, TAILQ_FOREACH_REVERSE without _SAVE is the correct
macro to use.
OK deraadt@ visa@


# 1.271 08-May-2018 mpi

Move the vfs stall "barrier" logic to a function. FREF() will soon
change and this has nothing to do with it.

ok visa@, bluhm@


# 1.270 07-May-2018 bluhm

Print the vp pointer in the vinvalbuf() panic strings.
OK mpi@


# 1.269 02-May-2018 visa

Remove proc from the parameters of vn_lock(). The parameter is
unnecessary because curproc always does the locking.

OK mpi@


# 1.268 28-Apr-2018 visa

Clean up the parameters of VOP_LOCK() and VOP_UNLOCK(). It is always
curproc that does the locking or unlocking, so the proc parameter
is pointless and can be dropped.

OK mpi@, deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.267 07-Mar-2018 bluhm

Remounting files systems read-only does not work reliably. There
are corner cases where ffs may leak blocks. So better revert and
unmount all file systems at reboot. The "init died" panic will be
fixed in a different way.
OK deraadt@


# 1.266 10-Feb-2018 deraadt

Syncronize filesystems to disk when suspending. Each mountpoint's vnodes
are pushed to disk. Dangling vnodes (unlinked files still in use) and
vnodes undergoing change by long-running syscalls are identified -- and
such filesystems are marked dirty on-disk while we are suspended (in case
power is lost, a fsck will be required). Filesystems without dangling or
busy vnodes are marked clean, resulting in faster boots following
"battery died" circumstances.
Tested by numerous developers, thanks for the feedback.


# 1.265 14-Dec-2017 deraadt

Don't bother using DETACH_FORCE for the softraid luns at reboot
time; the aggressive mountpoint destruction seems to hit insane
use-after-frees when we are already far on the way down.


# 1.264 14-Dec-2017 deraadt

Give vflush_vnode() a hint about vnodes we don't need to account as "busy".
Change mountpoint to RDONLY a little later. Seems to improve the
rw->ro transition a bit.


# 1.263 11-Dec-2017 bluhm

Format the vnode lists of ddb show mount properly in columns.
OK krw@


# 1.262 11-Dec-2017 deraadt

In uvm Chuck decided backing store would not be allocated proactively
for blocks re-fetchable from the filesystem. However at reboot time,
filesystems are unmounted, and since processes lack backing store they
are killed. Since the scheduler is still running, in some cases init is
killed... which drops us to ddb [noted by bluhm]. Solution is to convert
filesystems to read-only [proposed by kettenis]. The tale follows:
sys_reboot() should pass proc * to MD boot() to vfs_shutdown() which
completes current IO with vfs_busy VB_WRITE|VB_WAIT, then calls VFS_MOUNT()
with MNT_UPDATE | MNT_RDONLY, soon teaching us that *fs_mount() calls a
copyin() late... so store the sizes in vfsconflist[] and move the copyin()
to sys_mount()... and notice nfs_mount copyin() is size-variant, so kill
legacy struct nfs_args3. Next we learn ffs_mount()'s MNT_UPDATE code is
sharp and rusty especially wrt softdep, so fix some bugs adn add
~MNT_SOFTDEP to the downgrade. Some vnodes need a little more help,
so tie them to &dead_vnops.

ffs_mount calling DIOCCACHESYNC is causing a bit of grief still but
this issue is seperate and will be dealt with in time.
couple hundred reboots by bluhm and myself, advice from guenther and
others at the hut


# 1.261 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.260 31-Jul-2017 florian

Give back some space to the ramdisk by compiling net/radix.c only
if we compile pf, ipsec, pipex or nfsserver.
Suggested by mpi some time ago.
Tweak & OK bluhm
deraadt assumes it's fair


# 1.259 20-Apr-2017 visa

Tweak lock inits to make the system runnable with witness(4)
on amd64 and i386.


# 1.258 04-Apr-2017 deraadt

struct vfsconf is tightly packed, but let's M_ZERO it in case that ever
changes to avoid exposing userland memory.


Revision tags: OPENBSD_6_1_BASE
# 1.257 15-Jan-2017 bluhm

When traversing the mount list, the current mount point is locked
with vfs_busy(). If the FOREACH_SAFE macro is used, the next pointer
is not locked and could be freed by another process. Unless
necessary, do not use _SAFE as it is unsafe. In vfs_unmountall()
the current pointer is actullay freed. Add a comment that this
race has to be fixed later.
OK krw@


# 1.256 10-Jan-2017 bluhm

Replace manual for() loops with FOREACH() macro.
OK millert@


# 1.255 10-Jan-2017 bluhm

Remove the unused olddp parameter from function dounmount().
OK mpi@ millert@


# 1.254 28-Sep-2016 kettenis

Cast enum to u_int when doing a bounds check to avoid a clang warning that
the comparison is always true.

ok jca@, tedu@


# 1.253 16-Sep-2016 dlg

move the namecache_rb_tree from RB macros to RBT functions.

i had to shuffle the includes a bit. all the knowledge of the RB
tree is now inside vfs_cache.c, and all accesses are via cache_*
functions.


# 1.252 16-Sep-2016 dlg

move buf_rb_bufs from RB macros to RBT functions

i had to shuffle the order of some header bits cos RBT_PROTOTYPE
needs to see what RBT_HEAD produces.


# 1.251 15-Sep-2016 dlg

all pools have their ipl set via pool_setipl, so fold it into pool_init.

the ioff argument to pool_init() is unused and has been for many
years, so this replaces it with an ipl argument. because the ipl
will be set on init we no longer need pool_setipl.

most of these changes have been done with coccinelle using the spatch
below. cocci sucks at formatting code though, so i fixed that by hand.

the manpage and subr_pool.c bits i did myself.

ok tedu@ jmatthew@

@ipl@
expression pp;
expression ipl;
expression s, a, o, f, m, p;
@@
-pool_init(pp, s, a, o, f, m, p);
-pool_setipl(pp, ipl);
+pool_init(pp, s, a, ipl, f, m, p);


# 1.250 25-Aug-2016 dlg

pool_setipl

ok kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.249 22-Jul-2016 kettenis

Prevent NULL-pointer call for filesystems that don't provide vfs_sysctl
in their vfsops.

Issue reported by Tim Newsham.

ok claudio@, natano@


# 1.248 19-Jun-2016 natano

Remove the lockmgr() API. It is only used by filesystems, where it is a
trivial change to use rrw locks instead. All it needs is LK_* defines
for the RW_* flags.

tested by naddy and sthen on package building infrastructure
input and ok jmc mpi tedu


# 1.247 26-May-2016 natano

The doforce variable isn't modified anywhere. Also, the only filesystem
left using it is fuse. It has been removed from all other filesystems.

ok millert deraadt


# 1.246 26-Apr-2016 natano

copy_statfs_info() is not only used by ufs, but by other filesystems too,
so make sure that all members of mp->mnt_stat.mount_info are copied.
ok stefan


# 1.245 26-Apr-2016 beck

fix off by one in vfs_vnode_print - found by miod
ok deraadt@, krw@


# 1.244 07-Apr-2016 natano

Share clone bitmap between aliased vnodes. This prevents duplicate clone
instance numbers being handed out for the same minor device.
ok mikeb


# 1.243 05-Apr-2016 natano

Increase size of the clone bitmap (revised diff after revert). I have
tested this with fuse _and_ drm on amd64 and macppc. Also tested with
cloning bpf (not in the tree) on macppc.

ok mikeb
"looks correct to me" millert

The original commit message is as follows:

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.242 01-Apr-2016 mikeb

Revert the clone bitmap enlargement change


# 1.241 31-Mar-2016 natano

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.240 19-Mar-2016 natano

Remove the unused flags argument from VOP_UNLOCK().

torture tested on amd64, i386 and macppc
ok beck mpi stefan
"the change looks right" deraadt


# 1.239 14-Mar-2016 krw

Change a bunch of (<blah> *)0 to NULL.

ok beck@ deraadt@


Revision tags: OPENBSD_5_9_BASE
# 1.238 05-Dec-2015 tedu

branches: 1.238.2;
remove stale lint annotations


# 1.237 16-Nov-2015 deraadt

In getdevvp() set the VISTTY flag on a vnode to indicate the underlying
device is a D_TTY device. (Like spec_open, but this sets the flag to
satisfy pre-VOP_OPEN situations)
ok millert semarie tedu guenther


# 1.236 13-Oct-2015 guenther

Initialize va_filerev in vattr_null() to avoid leaking stack garbage;
problem pointed out by Martin Natano (natano (at) natano.net)

Also, stop chaining assignments (foo = bar = baz) in vattr_null().
The exact meaning of those depends on the order of the sizes-and-
signednesses of the lvalues, making them fragile: a statement here
mixed *six* types, but managed to get them in a safe order. Delete
a 20+ year old XXX comment that was almost certainly bemoaning a bug
from when they were in an unsafe order.

ok deraadt@ miod@


# 1.235 08-Oct-2015 mpi

Use the radix API directly and get rid of the function pointers. There
is no point in keeping an unused level of abstraction.

ok mikeb@, claudio@


# 1.234 07-Oct-2015 mpi

rn_inithead() offset argument is now specified in byte, missed in previous.


# 1.233 04-Sep-2015 mpi

Make every subsystem using a radix tree call rn_init() and pass the
length of the key as argument.

This way every consumer of the radix tree has a chance to explicitly
initialize the shared data structures and no longer rely on another
subsystem to do the initialization.

As a bonus ``dom_maxrtkey'' is no longer used an die.

ART kernels should now be fully usable because pf(4) and IPSEC properly
initialized the radix tree.

ok chris@, reyk@


Revision tags: OPENBSD_5_8_BASE
# 1.232 16-Jul-2015 claudio

branches: 1.232.4;
Fix rn_match and there for the expoerted lookup functions in radix.c
to never return the internal RNF_ROOT nodes. This removes the checks
in the callee to verify that not an RNF_ROOT node was returned.
OK mpi@


# 1.231 12-May-2015 mikeb

Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.230 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.229 02-Mar-2015 guenther

Return EINVAL if the creds supplied for NFS export have a cr_ngroups less
than zero or greater than NGROUPS_MAX

Fixes panic seen by henning@


# 1.228 09-Jan-2015 tedu

rename desiredvnodes to initialvnodes. less of a lie. ok beck deraadt


# 1.227 19-Dec-2014 tedu

start retiring the nointr allocator. specify PR_WAITOK as a flag as a
marker for which pools are not interrupt safe. ok dlg


# 1.226 17-Dec-2014 tedu

remove lock.h from uvm_extern.h. another holdover from the simpletonlock
era. fix uvm including c files to include lock.h or atomic.h as necessary.
ok deraadt


# 1.225 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.224 10-Dec-2014 tedu

convert bcopy to memcpy. ok millert


# 1.223 21-Nov-2014 tedu

simple lock is long dead


# 1.222 19-Nov-2014 tedu

delete the KERN_VNODE sysctl. it fails to provide any isolation from the
kernel struct vnode defintion, and the only consumer (pstat) still needs
kvm to read much of the required information. no great loss to always use
kvm until there's a better replacement interface.
ok deraadt millert uebayasi


# 1.221 14-Nov-2014 tedu

prefer sizeof(*ptr) to sizeof(struct) for malloc and free


# 1.220 03-Nov-2014 deraadt

pass size argument to free()
ok doug tedu


# 1.219 13-Sep-2014 doug

Replace all queue *_END macro calls except CIRCLEQ_END with NULL.

CIRCLEQ_* is deprecated and not called in the tree. The other queue types
have *_END macros which were added for symmetry with CIRCLEQ_END. They are
defined as NULL. There's no reason to keep the other *_END macro calls.

ok millert@


Revision tags: OPENBSD_5_6_BASE
# 1.218 13-Jul-2014 tedu

pass the size to free in some of the obvious cases


# 1.217 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.216 10-Jul-2014 mpi

Stop using a shutdown hook for softraid(4) and explicitly shutdown
the disciplines right after vfs_shutdown().

This change is required in order to be able to set `cold' to 1 before
traversing the device (mainbus) tree for DVACT_POWERDOWN when halting
a machine. Yes, this is ugly because sr_shutdown() needs to sleep. But
at least it is obvious and hopefully somebody will be ofended and fix
it.

In order to properly flush the cache of the disks under softraid0,
sr_shutdown() now propagates DVACT_POWERDOWN for this particular subtree
of devices which are not under mainbus. As a side effect sd(4) shutdown
hook should no longer be necessary.

Tested by stsp@ and Jean-Philippe Ouellet.

ok deraadt@, stsp@, jsing@


# 1.215 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.214 04-Jun-2014 claudio

While it may be smart to use the radix tree for exports it is not OK to
use the domain specific tree initialisation method for this since that one
is multipath enabled and assumes that the radix node is part of a struct
rtentry. This code uses a different struct and so the multipath modifies
wrong fields and breaks stuff in mysterious ways.
Since we only support AF_INET here anyway simplify the code and only have
one radix_node_head pointer instead of AF_MAX ones.
Fixes NFS server issues reported by rpe@, OK rpe@, guenther@, sthen@


# 1.213 10-Apr-2014 tedu

pull the bufcache freelist code out into separate functions to allow new
algorithms to be tested. in the process, drop support for unused B_AGE and
b_synctime options.
previous versions ok beck deraadt


# 1.212 24-Mar-2014 guenther

Split the API: struct ucred remains the kernel internal structure while
struct xucred becomes the structure for syscalls (mount(2) and nfssvc(2)).

ok deraadt@ beck@


Revision tags: OPENBSD_5_5_BASE
# 1.211 21-Jan-2014 tedu

bzero -> memset


# 1.210 01-Dec-2013 krw

Change 'mountlist' from CIRCLEQ to TAILQ. Be paranoid and
use TAILQ_*_SAFE more than might be needed.

Bulk ports build by sthen@ showed nobody sticking their fingers
so deep into the kernel.

Feedback and suggestions from millert@. ok jsing@


# 1.209 27-Nov-2013 jsing

Defer the v_type initialisation until after the vnode has been purged from
the namecache. Changing the v_type between cache_enter() and cache_purge()
results in bad things happening.

ok beck@


# 1.208 02-Oct-2013 sf

format string fix: b_flags is long


# 1.207 01-Oct-2013 sf

Format string fixes: Cast time_t to long long

and mnt_stat.f_ctime is long long, too


# 1.206 08-Aug-2013 syl

Uncomment kprintf format attributes for sys/kern

tested on vax (gcc3) ok miod@


# 1.205 30-Jul-2013 beck

The previous change was made while chasing nfs performance issues
on Theo's servers - however this was in the context of the buffer flipper
changes and this is now suspect in a continues performance issue with NFS
so back it out for now


Revision tags: OPENBSD_5_4_BASE
# 1.204 24-Jun-2013 beck

Manipulating buffers after sleeping is dangerous. Instead of attempting
to cheat and VOP_BWRITE a buffer, restart the vinvalbuf if we have to wait
for a busy buffer to complete
ok tedu@ guenther@


# 1.203 15-Apr-2013 jsing

Add an f_mntfromspec member to struct statfs, which specifies the name of
the special provided when the mount was requested. This may be the same as
the special that was actually used for the mount (e.g. in the case of a
device node) or it may be different (e.g. in the case of a DUID).

Whilst here, change f_ctime to a 64 bit type and remove the pointless
f_spare members.

Compatibility goo courtesy of guenther@

ok krw@ millert@


Revision tags: OPENBSD_5_3_BASE
# 1.202 17-Feb-2013 miod

Comment out recently added __attribute__((__format__(__kprintf__))) annotations
in MI code; gcc 2.95 does not accept such annotation for function pointer
declarations, only function prototypes.
To be uncommented once gcc 2.95 bites the dust.


# 1.201 09-Feb-2013 miod

Add explicit __attribute__ ((__format__(__kprintf__)))) to the functions and
function pointer arguments which are {used as,} wrappers around the kernel
printf function.
No functional change.


# 1.200 17-Nov-2012 beck

Don't map a buffer (and potentially sleep) when invalidating it in vinvalbuf.
This fixes a problem where we could sleep for kva and then our pointers
would not be valid on the next pass through the loop. We do this
by adding buf_acquire_nomap() - which can be used to busy up the buffer
without changing its mapped or unmapped state. We do not need to have
the buffer mapped to invalidate it, so it is sufficient to acquire it
for that. In the case where we write the buffer, we do map the buffer, and
potentially sleep.


# 1.199 01-Oct-2012 guenther

Make groupmember() check the effective gid too, so that the checks are
consistent when the effective gid isn't also a supplementary group.

ok beck@


# 1.198 19-Sep-2012 guenther

vhold() and vdrop() are prototyped in vnode.h, so don't repeat them here

ok beck@


Revision tags: OPENBSD_5_2_BASE
# 1.197 16-Jul-2012 deraadt

oops, need sys/acct.h too


# 1.196 16-Jul-2012 deraadt

Put acct_shutdown() proto in a better place


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.195 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.194 02-Jul-2011 thib

rename VFSDEBUG to VFLCKDEBUG;

prompted by tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.193 21-Dec-2010 thib

Bring back the "End the VOP experiment." diff, naddy's issues where
unrelated, and his alpha is much happier now.

OK deraadt@


# 1.192 06-Dec-2010 jasper

- drop NENTS(), which was yet another copy of nitems().
no binary change


ok deraadt@


# 1.191 10-Sep-2010 thib

Backout the VOP diff until the issues naddy was seeing on alpha (gcc3)
have been resolved.


# 1.190 06-Sep-2010 thib

End the VOP experiment. Instead of the ridicolusly complicated operation
vector setup that has questionable features (that have, as far as I can
tell never been used in practice, atleast not in OpenBSD), remove all
the gunk and favor a simple struct full of function pointers that get
set directly by each of the filesystems.

Removes gobs of ugly code and makes things simpler by a magnitude.

The only downside of this is that we loose the vnoperate feature so
the spec/fifo operations of the filesystems need to be kept in sync
with specfs and fifofs, this is no big deal as the API it self is pretty
static.

Many thanks to armani@ who pulled an earlier version of this diff to
current after c2k10 and Gabriel Kihlman on tech@ for testing.

Liked by many. "come on, find your balls" deraadt@.


# 1.189 12-Aug-2010 oga

Nuke extra (typoed) extern declaration and a spare newline from the last
commit.

"fix it -- free commit" beck@


# 1.188 11-Aug-2010 beck

Make the number of vnodes to correspond to the number of buffers in
buffer cache - we grow them dynamically, but do not attempt to shrink
them if the buffer cache shrinks after growing.

Tested by very many for a long time.

ok oga@ todd@ phessler@ tedu@


Revision tags: OPENBSD_4_8_BASE
# 1.187 29-Jun-2010 tedu

makefstype was only used in ported from freebsd filesystems. fix them
and remove the function. ok thib


# 1.186 28-Jun-2010 claudio

Add the rtable id as an argument to rn_walktree(). Functions like
rt_if_remove_rtdelete() need to know the table id to be able to correctly
remove nodes.
Problem found by Andrea Parazzini and analyzed by Martin Pelik�n.
OK henning@


# 1.185 06-May-2010 mpf

Fix favail format string.
From mickey.
OK thib, otto.


Revision tags: OPENBSD_4_7_BASE
# 1.184 17-Dec-2009 oga

if anyone vref()s a VNON vnode, panic. This should not happen.

Written while trying to debug the nfs_inactive panics. Turns out it
never got hit, but it's a useful check to have.

ok beck@


# 1.183 17-Aug-2009 jasper

dd 'show all bufs' to show all the buffers in the system

ok beck@ thib@


# 1.182 13-Aug-2009 thib

add a show all vnodes command, use dlg's nice pool_walk() to accomplish
this.

ok beck@, dlg@


# 1.181 12-Aug-2009 beck

Namecache revamp.

This eliminates the large single namecache hash table, and implements
the name cache as a global lru of entires, and a redblack tree in each
vnode. It makes cache_purge actually purge the namecache entries associated
with a vnode when a vnode is recycled (very important for later on actually being
able to resize the vnode pool)

This commit does #if 0 out a bunch of procmap code that was
already broken before this change, but needs to be redone completely.

Tested by many, including in thib's nfs test setup.

ok oga@,art@,thib@,miod@


# 1.180 02-Aug-2009 beck

Dynamic buffer cache support - a re-commit of what was backed out
after c2k9

allows buffer cache to be extended and grow/shrink dynamically

tested by many, ok oga@, "why not just commit it" deraadt@


Revision tags: OPENBSD_4_6_BASE
# 1.179 25-Jun-2009 thib

backout the buf_acquire() does the bremfree() since all callers
where doing bremfree() befure calling buf_acquire().

This is causing us headache pinning down a bug that showed up
when deraadt@ too cvs to current, and will have to be done
anyway as a preperation for backouts.

OK deraadt@


# 1.178 15-Jun-2009 beck

Back out all the buffer cache changes I committed during c2k9. This reverts three
commits:

1) The sysctl allowing bufcachepercent to be changed at boot time.
2) The change moving the buffer cache hash chains to a red-black tree
3) The dynamic buffer cache (Which depended on the earlier too).

ok on the backout from marco and todd


# 1.177 06-Jun-2009 art

All caller of buf_acquire were doing bremfree before the call.
Just put it in the buf_acquire function.
oga@ ok


# 1.176 03-Jun-2009 beck

Change bufhash from the old grotty hash table to red-black trees hanging
off the vnode.
ok art@, oga@, miod@


Revision tags: OPENBSD_4_5_BASE
# 1.175 10-Nov-2008 pedro

Fix typo in comment, okay jmc@.


# 1.174 01-Nov-2008 deraadt

change vrele() to return an int. if it returns 0, it can gaurantee that
it did not sleep. this is used to avoid checkdirs() to avoid having
to restart the allproc walk every time through
idea from tedu, ok thib pedro


Revision tags: OPENBSD_4_4_BASE
# 1.173 05-Jul-2008 thib

re-introduce vdrop() to signal a lost intrest in a vnode;

ok art@


# 1.172 14-Jun-2008 mk

A bunch of pool_get() + bzero() -> pool_get(..., .. | PR_ZERO)
conversions that should shave a few bytes off the kernel.

ok henning, krw, jsing, oga, miod, and thib (``even though i usually prefer
FOO|BAR''; thanks for looking.


# 1.171 13-Jun-2008 beck

back out stupid vnode change that was unintentionally included
with biomem and art has no idea how it got there.
ok art@ thib@


# 1.170 12-Jun-2008 deraadt

Bring biomem diff back into the tree after the nfs_bio.c fix went in.
ok thib beck art


# 1.169 11-Jun-2008 deraadt

back out biomem diff since it is not right yet. Doing very large
file copies to nfsv2 causes the system to eventually peg the console.
On the console ^T indicates that the load is increasing rapidly, ddb
indicates many calls to getbuf, there is some very slow nfs traffic
making none (or extremely slow) progress. Eventually some machines
seize up entirely.


# 1.168 10-Jun-2008 beck

Buffer cache revamp

1) remove multiple size queues, introduced as a stopgap.
2) decouple pages containing data from their mappings
3) only keep buffers mapped when they actually have to be mapped
(right now, this is when buffers are B_BUSY)
4) New functions to make a buffer busy, and release the busy flag
(buf_acquire and buf_release)
5) Move high/low water marks and statistics counters into a structure
6) Add a sysctl to retrieve buffer cache statistics

Tested in several variants and beat upon by bob and art for a year. run
accidentally on henning's nfs server for a few months...

ok deraadt@, krw@, art@ - who promises to be around to deal with any fallout


# 1.167 09-Jun-2008 millert

Update access(2) to have modern semantics with respect to X_OK and
the superuser. access(2) will now only indicate success for X_OK on
non-directories if there is at least one execute bit set on the file.
OK deraadt@ thib@ otto@


# 1.166 07-May-2008 thib

remove the vfc_mountroot member from vfsconf and
do appropriate cleanup;

OK deraadt@


# 1.165 07-May-2008 claudio

Implement routing priorities. Every route inserted has a priority assigned
and the one route with the lowest number wins. This will be used by the
routing daemons to resolve the synchronisations issue in case of conflicts.
The nasty bits of this are in the multipath code. If no priority is specified
the kernel will choose an appropriate priority.

Looked at by a few people at n2k8 code is much older


# 1.164 06-May-2008 thib

retire vfs_mountroot();

setroot() is now (and has been) responsible for setting
the mountroot function pointer "to the right thing", or
failing todo that, to ffs_mountroot;

based on a discussion/diff from deraadt@.
OK deraadt@


# 1.163 23-Mar-2008 miod

Wrong printf construct.


# 1.162 16-Mar-2008 otto

Widen some struct statfs fields to support large filesystem stata
and add some to be able to support statvfs(2). Do the compat dance
to provide backward compatibility. ok thib@ miod@


Revision tags: OPENBSD_4_3_BASE
# 1.161 13-Dec-2007 blambert

replace calls to ltsleep with tsleep

remove PNORELOCK flag, as PNORELOCK is used for msleep

ok art@ thib@


# 1.160 16-Nov-2007 deraadt

er, the newline is wrong. dissapointing.


# 1.159 15-Nov-2007 deraadt

newline before syncing disks is way prettier


# 1.158 29-Oct-2007 chl

MALLOC/FREE -> malloc/free
replace an hard coded value with M_WAITOK

ok krw@


# 1.157 15-Sep-2007 bluhm

Allow to pull out an usb stick with ffs filesystem while mounted
and a file is written onto the stick. Without these fixes the
machine panics or hangs.
The usb fix calls the callback when the stick is pulled out to free
the associated buffers. Otherwise we have busy buffers for ever
and the automatic unmount will panic.
The change in the scsi layer prevents passing down further dirty
buffers to usb after the stick has been deactivated.
In vfs the automatic unmount has moved from the function vgonel()
to vop_generic_revoke(). Both are called when the sd device's vnode
is removed. In vgonel() the VXLOCK is already held which can cause
a deadlock. So call dounmount() earlier.

ok krw@, I like this marco@, tested by ian@


# 1.156 07-Sep-2007 art

Use M_ZERO in a few more places to shave bytes from the kernel.

eyeballed and ok dlg@


Revision tags: OPENBSD_4_2_BASE
# 1.155 07-Aug-2007 beck

A few changes to deal with multi-user performance issues seen. this
brings us back roughly to 4.1 level performance, although this is still
far from optimal as we have seen in a number of cases. This change

1) puts a lower bound on buffer cache queues to prevent starvation
2) fixes the code which looks for a buffer to recycle
3) reduces the number of vnodes back to 4.1 levels to avoid complex
performance issues better addressed after 4.2

ok art@ deraadt@, tested by many


# 1.154 01-Jun-2007 beck

decouple the allocated number of vnodes from the "desiredvnodes" variable
which is used to size a zillion other things that increasing excessively
has been shown to cause problems - so that we may incrementally look at
increasing those other things without making the kernel unusable.

This diff effectivly increases the number of vnodes back to the number
of buffers, as in the earlier dynamic buffer cache commits, without
increasing anything else (namecache, softdeps, etc. etc.)

ok pedro@ tedu@ art@ thib@


# 1.153 31-May-2007 tedu

remove some silly casts, no real change


# 1.152 31-May-2007 pedro

NFSv2 cannot cope with a big number of vnodes, so revert to NPROC-based
calculation until the problem is fixed, okay beck@ art@


# 1.151 30-May-2007 beck

back out vfs change - todd fries has seen afs issues, and I'm suspicious
this can cause other problems.


# 1.150 29-May-2007 beck

Step one of some vnode improvements - change getnewvnode to
actually allocate "desiredvnodes" - add a vdrop to un-hold a vnode held
with vhold, and change the name cache to make use of vhold/vdrop, while
keeping track of which vnodes are referred to by which cache entries to
correctly hold/drop vnodes when the cache uses them.
ok thib@, tedu@, art@


# 1.149 28-May-2007 thib

de-inline vref();

ok pedro@


# 1.148 26-May-2007 pedro

Dynamic buffer cache. Initial diff from mickey@, okay art@ beck@ toby@
deraadt@ dlg@.


# 1.147 26-May-2007 thib

Nuke a bunch of simpelocks and associated goo.

ok art@


# 1.146 17-May-2007 thib

Collapse struct v_selectinfo in struct vnode, remove the
simplelock and reuse the name for the selinfo member.
Clean-up accordingly.

ok tedu@,art@


# 1.145 09-May-2007 deraadt

kinfo_vgetfailed has not been used for > 8 years


# 1.144 13-Apr-2007 thib

Move the declaration of VN_KNOTE() into vnode.h instead of having
multiple defines all over;

ok tedu@


# 1.143 13-Apr-2007 bluhm

Remove comments talking about vnode interlock. No binary change.
ok thib


# 1.142 11-Apr-2007 thib

Remove the simplelock argument from vrecycle();

ok pedro@, sturm@


# 1.141 21-Mar-2007 thib

Remove the v_interlock simplelock from the vnode structure.
Zap all calls to simple_lock/unlock() on it (those calls are
#defined away though). Remove the LK_INTERLOCK from the calls
to vn_lock() and cleanup the filesystems wich implement VOP_LOCK().
(by remvoing the v_interlock from there calls to lockmgr()).

ok pedro@, art@, tedu@


# 1.140 12-Mar-2007 mickey

better desiredvnodes not based on maxusers; pedro@ deraadt@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.139 20-Feb-2007 deraadt

for vfsconf sysctl, do not leak kernel sensors out to userland
ok art thib


# 1.138 17-Feb-2007 mickey

fix ddb buf printing for daddr_t growth to 64bit;
from juan hernandez gonzalez; tested by bluhm@


# 1.137 14-Feb-2007 jsg

Consistently spell FALLTHROUGH to appease lint.
ok kettenis@ cloder@ tom@ henning@


# 1.136 13-Feb-2007 mickey

fix ddb buf print


# 1.135 20-Nov-2006 tom

vprint() should be defined if DIAGNOSTIC || DEBUG. Noticed by (and
original diff from) Jake < antipsychic (at) hotmail.com >. Discussed
with Mickey and Miod.

ok miod@ pedro@


# 1.134 30-Oct-2006 thib

use vp->v_type to index into vtypes rather then vp->v_tag,
fixing odd output in the 'show vnode' ddb code.

ok mickey@


Revision tags: OPENBSD_4_0_BASE
# 1.133 11-Jul-2006 mickey

add mount/vnode/buf and softdep printing commands; tested on a few archs and will make pedro happy too (;


# 1.132 09-Jul-2006 pedro

Fix tab where space was meant


# 1.131 08-Jul-2006 thib

vinvalbuf() debugging aid, under VFSDEBUG.

ok pedro@


# 1.130 03-Jul-2006 mickey

also print vp in vprint (useful for debugging); pedro@ ok


# 1.129 25-Jun-2006 sturm

rename vfs_busy() flags VB_UMIGNORE/VB_UMWAIT to VB_NOWAIT/VB_WAIT

requested by and ok pedro


# 1.128 14-Jun-2006 sturm

move vfs_busy() to rwlocks and properly hide the locking api from vfs

ok tedu, pedro


# 1.127 02-Jun-2006 pedro

Add a clonable devices implementation. Hacked along with thib@, input
from krw@ and toby@, subliminal prodding from dlg@, okay deraadt@.


# 1.126 28-May-2006 pedro

Spacing in vfs_sysctl()


# 1.125 07-May-2006 sturm

forgot to remove this sentence from the comment
ok pedro


# 1.124 30-Apr-2006 sturm

remove the simplelock argument from vfs_busy() which is currently not
used and will never be used this way in VFS

requested by and ok pedro, ok krw, biorn


# 1.123 19-Apr-2006 pedro

Remove unused mount list simple_lock() goo


Revision tags: OPENBSD_3_9_BASE
# 1.122 09-Jan-2006 pedro

Put vprint() under DIAGNOSTIC, as to save space in generated ramdisks.
Inspiration from miod@, okay deraadt@. Tested on i386, macppc and amd64.


# 1.121 30-Nov-2005 pedro

No need for vfs_busy() and vfs_unbusy() to take a process pointer
anymore. Testing by jolan@, thanks.


# 1.120 24-Nov-2005 pedro

Remove kernfs, okay deraadt@.


# 1.119 19-Nov-2005 pedro

Remove unnecessary lockmgr() archaism that was costing too much in terms
of panics and bugfixes. Access curproc directly, do not expect a process
pointer as an argument. Should fix many "process context required" bugs.
Incentive and okay millert@, okay marc@. Various testing, thanks.


# 1.118 18-Nov-2005 pedro

Work around yet another race on non-locking file systems: when calling
VOP_INACTIVE() in vrele() and vput(), we may sleep. Since there's no
locking of any kind, someone can vget() the vnode and vrele() it while
we sleep, beating us in getting the vnode on the free list.


# 1.117 08-Nov-2005 pedro

Missed one use of 'register'


# 1.116 07-Nov-2005 pedro

Use ANSI function declarations and deregister, no binary change


# 1.115 19-Oct-2005 pedro

Remove v_vnlock from struct vnode, okay krw@ tedu@


Revision tags: OPENBSD_3_8_BASE
# 1.114 26-May-2005 pedro

branches: 1.114.2;
RIP stackable filesystems, ok marius@ tedu@, discussed with deraadt@


# 1.113 24-May-2005 pedro

when a device vnode associated with a mount point disappears, mark the
filesystem as doomed and unmount it


# 1.112 22-May-2005 pedro

put VLOCKSWORK stuff under a single option, VFSDEBUG


# 1.111 01-May-2005 pedro

check for VBIOONFREELIST and VBIOONSYNCLIST in vprint(), okay marius@


# 1.110 24-Mar-2005 tedu

always good to check for invalid values. ok marius pedro


Revision tags: OPENBSD_3_7_BASE
# 1.109 10-Jan-2005 pedro

branches: 1.109.2;
change vget() to only put a vnode back on the free lists if it actually
was there. should fix a (rare) corner case introduced by my last commit.
ok tedu@, testing by joris, moritz@, danh@, otto@ and krw@. many thanks.


# 1.108 31-Dec-2004 pedro

sprinkle some more list macros in here


# 1.107 31-Dec-2004 pedro

when releasing a vnode, make it inactive before sticking it to one of
the free lists. should fix some races on filesystems that don't have
locks, such as nfs. also, it allows for a more straightforward way of
releasing vnodes (nodes that are going to be recycled don't have to be
moved to the head of the list). tested by many, thanks.

ok tedu@ deraadt@


# 1.106 28-Dec-2004 deraadt

clean dirty accident by miod


# 1.105 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


# 1.104 09-Dec-2004 pedro

minor spacing/styling nits


Revision tags: OPENBSD_3_6_BASE
# 1.103 04-Aug-2004 art

Uninline vputonfreelist.


# 1.102 04-Aug-2004 pedro

better comments


# 1.101 02-Aug-2004 pedro

- check for LK_NOWAIT on vget()
- use ltsleep() instead of the unlock + sleep combo

ok art@, inspiration from free/net


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.100 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


# 1.99 27-May-2004 tedu

shutdown accounting before shutting down vfs. should prevent some panics.
ok david@ millert@ (iirc)


# 1.98 25-Apr-2004 itojun

radix tree with multipath support. from kame. deraadt ok
user visible changes:
- you can add multiple routes with same key (route add A B then route add A C)
- you have to specify gateway address if there are multiple entries on the table
(route delete A B, instead of route delete A)
kernel change:
- radix_node_head has an extra entry
- rnh_deladdr takes extra argument

TODO:
- actually take advantage of multipath (rtalloc -> rtalloc_mpath)


Revision tags: OPENBSD_3_5_BASE
# 1.97 09-Jan-2004 tedu

back out vnode parents. weird breakge found in ports tree


# 1.96 06-Jan-2004 tedu

keep track of a vnode's parent dir. ufs only, and unused atm, but
the fun stuff is coming. testing by brad.


Revision tags: OPENBSD_3_4_BASE
# 1.95 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.94 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: UBC_SYNC_A
# 1.93 13-May-2003 naddy

Back out previous change that causes "vnode table full" for large-scale
file operations.


# 1.92 13-May-2003 tedu

do reclaim LAYER vnodes, no good reason not to


# 1.91 06-May-2003 tedu

attempt to put a process's cwd back in place after a forced umount.
won't always work, but it's the best we can do for now. this covers
at least some of the failure cases the previous commit to vfs_lookup.c
checks for.
ok weingart@


# 1.90 01-May-2003 tedu

several related changes:
vfs_subr.c:
add a missing simple_lock_init for vnode interlock
try to avoid reclaiming locked or layered vnodes
initialize vnlock pointer to NULL
remove old code to free vnlock, never used
lockinit the new vnode lock
vfs_syscalls.c:
support for VLAYER flag
vnode_if.sh:
support for splitting VDESC flags
vnode_if.src:
split VDESC flags
WILLPUT is the combination of WILLRELE and WILLUNLOCK
most uses for WILLRELE become WILLPUT
vnode.h:
add v_lock to struct vnode
add VLAYER flag
update for new VDESC flags


# 1.89 06-Apr-2003 ho

strcat/strcpy/sprintf cleanup. krw@, anil@ ok. art@ tested sparc64.


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_B
# 1.88 11-Aug-2002 art

Add two missing vfs_busy calls in the failure path of sysctl_vnode.
Found by aaron@

NOTE - I think we need a mount-point iterator just like we have
NOTE - vfs_mount_foreach_vnode. (btw. why don't we use foreach_vnode in here?)


# 1.87 12-Jul-2002 art

Change the locking on the mountpoint slightly. Instead of using mnt_lock
to get shared locks for lookup and get the exclusive lock only with
LK_DRAIN on unmount and do the real exclusive locking with flags in
mnt_flags, we now use shared locks for lookup and an exclusive lock for
unmount.

This is accomplished by slightly changing the semantics of vfs_busy.
Old vfs_busy behavior:
- with LK_NOWAIT set in flags, a shared lock was obtained if the
mountpoint wasn't being unmounted, otherwise we just returned an error.
- with no flags, a shared lock was obtained if the mountpoint was being
unmounted, otherwise we slept until the unmount was done and returned
an error.
LK_NOWAIT was used for sync(2) and some statistics code where it isn't really
critical that we get the correct results.
0 was used in fchdir and lookup where it's critical that we get the right
directory vnode for the filesystem root.

After this change vfs_busy keeps the same behavior for no flags and LK_NOWAIT.
But if some other flags are passed into it, they are passed directly
into lockmgr (actually LK_SLEEPFAIL is always added to those flags because
if we sleep for the lock, that means someone was holding the exclusive lock
and the exclusive lock is only held when the filesystem is being unmounted.

More changes:
dounmount must now be called with the exclusive lock held. (before this
the caller was supposed to hold the vfs_busy lock, but that wasn't always
true).
Zap some (now) unused mount flags.
And the highlight of this change:
Add some vfs_busy calls to match some vfs_unbusy calls, especially in
sys_mount. (lockmgr doesn't detect the case where we release a lock noone
holds (it will do that soon)).

If you've seen hangs on reboot with mfs this should solve it (I repeat this
for the fourth time now, but this time I spent two months fixing and
redesigning this and reading the code so this time I must have gotten
this right).


# 1.86 16-Jun-2002 miod

When processing the KERN_VNODE sysctl, the kernel builds a packed structure,
while pstat(8) expects a C structure abiding the regular structure packing
rules. This caused pstat -v to break on powerpc.

Unbreak the confusion by defining the structure in a common header file,
and having the kernel use it.

ok millert@ deraadt@


# 1.85 08-Jun-2002 art

Use ltsleep in vfs_busy.


# 1.84 16-May-2002 art

sprinkle some splassert(IPL_BIO) in some functions that are commented as "should be called at splbio()"


Revision tags: OPENBSD_3_1_BASE
# 1.83 14-Mar-2002 millert

First round of __P removal in sys


# 1.82 04-Feb-2002 miod

Cleanup mountroot-related definitions.


# 1.81 23-Jan-2002 art

Pool deals fairly well with physical memory shortage, but it doesn't deal
well (not at all) with shortages of the vm_map where the pages are mapped
(usually kmem_map).

Try to deal with it:
- group all information the backend allocator for a pool in a separate
struct. The pool will only have a pointer to that struct.
- change the pool_init API to reflect that.
- link all pools allocating from the same allocator on a linked list.
- Since an allocator is responsible to wait for physical memory it will
only fail (waitok) when it runs out of its backing vm_map, carefully
drain pools using the same allocator so that va space is freed.
(see comments in code for caveats and details).
- change pool_reclaim to return if it actually succeeded to free some
memory, use that information to make draining easier and more efficient.
- get rid of PR_URGENT, noone uses it.


# 1.80 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.79 10-Dec-2001 art

branches: 1.79.2;
No need to initialize the uobj on every getnewvnode. Just do
it when allocating. Add some improved diagnostics.


# 1.78 10-Dec-2001 art

Big cleanup inspired by NetBSD with some parts of the code from NetBSD.
- get rid of VOP_BALLOCN and VOP_SIZE
- move the generic getpages and putpages into miscfs/genfs
- create a genfs_node which must be added to the top of the private portion
of each vnode for filsystems that want to use genfs_{get,put}pages
- rename genfs_mmap to vop_generic_mmap


# 1.77 10-Dec-2001 art

Merge in struct uvm_vnode into struct vnode.


# 1.76 05-Dec-2001 art

Break out the part that lowers v_holdcnt in brelvp into an own function
and make it and vhold into public interfaces.


# 1.75 29-Nov-2001 art

Ooops. Revert part of the last commit that was completly wrong and wasn't supposed to be committed.


# 1.74 29-Nov-2001 art

Correctly handle b_vp with bgetvp and brelvp in {get,put}pages.
Prevents panics caused by vnodes being recycled under our feet.


# 1.73 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.72 21-Nov-2001 csapuntz

Added vfs_isbusy. Useful for verifying that a mount point is locked
Added vfs_mount_foreach_vnode. Several places in the code seem to want to
traverse the mount list and they all seem to handle locking differently.
Centralize traversing the mount list in one place so that we only need
to get the locking right once.


# 1.71 15-Nov-2001 art

Don't zero v_bioflag when recycling a vnode in getnewvnode.
Sometimes the vnode can be on the syncers list. While that is a bug, it's
just a minor annoyance. A vnode on a syncer worklist without VBIOONSYNCLIST
set is a disaster.


# 1.70 12-Nov-2001 art

Remove unnecessary check for NULL vnode in reassignbuf.


# 1.69 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.68 02-Oct-2001 csapuntz

Bounds check index into routing table. Thanks to Ken Ashcraft of Stanford
for finding this bug.


# 1.67 19-Sep-2001 csapuntz

Get rid of B_VFLUSH. Not relevant after the end of the AGE queue.


# 1.66 16-Sep-2001 millert

Add some missing lengths checks when passing data from userland to
kernel. From based on NetBSD patches.


# 1.65 02-Aug-2001 assar

(vput): make panic strings actually say vput instead of vrele


# 1.64 26-Jul-2001 miod

Typo.


# 1.63 27-Jun-2001 art

remove old vm


# 1.62 22-Jun-2001 deraadt

KNF


# 1.61 05-Jun-2001 provos

send note_revoke to knotes when vnode goes away, okay art@


# 1.60 16-May-2001 art

indentation nit.


# 1.59 29-Apr-2001 art

cleanup, remove incorrect comment


Revision tags: OPENBSD_2_9_BASE
# 1.58 22-Mar-2001 art

branches: 1.58.2;
Use pool for allocating vnodes.
Even though vnodes are never freed (could be) this gives us big memory and
kmem_map savings.


# 1.57 21-Mar-2001 art

uvm_vnp_terminate expect the vnode to be locked.
Why didn't LOCKDEBUG catch this?


# 1.56 16-Mar-2001 art

Oops. fix thinko in last.


# 1.55 16-Mar-2001 art

Use CIRCLEQ macros for mountlist.


# 1.54 16-Mar-2001 art

Initialize the mountlist_slock.


# 1.53 26-Feb-2001 csapuntz

Move v_writecount test back to it original place


# 1.52 26-Feb-2001 csapuntz

Make ref counts 32-bit unsigned ints as opposed to a potpourri of longs and
ints.


# 1.51 24-Feb-2001 csapuntz

Cleanup of vnode interface continues. Get rid of VHOLD/HOLDRELE.
Change VM/UVM to use buf_replacevnode to change the vnode associated
with a buffer.

Addition v_bioflag for flags written in interrupt handlers
(and read at splbio, though not strictly necessary)

Add vwaitforio and use it instead of a while loop of v_numoutput.

Fix race conditions when manipulation vnode free list


# 1.50 23-Feb-2001 csapuntz

Remove the clustering fields from the vnodes and place them in the
file system inode instead


# 1.49 21-Feb-2001 csapuntz

Latest soft updates from FreeBSD/Kirk McKusick

Snapshot-related code has been commented out.


# 1.48 08-Feb-2001 mickey

do not print stuff when not verbose


Revision tags: OPENBSD_2_8_BASE
# 1.47 27-Sep-2000 art

branches: 1.47.2;
Minimal optimization.


# 1.46 17-Jul-2000 art

Don't wait for B_READ buffers on shutdown.
From NetBSD.


Revision tags: OPENBSD_2_7_BASE
# 1.45 25-Apr-2000 csapuntz

Use CIRCLEQ_FOREACH


# 1.44 21-Apr-2000 mickey

see if there is any meaning under curproc before using &proc0 in vfs_syncwait(); from art@


Revision tags: SMP_BASE kame_19991208
# 1.43 05-Dec-1999 art

branches: 1.43.2;
With soft updates, some buffers will be remarked as dirty after being written.
Handle this when syncing filesystems when unmounting.
From NetBSD.


# 1.42 05-Dec-1999 art

Use VONSYNCLIST to see if we should remove a vnode from the sync list instead
of looking at v_dirtyblkhd.


Revision tags: OPENBSD_2_6_BASE
# 1.41 20-Aug-1999 art

more paranoid check of the refcount in vfs_register


# 1.40 08-Aug-1999 niklas

From NetBSD; vdevgone, used for revoking access to device nodes when they
disappear (detach is coming).


# 1.39 31-May-1999 millert

New struct statfs with mount options. NOTE: this replaces statfs(2),
fstatfs(2), and getfsstat(2) so you will need to build a new kernel
before doing a "make build" or you will get "unimplemented syscall" errors.

The new struct statfs has the following featuires:
o Has a u_int32_t flags field--now softdep can have a real flag.

o Uses u_int32_t instead of longs (nicer on the alpha). Note: the man
page used to lie about setting invalid/unused fields to -1. SunOS does
that but our code never has.

o Gets rid of f_type completely. It hasn't been used since NetBSD 0.9
and having it there but always 0 is confusing. It is conceivable
that this may cause some old code to not compile but that is better
than silently breaking.

o Adds a mount_info union that contains the FSTYPE_args struct. This
means that "mount" can now tell you all the options a filesystem was
mounted with. This is especially nice for NFS.

Other changes:
o The linux statfs emulation didn't convert between BSD fs names
and linux f_type numbers. Now it does, since the BSD f_type
number is useless to linux apps (and has been removed anyway)

o FreeBSD's struct statfs is different from our (both old and new)
and thus needs conversion. Previously, the OpenBSD syscalls
were used without any real translation.

o mount(8) will now show extra info when invoked with no arguments.
However, to see *everything* you need to use the -v (verbose) flag.


# 1.38 06-May-1999 mickey

factor out sync+wait code into vfa_syncwait() routine for
applications in system like power management and such.
art@ finally said `commit it'


# 1.37 30-Apr-1999 art

in vput, simple_unlock the v_interlock before VOP_INACTIVE, not after


Revision tags: OPENBSD_2_5_BASE
# 1.36 11-Mar-1999 deraadt

backout


# 1.35 11-Mar-1999 deraadt

back out unapproved changes


# 1.34 11-Mar-1999 mickey

indent


# 1.33 11-Mar-1999 mickey

factor sync+wait operation out into a separate function.


# 1.32 26-Feb-1999 art

adapt to uvm vnode pager


# 1.31 19-Feb-1999 art

add vfs_register and vfs_unregister functions


# 1.30 28-Dec-1998 art

simple_lock fixes


# 1.29 22-Dec-1998 art

deconfuse vprint, print holdcount, not refcount when we are talking about holdcnt


# 1.28 10-Dec-1998 art

vfs_unmountall: retry to unmount all remaining filesystems when one unmount failed


# 1.27 05-Dec-1998 csapuntz

Framework for generating automatic test code for locking discipline
in DIAGNOSTIC mode.

Added documentation to vfs_subr.c on locking needs of a couple calls.

Improvements to the vinvalbuf patch. We need to start over after we
let our pants down.


# 1.26 04-Dec-1998 csapuntz

VFS-Lite2 requires stricter locking around vnode buffer queues. vinvalbuf
had insufficient protection


# 1.25 20-Nov-1998 art

vn_lock already unlocks the simple lock. don't do that again


# 1.24 12-Nov-1998 csapuntz

Integrate latest soft updates patches for McKusick.

Integrate cleaner ffs mount code from FreeBSD. Most notably, this mount
code prevents you from mounting an unclean file system read-write.


Revision tags: OPENBSD_2_4_BASE
# 1.23 13-Oct-1998 csapuntz

In vrele, vget, reinstate to following order

- VNODE gets placed on free list
- VOP_INACTIVE is called

This was the original order. It was changed in an earlier patch due to
a race condition in non-locking FSes (like NFS) between getnewvnode
and inactive. However, the modified order had its own race conditions, so
it turned out not to be a good choice.


# 1.22 30-Aug-1998 csapuntz

Cleanup.

Error diagnostics in vputonfreelist to catch violations of assumptions.


# 1.21 06-Aug-1998 csapuntz

Rename vop_revoke, vn_bwrite, vop_noislocked, vop_nolock, vop_nounlock
to be vop_generic_revoke, vop_generic_bwrite, vop_generic_islocked,
vop_generic_lock and vop_generic_unlock.

Create vop_generic_abortop and propogate change to all file systems.

Fix PR/371.

Get rid of locking in NULLFS (should be mostly unnecessary now except for
forced unmounts).


# 1.20 25-Apr-1998 niklas

typo


Revision tags: OPENBSD_2_3_BASE
# 1.19 20-Feb-1998 niklas

typo


# 1.18 11-Jan-1998 csapuntz

Fix a couple spinlock references. More code motion in vfs_subr.c


# 1.17 10-Jan-1998 csapuntz

Broke up vfs_subr.c which was getting a bit huge. We now have seperate files
for the syncer daemon as well as default VOP_*.


# 1.16 24-Nov-1997 niklas

Fix non-DIAGNOSTIC (and non-COMPAT*) compilation


# 1.15 07-Nov-1997 csapuntz

Fixed hang on shutdown
Disabled vop_nolock for now. Filesystems still need to be cleaned up.


# 1.14 06-Nov-1997 csapuntz

DEBUG now compiles


# 1.13 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.12 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.11 06-Oct-1997 csapuntz

VFS Lite2 Changes


Revision tags: OPENBSD_2_1_BASE
# 1.10 25-Apr-1997 deraadt

proper mask check; mike@fast.cs.utah.edu


# 1.9 14-Apr-1997 tholo

Minor performance enhancements from NetBSD


# 1.8 24-Feb-1997 niklas

OpenBSD tags


# 1.7 11-Feb-1997 millert

Add fs_id support and random inode generation numbers for ffs.


# 1.6 04-Jan-1997 kstailey

spec_advlock() via lf_advlock()


Revision tags: OPENBSD_2_0_BASE
# 1.5 08-Aug-1996 tholo

Make {,f}chown(2) behaviour POSIX.1 compliant with SUID / SGID files
Enable CTL_FS processing by sysctl(3)
Add CTL_FS request to disable clearing SUID / SGID bit when a files owner
or group is changed by root
Make sysctl(8) understand CTL_FS requests


# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 29-Feb-1996 niklas

From NetBSD: Merge with NetBSD 960217


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.296 27-Dec-2019 bluhm

Convert the speclisth hash buckets into SLIST macros. This makes
the vnode alias code more readable.
OK visa@


# 1.295 26-Dec-2019 bluhm

Fix white spaces.


# 1.294 08-Dec-2019 mpi

Convert infinite sleeps to tsleep_nsec(9).

ok visa@, jca@


Revision tags: OPENBSD_6_6_BASE
# 1.293 26-Aug-2019 anton

When a thread tries to exclusively lock a vnode, the same thread must
ensure that any other thread currently trying to acquire the underlying
vnode lock has observed that the same vnode is about to be exclusively
locked. Such threads must then sleep until the exclusive lock has been
released and then try to acquire the lock again. Otherwise, exclusive
access to the vnode cannot be guaranteed.

Thanks to naddy@ and visa@ for testing; ok visa@

Reported-by: syzbot+374d0e7e2400004957f7@syzkaller.appspotmail.com


# 1.292 25-Jul-2019 cheloha

vinvalbuf(9): tlseep -> tsleep_nsec(9); ok millert@


# 1.291 19-Jul-2019 cheloha

vwaitforio(9): tsleep(9) -> tsleep_nsec(9); ok visa@


# 1.290 28-Jun-2019 visa

Skip VFS barrier lock during normal operation to reduce overhead.
This removes a system-wide serialization point, which might help
finding timing-related bugs.

OK deraadt@ anton@


# 1.289 09-Jun-2019 beck

Add a temporary workaround to make removal of giant files better

mlarkin@ noticed we would freeze while removing enormous files because
of the amount of work done to invalidate buffers on unlink. This adds
a temporary workaround to ensure we give up the lock and yield while
doing this.

The longer term answer will be to move these buffers to another list
and not do the work here.

ok deraadt@


# 1.288 19-Apr-2019 visa

Add a subsystem lock for vfs_lockf.c. This enables calling lf_advlock()
and lf_purgelocks() without the kernel lock.

OK anton@ mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.287 02-Apr-2019 visa

Restrict which filesystems are available for swap. This rules out
obvious misconfigurations that cannot work.

OK mpi@ tedu@


# 1.286 17-Feb-2019 tedu

if a write fails, we mark the buffer invalid and throw it away. this can
lead to lost errors, where a later fsync will return success. to fix this,
set a flag on the vnode indicating a past error has occurred, and return
an error for future fsync calls.
ok bluhm deraadt visa


# 1.285 21-Jan-2019 anton

Introduce a dedicated entry point data structure for file locks. This new data
structure allows for better tracking of pending lock operations which is
essential in order to prevent a use-after-free once the underlying vnode is
gone.

Inspired by the lockf implementation in FreeBSD.

ok visa@

Reported-by: syzbot+d5540a236382f50f1dac@syzkaller.appspotmail.com


# 1.284 23-Dec-2018 natano

Rectify some issues with the noperm mount flag; the root vnode was not
protected properly and files without any x bit set were accidentaly considered
executable when checked with access(2).

Issues found and reported by deraadt, halex, reyk, tb
ok deraadt


# 1.283 07-Dec-2018 mpi

free(9) sizes for netcred.

ok visa@


Revision tags: OPENBSD_6_4_BASE
# 1.282 29-Sep-2018 visa

Use atomic operations to update vfc_refcount. Change the field's type
to unsigned int.

OK deraadt@


# 1.281 26-Sep-2018 visa

Move the allocating and freeing of mount points into
dedicated functions.

OK deraadt@ mpi@


# 1.280 22-Sep-2018 fcambus

Harmonize spacing after ellipses in displayed messages.

We were using spacing after ellipses in an inconsistent way in the
installer. Standardize on using "... " everywhere and take into account
the cursor position while we are waiting for the task to complete: the
cursor is now always positioned after the last dot, and the space is
added when displaying completion confirmation.

While there, also take cursor position into account in vfs_shutdown(),
and remove the extra leading space before ticks in dhclient.

OK deraadt@


# 1.279 17-Sep-2018 visa

Simplify VFS initialization.

Because loadable kernel modules are no longer, there is no need to
register or unregister filesystem implementations at runtime. Remove
vfs_register() and vfs_unregister(), and make vfsinit() call vfs_init
routines directly. Replace the linked list of vfsconf structs with
the vfsconflist[] array.

OK mpi@ bluhm@


# 1.278 16-Sep-2018 visa

Move vfsconf lookup code into dedicated functions.

OK bluhm@


# 1.277 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


# 1.276 02-Jul-2018 bluhm

Use more list macros for v_dirtyblkhd.
OK mpi@


# 1.275 06-Jun-2018 bluhm

The function dounmount() traverses the mnt_list in forward direction
to call vfs_busy() for all nested mount points. vfs_stall() called
vfs_busy() in reverser order for all mount points. Change the
direction of the latter to resolve the lock order conflict.
OK visa@


# 1.274 04-Jun-2018 guenther

Add VB_DUPOK to suppress witness(4) warning of concurrent mount locks.
Use that in three places:
- vfs_stall()
- sys_mount()
- dounmount()'s MNT_FORCE-does-recursive-unmounts case

ok deraadt@ visa@


# 1.273 27-May-2018 visa

Drop unnecessary `p' parameter from vget(9).

OK mpi@


# 1.272 08-May-2018 bluhm

When looping over mount points, the FOREACH SAVE macro is not save.
The loop variable mp is protected by vfs_busy() so that it cannot
be unmounted. But the next mount point nmp could be unmounted while
VFS_SYNC() sleeps. As the loop in vfs_stall() does not destroy the
mount point, TAILQ_FOREACH_REVERSE without _SAVE is the correct
macro to use.
OK deraadt@ visa@


# 1.271 08-May-2018 mpi

Move the vfs stall "barrier" logic to a function. FREF() will soon
change and this has nothing to do with it.

ok visa@, bluhm@


# 1.270 07-May-2018 bluhm

Print the vp pointer in the vinvalbuf() panic strings.
OK mpi@


# 1.269 02-May-2018 visa

Remove proc from the parameters of vn_lock(). The parameter is
unnecessary because curproc always does the locking.

OK mpi@


# 1.268 28-Apr-2018 visa

Clean up the parameters of VOP_LOCK() and VOP_UNLOCK(). It is always
curproc that does the locking or unlocking, so the proc parameter
is pointless and can be dropped.

OK mpi@, deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.267 07-Mar-2018 bluhm

Remounting files systems read-only does not work reliably. There
are corner cases where ffs may leak blocks. So better revert and
unmount all file systems at reboot. The "init died" panic will be
fixed in a different way.
OK deraadt@


# 1.266 10-Feb-2018 deraadt

Syncronize filesystems to disk when suspending. Each mountpoint's vnodes
are pushed to disk. Dangling vnodes (unlinked files still in use) and
vnodes undergoing change by long-running syscalls are identified -- and
such filesystems are marked dirty on-disk while we are suspended (in case
power is lost, a fsck will be required). Filesystems without dangling or
busy vnodes are marked clean, resulting in faster boots following
"battery died" circumstances.
Tested by numerous developers, thanks for the feedback.


# 1.265 14-Dec-2017 deraadt

Don't bother using DETACH_FORCE for the softraid luns at reboot
time; the aggressive mountpoint destruction seems to hit insane
use-after-frees when we are already far on the way down.


# 1.264 14-Dec-2017 deraadt

Give vflush_vnode() a hint about vnodes we don't need to account as "busy".
Change mountpoint to RDONLY a little later. Seems to improve the
rw->ro transition a bit.


# 1.263 11-Dec-2017 bluhm

Format the vnode lists of ddb show mount properly in columns.
OK krw@


# 1.262 11-Dec-2017 deraadt

In uvm Chuck decided backing store would not be allocated proactively
for blocks re-fetchable from the filesystem. However at reboot time,
filesystems are unmounted, and since processes lack backing store they
are killed. Since the scheduler is still running, in some cases init is
killed... which drops us to ddb [noted by bluhm]. Solution is to convert
filesystems to read-only [proposed by kettenis]. The tale follows:
sys_reboot() should pass proc * to MD boot() to vfs_shutdown() which
completes current IO with vfs_busy VB_WRITE|VB_WAIT, then calls VFS_MOUNT()
with MNT_UPDATE | MNT_RDONLY, soon teaching us that *fs_mount() calls a
copyin() late... so store the sizes in vfsconflist[] and move the copyin()
to sys_mount()... and notice nfs_mount copyin() is size-variant, so kill
legacy struct nfs_args3. Next we learn ffs_mount()'s MNT_UPDATE code is
sharp and rusty especially wrt softdep, so fix some bugs adn add
~MNT_SOFTDEP to the downgrade. Some vnodes need a little more help,
so tie them to &dead_vnops.

ffs_mount calling DIOCCACHESYNC is causing a bit of grief still but
this issue is seperate and will be dealt with in time.
couple hundred reboots by bluhm and myself, advice from guenther and
others at the hut


# 1.261 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.260 31-Jul-2017 florian

Give back some space to the ramdisk by compiling net/radix.c only
if we compile pf, ipsec, pipex or nfsserver.
Suggested by mpi some time ago.
Tweak & OK bluhm
deraadt assumes it's fair


# 1.259 20-Apr-2017 visa

Tweak lock inits to make the system runnable with witness(4)
on amd64 and i386.


# 1.258 04-Apr-2017 deraadt

struct vfsconf is tightly packed, but let's M_ZERO it in case that ever
changes to avoid exposing userland memory.


Revision tags: OPENBSD_6_1_BASE
# 1.257 15-Jan-2017 bluhm

When traversing the mount list, the current mount point is locked
with vfs_busy(). If the FOREACH_SAFE macro is used, the next pointer
is not locked and could be freed by another process. Unless
necessary, do not use _SAFE as it is unsafe. In vfs_unmountall()
the current pointer is actullay freed. Add a comment that this
race has to be fixed later.
OK krw@


# 1.256 10-Jan-2017 bluhm

Replace manual for() loops with FOREACH() macro.
OK millert@


# 1.255 10-Jan-2017 bluhm

Remove the unused olddp parameter from function dounmount().
OK mpi@ millert@


# 1.254 28-Sep-2016 kettenis

Cast enum to u_int when doing a bounds check to avoid a clang warning that
the comparison is always true.

ok jca@, tedu@


# 1.253 16-Sep-2016 dlg

move the namecache_rb_tree from RB macros to RBT functions.

i had to shuffle the includes a bit. all the knowledge of the RB
tree is now inside vfs_cache.c, and all accesses are via cache_*
functions.


# 1.252 16-Sep-2016 dlg

move buf_rb_bufs from RB macros to RBT functions

i had to shuffle the order of some header bits cos RBT_PROTOTYPE
needs to see what RBT_HEAD produces.


# 1.251 15-Sep-2016 dlg

all pools have their ipl set via pool_setipl, so fold it into pool_init.

the ioff argument to pool_init() is unused and has been for many
years, so this replaces it with an ipl argument. because the ipl
will be set on init we no longer need pool_setipl.

most of these changes have been done with coccinelle using the spatch
below. cocci sucks at formatting code though, so i fixed that by hand.

the manpage and subr_pool.c bits i did myself.

ok tedu@ jmatthew@

@ipl@
expression pp;
expression ipl;
expression s, a, o, f, m, p;
@@
-pool_init(pp, s, a, o, f, m, p);
-pool_setipl(pp, ipl);
+pool_init(pp, s, a, ipl, f, m, p);


# 1.250 25-Aug-2016 dlg

pool_setipl

ok kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.249 22-Jul-2016 kettenis

Prevent NULL-pointer call for filesystems that don't provide vfs_sysctl
in their vfsops.

Issue reported by Tim Newsham.

ok claudio@, natano@


# 1.248 19-Jun-2016 natano

Remove the lockmgr() API. It is only used by filesystems, where it is a
trivial change to use rrw locks instead. All it needs is LK_* defines
for the RW_* flags.

tested by naddy and sthen on package building infrastructure
input and ok jmc mpi tedu


# 1.247 26-May-2016 natano

The doforce variable isn't modified anywhere. Also, the only filesystem
left using it is fuse. It has been removed from all other filesystems.

ok millert deraadt


# 1.246 26-Apr-2016 natano

copy_statfs_info() is not only used by ufs, but by other filesystems too,
so make sure that all members of mp->mnt_stat.mount_info are copied.
ok stefan


# 1.245 26-Apr-2016 beck

fix off by one in vfs_vnode_print - found by miod
ok deraadt@, krw@


# 1.244 07-Apr-2016 natano

Share clone bitmap between aliased vnodes. This prevents duplicate clone
instance numbers being handed out for the same minor device.
ok mikeb


# 1.243 05-Apr-2016 natano

Increase size of the clone bitmap (revised diff after revert). I have
tested this with fuse _and_ drm on amd64 and macppc. Also tested with
cloning bpf (not in the tree) on macppc.

ok mikeb
"looks correct to me" millert

The original commit message is as follows:

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.242 01-Apr-2016 mikeb

Revert the clone bitmap enlargement change


# 1.241 31-Mar-2016 natano

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.240 19-Mar-2016 natano

Remove the unused flags argument from VOP_UNLOCK().

torture tested on amd64, i386 and macppc
ok beck mpi stefan
"the change looks right" deraadt


# 1.239 14-Mar-2016 krw

Change a bunch of (<blah> *)0 to NULL.

ok beck@ deraadt@


Revision tags: OPENBSD_5_9_BASE
# 1.238 05-Dec-2015 tedu

branches: 1.238.2;
remove stale lint annotations


# 1.237 16-Nov-2015 deraadt

In getdevvp() set the VISTTY flag on a vnode to indicate the underlying
device is a D_TTY device. (Like spec_open, but this sets the flag to
satisfy pre-VOP_OPEN situations)
ok millert semarie tedu guenther


# 1.236 13-Oct-2015 guenther

Initialize va_filerev in vattr_null() to avoid leaking stack garbage;
problem pointed out by Martin Natano (natano (at) natano.net)

Also, stop chaining assignments (foo = bar = baz) in vattr_null().
The exact meaning of those depends on the order of the sizes-and-
signednesses of the lvalues, making them fragile: a statement here
mixed *six* types, but managed to get them in a safe order. Delete
a 20+ year old XXX comment that was almost certainly bemoaning a bug
from when they were in an unsafe order.

ok deraadt@ miod@


# 1.235 08-Oct-2015 mpi

Use the radix API directly and get rid of the function pointers. There
is no point in keeping an unused level of abstraction.

ok mikeb@, claudio@


# 1.234 07-Oct-2015 mpi

rn_inithead() offset argument is now specified in byte, missed in previous.


# 1.233 04-Sep-2015 mpi

Make every subsystem using a radix tree call rn_init() and pass the
length of the key as argument.

This way every consumer of the radix tree has a chance to explicitly
initialize the shared data structures and no longer rely on another
subsystem to do the initialization.

As a bonus ``dom_maxrtkey'' is no longer used an die.

ART kernels should now be fully usable because pf(4) and IPSEC properly
initialized the radix tree.

ok chris@, reyk@


Revision tags: OPENBSD_5_8_BASE
# 1.232 16-Jul-2015 claudio

branches: 1.232.4;
Fix rn_match and there for the expoerted lookup functions in radix.c
to never return the internal RNF_ROOT nodes. This removes the checks
in the callee to verify that not an RNF_ROOT node was returned.
OK mpi@


# 1.231 12-May-2015 mikeb

Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.230 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.229 02-Mar-2015 guenther

Return EINVAL if the creds supplied for NFS export have a cr_ngroups less
than zero or greater than NGROUPS_MAX

Fixes panic seen by henning@


# 1.228 09-Jan-2015 tedu

rename desiredvnodes to initialvnodes. less of a lie. ok beck deraadt


# 1.227 19-Dec-2014 tedu

start retiring the nointr allocator. specify PR_WAITOK as a flag as a
marker for which pools are not interrupt safe. ok dlg


# 1.226 17-Dec-2014 tedu

remove lock.h from uvm_extern.h. another holdover from the simpletonlock
era. fix uvm including c files to include lock.h or atomic.h as necessary.
ok deraadt


# 1.225 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.224 10-Dec-2014 tedu

convert bcopy to memcpy. ok millert


# 1.223 21-Nov-2014 tedu

simple lock is long dead


# 1.222 19-Nov-2014 tedu

delete the KERN_VNODE sysctl. it fails to provide any isolation from the
kernel struct vnode defintion, and the only consumer (pstat) still needs
kvm to read much of the required information. no great loss to always use
kvm until there's a better replacement interface.
ok deraadt millert uebayasi


# 1.221 14-Nov-2014 tedu

prefer sizeof(*ptr) to sizeof(struct) for malloc and free


# 1.220 03-Nov-2014 deraadt

pass size argument to free()
ok doug tedu


# 1.219 13-Sep-2014 doug

Replace all queue *_END macro calls except CIRCLEQ_END with NULL.

CIRCLEQ_* is deprecated and not called in the tree. The other queue types
have *_END macros which were added for symmetry with CIRCLEQ_END. They are
defined as NULL. There's no reason to keep the other *_END macro calls.

ok millert@


Revision tags: OPENBSD_5_6_BASE
# 1.218 13-Jul-2014 tedu

pass the size to free in some of the obvious cases


# 1.217 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.216 10-Jul-2014 mpi

Stop using a shutdown hook for softraid(4) and explicitly shutdown
the disciplines right after vfs_shutdown().

This change is required in order to be able to set `cold' to 1 before
traversing the device (mainbus) tree for DVACT_POWERDOWN when halting
a machine. Yes, this is ugly because sr_shutdown() needs to sleep. But
at least it is obvious and hopefully somebody will be ofended and fix
it.

In order to properly flush the cache of the disks under softraid0,
sr_shutdown() now propagates DVACT_POWERDOWN for this particular subtree
of devices which are not under mainbus. As a side effect sd(4) shutdown
hook should no longer be necessary.

Tested by stsp@ and Jean-Philippe Ouellet.

ok deraadt@, stsp@, jsing@


# 1.215 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.214 04-Jun-2014 claudio

While it may be smart to use the radix tree for exports it is not OK to
use the domain specific tree initialisation method for this since that one
is multipath enabled and assumes that the radix node is part of a struct
rtentry. This code uses a different struct and so the multipath modifies
wrong fields and breaks stuff in mysterious ways.
Since we only support AF_INET here anyway simplify the code and only have
one radix_node_head pointer instead of AF_MAX ones.
Fixes NFS server issues reported by rpe@, OK rpe@, guenther@, sthen@


# 1.213 10-Apr-2014 tedu

pull the bufcache freelist code out into separate functions to allow new
algorithms to be tested. in the process, drop support for unused B_AGE and
b_synctime options.
previous versions ok beck deraadt


# 1.212 24-Mar-2014 guenther

Split the API: struct ucred remains the kernel internal structure while
struct xucred becomes the structure for syscalls (mount(2) and nfssvc(2)).

ok deraadt@ beck@


Revision tags: OPENBSD_5_5_BASE
# 1.211 21-Jan-2014 tedu

bzero -> memset


# 1.210 01-Dec-2013 krw

Change 'mountlist' from CIRCLEQ to TAILQ. Be paranoid and
use TAILQ_*_SAFE more than might be needed.

Bulk ports build by sthen@ showed nobody sticking their fingers
so deep into the kernel.

Feedback and suggestions from millert@. ok jsing@


# 1.209 27-Nov-2013 jsing

Defer the v_type initialisation until after the vnode has been purged from
the namecache. Changing the v_type between cache_enter() and cache_purge()
results in bad things happening.

ok beck@


# 1.208 02-Oct-2013 sf

format string fix: b_flags is long


# 1.207 01-Oct-2013 sf

Format string fixes: Cast time_t to long long

and mnt_stat.f_ctime is long long, too


# 1.206 08-Aug-2013 syl

Uncomment kprintf format attributes for sys/kern

tested on vax (gcc3) ok miod@


# 1.205 30-Jul-2013 beck

The previous change was made while chasing nfs performance issues
on Theo's servers - however this was in the context of the buffer flipper
changes and this is now suspect in a continues performance issue with NFS
so back it out for now


Revision tags: OPENBSD_5_4_BASE
# 1.204 24-Jun-2013 beck

Manipulating buffers after sleeping is dangerous. Instead of attempting
to cheat and VOP_BWRITE a buffer, restart the vinvalbuf if we have to wait
for a busy buffer to complete
ok tedu@ guenther@


# 1.203 15-Apr-2013 jsing

Add an f_mntfromspec member to struct statfs, which specifies the name of
the special provided when the mount was requested. This may be the same as
the special that was actually used for the mount (e.g. in the case of a
device node) or it may be different (e.g. in the case of a DUID).

Whilst here, change f_ctime to a 64 bit type and remove the pointless
f_spare members.

Compatibility goo courtesy of guenther@

ok krw@ millert@


Revision tags: OPENBSD_5_3_BASE
# 1.202 17-Feb-2013 miod

Comment out recently added __attribute__((__format__(__kprintf__))) annotations
in MI code; gcc 2.95 does not accept such annotation for function pointer
declarations, only function prototypes.
To be uncommented once gcc 2.95 bites the dust.


# 1.201 09-Feb-2013 miod

Add explicit __attribute__ ((__format__(__kprintf__)))) to the functions and
function pointer arguments which are {used as,} wrappers around the kernel
printf function.
No functional change.


# 1.200 17-Nov-2012 beck

Don't map a buffer (and potentially sleep) when invalidating it in vinvalbuf.
This fixes a problem where we could sleep for kva and then our pointers
would not be valid on the next pass through the loop. We do this
by adding buf_acquire_nomap() - which can be used to busy up the buffer
without changing its mapped or unmapped state. We do not need to have
the buffer mapped to invalidate it, so it is sufficient to acquire it
for that. In the case where we write the buffer, we do map the buffer, and
potentially sleep.


# 1.199 01-Oct-2012 guenther

Make groupmember() check the effective gid too, so that the checks are
consistent when the effective gid isn't also a supplementary group.

ok beck@


# 1.198 19-Sep-2012 guenther

vhold() and vdrop() are prototyped in vnode.h, so don't repeat them here

ok beck@


Revision tags: OPENBSD_5_2_BASE
# 1.197 16-Jul-2012 deraadt

oops, need sys/acct.h too


# 1.196 16-Jul-2012 deraadt

Put acct_shutdown() proto in a better place


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.195 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.194 02-Jul-2011 thib

rename VFSDEBUG to VFLCKDEBUG;

prompted by tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.193 21-Dec-2010 thib

Bring back the "End the VOP experiment." diff, naddy's issues where
unrelated, and his alpha is much happier now.

OK deraadt@


# 1.192 06-Dec-2010 jasper

- drop NENTS(), which was yet another copy of nitems().
no binary change


ok deraadt@


# 1.191 10-Sep-2010 thib

Backout the VOP diff until the issues naddy was seeing on alpha (gcc3)
have been resolved.


# 1.190 06-Sep-2010 thib

End the VOP experiment. Instead of the ridicolusly complicated operation
vector setup that has questionable features (that have, as far as I can
tell never been used in practice, atleast not in OpenBSD), remove all
the gunk and favor a simple struct full of function pointers that get
set directly by each of the filesystems.

Removes gobs of ugly code and makes things simpler by a magnitude.

The only downside of this is that we loose the vnoperate feature so
the spec/fifo operations of the filesystems need to be kept in sync
with specfs and fifofs, this is no big deal as the API it self is pretty
static.

Many thanks to armani@ who pulled an earlier version of this diff to
current after c2k10 and Gabriel Kihlman on tech@ for testing.

Liked by many. "come on, find your balls" deraadt@.


# 1.189 12-Aug-2010 oga

Nuke extra (typoed) extern declaration and a spare newline from the last
commit.

"fix it -- free commit" beck@


# 1.188 11-Aug-2010 beck

Make the number of vnodes to correspond to the number of buffers in
buffer cache - we grow them dynamically, but do not attempt to shrink
them if the buffer cache shrinks after growing.

Tested by very many for a long time.

ok oga@ todd@ phessler@ tedu@


Revision tags: OPENBSD_4_8_BASE
# 1.187 29-Jun-2010 tedu

makefstype was only used in ported from freebsd filesystems. fix them
and remove the function. ok thib


# 1.186 28-Jun-2010 claudio

Add the rtable id as an argument to rn_walktree(). Functions like
rt_if_remove_rtdelete() need to know the table id to be able to correctly
remove nodes.
Problem found by Andrea Parazzini and analyzed by Martin Pelik�n.
OK henning@


# 1.185 06-May-2010 mpf

Fix favail format string.
From mickey.
OK thib, otto.


Revision tags: OPENBSD_4_7_BASE
# 1.184 17-Dec-2009 oga

if anyone vref()s a VNON vnode, panic. This should not happen.

Written while trying to debug the nfs_inactive panics. Turns out it
never got hit, but it's a useful check to have.

ok beck@


# 1.183 17-Aug-2009 jasper

dd 'show all bufs' to show all the buffers in the system

ok beck@ thib@


# 1.182 13-Aug-2009 thib

add a show all vnodes command, use dlg's nice pool_walk() to accomplish
this.

ok beck@, dlg@


# 1.181 12-Aug-2009 beck

Namecache revamp.

This eliminates the large single namecache hash table, and implements
the name cache as a global lru of entires, and a redblack tree in each
vnode. It makes cache_purge actually purge the namecache entries associated
with a vnode when a vnode is recycled (very important for later on actually being
able to resize the vnode pool)

This commit does #if 0 out a bunch of procmap code that was
already broken before this change, but needs to be redone completely.

Tested by many, including in thib's nfs test setup.

ok oga@,art@,thib@,miod@


# 1.180 02-Aug-2009 beck

Dynamic buffer cache support - a re-commit of what was backed out
after c2k9

allows buffer cache to be extended and grow/shrink dynamically

tested by many, ok oga@, "why not just commit it" deraadt@


Revision tags: OPENBSD_4_6_BASE
# 1.179 25-Jun-2009 thib

backout the buf_acquire() does the bremfree() since all callers
where doing bremfree() befure calling buf_acquire().

This is causing us headache pinning down a bug that showed up
when deraadt@ too cvs to current, and will have to be done
anyway as a preperation for backouts.

OK deraadt@


# 1.178 15-Jun-2009 beck

Back out all the buffer cache changes I committed during c2k9. This reverts three
commits:

1) The sysctl allowing bufcachepercent to be changed at boot time.
2) The change moving the buffer cache hash chains to a red-black tree
3) The dynamic buffer cache (Which depended on the earlier too).

ok on the backout from marco and todd


# 1.177 06-Jun-2009 art

All caller of buf_acquire were doing bremfree before the call.
Just put it in the buf_acquire function.
oga@ ok


# 1.176 03-Jun-2009 beck

Change bufhash from the old grotty hash table to red-black trees hanging
off the vnode.
ok art@, oga@, miod@


Revision tags: OPENBSD_4_5_BASE
# 1.175 10-Nov-2008 pedro

Fix typo in comment, okay jmc@.


# 1.174 01-Nov-2008 deraadt

change vrele() to return an int. if it returns 0, it can gaurantee that
it did not sleep. this is used to avoid checkdirs() to avoid having
to restart the allproc walk every time through
idea from tedu, ok thib pedro


Revision tags: OPENBSD_4_4_BASE
# 1.173 05-Jul-2008 thib

re-introduce vdrop() to signal a lost intrest in a vnode;

ok art@


# 1.172 14-Jun-2008 mk

A bunch of pool_get() + bzero() -> pool_get(..., .. | PR_ZERO)
conversions that should shave a few bytes off the kernel.

ok henning, krw, jsing, oga, miod, and thib (``even though i usually prefer
FOO|BAR''; thanks for looking.


# 1.171 13-Jun-2008 beck

back out stupid vnode change that was unintentionally included
with biomem and art has no idea how it got there.
ok art@ thib@


# 1.170 12-Jun-2008 deraadt

Bring biomem diff back into the tree after the nfs_bio.c fix went in.
ok thib beck art


# 1.169 11-Jun-2008 deraadt

back out biomem diff since it is not right yet. Doing very large
file copies to nfsv2 causes the system to eventually peg the console.
On the console ^T indicates that the load is increasing rapidly, ddb
indicates many calls to getbuf, there is some very slow nfs traffic
making none (or extremely slow) progress. Eventually some machines
seize up entirely.


# 1.168 10-Jun-2008 beck

Buffer cache revamp

1) remove multiple size queues, introduced as a stopgap.
2) decouple pages containing data from their mappings
3) only keep buffers mapped when they actually have to be mapped
(right now, this is when buffers are B_BUSY)
4) New functions to make a buffer busy, and release the busy flag
(buf_acquire and buf_release)
5) Move high/low water marks and statistics counters into a structure
6) Add a sysctl to retrieve buffer cache statistics

Tested in several variants and beat upon by bob and art for a year. run
accidentally on henning's nfs server for a few months...

ok deraadt@, krw@, art@ - who promises to be around to deal with any fallout


# 1.167 09-Jun-2008 millert

Update access(2) to have modern semantics with respect to X_OK and
the superuser. access(2) will now only indicate success for X_OK on
non-directories if there is at least one execute bit set on the file.
OK deraadt@ thib@ otto@


# 1.166 07-May-2008 thib

remove the vfc_mountroot member from vfsconf and
do appropriate cleanup;

OK deraadt@


# 1.165 07-May-2008 claudio

Implement routing priorities. Every route inserted has a priority assigned
and the one route with the lowest number wins. This will be used by the
routing daemons to resolve the synchronisations issue in case of conflicts.
The nasty bits of this are in the multipath code. If no priority is specified
the kernel will choose an appropriate priority.

Looked at by a few people at n2k8 code is much older


# 1.164 06-May-2008 thib

retire vfs_mountroot();

setroot() is now (and has been) responsible for setting
the mountroot function pointer "to the right thing", or
failing todo that, to ffs_mountroot;

based on a discussion/diff from deraadt@.
OK deraadt@


# 1.163 23-Mar-2008 miod

Wrong printf construct.


# 1.162 16-Mar-2008 otto

Widen some struct statfs fields to support large filesystem stata
and add some to be able to support statvfs(2). Do the compat dance
to provide backward compatibility. ok thib@ miod@


Revision tags: OPENBSD_4_3_BASE
# 1.161 13-Dec-2007 blambert

replace calls to ltsleep with tsleep

remove PNORELOCK flag, as PNORELOCK is used for msleep

ok art@ thib@


# 1.160 16-Nov-2007 deraadt

er, the newline is wrong. dissapointing.


# 1.159 15-Nov-2007 deraadt

newline before syncing disks is way prettier


# 1.158 29-Oct-2007 chl

MALLOC/FREE -> malloc/free
replace an hard coded value with M_WAITOK

ok krw@


# 1.157 15-Sep-2007 bluhm

Allow to pull out an usb stick with ffs filesystem while mounted
and a file is written onto the stick. Without these fixes the
machine panics or hangs.
The usb fix calls the callback when the stick is pulled out to free
the associated buffers. Otherwise we have busy buffers for ever
and the automatic unmount will panic.
The change in the scsi layer prevents passing down further dirty
buffers to usb after the stick has been deactivated.
In vfs the automatic unmount has moved from the function vgonel()
to vop_generic_revoke(). Both are called when the sd device's vnode
is removed. In vgonel() the VXLOCK is already held which can cause
a deadlock. So call dounmount() earlier.

ok krw@, I like this marco@, tested by ian@


# 1.156 07-Sep-2007 art

Use M_ZERO in a few more places to shave bytes from the kernel.

eyeballed and ok dlg@


Revision tags: OPENBSD_4_2_BASE
# 1.155 07-Aug-2007 beck

A few changes to deal with multi-user performance issues seen. this
brings us back roughly to 4.1 level performance, although this is still
far from optimal as we have seen in a number of cases. This change

1) puts a lower bound on buffer cache queues to prevent starvation
2) fixes the code which looks for a buffer to recycle
3) reduces the number of vnodes back to 4.1 levels to avoid complex
performance issues better addressed after 4.2

ok art@ deraadt@, tested by many


# 1.154 01-Jun-2007 beck

decouple the allocated number of vnodes from the "desiredvnodes" variable
which is used to size a zillion other things that increasing excessively
has been shown to cause problems - so that we may incrementally look at
increasing those other things without making the kernel unusable.

This diff effectivly increases the number of vnodes back to the number
of buffers, as in the earlier dynamic buffer cache commits, without
increasing anything else (namecache, softdeps, etc. etc.)

ok pedro@ tedu@ art@ thib@


# 1.153 31-May-2007 tedu

remove some silly casts, no real change


# 1.152 31-May-2007 pedro

NFSv2 cannot cope with a big number of vnodes, so revert to NPROC-based
calculation until the problem is fixed, okay beck@ art@


# 1.151 30-May-2007 beck

back out vfs change - todd fries has seen afs issues, and I'm suspicious
this can cause other problems.


# 1.150 29-May-2007 beck

Step one of some vnode improvements - change getnewvnode to
actually allocate "desiredvnodes" - add a vdrop to un-hold a vnode held
with vhold, and change the name cache to make use of vhold/vdrop, while
keeping track of which vnodes are referred to by which cache entries to
correctly hold/drop vnodes when the cache uses them.
ok thib@, tedu@, art@


# 1.149 28-May-2007 thib

de-inline vref();

ok pedro@


# 1.148 26-May-2007 pedro

Dynamic buffer cache. Initial diff from mickey@, okay art@ beck@ toby@
deraadt@ dlg@.


# 1.147 26-May-2007 thib

Nuke a bunch of simpelocks and associated goo.

ok art@


# 1.146 17-May-2007 thib

Collapse struct v_selectinfo in struct vnode, remove the
simplelock and reuse the name for the selinfo member.
Clean-up accordingly.

ok tedu@,art@


# 1.145 09-May-2007 deraadt

kinfo_vgetfailed has not been used for > 8 years


# 1.144 13-Apr-2007 thib

Move the declaration of VN_KNOTE() into vnode.h instead of having
multiple defines all over;

ok tedu@


# 1.143 13-Apr-2007 bluhm

Remove comments talking about vnode interlock. No binary change.
ok thib


# 1.142 11-Apr-2007 thib

Remove the simplelock argument from vrecycle();

ok pedro@, sturm@


# 1.141 21-Mar-2007 thib

Remove the v_interlock simplelock from the vnode structure.
Zap all calls to simple_lock/unlock() on it (those calls are
#defined away though). Remove the LK_INTERLOCK from the calls
to vn_lock() and cleanup the filesystems wich implement VOP_LOCK().
(by remvoing the v_interlock from there calls to lockmgr()).

ok pedro@, art@, tedu@


# 1.140 12-Mar-2007 mickey

better desiredvnodes not based on maxusers; pedro@ deraadt@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.139 20-Feb-2007 deraadt

for vfsconf sysctl, do not leak kernel sensors out to userland
ok art thib


# 1.138 17-Feb-2007 mickey

fix ddb buf printing for daddr_t growth to 64bit;
from juan hernandez gonzalez; tested by bluhm@


# 1.137 14-Feb-2007 jsg

Consistently spell FALLTHROUGH to appease lint.
ok kettenis@ cloder@ tom@ henning@


# 1.136 13-Feb-2007 mickey

fix ddb buf print


# 1.135 20-Nov-2006 tom

vprint() should be defined if DIAGNOSTIC || DEBUG. Noticed by (and
original diff from) Jake < antipsychic (at) hotmail.com >. Discussed
with Mickey and Miod.

ok miod@ pedro@


# 1.134 30-Oct-2006 thib

use vp->v_type to index into vtypes rather then vp->v_tag,
fixing odd output in the 'show vnode' ddb code.

ok mickey@


Revision tags: OPENBSD_4_0_BASE
# 1.133 11-Jul-2006 mickey

add mount/vnode/buf and softdep printing commands; tested on a few archs and will make pedro happy too (;


# 1.132 09-Jul-2006 pedro

Fix tab where space was meant


# 1.131 08-Jul-2006 thib

vinvalbuf() debugging aid, under VFSDEBUG.

ok pedro@


# 1.130 03-Jul-2006 mickey

also print vp in vprint (useful for debugging); pedro@ ok


# 1.129 25-Jun-2006 sturm

rename vfs_busy() flags VB_UMIGNORE/VB_UMWAIT to VB_NOWAIT/VB_WAIT

requested by and ok pedro


# 1.128 14-Jun-2006 sturm

move vfs_busy() to rwlocks and properly hide the locking api from vfs

ok tedu, pedro


# 1.127 02-Jun-2006 pedro

Add a clonable devices implementation. Hacked along with thib@, input
from krw@ and toby@, subliminal prodding from dlg@, okay deraadt@.


# 1.126 28-May-2006 pedro

Spacing in vfs_sysctl()


# 1.125 07-May-2006 sturm

forgot to remove this sentence from the comment
ok pedro


# 1.124 30-Apr-2006 sturm

remove the simplelock argument from vfs_busy() which is currently not
used and will never be used this way in VFS

requested by and ok pedro, ok krw, biorn


# 1.123 19-Apr-2006 pedro

Remove unused mount list simple_lock() goo


Revision tags: OPENBSD_3_9_BASE
# 1.122 09-Jan-2006 pedro

Put vprint() under DIAGNOSTIC, as to save space in generated ramdisks.
Inspiration from miod@, okay deraadt@. Tested on i386, macppc and amd64.


# 1.121 30-Nov-2005 pedro

No need for vfs_busy() and vfs_unbusy() to take a process pointer
anymore. Testing by jolan@, thanks.


# 1.120 24-Nov-2005 pedro

Remove kernfs, okay deraadt@.


# 1.119 19-Nov-2005 pedro

Remove unnecessary lockmgr() archaism that was costing too much in terms
of panics and bugfixes. Access curproc directly, do not expect a process
pointer as an argument. Should fix many "process context required" bugs.
Incentive and okay millert@, okay marc@. Various testing, thanks.


# 1.118 18-Nov-2005 pedro

Work around yet another race on non-locking file systems: when calling
VOP_INACTIVE() in vrele() and vput(), we may sleep. Since there's no
locking of any kind, someone can vget() the vnode and vrele() it while
we sleep, beating us in getting the vnode on the free list.


# 1.117 08-Nov-2005 pedro

Missed one use of 'register'


# 1.116 07-Nov-2005 pedro

Use ANSI function declarations and deregister, no binary change


# 1.115 19-Oct-2005 pedro

Remove v_vnlock from struct vnode, okay krw@ tedu@


Revision tags: OPENBSD_3_8_BASE
# 1.114 26-May-2005 pedro

branches: 1.114.2;
RIP stackable filesystems, ok marius@ tedu@, discussed with deraadt@


# 1.113 24-May-2005 pedro

when a device vnode associated with a mount point disappears, mark the
filesystem as doomed and unmount it


# 1.112 22-May-2005 pedro

put VLOCKSWORK stuff under a single option, VFSDEBUG


# 1.111 01-May-2005 pedro

check for VBIOONFREELIST and VBIOONSYNCLIST in vprint(), okay marius@


# 1.110 24-Mar-2005 tedu

always good to check for invalid values. ok marius pedro


Revision tags: OPENBSD_3_7_BASE
# 1.109 10-Jan-2005 pedro

branches: 1.109.2;
change vget() to only put a vnode back on the free lists if it actually
was there. should fix a (rare) corner case introduced by my last commit.
ok tedu@, testing by joris, moritz@, danh@, otto@ and krw@. many thanks.


# 1.108 31-Dec-2004 pedro

sprinkle some more list macros in here


# 1.107 31-Dec-2004 pedro

when releasing a vnode, make it inactive before sticking it to one of
the free lists. should fix some races on filesystems that don't have
locks, such as nfs. also, it allows for a more straightforward way of
releasing vnodes (nodes that are going to be recycled don't have to be
moved to the head of the list). tested by many, thanks.

ok tedu@ deraadt@


# 1.106 28-Dec-2004 deraadt

clean dirty accident by miod


# 1.105 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


# 1.104 09-Dec-2004 pedro

minor spacing/styling nits


Revision tags: OPENBSD_3_6_BASE
# 1.103 04-Aug-2004 art

Uninline vputonfreelist.


# 1.102 04-Aug-2004 pedro

better comments


# 1.101 02-Aug-2004 pedro

- check for LK_NOWAIT on vget()
- use ltsleep() instead of the unlock + sleep combo

ok art@, inspiration from free/net


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.100 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


# 1.99 27-May-2004 tedu

shutdown accounting before shutting down vfs. should prevent some panics.
ok david@ millert@ (iirc)


# 1.98 25-Apr-2004 itojun

radix tree with multipath support. from kame. deraadt ok
user visible changes:
- you can add multiple routes with same key (route add A B then route add A C)
- you have to specify gateway address if there are multiple entries on the table
(route delete A B, instead of route delete A)
kernel change:
- radix_node_head has an extra entry
- rnh_deladdr takes extra argument

TODO:
- actually take advantage of multipath (rtalloc -> rtalloc_mpath)


Revision tags: OPENBSD_3_5_BASE
# 1.97 09-Jan-2004 tedu

back out vnode parents. weird breakge found in ports tree


# 1.96 06-Jan-2004 tedu

keep track of a vnode's parent dir. ufs only, and unused atm, but
the fun stuff is coming. testing by brad.


Revision tags: OPENBSD_3_4_BASE
# 1.95 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.94 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: UBC_SYNC_A
# 1.93 13-May-2003 naddy

Back out previous change that causes "vnode table full" for large-scale
file operations.


# 1.92 13-May-2003 tedu

do reclaim LAYER vnodes, no good reason not to


# 1.91 06-May-2003 tedu

attempt to put a process's cwd back in place after a forced umount.
won't always work, but it's the best we can do for now. this covers
at least some of the failure cases the previous commit to vfs_lookup.c
checks for.
ok weingart@


# 1.90 01-May-2003 tedu

several related changes:
vfs_subr.c:
add a missing simple_lock_init for vnode interlock
try to avoid reclaiming locked or layered vnodes
initialize vnlock pointer to NULL
remove old code to free vnlock, never used
lockinit the new vnode lock
vfs_syscalls.c:
support for VLAYER flag
vnode_if.sh:
support for splitting VDESC flags
vnode_if.src:
split VDESC flags
WILLPUT is the combination of WILLRELE and WILLUNLOCK
most uses for WILLRELE become WILLPUT
vnode.h:
add v_lock to struct vnode
add VLAYER flag
update for new VDESC flags


# 1.89 06-Apr-2003 ho

strcat/strcpy/sprintf cleanup. krw@, anil@ ok. art@ tested sparc64.


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_B
# 1.88 11-Aug-2002 art

Add two missing vfs_busy calls in the failure path of sysctl_vnode.
Found by aaron@

NOTE - I think we need a mount-point iterator just like we have
NOTE - vfs_mount_foreach_vnode. (btw. why don't we use foreach_vnode in here?)


# 1.87 12-Jul-2002 art

Change the locking on the mountpoint slightly. Instead of using mnt_lock
to get shared locks for lookup and get the exclusive lock only with
LK_DRAIN on unmount and do the real exclusive locking with flags in
mnt_flags, we now use shared locks for lookup and an exclusive lock for
unmount.

This is accomplished by slightly changing the semantics of vfs_busy.
Old vfs_busy behavior:
- with LK_NOWAIT set in flags, a shared lock was obtained if the
mountpoint wasn't being unmounted, otherwise we just returned an error.
- with no flags, a shared lock was obtained if the mountpoint was being
unmounted, otherwise we slept until the unmount was done and returned
an error.
LK_NOWAIT was used for sync(2) and some statistics code where it isn't really
critical that we get the correct results.
0 was used in fchdir and lookup where it's critical that we get the right
directory vnode for the filesystem root.

After this change vfs_busy keeps the same behavior for no flags and LK_NOWAIT.
But if some other flags are passed into it, they are passed directly
into lockmgr (actually LK_SLEEPFAIL is always added to those flags because
if we sleep for the lock, that means someone was holding the exclusive lock
and the exclusive lock is only held when the filesystem is being unmounted.

More changes:
dounmount must now be called with the exclusive lock held. (before this
the caller was supposed to hold the vfs_busy lock, but that wasn't always
true).
Zap some (now) unused mount flags.
And the highlight of this change:
Add some vfs_busy calls to match some vfs_unbusy calls, especially in
sys_mount. (lockmgr doesn't detect the case where we release a lock noone
holds (it will do that soon)).

If you've seen hangs on reboot with mfs this should solve it (I repeat this
for the fourth time now, but this time I spent two months fixing and
redesigning this and reading the code so this time I must have gotten
this right).


# 1.86 16-Jun-2002 miod

When processing the KERN_VNODE sysctl, the kernel builds a packed structure,
while pstat(8) expects a C structure abiding the regular structure packing
rules. This caused pstat -v to break on powerpc.

Unbreak the confusion by defining the structure in a common header file,
and having the kernel use it.

ok millert@ deraadt@


# 1.85 08-Jun-2002 art

Use ltsleep in vfs_busy.


# 1.84 16-May-2002 art

sprinkle some splassert(IPL_BIO) in some functions that are commented as "should be called at splbio()"


Revision tags: OPENBSD_3_1_BASE
# 1.83 14-Mar-2002 millert

First round of __P removal in sys


# 1.82 04-Feb-2002 miod

Cleanup mountroot-related definitions.


# 1.81 23-Jan-2002 art

Pool deals fairly well with physical memory shortage, but it doesn't deal
well (not at all) with shortages of the vm_map where the pages are mapped
(usually kmem_map).

Try to deal with it:
- group all information the backend allocator for a pool in a separate
struct. The pool will only have a pointer to that struct.
- change the pool_init API to reflect that.
- link all pools allocating from the same allocator on a linked list.
- Since an allocator is responsible to wait for physical memory it will
only fail (waitok) when it runs out of its backing vm_map, carefully
drain pools using the same allocator so that va space is freed.
(see comments in code for caveats and details).
- change pool_reclaim to return if it actually succeeded to free some
memory, use that information to make draining easier and more efficient.
- get rid of PR_URGENT, noone uses it.


# 1.80 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.79 10-Dec-2001 art

branches: 1.79.2;
No need to initialize the uobj on every getnewvnode. Just do
it when allocating. Add some improved diagnostics.


# 1.78 10-Dec-2001 art

Big cleanup inspired by NetBSD with some parts of the code from NetBSD.
- get rid of VOP_BALLOCN and VOP_SIZE
- move the generic getpages and putpages into miscfs/genfs
- create a genfs_node which must be added to the top of the private portion
of each vnode for filsystems that want to use genfs_{get,put}pages
- rename genfs_mmap to vop_generic_mmap


# 1.77 10-Dec-2001 art

Merge in struct uvm_vnode into struct vnode.


# 1.76 05-Dec-2001 art

Break out the part that lowers v_holdcnt in brelvp into an own function
and make it and vhold into public interfaces.


# 1.75 29-Nov-2001 art

Ooops. Revert part of the last commit that was completly wrong and wasn't supposed to be committed.


# 1.74 29-Nov-2001 art

Correctly handle b_vp with bgetvp and brelvp in {get,put}pages.
Prevents panics caused by vnodes being recycled under our feet.


# 1.73 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.72 21-Nov-2001 csapuntz

Added vfs_isbusy. Useful for verifying that a mount point is locked
Added vfs_mount_foreach_vnode. Several places in the code seem to want to
traverse the mount list and they all seem to handle locking differently.
Centralize traversing the mount list in one place so that we only need
to get the locking right once.


# 1.71 15-Nov-2001 art

Don't zero v_bioflag when recycling a vnode in getnewvnode.
Sometimes the vnode can be on the syncers list. While that is a bug, it's
just a minor annoyance. A vnode on a syncer worklist without VBIOONSYNCLIST
set is a disaster.


# 1.70 12-Nov-2001 art

Remove unnecessary check for NULL vnode in reassignbuf.


# 1.69 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.68 02-Oct-2001 csapuntz

Bounds check index into routing table. Thanks to Ken Ashcraft of Stanford
for finding this bug.


# 1.67 19-Sep-2001 csapuntz

Get rid of B_VFLUSH. Not relevant after the end of the AGE queue.


# 1.66 16-Sep-2001 millert

Add some missing lengths checks when passing data from userland to
kernel. From based on NetBSD patches.


# 1.65 02-Aug-2001 assar

(vput): make panic strings actually say vput instead of vrele


# 1.64 26-Jul-2001 miod

Typo.


# 1.63 27-Jun-2001 art

remove old vm


# 1.62 22-Jun-2001 deraadt

KNF


# 1.61 05-Jun-2001 provos

send note_revoke to knotes when vnode goes away, okay art@


# 1.60 16-May-2001 art

indentation nit.


# 1.59 29-Apr-2001 art

cleanup, remove incorrect comment


Revision tags: OPENBSD_2_9_BASE
# 1.58 22-Mar-2001 art

branches: 1.58.2;
Use pool for allocating vnodes.
Even though vnodes are never freed (could be) this gives us big memory and
kmem_map savings.


# 1.57 21-Mar-2001 art

uvm_vnp_terminate expect the vnode to be locked.
Why didn't LOCKDEBUG catch this?


# 1.56 16-Mar-2001 art

Oops. fix thinko in last.


# 1.55 16-Mar-2001 art

Use CIRCLEQ macros for mountlist.


# 1.54 16-Mar-2001 art

Initialize the mountlist_slock.


# 1.53 26-Feb-2001 csapuntz

Move v_writecount test back to it original place


# 1.52 26-Feb-2001 csapuntz

Make ref counts 32-bit unsigned ints as opposed to a potpourri of longs and
ints.


# 1.51 24-Feb-2001 csapuntz

Cleanup of vnode interface continues. Get rid of VHOLD/HOLDRELE.
Change VM/UVM to use buf_replacevnode to change the vnode associated
with a buffer.

Addition v_bioflag for flags written in interrupt handlers
(and read at splbio, though not strictly necessary)

Add vwaitforio and use it instead of a while loop of v_numoutput.

Fix race conditions when manipulation vnode free list


# 1.50 23-Feb-2001 csapuntz

Remove the clustering fields from the vnodes and place them in the
file system inode instead


# 1.49 21-Feb-2001 csapuntz

Latest soft updates from FreeBSD/Kirk McKusick

Snapshot-related code has been commented out.


# 1.48 08-Feb-2001 mickey

do not print stuff when not verbose


Revision tags: OPENBSD_2_8_BASE
# 1.47 27-Sep-2000 art

branches: 1.47.2;
Minimal optimization.


# 1.46 17-Jul-2000 art

Don't wait for B_READ buffers on shutdown.
From NetBSD.


Revision tags: OPENBSD_2_7_BASE
# 1.45 25-Apr-2000 csapuntz

Use CIRCLEQ_FOREACH


# 1.44 21-Apr-2000 mickey

see if there is any meaning under curproc before using &proc0 in vfs_syncwait(); from art@


Revision tags: SMP_BASE kame_19991208
# 1.43 05-Dec-1999 art

branches: 1.43.2;
With soft updates, some buffers will be remarked as dirty after being written.
Handle this when syncing filesystems when unmounting.
From NetBSD.


# 1.42 05-Dec-1999 art

Use VONSYNCLIST to see if we should remove a vnode from the sync list instead
of looking at v_dirtyblkhd.


Revision tags: OPENBSD_2_6_BASE
# 1.41 20-Aug-1999 art

more paranoid check of the refcount in vfs_register


# 1.40 08-Aug-1999 niklas

From NetBSD; vdevgone, used for revoking access to device nodes when they
disappear (detach is coming).


# 1.39 31-May-1999 millert

New struct statfs with mount options. NOTE: this replaces statfs(2),
fstatfs(2), and getfsstat(2) so you will need to build a new kernel
before doing a "make build" or you will get "unimplemented syscall" errors.

The new struct statfs has the following featuires:
o Has a u_int32_t flags field--now softdep can have a real flag.

o Uses u_int32_t instead of longs (nicer on the alpha). Note: the man
page used to lie about setting invalid/unused fields to -1. SunOS does
that but our code never has.

o Gets rid of f_type completely. It hasn't been used since NetBSD 0.9
and having it there but always 0 is confusing. It is conceivable
that this may cause some old code to not compile but that is better
than silently breaking.

o Adds a mount_info union that contains the FSTYPE_args struct. This
means that "mount" can now tell you all the options a filesystem was
mounted with. This is especially nice for NFS.

Other changes:
o The linux statfs emulation didn't convert between BSD fs names
and linux f_type numbers. Now it does, since the BSD f_type
number is useless to linux apps (and has been removed anyway)

o FreeBSD's struct statfs is different from our (both old and new)
and thus needs conversion. Previously, the OpenBSD syscalls
were used without any real translation.

o mount(8) will now show extra info when invoked with no arguments.
However, to see *everything* you need to use the -v (verbose) flag.


# 1.38 06-May-1999 mickey

factor out sync+wait code into vfa_syncwait() routine for
applications in system like power management and such.
art@ finally said `commit it'


# 1.37 30-Apr-1999 art

in vput, simple_unlock the v_interlock before VOP_INACTIVE, not after


Revision tags: OPENBSD_2_5_BASE
# 1.36 11-Mar-1999 deraadt

backout


# 1.35 11-Mar-1999 deraadt

back out unapproved changes


# 1.34 11-Mar-1999 mickey

indent


# 1.33 11-Mar-1999 mickey

factor sync+wait operation out into a separate function.


# 1.32 26-Feb-1999 art

adapt to uvm vnode pager


# 1.31 19-Feb-1999 art

add vfs_register and vfs_unregister functions


# 1.30 28-Dec-1998 art

simple_lock fixes


# 1.29 22-Dec-1998 art

deconfuse vprint, print holdcount, not refcount when we are talking about holdcnt


# 1.28 10-Dec-1998 art

vfs_unmountall: retry to unmount all remaining filesystems when one unmount failed


# 1.27 05-Dec-1998 csapuntz

Framework for generating automatic test code for locking discipline
in DIAGNOSTIC mode.

Added documentation to vfs_subr.c on locking needs of a couple calls.

Improvements to the vinvalbuf patch. We need to start over after we
let our pants down.


# 1.26 04-Dec-1998 csapuntz

VFS-Lite2 requires stricter locking around vnode buffer queues. vinvalbuf
had insufficient protection


# 1.25 20-Nov-1998 art

vn_lock already unlocks the simple lock. don't do that again


# 1.24 12-Nov-1998 csapuntz

Integrate latest soft updates patches for McKusick.

Integrate cleaner ffs mount code from FreeBSD. Most notably, this mount
code prevents you from mounting an unclean file system read-write.


Revision tags: OPENBSD_2_4_BASE
# 1.23 13-Oct-1998 csapuntz

In vrele, vget, reinstate to following order

- VNODE gets placed on free list
- VOP_INACTIVE is called

This was the original order. It was changed in an earlier patch due to
a race condition in non-locking FSes (like NFS) between getnewvnode
and inactive. However, the modified order had its own race conditions, so
it turned out not to be a good choice.


# 1.22 30-Aug-1998 csapuntz

Cleanup.

Error diagnostics in vputonfreelist to catch violations of assumptions.


# 1.21 06-Aug-1998 csapuntz

Rename vop_revoke, vn_bwrite, vop_noislocked, vop_nolock, vop_nounlock
to be vop_generic_revoke, vop_generic_bwrite, vop_generic_islocked,
vop_generic_lock and vop_generic_unlock.

Create vop_generic_abortop and propogate change to all file systems.

Fix PR/371.

Get rid of locking in NULLFS (should be mostly unnecessary now except for
forced unmounts).


# 1.20 25-Apr-1998 niklas

typo


Revision tags: OPENBSD_2_3_BASE
# 1.19 20-Feb-1998 niklas

typo


# 1.18 11-Jan-1998 csapuntz

Fix a couple spinlock references. More code motion in vfs_subr.c


# 1.17 10-Jan-1998 csapuntz

Broke up vfs_subr.c which was getting a bit huge. We now have seperate files
for the syncer daemon as well as default VOP_*.


# 1.16 24-Nov-1997 niklas

Fix non-DIAGNOSTIC (and non-COMPAT*) compilation


# 1.15 07-Nov-1997 csapuntz

Fixed hang on shutdown
Disabled vop_nolock for now. Filesystems still need to be cleaned up.


# 1.14 06-Nov-1997 csapuntz

DEBUG now compiles


# 1.13 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.12 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.11 06-Oct-1997 csapuntz

VFS Lite2 Changes


Revision tags: OPENBSD_2_1_BASE
# 1.10 25-Apr-1997 deraadt

proper mask check; mike@fast.cs.utah.edu


# 1.9 14-Apr-1997 tholo

Minor performance enhancements from NetBSD


# 1.8 24-Feb-1997 niklas

OpenBSD tags


# 1.7 11-Feb-1997 millert

Add fs_id support and random inode generation numbers for ffs.


# 1.6 04-Jan-1997 kstailey

spec_advlock() via lf_advlock()


Revision tags: OPENBSD_2_0_BASE
# 1.5 08-Aug-1996 tholo

Make {,f}chown(2) behaviour POSIX.1 compliant with SUID / SGID files
Enable CTL_FS processing by sysctl(3)
Add CTL_FS request to disable clearing SUID / SGID bit when a files owner
or group is changed by root
Make sysctl(8) understand CTL_FS requests


# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 29-Feb-1996 niklas

From NetBSD: Merge with NetBSD 960217


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.295 26-Dec-2019 bluhm

Fix white spaces.


# 1.294 08-Dec-2019 mpi

Convert infinite sleeps to tsleep_nsec(9).

ok visa@, jca@


Revision tags: OPENBSD_6_6_BASE
# 1.293 26-Aug-2019 anton

When a thread tries to exclusively lock a vnode, the same thread must
ensure that any other thread currently trying to acquire the underlying
vnode lock has observed that the same vnode is about to be exclusively
locked. Such threads must then sleep until the exclusive lock has been
released and then try to acquire the lock again. Otherwise, exclusive
access to the vnode cannot be guaranteed.

Thanks to naddy@ and visa@ for testing; ok visa@

Reported-by: syzbot+374d0e7e2400004957f7@syzkaller.appspotmail.com


# 1.292 25-Jul-2019 cheloha

vinvalbuf(9): tlseep -> tsleep_nsec(9); ok millert@


# 1.291 19-Jul-2019 cheloha

vwaitforio(9): tsleep(9) -> tsleep_nsec(9); ok visa@


# 1.290 28-Jun-2019 visa

Skip VFS barrier lock during normal operation to reduce overhead.
This removes a system-wide serialization point, which might help
finding timing-related bugs.

OK deraadt@ anton@


# 1.289 09-Jun-2019 beck

Add a temporary workaround to make removal of giant files better

mlarkin@ noticed we would freeze while removing enormous files because
of the amount of work done to invalidate buffers on unlink. This adds
a temporary workaround to ensure we give up the lock and yield while
doing this.

The longer term answer will be to move these buffers to another list
and not do the work here.

ok deraadt@


# 1.288 19-Apr-2019 visa

Add a subsystem lock for vfs_lockf.c. This enables calling lf_advlock()
and lf_purgelocks() without the kernel lock.

OK anton@ mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.287 02-Apr-2019 visa

Restrict which filesystems are available for swap. This rules out
obvious misconfigurations that cannot work.

OK mpi@ tedu@


# 1.286 17-Feb-2019 tedu

if a write fails, we mark the buffer invalid and throw it away. this can
lead to lost errors, where a later fsync will return success. to fix this,
set a flag on the vnode indicating a past error has occurred, and return
an error for future fsync calls.
ok bluhm deraadt visa


# 1.285 21-Jan-2019 anton

Introduce a dedicated entry point data structure for file locks. This new data
structure allows for better tracking of pending lock operations which is
essential in order to prevent a use-after-free once the underlying vnode is
gone.

Inspired by the lockf implementation in FreeBSD.

ok visa@

Reported-by: syzbot+d5540a236382f50f1dac@syzkaller.appspotmail.com


# 1.284 23-Dec-2018 natano

Rectify some issues with the noperm mount flag; the root vnode was not
protected properly and files without any x bit set were accidentaly considered
executable when checked with access(2).

Issues found and reported by deraadt, halex, reyk, tb
ok deraadt


# 1.283 07-Dec-2018 mpi

free(9) sizes for netcred.

ok visa@


Revision tags: OPENBSD_6_4_BASE
# 1.282 29-Sep-2018 visa

Use atomic operations to update vfc_refcount. Change the field's type
to unsigned int.

OK deraadt@


# 1.281 26-Sep-2018 visa

Move the allocating and freeing of mount points into
dedicated functions.

OK deraadt@ mpi@


# 1.280 22-Sep-2018 fcambus

Harmonize spacing after ellipses in displayed messages.

We were using spacing after ellipses in an inconsistent way in the
installer. Standardize on using "... " everywhere and take into account
the cursor position while we are waiting for the task to complete: the
cursor is now always positioned after the last dot, and the space is
added when displaying completion confirmation.

While there, also take cursor position into account in vfs_shutdown(),
and remove the extra leading space before ticks in dhclient.

OK deraadt@


# 1.279 17-Sep-2018 visa

Simplify VFS initialization.

Because loadable kernel modules are no longer, there is no need to
register or unregister filesystem implementations at runtime. Remove
vfs_register() and vfs_unregister(), and make vfsinit() call vfs_init
routines directly. Replace the linked list of vfsconf structs with
the vfsconflist[] array.

OK mpi@ bluhm@


# 1.278 16-Sep-2018 visa

Move vfsconf lookup code into dedicated functions.

OK bluhm@


# 1.277 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


# 1.276 02-Jul-2018 bluhm

Use more list macros for v_dirtyblkhd.
OK mpi@


# 1.275 06-Jun-2018 bluhm

The function dounmount() traverses the mnt_list in forward direction
to call vfs_busy() for all nested mount points. vfs_stall() called
vfs_busy() in reverser order for all mount points. Change the
direction of the latter to resolve the lock order conflict.
OK visa@


# 1.274 04-Jun-2018 guenther

Add VB_DUPOK to suppress witness(4) warning of concurrent mount locks.
Use that in three places:
- vfs_stall()
- sys_mount()
- dounmount()'s MNT_FORCE-does-recursive-unmounts case

ok deraadt@ visa@


# 1.273 27-May-2018 visa

Drop unnecessary `p' parameter from vget(9).

OK mpi@


# 1.272 08-May-2018 bluhm

When looping over mount points, the FOREACH SAVE macro is not save.
The loop variable mp is protected by vfs_busy() so that it cannot
be unmounted. But the next mount point nmp could be unmounted while
VFS_SYNC() sleeps. As the loop in vfs_stall() does not destroy the
mount point, TAILQ_FOREACH_REVERSE without _SAVE is the correct
macro to use.
OK deraadt@ visa@


# 1.271 08-May-2018 mpi

Move the vfs stall "barrier" logic to a function. FREF() will soon
change and this has nothing to do with it.

ok visa@, bluhm@


# 1.270 07-May-2018 bluhm

Print the vp pointer in the vinvalbuf() panic strings.
OK mpi@


# 1.269 02-May-2018 visa

Remove proc from the parameters of vn_lock(). The parameter is
unnecessary because curproc always does the locking.

OK mpi@


# 1.268 28-Apr-2018 visa

Clean up the parameters of VOP_LOCK() and VOP_UNLOCK(). It is always
curproc that does the locking or unlocking, so the proc parameter
is pointless and can be dropped.

OK mpi@, deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.267 07-Mar-2018 bluhm

Remounting files systems read-only does not work reliably. There
are corner cases where ffs may leak blocks. So better revert and
unmount all file systems at reboot. The "init died" panic will be
fixed in a different way.
OK deraadt@


# 1.266 10-Feb-2018 deraadt

Syncronize filesystems to disk when suspending. Each mountpoint's vnodes
are pushed to disk. Dangling vnodes (unlinked files still in use) and
vnodes undergoing change by long-running syscalls are identified -- and
such filesystems are marked dirty on-disk while we are suspended (in case
power is lost, a fsck will be required). Filesystems without dangling or
busy vnodes are marked clean, resulting in faster boots following
"battery died" circumstances.
Tested by numerous developers, thanks for the feedback.


# 1.265 14-Dec-2017 deraadt

Don't bother using DETACH_FORCE for the softraid luns at reboot
time; the aggressive mountpoint destruction seems to hit insane
use-after-frees when we are already far on the way down.


# 1.264 14-Dec-2017 deraadt

Give vflush_vnode() a hint about vnodes we don't need to account as "busy".
Change mountpoint to RDONLY a little later. Seems to improve the
rw->ro transition a bit.


# 1.263 11-Dec-2017 bluhm

Format the vnode lists of ddb show mount properly in columns.
OK krw@


# 1.262 11-Dec-2017 deraadt

In uvm Chuck decided backing store would not be allocated proactively
for blocks re-fetchable from the filesystem. However at reboot time,
filesystems are unmounted, and since processes lack backing store they
are killed. Since the scheduler is still running, in some cases init is
killed... which drops us to ddb [noted by bluhm]. Solution is to convert
filesystems to read-only [proposed by kettenis]. The tale follows:
sys_reboot() should pass proc * to MD boot() to vfs_shutdown() which
completes current IO with vfs_busy VB_WRITE|VB_WAIT, then calls VFS_MOUNT()
with MNT_UPDATE | MNT_RDONLY, soon teaching us that *fs_mount() calls a
copyin() late... so store the sizes in vfsconflist[] and move the copyin()
to sys_mount()... and notice nfs_mount copyin() is size-variant, so kill
legacy struct nfs_args3. Next we learn ffs_mount()'s MNT_UPDATE code is
sharp and rusty especially wrt softdep, so fix some bugs adn add
~MNT_SOFTDEP to the downgrade. Some vnodes need a little more help,
so tie them to &dead_vnops.

ffs_mount calling DIOCCACHESYNC is causing a bit of grief still but
this issue is seperate and will be dealt with in time.
couple hundred reboots by bluhm and myself, advice from guenther and
others at the hut


# 1.261 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.260 31-Jul-2017 florian

Give back some space to the ramdisk by compiling net/radix.c only
if we compile pf, ipsec, pipex or nfsserver.
Suggested by mpi some time ago.
Tweak & OK bluhm
deraadt assumes it's fair


# 1.259 20-Apr-2017 visa

Tweak lock inits to make the system runnable with witness(4)
on amd64 and i386.


# 1.258 04-Apr-2017 deraadt

struct vfsconf is tightly packed, but let's M_ZERO it in case that ever
changes to avoid exposing userland memory.


Revision tags: OPENBSD_6_1_BASE
# 1.257 15-Jan-2017 bluhm

When traversing the mount list, the current mount point is locked
with vfs_busy(). If the FOREACH_SAFE macro is used, the next pointer
is not locked and could be freed by another process. Unless
necessary, do not use _SAFE as it is unsafe. In vfs_unmountall()
the current pointer is actullay freed. Add a comment that this
race has to be fixed later.
OK krw@


# 1.256 10-Jan-2017 bluhm

Replace manual for() loops with FOREACH() macro.
OK millert@


# 1.255 10-Jan-2017 bluhm

Remove the unused olddp parameter from function dounmount().
OK mpi@ millert@


# 1.254 28-Sep-2016 kettenis

Cast enum to u_int when doing a bounds check to avoid a clang warning that
the comparison is always true.

ok jca@, tedu@


# 1.253 16-Sep-2016 dlg

move the namecache_rb_tree from RB macros to RBT functions.

i had to shuffle the includes a bit. all the knowledge of the RB
tree is now inside vfs_cache.c, and all accesses are via cache_*
functions.


# 1.252 16-Sep-2016 dlg

move buf_rb_bufs from RB macros to RBT functions

i had to shuffle the order of some header bits cos RBT_PROTOTYPE
needs to see what RBT_HEAD produces.


# 1.251 15-Sep-2016 dlg

all pools have their ipl set via pool_setipl, so fold it into pool_init.

the ioff argument to pool_init() is unused and has been for many
years, so this replaces it with an ipl argument. because the ipl
will be set on init we no longer need pool_setipl.

most of these changes have been done with coccinelle using the spatch
below. cocci sucks at formatting code though, so i fixed that by hand.

the manpage and subr_pool.c bits i did myself.

ok tedu@ jmatthew@

@ipl@
expression pp;
expression ipl;
expression s, a, o, f, m, p;
@@
-pool_init(pp, s, a, o, f, m, p);
-pool_setipl(pp, ipl);
+pool_init(pp, s, a, ipl, f, m, p);


# 1.250 25-Aug-2016 dlg

pool_setipl

ok kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.249 22-Jul-2016 kettenis

Prevent NULL-pointer call for filesystems that don't provide vfs_sysctl
in their vfsops.

Issue reported by Tim Newsham.

ok claudio@, natano@


# 1.248 19-Jun-2016 natano

Remove the lockmgr() API. It is only used by filesystems, where it is a
trivial change to use rrw locks instead. All it needs is LK_* defines
for the RW_* flags.

tested by naddy and sthen on package building infrastructure
input and ok jmc mpi tedu


# 1.247 26-May-2016 natano

The doforce variable isn't modified anywhere. Also, the only filesystem
left using it is fuse. It has been removed from all other filesystems.

ok millert deraadt


# 1.246 26-Apr-2016 natano

copy_statfs_info() is not only used by ufs, but by other filesystems too,
so make sure that all members of mp->mnt_stat.mount_info are copied.
ok stefan


# 1.245 26-Apr-2016 beck

fix off by one in vfs_vnode_print - found by miod
ok deraadt@, krw@


# 1.244 07-Apr-2016 natano

Share clone bitmap between aliased vnodes. This prevents duplicate clone
instance numbers being handed out for the same minor device.
ok mikeb


# 1.243 05-Apr-2016 natano

Increase size of the clone bitmap (revised diff after revert). I have
tested this with fuse _and_ drm on amd64 and macppc. Also tested with
cloning bpf (not in the tree) on macppc.

ok mikeb
"looks correct to me" millert

The original commit message is as follows:

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.242 01-Apr-2016 mikeb

Revert the clone bitmap enlargement change


# 1.241 31-Mar-2016 natano

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.240 19-Mar-2016 natano

Remove the unused flags argument from VOP_UNLOCK().

torture tested on amd64, i386 and macppc
ok beck mpi stefan
"the change looks right" deraadt


# 1.239 14-Mar-2016 krw

Change a bunch of (<blah> *)0 to NULL.

ok beck@ deraadt@


Revision tags: OPENBSD_5_9_BASE
# 1.238 05-Dec-2015 tedu

branches: 1.238.2;
remove stale lint annotations


# 1.237 16-Nov-2015 deraadt

In getdevvp() set the VISTTY flag on a vnode to indicate the underlying
device is a D_TTY device. (Like spec_open, but this sets the flag to
satisfy pre-VOP_OPEN situations)
ok millert semarie tedu guenther


# 1.236 13-Oct-2015 guenther

Initialize va_filerev in vattr_null() to avoid leaking stack garbage;
problem pointed out by Martin Natano (natano (at) natano.net)

Also, stop chaining assignments (foo = bar = baz) in vattr_null().
The exact meaning of those depends on the order of the sizes-and-
signednesses of the lvalues, making them fragile: a statement here
mixed *six* types, but managed to get them in a safe order. Delete
a 20+ year old XXX comment that was almost certainly bemoaning a bug
from when they were in an unsafe order.

ok deraadt@ miod@


# 1.235 08-Oct-2015 mpi

Use the radix API directly and get rid of the function pointers. There
is no point in keeping an unused level of abstraction.

ok mikeb@, claudio@


# 1.234 07-Oct-2015 mpi

rn_inithead() offset argument is now specified in byte, missed in previous.


# 1.233 04-Sep-2015 mpi

Make every subsystem using a radix tree call rn_init() and pass the
length of the key as argument.

This way every consumer of the radix tree has a chance to explicitly
initialize the shared data structures and no longer rely on another
subsystem to do the initialization.

As a bonus ``dom_maxrtkey'' is no longer used an die.

ART kernels should now be fully usable because pf(4) and IPSEC properly
initialized the radix tree.

ok chris@, reyk@


Revision tags: OPENBSD_5_8_BASE
# 1.232 16-Jul-2015 claudio

branches: 1.232.4;
Fix rn_match and there for the expoerted lookup functions in radix.c
to never return the internal RNF_ROOT nodes. This removes the checks
in the callee to verify that not an RNF_ROOT node was returned.
OK mpi@


# 1.231 12-May-2015 mikeb

Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.230 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.229 02-Mar-2015 guenther

Return EINVAL if the creds supplied for NFS export have a cr_ngroups less
than zero or greater than NGROUPS_MAX

Fixes panic seen by henning@


# 1.228 09-Jan-2015 tedu

rename desiredvnodes to initialvnodes. less of a lie. ok beck deraadt


# 1.227 19-Dec-2014 tedu

start retiring the nointr allocator. specify PR_WAITOK as a flag as a
marker for which pools are not interrupt safe. ok dlg


# 1.226 17-Dec-2014 tedu

remove lock.h from uvm_extern.h. another holdover from the simpletonlock
era. fix uvm including c files to include lock.h or atomic.h as necessary.
ok deraadt


# 1.225 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.224 10-Dec-2014 tedu

convert bcopy to memcpy. ok millert


# 1.223 21-Nov-2014 tedu

simple lock is long dead


# 1.222 19-Nov-2014 tedu

delete the KERN_VNODE sysctl. it fails to provide any isolation from the
kernel struct vnode defintion, and the only consumer (pstat) still needs
kvm to read much of the required information. no great loss to always use
kvm until there's a better replacement interface.
ok deraadt millert uebayasi


# 1.221 14-Nov-2014 tedu

prefer sizeof(*ptr) to sizeof(struct) for malloc and free


# 1.220 03-Nov-2014 deraadt

pass size argument to free()
ok doug tedu


# 1.219 13-Sep-2014 doug

Replace all queue *_END macro calls except CIRCLEQ_END with NULL.

CIRCLEQ_* is deprecated and not called in the tree. The other queue types
have *_END macros which were added for symmetry with CIRCLEQ_END. They are
defined as NULL. There's no reason to keep the other *_END macro calls.

ok millert@


Revision tags: OPENBSD_5_6_BASE
# 1.218 13-Jul-2014 tedu

pass the size to free in some of the obvious cases


# 1.217 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.216 10-Jul-2014 mpi

Stop using a shutdown hook for softraid(4) and explicitly shutdown
the disciplines right after vfs_shutdown().

This change is required in order to be able to set `cold' to 1 before
traversing the device (mainbus) tree for DVACT_POWERDOWN when halting
a machine. Yes, this is ugly because sr_shutdown() needs to sleep. But
at least it is obvious and hopefully somebody will be ofended and fix
it.

In order to properly flush the cache of the disks under softraid0,
sr_shutdown() now propagates DVACT_POWERDOWN for this particular subtree
of devices which are not under mainbus. As a side effect sd(4) shutdown
hook should no longer be necessary.

Tested by stsp@ and Jean-Philippe Ouellet.

ok deraadt@, stsp@, jsing@


# 1.215 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.214 04-Jun-2014 claudio

While it may be smart to use the radix tree for exports it is not OK to
use the domain specific tree initialisation method for this since that one
is multipath enabled and assumes that the radix node is part of a struct
rtentry. This code uses a different struct and so the multipath modifies
wrong fields and breaks stuff in mysterious ways.
Since we only support AF_INET here anyway simplify the code and only have
one radix_node_head pointer instead of AF_MAX ones.
Fixes NFS server issues reported by rpe@, OK rpe@, guenther@, sthen@


# 1.213 10-Apr-2014 tedu

pull the bufcache freelist code out into separate functions to allow new
algorithms to be tested. in the process, drop support for unused B_AGE and
b_synctime options.
previous versions ok beck deraadt


# 1.212 24-Mar-2014 guenther

Split the API: struct ucred remains the kernel internal structure while
struct xucred becomes the structure for syscalls (mount(2) and nfssvc(2)).

ok deraadt@ beck@


Revision tags: OPENBSD_5_5_BASE
# 1.211 21-Jan-2014 tedu

bzero -> memset


# 1.210 01-Dec-2013 krw

Change 'mountlist' from CIRCLEQ to TAILQ. Be paranoid and
use TAILQ_*_SAFE more than might be needed.

Bulk ports build by sthen@ showed nobody sticking their fingers
so deep into the kernel.

Feedback and suggestions from millert@. ok jsing@


# 1.209 27-Nov-2013 jsing

Defer the v_type initialisation until after the vnode has been purged from
the namecache. Changing the v_type between cache_enter() and cache_purge()
results in bad things happening.

ok beck@


# 1.208 02-Oct-2013 sf

format string fix: b_flags is long


# 1.207 01-Oct-2013 sf

Format string fixes: Cast time_t to long long

and mnt_stat.f_ctime is long long, too


# 1.206 08-Aug-2013 syl

Uncomment kprintf format attributes for sys/kern

tested on vax (gcc3) ok miod@


# 1.205 30-Jul-2013 beck

The previous change was made while chasing nfs performance issues
on Theo's servers - however this was in the context of the buffer flipper
changes and this is now suspect in a continues performance issue with NFS
so back it out for now


Revision tags: OPENBSD_5_4_BASE
# 1.204 24-Jun-2013 beck

Manipulating buffers after sleeping is dangerous. Instead of attempting
to cheat and VOP_BWRITE a buffer, restart the vinvalbuf if we have to wait
for a busy buffer to complete
ok tedu@ guenther@


# 1.203 15-Apr-2013 jsing

Add an f_mntfromspec member to struct statfs, which specifies the name of
the special provided when the mount was requested. This may be the same as
the special that was actually used for the mount (e.g. in the case of a
device node) or it may be different (e.g. in the case of a DUID).

Whilst here, change f_ctime to a 64 bit type and remove the pointless
f_spare members.

Compatibility goo courtesy of guenther@

ok krw@ millert@


Revision tags: OPENBSD_5_3_BASE
# 1.202 17-Feb-2013 miod

Comment out recently added __attribute__((__format__(__kprintf__))) annotations
in MI code; gcc 2.95 does not accept such annotation for function pointer
declarations, only function prototypes.
To be uncommented once gcc 2.95 bites the dust.


# 1.201 09-Feb-2013 miod

Add explicit __attribute__ ((__format__(__kprintf__)))) to the functions and
function pointer arguments which are {used as,} wrappers around the kernel
printf function.
No functional change.


# 1.200 17-Nov-2012 beck

Don't map a buffer (and potentially sleep) when invalidating it in vinvalbuf.
This fixes a problem where we could sleep for kva and then our pointers
would not be valid on the next pass through the loop. We do this
by adding buf_acquire_nomap() - which can be used to busy up the buffer
without changing its mapped or unmapped state. We do not need to have
the buffer mapped to invalidate it, so it is sufficient to acquire it
for that. In the case where we write the buffer, we do map the buffer, and
potentially sleep.


# 1.199 01-Oct-2012 guenther

Make groupmember() check the effective gid too, so that the checks are
consistent when the effective gid isn't also a supplementary group.

ok beck@


# 1.198 19-Sep-2012 guenther

vhold() and vdrop() are prototyped in vnode.h, so don't repeat them here

ok beck@


Revision tags: OPENBSD_5_2_BASE
# 1.197 16-Jul-2012 deraadt

oops, need sys/acct.h too


# 1.196 16-Jul-2012 deraadt

Put acct_shutdown() proto in a better place


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.195 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.194 02-Jul-2011 thib

rename VFSDEBUG to VFLCKDEBUG;

prompted by tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.193 21-Dec-2010 thib

Bring back the "End the VOP experiment." diff, naddy's issues where
unrelated, and his alpha is much happier now.

OK deraadt@


# 1.192 06-Dec-2010 jasper

- drop NENTS(), which was yet another copy of nitems().
no binary change


ok deraadt@


# 1.191 10-Sep-2010 thib

Backout the VOP diff until the issues naddy was seeing on alpha (gcc3)
have been resolved.


# 1.190 06-Sep-2010 thib

End the VOP experiment. Instead of the ridicolusly complicated operation
vector setup that has questionable features (that have, as far as I can
tell never been used in practice, atleast not in OpenBSD), remove all
the gunk and favor a simple struct full of function pointers that get
set directly by each of the filesystems.

Removes gobs of ugly code and makes things simpler by a magnitude.

The only downside of this is that we loose the vnoperate feature so
the spec/fifo operations of the filesystems need to be kept in sync
with specfs and fifofs, this is no big deal as the API it self is pretty
static.

Many thanks to armani@ who pulled an earlier version of this diff to
current after c2k10 and Gabriel Kihlman on tech@ for testing.

Liked by many. "come on, find your balls" deraadt@.


# 1.189 12-Aug-2010 oga

Nuke extra (typoed) extern declaration and a spare newline from the last
commit.

"fix it -- free commit" beck@


# 1.188 11-Aug-2010 beck

Make the number of vnodes to correspond to the number of buffers in
buffer cache - we grow them dynamically, but do not attempt to shrink
them if the buffer cache shrinks after growing.

Tested by very many for a long time.

ok oga@ todd@ phessler@ tedu@


Revision tags: OPENBSD_4_8_BASE
# 1.187 29-Jun-2010 tedu

makefstype was only used in ported from freebsd filesystems. fix them
and remove the function. ok thib


# 1.186 28-Jun-2010 claudio

Add the rtable id as an argument to rn_walktree(). Functions like
rt_if_remove_rtdelete() need to know the table id to be able to correctly
remove nodes.
Problem found by Andrea Parazzini and analyzed by Martin Pelik�n.
OK henning@


# 1.185 06-May-2010 mpf

Fix favail format string.
From mickey.
OK thib, otto.


Revision tags: OPENBSD_4_7_BASE
# 1.184 17-Dec-2009 oga

if anyone vref()s a VNON vnode, panic. This should not happen.

Written while trying to debug the nfs_inactive panics. Turns out it
never got hit, but it's a useful check to have.

ok beck@


# 1.183 17-Aug-2009 jasper

dd 'show all bufs' to show all the buffers in the system

ok beck@ thib@


# 1.182 13-Aug-2009 thib

add a show all vnodes command, use dlg's nice pool_walk() to accomplish
this.

ok beck@, dlg@


# 1.181 12-Aug-2009 beck

Namecache revamp.

This eliminates the large single namecache hash table, and implements
the name cache as a global lru of entires, and a redblack tree in each
vnode. It makes cache_purge actually purge the namecache entries associated
with a vnode when a vnode is recycled (very important for later on actually being
able to resize the vnode pool)

This commit does #if 0 out a bunch of procmap code that was
already broken before this change, but needs to be redone completely.

Tested by many, including in thib's nfs test setup.

ok oga@,art@,thib@,miod@


# 1.180 02-Aug-2009 beck

Dynamic buffer cache support - a re-commit of what was backed out
after c2k9

allows buffer cache to be extended and grow/shrink dynamically

tested by many, ok oga@, "why not just commit it" deraadt@


Revision tags: OPENBSD_4_6_BASE
# 1.179 25-Jun-2009 thib

backout the buf_acquire() does the bremfree() since all callers
where doing bremfree() befure calling buf_acquire().

This is causing us headache pinning down a bug that showed up
when deraadt@ too cvs to current, and will have to be done
anyway as a preperation for backouts.

OK deraadt@


# 1.178 15-Jun-2009 beck

Back out all the buffer cache changes I committed during c2k9. This reverts three
commits:

1) The sysctl allowing bufcachepercent to be changed at boot time.
2) The change moving the buffer cache hash chains to a red-black tree
3) The dynamic buffer cache (Which depended on the earlier too).

ok on the backout from marco and todd


# 1.177 06-Jun-2009 art

All caller of buf_acquire were doing bremfree before the call.
Just put it in the buf_acquire function.
oga@ ok


# 1.176 03-Jun-2009 beck

Change bufhash from the old grotty hash table to red-black trees hanging
off the vnode.
ok art@, oga@, miod@


Revision tags: OPENBSD_4_5_BASE
# 1.175 10-Nov-2008 pedro

Fix typo in comment, okay jmc@.


# 1.174 01-Nov-2008 deraadt

change vrele() to return an int. if it returns 0, it can gaurantee that
it did not sleep. this is used to avoid checkdirs() to avoid having
to restart the allproc walk every time through
idea from tedu, ok thib pedro


Revision tags: OPENBSD_4_4_BASE
# 1.173 05-Jul-2008 thib

re-introduce vdrop() to signal a lost intrest in a vnode;

ok art@


# 1.172 14-Jun-2008 mk

A bunch of pool_get() + bzero() -> pool_get(..., .. | PR_ZERO)
conversions that should shave a few bytes off the kernel.

ok henning, krw, jsing, oga, miod, and thib (``even though i usually prefer
FOO|BAR''; thanks for looking.


# 1.171 13-Jun-2008 beck

back out stupid vnode change that was unintentionally included
with biomem and art has no idea how it got there.
ok art@ thib@


# 1.170 12-Jun-2008 deraadt

Bring biomem diff back into the tree after the nfs_bio.c fix went in.
ok thib beck art


# 1.169 11-Jun-2008 deraadt

back out biomem diff since it is not right yet. Doing very large
file copies to nfsv2 causes the system to eventually peg the console.
On the console ^T indicates that the load is increasing rapidly, ddb
indicates many calls to getbuf, there is some very slow nfs traffic
making none (or extremely slow) progress. Eventually some machines
seize up entirely.


# 1.168 10-Jun-2008 beck

Buffer cache revamp

1) remove multiple size queues, introduced as a stopgap.
2) decouple pages containing data from their mappings
3) only keep buffers mapped when they actually have to be mapped
(right now, this is when buffers are B_BUSY)
4) New functions to make a buffer busy, and release the busy flag
(buf_acquire and buf_release)
5) Move high/low water marks and statistics counters into a structure
6) Add a sysctl to retrieve buffer cache statistics

Tested in several variants and beat upon by bob and art for a year. run
accidentally on henning's nfs server for a few months...

ok deraadt@, krw@, art@ - who promises to be around to deal with any fallout


# 1.167 09-Jun-2008 millert

Update access(2) to have modern semantics with respect to X_OK and
the superuser. access(2) will now only indicate success for X_OK on
non-directories if there is at least one execute bit set on the file.
OK deraadt@ thib@ otto@


# 1.166 07-May-2008 thib

remove the vfc_mountroot member from vfsconf and
do appropriate cleanup;

OK deraadt@


# 1.165 07-May-2008 claudio

Implement routing priorities. Every route inserted has a priority assigned
and the one route with the lowest number wins. This will be used by the
routing daemons to resolve the synchronisations issue in case of conflicts.
The nasty bits of this are in the multipath code. If no priority is specified
the kernel will choose an appropriate priority.

Looked at by a few people at n2k8 code is much older


# 1.164 06-May-2008 thib

retire vfs_mountroot();

setroot() is now (and has been) responsible for setting
the mountroot function pointer "to the right thing", or
failing todo that, to ffs_mountroot;

based on a discussion/diff from deraadt@.
OK deraadt@


# 1.163 23-Mar-2008 miod

Wrong printf construct.


# 1.162 16-Mar-2008 otto

Widen some struct statfs fields to support large filesystem stata
and add some to be able to support statvfs(2). Do the compat dance
to provide backward compatibility. ok thib@ miod@


Revision tags: OPENBSD_4_3_BASE
# 1.161 13-Dec-2007 blambert

replace calls to ltsleep with tsleep

remove PNORELOCK flag, as PNORELOCK is used for msleep

ok art@ thib@


# 1.160 16-Nov-2007 deraadt

er, the newline is wrong. dissapointing.


# 1.159 15-Nov-2007 deraadt

newline before syncing disks is way prettier


# 1.158 29-Oct-2007 chl

MALLOC/FREE -> malloc/free
replace an hard coded value with M_WAITOK

ok krw@


# 1.157 15-Sep-2007 bluhm

Allow to pull out an usb stick with ffs filesystem while mounted
and a file is written onto the stick. Without these fixes the
machine panics or hangs.
The usb fix calls the callback when the stick is pulled out to free
the associated buffers. Otherwise we have busy buffers for ever
and the automatic unmount will panic.
The change in the scsi layer prevents passing down further dirty
buffers to usb after the stick has been deactivated.
In vfs the automatic unmount has moved from the function vgonel()
to vop_generic_revoke(). Both are called when the sd device's vnode
is removed. In vgonel() the VXLOCK is already held which can cause
a deadlock. So call dounmount() earlier.

ok krw@, I like this marco@, tested by ian@


# 1.156 07-Sep-2007 art

Use M_ZERO in a few more places to shave bytes from the kernel.

eyeballed and ok dlg@


Revision tags: OPENBSD_4_2_BASE
# 1.155 07-Aug-2007 beck

A few changes to deal with multi-user performance issues seen. this
brings us back roughly to 4.1 level performance, although this is still
far from optimal as we have seen in a number of cases. This change

1) puts a lower bound on buffer cache queues to prevent starvation
2) fixes the code which looks for a buffer to recycle
3) reduces the number of vnodes back to 4.1 levels to avoid complex
performance issues better addressed after 4.2

ok art@ deraadt@, tested by many


# 1.154 01-Jun-2007 beck

decouple the allocated number of vnodes from the "desiredvnodes" variable
which is used to size a zillion other things that increasing excessively
has been shown to cause problems - so that we may incrementally look at
increasing those other things without making the kernel unusable.

This diff effectivly increases the number of vnodes back to the number
of buffers, as in the earlier dynamic buffer cache commits, without
increasing anything else (namecache, softdeps, etc. etc.)

ok pedro@ tedu@ art@ thib@


# 1.153 31-May-2007 tedu

remove some silly casts, no real change


# 1.152 31-May-2007 pedro

NFSv2 cannot cope with a big number of vnodes, so revert to NPROC-based
calculation until the problem is fixed, okay beck@ art@


# 1.151 30-May-2007 beck

back out vfs change - todd fries has seen afs issues, and I'm suspicious
this can cause other problems.


# 1.150 29-May-2007 beck

Step one of some vnode improvements - change getnewvnode to
actually allocate "desiredvnodes" - add a vdrop to un-hold a vnode held
with vhold, and change the name cache to make use of vhold/vdrop, while
keeping track of which vnodes are referred to by which cache entries to
correctly hold/drop vnodes when the cache uses them.
ok thib@, tedu@, art@


# 1.149 28-May-2007 thib

de-inline vref();

ok pedro@


# 1.148 26-May-2007 pedro

Dynamic buffer cache. Initial diff from mickey@, okay art@ beck@ toby@
deraadt@ dlg@.


# 1.147 26-May-2007 thib

Nuke a bunch of simpelocks and associated goo.

ok art@


# 1.146 17-May-2007 thib

Collapse struct v_selectinfo in struct vnode, remove the
simplelock and reuse the name for the selinfo member.
Clean-up accordingly.

ok tedu@,art@


# 1.145 09-May-2007 deraadt

kinfo_vgetfailed has not been used for > 8 years


# 1.144 13-Apr-2007 thib

Move the declaration of VN_KNOTE() into vnode.h instead of having
multiple defines all over;

ok tedu@


# 1.143 13-Apr-2007 bluhm

Remove comments talking about vnode interlock. No binary change.
ok thib


# 1.142 11-Apr-2007 thib

Remove the simplelock argument from vrecycle();

ok pedro@, sturm@


# 1.141 21-Mar-2007 thib

Remove the v_interlock simplelock from the vnode structure.
Zap all calls to simple_lock/unlock() on it (those calls are
#defined away though). Remove the LK_INTERLOCK from the calls
to vn_lock() and cleanup the filesystems wich implement VOP_LOCK().
(by remvoing the v_interlock from there calls to lockmgr()).

ok pedro@, art@, tedu@


# 1.140 12-Mar-2007 mickey

better desiredvnodes not based on maxusers; pedro@ deraadt@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.139 20-Feb-2007 deraadt

for vfsconf sysctl, do not leak kernel sensors out to userland
ok art thib


# 1.138 17-Feb-2007 mickey

fix ddb buf printing for daddr_t growth to 64bit;
from juan hernandez gonzalez; tested by bluhm@


# 1.137 14-Feb-2007 jsg

Consistently spell FALLTHROUGH to appease lint.
ok kettenis@ cloder@ tom@ henning@


# 1.136 13-Feb-2007 mickey

fix ddb buf print


# 1.135 20-Nov-2006 tom

vprint() should be defined if DIAGNOSTIC || DEBUG. Noticed by (and
original diff from) Jake < antipsychic (at) hotmail.com >. Discussed
with Mickey and Miod.

ok miod@ pedro@


# 1.134 30-Oct-2006 thib

use vp->v_type to index into vtypes rather then vp->v_tag,
fixing odd output in the 'show vnode' ddb code.

ok mickey@


Revision tags: OPENBSD_4_0_BASE
# 1.133 11-Jul-2006 mickey

add mount/vnode/buf and softdep printing commands; tested on a few archs and will make pedro happy too (;


# 1.132 09-Jul-2006 pedro

Fix tab where space was meant


# 1.131 08-Jul-2006 thib

vinvalbuf() debugging aid, under VFSDEBUG.

ok pedro@


# 1.130 03-Jul-2006 mickey

also print vp in vprint (useful for debugging); pedro@ ok


# 1.129 25-Jun-2006 sturm

rename vfs_busy() flags VB_UMIGNORE/VB_UMWAIT to VB_NOWAIT/VB_WAIT

requested by and ok pedro


# 1.128 14-Jun-2006 sturm

move vfs_busy() to rwlocks and properly hide the locking api from vfs

ok tedu, pedro


# 1.127 02-Jun-2006 pedro

Add a clonable devices implementation. Hacked along with thib@, input
from krw@ and toby@, subliminal prodding from dlg@, okay deraadt@.


# 1.126 28-May-2006 pedro

Spacing in vfs_sysctl()


# 1.125 07-May-2006 sturm

forgot to remove this sentence from the comment
ok pedro


# 1.124 30-Apr-2006 sturm

remove the simplelock argument from vfs_busy() which is currently not
used and will never be used this way in VFS

requested by and ok pedro, ok krw, biorn


# 1.123 19-Apr-2006 pedro

Remove unused mount list simple_lock() goo


Revision tags: OPENBSD_3_9_BASE
# 1.122 09-Jan-2006 pedro

Put vprint() under DIAGNOSTIC, as to save space in generated ramdisks.
Inspiration from miod@, okay deraadt@. Tested on i386, macppc and amd64.


# 1.121 30-Nov-2005 pedro

No need for vfs_busy() and vfs_unbusy() to take a process pointer
anymore. Testing by jolan@, thanks.


# 1.120 24-Nov-2005 pedro

Remove kernfs, okay deraadt@.


# 1.119 19-Nov-2005 pedro

Remove unnecessary lockmgr() archaism that was costing too much in terms
of panics and bugfixes. Access curproc directly, do not expect a process
pointer as an argument. Should fix many "process context required" bugs.
Incentive and okay millert@, okay marc@. Various testing, thanks.


# 1.118 18-Nov-2005 pedro

Work around yet another race on non-locking file systems: when calling
VOP_INACTIVE() in vrele() and vput(), we may sleep. Since there's no
locking of any kind, someone can vget() the vnode and vrele() it while
we sleep, beating us in getting the vnode on the free list.


# 1.117 08-Nov-2005 pedro

Missed one use of 'register'


# 1.116 07-Nov-2005 pedro

Use ANSI function declarations and deregister, no binary change


# 1.115 19-Oct-2005 pedro

Remove v_vnlock from struct vnode, okay krw@ tedu@


Revision tags: OPENBSD_3_8_BASE
# 1.114 26-May-2005 pedro

branches: 1.114.2;
RIP stackable filesystems, ok marius@ tedu@, discussed with deraadt@


# 1.113 24-May-2005 pedro

when a device vnode associated with a mount point disappears, mark the
filesystem as doomed and unmount it


# 1.112 22-May-2005 pedro

put VLOCKSWORK stuff under a single option, VFSDEBUG


# 1.111 01-May-2005 pedro

check for VBIOONFREELIST and VBIOONSYNCLIST in vprint(), okay marius@


# 1.110 24-Mar-2005 tedu

always good to check for invalid values. ok marius pedro


Revision tags: OPENBSD_3_7_BASE
# 1.109 10-Jan-2005 pedro

branches: 1.109.2;
change vget() to only put a vnode back on the free lists if it actually
was there. should fix a (rare) corner case introduced by my last commit.
ok tedu@, testing by joris, moritz@, danh@, otto@ and krw@. many thanks.


# 1.108 31-Dec-2004 pedro

sprinkle some more list macros in here


# 1.107 31-Dec-2004 pedro

when releasing a vnode, make it inactive before sticking it to one of
the free lists. should fix some races on filesystems that don't have
locks, such as nfs. also, it allows for a more straightforward way of
releasing vnodes (nodes that are going to be recycled don't have to be
moved to the head of the list). tested by many, thanks.

ok tedu@ deraadt@


# 1.106 28-Dec-2004 deraadt

clean dirty accident by miod


# 1.105 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


# 1.104 09-Dec-2004 pedro

minor spacing/styling nits


Revision tags: OPENBSD_3_6_BASE
# 1.103 04-Aug-2004 art

Uninline vputonfreelist.


# 1.102 04-Aug-2004 pedro

better comments


# 1.101 02-Aug-2004 pedro

- check for LK_NOWAIT on vget()
- use ltsleep() instead of the unlock + sleep combo

ok art@, inspiration from free/net


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.100 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


# 1.99 27-May-2004 tedu

shutdown accounting before shutting down vfs. should prevent some panics.
ok david@ millert@ (iirc)


# 1.98 25-Apr-2004 itojun

radix tree with multipath support. from kame. deraadt ok
user visible changes:
- you can add multiple routes with same key (route add A B then route add A C)
- you have to specify gateway address if there are multiple entries on the table
(route delete A B, instead of route delete A)
kernel change:
- radix_node_head has an extra entry
- rnh_deladdr takes extra argument

TODO:
- actually take advantage of multipath (rtalloc -> rtalloc_mpath)


Revision tags: OPENBSD_3_5_BASE
# 1.97 09-Jan-2004 tedu

back out vnode parents. weird breakge found in ports tree


# 1.96 06-Jan-2004 tedu

keep track of a vnode's parent dir. ufs only, and unused atm, but
the fun stuff is coming. testing by brad.


Revision tags: OPENBSD_3_4_BASE
# 1.95 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.94 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: UBC_SYNC_A
# 1.93 13-May-2003 naddy

Back out previous change that causes "vnode table full" for large-scale
file operations.


# 1.92 13-May-2003 tedu

do reclaim LAYER vnodes, no good reason not to


# 1.91 06-May-2003 tedu

attempt to put a process's cwd back in place after a forced umount.
won't always work, but it's the best we can do for now. this covers
at least some of the failure cases the previous commit to vfs_lookup.c
checks for.
ok weingart@


# 1.90 01-May-2003 tedu

several related changes:
vfs_subr.c:
add a missing simple_lock_init for vnode interlock
try to avoid reclaiming locked or layered vnodes
initialize vnlock pointer to NULL
remove old code to free vnlock, never used
lockinit the new vnode lock
vfs_syscalls.c:
support for VLAYER flag
vnode_if.sh:
support for splitting VDESC flags
vnode_if.src:
split VDESC flags
WILLPUT is the combination of WILLRELE and WILLUNLOCK
most uses for WILLRELE become WILLPUT
vnode.h:
add v_lock to struct vnode
add VLAYER flag
update for new VDESC flags


# 1.89 06-Apr-2003 ho

strcat/strcpy/sprintf cleanup. krw@, anil@ ok. art@ tested sparc64.


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_B
# 1.88 11-Aug-2002 art

Add two missing vfs_busy calls in the failure path of sysctl_vnode.
Found by aaron@

NOTE - I think we need a mount-point iterator just like we have
NOTE - vfs_mount_foreach_vnode. (btw. why don't we use foreach_vnode in here?)


# 1.87 12-Jul-2002 art

Change the locking on the mountpoint slightly. Instead of using mnt_lock
to get shared locks for lookup and get the exclusive lock only with
LK_DRAIN on unmount and do the real exclusive locking with flags in
mnt_flags, we now use shared locks for lookup and an exclusive lock for
unmount.

This is accomplished by slightly changing the semantics of vfs_busy.
Old vfs_busy behavior:
- with LK_NOWAIT set in flags, a shared lock was obtained if the
mountpoint wasn't being unmounted, otherwise we just returned an error.
- with no flags, a shared lock was obtained if the mountpoint was being
unmounted, otherwise we slept until the unmount was done and returned
an error.
LK_NOWAIT was used for sync(2) and some statistics code where it isn't really
critical that we get the correct results.
0 was used in fchdir and lookup where it's critical that we get the right
directory vnode for the filesystem root.

After this change vfs_busy keeps the same behavior for no flags and LK_NOWAIT.
But if some other flags are passed into it, they are passed directly
into lockmgr (actually LK_SLEEPFAIL is always added to those flags because
if we sleep for the lock, that means someone was holding the exclusive lock
and the exclusive lock is only held when the filesystem is being unmounted.

More changes:
dounmount must now be called with the exclusive lock held. (before this
the caller was supposed to hold the vfs_busy lock, but that wasn't always
true).
Zap some (now) unused mount flags.
And the highlight of this change:
Add some vfs_busy calls to match some vfs_unbusy calls, especially in
sys_mount. (lockmgr doesn't detect the case where we release a lock noone
holds (it will do that soon)).

If you've seen hangs on reboot with mfs this should solve it (I repeat this
for the fourth time now, but this time I spent two months fixing and
redesigning this and reading the code so this time I must have gotten
this right).


# 1.86 16-Jun-2002 miod

When processing the KERN_VNODE sysctl, the kernel builds a packed structure,
while pstat(8) expects a C structure abiding the regular structure packing
rules. This caused pstat -v to break on powerpc.

Unbreak the confusion by defining the structure in a common header file,
and having the kernel use it.

ok millert@ deraadt@


# 1.85 08-Jun-2002 art

Use ltsleep in vfs_busy.


# 1.84 16-May-2002 art

sprinkle some splassert(IPL_BIO) in some functions that are commented as "should be called at splbio()"


Revision tags: OPENBSD_3_1_BASE
# 1.83 14-Mar-2002 millert

First round of __P removal in sys


# 1.82 04-Feb-2002 miod

Cleanup mountroot-related definitions.


# 1.81 23-Jan-2002 art

Pool deals fairly well with physical memory shortage, but it doesn't deal
well (not at all) with shortages of the vm_map where the pages are mapped
(usually kmem_map).

Try to deal with it:
- group all information the backend allocator for a pool in a separate
struct. The pool will only have a pointer to that struct.
- change the pool_init API to reflect that.
- link all pools allocating from the same allocator on a linked list.
- Since an allocator is responsible to wait for physical memory it will
only fail (waitok) when it runs out of its backing vm_map, carefully
drain pools using the same allocator so that va space is freed.
(see comments in code for caveats and details).
- change pool_reclaim to return if it actually succeeded to free some
memory, use that information to make draining easier and more efficient.
- get rid of PR_URGENT, noone uses it.


# 1.80 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.79 10-Dec-2001 art

branches: 1.79.2;
No need to initialize the uobj on every getnewvnode. Just do
it when allocating. Add some improved diagnostics.


# 1.78 10-Dec-2001 art

Big cleanup inspired by NetBSD with some parts of the code from NetBSD.
- get rid of VOP_BALLOCN and VOP_SIZE
- move the generic getpages and putpages into miscfs/genfs
- create a genfs_node which must be added to the top of the private portion
of each vnode for filsystems that want to use genfs_{get,put}pages
- rename genfs_mmap to vop_generic_mmap


# 1.77 10-Dec-2001 art

Merge in struct uvm_vnode into struct vnode.


# 1.76 05-Dec-2001 art

Break out the part that lowers v_holdcnt in brelvp into an own function
and make it and vhold into public interfaces.


# 1.75 29-Nov-2001 art

Ooops. Revert part of the last commit that was completly wrong and wasn't supposed to be committed.


# 1.74 29-Nov-2001 art

Correctly handle b_vp with bgetvp and brelvp in {get,put}pages.
Prevents panics caused by vnodes being recycled under our feet.


# 1.73 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.72 21-Nov-2001 csapuntz

Added vfs_isbusy. Useful for verifying that a mount point is locked
Added vfs_mount_foreach_vnode. Several places in the code seem to want to
traverse the mount list and they all seem to handle locking differently.
Centralize traversing the mount list in one place so that we only need
to get the locking right once.


# 1.71 15-Nov-2001 art

Don't zero v_bioflag when recycling a vnode in getnewvnode.
Sometimes the vnode can be on the syncers list. While that is a bug, it's
just a minor annoyance. A vnode on a syncer worklist without VBIOONSYNCLIST
set is a disaster.


# 1.70 12-Nov-2001 art

Remove unnecessary check for NULL vnode in reassignbuf.


# 1.69 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.68 02-Oct-2001 csapuntz

Bounds check index into routing table. Thanks to Ken Ashcraft of Stanford
for finding this bug.


# 1.67 19-Sep-2001 csapuntz

Get rid of B_VFLUSH. Not relevant after the end of the AGE queue.


# 1.66 16-Sep-2001 millert

Add some missing lengths checks when passing data from userland to
kernel. From based on NetBSD patches.


# 1.65 02-Aug-2001 assar

(vput): make panic strings actually say vput instead of vrele


# 1.64 26-Jul-2001 miod

Typo.


# 1.63 27-Jun-2001 art

remove old vm


# 1.62 22-Jun-2001 deraadt

KNF


# 1.61 05-Jun-2001 provos

send note_revoke to knotes when vnode goes away, okay art@


# 1.60 16-May-2001 art

indentation nit.


# 1.59 29-Apr-2001 art

cleanup, remove incorrect comment


Revision tags: OPENBSD_2_9_BASE
# 1.58 22-Mar-2001 art

branches: 1.58.2;
Use pool for allocating vnodes.
Even though vnodes are never freed (could be) this gives us big memory and
kmem_map savings.


# 1.57 21-Mar-2001 art

uvm_vnp_terminate expect the vnode to be locked.
Why didn't LOCKDEBUG catch this?


# 1.56 16-Mar-2001 art

Oops. fix thinko in last.


# 1.55 16-Mar-2001 art

Use CIRCLEQ macros for mountlist.


# 1.54 16-Mar-2001 art

Initialize the mountlist_slock.


# 1.53 26-Feb-2001 csapuntz

Move v_writecount test back to it original place


# 1.52 26-Feb-2001 csapuntz

Make ref counts 32-bit unsigned ints as opposed to a potpourri of longs and
ints.


# 1.51 24-Feb-2001 csapuntz

Cleanup of vnode interface continues. Get rid of VHOLD/HOLDRELE.
Change VM/UVM to use buf_replacevnode to change the vnode associated
with a buffer.

Addition v_bioflag for flags written in interrupt handlers
(and read at splbio, though not strictly necessary)

Add vwaitforio and use it instead of a while loop of v_numoutput.

Fix race conditions when manipulation vnode free list


# 1.50 23-Feb-2001 csapuntz

Remove the clustering fields from the vnodes and place them in the
file system inode instead


# 1.49 21-Feb-2001 csapuntz

Latest soft updates from FreeBSD/Kirk McKusick

Snapshot-related code has been commented out.


# 1.48 08-Feb-2001 mickey

do not print stuff when not verbose


Revision tags: OPENBSD_2_8_BASE
# 1.47 27-Sep-2000 art

branches: 1.47.2;
Minimal optimization.


# 1.46 17-Jul-2000 art

Don't wait for B_READ buffers on shutdown.
From NetBSD.


Revision tags: OPENBSD_2_7_BASE
# 1.45 25-Apr-2000 csapuntz

Use CIRCLEQ_FOREACH


# 1.44 21-Apr-2000 mickey

see if there is any meaning under curproc before using &proc0 in vfs_syncwait(); from art@


Revision tags: SMP_BASE kame_19991208
# 1.43 05-Dec-1999 art

branches: 1.43.2;
With soft updates, some buffers will be remarked as dirty after being written.
Handle this when syncing filesystems when unmounting.
From NetBSD.


# 1.42 05-Dec-1999 art

Use VONSYNCLIST to see if we should remove a vnode from the sync list instead
of looking at v_dirtyblkhd.


Revision tags: OPENBSD_2_6_BASE
# 1.41 20-Aug-1999 art

more paranoid check of the refcount in vfs_register


# 1.40 08-Aug-1999 niklas

From NetBSD; vdevgone, used for revoking access to device nodes when they
disappear (detach is coming).


# 1.39 31-May-1999 millert

New struct statfs with mount options. NOTE: this replaces statfs(2),
fstatfs(2), and getfsstat(2) so you will need to build a new kernel
before doing a "make build" or you will get "unimplemented syscall" errors.

The new struct statfs has the following featuires:
o Has a u_int32_t flags field--now softdep can have a real flag.

o Uses u_int32_t instead of longs (nicer on the alpha). Note: the man
page used to lie about setting invalid/unused fields to -1. SunOS does
that but our code never has.

o Gets rid of f_type completely. It hasn't been used since NetBSD 0.9
and having it there but always 0 is confusing. It is conceivable
that this may cause some old code to not compile but that is better
than silently breaking.

o Adds a mount_info union that contains the FSTYPE_args struct. This
means that "mount" can now tell you all the options a filesystem was
mounted with. This is especially nice for NFS.

Other changes:
o The linux statfs emulation didn't convert between BSD fs names
and linux f_type numbers. Now it does, since the BSD f_type
number is useless to linux apps (and has been removed anyway)

o FreeBSD's struct statfs is different from our (both old and new)
and thus needs conversion. Previously, the OpenBSD syscalls
were used without any real translation.

o mount(8) will now show extra info when invoked with no arguments.
However, to see *everything* you need to use the -v (verbose) flag.


# 1.38 06-May-1999 mickey

factor out sync+wait code into vfa_syncwait() routine for
applications in system like power management and such.
art@ finally said `commit it'


# 1.37 30-Apr-1999 art

in vput, simple_unlock the v_interlock before VOP_INACTIVE, not after


Revision tags: OPENBSD_2_5_BASE
# 1.36 11-Mar-1999 deraadt

backout


# 1.35 11-Mar-1999 deraadt

back out unapproved changes


# 1.34 11-Mar-1999 mickey

indent


# 1.33 11-Mar-1999 mickey

factor sync+wait operation out into a separate function.


# 1.32 26-Feb-1999 art

adapt to uvm vnode pager


# 1.31 19-Feb-1999 art

add vfs_register and vfs_unregister functions


# 1.30 28-Dec-1998 art

simple_lock fixes


# 1.29 22-Dec-1998 art

deconfuse vprint, print holdcount, not refcount when we are talking about holdcnt


# 1.28 10-Dec-1998 art

vfs_unmountall: retry to unmount all remaining filesystems when one unmount failed


# 1.27 05-Dec-1998 csapuntz

Framework for generating automatic test code for locking discipline
in DIAGNOSTIC mode.

Added documentation to vfs_subr.c on locking needs of a couple calls.

Improvements to the vinvalbuf patch. We need to start over after we
let our pants down.


# 1.26 04-Dec-1998 csapuntz

VFS-Lite2 requires stricter locking around vnode buffer queues. vinvalbuf
had insufficient protection


# 1.25 20-Nov-1998 art

vn_lock already unlocks the simple lock. don't do that again


# 1.24 12-Nov-1998 csapuntz

Integrate latest soft updates patches for McKusick.

Integrate cleaner ffs mount code from FreeBSD. Most notably, this mount
code prevents you from mounting an unclean file system read-write.


Revision tags: OPENBSD_2_4_BASE
# 1.23 13-Oct-1998 csapuntz

In vrele, vget, reinstate to following order

- VNODE gets placed on free list
- VOP_INACTIVE is called

This was the original order. It was changed in an earlier patch due to
a race condition in non-locking FSes (like NFS) between getnewvnode
and inactive. However, the modified order had its own race conditions, so
it turned out not to be a good choice.


# 1.22 30-Aug-1998 csapuntz

Cleanup.

Error diagnostics in vputonfreelist to catch violations of assumptions.


# 1.21 06-Aug-1998 csapuntz

Rename vop_revoke, vn_bwrite, vop_noislocked, vop_nolock, vop_nounlock
to be vop_generic_revoke, vop_generic_bwrite, vop_generic_islocked,
vop_generic_lock and vop_generic_unlock.

Create vop_generic_abortop and propogate change to all file systems.

Fix PR/371.

Get rid of locking in NULLFS (should be mostly unnecessary now except for
forced unmounts).


# 1.20 25-Apr-1998 niklas

typo


Revision tags: OPENBSD_2_3_BASE
# 1.19 20-Feb-1998 niklas

typo


# 1.18 11-Jan-1998 csapuntz

Fix a couple spinlock references. More code motion in vfs_subr.c


# 1.17 10-Jan-1998 csapuntz

Broke up vfs_subr.c which was getting a bit huge. We now have seperate files
for the syncer daemon as well as default VOP_*.


# 1.16 24-Nov-1997 niklas

Fix non-DIAGNOSTIC (and non-COMPAT*) compilation


# 1.15 07-Nov-1997 csapuntz

Fixed hang on shutdown
Disabled vop_nolock for now. Filesystems still need to be cleaned up.


# 1.14 06-Nov-1997 csapuntz

DEBUG now compiles


# 1.13 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.12 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.11 06-Oct-1997 csapuntz

VFS Lite2 Changes


Revision tags: OPENBSD_2_1_BASE
# 1.10 25-Apr-1997 deraadt

proper mask check; mike@fast.cs.utah.edu


# 1.9 14-Apr-1997 tholo

Minor performance enhancements from NetBSD


# 1.8 24-Feb-1997 niklas

OpenBSD tags


# 1.7 11-Feb-1997 millert

Add fs_id support and random inode generation numbers for ffs.


# 1.6 04-Jan-1997 kstailey

spec_advlock() via lf_advlock()


Revision tags: OPENBSD_2_0_BASE
# 1.5 08-Aug-1996 tholo

Make {,f}chown(2) behaviour POSIX.1 compliant with SUID / SGID files
Enable CTL_FS processing by sysctl(3)
Add CTL_FS request to disable clearing SUID / SGID bit when a files owner
or group is changed by root
Make sysctl(8) understand CTL_FS requests


# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 29-Feb-1996 niklas

From NetBSD: Merge with NetBSD 960217


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.294 08-Dec-2019 mpi

Convert infinite sleeps to tsleep_nsec(9).

ok visa@, jca@


Revision tags: OPENBSD_6_6_BASE
# 1.293 26-Aug-2019 anton

When a thread tries to exclusively lock a vnode, the same thread must
ensure that any other thread currently trying to acquire the underlying
vnode lock has observed that the same vnode is about to be exclusively
locked. Such threads must then sleep until the exclusive lock has been
released and then try to acquire the lock again. Otherwise, exclusive
access to the vnode cannot be guaranteed.

Thanks to naddy@ and visa@ for testing; ok visa@

Reported-by: syzbot+374d0e7e2400004957f7@syzkaller.appspotmail.com


# 1.292 25-Jul-2019 cheloha

vinvalbuf(9): tlseep -> tsleep_nsec(9); ok millert@


# 1.291 19-Jul-2019 cheloha

vwaitforio(9): tsleep(9) -> tsleep_nsec(9); ok visa@


# 1.290 28-Jun-2019 visa

Skip VFS barrier lock during normal operation to reduce overhead.
This removes a system-wide serialization point, which might help
finding timing-related bugs.

OK deraadt@ anton@


# 1.289 09-Jun-2019 beck

Add a temporary workaround to make removal of giant files better

mlarkin@ noticed we would freeze while removing enormous files because
of the amount of work done to invalidate buffers on unlink. This adds
a temporary workaround to ensure we give up the lock and yield while
doing this.

The longer term answer will be to move these buffers to another list
and not do the work here.

ok deraadt@


# 1.288 19-Apr-2019 visa

Add a subsystem lock for vfs_lockf.c. This enables calling lf_advlock()
and lf_purgelocks() without the kernel lock.

OK anton@ mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.287 02-Apr-2019 visa

Restrict which filesystems are available for swap. This rules out
obvious misconfigurations that cannot work.

OK mpi@ tedu@


# 1.286 17-Feb-2019 tedu

if a write fails, we mark the buffer invalid and throw it away. this can
lead to lost errors, where a later fsync will return success. to fix this,
set a flag on the vnode indicating a past error has occurred, and return
an error for future fsync calls.
ok bluhm deraadt visa


# 1.285 21-Jan-2019 anton

Introduce a dedicated entry point data structure for file locks. This new data
structure allows for better tracking of pending lock operations which is
essential in order to prevent a use-after-free once the underlying vnode is
gone.

Inspired by the lockf implementation in FreeBSD.

ok visa@

Reported-by: syzbot+d5540a236382f50f1dac@syzkaller.appspotmail.com


# 1.284 23-Dec-2018 natano

Rectify some issues with the noperm mount flag; the root vnode was not
protected properly and files without any x bit set were accidentaly considered
executable when checked with access(2).

Issues found and reported by deraadt, halex, reyk, tb
ok deraadt


# 1.283 07-Dec-2018 mpi

free(9) sizes for netcred.

ok visa@


Revision tags: OPENBSD_6_4_BASE
# 1.282 29-Sep-2018 visa

Use atomic operations to update vfc_refcount. Change the field's type
to unsigned int.

OK deraadt@


# 1.281 26-Sep-2018 visa

Move the allocating and freeing of mount points into
dedicated functions.

OK deraadt@ mpi@


# 1.280 22-Sep-2018 fcambus

Harmonize spacing after ellipses in displayed messages.

We were using spacing after ellipses in an inconsistent way in the
installer. Standardize on using "... " everywhere and take into account
the cursor position while we are waiting for the task to complete: the
cursor is now always positioned after the last dot, and the space is
added when displaying completion confirmation.

While there, also take cursor position into account in vfs_shutdown(),
and remove the extra leading space before ticks in dhclient.

OK deraadt@


# 1.279 17-Sep-2018 visa

Simplify VFS initialization.

Because loadable kernel modules are no longer, there is no need to
register or unregister filesystem implementations at runtime. Remove
vfs_register() and vfs_unregister(), and make vfsinit() call vfs_init
routines directly. Replace the linked list of vfsconf structs with
the vfsconflist[] array.

OK mpi@ bluhm@


# 1.278 16-Sep-2018 visa

Move vfsconf lookup code into dedicated functions.

OK bluhm@


# 1.277 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


# 1.276 02-Jul-2018 bluhm

Use more list macros for v_dirtyblkhd.
OK mpi@


# 1.275 06-Jun-2018 bluhm

The function dounmount() traverses the mnt_list in forward direction
to call vfs_busy() for all nested mount points. vfs_stall() called
vfs_busy() in reverser order for all mount points. Change the
direction of the latter to resolve the lock order conflict.
OK visa@


# 1.274 04-Jun-2018 guenther

Add VB_DUPOK to suppress witness(4) warning of concurrent mount locks.
Use that in three places:
- vfs_stall()
- sys_mount()
- dounmount()'s MNT_FORCE-does-recursive-unmounts case

ok deraadt@ visa@


# 1.273 27-May-2018 visa

Drop unnecessary `p' parameter from vget(9).

OK mpi@


# 1.272 08-May-2018 bluhm

When looping over mount points, the FOREACH SAVE macro is not save.
The loop variable mp is protected by vfs_busy() so that it cannot
be unmounted. But the next mount point nmp could be unmounted while
VFS_SYNC() sleeps. As the loop in vfs_stall() does not destroy the
mount point, TAILQ_FOREACH_REVERSE without _SAVE is the correct
macro to use.
OK deraadt@ visa@


# 1.271 08-May-2018 mpi

Move the vfs stall "barrier" logic to a function. FREF() will soon
change and this has nothing to do with it.

ok visa@, bluhm@


# 1.270 07-May-2018 bluhm

Print the vp pointer in the vinvalbuf() panic strings.
OK mpi@


# 1.269 02-May-2018 visa

Remove proc from the parameters of vn_lock(). The parameter is
unnecessary because curproc always does the locking.

OK mpi@


# 1.268 28-Apr-2018 visa

Clean up the parameters of VOP_LOCK() and VOP_UNLOCK(). It is always
curproc that does the locking or unlocking, so the proc parameter
is pointless and can be dropped.

OK mpi@, deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.267 07-Mar-2018 bluhm

Remounting files systems read-only does not work reliably. There
are corner cases where ffs may leak blocks. So better revert and
unmount all file systems at reboot. The "init died" panic will be
fixed in a different way.
OK deraadt@


# 1.266 10-Feb-2018 deraadt

Syncronize filesystems to disk when suspending. Each mountpoint's vnodes
are pushed to disk. Dangling vnodes (unlinked files still in use) and
vnodes undergoing change by long-running syscalls are identified -- and
such filesystems are marked dirty on-disk while we are suspended (in case
power is lost, a fsck will be required). Filesystems without dangling or
busy vnodes are marked clean, resulting in faster boots following
"battery died" circumstances.
Tested by numerous developers, thanks for the feedback.


# 1.265 14-Dec-2017 deraadt

Don't bother using DETACH_FORCE for the softraid luns at reboot
time; the aggressive mountpoint destruction seems to hit insane
use-after-frees when we are already far on the way down.


# 1.264 14-Dec-2017 deraadt

Give vflush_vnode() a hint about vnodes we don't need to account as "busy".
Change mountpoint to RDONLY a little later. Seems to improve the
rw->ro transition a bit.


# 1.263 11-Dec-2017 bluhm

Format the vnode lists of ddb show mount properly in columns.
OK krw@


# 1.262 11-Dec-2017 deraadt

In uvm Chuck decided backing store would not be allocated proactively
for blocks re-fetchable from the filesystem. However at reboot time,
filesystems are unmounted, and since processes lack backing store they
are killed. Since the scheduler is still running, in some cases init is
killed... which drops us to ddb [noted by bluhm]. Solution is to convert
filesystems to read-only [proposed by kettenis]. The tale follows:
sys_reboot() should pass proc * to MD boot() to vfs_shutdown() which
completes current IO with vfs_busy VB_WRITE|VB_WAIT, then calls VFS_MOUNT()
with MNT_UPDATE | MNT_RDONLY, soon teaching us that *fs_mount() calls a
copyin() late... so store the sizes in vfsconflist[] and move the copyin()
to sys_mount()... and notice nfs_mount copyin() is size-variant, so kill
legacy struct nfs_args3. Next we learn ffs_mount()'s MNT_UPDATE code is
sharp and rusty especially wrt softdep, so fix some bugs adn add
~MNT_SOFTDEP to the downgrade. Some vnodes need a little more help,
so tie them to &dead_vnops.

ffs_mount calling DIOCCACHESYNC is causing a bit of grief still but
this issue is seperate and will be dealt with in time.
couple hundred reboots by bluhm and myself, advice from guenther and
others at the hut


# 1.261 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.260 31-Jul-2017 florian

Give back some space to the ramdisk by compiling net/radix.c only
if we compile pf, ipsec, pipex or nfsserver.
Suggested by mpi some time ago.
Tweak & OK bluhm
deraadt assumes it's fair


# 1.259 20-Apr-2017 visa

Tweak lock inits to make the system runnable with witness(4)
on amd64 and i386.


# 1.258 04-Apr-2017 deraadt

struct vfsconf is tightly packed, but let's M_ZERO it in case that ever
changes to avoid exposing userland memory.


Revision tags: OPENBSD_6_1_BASE
# 1.257 15-Jan-2017 bluhm

When traversing the mount list, the current mount point is locked
with vfs_busy(). If the FOREACH_SAFE macro is used, the next pointer
is not locked and could be freed by another process. Unless
necessary, do not use _SAFE as it is unsafe. In vfs_unmountall()
the current pointer is actullay freed. Add a comment that this
race has to be fixed later.
OK krw@


# 1.256 10-Jan-2017 bluhm

Replace manual for() loops with FOREACH() macro.
OK millert@


# 1.255 10-Jan-2017 bluhm

Remove the unused olddp parameter from function dounmount().
OK mpi@ millert@


# 1.254 28-Sep-2016 kettenis

Cast enum to u_int when doing a bounds check to avoid a clang warning that
the comparison is always true.

ok jca@, tedu@


# 1.253 16-Sep-2016 dlg

move the namecache_rb_tree from RB macros to RBT functions.

i had to shuffle the includes a bit. all the knowledge of the RB
tree is now inside vfs_cache.c, and all accesses are via cache_*
functions.


# 1.252 16-Sep-2016 dlg

move buf_rb_bufs from RB macros to RBT functions

i had to shuffle the order of some header bits cos RBT_PROTOTYPE
needs to see what RBT_HEAD produces.


# 1.251 15-Sep-2016 dlg

all pools have their ipl set via pool_setipl, so fold it into pool_init.

the ioff argument to pool_init() is unused and has been for many
years, so this replaces it with an ipl argument. because the ipl
will be set on init we no longer need pool_setipl.

most of these changes have been done with coccinelle using the spatch
below. cocci sucks at formatting code though, so i fixed that by hand.

the manpage and subr_pool.c bits i did myself.

ok tedu@ jmatthew@

@ipl@
expression pp;
expression ipl;
expression s, a, o, f, m, p;
@@
-pool_init(pp, s, a, o, f, m, p);
-pool_setipl(pp, ipl);
+pool_init(pp, s, a, ipl, f, m, p);


# 1.250 25-Aug-2016 dlg

pool_setipl

ok kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.249 22-Jul-2016 kettenis

Prevent NULL-pointer call for filesystems that don't provide vfs_sysctl
in their vfsops.

Issue reported by Tim Newsham.

ok claudio@, natano@


# 1.248 19-Jun-2016 natano

Remove the lockmgr() API. It is only used by filesystems, where it is a
trivial change to use rrw locks instead. All it needs is LK_* defines
for the RW_* flags.

tested by naddy and sthen on package building infrastructure
input and ok jmc mpi tedu


# 1.247 26-May-2016 natano

The doforce variable isn't modified anywhere. Also, the only filesystem
left using it is fuse. It has been removed from all other filesystems.

ok millert deraadt


# 1.246 26-Apr-2016 natano

copy_statfs_info() is not only used by ufs, but by other filesystems too,
so make sure that all members of mp->mnt_stat.mount_info are copied.
ok stefan


# 1.245 26-Apr-2016 beck

fix off by one in vfs_vnode_print - found by miod
ok deraadt@, krw@


# 1.244 07-Apr-2016 natano

Share clone bitmap between aliased vnodes. This prevents duplicate clone
instance numbers being handed out for the same minor device.
ok mikeb


# 1.243 05-Apr-2016 natano

Increase size of the clone bitmap (revised diff after revert). I have
tested this with fuse _and_ drm on amd64 and macppc. Also tested with
cloning bpf (not in the tree) on macppc.

ok mikeb
"looks correct to me" millert

The original commit message is as follows:

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.242 01-Apr-2016 mikeb

Revert the clone bitmap enlargement change


# 1.241 31-Mar-2016 natano

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.240 19-Mar-2016 natano

Remove the unused flags argument from VOP_UNLOCK().

torture tested on amd64, i386 and macppc
ok beck mpi stefan
"the change looks right" deraadt


# 1.239 14-Mar-2016 krw

Change a bunch of (<blah> *)0 to NULL.

ok beck@ deraadt@


Revision tags: OPENBSD_5_9_BASE
# 1.238 05-Dec-2015 tedu

branches: 1.238.2;
remove stale lint annotations


# 1.237 16-Nov-2015 deraadt

In getdevvp() set the VISTTY flag on a vnode to indicate the underlying
device is a D_TTY device. (Like spec_open, but this sets the flag to
satisfy pre-VOP_OPEN situations)
ok millert semarie tedu guenther


# 1.236 13-Oct-2015 guenther

Initialize va_filerev in vattr_null() to avoid leaking stack garbage;
problem pointed out by Martin Natano (natano (at) natano.net)

Also, stop chaining assignments (foo = bar = baz) in vattr_null().
The exact meaning of those depends on the order of the sizes-and-
signednesses of the lvalues, making them fragile: a statement here
mixed *six* types, but managed to get them in a safe order. Delete
a 20+ year old XXX comment that was almost certainly bemoaning a bug
from when they were in an unsafe order.

ok deraadt@ miod@


# 1.235 08-Oct-2015 mpi

Use the radix API directly and get rid of the function pointers. There
is no point in keeping an unused level of abstraction.

ok mikeb@, claudio@


# 1.234 07-Oct-2015 mpi

rn_inithead() offset argument is now specified in byte, missed in previous.


# 1.233 04-Sep-2015 mpi

Make every subsystem using a radix tree call rn_init() and pass the
length of the key as argument.

This way every consumer of the radix tree has a chance to explicitly
initialize the shared data structures and no longer rely on another
subsystem to do the initialization.

As a bonus ``dom_maxrtkey'' is no longer used an die.

ART kernels should now be fully usable because pf(4) and IPSEC properly
initialized the radix tree.

ok chris@, reyk@


Revision tags: OPENBSD_5_8_BASE
# 1.232 16-Jul-2015 claudio

branches: 1.232.4;
Fix rn_match and there for the expoerted lookup functions in radix.c
to never return the internal RNF_ROOT nodes. This removes the checks
in the callee to verify that not an RNF_ROOT node was returned.
OK mpi@


# 1.231 12-May-2015 mikeb

Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.230 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.229 02-Mar-2015 guenther

Return EINVAL if the creds supplied for NFS export have a cr_ngroups less
than zero or greater than NGROUPS_MAX

Fixes panic seen by henning@


# 1.228 09-Jan-2015 tedu

rename desiredvnodes to initialvnodes. less of a lie. ok beck deraadt


# 1.227 19-Dec-2014 tedu

start retiring the nointr allocator. specify PR_WAITOK as a flag as a
marker for which pools are not interrupt safe. ok dlg


# 1.226 17-Dec-2014 tedu

remove lock.h from uvm_extern.h. another holdover from the simpletonlock
era. fix uvm including c files to include lock.h or atomic.h as necessary.
ok deraadt


# 1.225 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.224 10-Dec-2014 tedu

convert bcopy to memcpy. ok millert


# 1.223 21-Nov-2014 tedu

simple lock is long dead


# 1.222 19-Nov-2014 tedu

delete the KERN_VNODE sysctl. it fails to provide any isolation from the
kernel struct vnode defintion, and the only consumer (pstat) still needs
kvm to read much of the required information. no great loss to always use
kvm until there's a better replacement interface.
ok deraadt millert uebayasi


# 1.221 14-Nov-2014 tedu

prefer sizeof(*ptr) to sizeof(struct) for malloc and free


# 1.220 03-Nov-2014 deraadt

pass size argument to free()
ok doug tedu


# 1.219 13-Sep-2014 doug

Replace all queue *_END macro calls except CIRCLEQ_END with NULL.

CIRCLEQ_* is deprecated and not called in the tree. The other queue types
have *_END macros which were added for symmetry with CIRCLEQ_END. They are
defined as NULL. There's no reason to keep the other *_END macro calls.

ok millert@


Revision tags: OPENBSD_5_6_BASE
# 1.218 13-Jul-2014 tedu

pass the size to free in some of the obvious cases


# 1.217 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.216 10-Jul-2014 mpi

Stop using a shutdown hook for softraid(4) and explicitly shutdown
the disciplines right after vfs_shutdown().

This change is required in order to be able to set `cold' to 1 before
traversing the device (mainbus) tree for DVACT_POWERDOWN when halting
a machine. Yes, this is ugly because sr_shutdown() needs to sleep. But
at least it is obvious and hopefully somebody will be ofended and fix
it.

In order to properly flush the cache of the disks under softraid0,
sr_shutdown() now propagates DVACT_POWERDOWN for this particular subtree
of devices which are not under mainbus. As a side effect sd(4) shutdown
hook should no longer be necessary.

Tested by stsp@ and Jean-Philippe Ouellet.

ok deraadt@, stsp@, jsing@


# 1.215 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.214 04-Jun-2014 claudio

While it may be smart to use the radix tree for exports it is not OK to
use the domain specific tree initialisation method for this since that one
is multipath enabled and assumes that the radix node is part of a struct
rtentry. This code uses a different struct and so the multipath modifies
wrong fields and breaks stuff in mysterious ways.
Since we only support AF_INET here anyway simplify the code and only have
one radix_node_head pointer instead of AF_MAX ones.
Fixes NFS server issues reported by rpe@, OK rpe@, guenther@, sthen@


# 1.213 10-Apr-2014 tedu

pull the bufcache freelist code out into separate functions to allow new
algorithms to be tested. in the process, drop support for unused B_AGE and
b_synctime options.
previous versions ok beck deraadt


# 1.212 24-Mar-2014 guenther

Split the API: struct ucred remains the kernel internal structure while
struct xucred becomes the structure for syscalls (mount(2) and nfssvc(2)).

ok deraadt@ beck@


Revision tags: OPENBSD_5_5_BASE
# 1.211 21-Jan-2014 tedu

bzero -> memset


# 1.210 01-Dec-2013 krw

Change 'mountlist' from CIRCLEQ to TAILQ. Be paranoid and
use TAILQ_*_SAFE more than might be needed.

Bulk ports build by sthen@ showed nobody sticking their fingers
so deep into the kernel.

Feedback and suggestions from millert@. ok jsing@


# 1.209 27-Nov-2013 jsing

Defer the v_type initialisation until after the vnode has been purged from
the namecache. Changing the v_type between cache_enter() and cache_purge()
results in bad things happening.

ok beck@


# 1.208 02-Oct-2013 sf

format string fix: b_flags is long


# 1.207 01-Oct-2013 sf

Format string fixes: Cast time_t to long long

and mnt_stat.f_ctime is long long, too


# 1.206 08-Aug-2013 syl

Uncomment kprintf format attributes for sys/kern

tested on vax (gcc3) ok miod@


# 1.205 30-Jul-2013 beck

The previous change was made while chasing nfs performance issues
on Theo's servers - however this was in the context of the buffer flipper
changes and this is now suspect in a continues performance issue with NFS
so back it out for now


Revision tags: OPENBSD_5_4_BASE
# 1.204 24-Jun-2013 beck

Manipulating buffers after sleeping is dangerous. Instead of attempting
to cheat and VOP_BWRITE a buffer, restart the vinvalbuf if we have to wait
for a busy buffer to complete
ok tedu@ guenther@


# 1.203 15-Apr-2013 jsing

Add an f_mntfromspec member to struct statfs, which specifies the name of
the special provided when the mount was requested. This may be the same as
the special that was actually used for the mount (e.g. in the case of a
device node) or it may be different (e.g. in the case of a DUID).

Whilst here, change f_ctime to a 64 bit type and remove the pointless
f_spare members.

Compatibility goo courtesy of guenther@

ok krw@ millert@


Revision tags: OPENBSD_5_3_BASE
# 1.202 17-Feb-2013 miod

Comment out recently added __attribute__((__format__(__kprintf__))) annotations
in MI code; gcc 2.95 does not accept such annotation for function pointer
declarations, only function prototypes.
To be uncommented once gcc 2.95 bites the dust.


# 1.201 09-Feb-2013 miod

Add explicit __attribute__ ((__format__(__kprintf__)))) to the functions and
function pointer arguments which are {used as,} wrappers around the kernel
printf function.
No functional change.


# 1.200 17-Nov-2012 beck

Don't map a buffer (and potentially sleep) when invalidating it in vinvalbuf.
This fixes a problem where we could sleep for kva and then our pointers
would not be valid on the next pass through the loop. We do this
by adding buf_acquire_nomap() - which can be used to busy up the buffer
without changing its mapped or unmapped state. We do not need to have
the buffer mapped to invalidate it, so it is sufficient to acquire it
for that. In the case where we write the buffer, we do map the buffer, and
potentially sleep.


# 1.199 01-Oct-2012 guenther

Make groupmember() check the effective gid too, so that the checks are
consistent when the effective gid isn't also a supplementary group.

ok beck@


# 1.198 19-Sep-2012 guenther

vhold() and vdrop() are prototyped in vnode.h, so don't repeat them here

ok beck@


Revision tags: OPENBSD_5_2_BASE
# 1.197 16-Jul-2012 deraadt

oops, need sys/acct.h too


# 1.196 16-Jul-2012 deraadt

Put acct_shutdown() proto in a better place


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.195 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.194 02-Jul-2011 thib

rename VFSDEBUG to VFLCKDEBUG;

prompted by tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.193 21-Dec-2010 thib

Bring back the "End the VOP experiment." diff, naddy's issues where
unrelated, and his alpha is much happier now.

OK deraadt@


# 1.192 06-Dec-2010 jasper

- drop NENTS(), which was yet another copy of nitems().
no binary change


ok deraadt@


# 1.191 10-Sep-2010 thib

Backout the VOP diff until the issues naddy was seeing on alpha (gcc3)
have been resolved.


# 1.190 06-Sep-2010 thib

End the VOP experiment. Instead of the ridicolusly complicated operation
vector setup that has questionable features (that have, as far as I can
tell never been used in practice, atleast not in OpenBSD), remove all
the gunk and favor a simple struct full of function pointers that get
set directly by each of the filesystems.

Removes gobs of ugly code and makes things simpler by a magnitude.

The only downside of this is that we loose the vnoperate feature so
the spec/fifo operations of the filesystems need to be kept in sync
with specfs and fifofs, this is no big deal as the API it self is pretty
static.

Many thanks to armani@ who pulled an earlier version of this diff to
current after c2k10 and Gabriel Kihlman on tech@ for testing.

Liked by many. "come on, find your balls" deraadt@.


# 1.189 12-Aug-2010 oga

Nuke extra (typoed) extern declaration and a spare newline from the last
commit.

"fix it -- free commit" beck@


# 1.188 11-Aug-2010 beck

Make the number of vnodes to correspond to the number of buffers in
buffer cache - we grow them dynamically, but do not attempt to shrink
them if the buffer cache shrinks after growing.

Tested by very many for a long time.

ok oga@ todd@ phessler@ tedu@


Revision tags: OPENBSD_4_8_BASE
# 1.187 29-Jun-2010 tedu

makefstype was only used in ported from freebsd filesystems. fix them
and remove the function. ok thib


# 1.186 28-Jun-2010 claudio

Add the rtable id as an argument to rn_walktree(). Functions like
rt_if_remove_rtdelete() need to know the table id to be able to correctly
remove nodes.
Problem found by Andrea Parazzini and analyzed by Martin Pelik�n.
OK henning@


# 1.185 06-May-2010 mpf

Fix favail format string.
From mickey.
OK thib, otto.


Revision tags: OPENBSD_4_7_BASE
# 1.184 17-Dec-2009 oga

if anyone vref()s a VNON vnode, panic. This should not happen.

Written while trying to debug the nfs_inactive panics. Turns out it
never got hit, but it's a useful check to have.

ok beck@


# 1.183 17-Aug-2009 jasper

dd 'show all bufs' to show all the buffers in the system

ok beck@ thib@


# 1.182 13-Aug-2009 thib

add a show all vnodes command, use dlg's nice pool_walk() to accomplish
this.

ok beck@, dlg@


# 1.181 12-Aug-2009 beck

Namecache revamp.

This eliminates the large single namecache hash table, and implements
the name cache as a global lru of entires, and a redblack tree in each
vnode. It makes cache_purge actually purge the namecache entries associated
with a vnode when a vnode is recycled (very important for later on actually being
able to resize the vnode pool)

This commit does #if 0 out a bunch of procmap code that was
already broken before this change, but needs to be redone completely.

Tested by many, including in thib's nfs test setup.

ok oga@,art@,thib@,miod@


# 1.180 02-Aug-2009 beck

Dynamic buffer cache support - a re-commit of what was backed out
after c2k9

allows buffer cache to be extended and grow/shrink dynamically

tested by many, ok oga@, "why not just commit it" deraadt@


Revision tags: OPENBSD_4_6_BASE
# 1.179 25-Jun-2009 thib

backout the buf_acquire() does the bremfree() since all callers
where doing bremfree() befure calling buf_acquire().

This is causing us headache pinning down a bug that showed up
when deraadt@ too cvs to current, and will have to be done
anyway as a preperation for backouts.

OK deraadt@


# 1.178 15-Jun-2009 beck

Back out all the buffer cache changes I committed during c2k9. This reverts three
commits:

1) The sysctl allowing bufcachepercent to be changed at boot time.
2) The change moving the buffer cache hash chains to a red-black tree
3) The dynamic buffer cache (Which depended on the earlier too).

ok on the backout from marco and todd


# 1.177 06-Jun-2009 art

All caller of buf_acquire were doing bremfree before the call.
Just put it in the buf_acquire function.
oga@ ok


# 1.176 03-Jun-2009 beck

Change bufhash from the old grotty hash table to red-black trees hanging
off the vnode.
ok art@, oga@, miod@


Revision tags: OPENBSD_4_5_BASE
# 1.175 10-Nov-2008 pedro

Fix typo in comment, okay jmc@.


# 1.174 01-Nov-2008 deraadt

change vrele() to return an int. if it returns 0, it can gaurantee that
it did not sleep. this is used to avoid checkdirs() to avoid having
to restart the allproc walk every time through
idea from tedu, ok thib pedro


Revision tags: OPENBSD_4_4_BASE
# 1.173 05-Jul-2008 thib

re-introduce vdrop() to signal a lost intrest in a vnode;

ok art@


# 1.172 14-Jun-2008 mk

A bunch of pool_get() + bzero() -> pool_get(..., .. | PR_ZERO)
conversions that should shave a few bytes off the kernel.

ok henning, krw, jsing, oga, miod, and thib (``even though i usually prefer
FOO|BAR''; thanks for looking.


# 1.171 13-Jun-2008 beck

back out stupid vnode change that was unintentionally included
with biomem and art has no idea how it got there.
ok art@ thib@


# 1.170 12-Jun-2008 deraadt

Bring biomem diff back into the tree after the nfs_bio.c fix went in.
ok thib beck art


# 1.169 11-Jun-2008 deraadt

back out biomem diff since it is not right yet. Doing very large
file copies to nfsv2 causes the system to eventually peg the console.
On the console ^T indicates that the load is increasing rapidly, ddb
indicates many calls to getbuf, there is some very slow nfs traffic
making none (or extremely slow) progress. Eventually some machines
seize up entirely.


# 1.168 10-Jun-2008 beck

Buffer cache revamp

1) remove multiple size queues, introduced as a stopgap.
2) decouple pages containing data from their mappings
3) only keep buffers mapped when they actually have to be mapped
(right now, this is when buffers are B_BUSY)
4) New functions to make a buffer busy, and release the busy flag
(buf_acquire and buf_release)
5) Move high/low water marks and statistics counters into a structure
6) Add a sysctl to retrieve buffer cache statistics

Tested in several variants and beat upon by bob and art for a year. run
accidentally on henning's nfs server for a few months...

ok deraadt@, krw@, art@ - who promises to be around to deal with any fallout


# 1.167 09-Jun-2008 millert

Update access(2) to have modern semantics with respect to X_OK and
the superuser. access(2) will now only indicate success for X_OK on
non-directories if there is at least one execute bit set on the file.
OK deraadt@ thib@ otto@


# 1.166 07-May-2008 thib

remove the vfc_mountroot member from vfsconf and
do appropriate cleanup;

OK deraadt@


# 1.165 07-May-2008 claudio

Implement routing priorities. Every route inserted has a priority assigned
and the one route with the lowest number wins. This will be used by the
routing daemons to resolve the synchronisations issue in case of conflicts.
The nasty bits of this are in the multipath code. If no priority is specified
the kernel will choose an appropriate priority.

Looked at by a few people at n2k8 code is much older


# 1.164 06-May-2008 thib

retire vfs_mountroot();

setroot() is now (and has been) responsible for setting
the mountroot function pointer "to the right thing", or
failing todo that, to ffs_mountroot;

based on a discussion/diff from deraadt@.
OK deraadt@


# 1.163 23-Mar-2008 miod

Wrong printf construct.


# 1.162 16-Mar-2008 otto

Widen some struct statfs fields to support large filesystem stata
and add some to be able to support statvfs(2). Do the compat dance
to provide backward compatibility. ok thib@ miod@


Revision tags: OPENBSD_4_3_BASE
# 1.161 13-Dec-2007 blambert

replace calls to ltsleep with tsleep

remove PNORELOCK flag, as PNORELOCK is used for msleep

ok art@ thib@


# 1.160 16-Nov-2007 deraadt

er, the newline is wrong. dissapointing.


# 1.159 15-Nov-2007 deraadt

newline before syncing disks is way prettier


# 1.158 29-Oct-2007 chl

MALLOC/FREE -> malloc/free
replace an hard coded value with M_WAITOK

ok krw@


# 1.157 15-Sep-2007 bluhm

Allow to pull out an usb stick with ffs filesystem while mounted
and a file is written onto the stick. Without these fixes the
machine panics or hangs.
The usb fix calls the callback when the stick is pulled out to free
the associated buffers. Otherwise we have busy buffers for ever
and the automatic unmount will panic.
The change in the scsi layer prevents passing down further dirty
buffers to usb after the stick has been deactivated.
In vfs the automatic unmount has moved from the function vgonel()
to vop_generic_revoke(). Both are called when the sd device's vnode
is removed. In vgonel() the VXLOCK is already held which can cause
a deadlock. So call dounmount() earlier.

ok krw@, I like this marco@, tested by ian@


# 1.156 07-Sep-2007 art

Use M_ZERO in a few more places to shave bytes from the kernel.

eyeballed and ok dlg@


Revision tags: OPENBSD_4_2_BASE
# 1.155 07-Aug-2007 beck

A few changes to deal with multi-user performance issues seen. this
brings us back roughly to 4.1 level performance, although this is still
far from optimal as we have seen in a number of cases. This change

1) puts a lower bound on buffer cache queues to prevent starvation
2) fixes the code which looks for a buffer to recycle
3) reduces the number of vnodes back to 4.1 levels to avoid complex
performance issues better addressed after 4.2

ok art@ deraadt@, tested by many


# 1.154 01-Jun-2007 beck

decouple the allocated number of vnodes from the "desiredvnodes" variable
which is used to size a zillion other things that increasing excessively
has been shown to cause problems - so that we may incrementally look at
increasing those other things without making the kernel unusable.

This diff effectivly increases the number of vnodes back to the number
of buffers, as in the earlier dynamic buffer cache commits, without
increasing anything else (namecache, softdeps, etc. etc.)

ok pedro@ tedu@ art@ thib@


# 1.153 31-May-2007 tedu

remove some silly casts, no real change


# 1.152 31-May-2007 pedro

NFSv2 cannot cope with a big number of vnodes, so revert to NPROC-based
calculation until the problem is fixed, okay beck@ art@


# 1.151 30-May-2007 beck

back out vfs change - todd fries has seen afs issues, and I'm suspicious
this can cause other problems.


# 1.150 29-May-2007 beck

Step one of some vnode improvements - change getnewvnode to
actually allocate "desiredvnodes" - add a vdrop to un-hold a vnode held
with vhold, and change the name cache to make use of vhold/vdrop, while
keeping track of which vnodes are referred to by which cache entries to
correctly hold/drop vnodes when the cache uses them.
ok thib@, tedu@, art@


# 1.149 28-May-2007 thib

de-inline vref();

ok pedro@


# 1.148 26-May-2007 pedro

Dynamic buffer cache. Initial diff from mickey@, okay art@ beck@ toby@
deraadt@ dlg@.


# 1.147 26-May-2007 thib

Nuke a bunch of simpelocks and associated goo.

ok art@


# 1.146 17-May-2007 thib

Collapse struct v_selectinfo in struct vnode, remove the
simplelock and reuse the name for the selinfo member.
Clean-up accordingly.

ok tedu@,art@


# 1.145 09-May-2007 deraadt

kinfo_vgetfailed has not been used for > 8 years


# 1.144 13-Apr-2007 thib

Move the declaration of VN_KNOTE() into vnode.h instead of having
multiple defines all over;

ok tedu@


# 1.143 13-Apr-2007 bluhm

Remove comments talking about vnode interlock. No binary change.
ok thib


# 1.142 11-Apr-2007 thib

Remove the simplelock argument from vrecycle();

ok pedro@, sturm@


# 1.141 21-Mar-2007 thib

Remove the v_interlock simplelock from the vnode structure.
Zap all calls to simple_lock/unlock() on it (those calls are
#defined away though). Remove the LK_INTERLOCK from the calls
to vn_lock() and cleanup the filesystems wich implement VOP_LOCK().
(by remvoing the v_interlock from there calls to lockmgr()).

ok pedro@, art@, tedu@


# 1.140 12-Mar-2007 mickey

better desiredvnodes not based on maxusers; pedro@ deraadt@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.139 20-Feb-2007 deraadt

for vfsconf sysctl, do not leak kernel sensors out to userland
ok art thib


# 1.138 17-Feb-2007 mickey

fix ddb buf printing for daddr_t growth to 64bit;
from juan hernandez gonzalez; tested by bluhm@


# 1.137 14-Feb-2007 jsg

Consistently spell FALLTHROUGH to appease lint.
ok kettenis@ cloder@ tom@ henning@


# 1.136 13-Feb-2007 mickey

fix ddb buf print


# 1.135 20-Nov-2006 tom

vprint() should be defined if DIAGNOSTIC || DEBUG. Noticed by (and
original diff from) Jake < antipsychic (at) hotmail.com >. Discussed
with Mickey and Miod.

ok miod@ pedro@


# 1.134 30-Oct-2006 thib

use vp->v_type to index into vtypes rather then vp->v_tag,
fixing odd output in the 'show vnode' ddb code.

ok mickey@


Revision tags: OPENBSD_4_0_BASE
# 1.133 11-Jul-2006 mickey

add mount/vnode/buf and softdep printing commands; tested on a few archs and will make pedro happy too (;


# 1.132 09-Jul-2006 pedro

Fix tab where space was meant


# 1.131 08-Jul-2006 thib

vinvalbuf() debugging aid, under VFSDEBUG.

ok pedro@


# 1.130 03-Jul-2006 mickey

also print vp in vprint (useful for debugging); pedro@ ok


# 1.129 25-Jun-2006 sturm

rename vfs_busy() flags VB_UMIGNORE/VB_UMWAIT to VB_NOWAIT/VB_WAIT

requested by and ok pedro


# 1.128 14-Jun-2006 sturm

move vfs_busy() to rwlocks and properly hide the locking api from vfs

ok tedu, pedro


# 1.127 02-Jun-2006 pedro

Add a clonable devices implementation. Hacked along with thib@, input
from krw@ and toby@, subliminal prodding from dlg@, okay deraadt@.


# 1.126 28-May-2006 pedro

Spacing in vfs_sysctl()


# 1.125 07-May-2006 sturm

forgot to remove this sentence from the comment
ok pedro


# 1.124 30-Apr-2006 sturm

remove the simplelock argument from vfs_busy() which is currently not
used and will never be used this way in VFS

requested by and ok pedro, ok krw, biorn


# 1.123 19-Apr-2006 pedro

Remove unused mount list simple_lock() goo


Revision tags: OPENBSD_3_9_BASE
# 1.122 09-Jan-2006 pedro

Put vprint() under DIAGNOSTIC, as to save space in generated ramdisks.
Inspiration from miod@, okay deraadt@. Tested on i386, macppc and amd64.


# 1.121 30-Nov-2005 pedro

No need for vfs_busy() and vfs_unbusy() to take a process pointer
anymore. Testing by jolan@, thanks.


# 1.120 24-Nov-2005 pedro

Remove kernfs, okay deraadt@.


# 1.119 19-Nov-2005 pedro

Remove unnecessary lockmgr() archaism that was costing too much in terms
of panics and bugfixes. Access curproc directly, do not expect a process
pointer as an argument. Should fix many "process context required" bugs.
Incentive and okay millert@, okay marc@. Various testing, thanks.


# 1.118 18-Nov-2005 pedro

Work around yet another race on non-locking file systems: when calling
VOP_INACTIVE() in vrele() and vput(), we may sleep. Since there's no
locking of any kind, someone can vget() the vnode and vrele() it while
we sleep, beating us in getting the vnode on the free list.


# 1.117 08-Nov-2005 pedro

Missed one use of 'register'


# 1.116 07-Nov-2005 pedro

Use ANSI function declarations and deregister, no binary change


# 1.115 19-Oct-2005 pedro

Remove v_vnlock from struct vnode, okay krw@ tedu@


Revision tags: OPENBSD_3_8_BASE
# 1.114 26-May-2005 pedro

branches: 1.114.2;
RIP stackable filesystems, ok marius@ tedu@, discussed with deraadt@


# 1.113 24-May-2005 pedro

when a device vnode associated with a mount point disappears, mark the
filesystem as doomed and unmount it


# 1.112 22-May-2005 pedro

put VLOCKSWORK stuff under a single option, VFSDEBUG


# 1.111 01-May-2005 pedro

check for VBIOONFREELIST and VBIOONSYNCLIST in vprint(), okay marius@


# 1.110 24-Mar-2005 tedu

always good to check for invalid values. ok marius pedro


Revision tags: OPENBSD_3_7_BASE
# 1.109 10-Jan-2005 pedro

branches: 1.109.2;
change vget() to only put a vnode back on the free lists if it actually
was there. should fix a (rare) corner case introduced by my last commit.
ok tedu@, testing by joris, moritz@, danh@, otto@ and krw@. many thanks.


# 1.108 31-Dec-2004 pedro

sprinkle some more list macros in here


# 1.107 31-Dec-2004 pedro

when releasing a vnode, make it inactive before sticking it to one of
the free lists. should fix some races on filesystems that don't have
locks, such as nfs. also, it allows for a more straightforward way of
releasing vnodes (nodes that are going to be recycled don't have to be
moved to the head of the list). tested by many, thanks.

ok tedu@ deraadt@


# 1.106 28-Dec-2004 deraadt

clean dirty accident by miod


# 1.105 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


# 1.104 09-Dec-2004 pedro

minor spacing/styling nits


Revision tags: OPENBSD_3_6_BASE
# 1.103 04-Aug-2004 art

Uninline vputonfreelist.


# 1.102 04-Aug-2004 pedro

better comments


# 1.101 02-Aug-2004 pedro

- check for LK_NOWAIT on vget()
- use ltsleep() instead of the unlock + sleep combo

ok art@, inspiration from free/net


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.100 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


# 1.99 27-May-2004 tedu

shutdown accounting before shutting down vfs. should prevent some panics.
ok david@ millert@ (iirc)


# 1.98 25-Apr-2004 itojun

radix tree with multipath support. from kame. deraadt ok
user visible changes:
- you can add multiple routes with same key (route add A B then route add A C)
- you have to specify gateway address if there are multiple entries on the table
(route delete A B, instead of route delete A)
kernel change:
- radix_node_head has an extra entry
- rnh_deladdr takes extra argument

TODO:
- actually take advantage of multipath (rtalloc -> rtalloc_mpath)


Revision tags: OPENBSD_3_5_BASE
# 1.97 09-Jan-2004 tedu

back out vnode parents. weird breakge found in ports tree


# 1.96 06-Jan-2004 tedu

keep track of a vnode's parent dir. ufs only, and unused atm, but
the fun stuff is coming. testing by brad.


Revision tags: OPENBSD_3_4_BASE
# 1.95 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.94 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: UBC_SYNC_A
# 1.93 13-May-2003 naddy

Back out previous change that causes "vnode table full" for large-scale
file operations.


# 1.92 13-May-2003 tedu

do reclaim LAYER vnodes, no good reason not to


# 1.91 06-May-2003 tedu

attempt to put a process's cwd back in place after a forced umount.
won't always work, but it's the best we can do for now. this covers
at least some of the failure cases the previous commit to vfs_lookup.c
checks for.
ok weingart@


# 1.90 01-May-2003 tedu

several related changes:
vfs_subr.c:
add a missing simple_lock_init for vnode interlock
try to avoid reclaiming locked or layered vnodes
initialize vnlock pointer to NULL
remove old code to free vnlock, never used
lockinit the new vnode lock
vfs_syscalls.c:
support for VLAYER flag
vnode_if.sh:
support for splitting VDESC flags
vnode_if.src:
split VDESC flags
WILLPUT is the combination of WILLRELE and WILLUNLOCK
most uses for WILLRELE become WILLPUT
vnode.h:
add v_lock to struct vnode
add VLAYER flag
update for new VDESC flags


# 1.89 06-Apr-2003 ho

strcat/strcpy/sprintf cleanup. krw@, anil@ ok. art@ tested sparc64.


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_B
# 1.88 11-Aug-2002 art

Add two missing vfs_busy calls in the failure path of sysctl_vnode.
Found by aaron@

NOTE - I think we need a mount-point iterator just like we have
NOTE - vfs_mount_foreach_vnode. (btw. why don't we use foreach_vnode in here?)


# 1.87 12-Jul-2002 art

Change the locking on the mountpoint slightly. Instead of using mnt_lock
to get shared locks for lookup and get the exclusive lock only with
LK_DRAIN on unmount and do the real exclusive locking with flags in
mnt_flags, we now use shared locks for lookup and an exclusive lock for
unmount.

This is accomplished by slightly changing the semantics of vfs_busy.
Old vfs_busy behavior:
- with LK_NOWAIT set in flags, a shared lock was obtained if the
mountpoint wasn't being unmounted, otherwise we just returned an error.
- with no flags, a shared lock was obtained if the mountpoint was being
unmounted, otherwise we slept until the unmount was done and returned
an error.
LK_NOWAIT was used for sync(2) and some statistics code where it isn't really
critical that we get the correct results.
0 was used in fchdir and lookup where it's critical that we get the right
directory vnode for the filesystem root.

After this change vfs_busy keeps the same behavior for no flags and LK_NOWAIT.
But if some other flags are passed into it, they are passed directly
into lockmgr (actually LK_SLEEPFAIL is always added to those flags because
if we sleep for the lock, that means someone was holding the exclusive lock
and the exclusive lock is only held when the filesystem is being unmounted.

More changes:
dounmount must now be called with the exclusive lock held. (before this
the caller was supposed to hold the vfs_busy lock, but that wasn't always
true).
Zap some (now) unused mount flags.
And the highlight of this change:
Add some vfs_busy calls to match some vfs_unbusy calls, especially in
sys_mount. (lockmgr doesn't detect the case where we release a lock noone
holds (it will do that soon)).

If you've seen hangs on reboot with mfs this should solve it (I repeat this
for the fourth time now, but this time I spent two months fixing and
redesigning this and reading the code so this time I must have gotten
this right).


# 1.86 16-Jun-2002 miod

When processing the KERN_VNODE sysctl, the kernel builds a packed structure,
while pstat(8) expects a C structure abiding the regular structure packing
rules. This caused pstat -v to break on powerpc.

Unbreak the confusion by defining the structure in a common header file,
and having the kernel use it.

ok millert@ deraadt@


# 1.85 08-Jun-2002 art

Use ltsleep in vfs_busy.


# 1.84 16-May-2002 art

sprinkle some splassert(IPL_BIO) in some functions that are commented as "should be called at splbio()"


Revision tags: OPENBSD_3_1_BASE
# 1.83 14-Mar-2002 millert

First round of __P removal in sys


# 1.82 04-Feb-2002 miod

Cleanup mountroot-related definitions.


# 1.81 23-Jan-2002 art

Pool deals fairly well with physical memory shortage, but it doesn't deal
well (not at all) with shortages of the vm_map where the pages are mapped
(usually kmem_map).

Try to deal with it:
- group all information the backend allocator for a pool in a separate
struct. The pool will only have a pointer to that struct.
- change the pool_init API to reflect that.
- link all pools allocating from the same allocator on a linked list.
- Since an allocator is responsible to wait for physical memory it will
only fail (waitok) when it runs out of its backing vm_map, carefully
drain pools using the same allocator so that va space is freed.
(see comments in code for caveats and details).
- change pool_reclaim to return if it actually succeeded to free some
memory, use that information to make draining easier and more efficient.
- get rid of PR_URGENT, noone uses it.


# 1.80 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.79 10-Dec-2001 art

branches: 1.79.2;
No need to initialize the uobj on every getnewvnode. Just do
it when allocating. Add some improved diagnostics.


# 1.78 10-Dec-2001 art

Big cleanup inspired by NetBSD with some parts of the code from NetBSD.
- get rid of VOP_BALLOCN and VOP_SIZE
- move the generic getpages and putpages into miscfs/genfs
- create a genfs_node which must be added to the top of the private portion
of each vnode for filsystems that want to use genfs_{get,put}pages
- rename genfs_mmap to vop_generic_mmap


# 1.77 10-Dec-2001 art

Merge in struct uvm_vnode into struct vnode.


# 1.76 05-Dec-2001 art

Break out the part that lowers v_holdcnt in brelvp into an own function
and make it and vhold into public interfaces.


# 1.75 29-Nov-2001 art

Ooops. Revert part of the last commit that was completly wrong and wasn't supposed to be committed.


# 1.74 29-Nov-2001 art

Correctly handle b_vp with bgetvp and brelvp in {get,put}pages.
Prevents panics caused by vnodes being recycled under our feet.


# 1.73 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.72 21-Nov-2001 csapuntz

Added vfs_isbusy. Useful for verifying that a mount point is locked
Added vfs_mount_foreach_vnode. Several places in the code seem to want to
traverse the mount list and they all seem to handle locking differently.
Centralize traversing the mount list in one place so that we only need
to get the locking right once.


# 1.71 15-Nov-2001 art

Don't zero v_bioflag when recycling a vnode in getnewvnode.
Sometimes the vnode can be on the syncers list. While that is a bug, it's
just a minor annoyance. A vnode on a syncer worklist without VBIOONSYNCLIST
set is a disaster.


# 1.70 12-Nov-2001 art

Remove unnecessary check for NULL vnode in reassignbuf.


# 1.69 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.68 02-Oct-2001 csapuntz

Bounds check index into routing table. Thanks to Ken Ashcraft of Stanford
for finding this bug.


# 1.67 19-Sep-2001 csapuntz

Get rid of B_VFLUSH. Not relevant after the end of the AGE queue.


# 1.66 16-Sep-2001 millert

Add some missing lengths checks when passing data from userland to
kernel. From based on NetBSD patches.


# 1.65 02-Aug-2001 assar

(vput): make panic strings actually say vput instead of vrele


# 1.64 26-Jul-2001 miod

Typo.


# 1.63 27-Jun-2001 art

remove old vm


# 1.62 22-Jun-2001 deraadt

KNF


# 1.61 05-Jun-2001 provos

send note_revoke to knotes when vnode goes away, okay art@


# 1.60 16-May-2001 art

indentation nit.


# 1.59 29-Apr-2001 art

cleanup, remove incorrect comment


Revision tags: OPENBSD_2_9_BASE
# 1.58 22-Mar-2001 art

branches: 1.58.2;
Use pool for allocating vnodes.
Even though vnodes are never freed (could be) this gives us big memory and
kmem_map savings.


# 1.57 21-Mar-2001 art

uvm_vnp_terminate expect the vnode to be locked.
Why didn't LOCKDEBUG catch this?


# 1.56 16-Mar-2001 art

Oops. fix thinko in last.


# 1.55 16-Mar-2001 art

Use CIRCLEQ macros for mountlist.


# 1.54 16-Mar-2001 art

Initialize the mountlist_slock.


# 1.53 26-Feb-2001 csapuntz

Move v_writecount test back to it original place


# 1.52 26-Feb-2001 csapuntz

Make ref counts 32-bit unsigned ints as opposed to a potpourri of longs and
ints.


# 1.51 24-Feb-2001 csapuntz

Cleanup of vnode interface continues. Get rid of VHOLD/HOLDRELE.
Change VM/UVM to use buf_replacevnode to change the vnode associated
with a buffer.

Addition v_bioflag for flags written in interrupt handlers
(and read at splbio, though not strictly necessary)

Add vwaitforio and use it instead of a while loop of v_numoutput.

Fix race conditions when manipulation vnode free list


# 1.50 23-Feb-2001 csapuntz

Remove the clustering fields from the vnodes and place them in the
file system inode instead


# 1.49 21-Feb-2001 csapuntz

Latest soft updates from FreeBSD/Kirk McKusick

Snapshot-related code has been commented out.


# 1.48 08-Feb-2001 mickey

do not print stuff when not verbose


Revision tags: OPENBSD_2_8_BASE
# 1.47 27-Sep-2000 art

branches: 1.47.2;
Minimal optimization.


# 1.46 17-Jul-2000 art

Don't wait for B_READ buffers on shutdown.
From NetBSD.


Revision tags: OPENBSD_2_7_BASE
# 1.45 25-Apr-2000 csapuntz

Use CIRCLEQ_FOREACH


# 1.44 21-Apr-2000 mickey

see if there is any meaning under curproc before using &proc0 in vfs_syncwait(); from art@


Revision tags: SMP_BASE kame_19991208
# 1.43 05-Dec-1999 art

branches: 1.43.2;
With soft updates, some buffers will be remarked as dirty after being written.
Handle this when syncing filesystems when unmounting.
From NetBSD.


# 1.42 05-Dec-1999 art

Use VONSYNCLIST to see if we should remove a vnode from the sync list instead
of looking at v_dirtyblkhd.


Revision tags: OPENBSD_2_6_BASE
# 1.41 20-Aug-1999 art

more paranoid check of the refcount in vfs_register


# 1.40 08-Aug-1999 niklas

From NetBSD; vdevgone, used for revoking access to device nodes when they
disappear (detach is coming).


# 1.39 31-May-1999 millert

New struct statfs with mount options. NOTE: this replaces statfs(2),
fstatfs(2), and getfsstat(2) so you will need to build a new kernel
before doing a "make build" or you will get "unimplemented syscall" errors.

The new struct statfs has the following featuires:
o Has a u_int32_t flags field--now softdep can have a real flag.

o Uses u_int32_t instead of longs (nicer on the alpha). Note: the man
page used to lie about setting invalid/unused fields to -1. SunOS does
that but our code never has.

o Gets rid of f_type completely. It hasn't been used since NetBSD 0.9
and having it there but always 0 is confusing. It is conceivable
that this may cause some old code to not compile but that is better
than silently breaking.

o Adds a mount_info union that contains the FSTYPE_args struct. This
means that "mount" can now tell you all the options a filesystem was
mounted with. This is especially nice for NFS.

Other changes:
o The linux statfs emulation didn't convert between BSD fs names
and linux f_type numbers. Now it does, since the BSD f_type
number is useless to linux apps (and has been removed anyway)

o FreeBSD's struct statfs is different from our (both old and new)
and thus needs conversion. Previously, the OpenBSD syscalls
were used without any real translation.

o mount(8) will now show extra info when invoked with no arguments.
However, to see *everything* you need to use the -v (verbose) flag.


# 1.38 06-May-1999 mickey

factor out sync+wait code into vfa_syncwait() routine for
applications in system like power management and such.
art@ finally said `commit it'


# 1.37 30-Apr-1999 art

in vput, simple_unlock the v_interlock before VOP_INACTIVE, not after


Revision tags: OPENBSD_2_5_BASE
# 1.36 11-Mar-1999 deraadt

backout


# 1.35 11-Mar-1999 deraadt

back out unapproved changes


# 1.34 11-Mar-1999 mickey

indent


# 1.33 11-Mar-1999 mickey

factor sync+wait operation out into a separate function.


# 1.32 26-Feb-1999 art

adapt to uvm vnode pager


# 1.31 19-Feb-1999 art

add vfs_register and vfs_unregister functions


# 1.30 28-Dec-1998 art

simple_lock fixes


# 1.29 22-Dec-1998 art

deconfuse vprint, print holdcount, not refcount when we are talking about holdcnt


# 1.28 10-Dec-1998 art

vfs_unmountall: retry to unmount all remaining filesystems when one unmount failed


# 1.27 05-Dec-1998 csapuntz

Framework for generating automatic test code for locking discipline
in DIAGNOSTIC mode.

Added documentation to vfs_subr.c on locking needs of a couple calls.

Improvements to the vinvalbuf patch. We need to start over after we
let our pants down.


# 1.26 04-Dec-1998 csapuntz

VFS-Lite2 requires stricter locking around vnode buffer queues. vinvalbuf
had insufficient protection


# 1.25 20-Nov-1998 art

vn_lock already unlocks the simple lock. don't do that again


# 1.24 12-Nov-1998 csapuntz

Integrate latest soft updates patches for McKusick.

Integrate cleaner ffs mount code from FreeBSD. Most notably, this mount
code prevents you from mounting an unclean file system read-write.


Revision tags: OPENBSD_2_4_BASE
# 1.23 13-Oct-1998 csapuntz

In vrele, vget, reinstate to following order

- VNODE gets placed on free list
- VOP_INACTIVE is called

This was the original order. It was changed in an earlier patch due to
a race condition in non-locking FSes (like NFS) between getnewvnode
and inactive. However, the modified order had its own race conditions, so
it turned out not to be a good choice.


# 1.22 30-Aug-1998 csapuntz

Cleanup.

Error diagnostics in vputonfreelist to catch violations of assumptions.


# 1.21 06-Aug-1998 csapuntz

Rename vop_revoke, vn_bwrite, vop_noislocked, vop_nolock, vop_nounlock
to be vop_generic_revoke, vop_generic_bwrite, vop_generic_islocked,
vop_generic_lock and vop_generic_unlock.

Create vop_generic_abortop and propogate change to all file systems.

Fix PR/371.

Get rid of locking in NULLFS (should be mostly unnecessary now except for
forced unmounts).


# 1.20 25-Apr-1998 niklas

typo


Revision tags: OPENBSD_2_3_BASE
# 1.19 20-Feb-1998 niklas

typo


# 1.18 11-Jan-1998 csapuntz

Fix a couple spinlock references. More code motion in vfs_subr.c


# 1.17 10-Jan-1998 csapuntz

Broke up vfs_subr.c which was getting a bit huge. We now have seperate files
for the syncer daemon as well as default VOP_*.


# 1.16 24-Nov-1997 niklas

Fix non-DIAGNOSTIC (and non-COMPAT*) compilation


# 1.15 07-Nov-1997 csapuntz

Fixed hang on shutdown
Disabled vop_nolock for now. Filesystems still need to be cleaned up.


# 1.14 06-Nov-1997 csapuntz

DEBUG now compiles


# 1.13 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.12 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.11 06-Oct-1997 csapuntz

VFS Lite2 Changes


Revision tags: OPENBSD_2_1_BASE
# 1.10 25-Apr-1997 deraadt

proper mask check; mike@fast.cs.utah.edu


# 1.9 14-Apr-1997 tholo

Minor performance enhancements from NetBSD


# 1.8 24-Feb-1997 niklas

OpenBSD tags


# 1.7 11-Feb-1997 millert

Add fs_id support and random inode generation numbers for ffs.


# 1.6 04-Jan-1997 kstailey

spec_advlock() via lf_advlock()


Revision tags: OPENBSD_2_0_BASE
# 1.5 08-Aug-1996 tholo

Make {,f}chown(2) behaviour POSIX.1 compliant with SUID / SGID files
Enable CTL_FS processing by sysctl(3)
Add CTL_FS request to disable clearing SUID / SGID bit when a files owner
or group is changed by root
Make sysctl(8) understand CTL_FS requests


# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 29-Feb-1996 niklas

From NetBSD: Merge with NetBSD 960217


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.293 26-Aug-2019 anton

When a thread tries to exclusively lock a vnode, the same thread must
ensure that any other thread currently trying to acquire the underlying
vnode lock has observed that the same vnode is about to be exclusively
locked. Such threads must then sleep until the exclusive lock has been
released and then try to acquire the lock again. Otherwise, exclusive
access to the vnode cannot be guaranteed.

Thanks to naddy@ and visa@ for testing; ok visa@

Reported-by: syzbot+374d0e7e2400004957f7@syzkaller.appspotmail.com


# 1.292 25-Jul-2019 cheloha

vinvalbuf(9): tlseep -> tsleep_nsec(9); ok millert@


# 1.291 19-Jul-2019 cheloha

vwaitforio(9): tsleep(9) -> tsleep_nsec(9); ok visa@


# 1.290 28-Jun-2019 visa

Skip VFS barrier lock during normal operation to reduce overhead.
This removes a system-wide serialization point, which might help
finding timing-related bugs.

OK deraadt@ anton@


# 1.289 09-Jun-2019 beck

Add a temporary workaround to make removal of giant files better

mlarkin@ noticed we would freeze while removing enormous files because
of the amount of work done to invalidate buffers on unlink. This adds
a temporary workaround to ensure we give up the lock and yield while
doing this.

The longer term answer will be to move these buffers to another list
and not do the work here.

ok deraadt@


# 1.288 19-Apr-2019 visa

Add a subsystem lock for vfs_lockf.c. This enables calling lf_advlock()
and lf_purgelocks() without the kernel lock.

OK anton@ mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.287 02-Apr-2019 visa

Restrict which filesystems are available for swap. This rules out
obvious misconfigurations that cannot work.

OK mpi@ tedu@


# 1.286 17-Feb-2019 tedu

if a write fails, we mark the buffer invalid and throw it away. this can
lead to lost errors, where a later fsync will return success. to fix this,
set a flag on the vnode indicating a past error has occurred, and return
an error for future fsync calls.
ok bluhm deraadt visa


# 1.285 21-Jan-2019 anton

Introduce a dedicated entry point data structure for file locks. This new data
structure allows for better tracking of pending lock operations which is
essential in order to prevent a use-after-free once the underlying vnode is
gone.

Inspired by the lockf implementation in FreeBSD.

ok visa@

Reported-by: syzbot+d5540a236382f50f1dac@syzkaller.appspotmail.com


# 1.284 23-Dec-2018 natano

Rectify some issues with the noperm mount flag; the root vnode was not
protected properly and files without any x bit set were accidentaly considered
executable when checked with access(2).

Issues found and reported by deraadt, halex, reyk, tb
ok deraadt


# 1.283 07-Dec-2018 mpi

free(9) sizes for netcred.

ok visa@


Revision tags: OPENBSD_6_4_BASE
# 1.282 29-Sep-2018 visa

Use atomic operations to update vfc_refcount. Change the field's type
to unsigned int.

OK deraadt@


# 1.281 26-Sep-2018 visa

Move the allocating and freeing of mount points into
dedicated functions.

OK deraadt@ mpi@


# 1.280 22-Sep-2018 fcambus

Harmonize spacing after ellipses in displayed messages.

We were using spacing after ellipses in an inconsistent way in the
installer. Standardize on using "... " everywhere and take into account
the cursor position while we are waiting for the task to complete: the
cursor is now always positioned after the last dot, and the space is
added when displaying completion confirmation.

While there, also take cursor position into account in vfs_shutdown(),
and remove the extra leading space before ticks in dhclient.

OK deraadt@


# 1.279 17-Sep-2018 visa

Simplify VFS initialization.

Because loadable kernel modules are no longer, there is no need to
register or unregister filesystem implementations at runtime. Remove
vfs_register() and vfs_unregister(), and make vfsinit() call vfs_init
routines directly. Replace the linked list of vfsconf structs with
the vfsconflist[] array.

OK mpi@ bluhm@


# 1.278 16-Sep-2018 visa

Move vfsconf lookup code into dedicated functions.

OK bluhm@


# 1.277 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


# 1.276 02-Jul-2018 bluhm

Use more list macros for v_dirtyblkhd.
OK mpi@


# 1.275 06-Jun-2018 bluhm

The function dounmount() traverses the mnt_list in forward direction
to call vfs_busy() for all nested mount points. vfs_stall() called
vfs_busy() in reverser order for all mount points. Change the
direction of the latter to resolve the lock order conflict.
OK visa@


# 1.274 04-Jun-2018 guenther

Add VB_DUPOK to suppress witness(4) warning of concurrent mount locks.
Use that in three places:
- vfs_stall()
- sys_mount()
- dounmount()'s MNT_FORCE-does-recursive-unmounts case

ok deraadt@ visa@


# 1.273 27-May-2018 visa

Drop unnecessary `p' parameter from vget(9).

OK mpi@


# 1.272 08-May-2018 bluhm

When looping over mount points, the FOREACH SAVE macro is not save.
The loop variable mp is protected by vfs_busy() so that it cannot
be unmounted. But the next mount point nmp could be unmounted while
VFS_SYNC() sleeps. As the loop in vfs_stall() does not destroy the
mount point, TAILQ_FOREACH_REVERSE without _SAVE is the correct
macro to use.
OK deraadt@ visa@


# 1.271 08-May-2018 mpi

Move the vfs stall "barrier" logic to a function. FREF() will soon
change and this has nothing to do with it.

ok visa@, bluhm@


# 1.270 07-May-2018 bluhm

Print the vp pointer in the vinvalbuf() panic strings.
OK mpi@


# 1.269 02-May-2018 visa

Remove proc from the parameters of vn_lock(). The parameter is
unnecessary because curproc always does the locking.

OK mpi@


# 1.268 28-Apr-2018 visa

Clean up the parameters of VOP_LOCK() and VOP_UNLOCK(). It is always
curproc that does the locking or unlocking, so the proc parameter
is pointless and can be dropped.

OK mpi@, deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.267 07-Mar-2018 bluhm

Remounting files systems read-only does not work reliably. There
are corner cases where ffs may leak blocks. So better revert and
unmount all file systems at reboot. The "init died" panic will be
fixed in a different way.
OK deraadt@


# 1.266 10-Feb-2018 deraadt

Syncronize filesystems to disk when suspending. Each mountpoint's vnodes
are pushed to disk. Dangling vnodes (unlinked files still in use) and
vnodes undergoing change by long-running syscalls are identified -- and
such filesystems are marked dirty on-disk while we are suspended (in case
power is lost, a fsck will be required). Filesystems without dangling or
busy vnodes are marked clean, resulting in faster boots following
"battery died" circumstances.
Tested by numerous developers, thanks for the feedback.


# 1.265 14-Dec-2017 deraadt

Don't bother using DETACH_FORCE for the softraid luns at reboot
time; the aggressive mountpoint destruction seems to hit insane
use-after-frees when we are already far on the way down.


# 1.264 14-Dec-2017 deraadt

Give vflush_vnode() a hint about vnodes we don't need to account as "busy".
Change mountpoint to RDONLY a little later. Seems to improve the
rw->ro transition a bit.


# 1.263 11-Dec-2017 bluhm

Format the vnode lists of ddb show mount properly in columns.
OK krw@


# 1.262 11-Dec-2017 deraadt

In uvm Chuck decided backing store would not be allocated proactively
for blocks re-fetchable from the filesystem. However at reboot time,
filesystems are unmounted, and since processes lack backing store they
are killed. Since the scheduler is still running, in some cases init is
killed... which drops us to ddb [noted by bluhm]. Solution is to convert
filesystems to read-only [proposed by kettenis]. The tale follows:
sys_reboot() should pass proc * to MD boot() to vfs_shutdown() which
completes current IO with vfs_busy VB_WRITE|VB_WAIT, then calls VFS_MOUNT()
with MNT_UPDATE | MNT_RDONLY, soon teaching us that *fs_mount() calls a
copyin() late... so store the sizes in vfsconflist[] and move the copyin()
to sys_mount()... and notice nfs_mount copyin() is size-variant, so kill
legacy struct nfs_args3. Next we learn ffs_mount()'s MNT_UPDATE code is
sharp and rusty especially wrt softdep, so fix some bugs adn add
~MNT_SOFTDEP to the downgrade. Some vnodes need a little more help,
so tie them to &dead_vnops.

ffs_mount calling DIOCCACHESYNC is causing a bit of grief still but
this issue is seperate and will be dealt with in time.
couple hundred reboots by bluhm and myself, advice from guenther and
others at the hut


# 1.261 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.260 31-Jul-2017 florian

Give back some space to the ramdisk by compiling net/radix.c only
if we compile pf, ipsec, pipex or nfsserver.
Suggested by mpi some time ago.
Tweak & OK bluhm
deraadt assumes it's fair


# 1.259 20-Apr-2017 visa

Tweak lock inits to make the system runnable with witness(4)
on amd64 and i386.


# 1.258 04-Apr-2017 deraadt

struct vfsconf is tightly packed, but let's M_ZERO it in case that ever
changes to avoid exposing userland memory.


Revision tags: OPENBSD_6_1_BASE
# 1.257 15-Jan-2017 bluhm

When traversing the mount list, the current mount point is locked
with vfs_busy(). If the FOREACH_SAFE macro is used, the next pointer
is not locked and could be freed by another process. Unless
necessary, do not use _SAFE as it is unsafe. In vfs_unmountall()
the current pointer is actullay freed. Add a comment that this
race has to be fixed later.
OK krw@


# 1.256 10-Jan-2017 bluhm

Replace manual for() loops with FOREACH() macro.
OK millert@


# 1.255 10-Jan-2017 bluhm

Remove the unused olddp parameter from function dounmount().
OK mpi@ millert@


# 1.254 28-Sep-2016 kettenis

Cast enum to u_int when doing a bounds check to avoid a clang warning that
the comparison is always true.

ok jca@, tedu@


# 1.253 16-Sep-2016 dlg

move the namecache_rb_tree from RB macros to RBT functions.

i had to shuffle the includes a bit. all the knowledge of the RB
tree is now inside vfs_cache.c, and all accesses are via cache_*
functions.


# 1.252 16-Sep-2016 dlg

move buf_rb_bufs from RB macros to RBT functions

i had to shuffle the order of some header bits cos RBT_PROTOTYPE
needs to see what RBT_HEAD produces.


# 1.251 15-Sep-2016 dlg

all pools have their ipl set via pool_setipl, so fold it into pool_init.

the ioff argument to pool_init() is unused and has been for many
years, so this replaces it with an ipl argument. because the ipl
will be set on init we no longer need pool_setipl.

most of these changes have been done with coccinelle using the spatch
below. cocci sucks at formatting code though, so i fixed that by hand.

the manpage and subr_pool.c bits i did myself.

ok tedu@ jmatthew@

@ipl@
expression pp;
expression ipl;
expression s, a, o, f, m, p;
@@
-pool_init(pp, s, a, o, f, m, p);
-pool_setipl(pp, ipl);
+pool_init(pp, s, a, ipl, f, m, p);


# 1.250 25-Aug-2016 dlg

pool_setipl

ok kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.249 22-Jul-2016 kettenis

Prevent NULL-pointer call for filesystems that don't provide vfs_sysctl
in their vfsops.

Issue reported by Tim Newsham.

ok claudio@, natano@


# 1.248 19-Jun-2016 natano

Remove the lockmgr() API. It is only used by filesystems, where it is a
trivial change to use rrw locks instead. All it needs is LK_* defines
for the RW_* flags.

tested by naddy and sthen on package building infrastructure
input and ok jmc mpi tedu


# 1.247 26-May-2016 natano

The doforce variable isn't modified anywhere. Also, the only filesystem
left using it is fuse. It has been removed from all other filesystems.

ok millert deraadt


# 1.246 26-Apr-2016 natano

copy_statfs_info() is not only used by ufs, but by other filesystems too,
so make sure that all members of mp->mnt_stat.mount_info are copied.
ok stefan


# 1.245 26-Apr-2016 beck

fix off by one in vfs_vnode_print - found by miod
ok deraadt@, krw@


# 1.244 07-Apr-2016 natano

Share clone bitmap between aliased vnodes. This prevents duplicate clone
instance numbers being handed out for the same minor device.
ok mikeb


# 1.243 05-Apr-2016 natano

Increase size of the clone bitmap (revised diff after revert). I have
tested this with fuse _and_ drm on amd64 and macppc. Also tested with
cloning bpf (not in the tree) on macppc.

ok mikeb
"looks correct to me" millert

The original commit message is as follows:

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.242 01-Apr-2016 mikeb

Revert the clone bitmap enlargement change


# 1.241 31-Mar-2016 natano

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.240 19-Mar-2016 natano

Remove the unused flags argument from VOP_UNLOCK().

torture tested on amd64, i386 and macppc
ok beck mpi stefan
"the change looks right" deraadt


# 1.239 14-Mar-2016 krw

Change a bunch of (<blah> *)0 to NULL.

ok beck@ deraadt@


Revision tags: OPENBSD_5_9_BASE
# 1.238 05-Dec-2015 tedu

branches: 1.238.2;
remove stale lint annotations


# 1.237 16-Nov-2015 deraadt

In getdevvp() set the VISTTY flag on a vnode to indicate the underlying
device is a D_TTY device. (Like spec_open, but this sets the flag to
satisfy pre-VOP_OPEN situations)
ok millert semarie tedu guenther


# 1.236 13-Oct-2015 guenther

Initialize va_filerev in vattr_null() to avoid leaking stack garbage;
problem pointed out by Martin Natano (natano (at) natano.net)

Also, stop chaining assignments (foo = bar = baz) in vattr_null().
The exact meaning of those depends on the order of the sizes-and-
signednesses of the lvalues, making them fragile: a statement here
mixed *six* types, but managed to get them in a safe order. Delete
a 20+ year old XXX comment that was almost certainly bemoaning a bug
from when they were in an unsafe order.

ok deraadt@ miod@


# 1.235 08-Oct-2015 mpi

Use the radix API directly and get rid of the function pointers. There
is no point in keeping an unused level of abstraction.

ok mikeb@, claudio@


# 1.234 07-Oct-2015 mpi

rn_inithead() offset argument is now specified in byte, missed in previous.


# 1.233 04-Sep-2015 mpi

Make every subsystem using a radix tree call rn_init() and pass the
length of the key as argument.

This way every consumer of the radix tree has a chance to explicitly
initialize the shared data structures and no longer rely on another
subsystem to do the initialization.

As a bonus ``dom_maxrtkey'' is no longer used an die.

ART kernels should now be fully usable because pf(4) and IPSEC properly
initialized the radix tree.

ok chris@, reyk@


Revision tags: OPENBSD_5_8_BASE
# 1.232 16-Jul-2015 claudio

branches: 1.232.4;
Fix rn_match and there for the expoerted lookup functions in radix.c
to never return the internal RNF_ROOT nodes. This removes the checks
in the callee to verify that not an RNF_ROOT node was returned.
OK mpi@


# 1.231 12-May-2015 mikeb

Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.230 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.229 02-Mar-2015 guenther

Return EINVAL if the creds supplied for NFS export have a cr_ngroups less
than zero or greater than NGROUPS_MAX

Fixes panic seen by henning@


# 1.228 09-Jan-2015 tedu

rename desiredvnodes to initialvnodes. less of a lie. ok beck deraadt


# 1.227 19-Dec-2014 tedu

start retiring the nointr allocator. specify PR_WAITOK as a flag as a
marker for which pools are not interrupt safe. ok dlg


# 1.226 17-Dec-2014 tedu

remove lock.h from uvm_extern.h. another holdover from the simpletonlock
era. fix uvm including c files to include lock.h or atomic.h as necessary.
ok deraadt


# 1.225 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.224 10-Dec-2014 tedu

convert bcopy to memcpy. ok millert


# 1.223 21-Nov-2014 tedu

simple lock is long dead


# 1.222 19-Nov-2014 tedu

delete the KERN_VNODE sysctl. it fails to provide any isolation from the
kernel struct vnode defintion, and the only consumer (pstat) still needs
kvm to read much of the required information. no great loss to always use
kvm until there's a better replacement interface.
ok deraadt millert uebayasi


# 1.221 14-Nov-2014 tedu

prefer sizeof(*ptr) to sizeof(struct) for malloc and free


# 1.220 03-Nov-2014 deraadt

pass size argument to free()
ok doug tedu


# 1.219 13-Sep-2014 doug

Replace all queue *_END macro calls except CIRCLEQ_END with NULL.

CIRCLEQ_* is deprecated and not called in the tree. The other queue types
have *_END macros which were added for symmetry with CIRCLEQ_END. They are
defined as NULL. There's no reason to keep the other *_END macro calls.

ok millert@


Revision tags: OPENBSD_5_6_BASE
# 1.218 13-Jul-2014 tedu

pass the size to free in some of the obvious cases


# 1.217 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.216 10-Jul-2014 mpi

Stop using a shutdown hook for softraid(4) and explicitly shutdown
the disciplines right after vfs_shutdown().

This change is required in order to be able to set `cold' to 1 before
traversing the device (mainbus) tree for DVACT_POWERDOWN when halting
a machine. Yes, this is ugly because sr_shutdown() needs to sleep. But
at least it is obvious and hopefully somebody will be ofended and fix
it.

In order to properly flush the cache of the disks under softraid0,
sr_shutdown() now propagates DVACT_POWERDOWN for this particular subtree
of devices which are not under mainbus. As a side effect sd(4) shutdown
hook should no longer be necessary.

Tested by stsp@ and Jean-Philippe Ouellet.

ok deraadt@, stsp@, jsing@


# 1.215 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.214 04-Jun-2014 claudio

While it may be smart to use the radix tree for exports it is not OK to
use the domain specific tree initialisation method for this since that one
is multipath enabled and assumes that the radix node is part of a struct
rtentry. This code uses a different struct and so the multipath modifies
wrong fields and breaks stuff in mysterious ways.
Since we only support AF_INET here anyway simplify the code and only have
one radix_node_head pointer instead of AF_MAX ones.
Fixes NFS server issues reported by rpe@, OK rpe@, guenther@, sthen@


# 1.213 10-Apr-2014 tedu

pull the bufcache freelist code out into separate functions to allow new
algorithms to be tested. in the process, drop support for unused B_AGE and
b_synctime options.
previous versions ok beck deraadt


# 1.212 24-Mar-2014 guenther

Split the API: struct ucred remains the kernel internal structure while
struct xucred becomes the structure for syscalls (mount(2) and nfssvc(2)).

ok deraadt@ beck@


Revision tags: OPENBSD_5_5_BASE
# 1.211 21-Jan-2014 tedu

bzero -> memset


# 1.210 01-Dec-2013 krw

Change 'mountlist' from CIRCLEQ to TAILQ. Be paranoid and
use TAILQ_*_SAFE more than might be needed.

Bulk ports build by sthen@ showed nobody sticking their fingers
so deep into the kernel.

Feedback and suggestions from millert@. ok jsing@


# 1.209 27-Nov-2013 jsing

Defer the v_type initialisation until after the vnode has been purged from
the namecache. Changing the v_type between cache_enter() and cache_purge()
results in bad things happening.

ok beck@


# 1.208 02-Oct-2013 sf

format string fix: b_flags is long


# 1.207 01-Oct-2013 sf

Format string fixes: Cast time_t to long long

and mnt_stat.f_ctime is long long, too


# 1.206 08-Aug-2013 syl

Uncomment kprintf format attributes for sys/kern

tested on vax (gcc3) ok miod@


# 1.205 30-Jul-2013 beck

The previous change was made while chasing nfs performance issues
on Theo's servers - however this was in the context of the buffer flipper
changes and this is now suspect in a continues performance issue with NFS
so back it out for now


Revision tags: OPENBSD_5_4_BASE
# 1.204 24-Jun-2013 beck

Manipulating buffers after sleeping is dangerous. Instead of attempting
to cheat and VOP_BWRITE a buffer, restart the vinvalbuf if we have to wait
for a busy buffer to complete
ok tedu@ guenther@


# 1.203 15-Apr-2013 jsing

Add an f_mntfromspec member to struct statfs, which specifies the name of
the special provided when the mount was requested. This may be the same as
the special that was actually used for the mount (e.g. in the case of a
device node) or it may be different (e.g. in the case of a DUID).

Whilst here, change f_ctime to a 64 bit type and remove the pointless
f_spare members.

Compatibility goo courtesy of guenther@

ok krw@ millert@


Revision tags: OPENBSD_5_3_BASE
# 1.202 17-Feb-2013 miod

Comment out recently added __attribute__((__format__(__kprintf__))) annotations
in MI code; gcc 2.95 does not accept such annotation for function pointer
declarations, only function prototypes.
To be uncommented once gcc 2.95 bites the dust.


# 1.201 09-Feb-2013 miod

Add explicit __attribute__ ((__format__(__kprintf__)))) to the functions and
function pointer arguments which are {used as,} wrappers around the kernel
printf function.
No functional change.


# 1.200 17-Nov-2012 beck

Don't map a buffer (and potentially sleep) when invalidating it in vinvalbuf.
This fixes a problem where we could sleep for kva and then our pointers
would not be valid on the next pass through the loop. We do this
by adding buf_acquire_nomap() - which can be used to busy up the buffer
without changing its mapped or unmapped state. We do not need to have
the buffer mapped to invalidate it, so it is sufficient to acquire it
for that. In the case where we write the buffer, we do map the buffer, and
potentially sleep.


# 1.199 01-Oct-2012 guenther

Make groupmember() check the effective gid too, so that the checks are
consistent when the effective gid isn't also a supplementary group.

ok beck@


# 1.198 19-Sep-2012 guenther

vhold() and vdrop() are prototyped in vnode.h, so don't repeat them here

ok beck@


Revision tags: OPENBSD_5_2_BASE
# 1.197 16-Jul-2012 deraadt

oops, need sys/acct.h too


# 1.196 16-Jul-2012 deraadt

Put acct_shutdown() proto in a better place


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.195 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.194 02-Jul-2011 thib

rename VFSDEBUG to VFLCKDEBUG;

prompted by tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.193 21-Dec-2010 thib

Bring back the "End the VOP experiment." diff, naddy's issues where
unrelated, and his alpha is much happier now.

OK deraadt@


# 1.192 06-Dec-2010 jasper

- drop NENTS(), which was yet another copy of nitems().
no binary change


ok deraadt@


# 1.191 10-Sep-2010 thib

Backout the VOP diff until the issues naddy was seeing on alpha (gcc3)
have been resolved.


# 1.190 06-Sep-2010 thib

End the VOP experiment. Instead of the ridicolusly complicated operation
vector setup that has questionable features (that have, as far as I can
tell never been used in practice, atleast not in OpenBSD), remove all
the gunk and favor a simple struct full of function pointers that get
set directly by each of the filesystems.

Removes gobs of ugly code and makes things simpler by a magnitude.

The only downside of this is that we loose the vnoperate feature so
the spec/fifo operations of the filesystems need to be kept in sync
with specfs and fifofs, this is no big deal as the API it self is pretty
static.

Many thanks to armani@ who pulled an earlier version of this diff to
current after c2k10 and Gabriel Kihlman on tech@ for testing.

Liked by many. "come on, find your balls" deraadt@.


# 1.189 12-Aug-2010 oga

Nuke extra (typoed) extern declaration and a spare newline from the last
commit.

"fix it -- free commit" beck@


# 1.188 11-Aug-2010 beck

Make the number of vnodes to correspond to the number of buffers in
buffer cache - we grow them dynamically, but do not attempt to shrink
them if the buffer cache shrinks after growing.

Tested by very many for a long time.

ok oga@ todd@ phessler@ tedu@


Revision tags: OPENBSD_4_8_BASE
# 1.187 29-Jun-2010 tedu

makefstype was only used in ported from freebsd filesystems. fix them
and remove the function. ok thib


# 1.186 28-Jun-2010 claudio

Add the rtable id as an argument to rn_walktree(). Functions like
rt_if_remove_rtdelete() need to know the table id to be able to correctly
remove nodes.
Problem found by Andrea Parazzini and analyzed by Martin Pelik�n.
OK henning@


# 1.185 06-May-2010 mpf

Fix favail format string.
From mickey.
OK thib, otto.


Revision tags: OPENBSD_4_7_BASE
# 1.184 17-Dec-2009 oga

if anyone vref()s a VNON vnode, panic. This should not happen.

Written while trying to debug the nfs_inactive panics. Turns out it
never got hit, but it's a useful check to have.

ok beck@


# 1.183 17-Aug-2009 jasper

dd 'show all bufs' to show all the buffers in the system

ok beck@ thib@


# 1.182 13-Aug-2009 thib

add a show all vnodes command, use dlg's nice pool_walk() to accomplish
this.

ok beck@, dlg@


# 1.181 12-Aug-2009 beck

Namecache revamp.

This eliminates the large single namecache hash table, and implements
the name cache as a global lru of entires, and a redblack tree in each
vnode. It makes cache_purge actually purge the namecache entries associated
with a vnode when a vnode is recycled (very important for later on actually being
able to resize the vnode pool)

This commit does #if 0 out a bunch of procmap code that was
already broken before this change, but needs to be redone completely.

Tested by many, including in thib's nfs test setup.

ok oga@,art@,thib@,miod@


# 1.180 02-Aug-2009 beck

Dynamic buffer cache support - a re-commit of what was backed out
after c2k9

allows buffer cache to be extended and grow/shrink dynamically

tested by many, ok oga@, "why not just commit it" deraadt@


Revision tags: OPENBSD_4_6_BASE
# 1.179 25-Jun-2009 thib

backout the buf_acquire() does the bremfree() since all callers
where doing bremfree() befure calling buf_acquire().

This is causing us headache pinning down a bug that showed up
when deraadt@ too cvs to current, and will have to be done
anyway as a preperation for backouts.

OK deraadt@


# 1.178 15-Jun-2009 beck

Back out all the buffer cache changes I committed during c2k9. This reverts three
commits:

1) The sysctl allowing bufcachepercent to be changed at boot time.
2) The change moving the buffer cache hash chains to a red-black tree
3) The dynamic buffer cache (Which depended on the earlier too).

ok on the backout from marco and todd


# 1.177 06-Jun-2009 art

All caller of buf_acquire were doing bremfree before the call.
Just put it in the buf_acquire function.
oga@ ok


# 1.176 03-Jun-2009 beck

Change bufhash from the old grotty hash table to red-black trees hanging
off the vnode.
ok art@, oga@, miod@


Revision tags: OPENBSD_4_5_BASE
# 1.175 10-Nov-2008 pedro

Fix typo in comment, okay jmc@.


# 1.174 01-Nov-2008 deraadt

change vrele() to return an int. if it returns 0, it can gaurantee that
it did not sleep. this is used to avoid checkdirs() to avoid having
to restart the allproc walk every time through
idea from tedu, ok thib pedro


Revision tags: OPENBSD_4_4_BASE
# 1.173 05-Jul-2008 thib

re-introduce vdrop() to signal a lost intrest in a vnode;

ok art@


# 1.172 14-Jun-2008 mk

A bunch of pool_get() + bzero() -> pool_get(..., .. | PR_ZERO)
conversions that should shave a few bytes off the kernel.

ok henning, krw, jsing, oga, miod, and thib (``even though i usually prefer
FOO|BAR''; thanks for looking.


# 1.171 13-Jun-2008 beck

back out stupid vnode change that was unintentionally included
with biomem and art has no idea how it got there.
ok art@ thib@


# 1.170 12-Jun-2008 deraadt

Bring biomem diff back into the tree after the nfs_bio.c fix went in.
ok thib beck art


# 1.169 11-Jun-2008 deraadt

back out biomem diff since it is not right yet. Doing very large
file copies to nfsv2 causes the system to eventually peg the console.
On the console ^T indicates that the load is increasing rapidly, ddb
indicates many calls to getbuf, there is some very slow nfs traffic
making none (or extremely slow) progress. Eventually some machines
seize up entirely.


# 1.168 10-Jun-2008 beck

Buffer cache revamp

1) remove multiple size queues, introduced as a stopgap.
2) decouple pages containing data from their mappings
3) only keep buffers mapped when they actually have to be mapped
(right now, this is when buffers are B_BUSY)
4) New functions to make a buffer busy, and release the busy flag
(buf_acquire and buf_release)
5) Move high/low water marks and statistics counters into a structure
6) Add a sysctl to retrieve buffer cache statistics

Tested in several variants and beat upon by bob and art for a year. run
accidentally on henning's nfs server for a few months...

ok deraadt@, krw@, art@ - who promises to be around to deal with any fallout


# 1.167 09-Jun-2008 millert

Update access(2) to have modern semantics with respect to X_OK and
the superuser. access(2) will now only indicate success for X_OK on
non-directories if there is at least one execute bit set on the file.
OK deraadt@ thib@ otto@


# 1.166 07-May-2008 thib

remove the vfc_mountroot member from vfsconf and
do appropriate cleanup;

OK deraadt@


# 1.165 07-May-2008 claudio

Implement routing priorities. Every route inserted has a priority assigned
and the one route with the lowest number wins. This will be used by the
routing daemons to resolve the synchronisations issue in case of conflicts.
The nasty bits of this are in the multipath code. If no priority is specified
the kernel will choose an appropriate priority.

Looked at by a few people at n2k8 code is much older


# 1.164 06-May-2008 thib

retire vfs_mountroot();

setroot() is now (and has been) responsible for setting
the mountroot function pointer "to the right thing", or
failing todo that, to ffs_mountroot;

based on a discussion/diff from deraadt@.
OK deraadt@


# 1.163 23-Mar-2008 miod

Wrong printf construct.


# 1.162 16-Mar-2008 otto

Widen some struct statfs fields to support large filesystem stata
and add some to be able to support statvfs(2). Do the compat dance
to provide backward compatibility. ok thib@ miod@


Revision tags: OPENBSD_4_3_BASE
# 1.161 13-Dec-2007 blambert

replace calls to ltsleep with tsleep

remove PNORELOCK flag, as PNORELOCK is used for msleep

ok art@ thib@


# 1.160 16-Nov-2007 deraadt

er, the newline is wrong. dissapointing.


# 1.159 15-Nov-2007 deraadt

newline before syncing disks is way prettier


# 1.158 29-Oct-2007 chl

MALLOC/FREE -> malloc/free
replace an hard coded value with M_WAITOK

ok krw@


# 1.157 15-Sep-2007 bluhm

Allow to pull out an usb stick with ffs filesystem while mounted
and a file is written onto the stick. Without these fixes the
machine panics or hangs.
The usb fix calls the callback when the stick is pulled out to free
the associated buffers. Otherwise we have busy buffers for ever
and the automatic unmount will panic.
The change in the scsi layer prevents passing down further dirty
buffers to usb after the stick has been deactivated.
In vfs the automatic unmount has moved from the function vgonel()
to vop_generic_revoke(). Both are called when the sd device's vnode
is removed. In vgonel() the VXLOCK is already held which can cause
a deadlock. So call dounmount() earlier.

ok krw@, I like this marco@, tested by ian@


# 1.156 07-Sep-2007 art

Use M_ZERO in a few more places to shave bytes from the kernel.

eyeballed and ok dlg@


Revision tags: OPENBSD_4_2_BASE
# 1.155 07-Aug-2007 beck

A few changes to deal with multi-user performance issues seen. this
brings us back roughly to 4.1 level performance, although this is still
far from optimal as we have seen in a number of cases. This change

1) puts a lower bound on buffer cache queues to prevent starvation
2) fixes the code which looks for a buffer to recycle
3) reduces the number of vnodes back to 4.1 levels to avoid complex
performance issues better addressed after 4.2

ok art@ deraadt@, tested by many


# 1.154 01-Jun-2007 beck

decouple the allocated number of vnodes from the "desiredvnodes" variable
which is used to size a zillion other things that increasing excessively
has been shown to cause problems - so that we may incrementally look at
increasing those other things without making the kernel unusable.

This diff effectivly increases the number of vnodes back to the number
of buffers, as in the earlier dynamic buffer cache commits, without
increasing anything else (namecache, softdeps, etc. etc.)

ok pedro@ tedu@ art@ thib@


# 1.153 31-May-2007 tedu

remove some silly casts, no real change


# 1.152 31-May-2007 pedro

NFSv2 cannot cope with a big number of vnodes, so revert to NPROC-based
calculation until the problem is fixed, okay beck@ art@


# 1.151 30-May-2007 beck

back out vfs change - todd fries has seen afs issues, and I'm suspicious
this can cause other problems.


# 1.150 29-May-2007 beck

Step one of some vnode improvements - change getnewvnode to
actually allocate "desiredvnodes" - add a vdrop to un-hold a vnode held
with vhold, and change the name cache to make use of vhold/vdrop, while
keeping track of which vnodes are referred to by which cache entries to
correctly hold/drop vnodes when the cache uses them.
ok thib@, tedu@, art@


# 1.149 28-May-2007 thib

de-inline vref();

ok pedro@


# 1.148 26-May-2007 pedro

Dynamic buffer cache. Initial diff from mickey@, okay art@ beck@ toby@
deraadt@ dlg@.


# 1.147 26-May-2007 thib

Nuke a bunch of simpelocks and associated goo.

ok art@


# 1.146 17-May-2007 thib

Collapse struct v_selectinfo in struct vnode, remove the
simplelock and reuse the name for the selinfo member.
Clean-up accordingly.

ok tedu@,art@


# 1.145 09-May-2007 deraadt

kinfo_vgetfailed has not been used for > 8 years


# 1.144 13-Apr-2007 thib

Move the declaration of VN_KNOTE() into vnode.h instead of having
multiple defines all over;

ok tedu@


# 1.143 13-Apr-2007 bluhm

Remove comments talking about vnode interlock. No binary change.
ok thib


# 1.142 11-Apr-2007 thib

Remove the simplelock argument from vrecycle();

ok pedro@, sturm@


# 1.141 21-Mar-2007 thib

Remove the v_interlock simplelock from the vnode structure.
Zap all calls to simple_lock/unlock() on it (those calls are
#defined away though). Remove the LK_INTERLOCK from the calls
to vn_lock() and cleanup the filesystems wich implement VOP_LOCK().
(by remvoing the v_interlock from there calls to lockmgr()).

ok pedro@, art@, tedu@


# 1.140 12-Mar-2007 mickey

better desiredvnodes not based on maxusers; pedro@ deraadt@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.139 20-Feb-2007 deraadt

for vfsconf sysctl, do not leak kernel sensors out to userland
ok art thib


# 1.138 17-Feb-2007 mickey

fix ddb buf printing for daddr_t growth to 64bit;
from juan hernandez gonzalez; tested by bluhm@


# 1.137 14-Feb-2007 jsg

Consistently spell FALLTHROUGH to appease lint.
ok kettenis@ cloder@ tom@ henning@


# 1.136 13-Feb-2007 mickey

fix ddb buf print


# 1.135 20-Nov-2006 tom

vprint() should be defined if DIAGNOSTIC || DEBUG. Noticed by (and
original diff from) Jake < antipsychic (at) hotmail.com >. Discussed
with Mickey and Miod.

ok miod@ pedro@


# 1.134 30-Oct-2006 thib

use vp->v_type to index into vtypes rather then vp->v_tag,
fixing odd output in the 'show vnode' ddb code.

ok mickey@


Revision tags: OPENBSD_4_0_BASE
# 1.133 11-Jul-2006 mickey

add mount/vnode/buf and softdep printing commands; tested on a few archs and will make pedro happy too (;


# 1.132 09-Jul-2006 pedro

Fix tab where space was meant


# 1.131 08-Jul-2006 thib

vinvalbuf() debugging aid, under VFSDEBUG.

ok pedro@


# 1.130 03-Jul-2006 mickey

also print vp in vprint (useful for debugging); pedro@ ok


# 1.129 25-Jun-2006 sturm

rename vfs_busy() flags VB_UMIGNORE/VB_UMWAIT to VB_NOWAIT/VB_WAIT

requested by and ok pedro


# 1.128 14-Jun-2006 sturm

move vfs_busy() to rwlocks and properly hide the locking api from vfs

ok tedu, pedro


# 1.127 02-Jun-2006 pedro

Add a clonable devices implementation. Hacked along with thib@, input
from krw@ and toby@, subliminal prodding from dlg@, okay deraadt@.


# 1.126 28-May-2006 pedro

Spacing in vfs_sysctl()


# 1.125 07-May-2006 sturm

forgot to remove this sentence from the comment
ok pedro


# 1.124 30-Apr-2006 sturm

remove the simplelock argument from vfs_busy() which is currently not
used and will never be used this way in VFS

requested by and ok pedro, ok krw, biorn


# 1.123 19-Apr-2006 pedro

Remove unused mount list simple_lock() goo


Revision tags: OPENBSD_3_9_BASE
# 1.122 09-Jan-2006 pedro

Put vprint() under DIAGNOSTIC, as to save space in generated ramdisks.
Inspiration from miod@, okay deraadt@. Tested on i386, macppc and amd64.


# 1.121 30-Nov-2005 pedro

No need for vfs_busy() and vfs_unbusy() to take a process pointer
anymore. Testing by jolan@, thanks.


# 1.120 24-Nov-2005 pedro

Remove kernfs, okay deraadt@.


# 1.119 19-Nov-2005 pedro

Remove unnecessary lockmgr() archaism that was costing too much in terms
of panics and bugfixes. Access curproc directly, do not expect a process
pointer as an argument. Should fix many "process context required" bugs.
Incentive and okay millert@, okay marc@. Various testing, thanks.


# 1.118 18-Nov-2005 pedro

Work around yet another race on non-locking file systems: when calling
VOP_INACTIVE() in vrele() and vput(), we may sleep. Since there's no
locking of any kind, someone can vget() the vnode and vrele() it while
we sleep, beating us in getting the vnode on the free list.


# 1.117 08-Nov-2005 pedro

Missed one use of 'register'


# 1.116 07-Nov-2005 pedro

Use ANSI function declarations and deregister, no binary change


# 1.115 19-Oct-2005 pedro

Remove v_vnlock from struct vnode, okay krw@ tedu@


Revision tags: OPENBSD_3_8_BASE
# 1.114 26-May-2005 pedro

branches: 1.114.2;
RIP stackable filesystems, ok marius@ tedu@, discussed with deraadt@


# 1.113 24-May-2005 pedro

when a device vnode associated with a mount point disappears, mark the
filesystem as doomed and unmount it


# 1.112 22-May-2005 pedro

put VLOCKSWORK stuff under a single option, VFSDEBUG


# 1.111 01-May-2005 pedro

check for VBIOONFREELIST and VBIOONSYNCLIST in vprint(), okay marius@


# 1.110 24-Mar-2005 tedu

always good to check for invalid values. ok marius pedro


Revision tags: OPENBSD_3_7_BASE
# 1.109 10-Jan-2005 pedro

branches: 1.109.2;
change vget() to only put a vnode back on the free lists if it actually
was there. should fix a (rare) corner case introduced by my last commit.
ok tedu@, testing by joris, moritz@, danh@, otto@ and krw@. many thanks.


# 1.108 31-Dec-2004 pedro

sprinkle some more list macros in here


# 1.107 31-Dec-2004 pedro

when releasing a vnode, make it inactive before sticking it to one of
the free lists. should fix some races on filesystems that don't have
locks, such as nfs. also, it allows for a more straightforward way of
releasing vnodes (nodes that are going to be recycled don't have to be
moved to the head of the list). tested by many, thanks.

ok tedu@ deraadt@


# 1.106 28-Dec-2004 deraadt

clean dirty accident by miod


# 1.105 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


# 1.104 09-Dec-2004 pedro

minor spacing/styling nits


Revision tags: OPENBSD_3_6_BASE
# 1.103 04-Aug-2004 art

Uninline vputonfreelist.


# 1.102 04-Aug-2004 pedro

better comments


# 1.101 02-Aug-2004 pedro

- check for LK_NOWAIT on vget()
- use ltsleep() instead of the unlock + sleep combo

ok art@, inspiration from free/net


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.100 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


# 1.99 27-May-2004 tedu

shutdown accounting before shutting down vfs. should prevent some panics.
ok david@ millert@ (iirc)


# 1.98 25-Apr-2004 itojun

radix tree with multipath support. from kame. deraadt ok
user visible changes:
- you can add multiple routes with same key (route add A B then route add A C)
- you have to specify gateway address if there are multiple entries on the table
(route delete A B, instead of route delete A)
kernel change:
- radix_node_head has an extra entry
- rnh_deladdr takes extra argument

TODO:
- actually take advantage of multipath (rtalloc -> rtalloc_mpath)


Revision tags: OPENBSD_3_5_BASE
# 1.97 09-Jan-2004 tedu

back out vnode parents. weird breakge found in ports tree


# 1.96 06-Jan-2004 tedu

keep track of a vnode's parent dir. ufs only, and unused atm, but
the fun stuff is coming. testing by brad.


Revision tags: OPENBSD_3_4_BASE
# 1.95 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.94 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: UBC_SYNC_A
# 1.93 13-May-2003 naddy

Back out previous change that causes "vnode table full" for large-scale
file operations.


# 1.92 13-May-2003 tedu

do reclaim LAYER vnodes, no good reason not to


# 1.91 06-May-2003 tedu

attempt to put a process's cwd back in place after a forced umount.
won't always work, but it's the best we can do for now. this covers
at least some of the failure cases the previous commit to vfs_lookup.c
checks for.
ok weingart@


# 1.90 01-May-2003 tedu

several related changes:
vfs_subr.c:
add a missing simple_lock_init for vnode interlock
try to avoid reclaiming locked or layered vnodes
initialize vnlock pointer to NULL
remove old code to free vnlock, never used
lockinit the new vnode lock
vfs_syscalls.c:
support for VLAYER flag
vnode_if.sh:
support for splitting VDESC flags
vnode_if.src:
split VDESC flags
WILLPUT is the combination of WILLRELE and WILLUNLOCK
most uses for WILLRELE become WILLPUT
vnode.h:
add v_lock to struct vnode
add VLAYER flag
update for new VDESC flags


# 1.89 06-Apr-2003 ho

strcat/strcpy/sprintf cleanup. krw@, anil@ ok. art@ tested sparc64.


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_B
# 1.88 11-Aug-2002 art

Add two missing vfs_busy calls in the failure path of sysctl_vnode.
Found by aaron@

NOTE - I think we need a mount-point iterator just like we have
NOTE - vfs_mount_foreach_vnode. (btw. why don't we use foreach_vnode in here?)


# 1.87 12-Jul-2002 art

Change the locking on the mountpoint slightly. Instead of using mnt_lock
to get shared locks for lookup and get the exclusive lock only with
LK_DRAIN on unmount and do the real exclusive locking with flags in
mnt_flags, we now use shared locks for lookup and an exclusive lock for
unmount.

This is accomplished by slightly changing the semantics of vfs_busy.
Old vfs_busy behavior:
- with LK_NOWAIT set in flags, a shared lock was obtained if the
mountpoint wasn't being unmounted, otherwise we just returned an error.
- with no flags, a shared lock was obtained if the mountpoint was being
unmounted, otherwise we slept until the unmount was done and returned
an error.
LK_NOWAIT was used for sync(2) and some statistics code where it isn't really
critical that we get the correct results.
0 was used in fchdir and lookup where it's critical that we get the right
directory vnode for the filesystem root.

After this change vfs_busy keeps the same behavior for no flags and LK_NOWAIT.
But if some other flags are passed into it, they are passed directly
into lockmgr (actually LK_SLEEPFAIL is always added to those flags because
if we sleep for the lock, that means someone was holding the exclusive lock
and the exclusive lock is only held when the filesystem is being unmounted.

More changes:
dounmount must now be called with the exclusive lock held. (before this
the caller was supposed to hold the vfs_busy lock, but that wasn't always
true).
Zap some (now) unused mount flags.
And the highlight of this change:
Add some vfs_busy calls to match some vfs_unbusy calls, especially in
sys_mount. (lockmgr doesn't detect the case where we release a lock noone
holds (it will do that soon)).

If you've seen hangs on reboot with mfs this should solve it (I repeat this
for the fourth time now, but this time I spent two months fixing and
redesigning this and reading the code so this time I must have gotten
this right).


# 1.86 16-Jun-2002 miod

When processing the KERN_VNODE sysctl, the kernel builds a packed structure,
while pstat(8) expects a C structure abiding the regular structure packing
rules. This caused pstat -v to break on powerpc.

Unbreak the confusion by defining the structure in a common header file,
and having the kernel use it.

ok millert@ deraadt@


# 1.85 08-Jun-2002 art

Use ltsleep in vfs_busy.


# 1.84 16-May-2002 art

sprinkle some splassert(IPL_BIO) in some functions that are commented as "should be called at splbio()"


Revision tags: OPENBSD_3_1_BASE
# 1.83 14-Mar-2002 millert

First round of __P removal in sys


# 1.82 04-Feb-2002 miod

Cleanup mountroot-related definitions.


# 1.81 23-Jan-2002 art

Pool deals fairly well with physical memory shortage, but it doesn't deal
well (not at all) with shortages of the vm_map where the pages are mapped
(usually kmem_map).

Try to deal with it:
- group all information the backend allocator for a pool in a separate
struct. The pool will only have a pointer to that struct.
- change the pool_init API to reflect that.
- link all pools allocating from the same allocator on a linked list.
- Since an allocator is responsible to wait for physical memory it will
only fail (waitok) when it runs out of its backing vm_map, carefully
drain pools using the same allocator so that va space is freed.
(see comments in code for caveats and details).
- change pool_reclaim to return if it actually succeeded to free some
memory, use that information to make draining easier and more efficient.
- get rid of PR_URGENT, noone uses it.


# 1.80 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.79 10-Dec-2001 art

branches: 1.79.2;
No need to initialize the uobj on every getnewvnode. Just do
it when allocating. Add some improved diagnostics.


# 1.78 10-Dec-2001 art

Big cleanup inspired by NetBSD with some parts of the code from NetBSD.
- get rid of VOP_BALLOCN and VOP_SIZE
- move the generic getpages and putpages into miscfs/genfs
- create a genfs_node which must be added to the top of the private portion
of each vnode for filsystems that want to use genfs_{get,put}pages
- rename genfs_mmap to vop_generic_mmap


# 1.77 10-Dec-2001 art

Merge in struct uvm_vnode into struct vnode.


# 1.76 05-Dec-2001 art

Break out the part that lowers v_holdcnt in brelvp into an own function
and make it and vhold into public interfaces.


# 1.75 29-Nov-2001 art

Ooops. Revert part of the last commit that was completly wrong and wasn't supposed to be committed.


# 1.74 29-Nov-2001 art

Correctly handle b_vp with bgetvp and brelvp in {get,put}pages.
Prevents panics caused by vnodes being recycled under our feet.


# 1.73 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.72 21-Nov-2001 csapuntz

Added vfs_isbusy. Useful for verifying that a mount point is locked
Added vfs_mount_foreach_vnode. Several places in the code seem to want to
traverse the mount list and they all seem to handle locking differently.
Centralize traversing the mount list in one place so that we only need
to get the locking right once.


# 1.71 15-Nov-2001 art

Don't zero v_bioflag when recycling a vnode in getnewvnode.
Sometimes the vnode can be on the syncers list. While that is a bug, it's
just a minor annoyance. A vnode on a syncer worklist without VBIOONSYNCLIST
set is a disaster.


# 1.70 12-Nov-2001 art

Remove unnecessary check for NULL vnode in reassignbuf.


# 1.69 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.68 02-Oct-2001 csapuntz

Bounds check index into routing table. Thanks to Ken Ashcraft of Stanford
for finding this bug.


# 1.67 19-Sep-2001 csapuntz

Get rid of B_VFLUSH. Not relevant after the end of the AGE queue.


# 1.66 16-Sep-2001 millert

Add some missing lengths checks when passing data from userland to
kernel. From based on NetBSD patches.


# 1.65 02-Aug-2001 assar

(vput): make panic strings actually say vput instead of vrele


# 1.64 26-Jul-2001 miod

Typo.


# 1.63 27-Jun-2001 art

remove old vm


# 1.62 22-Jun-2001 deraadt

KNF


# 1.61 05-Jun-2001 provos

send note_revoke to knotes when vnode goes away, okay art@


# 1.60 16-May-2001 art

indentation nit.


# 1.59 29-Apr-2001 art

cleanup, remove incorrect comment


Revision tags: OPENBSD_2_9_BASE
# 1.58 22-Mar-2001 art

branches: 1.58.2;
Use pool for allocating vnodes.
Even though vnodes are never freed (could be) this gives us big memory and
kmem_map savings.


# 1.57 21-Mar-2001 art

uvm_vnp_terminate expect the vnode to be locked.
Why didn't LOCKDEBUG catch this?


# 1.56 16-Mar-2001 art

Oops. fix thinko in last.


# 1.55 16-Mar-2001 art

Use CIRCLEQ macros for mountlist.


# 1.54 16-Mar-2001 art

Initialize the mountlist_slock.


# 1.53 26-Feb-2001 csapuntz

Move v_writecount test back to it original place


# 1.52 26-Feb-2001 csapuntz

Make ref counts 32-bit unsigned ints as opposed to a potpourri of longs and
ints.


# 1.51 24-Feb-2001 csapuntz

Cleanup of vnode interface continues. Get rid of VHOLD/HOLDRELE.
Change VM/UVM to use buf_replacevnode to change the vnode associated
with a buffer.

Addition v_bioflag for flags written in interrupt handlers
(and read at splbio, though not strictly necessary)

Add vwaitforio and use it instead of a while loop of v_numoutput.

Fix race conditions when manipulation vnode free list


# 1.50 23-Feb-2001 csapuntz

Remove the clustering fields from the vnodes and place them in the
file system inode instead


# 1.49 21-Feb-2001 csapuntz

Latest soft updates from FreeBSD/Kirk McKusick

Snapshot-related code has been commented out.


# 1.48 08-Feb-2001 mickey

do not print stuff when not verbose


Revision tags: OPENBSD_2_8_BASE
# 1.47 27-Sep-2000 art

branches: 1.47.2;
Minimal optimization.


# 1.46 17-Jul-2000 art

Don't wait for B_READ buffers on shutdown.
From NetBSD.


Revision tags: OPENBSD_2_7_BASE
# 1.45 25-Apr-2000 csapuntz

Use CIRCLEQ_FOREACH


# 1.44 21-Apr-2000 mickey

see if there is any meaning under curproc before using &proc0 in vfs_syncwait(); from art@


Revision tags: SMP_BASE kame_19991208
# 1.43 05-Dec-1999 art

branches: 1.43.2;
With soft updates, some buffers will be remarked as dirty after being written.
Handle this when syncing filesystems when unmounting.
From NetBSD.


# 1.42 05-Dec-1999 art

Use VONSYNCLIST to see if we should remove a vnode from the sync list instead
of looking at v_dirtyblkhd.


Revision tags: OPENBSD_2_6_BASE
# 1.41 20-Aug-1999 art

more paranoid check of the refcount in vfs_register


# 1.40 08-Aug-1999 niklas

From NetBSD; vdevgone, used for revoking access to device nodes when they
disappear (detach is coming).


# 1.39 31-May-1999 millert

New struct statfs with mount options. NOTE: this replaces statfs(2),
fstatfs(2), and getfsstat(2) so you will need to build a new kernel
before doing a "make build" or you will get "unimplemented syscall" errors.

The new struct statfs has the following featuires:
o Has a u_int32_t flags field--now softdep can have a real flag.

o Uses u_int32_t instead of longs (nicer on the alpha). Note: the man
page used to lie about setting invalid/unused fields to -1. SunOS does
that but our code never has.

o Gets rid of f_type completely. It hasn't been used since NetBSD 0.9
and having it there but always 0 is confusing. It is conceivable
that this may cause some old code to not compile but that is better
than silently breaking.

o Adds a mount_info union that contains the FSTYPE_args struct. This
means that "mount" can now tell you all the options a filesystem was
mounted with. This is especially nice for NFS.

Other changes:
o The linux statfs emulation didn't convert between BSD fs names
and linux f_type numbers. Now it does, since the BSD f_type
number is useless to linux apps (and has been removed anyway)

o FreeBSD's struct statfs is different from our (both old and new)
and thus needs conversion. Previously, the OpenBSD syscalls
were used without any real translation.

o mount(8) will now show extra info when invoked with no arguments.
However, to see *everything* you need to use the -v (verbose) flag.


# 1.38 06-May-1999 mickey

factor out sync+wait code into vfa_syncwait() routine for
applications in system like power management and such.
art@ finally said `commit it'


# 1.37 30-Apr-1999 art

in vput, simple_unlock the v_interlock before VOP_INACTIVE, not after


Revision tags: OPENBSD_2_5_BASE
# 1.36 11-Mar-1999 deraadt

backout


# 1.35 11-Mar-1999 deraadt

back out unapproved changes


# 1.34 11-Mar-1999 mickey

indent


# 1.33 11-Mar-1999 mickey

factor sync+wait operation out into a separate function.


# 1.32 26-Feb-1999 art

adapt to uvm vnode pager


# 1.31 19-Feb-1999 art

add vfs_register and vfs_unregister functions


# 1.30 28-Dec-1998 art

simple_lock fixes


# 1.29 22-Dec-1998 art

deconfuse vprint, print holdcount, not refcount when we are talking about holdcnt


# 1.28 10-Dec-1998 art

vfs_unmountall: retry to unmount all remaining filesystems when one unmount failed


# 1.27 05-Dec-1998 csapuntz

Framework for generating automatic test code for locking discipline
in DIAGNOSTIC mode.

Added documentation to vfs_subr.c on locking needs of a couple calls.

Improvements to the vinvalbuf patch. We need to start over after we
let our pants down.


# 1.26 04-Dec-1998 csapuntz

VFS-Lite2 requires stricter locking around vnode buffer queues. vinvalbuf
had insufficient protection


# 1.25 20-Nov-1998 art

vn_lock already unlocks the simple lock. don't do that again


# 1.24 12-Nov-1998 csapuntz

Integrate latest soft updates patches for McKusick.

Integrate cleaner ffs mount code from FreeBSD. Most notably, this mount
code prevents you from mounting an unclean file system read-write.


Revision tags: OPENBSD_2_4_BASE
# 1.23 13-Oct-1998 csapuntz

In vrele, vget, reinstate to following order

- VNODE gets placed on free list
- VOP_INACTIVE is called

This was the original order. It was changed in an earlier patch due to
a race condition in non-locking FSes (like NFS) between getnewvnode
and inactive. However, the modified order had its own race conditions, so
it turned out not to be a good choice.


# 1.22 30-Aug-1998 csapuntz

Cleanup.

Error diagnostics in vputonfreelist to catch violations of assumptions.


# 1.21 06-Aug-1998 csapuntz

Rename vop_revoke, vn_bwrite, vop_noislocked, vop_nolock, vop_nounlock
to be vop_generic_revoke, vop_generic_bwrite, vop_generic_islocked,
vop_generic_lock and vop_generic_unlock.

Create vop_generic_abortop and propogate change to all file systems.

Fix PR/371.

Get rid of locking in NULLFS (should be mostly unnecessary now except for
forced unmounts).


# 1.20 25-Apr-1998 niklas

typo


Revision tags: OPENBSD_2_3_BASE
# 1.19 20-Feb-1998 niklas

typo


# 1.18 11-Jan-1998 csapuntz

Fix a couple spinlock references. More code motion in vfs_subr.c


# 1.17 10-Jan-1998 csapuntz

Broke up vfs_subr.c which was getting a bit huge. We now have seperate files
for the syncer daemon as well as default VOP_*.


# 1.16 24-Nov-1997 niklas

Fix non-DIAGNOSTIC (and non-COMPAT*) compilation


# 1.15 07-Nov-1997 csapuntz

Fixed hang on shutdown
Disabled vop_nolock for now. Filesystems still need to be cleaned up.


# 1.14 06-Nov-1997 csapuntz

DEBUG now compiles


# 1.13 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.12 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.11 06-Oct-1997 csapuntz

VFS Lite2 Changes


Revision tags: OPENBSD_2_1_BASE
# 1.10 25-Apr-1997 deraadt

proper mask check; mike@fast.cs.utah.edu


# 1.9 14-Apr-1997 tholo

Minor performance enhancements from NetBSD


# 1.8 24-Feb-1997 niklas

OpenBSD tags


# 1.7 11-Feb-1997 millert

Add fs_id support and random inode generation numbers for ffs.


# 1.6 04-Jan-1997 kstailey

spec_advlock() via lf_advlock()


Revision tags: OPENBSD_2_0_BASE
# 1.5 08-Aug-1996 tholo

Make {,f}chown(2) behaviour POSIX.1 compliant with SUID / SGID files
Enable CTL_FS processing by sysctl(3)
Add CTL_FS request to disable clearing SUID / SGID bit when a files owner
or group is changed by root
Make sysctl(8) understand CTL_FS requests


# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 29-Feb-1996 niklas

From NetBSD: Merge with NetBSD 960217


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.292 25-Jul-2019 cheloha

vinvalbuf(9): tlseep -> tsleep_nsec(9); ok millert@


# 1.291 19-Jul-2019 cheloha

vwaitforio(9): tsleep(9) -> tsleep_nsec(9); ok visa@


# 1.290 28-Jun-2019 visa

Skip VFS barrier lock during normal operation to reduce overhead.
This removes a system-wide serialization point, which might help
finding timing-related bugs.

OK deraadt@ anton@


# 1.289 09-Jun-2019 beck

Add a temporary workaround to make removal of giant files better

mlarkin@ noticed we would freeze while removing enormous files because
of the amount of work done to invalidate buffers on unlink. This adds
a temporary workaround to ensure we give up the lock and yield while
doing this.

The longer term answer will be to move these buffers to another list
and not do the work here.

ok deraadt@


# 1.288 19-Apr-2019 visa

Add a subsystem lock for vfs_lockf.c. This enables calling lf_advlock()
and lf_purgelocks() without the kernel lock.

OK anton@ mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.287 02-Apr-2019 visa

Restrict which filesystems are available for swap. This rules out
obvious misconfigurations that cannot work.

OK mpi@ tedu@


# 1.286 17-Feb-2019 tedu

if a write fails, we mark the buffer invalid and throw it away. this can
lead to lost errors, where a later fsync will return success. to fix this,
set a flag on the vnode indicating a past error has occurred, and return
an error for future fsync calls.
ok bluhm deraadt visa


# 1.285 21-Jan-2019 anton

Introduce a dedicated entry point data structure for file locks. This new data
structure allows for better tracking of pending lock operations which is
essential in order to prevent a use-after-free once the underlying vnode is
gone.

Inspired by the lockf implementation in FreeBSD.

ok visa@

Reported-by: syzbot+d5540a236382f50f1dac@syzkaller.appspotmail.com


# 1.284 23-Dec-2018 natano

Rectify some issues with the noperm mount flag; the root vnode was not
protected properly and files without any x bit set were accidentaly considered
executable when checked with access(2).

Issues found and reported by deraadt, halex, reyk, tb
ok deraadt


# 1.283 07-Dec-2018 mpi

free(9) sizes for netcred.

ok visa@


Revision tags: OPENBSD_6_4_BASE
# 1.282 29-Sep-2018 visa

Use atomic operations to update vfc_refcount. Change the field's type
to unsigned int.

OK deraadt@


# 1.281 26-Sep-2018 visa

Move the allocating and freeing of mount points into
dedicated functions.

OK deraadt@ mpi@


# 1.280 22-Sep-2018 fcambus

Harmonize spacing after ellipses in displayed messages.

We were using spacing after ellipses in an inconsistent way in the
installer. Standardize on using "... " everywhere and take into account
the cursor position while we are waiting for the task to complete: the
cursor is now always positioned after the last dot, and the space is
added when displaying completion confirmation.

While there, also take cursor position into account in vfs_shutdown(),
and remove the extra leading space before ticks in dhclient.

OK deraadt@


# 1.279 17-Sep-2018 visa

Simplify VFS initialization.

Because loadable kernel modules are no longer, there is no need to
register or unregister filesystem implementations at runtime. Remove
vfs_register() and vfs_unregister(), and make vfsinit() call vfs_init
routines directly. Replace the linked list of vfsconf structs with
the vfsconflist[] array.

OK mpi@ bluhm@


# 1.278 16-Sep-2018 visa

Move vfsconf lookup code into dedicated functions.

OK bluhm@


# 1.277 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


# 1.276 02-Jul-2018 bluhm

Use more list macros for v_dirtyblkhd.
OK mpi@


# 1.275 06-Jun-2018 bluhm

The function dounmount() traverses the mnt_list in forward direction
to call vfs_busy() for all nested mount points. vfs_stall() called
vfs_busy() in reverser order for all mount points. Change the
direction of the latter to resolve the lock order conflict.
OK visa@


# 1.274 04-Jun-2018 guenther

Add VB_DUPOK to suppress witness(4) warning of concurrent mount locks.
Use that in three places:
- vfs_stall()
- sys_mount()
- dounmount()'s MNT_FORCE-does-recursive-unmounts case

ok deraadt@ visa@


# 1.273 27-May-2018 visa

Drop unnecessary `p' parameter from vget(9).

OK mpi@


# 1.272 08-May-2018 bluhm

When looping over mount points, the FOREACH SAVE macro is not save.
The loop variable mp is protected by vfs_busy() so that it cannot
be unmounted. But the next mount point nmp could be unmounted while
VFS_SYNC() sleeps. As the loop in vfs_stall() does not destroy the
mount point, TAILQ_FOREACH_REVERSE without _SAVE is the correct
macro to use.
OK deraadt@ visa@


# 1.271 08-May-2018 mpi

Move the vfs stall "barrier" logic to a function. FREF() will soon
change and this has nothing to do with it.

ok visa@, bluhm@


# 1.270 07-May-2018 bluhm

Print the vp pointer in the vinvalbuf() panic strings.
OK mpi@


# 1.269 02-May-2018 visa

Remove proc from the parameters of vn_lock(). The parameter is
unnecessary because curproc always does the locking.

OK mpi@


# 1.268 28-Apr-2018 visa

Clean up the parameters of VOP_LOCK() and VOP_UNLOCK(). It is always
curproc that does the locking or unlocking, so the proc parameter
is pointless and can be dropped.

OK mpi@, deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.267 07-Mar-2018 bluhm

Remounting files systems read-only does not work reliably. There
are corner cases where ffs may leak blocks. So better revert and
unmount all file systems at reboot. The "init died" panic will be
fixed in a different way.
OK deraadt@


# 1.266 10-Feb-2018 deraadt

Syncronize filesystems to disk when suspending. Each mountpoint's vnodes
are pushed to disk. Dangling vnodes (unlinked files still in use) and
vnodes undergoing change by long-running syscalls are identified -- and
such filesystems are marked dirty on-disk while we are suspended (in case
power is lost, a fsck will be required). Filesystems without dangling or
busy vnodes are marked clean, resulting in faster boots following
"battery died" circumstances.
Tested by numerous developers, thanks for the feedback.


# 1.265 14-Dec-2017 deraadt

Don't bother using DETACH_FORCE for the softraid luns at reboot
time; the aggressive mountpoint destruction seems to hit insane
use-after-frees when we are already far on the way down.


# 1.264 14-Dec-2017 deraadt

Give vflush_vnode() a hint about vnodes we don't need to account as "busy".
Change mountpoint to RDONLY a little later. Seems to improve the
rw->ro transition a bit.


# 1.263 11-Dec-2017 bluhm

Format the vnode lists of ddb show mount properly in columns.
OK krw@


# 1.262 11-Dec-2017 deraadt

In uvm Chuck decided backing store would not be allocated proactively
for blocks re-fetchable from the filesystem. However at reboot time,
filesystems are unmounted, and since processes lack backing store they
are killed. Since the scheduler is still running, in some cases init is
killed... which drops us to ddb [noted by bluhm]. Solution is to convert
filesystems to read-only [proposed by kettenis]. The tale follows:
sys_reboot() should pass proc * to MD boot() to vfs_shutdown() which
completes current IO with vfs_busy VB_WRITE|VB_WAIT, then calls VFS_MOUNT()
with MNT_UPDATE | MNT_RDONLY, soon teaching us that *fs_mount() calls a
copyin() late... so store the sizes in vfsconflist[] and move the copyin()
to sys_mount()... and notice nfs_mount copyin() is size-variant, so kill
legacy struct nfs_args3. Next we learn ffs_mount()'s MNT_UPDATE code is
sharp and rusty especially wrt softdep, so fix some bugs adn add
~MNT_SOFTDEP to the downgrade. Some vnodes need a little more help,
so tie them to &dead_vnops.

ffs_mount calling DIOCCACHESYNC is causing a bit of grief still but
this issue is seperate and will be dealt with in time.
couple hundred reboots by bluhm and myself, advice from guenther and
others at the hut


# 1.261 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.260 31-Jul-2017 florian

Give back some space to the ramdisk by compiling net/radix.c only
if we compile pf, ipsec, pipex or nfsserver.
Suggested by mpi some time ago.
Tweak & OK bluhm
deraadt assumes it's fair


# 1.259 20-Apr-2017 visa

Tweak lock inits to make the system runnable with witness(4)
on amd64 and i386.


# 1.258 04-Apr-2017 deraadt

struct vfsconf is tightly packed, but let's M_ZERO it in case that ever
changes to avoid exposing userland memory.


Revision tags: OPENBSD_6_1_BASE
# 1.257 15-Jan-2017 bluhm

When traversing the mount list, the current mount point is locked
with vfs_busy(). If the FOREACH_SAFE macro is used, the next pointer
is not locked and could be freed by another process. Unless
necessary, do not use _SAFE as it is unsafe. In vfs_unmountall()
the current pointer is actullay freed. Add a comment that this
race has to be fixed later.
OK krw@


# 1.256 10-Jan-2017 bluhm

Replace manual for() loops with FOREACH() macro.
OK millert@


# 1.255 10-Jan-2017 bluhm

Remove the unused olddp parameter from function dounmount().
OK mpi@ millert@


# 1.254 28-Sep-2016 kettenis

Cast enum to u_int when doing a bounds check to avoid a clang warning that
the comparison is always true.

ok jca@, tedu@


# 1.253 16-Sep-2016 dlg

move the namecache_rb_tree from RB macros to RBT functions.

i had to shuffle the includes a bit. all the knowledge of the RB
tree is now inside vfs_cache.c, and all accesses are via cache_*
functions.


# 1.252 16-Sep-2016 dlg

move buf_rb_bufs from RB macros to RBT functions

i had to shuffle the order of some header bits cos RBT_PROTOTYPE
needs to see what RBT_HEAD produces.


# 1.251 15-Sep-2016 dlg

all pools have their ipl set via pool_setipl, so fold it into pool_init.

the ioff argument to pool_init() is unused and has been for many
years, so this replaces it with an ipl argument. because the ipl
will be set on init we no longer need pool_setipl.

most of these changes have been done with coccinelle using the spatch
below. cocci sucks at formatting code though, so i fixed that by hand.

the manpage and subr_pool.c bits i did myself.

ok tedu@ jmatthew@

@ipl@
expression pp;
expression ipl;
expression s, a, o, f, m, p;
@@
-pool_init(pp, s, a, o, f, m, p);
-pool_setipl(pp, ipl);
+pool_init(pp, s, a, ipl, f, m, p);


# 1.250 25-Aug-2016 dlg

pool_setipl

ok kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.249 22-Jul-2016 kettenis

Prevent NULL-pointer call for filesystems that don't provide vfs_sysctl
in their vfsops.

Issue reported by Tim Newsham.

ok claudio@, natano@


# 1.248 19-Jun-2016 natano

Remove the lockmgr() API. It is only used by filesystems, where it is a
trivial change to use rrw locks instead. All it needs is LK_* defines
for the RW_* flags.

tested by naddy and sthen on package building infrastructure
input and ok jmc mpi tedu


# 1.247 26-May-2016 natano

The doforce variable isn't modified anywhere. Also, the only filesystem
left using it is fuse. It has been removed from all other filesystems.

ok millert deraadt


# 1.246 26-Apr-2016 natano

copy_statfs_info() is not only used by ufs, but by other filesystems too,
so make sure that all members of mp->mnt_stat.mount_info are copied.
ok stefan


# 1.245 26-Apr-2016 beck

fix off by one in vfs_vnode_print - found by miod
ok deraadt@, krw@


# 1.244 07-Apr-2016 natano

Share clone bitmap between aliased vnodes. This prevents duplicate clone
instance numbers being handed out for the same minor device.
ok mikeb


# 1.243 05-Apr-2016 natano

Increase size of the clone bitmap (revised diff after revert). I have
tested this with fuse _and_ drm on amd64 and macppc. Also tested with
cloning bpf (not in the tree) on macppc.

ok mikeb
"looks correct to me" millert

The original commit message is as follows:

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.242 01-Apr-2016 mikeb

Revert the clone bitmap enlargement change


# 1.241 31-Mar-2016 natano

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.240 19-Mar-2016 natano

Remove the unused flags argument from VOP_UNLOCK().

torture tested on amd64, i386 and macppc
ok beck mpi stefan
"the change looks right" deraadt


# 1.239 14-Mar-2016 krw

Change a bunch of (<blah> *)0 to NULL.

ok beck@ deraadt@


Revision tags: OPENBSD_5_9_BASE
# 1.238 05-Dec-2015 tedu

branches: 1.238.2;
remove stale lint annotations


# 1.237 16-Nov-2015 deraadt

In getdevvp() set the VISTTY flag on a vnode to indicate the underlying
device is a D_TTY device. (Like spec_open, but this sets the flag to
satisfy pre-VOP_OPEN situations)
ok millert semarie tedu guenther


# 1.236 13-Oct-2015 guenther

Initialize va_filerev in vattr_null() to avoid leaking stack garbage;
problem pointed out by Martin Natano (natano (at) natano.net)

Also, stop chaining assignments (foo = bar = baz) in vattr_null().
The exact meaning of those depends on the order of the sizes-and-
signednesses of the lvalues, making them fragile: a statement here
mixed *six* types, but managed to get them in a safe order. Delete
a 20+ year old XXX comment that was almost certainly bemoaning a bug
from when they were in an unsafe order.

ok deraadt@ miod@


# 1.235 08-Oct-2015 mpi

Use the radix API directly and get rid of the function pointers. There
is no point in keeping an unused level of abstraction.

ok mikeb@, claudio@


# 1.234 07-Oct-2015 mpi

rn_inithead() offset argument is now specified in byte, missed in previous.


# 1.233 04-Sep-2015 mpi

Make every subsystem using a radix tree call rn_init() and pass the
length of the key as argument.

This way every consumer of the radix tree has a chance to explicitly
initialize the shared data structures and no longer rely on another
subsystem to do the initialization.

As a bonus ``dom_maxrtkey'' is no longer used an die.

ART kernels should now be fully usable because pf(4) and IPSEC properly
initialized the radix tree.

ok chris@, reyk@


Revision tags: OPENBSD_5_8_BASE
# 1.232 16-Jul-2015 claudio

branches: 1.232.4;
Fix rn_match and there for the expoerted lookup functions in radix.c
to never return the internal RNF_ROOT nodes. This removes the checks
in the callee to verify that not an RNF_ROOT node was returned.
OK mpi@


# 1.231 12-May-2015 mikeb

Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.230 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.229 02-Mar-2015 guenther

Return EINVAL if the creds supplied for NFS export have a cr_ngroups less
than zero or greater than NGROUPS_MAX

Fixes panic seen by henning@


# 1.228 09-Jan-2015 tedu

rename desiredvnodes to initialvnodes. less of a lie. ok beck deraadt


# 1.227 19-Dec-2014 tedu

start retiring the nointr allocator. specify PR_WAITOK as a flag as a
marker for which pools are not interrupt safe. ok dlg


# 1.226 17-Dec-2014 tedu

remove lock.h from uvm_extern.h. another holdover from the simpletonlock
era. fix uvm including c files to include lock.h or atomic.h as necessary.
ok deraadt


# 1.225 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.224 10-Dec-2014 tedu

convert bcopy to memcpy. ok millert


# 1.223 21-Nov-2014 tedu

simple lock is long dead


# 1.222 19-Nov-2014 tedu

delete the KERN_VNODE sysctl. it fails to provide any isolation from the
kernel struct vnode defintion, and the only consumer (pstat) still needs
kvm to read much of the required information. no great loss to always use
kvm until there's a better replacement interface.
ok deraadt millert uebayasi


# 1.221 14-Nov-2014 tedu

prefer sizeof(*ptr) to sizeof(struct) for malloc and free


# 1.220 03-Nov-2014 deraadt

pass size argument to free()
ok doug tedu


# 1.219 13-Sep-2014 doug

Replace all queue *_END macro calls except CIRCLEQ_END with NULL.

CIRCLEQ_* is deprecated and not called in the tree. The other queue types
have *_END macros which were added for symmetry with CIRCLEQ_END. They are
defined as NULL. There's no reason to keep the other *_END macro calls.

ok millert@


Revision tags: OPENBSD_5_6_BASE
# 1.218 13-Jul-2014 tedu

pass the size to free in some of the obvious cases


# 1.217 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.216 10-Jul-2014 mpi

Stop using a shutdown hook for softraid(4) and explicitly shutdown
the disciplines right after vfs_shutdown().

This change is required in order to be able to set `cold' to 1 before
traversing the device (mainbus) tree for DVACT_POWERDOWN when halting
a machine. Yes, this is ugly because sr_shutdown() needs to sleep. But
at least it is obvious and hopefully somebody will be ofended and fix
it.

In order to properly flush the cache of the disks under softraid0,
sr_shutdown() now propagates DVACT_POWERDOWN for this particular subtree
of devices which are not under mainbus. As a side effect sd(4) shutdown
hook should no longer be necessary.

Tested by stsp@ and Jean-Philippe Ouellet.

ok deraadt@, stsp@, jsing@


# 1.215 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.214 04-Jun-2014 claudio

While it may be smart to use the radix tree for exports it is not OK to
use the domain specific tree initialisation method for this since that one
is multipath enabled and assumes that the radix node is part of a struct
rtentry. This code uses a different struct and so the multipath modifies
wrong fields and breaks stuff in mysterious ways.
Since we only support AF_INET here anyway simplify the code and only have
one radix_node_head pointer instead of AF_MAX ones.
Fixes NFS server issues reported by rpe@, OK rpe@, guenther@, sthen@


# 1.213 10-Apr-2014 tedu

pull the bufcache freelist code out into separate functions to allow new
algorithms to be tested. in the process, drop support for unused B_AGE and
b_synctime options.
previous versions ok beck deraadt


# 1.212 24-Mar-2014 guenther

Split the API: struct ucred remains the kernel internal structure while
struct xucred becomes the structure for syscalls (mount(2) and nfssvc(2)).

ok deraadt@ beck@


Revision tags: OPENBSD_5_5_BASE
# 1.211 21-Jan-2014 tedu

bzero -> memset


# 1.210 01-Dec-2013 krw

Change 'mountlist' from CIRCLEQ to TAILQ. Be paranoid and
use TAILQ_*_SAFE more than might be needed.

Bulk ports build by sthen@ showed nobody sticking their fingers
so deep into the kernel.

Feedback and suggestions from millert@. ok jsing@


# 1.209 27-Nov-2013 jsing

Defer the v_type initialisation until after the vnode has been purged from
the namecache. Changing the v_type between cache_enter() and cache_purge()
results in bad things happening.

ok beck@


# 1.208 02-Oct-2013 sf

format string fix: b_flags is long


# 1.207 01-Oct-2013 sf

Format string fixes: Cast time_t to long long

and mnt_stat.f_ctime is long long, too


# 1.206 08-Aug-2013 syl

Uncomment kprintf format attributes for sys/kern

tested on vax (gcc3) ok miod@


# 1.205 30-Jul-2013 beck

The previous change was made while chasing nfs performance issues
on Theo's servers - however this was in the context of the buffer flipper
changes and this is now suspect in a continues performance issue with NFS
so back it out for now


Revision tags: OPENBSD_5_4_BASE
# 1.204 24-Jun-2013 beck

Manipulating buffers after sleeping is dangerous. Instead of attempting
to cheat and VOP_BWRITE a buffer, restart the vinvalbuf if we have to wait
for a busy buffer to complete
ok tedu@ guenther@


# 1.203 15-Apr-2013 jsing

Add an f_mntfromspec member to struct statfs, which specifies the name of
the special provided when the mount was requested. This may be the same as
the special that was actually used for the mount (e.g. in the case of a
device node) or it may be different (e.g. in the case of a DUID).

Whilst here, change f_ctime to a 64 bit type and remove the pointless
f_spare members.

Compatibility goo courtesy of guenther@

ok krw@ millert@


Revision tags: OPENBSD_5_3_BASE
# 1.202 17-Feb-2013 miod

Comment out recently added __attribute__((__format__(__kprintf__))) annotations
in MI code; gcc 2.95 does not accept such annotation for function pointer
declarations, only function prototypes.
To be uncommented once gcc 2.95 bites the dust.


# 1.201 09-Feb-2013 miod

Add explicit __attribute__ ((__format__(__kprintf__)))) to the functions and
function pointer arguments which are {used as,} wrappers around the kernel
printf function.
No functional change.


# 1.200 17-Nov-2012 beck

Don't map a buffer (and potentially sleep) when invalidating it in vinvalbuf.
This fixes a problem where we could sleep for kva and then our pointers
would not be valid on the next pass through the loop. We do this
by adding buf_acquire_nomap() - which can be used to busy up the buffer
without changing its mapped or unmapped state. We do not need to have
the buffer mapped to invalidate it, so it is sufficient to acquire it
for that. In the case where we write the buffer, we do map the buffer, and
potentially sleep.


# 1.199 01-Oct-2012 guenther

Make groupmember() check the effective gid too, so that the checks are
consistent when the effective gid isn't also a supplementary group.

ok beck@


# 1.198 19-Sep-2012 guenther

vhold() and vdrop() are prototyped in vnode.h, so don't repeat them here

ok beck@


Revision tags: OPENBSD_5_2_BASE
# 1.197 16-Jul-2012 deraadt

oops, need sys/acct.h too


# 1.196 16-Jul-2012 deraadt

Put acct_shutdown() proto in a better place


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.195 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.194 02-Jul-2011 thib

rename VFSDEBUG to VFLCKDEBUG;

prompted by tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.193 21-Dec-2010 thib

Bring back the "End the VOP experiment." diff, naddy's issues where
unrelated, and his alpha is much happier now.

OK deraadt@


# 1.192 06-Dec-2010 jasper

- drop NENTS(), which was yet another copy of nitems().
no binary change


ok deraadt@


# 1.191 10-Sep-2010 thib

Backout the VOP diff until the issues naddy was seeing on alpha (gcc3)
have been resolved.


# 1.190 06-Sep-2010 thib

End the VOP experiment. Instead of the ridicolusly complicated operation
vector setup that has questionable features (that have, as far as I can
tell never been used in practice, atleast not in OpenBSD), remove all
the gunk and favor a simple struct full of function pointers that get
set directly by each of the filesystems.

Removes gobs of ugly code and makes things simpler by a magnitude.

The only downside of this is that we loose the vnoperate feature so
the spec/fifo operations of the filesystems need to be kept in sync
with specfs and fifofs, this is no big deal as the API it self is pretty
static.

Many thanks to armani@ who pulled an earlier version of this diff to
current after c2k10 and Gabriel Kihlman on tech@ for testing.

Liked by many. "come on, find your balls" deraadt@.


# 1.189 12-Aug-2010 oga

Nuke extra (typoed) extern declaration and a spare newline from the last
commit.

"fix it -- free commit" beck@


# 1.188 11-Aug-2010 beck

Make the number of vnodes to correspond to the number of buffers in
buffer cache - we grow them dynamically, but do not attempt to shrink
them if the buffer cache shrinks after growing.

Tested by very many for a long time.

ok oga@ todd@ phessler@ tedu@


Revision tags: OPENBSD_4_8_BASE
# 1.187 29-Jun-2010 tedu

makefstype was only used in ported from freebsd filesystems. fix them
and remove the function. ok thib


# 1.186 28-Jun-2010 claudio

Add the rtable id as an argument to rn_walktree(). Functions like
rt_if_remove_rtdelete() need to know the table id to be able to correctly
remove nodes.
Problem found by Andrea Parazzini and analyzed by Martin Pelik�n.
OK henning@


# 1.185 06-May-2010 mpf

Fix favail format string.
From mickey.
OK thib, otto.


Revision tags: OPENBSD_4_7_BASE
# 1.184 17-Dec-2009 oga

if anyone vref()s a VNON vnode, panic. This should not happen.

Written while trying to debug the nfs_inactive panics. Turns out it
never got hit, but it's a useful check to have.

ok beck@


# 1.183 17-Aug-2009 jasper

dd 'show all bufs' to show all the buffers in the system

ok beck@ thib@


# 1.182 13-Aug-2009 thib

add a show all vnodes command, use dlg's nice pool_walk() to accomplish
this.

ok beck@, dlg@


# 1.181 12-Aug-2009 beck

Namecache revamp.

This eliminates the large single namecache hash table, and implements
the name cache as a global lru of entires, and a redblack tree in each
vnode. It makes cache_purge actually purge the namecache entries associated
with a vnode when a vnode is recycled (very important for later on actually being
able to resize the vnode pool)

This commit does #if 0 out a bunch of procmap code that was
already broken before this change, but needs to be redone completely.

Tested by many, including in thib's nfs test setup.

ok oga@,art@,thib@,miod@


# 1.180 02-Aug-2009 beck

Dynamic buffer cache support - a re-commit of what was backed out
after c2k9

allows buffer cache to be extended and grow/shrink dynamically

tested by many, ok oga@, "why not just commit it" deraadt@


Revision tags: OPENBSD_4_6_BASE
# 1.179 25-Jun-2009 thib

backout the buf_acquire() does the bremfree() since all callers
where doing bremfree() befure calling buf_acquire().

This is causing us headache pinning down a bug that showed up
when deraadt@ too cvs to current, and will have to be done
anyway as a preperation for backouts.

OK deraadt@


# 1.178 15-Jun-2009 beck

Back out all the buffer cache changes I committed during c2k9. This reverts three
commits:

1) The sysctl allowing bufcachepercent to be changed at boot time.
2) The change moving the buffer cache hash chains to a red-black tree
3) The dynamic buffer cache (Which depended on the earlier too).

ok on the backout from marco and todd


# 1.177 06-Jun-2009 art

All caller of buf_acquire were doing bremfree before the call.
Just put it in the buf_acquire function.
oga@ ok


# 1.176 03-Jun-2009 beck

Change bufhash from the old grotty hash table to red-black trees hanging
off the vnode.
ok art@, oga@, miod@


Revision tags: OPENBSD_4_5_BASE
# 1.175 10-Nov-2008 pedro

Fix typo in comment, okay jmc@.


# 1.174 01-Nov-2008 deraadt

change vrele() to return an int. if it returns 0, it can gaurantee that
it did not sleep. this is used to avoid checkdirs() to avoid having
to restart the allproc walk every time through
idea from tedu, ok thib pedro


Revision tags: OPENBSD_4_4_BASE
# 1.173 05-Jul-2008 thib

re-introduce vdrop() to signal a lost intrest in a vnode;

ok art@


# 1.172 14-Jun-2008 mk

A bunch of pool_get() + bzero() -> pool_get(..., .. | PR_ZERO)
conversions that should shave a few bytes off the kernel.

ok henning, krw, jsing, oga, miod, and thib (``even though i usually prefer
FOO|BAR''; thanks for looking.


# 1.171 13-Jun-2008 beck

back out stupid vnode change that was unintentionally included
with biomem and art has no idea how it got there.
ok art@ thib@


# 1.170 12-Jun-2008 deraadt

Bring biomem diff back into the tree after the nfs_bio.c fix went in.
ok thib beck art


# 1.169 11-Jun-2008 deraadt

back out biomem diff since it is not right yet. Doing very large
file copies to nfsv2 causes the system to eventually peg the console.
On the console ^T indicates that the load is increasing rapidly, ddb
indicates many calls to getbuf, there is some very slow nfs traffic
making none (or extremely slow) progress. Eventually some machines
seize up entirely.


# 1.168 10-Jun-2008 beck

Buffer cache revamp

1) remove multiple size queues, introduced as a stopgap.
2) decouple pages containing data from their mappings
3) only keep buffers mapped when they actually have to be mapped
(right now, this is when buffers are B_BUSY)
4) New functions to make a buffer busy, and release the busy flag
(buf_acquire and buf_release)
5) Move high/low water marks and statistics counters into a structure
6) Add a sysctl to retrieve buffer cache statistics

Tested in several variants and beat upon by bob and art for a year. run
accidentally on henning's nfs server for a few months...

ok deraadt@, krw@, art@ - who promises to be around to deal with any fallout


# 1.167 09-Jun-2008 millert

Update access(2) to have modern semantics with respect to X_OK and
the superuser. access(2) will now only indicate success for X_OK on
non-directories if there is at least one execute bit set on the file.
OK deraadt@ thib@ otto@


# 1.166 07-May-2008 thib

remove the vfc_mountroot member from vfsconf and
do appropriate cleanup;

OK deraadt@


# 1.165 07-May-2008 claudio

Implement routing priorities. Every route inserted has a priority assigned
and the one route with the lowest number wins. This will be used by the
routing daemons to resolve the synchronisations issue in case of conflicts.
The nasty bits of this are in the multipath code. If no priority is specified
the kernel will choose an appropriate priority.

Looked at by a few people at n2k8 code is much older


# 1.164 06-May-2008 thib

retire vfs_mountroot();

setroot() is now (and has been) responsible for setting
the mountroot function pointer "to the right thing", or
failing todo that, to ffs_mountroot;

based on a discussion/diff from deraadt@.
OK deraadt@


# 1.163 23-Mar-2008 miod

Wrong printf construct.


# 1.162 16-Mar-2008 otto

Widen some struct statfs fields to support large filesystem stata
and add some to be able to support statvfs(2). Do the compat dance
to provide backward compatibility. ok thib@ miod@


Revision tags: OPENBSD_4_3_BASE
# 1.161 13-Dec-2007 blambert

replace calls to ltsleep with tsleep

remove PNORELOCK flag, as PNORELOCK is used for msleep

ok art@ thib@


# 1.160 16-Nov-2007 deraadt

er, the newline is wrong. dissapointing.


# 1.159 15-Nov-2007 deraadt

newline before syncing disks is way prettier


# 1.158 29-Oct-2007 chl

MALLOC/FREE -> malloc/free
replace an hard coded value with M_WAITOK

ok krw@


# 1.157 15-Sep-2007 bluhm

Allow to pull out an usb stick with ffs filesystem while mounted
and a file is written onto the stick. Without these fixes the
machine panics or hangs.
The usb fix calls the callback when the stick is pulled out to free
the associated buffers. Otherwise we have busy buffers for ever
and the automatic unmount will panic.
The change in the scsi layer prevents passing down further dirty
buffers to usb after the stick has been deactivated.
In vfs the automatic unmount has moved from the function vgonel()
to vop_generic_revoke(). Both are called when the sd device's vnode
is removed. In vgonel() the VXLOCK is already held which can cause
a deadlock. So call dounmount() earlier.

ok krw@, I like this marco@, tested by ian@


# 1.156 07-Sep-2007 art

Use M_ZERO in a few more places to shave bytes from the kernel.

eyeballed and ok dlg@


Revision tags: OPENBSD_4_2_BASE
# 1.155 07-Aug-2007 beck

A few changes to deal with multi-user performance issues seen. this
brings us back roughly to 4.1 level performance, although this is still
far from optimal as we have seen in a number of cases. This change

1) puts a lower bound on buffer cache queues to prevent starvation
2) fixes the code which looks for a buffer to recycle
3) reduces the number of vnodes back to 4.1 levels to avoid complex
performance issues better addressed after 4.2

ok art@ deraadt@, tested by many


# 1.154 01-Jun-2007 beck

decouple the allocated number of vnodes from the "desiredvnodes" variable
which is used to size a zillion other things that increasing excessively
has been shown to cause problems - so that we may incrementally look at
increasing those other things without making the kernel unusable.

This diff effectivly increases the number of vnodes back to the number
of buffers, as in the earlier dynamic buffer cache commits, without
increasing anything else (namecache, softdeps, etc. etc.)

ok pedro@ tedu@ art@ thib@


# 1.153 31-May-2007 tedu

remove some silly casts, no real change


# 1.152 31-May-2007 pedro

NFSv2 cannot cope with a big number of vnodes, so revert to NPROC-based
calculation until the problem is fixed, okay beck@ art@


# 1.151 30-May-2007 beck

back out vfs change - todd fries has seen afs issues, and I'm suspicious
this can cause other problems.


# 1.150 29-May-2007 beck

Step one of some vnode improvements - change getnewvnode to
actually allocate "desiredvnodes" - add a vdrop to un-hold a vnode held
with vhold, and change the name cache to make use of vhold/vdrop, while
keeping track of which vnodes are referred to by which cache entries to
correctly hold/drop vnodes when the cache uses them.
ok thib@, tedu@, art@


# 1.149 28-May-2007 thib

de-inline vref();

ok pedro@


# 1.148 26-May-2007 pedro

Dynamic buffer cache. Initial diff from mickey@, okay art@ beck@ toby@
deraadt@ dlg@.


# 1.147 26-May-2007 thib

Nuke a bunch of simpelocks and associated goo.

ok art@


# 1.146 17-May-2007 thib

Collapse struct v_selectinfo in struct vnode, remove the
simplelock and reuse the name for the selinfo member.
Clean-up accordingly.

ok tedu@,art@


# 1.145 09-May-2007 deraadt

kinfo_vgetfailed has not been used for > 8 years


# 1.144 13-Apr-2007 thib

Move the declaration of VN_KNOTE() into vnode.h instead of having
multiple defines all over;

ok tedu@


# 1.143 13-Apr-2007 bluhm

Remove comments talking about vnode interlock. No binary change.
ok thib


# 1.142 11-Apr-2007 thib

Remove the simplelock argument from vrecycle();

ok pedro@, sturm@


# 1.141 21-Mar-2007 thib

Remove the v_interlock simplelock from the vnode structure.
Zap all calls to simple_lock/unlock() on it (those calls are
#defined away though). Remove the LK_INTERLOCK from the calls
to vn_lock() and cleanup the filesystems wich implement VOP_LOCK().
(by remvoing the v_interlock from there calls to lockmgr()).

ok pedro@, art@, tedu@


# 1.140 12-Mar-2007 mickey

better desiredvnodes not based on maxusers; pedro@ deraadt@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.139 20-Feb-2007 deraadt

for vfsconf sysctl, do not leak kernel sensors out to userland
ok art thib


# 1.138 17-Feb-2007 mickey

fix ddb buf printing for daddr_t growth to 64bit;
from juan hernandez gonzalez; tested by bluhm@


# 1.137 14-Feb-2007 jsg

Consistently spell FALLTHROUGH to appease lint.
ok kettenis@ cloder@ tom@ henning@


# 1.136 13-Feb-2007 mickey

fix ddb buf print


# 1.135 20-Nov-2006 tom

vprint() should be defined if DIAGNOSTIC || DEBUG. Noticed by (and
original diff from) Jake < antipsychic (at) hotmail.com >. Discussed
with Mickey and Miod.

ok miod@ pedro@


# 1.134 30-Oct-2006 thib

use vp->v_type to index into vtypes rather then vp->v_tag,
fixing odd output in the 'show vnode' ddb code.

ok mickey@


Revision tags: OPENBSD_4_0_BASE
# 1.133 11-Jul-2006 mickey

add mount/vnode/buf and softdep printing commands; tested on a few archs and will make pedro happy too (;


# 1.132 09-Jul-2006 pedro

Fix tab where space was meant


# 1.131 08-Jul-2006 thib

vinvalbuf() debugging aid, under VFSDEBUG.

ok pedro@


# 1.130 03-Jul-2006 mickey

also print vp in vprint (useful for debugging); pedro@ ok


# 1.129 25-Jun-2006 sturm

rename vfs_busy() flags VB_UMIGNORE/VB_UMWAIT to VB_NOWAIT/VB_WAIT

requested by and ok pedro


# 1.128 14-Jun-2006 sturm

move vfs_busy() to rwlocks and properly hide the locking api from vfs

ok tedu, pedro


# 1.127 02-Jun-2006 pedro

Add a clonable devices implementation. Hacked along with thib@, input
from krw@ and toby@, subliminal prodding from dlg@, okay deraadt@.


# 1.126 28-May-2006 pedro

Spacing in vfs_sysctl()


# 1.125 07-May-2006 sturm

forgot to remove this sentence from the comment
ok pedro


# 1.124 30-Apr-2006 sturm

remove the simplelock argument from vfs_busy() which is currently not
used and will never be used this way in VFS

requested by and ok pedro, ok krw, biorn


# 1.123 19-Apr-2006 pedro

Remove unused mount list simple_lock() goo


Revision tags: OPENBSD_3_9_BASE
# 1.122 09-Jan-2006 pedro

Put vprint() under DIAGNOSTIC, as to save space in generated ramdisks.
Inspiration from miod@, okay deraadt@. Tested on i386, macppc and amd64.


# 1.121 30-Nov-2005 pedro

No need for vfs_busy() and vfs_unbusy() to take a process pointer
anymore. Testing by jolan@, thanks.


# 1.120 24-Nov-2005 pedro

Remove kernfs, okay deraadt@.


# 1.119 19-Nov-2005 pedro

Remove unnecessary lockmgr() archaism that was costing too much in terms
of panics and bugfixes. Access curproc directly, do not expect a process
pointer as an argument. Should fix many "process context required" bugs.
Incentive and okay millert@, okay marc@. Various testing, thanks.


# 1.118 18-Nov-2005 pedro

Work around yet another race on non-locking file systems: when calling
VOP_INACTIVE() in vrele() and vput(), we may sleep. Since there's no
locking of any kind, someone can vget() the vnode and vrele() it while
we sleep, beating us in getting the vnode on the free list.


# 1.117 08-Nov-2005 pedro

Missed one use of 'register'


# 1.116 07-Nov-2005 pedro

Use ANSI function declarations and deregister, no binary change


# 1.115 19-Oct-2005 pedro

Remove v_vnlock from struct vnode, okay krw@ tedu@


Revision tags: OPENBSD_3_8_BASE
# 1.114 26-May-2005 pedro

branches: 1.114.2;
RIP stackable filesystems, ok marius@ tedu@, discussed with deraadt@


# 1.113 24-May-2005 pedro

when a device vnode associated with a mount point disappears, mark the
filesystem as doomed and unmount it


# 1.112 22-May-2005 pedro

put VLOCKSWORK stuff under a single option, VFSDEBUG


# 1.111 01-May-2005 pedro

check for VBIOONFREELIST and VBIOONSYNCLIST in vprint(), okay marius@


# 1.110 24-Mar-2005 tedu

always good to check for invalid values. ok marius pedro


Revision tags: OPENBSD_3_7_BASE
# 1.109 10-Jan-2005 pedro

branches: 1.109.2;
change vget() to only put a vnode back on the free lists if it actually
was there. should fix a (rare) corner case introduced by my last commit.
ok tedu@, testing by joris, moritz@, danh@, otto@ and krw@. many thanks.


# 1.108 31-Dec-2004 pedro

sprinkle some more list macros in here


# 1.107 31-Dec-2004 pedro

when releasing a vnode, make it inactive before sticking it to one of
the free lists. should fix some races on filesystems that don't have
locks, such as nfs. also, it allows for a more straightforward way of
releasing vnodes (nodes that are going to be recycled don't have to be
moved to the head of the list). tested by many, thanks.

ok tedu@ deraadt@


# 1.106 28-Dec-2004 deraadt

clean dirty accident by miod


# 1.105 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


# 1.104 09-Dec-2004 pedro

minor spacing/styling nits


Revision tags: OPENBSD_3_6_BASE
# 1.103 04-Aug-2004 art

Uninline vputonfreelist.


# 1.102 04-Aug-2004 pedro

better comments


# 1.101 02-Aug-2004 pedro

- check for LK_NOWAIT on vget()
- use ltsleep() instead of the unlock + sleep combo

ok art@, inspiration from free/net


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.100 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


# 1.99 27-May-2004 tedu

shutdown accounting before shutting down vfs. should prevent some panics.
ok david@ millert@ (iirc)


# 1.98 25-Apr-2004 itojun

radix tree with multipath support. from kame. deraadt ok
user visible changes:
- you can add multiple routes with same key (route add A B then route add A C)
- you have to specify gateway address if there are multiple entries on the table
(route delete A B, instead of route delete A)
kernel change:
- radix_node_head has an extra entry
- rnh_deladdr takes extra argument

TODO:
- actually take advantage of multipath (rtalloc -> rtalloc_mpath)


Revision tags: OPENBSD_3_5_BASE
# 1.97 09-Jan-2004 tedu

back out vnode parents. weird breakge found in ports tree


# 1.96 06-Jan-2004 tedu

keep track of a vnode's parent dir. ufs only, and unused atm, but
the fun stuff is coming. testing by brad.


Revision tags: OPENBSD_3_4_BASE
# 1.95 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.94 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: UBC_SYNC_A
# 1.93 13-May-2003 naddy

Back out previous change that causes "vnode table full" for large-scale
file operations.


# 1.92 13-May-2003 tedu

do reclaim LAYER vnodes, no good reason not to


# 1.91 06-May-2003 tedu

attempt to put a process's cwd back in place after a forced umount.
won't always work, but it's the best we can do for now. this covers
at least some of the failure cases the previous commit to vfs_lookup.c
checks for.
ok weingart@


# 1.90 01-May-2003 tedu

several related changes:
vfs_subr.c:
add a missing simple_lock_init for vnode interlock
try to avoid reclaiming locked or layered vnodes
initialize vnlock pointer to NULL
remove old code to free vnlock, never used
lockinit the new vnode lock
vfs_syscalls.c:
support for VLAYER flag
vnode_if.sh:
support for splitting VDESC flags
vnode_if.src:
split VDESC flags
WILLPUT is the combination of WILLRELE and WILLUNLOCK
most uses for WILLRELE become WILLPUT
vnode.h:
add v_lock to struct vnode
add VLAYER flag
update for new VDESC flags


# 1.89 06-Apr-2003 ho

strcat/strcpy/sprintf cleanup. krw@, anil@ ok. art@ tested sparc64.


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_B
# 1.88 11-Aug-2002 art

Add two missing vfs_busy calls in the failure path of sysctl_vnode.
Found by aaron@

NOTE - I think we need a mount-point iterator just like we have
NOTE - vfs_mount_foreach_vnode. (btw. why don't we use foreach_vnode in here?)


# 1.87 12-Jul-2002 art

Change the locking on the mountpoint slightly. Instead of using mnt_lock
to get shared locks for lookup and get the exclusive lock only with
LK_DRAIN on unmount and do the real exclusive locking with flags in
mnt_flags, we now use shared locks for lookup and an exclusive lock for
unmount.

This is accomplished by slightly changing the semantics of vfs_busy.
Old vfs_busy behavior:
- with LK_NOWAIT set in flags, a shared lock was obtained if the
mountpoint wasn't being unmounted, otherwise we just returned an error.
- with no flags, a shared lock was obtained if the mountpoint was being
unmounted, otherwise we slept until the unmount was done and returned
an error.
LK_NOWAIT was used for sync(2) and some statistics code where it isn't really
critical that we get the correct results.
0 was used in fchdir and lookup where it's critical that we get the right
directory vnode for the filesystem root.

After this change vfs_busy keeps the same behavior for no flags and LK_NOWAIT.
But if some other flags are passed into it, they are passed directly
into lockmgr (actually LK_SLEEPFAIL is always added to those flags because
if we sleep for the lock, that means someone was holding the exclusive lock
and the exclusive lock is only held when the filesystem is being unmounted.

More changes:
dounmount must now be called with the exclusive lock held. (before this
the caller was supposed to hold the vfs_busy lock, but that wasn't always
true).
Zap some (now) unused mount flags.
And the highlight of this change:
Add some vfs_busy calls to match some vfs_unbusy calls, especially in
sys_mount. (lockmgr doesn't detect the case where we release a lock noone
holds (it will do that soon)).

If you've seen hangs on reboot with mfs this should solve it (I repeat this
for the fourth time now, but this time I spent two months fixing and
redesigning this and reading the code so this time I must have gotten
this right).


# 1.86 16-Jun-2002 miod

When processing the KERN_VNODE sysctl, the kernel builds a packed structure,
while pstat(8) expects a C structure abiding the regular structure packing
rules. This caused pstat -v to break on powerpc.

Unbreak the confusion by defining the structure in a common header file,
and having the kernel use it.

ok millert@ deraadt@


# 1.85 08-Jun-2002 art

Use ltsleep in vfs_busy.


# 1.84 16-May-2002 art

sprinkle some splassert(IPL_BIO) in some functions that are commented as "should be called at splbio()"


Revision tags: OPENBSD_3_1_BASE
# 1.83 14-Mar-2002 millert

First round of __P removal in sys


# 1.82 04-Feb-2002 miod

Cleanup mountroot-related definitions.


# 1.81 23-Jan-2002 art

Pool deals fairly well with physical memory shortage, but it doesn't deal
well (not at all) with shortages of the vm_map where the pages are mapped
(usually kmem_map).

Try to deal with it:
- group all information the backend allocator for a pool in a separate
struct. The pool will only have a pointer to that struct.
- change the pool_init API to reflect that.
- link all pools allocating from the same allocator on a linked list.
- Since an allocator is responsible to wait for physical memory it will
only fail (waitok) when it runs out of its backing vm_map, carefully
drain pools using the same allocator so that va space is freed.
(see comments in code for caveats and details).
- change pool_reclaim to return if it actually succeeded to free some
memory, use that information to make draining easier and more efficient.
- get rid of PR_URGENT, noone uses it.


# 1.80 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.79 10-Dec-2001 art

branches: 1.79.2;
No need to initialize the uobj on every getnewvnode. Just do
it when allocating. Add some improved diagnostics.


# 1.78 10-Dec-2001 art

Big cleanup inspired by NetBSD with some parts of the code from NetBSD.
- get rid of VOP_BALLOCN and VOP_SIZE
- move the generic getpages and putpages into miscfs/genfs
- create a genfs_node which must be added to the top of the private portion
of each vnode for filsystems that want to use genfs_{get,put}pages
- rename genfs_mmap to vop_generic_mmap


# 1.77 10-Dec-2001 art

Merge in struct uvm_vnode into struct vnode.


# 1.76 05-Dec-2001 art

Break out the part that lowers v_holdcnt in brelvp into an own function
and make it and vhold into public interfaces.


# 1.75 29-Nov-2001 art

Ooops. Revert part of the last commit that was completly wrong and wasn't supposed to be committed.


# 1.74 29-Nov-2001 art

Correctly handle b_vp with bgetvp and brelvp in {get,put}pages.
Prevents panics caused by vnodes being recycled under our feet.


# 1.73 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.72 21-Nov-2001 csapuntz

Added vfs_isbusy. Useful for verifying that a mount point is locked
Added vfs_mount_foreach_vnode. Several places in the code seem to want to
traverse the mount list and they all seem to handle locking differently.
Centralize traversing the mount list in one place so that we only need
to get the locking right once.


# 1.71 15-Nov-2001 art

Don't zero v_bioflag when recycling a vnode in getnewvnode.
Sometimes the vnode can be on the syncers list. While that is a bug, it's
just a minor annoyance. A vnode on a syncer worklist without VBIOONSYNCLIST
set is a disaster.


# 1.70 12-Nov-2001 art

Remove unnecessary check for NULL vnode in reassignbuf.


# 1.69 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.68 02-Oct-2001 csapuntz

Bounds check index into routing table. Thanks to Ken Ashcraft of Stanford
for finding this bug.


# 1.67 19-Sep-2001 csapuntz

Get rid of B_VFLUSH. Not relevant after the end of the AGE queue.


# 1.66 16-Sep-2001 millert

Add some missing lengths checks when passing data from userland to
kernel. From based on NetBSD patches.


# 1.65 02-Aug-2001 assar

(vput): make panic strings actually say vput instead of vrele


# 1.64 26-Jul-2001 miod

Typo.


# 1.63 27-Jun-2001 art

remove old vm


# 1.62 22-Jun-2001 deraadt

KNF


# 1.61 05-Jun-2001 provos

send note_revoke to knotes when vnode goes away, okay art@


# 1.60 16-May-2001 art

indentation nit.


# 1.59 29-Apr-2001 art

cleanup, remove incorrect comment


Revision tags: OPENBSD_2_9_BASE
# 1.58 22-Mar-2001 art

branches: 1.58.2;
Use pool for allocating vnodes.
Even though vnodes are never freed (could be) this gives us big memory and
kmem_map savings.


# 1.57 21-Mar-2001 art

uvm_vnp_terminate expect the vnode to be locked.
Why didn't LOCKDEBUG catch this?


# 1.56 16-Mar-2001 art

Oops. fix thinko in last.


# 1.55 16-Mar-2001 art

Use CIRCLEQ macros for mountlist.


# 1.54 16-Mar-2001 art

Initialize the mountlist_slock.


# 1.53 26-Feb-2001 csapuntz

Move v_writecount test back to it original place


# 1.52 26-Feb-2001 csapuntz

Make ref counts 32-bit unsigned ints as opposed to a potpourri of longs and
ints.


# 1.51 24-Feb-2001 csapuntz

Cleanup of vnode interface continues. Get rid of VHOLD/HOLDRELE.
Change VM/UVM to use buf_replacevnode to change the vnode associated
with a buffer.

Addition v_bioflag for flags written in interrupt handlers
(and read at splbio, though not strictly necessary)

Add vwaitforio and use it instead of a while loop of v_numoutput.

Fix race conditions when manipulation vnode free list


# 1.50 23-Feb-2001 csapuntz

Remove the clustering fields from the vnodes and place them in the
file system inode instead


# 1.49 21-Feb-2001 csapuntz

Latest soft updates from FreeBSD/Kirk McKusick

Snapshot-related code has been commented out.


# 1.48 08-Feb-2001 mickey

do not print stuff when not verbose


Revision tags: OPENBSD_2_8_BASE
# 1.47 27-Sep-2000 art

branches: 1.47.2;
Minimal optimization.


# 1.46 17-Jul-2000 art

Don't wait for B_READ buffers on shutdown.
From NetBSD.


Revision tags: OPENBSD_2_7_BASE
# 1.45 25-Apr-2000 csapuntz

Use CIRCLEQ_FOREACH


# 1.44 21-Apr-2000 mickey

see if there is any meaning under curproc before using &proc0 in vfs_syncwait(); from art@


Revision tags: SMP_BASE kame_19991208
# 1.43 05-Dec-1999 art

branches: 1.43.2;
With soft updates, some buffers will be remarked as dirty after being written.
Handle this when syncing filesystems when unmounting.
From NetBSD.


# 1.42 05-Dec-1999 art

Use VONSYNCLIST to see if we should remove a vnode from the sync list instead
of looking at v_dirtyblkhd.


Revision tags: OPENBSD_2_6_BASE
# 1.41 20-Aug-1999 art

more paranoid check of the refcount in vfs_register


# 1.40 08-Aug-1999 niklas

From NetBSD; vdevgone, used for revoking access to device nodes when they
disappear (detach is coming).


# 1.39 31-May-1999 millert

New struct statfs with mount options. NOTE: this replaces statfs(2),
fstatfs(2), and getfsstat(2) so you will need to build a new kernel
before doing a "make build" or you will get "unimplemented syscall" errors.

The new struct statfs has the following featuires:
o Has a u_int32_t flags field--now softdep can have a real flag.

o Uses u_int32_t instead of longs (nicer on the alpha). Note: the man
page used to lie about setting invalid/unused fields to -1. SunOS does
that but our code never has.

o Gets rid of f_type completely. It hasn't been used since NetBSD 0.9
and having it there but always 0 is confusing. It is conceivable
that this may cause some old code to not compile but that is better
than silently breaking.

o Adds a mount_info union that contains the FSTYPE_args struct. This
means that "mount" can now tell you all the options a filesystem was
mounted with. This is especially nice for NFS.

Other changes:
o The linux statfs emulation didn't convert between BSD fs names
and linux f_type numbers. Now it does, since the BSD f_type
number is useless to linux apps (and has been removed anyway)

o FreeBSD's struct statfs is different from our (both old and new)
and thus needs conversion. Previously, the OpenBSD syscalls
were used without any real translation.

o mount(8) will now show extra info when invoked with no arguments.
However, to see *everything* you need to use the -v (verbose) flag.


# 1.38 06-May-1999 mickey

factor out sync+wait code into vfa_syncwait() routine for
applications in system like power management and such.
art@ finally said `commit it'


# 1.37 30-Apr-1999 art

in vput, simple_unlock the v_interlock before VOP_INACTIVE, not after


Revision tags: OPENBSD_2_5_BASE
# 1.36 11-Mar-1999 deraadt

backout


# 1.35 11-Mar-1999 deraadt

back out unapproved changes


# 1.34 11-Mar-1999 mickey

indent


# 1.33 11-Mar-1999 mickey

factor sync+wait operation out into a separate function.


# 1.32 26-Feb-1999 art

adapt to uvm vnode pager


# 1.31 19-Feb-1999 art

add vfs_register and vfs_unregister functions


# 1.30 28-Dec-1998 art

simple_lock fixes


# 1.29 22-Dec-1998 art

deconfuse vprint, print holdcount, not refcount when we are talking about holdcnt


# 1.28 10-Dec-1998 art

vfs_unmountall: retry to unmount all remaining filesystems when one unmount failed


# 1.27 05-Dec-1998 csapuntz

Framework for generating automatic test code for locking discipline
in DIAGNOSTIC mode.

Added documentation to vfs_subr.c on locking needs of a couple calls.

Improvements to the vinvalbuf patch. We need to start over after we
let our pants down.


# 1.26 04-Dec-1998 csapuntz

VFS-Lite2 requires stricter locking around vnode buffer queues. vinvalbuf
had insufficient protection


# 1.25 20-Nov-1998 art

vn_lock already unlocks the simple lock. don't do that again


# 1.24 12-Nov-1998 csapuntz

Integrate latest soft updates patches for McKusick.

Integrate cleaner ffs mount code from FreeBSD. Most notably, this mount
code prevents you from mounting an unclean file system read-write.


Revision tags: OPENBSD_2_4_BASE
# 1.23 13-Oct-1998 csapuntz

In vrele, vget, reinstate to following order

- VNODE gets placed on free list
- VOP_INACTIVE is called

This was the original order. It was changed in an earlier patch due to
a race condition in non-locking FSes (like NFS) between getnewvnode
and inactive. However, the modified order had its own race conditions, so
it turned out not to be a good choice.


# 1.22 30-Aug-1998 csapuntz

Cleanup.

Error diagnostics in vputonfreelist to catch violations of assumptions.


# 1.21 06-Aug-1998 csapuntz

Rename vop_revoke, vn_bwrite, vop_noislocked, vop_nolock, vop_nounlock
to be vop_generic_revoke, vop_generic_bwrite, vop_generic_islocked,
vop_generic_lock and vop_generic_unlock.

Create vop_generic_abortop and propogate change to all file systems.

Fix PR/371.

Get rid of locking in NULLFS (should be mostly unnecessary now except for
forced unmounts).


# 1.20 25-Apr-1998 niklas

typo


Revision tags: OPENBSD_2_3_BASE
# 1.19 20-Feb-1998 niklas

typo


# 1.18 11-Jan-1998 csapuntz

Fix a couple spinlock references. More code motion in vfs_subr.c


# 1.17 10-Jan-1998 csapuntz

Broke up vfs_subr.c which was getting a bit huge. We now have seperate files
for the syncer daemon as well as default VOP_*.


# 1.16 24-Nov-1997 niklas

Fix non-DIAGNOSTIC (and non-COMPAT*) compilation


# 1.15 07-Nov-1997 csapuntz

Fixed hang on shutdown
Disabled vop_nolock for now. Filesystems still need to be cleaned up.


# 1.14 06-Nov-1997 csapuntz

DEBUG now compiles


# 1.13 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.12 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.11 06-Oct-1997 csapuntz

VFS Lite2 Changes


Revision tags: OPENBSD_2_1_BASE
# 1.10 25-Apr-1997 deraadt

proper mask check; mike@fast.cs.utah.edu


# 1.9 14-Apr-1997 tholo

Minor performance enhancements from NetBSD


# 1.8 24-Feb-1997 niklas

OpenBSD tags


# 1.7 11-Feb-1997 millert

Add fs_id support and random inode generation numbers for ffs.


# 1.6 04-Jan-1997 kstailey

spec_advlock() via lf_advlock()


Revision tags: OPENBSD_2_0_BASE
# 1.5 08-Aug-1996 tholo

Make {,f}chown(2) behaviour POSIX.1 compliant with SUID / SGID files
Enable CTL_FS processing by sysctl(3)
Add CTL_FS request to disable clearing SUID / SGID bit when a files owner
or group is changed by root
Make sysctl(8) understand CTL_FS requests


# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 29-Feb-1996 niklas

From NetBSD: Merge with NetBSD 960217


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.291 19-Jul-2019 cheloha

vwaitforio(9): tsleep(9) -> tsleep_nsec(9); ok visa@


# 1.290 28-Jun-2019 visa

Skip VFS barrier lock during normal operation to reduce overhead.
This removes a system-wide serialization point, which might help
finding timing-related bugs.

OK deraadt@ anton@


# 1.289 09-Jun-2019 beck

Add a temporary workaround to make removal of giant files better

mlarkin@ noticed we would freeze while removing enormous files because
of the amount of work done to invalidate buffers on unlink. This adds
a temporary workaround to ensure we give up the lock and yield while
doing this.

The longer term answer will be to move these buffers to another list
and not do the work here.

ok deraadt@


# 1.288 19-Apr-2019 visa

Add a subsystem lock for vfs_lockf.c. This enables calling lf_advlock()
and lf_purgelocks() without the kernel lock.

OK anton@ mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.287 02-Apr-2019 visa

Restrict which filesystems are available for swap. This rules out
obvious misconfigurations that cannot work.

OK mpi@ tedu@


# 1.286 17-Feb-2019 tedu

if a write fails, we mark the buffer invalid and throw it away. this can
lead to lost errors, where a later fsync will return success. to fix this,
set a flag on the vnode indicating a past error has occurred, and return
an error for future fsync calls.
ok bluhm deraadt visa


# 1.285 21-Jan-2019 anton

Introduce a dedicated entry point data structure for file locks. This new data
structure allows for better tracking of pending lock operations which is
essential in order to prevent a use-after-free once the underlying vnode is
gone.

Inspired by the lockf implementation in FreeBSD.

ok visa@

Reported-by: syzbot+d5540a236382f50f1dac@syzkaller.appspotmail.com


# 1.284 23-Dec-2018 natano

Rectify some issues with the noperm mount flag; the root vnode was not
protected properly and files without any x bit set were accidentaly considered
executable when checked with access(2).

Issues found and reported by deraadt, halex, reyk, tb
ok deraadt


# 1.283 07-Dec-2018 mpi

free(9) sizes for netcred.

ok visa@


Revision tags: OPENBSD_6_4_BASE
# 1.282 29-Sep-2018 visa

Use atomic operations to update vfc_refcount. Change the field's type
to unsigned int.

OK deraadt@


# 1.281 26-Sep-2018 visa

Move the allocating and freeing of mount points into
dedicated functions.

OK deraadt@ mpi@


# 1.280 22-Sep-2018 fcambus

Harmonize spacing after ellipses in displayed messages.

We were using spacing after ellipses in an inconsistent way in the
installer. Standardize on using "... " everywhere and take into account
the cursor position while we are waiting for the task to complete: the
cursor is now always positioned after the last dot, and the space is
added when displaying completion confirmation.

While there, also take cursor position into account in vfs_shutdown(),
and remove the extra leading space before ticks in dhclient.

OK deraadt@


# 1.279 17-Sep-2018 visa

Simplify VFS initialization.

Because loadable kernel modules are no longer, there is no need to
register or unregister filesystem implementations at runtime. Remove
vfs_register() and vfs_unregister(), and make vfsinit() call vfs_init
routines directly. Replace the linked list of vfsconf structs with
the vfsconflist[] array.

OK mpi@ bluhm@


# 1.278 16-Sep-2018 visa

Move vfsconf lookup code into dedicated functions.

OK bluhm@


# 1.277 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


# 1.276 02-Jul-2018 bluhm

Use more list macros for v_dirtyblkhd.
OK mpi@


# 1.275 06-Jun-2018 bluhm

The function dounmount() traverses the mnt_list in forward direction
to call vfs_busy() for all nested mount points. vfs_stall() called
vfs_busy() in reverser order for all mount points. Change the
direction of the latter to resolve the lock order conflict.
OK visa@


# 1.274 04-Jun-2018 guenther

Add VB_DUPOK to suppress witness(4) warning of concurrent mount locks.
Use that in three places:
- vfs_stall()
- sys_mount()
- dounmount()'s MNT_FORCE-does-recursive-unmounts case

ok deraadt@ visa@


# 1.273 27-May-2018 visa

Drop unnecessary `p' parameter from vget(9).

OK mpi@


# 1.272 08-May-2018 bluhm

When looping over mount points, the FOREACH SAVE macro is not save.
The loop variable mp is protected by vfs_busy() so that it cannot
be unmounted. But the next mount point nmp could be unmounted while
VFS_SYNC() sleeps. As the loop in vfs_stall() does not destroy the
mount point, TAILQ_FOREACH_REVERSE without _SAVE is the correct
macro to use.
OK deraadt@ visa@


# 1.271 08-May-2018 mpi

Move the vfs stall "barrier" logic to a function. FREF() will soon
change and this has nothing to do with it.

ok visa@, bluhm@


# 1.270 07-May-2018 bluhm

Print the vp pointer in the vinvalbuf() panic strings.
OK mpi@


# 1.269 02-May-2018 visa

Remove proc from the parameters of vn_lock(). The parameter is
unnecessary because curproc always does the locking.

OK mpi@


# 1.268 28-Apr-2018 visa

Clean up the parameters of VOP_LOCK() and VOP_UNLOCK(). It is always
curproc that does the locking or unlocking, so the proc parameter
is pointless and can be dropped.

OK mpi@, deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.267 07-Mar-2018 bluhm

Remounting files systems read-only does not work reliably. There
are corner cases where ffs may leak blocks. So better revert and
unmount all file systems at reboot. The "init died" panic will be
fixed in a different way.
OK deraadt@


# 1.266 10-Feb-2018 deraadt

Syncronize filesystems to disk when suspending. Each mountpoint's vnodes
are pushed to disk. Dangling vnodes (unlinked files still in use) and
vnodes undergoing change by long-running syscalls are identified -- and
such filesystems are marked dirty on-disk while we are suspended (in case
power is lost, a fsck will be required). Filesystems without dangling or
busy vnodes are marked clean, resulting in faster boots following
"battery died" circumstances.
Tested by numerous developers, thanks for the feedback.


# 1.265 14-Dec-2017 deraadt

Don't bother using DETACH_FORCE for the softraid luns at reboot
time; the aggressive mountpoint destruction seems to hit insane
use-after-frees when we are already far on the way down.


# 1.264 14-Dec-2017 deraadt

Give vflush_vnode() a hint about vnodes we don't need to account as "busy".
Change mountpoint to RDONLY a little later. Seems to improve the
rw->ro transition a bit.


# 1.263 11-Dec-2017 bluhm

Format the vnode lists of ddb show mount properly in columns.
OK krw@


# 1.262 11-Dec-2017 deraadt

In uvm Chuck decided backing store would not be allocated proactively
for blocks re-fetchable from the filesystem. However at reboot time,
filesystems are unmounted, and since processes lack backing store they
are killed. Since the scheduler is still running, in some cases init is
killed... which drops us to ddb [noted by bluhm]. Solution is to convert
filesystems to read-only [proposed by kettenis]. The tale follows:
sys_reboot() should pass proc * to MD boot() to vfs_shutdown() which
completes current IO with vfs_busy VB_WRITE|VB_WAIT, then calls VFS_MOUNT()
with MNT_UPDATE | MNT_RDONLY, soon teaching us that *fs_mount() calls a
copyin() late... so store the sizes in vfsconflist[] and move the copyin()
to sys_mount()... and notice nfs_mount copyin() is size-variant, so kill
legacy struct nfs_args3. Next we learn ffs_mount()'s MNT_UPDATE code is
sharp and rusty especially wrt softdep, so fix some bugs adn add
~MNT_SOFTDEP to the downgrade. Some vnodes need a little more help,
so tie them to &dead_vnops.

ffs_mount calling DIOCCACHESYNC is causing a bit of grief still but
this issue is seperate and will be dealt with in time.
couple hundred reboots by bluhm and myself, advice from guenther and
others at the hut


# 1.261 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.260 31-Jul-2017 florian

Give back some space to the ramdisk by compiling net/radix.c only
if we compile pf, ipsec, pipex or nfsserver.
Suggested by mpi some time ago.
Tweak & OK bluhm
deraadt assumes it's fair


# 1.259 20-Apr-2017 visa

Tweak lock inits to make the system runnable with witness(4)
on amd64 and i386.


# 1.258 04-Apr-2017 deraadt

struct vfsconf is tightly packed, but let's M_ZERO it in case that ever
changes to avoid exposing userland memory.


Revision tags: OPENBSD_6_1_BASE
# 1.257 15-Jan-2017 bluhm

When traversing the mount list, the current mount point is locked
with vfs_busy(). If the FOREACH_SAFE macro is used, the next pointer
is not locked and could be freed by another process. Unless
necessary, do not use _SAFE as it is unsafe. In vfs_unmountall()
the current pointer is actullay freed. Add a comment that this
race has to be fixed later.
OK krw@


# 1.256 10-Jan-2017 bluhm

Replace manual for() loops with FOREACH() macro.
OK millert@


# 1.255 10-Jan-2017 bluhm

Remove the unused olddp parameter from function dounmount().
OK mpi@ millert@


# 1.254 28-Sep-2016 kettenis

Cast enum to u_int when doing a bounds check to avoid a clang warning that
the comparison is always true.

ok jca@, tedu@


# 1.253 16-Sep-2016 dlg

move the namecache_rb_tree from RB macros to RBT functions.

i had to shuffle the includes a bit. all the knowledge of the RB
tree is now inside vfs_cache.c, and all accesses are via cache_*
functions.


# 1.252 16-Sep-2016 dlg

move buf_rb_bufs from RB macros to RBT functions

i had to shuffle the order of some header bits cos RBT_PROTOTYPE
needs to see what RBT_HEAD produces.


# 1.251 15-Sep-2016 dlg

all pools have their ipl set via pool_setipl, so fold it into pool_init.

the ioff argument to pool_init() is unused and has been for many
years, so this replaces it with an ipl argument. because the ipl
will be set on init we no longer need pool_setipl.

most of these changes have been done with coccinelle using the spatch
below. cocci sucks at formatting code though, so i fixed that by hand.

the manpage and subr_pool.c bits i did myself.

ok tedu@ jmatthew@

@ipl@
expression pp;
expression ipl;
expression s, a, o, f, m, p;
@@
-pool_init(pp, s, a, o, f, m, p);
-pool_setipl(pp, ipl);
+pool_init(pp, s, a, ipl, f, m, p);


# 1.250 25-Aug-2016 dlg

pool_setipl

ok kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.249 22-Jul-2016 kettenis

Prevent NULL-pointer call for filesystems that don't provide vfs_sysctl
in their vfsops.

Issue reported by Tim Newsham.

ok claudio@, natano@


# 1.248 19-Jun-2016 natano

Remove the lockmgr() API. It is only used by filesystems, where it is a
trivial change to use rrw locks instead. All it needs is LK_* defines
for the RW_* flags.

tested by naddy and sthen on package building infrastructure
input and ok jmc mpi tedu


# 1.247 26-May-2016 natano

The doforce variable isn't modified anywhere. Also, the only filesystem
left using it is fuse. It has been removed from all other filesystems.

ok millert deraadt


# 1.246 26-Apr-2016 natano

copy_statfs_info() is not only used by ufs, but by other filesystems too,
so make sure that all members of mp->mnt_stat.mount_info are copied.
ok stefan


# 1.245 26-Apr-2016 beck

fix off by one in vfs_vnode_print - found by miod
ok deraadt@, krw@


# 1.244 07-Apr-2016 natano

Share clone bitmap between aliased vnodes. This prevents duplicate clone
instance numbers being handed out for the same minor device.
ok mikeb


# 1.243 05-Apr-2016 natano

Increase size of the clone bitmap (revised diff after revert). I have
tested this with fuse _and_ drm on amd64 and macppc. Also tested with
cloning bpf (not in the tree) on macppc.

ok mikeb
"looks correct to me" millert

The original commit message is as follows:

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.242 01-Apr-2016 mikeb

Revert the clone bitmap enlargement change


# 1.241 31-Mar-2016 natano

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.240 19-Mar-2016 natano

Remove the unused flags argument from VOP_UNLOCK().

torture tested on amd64, i386 and macppc
ok beck mpi stefan
"the change looks right" deraadt


# 1.239 14-Mar-2016 krw

Change a bunch of (<blah> *)0 to NULL.

ok beck@ deraadt@


Revision tags: OPENBSD_5_9_BASE
# 1.238 05-Dec-2015 tedu

branches: 1.238.2;
remove stale lint annotations


# 1.237 16-Nov-2015 deraadt

In getdevvp() set the VISTTY flag on a vnode to indicate the underlying
device is a D_TTY device. (Like spec_open, but this sets the flag to
satisfy pre-VOP_OPEN situations)
ok millert semarie tedu guenther


# 1.236 13-Oct-2015 guenther

Initialize va_filerev in vattr_null() to avoid leaking stack garbage;
problem pointed out by Martin Natano (natano (at) natano.net)

Also, stop chaining assignments (foo = bar = baz) in vattr_null().
The exact meaning of those depends on the order of the sizes-and-
signednesses of the lvalues, making them fragile: a statement here
mixed *six* types, but managed to get them in a safe order. Delete
a 20+ year old XXX comment that was almost certainly bemoaning a bug
from when they were in an unsafe order.

ok deraadt@ miod@


# 1.235 08-Oct-2015 mpi

Use the radix API directly and get rid of the function pointers. There
is no point in keeping an unused level of abstraction.

ok mikeb@, claudio@


# 1.234 07-Oct-2015 mpi

rn_inithead() offset argument is now specified in byte, missed in previous.


# 1.233 04-Sep-2015 mpi

Make every subsystem using a radix tree call rn_init() and pass the
length of the key as argument.

This way every consumer of the radix tree has a chance to explicitly
initialize the shared data structures and no longer rely on another
subsystem to do the initialization.

As a bonus ``dom_maxrtkey'' is no longer used an die.

ART kernels should now be fully usable because pf(4) and IPSEC properly
initialized the radix tree.

ok chris@, reyk@


Revision tags: OPENBSD_5_8_BASE
# 1.232 16-Jul-2015 claudio

branches: 1.232.4;
Fix rn_match and there for the expoerted lookup functions in radix.c
to never return the internal RNF_ROOT nodes. This removes the checks
in the callee to verify that not an RNF_ROOT node was returned.
OK mpi@


# 1.231 12-May-2015 mikeb

Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.230 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.229 02-Mar-2015 guenther

Return EINVAL if the creds supplied for NFS export have a cr_ngroups less
than zero or greater than NGROUPS_MAX

Fixes panic seen by henning@


# 1.228 09-Jan-2015 tedu

rename desiredvnodes to initialvnodes. less of a lie. ok beck deraadt


# 1.227 19-Dec-2014 tedu

start retiring the nointr allocator. specify PR_WAITOK as a flag as a
marker for which pools are not interrupt safe. ok dlg


# 1.226 17-Dec-2014 tedu

remove lock.h from uvm_extern.h. another holdover from the simpletonlock
era. fix uvm including c files to include lock.h or atomic.h as necessary.
ok deraadt


# 1.225 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.224 10-Dec-2014 tedu

convert bcopy to memcpy. ok millert


# 1.223 21-Nov-2014 tedu

simple lock is long dead


# 1.222 19-Nov-2014 tedu

delete the KERN_VNODE sysctl. it fails to provide any isolation from the
kernel struct vnode defintion, and the only consumer (pstat) still needs
kvm to read much of the required information. no great loss to always use
kvm until there's a better replacement interface.
ok deraadt millert uebayasi


# 1.221 14-Nov-2014 tedu

prefer sizeof(*ptr) to sizeof(struct) for malloc and free


# 1.220 03-Nov-2014 deraadt

pass size argument to free()
ok doug tedu


# 1.219 13-Sep-2014 doug

Replace all queue *_END macro calls except CIRCLEQ_END with NULL.

CIRCLEQ_* is deprecated and not called in the tree. The other queue types
have *_END macros which were added for symmetry with CIRCLEQ_END. They are
defined as NULL. There's no reason to keep the other *_END macro calls.

ok millert@


Revision tags: OPENBSD_5_6_BASE
# 1.218 13-Jul-2014 tedu

pass the size to free in some of the obvious cases


# 1.217 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.216 10-Jul-2014 mpi

Stop using a shutdown hook for softraid(4) and explicitly shutdown
the disciplines right after vfs_shutdown().

This change is required in order to be able to set `cold' to 1 before
traversing the device (mainbus) tree for DVACT_POWERDOWN when halting
a machine. Yes, this is ugly because sr_shutdown() needs to sleep. But
at least it is obvious and hopefully somebody will be ofended and fix
it.

In order to properly flush the cache of the disks under softraid0,
sr_shutdown() now propagates DVACT_POWERDOWN for this particular subtree
of devices which are not under mainbus. As a side effect sd(4) shutdown
hook should no longer be necessary.

Tested by stsp@ and Jean-Philippe Ouellet.

ok deraadt@, stsp@, jsing@


# 1.215 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.214 04-Jun-2014 claudio

While it may be smart to use the radix tree for exports it is not OK to
use the domain specific tree initialisation method for this since that one
is multipath enabled and assumes that the radix node is part of a struct
rtentry. This code uses a different struct and so the multipath modifies
wrong fields and breaks stuff in mysterious ways.
Since we only support AF_INET here anyway simplify the code and only have
one radix_node_head pointer instead of AF_MAX ones.
Fixes NFS server issues reported by rpe@, OK rpe@, guenther@, sthen@


# 1.213 10-Apr-2014 tedu

pull the bufcache freelist code out into separate functions to allow new
algorithms to be tested. in the process, drop support for unused B_AGE and
b_synctime options.
previous versions ok beck deraadt


# 1.212 24-Mar-2014 guenther

Split the API: struct ucred remains the kernel internal structure while
struct xucred becomes the structure for syscalls (mount(2) and nfssvc(2)).

ok deraadt@ beck@


Revision tags: OPENBSD_5_5_BASE
# 1.211 21-Jan-2014 tedu

bzero -> memset


# 1.210 01-Dec-2013 krw

Change 'mountlist' from CIRCLEQ to TAILQ. Be paranoid and
use TAILQ_*_SAFE more than might be needed.

Bulk ports build by sthen@ showed nobody sticking their fingers
so deep into the kernel.

Feedback and suggestions from millert@. ok jsing@


# 1.209 27-Nov-2013 jsing

Defer the v_type initialisation until after the vnode has been purged from
the namecache. Changing the v_type between cache_enter() and cache_purge()
results in bad things happening.

ok beck@


# 1.208 02-Oct-2013 sf

format string fix: b_flags is long


# 1.207 01-Oct-2013 sf

Format string fixes: Cast time_t to long long

and mnt_stat.f_ctime is long long, too


# 1.206 08-Aug-2013 syl

Uncomment kprintf format attributes for sys/kern

tested on vax (gcc3) ok miod@


# 1.205 30-Jul-2013 beck

The previous change was made while chasing nfs performance issues
on Theo's servers - however this was in the context of the buffer flipper
changes and this is now suspect in a continues performance issue with NFS
so back it out for now


Revision tags: OPENBSD_5_4_BASE
# 1.204 24-Jun-2013 beck

Manipulating buffers after sleeping is dangerous. Instead of attempting
to cheat and VOP_BWRITE a buffer, restart the vinvalbuf if we have to wait
for a busy buffer to complete
ok tedu@ guenther@


# 1.203 15-Apr-2013 jsing

Add an f_mntfromspec member to struct statfs, which specifies the name of
the special provided when the mount was requested. This may be the same as
the special that was actually used for the mount (e.g. in the case of a
device node) or it may be different (e.g. in the case of a DUID).

Whilst here, change f_ctime to a 64 bit type and remove the pointless
f_spare members.

Compatibility goo courtesy of guenther@

ok krw@ millert@


Revision tags: OPENBSD_5_3_BASE
# 1.202 17-Feb-2013 miod

Comment out recently added __attribute__((__format__(__kprintf__))) annotations
in MI code; gcc 2.95 does not accept such annotation for function pointer
declarations, only function prototypes.
To be uncommented once gcc 2.95 bites the dust.


# 1.201 09-Feb-2013 miod

Add explicit __attribute__ ((__format__(__kprintf__)))) to the functions and
function pointer arguments which are {used as,} wrappers around the kernel
printf function.
No functional change.


# 1.200 17-Nov-2012 beck

Don't map a buffer (and potentially sleep) when invalidating it in vinvalbuf.
This fixes a problem where we could sleep for kva and then our pointers
would not be valid on the next pass through the loop. We do this
by adding buf_acquire_nomap() - which can be used to busy up the buffer
without changing its mapped or unmapped state. We do not need to have
the buffer mapped to invalidate it, so it is sufficient to acquire it
for that. In the case where we write the buffer, we do map the buffer, and
potentially sleep.


# 1.199 01-Oct-2012 guenther

Make groupmember() check the effective gid too, so that the checks are
consistent when the effective gid isn't also a supplementary group.

ok beck@


# 1.198 19-Sep-2012 guenther

vhold() and vdrop() are prototyped in vnode.h, so don't repeat them here

ok beck@


Revision tags: OPENBSD_5_2_BASE
# 1.197 16-Jul-2012 deraadt

oops, need sys/acct.h too


# 1.196 16-Jul-2012 deraadt

Put acct_shutdown() proto in a better place


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.195 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.194 02-Jul-2011 thib

rename VFSDEBUG to VFLCKDEBUG;

prompted by tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.193 21-Dec-2010 thib

Bring back the "End the VOP experiment." diff, naddy's issues where
unrelated, and his alpha is much happier now.

OK deraadt@


# 1.192 06-Dec-2010 jasper

- drop NENTS(), which was yet another copy of nitems().
no binary change


ok deraadt@


# 1.191 10-Sep-2010 thib

Backout the VOP diff until the issues naddy was seeing on alpha (gcc3)
have been resolved.


# 1.190 06-Sep-2010 thib

End the VOP experiment. Instead of the ridicolusly complicated operation
vector setup that has questionable features (that have, as far as I can
tell never been used in practice, atleast not in OpenBSD), remove all
the gunk and favor a simple struct full of function pointers that get
set directly by each of the filesystems.

Removes gobs of ugly code and makes things simpler by a magnitude.

The only downside of this is that we loose the vnoperate feature so
the spec/fifo operations of the filesystems need to be kept in sync
with specfs and fifofs, this is no big deal as the API it self is pretty
static.

Many thanks to armani@ who pulled an earlier version of this diff to
current after c2k10 and Gabriel Kihlman on tech@ for testing.

Liked by many. "come on, find your balls" deraadt@.


# 1.189 12-Aug-2010 oga

Nuke extra (typoed) extern declaration and a spare newline from the last
commit.

"fix it -- free commit" beck@


# 1.188 11-Aug-2010 beck

Make the number of vnodes to correspond to the number of buffers in
buffer cache - we grow them dynamically, but do not attempt to shrink
them if the buffer cache shrinks after growing.

Tested by very many for a long time.

ok oga@ todd@ phessler@ tedu@


Revision tags: OPENBSD_4_8_BASE
# 1.187 29-Jun-2010 tedu

makefstype was only used in ported from freebsd filesystems. fix them
and remove the function. ok thib


# 1.186 28-Jun-2010 claudio

Add the rtable id as an argument to rn_walktree(). Functions like
rt_if_remove_rtdelete() need to know the table id to be able to correctly
remove nodes.
Problem found by Andrea Parazzini and analyzed by Martin Pelik�n.
OK henning@


# 1.185 06-May-2010 mpf

Fix favail format string.
From mickey.
OK thib, otto.


Revision tags: OPENBSD_4_7_BASE
# 1.184 17-Dec-2009 oga

if anyone vref()s a VNON vnode, panic. This should not happen.

Written while trying to debug the nfs_inactive panics. Turns out it
never got hit, but it's a useful check to have.

ok beck@


# 1.183 17-Aug-2009 jasper

dd 'show all bufs' to show all the buffers in the system

ok beck@ thib@


# 1.182 13-Aug-2009 thib

add a show all vnodes command, use dlg's nice pool_walk() to accomplish
this.

ok beck@, dlg@


# 1.181 12-Aug-2009 beck

Namecache revamp.

This eliminates the large single namecache hash table, and implements
the name cache as a global lru of entires, and a redblack tree in each
vnode. It makes cache_purge actually purge the namecache entries associated
with a vnode when a vnode is recycled (very important for later on actually being
able to resize the vnode pool)

This commit does #if 0 out a bunch of procmap code that was
already broken before this change, but needs to be redone completely.

Tested by many, including in thib's nfs test setup.

ok oga@,art@,thib@,miod@


# 1.180 02-Aug-2009 beck

Dynamic buffer cache support - a re-commit of what was backed out
after c2k9

allows buffer cache to be extended and grow/shrink dynamically

tested by many, ok oga@, "why not just commit it" deraadt@


Revision tags: OPENBSD_4_6_BASE
# 1.179 25-Jun-2009 thib

backout the buf_acquire() does the bremfree() since all callers
where doing bremfree() befure calling buf_acquire().

This is causing us headache pinning down a bug that showed up
when deraadt@ too cvs to current, and will have to be done
anyway as a preperation for backouts.

OK deraadt@


# 1.178 15-Jun-2009 beck

Back out all the buffer cache changes I committed during c2k9. This reverts three
commits:

1) The sysctl allowing bufcachepercent to be changed at boot time.
2) The change moving the buffer cache hash chains to a red-black tree
3) The dynamic buffer cache (Which depended on the earlier too).

ok on the backout from marco and todd


# 1.177 06-Jun-2009 art

All caller of buf_acquire were doing bremfree before the call.
Just put it in the buf_acquire function.
oga@ ok


# 1.176 03-Jun-2009 beck

Change bufhash from the old grotty hash table to red-black trees hanging
off the vnode.
ok art@, oga@, miod@


Revision tags: OPENBSD_4_5_BASE
# 1.175 10-Nov-2008 pedro

Fix typo in comment, okay jmc@.


# 1.174 01-Nov-2008 deraadt

change vrele() to return an int. if it returns 0, it can gaurantee that
it did not sleep. this is used to avoid checkdirs() to avoid having
to restart the allproc walk every time through
idea from tedu, ok thib pedro


Revision tags: OPENBSD_4_4_BASE
# 1.173 05-Jul-2008 thib

re-introduce vdrop() to signal a lost intrest in a vnode;

ok art@


# 1.172 14-Jun-2008 mk

A bunch of pool_get() + bzero() -> pool_get(..., .. | PR_ZERO)
conversions that should shave a few bytes off the kernel.

ok henning, krw, jsing, oga, miod, and thib (``even though i usually prefer
FOO|BAR''; thanks for looking.


# 1.171 13-Jun-2008 beck

back out stupid vnode change that was unintentionally included
with biomem and art has no idea how it got there.
ok art@ thib@


# 1.170 12-Jun-2008 deraadt

Bring biomem diff back into the tree after the nfs_bio.c fix went in.
ok thib beck art


# 1.169 11-Jun-2008 deraadt

back out biomem diff since it is not right yet. Doing very large
file copies to nfsv2 causes the system to eventually peg the console.
On the console ^T indicates that the load is increasing rapidly, ddb
indicates many calls to getbuf, there is some very slow nfs traffic
making none (or extremely slow) progress. Eventually some machines
seize up entirely.


# 1.168 10-Jun-2008 beck

Buffer cache revamp

1) remove multiple size queues, introduced as a stopgap.
2) decouple pages containing data from their mappings
3) only keep buffers mapped when they actually have to be mapped
(right now, this is when buffers are B_BUSY)
4) New functions to make a buffer busy, and release the busy flag
(buf_acquire and buf_release)
5) Move high/low water marks and statistics counters into a structure
6) Add a sysctl to retrieve buffer cache statistics

Tested in several variants and beat upon by bob and art for a year. run
accidentally on henning's nfs server for a few months...

ok deraadt@, krw@, art@ - who promises to be around to deal with any fallout


# 1.167 09-Jun-2008 millert

Update access(2) to have modern semantics with respect to X_OK and
the superuser. access(2) will now only indicate success for X_OK on
non-directories if there is at least one execute bit set on the file.
OK deraadt@ thib@ otto@


# 1.166 07-May-2008 thib

remove the vfc_mountroot member from vfsconf and
do appropriate cleanup;

OK deraadt@


# 1.165 07-May-2008 claudio

Implement routing priorities. Every route inserted has a priority assigned
and the one route with the lowest number wins. This will be used by the
routing daemons to resolve the synchronisations issue in case of conflicts.
The nasty bits of this are in the multipath code. If no priority is specified
the kernel will choose an appropriate priority.

Looked at by a few people at n2k8 code is much older


# 1.164 06-May-2008 thib

retire vfs_mountroot();

setroot() is now (and has been) responsible for setting
the mountroot function pointer "to the right thing", or
failing todo that, to ffs_mountroot;

based on a discussion/diff from deraadt@.
OK deraadt@


# 1.163 23-Mar-2008 miod

Wrong printf construct.


# 1.162 16-Mar-2008 otto

Widen some struct statfs fields to support large filesystem stata
and add some to be able to support statvfs(2). Do the compat dance
to provide backward compatibility. ok thib@ miod@


Revision tags: OPENBSD_4_3_BASE
# 1.161 13-Dec-2007 blambert

replace calls to ltsleep with tsleep

remove PNORELOCK flag, as PNORELOCK is used for msleep

ok art@ thib@


# 1.160 16-Nov-2007 deraadt

er, the newline is wrong. dissapointing.


# 1.159 15-Nov-2007 deraadt

newline before syncing disks is way prettier


# 1.158 29-Oct-2007 chl

MALLOC/FREE -> malloc/free
replace an hard coded value with M_WAITOK

ok krw@


# 1.157 15-Sep-2007 bluhm

Allow to pull out an usb stick with ffs filesystem while mounted
and a file is written onto the stick. Without these fixes the
machine panics or hangs.
The usb fix calls the callback when the stick is pulled out to free
the associated buffers. Otherwise we have busy buffers for ever
and the automatic unmount will panic.
The change in the scsi layer prevents passing down further dirty
buffers to usb after the stick has been deactivated.
In vfs the automatic unmount has moved from the function vgonel()
to vop_generic_revoke(). Both are called when the sd device's vnode
is removed. In vgonel() the VXLOCK is already held which can cause
a deadlock. So call dounmount() earlier.

ok krw@, I like this marco@, tested by ian@


# 1.156 07-Sep-2007 art

Use M_ZERO in a few more places to shave bytes from the kernel.

eyeballed and ok dlg@


Revision tags: OPENBSD_4_2_BASE
# 1.155 07-Aug-2007 beck

A few changes to deal with multi-user performance issues seen. this
brings us back roughly to 4.1 level performance, although this is still
far from optimal as we have seen in a number of cases. This change

1) puts a lower bound on buffer cache queues to prevent starvation
2) fixes the code which looks for a buffer to recycle
3) reduces the number of vnodes back to 4.1 levels to avoid complex
performance issues better addressed after 4.2

ok art@ deraadt@, tested by many


# 1.154 01-Jun-2007 beck

decouple the allocated number of vnodes from the "desiredvnodes" variable
which is used to size a zillion other things that increasing excessively
has been shown to cause problems - so that we may incrementally look at
increasing those other things without making the kernel unusable.

This diff effectivly increases the number of vnodes back to the number
of buffers, as in the earlier dynamic buffer cache commits, without
increasing anything else (namecache, softdeps, etc. etc.)

ok pedro@ tedu@ art@ thib@


# 1.153 31-May-2007 tedu

remove some silly casts, no real change


# 1.152 31-May-2007 pedro

NFSv2 cannot cope with a big number of vnodes, so revert to NPROC-based
calculation until the problem is fixed, okay beck@ art@


# 1.151 30-May-2007 beck

back out vfs change - todd fries has seen afs issues, and I'm suspicious
this can cause other problems.


# 1.150 29-May-2007 beck

Step one of some vnode improvements - change getnewvnode to
actually allocate "desiredvnodes" - add a vdrop to un-hold a vnode held
with vhold, and change the name cache to make use of vhold/vdrop, while
keeping track of which vnodes are referred to by which cache entries to
correctly hold/drop vnodes when the cache uses them.
ok thib@, tedu@, art@


# 1.149 28-May-2007 thib

de-inline vref();

ok pedro@


# 1.148 26-May-2007 pedro

Dynamic buffer cache. Initial diff from mickey@, okay art@ beck@ toby@
deraadt@ dlg@.


# 1.147 26-May-2007 thib

Nuke a bunch of simpelocks and associated goo.

ok art@


# 1.146 17-May-2007 thib

Collapse struct v_selectinfo in struct vnode, remove the
simplelock and reuse the name for the selinfo member.
Clean-up accordingly.

ok tedu@,art@


# 1.145 09-May-2007 deraadt

kinfo_vgetfailed has not been used for > 8 years


# 1.144 13-Apr-2007 thib

Move the declaration of VN_KNOTE() into vnode.h instead of having
multiple defines all over;

ok tedu@


# 1.143 13-Apr-2007 bluhm

Remove comments talking about vnode interlock. No binary change.
ok thib


# 1.142 11-Apr-2007 thib

Remove the simplelock argument from vrecycle();

ok pedro@, sturm@


# 1.141 21-Mar-2007 thib

Remove the v_interlock simplelock from the vnode structure.
Zap all calls to simple_lock/unlock() on it (those calls are
#defined away though). Remove the LK_INTERLOCK from the calls
to vn_lock() and cleanup the filesystems wich implement VOP_LOCK().
(by remvoing the v_interlock from there calls to lockmgr()).

ok pedro@, art@, tedu@


# 1.140 12-Mar-2007 mickey

better desiredvnodes not based on maxusers; pedro@ deraadt@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.139 20-Feb-2007 deraadt

for vfsconf sysctl, do not leak kernel sensors out to userland
ok art thib


# 1.138 17-Feb-2007 mickey

fix ddb buf printing for daddr_t growth to 64bit;
from juan hernandez gonzalez; tested by bluhm@


# 1.137 14-Feb-2007 jsg

Consistently spell FALLTHROUGH to appease lint.
ok kettenis@ cloder@ tom@ henning@


# 1.136 13-Feb-2007 mickey

fix ddb buf print


# 1.135 20-Nov-2006 tom

vprint() should be defined if DIAGNOSTIC || DEBUG. Noticed by (and
original diff from) Jake < antipsychic (at) hotmail.com >. Discussed
with Mickey and Miod.

ok miod@ pedro@


# 1.134 30-Oct-2006 thib

use vp->v_type to index into vtypes rather then vp->v_tag,
fixing odd output in the 'show vnode' ddb code.

ok mickey@


Revision tags: OPENBSD_4_0_BASE
# 1.133 11-Jul-2006 mickey

add mount/vnode/buf and softdep printing commands; tested on a few archs and will make pedro happy too (;


# 1.132 09-Jul-2006 pedro

Fix tab where space was meant


# 1.131 08-Jul-2006 thib

vinvalbuf() debugging aid, under VFSDEBUG.

ok pedro@


# 1.130 03-Jul-2006 mickey

also print vp in vprint (useful for debugging); pedro@ ok


# 1.129 25-Jun-2006 sturm

rename vfs_busy() flags VB_UMIGNORE/VB_UMWAIT to VB_NOWAIT/VB_WAIT

requested by and ok pedro


# 1.128 14-Jun-2006 sturm

move vfs_busy() to rwlocks and properly hide the locking api from vfs

ok tedu, pedro


# 1.127 02-Jun-2006 pedro

Add a clonable devices implementation. Hacked along with thib@, input
from krw@ and toby@, subliminal prodding from dlg@, okay deraadt@.


# 1.126 28-May-2006 pedro

Spacing in vfs_sysctl()


# 1.125 07-May-2006 sturm

forgot to remove this sentence from the comment
ok pedro


# 1.124 30-Apr-2006 sturm

remove the simplelock argument from vfs_busy() which is currently not
used and will never be used this way in VFS

requested by and ok pedro, ok krw, biorn


# 1.123 19-Apr-2006 pedro

Remove unused mount list simple_lock() goo


Revision tags: OPENBSD_3_9_BASE
# 1.122 09-Jan-2006 pedro

Put vprint() under DIAGNOSTIC, as to save space in generated ramdisks.
Inspiration from miod@, okay deraadt@. Tested on i386, macppc and amd64.


# 1.121 30-Nov-2005 pedro

No need for vfs_busy() and vfs_unbusy() to take a process pointer
anymore. Testing by jolan@, thanks.


# 1.120 24-Nov-2005 pedro

Remove kernfs, okay deraadt@.


# 1.119 19-Nov-2005 pedro

Remove unnecessary lockmgr() archaism that was costing too much in terms
of panics and bugfixes. Access curproc directly, do not expect a process
pointer as an argument. Should fix many "process context required" bugs.
Incentive and okay millert@, okay marc@. Various testing, thanks.


# 1.118 18-Nov-2005 pedro

Work around yet another race on non-locking file systems: when calling
VOP_INACTIVE() in vrele() and vput(), we may sleep. Since there's no
locking of any kind, someone can vget() the vnode and vrele() it while
we sleep, beating us in getting the vnode on the free list.


# 1.117 08-Nov-2005 pedro

Missed one use of 'register'


# 1.116 07-Nov-2005 pedro

Use ANSI function declarations and deregister, no binary change


# 1.115 19-Oct-2005 pedro

Remove v_vnlock from struct vnode, okay krw@ tedu@


Revision tags: OPENBSD_3_8_BASE
# 1.114 26-May-2005 pedro

branches: 1.114.2;
RIP stackable filesystems, ok marius@ tedu@, discussed with deraadt@


# 1.113 24-May-2005 pedro

when a device vnode associated with a mount point disappears, mark the
filesystem as doomed and unmount it


# 1.112 22-May-2005 pedro

put VLOCKSWORK stuff under a single option, VFSDEBUG


# 1.111 01-May-2005 pedro

check for VBIOONFREELIST and VBIOONSYNCLIST in vprint(), okay marius@


# 1.110 24-Mar-2005 tedu

always good to check for invalid values. ok marius pedro


Revision tags: OPENBSD_3_7_BASE
# 1.109 10-Jan-2005 pedro

branches: 1.109.2;
change vget() to only put a vnode back on the free lists if it actually
was there. should fix a (rare) corner case introduced by my last commit.
ok tedu@, testing by joris, moritz@, danh@, otto@ and krw@. many thanks.


# 1.108 31-Dec-2004 pedro

sprinkle some more list macros in here


# 1.107 31-Dec-2004 pedro

when releasing a vnode, make it inactive before sticking it to one of
the free lists. should fix some races on filesystems that don't have
locks, such as nfs. also, it allows for a more straightforward way of
releasing vnodes (nodes that are going to be recycled don't have to be
moved to the head of the list). tested by many, thanks.

ok tedu@ deraadt@


# 1.106 28-Dec-2004 deraadt

clean dirty accident by miod


# 1.105 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


# 1.104 09-Dec-2004 pedro

minor spacing/styling nits


Revision tags: OPENBSD_3_6_BASE
# 1.103 04-Aug-2004 art

Uninline vputonfreelist.


# 1.102 04-Aug-2004 pedro

better comments


# 1.101 02-Aug-2004 pedro

- check for LK_NOWAIT on vget()
- use ltsleep() instead of the unlock + sleep combo

ok art@, inspiration from free/net


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.100 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


# 1.99 27-May-2004 tedu

shutdown accounting before shutting down vfs. should prevent some panics.
ok david@ millert@ (iirc)


# 1.98 25-Apr-2004 itojun

radix tree with multipath support. from kame. deraadt ok
user visible changes:
- you can add multiple routes with same key (route add A B then route add A C)
- you have to specify gateway address if there are multiple entries on the table
(route delete A B, instead of route delete A)
kernel change:
- radix_node_head has an extra entry
- rnh_deladdr takes extra argument

TODO:
- actually take advantage of multipath (rtalloc -> rtalloc_mpath)


Revision tags: OPENBSD_3_5_BASE
# 1.97 09-Jan-2004 tedu

back out vnode parents. weird breakge found in ports tree


# 1.96 06-Jan-2004 tedu

keep track of a vnode's parent dir. ufs only, and unused atm, but
the fun stuff is coming. testing by brad.


Revision tags: OPENBSD_3_4_BASE
# 1.95 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.94 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: UBC_SYNC_A
# 1.93 13-May-2003 naddy

Back out previous change that causes "vnode table full" for large-scale
file operations.


# 1.92 13-May-2003 tedu

do reclaim LAYER vnodes, no good reason not to


# 1.91 06-May-2003 tedu

attempt to put a process's cwd back in place after a forced umount.
won't always work, but it's the best we can do for now. this covers
at least some of the failure cases the previous commit to vfs_lookup.c
checks for.
ok weingart@


# 1.90 01-May-2003 tedu

several related changes:
vfs_subr.c:
add a missing simple_lock_init for vnode interlock
try to avoid reclaiming locked or layered vnodes
initialize vnlock pointer to NULL
remove old code to free vnlock, never used
lockinit the new vnode lock
vfs_syscalls.c:
support for VLAYER flag
vnode_if.sh:
support for splitting VDESC flags
vnode_if.src:
split VDESC flags
WILLPUT is the combination of WILLRELE and WILLUNLOCK
most uses for WILLRELE become WILLPUT
vnode.h:
add v_lock to struct vnode
add VLAYER flag
update for new VDESC flags


# 1.89 06-Apr-2003 ho

strcat/strcpy/sprintf cleanup. krw@, anil@ ok. art@ tested sparc64.


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_B
# 1.88 11-Aug-2002 art

Add two missing vfs_busy calls in the failure path of sysctl_vnode.
Found by aaron@

NOTE - I think we need a mount-point iterator just like we have
NOTE - vfs_mount_foreach_vnode. (btw. why don't we use foreach_vnode in here?)


# 1.87 12-Jul-2002 art

Change the locking on the mountpoint slightly. Instead of using mnt_lock
to get shared locks for lookup and get the exclusive lock only with
LK_DRAIN on unmount and do the real exclusive locking with flags in
mnt_flags, we now use shared locks for lookup and an exclusive lock for
unmount.

This is accomplished by slightly changing the semantics of vfs_busy.
Old vfs_busy behavior:
- with LK_NOWAIT set in flags, a shared lock was obtained if the
mountpoint wasn't being unmounted, otherwise we just returned an error.
- with no flags, a shared lock was obtained if the mountpoint was being
unmounted, otherwise we slept until the unmount was done and returned
an error.
LK_NOWAIT was used for sync(2) and some statistics code where it isn't really
critical that we get the correct results.
0 was used in fchdir and lookup where it's critical that we get the right
directory vnode for the filesystem root.

After this change vfs_busy keeps the same behavior for no flags and LK_NOWAIT.
But if some other flags are passed into it, they are passed directly
into lockmgr (actually LK_SLEEPFAIL is always added to those flags because
if we sleep for the lock, that means someone was holding the exclusive lock
and the exclusive lock is only held when the filesystem is being unmounted.

More changes:
dounmount must now be called with the exclusive lock held. (before this
the caller was supposed to hold the vfs_busy lock, but that wasn't always
true).
Zap some (now) unused mount flags.
And the highlight of this change:
Add some vfs_busy calls to match some vfs_unbusy calls, especially in
sys_mount. (lockmgr doesn't detect the case where we release a lock noone
holds (it will do that soon)).

If you've seen hangs on reboot with mfs this should solve it (I repeat this
for the fourth time now, but this time I spent two months fixing and
redesigning this and reading the code so this time I must have gotten
this right).


# 1.86 16-Jun-2002 miod

When processing the KERN_VNODE sysctl, the kernel builds a packed structure,
while pstat(8) expects a C structure abiding the regular structure packing
rules. This caused pstat -v to break on powerpc.

Unbreak the confusion by defining the structure in a common header file,
and having the kernel use it.

ok millert@ deraadt@


# 1.85 08-Jun-2002 art

Use ltsleep in vfs_busy.


# 1.84 16-May-2002 art

sprinkle some splassert(IPL_BIO) in some functions that are commented as "should be called at splbio()"


Revision tags: OPENBSD_3_1_BASE
# 1.83 14-Mar-2002 millert

First round of __P removal in sys


# 1.82 04-Feb-2002 miod

Cleanup mountroot-related definitions.


# 1.81 23-Jan-2002 art

Pool deals fairly well with physical memory shortage, but it doesn't deal
well (not at all) with shortages of the vm_map where the pages are mapped
(usually kmem_map).

Try to deal with it:
- group all information the backend allocator for a pool in a separate
struct. The pool will only have a pointer to that struct.
- change the pool_init API to reflect that.
- link all pools allocating from the same allocator on a linked list.
- Since an allocator is responsible to wait for physical memory it will
only fail (waitok) when it runs out of its backing vm_map, carefully
drain pools using the same allocator so that va space is freed.
(see comments in code for caveats and details).
- change pool_reclaim to return if it actually succeeded to free some
memory, use that information to make draining easier and more efficient.
- get rid of PR_URGENT, noone uses it.


# 1.80 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.79 10-Dec-2001 art

branches: 1.79.2;
No need to initialize the uobj on every getnewvnode. Just do
it when allocating. Add some improved diagnostics.


# 1.78 10-Dec-2001 art

Big cleanup inspired by NetBSD with some parts of the code from NetBSD.
- get rid of VOP_BALLOCN and VOP_SIZE
- move the generic getpages and putpages into miscfs/genfs
- create a genfs_node which must be added to the top of the private portion
of each vnode for filsystems that want to use genfs_{get,put}pages
- rename genfs_mmap to vop_generic_mmap


# 1.77 10-Dec-2001 art

Merge in struct uvm_vnode into struct vnode.


# 1.76 05-Dec-2001 art

Break out the part that lowers v_holdcnt in brelvp into an own function
and make it and vhold into public interfaces.


# 1.75 29-Nov-2001 art

Ooops. Revert part of the last commit that was completly wrong and wasn't supposed to be committed.


# 1.74 29-Nov-2001 art

Correctly handle b_vp with bgetvp and brelvp in {get,put}pages.
Prevents panics caused by vnodes being recycled under our feet.


# 1.73 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.72 21-Nov-2001 csapuntz

Added vfs_isbusy. Useful for verifying that a mount point is locked
Added vfs_mount_foreach_vnode. Several places in the code seem to want to
traverse the mount list and they all seem to handle locking differently.
Centralize traversing the mount list in one place so that we only need
to get the locking right once.


# 1.71 15-Nov-2001 art

Don't zero v_bioflag when recycling a vnode in getnewvnode.
Sometimes the vnode can be on the syncers list. While that is a bug, it's
just a minor annoyance. A vnode on a syncer worklist without VBIOONSYNCLIST
set is a disaster.


# 1.70 12-Nov-2001 art

Remove unnecessary check for NULL vnode in reassignbuf.


# 1.69 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.68 02-Oct-2001 csapuntz

Bounds check index into routing table. Thanks to Ken Ashcraft of Stanford
for finding this bug.


# 1.67 19-Sep-2001 csapuntz

Get rid of B_VFLUSH. Not relevant after the end of the AGE queue.


# 1.66 16-Sep-2001 millert

Add some missing lengths checks when passing data from userland to
kernel. From based on NetBSD patches.


# 1.65 02-Aug-2001 assar

(vput): make panic strings actually say vput instead of vrele


# 1.64 26-Jul-2001 miod

Typo.


# 1.63 27-Jun-2001 art

remove old vm


# 1.62 22-Jun-2001 deraadt

KNF


# 1.61 05-Jun-2001 provos

send note_revoke to knotes when vnode goes away, okay art@


# 1.60 16-May-2001 art

indentation nit.


# 1.59 29-Apr-2001 art

cleanup, remove incorrect comment


Revision tags: OPENBSD_2_9_BASE
# 1.58 22-Mar-2001 art

branches: 1.58.2;
Use pool for allocating vnodes.
Even though vnodes are never freed (could be) this gives us big memory and
kmem_map savings.


# 1.57 21-Mar-2001 art

uvm_vnp_terminate expect the vnode to be locked.
Why didn't LOCKDEBUG catch this?


# 1.56 16-Mar-2001 art

Oops. fix thinko in last.


# 1.55 16-Mar-2001 art

Use CIRCLEQ macros for mountlist.


# 1.54 16-Mar-2001 art

Initialize the mountlist_slock.


# 1.53 26-Feb-2001 csapuntz

Move v_writecount test back to it original place


# 1.52 26-Feb-2001 csapuntz

Make ref counts 32-bit unsigned ints as opposed to a potpourri of longs and
ints.


# 1.51 24-Feb-2001 csapuntz

Cleanup of vnode interface continues. Get rid of VHOLD/HOLDRELE.
Change VM/UVM to use buf_replacevnode to change the vnode associated
with a buffer.

Addition v_bioflag for flags written in interrupt handlers
(and read at splbio, though not strictly necessary)

Add vwaitforio and use it instead of a while loop of v_numoutput.

Fix race conditions when manipulation vnode free list


# 1.50 23-Feb-2001 csapuntz

Remove the clustering fields from the vnodes and place them in the
file system inode instead


# 1.49 21-Feb-2001 csapuntz

Latest soft updates from FreeBSD/Kirk McKusick

Snapshot-related code has been commented out.


# 1.48 08-Feb-2001 mickey

do not print stuff when not verbose


Revision tags: OPENBSD_2_8_BASE
# 1.47 27-Sep-2000 art

branches: 1.47.2;
Minimal optimization.


# 1.46 17-Jul-2000 art

Don't wait for B_READ buffers on shutdown.
From NetBSD.


Revision tags: OPENBSD_2_7_BASE
# 1.45 25-Apr-2000 csapuntz

Use CIRCLEQ_FOREACH


# 1.44 21-Apr-2000 mickey

see if there is any meaning under curproc before using &proc0 in vfs_syncwait(); from art@


Revision tags: SMP_BASE kame_19991208
# 1.43 05-Dec-1999 art

branches: 1.43.2;
With soft updates, some buffers will be remarked as dirty after being written.
Handle this when syncing filesystems when unmounting.
From NetBSD.


# 1.42 05-Dec-1999 art

Use VONSYNCLIST to see if we should remove a vnode from the sync list instead
of looking at v_dirtyblkhd.


Revision tags: OPENBSD_2_6_BASE
# 1.41 20-Aug-1999 art

more paranoid check of the refcount in vfs_register


# 1.40 08-Aug-1999 niklas

From NetBSD; vdevgone, used for revoking access to device nodes when they
disappear (detach is coming).


# 1.39 31-May-1999 millert

New struct statfs with mount options. NOTE: this replaces statfs(2),
fstatfs(2), and getfsstat(2) so you will need to build a new kernel
before doing a "make build" or you will get "unimplemented syscall" errors.

The new struct statfs has the following featuires:
o Has a u_int32_t flags field--now softdep can have a real flag.

o Uses u_int32_t instead of longs (nicer on the alpha). Note: the man
page used to lie about setting invalid/unused fields to -1. SunOS does
that but our code never has.

o Gets rid of f_type completely. It hasn't been used since NetBSD 0.9
and having it there but always 0 is confusing. It is conceivable
that this may cause some old code to not compile but that is better
than silently breaking.

o Adds a mount_info union that contains the FSTYPE_args struct. This
means that "mount" can now tell you all the options a filesystem was
mounted with. This is especially nice for NFS.

Other changes:
o The linux statfs emulation didn't convert between BSD fs names
and linux f_type numbers. Now it does, since the BSD f_type
number is useless to linux apps (and has been removed anyway)

o FreeBSD's struct statfs is different from our (both old and new)
and thus needs conversion. Previously, the OpenBSD syscalls
were used without any real translation.

o mount(8) will now show extra info when invoked with no arguments.
However, to see *everything* you need to use the -v (verbose) flag.


# 1.38 06-May-1999 mickey

factor out sync+wait code into vfa_syncwait() routine for
applications in system like power management and such.
art@ finally said `commit it'


# 1.37 30-Apr-1999 art

in vput, simple_unlock the v_interlock before VOP_INACTIVE, not after


Revision tags: OPENBSD_2_5_BASE
# 1.36 11-Mar-1999 deraadt

backout


# 1.35 11-Mar-1999 deraadt

back out unapproved changes


# 1.34 11-Mar-1999 mickey

indent


# 1.33 11-Mar-1999 mickey

factor sync+wait operation out into a separate function.


# 1.32 26-Feb-1999 art

adapt to uvm vnode pager


# 1.31 19-Feb-1999 art

add vfs_register and vfs_unregister functions


# 1.30 28-Dec-1998 art

simple_lock fixes


# 1.29 22-Dec-1998 art

deconfuse vprint, print holdcount, not refcount when we are talking about holdcnt


# 1.28 10-Dec-1998 art

vfs_unmountall: retry to unmount all remaining filesystems when one unmount failed


# 1.27 05-Dec-1998 csapuntz

Framework for generating automatic test code for locking discipline
in DIAGNOSTIC mode.

Added documentation to vfs_subr.c on locking needs of a couple calls.

Improvements to the vinvalbuf patch. We need to start over after we
let our pants down.


# 1.26 04-Dec-1998 csapuntz

VFS-Lite2 requires stricter locking around vnode buffer queues. vinvalbuf
had insufficient protection


# 1.25 20-Nov-1998 art

vn_lock already unlocks the simple lock. don't do that again


# 1.24 12-Nov-1998 csapuntz

Integrate latest soft updates patches for McKusick.

Integrate cleaner ffs mount code from FreeBSD. Most notably, this mount
code prevents you from mounting an unclean file system read-write.


Revision tags: OPENBSD_2_4_BASE
# 1.23 13-Oct-1998 csapuntz

In vrele, vget, reinstate to following order

- VNODE gets placed on free list
- VOP_INACTIVE is called

This was the original order. It was changed in an earlier patch due to
a race condition in non-locking FSes (like NFS) between getnewvnode
and inactive. However, the modified order had its own race conditions, so
it turned out not to be a good choice.


# 1.22 30-Aug-1998 csapuntz

Cleanup.

Error diagnostics in vputonfreelist to catch violations of assumptions.


# 1.21 06-Aug-1998 csapuntz

Rename vop_revoke, vn_bwrite, vop_noislocked, vop_nolock, vop_nounlock
to be vop_generic_revoke, vop_generic_bwrite, vop_generic_islocked,
vop_generic_lock and vop_generic_unlock.

Create vop_generic_abortop and propogate change to all file systems.

Fix PR/371.

Get rid of locking in NULLFS (should be mostly unnecessary now except for
forced unmounts).


# 1.20 25-Apr-1998 niklas

typo


Revision tags: OPENBSD_2_3_BASE
# 1.19 20-Feb-1998 niklas

typo


# 1.18 11-Jan-1998 csapuntz

Fix a couple spinlock references. More code motion in vfs_subr.c


# 1.17 10-Jan-1998 csapuntz

Broke up vfs_subr.c which was getting a bit huge. We now have seperate files
for the syncer daemon as well as default VOP_*.


# 1.16 24-Nov-1997 niklas

Fix non-DIAGNOSTIC (and non-COMPAT*) compilation


# 1.15 07-Nov-1997 csapuntz

Fixed hang on shutdown
Disabled vop_nolock for now. Filesystems still need to be cleaned up.


# 1.14 06-Nov-1997 csapuntz

DEBUG now compiles


# 1.13 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.12 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.11 06-Oct-1997 csapuntz

VFS Lite2 Changes


Revision tags: OPENBSD_2_1_BASE
# 1.10 25-Apr-1997 deraadt

proper mask check; mike@fast.cs.utah.edu


# 1.9 14-Apr-1997 tholo

Minor performance enhancements from NetBSD


# 1.8 24-Feb-1997 niklas

OpenBSD tags


# 1.7 11-Feb-1997 millert

Add fs_id support and random inode generation numbers for ffs.


# 1.6 04-Jan-1997 kstailey

spec_advlock() via lf_advlock()


Revision tags: OPENBSD_2_0_BASE
# 1.5 08-Aug-1996 tholo

Make {,f}chown(2) behaviour POSIX.1 compliant with SUID / SGID files
Enable CTL_FS processing by sysctl(3)
Add CTL_FS request to disable clearing SUID / SGID bit when a files owner
or group is changed by root
Make sysctl(8) understand CTL_FS requests


# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 29-Feb-1996 niklas

From NetBSD: Merge with NetBSD 960217


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.290 28-Jun-2019 visa

Skip VFS barrier lock during normal operation to reduce overhead.
This removes a system-wide serialization point, which might help
finding timing-related bugs.

OK deraadt@ anton@


# 1.289 09-Jun-2019 beck

Add a temporary workaround to make removal of giant files better

mlarkin@ noticed we would freeze while removing enormous files because
of the amount of work done to invalidate buffers on unlink. This adds
a temporary workaround to ensure we give up the lock and yield while
doing this.

The longer term answer will be to move these buffers to another list
and not do the work here.

ok deraadt@


# 1.288 19-Apr-2019 visa

Add a subsystem lock for vfs_lockf.c. This enables calling lf_advlock()
and lf_purgelocks() without the kernel lock.

OK anton@ mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.287 02-Apr-2019 visa

Restrict which filesystems are available for swap. This rules out
obvious misconfigurations that cannot work.

OK mpi@ tedu@


# 1.286 17-Feb-2019 tedu

if a write fails, we mark the buffer invalid and throw it away. this can
lead to lost errors, where a later fsync will return success. to fix this,
set a flag on the vnode indicating a past error has occurred, and return
an error for future fsync calls.
ok bluhm deraadt visa


# 1.285 21-Jan-2019 anton

Introduce a dedicated entry point data structure for file locks. This new data
structure allows for better tracking of pending lock operations which is
essential in order to prevent a use-after-free once the underlying vnode is
gone.

Inspired by the lockf implementation in FreeBSD.

ok visa@

Reported-by: syzbot+d5540a236382f50f1dac@syzkaller.appspotmail.com


# 1.284 23-Dec-2018 natano

Rectify some issues with the noperm mount flag; the root vnode was not
protected properly and files without any x bit set were accidentaly considered
executable when checked with access(2).

Issues found and reported by deraadt, halex, reyk, tb
ok deraadt


# 1.283 07-Dec-2018 mpi

free(9) sizes for netcred.

ok visa@


Revision tags: OPENBSD_6_4_BASE
# 1.282 29-Sep-2018 visa

Use atomic operations to update vfc_refcount. Change the field's type
to unsigned int.

OK deraadt@


# 1.281 26-Sep-2018 visa

Move the allocating and freeing of mount points into
dedicated functions.

OK deraadt@ mpi@


# 1.280 22-Sep-2018 fcambus

Harmonize spacing after ellipses in displayed messages.

We were using spacing after ellipses in an inconsistent way in the
installer. Standardize on using "... " everywhere and take into account
the cursor position while we are waiting for the task to complete: the
cursor is now always positioned after the last dot, and the space is
added when displaying completion confirmation.

While there, also take cursor position into account in vfs_shutdown(),
and remove the extra leading space before ticks in dhclient.

OK deraadt@


# 1.279 17-Sep-2018 visa

Simplify VFS initialization.

Because loadable kernel modules are no longer, there is no need to
register or unregister filesystem implementations at runtime. Remove
vfs_register() and vfs_unregister(), and make vfsinit() call vfs_init
routines directly. Replace the linked list of vfsconf structs with
the vfsconflist[] array.

OK mpi@ bluhm@


# 1.278 16-Sep-2018 visa

Move vfsconf lookup code into dedicated functions.

OK bluhm@


# 1.277 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


# 1.276 02-Jul-2018 bluhm

Use more list macros for v_dirtyblkhd.
OK mpi@


# 1.275 06-Jun-2018 bluhm

The function dounmount() traverses the mnt_list in forward direction
to call vfs_busy() for all nested mount points. vfs_stall() called
vfs_busy() in reverser order for all mount points. Change the
direction of the latter to resolve the lock order conflict.
OK visa@


# 1.274 04-Jun-2018 guenther

Add VB_DUPOK to suppress witness(4) warning of concurrent mount locks.
Use that in three places:
- vfs_stall()
- sys_mount()
- dounmount()'s MNT_FORCE-does-recursive-unmounts case

ok deraadt@ visa@


# 1.273 27-May-2018 visa

Drop unnecessary `p' parameter from vget(9).

OK mpi@


# 1.272 08-May-2018 bluhm

When looping over mount points, the FOREACH SAVE macro is not save.
The loop variable mp is protected by vfs_busy() so that it cannot
be unmounted. But the next mount point nmp could be unmounted while
VFS_SYNC() sleeps. As the loop in vfs_stall() does not destroy the
mount point, TAILQ_FOREACH_REVERSE without _SAVE is the correct
macro to use.
OK deraadt@ visa@


# 1.271 08-May-2018 mpi

Move the vfs stall "barrier" logic to a function. FREF() will soon
change and this has nothing to do with it.

ok visa@, bluhm@


# 1.270 07-May-2018 bluhm

Print the vp pointer in the vinvalbuf() panic strings.
OK mpi@


# 1.269 02-May-2018 visa

Remove proc from the parameters of vn_lock(). The parameter is
unnecessary because curproc always does the locking.

OK mpi@


# 1.268 28-Apr-2018 visa

Clean up the parameters of VOP_LOCK() and VOP_UNLOCK(). It is always
curproc that does the locking or unlocking, so the proc parameter
is pointless and can be dropped.

OK mpi@, deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.267 07-Mar-2018 bluhm

Remounting files systems read-only does not work reliably. There
are corner cases where ffs may leak blocks. So better revert and
unmount all file systems at reboot. The "init died" panic will be
fixed in a different way.
OK deraadt@


# 1.266 10-Feb-2018 deraadt

Syncronize filesystems to disk when suspending. Each mountpoint's vnodes
are pushed to disk. Dangling vnodes (unlinked files still in use) and
vnodes undergoing change by long-running syscalls are identified -- and
such filesystems are marked dirty on-disk while we are suspended (in case
power is lost, a fsck will be required). Filesystems without dangling or
busy vnodes are marked clean, resulting in faster boots following
"battery died" circumstances.
Tested by numerous developers, thanks for the feedback.


# 1.265 14-Dec-2017 deraadt

Don't bother using DETACH_FORCE for the softraid luns at reboot
time; the aggressive mountpoint destruction seems to hit insane
use-after-frees when we are already far on the way down.


# 1.264 14-Dec-2017 deraadt

Give vflush_vnode() a hint about vnodes we don't need to account as "busy".
Change mountpoint to RDONLY a little later. Seems to improve the
rw->ro transition a bit.


# 1.263 11-Dec-2017 bluhm

Format the vnode lists of ddb show mount properly in columns.
OK krw@


# 1.262 11-Dec-2017 deraadt

In uvm Chuck decided backing store would not be allocated proactively
for blocks re-fetchable from the filesystem. However at reboot time,
filesystems are unmounted, and since processes lack backing store they
are killed. Since the scheduler is still running, in some cases init is
killed... which drops us to ddb [noted by bluhm]. Solution is to convert
filesystems to read-only [proposed by kettenis]. The tale follows:
sys_reboot() should pass proc * to MD boot() to vfs_shutdown() which
completes current IO with vfs_busy VB_WRITE|VB_WAIT, then calls VFS_MOUNT()
with MNT_UPDATE | MNT_RDONLY, soon teaching us that *fs_mount() calls a
copyin() late... so store the sizes in vfsconflist[] and move the copyin()
to sys_mount()... and notice nfs_mount copyin() is size-variant, so kill
legacy struct nfs_args3. Next we learn ffs_mount()'s MNT_UPDATE code is
sharp and rusty especially wrt softdep, so fix some bugs adn add
~MNT_SOFTDEP to the downgrade. Some vnodes need a little more help,
so tie them to &dead_vnops.

ffs_mount calling DIOCCACHESYNC is causing a bit of grief still but
this issue is seperate and will be dealt with in time.
couple hundred reboots by bluhm and myself, advice from guenther and
others at the hut


# 1.261 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.260 31-Jul-2017 florian

Give back some space to the ramdisk by compiling net/radix.c only
if we compile pf, ipsec, pipex or nfsserver.
Suggested by mpi some time ago.
Tweak & OK bluhm
deraadt assumes it's fair


# 1.259 20-Apr-2017 visa

Tweak lock inits to make the system runnable with witness(4)
on amd64 and i386.


# 1.258 04-Apr-2017 deraadt

struct vfsconf is tightly packed, but let's M_ZERO it in case that ever
changes to avoid exposing userland memory.


Revision tags: OPENBSD_6_1_BASE
# 1.257 15-Jan-2017 bluhm

When traversing the mount list, the current mount point is locked
with vfs_busy(). If the FOREACH_SAFE macro is used, the next pointer
is not locked and could be freed by another process. Unless
necessary, do not use _SAFE as it is unsafe. In vfs_unmountall()
the current pointer is actullay freed. Add a comment that this
race has to be fixed later.
OK krw@


# 1.256 10-Jan-2017 bluhm

Replace manual for() loops with FOREACH() macro.
OK millert@


# 1.255 10-Jan-2017 bluhm

Remove the unused olddp parameter from function dounmount().
OK mpi@ millert@


# 1.254 28-Sep-2016 kettenis

Cast enum to u_int when doing a bounds check to avoid a clang warning that
the comparison is always true.

ok jca@, tedu@


# 1.253 16-Sep-2016 dlg

move the namecache_rb_tree from RB macros to RBT functions.

i had to shuffle the includes a bit. all the knowledge of the RB
tree is now inside vfs_cache.c, and all accesses are via cache_*
functions.


# 1.252 16-Sep-2016 dlg

move buf_rb_bufs from RB macros to RBT functions

i had to shuffle the order of some header bits cos RBT_PROTOTYPE
needs to see what RBT_HEAD produces.


# 1.251 15-Sep-2016 dlg

all pools have their ipl set via pool_setipl, so fold it into pool_init.

the ioff argument to pool_init() is unused and has been for many
years, so this replaces it with an ipl argument. because the ipl
will be set on init we no longer need pool_setipl.

most of these changes have been done with coccinelle using the spatch
below. cocci sucks at formatting code though, so i fixed that by hand.

the manpage and subr_pool.c bits i did myself.

ok tedu@ jmatthew@

@ipl@
expression pp;
expression ipl;
expression s, a, o, f, m, p;
@@
-pool_init(pp, s, a, o, f, m, p);
-pool_setipl(pp, ipl);
+pool_init(pp, s, a, ipl, f, m, p);


# 1.250 25-Aug-2016 dlg

pool_setipl

ok kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.249 22-Jul-2016 kettenis

Prevent NULL-pointer call for filesystems that don't provide vfs_sysctl
in their vfsops.

Issue reported by Tim Newsham.

ok claudio@, natano@


# 1.248 19-Jun-2016 natano

Remove the lockmgr() API. It is only used by filesystems, where it is a
trivial change to use rrw locks instead. All it needs is LK_* defines
for the RW_* flags.

tested by naddy and sthen on package building infrastructure
input and ok jmc mpi tedu


# 1.247 26-May-2016 natano

The doforce variable isn't modified anywhere. Also, the only filesystem
left using it is fuse. It has been removed from all other filesystems.

ok millert deraadt


# 1.246 26-Apr-2016 natano

copy_statfs_info() is not only used by ufs, but by other filesystems too,
so make sure that all members of mp->mnt_stat.mount_info are copied.
ok stefan


# 1.245 26-Apr-2016 beck

fix off by one in vfs_vnode_print - found by miod
ok deraadt@, krw@


# 1.244 07-Apr-2016 natano

Share clone bitmap between aliased vnodes. This prevents duplicate clone
instance numbers being handed out for the same minor device.
ok mikeb


# 1.243 05-Apr-2016 natano

Increase size of the clone bitmap (revised diff after revert). I have
tested this with fuse _and_ drm on amd64 and macppc. Also tested with
cloning bpf (not in the tree) on macppc.

ok mikeb
"looks correct to me" millert

The original commit message is as follows:

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.242 01-Apr-2016 mikeb

Revert the clone bitmap enlargement change


# 1.241 31-Mar-2016 natano

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.240 19-Mar-2016 natano

Remove the unused flags argument from VOP_UNLOCK().

torture tested on amd64, i386 and macppc
ok beck mpi stefan
"the change looks right" deraadt


# 1.239 14-Mar-2016 krw

Change a bunch of (<blah> *)0 to NULL.

ok beck@ deraadt@


Revision tags: OPENBSD_5_9_BASE
# 1.238 05-Dec-2015 tedu

branches: 1.238.2;
remove stale lint annotations


# 1.237 16-Nov-2015 deraadt

In getdevvp() set the VISTTY flag on a vnode to indicate the underlying
device is a D_TTY device. (Like spec_open, but this sets the flag to
satisfy pre-VOP_OPEN situations)
ok millert semarie tedu guenther


# 1.236 13-Oct-2015 guenther

Initialize va_filerev in vattr_null() to avoid leaking stack garbage;
problem pointed out by Martin Natano (natano (at) natano.net)

Also, stop chaining assignments (foo = bar = baz) in vattr_null().
The exact meaning of those depends on the order of the sizes-and-
signednesses of the lvalues, making them fragile: a statement here
mixed *six* types, but managed to get them in a safe order. Delete
a 20+ year old XXX comment that was almost certainly bemoaning a bug
from when they were in an unsafe order.

ok deraadt@ miod@


# 1.235 08-Oct-2015 mpi

Use the radix API directly and get rid of the function pointers. There
is no point in keeping an unused level of abstraction.

ok mikeb@, claudio@


# 1.234 07-Oct-2015 mpi

rn_inithead() offset argument is now specified in byte, missed in previous.


# 1.233 04-Sep-2015 mpi

Make every subsystem using a radix tree call rn_init() and pass the
length of the key as argument.

This way every consumer of the radix tree has a chance to explicitly
initialize the shared data structures and no longer rely on another
subsystem to do the initialization.

As a bonus ``dom_maxrtkey'' is no longer used an die.

ART kernels should now be fully usable because pf(4) and IPSEC properly
initialized the radix tree.

ok chris@, reyk@


Revision tags: OPENBSD_5_8_BASE
# 1.232 16-Jul-2015 claudio

branches: 1.232.4;
Fix rn_match and there for the expoerted lookup functions in radix.c
to never return the internal RNF_ROOT nodes. This removes the checks
in the callee to verify that not an RNF_ROOT node was returned.
OK mpi@


# 1.231 12-May-2015 mikeb

Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.230 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.229 02-Mar-2015 guenther

Return EINVAL if the creds supplied for NFS export have a cr_ngroups less
than zero or greater than NGROUPS_MAX

Fixes panic seen by henning@


# 1.228 09-Jan-2015 tedu

rename desiredvnodes to initialvnodes. less of a lie. ok beck deraadt


# 1.227 19-Dec-2014 tedu

start retiring the nointr allocator. specify PR_WAITOK as a flag as a
marker for which pools are not interrupt safe. ok dlg


# 1.226 17-Dec-2014 tedu

remove lock.h from uvm_extern.h. another holdover from the simpletonlock
era. fix uvm including c files to include lock.h or atomic.h as necessary.
ok deraadt


# 1.225 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.224 10-Dec-2014 tedu

convert bcopy to memcpy. ok millert


# 1.223 21-Nov-2014 tedu

simple lock is long dead


# 1.222 19-Nov-2014 tedu

delete the KERN_VNODE sysctl. it fails to provide any isolation from the
kernel struct vnode defintion, and the only consumer (pstat) still needs
kvm to read much of the required information. no great loss to always use
kvm until there's a better replacement interface.
ok deraadt millert uebayasi


# 1.221 14-Nov-2014 tedu

prefer sizeof(*ptr) to sizeof(struct) for malloc and free


# 1.220 03-Nov-2014 deraadt

pass size argument to free()
ok doug tedu


# 1.219 13-Sep-2014 doug

Replace all queue *_END macro calls except CIRCLEQ_END with NULL.

CIRCLEQ_* is deprecated and not called in the tree. The other queue types
have *_END macros which were added for symmetry with CIRCLEQ_END. They are
defined as NULL. There's no reason to keep the other *_END macro calls.

ok millert@


Revision tags: OPENBSD_5_6_BASE
# 1.218 13-Jul-2014 tedu

pass the size to free in some of the obvious cases


# 1.217 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.216 10-Jul-2014 mpi

Stop using a shutdown hook for softraid(4) and explicitly shutdown
the disciplines right after vfs_shutdown().

This change is required in order to be able to set `cold' to 1 before
traversing the device (mainbus) tree for DVACT_POWERDOWN when halting
a machine. Yes, this is ugly because sr_shutdown() needs to sleep. But
at least it is obvious and hopefully somebody will be ofended and fix
it.

In order to properly flush the cache of the disks under softraid0,
sr_shutdown() now propagates DVACT_POWERDOWN for this particular subtree
of devices which are not under mainbus. As a side effect sd(4) shutdown
hook should no longer be necessary.

Tested by stsp@ and Jean-Philippe Ouellet.

ok deraadt@, stsp@, jsing@


# 1.215 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.214 04-Jun-2014 claudio

While it may be smart to use the radix tree for exports it is not OK to
use the domain specific tree initialisation method for this since that one
is multipath enabled and assumes that the radix node is part of a struct
rtentry. This code uses a different struct and so the multipath modifies
wrong fields and breaks stuff in mysterious ways.
Since we only support AF_INET here anyway simplify the code and only have
one radix_node_head pointer instead of AF_MAX ones.
Fixes NFS server issues reported by rpe@, OK rpe@, guenther@, sthen@


# 1.213 10-Apr-2014 tedu

pull the bufcache freelist code out into separate functions to allow new
algorithms to be tested. in the process, drop support for unused B_AGE and
b_synctime options.
previous versions ok beck deraadt


# 1.212 24-Mar-2014 guenther

Split the API: struct ucred remains the kernel internal structure while
struct xucred becomes the structure for syscalls (mount(2) and nfssvc(2)).

ok deraadt@ beck@


Revision tags: OPENBSD_5_5_BASE
# 1.211 21-Jan-2014 tedu

bzero -> memset


# 1.210 01-Dec-2013 krw

Change 'mountlist' from CIRCLEQ to TAILQ. Be paranoid and
use TAILQ_*_SAFE more than might be needed.

Bulk ports build by sthen@ showed nobody sticking their fingers
so deep into the kernel.

Feedback and suggestions from millert@. ok jsing@


# 1.209 27-Nov-2013 jsing

Defer the v_type initialisation until after the vnode has been purged from
the namecache. Changing the v_type between cache_enter() and cache_purge()
results in bad things happening.

ok beck@


# 1.208 02-Oct-2013 sf

format string fix: b_flags is long


# 1.207 01-Oct-2013 sf

Format string fixes: Cast time_t to long long

and mnt_stat.f_ctime is long long, too


# 1.206 08-Aug-2013 syl

Uncomment kprintf format attributes for sys/kern

tested on vax (gcc3) ok miod@


# 1.205 30-Jul-2013 beck

The previous change was made while chasing nfs performance issues
on Theo's servers - however this was in the context of the buffer flipper
changes and this is now suspect in a continues performance issue with NFS
so back it out for now


Revision tags: OPENBSD_5_4_BASE
# 1.204 24-Jun-2013 beck

Manipulating buffers after sleeping is dangerous. Instead of attempting
to cheat and VOP_BWRITE a buffer, restart the vinvalbuf if we have to wait
for a busy buffer to complete
ok tedu@ guenther@


# 1.203 15-Apr-2013 jsing

Add an f_mntfromspec member to struct statfs, which specifies the name of
the special provided when the mount was requested. This may be the same as
the special that was actually used for the mount (e.g. in the case of a
device node) or it may be different (e.g. in the case of a DUID).

Whilst here, change f_ctime to a 64 bit type and remove the pointless
f_spare members.

Compatibility goo courtesy of guenther@

ok krw@ millert@


Revision tags: OPENBSD_5_3_BASE
# 1.202 17-Feb-2013 miod

Comment out recently added __attribute__((__format__(__kprintf__))) annotations
in MI code; gcc 2.95 does not accept such annotation for function pointer
declarations, only function prototypes.
To be uncommented once gcc 2.95 bites the dust.


# 1.201 09-Feb-2013 miod

Add explicit __attribute__ ((__format__(__kprintf__)))) to the functions and
function pointer arguments which are {used as,} wrappers around the kernel
printf function.
No functional change.


# 1.200 17-Nov-2012 beck

Don't map a buffer (and potentially sleep) when invalidating it in vinvalbuf.
This fixes a problem where we could sleep for kva and then our pointers
would not be valid on the next pass through the loop. We do this
by adding buf_acquire_nomap() - which can be used to busy up the buffer
without changing its mapped or unmapped state. We do not need to have
the buffer mapped to invalidate it, so it is sufficient to acquire it
for that. In the case where we write the buffer, we do map the buffer, and
potentially sleep.


# 1.199 01-Oct-2012 guenther

Make groupmember() check the effective gid too, so that the checks are
consistent when the effective gid isn't also a supplementary group.

ok beck@


# 1.198 19-Sep-2012 guenther

vhold() and vdrop() are prototyped in vnode.h, so don't repeat them here

ok beck@


Revision tags: OPENBSD_5_2_BASE
# 1.197 16-Jul-2012 deraadt

oops, need sys/acct.h too


# 1.196 16-Jul-2012 deraadt

Put acct_shutdown() proto in a better place


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.195 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.194 02-Jul-2011 thib

rename VFSDEBUG to VFLCKDEBUG;

prompted by tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.193 21-Dec-2010 thib

Bring back the "End the VOP experiment." diff, naddy's issues where
unrelated, and his alpha is much happier now.

OK deraadt@


# 1.192 06-Dec-2010 jasper

- drop NENTS(), which was yet another copy of nitems().
no binary change


ok deraadt@


# 1.191 10-Sep-2010 thib

Backout the VOP diff until the issues naddy was seeing on alpha (gcc3)
have been resolved.


# 1.190 06-Sep-2010 thib

End the VOP experiment. Instead of the ridicolusly complicated operation
vector setup that has questionable features (that have, as far as I can
tell never been used in practice, atleast not in OpenBSD), remove all
the gunk and favor a simple struct full of function pointers that get
set directly by each of the filesystems.

Removes gobs of ugly code and makes things simpler by a magnitude.

The only downside of this is that we loose the vnoperate feature so
the spec/fifo operations of the filesystems need to be kept in sync
with specfs and fifofs, this is no big deal as the API it self is pretty
static.

Many thanks to armani@ who pulled an earlier version of this diff to
current after c2k10 and Gabriel Kihlman on tech@ for testing.

Liked by many. "come on, find your balls" deraadt@.


# 1.189 12-Aug-2010 oga

Nuke extra (typoed) extern declaration and a spare newline from the last
commit.

"fix it -- free commit" beck@


# 1.188 11-Aug-2010 beck

Make the number of vnodes to correspond to the number of buffers in
buffer cache - we grow them dynamically, but do not attempt to shrink
them if the buffer cache shrinks after growing.

Tested by very many for a long time.

ok oga@ todd@ phessler@ tedu@


Revision tags: OPENBSD_4_8_BASE
# 1.187 29-Jun-2010 tedu

makefstype was only used in ported from freebsd filesystems. fix them
and remove the function. ok thib


# 1.186 28-Jun-2010 claudio

Add the rtable id as an argument to rn_walktree(). Functions like
rt_if_remove_rtdelete() need to know the table id to be able to correctly
remove nodes.
Problem found by Andrea Parazzini and analyzed by Martin Pelik�n.
OK henning@


# 1.185 06-May-2010 mpf

Fix favail format string.
From mickey.
OK thib, otto.


Revision tags: OPENBSD_4_7_BASE
# 1.184 17-Dec-2009 oga

if anyone vref()s a VNON vnode, panic. This should not happen.

Written while trying to debug the nfs_inactive panics. Turns out it
never got hit, but it's a useful check to have.

ok beck@


# 1.183 17-Aug-2009 jasper

dd 'show all bufs' to show all the buffers in the system

ok beck@ thib@


# 1.182 13-Aug-2009 thib

add a show all vnodes command, use dlg's nice pool_walk() to accomplish
this.

ok beck@, dlg@


# 1.181 12-Aug-2009 beck

Namecache revamp.

This eliminates the large single namecache hash table, and implements
the name cache as a global lru of entires, and a redblack tree in each
vnode. It makes cache_purge actually purge the namecache entries associated
with a vnode when a vnode is recycled (very important for later on actually being
able to resize the vnode pool)

This commit does #if 0 out a bunch of procmap code that was
already broken before this change, but needs to be redone completely.

Tested by many, including in thib's nfs test setup.

ok oga@,art@,thib@,miod@


# 1.180 02-Aug-2009 beck

Dynamic buffer cache support - a re-commit of what was backed out
after c2k9

allows buffer cache to be extended and grow/shrink dynamically

tested by many, ok oga@, "why not just commit it" deraadt@


Revision tags: OPENBSD_4_6_BASE
# 1.179 25-Jun-2009 thib

backout the buf_acquire() does the bremfree() since all callers
where doing bremfree() befure calling buf_acquire().

This is causing us headache pinning down a bug that showed up
when deraadt@ too cvs to current, and will have to be done
anyway as a preperation for backouts.

OK deraadt@


# 1.178 15-Jun-2009 beck

Back out all the buffer cache changes I committed during c2k9. This reverts three
commits:

1) The sysctl allowing bufcachepercent to be changed at boot time.
2) The change moving the buffer cache hash chains to a red-black tree
3) The dynamic buffer cache (Which depended on the earlier too).

ok on the backout from marco and todd


# 1.177 06-Jun-2009 art

All caller of buf_acquire were doing bremfree before the call.
Just put it in the buf_acquire function.
oga@ ok


# 1.176 03-Jun-2009 beck

Change bufhash from the old grotty hash table to red-black trees hanging
off the vnode.
ok art@, oga@, miod@


Revision tags: OPENBSD_4_5_BASE
# 1.175 10-Nov-2008 pedro

Fix typo in comment, okay jmc@.


# 1.174 01-Nov-2008 deraadt

change vrele() to return an int. if it returns 0, it can gaurantee that
it did not sleep. this is used to avoid checkdirs() to avoid having
to restart the allproc walk every time through
idea from tedu, ok thib pedro


Revision tags: OPENBSD_4_4_BASE
# 1.173 05-Jul-2008 thib

re-introduce vdrop() to signal a lost intrest in a vnode;

ok art@


# 1.172 14-Jun-2008 mk

A bunch of pool_get() + bzero() -> pool_get(..., .. | PR_ZERO)
conversions that should shave a few bytes off the kernel.

ok henning, krw, jsing, oga, miod, and thib (``even though i usually prefer
FOO|BAR''; thanks for looking.


# 1.171 13-Jun-2008 beck

back out stupid vnode change that was unintentionally included
with biomem and art has no idea how it got there.
ok art@ thib@


# 1.170 12-Jun-2008 deraadt

Bring biomem diff back into the tree after the nfs_bio.c fix went in.
ok thib beck art


# 1.169 11-Jun-2008 deraadt

back out biomem diff since it is not right yet. Doing very large
file copies to nfsv2 causes the system to eventually peg the console.
On the console ^T indicates that the load is increasing rapidly, ddb
indicates many calls to getbuf, there is some very slow nfs traffic
making none (or extremely slow) progress. Eventually some machines
seize up entirely.


# 1.168 10-Jun-2008 beck

Buffer cache revamp

1) remove multiple size queues, introduced as a stopgap.
2) decouple pages containing data from their mappings
3) only keep buffers mapped when they actually have to be mapped
(right now, this is when buffers are B_BUSY)
4) New functions to make a buffer busy, and release the busy flag
(buf_acquire and buf_release)
5) Move high/low water marks and statistics counters into a structure
6) Add a sysctl to retrieve buffer cache statistics

Tested in several variants and beat upon by bob and art for a year. run
accidentally on henning's nfs server for a few months...

ok deraadt@, krw@, art@ - who promises to be around to deal with any fallout


# 1.167 09-Jun-2008 millert

Update access(2) to have modern semantics with respect to X_OK and
the superuser. access(2) will now only indicate success for X_OK on
non-directories if there is at least one execute bit set on the file.
OK deraadt@ thib@ otto@


# 1.166 07-May-2008 thib

remove the vfc_mountroot member from vfsconf and
do appropriate cleanup;

OK deraadt@


# 1.165 07-May-2008 claudio

Implement routing priorities. Every route inserted has a priority assigned
and the one route with the lowest number wins. This will be used by the
routing daemons to resolve the synchronisations issue in case of conflicts.
The nasty bits of this are in the multipath code. If no priority is specified
the kernel will choose an appropriate priority.

Looked at by a few people at n2k8 code is much older


# 1.164 06-May-2008 thib

retire vfs_mountroot();

setroot() is now (and has been) responsible for setting
the mountroot function pointer "to the right thing", or
failing todo that, to ffs_mountroot;

based on a discussion/diff from deraadt@.
OK deraadt@


# 1.163 23-Mar-2008 miod

Wrong printf construct.


# 1.162 16-Mar-2008 otto

Widen some struct statfs fields to support large filesystem stata
and add some to be able to support statvfs(2). Do the compat dance
to provide backward compatibility. ok thib@ miod@


Revision tags: OPENBSD_4_3_BASE
# 1.161 13-Dec-2007 blambert

replace calls to ltsleep with tsleep

remove PNORELOCK flag, as PNORELOCK is used for msleep

ok art@ thib@


# 1.160 16-Nov-2007 deraadt

er, the newline is wrong. dissapointing.


# 1.159 15-Nov-2007 deraadt

newline before syncing disks is way prettier


# 1.158 29-Oct-2007 chl

MALLOC/FREE -> malloc/free
replace an hard coded value with M_WAITOK

ok krw@


# 1.157 15-Sep-2007 bluhm

Allow to pull out an usb stick with ffs filesystem while mounted
and a file is written onto the stick. Without these fixes the
machine panics or hangs.
The usb fix calls the callback when the stick is pulled out to free
the associated buffers. Otherwise we have busy buffers for ever
and the automatic unmount will panic.
The change in the scsi layer prevents passing down further dirty
buffers to usb after the stick has been deactivated.
In vfs the automatic unmount has moved from the function vgonel()
to vop_generic_revoke(). Both are called when the sd device's vnode
is removed. In vgonel() the VXLOCK is already held which can cause
a deadlock. So call dounmount() earlier.

ok krw@, I like this marco@, tested by ian@


# 1.156 07-Sep-2007 art

Use M_ZERO in a few more places to shave bytes from the kernel.

eyeballed and ok dlg@


Revision tags: OPENBSD_4_2_BASE
# 1.155 07-Aug-2007 beck

A few changes to deal with multi-user performance issues seen. this
brings us back roughly to 4.1 level performance, although this is still
far from optimal as we have seen in a number of cases. This change

1) puts a lower bound on buffer cache queues to prevent starvation
2) fixes the code which looks for a buffer to recycle
3) reduces the number of vnodes back to 4.1 levels to avoid complex
performance issues better addressed after 4.2

ok art@ deraadt@, tested by many


# 1.154 01-Jun-2007 beck

decouple the allocated number of vnodes from the "desiredvnodes" variable
which is used to size a zillion other things that increasing excessively
has been shown to cause problems - so that we may incrementally look at
increasing those other things without making the kernel unusable.

This diff effectivly increases the number of vnodes back to the number
of buffers, as in the earlier dynamic buffer cache commits, without
increasing anything else (namecache, softdeps, etc. etc.)

ok pedro@ tedu@ art@ thib@


# 1.153 31-May-2007 tedu

remove some silly casts, no real change


# 1.152 31-May-2007 pedro

NFSv2 cannot cope with a big number of vnodes, so revert to NPROC-based
calculation until the problem is fixed, okay beck@ art@


# 1.151 30-May-2007 beck

back out vfs change - todd fries has seen afs issues, and I'm suspicious
this can cause other problems.


# 1.150 29-May-2007 beck

Step one of some vnode improvements - change getnewvnode to
actually allocate "desiredvnodes" - add a vdrop to un-hold a vnode held
with vhold, and change the name cache to make use of vhold/vdrop, while
keeping track of which vnodes are referred to by which cache entries to
correctly hold/drop vnodes when the cache uses them.
ok thib@, tedu@, art@


# 1.149 28-May-2007 thib

de-inline vref();

ok pedro@


# 1.148 26-May-2007 pedro

Dynamic buffer cache. Initial diff from mickey@, okay art@ beck@ toby@
deraadt@ dlg@.


# 1.147 26-May-2007 thib

Nuke a bunch of simpelocks and associated goo.

ok art@


# 1.146 17-May-2007 thib

Collapse struct v_selectinfo in struct vnode, remove the
simplelock and reuse the name for the selinfo member.
Clean-up accordingly.

ok tedu@,art@


# 1.145 09-May-2007 deraadt

kinfo_vgetfailed has not been used for > 8 years


# 1.144 13-Apr-2007 thib

Move the declaration of VN_KNOTE() into vnode.h instead of having
multiple defines all over;

ok tedu@


# 1.143 13-Apr-2007 bluhm

Remove comments talking about vnode interlock. No binary change.
ok thib


# 1.142 11-Apr-2007 thib

Remove the simplelock argument from vrecycle();

ok pedro@, sturm@


# 1.141 21-Mar-2007 thib

Remove the v_interlock simplelock from the vnode structure.
Zap all calls to simple_lock/unlock() on it (those calls are
#defined away though). Remove the LK_INTERLOCK from the calls
to vn_lock() and cleanup the filesystems wich implement VOP_LOCK().
(by remvoing the v_interlock from there calls to lockmgr()).

ok pedro@, art@, tedu@


# 1.140 12-Mar-2007 mickey

better desiredvnodes not based on maxusers; pedro@ deraadt@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.139 20-Feb-2007 deraadt

for vfsconf sysctl, do not leak kernel sensors out to userland
ok art thib


# 1.138 17-Feb-2007 mickey

fix ddb buf printing for daddr_t growth to 64bit;
from juan hernandez gonzalez; tested by bluhm@


# 1.137 14-Feb-2007 jsg

Consistently spell FALLTHROUGH to appease lint.
ok kettenis@ cloder@ tom@ henning@


# 1.136 13-Feb-2007 mickey

fix ddb buf print


# 1.135 20-Nov-2006 tom

vprint() should be defined if DIAGNOSTIC || DEBUG. Noticed by (and
original diff from) Jake < antipsychic (at) hotmail.com >. Discussed
with Mickey and Miod.

ok miod@ pedro@


# 1.134 30-Oct-2006 thib

use vp->v_type to index into vtypes rather then vp->v_tag,
fixing odd output in the 'show vnode' ddb code.

ok mickey@


Revision tags: OPENBSD_4_0_BASE
# 1.133 11-Jul-2006 mickey

add mount/vnode/buf and softdep printing commands; tested on a few archs and will make pedro happy too (;


# 1.132 09-Jul-2006 pedro

Fix tab where space was meant


# 1.131 08-Jul-2006 thib

vinvalbuf() debugging aid, under VFSDEBUG.

ok pedro@


# 1.130 03-Jul-2006 mickey

also print vp in vprint (useful for debugging); pedro@ ok


# 1.129 25-Jun-2006 sturm

rename vfs_busy() flags VB_UMIGNORE/VB_UMWAIT to VB_NOWAIT/VB_WAIT

requested by and ok pedro


# 1.128 14-Jun-2006 sturm

move vfs_busy() to rwlocks and properly hide the locking api from vfs

ok tedu, pedro


# 1.127 02-Jun-2006 pedro

Add a clonable devices implementation. Hacked along with thib@, input
from krw@ and toby@, subliminal prodding from dlg@, okay deraadt@.


# 1.126 28-May-2006 pedro

Spacing in vfs_sysctl()


# 1.125 07-May-2006 sturm

forgot to remove this sentence from the comment
ok pedro


# 1.124 30-Apr-2006 sturm

remove the simplelock argument from vfs_busy() which is currently not
used and will never be used this way in VFS

requested by and ok pedro, ok krw, biorn


# 1.123 19-Apr-2006 pedro

Remove unused mount list simple_lock() goo


Revision tags: OPENBSD_3_9_BASE
# 1.122 09-Jan-2006 pedro

Put vprint() under DIAGNOSTIC, as to save space in generated ramdisks.
Inspiration from miod@, okay deraadt@. Tested on i386, macppc and amd64.


# 1.121 30-Nov-2005 pedro

No need for vfs_busy() and vfs_unbusy() to take a process pointer
anymore. Testing by jolan@, thanks.


# 1.120 24-Nov-2005 pedro

Remove kernfs, okay deraadt@.


# 1.119 19-Nov-2005 pedro

Remove unnecessary lockmgr() archaism that was costing too much in terms
of panics and bugfixes. Access curproc directly, do not expect a process
pointer as an argument. Should fix many "process context required" bugs.
Incentive and okay millert@, okay marc@. Various testing, thanks.


# 1.118 18-Nov-2005 pedro

Work around yet another race on non-locking file systems: when calling
VOP_INACTIVE() in vrele() and vput(), we may sleep. Since there's no
locking of any kind, someone can vget() the vnode and vrele() it while
we sleep, beating us in getting the vnode on the free list.


# 1.117 08-Nov-2005 pedro

Missed one use of 'register'


# 1.116 07-Nov-2005 pedro

Use ANSI function declarations and deregister, no binary change


# 1.115 19-Oct-2005 pedro

Remove v_vnlock from struct vnode, okay krw@ tedu@


Revision tags: OPENBSD_3_8_BASE
# 1.114 26-May-2005 pedro

branches: 1.114.2;
RIP stackable filesystems, ok marius@ tedu@, discussed with deraadt@


# 1.113 24-May-2005 pedro

when a device vnode associated with a mount point disappears, mark the
filesystem as doomed and unmount it


# 1.112 22-May-2005 pedro

put VLOCKSWORK stuff under a single option, VFSDEBUG


# 1.111 01-May-2005 pedro

check for VBIOONFREELIST and VBIOONSYNCLIST in vprint(), okay marius@


# 1.110 24-Mar-2005 tedu

always good to check for invalid values. ok marius pedro


Revision tags: OPENBSD_3_7_BASE
# 1.109 10-Jan-2005 pedro

branches: 1.109.2;
change vget() to only put a vnode back on the free lists if it actually
was there. should fix a (rare) corner case introduced by my last commit.
ok tedu@, testing by joris, moritz@, danh@, otto@ and krw@. many thanks.


# 1.108 31-Dec-2004 pedro

sprinkle some more list macros in here


# 1.107 31-Dec-2004 pedro

when releasing a vnode, make it inactive before sticking it to one of
the free lists. should fix some races on filesystems that don't have
locks, such as nfs. also, it allows for a more straightforward way of
releasing vnodes (nodes that are going to be recycled don't have to be
moved to the head of the list). tested by many, thanks.

ok tedu@ deraadt@


# 1.106 28-Dec-2004 deraadt

clean dirty accident by miod


# 1.105 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


# 1.104 09-Dec-2004 pedro

minor spacing/styling nits


Revision tags: OPENBSD_3_6_BASE
# 1.103 04-Aug-2004 art

Uninline vputonfreelist.


# 1.102 04-Aug-2004 pedro

better comments


# 1.101 02-Aug-2004 pedro

- check for LK_NOWAIT on vget()
- use ltsleep() instead of the unlock + sleep combo

ok art@, inspiration from free/net


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.100 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


# 1.99 27-May-2004 tedu

shutdown accounting before shutting down vfs. should prevent some panics.
ok david@ millert@ (iirc)


# 1.98 25-Apr-2004 itojun

radix tree with multipath support. from kame. deraadt ok
user visible changes:
- you can add multiple routes with same key (route add A B then route add A C)
- you have to specify gateway address if there are multiple entries on the table
(route delete A B, instead of route delete A)
kernel change:
- radix_node_head has an extra entry
- rnh_deladdr takes extra argument

TODO:
- actually take advantage of multipath (rtalloc -> rtalloc_mpath)


Revision tags: OPENBSD_3_5_BASE
# 1.97 09-Jan-2004 tedu

back out vnode parents. weird breakge found in ports tree


# 1.96 06-Jan-2004 tedu

keep track of a vnode's parent dir. ufs only, and unused atm, but
the fun stuff is coming. testing by brad.


Revision tags: OPENBSD_3_4_BASE
# 1.95 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.94 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: UBC_SYNC_A
# 1.93 13-May-2003 naddy

Back out previous change that causes "vnode table full" for large-scale
file operations.


# 1.92 13-May-2003 tedu

do reclaim LAYER vnodes, no good reason not to


# 1.91 06-May-2003 tedu

attempt to put a process's cwd back in place after a forced umount.
won't always work, but it's the best we can do for now. this covers
at least some of the failure cases the previous commit to vfs_lookup.c
checks for.
ok weingart@


# 1.90 01-May-2003 tedu

several related changes:
vfs_subr.c:
add a missing simple_lock_init for vnode interlock
try to avoid reclaiming locked or layered vnodes
initialize vnlock pointer to NULL
remove old code to free vnlock, never used
lockinit the new vnode lock
vfs_syscalls.c:
support for VLAYER flag
vnode_if.sh:
support for splitting VDESC flags
vnode_if.src:
split VDESC flags
WILLPUT is the combination of WILLRELE and WILLUNLOCK
most uses for WILLRELE become WILLPUT
vnode.h:
add v_lock to struct vnode
add VLAYER flag
update for new VDESC flags


# 1.89 06-Apr-2003 ho

strcat/strcpy/sprintf cleanup. krw@, anil@ ok. art@ tested sparc64.


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_B
# 1.88 11-Aug-2002 art

Add two missing vfs_busy calls in the failure path of sysctl_vnode.
Found by aaron@

NOTE - I think we need a mount-point iterator just like we have
NOTE - vfs_mount_foreach_vnode. (btw. why don't we use foreach_vnode in here?)


# 1.87 12-Jul-2002 art

Change the locking on the mountpoint slightly. Instead of using mnt_lock
to get shared locks for lookup and get the exclusive lock only with
LK_DRAIN on unmount and do the real exclusive locking with flags in
mnt_flags, we now use shared locks for lookup and an exclusive lock for
unmount.

This is accomplished by slightly changing the semantics of vfs_busy.
Old vfs_busy behavior:
- with LK_NOWAIT set in flags, a shared lock was obtained if the
mountpoint wasn't being unmounted, otherwise we just returned an error.
- with no flags, a shared lock was obtained if the mountpoint was being
unmounted, otherwise we slept until the unmount was done and returned
an error.
LK_NOWAIT was used for sync(2) and some statistics code where it isn't really
critical that we get the correct results.
0 was used in fchdir and lookup where it's critical that we get the right
directory vnode for the filesystem root.

After this change vfs_busy keeps the same behavior for no flags and LK_NOWAIT.
But if some other flags are passed into it, they are passed directly
into lockmgr (actually LK_SLEEPFAIL is always added to those flags because
if we sleep for the lock, that means someone was holding the exclusive lock
and the exclusive lock is only held when the filesystem is being unmounted.

More changes:
dounmount must now be called with the exclusive lock held. (before this
the caller was supposed to hold the vfs_busy lock, but that wasn't always
true).
Zap some (now) unused mount flags.
And the highlight of this change:
Add some vfs_busy calls to match some vfs_unbusy calls, especially in
sys_mount. (lockmgr doesn't detect the case where we release a lock noone
holds (it will do that soon)).

If you've seen hangs on reboot with mfs this should solve it (I repeat this
for the fourth time now, but this time I spent two months fixing and
redesigning this and reading the code so this time I must have gotten
this right).


# 1.86 16-Jun-2002 miod

When processing the KERN_VNODE sysctl, the kernel builds a packed structure,
while pstat(8) expects a C structure abiding the regular structure packing
rules. This caused pstat -v to break on powerpc.

Unbreak the confusion by defining the structure in a common header file,
and having the kernel use it.

ok millert@ deraadt@


# 1.85 08-Jun-2002 art

Use ltsleep in vfs_busy.


# 1.84 16-May-2002 art

sprinkle some splassert(IPL_BIO) in some functions that are commented as "should be called at splbio()"


Revision tags: OPENBSD_3_1_BASE
# 1.83 14-Mar-2002 millert

First round of __P removal in sys


# 1.82 04-Feb-2002 miod

Cleanup mountroot-related definitions.


# 1.81 23-Jan-2002 art

Pool deals fairly well with physical memory shortage, but it doesn't deal
well (not at all) with shortages of the vm_map where the pages are mapped
(usually kmem_map).

Try to deal with it:
- group all information the backend allocator for a pool in a separate
struct. The pool will only have a pointer to that struct.
- change the pool_init API to reflect that.
- link all pools allocating from the same allocator on a linked list.
- Since an allocator is responsible to wait for physical memory it will
only fail (waitok) when it runs out of its backing vm_map, carefully
drain pools using the same allocator so that va space is freed.
(see comments in code for caveats and details).
- change pool_reclaim to return if it actually succeeded to free some
memory, use that information to make draining easier and more efficient.
- get rid of PR_URGENT, noone uses it.


# 1.80 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.79 10-Dec-2001 art

branches: 1.79.2;
No need to initialize the uobj on every getnewvnode. Just do
it when allocating. Add some improved diagnostics.


# 1.78 10-Dec-2001 art

Big cleanup inspired by NetBSD with some parts of the code from NetBSD.
- get rid of VOP_BALLOCN and VOP_SIZE
- move the generic getpages and putpages into miscfs/genfs
- create a genfs_node which must be added to the top of the private portion
of each vnode for filsystems that want to use genfs_{get,put}pages
- rename genfs_mmap to vop_generic_mmap


# 1.77 10-Dec-2001 art

Merge in struct uvm_vnode into struct vnode.


# 1.76 05-Dec-2001 art

Break out the part that lowers v_holdcnt in brelvp into an own function
and make it and vhold into public interfaces.


# 1.75 29-Nov-2001 art

Ooops. Revert part of the last commit that was completly wrong and wasn't supposed to be committed.


# 1.74 29-Nov-2001 art

Correctly handle b_vp with bgetvp and brelvp in {get,put}pages.
Prevents panics caused by vnodes being recycled under our feet.


# 1.73 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.72 21-Nov-2001 csapuntz

Added vfs_isbusy. Useful for verifying that a mount point is locked
Added vfs_mount_foreach_vnode. Several places in the code seem to want to
traverse the mount list and they all seem to handle locking differently.
Centralize traversing the mount list in one place so that we only need
to get the locking right once.


# 1.71 15-Nov-2001 art

Don't zero v_bioflag when recycling a vnode in getnewvnode.
Sometimes the vnode can be on the syncers list. While that is a bug, it's
just a minor annoyance. A vnode on a syncer worklist without VBIOONSYNCLIST
set is a disaster.


# 1.70 12-Nov-2001 art

Remove unnecessary check for NULL vnode in reassignbuf.


# 1.69 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.68 02-Oct-2001 csapuntz

Bounds check index into routing table. Thanks to Ken Ashcraft of Stanford
for finding this bug.


# 1.67 19-Sep-2001 csapuntz

Get rid of B_VFLUSH. Not relevant after the end of the AGE queue.


# 1.66 16-Sep-2001 millert

Add some missing lengths checks when passing data from userland to
kernel. From based on NetBSD patches.


# 1.65 02-Aug-2001 assar

(vput): make panic strings actually say vput instead of vrele


# 1.64 26-Jul-2001 miod

Typo.


# 1.63 27-Jun-2001 art

remove old vm


# 1.62 22-Jun-2001 deraadt

KNF


# 1.61 05-Jun-2001 provos

send note_revoke to knotes when vnode goes away, okay art@


# 1.60 16-May-2001 art

indentation nit.


# 1.59 29-Apr-2001 art

cleanup, remove incorrect comment


Revision tags: OPENBSD_2_9_BASE
# 1.58 22-Mar-2001 art

branches: 1.58.2;
Use pool for allocating vnodes.
Even though vnodes are never freed (could be) this gives us big memory and
kmem_map savings.


# 1.57 21-Mar-2001 art

uvm_vnp_terminate expect the vnode to be locked.
Why didn't LOCKDEBUG catch this?


# 1.56 16-Mar-2001 art

Oops. fix thinko in last.


# 1.55 16-Mar-2001 art

Use CIRCLEQ macros for mountlist.


# 1.54 16-Mar-2001 art

Initialize the mountlist_slock.


# 1.53 26-Feb-2001 csapuntz

Move v_writecount test back to it original place


# 1.52 26-Feb-2001 csapuntz

Make ref counts 32-bit unsigned ints as opposed to a potpourri of longs and
ints.


# 1.51 24-Feb-2001 csapuntz

Cleanup of vnode interface continues. Get rid of VHOLD/HOLDRELE.
Change VM/UVM to use buf_replacevnode to change the vnode associated
with a buffer.

Addition v_bioflag for flags written in interrupt handlers
(and read at splbio, though not strictly necessary)

Add vwaitforio and use it instead of a while loop of v_numoutput.

Fix race conditions when manipulation vnode free list


# 1.50 23-Feb-2001 csapuntz

Remove the clustering fields from the vnodes and place them in the
file system inode instead


# 1.49 21-Feb-2001 csapuntz

Latest soft updates from FreeBSD/Kirk McKusick

Snapshot-related code has been commented out.


# 1.48 08-Feb-2001 mickey

do not print stuff when not verbose


Revision tags: OPENBSD_2_8_BASE
# 1.47 27-Sep-2000 art

branches: 1.47.2;
Minimal optimization.


# 1.46 17-Jul-2000 art

Don't wait for B_READ buffers on shutdown.
From NetBSD.


Revision tags: OPENBSD_2_7_BASE
# 1.45 25-Apr-2000 csapuntz

Use CIRCLEQ_FOREACH


# 1.44 21-Apr-2000 mickey

see if there is any meaning under curproc before using &proc0 in vfs_syncwait(); from art@


Revision tags: SMP_BASE kame_19991208
# 1.43 05-Dec-1999 art

branches: 1.43.2;
With soft updates, some buffers will be remarked as dirty after being written.
Handle this when syncing filesystems when unmounting.
From NetBSD.


# 1.42 05-Dec-1999 art

Use VONSYNCLIST to see if we should remove a vnode from the sync list instead
of looking at v_dirtyblkhd.


Revision tags: OPENBSD_2_6_BASE
# 1.41 20-Aug-1999 art

more paranoid check of the refcount in vfs_register


# 1.40 08-Aug-1999 niklas

From NetBSD; vdevgone, used for revoking access to device nodes when they
disappear (detach is coming).


# 1.39 31-May-1999 millert

New struct statfs with mount options. NOTE: this replaces statfs(2),
fstatfs(2), and getfsstat(2) so you will need to build a new kernel
before doing a "make build" or you will get "unimplemented syscall" errors.

The new struct statfs has the following featuires:
o Has a u_int32_t flags field--now softdep can have a real flag.

o Uses u_int32_t instead of longs (nicer on the alpha). Note: the man
page used to lie about setting invalid/unused fields to -1. SunOS does
that but our code never has.

o Gets rid of f_type completely. It hasn't been used since NetBSD 0.9
and having it there but always 0 is confusing. It is conceivable
that this may cause some old code to not compile but that is better
than silently breaking.

o Adds a mount_info union that contains the FSTYPE_args struct. This
means that "mount" can now tell you all the options a filesystem was
mounted with. This is especially nice for NFS.

Other changes:
o The linux statfs emulation didn't convert between BSD fs names
and linux f_type numbers. Now it does, since the BSD f_type
number is useless to linux apps (and has been removed anyway)

o FreeBSD's struct statfs is different from our (both old and new)
and thus needs conversion. Previously, the OpenBSD syscalls
were used without any real translation.

o mount(8) will now show extra info when invoked with no arguments.
However, to see *everything* you need to use the -v (verbose) flag.


# 1.38 06-May-1999 mickey

factor out sync+wait code into vfa_syncwait() routine for
applications in system like power management and such.
art@ finally said `commit it'


# 1.37 30-Apr-1999 art

in vput, simple_unlock the v_interlock before VOP_INACTIVE, not after


Revision tags: OPENBSD_2_5_BASE
# 1.36 11-Mar-1999 deraadt

backout


# 1.35 11-Mar-1999 deraadt

back out unapproved changes


# 1.34 11-Mar-1999 mickey

indent


# 1.33 11-Mar-1999 mickey

factor sync+wait operation out into a separate function.


# 1.32 26-Feb-1999 art

adapt to uvm vnode pager


# 1.31 19-Feb-1999 art

add vfs_register and vfs_unregister functions


# 1.30 28-Dec-1998 art

simple_lock fixes


# 1.29 22-Dec-1998 art

deconfuse vprint, print holdcount, not refcount when we are talking about holdcnt


# 1.28 10-Dec-1998 art

vfs_unmountall: retry to unmount all remaining filesystems when one unmount failed


# 1.27 05-Dec-1998 csapuntz

Framework for generating automatic test code for locking discipline
in DIAGNOSTIC mode.

Added documentation to vfs_subr.c on locking needs of a couple calls.

Improvements to the vinvalbuf patch. We need to start over after we
let our pants down.


# 1.26 04-Dec-1998 csapuntz

VFS-Lite2 requires stricter locking around vnode buffer queues. vinvalbuf
had insufficient protection


# 1.25 20-Nov-1998 art

vn_lock already unlocks the simple lock. don't do that again


# 1.24 12-Nov-1998 csapuntz

Integrate latest soft updates patches for McKusick.

Integrate cleaner ffs mount code from FreeBSD. Most notably, this mount
code prevents you from mounting an unclean file system read-write.


Revision tags: OPENBSD_2_4_BASE
# 1.23 13-Oct-1998 csapuntz

In vrele, vget, reinstate to following order

- VNODE gets placed on free list
- VOP_INACTIVE is called

This was the original order. It was changed in an earlier patch due to
a race condition in non-locking FSes (like NFS) between getnewvnode
and inactive. However, the modified order had its own race conditions, so
it turned out not to be a good choice.


# 1.22 30-Aug-1998 csapuntz

Cleanup.

Error diagnostics in vputonfreelist to catch violations of assumptions.


# 1.21 06-Aug-1998 csapuntz

Rename vop_revoke, vn_bwrite, vop_noislocked, vop_nolock, vop_nounlock
to be vop_generic_revoke, vop_generic_bwrite, vop_generic_islocked,
vop_generic_lock and vop_generic_unlock.

Create vop_generic_abortop and propogate change to all file systems.

Fix PR/371.

Get rid of locking in NULLFS (should be mostly unnecessary now except for
forced unmounts).


# 1.20 25-Apr-1998 niklas

typo


Revision tags: OPENBSD_2_3_BASE
# 1.19 20-Feb-1998 niklas

typo


# 1.18 11-Jan-1998 csapuntz

Fix a couple spinlock references. More code motion in vfs_subr.c


# 1.17 10-Jan-1998 csapuntz

Broke up vfs_subr.c which was getting a bit huge. We now have seperate files
for the syncer daemon as well as default VOP_*.


# 1.16 24-Nov-1997 niklas

Fix non-DIAGNOSTIC (and non-COMPAT*) compilation


# 1.15 07-Nov-1997 csapuntz

Fixed hang on shutdown
Disabled vop_nolock for now. Filesystems still need to be cleaned up.


# 1.14 06-Nov-1997 csapuntz

DEBUG now compiles


# 1.13 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.12 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.11 06-Oct-1997 csapuntz

VFS Lite2 Changes


Revision tags: OPENBSD_2_1_BASE
# 1.10 25-Apr-1997 deraadt

proper mask check; mike@fast.cs.utah.edu


# 1.9 14-Apr-1997 tholo

Minor performance enhancements from NetBSD


# 1.8 24-Feb-1997 niklas

OpenBSD tags


# 1.7 11-Feb-1997 millert

Add fs_id support and random inode generation numbers for ffs.


# 1.6 04-Jan-1997 kstailey

spec_advlock() via lf_advlock()


Revision tags: OPENBSD_2_0_BASE
# 1.5 08-Aug-1996 tholo

Make {,f}chown(2) behaviour POSIX.1 compliant with SUID / SGID files
Enable CTL_FS processing by sysctl(3)
Add CTL_FS request to disable clearing SUID / SGID bit when a files owner
or group is changed by root
Make sysctl(8) understand CTL_FS requests


# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 29-Feb-1996 niklas

From NetBSD: Merge with NetBSD 960217


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.289 09-Jun-2019 beck

Add a temporary workaround to make removal of giant files better

mlarkin@ noticed we would freeze while removing enormous files because
of the amount of work done to invalidate buffers on unlink. This adds
a temporary workaround to ensure we give up the lock and yield while
doing this.

The longer term answer will be to move these buffers to another list
and not do the work here.

ok deraadt@


# 1.288 19-Apr-2019 visa

Add a subsystem lock for vfs_lockf.c. This enables calling lf_advlock()
and lf_purgelocks() without the kernel lock.

OK anton@ mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.287 02-Apr-2019 visa

Restrict which filesystems are available for swap. This rules out
obvious misconfigurations that cannot work.

OK mpi@ tedu@


# 1.286 17-Feb-2019 tedu

if a write fails, we mark the buffer invalid and throw it away. this can
lead to lost errors, where a later fsync will return success. to fix this,
set a flag on the vnode indicating a past error has occurred, and return
an error for future fsync calls.
ok bluhm deraadt visa


# 1.285 21-Jan-2019 anton

Introduce a dedicated entry point data structure for file locks. This new data
structure allows for better tracking of pending lock operations which is
essential in order to prevent a use-after-free once the underlying vnode is
gone.

Inspired by the lockf implementation in FreeBSD.

ok visa@

Reported-by: syzbot+d5540a236382f50f1dac@syzkaller.appspotmail.com


# 1.284 23-Dec-2018 natano

Rectify some issues with the noperm mount flag; the root vnode was not
protected properly and files without any x bit set were accidentaly considered
executable when checked with access(2).

Issues found and reported by deraadt, halex, reyk, tb
ok deraadt


# 1.283 07-Dec-2018 mpi

free(9) sizes for netcred.

ok visa@


Revision tags: OPENBSD_6_4_BASE
# 1.282 29-Sep-2018 visa

Use atomic operations to update vfc_refcount. Change the field's type
to unsigned int.

OK deraadt@


# 1.281 26-Sep-2018 visa

Move the allocating and freeing of mount points into
dedicated functions.

OK deraadt@ mpi@


# 1.280 22-Sep-2018 fcambus

Harmonize spacing after ellipses in displayed messages.

We were using spacing after ellipses in an inconsistent way in the
installer. Standardize on using "... " everywhere and take into account
the cursor position while we are waiting for the task to complete: the
cursor is now always positioned after the last dot, and the space is
added when displaying completion confirmation.

While there, also take cursor position into account in vfs_shutdown(),
and remove the extra leading space before ticks in dhclient.

OK deraadt@


# 1.279 17-Sep-2018 visa

Simplify VFS initialization.

Because loadable kernel modules are no longer, there is no need to
register or unregister filesystem implementations at runtime. Remove
vfs_register() and vfs_unregister(), and make vfsinit() call vfs_init
routines directly. Replace the linked list of vfsconf structs with
the vfsconflist[] array.

OK mpi@ bluhm@


# 1.278 16-Sep-2018 visa

Move vfsconf lookup code into dedicated functions.

OK bluhm@


# 1.277 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


# 1.276 02-Jul-2018 bluhm

Use more list macros for v_dirtyblkhd.
OK mpi@


# 1.275 06-Jun-2018 bluhm

The function dounmount() traverses the mnt_list in forward direction
to call vfs_busy() for all nested mount points. vfs_stall() called
vfs_busy() in reverser order for all mount points. Change the
direction of the latter to resolve the lock order conflict.
OK visa@


# 1.274 04-Jun-2018 guenther

Add VB_DUPOK to suppress witness(4) warning of concurrent mount locks.
Use that in three places:
- vfs_stall()
- sys_mount()
- dounmount()'s MNT_FORCE-does-recursive-unmounts case

ok deraadt@ visa@


# 1.273 27-May-2018 visa

Drop unnecessary `p' parameter from vget(9).

OK mpi@


# 1.272 08-May-2018 bluhm

When looping over mount points, the FOREACH SAVE macro is not save.
The loop variable mp is protected by vfs_busy() so that it cannot
be unmounted. But the next mount point nmp could be unmounted while
VFS_SYNC() sleeps. As the loop in vfs_stall() does not destroy the
mount point, TAILQ_FOREACH_REVERSE without _SAVE is the correct
macro to use.
OK deraadt@ visa@


# 1.271 08-May-2018 mpi

Move the vfs stall "barrier" logic to a function. FREF() will soon
change and this has nothing to do with it.

ok visa@, bluhm@


# 1.270 07-May-2018 bluhm

Print the vp pointer in the vinvalbuf() panic strings.
OK mpi@


# 1.269 02-May-2018 visa

Remove proc from the parameters of vn_lock(). The parameter is
unnecessary because curproc always does the locking.

OK mpi@


# 1.268 28-Apr-2018 visa

Clean up the parameters of VOP_LOCK() and VOP_UNLOCK(). It is always
curproc that does the locking or unlocking, so the proc parameter
is pointless and can be dropped.

OK mpi@, deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.267 07-Mar-2018 bluhm

Remounting files systems read-only does not work reliably. There
are corner cases where ffs may leak blocks. So better revert and
unmount all file systems at reboot. The "init died" panic will be
fixed in a different way.
OK deraadt@


# 1.266 10-Feb-2018 deraadt

Syncronize filesystems to disk when suspending. Each mountpoint's vnodes
are pushed to disk. Dangling vnodes (unlinked files still in use) and
vnodes undergoing change by long-running syscalls are identified -- and
such filesystems are marked dirty on-disk while we are suspended (in case
power is lost, a fsck will be required). Filesystems without dangling or
busy vnodes are marked clean, resulting in faster boots following
"battery died" circumstances.
Tested by numerous developers, thanks for the feedback.


# 1.265 14-Dec-2017 deraadt

Don't bother using DETACH_FORCE for the softraid luns at reboot
time; the aggressive mountpoint destruction seems to hit insane
use-after-frees when we are already far on the way down.


# 1.264 14-Dec-2017 deraadt

Give vflush_vnode() a hint about vnodes we don't need to account as "busy".
Change mountpoint to RDONLY a little later. Seems to improve the
rw->ro transition a bit.


# 1.263 11-Dec-2017 bluhm

Format the vnode lists of ddb show mount properly in columns.
OK krw@


# 1.262 11-Dec-2017 deraadt

In uvm Chuck decided backing store would not be allocated proactively
for blocks re-fetchable from the filesystem. However at reboot time,
filesystems are unmounted, and since processes lack backing store they
are killed. Since the scheduler is still running, in some cases init is
killed... which drops us to ddb [noted by bluhm]. Solution is to convert
filesystems to read-only [proposed by kettenis]. The tale follows:
sys_reboot() should pass proc * to MD boot() to vfs_shutdown() which
completes current IO with vfs_busy VB_WRITE|VB_WAIT, then calls VFS_MOUNT()
with MNT_UPDATE | MNT_RDONLY, soon teaching us that *fs_mount() calls a
copyin() late... so store the sizes in vfsconflist[] and move the copyin()
to sys_mount()... and notice nfs_mount copyin() is size-variant, so kill
legacy struct nfs_args3. Next we learn ffs_mount()'s MNT_UPDATE code is
sharp and rusty especially wrt softdep, so fix some bugs adn add
~MNT_SOFTDEP to the downgrade. Some vnodes need a little more help,
so tie them to &dead_vnops.

ffs_mount calling DIOCCACHESYNC is causing a bit of grief still but
this issue is seperate and will be dealt with in time.
couple hundred reboots by bluhm and myself, advice from guenther and
others at the hut


# 1.261 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.260 31-Jul-2017 florian

Give back some space to the ramdisk by compiling net/radix.c only
if we compile pf, ipsec, pipex or nfsserver.
Suggested by mpi some time ago.
Tweak & OK bluhm
deraadt assumes it's fair


# 1.259 20-Apr-2017 visa

Tweak lock inits to make the system runnable with witness(4)
on amd64 and i386.


# 1.258 04-Apr-2017 deraadt

struct vfsconf is tightly packed, but let's M_ZERO it in case that ever
changes to avoid exposing userland memory.


Revision tags: OPENBSD_6_1_BASE
# 1.257 15-Jan-2017 bluhm

When traversing the mount list, the current mount point is locked
with vfs_busy(). If the FOREACH_SAFE macro is used, the next pointer
is not locked and could be freed by another process. Unless
necessary, do not use _SAFE as it is unsafe. In vfs_unmountall()
the current pointer is actullay freed. Add a comment that this
race has to be fixed later.
OK krw@


# 1.256 10-Jan-2017 bluhm

Replace manual for() loops with FOREACH() macro.
OK millert@


# 1.255 10-Jan-2017 bluhm

Remove the unused olddp parameter from function dounmount().
OK mpi@ millert@


# 1.254 28-Sep-2016 kettenis

Cast enum to u_int when doing a bounds check to avoid a clang warning that
the comparison is always true.

ok jca@, tedu@


# 1.253 16-Sep-2016 dlg

move the namecache_rb_tree from RB macros to RBT functions.

i had to shuffle the includes a bit. all the knowledge of the RB
tree is now inside vfs_cache.c, and all accesses are via cache_*
functions.


# 1.252 16-Sep-2016 dlg

move buf_rb_bufs from RB macros to RBT functions

i had to shuffle the order of some header bits cos RBT_PROTOTYPE
needs to see what RBT_HEAD produces.


# 1.251 15-Sep-2016 dlg

all pools have their ipl set via pool_setipl, so fold it into pool_init.

the ioff argument to pool_init() is unused and has been for many
years, so this replaces it with an ipl argument. because the ipl
will be set on init we no longer need pool_setipl.

most of these changes have been done with coccinelle using the spatch
below. cocci sucks at formatting code though, so i fixed that by hand.

the manpage and subr_pool.c bits i did myself.

ok tedu@ jmatthew@

@ipl@
expression pp;
expression ipl;
expression s, a, o, f, m, p;
@@
-pool_init(pp, s, a, o, f, m, p);
-pool_setipl(pp, ipl);
+pool_init(pp, s, a, ipl, f, m, p);


# 1.250 25-Aug-2016 dlg

pool_setipl

ok kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.249 22-Jul-2016 kettenis

Prevent NULL-pointer call for filesystems that don't provide vfs_sysctl
in their vfsops.

Issue reported by Tim Newsham.

ok claudio@, natano@


# 1.248 19-Jun-2016 natano

Remove the lockmgr() API. It is only used by filesystems, where it is a
trivial change to use rrw locks instead. All it needs is LK_* defines
for the RW_* flags.

tested by naddy and sthen on package building infrastructure
input and ok jmc mpi tedu


# 1.247 26-May-2016 natano

The doforce variable isn't modified anywhere. Also, the only filesystem
left using it is fuse. It has been removed from all other filesystems.

ok millert deraadt


# 1.246 26-Apr-2016 natano

copy_statfs_info() is not only used by ufs, but by other filesystems too,
so make sure that all members of mp->mnt_stat.mount_info are copied.
ok stefan


# 1.245 26-Apr-2016 beck

fix off by one in vfs_vnode_print - found by miod
ok deraadt@, krw@


# 1.244 07-Apr-2016 natano

Share clone bitmap between aliased vnodes. This prevents duplicate clone
instance numbers being handed out for the same minor device.
ok mikeb


# 1.243 05-Apr-2016 natano

Increase size of the clone bitmap (revised diff after revert). I have
tested this with fuse _and_ drm on amd64 and macppc. Also tested with
cloning bpf (not in the tree) on macppc.

ok mikeb
"looks correct to me" millert

The original commit message is as follows:

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.242 01-Apr-2016 mikeb

Revert the clone bitmap enlargement change


# 1.241 31-Mar-2016 natano

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.240 19-Mar-2016 natano

Remove the unused flags argument from VOP_UNLOCK().

torture tested on amd64, i386 and macppc
ok beck mpi stefan
"the change looks right" deraadt


# 1.239 14-Mar-2016 krw

Change a bunch of (<blah> *)0 to NULL.

ok beck@ deraadt@


Revision tags: OPENBSD_5_9_BASE
# 1.238 05-Dec-2015 tedu

branches: 1.238.2;
remove stale lint annotations


# 1.237 16-Nov-2015 deraadt

In getdevvp() set the VISTTY flag on a vnode to indicate the underlying
device is a D_TTY device. (Like spec_open, but this sets the flag to
satisfy pre-VOP_OPEN situations)
ok millert semarie tedu guenther


# 1.236 13-Oct-2015 guenther

Initialize va_filerev in vattr_null() to avoid leaking stack garbage;
problem pointed out by Martin Natano (natano (at) natano.net)

Also, stop chaining assignments (foo = bar = baz) in vattr_null().
The exact meaning of those depends on the order of the sizes-and-
signednesses of the lvalues, making them fragile: a statement here
mixed *six* types, but managed to get them in a safe order. Delete
a 20+ year old XXX comment that was almost certainly bemoaning a bug
from when they were in an unsafe order.

ok deraadt@ miod@


# 1.235 08-Oct-2015 mpi

Use the radix API directly and get rid of the function pointers. There
is no point in keeping an unused level of abstraction.

ok mikeb@, claudio@


# 1.234 07-Oct-2015 mpi

rn_inithead() offset argument is now specified in byte, missed in previous.


# 1.233 04-Sep-2015 mpi

Make every subsystem using a radix tree call rn_init() and pass the
length of the key as argument.

This way every consumer of the radix tree has a chance to explicitly
initialize the shared data structures and no longer rely on another
subsystem to do the initialization.

As a bonus ``dom_maxrtkey'' is no longer used an die.

ART kernels should now be fully usable because pf(4) and IPSEC properly
initialized the radix tree.

ok chris@, reyk@


Revision tags: OPENBSD_5_8_BASE
# 1.232 16-Jul-2015 claudio

branches: 1.232.4;
Fix rn_match and there for the expoerted lookup functions in radix.c
to never return the internal RNF_ROOT nodes. This removes the checks
in the callee to verify that not an RNF_ROOT node was returned.
OK mpi@


# 1.231 12-May-2015 mikeb

Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.230 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.229 02-Mar-2015 guenther

Return EINVAL if the creds supplied for NFS export have a cr_ngroups less
than zero or greater than NGROUPS_MAX

Fixes panic seen by henning@


# 1.228 09-Jan-2015 tedu

rename desiredvnodes to initialvnodes. less of a lie. ok beck deraadt


# 1.227 19-Dec-2014 tedu

start retiring the nointr allocator. specify PR_WAITOK as a flag as a
marker for which pools are not interrupt safe. ok dlg


# 1.226 17-Dec-2014 tedu

remove lock.h from uvm_extern.h. another holdover from the simpletonlock
era. fix uvm including c files to include lock.h or atomic.h as necessary.
ok deraadt


# 1.225 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.224 10-Dec-2014 tedu

convert bcopy to memcpy. ok millert


# 1.223 21-Nov-2014 tedu

simple lock is long dead


# 1.222 19-Nov-2014 tedu

delete the KERN_VNODE sysctl. it fails to provide any isolation from the
kernel struct vnode defintion, and the only consumer (pstat) still needs
kvm to read much of the required information. no great loss to always use
kvm until there's a better replacement interface.
ok deraadt millert uebayasi


# 1.221 14-Nov-2014 tedu

prefer sizeof(*ptr) to sizeof(struct) for malloc and free


# 1.220 03-Nov-2014 deraadt

pass size argument to free()
ok doug tedu


# 1.219 13-Sep-2014 doug

Replace all queue *_END macro calls except CIRCLEQ_END with NULL.

CIRCLEQ_* is deprecated and not called in the tree. The other queue types
have *_END macros which were added for symmetry with CIRCLEQ_END. They are
defined as NULL. There's no reason to keep the other *_END macro calls.

ok millert@


Revision tags: OPENBSD_5_6_BASE
# 1.218 13-Jul-2014 tedu

pass the size to free in some of the obvious cases


# 1.217 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.216 10-Jul-2014 mpi

Stop using a shutdown hook for softraid(4) and explicitly shutdown
the disciplines right after vfs_shutdown().

This change is required in order to be able to set `cold' to 1 before
traversing the device (mainbus) tree for DVACT_POWERDOWN when halting
a machine. Yes, this is ugly because sr_shutdown() needs to sleep. But
at least it is obvious and hopefully somebody will be ofended and fix
it.

In order to properly flush the cache of the disks under softraid0,
sr_shutdown() now propagates DVACT_POWERDOWN for this particular subtree
of devices which are not under mainbus. As a side effect sd(4) shutdown
hook should no longer be necessary.

Tested by stsp@ and Jean-Philippe Ouellet.

ok deraadt@, stsp@, jsing@


# 1.215 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.214 04-Jun-2014 claudio

While it may be smart to use the radix tree for exports it is not OK to
use the domain specific tree initialisation method for this since that one
is multipath enabled and assumes that the radix node is part of a struct
rtentry. This code uses a different struct and so the multipath modifies
wrong fields and breaks stuff in mysterious ways.
Since we only support AF_INET here anyway simplify the code and only have
one radix_node_head pointer instead of AF_MAX ones.
Fixes NFS server issues reported by rpe@, OK rpe@, guenther@, sthen@


# 1.213 10-Apr-2014 tedu

pull the bufcache freelist code out into separate functions to allow new
algorithms to be tested. in the process, drop support for unused B_AGE and
b_synctime options.
previous versions ok beck deraadt


# 1.212 24-Mar-2014 guenther

Split the API: struct ucred remains the kernel internal structure while
struct xucred becomes the structure for syscalls (mount(2) and nfssvc(2)).

ok deraadt@ beck@


Revision tags: OPENBSD_5_5_BASE
# 1.211 21-Jan-2014 tedu

bzero -> memset


# 1.210 01-Dec-2013 krw

Change 'mountlist' from CIRCLEQ to TAILQ. Be paranoid and
use TAILQ_*_SAFE more than might be needed.

Bulk ports build by sthen@ showed nobody sticking their fingers
so deep into the kernel.

Feedback and suggestions from millert@. ok jsing@


# 1.209 27-Nov-2013 jsing

Defer the v_type initialisation until after the vnode has been purged from
the namecache. Changing the v_type between cache_enter() and cache_purge()
results in bad things happening.

ok beck@


# 1.208 02-Oct-2013 sf

format string fix: b_flags is long


# 1.207 01-Oct-2013 sf

Format string fixes: Cast time_t to long long

and mnt_stat.f_ctime is long long, too


# 1.206 08-Aug-2013 syl

Uncomment kprintf format attributes for sys/kern

tested on vax (gcc3) ok miod@


# 1.205 30-Jul-2013 beck

The previous change was made while chasing nfs performance issues
on Theo's servers - however this was in the context of the buffer flipper
changes and this is now suspect in a continues performance issue with NFS
so back it out for now


Revision tags: OPENBSD_5_4_BASE
# 1.204 24-Jun-2013 beck

Manipulating buffers after sleeping is dangerous. Instead of attempting
to cheat and VOP_BWRITE a buffer, restart the vinvalbuf if we have to wait
for a busy buffer to complete
ok tedu@ guenther@


# 1.203 15-Apr-2013 jsing

Add an f_mntfromspec member to struct statfs, which specifies the name of
the special provided when the mount was requested. This may be the same as
the special that was actually used for the mount (e.g. in the case of a
device node) or it may be different (e.g. in the case of a DUID).

Whilst here, change f_ctime to a 64 bit type and remove the pointless
f_spare members.

Compatibility goo courtesy of guenther@

ok krw@ millert@


Revision tags: OPENBSD_5_3_BASE
# 1.202 17-Feb-2013 miod

Comment out recently added __attribute__((__format__(__kprintf__))) annotations
in MI code; gcc 2.95 does not accept such annotation for function pointer
declarations, only function prototypes.
To be uncommented once gcc 2.95 bites the dust.


# 1.201 09-Feb-2013 miod

Add explicit __attribute__ ((__format__(__kprintf__)))) to the functions and
function pointer arguments which are {used as,} wrappers around the kernel
printf function.
No functional change.


# 1.200 17-Nov-2012 beck

Don't map a buffer (and potentially sleep) when invalidating it in vinvalbuf.
This fixes a problem where we could sleep for kva and then our pointers
would not be valid on the next pass through the loop. We do this
by adding buf_acquire_nomap() - which can be used to busy up the buffer
without changing its mapped or unmapped state. We do not need to have
the buffer mapped to invalidate it, so it is sufficient to acquire it
for that. In the case where we write the buffer, we do map the buffer, and
potentially sleep.


# 1.199 01-Oct-2012 guenther

Make groupmember() check the effective gid too, so that the checks are
consistent when the effective gid isn't also a supplementary group.

ok beck@


# 1.198 19-Sep-2012 guenther

vhold() and vdrop() are prototyped in vnode.h, so don't repeat them here

ok beck@


Revision tags: OPENBSD_5_2_BASE
# 1.197 16-Jul-2012 deraadt

oops, need sys/acct.h too


# 1.196 16-Jul-2012 deraadt

Put acct_shutdown() proto in a better place


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.195 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.194 02-Jul-2011 thib

rename VFSDEBUG to VFLCKDEBUG;

prompted by tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.193 21-Dec-2010 thib

Bring back the "End the VOP experiment." diff, naddy's issues where
unrelated, and his alpha is much happier now.

OK deraadt@


# 1.192 06-Dec-2010 jasper

- drop NENTS(), which was yet another copy of nitems().
no binary change


ok deraadt@


# 1.191 10-Sep-2010 thib

Backout the VOP diff until the issues naddy was seeing on alpha (gcc3)
have been resolved.


# 1.190 06-Sep-2010 thib

End the VOP experiment. Instead of the ridicolusly complicated operation
vector setup that has questionable features (that have, as far as I can
tell never been used in practice, atleast not in OpenBSD), remove all
the gunk and favor a simple struct full of function pointers that get
set directly by each of the filesystems.

Removes gobs of ugly code and makes things simpler by a magnitude.

The only downside of this is that we loose the vnoperate feature so
the spec/fifo operations of the filesystems need to be kept in sync
with specfs and fifofs, this is no big deal as the API it self is pretty
static.

Many thanks to armani@ who pulled an earlier version of this diff to
current after c2k10 and Gabriel Kihlman on tech@ for testing.

Liked by many. "come on, find your balls" deraadt@.


# 1.189 12-Aug-2010 oga

Nuke extra (typoed) extern declaration and a spare newline from the last
commit.

"fix it -- free commit" beck@


# 1.188 11-Aug-2010 beck

Make the number of vnodes to correspond to the number of buffers in
buffer cache - we grow them dynamically, but do not attempt to shrink
them if the buffer cache shrinks after growing.

Tested by very many for a long time.

ok oga@ todd@ phessler@ tedu@


Revision tags: OPENBSD_4_8_BASE
# 1.187 29-Jun-2010 tedu

makefstype was only used in ported from freebsd filesystems. fix them
and remove the function. ok thib


# 1.186 28-Jun-2010 claudio

Add the rtable id as an argument to rn_walktree(). Functions like
rt_if_remove_rtdelete() need to know the table id to be able to correctly
remove nodes.
Problem found by Andrea Parazzini and analyzed by Martin Pelik�n.
OK henning@


# 1.185 06-May-2010 mpf

Fix favail format string.
From mickey.
OK thib, otto.


Revision tags: OPENBSD_4_7_BASE
# 1.184 17-Dec-2009 oga

if anyone vref()s a VNON vnode, panic. This should not happen.

Written while trying to debug the nfs_inactive panics. Turns out it
never got hit, but it's a useful check to have.

ok beck@


# 1.183 17-Aug-2009 jasper

dd 'show all bufs' to show all the buffers in the system

ok beck@ thib@


# 1.182 13-Aug-2009 thib

add a show all vnodes command, use dlg's nice pool_walk() to accomplish
this.

ok beck@, dlg@


# 1.181 12-Aug-2009 beck

Namecache revamp.

This eliminates the large single namecache hash table, and implements
the name cache as a global lru of entires, and a redblack tree in each
vnode. It makes cache_purge actually purge the namecache entries associated
with a vnode when a vnode is recycled (very important for later on actually being
able to resize the vnode pool)

This commit does #if 0 out a bunch of procmap code that was
already broken before this change, but needs to be redone completely.

Tested by many, including in thib's nfs test setup.

ok oga@,art@,thib@,miod@


# 1.180 02-Aug-2009 beck

Dynamic buffer cache support - a re-commit of what was backed out
after c2k9

allows buffer cache to be extended and grow/shrink dynamically

tested by many, ok oga@, "why not just commit it" deraadt@


Revision tags: OPENBSD_4_6_BASE
# 1.179 25-Jun-2009 thib

backout the buf_acquire() does the bremfree() since all callers
where doing bremfree() befure calling buf_acquire().

This is causing us headache pinning down a bug that showed up
when deraadt@ too cvs to current, and will have to be done
anyway as a preperation for backouts.

OK deraadt@


# 1.178 15-Jun-2009 beck

Back out all the buffer cache changes I committed during c2k9. This reverts three
commits:

1) The sysctl allowing bufcachepercent to be changed at boot time.
2) The change moving the buffer cache hash chains to a red-black tree
3) The dynamic buffer cache (Which depended on the earlier too).

ok on the backout from marco and todd


# 1.177 06-Jun-2009 art

All caller of buf_acquire were doing bremfree before the call.
Just put it in the buf_acquire function.
oga@ ok


# 1.176 03-Jun-2009 beck

Change bufhash from the old grotty hash table to red-black trees hanging
off the vnode.
ok art@, oga@, miod@


Revision tags: OPENBSD_4_5_BASE
# 1.175 10-Nov-2008 pedro

Fix typo in comment, okay jmc@.


# 1.174 01-Nov-2008 deraadt

change vrele() to return an int. if it returns 0, it can gaurantee that
it did not sleep. this is used to avoid checkdirs() to avoid having
to restart the allproc walk every time through
idea from tedu, ok thib pedro


Revision tags: OPENBSD_4_4_BASE
# 1.173 05-Jul-2008 thib

re-introduce vdrop() to signal a lost intrest in a vnode;

ok art@


# 1.172 14-Jun-2008 mk

A bunch of pool_get() + bzero() -> pool_get(..., .. | PR_ZERO)
conversions that should shave a few bytes off the kernel.

ok henning, krw, jsing, oga, miod, and thib (``even though i usually prefer
FOO|BAR''; thanks for looking.


# 1.171 13-Jun-2008 beck

back out stupid vnode change that was unintentionally included
with biomem and art has no idea how it got there.
ok art@ thib@


# 1.170 12-Jun-2008 deraadt

Bring biomem diff back into the tree after the nfs_bio.c fix went in.
ok thib beck art


# 1.169 11-Jun-2008 deraadt

back out biomem diff since it is not right yet. Doing very large
file copies to nfsv2 causes the system to eventually peg the console.
On the console ^T indicates that the load is increasing rapidly, ddb
indicates many calls to getbuf, there is some very slow nfs traffic
making none (or extremely slow) progress. Eventually some machines
seize up entirely.


# 1.168 10-Jun-2008 beck

Buffer cache revamp

1) remove multiple size queues, introduced as a stopgap.
2) decouple pages containing data from their mappings
3) only keep buffers mapped when they actually have to be mapped
(right now, this is when buffers are B_BUSY)
4) New functions to make a buffer busy, and release the busy flag
(buf_acquire and buf_release)
5) Move high/low water marks and statistics counters into a structure
6) Add a sysctl to retrieve buffer cache statistics

Tested in several variants and beat upon by bob and art for a year. run
accidentally on henning's nfs server for a few months...

ok deraadt@, krw@, art@ - who promises to be around to deal with any fallout


# 1.167 09-Jun-2008 millert

Update access(2) to have modern semantics with respect to X_OK and
the superuser. access(2) will now only indicate success for X_OK on
non-directories if there is at least one execute bit set on the file.
OK deraadt@ thib@ otto@


# 1.166 07-May-2008 thib

remove the vfc_mountroot member from vfsconf and
do appropriate cleanup;

OK deraadt@


# 1.165 07-May-2008 claudio

Implement routing priorities. Every route inserted has a priority assigned
and the one route with the lowest number wins. This will be used by the
routing daemons to resolve the synchronisations issue in case of conflicts.
The nasty bits of this are in the multipath code. If no priority is specified
the kernel will choose an appropriate priority.

Looked at by a few people at n2k8 code is much older


# 1.164 06-May-2008 thib

retire vfs_mountroot();

setroot() is now (and has been) responsible for setting
the mountroot function pointer "to the right thing", or
failing todo that, to ffs_mountroot;

based on a discussion/diff from deraadt@.
OK deraadt@


# 1.163 23-Mar-2008 miod

Wrong printf construct.


# 1.162 16-Mar-2008 otto

Widen some struct statfs fields to support large filesystem stata
and add some to be able to support statvfs(2). Do the compat dance
to provide backward compatibility. ok thib@ miod@


Revision tags: OPENBSD_4_3_BASE
# 1.161 13-Dec-2007 blambert

replace calls to ltsleep with tsleep

remove PNORELOCK flag, as PNORELOCK is used for msleep

ok art@ thib@


# 1.160 16-Nov-2007 deraadt

er, the newline is wrong. dissapointing.


# 1.159 15-Nov-2007 deraadt

newline before syncing disks is way prettier


# 1.158 29-Oct-2007 chl

MALLOC/FREE -> malloc/free
replace an hard coded value with M_WAITOK

ok krw@


# 1.157 15-Sep-2007 bluhm

Allow to pull out an usb stick with ffs filesystem while mounted
and a file is written onto the stick. Without these fixes the
machine panics or hangs.
The usb fix calls the callback when the stick is pulled out to free
the associated buffers. Otherwise we have busy buffers for ever
and the automatic unmount will panic.
The change in the scsi layer prevents passing down further dirty
buffers to usb after the stick has been deactivated.
In vfs the automatic unmount has moved from the function vgonel()
to vop_generic_revoke(). Both are called when the sd device's vnode
is removed. In vgonel() the VXLOCK is already held which can cause
a deadlock. So call dounmount() earlier.

ok krw@, I like this marco@, tested by ian@


# 1.156 07-Sep-2007 art

Use M_ZERO in a few more places to shave bytes from the kernel.

eyeballed and ok dlg@


Revision tags: OPENBSD_4_2_BASE
# 1.155 07-Aug-2007 beck

A few changes to deal with multi-user performance issues seen. this
brings us back roughly to 4.1 level performance, although this is still
far from optimal as we have seen in a number of cases. This change

1) puts a lower bound on buffer cache queues to prevent starvation
2) fixes the code which looks for a buffer to recycle
3) reduces the number of vnodes back to 4.1 levels to avoid complex
performance issues better addressed after 4.2

ok art@ deraadt@, tested by many


# 1.154 01-Jun-2007 beck

decouple the allocated number of vnodes from the "desiredvnodes" variable
which is used to size a zillion other things that increasing excessively
has been shown to cause problems - so that we may incrementally look at
increasing those other things without making the kernel unusable.

This diff effectivly increases the number of vnodes back to the number
of buffers, as in the earlier dynamic buffer cache commits, without
increasing anything else (namecache, softdeps, etc. etc.)

ok pedro@ tedu@ art@ thib@


# 1.153 31-May-2007 tedu

remove some silly casts, no real change


# 1.152 31-May-2007 pedro

NFSv2 cannot cope with a big number of vnodes, so revert to NPROC-based
calculation until the problem is fixed, okay beck@ art@


# 1.151 30-May-2007 beck

back out vfs change - todd fries has seen afs issues, and I'm suspicious
this can cause other problems.


# 1.150 29-May-2007 beck

Step one of some vnode improvements - change getnewvnode to
actually allocate "desiredvnodes" - add a vdrop to un-hold a vnode held
with vhold, and change the name cache to make use of vhold/vdrop, while
keeping track of which vnodes are referred to by which cache entries to
correctly hold/drop vnodes when the cache uses them.
ok thib@, tedu@, art@


# 1.149 28-May-2007 thib

de-inline vref();

ok pedro@


# 1.148 26-May-2007 pedro

Dynamic buffer cache. Initial diff from mickey@, okay art@ beck@ toby@
deraadt@ dlg@.


# 1.147 26-May-2007 thib

Nuke a bunch of simpelocks and associated goo.

ok art@


# 1.146 17-May-2007 thib

Collapse struct v_selectinfo in struct vnode, remove the
simplelock and reuse the name for the selinfo member.
Clean-up accordingly.

ok tedu@,art@


# 1.145 09-May-2007 deraadt

kinfo_vgetfailed has not been used for > 8 years


# 1.144 13-Apr-2007 thib

Move the declaration of VN_KNOTE() into vnode.h instead of having
multiple defines all over;

ok tedu@


# 1.143 13-Apr-2007 bluhm

Remove comments talking about vnode interlock. No binary change.
ok thib


# 1.142 11-Apr-2007 thib

Remove the simplelock argument from vrecycle();

ok pedro@, sturm@


# 1.141 21-Mar-2007 thib

Remove the v_interlock simplelock from the vnode structure.
Zap all calls to simple_lock/unlock() on it (those calls are
#defined away though). Remove the LK_INTERLOCK from the calls
to vn_lock() and cleanup the filesystems wich implement VOP_LOCK().
(by remvoing the v_interlock from there calls to lockmgr()).

ok pedro@, art@, tedu@


# 1.140 12-Mar-2007 mickey

better desiredvnodes not based on maxusers; pedro@ deraadt@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.139 20-Feb-2007 deraadt

for vfsconf sysctl, do not leak kernel sensors out to userland
ok art thib


# 1.138 17-Feb-2007 mickey

fix ddb buf printing for daddr_t growth to 64bit;
from juan hernandez gonzalez; tested by bluhm@


# 1.137 14-Feb-2007 jsg

Consistently spell FALLTHROUGH to appease lint.
ok kettenis@ cloder@ tom@ henning@


# 1.136 13-Feb-2007 mickey

fix ddb buf print


# 1.135 20-Nov-2006 tom

vprint() should be defined if DIAGNOSTIC || DEBUG. Noticed by (and
original diff from) Jake < antipsychic (at) hotmail.com >. Discussed
with Mickey and Miod.

ok miod@ pedro@


# 1.134 30-Oct-2006 thib

use vp->v_type to index into vtypes rather then vp->v_tag,
fixing odd output in the 'show vnode' ddb code.

ok mickey@


Revision tags: OPENBSD_4_0_BASE
# 1.133 11-Jul-2006 mickey

add mount/vnode/buf and softdep printing commands; tested on a few archs and will make pedro happy too (;


# 1.132 09-Jul-2006 pedro

Fix tab where space was meant


# 1.131 08-Jul-2006 thib

vinvalbuf() debugging aid, under VFSDEBUG.

ok pedro@


# 1.130 03-Jul-2006 mickey

also print vp in vprint (useful for debugging); pedro@ ok


# 1.129 25-Jun-2006 sturm

rename vfs_busy() flags VB_UMIGNORE/VB_UMWAIT to VB_NOWAIT/VB_WAIT

requested by and ok pedro


# 1.128 14-Jun-2006 sturm

move vfs_busy() to rwlocks and properly hide the locking api from vfs

ok tedu, pedro


# 1.127 02-Jun-2006 pedro

Add a clonable devices implementation. Hacked along with thib@, input
from krw@ and toby@, subliminal prodding from dlg@, okay deraadt@.


# 1.126 28-May-2006 pedro

Spacing in vfs_sysctl()


# 1.125 07-May-2006 sturm

forgot to remove this sentence from the comment
ok pedro


# 1.124 30-Apr-2006 sturm

remove the simplelock argument from vfs_busy() which is currently not
used and will never be used this way in VFS

requested by and ok pedro, ok krw, biorn


# 1.123 19-Apr-2006 pedro

Remove unused mount list simple_lock() goo


Revision tags: OPENBSD_3_9_BASE
# 1.122 09-Jan-2006 pedro

Put vprint() under DIAGNOSTIC, as to save space in generated ramdisks.
Inspiration from miod@, okay deraadt@. Tested on i386, macppc and amd64.


# 1.121 30-Nov-2005 pedro

No need for vfs_busy() and vfs_unbusy() to take a process pointer
anymore. Testing by jolan@, thanks.


# 1.120 24-Nov-2005 pedro

Remove kernfs, okay deraadt@.


# 1.119 19-Nov-2005 pedro

Remove unnecessary lockmgr() archaism that was costing too much in terms
of panics and bugfixes. Access curproc directly, do not expect a process
pointer as an argument. Should fix many "process context required" bugs.
Incentive and okay millert@, okay marc@. Various testing, thanks.


# 1.118 18-Nov-2005 pedro

Work around yet another race on non-locking file systems: when calling
VOP_INACTIVE() in vrele() and vput(), we may sleep. Since there's no
locking of any kind, someone can vget() the vnode and vrele() it while
we sleep, beating us in getting the vnode on the free list.


# 1.117 08-Nov-2005 pedro

Missed one use of 'register'


# 1.116 07-Nov-2005 pedro

Use ANSI function declarations and deregister, no binary change


# 1.115 19-Oct-2005 pedro

Remove v_vnlock from struct vnode, okay krw@ tedu@


Revision tags: OPENBSD_3_8_BASE
# 1.114 26-May-2005 pedro

branches: 1.114.2;
RIP stackable filesystems, ok marius@ tedu@, discussed with deraadt@


# 1.113 24-May-2005 pedro

when a device vnode associated with a mount point disappears, mark the
filesystem as doomed and unmount it


# 1.112 22-May-2005 pedro

put VLOCKSWORK stuff under a single option, VFSDEBUG


# 1.111 01-May-2005 pedro

check for VBIOONFREELIST and VBIOONSYNCLIST in vprint(), okay marius@


# 1.110 24-Mar-2005 tedu

always good to check for invalid values. ok marius pedro


Revision tags: OPENBSD_3_7_BASE
# 1.109 10-Jan-2005 pedro

branches: 1.109.2;
change vget() to only put a vnode back on the free lists if it actually
was there. should fix a (rare) corner case introduced by my last commit.
ok tedu@, testing by joris, moritz@, danh@, otto@ and krw@. many thanks.


# 1.108 31-Dec-2004 pedro

sprinkle some more list macros in here


# 1.107 31-Dec-2004 pedro

when releasing a vnode, make it inactive before sticking it to one of
the free lists. should fix some races on filesystems that don't have
locks, such as nfs. also, it allows for a more straightforward way of
releasing vnodes (nodes that are going to be recycled don't have to be
moved to the head of the list). tested by many, thanks.

ok tedu@ deraadt@


# 1.106 28-Dec-2004 deraadt

clean dirty accident by miod


# 1.105 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


# 1.104 09-Dec-2004 pedro

minor spacing/styling nits


Revision tags: OPENBSD_3_6_BASE
# 1.103 04-Aug-2004 art

Uninline vputonfreelist.


# 1.102 04-Aug-2004 pedro

better comments


# 1.101 02-Aug-2004 pedro

- check for LK_NOWAIT on vget()
- use ltsleep() instead of the unlock + sleep combo

ok art@, inspiration from free/net


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.100 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


# 1.99 27-May-2004 tedu

shutdown accounting before shutting down vfs. should prevent some panics.
ok david@ millert@ (iirc)


# 1.98 25-Apr-2004 itojun

radix tree with multipath support. from kame. deraadt ok
user visible changes:
- you can add multiple routes with same key (route add A B then route add A C)
- you have to specify gateway address if there are multiple entries on the table
(route delete A B, instead of route delete A)
kernel change:
- radix_node_head has an extra entry
- rnh_deladdr takes extra argument

TODO:
- actually take advantage of multipath (rtalloc -> rtalloc_mpath)


Revision tags: OPENBSD_3_5_BASE
# 1.97 09-Jan-2004 tedu

back out vnode parents. weird breakge found in ports tree


# 1.96 06-Jan-2004 tedu

keep track of a vnode's parent dir. ufs only, and unused atm, but
the fun stuff is coming. testing by brad.


Revision tags: OPENBSD_3_4_BASE
# 1.95 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.94 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: UBC_SYNC_A
# 1.93 13-May-2003 naddy

Back out previous change that causes "vnode table full" for large-scale
file operations.


# 1.92 13-May-2003 tedu

do reclaim LAYER vnodes, no good reason not to


# 1.91 06-May-2003 tedu

attempt to put a process's cwd back in place after a forced umount.
won't always work, but it's the best we can do for now. this covers
at least some of the failure cases the previous commit to vfs_lookup.c
checks for.
ok weingart@


# 1.90 01-May-2003 tedu

several related changes:
vfs_subr.c:
add a missing simple_lock_init for vnode interlock
try to avoid reclaiming locked or layered vnodes
initialize vnlock pointer to NULL
remove old code to free vnlock, never used
lockinit the new vnode lock
vfs_syscalls.c:
support for VLAYER flag
vnode_if.sh:
support for splitting VDESC flags
vnode_if.src:
split VDESC flags
WILLPUT is the combination of WILLRELE and WILLUNLOCK
most uses for WILLRELE become WILLPUT
vnode.h:
add v_lock to struct vnode
add VLAYER flag
update for new VDESC flags


# 1.89 06-Apr-2003 ho

strcat/strcpy/sprintf cleanup. krw@, anil@ ok. art@ tested sparc64.


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_B
# 1.88 11-Aug-2002 art

Add two missing vfs_busy calls in the failure path of sysctl_vnode.
Found by aaron@

NOTE - I think we need a mount-point iterator just like we have
NOTE - vfs_mount_foreach_vnode. (btw. why don't we use foreach_vnode in here?)


# 1.87 12-Jul-2002 art

Change the locking on the mountpoint slightly. Instead of using mnt_lock
to get shared locks for lookup and get the exclusive lock only with
LK_DRAIN on unmount and do the real exclusive locking with flags in
mnt_flags, we now use shared locks for lookup and an exclusive lock for
unmount.

This is accomplished by slightly changing the semantics of vfs_busy.
Old vfs_busy behavior:
- with LK_NOWAIT set in flags, a shared lock was obtained if the
mountpoint wasn't being unmounted, otherwise we just returned an error.
- with no flags, a shared lock was obtained if the mountpoint was being
unmounted, otherwise we slept until the unmount was done and returned
an error.
LK_NOWAIT was used for sync(2) and some statistics code where it isn't really
critical that we get the correct results.
0 was used in fchdir and lookup where it's critical that we get the right
directory vnode for the filesystem root.

After this change vfs_busy keeps the same behavior for no flags and LK_NOWAIT.
But if some other flags are passed into it, they are passed directly
into lockmgr (actually LK_SLEEPFAIL is always added to those flags because
if we sleep for the lock, that means someone was holding the exclusive lock
and the exclusive lock is only held when the filesystem is being unmounted.

More changes:
dounmount must now be called with the exclusive lock held. (before this
the caller was supposed to hold the vfs_busy lock, but that wasn't always
true).
Zap some (now) unused mount flags.
And the highlight of this change:
Add some vfs_busy calls to match some vfs_unbusy calls, especially in
sys_mount. (lockmgr doesn't detect the case where we release a lock noone
holds (it will do that soon)).

If you've seen hangs on reboot with mfs this should solve it (I repeat this
for the fourth time now, but this time I spent two months fixing and
redesigning this and reading the code so this time I must have gotten
this right).


# 1.86 16-Jun-2002 miod

When processing the KERN_VNODE sysctl, the kernel builds a packed structure,
while pstat(8) expects a C structure abiding the regular structure packing
rules. This caused pstat -v to break on powerpc.

Unbreak the confusion by defining the structure in a common header file,
and having the kernel use it.

ok millert@ deraadt@


# 1.85 08-Jun-2002 art

Use ltsleep in vfs_busy.


# 1.84 16-May-2002 art

sprinkle some splassert(IPL_BIO) in some functions that are commented as "should be called at splbio()"


Revision tags: OPENBSD_3_1_BASE
# 1.83 14-Mar-2002 millert

First round of __P removal in sys


# 1.82 04-Feb-2002 miod

Cleanup mountroot-related definitions.


# 1.81 23-Jan-2002 art

Pool deals fairly well with physical memory shortage, but it doesn't deal
well (not at all) with shortages of the vm_map where the pages are mapped
(usually kmem_map).

Try to deal with it:
- group all information the backend allocator for a pool in a separate
struct. The pool will only have a pointer to that struct.
- change the pool_init API to reflect that.
- link all pools allocating from the same allocator on a linked list.
- Since an allocator is responsible to wait for physical memory it will
only fail (waitok) when it runs out of its backing vm_map, carefully
drain pools using the same allocator so that va space is freed.
(see comments in code for caveats and details).
- change pool_reclaim to return if it actually succeeded to free some
memory, use that information to make draining easier and more efficient.
- get rid of PR_URGENT, noone uses it.


# 1.80 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.79 10-Dec-2001 art

branches: 1.79.2;
No need to initialize the uobj on every getnewvnode. Just do
it when allocating. Add some improved diagnostics.


# 1.78 10-Dec-2001 art

Big cleanup inspired by NetBSD with some parts of the code from NetBSD.
- get rid of VOP_BALLOCN and VOP_SIZE
- move the generic getpages and putpages into miscfs/genfs
- create a genfs_node which must be added to the top of the private portion
of each vnode for filsystems that want to use genfs_{get,put}pages
- rename genfs_mmap to vop_generic_mmap


# 1.77 10-Dec-2001 art

Merge in struct uvm_vnode into struct vnode.


# 1.76 05-Dec-2001 art

Break out the part that lowers v_holdcnt in brelvp into an own function
and make it and vhold into public interfaces.


# 1.75 29-Nov-2001 art

Ooops. Revert part of the last commit that was completly wrong and wasn't supposed to be committed.


# 1.74 29-Nov-2001 art

Correctly handle b_vp with bgetvp and brelvp in {get,put}pages.
Prevents panics caused by vnodes being recycled under our feet.


# 1.73 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.72 21-Nov-2001 csapuntz

Added vfs_isbusy. Useful for verifying that a mount point is locked
Added vfs_mount_foreach_vnode. Several places in the code seem to want to
traverse the mount list and they all seem to handle locking differently.
Centralize traversing the mount list in one place so that we only need
to get the locking right once.


# 1.71 15-Nov-2001 art

Don't zero v_bioflag when recycling a vnode in getnewvnode.
Sometimes the vnode can be on the syncers list. While that is a bug, it's
just a minor annoyance. A vnode on a syncer worklist without VBIOONSYNCLIST
set is a disaster.


# 1.70 12-Nov-2001 art

Remove unnecessary check for NULL vnode in reassignbuf.


# 1.69 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.68 02-Oct-2001 csapuntz

Bounds check index into routing table. Thanks to Ken Ashcraft of Stanford
for finding this bug.


# 1.67 19-Sep-2001 csapuntz

Get rid of B_VFLUSH. Not relevant after the end of the AGE queue.


# 1.66 16-Sep-2001 millert

Add some missing lengths checks when passing data from userland to
kernel. From based on NetBSD patches.


# 1.65 02-Aug-2001 assar

(vput): make panic strings actually say vput instead of vrele


# 1.64 26-Jul-2001 miod

Typo.


# 1.63 27-Jun-2001 art

remove old vm


# 1.62 22-Jun-2001 deraadt

KNF


# 1.61 05-Jun-2001 provos

send note_revoke to knotes when vnode goes away, okay art@


# 1.60 16-May-2001 art

indentation nit.


# 1.59 29-Apr-2001 art

cleanup, remove incorrect comment


Revision tags: OPENBSD_2_9_BASE
# 1.58 22-Mar-2001 art

branches: 1.58.2;
Use pool for allocating vnodes.
Even though vnodes are never freed (could be) this gives us big memory and
kmem_map savings.


# 1.57 21-Mar-2001 art

uvm_vnp_terminate expect the vnode to be locked.
Why didn't LOCKDEBUG catch this?


# 1.56 16-Mar-2001 art

Oops. fix thinko in last.


# 1.55 16-Mar-2001 art

Use CIRCLEQ macros for mountlist.


# 1.54 16-Mar-2001 art

Initialize the mountlist_slock.


# 1.53 26-Feb-2001 csapuntz

Move v_writecount test back to it original place


# 1.52 26-Feb-2001 csapuntz

Make ref counts 32-bit unsigned ints as opposed to a potpourri of longs and
ints.


# 1.51 24-Feb-2001 csapuntz

Cleanup of vnode interface continues. Get rid of VHOLD/HOLDRELE.
Change VM/UVM to use buf_replacevnode to change the vnode associated
with a buffer.

Addition v_bioflag for flags written in interrupt handlers
(and read at splbio, though not strictly necessary)

Add vwaitforio and use it instead of a while loop of v_numoutput.

Fix race conditions when manipulation vnode free list


# 1.50 23-Feb-2001 csapuntz

Remove the clustering fields from the vnodes and place them in the
file system inode instead


# 1.49 21-Feb-2001 csapuntz

Latest soft updates from FreeBSD/Kirk McKusick

Snapshot-related code has been commented out.


# 1.48 08-Feb-2001 mickey

do not print stuff when not verbose


Revision tags: OPENBSD_2_8_BASE
# 1.47 27-Sep-2000 art

branches: 1.47.2;
Minimal optimization.


# 1.46 17-Jul-2000 art

Don't wait for B_READ buffers on shutdown.
From NetBSD.


Revision tags: OPENBSD_2_7_BASE
# 1.45 25-Apr-2000 csapuntz

Use CIRCLEQ_FOREACH


# 1.44 21-Apr-2000 mickey

see if there is any meaning under curproc before using &proc0 in vfs_syncwait(); from art@


Revision tags: SMP_BASE kame_19991208
# 1.43 05-Dec-1999 art

branches: 1.43.2;
With soft updates, some buffers will be remarked as dirty after being written.
Handle this when syncing filesystems when unmounting.
From NetBSD.


# 1.42 05-Dec-1999 art

Use VONSYNCLIST to see if we should remove a vnode from the sync list instead
of looking at v_dirtyblkhd.


Revision tags: OPENBSD_2_6_BASE
# 1.41 20-Aug-1999 art

more paranoid check of the refcount in vfs_register


# 1.40 08-Aug-1999 niklas

From NetBSD; vdevgone, used for revoking access to device nodes when they
disappear (detach is coming).


# 1.39 31-May-1999 millert

New struct statfs with mount options. NOTE: this replaces statfs(2),
fstatfs(2), and getfsstat(2) so you will need to build a new kernel
before doing a "make build" or you will get "unimplemented syscall" errors.

The new struct statfs has the following featuires:
o Has a u_int32_t flags field--now softdep can have a real flag.

o Uses u_int32_t instead of longs (nicer on the alpha). Note: the man
page used to lie about setting invalid/unused fields to -1. SunOS does
that but our code never has.

o Gets rid of f_type completely. It hasn't been used since NetBSD 0.9
and having it there but always 0 is confusing. It is conceivable
that this may cause some old code to not compile but that is better
than silently breaking.

o Adds a mount_info union that contains the FSTYPE_args struct. This
means that "mount" can now tell you all the options a filesystem was
mounted with. This is especially nice for NFS.

Other changes:
o The linux statfs emulation didn't convert between BSD fs names
and linux f_type numbers. Now it does, since the BSD f_type
number is useless to linux apps (and has been removed anyway)

o FreeBSD's struct statfs is different from our (both old and new)
and thus needs conversion. Previously, the OpenBSD syscalls
were used without any real translation.

o mount(8) will now show extra info when invoked with no arguments.
However, to see *everything* you need to use the -v (verbose) flag.


# 1.38 06-May-1999 mickey

factor out sync+wait code into vfa_syncwait() routine for
applications in system like power management and such.
art@ finally said `commit it'


# 1.37 30-Apr-1999 art

in vput, simple_unlock the v_interlock before VOP_INACTIVE, not after


Revision tags: OPENBSD_2_5_BASE
# 1.36 11-Mar-1999 deraadt

backout


# 1.35 11-Mar-1999 deraadt

back out unapproved changes


# 1.34 11-Mar-1999 mickey

indent


# 1.33 11-Mar-1999 mickey

factor sync+wait operation out into a separate function.


# 1.32 26-Feb-1999 art

adapt to uvm vnode pager


# 1.31 19-Feb-1999 art

add vfs_register and vfs_unregister functions


# 1.30 28-Dec-1998 art

simple_lock fixes


# 1.29 22-Dec-1998 art

deconfuse vprint, print holdcount, not refcount when we are talking about holdcnt


# 1.28 10-Dec-1998 art

vfs_unmountall: retry to unmount all remaining filesystems when one unmount failed


# 1.27 05-Dec-1998 csapuntz

Framework for generating automatic test code for locking discipline
in DIAGNOSTIC mode.

Added documentation to vfs_subr.c on locking needs of a couple calls.

Improvements to the vinvalbuf patch. We need to start over after we
let our pants down.


# 1.26 04-Dec-1998 csapuntz

VFS-Lite2 requires stricter locking around vnode buffer queues. vinvalbuf
had insufficient protection


# 1.25 20-Nov-1998 art

vn_lock already unlocks the simple lock. don't do that again


# 1.24 12-Nov-1998 csapuntz

Integrate latest soft updates patches for McKusick.

Integrate cleaner ffs mount code from FreeBSD. Most notably, this mount
code prevents you from mounting an unclean file system read-write.


Revision tags: OPENBSD_2_4_BASE
# 1.23 13-Oct-1998 csapuntz

In vrele, vget, reinstate to following order

- VNODE gets placed on free list
- VOP_INACTIVE is called

This was the original order. It was changed in an earlier patch due to
a race condition in non-locking FSes (like NFS) between getnewvnode
and inactive. However, the modified order had its own race conditions, so
it turned out not to be a good choice.


# 1.22 30-Aug-1998 csapuntz

Cleanup.

Error diagnostics in vputonfreelist to catch violations of assumptions.


# 1.21 06-Aug-1998 csapuntz

Rename vop_revoke, vn_bwrite, vop_noislocked, vop_nolock, vop_nounlock
to be vop_generic_revoke, vop_generic_bwrite, vop_generic_islocked,
vop_generic_lock and vop_generic_unlock.

Create vop_generic_abortop and propogate change to all file systems.

Fix PR/371.

Get rid of locking in NULLFS (should be mostly unnecessary now except for
forced unmounts).


# 1.20 25-Apr-1998 niklas

typo


Revision tags: OPENBSD_2_3_BASE
# 1.19 20-Feb-1998 niklas

typo


# 1.18 11-Jan-1998 csapuntz

Fix a couple spinlock references. More code motion in vfs_subr.c


# 1.17 10-Jan-1998 csapuntz

Broke up vfs_subr.c which was getting a bit huge. We now have seperate files
for the syncer daemon as well as default VOP_*.


# 1.16 24-Nov-1997 niklas

Fix non-DIAGNOSTIC (and non-COMPAT*) compilation


# 1.15 07-Nov-1997 csapuntz

Fixed hang on shutdown
Disabled vop_nolock for now. Filesystems still need to be cleaned up.


# 1.14 06-Nov-1997 csapuntz

DEBUG now compiles


# 1.13 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.12 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.11 06-Oct-1997 csapuntz

VFS Lite2 Changes


Revision tags: OPENBSD_2_1_BASE
# 1.10 25-Apr-1997 deraadt

proper mask check; mike@fast.cs.utah.edu


# 1.9 14-Apr-1997 tholo

Minor performance enhancements from NetBSD


# 1.8 24-Feb-1997 niklas

OpenBSD tags


# 1.7 11-Feb-1997 millert

Add fs_id support and random inode generation numbers for ffs.


# 1.6 04-Jan-1997 kstailey

spec_advlock() via lf_advlock()


Revision tags: OPENBSD_2_0_BASE
# 1.5 08-Aug-1996 tholo

Make {,f}chown(2) behaviour POSIX.1 compliant with SUID / SGID files
Enable CTL_FS processing by sysctl(3)
Add CTL_FS request to disable clearing SUID / SGID bit when a files owner
or group is changed by root
Make sysctl(8) understand CTL_FS requests


# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 29-Feb-1996 niklas

From NetBSD: Merge with NetBSD 960217


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.288 19-Apr-2019 visa

Add a subsystem lock for vfs_lockf.c. This enables calling lf_advlock()
and lf_purgelocks() without the kernel lock.

OK anton@ mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.287 02-Apr-2019 visa

Restrict which filesystems are available for swap. This rules out
obvious misconfigurations that cannot work.

OK mpi@ tedu@


# 1.286 17-Feb-2019 tedu

if a write fails, we mark the buffer invalid and throw it away. this can
lead to lost errors, where a later fsync will return success. to fix this,
set a flag on the vnode indicating a past error has occurred, and return
an error for future fsync calls.
ok bluhm deraadt visa


# 1.285 21-Jan-2019 anton

Introduce a dedicated entry point data structure for file locks. This new data
structure allows for better tracking of pending lock operations which is
essential in order to prevent a use-after-free once the underlying vnode is
gone.

Inspired by the lockf implementation in FreeBSD.

ok visa@

Reported-by: syzbot+d5540a236382f50f1dac@syzkaller.appspotmail.com


# 1.284 23-Dec-2018 natano

Rectify some issues with the noperm mount flag; the root vnode was not
protected properly and files without any x bit set were accidentaly considered
executable when checked with access(2).

Issues found and reported by deraadt, halex, reyk, tb
ok deraadt


# 1.283 07-Dec-2018 mpi

free(9) sizes for netcred.

ok visa@


Revision tags: OPENBSD_6_4_BASE
# 1.282 29-Sep-2018 visa

Use atomic operations to update vfc_refcount. Change the field's type
to unsigned int.

OK deraadt@


# 1.281 26-Sep-2018 visa

Move the allocating and freeing of mount points into
dedicated functions.

OK deraadt@ mpi@


# 1.280 22-Sep-2018 fcambus

Harmonize spacing after ellipses in displayed messages.

We were using spacing after ellipses in an inconsistent way in the
installer. Standardize on using "... " everywhere and take into account
the cursor position while we are waiting for the task to complete: the
cursor is now always positioned after the last dot, and the space is
added when displaying completion confirmation.

While there, also take cursor position into account in vfs_shutdown(),
and remove the extra leading space before ticks in dhclient.

OK deraadt@


# 1.279 17-Sep-2018 visa

Simplify VFS initialization.

Because loadable kernel modules are no longer, there is no need to
register or unregister filesystem implementations at runtime. Remove
vfs_register() and vfs_unregister(), and make vfsinit() call vfs_init
routines directly. Replace the linked list of vfsconf structs with
the vfsconflist[] array.

OK mpi@ bluhm@


# 1.278 16-Sep-2018 visa

Move vfsconf lookup code into dedicated functions.

OK bluhm@


# 1.277 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


# 1.276 02-Jul-2018 bluhm

Use more list macros for v_dirtyblkhd.
OK mpi@


# 1.275 06-Jun-2018 bluhm

The function dounmount() traverses the mnt_list in forward direction
to call vfs_busy() for all nested mount points. vfs_stall() called
vfs_busy() in reverser order for all mount points. Change the
direction of the latter to resolve the lock order conflict.
OK visa@


# 1.274 04-Jun-2018 guenther

Add VB_DUPOK to suppress witness(4) warning of concurrent mount locks.
Use that in three places:
- vfs_stall()
- sys_mount()
- dounmount()'s MNT_FORCE-does-recursive-unmounts case

ok deraadt@ visa@


# 1.273 27-May-2018 visa

Drop unnecessary `p' parameter from vget(9).

OK mpi@


# 1.272 08-May-2018 bluhm

When looping over mount points, the FOREACH SAVE macro is not save.
The loop variable mp is protected by vfs_busy() so that it cannot
be unmounted. But the next mount point nmp could be unmounted while
VFS_SYNC() sleeps. As the loop in vfs_stall() does not destroy the
mount point, TAILQ_FOREACH_REVERSE without _SAVE is the correct
macro to use.
OK deraadt@ visa@


# 1.271 08-May-2018 mpi

Move the vfs stall "barrier" logic to a function. FREF() will soon
change and this has nothing to do with it.

ok visa@, bluhm@


# 1.270 07-May-2018 bluhm

Print the vp pointer in the vinvalbuf() panic strings.
OK mpi@


# 1.269 02-May-2018 visa

Remove proc from the parameters of vn_lock(). The parameter is
unnecessary because curproc always does the locking.

OK mpi@


# 1.268 28-Apr-2018 visa

Clean up the parameters of VOP_LOCK() and VOP_UNLOCK(). It is always
curproc that does the locking or unlocking, so the proc parameter
is pointless and can be dropped.

OK mpi@, deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.267 07-Mar-2018 bluhm

Remounting files systems read-only does not work reliably. There
are corner cases where ffs may leak blocks. So better revert and
unmount all file systems at reboot. The "init died" panic will be
fixed in a different way.
OK deraadt@


# 1.266 10-Feb-2018 deraadt

Syncronize filesystems to disk when suspending. Each mountpoint's vnodes
are pushed to disk. Dangling vnodes (unlinked files still in use) and
vnodes undergoing change by long-running syscalls are identified -- and
such filesystems are marked dirty on-disk while we are suspended (in case
power is lost, a fsck will be required). Filesystems without dangling or
busy vnodes are marked clean, resulting in faster boots following
"battery died" circumstances.
Tested by numerous developers, thanks for the feedback.


# 1.265 14-Dec-2017 deraadt

Don't bother using DETACH_FORCE for the softraid luns at reboot
time; the aggressive mountpoint destruction seems to hit insane
use-after-frees when we are already far on the way down.


# 1.264 14-Dec-2017 deraadt

Give vflush_vnode() a hint about vnodes we don't need to account as "busy".
Change mountpoint to RDONLY a little later. Seems to improve the
rw->ro transition a bit.


# 1.263 11-Dec-2017 bluhm

Format the vnode lists of ddb show mount properly in columns.
OK krw@


# 1.262 11-Dec-2017 deraadt

In uvm Chuck decided backing store would not be allocated proactively
for blocks re-fetchable from the filesystem. However at reboot time,
filesystems are unmounted, and since processes lack backing store they
are killed. Since the scheduler is still running, in some cases init is
killed... which drops us to ddb [noted by bluhm]. Solution is to convert
filesystems to read-only [proposed by kettenis]. The tale follows:
sys_reboot() should pass proc * to MD boot() to vfs_shutdown() which
completes current IO with vfs_busy VB_WRITE|VB_WAIT, then calls VFS_MOUNT()
with MNT_UPDATE | MNT_RDONLY, soon teaching us that *fs_mount() calls a
copyin() late... so store the sizes in vfsconflist[] and move the copyin()
to sys_mount()... and notice nfs_mount copyin() is size-variant, so kill
legacy struct nfs_args3. Next we learn ffs_mount()'s MNT_UPDATE code is
sharp and rusty especially wrt softdep, so fix some bugs adn add
~MNT_SOFTDEP to the downgrade. Some vnodes need a little more help,
so tie them to &dead_vnops.

ffs_mount calling DIOCCACHESYNC is causing a bit of grief still but
this issue is seperate and will be dealt with in time.
couple hundred reboots by bluhm and myself, advice from guenther and
others at the hut


# 1.261 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.260 31-Jul-2017 florian

Give back some space to the ramdisk by compiling net/radix.c only
if we compile pf, ipsec, pipex or nfsserver.
Suggested by mpi some time ago.
Tweak & OK bluhm
deraadt assumes it's fair


# 1.259 20-Apr-2017 visa

Tweak lock inits to make the system runnable with witness(4)
on amd64 and i386.


# 1.258 04-Apr-2017 deraadt

struct vfsconf is tightly packed, but let's M_ZERO it in case that ever
changes to avoid exposing userland memory.


Revision tags: OPENBSD_6_1_BASE
# 1.257 15-Jan-2017 bluhm

When traversing the mount list, the current mount point is locked
with vfs_busy(). If the FOREACH_SAFE macro is used, the next pointer
is not locked and could be freed by another process. Unless
necessary, do not use _SAFE as it is unsafe. In vfs_unmountall()
the current pointer is actullay freed. Add a comment that this
race has to be fixed later.
OK krw@


# 1.256 10-Jan-2017 bluhm

Replace manual for() loops with FOREACH() macro.
OK millert@


# 1.255 10-Jan-2017 bluhm

Remove the unused olddp parameter from function dounmount().
OK mpi@ millert@


# 1.254 28-Sep-2016 kettenis

Cast enum to u_int when doing a bounds check to avoid a clang warning that
the comparison is always true.

ok jca@, tedu@


# 1.253 16-Sep-2016 dlg

move the namecache_rb_tree from RB macros to RBT functions.

i had to shuffle the includes a bit. all the knowledge of the RB
tree is now inside vfs_cache.c, and all accesses are via cache_*
functions.


# 1.252 16-Sep-2016 dlg

move buf_rb_bufs from RB macros to RBT functions

i had to shuffle the order of some header bits cos RBT_PROTOTYPE
needs to see what RBT_HEAD produces.


# 1.251 15-Sep-2016 dlg

all pools have their ipl set via pool_setipl, so fold it into pool_init.

the ioff argument to pool_init() is unused and has been for many
years, so this replaces it with an ipl argument. because the ipl
will be set on init we no longer need pool_setipl.

most of these changes have been done with coccinelle using the spatch
below. cocci sucks at formatting code though, so i fixed that by hand.

the manpage and subr_pool.c bits i did myself.

ok tedu@ jmatthew@

@ipl@
expression pp;
expression ipl;
expression s, a, o, f, m, p;
@@
-pool_init(pp, s, a, o, f, m, p);
-pool_setipl(pp, ipl);
+pool_init(pp, s, a, ipl, f, m, p);


# 1.250 25-Aug-2016 dlg

pool_setipl

ok kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.249 22-Jul-2016 kettenis

Prevent NULL-pointer call for filesystems that don't provide vfs_sysctl
in their vfsops.

Issue reported by Tim Newsham.

ok claudio@, natano@


# 1.248 19-Jun-2016 natano

Remove the lockmgr() API. It is only used by filesystems, where it is a
trivial change to use rrw locks instead. All it needs is LK_* defines
for the RW_* flags.

tested by naddy and sthen on package building infrastructure
input and ok jmc mpi tedu


# 1.247 26-May-2016 natano

The doforce variable isn't modified anywhere. Also, the only filesystem
left using it is fuse. It has been removed from all other filesystems.

ok millert deraadt


# 1.246 26-Apr-2016 natano

copy_statfs_info() is not only used by ufs, but by other filesystems too,
so make sure that all members of mp->mnt_stat.mount_info are copied.
ok stefan


# 1.245 26-Apr-2016 beck

fix off by one in vfs_vnode_print - found by miod
ok deraadt@, krw@


# 1.244 07-Apr-2016 natano

Share clone bitmap between aliased vnodes. This prevents duplicate clone
instance numbers being handed out for the same minor device.
ok mikeb


# 1.243 05-Apr-2016 natano

Increase size of the clone bitmap (revised diff after revert). I have
tested this with fuse _and_ drm on amd64 and macppc. Also tested with
cloning bpf (not in the tree) on macppc.

ok mikeb
"looks correct to me" millert

The original commit message is as follows:

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.242 01-Apr-2016 mikeb

Revert the clone bitmap enlargement change


# 1.241 31-Mar-2016 natano

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.240 19-Mar-2016 natano

Remove the unused flags argument from VOP_UNLOCK().

torture tested on amd64, i386 and macppc
ok beck mpi stefan
"the change looks right" deraadt


# 1.239 14-Mar-2016 krw

Change a bunch of (<blah> *)0 to NULL.

ok beck@ deraadt@


Revision tags: OPENBSD_5_9_BASE
# 1.238 05-Dec-2015 tedu

branches: 1.238.2;
remove stale lint annotations


# 1.237 16-Nov-2015 deraadt

In getdevvp() set the VISTTY flag on a vnode to indicate the underlying
device is a D_TTY device. (Like spec_open, but this sets the flag to
satisfy pre-VOP_OPEN situations)
ok millert semarie tedu guenther


# 1.236 13-Oct-2015 guenther

Initialize va_filerev in vattr_null() to avoid leaking stack garbage;
problem pointed out by Martin Natano (natano (at) natano.net)

Also, stop chaining assignments (foo = bar = baz) in vattr_null().
The exact meaning of those depends on the order of the sizes-and-
signednesses of the lvalues, making them fragile: a statement here
mixed *six* types, but managed to get them in a safe order. Delete
a 20+ year old XXX comment that was almost certainly bemoaning a bug
from when they were in an unsafe order.

ok deraadt@ miod@


# 1.235 08-Oct-2015 mpi

Use the radix API directly and get rid of the function pointers. There
is no point in keeping an unused level of abstraction.

ok mikeb@, claudio@


# 1.234 07-Oct-2015 mpi

rn_inithead() offset argument is now specified in byte, missed in previous.


# 1.233 04-Sep-2015 mpi

Make every subsystem using a radix tree call rn_init() and pass the
length of the key as argument.

This way every consumer of the radix tree has a chance to explicitly
initialize the shared data structures and no longer rely on another
subsystem to do the initialization.

As a bonus ``dom_maxrtkey'' is no longer used an die.

ART kernels should now be fully usable because pf(4) and IPSEC properly
initialized the radix tree.

ok chris@, reyk@


Revision tags: OPENBSD_5_8_BASE
# 1.232 16-Jul-2015 claudio

branches: 1.232.4;
Fix rn_match and there for the expoerted lookup functions in radix.c
to never return the internal RNF_ROOT nodes. This removes the checks
in the callee to verify that not an RNF_ROOT node was returned.
OK mpi@


# 1.231 12-May-2015 mikeb

Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.230 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.229 02-Mar-2015 guenther

Return EINVAL if the creds supplied for NFS export have a cr_ngroups less
than zero or greater than NGROUPS_MAX

Fixes panic seen by henning@


# 1.228 09-Jan-2015 tedu

rename desiredvnodes to initialvnodes. less of a lie. ok beck deraadt


# 1.227 19-Dec-2014 tedu

start retiring the nointr allocator. specify PR_WAITOK as a flag as a
marker for which pools are not interrupt safe. ok dlg


# 1.226 17-Dec-2014 tedu

remove lock.h from uvm_extern.h. another holdover from the simpletonlock
era. fix uvm including c files to include lock.h or atomic.h as necessary.
ok deraadt


# 1.225 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.224 10-Dec-2014 tedu

convert bcopy to memcpy. ok millert


# 1.223 21-Nov-2014 tedu

simple lock is long dead


# 1.222 19-Nov-2014 tedu

delete the KERN_VNODE sysctl. it fails to provide any isolation from the
kernel struct vnode defintion, and the only consumer (pstat) still needs
kvm to read much of the required information. no great loss to always use
kvm until there's a better replacement interface.
ok deraadt millert uebayasi


# 1.221 14-Nov-2014 tedu

prefer sizeof(*ptr) to sizeof(struct) for malloc and free


# 1.220 03-Nov-2014 deraadt

pass size argument to free()
ok doug tedu


# 1.219 13-Sep-2014 doug

Replace all queue *_END macro calls except CIRCLEQ_END with NULL.

CIRCLEQ_* is deprecated and not called in the tree. The other queue types
have *_END macros which were added for symmetry with CIRCLEQ_END. They are
defined as NULL. There's no reason to keep the other *_END macro calls.

ok millert@


Revision tags: OPENBSD_5_6_BASE
# 1.218 13-Jul-2014 tedu

pass the size to free in some of the obvious cases


# 1.217 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.216 10-Jul-2014 mpi

Stop using a shutdown hook for softraid(4) and explicitly shutdown
the disciplines right after vfs_shutdown().

This change is required in order to be able to set `cold' to 1 before
traversing the device (mainbus) tree for DVACT_POWERDOWN when halting
a machine. Yes, this is ugly because sr_shutdown() needs to sleep. But
at least it is obvious and hopefully somebody will be ofended and fix
it.

In order to properly flush the cache of the disks under softraid0,
sr_shutdown() now propagates DVACT_POWERDOWN for this particular subtree
of devices which are not under mainbus. As a side effect sd(4) shutdown
hook should no longer be necessary.

Tested by stsp@ and Jean-Philippe Ouellet.

ok deraadt@, stsp@, jsing@


# 1.215 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.214 04-Jun-2014 claudio

While it may be smart to use the radix tree for exports it is not OK to
use the domain specific tree initialisation method for this since that one
is multipath enabled and assumes that the radix node is part of a struct
rtentry. This code uses a different struct and so the multipath modifies
wrong fields and breaks stuff in mysterious ways.
Since we only support AF_INET here anyway simplify the code and only have
one radix_node_head pointer instead of AF_MAX ones.
Fixes NFS server issues reported by rpe@, OK rpe@, guenther@, sthen@


# 1.213 10-Apr-2014 tedu

pull the bufcache freelist code out into separate functions to allow new
algorithms to be tested. in the process, drop support for unused B_AGE and
b_synctime options.
previous versions ok beck deraadt


# 1.212 24-Mar-2014 guenther

Split the API: struct ucred remains the kernel internal structure while
struct xucred becomes the structure for syscalls (mount(2) and nfssvc(2)).

ok deraadt@ beck@


Revision tags: OPENBSD_5_5_BASE
# 1.211 21-Jan-2014 tedu

bzero -> memset


# 1.210 01-Dec-2013 krw

Change 'mountlist' from CIRCLEQ to TAILQ. Be paranoid and
use TAILQ_*_SAFE more than might be needed.

Bulk ports build by sthen@ showed nobody sticking their fingers
so deep into the kernel.

Feedback and suggestions from millert@. ok jsing@


# 1.209 27-Nov-2013 jsing

Defer the v_type initialisation until after the vnode has been purged from
the namecache. Changing the v_type between cache_enter() and cache_purge()
results in bad things happening.

ok beck@


# 1.208 02-Oct-2013 sf

format string fix: b_flags is long


# 1.207 01-Oct-2013 sf

Format string fixes: Cast time_t to long long

and mnt_stat.f_ctime is long long, too


# 1.206 08-Aug-2013 syl

Uncomment kprintf format attributes for sys/kern

tested on vax (gcc3) ok miod@


# 1.205 30-Jul-2013 beck

The previous change was made while chasing nfs performance issues
on Theo's servers - however this was in the context of the buffer flipper
changes and this is now suspect in a continues performance issue with NFS
so back it out for now


Revision tags: OPENBSD_5_4_BASE
# 1.204 24-Jun-2013 beck

Manipulating buffers after sleeping is dangerous. Instead of attempting
to cheat and VOP_BWRITE a buffer, restart the vinvalbuf if we have to wait
for a busy buffer to complete
ok tedu@ guenther@


# 1.203 15-Apr-2013 jsing

Add an f_mntfromspec member to struct statfs, which specifies the name of
the special provided when the mount was requested. This may be the same as
the special that was actually used for the mount (e.g. in the case of a
device node) or it may be different (e.g. in the case of a DUID).

Whilst here, change f_ctime to a 64 bit type and remove the pointless
f_spare members.

Compatibility goo courtesy of guenther@

ok krw@ millert@


Revision tags: OPENBSD_5_3_BASE
# 1.202 17-Feb-2013 miod

Comment out recently added __attribute__((__format__(__kprintf__))) annotations
in MI code; gcc 2.95 does not accept such annotation for function pointer
declarations, only function prototypes.
To be uncommented once gcc 2.95 bites the dust.


# 1.201 09-Feb-2013 miod

Add explicit __attribute__ ((__format__(__kprintf__)))) to the functions and
function pointer arguments which are {used as,} wrappers around the kernel
printf function.
No functional change.


# 1.200 17-Nov-2012 beck

Don't map a buffer (and potentially sleep) when invalidating it in vinvalbuf.
This fixes a problem where we could sleep for kva and then our pointers
would not be valid on the next pass through the loop. We do this
by adding buf_acquire_nomap() - which can be used to busy up the buffer
without changing its mapped or unmapped state. We do not need to have
the buffer mapped to invalidate it, so it is sufficient to acquire it
for that. In the case where we write the buffer, we do map the buffer, and
potentially sleep.


# 1.199 01-Oct-2012 guenther

Make groupmember() check the effective gid too, so that the checks are
consistent when the effective gid isn't also a supplementary group.

ok beck@


# 1.198 19-Sep-2012 guenther

vhold() and vdrop() are prototyped in vnode.h, so don't repeat them here

ok beck@


Revision tags: OPENBSD_5_2_BASE
# 1.197 16-Jul-2012 deraadt

oops, need sys/acct.h too


# 1.196 16-Jul-2012 deraadt

Put acct_shutdown() proto in a better place


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.195 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.194 02-Jul-2011 thib

rename VFSDEBUG to VFLCKDEBUG;

prompted by tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.193 21-Dec-2010 thib

Bring back the "End the VOP experiment." diff, naddy's issues where
unrelated, and his alpha is much happier now.

OK deraadt@


# 1.192 06-Dec-2010 jasper

- drop NENTS(), which was yet another copy of nitems().
no binary change


ok deraadt@


# 1.191 10-Sep-2010 thib

Backout the VOP diff until the issues naddy was seeing on alpha (gcc3)
have been resolved.


# 1.190 06-Sep-2010 thib

End the VOP experiment. Instead of the ridicolusly complicated operation
vector setup that has questionable features (that have, as far as I can
tell never been used in practice, atleast not in OpenBSD), remove all
the gunk and favor a simple struct full of function pointers that get
set directly by each of the filesystems.

Removes gobs of ugly code and makes things simpler by a magnitude.

The only downside of this is that we loose the vnoperate feature so
the spec/fifo operations of the filesystems need to be kept in sync
with specfs and fifofs, this is no big deal as the API it self is pretty
static.

Many thanks to armani@ who pulled an earlier version of this diff to
current after c2k10 and Gabriel Kihlman on tech@ for testing.

Liked by many. "come on, find your balls" deraadt@.


# 1.189 12-Aug-2010 oga

Nuke extra (typoed) extern declaration and a spare newline from the last
commit.

"fix it -- free commit" beck@


# 1.188 11-Aug-2010 beck

Make the number of vnodes to correspond to the number of buffers in
buffer cache - we grow them dynamically, but do not attempt to shrink
them if the buffer cache shrinks after growing.

Tested by very many for a long time.

ok oga@ todd@ phessler@ tedu@


Revision tags: OPENBSD_4_8_BASE
# 1.187 29-Jun-2010 tedu

makefstype was only used in ported from freebsd filesystems. fix them
and remove the function. ok thib


# 1.186 28-Jun-2010 claudio

Add the rtable id as an argument to rn_walktree(). Functions like
rt_if_remove_rtdelete() need to know the table id to be able to correctly
remove nodes.
Problem found by Andrea Parazzini and analyzed by Martin Pelik�n.
OK henning@


# 1.185 06-May-2010 mpf

Fix favail format string.
From mickey.
OK thib, otto.


Revision tags: OPENBSD_4_7_BASE
# 1.184 17-Dec-2009 oga

if anyone vref()s a VNON vnode, panic. This should not happen.

Written while trying to debug the nfs_inactive panics. Turns out it
never got hit, but it's a useful check to have.

ok beck@


# 1.183 17-Aug-2009 jasper

dd 'show all bufs' to show all the buffers in the system

ok beck@ thib@


# 1.182 13-Aug-2009 thib

add a show all vnodes command, use dlg's nice pool_walk() to accomplish
this.

ok beck@, dlg@


# 1.181 12-Aug-2009 beck

Namecache revamp.

This eliminates the large single namecache hash table, and implements
the name cache as a global lru of entires, and a redblack tree in each
vnode. It makes cache_purge actually purge the namecache entries associated
with a vnode when a vnode is recycled (very important for later on actually being
able to resize the vnode pool)

This commit does #if 0 out a bunch of procmap code that was
already broken before this change, but needs to be redone completely.

Tested by many, including in thib's nfs test setup.

ok oga@,art@,thib@,miod@


# 1.180 02-Aug-2009 beck

Dynamic buffer cache support - a re-commit of what was backed out
after c2k9

allows buffer cache to be extended and grow/shrink dynamically

tested by many, ok oga@, "why not just commit it" deraadt@


Revision tags: OPENBSD_4_6_BASE
# 1.179 25-Jun-2009 thib

backout the buf_acquire() does the bremfree() since all callers
where doing bremfree() befure calling buf_acquire().

This is causing us headache pinning down a bug that showed up
when deraadt@ too cvs to current, and will have to be done
anyway as a preperation for backouts.

OK deraadt@


# 1.178 15-Jun-2009 beck

Back out all the buffer cache changes I committed during c2k9. This reverts three
commits:

1) The sysctl allowing bufcachepercent to be changed at boot time.
2) The change moving the buffer cache hash chains to a red-black tree
3) The dynamic buffer cache (Which depended on the earlier too).

ok on the backout from marco and todd


# 1.177 06-Jun-2009 art

All caller of buf_acquire were doing bremfree before the call.
Just put it in the buf_acquire function.
oga@ ok


# 1.176 03-Jun-2009 beck

Change bufhash from the old grotty hash table to red-black trees hanging
off the vnode.
ok art@, oga@, miod@


Revision tags: OPENBSD_4_5_BASE
# 1.175 10-Nov-2008 pedro

Fix typo in comment, okay jmc@.


# 1.174 01-Nov-2008 deraadt

change vrele() to return an int. if it returns 0, it can gaurantee that
it did not sleep. this is used to avoid checkdirs() to avoid having
to restart the allproc walk every time through
idea from tedu, ok thib pedro


Revision tags: OPENBSD_4_4_BASE
# 1.173 05-Jul-2008 thib

re-introduce vdrop() to signal a lost intrest in a vnode;

ok art@


# 1.172 14-Jun-2008 mk

A bunch of pool_get() + bzero() -> pool_get(..., .. | PR_ZERO)
conversions that should shave a few bytes off the kernel.

ok henning, krw, jsing, oga, miod, and thib (``even though i usually prefer
FOO|BAR''; thanks for looking.


# 1.171 13-Jun-2008 beck

back out stupid vnode change that was unintentionally included
with biomem and art has no idea how it got there.
ok art@ thib@


# 1.170 12-Jun-2008 deraadt

Bring biomem diff back into the tree after the nfs_bio.c fix went in.
ok thib beck art


# 1.169 11-Jun-2008 deraadt

back out biomem diff since it is not right yet. Doing very large
file copies to nfsv2 causes the system to eventually peg the console.
On the console ^T indicates that the load is increasing rapidly, ddb
indicates many calls to getbuf, there is some very slow nfs traffic
making none (or extremely slow) progress. Eventually some machines
seize up entirely.


# 1.168 10-Jun-2008 beck

Buffer cache revamp

1) remove multiple size queues, introduced as a stopgap.
2) decouple pages containing data from their mappings
3) only keep buffers mapped when they actually have to be mapped
(right now, this is when buffers are B_BUSY)
4) New functions to make a buffer busy, and release the busy flag
(buf_acquire and buf_release)
5) Move high/low water marks and statistics counters into a structure
6) Add a sysctl to retrieve buffer cache statistics

Tested in several variants and beat upon by bob and art for a year. run
accidentally on henning's nfs server for a few months...

ok deraadt@, krw@, art@ - who promises to be around to deal with any fallout


# 1.167 09-Jun-2008 millert

Update access(2) to have modern semantics with respect to X_OK and
the superuser. access(2) will now only indicate success for X_OK on
non-directories if there is at least one execute bit set on the file.
OK deraadt@ thib@ otto@


# 1.166 07-May-2008 thib

remove the vfc_mountroot member from vfsconf and
do appropriate cleanup;

OK deraadt@


# 1.165 07-May-2008 claudio

Implement routing priorities. Every route inserted has a priority assigned
and the one route with the lowest number wins. This will be used by the
routing daemons to resolve the synchronisations issue in case of conflicts.
The nasty bits of this are in the multipath code. If no priority is specified
the kernel will choose an appropriate priority.

Looked at by a few people at n2k8 code is much older


# 1.164 06-May-2008 thib

retire vfs_mountroot();

setroot() is now (and has been) responsible for setting
the mountroot function pointer "to the right thing", or
failing todo that, to ffs_mountroot;

based on a discussion/diff from deraadt@.
OK deraadt@


# 1.163 23-Mar-2008 miod

Wrong printf construct.


# 1.162 16-Mar-2008 otto

Widen some struct statfs fields to support large filesystem stata
and add some to be able to support statvfs(2). Do the compat dance
to provide backward compatibility. ok thib@ miod@


Revision tags: OPENBSD_4_3_BASE
# 1.161 13-Dec-2007 blambert

replace calls to ltsleep with tsleep

remove PNORELOCK flag, as PNORELOCK is used for msleep

ok art@ thib@


# 1.160 16-Nov-2007 deraadt

er, the newline is wrong. dissapointing.


# 1.159 15-Nov-2007 deraadt

newline before syncing disks is way prettier


# 1.158 29-Oct-2007 chl

MALLOC/FREE -> malloc/free
replace an hard coded value with M_WAITOK

ok krw@


# 1.157 15-Sep-2007 bluhm

Allow to pull out an usb stick with ffs filesystem while mounted
and a file is written onto the stick. Without these fixes the
machine panics or hangs.
The usb fix calls the callback when the stick is pulled out to free
the associated buffers. Otherwise we have busy buffers for ever
and the automatic unmount will panic.
The change in the scsi layer prevents passing down further dirty
buffers to usb after the stick has been deactivated.
In vfs the automatic unmount has moved from the function vgonel()
to vop_generic_revoke(). Both are called when the sd device's vnode
is removed. In vgonel() the VXLOCK is already held which can cause
a deadlock. So call dounmount() earlier.

ok krw@, I like this marco@, tested by ian@


# 1.156 07-Sep-2007 art

Use M_ZERO in a few more places to shave bytes from the kernel.

eyeballed and ok dlg@


Revision tags: OPENBSD_4_2_BASE
# 1.155 07-Aug-2007 beck

A few changes to deal with multi-user performance issues seen. this
brings us back roughly to 4.1 level performance, although this is still
far from optimal as we have seen in a number of cases. This change

1) puts a lower bound on buffer cache queues to prevent starvation
2) fixes the code which looks for a buffer to recycle
3) reduces the number of vnodes back to 4.1 levels to avoid complex
performance issues better addressed after 4.2

ok art@ deraadt@, tested by many


# 1.154 01-Jun-2007 beck

decouple the allocated number of vnodes from the "desiredvnodes" variable
which is used to size a zillion other things that increasing excessively
has been shown to cause problems - so that we may incrementally look at
increasing those other things without making the kernel unusable.

This diff effectivly increases the number of vnodes back to the number
of buffers, as in the earlier dynamic buffer cache commits, without
increasing anything else (namecache, softdeps, etc. etc.)

ok pedro@ tedu@ art@ thib@


# 1.153 31-May-2007 tedu

remove some silly casts, no real change


# 1.152 31-May-2007 pedro

NFSv2 cannot cope with a big number of vnodes, so revert to NPROC-based
calculation until the problem is fixed, okay beck@ art@


# 1.151 30-May-2007 beck

back out vfs change - todd fries has seen afs issues, and I'm suspicious
this can cause other problems.


# 1.150 29-May-2007 beck

Step one of some vnode improvements - change getnewvnode to
actually allocate "desiredvnodes" - add a vdrop to un-hold a vnode held
with vhold, and change the name cache to make use of vhold/vdrop, while
keeping track of which vnodes are referred to by which cache entries to
correctly hold/drop vnodes when the cache uses them.
ok thib@, tedu@, art@


# 1.149 28-May-2007 thib

de-inline vref();

ok pedro@


# 1.148 26-May-2007 pedro

Dynamic buffer cache. Initial diff from mickey@, okay art@ beck@ toby@
deraadt@ dlg@.


# 1.147 26-May-2007 thib

Nuke a bunch of simpelocks and associated goo.

ok art@


# 1.146 17-May-2007 thib

Collapse struct v_selectinfo in struct vnode, remove the
simplelock and reuse the name for the selinfo member.
Clean-up accordingly.

ok tedu@,art@


# 1.145 09-May-2007 deraadt

kinfo_vgetfailed has not been used for > 8 years


# 1.144 13-Apr-2007 thib

Move the declaration of VN_KNOTE() into vnode.h instead of having
multiple defines all over;

ok tedu@


# 1.143 13-Apr-2007 bluhm

Remove comments talking about vnode interlock. No binary change.
ok thib


# 1.142 11-Apr-2007 thib

Remove the simplelock argument from vrecycle();

ok pedro@, sturm@


# 1.141 21-Mar-2007 thib

Remove the v_interlock simplelock from the vnode structure.
Zap all calls to simple_lock/unlock() on it (those calls are
#defined away though). Remove the LK_INTERLOCK from the calls
to vn_lock() and cleanup the filesystems wich implement VOP_LOCK().
(by remvoing the v_interlock from there calls to lockmgr()).

ok pedro@, art@, tedu@


# 1.140 12-Mar-2007 mickey

better desiredvnodes not based on maxusers; pedro@ deraadt@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.139 20-Feb-2007 deraadt

for vfsconf sysctl, do not leak kernel sensors out to userland
ok art thib


# 1.138 17-Feb-2007 mickey

fix ddb buf printing for daddr_t growth to 64bit;
from juan hernandez gonzalez; tested by bluhm@


# 1.137 14-Feb-2007 jsg

Consistently spell FALLTHROUGH to appease lint.
ok kettenis@ cloder@ tom@ henning@


# 1.136 13-Feb-2007 mickey

fix ddb buf print


# 1.135 20-Nov-2006 tom

vprint() should be defined if DIAGNOSTIC || DEBUG. Noticed by (and
original diff from) Jake < antipsychic (at) hotmail.com >. Discussed
with Mickey and Miod.

ok miod@ pedro@


# 1.134 30-Oct-2006 thib

use vp->v_type to index into vtypes rather then vp->v_tag,
fixing odd output in the 'show vnode' ddb code.

ok mickey@


Revision tags: OPENBSD_4_0_BASE
# 1.133 11-Jul-2006 mickey

add mount/vnode/buf and softdep printing commands; tested on a few archs and will make pedro happy too (;


# 1.132 09-Jul-2006 pedro

Fix tab where space was meant


# 1.131 08-Jul-2006 thib

vinvalbuf() debugging aid, under VFSDEBUG.

ok pedro@


# 1.130 03-Jul-2006 mickey

also print vp in vprint (useful for debugging); pedro@ ok


# 1.129 25-Jun-2006 sturm

rename vfs_busy() flags VB_UMIGNORE/VB_UMWAIT to VB_NOWAIT/VB_WAIT

requested by and ok pedro


# 1.128 14-Jun-2006 sturm

move vfs_busy() to rwlocks and properly hide the locking api from vfs

ok tedu, pedro


# 1.127 02-Jun-2006 pedro

Add a clonable devices implementation. Hacked along with thib@, input
from krw@ and toby@, subliminal prodding from dlg@, okay deraadt@.


# 1.126 28-May-2006 pedro

Spacing in vfs_sysctl()


# 1.125 07-May-2006 sturm

forgot to remove this sentence from the comment
ok pedro


# 1.124 30-Apr-2006 sturm

remove the simplelock argument from vfs_busy() which is currently not
used and will never be used this way in VFS

requested by and ok pedro, ok krw, biorn


# 1.123 19-Apr-2006 pedro

Remove unused mount list simple_lock() goo


Revision tags: OPENBSD_3_9_BASE
# 1.122 09-Jan-2006 pedro

Put vprint() under DIAGNOSTIC, as to save space in generated ramdisks.
Inspiration from miod@, okay deraadt@. Tested on i386, macppc and amd64.


# 1.121 30-Nov-2005 pedro

No need for vfs_busy() and vfs_unbusy() to take a process pointer
anymore. Testing by jolan@, thanks.


# 1.120 24-Nov-2005 pedro

Remove kernfs, okay deraadt@.


# 1.119 19-Nov-2005 pedro

Remove unnecessary lockmgr() archaism that was costing too much in terms
of panics and bugfixes. Access curproc directly, do not expect a process
pointer as an argument. Should fix many "process context required" bugs.
Incentive and okay millert@, okay marc@. Various testing, thanks.


# 1.118 18-Nov-2005 pedro

Work around yet another race on non-locking file systems: when calling
VOP_INACTIVE() in vrele() and vput(), we may sleep. Since there's no
locking of any kind, someone can vget() the vnode and vrele() it while
we sleep, beating us in getting the vnode on the free list.


# 1.117 08-Nov-2005 pedro

Missed one use of 'register'


# 1.116 07-Nov-2005 pedro

Use ANSI function declarations and deregister, no binary change


# 1.115 19-Oct-2005 pedro

Remove v_vnlock from struct vnode, okay krw@ tedu@


Revision tags: OPENBSD_3_8_BASE
# 1.114 26-May-2005 pedro

branches: 1.114.2;
RIP stackable filesystems, ok marius@ tedu@, discussed with deraadt@


# 1.113 24-May-2005 pedro

when a device vnode associated with a mount point disappears, mark the
filesystem as doomed and unmount it


# 1.112 22-May-2005 pedro

put VLOCKSWORK stuff under a single option, VFSDEBUG


# 1.111 01-May-2005 pedro

check for VBIOONFREELIST and VBIOONSYNCLIST in vprint(), okay marius@


# 1.110 24-Mar-2005 tedu

always good to check for invalid values. ok marius pedro


Revision tags: OPENBSD_3_7_BASE
# 1.109 10-Jan-2005 pedro

branches: 1.109.2;
change vget() to only put a vnode back on the free lists if it actually
was there. should fix a (rare) corner case introduced by my last commit.
ok tedu@, testing by joris, moritz@, danh@, otto@ and krw@. many thanks.


# 1.108 31-Dec-2004 pedro

sprinkle some more list macros in here


# 1.107 31-Dec-2004 pedro

when releasing a vnode, make it inactive before sticking it to one of
the free lists. should fix some races on filesystems that don't have
locks, such as nfs. also, it allows for a more straightforward way of
releasing vnodes (nodes that are going to be recycled don't have to be
moved to the head of the list). tested by many, thanks.

ok tedu@ deraadt@


# 1.106 28-Dec-2004 deraadt

clean dirty accident by miod


# 1.105 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


# 1.104 09-Dec-2004 pedro

minor spacing/styling nits


Revision tags: OPENBSD_3_6_BASE
# 1.103 04-Aug-2004 art

Uninline vputonfreelist.


# 1.102 04-Aug-2004 pedro

better comments


# 1.101 02-Aug-2004 pedro

- check for LK_NOWAIT on vget()
- use ltsleep() instead of the unlock + sleep combo

ok art@, inspiration from free/net


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.100 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


# 1.99 27-May-2004 tedu

shutdown accounting before shutting down vfs. should prevent some panics.
ok david@ millert@ (iirc)


# 1.98 25-Apr-2004 itojun

radix tree with multipath support. from kame. deraadt ok
user visible changes:
- you can add multiple routes with same key (route add A B then route add A C)
- you have to specify gateway address if there are multiple entries on the table
(route delete A B, instead of route delete A)
kernel change:
- radix_node_head has an extra entry
- rnh_deladdr takes extra argument

TODO:
- actually take advantage of multipath (rtalloc -> rtalloc_mpath)


Revision tags: OPENBSD_3_5_BASE
# 1.97 09-Jan-2004 tedu

back out vnode parents. weird breakge found in ports tree


# 1.96 06-Jan-2004 tedu

keep track of a vnode's parent dir. ufs only, and unused atm, but
the fun stuff is coming. testing by brad.


Revision tags: OPENBSD_3_4_BASE
# 1.95 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.94 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: UBC_SYNC_A
# 1.93 13-May-2003 naddy

Back out previous change that causes "vnode table full" for large-scale
file operations.


# 1.92 13-May-2003 tedu

do reclaim LAYER vnodes, no good reason not to


# 1.91 06-May-2003 tedu

attempt to put a process's cwd back in place after a forced umount.
won't always work, but it's the best we can do for now. this covers
at least some of the failure cases the previous commit to vfs_lookup.c
checks for.
ok weingart@


# 1.90 01-May-2003 tedu

several related changes:
vfs_subr.c:
add a missing simple_lock_init for vnode interlock
try to avoid reclaiming locked or layered vnodes
initialize vnlock pointer to NULL
remove old code to free vnlock, never used
lockinit the new vnode lock
vfs_syscalls.c:
support for VLAYER flag
vnode_if.sh:
support for splitting VDESC flags
vnode_if.src:
split VDESC flags
WILLPUT is the combination of WILLRELE and WILLUNLOCK
most uses for WILLRELE become WILLPUT
vnode.h:
add v_lock to struct vnode
add VLAYER flag
update for new VDESC flags


# 1.89 06-Apr-2003 ho

strcat/strcpy/sprintf cleanup. krw@, anil@ ok. art@ tested sparc64.


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_B
# 1.88 11-Aug-2002 art

Add two missing vfs_busy calls in the failure path of sysctl_vnode.
Found by aaron@

NOTE - I think we need a mount-point iterator just like we have
NOTE - vfs_mount_foreach_vnode. (btw. why don't we use foreach_vnode in here?)


# 1.87 12-Jul-2002 art

Change the locking on the mountpoint slightly. Instead of using mnt_lock
to get shared locks for lookup and get the exclusive lock only with
LK_DRAIN on unmount and do the real exclusive locking with flags in
mnt_flags, we now use shared locks for lookup and an exclusive lock for
unmount.

This is accomplished by slightly changing the semantics of vfs_busy.
Old vfs_busy behavior:
- with LK_NOWAIT set in flags, a shared lock was obtained if the
mountpoint wasn't being unmounted, otherwise we just returned an error.
- with no flags, a shared lock was obtained if the mountpoint was being
unmounted, otherwise we slept until the unmount was done and returned
an error.
LK_NOWAIT was used for sync(2) and some statistics code where it isn't really
critical that we get the correct results.
0 was used in fchdir and lookup where it's critical that we get the right
directory vnode for the filesystem root.

After this change vfs_busy keeps the same behavior for no flags and LK_NOWAIT.
But if some other flags are passed into it, they are passed directly
into lockmgr (actually LK_SLEEPFAIL is always added to those flags because
if we sleep for the lock, that means someone was holding the exclusive lock
and the exclusive lock is only held when the filesystem is being unmounted.

More changes:
dounmount must now be called with the exclusive lock held. (before this
the caller was supposed to hold the vfs_busy lock, but that wasn't always
true).
Zap some (now) unused mount flags.
And the highlight of this change:
Add some vfs_busy calls to match some vfs_unbusy calls, especially in
sys_mount. (lockmgr doesn't detect the case where we release a lock noone
holds (it will do that soon)).

If you've seen hangs on reboot with mfs this should solve it (I repeat this
for the fourth time now, but this time I spent two months fixing and
redesigning this and reading the code so this time I must have gotten
this right).


# 1.86 16-Jun-2002 miod

When processing the KERN_VNODE sysctl, the kernel builds a packed structure,
while pstat(8) expects a C structure abiding the regular structure packing
rules. This caused pstat -v to break on powerpc.

Unbreak the confusion by defining the structure in a common header file,
and having the kernel use it.

ok millert@ deraadt@


# 1.85 08-Jun-2002 art

Use ltsleep in vfs_busy.


# 1.84 16-May-2002 art

sprinkle some splassert(IPL_BIO) in some functions that are commented as "should be called at splbio()"


Revision tags: OPENBSD_3_1_BASE
# 1.83 14-Mar-2002 millert

First round of __P removal in sys


# 1.82 04-Feb-2002 miod

Cleanup mountroot-related definitions.


# 1.81 23-Jan-2002 art

Pool deals fairly well with physical memory shortage, but it doesn't deal
well (not at all) with shortages of the vm_map where the pages are mapped
(usually kmem_map).

Try to deal with it:
- group all information the backend allocator for a pool in a separate
struct. The pool will only have a pointer to that struct.
- change the pool_init API to reflect that.
- link all pools allocating from the same allocator on a linked list.
- Since an allocator is responsible to wait for physical memory it will
only fail (waitok) when it runs out of its backing vm_map, carefully
drain pools using the same allocator so that va space is freed.
(see comments in code for caveats and details).
- change pool_reclaim to return if it actually succeeded to free some
memory, use that information to make draining easier and more efficient.
- get rid of PR_URGENT, noone uses it.


# 1.80 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.79 10-Dec-2001 art

branches: 1.79.2;
No need to initialize the uobj on every getnewvnode. Just do
it when allocating. Add some improved diagnostics.


# 1.78 10-Dec-2001 art

Big cleanup inspired by NetBSD with some parts of the code from NetBSD.
- get rid of VOP_BALLOCN and VOP_SIZE
- move the generic getpages and putpages into miscfs/genfs
- create a genfs_node which must be added to the top of the private portion
of each vnode for filsystems that want to use genfs_{get,put}pages
- rename genfs_mmap to vop_generic_mmap


# 1.77 10-Dec-2001 art

Merge in struct uvm_vnode into struct vnode.


# 1.76 05-Dec-2001 art

Break out the part that lowers v_holdcnt in brelvp into an own function
and make it and vhold into public interfaces.


# 1.75 29-Nov-2001 art

Ooops. Revert part of the last commit that was completly wrong and wasn't supposed to be committed.


# 1.74 29-Nov-2001 art

Correctly handle b_vp with bgetvp and brelvp in {get,put}pages.
Prevents panics caused by vnodes being recycled under our feet.


# 1.73 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.72 21-Nov-2001 csapuntz

Added vfs_isbusy. Useful for verifying that a mount point is locked
Added vfs_mount_foreach_vnode. Several places in the code seem to want to
traverse the mount list and they all seem to handle locking differently.
Centralize traversing the mount list in one place so that we only need
to get the locking right once.


# 1.71 15-Nov-2001 art

Don't zero v_bioflag when recycling a vnode in getnewvnode.
Sometimes the vnode can be on the syncers list. While that is a bug, it's
just a minor annoyance. A vnode on a syncer worklist without VBIOONSYNCLIST
set is a disaster.


# 1.70 12-Nov-2001 art

Remove unnecessary check for NULL vnode in reassignbuf.


# 1.69 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.68 02-Oct-2001 csapuntz

Bounds check index into routing table. Thanks to Ken Ashcraft of Stanford
for finding this bug.


# 1.67 19-Sep-2001 csapuntz

Get rid of B_VFLUSH. Not relevant after the end of the AGE queue.


# 1.66 16-Sep-2001 millert

Add some missing lengths checks when passing data from userland to
kernel. From based on NetBSD patches.


# 1.65 02-Aug-2001 assar

(vput): make panic strings actually say vput instead of vrele


# 1.64 26-Jul-2001 miod

Typo.


# 1.63 27-Jun-2001 art

remove old vm


# 1.62 22-Jun-2001 deraadt

KNF


# 1.61 05-Jun-2001 provos

send note_revoke to knotes when vnode goes away, okay art@


# 1.60 16-May-2001 art

indentation nit.


# 1.59 29-Apr-2001 art

cleanup, remove incorrect comment


Revision tags: OPENBSD_2_9_BASE
# 1.58 22-Mar-2001 art

branches: 1.58.2;
Use pool for allocating vnodes.
Even though vnodes are never freed (could be) this gives us big memory and
kmem_map savings.


# 1.57 21-Mar-2001 art

uvm_vnp_terminate expect the vnode to be locked.
Why didn't LOCKDEBUG catch this?


# 1.56 16-Mar-2001 art

Oops. fix thinko in last.


# 1.55 16-Mar-2001 art

Use CIRCLEQ macros for mountlist.


# 1.54 16-Mar-2001 art

Initialize the mountlist_slock.


# 1.53 26-Feb-2001 csapuntz

Move v_writecount test back to it original place


# 1.52 26-Feb-2001 csapuntz

Make ref counts 32-bit unsigned ints as opposed to a potpourri of longs and
ints.


# 1.51 24-Feb-2001 csapuntz

Cleanup of vnode interface continues. Get rid of VHOLD/HOLDRELE.
Change VM/UVM to use buf_replacevnode to change the vnode associated
with a buffer.

Addition v_bioflag for flags written in interrupt handlers
(and read at splbio, though not strictly necessary)

Add vwaitforio and use it instead of a while loop of v_numoutput.

Fix race conditions when manipulation vnode free list


# 1.50 23-Feb-2001 csapuntz

Remove the clustering fields from the vnodes and place them in the
file system inode instead


# 1.49 21-Feb-2001 csapuntz

Latest soft updates from FreeBSD/Kirk McKusick

Snapshot-related code has been commented out.


# 1.48 08-Feb-2001 mickey

do not print stuff when not verbose


Revision tags: OPENBSD_2_8_BASE
# 1.47 27-Sep-2000 art

branches: 1.47.2;
Minimal optimization.


# 1.46 17-Jul-2000 art

Don't wait for B_READ buffers on shutdown.
From NetBSD.


Revision tags: OPENBSD_2_7_BASE
# 1.45 25-Apr-2000 csapuntz

Use CIRCLEQ_FOREACH


# 1.44 21-Apr-2000 mickey

see if there is any meaning under curproc before using &proc0 in vfs_syncwait(); from art@


Revision tags: SMP_BASE kame_19991208
# 1.43 05-Dec-1999 art

branches: 1.43.2;
With soft updates, some buffers will be remarked as dirty after being written.
Handle this when syncing filesystems when unmounting.
From NetBSD.


# 1.42 05-Dec-1999 art

Use VONSYNCLIST to see if we should remove a vnode from the sync list instead
of looking at v_dirtyblkhd.


Revision tags: OPENBSD_2_6_BASE
# 1.41 20-Aug-1999 art

more paranoid check of the refcount in vfs_register


# 1.40 08-Aug-1999 niklas

From NetBSD; vdevgone, used for revoking access to device nodes when they
disappear (detach is coming).


# 1.39 31-May-1999 millert

New struct statfs with mount options. NOTE: this replaces statfs(2),
fstatfs(2), and getfsstat(2) so you will need to build a new kernel
before doing a "make build" or you will get "unimplemented syscall" errors.

The new struct statfs has the following featuires:
o Has a u_int32_t flags field--now softdep can have a real flag.

o Uses u_int32_t instead of longs (nicer on the alpha). Note: the man
page used to lie about setting invalid/unused fields to -1. SunOS does
that but our code never has.

o Gets rid of f_type completely. It hasn't been used since NetBSD 0.9
and having it there but always 0 is confusing. It is conceivable
that this may cause some old code to not compile but that is better
than silently breaking.

o Adds a mount_info union that contains the FSTYPE_args struct. This
means that "mount" can now tell you all the options a filesystem was
mounted with. This is especially nice for NFS.

Other changes:
o The linux statfs emulation didn't convert between BSD fs names
and linux f_type numbers. Now it does, since the BSD f_type
number is useless to linux apps (and has been removed anyway)

o FreeBSD's struct statfs is different from our (both old and new)
and thus needs conversion. Previously, the OpenBSD syscalls
were used without any real translation.

o mount(8) will now show extra info when invoked with no arguments.
However, to see *everything* you need to use the -v (verbose) flag.


# 1.38 06-May-1999 mickey

factor out sync+wait code into vfa_syncwait() routine for
applications in system like power management and such.
art@ finally said `commit it'


# 1.37 30-Apr-1999 art

in vput, simple_unlock the v_interlock before VOP_INACTIVE, not after


Revision tags: OPENBSD_2_5_BASE
# 1.36 11-Mar-1999 deraadt

backout


# 1.35 11-Mar-1999 deraadt

back out unapproved changes


# 1.34 11-Mar-1999 mickey

indent


# 1.33 11-Mar-1999 mickey

factor sync+wait operation out into a separate function.


# 1.32 26-Feb-1999 art

adapt to uvm vnode pager


# 1.31 19-Feb-1999 art

add vfs_register and vfs_unregister functions


# 1.30 28-Dec-1998 art

simple_lock fixes


# 1.29 22-Dec-1998 art

deconfuse vprint, print holdcount, not refcount when we are talking about holdcnt


# 1.28 10-Dec-1998 art

vfs_unmountall: retry to unmount all remaining filesystems when one unmount failed


# 1.27 05-Dec-1998 csapuntz

Framework for generating automatic test code for locking discipline
in DIAGNOSTIC mode.

Added documentation to vfs_subr.c on locking needs of a couple calls.

Improvements to the vinvalbuf patch. We need to start over after we
let our pants down.


# 1.26 04-Dec-1998 csapuntz

VFS-Lite2 requires stricter locking around vnode buffer queues. vinvalbuf
had insufficient protection


# 1.25 20-Nov-1998 art

vn_lock already unlocks the simple lock. don't do that again


# 1.24 12-Nov-1998 csapuntz

Integrate latest soft updates patches for McKusick.

Integrate cleaner ffs mount code from FreeBSD. Most notably, this mount
code prevents you from mounting an unclean file system read-write.


Revision tags: OPENBSD_2_4_BASE
# 1.23 13-Oct-1998 csapuntz

In vrele, vget, reinstate to following order

- VNODE gets placed on free list
- VOP_INACTIVE is called

This was the original order. It was changed in an earlier patch due to
a race condition in non-locking FSes (like NFS) between getnewvnode
and inactive. However, the modified order had its own race conditions, so
it turned out not to be a good choice.


# 1.22 30-Aug-1998 csapuntz

Cleanup.

Error diagnostics in vputonfreelist to catch violations of assumptions.


# 1.21 06-Aug-1998 csapuntz

Rename vop_revoke, vn_bwrite, vop_noislocked, vop_nolock, vop_nounlock
to be vop_generic_revoke, vop_generic_bwrite, vop_generic_islocked,
vop_generic_lock and vop_generic_unlock.

Create vop_generic_abortop and propogate change to all file systems.

Fix PR/371.

Get rid of locking in NULLFS (should be mostly unnecessary now except for
forced unmounts).


# 1.20 25-Apr-1998 niklas

typo


Revision tags: OPENBSD_2_3_BASE
# 1.19 20-Feb-1998 niklas

typo


# 1.18 11-Jan-1998 csapuntz

Fix a couple spinlock references. More code motion in vfs_subr.c


# 1.17 10-Jan-1998 csapuntz

Broke up vfs_subr.c which was getting a bit huge. We now have seperate files
for the syncer daemon as well as default VOP_*.


# 1.16 24-Nov-1997 niklas

Fix non-DIAGNOSTIC (and non-COMPAT*) compilation


# 1.15 07-Nov-1997 csapuntz

Fixed hang on shutdown
Disabled vop_nolock for now. Filesystems still need to be cleaned up.


# 1.14 06-Nov-1997 csapuntz

DEBUG now compiles


# 1.13 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.12 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.11 06-Oct-1997 csapuntz

VFS Lite2 Changes


Revision tags: OPENBSD_2_1_BASE
# 1.10 25-Apr-1997 deraadt

proper mask check; mike@fast.cs.utah.edu


# 1.9 14-Apr-1997 tholo

Minor performance enhancements from NetBSD


# 1.8 24-Feb-1997 niklas

OpenBSD tags


# 1.7 11-Feb-1997 millert

Add fs_id support and random inode generation numbers for ffs.


# 1.6 04-Jan-1997 kstailey

spec_advlock() via lf_advlock()


Revision tags: OPENBSD_2_0_BASE
# 1.5 08-Aug-1996 tholo

Make {,f}chown(2) behaviour POSIX.1 compliant with SUID / SGID files
Enable CTL_FS processing by sysctl(3)
Add CTL_FS request to disable clearing SUID / SGID bit when a files owner
or group is changed by root
Make sysctl(8) understand CTL_FS requests


# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 29-Feb-1996 niklas

From NetBSD: Merge with NetBSD 960217


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.286 17-Feb-2019 tedu

if a write fails, we mark the buffer invalid and throw it away. this can
lead to lost errors, where a later fsync will return success. to fix this,
set a flag on the vnode indicating a past error has occurred, and return
an error for future fsync calls.
ok bluhm deraadt visa


# 1.285 21-Jan-2019 anton

Introduce a dedicated entry point data structure for file locks. This new data
structure allows for better tracking of pending lock operations which is
essential in order to prevent a use-after-free once the underlying vnode is
gone.

Inspired by the lockf implementation in FreeBSD.

ok visa@

Reported-by: syzbot+d5540a236382f50f1dac@syzkaller.appspotmail.com


# 1.284 23-Dec-2018 natano

Rectify some issues with the noperm mount flag; the root vnode was not
protected properly and files without any x bit set were accidentaly considered
executable when checked with access(2).

Issues found and reported by deraadt, halex, reyk, tb
ok deraadt


# 1.283 07-Dec-2018 mpi

free(9) sizes for netcred.

ok visa@


Revision tags: OPENBSD_6_4_BASE
# 1.282 29-Sep-2018 visa

Use atomic operations to update vfc_refcount. Change the field's type
to unsigned int.

OK deraadt@


# 1.281 26-Sep-2018 visa

Move the allocating and freeing of mount points into
dedicated functions.

OK deraadt@ mpi@


# 1.280 22-Sep-2018 fcambus

Harmonize spacing after ellipses in displayed messages.

We were using spacing after ellipses in an inconsistent way in the
installer. Standardize on using "... " everywhere and take into account
the cursor position while we are waiting for the task to complete: the
cursor is now always positioned after the last dot, and the space is
added when displaying completion confirmation.

While there, also take cursor position into account in vfs_shutdown(),
and remove the extra leading space before ticks in dhclient.

OK deraadt@


# 1.279 17-Sep-2018 visa

Simplify VFS initialization.

Because loadable kernel modules are no longer, there is no need to
register or unregister filesystem implementations at runtime. Remove
vfs_register() and vfs_unregister(), and make vfsinit() call vfs_init
routines directly. Replace the linked list of vfsconf structs with
the vfsconflist[] array.

OK mpi@ bluhm@


# 1.278 16-Sep-2018 visa

Move vfsconf lookup code into dedicated functions.

OK bluhm@


# 1.277 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


# 1.276 02-Jul-2018 bluhm

Use more list macros for v_dirtyblkhd.
OK mpi@


# 1.275 06-Jun-2018 bluhm

The function dounmount() traverses the mnt_list in forward direction
to call vfs_busy() for all nested mount points. vfs_stall() called
vfs_busy() in reverser order for all mount points. Change the
direction of the latter to resolve the lock order conflict.
OK visa@


# 1.274 04-Jun-2018 guenther

Add VB_DUPOK to suppress witness(4) warning of concurrent mount locks.
Use that in three places:
- vfs_stall()
- sys_mount()
- dounmount()'s MNT_FORCE-does-recursive-unmounts case

ok deraadt@ visa@


# 1.273 27-May-2018 visa

Drop unnecessary `p' parameter from vget(9).

OK mpi@


# 1.272 08-May-2018 bluhm

When looping over mount points, the FOREACH SAVE macro is not save.
The loop variable mp is protected by vfs_busy() so that it cannot
be unmounted. But the next mount point nmp could be unmounted while
VFS_SYNC() sleeps. As the loop in vfs_stall() does not destroy the
mount point, TAILQ_FOREACH_REVERSE without _SAVE is the correct
macro to use.
OK deraadt@ visa@


# 1.271 08-May-2018 mpi

Move the vfs stall "barrier" logic to a function. FREF() will soon
change and this has nothing to do with it.

ok visa@, bluhm@


# 1.270 07-May-2018 bluhm

Print the vp pointer in the vinvalbuf() panic strings.
OK mpi@


# 1.269 02-May-2018 visa

Remove proc from the parameters of vn_lock(). The parameter is
unnecessary because curproc always does the locking.

OK mpi@


# 1.268 28-Apr-2018 visa

Clean up the parameters of VOP_LOCK() and VOP_UNLOCK(). It is always
curproc that does the locking or unlocking, so the proc parameter
is pointless and can be dropped.

OK mpi@, deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.267 07-Mar-2018 bluhm

Remounting files systems read-only does not work reliably. There
are corner cases where ffs may leak blocks. So better revert and
unmount all file systems at reboot. The "init died" panic will be
fixed in a different way.
OK deraadt@


# 1.266 10-Feb-2018 deraadt

Syncronize filesystems to disk when suspending. Each mountpoint's vnodes
are pushed to disk. Dangling vnodes (unlinked files still in use) and
vnodes undergoing change by long-running syscalls are identified -- and
such filesystems are marked dirty on-disk while we are suspended (in case
power is lost, a fsck will be required). Filesystems without dangling or
busy vnodes are marked clean, resulting in faster boots following
"battery died" circumstances.
Tested by numerous developers, thanks for the feedback.


# 1.265 14-Dec-2017 deraadt

Don't bother using DETACH_FORCE for the softraid luns at reboot
time; the aggressive mountpoint destruction seems to hit insane
use-after-frees when we are already far on the way down.


# 1.264 14-Dec-2017 deraadt

Give vflush_vnode() a hint about vnodes we don't need to account as "busy".
Change mountpoint to RDONLY a little later. Seems to improve the
rw->ro transition a bit.


# 1.263 11-Dec-2017 bluhm

Format the vnode lists of ddb show mount properly in columns.
OK krw@


# 1.262 11-Dec-2017 deraadt

In uvm Chuck decided backing store would not be allocated proactively
for blocks re-fetchable from the filesystem. However at reboot time,
filesystems are unmounted, and since processes lack backing store they
are killed. Since the scheduler is still running, in some cases init is
killed... which drops us to ddb [noted by bluhm]. Solution is to convert
filesystems to read-only [proposed by kettenis]. The tale follows:
sys_reboot() should pass proc * to MD boot() to vfs_shutdown() which
completes current IO with vfs_busy VB_WRITE|VB_WAIT, then calls VFS_MOUNT()
with MNT_UPDATE | MNT_RDONLY, soon teaching us that *fs_mount() calls a
copyin() late... so store the sizes in vfsconflist[] and move the copyin()
to sys_mount()... and notice nfs_mount copyin() is size-variant, so kill
legacy struct nfs_args3. Next we learn ffs_mount()'s MNT_UPDATE code is
sharp and rusty especially wrt softdep, so fix some bugs adn add
~MNT_SOFTDEP to the downgrade. Some vnodes need a little more help,
so tie them to &dead_vnops.

ffs_mount calling DIOCCACHESYNC is causing a bit of grief still but
this issue is seperate and will be dealt with in time.
couple hundred reboots by bluhm and myself, advice from guenther and
others at the hut


# 1.261 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.260 31-Jul-2017 florian

Give back some space to the ramdisk by compiling net/radix.c only
if we compile pf, ipsec, pipex or nfsserver.
Suggested by mpi some time ago.
Tweak & OK bluhm
deraadt assumes it's fair


# 1.259 20-Apr-2017 visa

Tweak lock inits to make the system runnable with witness(4)
on amd64 and i386.


# 1.258 04-Apr-2017 deraadt

struct vfsconf is tightly packed, but let's M_ZERO it in case that ever
changes to avoid exposing userland memory.


Revision tags: OPENBSD_6_1_BASE
# 1.257 15-Jan-2017 bluhm

When traversing the mount list, the current mount point is locked
with vfs_busy(). If the FOREACH_SAFE macro is used, the next pointer
is not locked and could be freed by another process. Unless
necessary, do not use _SAFE as it is unsafe. In vfs_unmountall()
the current pointer is actullay freed. Add a comment that this
race has to be fixed later.
OK krw@


# 1.256 10-Jan-2017 bluhm

Replace manual for() loops with FOREACH() macro.
OK millert@


# 1.255 10-Jan-2017 bluhm

Remove the unused olddp parameter from function dounmount().
OK mpi@ millert@


# 1.254 28-Sep-2016 kettenis

Cast enum to u_int when doing a bounds check to avoid a clang warning that
the comparison is always true.

ok jca@, tedu@


# 1.253 16-Sep-2016 dlg

move the namecache_rb_tree from RB macros to RBT functions.

i had to shuffle the includes a bit. all the knowledge of the RB
tree is now inside vfs_cache.c, and all accesses are via cache_*
functions.


# 1.252 16-Sep-2016 dlg

move buf_rb_bufs from RB macros to RBT functions

i had to shuffle the order of some header bits cos RBT_PROTOTYPE
needs to see what RBT_HEAD produces.


# 1.251 15-Sep-2016 dlg

all pools have their ipl set via pool_setipl, so fold it into pool_init.

the ioff argument to pool_init() is unused and has been for many
years, so this replaces it with an ipl argument. because the ipl
will be set on init we no longer need pool_setipl.

most of these changes have been done with coccinelle using the spatch
below. cocci sucks at formatting code though, so i fixed that by hand.

the manpage and subr_pool.c bits i did myself.

ok tedu@ jmatthew@

@ipl@
expression pp;
expression ipl;
expression s, a, o, f, m, p;
@@
-pool_init(pp, s, a, o, f, m, p);
-pool_setipl(pp, ipl);
+pool_init(pp, s, a, ipl, f, m, p);


# 1.250 25-Aug-2016 dlg

pool_setipl

ok kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.249 22-Jul-2016 kettenis

Prevent NULL-pointer call for filesystems that don't provide vfs_sysctl
in their vfsops.

Issue reported by Tim Newsham.

ok claudio@, natano@


# 1.248 19-Jun-2016 natano

Remove the lockmgr() API. It is only used by filesystems, where it is a
trivial change to use rrw locks instead. All it needs is LK_* defines
for the RW_* flags.

tested by naddy and sthen on package building infrastructure
input and ok jmc mpi tedu


# 1.247 26-May-2016 natano

The doforce variable isn't modified anywhere. Also, the only filesystem
left using it is fuse. It has been removed from all other filesystems.

ok millert deraadt


# 1.246 26-Apr-2016 natano

copy_statfs_info() is not only used by ufs, but by other filesystems too,
so make sure that all members of mp->mnt_stat.mount_info are copied.
ok stefan


# 1.245 26-Apr-2016 beck

fix off by one in vfs_vnode_print - found by miod
ok deraadt@, krw@


# 1.244 07-Apr-2016 natano

Share clone bitmap between aliased vnodes. This prevents duplicate clone
instance numbers being handed out for the same minor device.
ok mikeb


# 1.243 05-Apr-2016 natano

Increase size of the clone bitmap (revised diff after revert). I have
tested this with fuse _and_ drm on amd64 and macppc. Also tested with
cloning bpf (not in the tree) on macppc.

ok mikeb
"looks correct to me" millert

The original commit message is as follows:

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.242 01-Apr-2016 mikeb

Revert the clone bitmap enlargement change


# 1.241 31-Mar-2016 natano

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.240 19-Mar-2016 natano

Remove the unused flags argument from VOP_UNLOCK().

torture tested on amd64, i386 and macppc
ok beck mpi stefan
"the change looks right" deraadt


# 1.239 14-Mar-2016 krw

Change a bunch of (<blah> *)0 to NULL.

ok beck@ deraadt@


Revision tags: OPENBSD_5_9_BASE
# 1.238 05-Dec-2015 tedu

branches: 1.238.2;
remove stale lint annotations


# 1.237 16-Nov-2015 deraadt

In getdevvp() set the VISTTY flag on a vnode to indicate the underlying
device is a D_TTY device. (Like spec_open, but this sets the flag to
satisfy pre-VOP_OPEN situations)
ok millert semarie tedu guenther


# 1.236 13-Oct-2015 guenther

Initialize va_filerev in vattr_null() to avoid leaking stack garbage;
problem pointed out by Martin Natano (natano (at) natano.net)

Also, stop chaining assignments (foo = bar = baz) in vattr_null().
The exact meaning of those depends on the order of the sizes-and-
signednesses of the lvalues, making them fragile: a statement here
mixed *six* types, but managed to get them in a safe order. Delete
a 20+ year old XXX comment that was almost certainly bemoaning a bug
from when they were in an unsafe order.

ok deraadt@ miod@


# 1.235 08-Oct-2015 mpi

Use the radix API directly and get rid of the function pointers. There
is no point in keeping an unused level of abstraction.

ok mikeb@, claudio@


# 1.234 07-Oct-2015 mpi

rn_inithead() offset argument is now specified in byte, missed in previous.


# 1.233 04-Sep-2015 mpi

Make every subsystem using a radix tree call rn_init() and pass the
length of the key as argument.

This way every consumer of the radix tree has a chance to explicitly
initialize the shared data structures and no longer rely on another
subsystem to do the initialization.

As a bonus ``dom_maxrtkey'' is no longer used an die.

ART kernels should now be fully usable because pf(4) and IPSEC properly
initialized the radix tree.

ok chris@, reyk@


Revision tags: OPENBSD_5_8_BASE
# 1.232 16-Jul-2015 claudio

branches: 1.232.4;
Fix rn_match and there for the expoerted lookup functions in radix.c
to never return the internal RNF_ROOT nodes. This removes the checks
in the callee to verify that not an RNF_ROOT node was returned.
OK mpi@


# 1.231 12-May-2015 mikeb

Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.230 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.229 02-Mar-2015 guenther

Return EINVAL if the creds supplied for NFS export have a cr_ngroups less
than zero or greater than NGROUPS_MAX

Fixes panic seen by henning@


# 1.228 09-Jan-2015 tedu

rename desiredvnodes to initialvnodes. less of a lie. ok beck deraadt


# 1.227 19-Dec-2014 tedu

start retiring the nointr allocator. specify PR_WAITOK as a flag as a
marker for which pools are not interrupt safe. ok dlg


# 1.226 17-Dec-2014 tedu

remove lock.h from uvm_extern.h. another holdover from the simpletonlock
era. fix uvm including c files to include lock.h or atomic.h as necessary.
ok deraadt


# 1.225 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.224 10-Dec-2014 tedu

convert bcopy to memcpy. ok millert


# 1.223 21-Nov-2014 tedu

simple lock is long dead


# 1.222 19-Nov-2014 tedu

delete the KERN_VNODE sysctl. it fails to provide any isolation from the
kernel struct vnode defintion, and the only consumer (pstat) still needs
kvm to read much of the required information. no great loss to always use
kvm until there's a better replacement interface.
ok deraadt millert uebayasi


# 1.221 14-Nov-2014 tedu

prefer sizeof(*ptr) to sizeof(struct) for malloc and free


# 1.220 03-Nov-2014 deraadt

pass size argument to free()
ok doug tedu


# 1.219 13-Sep-2014 doug

Replace all queue *_END macro calls except CIRCLEQ_END with NULL.

CIRCLEQ_* is deprecated and not called in the tree. The other queue types
have *_END macros which were added for symmetry with CIRCLEQ_END. They are
defined as NULL. There's no reason to keep the other *_END macro calls.

ok millert@


Revision tags: OPENBSD_5_6_BASE
# 1.218 13-Jul-2014 tedu

pass the size to free in some of the obvious cases


# 1.217 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.216 10-Jul-2014 mpi

Stop using a shutdown hook for softraid(4) and explicitly shutdown
the disciplines right after vfs_shutdown().

This change is required in order to be able to set `cold' to 1 before
traversing the device (mainbus) tree for DVACT_POWERDOWN when halting
a machine. Yes, this is ugly because sr_shutdown() needs to sleep. But
at least it is obvious and hopefully somebody will be ofended and fix
it.

In order to properly flush the cache of the disks under softraid0,
sr_shutdown() now propagates DVACT_POWERDOWN for this particular subtree
of devices which are not under mainbus. As a side effect sd(4) shutdown
hook should no longer be necessary.

Tested by stsp@ and Jean-Philippe Ouellet.

ok deraadt@, stsp@, jsing@


# 1.215 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.214 04-Jun-2014 claudio

While it may be smart to use the radix tree for exports it is not OK to
use the domain specific tree initialisation method for this since that one
is multipath enabled and assumes that the radix node is part of a struct
rtentry. This code uses a different struct and so the multipath modifies
wrong fields and breaks stuff in mysterious ways.
Since we only support AF_INET here anyway simplify the code and only have
one radix_node_head pointer instead of AF_MAX ones.
Fixes NFS server issues reported by rpe@, OK rpe@, guenther@, sthen@


# 1.213 10-Apr-2014 tedu

pull the bufcache freelist code out into separate functions to allow new
algorithms to be tested. in the process, drop support for unused B_AGE and
b_synctime options.
previous versions ok beck deraadt


# 1.212 24-Mar-2014 guenther

Split the API: struct ucred remains the kernel internal structure while
struct xucred becomes the structure for syscalls (mount(2) and nfssvc(2)).

ok deraadt@ beck@


Revision tags: OPENBSD_5_5_BASE
# 1.211 21-Jan-2014 tedu

bzero -> memset


# 1.210 01-Dec-2013 krw

Change 'mountlist' from CIRCLEQ to TAILQ. Be paranoid and
use TAILQ_*_SAFE more than might be needed.

Bulk ports build by sthen@ showed nobody sticking their fingers
so deep into the kernel.

Feedback and suggestions from millert@. ok jsing@


# 1.209 27-Nov-2013 jsing

Defer the v_type initialisation until after the vnode has been purged from
the namecache. Changing the v_type between cache_enter() and cache_purge()
results in bad things happening.

ok beck@


# 1.208 02-Oct-2013 sf

format string fix: b_flags is long


# 1.207 01-Oct-2013 sf

Format string fixes: Cast time_t to long long

and mnt_stat.f_ctime is long long, too


# 1.206 08-Aug-2013 syl

Uncomment kprintf format attributes for sys/kern

tested on vax (gcc3) ok miod@


# 1.205 30-Jul-2013 beck

The previous change was made while chasing nfs performance issues
on Theo's servers - however this was in the context of the buffer flipper
changes and this is now suspect in a continues performance issue with NFS
so back it out for now


Revision tags: OPENBSD_5_4_BASE
# 1.204 24-Jun-2013 beck

Manipulating buffers after sleeping is dangerous. Instead of attempting
to cheat and VOP_BWRITE a buffer, restart the vinvalbuf if we have to wait
for a busy buffer to complete
ok tedu@ guenther@


# 1.203 15-Apr-2013 jsing

Add an f_mntfromspec member to struct statfs, which specifies the name of
the special provided when the mount was requested. This may be the same as
the special that was actually used for the mount (e.g. in the case of a
device node) or it may be different (e.g. in the case of a DUID).

Whilst here, change f_ctime to a 64 bit type and remove the pointless
f_spare members.

Compatibility goo courtesy of guenther@

ok krw@ millert@


Revision tags: OPENBSD_5_3_BASE
# 1.202 17-Feb-2013 miod

Comment out recently added __attribute__((__format__(__kprintf__))) annotations
in MI code; gcc 2.95 does not accept such annotation for function pointer
declarations, only function prototypes.
To be uncommented once gcc 2.95 bites the dust.


# 1.201 09-Feb-2013 miod

Add explicit __attribute__ ((__format__(__kprintf__)))) to the functions and
function pointer arguments which are {used as,} wrappers around the kernel
printf function.
No functional change.


# 1.200 17-Nov-2012 beck

Don't map a buffer (and potentially sleep) when invalidating it in vinvalbuf.
This fixes a problem where we could sleep for kva and then our pointers
would not be valid on the next pass through the loop. We do this
by adding buf_acquire_nomap() - which can be used to busy up the buffer
without changing its mapped or unmapped state. We do not need to have
the buffer mapped to invalidate it, so it is sufficient to acquire it
for that. In the case where we write the buffer, we do map the buffer, and
potentially sleep.


# 1.199 01-Oct-2012 guenther

Make groupmember() check the effective gid too, so that the checks are
consistent when the effective gid isn't also a supplementary group.

ok beck@


# 1.198 19-Sep-2012 guenther

vhold() and vdrop() are prototyped in vnode.h, so don't repeat them here

ok beck@


Revision tags: OPENBSD_5_2_BASE
# 1.197 16-Jul-2012 deraadt

oops, need sys/acct.h too


# 1.196 16-Jul-2012 deraadt

Put acct_shutdown() proto in a better place


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.195 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.194 02-Jul-2011 thib

rename VFSDEBUG to VFLCKDEBUG;

prompted by tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.193 21-Dec-2010 thib

Bring back the "End the VOP experiment." diff, naddy's issues where
unrelated, and his alpha is much happier now.

OK deraadt@


# 1.192 06-Dec-2010 jasper

- drop NENTS(), which was yet another copy of nitems().
no binary change


ok deraadt@


# 1.191 10-Sep-2010 thib

Backout the VOP diff until the issues naddy was seeing on alpha (gcc3)
have been resolved.


# 1.190 06-Sep-2010 thib

End the VOP experiment. Instead of the ridicolusly complicated operation
vector setup that has questionable features (that have, as far as I can
tell never been used in practice, atleast not in OpenBSD), remove all
the gunk and favor a simple struct full of function pointers that get
set directly by each of the filesystems.

Removes gobs of ugly code and makes things simpler by a magnitude.

The only downside of this is that we loose the vnoperate feature so
the spec/fifo operations of the filesystems need to be kept in sync
with specfs and fifofs, this is no big deal as the API it self is pretty
static.

Many thanks to armani@ who pulled an earlier version of this diff to
current after c2k10 and Gabriel Kihlman on tech@ for testing.

Liked by many. "come on, find your balls" deraadt@.


# 1.189 12-Aug-2010 oga

Nuke extra (typoed) extern declaration and a spare newline from the last
commit.

"fix it -- free commit" beck@


# 1.188 11-Aug-2010 beck

Make the number of vnodes to correspond to the number of buffers in
buffer cache - we grow them dynamically, but do not attempt to shrink
them if the buffer cache shrinks after growing.

Tested by very many for a long time.

ok oga@ todd@ phessler@ tedu@


Revision tags: OPENBSD_4_8_BASE
# 1.187 29-Jun-2010 tedu

makefstype was only used in ported from freebsd filesystems. fix them
and remove the function. ok thib


# 1.186 28-Jun-2010 claudio

Add the rtable id as an argument to rn_walktree(). Functions like
rt_if_remove_rtdelete() need to know the table id to be able to correctly
remove nodes.
Problem found by Andrea Parazzini and analyzed by Martin Pelik�n.
OK henning@


# 1.185 06-May-2010 mpf

Fix favail format string.
From mickey.
OK thib, otto.


Revision tags: OPENBSD_4_7_BASE
# 1.184 17-Dec-2009 oga

if anyone vref()s a VNON vnode, panic. This should not happen.

Written while trying to debug the nfs_inactive panics. Turns out it
never got hit, but it's a useful check to have.

ok beck@


# 1.183 17-Aug-2009 jasper

dd 'show all bufs' to show all the buffers in the system

ok beck@ thib@


# 1.182 13-Aug-2009 thib

add a show all vnodes command, use dlg's nice pool_walk() to accomplish
this.

ok beck@, dlg@


# 1.181 12-Aug-2009 beck

Namecache revamp.

This eliminates the large single namecache hash table, and implements
the name cache as a global lru of entires, and a redblack tree in each
vnode. It makes cache_purge actually purge the namecache entries associated
with a vnode when a vnode is recycled (very important for later on actually being
able to resize the vnode pool)

This commit does #if 0 out a bunch of procmap code that was
already broken before this change, but needs to be redone completely.

Tested by many, including in thib's nfs test setup.

ok oga@,art@,thib@,miod@


# 1.180 02-Aug-2009 beck

Dynamic buffer cache support - a re-commit of what was backed out
after c2k9

allows buffer cache to be extended and grow/shrink dynamically

tested by many, ok oga@, "why not just commit it" deraadt@


Revision tags: OPENBSD_4_6_BASE
# 1.179 25-Jun-2009 thib

backout the buf_acquire() does the bremfree() since all callers
where doing bremfree() befure calling buf_acquire().

This is causing us headache pinning down a bug that showed up
when deraadt@ too cvs to current, and will have to be done
anyway as a preperation for backouts.

OK deraadt@


# 1.178 15-Jun-2009 beck

Back out all the buffer cache changes I committed during c2k9. This reverts three
commits:

1) The sysctl allowing bufcachepercent to be changed at boot time.
2) The change moving the buffer cache hash chains to a red-black tree
3) The dynamic buffer cache (Which depended on the earlier too).

ok on the backout from marco and todd


# 1.177 06-Jun-2009 art

All caller of buf_acquire were doing bremfree before the call.
Just put it in the buf_acquire function.
oga@ ok


# 1.176 03-Jun-2009 beck

Change bufhash from the old grotty hash table to red-black trees hanging
off the vnode.
ok art@, oga@, miod@


Revision tags: OPENBSD_4_5_BASE
# 1.175 10-Nov-2008 pedro

Fix typo in comment, okay jmc@.


# 1.174 01-Nov-2008 deraadt

change vrele() to return an int. if it returns 0, it can gaurantee that
it did not sleep. this is used to avoid checkdirs() to avoid having
to restart the allproc walk every time through
idea from tedu, ok thib pedro


Revision tags: OPENBSD_4_4_BASE
# 1.173 05-Jul-2008 thib

re-introduce vdrop() to signal a lost intrest in a vnode;

ok art@


# 1.172 14-Jun-2008 mk

A bunch of pool_get() + bzero() -> pool_get(..., .. | PR_ZERO)
conversions that should shave a few bytes off the kernel.

ok henning, krw, jsing, oga, miod, and thib (``even though i usually prefer
FOO|BAR''; thanks for looking.


# 1.171 13-Jun-2008 beck

back out stupid vnode change that was unintentionally included
with biomem and art has no idea how it got there.
ok art@ thib@


# 1.170 12-Jun-2008 deraadt

Bring biomem diff back into the tree after the nfs_bio.c fix went in.
ok thib beck art


# 1.169 11-Jun-2008 deraadt

back out biomem diff since it is not right yet. Doing very large
file copies to nfsv2 causes the system to eventually peg the console.
On the console ^T indicates that the load is increasing rapidly, ddb
indicates many calls to getbuf, there is some very slow nfs traffic
making none (or extremely slow) progress. Eventually some machines
seize up entirely.


# 1.168 10-Jun-2008 beck

Buffer cache revamp

1) remove multiple size queues, introduced as a stopgap.
2) decouple pages containing data from their mappings
3) only keep buffers mapped when they actually have to be mapped
(right now, this is when buffers are B_BUSY)
4) New functions to make a buffer busy, and release the busy flag
(buf_acquire and buf_release)
5) Move high/low water marks and statistics counters into a structure
6) Add a sysctl to retrieve buffer cache statistics

Tested in several variants and beat upon by bob and art for a year. run
accidentally on henning's nfs server for a few months...

ok deraadt@, krw@, art@ - who promises to be around to deal with any fallout


# 1.167 09-Jun-2008 millert

Update access(2) to have modern semantics with respect to X_OK and
the superuser. access(2) will now only indicate success for X_OK on
non-directories if there is at least one execute bit set on the file.
OK deraadt@ thib@ otto@


# 1.166 07-May-2008 thib

remove the vfc_mountroot member from vfsconf and
do appropriate cleanup;

OK deraadt@


# 1.165 07-May-2008 claudio

Implement routing priorities. Every route inserted has a priority assigned
and the one route with the lowest number wins. This will be used by the
routing daemons to resolve the synchronisations issue in case of conflicts.
The nasty bits of this are in the multipath code. If no priority is specified
the kernel will choose an appropriate priority.

Looked at by a few people at n2k8 code is much older


# 1.164 06-May-2008 thib

retire vfs_mountroot();

setroot() is now (and has been) responsible for setting
the mountroot function pointer "to the right thing", or
failing todo that, to ffs_mountroot;

based on a discussion/diff from deraadt@.
OK deraadt@


# 1.163 23-Mar-2008 miod

Wrong printf construct.


# 1.162 16-Mar-2008 otto

Widen some struct statfs fields to support large filesystem stata
and add some to be able to support statvfs(2). Do the compat dance
to provide backward compatibility. ok thib@ miod@


Revision tags: OPENBSD_4_3_BASE
# 1.161 13-Dec-2007 blambert

replace calls to ltsleep with tsleep

remove PNORELOCK flag, as PNORELOCK is used for msleep

ok art@ thib@


# 1.160 16-Nov-2007 deraadt

er, the newline is wrong. dissapointing.


# 1.159 15-Nov-2007 deraadt

newline before syncing disks is way prettier


# 1.158 29-Oct-2007 chl

MALLOC/FREE -> malloc/free
replace an hard coded value with M_WAITOK

ok krw@


# 1.157 15-Sep-2007 bluhm

Allow to pull out an usb stick with ffs filesystem while mounted
and a file is written onto the stick. Without these fixes the
machine panics or hangs.
The usb fix calls the callback when the stick is pulled out to free
the associated buffers. Otherwise we have busy buffers for ever
and the automatic unmount will panic.
The change in the scsi layer prevents passing down further dirty
buffers to usb after the stick has been deactivated.
In vfs the automatic unmount has moved from the function vgonel()
to vop_generic_revoke(). Both are called when the sd device's vnode
is removed. In vgonel() the VXLOCK is already held which can cause
a deadlock. So call dounmount() earlier.

ok krw@, I like this marco@, tested by ian@


# 1.156 07-Sep-2007 art

Use M_ZERO in a few more places to shave bytes from the kernel.

eyeballed and ok dlg@


Revision tags: OPENBSD_4_2_BASE
# 1.155 07-Aug-2007 beck

A few changes to deal with multi-user performance issues seen. this
brings us back roughly to 4.1 level performance, although this is still
far from optimal as we have seen in a number of cases. This change

1) puts a lower bound on buffer cache queues to prevent starvation
2) fixes the code which looks for a buffer to recycle
3) reduces the number of vnodes back to 4.1 levels to avoid complex
performance issues better addressed after 4.2

ok art@ deraadt@, tested by many


# 1.154 01-Jun-2007 beck

decouple the allocated number of vnodes from the "desiredvnodes" variable
which is used to size a zillion other things that increasing excessively
has been shown to cause problems - so that we may incrementally look at
increasing those other things without making the kernel unusable.

This diff effectivly increases the number of vnodes back to the number
of buffers, as in the earlier dynamic buffer cache commits, without
increasing anything else (namecache, softdeps, etc. etc.)

ok pedro@ tedu@ art@ thib@


# 1.153 31-May-2007 tedu

remove some silly casts, no real change


# 1.152 31-May-2007 pedro

NFSv2 cannot cope with a big number of vnodes, so revert to NPROC-based
calculation until the problem is fixed, okay beck@ art@


# 1.151 30-May-2007 beck

back out vfs change - todd fries has seen afs issues, and I'm suspicious
this can cause other problems.


# 1.150 29-May-2007 beck

Step one of some vnode improvements - change getnewvnode to
actually allocate "desiredvnodes" - add a vdrop to un-hold a vnode held
with vhold, and change the name cache to make use of vhold/vdrop, while
keeping track of which vnodes are referred to by which cache entries to
correctly hold/drop vnodes when the cache uses them.
ok thib@, tedu@, art@


# 1.149 28-May-2007 thib

de-inline vref();

ok pedro@


# 1.148 26-May-2007 pedro

Dynamic buffer cache. Initial diff from mickey@, okay art@ beck@ toby@
deraadt@ dlg@.


# 1.147 26-May-2007 thib

Nuke a bunch of simpelocks and associated goo.

ok art@


# 1.146 17-May-2007 thib

Collapse struct v_selectinfo in struct vnode, remove the
simplelock and reuse the name for the selinfo member.
Clean-up accordingly.

ok tedu@,art@


# 1.145 09-May-2007 deraadt

kinfo_vgetfailed has not been used for > 8 years


# 1.144 13-Apr-2007 thib

Move the declaration of VN_KNOTE() into vnode.h instead of having
multiple defines all over;

ok tedu@


# 1.143 13-Apr-2007 bluhm

Remove comments talking about vnode interlock. No binary change.
ok thib


# 1.142 11-Apr-2007 thib

Remove the simplelock argument from vrecycle();

ok pedro@, sturm@


# 1.141 21-Mar-2007 thib

Remove the v_interlock simplelock from the vnode structure.
Zap all calls to simple_lock/unlock() on it (those calls are
#defined away though). Remove the LK_INTERLOCK from the calls
to vn_lock() and cleanup the filesystems wich implement VOP_LOCK().
(by remvoing the v_interlock from there calls to lockmgr()).

ok pedro@, art@, tedu@


# 1.140 12-Mar-2007 mickey

better desiredvnodes not based on maxusers; pedro@ deraadt@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.139 20-Feb-2007 deraadt

for vfsconf sysctl, do not leak kernel sensors out to userland
ok art thib


# 1.138 17-Feb-2007 mickey

fix ddb buf printing for daddr_t growth to 64bit;
from juan hernandez gonzalez; tested by bluhm@


# 1.137 14-Feb-2007 jsg

Consistently spell FALLTHROUGH to appease lint.
ok kettenis@ cloder@ tom@ henning@


# 1.136 13-Feb-2007 mickey

fix ddb buf print


# 1.135 20-Nov-2006 tom

vprint() should be defined if DIAGNOSTIC || DEBUG. Noticed by (and
original diff from) Jake < antipsychic (at) hotmail.com >. Discussed
with Mickey and Miod.

ok miod@ pedro@


# 1.134 30-Oct-2006 thib

use vp->v_type to index into vtypes rather then vp->v_tag,
fixing odd output in the 'show vnode' ddb code.

ok mickey@


Revision tags: OPENBSD_4_0_BASE
# 1.133 11-Jul-2006 mickey

add mount/vnode/buf and softdep printing commands; tested on a few archs and will make pedro happy too (;


# 1.132 09-Jul-2006 pedro

Fix tab where space was meant


# 1.131 08-Jul-2006 thib

vinvalbuf() debugging aid, under VFSDEBUG.

ok pedro@


# 1.130 03-Jul-2006 mickey

also print vp in vprint (useful for debugging); pedro@ ok


# 1.129 25-Jun-2006 sturm

rename vfs_busy() flags VB_UMIGNORE/VB_UMWAIT to VB_NOWAIT/VB_WAIT

requested by and ok pedro


# 1.128 14-Jun-2006 sturm

move vfs_busy() to rwlocks and properly hide the locking api from vfs

ok tedu, pedro


# 1.127 02-Jun-2006 pedro

Add a clonable devices implementation. Hacked along with thib@, input
from krw@ and toby@, subliminal prodding from dlg@, okay deraadt@.


# 1.126 28-May-2006 pedro

Spacing in vfs_sysctl()


# 1.125 07-May-2006 sturm

forgot to remove this sentence from the comment
ok pedro


# 1.124 30-Apr-2006 sturm

remove the simplelock argument from vfs_busy() which is currently not
used and will never be used this way in VFS

requested by and ok pedro, ok krw, biorn


# 1.123 19-Apr-2006 pedro

Remove unused mount list simple_lock() goo


Revision tags: OPENBSD_3_9_BASE
# 1.122 09-Jan-2006 pedro

Put vprint() under DIAGNOSTIC, as to save space in generated ramdisks.
Inspiration from miod@, okay deraadt@. Tested on i386, macppc and amd64.


# 1.121 30-Nov-2005 pedro

No need for vfs_busy() and vfs_unbusy() to take a process pointer
anymore. Testing by jolan@, thanks.


# 1.120 24-Nov-2005 pedro

Remove kernfs, okay deraadt@.


# 1.119 19-Nov-2005 pedro

Remove unnecessary lockmgr() archaism that was costing too much in terms
of panics and bugfixes. Access curproc directly, do not expect a process
pointer as an argument. Should fix many "process context required" bugs.
Incentive and okay millert@, okay marc@. Various testing, thanks.


# 1.118 18-Nov-2005 pedro

Work around yet another race on non-locking file systems: when calling
VOP_INACTIVE() in vrele() and vput(), we may sleep. Since there's no
locking of any kind, someone can vget() the vnode and vrele() it while
we sleep, beating us in getting the vnode on the free list.


# 1.117 08-Nov-2005 pedro

Missed one use of 'register'


# 1.116 07-Nov-2005 pedro

Use ANSI function declarations and deregister, no binary change


# 1.115 19-Oct-2005 pedro

Remove v_vnlock from struct vnode, okay krw@ tedu@


Revision tags: OPENBSD_3_8_BASE
# 1.114 26-May-2005 pedro

branches: 1.114.2;
RIP stackable filesystems, ok marius@ tedu@, discussed with deraadt@


# 1.113 24-May-2005 pedro

when a device vnode associated with a mount point disappears, mark the
filesystem as doomed and unmount it


# 1.112 22-May-2005 pedro

put VLOCKSWORK stuff under a single option, VFSDEBUG


# 1.111 01-May-2005 pedro

check for VBIOONFREELIST and VBIOONSYNCLIST in vprint(), okay marius@


# 1.110 24-Mar-2005 tedu

always good to check for invalid values. ok marius pedro


Revision tags: OPENBSD_3_7_BASE
# 1.109 10-Jan-2005 pedro

branches: 1.109.2;
change vget() to only put a vnode back on the free lists if it actually
was there. should fix a (rare) corner case introduced by my last commit.
ok tedu@, testing by joris, moritz@, danh@, otto@ and krw@. many thanks.


# 1.108 31-Dec-2004 pedro

sprinkle some more list macros in here


# 1.107 31-Dec-2004 pedro

when releasing a vnode, make it inactive before sticking it to one of
the free lists. should fix some races on filesystems that don't have
locks, such as nfs. also, it allows for a more straightforward way of
releasing vnodes (nodes that are going to be recycled don't have to be
moved to the head of the list). tested by many, thanks.

ok tedu@ deraadt@


# 1.106 28-Dec-2004 deraadt

clean dirty accident by miod


# 1.105 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


# 1.104 09-Dec-2004 pedro

minor spacing/styling nits


Revision tags: OPENBSD_3_6_BASE
# 1.103 04-Aug-2004 art

Uninline vputonfreelist.


# 1.102 04-Aug-2004 pedro

better comments


# 1.101 02-Aug-2004 pedro

- check for LK_NOWAIT on vget()
- use ltsleep() instead of the unlock + sleep combo

ok art@, inspiration from free/net


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.100 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


# 1.99 27-May-2004 tedu

shutdown accounting before shutting down vfs. should prevent some panics.
ok david@ millert@ (iirc)


# 1.98 25-Apr-2004 itojun

radix tree with multipath support. from kame. deraadt ok
user visible changes:
- you can add multiple routes with same key (route add A B then route add A C)
- you have to specify gateway address if there are multiple entries on the table
(route delete A B, instead of route delete A)
kernel change:
- radix_node_head has an extra entry
- rnh_deladdr takes extra argument

TODO:
- actually take advantage of multipath (rtalloc -> rtalloc_mpath)


Revision tags: OPENBSD_3_5_BASE
# 1.97 09-Jan-2004 tedu

back out vnode parents. weird breakge found in ports tree


# 1.96 06-Jan-2004 tedu

keep track of a vnode's parent dir. ufs only, and unused atm, but
the fun stuff is coming. testing by brad.


Revision tags: OPENBSD_3_4_BASE
# 1.95 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.94 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: UBC_SYNC_A
# 1.93 13-May-2003 naddy

Back out previous change that causes "vnode table full" for large-scale
file operations.


# 1.92 13-May-2003 tedu

do reclaim LAYER vnodes, no good reason not to


# 1.91 06-May-2003 tedu

attempt to put a process's cwd back in place after a forced umount.
won't always work, but it's the best we can do for now. this covers
at least some of the failure cases the previous commit to vfs_lookup.c
checks for.
ok weingart@


# 1.90 01-May-2003 tedu

several related changes:
vfs_subr.c:
add a missing simple_lock_init for vnode interlock
try to avoid reclaiming locked or layered vnodes
initialize vnlock pointer to NULL
remove old code to free vnlock, never used
lockinit the new vnode lock
vfs_syscalls.c:
support for VLAYER flag
vnode_if.sh:
support for splitting VDESC flags
vnode_if.src:
split VDESC flags
WILLPUT is the combination of WILLRELE and WILLUNLOCK
most uses for WILLRELE become WILLPUT
vnode.h:
add v_lock to struct vnode
add VLAYER flag
update for new VDESC flags


# 1.89 06-Apr-2003 ho

strcat/strcpy/sprintf cleanup. krw@, anil@ ok. art@ tested sparc64.


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_B
# 1.88 11-Aug-2002 art

Add two missing vfs_busy calls in the failure path of sysctl_vnode.
Found by aaron@

NOTE - I think we need a mount-point iterator just like we have
NOTE - vfs_mount_foreach_vnode. (btw. why don't we use foreach_vnode in here?)


# 1.87 12-Jul-2002 art

Change the locking on the mountpoint slightly. Instead of using mnt_lock
to get shared locks for lookup and get the exclusive lock only with
LK_DRAIN on unmount and do the real exclusive locking with flags in
mnt_flags, we now use shared locks for lookup and an exclusive lock for
unmount.

This is accomplished by slightly changing the semantics of vfs_busy.
Old vfs_busy behavior:
- with LK_NOWAIT set in flags, a shared lock was obtained if the
mountpoint wasn't being unmounted, otherwise we just returned an error.
- with no flags, a shared lock was obtained if the mountpoint was being
unmounted, otherwise we slept until the unmount was done and returned
an error.
LK_NOWAIT was used for sync(2) and some statistics code where it isn't really
critical that we get the correct results.
0 was used in fchdir and lookup where it's critical that we get the right
directory vnode for the filesystem root.

After this change vfs_busy keeps the same behavior for no flags and LK_NOWAIT.
But if some other flags are passed into it, they are passed directly
into lockmgr (actually LK_SLEEPFAIL is always added to those flags because
if we sleep for the lock, that means someone was holding the exclusive lock
and the exclusive lock is only held when the filesystem is being unmounted.

More changes:
dounmount must now be called with the exclusive lock held. (before this
the caller was supposed to hold the vfs_busy lock, but that wasn't always
true).
Zap some (now) unused mount flags.
And the highlight of this change:
Add some vfs_busy calls to match some vfs_unbusy calls, especially in
sys_mount. (lockmgr doesn't detect the case where we release a lock noone
holds (it will do that soon)).

If you've seen hangs on reboot with mfs this should solve it (I repeat this
for the fourth time now, but this time I spent two months fixing and
redesigning this and reading the code so this time I must have gotten
this right).


# 1.86 16-Jun-2002 miod

When processing the KERN_VNODE sysctl, the kernel builds a packed structure,
while pstat(8) expects a C structure abiding the regular structure packing
rules. This caused pstat -v to break on powerpc.

Unbreak the confusion by defining the structure in a common header file,
and having the kernel use it.

ok millert@ deraadt@


# 1.85 08-Jun-2002 art

Use ltsleep in vfs_busy.


# 1.84 16-May-2002 art

sprinkle some splassert(IPL_BIO) in some functions that are commented as "should be called at splbio()"


Revision tags: OPENBSD_3_1_BASE
# 1.83 14-Mar-2002 millert

First round of __P removal in sys


# 1.82 04-Feb-2002 miod

Cleanup mountroot-related definitions.


# 1.81 23-Jan-2002 art

Pool deals fairly well with physical memory shortage, but it doesn't deal
well (not at all) with shortages of the vm_map where the pages are mapped
(usually kmem_map).

Try to deal with it:
- group all information the backend allocator for a pool in a separate
struct. The pool will only have a pointer to that struct.
- change the pool_init API to reflect that.
- link all pools allocating from the same allocator on a linked list.
- Since an allocator is responsible to wait for physical memory it will
only fail (waitok) when it runs out of its backing vm_map, carefully
drain pools using the same allocator so that va space is freed.
(see comments in code for caveats and details).
- change pool_reclaim to return if it actually succeeded to free some
memory, use that information to make draining easier and more efficient.
- get rid of PR_URGENT, noone uses it.


# 1.80 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.79 10-Dec-2001 art

branches: 1.79.2;
No need to initialize the uobj on every getnewvnode. Just do
it when allocating. Add some improved diagnostics.


# 1.78 10-Dec-2001 art

Big cleanup inspired by NetBSD with some parts of the code from NetBSD.
- get rid of VOP_BALLOCN and VOP_SIZE
- move the generic getpages and putpages into miscfs/genfs
- create a genfs_node which must be added to the top of the private portion
of each vnode for filsystems that want to use genfs_{get,put}pages
- rename genfs_mmap to vop_generic_mmap


# 1.77 10-Dec-2001 art

Merge in struct uvm_vnode into struct vnode.


# 1.76 05-Dec-2001 art

Break out the part that lowers v_holdcnt in brelvp into an own function
and make it and vhold into public interfaces.


# 1.75 29-Nov-2001 art

Ooops. Revert part of the last commit that was completly wrong and wasn't supposed to be committed.


# 1.74 29-Nov-2001 art

Correctly handle b_vp with bgetvp and brelvp in {get,put}pages.
Prevents panics caused by vnodes being recycled under our feet.


# 1.73 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.72 21-Nov-2001 csapuntz

Added vfs_isbusy. Useful for verifying that a mount point is locked
Added vfs_mount_foreach_vnode. Several places in the code seem to want to
traverse the mount list and they all seem to handle locking differently.
Centralize traversing the mount list in one place so that we only need
to get the locking right once.


# 1.71 15-Nov-2001 art

Don't zero v_bioflag when recycling a vnode in getnewvnode.
Sometimes the vnode can be on the syncers list. While that is a bug, it's
just a minor annoyance. A vnode on a syncer worklist without VBIOONSYNCLIST
set is a disaster.


# 1.70 12-Nov-2001 art

Remove unnecessary check for NULL vnode in reassignbuf.


# 1.69 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.68 02-Oct-2001 csapuntz

Bounds check index into routing table. Thanks to Ken Ashcraft of Stanford
for finding this bug.


# 1.67 19-Sep-2001 csapuntz

Get rid of B_VFLUSH. Not relevant after the end of the AGE queue.


# 1.66 16-Sep-2001 millert

Add some missing lengths checks when passing data from userland to
kernel. From based on NetBSD patches.


# 1.65 02-Aug-2001 assar

(vput): make panic strings actually say vput instead of vrele


# 1.64 26-Jul-2001 miod

Typo.


# 1.63 27-Jun-2001 art

remove old vm


# 1.62 22-Jun-2001 deraadt

KNF


# 1.61 05-Jun-2001 provos

send note_revoke to knotes when vnode goes away, okay art@


# 1.60 16-May-2001 art

indentation nit.


# 1.59 29-Apr-2001 art

cleanup, remove incorrect comment


Revision tags: OPENBSD_2_9_BASE
# 1.58 22-Mar-2001 art

branches: 1.58.2;
Use pool for allocating vnodes.
Even though vnodes are never freed (could be) this gives us big memory and
kmem_map savings.


# 1.57 21-Mar-2001 art

uvm_vnp_terminate expect the vnode to be locked.
Why didn't LOCKDEBUG catch this?


# 1.56 16-Mar-2001 art

Oops. fix thinko in last.


# 1.55 16-Mar-2001 art

Use CIRCLEQ macros for mountlist.


# 1.54 16-Mar-2001 art

Initialize the mountlist_slock.


# 1.53 26-Feb-2001 csapuntz

Move v_writecount test back to it original place


# 1.52 26-Feb-2001 csapuntz

Make ref counts 32-bit unsigned ints as opposed to a potpourri of longs and
ints.


# 1.51 24-Feb-2001 csapuntz

Cleanup of vnode interface continues. Get rid of VHOLD/HOLDRELE.
Change VM/UVM to use buf_replacevnode to change the vnode associated
with a buffer.

Addition v_bioflag for flags written in interrupt handlers
(and read at splbio, though not strictly necessary)

Add vwaitforio and use it instead of a while loop of v_numoutput.

Fix race conditions when manipulation vnode free list


# 1.50 23-Feb-2001 csapuntz

Remove the clustering fields from the vnodes and place them in the
file system inode instead


# 1.49 21-Feb-2001 csapuntz

Latest soft updates from FreeBSD/Kirk McKusick

Snapshot-related code has been commented out.


# 1.48 08-Feb-2001 mickey

do not print stuff when not verbose


Revision tags: OPENBSD_2_8_BASE
# 1.47 27-Sep-2000 art

branches: 1.47.2;
Minimal optimization.


# 1.46 17-Jul-2000 art

Don't wait for B_READ buffers on shutdown.
From NetBSD.


Revision tags: OPENBSD_2_7_BASE
# 1.45 25-Apr-2000 csapuntz

Use CIRCLEQ_FOREACH


# 1.44 21-Apr-2000 mickey

see if there is any meaning under curproc before using &proc0 in vfs_syncwait(); from art@


Revision tags: SMP_BASE kame_19991208
# 1.43 05-Dec-1999 art

branches: 1.43.2;
With soft updates, some buffers will be remarked as dirty after being written.
Handle this when syncing filesystems when unmounting.
From NetBSD.


# 1.42 05-Dec-1999 art

Use VONSYNCLIST to see if we should remove a vnode from the sync list instead
of looking at v_dirtyblkhd.


Revision tags: OPENBSD_2_6_BASE
# 1.41 20-Aug-1999 art

more paranoid check of the refcount in vfs_register


# 1.40 08-Aug-1999 niklas

From NetBSD; vdevgone, used for revoking access to device nodes when they
disappear (detach is coming).


# 1.39 31-May-1999 millert

New struct statfs with mount options. NOTE: this replaces statfs(2),
fstatfs(2), and getfsstat(2) so you will need to build a new kernel
before doing a "make build" or you will get "unimplemented syscall" errors.

The new struct statfs has the following featuires:
o Has a u_int32_t flags field--now softdep can have a real flag.

o Uses u_int32_t instead of longs (nicer on the alpha). Note: the man
page used to lie about setting invalid/unused fields to -1. SunOS does
that but our code never has.

o Gets rid of f_type completely. It hasn't been used since NetBSD 0.9
and having it there but always 0 is confusing. It is conceivable
that this may cause some old code to not compile but that is better
than silently breaking.

o Adds a mount_info union that contains the FSTYPE_args struct. This
means that "mount" can now tell you all the options a filesystem was
mounted with. This is especially nice for NFS.

Other changes:
o The linux statfs emulation didn't convert between BSD fs names
and linux f_type numbers. Now it does, since the BSD f_type
number is useless to linux apps (and has been removed anyway)

o FreeBSD's struct statfs is different from our (both old and new)
and thus needs conversion. Previously, the OpenBSD syscalls
were used without any real translation.

o mount(8) will now show extra info when invoked with no arguments.
However, to see *everything* you need to use the -v (verbose) flag.


# 1.38 06-May-1999 mickey

factor out sync+wait code into vfa_syncwait() routine for
applications in system like power management and such.
art@ finally said `commit it'


# 1.37 30-Apr-1999 art

in vput, simple_unlock the v_interlock before VOP_INACTIVE, not after


Revision tags: OPENBSD_2_5_BASE
# 1.36 11-Mar-1999 deraadt

backout


# 1.35 11-Mar-1999 deraadt

back out unapproved changes


# 1.34 11-Mar-1999 mickey

indent


# 1.33 11-Mar-1999 mickey

factor sync+wait operation out into a separate function.


# 1.32 26-Feb-1999 art

adapt to uvm vnode pager


# 1.31 19-Feb-1999 art

add vfs_register and vfs_unregister functions


# 1.30 28-Dec-1998 art

simple_lock fixes


# 1.29 22-Dec-1998 art

deconfuse vprint, print holdcount, not refcount when we are talking about holdcnt


# 1.28 10-Dec-1998 art

vfs_unmountall: retry to unmount all remaining filesystems when one unmount failed


# 1.27 05-Dec-1998 csapuntz

Framework for generating automatic test code for locking discipline
in DIAGNOSTIC mode.

Added documentation to vfs_subr.c on locking needs of a couple calls.

Improvements to the vinvalbuf patch. We need to start over after we
let our pants down.


# 1.26 04-Dec-1998 csapuntz

VFS-Lite2 requires stricter locking around vnode buffer queues. vinvalbuf
had insufficient protection


# 1.25 20-Nov-1998 art

vn_lock already unlocks the simple lock. don't do that again


# 1.24 12-Nov-1998 csapuntz

Integrate latest soft updates patches for McKusick.

Integrate cleaner ffs mount code from FreeBSD. Most notably, this mount
code prevents you from mounting an unclean file system read-write.


Revision tags: OPENBSD_2_4_BASE
# 1.23 13-Oct-1998 csapuntz

In vrele, vget, reinstate to following order

- VNODE gets placed on free list
- VOP_INACTIVE is called

This was the original order. It was changed in an earlier patch due to
a race condition in non-locking FSes (like NFS) between getnewvnode
and inactive. However, the modified order had its own race conditions, so
it turned out not to be a good choice.


# 1.22 30-Aug-1998 csapuntz

Cleanup.

Error diagnostics in vputonfreelist to catch violations of assumptions.


# 1.21 06-Aug-1998 csapuntz

Rename vop_revoke, vn_bwrite, vop_noislocked, vop_nolock, vop_nounlock
to be vop_generic_revoke, vop_generic_bwrite, vop_generic_islocked,
vop_generic_lock and vop_generic_unlock.

Create vop_generic_abortop and propogate change to all file systems.

Fix PR/371.

Get rid of locking in NULLFS (should be mostly unnecessary now except for
forced unmounts).


# 1.20 25-Apr-1998 niklas

typo


Revision tags: OPENBSD_2_3_BASE
# 1.19 20-Feb-1998 niklas

typo


# 1.18 11-Jan-1998 csapuntz

Fix a couple spinlock references. More code motion in vfs_subr.c


# 1.17 10-Jan-1998 csapuntz

Broke up vfs_subr.c which was getting a bit huge. We now have seperate files
for the syncer daemon as well as default VOP_*.


# 1.16 24-Nov-1997 niklas

Fix non-DIAGNOSTIC (and non-COMPAT*) compilation


# 1.15 07-Nov-1997 csapuntz

Fixed hang on shutdown
Disabled vop_nolock for now. Filesystems still need to be cleaned up.


# 1.14 06-Nov-1997 csapuntz

DEBUG now compiles


# 1.13 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.12 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.11 06-Oct-1997 csapuntz

VFS Lite2 Changes


Revision tags: OPENBSD_2_1_BASE
# 1.10 25-Apr-1997 deraadt

proper mask check; mike@fast.cs.utah.edu


# 1.9 14-Apr-1997 tholo

Minor performance enhancements from NetBSD


# 1.8 24-Feb-1997 niklas

OpenBSD tags


# 1.7 11-Feb-1997 millert

Add fs_id support and random inode generation numbers for ffs.


# 1.6 04-Jan-1997 kstailey

spec_advlock() via lf_advlock()


Revision tags: OPENBSD_2_0_BASE
# 1.5 08-Aug-1996 tholo

Make {,f}chown(2) behaviour POSIX.1 compliant with SUID / SGID files
Enable CTL_FS processing by sysctl(3)
Add CTL_FS request to disable clearing SUID / SGID bit when a files owner
or group is changed by root
Make sysctl(8) understand CTL_FS requests


# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 29-Feb-1996 niklas

From NetBSD: Merge with NetBSD 960217


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.285 21-Jan-2019 anton

Introduce a dedicated entry point data structure for file locks. This new data
structure allows for better tracking of pending lock operations which is
essential in order to prevent a use-after-free once the underlying vnode is
gone.

Inspired by the lockf implementation in FreeBSD.

ok visa@

Reported-by: syzbot+d5540a236382f50f1dac@syzkaller.appspotmail.com


# 1.284 23-Dec-2018 natano

Rectify some issues with the noperm mount flag; the root vnode was not
protected properly and files without any x bit set were accidentaly considered
executable when checked with access(2).

Issues found and reported by deraadt, halex, reyk, tb
ok deraadt


# 1.283 07-Dec-2018 mpi

free(9) sizes for netcred.

ok visa@


Revision tags: OPENBSD_6_4_BASE
# 1.282 29-Sep-2018 visa

Use atomic operations to update vfc_refcount. Change the field's type
to unsigned int.

OK deraadt@


# 1.281 26-Sep-2018 visa

Move the allocating and freeing of mount points into
dedicated functions.

OK deraadt@ mpi@


# 1.280 22-Sep-2018 fcambus

Harmonize spacing after ellipses in displayed messages.

We were using spacing after ellipses in an inconsistent way in the
installer. Standardize on using "... " everywhere and take into account
the cursor position while we are waiting for the task to complete: the
cursor is now always positioned after the last dot, and the space is
added when displaying completion confirmation.

While there, also take cursor position into account in vfs_shutdown(),
and remove the extra leading space before ticks in dhclient.

OK deraadt@


# 1.279 17-Sep-2018 visa

Simplify VFS initialization.

Because loadable kernel modules are no longer, there is no need to
register or unregister filesystem implementations at runtime. Remove
vfs_register() and vfs_unregister(), and make vfsinit() call vfs_init
routines directly. Replace the linked list of vfsconf structs with
the vfsconflist[] array.

OK mpi@ bluhm@


# 1.278 16-Sep-2018 visa

Move vfsconf lookup code into dedicated functions.

OK bluhm@


# 1.277 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


# 1.276 02-Jul-2018 bluhm

Use more list macros for v_dirtyblkhd.
OK mpi@


# 1.275 06-Jun-2018 bluhm

The function dounmount() traverses the mnt_list in forward direction
to call vfs_busy() for all nested mount points. vfs_stall() called
vfs_busy() in reverser order for all mount points. Change the
direction of the latter to resolve the lock order conflict.
OK visa@


# 1.274 04-Jun-2018 guenther

Add VB_DUPOK to suppress witness(4) warning of concurrent mount locks.
Use that in three places:
- vfs_stall()
- sys_mount()
- dounmount()'s MNT_FORCE-does-recursive-unmounts case

ok deraadt@ visa@


# 1.273 27-May-2018 visa

Drop unnecessary `p' parameter from vget(9).

OK mpi@


# 1.272 08-May-2018 bluhm

When looping over mount points, the FOREACH SAVE macro is not save.
The loop variable mp is protected by vfs_busy() so that it cannot
be unmounted. But the next mount point nmp could be unmounted while
VFS_SYNC() sleeps. As the loop in vfs_stall() does not destroy the
mount point, TAILQ_FOREACH_REVERSE without _SAVE is the correct
macro to use.
OK deraadt@ visa@


# 1.271 08-May-2018 mpi

Move the vfs stall "barrier" logic to a function. FREF() will soon
change and this has nothing to do with it.

ok visa@, bluhm@


# 1.270 07-May-2018 bluhm

Print the vp pointer in the vinvalbuf() panic strings.
OK mpi@


# 1.269 02-May-2018 visa

Remove proc from the parameters of vn_lock(). The parameter is
unnecessary because curproc always does the locking.

OK mpi@


# 1.268 28-Apr-2018 visa

Clean up the parameters of VOP_LOCK() and VOP_UNLOCK(). It is always
curproc that does the locking or unlocking, so the proc parameter
is pointless and can be dropped.

OK mpi@, deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.267 07-Mar-2018 bluhm

Remounting files systems read-only does not work reliably. There
are corner cases where ffs may leak blocks. So better revert and
unmount all file systems at reboot. The "init died" panic will be
fixed in a different way.
OK deraadt@


# 1.266 10-Feb-2018 deraadt

Syncronize filesystems to disk when suspending. Each mountpoint's vnodes
are pushed to disk. Dangling vnodes (unlinked files still in use) and
vnodes undergoing change by long-running syscalls are identified -- and
such filesystems are marked dirty on-disk while we are suspended (in case
power is lost, a fsck will be required). Filesystems without dangling or
busy vnodes are marked clean, resulting in faster boots following
"battery died" circumstances.
Tested by numerous developers, thanks for the feedback.


# 1.265 14-Dec-2017 deraadt

Don't bother using DETACH_FORCE for the softraid luns at reboot
time; the aggressive mountpoint destruction seems to hit insane
use-after-frees when we are already far on the way down.


# 1.264 14-Dec-2017 deraadt

Give vflush_vnode() a hint about vnodes we don't need to account as "busy".
Change mountpoint to RDONLY a little later. Seems to improve the
rw->ro transition a bit.


# 1.263 11-Dec-2017 bluhm

Format the vnode lists of ddb show mount properly in columns.
OK krw@


# 1.262 11-Dec-2017 deraadt

In uvm Chuck decided backing store would not be allocated proactively
for blocks re-fetchable from the filesystem. However at reboot time,
filesystems are unmounted, and since processes lack backing store they
are killed. Since the scheduler is still running, in some cases init is
killed... which drops us to ddb [noted by bluhm]. Solution is to convert
filesystems to read-only [proposed by kettenis]. The tale follows:
sys_reboot() should pass proc * to MD boot() to vfs_shutdown() which
completes current IO with vfs_busy VB_WRITE|VB_WAIT, then calls VFS_MOUNT()
with MNT_UPDATE | MNT_RDONLY, soon teaching us that *fs_mount() calls a
copyin() late... so store the sizes in vfsconflist[] and move the copyin()
to sys_mount()... and notice nfs_mount copyin() is size-variant, so kill
legacy struct nfs_args3. Next we learn ffs_mount()'s MNT_UPDATE code is
sharp and rusty especially wrt softdep, so fix some bugs adn add
~MNT_SOFTDEP to the downgrade. Some vnodes need a little more help,
so tie them to &dead_vnops.

ffs_mount calling DIOCCACHESYNC is causing a bit of grief still but
this issue is seperate and will be dealt with in time.
couple hundred reboots by bluhm and myself, advice from guenther and
others at the hut


# 1.261 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.260 31-Jul-2017 florian

Give back some space to the ramdisk by compiling net/radix.c only
if we compile pf, ipsec, pipex or nfsserver.
Suggested by mpi some time ago.
Tweak & OK bluhm
deraadt assumes it's fair


# 1.259 20-Apr-2017 visa

Tweak lock inits to make the system runnable with witness(4)
on amd64 and i386.


# 1.258 04-Apr-2017 deraadt

struct vfsconf is tightly packed, but let's M_ZERO it in case that ever
changes to avoid exposing userland memory.


Revision tags: OPENBSD_6_1_BASE
# 1.257 15-Jan-2017 bluhm

When traversing the mount list, the current mount point is locked
with vfs_busy(). If the FOREACH_SAFE macro is used, the next pointer
is not locked and could be freed by another process. Unless
necessary, do not use _SAFE as it is unsafe. In vfs_unmountall()
the current pointer is actullay freed. Add a comment that this
race has to be fixed later.
OK krw@


# 1.256 10-Jan-2017 bluhm

Replace manual for() loops with FOREACH() macro.
OK millert@


# 1.255 10-Jan-2017 bluhm

Remove the unused olddp parameter from function dounmount().
OK mpi@ millert@


# 1.254 28-Sep-2016 kettenis

Cast enum to u_int when doing a bounds check to avoid a clang warning that
the comparison is always true.

ok jca@, tedu@


# 1.253 16-Sep-2016 dlg

move the namecache_rb_tree from RB macros to RBT functions.

i had to shuffle the includes a bit. all the knowledge of the RB
tree is now inside vfs_cache.c, and all accesses are via cache_*
functions.


# 1.252 16-Sep-2016 dlg

move buf_rb_bufs from RB macros to RBT functions

i had to shuffle the order of some header bits cos RBT_PROTOTYPE
needs to see what RBT_HEAD produces.


# 1.251 15-Sep-2016 dlg

all pools have their ipl set via pool_setipl, so fold it into pool_init.

the ioff argument to pool_init() is unused and has been for many
years, so this replaces it with an ipl argument. because the ipl
will be set on init we no longer need pool_setipl.

most of these changes have been done with coccinelle using the spatch
below. cocci sucks at formatting code though, so i fixed that by hand.

the manpage and subr_pool.c bits i did myself.

ok tedu@ jmatthew@

@ipl@
expression pp;
expression ipl;
expression s, a, o, f, m, p;
@@
-pool_init(pp, s, a, o, f, m, p);
-pool_setipl(pp, ipl);
+pool_init(pp, s, a, ipl, f, m, p);


# 1.250 25-Aug-2016 dlg

pool_setipl

ok kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.249 22-Jul-2016 kettenis

Prevent NULL-pointer call for filesystems that don't provide vfs_sysctl
in their vfsops.

Issue reported by Tim Newsham.

ok claudio@, natano@


# 1.248 19-Jun-2016 natano

Remove the lockmgr() API. It is only used by filesystems, where it is a
trivial change to use rrw locks instead. All it needs is LK_* defines
for the RW_* flags.

tested by naddy and sthen on package building infrastructure
input and ok jmc mpi tedu


# 1.247 26-May-2016 natano

The doforce variable isn't modified anywhere. Also, the only filesystem
left using it is fuse. It has been removed from all other filesystems.

ok millert deraadt


# 1.246 26-Apr-2016 natano

copy_statfs_info() is not only used by ufs, but by other filesystems too,
so make sure that all members of mp->mnt_stat.mount_info are copied.
ok stefan


# 1.245 26-Apr-2016 beck

fix off by one in vfs_vnode_print - found by miod
ok deraadt@, krw@


# 1.244 07-Apr-2016 natano

Share clone bitmap between aliased vnodes. This prevents duplicate clone
instance numbers being handed out for the same minor device.
ok mikeb


# 1.243 05-Apr-2016 natano

Increase size of the clone bitmap (revised diff after revert). I have
tested this with fuse _and_ drm on amd64 and macppc. Also tested with
cloning bpf (not in the tree) on macppc.

ok mikeb
"looks correct to me" millert

The original commit message is as follows:

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.242 01-Apr-2016 mikeb

Revert the clone bitmap enlargement change


# 1.241 31-Mar-2016 natano

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.240 19-Mar-2016 natano

Remove the unused flags argument from VOP_UNLOCK().

torture tested on amd64, i386 and macppc
ok beck mpi stefan
"the change looks right" deraadt


# 1.239 14-Mar-2016 krw

Change a bunch of (<blah> *)0 to NULL.

ok beck@ deraadt@


Revision tags: OPENBSD_5_9_BASE
# 1.238 05-Dec-2015 tedu

branches: 1.238.2;
remove stale lint annotations


# 1.237 16-Nov-2015 deraadt

In getdevvp() set the VISTTY flag on a vnode to indicate the underlying
device is a D_TTY device. (Like spec_open, but this sets the flag to
satisfy pre-VOP_OPEN situations)
ok millert semarie tedu guenther


# 1.236 13-Oct-2015 guenther

Initialize va_filerev in vattr_null() to avoid leaking stack garbage;
problem pointed out by Martin Natano (natano (at) natano.net)

Also, stop chaining assignments (foo = bar = baz) in vattr_null().
The exact meaning of those depends on the order of the sizes-and-
signednesses of the lvalues, making them fragile: a statement here
mixed *six* types, but managed to get them in a safe order. Delete
a 20+ year old XXX comment that was almost certainly bemoaning a bug
from when they were in an unsafe order.

ok deraadt@ miod@


# 1.235 08-Oct-2015 mpi

Use the radix API directly and get rid of the function pointers. There
is no point in keeping an unused level of abstraction.

ok mikeb@, claudio@


# 1.234 07-Oct-2015 mpi

rn_inithead() offset argument is now specified in byte, missed in previous.


# 1.233 04-Sep-2015 mpi

Make every subsystem using a radix tree call rn_init() and pass the
length of the key as argument.

This way every consumer of the radix tree has a chance to explicitly
initialize the shared data structures and no longer rely on another
subsystem to do the initialization.

As a bonus ``dom_maxrtkey'' is no longer used an die.

ART kernels should now be fully usable because pf(4) and IPSEC properly
initialized the radix tree.

ok chris@, reyk@


Revision tags: OPENBSD_5_8_BASE
# 1.232 16-Jul-2015 claudio

branches: 1.232.4;
Fix rn_match and there for the expoerted lookup functions in radix.c
to never return the internal RNF_ROOT nodes. This removes the checks
in the callee to verify that not an RNF_ROOT node was returned.
OK mpi@


# 1.231 12-May-2015 mikeb

Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.230 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.229 02-Mar-2015 guenther

Return EINVAL if the creds supplied for NFS export have a cr_ngroups less
than zero or greater than NGROUPS_MAX

Fixes panic seen by henning@


# 1.228 09-Jan-2015 tedu

rename desiredvnodes to initialvnodes. less of a lie. ok beck deraadt


# 1.227 19-Dec-2014 tedu

start retiring the nointr allocator. specify PR_WAITOK as a flag as a
marker for which pools are not interrupt safe. ok dlg


# 1.226 17-Dec-2014 tedu

remove lock.h from uvm_extern.h. another holdover from the simpletonlock
era. fix uvm including c files to include lock.h or atomic.h as necessary.
ok deraadt


# 1.225 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.224 10-Dec-2014 tedu

convert bcopy to memcpy. ok millert


# 1.223 21-Nov-2014 tedu

simple lock is long dead


# 1.222 19-Nov-2014 tedu

delete the KERN_VNODE sysctl. it fails to provide any isolation from the
kernel struct vnode defintion, and the only consumer (pstat) still needs
kvm to read much of the required information. no great loss to always use
kvm until there's a better replacement interface.
ok deraadt millert uebayasi


# 1.221 14-Nov-2014 tedu

prefer sizeof(*ptr) to sizeof(struct) for malloc and free


# 1.220 03-Nov-2014 deraadt

pass size argument to free()
ok doug tedu


# 1.219 13-Sep-2014 doug

Replace all queue *_END macro calls except CIRCLEQ_END with NULL.

CIRCLEQ_* is deprecated and not called in the tree. The other queue types
have *_END macros which were added for symmetry with CIRCLEQ_END. They are
defined as NULL. There's no reason to keep the other *_END macro calls.

ok millert@


Revision tags: OPENBSD_5_6_BASE
# 1.218 13-Jul-2014 tedu

pass the size to free in some of the obvious cases


# 1.217 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.216 10-Jul-2014 mpi

Stop using a shutdown hook for softraid(4) and explicitly shutdown
the disciplines right after vfs_shutdown().

This change is required in order to be able to set `cold' to 1 before
traversing the device (mainbus) tree for DVACT_POWERDOWN when halting
a machine. Yes, this is ugly because sr_shutdown() needs to sleep. But
at least it is obvious and hopefully somebody will be ofended and fix
it.

In order to properly flush the cache of the disks under softraid0,
sr_shutdown() now propagates DVACT_POWERDOWN for this particular subtree
of devices which are not under mainbus. As a side effect sd(4) shutdown
hook should no longer be necessary.

Tested by stsp@ and Jean-Philippe Ouellet.

ok deraadt@, stsp@, jsing@


# 1.215 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.214 04-Jun-2014 claudio

While it may be smart to use the radix tree for exports it is not OK to
use the domain specific tree initialisation method for this since that one
is multipath enabled and assumes that the radix node is part of a struct
rtentry. This code uses a different struct and so the multipath modifies
wrong fields and breaks stuff in mysterious ways.
Since we only support AF_INET here anyway simplify the code and only have
one radix_node_head pointer instead of AF_MAX ones.
Fixes NFS server issues reported by rpe@, OK rpe@, guenther@, sthen@


# 1.213 10-Apr-2014 tedu

pull the bufcache freelist code out into separate functions to allow new
algorithms to be tested. in the process, drop support for unused B_AGE and
b_synctime options.
previous versions ok beck deraadt


# 1.212 24-Mar-2014 guenther

Split the API: struct ucred remains the kernel internal structure while
struct xucred becomes the structure for syscalls (mount(2) and nfssvc(2)).

ok deraadt@ beck@


Revision tags: OPENBSD_5_5_BASE
# 1.211 21-Jan-2014 tedu

bzero -> memset


# 1.210 01-Dec-2013 krw

Change 'mountlist' from CIRCLEQ to TAILQ. Be paranoid and
use TAILQ_*_SAFE more than might be needed.

Bulk ports build by sthen@ showed nobody sticking their fingers
so deep into the kernel.

Feedback and suggestions from millert@. ok jsing@


# 1.209 27-Nov-2013 jsing

Defer the v_type initialisation until after the vnode has been purged from
the namecache. Changing the v_type between cache_enter() and cache_purge()
results in bad things happening.

ok beck@


# 1.208 02-Oct-2013 sf

format string fix: b_flags is long


# 1.207 01-Oct-2013 sf

Format string fixes: Cast time_t to long long

and mnt_stat.f_ctime is long long, too


# 1.206 08-Aug-2013 syl

Uncomment kprintf format attributes for sys/kern

tested on vax (gcc3) ok miod@


# 1.205 30-Jul-2013 beck

The previous change was made while chasing nfs performance issues
on Theo's servers - however this was in the context of the buffer flipper
changes and this is now suspect in a continues performance issue with NFS
so back it out for now


Revision tags: OPENBSD_5_4_BASE
# 1.204 24-Jun-2013 beck

Manipulating buffers after sleeping is dangerous. Instead of attempting
to cheat and VOP_BWRITE a buffer, restart the vinvalbuf if we have to wait
for a busy buffer to complete
ok tedu@ guenther@


# 1.203 15-Apr-2013 jsing

Add an f_mntfromspec member to struct statfs, which specifies the name of
the special provided when the mount was requested. This may be the same as
the special that was actually used for the mount (e.g. in the case of a
device node) or it may be different (e.g. in the case of a DUID).

Whilst here, change f_ctime to a 64 bit type and remove the pointless
f_spare members.

Compatibility goo courtesy of guenther@

ok krw@ millert@


Revision tags: OPENBSD_5_3_BASE
# 1.202 17-Feb-2013 miod

Comment out recently added __attribute__((__format__(__kprintf__))) annotations
in MI code; gcc 2.95 does not accept such annotation for function pointer
declarations, only function prototypes.
To be uncommented once gcc 2.95 bites the dust.


# 1.201 09-Feb-2013 miod

Add explicit __attribute__ ((__format__(__kprintf__)))) to the functions and
function pointer arguments which are {used as,} wrappers around the kernel
printf function.
No functional change.


# 1.200 17-Nov-2012 beck

Don't map a buffer (and potentially sleep) when invalidating it in vinvalbuf.
This fixes a problem where we could sleep for kva and then our pointers
would not be valid on the next pass through the loop. We do this
by adding buf_acquire_nomap() - which can be used to busy up the buffer
without changing its mapped or unmapped state. We do not need to have
the buffer mapped to invalidate it, so it is sufficient to acquire it
for that. In the case where we write the buffer, we do map the buffer, and
potentially sleep.


# 1.199 01-Oct-2012 guenther

Make groupmember() check the effective gid too, so that the checks are
consistent when the effective gid isn't also a supplementary group.

ok beck@


# 1.198 19-Sep-2012 guenther

vhold() and vdrop() are prototyped in vnode.h, so don't repeat them here

ok beck@


Revision tags: OPENBSD_5_2_BASE
# 1.197 16-Jul-2012 deraadt

oops, need sys/acct.h too


# 1.196 16-Jul-2012 deraadt

Put acct_shutdown() proto in a better place


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.195 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.194 02-Jul-2011 thib

rename VFSDEBUG to VFLCKDEBUG;

prompted by tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.193 21-Dec-2010 thib

Bring back the "End the VOP experiment." diff, naddy's issues where
unrelated, and his alpha is much happier now.

OK deraadt@


# 1.192 06-Dec-2010 jasper

- drop NENTS(), which was yet another copy of nitems().
no binary change


ok deraadt@


# 1.191 10-Sep-2010 thib

Backout the VOP diff until the issues naddy was seeing on alpha (gcc3)
have been resolved.


# 1.190 06-Sep-2010 thib

End the VOP experiment. Instead of the ridicolusly complicated operation
vector setup that has questionable features (that have, as far as I can
tell never been used in practice, atleast not in OpenBSD), remove all
the gunk and favor a simple struct full of function pointers that get
set directly by each of the filesystems.

Removes gobs of ugly code and makes things simpler by a magnitude.

The only downside of this is that we loose the vnoperate feature so
the spec/fifo operations of the filesystems need to be kept in sync
with specfs and fifofs, this is no big deal as the API it self is pretty
static.

Many thanks to armani@ who pulled an earlier version of this diff to
current after c2k10 and Gabriel Kihlman on tech@ for testing.

Liked by many. "come on, find your balls" deraadt@.


# 1.189 12-Aug-2010 oga

Nuke extra (typoed) extern declaration and a spare newline from the last
commit.

"fix it -- free commit" beck@


# 1.188 11-Aug-2010 beck

Make the number of vnodes to correspond to the number of buffers in
buffer cache - we grow them dynamically, but do not attempt to shrink
them if the buffer cache shrinks after growing.

Tested by very many for a long time.

ok oga@ todd@ phessler@ tedu@


Revision tags: OPENBSD_4_8_BASE
# 1.187 29-Jun-2010 tedu

makefstype was only used in ported from freebsd filesystems. fix them
and remove the function. ok thib


# 1.186 28-Jun-2010 claudio

Add the rtable id as an argument to rn_walktree(). Functions like
rt_if_remove_rtdelete() need to know the table id to be able to correctly
remove nodes.
Problem found by Andrea Parazzini and analyzed by Martin Pelik�n.
OK henning@


# 1.185 06-May-2010 mpf

Fix favail format string.
From mickey.
OK thib, otto.


Revision tags: OPENBSD_4_7_BASE
# 1.184 17-Dec-2009 oga

if anyone vref()s a VNON vnode, panic. This should not happen.

Written while trying to debug the nfs_inactive panics. Turns out it
never got hit, but it's a useful check to have.

ok beck@


# 1.183 17-Aug-2009 jasper

dd 'show all bufs' to show all the buffers in the system

ok beck@ thib@


# 1.182 13-Aug-2009 thib

add a show all vnodes command, use dlg's nice pool_walk() to accomplish
this.

ok beck@, dlg@


# 1.181 12-Aug-2009 beck

Namecache revamp.

This eliminates the large single namecache hash table, and implements
the name cache as a global lru of entires, and a redblack tree in each
vnode. It makes cache_purge actually purge the namecache entries associated
with a vnode when a vnode is recycled (very important for later on actually being
able to resize the vnode pool)

This commit does #if 0 out a bunch of procmap code that was
already broken before this change, but needs to be redone completely.

Tested by many, including in thib's nfs test setup.

ok oga@,art@,thib@,miod@


# 1.180 02-Aug-2009 beck

Dynamic buffer cache support - a re-commit of what was backed out
after c2k9

allows buffer cache to be extended and grow/shrink dynamically

tested by many, ok oga@, "why not just commit it" deraadt@


Revision tags: OPENBSD_4_6_BASE
# 1.179 25-Jun-2009 thib

backout the buf_acquire() does the bremfree() since all callers
where doing bremfree() befure calling buf_acquire().

This is causing us headache pinning down a bug that showed up
when deraadt@ too cvs to current, and will have to be done
anyway as a preperation for backouts.

OK deraadt@


# 1.178 15-Jun-2009 beck

Back out all the buffer cache changes I committed during c2k9. This reverts three
commits:

1) The sysctl allowing bufcachepercent to be changed at boot time.
2) The change moving the buffer cache hash chains to a red-black tree
3) The dynamic buffer cache (Which depended on the earlier too).

ok on the backout from marco and todd


# 1.177 06-Jun-2009 art

All caller of buf_acquire were doing bremfree before the call.
Just put it in the buf_acquire function.
oga@ ok


# 1.176 03-Jun-2009 beck

Change bufhash from the old grotty hash table to red-black trees hanging
off the vnode.
ok art@, oga@, miod@


Revision tags: OPENBSD_4_5_BASE
# 1.175 10-Nov-2008 pedro

Fix typo in comment, okay jmc@.


# 1.174 01-Nov-2008 deraadt

change vrele() to return an int. if it returns 0, it can gaurantee that
it did not sleep. this is used to avoid checkdirs() to avoid having
to restart the allproc walk every time through
idea from tedu, ok thib pedro


Revision tags: OPENBSD_4_4_BASE
# 1.173 05-Jul-2008 thib

re-introduce vdrop() to signal a lost intrest in a vnode;

ok art@


# 1.172 14-Jun-2008 mk

A bunch of pool_get() + bzero() -> pool_get(..., .. | PR_ZERO)
conversions that should shave a few bytes off the kernel.

ok henning, krw, jsing, oga, miod, and thib (``even though i usually prefer
FOO|BAR''; thanks for looking.


# 1.171 13-Jun-2008 beck

back out stupid vnode change that was unintentionally included
with biomem and art has no idea how it got there.
ok art@ thib@


# 1.170 12-Jun-2008 deraadt

Bring biomem diff back into the tree after the nfs_bio.c fix went in.
ok thib beck art


# 1.169 11-Jun-2008 deraadt

back out biomem diff since it is not right yet. Doing very large
file copies to nfsv2 causes the system to eventually peg the console.
On the console ^T indicates that the load is increasing rapidly, ddb
indicates many calls to getbuf, there is some very slow nfs traffic
making none (or extremely slow) progress. Eventually some machines
seize up entirely.


# 1.168 10-Jun-2008 beck

Buffer cache revamp

1) remove multiple size queues, introduced as a stopgap.
2) decouple pages containing data from their mappings
3) only keep buffers mapped when they actually have to be mapped
(right now, this is when buffers are B_BUSY)
4) New functions to make a buffer busy, and release the busy flag
(buf_acquire and buf_release)
5) Move high/low water marks and statistics counters into a structure
6) Add a sysctl to retrieve buffer cache statistics

Tested in several variants and beat upon by bob and art for a year. run
accidentally on henning's nfs server for a few months...

ok deraadt@, krw@, art@ - who promises to be around to deal with any fallout


# 1.167 09-Jun-2008 millert

Update access(2) to have modern semantics with respect to X_OK and
the superuser. access(2) will now only indicate success for X_OK on
non-directories if there is at least one execute bit set on the file.
OK deraadt@ thib@ otto@


# 1.166 07-May-2008 thib

remove the vfc_mountroot member from vfsconf and
do appropriate cleanup;

OK deraadt@


# 1.165 07-May-2008 claudio

Implement routing priorities. Every route inserted has a priority assigned
and the one route with the lowest number wins. This will be used by the
routing daemons to resolve the synchronisations issue in case of conflicts.
The nasty bits of this are in the multipath code. If no priority is specified
the kernel will choose an appropriate priority.

Looked at by a few people at n2k8 code is much older


# 1.164 06-May-2008 thib

retire vfs_mountroot();

setroot() is now (and has been) responsible for setting
the mountroot function pointer "to the right thing", or
failing todo that, to ffs_mountroot;

based on a discussion/diff from deraadt@.
OK deraadt@


# 1.163 23-Mar-2008 miod

Wrong printf construct.


# 1.162 16-Mar-2008 otto

Widen some struct statfs fields to support large filesystem stata
and add some to be able to support statvfs(2). Do the compat dance
to provide backward compatibility. ok thib@ miod@


Revision tags: OPENBSD_4_3_BASE
# 1.161 13-Dec-2007 blambert

replace calls to ltsleep with tsleep

remove PNORELOCK flag, as PNORELOCK is used for msleep

ok art@ thib@


# 1.160 16-Nov-2007 deraadt

er, the newline is wrong. dissapointing.


# 1.159 15-Nov-2007 deraadt

newline before syncing disks is way prettier


# 1.158 29-Oct-2007 chl

MALLOC/FREE -> malloc/free
replace an hard coded value with M_WAITOK

ok krw@


# 1.157 15-Sep-2007 bluhm

Allow to pull out an usb stick with ffs filesystem while mounted
and a file is written onto the stick. Without these fixes the
machine panics or hangs.
The usb fix calls the callback when the stick is pulled out to free
the associated buffers. Otherwise we have busy buffers for ever
and the automatic unmount will panic.
The change in the scsi layer prevents passing down further dirty
buffers to usb after the stick has been deactivated.
In vfs the automatic unmount has moved from the function vgonel()
to vop_generic_revoke(). Both are called when the sd device's vnode
is removed. In vgonel() the VXLOCK is already held which can cause
a deadlock. So call dounmount() earlier.

ok krw@, I like this marco@, tested by ian@


# 1.156 07-Sep-2007 art

Use M_ZERO in a few more places to shave bytes from the kernel.

eyeballed and ok dlg@


Revision tags: OPENBSD_4_2_BASE
# 1.155 07-Aug-2007 beck

A few changes to deal with multi-user performance issues seen. this
brings us back roughly to 4.1 level performance, although this is still
far from optimal as we have seen in a number of cases. This change

1) puts a lower bound on buffer cache queues to prevent starvation
2) fixes the code which looks for a buffer to recycle
3) reduces the number of vnodes back to 4.1 levels to avoid complex
performance issues better addressed after 4.2

ok art@ deraadt@, tested by many


# 1.154 01-Jun-2007 beck

decouple the allocated number of vnodes from the "desiredvnodes" variable
which is used to size a zillion other things that increasing excessively
has been shown to cause problems - so that we may incrementally look at
increasing those other things without making the kernel unusable.

This diff effectivly increases the number of vnodes back to the number
of buffers, as in the earlier dynamic buffer cache commits, without
increasing anything else (namecache, softdeps, etc. etc.)

ok pedro@ tedu@ art@ thib@


# 1.153 31-May-2007 tedu

remove some silly casts, no real change


# 1.152 31-May-2007 pedro

NFSv2 cannot cope with a big number of vnodes, so revert to NPROC-based
calculation until the problem is fixed, okay beck@ art@


# 1.151 30-May-2007 beck

back out vfs change - todd fries has seen afs issues, and I'm suspicious
this can cause other problems.


# 1.150 29-May-2007 beck

Step one of some vnode improvements - change getnewvnode to
actually allocate "desiredvnodes" - add a vdrop to un-hold a vnode held
with vhold, and change the name cache to make use of vhold/vdrop, while
keeping track of which vnodes are referred to by which cache entries to
correctly hold/drop vnodes when the cache uses them.
ok thib@, tedu@, art@


# 1.149 28-May-2007 thib

de-inline vref();

ok pedro@


# 1.148 26-May-2007 pedro

Dynamic buffer cache. Initial diff from mickey@, okay art@ beck@ toby@
deraadt@ dlg@.


# 1.147 26-May-2007 thib

Nuke a bunch of simpelocks and associated goo.

ok art@


# 1.146 17-May-2007 thib

Collapse struct v_selectinfo in struct vnode, remove the
simplelock and reuse the name for the selinfo member.
Clean-up accordingly.

ok tedu@,art@


# 1.145 09-May-2007 deraadt

kinfo_vgetfailed has not been used for > 8 years


# 1.144 13-Apr-2007 thib

Move the declaration of VN_KNOTE() into vnode.h instead of having
multiple defines all over;

ok tedu@


# 1.143 13-Apr-2007 bluhm

Remove comments talking about vnode interlock. No binary change.
ok thib


# 1.142 11-Apr-2007 thib

Remove the simplelock argument from vrecycle();

ok pedro@, sturm@


# 1.141 21-Mar-2007 thib

Remove the v_interlock simplelock from the vnode structure.
Zap all calls to simple_lock/unlock() on it (those calls are
#defined away though). Remove the LK_INTERLOCK from the calls
to vn_lock() and cleanup the filesystems wich implement VOP_LOCK().
(by remvoing the v_interlock from there calls to lockmgr()).

ok pedro@, art@, tedu@


# 1.140 12-Mar-2007 mickey

better desiredvnodes not based on maxusers; pedro@ deraadt@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.139 20-Feb-2007 deraadt

for vfsconf sysctl, do not leak kernel sensors out to userland
ok art thib


# 1.138 17-Feb-2007 mickey

fix ddb buf printing for daddr_t growth to 64bit;
from juan hernandez gonzalez; tested by bluhm@


# 1.137 14-Feb-2007 jsg

Consistently spell FALLTHROUGH to appease lint.
ok kettenis@ cloder@ tom@ henning@


# 1.136 13-Feb-2007 mickey

fix ddb buf print


# 1.135 20-Nov-2006 tom

vprint() should be defined if DIAGNOSTIC || DEBUG. Noticed by (and
original diff from) Jake < antipsychic (at) hotmail.com >. Discussed
with Mickey and Miod.

ok miod@ pedro@


# 1.134 30-Oct-2006 thib

use vp->v_type to index into vtypes rather then vp->v_tag,
fixing odd output in the 'show vnode' ddb code.

ok mickey@


Revision tags: OPENBSD_4_0_BASE
# 1.133 11-Jul-2006 mickey

add mount/vnode/buf and softdep printing commands; tested on a few archs and will make pedro happy too (;


# 1.132 09-Jul-2006 pedro

Fix tab where space was meant


# 1.131 08-Jul-2006 thib

vinvalbuf() debugging aid, under VFSDEBUG.

ok pedro@


# 1.130 03-Jul-2006 mickey

also print vp in vprint (useful for debugging); pedro@ ok


# 1.129 25-Jun-2006 sturm

rename vfs_busy() flags VB_UMIGNORE/VB_UMWAIT to VB_NOWAIT/VB_WAIT

requested by and ok pedro


# 1.128 14-Jun-2006 sturm

move vfs_busy() to rwlocks and properly hide the locking api from vfs

ok tedu, pedro


# 1.127 02-Jun-2006 pedro

Add a clonable devices implementation. Hacked along with thib@, input
from krw@ and toby@, subliminal prodding from dlg@, okay deraadt@.


# 1.126 28-May-2006 pedro

Spacing in vfs_sysctl()


# 1.125 07-May-2006 sturm

forgot to remove this sentence from the comment
ok pedro


# 1.124 30-Apr-2006 sturm

remove the simplelock argument from vfs_busy() which is currently not
used and will never be used this way in VFS

requested by and ok pedro, ok krw, biorn


# 1.123 19-Apr-2006 pedro

Remove unused mount list simple_lock() goo


Revision tags: OPENBSD_3_9_BASE
# 1.122 09-Jan-2006 pedro

Put vprint() under DIAGNOSTIC, as to save space in generated ramdisks.
Inspiration from miod@, okay deraadt@. Tested on i386, macppc and amd64.


# 1.121 30-Nov-2005 pedro

No need for vfs_busy() and vfs_unbusy() to take a process pointer
anymore. Testing by jolan@, thanks.


# 1.120 24-Nov-2005 pedro

Remove kernfs, okay deraadt@.


# 1.119 19-Nov-2005 pedro

Remove unnecessary lockmgr() archaism that was costing too much in terms
of panics and bugfixes. Access curproc directly, do not expect a process
pointer as an argument. Should fix many "process context required" bugs.
Incentive and okay millert@, okay marc@. Various testing, thanks.


# 1.118 18-Nov-2005 pedro

Work around yet another race on non-locking file systems: when calling
VOP_INACTIVE() in vrele() and vput(), we may sleep. Since there's no
locking of any kind, someone can vget() the vnode and vrele() it while
we sleep, beating us in getting the vnode on the free list.


# 1.117 08-Nov-2005 pedro

Missed one use of 'register'


# 1.116 07-Nov-2005 pedro

Use ANSI function declarations and deregister, no binary change


# 1.115 19-Oct-2005 pedro

Remove v_vnlock from struct vnode, okay krw@ tedu@


Revision tags: OPENBSD_3_8_BASE
# 1.114 26-May-2005 pedro

branches: 1.114.2;
RIP stackable filesystems, ok marius@ tedu@, discussed with deraadt@


# 1.113 24-May-2005 pedro

when a device vnode associated with a mount point disappears, mark the
filesystem as doomed and unmount it


# 1.112 22-May-2005 pedro

put VLOCKSWORK stuff under a single option, VFSDEBUG


# 1.111 01-May-2005 pedro

check for VBIOONFREELIST and VBIOONSYNCLIST in vprint(), okay marius@


# 1.110 24-Mar-2005 tedu

always good to check for invalid values. ok marius pedro


Revision tags: OPENBSD_3_7_BASE
# 1.109 10-Jan-2005 pedro

branches: 1.109.2;
change vget() to only put a vnode back on the free lists if it actually
was there. should fix a (rare) corner case introduced by my last commit.
ok tedu@, testing by joris, moritz@, danh@, otto@ and krw@. many thanks.


# 1.108 31-Dec-2004 pedro

sprinkle some more list macros in here


# 1.107 31-Dec-2004 pedro

when releasing a vnode, make it inactive before sticking it to one of
the free lists. should fix some races on filesystems that don't have
locks, such as nfs. also, it allows for a more straightforward way of
releasing vnodes (nodes that are going to be recycled don't have to be
moved to the head of the list). tested by many, thanks.

ok tedu@ deraadt@


# 1.106 28-Dec-2004 deraadt

clean dirty accident by miod


# 1.105 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


# 1.104 09-Dec-2004 pedro

minor spacing/styling nits


Revision tags: OPENBSD_3_6_BASE
# 1.103 04-Aug-2004 art

Uninline vputonfreelist.


# 1.102 04-Aug-2004 pedro

better comments


# 1.101 02-Aug-2004 pedro

- check for LK_NOWAIT on vget()
- use ltsleep() instead of the unlock + sleep combo

ok art@, inspiration from free/net


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.100 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


# 1.99 27-May-2004 tedu

shutdown accounting before shutting down vfs. should prevent some panics.
ok david@ millert@ (iirc)


# 1.98 25-Apr-2004 itojun

radix tree with multipath support. from kame. deraadt ok
user visible changes:
- you can add multiple routes with same key (route add A B then route add A C)
- you have to specify gateway address if there are multiple entries on the table
(route delete A B, instead of route delete A)
kernel change:
- radix_node_head has an extra entry
- rnh_deladdr takes extra argument

TODO:
- actually take advantage of multipath (rtalloc -> rtalloc_mpath)


Revision tags: OPENBSD_3_5_BASE
# 1.97 09-Jan-2004 tedu

back out vnode parents. weird breakge found in ports tree


# 1.96 06-Jan-2004 tedu

keep track of a vnode's parent dir. ufs only, and unused atm, but
the fun stuff is coming. testing by brad.


Revision tags: OPENBSD_3_4_BASE
# 1.95 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.94 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: UBC_SYNC_A
# 1.93 13-May-2003 naddy

Back out previous change that causes "vnode table full" for large-scale
file operations.


# 1.92 13-May-2003 tedu

do reclaim LAYER vnodes, no good reason not to


# 1.91 06-May-2003 tedu

attempt to put a process's cwd back in place after a forced umount.
won't always work, but it's the best we can do for now. this covers
at least some of the failure cases the previous commit to vfs_lookup.c
checks for.
ok weingart@


# 1.90 01-May-2003 tedu

several related changes:
vfs_subr.c:
add a missing simple_lock_init for vnode interlock
try to avoid reclaiming locked or layered vnodes
initialize vnlock pointer to NULL
remove old code to free vnlock, never used
lockinit the new vnode lock
vfs_syscalls.c:
support for VLAYER flag
vnode_if.sh:
support for splitting VDESC flags
vnode_if.src:
split VDESC flags
WILLPUT is the combination of WILLRELE and WILLUNLOCK
most uses for WILLRELE become WILLPUT
vnode.h:
add v_lock to struct vnode
add VLAYER flag
update for new VDESC flags


# 1.89 06-Apr-2003 ho

strcat/strcpy/sprintf cleanup. krw@, anil@ ok. art@ tested sparc64.


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_B
# 1.88 11-Aug-2002 art

Add two missing vfs_busy calls in the failure path of sysctl_vnode.
Found by aaron@

NOTE - I think we need a mount-point iterator just like we have
NOTE - vfs_mount_foreach_vnode. (btw. why don't we use foreach_vnode in here?)


# 1.87 12-Jul-2002 art

Change the locking on the mountpoint slightly. Instead of using mnt_lock
to get shared locks for lookup and get the exclusive lock only with
LK_DRAIN on unmount and do the real exclusive locking with flags in
mnt_flags, we now use shared locks for lookup and an exclusive lock for
unmount.

This is accomplished by slightly changing the semantics of vfs_busy.
Old vfs_busy behavior:
- with LK_NOWAIT set in flags, a shared lock was obtained if the
mountpoint wasn't being unmounted, otherwise we just returned an error.
- with no flags, a shared lock was obtained if the mountpoint was being
unmounted, otherwise we slept until the unmount was done and returned
an error.
LK_NOWAIT was used for sync(2) and some statistics code where it isn't really
critical that we get the correct results.
0 was used in fchdir and lookup where it's critical that we get the right
directory vnode for the filesystem root.

After this change vfs_busy keeps the same behavior for no flags and LK_NOWAIT.
But if some other flags are passed into it, they are passed directly
into lockmgr (actually LK_SLEEPFAIL is always added to those flags because
if we sleep for the lock, that means someone was holding the exclusive lock
and the exclusive lock is only held when the filesystem is being unmounted.

More changes:
dounmount must now be called with the exclusive lock held. (before this
the caller was supposed to hold the vfs_busy lock, but that wasn't always
true).
Zap some (now) unused mount flags.
And the highlight of this change:
Add some vfs_busy calls to match some vfs_unbusy calls, especially in
sys_mount. (lockmgr doesn't detect the case where we release a lock noone
holds (it will do that soon)).

If you've seen hangs on reboot with mfs this should solve it (I repeat this
for the fourth time now, but this time I spent two months fixing and
redesigning this and reading the code so this time I must have gotten
this right).


# 1.86 16-Jun-2002 miod

When processing the KERN_VNODE sysctl, the kernel builds a packed structure,
while pstat(8) expects a C structure abiding the regular structure packing
rules. This caused pstat -v to break on powerpc.

Unbreak the confusion by defining the structure in a common header file,
and having the kernel use it.

ok millert@ deraadt@


# 1.85 08-Jun-2002 art

Use ltsleep in vfs_busy.


# 1.84 16-May-2002 art

sprinkle some splassert(IPL_BIO) in some functions that are commented as "should be called at splbio()"


Revision tags: OPENBSD_3_1_BASE
# 1.83 14-Mar-2002 millert

First round of __P removal in sys


# 1.82 04-Feb-2002 miod

Cleanup mountroot-related definitions.


# 1.81 23-Jan-2002 art

Pool deals fairly well with physical memory shortage, but it doesn't deal
well (not at all) with shortages of the vm_map where the pages are mapped
(usually kmem_map).

Try to deal with it:
- group all information the backend allocator for a pool in a separate
struct. The pool will only have a pointer to that struct.
- change the pool_init API to reflect that.
- link all pools allocating from the same allocator on a linked list.
- Since an allocator is responsible to wait for physical memory it will
only fail (waitok) when it runs out of its backing vm_map, carefully
drain pools using the same allocator so that va space is freed.
(see comments in code for caveats and details).
- change pool_reclaim to return if it actually succeeded to free some
memory, use that information to make draining easier and more efficient.
- get rid of PR_URGENT, noone uses it.


# 1.80 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.79 10-Dec-2001 art

branches: 1.79.2;
No need to initialize the uobj on every getnewvnode. Just do
it when allocating. Add some improved diagnostics.


# 1.78 10-Dec-2001 art

Big cleanup inspired by NetBSD with some parts of the code from NetBSD.
- get rid of VOP_BALLOCN and VOP_SIZE
- move the generic getpages and putpages into miscfs/genfs
- create a genfs_node which must be added to the top of the private portion
of each vnode for filsystems that want to use genfs_{get,put}pages
- rename genfs_mmap to vop_generic_mmap


# 1.77 10-Dec-2001 art

Merge in struct uvm_vnode into struct vnode.


# 1.76 05-Dec-2001 art

Break out the part that lowers v_holdcnt in brelvp into an own function
and make it and vhold into public interfaces.


# 1.75 29-Nov-2001 art

Ooops. Revert part of the last commit that was completly wrong and wasn't supposed to be committed.


# 1.74 29-Nov-2001 art

Correctly handle b_vp with bgetvp and brelvp in {get,put}pages.
Prevents panics caused by vnodes being recycled under our feet.


# 1.73 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.72 21-Nov-2001 csapuntz

Added vfs_isbusy. Useful for verifying that a mount point is locked
Added vfs_mount_foreach_vnode. Several places in the code seem to want to
traverse the mount list and they all seem to handle locking differently.
Centralize traversing the mount list in one place so that we only need
to get the locking right once.


# 1.71 15-Nov-2001 art

Don't zero v_bioflag when recycling a vnode in getnewvnode.
Sometimes the vnode can be on the syncers list. While that is a bug, it's
just a minor annoyance. A vnode on a syncer worklist without VBIOONSYNCLIST
set is a disaster.


# 1.70 12-Nov-2001 art

Remove unnecessary check for NULL vnode in reassignbuf.


# 1.69 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.68 02-Oct-2001 csapuntz

Bounds check index into routing table. Thanks to Ken Ashcraft of Stanford
for finding this bug.


# 1.67 19-Sep-2001 csapuntz

Get rid of B_VFLUSH. Not relevant after the end of the AGE queue.


# 1.66 16-Sep-2001 millert

Add some missing lengths checks when passing data from userland to
kernel. From based on NetBSD patches.


# 1.65 02-Aug-2001 assar

(vput): make panic strings actually say vput instead of vrele


# 1.64 26-Jul-2001 miod

Typo.


# 1.63 27-Jun-2001 art

remove old vm


# 1.62 22-Jun-2001 deraadt

KNF


# 1.61 05-Jun-2001 provos

send note_revoke to knotes when vnode goes away, okay art@


# 1.60 16-May-2001 art

indentation nit.


# 1.59 29-Apr-2001 art

cleanup, remove incorrect comment


Revision tags: OPENBSD_2_9_BASE
# 1.58 22-Mar-2001 art

branches: 1.58.2;
Use pool for allocating vnodes.
Even though vnodes are never freed (could be) this gives us big memory and
kmem_map savings.


# 1.57 21-Mar-2001 art

uvm_vnp_terminate expect the vnode to be locked.
Why didn't LOCKDEBUG catch this?


# 1.56 16-Mar-2001 art

Oops. fix thinko in last.


# 1.55 16-Mar-2001 art

Use CIRCLEQ macros for mountlist.


# 1.54 16-Mar-2001 art

Initialize the mountlist_slock.


# 1.53 26-Feb-2001 csapuntz

Move v_writecount test back to it original place


# 1.52 26-Feb-2001 csapuntz

Make ref counts 32-bit unsigned ints as opposed to a potpourri of longs and
ints.


# 1.51 24-Feb-2001 csapuntz

Cleanup of vnode interface continues. Get rid of VHOLD/HOLDRELE.
Change VM/UVM to use buf_replacevnode to change the vnode associated
with a buffer.

Addition v_bioflag for flags written in interrupt handlers
(and read at splbio, though not strictly necessary)

Add vwaitforio and use it instead of a while loop of v_numoutput.

Fix race conditions when manipulation vnode free list


# 1.50 23-Feb-2001 csapuntz

Remove the clustering fields from the vnodes and place them in the
file system inode instead


# 1.49 21-Feb-2001 csapuntz

Latest soft updates from FreeBSD/Kirk McKusick

Snapshot-related code has been commented out.


# 1.48 08-Feb-2001 mickey

do not print stuff when not verbose


Revision tags: OPENBSD_2_8_BASE
# 1.47 27-Sep-2000 art

branches: 1.47.2;
Minimal optimization.


# 1.46 17-Jul-2000 art

Don't wait for B_READ buffers on shutdown.
From NetBSD.


Revision tags: OPENBSD_2_7_BASE
# 1.45 25-Apr-2000 csapuntz

Use CIRCLEQ_FOREACH


# 1.44 21-Apr-2000 mickey

see if there is any meaning under curproc before using &proc0 in vfs_syncwait(); from art@


Revision tags: SMP_BASE kame_19991208
# 1.43 05-Dec-1999 art

branches: 1.43.2;
With soft updates, some buffers will be remarked as dirty after being written.
Handle this when syncing filesystems when unmounting.
From NetBSD.


# 1.42 05-Dec-1999 art

Use VONSYNCLIST to see if we should remove a vnode from the sync list instead
of looking at v_dirtyblkhd.


Revision tags: OPENBSD_2_6_BASE
# 1.41 20-Aug-1999 art

more paranoid check of the refcount in vfs_register


# 1.40 08-Aug-1999 niklas

From NetBSD; vdevgone, used for revoking access to device nodes when they
disappear (detach is coming).


# 1.39 31-May-1999 millert

New struct statfs with mount options. NOTE: this replaces statfs(2),
fstatfs(2), and getfsstat(2) so you will need to build a new kernel
before doing a "make build" or you will get "unimplemented syscall" errors.

The new struct statfs has the following featuires:
o Has a u_int32_t flags field--now softdep can have a real flag.

o Uses u_int32_t instead of longs (nicer on the alpha). Note: the man
page used to lie about setting invalid/unused fields to -1. SunOS does
that but our code never has.

o Gets rid of f_type completely. It hasn't been used since NetBSD 0.9
and having it there but always 0 is confusing. It is conceivable
that this may cause some old code to not compile but that is better
than silently breaking.

o Adds a mount_info union that contains the FSTYPE_args struct. This
means that "mount" can now tell you all the options a filesystem was
mounted with. This is especially nice for NFS.

Other changes:
o The linux statfs emulation didn't convert between BSD fs names
and linux f_type numbers. Now it does, since the BSD f_type
number is useless to linux apps (and has been removed anyway)

o FreeBSD's struct statfs is different from our (both old and new)
and thus needs conversion. Previously, the OpenBSD syscalls
were used without any real translation.

o mount(8) will now show extra info when invoked with no arguments.
However, to see *everything* you need to use the -v (verbose) flag.


# 1.38 06-May-1999 mickey

factor out sync+wait code into vfa_syncwait() routine for
applications in system like power management and such.
art@ finally said `commit it'


# 1.37 30-Apr-1999 art

in vput, simple_unlock the v_interlock before VOP_INACTIVE, not after


Revision tags: OPENBSD_2_5_BASE
# 1.36 11-Mar-1999 deraadt

backout


# 1.35 11-Mar-1999 deraadt

back out unapproved changes


# 1.34 11-Mar-1999 mickey

indent


# 1.33 11-Mar-1999 mickey

factor sync+wait operation out into a separate function.


# 1.32 26-Feb-1999 art

adapt to uvm vnode pager


# 1.31 19-Feb-1999 art

add vfs_register and vfs_unregister functions


# 1.30 28-Dec-1998 art

simple_lock fixes


# 1.29 22-Dec-1998 art

deconfuse vprint, print holdcount, not refcount when we are talking about holdcnt


# 1.28 10-Dec-1998 art

vfs_unmountall: retry to unmount all remaining filesystems when one unmount failed


# 1.27 05-Dec-1998 csapuntz

Framework for generating automatic test code for locking discipline
in DIAGNOSTIC mode.

Added documentation to vfs_subr.c on locking needs of a couple calls.

Improvements to the vinvalbuf patch. We need to start over after we
let our pants down.


# 1.26 04-Dec-1998 csapuntz

VFS-Lite2 requires stricter locking around vnode buffer queues. vinvalbuf
had insufficient protection


# 1.25 20-Nov-1998 art

vn_lock already unlocks the simple lock. don't do that again


# 1.24 12-Nov-1998 csapuntz

Integrate latest soft updates patches for McKusick.

Integrate cleaner ffs mount code from FreeBSD. Most notably, this mount
code prevents you from mounting an unclean file system read-write.


Revision tags: OPENBSD_2_4_BASE
# 1.23 13-Oct-1998 csapuntz

In vrele, vget, reinstate to following order

- VNODE gets placed on free list
- VOP_INACTIVE is called

This was the original order. It was changed in an earlier patch due to
a race condition in non-locking FSes (like NFS) between getnewvnode
and inactive. However, the modified order had its own race conditions, so
it turned out not to be a good choice.


# 1.22 30-Aug-1998 csapuntz

Cleanup.

Error diagnostics in vputonfreelist to catch violations of assumptions.


# 1.21 06-Aug-1998 csapuntz

Rename vop_revoke, vn_bwrite, vop_noislocked, vop_nolock, vop_nounlock
to be vop_generic_revoke, vop_generic_bwrite, vop_generic_islocked,
vop_generic_lock and vop_generic_unlock.

Create vop_generic_abortop and propogate change to all file systems.

Fix PR/371.

Get rid of locking in NULLFS (should be mostly unnecessary now except for
forced unmounts).


# 1.20 25-Apr-1998 niklas

typo


Revision tags: OPENBSD_2_3_BASE
# 1.19 20-Feb-1998 niklas

typo


# 1.18 11-Jan-1998 csapuntz

Fix a couple spinlock references. More code motion in vfs_subr.c


# 1.17 10-Jan-1998 csapuntz

Broke up vfs_subr.c which was getting a bit huge. We now have seperate files
for the syncer daemon as well as default VOP_*.


# 1.16 24-Nov-1997 niklas

Fix non-DIAGNOSTIC (and non-COMPAT*) compilation


# 1.15 07-Nov-1997 csapuntz

Fixed hang on shutdown
Disabled vop_nolock for now. Filesystems still need to be cleaned up.


# 1.14 06-Nov-1997 csapuntz

DEBUG now compiles


# 1.13 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.12 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.11 06-Oct-1997 csapuntz

VFS Lite2 Changes


Revision tags: OPENBSD_2_1_BASE
# 1.10 25-Apr-1997 deraadt

proper mask check; mike@fast.cs.utah.edu


# 1.9 14-Apr-1997 tholo

Minor performance enhancements from NetBSD


# 1.8 24-Feb-1997 niklas

OpenBSD tags


# 1.7 11-Feb-1997 millert

Add fs_id support and random inode generation numbers for ffs.


# 1.6 04-Jan-1997 kstailey

spec_advlock() via lf_advlock()


Revision tags: OPENBSD_2_0_BASE
# 1.5 08-Aug-1996 tholo

Make {,f}chown(2) behaviour POSIX.1 compliant with SUID / SGID files
Enable CTL_FS processing by sysctl(3)
Add CTL_FS request to disable clearing SUID / SGID bit when a files owner
or group is changed by root
Make sysctl(8) understand CTL_FS requests


# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 29-Feb-1996 niklas

From NetBSD: Merge with NetBSD 960217


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.284 23-Dec-2018 natano

Rectify some issues with the noperm mount flag; the root vnode was not
protected properly and files without any x bit set were accidentaly considered
executable when checked with access(2).

Issues found and reported by deraadt, halex, reyk, tb
ok deraadt


# 1.283 07-Dec-2018 mpi

free(9) sizes for netcred.

ok visa@


Revision tags: OPENBSD_6_4_BASE
# 1.282 29-Sep-2018 visa

Use atomic operations to update vfc_refcount. Change the field's type
to unsigned int.

OK deraadt@


# 1.281 26-Sep-2018 visa

Move the allocating and freeing of mount points into
dedicated functions.

OK deraadt@ mpi@


# 1.280 22-Sep-2018 fcambus

Harmonize spacing after ellipses in displayed messages.

We were using spacing after ellipses in an inconsistent way in the
installer. Standardize on using "... " everywhere and take into account
the cursor position while we are waiting for the task to complete: the
cursor is now always positioned after the last dot, and the space is
added when displaying completion confirmation.

While there, also take cursor position into account in vfs_shutdown(),
and remove the extra leading space before ticks in dhclient.

OK deraadt@


# 1.279 17-Sep-2018 visa

Simplify VFS initialization.

Because loadable kernel modules are no longer, there is no need to
register or unregister filesystem implementations at runtime. Remove
vfs_register() and vfs_unregister(), and make vfsinit() call vfs_init
routines directly. Replace the linked list of vfsconf structs with
the vfsconflist[] array.

OK mpi@ bluhm@


# 1.278 16-Sep-2018 visa

Move vfsconf lookup code into dedicated functions.

OK bluhm@


# 1.277 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


# 1.276 02-Jul-2018 bluhm

Use more list macros for v_dirtyblkhd.
OK mpi@


# 1.275 06-Jun-2018 bluhm

The function dounmount() traverses the mnt_list in forward direction
to call vfs_busy() for all nested mount points. vfs_stall() called
vfs_busy() in reverser order for all mount points. Change the
direction of the latter to resolve the lock order conflict.
OK visa@


# 1.274 04-Jun-2018 guenther

Add VB_DUPOK to suppress witness(4) warning of concurrent mount locks.
Use that in three places:
- vfs_stall()
- sys_mount()
- dounmount()'s MNT_FORCE-does-recursive-unmounts case

ok deraadt@ visa@


# 1.273 27-May-2018 visa

Drop unnecessary `p' parameter from vget(9).

OK mpi@


# 1.272 08-May-2018 bluhm

When looping over mount points, the FOREACH SAVE macro is not save.
The loop variable mp is protected by vfs_busy() so that it cannot
be unmounted. But the next mount point nmp could be unmounted while
VFS_SYNC() sleeps. As the loop in vfs_stall() does not destroy the
mount point, TAILQ_FOREACH_REVERSE without _SAVE is the correct
macro to use.
OK deraadt@ visa@


# 1.271 08-May-2018 mpi

Move the vfs stall "barrier" logic to a function. FREF() will soon
change and this has nothing to do with it.

ok visa@, bluhm@


# 1.270 07-May-2018 bluhm

Print the vp pointer in the vinvalbuf() panic strings.
OK mpi@


# 1.269 02-May-2018 visa

Remove proc from the parameters of vn_lock(). The parameter is
unnecessary because curproc always does the locking.

OK mpi@


# 1.268 28-Apr-2018 visa

Clean up the parameters of VOP_LOCK() and VOP_UNLOCK(). It is always
curproc that does the locking or unlocking, so the proc parameter
is pointless and can be dropped.

OK mpi@, deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.267 07-Mar-2018 bluhm

Remounting files systems read-only does not work reliably. There
are corner cases where ffs may leak blocks. So better revert and
unmount all file systems at reboot. The "init died" panic will be
fixed in a different way.
OK deraadt@


# 1.266 10-Feb-2018 deraadt

Syncronize filesystems to disk when suspending. Each mountpoint's vnodes
are pushed to disk. Dangling vnodes (unlinked files still in use) and
vnodes undergoing change by long-running syscalls are identified -- and
such filesystems are marked dirty on-disk while we are suspended (in case
power is lost, a fsck will be required). Filesystems without dangling or
busy vnodes are marked clean, resulting in faster boots following
"battery died" circumstances.
Tested by numerous developers, thanks for the feedback.


# 1.265 14-Dec-2017 deraadt

Don't bother using DETACH_FORCE for the softraid luns at reboot
time; the aggressive mountpoint destruction seems to hit insane
use-after-frees when we are already far on the way down.


# 1.264 14-Dec-2017 deraadt

Give vflush_vnode() a hint about vnodes we don't need to account as "busy".
Change mountpoint to RDONLY a little later. Seems to improve the
rw->ro transition a bit.


# 1.263 11-Dec-2017 bluhm

Format the vnode lists of ddb show mount properly in columns.
OK krw@


# 1.262 11-Dec-2017 deraadt

In uvm Chuck decided backing store would not be allocated proactively
for blocks re-fetchable from the filesystem. However at reboot time,
filesystems are unmounted, and since processes lack backing store they
are killed. Since the scheduler is still running, in some cases init is
killed... which drops us to ddb [noted by bluhm]. Solution is to convert
filesystems to read-only [proposed by kettenis]. The tale follows:
sys_reboot() should pass proc * to MD boot() to vfs_shutdown() which
completes current IO with vfs_busy VB_WRITE|VB_WAIT, then calls VFS_MOUNT()
with MNT_UPDATE | MNT_RDONLY, soon teaching us that *fs_mount() calls a
copyin() late... so store the sizes in vfsconflist[] and move the copyin()
to sys_mount()... and notice nfs_mount copyin() is size-variant, so kill
legacy struct nfs_args3. Next we learn ffs_mount()'s MNT_UPDATE code is
sharp and rusty especially wrt softdep, so fix some bugs adn add
~MNT_SOFTDEP to the downgrade. Some vnodes need a little more help,
so tie them to &dead_vnops.

ffs_mount calling DIOCCACHESYNC is causing a bit of grief still but
this issue is seperate and will be dealt with in time.
couple hundred reboots by bluhm and myself, advice from guenther and
others at the hut


# 1.261 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.260 31-Jul-2017 florian

Give back some space to the ramdisk by compiling net/radix.c only
if we compile pf, ipsec, pipex or nfsserver.
Suggested by mpi some time ago.
Tweak & OK bluhm
deraadt assumes it's fair


# 1.259 20-Apr-2017 visa

Tweak lock inits to make the system runnable with witness(4)
on amd64 and i386.


# 1.258 04-Apr-2017 deraadt

struct vfsconf is tightly packed, but let's M_ZERO it in case that ever
changes to avoid exposing userland memory.


Revision tags: OPENBSD_6_1_BASE
# 1.257 15-Jan-2017 bluhm

When traversing the mount list, the current mount point is locked
with vfs_busy(). If the FOREACH_SAFE macro is used, the next pointer
is not locked and could be freed by another process. Unless
necessary, do not use _SAFE as it is unsafe. In vfs_unmountall()
the current pointer is actullay freed. Add a comment that this
race has to be fixed later.
OK krw@


# 1.256 10-Jan-2017 bluhm

Replace manual for() loops with FOREACH() macro.
OK millert@


# 1.255 10-Jan-2017 bluhm

Remove the unused olddp parameter from function dounmount().
OK mpi@ millert@


# 1.254 28-Sep-2016 kettenis

Cast enum to u_int when doing a bounds check to avoid a clang warning that
the comparison is always true.

ok jca@, tedu@


# 1.253 16-Sep-2016 dlg

move the namecache_rb_tree from RB macros to RBT functions.

i had to shuffle the includes a bit. all the knowledge of the RB
tree is now inside vfs_cache.c, and all accesses are via cache_*
functions.


# 1.252 16-Sep-2016 dlg

move buf_rb_bufs from RB macros to RBT functions

i had to shuffle the order of some header bits cos RBT_PROTOTYPE
needs to see what RBT_HEAD produces.


# 1.251 15-Sep-2016 dlg

all pools have their ipl set via pool_setipl, so fold it into pool_init.

the ioff argument to pool_init() is unused and has been for many
years, so this replaces it with an ipl argument. because the ipl
will be set on init we no longer need pool_setipl.

most of these changes have been done with coccinelle using the spatch
below. cocci sucks at formatting code though, so i fixed that by hand.

the manpage and subr_pool.c bits i did myself.

ok tedu@ jmatthew@

@ipl@
expression pp;
expression ipl;
expression s, a, o, f, m, p;
@@
-pool_init(pp, s, a, o, f, m, p);
-pool_setipl(pp, ipl);
+pool_init(pp, s, a, ipl, f, m, p);


# 1.250 25-Aug-2016 dlg

pool_setipl

ok kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.249 22-Jul-2016 kettenis

Prevent NULL-pointer call for filesystems that don't provide vfs_sysctl
in their vfsops.

Issue reported by Tim Newsham.

ok claudio@, natano@


# 1.248 19-Jun-2016 natano

Remove the lockmgr() API. It is only used by filesystems, where it is a
trivial change to use rrw locks instead. All it needs is LK_* defines
for the RW_* flags.

tested by naddy and sthen on package building infrastructure
input and ok jmc mpi tedu


# 1.247 26-May-2016 natano

The doforce variable isn't modified anywhere. Also, the only filesystem
left using it is fuse. It has been removed from all other filesystems.

ok millert deraadt


# 1.246 26-Apr-2016 natano

copy_statfs_info() is not only used by ufs, but by other filesystems too,
so make sure that all members of mp->mnt_stat.mount_info are copied.
ok stefan


# 1.245 26-Apr-2016 beck

fix off by one in vfs_vnode_print - found by miod
ok deraadt@, krw@


# 1.244 07-Apr-2016 natano

Share clone bitmap between aliased vnodes. This prevents duplicate clone
instance numbers being handed out for the same minor device.
ok mikeb


# 1.243 05-Apr-2016 natano

Increase size of the clone bitmap (revised diff after revert). I have
tested this with fuse _and_ drm on amd64 and macppc. Also tested with
cloning bpf (not in the tree) on macppc.

ok mikeb
"looks correct to me" millert

The original commit message is as follows:

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.242 01-Apr-2016 mikeb

Revert the clone bitmap enlargement change


# 1.241 31-Mar-2016 natano

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.240 19-Mar-2016 natano

Remove the unused flags argument from VOP_UNLOCK().

torture tested on amd64, i386 and macppc
ok beck mpi stefan
"the change looks right" deraadt


# 1.239 14-Mar-2016 krw

Change a bunch of (<blah> *)0 to NULL.

ok beck@ deraadt@


Revision tags: OPENBSD_5_9_BASE
# 1.238 05-Dec-2015 tedu

branches: 1.238.2;
remove stale lint annotations


# 1.237 16-Nov-2015 deraadt

In getdevvp() set the VISTTY flag on a vnode to indicate the underlying
device is a D_TTY device. (Like spec_open, but this sets the flag to
satisfy pre-VOP_OPEN situations)
ok millert semarie tedu guenther


# 1.236 13-Oct-2015 guenther

Initialize va_filerev in vattr_null() to avoid leaking stack garbage;
problem pointed out by Martin Natano (natano (at) natano.net)

Also, stop chaining assignments (foo = bar = baz) in vattr_null().
The exact meaning of those depends on the order of the sizes-and-
signednesses of the lvalues, making them fragile: a statement here
mixed *six* types, but managed to get them in a safe order. Delete
a 20+ year old XXX comment that was almost certainly bemoaning a bug
from when they were in an unsafe order.

ok deraadt@ miod@


# 1.235 08-Oct-2015 mpi

Use the radix API directly and get rid of the function pointers. There
is no point in keeping an unused level of abstraction.

ok mikeb@, claudio@


# 1.234 07-Oct-2015 mpi

rn_inithead() offset argument is now specified in byte, missed in previous.


# 1.233 04-Sep-2015 mpi

Make every subsystem using a radix tree call rn_init() and pass the
length of the key as argument.

This way every consumer of the radix tree has a chance to explicitly
initialize the shared data structures and no longer rely on another
subsystem to do the initialization.

As a bonus ``dom_maxrtkey'' is no longer used an die.

ART kernels should now be fully usable because pf(4) and IPSEC properly
initialized the radix tree.

ok chris@, reyk@


Revision tags: OPENBSD_5_8_BASE
# 1.232 16-Jul-2015 claudio

branches: 1.232.4;
Fix rn_match and there for the expoerted lookup functions in radix.c
to never return the internal RNF_ROOT nodes. This removes the checks
in the callee to verify that not an RNF_ROOT node was returned.
OK mpi@


# 1.231 12-May-2015 mikeb

Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.230 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.229 02-Mar-2015 guenther

Return EINVAL if the creds supplied for NFS export have a cr_ngroups less
than zero or greater than NGROUPS_MAX

Fixes panic seen by henning@


# 1.228 09-Jan-2015 tedu

rename desiredvnodes to initialvnodes. less of a lie. ok beck deraadt


# 1.227 19-Dec-2014 tedu

start retiring the nointr allocator. specify PR_WAITOK as a flag as a
marker for which pools are not interrupt safe. ok dlg


# 1.226 17-Dec-2014 tedu

remove lock.h from uvm_extern.h. another holdover from the simpletonlock
era. fix uvm including c files to include lock.h or atomic.h as necessary.
ok deraadt


# 1.225 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.224 10-Dec-2014 tedu

convert bcopy to memcpy. ok millert


# 1.223 21-Nov-2014 tedu

simple lock is long dead


# 1.222 19-Nov-2014 tedu

delete the KERN_VNODE sysctl. it fails to provide any isolation from the
kernel struct vnode defintion, and the only consumer (pstat) still needs
kvm to read much of the required information. no great loss to always use
kvm until there's a better replacement interface.
ok deraadt millert uebayasi


# 1.221 14-Nov-2014 tedu

prefer sizeof(*ptr) to sizeof(struct) for malloc and free


# 1.220 03-Nov-2014 deraadt

pass size argument to free()
ok doug tedu


# 1.219 13-Sep-2014 doug

Replace all queue *_END macro calls except CIRCLEQ_END with NULL.

CIRCLEQ_* is deprecated and not called in the tree. The other queue types
have *_END macros which were added for symmetry with CIRCLEQ_END. They are
defined as NULL. There's no reason to keep the other *_END macro calls.

ok millert@


Revision tags: OPENBSD_5_6_BASE
# 1.218 13-Jul-2014 tedu

pass the size to free in some of the obvious cases


# 1.217 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.216 10-Jul-2014 mpi

Stop using a shutdown hook for softraid(4) and explicitly shutdown
the disciplines right after vfs_shutdown().

This change is required in order to be able to set `cold' to 1 before
traversing the device (mainbus) tree for DVACT_POWERDOWN when halting
a machine. Yes, this is ugly because sr_shutdown() needs to sleep. But
at least it is obvious and hopefully somebody will be ofended and fix
it.

In order to properly flush the cache of the disks under softraid0,
sr_shutdown() now propagates DVACT_POWERDOWN for this particular subtree
of devices which are not under mainbus. As a side effect sd(4) shutdown
hook should no longer be necessary.

Tested by stsp@ and Jean-Philippe Ouellet.

ok deraadt@, stsp@, jsing@


# 1.215 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.214 04-Jun-2014 claudio

While it may be smart to use the radix tree for exports it is not OK to
use the domain specific tree initialisation method for this since that one
is multipath enabled and assumes that the radix node is part of a struct
rtentry. This code uses a different struct and so the multipath modifies
wrong fields and breaks stuff in mysterious ways.
Since we only support AF_INET here anyway simplify the code and only have
one radix_node_head pointer instead of AF_MAX ones.
Fixes NFS server issues reported by rpe@, OK rpe@, guenther@, sthen@


# 1.213 10-Apr-2014 tedu

pull the bufcache freelist code out into separate functions to allow new
algorithms to be tested. in the process, drop support for unused B_AGE and
b_synctime options.
previous versions ok beck deraadt


# 1.212 24-Mar-2014 guenther

Split the API: struct ucred remains the kernel internal structure while
struct xucred becomes the structure for syscalls (mount(2) and nfssvc(2)).

ok deraadt@ beck@


Revision tags: OPENBSD_5_5_BASE
# 1.211 21-Jan-2014 tedu

bzero -> memset


# 1.210 01-Dec-2013 krw

Change 'mountlist' from CIRCLEQ to TAILQ. Be paranoid and
use TAILQ_*_SAFE more than might be needed.

Bulk ports build by sthen@ showed nobody sticking their fingers
so deep into the kernel.

Feedback and suggestions from millert@. ok jsing@


# 1.209 27-Nov-2013 jsing

Defer the v_type initialisation until after the vnode has been purged from
the namecache. Changing the v_type between cache_enter() and cache_purge()
results in bad things happening.

ok beck@


# 1.208 02-Oct-2013 sf

format string fix: b_flags is long


# 1.207 01-Oct-2013 sf

Format string fixes: Cast time_t to long long

and mnt_stat.f_ctime is long long, too


# 1.206 08-Aug-2013 syl

Uncomment kprintf format attributes for sys/kern

tested on vax (gcc3) ok miod@


# 1.205 30-Jul-2013 beck

The previous change was made while chasing nfs performance issues
on Theo's servers - however this was in the context of the buffer flipper
changes and this is now suspect in a continues performance issue with NFS
so back it out for now


Revision tags: OPENBSD_5_4_BASE
# 1.204 24-Jun-2013 beck

Manipulating buffers after sleeping is dangerous. Instead of attempting
to cheat and VOP_BWRITE a buffer, restart the vinvalbuf if we have to wait
for a busy buffer to complete
ok tedu@ guenther@


# 1.203 15-Apr-2013 jsing

Add an f_mntfromspec member to struct statfs, which specifies the name of
the special provided when the mount was requested. This may be the same as
the special that was actually used for the mount (e.g. in the case of a
device node) or it may be different (e.g. in the case of a DUID).

Whilst here, change f_ctime to a 64 bit type and remove the pointless
f_spare members.

Compatibility goo courtesy of guenther@

ok krw@ millert@


Revision tags: OPENBSD_5_3_BASE
# 1.202 17-Feb-2013 miod

Comment out recently added __attribute__((__format__(__kprintf__))) annotations
in MI code; gcc 2.95 does not accept such annotation for function pointer
declarations, only function prototypes.
To be uncommented once gcc 2.95 bites the dust.


# 1.201 09-Feb-2013 miod

Add explicit __attribute__ ((__format__(__kprintf__)))) to the functions and
function pointer arguments which are {used as,} wrappers around the kernel
printf function.
No functional change.


# 1.200 17-Nov-2012 beck

Don't map a buffer (and potentially sleep) when invalidating it in vinvalbuf.
This fixes a problem where we could sleep for kva and then our pointers
would not be valid on the next pass through the loop. We do this
by adding buf_acquire_nomap() - which can be used to busy up the buffer
without changing its mapped or unmapped state. We do not need to have
the buffer mapped to invalidate it, so it is sufficient to acquire it
for that. In the case where we write the buffer, we do map the buffer, and
potentially sleep.


# 1.199 01-Oct-2012 guenther

Make groupmember() check the effective gid too, so that the checks are
consistent when the effective gid isn't also a supplementary group.

ok beck@


# 1.198 19-Sep-2012 guenther

vhold() and vdrop() are prototyped in vnode.h, so don't repeat them here

ok beck@


Revision tags: OPENBSD_5_2_BASE
# 1.197 16-Jul-2012 deraadt

oops, need sys/acct.h too


# 1.196 16-Jul-2012 deraadt

Put acct_shutdown() proto in a better place


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.195 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.194 02-Jul-2011 thib

rename VFSDEBUG to VFLCKDEBUG;

prompted by tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.193 21-Dec-2010 thib

Bring back the "End the VOP experiment." diff, naddy's issues where
unrelated, and his alpha is much happier now.

OK deraadt@


# 1.192 06-Dec-2010 jasper

- drop NENTS(), which was yet another copy of nitems().
no binary change


ok deraadt@


# 1.191 10-Sep-2010 thib

Backout the VOP diff until the issues naddy was seeing on alpha (gcc3)
have been resolved.


# 1.190 06-Sep-2010 thib

End the VOP experiment. Instead of the ridicolusly complicated operation
vector setup that has questionable features (that have, as far as I can
tell never been used in practice, atleast not in OpenBSD), remove all
the gunk and favor a simple struct full of function pointers that get
set directly by each of the filesystems.

Removes gobs of ugly code and makes things simpler by a magnitude.

The only downside of this is that we loose the vnoperate feature so
the spec/fifo operations of the filesystems need to be kept in sync
with specfs and fifofs, this is no big deal as the API it self is pretty
static.

Many thanks to armani@ who pulled an earlier version of this diff to
current after c2k10 and Gabriel Kihlman on tech@ for testing.

Liked by many. "come on, find your balls" deraadt@.


# 1.189 12-Aug-2010 oga

Nuke extra (typoed) extern declaration and a spare newline from the last
commit.

"fix it -- free commit" beck@


# 1.188 11-Aug-2010 beck

Make the number of vnodes to correspond to the number of buffers in
buffer cache - we grow them dynamically, but do not attempt to shrink
them if the buffer cache shrinks after growing.

Tested by very many for a long time.

ok oga@ todd@ phessler@ tedu@


Revision tags: OPENBSD_4_8_BASE
# 1.187 29-Jun-2010 tedu

makefstype was only used in ported from freebsd filesystems. fix them
and remove the function. ok thib


# 1.186 28-Jun-2010 claudio

Add the rtable id as an argument to rn_walktree(). Functions like
rt_if_remove_rtdelete() need to know the table id to be able to correctly
remove nodes.
Problem found by Andrea Parazzini and analyzed by Martin Pelik�n.
OK henning@


# 1.185 06-May-2010 mpf

Fix favail format string.
From mickey.
OK thib, otto.


Revision tags: OPENBSD_4_7_BASE
# 1.184 17-Dec-2009 oga

if anyone vref()s a VNON vnode, panic. This should not happen.

Written while trying to debug the nfs_inactive panics. Turns out it
never got hit, but it's a useful check to have.

ok beck@


# 1.183 17-Aug-2009 jasper

dd 'show all bufs' to show all the buffers in the system

ok beck@ thib@


# 1.182 13-Aug-2009 thib

add a show all vnodes command, use dlg's nice pool_walk() to accomplish
this.

ok beck@, dlg@


# 1.181 12-Aug-2009 beck

Namecache revamp.

This eliminates the large single namecache hash table, and implements
the name cache as a global lru of entires, and a redblack tree in each
vnode. It makes cache_purge actually purge the namecache entries associated
with a vnode when a vnode is recycled (very important for later on actually being
able to resize the vnode pool)

This commit does #if 0 out a bunch of procmap code that was
already broken before this change, but needs to be redone completely.

Tested by many, including in thib's nfs test setup.

ok oga@,art@,thib@,miod@


# 1.180 02-Aug-2009 beck

Dynamic buffer cache support - a re-commit of what was backed out
after c2k9

allows buffer cache to be extended and grow/shrink dynamically

tested by many, ok oga@, "why not just commit it" deraadt@


Revision tags: OPENBSD_4_6_BASE
# 1.179 25-Jun-2009 thib

backout the buf_acquire() does the bremfree() since all callers
where doing bremfree() befure calling buf_acquire().

This is causing us headache pinning down a bug that showed up
when deraadt@ too cvs to current, and will have to be done
anyway as a preperation for backouts.

OK deraadt@


# 1.178 15-Jun-2009 beck

Back out all the buffer cache changes I committed during c2k9. This reverts three
commits:

1) The sysctl allowing bufcachepercent to be changed at boot time.
2) The change moving the buffer cache hash chains to a red-black tree
3) The dynamic buffer cache (Which depended on the earlier too).

ok on the backout from marco and todd


# 1.177 06-Jun-2009 art

All caller of buf_acquire were doing bremfree before the call.
Just put it in the buf_acquire function.
oga@ ok


# 1.176 03-Jun-2009 beck

Change bufhash from the old grotty hash table to red-black trees hanging
off the vnode.
ok art@, oga@, miod@


Revision tags: OPENBSD_4_5_BASE
# 1.175 10-Nov-2008 pedro

Fix typo in comment, okay jmc@.


# 1.174 01-Nov-2008 deraadt

change vrele() to return an int. if it returns 0, it can gaurantee that
it did not sleep. this is used to avoid checkdirs() to avoid having
to restart the allproc walk every time through
idea from tedu, ok thib pedro


Revision tags: OPENBSD_4_4_BASE
# 1.173 05-Jul-2008 thib

re-introduce vdrop() to signal a lost intrest in a vnode;

ok art@


# 1.172 14-Jun-2008 mk

A bunch of pool_get() + bzero() -> pool_get(..., .. | PR_ZERO)
conversions that should shave a few bytes off the kernel.

ok henning, krw, jsing, oga, miod, and thib (``even though i usually prefer
FOO|BAR''; thanks for looking.


# 1.171 13-Jun-2008 beck

back out stupid vnode change that was unintentionally included
with biomem and art has no idea how it got there.
ok art@ thib@


# 1.170 12-Jun-2008 deraadt

Bring biomem diff back into the tree after the nfs_bio.c fix went in.
ok thib beck art


# 1.169 11-Jun-2008 deraadt

back out biomem diff since it is not right yet. Doing very large
file copies to nfsv2 causes the system to eventually peg the console.
On the console ^T indicates that the load is increasing rapidly, ddb
indicates many calls to getbuf, there is some very slow nfs traffic
making none (or extremely slow) progress. Eventually some machines
seize up entirely.


# 1.168 10-Jun-2008 beck

Buffer cache revamp

1) remove multiple size queues, introduced as a stopgap.
2) decouple pages containing data from their mappings
3) only keep buffers mapped when they actually have to be mapped
(right now, this is when buffers are B_BUSY)
4) New functions to make a buffer busy, and release the busy flag
(buf_acquire and buf_release)
5) Move high/low water marks and statistics counters into a structure
6) Add a sysctl to retrieve buffer cache statistics

Tested in several variants and beat upon by bob and art for a year. run
accidentally on henning's nfs server for a few months...

ok deraadt@, krw@, art@ - who promises to be around to deal with any fallout


# 1.167 09-Jun-2008 millert

Update access(2) to have modern semantics with respect to X_OK and
the superuser. access(2) will now only indicate success for X_OK on
non-directories if there is at least one execute bit set on the file.
OK deraadt@ thib@ otto@


# 1.166 07-May-2008 thib

remove the vfc_mountroot member from vfsconf and
do appropriate cleanup;

OK deraadt@


# 1.165 07-May-2008 claudio

Implement routing priorities. Every route inserted has a priority assigned
and the one route with the lowest number wins. This will be used by the
routing daemons to resolve the synchronisations issue in case of conflicts.
The nasty bits of this are in the multipath code. If no priority is specified
the kernel will choose an appropriate priority.

Looked at by a few people at n2k8 code is much older


# 1.164 06-May-2008 thib

retire vfs_mountroot();

setroot() is now (and has been) responsible for setting
the mountroot function pointer "to the right thing", or
failing todo that, to ffs_mountroot;

based on a discussion/diff from deraadt@.
OK deraadt@


# 1.163 23-Mar-2008 miod

Wrong printf construct.


# 1.162 16-Mar-2008 otto

Widen some struct statfs fields to support large filesystem stata
and add some to be able to support statvfs(2). Do the compat dance
to provide backward compatibility. ok thib@ miod@


Revision tags: OPENBSD_4_3_BASE
# 1.161 13-Dec-2007 blambert

replace calls to ltsleep with tsleep

remove PNORELOCK flag, as PNORELOCK is used for msleep

ok art@ thib@


# 1.160 16-Nov-2007 deraadt

er, the newline is wrong. dissapointing.


# 1.159 15-Nov-2007 deraadt

newline before syncing disks is way prettier


# 1.158 29-Oct-2007 chl

MALLOC/FREE -> malloc/free
replace an hard coded value with M_WAITOK

ok krw@


# 1.157 15-Sep-2007 bluhm

Allow to pull out an usb stick with ffs filesystem while mounted
and a file is written onto the stick. Without these fixes the
machine panics or hangs.
The usb fix calls the callback when the stick is pulled out to free
the associated buffers. Otherwise we have busy buffers for ever
and the automatic unmount will panic.
The change in the scsi layer prevents passing down further dirty
buffers to usb after the stick has been deactivated.
In vfs the automatic unmount has moved from the function vgonel()
to vop_generic_revoke(). Both are called when the sd device's vnode
is removed. In vgonel() the VXLOCK is already held which can cause
a deadlock. So call dounmount() earlier.

ok krw@, I like this marco@, tested by ian@


# 1.156 07-Sep-2007 art

Use M_ZERO in a few more places to shave bytes from the kernel.

eyeballed and ok dlg@


Revision tags: OPENBSD_4_2_BASE
# 1.155 07-Aug-2007 beck

A few changes to deal with multi-user performance issues seen. this
brings us back roughly to 4.1 level performance, although this is still
far from optimal as we have seen in a number of cases. This change

1) puts a lower bound on buffer cache queues to prevent starvation
2) fixes the code which looks for a buffer to recycle
3) reduces the number of vnodes back to 4.1 levels to avoid complex
performance issues better addressed after 4.2

ok art@ deraadt@, tested by many


# 1.154 01-Jun-2007 beck

decouple the allocated number of vnodes from the "desiredvnodes" variable
which is used to size a zillion other things that increasing excessively
has been shown to cause problems - so that we may incrementally look at
increasing those other things without making the kernel unusable.

This diff effectivly increases the number of vnodes back to the number
of buffers, as in the earlier dynamic buffer cache commits, without
increasing anything else (namecache, softdeps, etc. etc.)

ok pedro@ tedu@ art@ thib@


# 1.153 31-May-2007 tedu

remove some silly casts, no real change


# 1.152 31-May-2007 pedro

NFSv2 cannot cope with a big number of vnodes, so revert to NPROC-based
calculation until the problem is fixed, okay beck@ art@


# 1.151 30-May-2007 beck

back out vfs change - todd fries has seen afs issues, and I'm suspicious
this can cause other problems.


# 1.150 29-May-2007 beck

Step one of some vnode improvements - change getnewvnode to
actually allocate "desiredvnodes" - add a vdrop to un-hold a vnode held
with vhold, and change the name cache to make use of vhold/vdrop, while
keeping track of which vnodes are referred to by which cache entries to
correctly hold/drop vnodes when the cache uses them.
ok thib@, tedu@, art@


# 1.149 28-May-2007 thib

de-inline vref();

ok pedro@


# 1.148 26-May-2007 pedro

Dynamic buffer cache. Initial diff from mickey@, okay art@ beck@ toby@
deraadt@ dlg@.


# 1.147 26-May-2007 thib

Nuke a bunch of simpelocks and associated goo.

ok art@


# 1.146 17-May-2007 thib

Collapse struct v_selectinfo in struct vnode, remove the
simplelock and reuse the name for the selinfo member.
Clean-up accordingly.

ok tedu@,art@


# 1.145 09-May-2007 deraadt

kinfo_vgetfailed has not been used for > 8 years


# 1.144 13-Apr-2007 thib

Move the declaration of VN_KNOTE() into vnode.h instead of having
multiple defines all over;

ok tedu@


# 1.143 13-Apr-2007 bluhm

Remove comments talking about vnode interlock. No binary change.
ok thib


# 1.142 11-Apr-2007 thib

Remove the simplelock argument from vrecycle();

ok pedro@, sturm@


# 1.141 21-Mar-2007 thib

Remove the v_interlock simplelock from the vnode structure.
Zap all calls to simple_lock/unlock() on it (those calls are
#defined away though). Remove the LK_INTERLOCK from the calls
to vn_lock() and cleanup the filesystems wich implement VOP_LOCK().
(by remvoing the v_interlock from there calls to lockmgr()).

ok pedro@, art@, tedu@


# 1.140 12-Mar-2007 mickey

better desiredvnodes not based on maxusers; pedro@ deraadt@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.139 20-Feb-2007 deraadt

for vfsconf sysctl, do not leak kernel sensors out to userland
ok art thib


# 1.138 17-Feb-2007 mickey

fix ddb buf printing for daddr_t growth to 64bit;
from juan hernandez gonzalez; tested by bluhm@


# 1.137 14-Feb-2007 jsg

Consistently spell FALLTHROUGH to appease lint.
ok kettenis@ cloder@ tom@ henning@


# 1.136 13-Feb-2007 mickey

fix ddb buf print


# 1.135 20-Nov-2006 tom

vprint() should be defined if DIAGNOSTIC || DEBUG. Noticed by (and
original diff from) Jake < antipsychic (at) hotmail.com >. Discussed
with Mickey and Miod.

ok miod@ pedro@


# 1.134 30-Oct-2006 thib

use vp->v_type to index into vtypes rather then vp->v_tag,
fixing odd output in the 'show vnode' ddb code.

ok mickey@


Revision tags: OPENBSD_4_0_BASE
# 1.133 11-Jul-2006 mickey

add mount/vnode/buf and softdep printing commands; tested on a few archs and will make pedro happy too (;


# 1.132 09-Jul-2006 pedro

Fix tab where space was meant


# 1.131 08-Jul-2006 thib

vinvalbuf() debugging aid, under VFSDEBUG.

ok pedro@


# 1.130 03-Jul-2006 mickey

also print vp in vprint (useful for debugging); pedro@ ok


# 1.129 25-Jun-2006 sturm

rename vfs_busy() flags VB_UMIGNORE/VB_UMWAIT to VB_NOWAIT/VB_WAIT

requested by and ok pedro


# 1.128 14-Jun-2006 sturm

move vfs_busy() to rwlocks and properly hide the locking api from vfs

ok tedu, pedro


# 1.127 02-Jun-2006 pedro

Add a clonable devices implementation. Hacked along with thib@, input
from krw@ and toby@, subliminal prodding from dlg@, okay deraadt@.


# 1.126 28-May-2006 pedro

Spacing in vfs_sysctl()


# 1.125 07-May-2006 sturm

forgot to remove this sentence from the comment
ok pedro


# 1.124 30-Apr-2006 sturm

remove the simplelock argument from vfs_busy() which is currently not
used and will never be used this way in VFS

requested by and ok pedro, ok krw, biorn


# 1.123 19-Apr-2006 pedro

Remove unused mount list simple_lock() goo


Revision tags: OPENBSD_3_9_BASE
# 1.122 09-Jan-2006 pedro

Put vprint() under DIAGNOSTIC, as to save space in generated ramdisks.
Inspiration from miod@, okay deraadt@. Tested on i386, macppc and amd64.


# 1.121 30-Nov-2005 pedro

No need for vfs_busy() and vfs_unbusy() to take a process pointer
anymore. Testing by jolan@, thanks.


# 1.120 24-Nov-2005 pedro

Remove kernfs, okay deraadt@.


# 1.119 19-Nov-2005 pedro

Remove unnecessary lockmgr() archaism that was costing too much in terms
of panics and bugfixes. Access curproc directly, do not expect a process
pointer as an argument. Should fix many "process context required" bugs.
Incentive and okay millert@, okay marc@. Various testing, thanks.


# 1.118 18-Nov-2005 pedro

Work around yet another race on non-locking file systems: when calling
VOP_INACTIVE() in vrele() and vput(), we may sleep. Since there's no
locking of any kind, someone can vget() the vnode and vrele() it while
we sleep, beating us in getting the vnode on the free list.


# 1.117 08-Nov-2005 pedro

Missed one use of 'register'


# 1.116 07-Nov-2005 pedro

Use ANSI function declarations and deregister, no binary change


# 1.115 19-Oct-2005 pedro

Remove v_vnlock from struct vnode, okay krw@ tedu@


Revision tags: OPENBSD_3_8_BASE
# 1.114 26-May-2005 pedro

branches: 1.114.2;
RIP stackable filesystems, ok marius@ tedu@, discussed with deraadt@


# 1.113 24-May-2005 pedro

when a device vnode associated with a mount point disappears, mark the
filesystem as doomed and unmount it


# 1.112 22-May-2005 pedro

put VLOCKSWORK stuff under a single option, VFSDEBUG


# 1.111 01-May-2005 pedro

check for VBIOONFREELIST and VBIOONSYNCLIST in vprint(), okay marius@


# 1.110 24-Mar-2005 tedu

always good to check for invalid values. ok marius pedro


Revision tags: OPENBSD_3_7_BASE
# 1.109 10-Jan-2005 pedro

branches: 1.109.2;
change vget() to only put a vnode back on the free lists if it actually
was there. should fix a (rare) corner case introduced by my last commit.
ok tedu@, testing by joris, moritz@, danh@, otto@ and krw@. many thanks.


# 1.108 31-Dec-2004 pedro

sprinkle some more list macros in here


# 1.107 31-Dec-2004 pedro

when releasing a vnode, make it inactive before sticking it to one of
the free lists. should fix some races on filesystems that don't have
locks, such as nfs. also, it allows for a more straightforward way of
releasing vnodes (nodes that are going to be recycled don't have to be
moved to the head of the list). tested by many, thanks.

ok tedu@ deraadt@


# 1.106 28-Dec-2004 deraadt

clean dirty accident by miod


# 1.105 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


# 1.104 09-Dec-2004 pedro

minor spacing/styling nits


Revision tags: OPENBSD_3_6_BASE
# 1.103 04-Aug-2004 art

Uninline vputonfreelist.


# 1.102 04-Aug-2004 pedro

better comments


# 1.101 02-Aug-2004 pedro

- check for LK_NOWAIT on vget()
- use ltsleep() instead of the unlock + sleep combo

ok art@, inspiration from free/net


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.100 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


# 1.99 27-May-2004 tedu

shutdown accounting before shutting down vfs. should prevent some panics.
ok david@ millert@ (iirc)


# 1.98 25-Apr-2004 itojun

radix tree with multipath support. from kame. deraadt ok
user visible changes:
- you can add multiple routes with same key (route add A B then route add A C)
- you have to specify gateway address if there are multiple entries on the table
(route delete A B, instead of route delete A)
kernel change:
- radix_node_head has an extra entry
- rnh_deladdr takes extra argument

TODO:
- actually take advantage of multipath (rtalloc -> rtalloc_mpath)


Revision tags: OPENBSD_3_5_BASE
# 1.97 09-Jan-2004 tedu

back out vnode parents. weird breakge found in ports tree


# 1.96 06-Jan-2004 tedu

keep track of a vnode's parent dir. ufs only, and unused atm, but
the fun stuff is coming. testing by brad.


Revision tags: OPENBSD_3_4_BASE
# 1.95 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.94 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: UBC_SYNC_A
# 1.93 13-May-2003 naddy

Back out previous change that causes "vnode table full" for large-scale
file operations.


# 1.92 13-May-2003 tedu

do reclaim LAYER vnodes, no good reason not to


# 1.91 06-May-2003 tedu

attempt to put a process's cwd back in place after a forced umount.
won't always work, but it's the best we can do for now. this covers
at least some of the failure cases the previous commit to vfs_lookup.c
checks for.
ok weingart@


# 1.90 01-May-2003 tedu

several related changes:
vfs_subr.c:
add a missing simple_lock_init for vnode interlock
try to avoid reclaiming locked or layered vnodes
initialize vnlock pointer to NULL
remove old code to free vnlock, never used
lockinit the new vnode lock
vfs_syscalls.c:
support for VLAYER flag
vnode_if.sh:
support for splitting VDESC flags
vnode_if.src:
split VDESC flags
WILLPUT is the combination of WILLRELE and WILLUNLOCK
most uses for WILLRELE become WILLPUT
vnode.h:
add v_lock to struct vnode
add VLAYER flag
update for new VDESC flags


# 1.89 06-Apr-2003 ho

strcat/strcpy/sprintf cleanup. krw@, anil@ ok. art@ tested sparc64.


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_B
# 1.88 11-Aug-2002 art

Add two missing vfs_busy calls in the failure path of sysctl_vnode.
Found by aaron@

NOTE - I think we need a mount-point iterator just like we have
NOTE - vfs_mount_foreach_vnode. (btw. why don't we use foreach_vnode in here?)


# 1.87 12-Jul-2002 art

Change the locking on the mountpoint slightly. Instead of using mnt_lock
to get shared locks for lookup and get the exclusive lock only with
LK_DRAIN on unmount and do the real exclusive locking with flags in
mnt_flags, we now use shared locks for lookup and an exclusive lock for
unmount.

This is accomplished by slightly changing the semantics of vfs_busy.
Old vfs_busy behavior:
- with LK_NOWAIT set in flags, a shared lock was obtained if the
mountpoint wasn't being unmounted, otherwise we just returned an error.
- with no flags, a shared lock was obtained if the mountpoint was being
unmounted, otherwise we slept until the unmount was done and returned
an error.
LK_NOWAIT was used for sync(2) and some statistics code where it isn't really
critical that we get the correct results.
0 was used in fchdir and lookup where it's critical that we get the right
directory vnode for the filesystem root.

After this change vfs_busy keeps the same behavior for no flags and LK_NOWAIT.
But if some other flags are passed into it, they are passed directly
into lockmgr (actually LK_SLEEPFAIL is always added to those flags because
if we sleep for the lock, that means someone was holding the exclusive lock
and the exclusive lock is only held when the filesystem is being unmounted.

More changes:
dounmount must now be called with the exclusive lock held. (before this
the caller was supposed to hold the vfs_busy lock, but that wasn't always
true).
Zap some (now) unused mount flags.
And the highlight of this change:
Add some vfs_busy calls to match some vfs_unbusy calls, especially in
sys_mount. (lockmgr doesn't detect the case where we release a lock noone
holds (it will do that soon)).

If you've seen hangs on reboot with mfs this should solve it (I repeat this
for the fourth time now, but this time I spent two months fixing and
redesigning this and reading the code so this time I must have gotten
this right).


# 1.86 16-Jun-2002 miod

When processing the KERN_VNODE sysctl, the kernel builds a packed structure,
while pstat(8) expects a C structure abiding the regular structure packing
rules. This caused pstat -v to break on powerpc.

Unbreak the confusion by defining the structure in a common header file,
and having the kernel use it.

ok millert@ deraadt@


# 1.85 08-Jun-2002 art

Use ltsleep in vfs_busy.


# 1.84 16-May-2002 art

sprinkle some splassert(IPL_BIO) in some functions that are commented as "should be called at splbio()"


Revision tags: OPENBSD_3_1_BASE
# 1.83 14-Mar-2002 millert

First round of __P removal in sys


# 1.82 04-Feb-2002 miod

Cleanup mountroot-related definitions.


# 1.81 23-Jan-2002 art

Pool deals fairly well with physical memory shortage, but it doesn't deal
well (not at all) with shortages of the vm_map where the pages are mapped
(usually kmem_map).

Try to deal with it:
- group all information the backend allocator for a pool in a separate
struct. The pool will only have a pointer to that struct.
- change the pool_init API to reflect that.
- link all pools allocating from the same allocator on a linked list.
- Since an allocator is responsible to wait for physical memory it will
only fail (waitok) when it runs out of its backing vm_map, carefully
drain pools using the same allocator so that va space is freed.
(see comments in code for caveats and details).
- change pool_reclaim to return if it actually succeeded to free some
memory, use that information to make draining easier and more efficient.
- get rid of PR_URGENT, noone uses it.


# 1.80 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.79 10-Dec-2001 art

branches: 1.79.2;
No need to initialize the uobj on every getnewvnode. Just do
it when allocating. Add some improved diagnostics.


# 1.78 10-Dec-2001 art

Big cleanup inspired by NetBSD with some parts of the code from NetBSD.
- get rid of VOP_BALLOCN and VOP_SIZE
- move the generic getpages and putpages into miscfs/genfs
- create a genfs_node which must be added to the top of the private portion
of each vnode for filsystems that want to use genfs_{get,put}pages
- rename genfs_mmap to vop_generic_mmap


# 1.77 10-Dec-2001 art

Merge in struct uvm_vnode into struct vnode.


# 1.76 05-Dec-2001 art

Break out the part that lowers v_holdcnt in brelvp into an own function
and make it and vhold into public interfaces.


# 1.75 29-Nov-2001 art

Ooops. Revert part of the last commit that was completly wrong and wasn't supposed to be committed.


# 1.74 29-Nov-2001 art

Correctly handle b_vp with bgetvp and brelvp in {get,put}pages.
Prevents panics caused by vnodes being recycled under our feet.


# 1.73 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.72 21-Nov-2001 csapuntz

Added vfs_isbusy. Useful for verifying that a mount point is locked
Added vfs_mount_foreach_vnode. Several places in the code seem to want to
traverse the mount list and they all seem to handle locking differently.
Centralize traversing the mount list in one place so that we only need
to get the locking right once.


# 1.71 15-Nov-2001 art

Don't zero v_bioflag when recycling a vnode in getnewvnode.
Sometimes the vnode can be on the syncers list. While that is a bug, it's
just a minor annoyance. A vnode on a syncer worklist without VBIOONSYNCLIST
set is a disaster.


# 1.70 12-Nov-2001 art

Remove unnecessary check for NULL vnode in reassignbuf.


# 1.69 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.68 02-Oct-2001 csapuntz

Bounds check index into routing table. Thanks to Ken Ashcraft of Stanford
for finding this bug.


# 1.67 19-Sep-2001 csapuntz

Get rid of B_VFLUSH. Not relevant after the end of the AGE queue.


# 1.66 16-Sep-2001 millert

Add some missing lengths checks when passing data from userland to
kernel. From based on NetBSD patches.


# 1.65 02-Aug-2001 assar

(vput): make panic strings actually say vput instead of vrele


# 1.64 26-Jul-2001 miod

Typo.


# 1.63 27-Jun-2001 art

remove old vm


# 1.62 22-Jun-2001 deraadt

KNF


# 1.61 05-Jun-2001 provos

send note_revoke to knotes when vnode goes away, okay art@


# 1.60 16-May-2001 art

indentation nit.


# 1.59 29-Apr-2001 art

cleanup, remove incorrect comment


Revision tags: OPENBSD_2_9_BASE
# 1.58 22-Mar-2001 art

branches: 1.58.2;
Use pool for allocating vnodes.
Even though vnodes are never freed (could be) this gives us big memory and
kmem_map savings.


# 1.57 21-Mar-2001 art

uvm_vnp_terminate expect the vnode to be locked.
Why didn't LOCKDEBUG catch this?


# 1.56 16-Mar-2001 art

Oops. fix thinko in last.


# 1.55 16-Mar-2001 art

Use CIRCLEQ macros for mountlist.


# 1.54 16-Mar-2001 art

Initialize the mountlist_slock.


# 1.53 26-Feb-2001 csapuntz

Move v_writecount test back to it original place


# 1.52 26-Feb-2001 csapuntz

Make ref counts 32-bit unsigned ints as opposed to a potpourri of longs and
ints.


# 1.51 24-Feb-2001 csapuntz

Cleanup of vnode interface continues. Get rid of VHOLD/HOLDRELE.
Change VM/UVM to use buf_replacevnode to change the vnode associated
with a buffer.

Addition v_bioflag for flags written in interrupt handlers
(and read at splbio, though not strictly necessary)

Add vwaitforio and use it instead of a while loop of v_numoutput.

Fix race conditions when manipulation vnode free list


# 1.50 23-Feb-2001 csapuntz

Remove the clustering fields from the vnodes and place them in the
file system inode instead


# 1.49 21-Feb-2001 csapuntz

Latest soft updates from FreeBSD/Kirk McKusick

Snapshot-related code has been commented out.


# 1.48 08-Feb-2001 mickey

do not print stuff when not verbose


Revision tags: OPENBSD_2_8_BASE
# 1.47 27-Sep-2000 art

branches: 1.47.2;
Minimal optimization.


# 1.46 17-Jul-2000 art

Don't wait for B_READ buffers on shutdown.
From NetBSD.


Revision tags: OPENBSD_2_7_BASE
# 1.45 25-Apr-2000 csapuntz

Use CIRCLEQ_FOREACH


# 1.44 21-Apr-2000 mickey

see if there is any meaning under curproc before using &proc0 in vfs_syncwait(); from art@


Revision tags: SMP_BASE kame_19991208
# 1.43 05-Dec-1999 art

branches: 1.43.2;
With soft updates, some buffers will be remarked as dirty after being written.
Handle this when syncing filesystems when unmounting.
From NetBSD.


# 1.42 05-Dec-1999 art

Use VONSYNCLIST to see if we should remove a vnode from the sync list instead
of looking at v_dirtyblkhd.


Revision tags: OPENBSD_2_6_BASE
# 1.41 20-Aug-1999 art

more paranoid check of the refcount in vfs_register


# 1.40 08-Aug-1999 niklas

From NetBSD; vdevgone, used for revoking access to device nodes when they
disappear (detach is coming).


# 1.39 31-May-1999 millert

New struct statfs with mount options. NOTE: this replaces statfs(2),
fstatfs(2), and getfsstat(2) so you will need to build a new kernel
before doing a "make build" or you will get "unimplemented syscall" errors.

The new struct statfs has the following featuires:
o Has a u_int32_t flags field--now softdep can have a real flag.

o Uses u_int32_t instead of longs (nicer on the alpha). Note: the man
page used to lie about setting invalid/unused fields to -1. SunOS does
that but our code never has.

o Gets rid of f_type completely. It hasn't been used since NetBSD 0.9
and having it there but always 0 is confusing. It is conceivable
that this may cause some old code to not compile but that is better
than silently breaking.

o Adds a mount_info union that contains the FSTYPE_args struct. This
means that "mount" can now tell you all the options a filesystem was
mounted with. This is especially nice for NFS.

Other changes:
o The linux statfs emulation didn't convert between BSD fs names
and linux f_type numbers. Now it does, since the BSD f_type
number is useless to linux apps (and has been removed anyway)

o FreeBSD's struct statfs is different from our (both old and new)
and thus needs conversion. Previously, the OpenBSD syscalls
were used without any real translation.

o mount(8) will now show extra info when invoked with no arguments.
However, to see *everything* you need to use the -v (verbose) flag.


# 1.38 06-May-1999 mickey

factor out sync+wait code into vfa_syncwait() routine for
applications in system like power management and such.
art@ finally said `commit it'


# 1.37 30-Apr-1999 art

in vput, simple_unlock the v_interlock before VOP_INACTIVE, not after


Revision tags: OPENBSD_2_5_BASE
# 1.36 11-Mar-1999 deraadt

backout


# 1.35 11-Mar-1999 deraadt

back out unapproved changes


# 1.34 11-Mar-1999 mickey

indent


# 1.33 11-Mar-1999 mickey

factor sync+wait operation out into a separate function.


# 1.32 26-Feb-1999 art

adapt to uvm vnode pager


# 1.31 19-Feb-1999 art

add vfs_register and vfs_unregister functions


# 1.30 28-Dec-1998 art

simple_lock fixes


# 1.29 22-Dec-1998 art

deconfuse vprint, print holdcount, not refcount when we are talking about holdcnt


# 1.28 10-Dec-1998 art

vfs_unmountall: retry to unmount all remaining filesystems when one unmount failed


# 1.27 05-Dec-1998 csapuntz

Framework for generating automatic test code for locking discipline
in DIAGNOSTIC mode.

Added documentation to vfs_subr.c on locking needs of a couple calls.

Improvements to the vinvalbuf patch. We need to start over after we
let our pants down.


# 1.26 04-Dec-1998 csapuntz

VFS-Lite2 requires stricter locking around vnode buffer queues. vinvalbuf
had insufficient protection


# 1.25 20-Nov-1998 art

vn_lock already unlocks the simple lock. don't do that again


# 1.24 12-Nov-1998 csapuntz

Integrate latest soft updates patches for McKusick.

Integrate cleaner ffs mount code from FreeBSD. Most notably, this mount
code prevents you from mounting an unclean file system read-write.


Revision tags: OPENBSD_2_4_BASE
# 1.23 13-Oct-1998 csapuntz

In vrele, vget, reinstate to following order

- VNODE gets placed on free list
- VOP_INACTIVE is called

This was the original order. It was changed in an earlier patch due to
a race condition in non-locking FSes (like NFS) between getnewvnode
and inactive. However, the modified order had its own race conditions, so
it turned out not to be a good choice.


# 1.22 30-Aug-1998 csapuntz

Cleanup.

Error diagnostics in vputonfreelist to catch violations of assumptions.


# 1.21 06-Aug-1998 csapuntz

Rename vop_revoke, vn_bwrite, vop_noislocked, vop_nolock, vop_nounlock
to be vop_generic_revoke, vop_generic_bwrite, vop_generic_islocked,
vop_generic_lock and vop_generic_unlock.

Create vop_generic_abortop and propogate change to all file systems.

Fix PR/371.

Get rid of locking in NULLFS (should be mostly unnecessary now except for
forced unmounts).


# 1.20 25-Apr-1998 niklas

typo


Revision tags: OPENBSD_2_3_BASE
# 1.19 20-Feb-1998 niklas

typo


# 1.18 11-Jan-1998 csapuntz

Fix a couple spinlock references. More code motion in vfs_subr.c


# 1.17 10-Jan-1998 csapuntz

Broke up vfs_subr.c which was getting a bit huge. We now have seperate files
for the syncer daemon as well as default VOP_*.


# 1.16 24-Nov-1997 niklas

Fix non-DIAGNOSTIC (and non-COMPAT*) compilation


# 1.15 07-Nov-1997 csapuntz

Fixed hang on shutdown
Disabled vop_nolock for now. Filesystems still need to be cleaned up.


# 1.14 06-Nov-1997 csapuntz

DEBUG now compiles


# 1.13 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.12 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.11 06-Oct-1997 csapuntz

VFS Lite2 Changes


Revision tags: OPENBSD_2_1_BASE
# 1.10 25-Apr-1997 deraadt

proper mask check; mike@fast.cs.utah.edu


# 1.9 14-Apr-1997 tholo

Minor performance enhancements from NetBSD


# 1.8 24-Feb-1997 niklas

OpenBSD tags


# 1.7 11-Feb-1997 millert

Add fs_id support and random inode generation numbers for ffs.


# 1.6 04-Jan-1997 kstailey

spec_advlock() via lf_advlock()


Revision tags: OPENBSD_2_0_BASE
# 1.5 08-Aug-1996 tholo

Make {,f}chown(2) behaviour POSIX.1 compliant with SUID / SGID files
Enable CTL_FS processing by sysctl(3)
Add CTL_FS request to disable clearing SUID / SGID bit when a files owner
or group is changed by root
Make sysctl(8) understand CTL_FS requests


# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 29-Feb-1996 niklas

From NetBSD: Merge with NetBSD 960217


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.283 07-Dec-2018 mpi

free(9) sizes for netcred.

ok visa@


Revision tags: OPENBSD_6_4_BASE
# 1.282 29-Sep-2018 visa

Use atomic operations to update vfc_refcount. Change the field's type
to unsigned int.

OK deraadt@


# 1.281 26-Sep-2018 visa

Move the allocating and freeing of mount points into
dedicated functions.

OK deraadt@ mpi@


# 1.280 22-Sep-2018 fcambus

Harmonize spacing after ellipses in displayed messages.

We were using spacing after ellipses in an inconsistent way in the
installer. Standardize on using "... " everywhere and take into account
the cursor position while we are waiting for the task to complete: the
cursor is now always positioned after the last dot, and the space is
added when displaying completion confirmation.

While there, also take cursor position into account in vfs_shutdown(),
and remove the extra leading space before ticks in dhclient.

OK deraadt@


# 1.279 17-Sep-2018 visa

Simplify VFS initialization.

Because loadable kernel modules are no longer, there is no need to
register or unregister filesystem implementations at runtime. Remove
vfs_register() and vfs_unregister(), and make vfsinit() call vfs_init
routines directly. Replace the linked list of vfsconf structs with
the vfsconflist[] array.

OK mpi@ bluhm@


# 1.278 16-Sep-2018 visa

Move vfsconf lookup code into dedicated functions.

OK bluhm@


# 1.277 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


# 1.276 02-Jul-2018 bluhm

Use more list macros for v_dirtyblkhd.
OK mpi@


# 1.275 06-Jun-2018 bluhm

The function dounmount() traverses the mnt_list in forward direction
to call vfs_busy() for all nested mount points. vfs_stall() called
vfs_busy() in reverser order for all mount points. Change the
direction of the latter to resolve the lock order conflict.
OK visa@


# 1.274 04-Jun-2018 guenther

Add VB_DUPOK to suppress witness(4) warning of concurrent mount locks.
Use that in three places:
- vfs_stall()
- sys_mount()
- dounmount()'s MNT_FORCE-does-recursive-unmounts case

ok deraadt@ visa@


# 1.273 27-May-2018 visa

Drop unnecessary `p' parameter from vget(9).

OK mpi@


# 1.272 08-May-2018 bluhm

When looping over mount points, the FOREACH SAVE macro is not save.
The loop variable mp is protected by vfs_busy() so that it cannot
be unmounted. But the next mount point nmp could be unmounted while
VFS_SYNC() sleeps. As the loop in vfs_stall() does not destroy the
mount point, TAILQ_FOREACH_REVERSE without _SAVE is the correct
macro to use.
OK deraadt@ visa@


# 1.271 08-May-2018 mpi

Move the vfs stall "barrier" logic to a function. FREF() will soon
change and this has nothing to do with it.

ok visa@, bluhm@


# 1.270 07-May-2018 bluhm

Print the vp pointer in the vinvalbuf() panic strings.
OK mpi@


# 1.269 02-May-2018 visa

Remove proc from the parameters of vn_lock(). The parameter is
unnecessary because curproc always does the locking.

OK mpi@


# 1.268 28-Apr-2018 visa

Clean up the parameters of VOP_LOCK() and VOP_UNLOCK(). It is always
curproc that does the locking or unlocking, so the proc parameter
is pointless and can be dropped.

OK mpi@, deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.267 07-Mar-2018 bluhm

Remounting files systems read-only does not work reliably. There
are corner cases where ffs may leak blocks. So better revert and
unmount all file systems at reboot. The "init died" panic will be
fixed in a different way.
OK deraadt@


# 1.266 10-Feb-2018 deraadt

Syncronize filesystems to disk when suspending. Each mountpoint's vnodes
are pushed to disk. Dangling vnodes (unlinked files still in use) and
vnodes undergoing change by long-running syscalls are identified -- and
such filesystems are marked dirty on-disk while we are suspended (in case
power is lost, a fsck will be required). Filesystems without dangling or
busy vnodes are marked clean, resulting in faster boots following
"battery died" circumstances.
Tested by numerous developers, thanks for the feedback.


# 1.265 14-Dec-2017 deraadt

Don't bother using DETACH_FORCE for the softraid luns at reboot
time; the aggressive mountpoint destruction seems to hit insane
use-after-frees when we are already far on the way down.


# 1.264 14-Dec-2017 deraadt

Give vflush_vnode() a hint about vnodes we don't need to account as "busy".
Change mountpoint to RDONLY a little later. Seems to improve the
rw->ro transition a bit.


# 1.263 11-Dec-2017 bluhm

Format the vnode lists of ddb show mount properly in columns.
OK krw@


# 1.262 11-Dec-2017 deraadt

In uvm Chuck decided backing store would not be allocated proactively
for blocks re-fetchable from the filesystem. However at reboot time,
filesystems are unmounted, and since processes lack backing store they
are killed. Since the scheduler is still running, in some cases init is
killed... which drops us to ddb [noted by bluhm]. Solution is to convert
filesystems to read-only [proposed by kettenis]. The tale follows:
sys_reboot() should pass proc * to MD boot() to vfs_shutdown() which
completes current IO with vfs_busy VB_WRITE|VB_WAIT, then calls VFS_MOUNT()
with MNT_UPDATE | MNT_RDONLY, soon teaching us that *fs_mount() calls a
copyin() late... so store the sizes in vfsconflist[] and move the copyin()
to sys_mount()... and notice nfs_mount copyin() is size-variant, so kill
legacy struct nfs_args3. Next we learn ffs_mount()'s MNT_UPDATE code is
sharp and rusty especially wrt softdep, so fix some bugs adn add
~MNT_SOFTDEP to the downgrade. Some vnodes need a little more help,
so tie them to &dead_vnops.

ffs_mount calling DIOCCACHESYNC is causing a bit of grief still but
this issue is seperate and will be dealt with in time.
couple hundred reboots by bluhm and myself, advice from guenther and
others at the hut


# 1.261 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.260 31-Jul-2017 florian

Give back some space to the ramdisk by compiling net/radix.c only
if we compile pf, ipsec, pipex or nfsserver.
Suggested by mpi some time ago.
Tweak & OK bluhm
deraadt assumes it's fair


# 1.259 20-Apr-2017 visa

Tweak lock inits to make the system runnable with witness(4)
on amd64 and i386.


# 1.258 04-Apr-2017 deraadt

struct vfsconf is tightly packed, but let's M_ZERO it in case that ever
changes to avoid exposing userland memory.


Revision tags: OPENBSD_6_1_BASE
# 1.257 15-Jan-2017 bluhm

When traversing the mount list, the current mount point is locked
with vfs_busy(). If the FOREACH_SAFE macro is used, the next pointer
is not locked and could be freed by another process. Unless
necessary, do not use _SAFE as it is unsafe. In vfs_unmountall()
the current pointer is actullay freed. Add a comment that this
race has to be fixed later.
OK krw@


# 1.256 10-Jan-2017 bluhm

Replace manual for() loops with FOREACH() macro.
OK millert@


# 1.255 10-Jan-2017 bluhm

Remove the unused olddp parameter from function dounmount().
OK mpi@ millert@


# 1.254 28-Sep-2016 kettenis

Cast enum to u_int when doing a bounds check to avoid a clang warning that
the comparison is always true.

ok jca@, tedu@


# 1.253 16-Sep-2016 dlg

move the namecache_rb_tree from RB macros to RBT functions.

i had to shuffle the includes a bit. all the knowledge of the RB
tree is now inside vfs_cache.c, and all accesses are via cache_*
functions.


# 1.252 16-Sep-2016 dlg

move buf_rb_bufs from RB macros to RBT functions

i had to shuffle the order of some header bits cos RBT_PROTOTYPE
needs to see what RBT_HEAD produces.


# 1.251 15-Sep-2016 dlg

all pools have their ipl set via pool_setipl, so fold it into pool_init.

the ioff argument to pool_init() is unused and has been for many
years, so this replaces it with an ipl argument. because the ipl
will be set on init we no longer need pool_setipl.

most of these changes have been done with coccinelle using the spatch
below. cocci sucks at formatting code though, so i fixed that by hand.

the manpage and subr_pool.c bits i did myself.

ok tedu@ jmatthew@

@ipl@
expression pp;
expression ipl;
expression s, a, o, f, m, p;
@@
-pool_init(pp, s, a, o, f, m, p);
-pool_setipl(pp, ipl);
+pool_init(pp, s, a, ipl, f, m, p);


# 1.250 25-Aug-2016 dlg

pool_setipl

ok kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.249 22-Jul-2016 kettenis

Prevent NULL-pointer call for filesystems that don't provide vfs_sysctl
in their vfsops.

Issue reported by Tim Newsham.

ok claudio@, natano@


# 1.248 19-Jun-2016 natano

Remove the lockmgr() API. It is only used by filesystems, where it is a
trivial change to use rrw locks instead. All it needs is LK_* defines
for the RW_* flags.

tested by naddy and sthen on package building infrastructure
input and ok jmc mpi tedu


# 1.247 26-May-2016 natano

The doforce variable isn't modified anywhere. Also, the only filesystem
left using it is fuse. It has been removed from all other filesystems.

ok millert deraadt


# 1.246 26-Apr-2016 natano

copy_statfs_info() is not only used by ufs, but by other filesystems too,
so make sure that all members of mp->mnt_stat.mount_info are copied.
ok stefan


# 1.245 26-Apr-2016 beck

fix off by one in vfs_vnode_print - found by miod
ok deraadt@, krw@


# 1.244 07-Apr-2016 natano

Share clone bitmap between aliased vnodes. This prevents duplicate clone
instance numbers being handed out for the same minor device.
ok mikeb


# 1.243 05-Apr-2016 natano

Increase size of the clone bitmap (revised diff after revert). I have
tested this with fuse _and_ drm on amd64 and macppc. Also tested with
cloning bpf (not in the tree) on macppc.

ok mikeb
"looks correct to me" millert

The original commit message is as follows:

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.242 01-Apr-2016 mikeb

Revert the clone bitmap enlargement change


# 1.241 31-Mar-2016 natano

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.240 19-Mar-2016 natano

Remove the unused flags argument from VOP_UNLOCK().

torture tested on amd64, i386 and macppc
ok beck mpi stefan
"the change looks right" deraadt


# 1.239 14-Mar-2016 krw

Change a bunch of (<blah> *)0 to NULL.

ok beck@ deraadt@


Revision tags: OPENBSD_5_9_BASE
# 1.238 05-Dec-2015 tedu

branches: 1.238.2;
remove stale lint annotations


# 1.237 16-Nov-2015 deraadt

In getdevvp() set the VISTTY flag on a vnode to indicate the underlying
device is a D_TTY device. (Like spec_open, but this sets the flag to
satisfy pre-VOP_OPEN situations)
ok millert semarie tedu guenther


# 1.236 13-Oct-2015 guenther

Initialize va_filerev in vattr_null() to avoid leaking stack garbage;
problem pointed out by Martin Natano (natano (at) natano.net)

Also, stop chaining assignments (foo = bar = baz) in vattr_null().
The exact meaning of those depends on the order of the sizes-and-
signednesses of the lvalues, making them fragile: a statement here
mixed *six* types, but managed to get them in a safe order. Delete
a 20+ year old XXX comment that was almost certainly bemoaning a bug
from when they were in an unsafe order.

ok deraadt@ miod@


# 1.235 08-Oct-2015 mpi

Use the radix API directly and get rid of the function pointers. There
is no point in keeping an unused level of abstraction.

ok mikeb@, claudio@


# 1.234 07-Oct-2015 mpi

rn_inithead() offset argument is now specified in byte, missed in previous.


# 1.233 04-Sep-2015 mpi

Make every subsystem using a radix tree call rn_init() and pass the
length of the key as argument.

This way every consumer of the radix tree has a chance to explicitly
initialize the shared data structures and no longer rely on another
subsystem to do the initialization.

As a bonus ``dom_maxrtkey'' is no longer used an die.

ART kernels should now be fully usable because pf(4) and IPSEC properly
initialized the radix tree.

ok chris@, reyk@


Revision tags: OPENBSD_5_8_BASE
# 1.232 16-Jul-2015 claudio

branches: 1.232.4;
Fix rn_match and there for the expoerted lookup functions in radix.c
to never return the internal RNF_ROOT nodes. This removes the checks
in the callee to verify that not an RNF_ROOT node was returned.
OK mpi@


# 1.231 12-May-2015 mikeb

Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.230 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.229 02-Mar-2015 guenther

Return EINVAL if the creds supplied for NFS export have a cr_ngroups less
than zero or greater than NGROUPS_MAX

Fixes panic seen by henning@


# 1.228 09-Jan-2015 tedu

rename desiredvnodes to initialvnodes. less of a lie. ok beck deraadt


# 1.227 19-Dec-2014 tedu

start retiring the nointr allocator. specify PR_WAITOK as a flag as a
marker for which pools are not interrupt safe. ok dlg


# 1.226 17-Dec-2014 tedu

remove lock.h from uvm_extern.h. another holdover from the simpletonlock
era. fix uvm including c files to include lock.h or atomic.h as necessary.
ok deraadt


# 1.225 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.224 10-Dec-2014 tedu

convert bcopy to memcpy. ok millert


# 1.223 21-Nov-2014 tedu

simple lock is long dead


# 1.222 19-Nov-2014 tedu

delete the KERN_VNODE sysctl. it fails to provide any isolation from the
kernel struct vnode defintion, and the only consumer (pstat) still needs
kvm to read much of the required information. no great loss to always use
kvm until there's a better replacement interface.
ok deraadt millert uebayasi


# 1.221 14-Nov-2014 tedu

prefer sizeof(*ptr) to sizeof(struct) for malloc and free


# 1.220 03-Nov-2014 deraadt

pass size argument to free()
ok doug tedu


# 1.219 13-Sep-2014 doug

Replace all queue *_END macro calls except CIRCLEQ_END with NULL.

CIRCLEQ_* is deprecated and not called in the tree. The other queue types
have *_END macros which were added for symmetry with CIRCLEQ_END. They are
defined as NULL. There's no reason to keep the other *_END macro calls.

ok millert@


Revision tags: OPENBSD_5_6_BASE
# 1.218 13-Jul-2014 tedu

pass the size to free in some of the obvious cases


# 1.217 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.216 10-Jul-2014 mpi

Stop using a shutdown hook for softraid(4) and explicitly shutdown
the disciplines right after vfs_shutdown().

This change is required in order to be able to set `cold' to 1 before
traversing the device (mainbus) tree for DVACT_POWERDOWN when halting
a machine. Yes, this is ugly because sr_shutdown() needs to sleep. But
at least it is obvious and hopefully somebody will be ofended and fix
it.

In order to properly flush the cache of the disks under softraid0,
sr_shutdown() now propagates DVACT_POWERDOWN for this particular subtree
of devices which are not under mainbus. As a side effect sd(4) shutdown
hook should no longer be necessary.

Tested by stsp@ and Jean-Philippe Ouellet.

ok deraadt@, stsp@, jsing@


# 1.215 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.214 04-Jun-2014 claudio

While it may be smart to use the radix tree for exports it is not OK to
use the domain specific tree initialisation method for this since that one
is multipath enabled and assumes that the radix node is part of a struct
rtentry. This code uses a different struct and so the multipath modifies
wrong fields and breaks stuff in mysterious ways.
Since we only support AF_INET here anyway simplify the code and only have
one radix_node_head pointer instead of AF_MAX ones.
Fixes NFS server issues reported by rpe@, OK rpe@, guenther@, sthen@


# 1.213 10-Apr-2014 tedu

pull the bufcache freelist code out into separate functions to allow new
algorithms to be tested. in the process, drop support for unused B_AGE and
b_synctime options.
previous versions ok beck deraadt


# 1.212 24-Mar-2014 guenther

Split the API: struct ucred remains the kernel internal structure while
struct xucred becomes the structure for syscalls (mount(2) and nfssvc(2)).

ok deraadt@ beck@


Revision tags: OPENBSD_5_5_BASE
# 1.211 21-Jan-2014 tedu

bzero -> memset


# 1.210 01-Dec-2013 krw

Change 'mountlist' from CIRCLEQ to TAILQ. Be paranoid and
use TAILQ_*_SAFE more than might be needed.

Bulk ports build by sthen@ showed nobody sticking their fingers
so deep into the kernel.

Feedback and suggestions from millert@. ok jsing@


# 1.209 27-Nov-2013 jsing

Defer the v_type initialisation until after the vnode has been purged from
the namecache. Changing the v_type between cache_enter() and cache_purge()
results in bad things happening.

ok beck@


# 1.208 02-Oct-2013 sf

format string fix: b_flags is long


# 1.207 01-Oct-2013 sf

Format string fixes: Cast time_t to long long

and mnt_stat.f_ctime is long long, too


# 1.206 08-Aug-2013 syl

Uncomment kprintf format attributes for sys/kern

tested on vax (gcc3) ok miod@


# 1.205 30-Jul-2013 beck

The previous change was made while chasing nfs performance issues
on Theo's servers - however this was in the context of the buffer flipper
changes and this is now suspect in a continues performance issue with NFS
so back it out for now


Revision tags: OPENBSD_5_4_BASE
# 1.204 24-Jun-2013 beck

Manipulating buffers after sleeping is dangerous. Instead of attempting
to cheat and VOP_BWRITE a buffer, restart the vinvalbuf if we have to wait
for a busy buffer to complete
ok tedu@ guenther@


# 1.203 15-Apr-2013 jsing

Add an f_mntfromspec member to struct statfs, which specifies the name of
the special provided when the mount was requested. This may be the same as
the special that was actually used for the mount (e.g. in the case of a
device node) or it may be different (e.g. in the case of a DUID).

Whilst here, change f_ctime to a 64 bit type and remove the pointless
f_spare members.

Compatibility goo courtesy of guenther@

ok krw@ millert@


Revision tags: OPENBSD_5_3_BASE
# 1.202 17-Feb-2013 miod

Comment out recently added __attribute__((__format__(__kprintf__))) annotations
in MI code; gcc 2.95 does not accept such annotation for function pointer
declarations, only function prototypes.
To be uncommented once gcc 2.95 bites the dust.


# 1.201 09-Feb-2013 miod

Add explicit __attribute__ ((__format__(__kprintf__)))) to the functions and
function pointer arguments which are {used as,} wrappers around the kernel
printf function.
No functional change.


# 1.200 17-Nov-2012 beck

Don't map a buffer (and potentially sleep) when invalidating it in vinvalbuf.
This fixes a problem where we could sleep for kva and then our pointers
would not be valid on the next pass through the loop. We do this
by adding buf_acquire_nomap() - which can be used to busy up the buffer
without changing its mapped or unmapped state. We do not need to have
the buffer mapped to invalidate it, so it is sufficient to acquire it
for that. In the case where we write the buffer, we do map the buffer, and
potentially sleep.


# 1.199 01-Oct-2012 guenther

Make groupmember() check the effective gid too, so that the checks are
consistent when the effective gid isn't also a supplementary group.

ok beck@


# 1.198 19-Sep-2012 guenther

vhold() and vdrop() are prototyped in vnode.h, so don't repeat them here

ok beck@


Revision tags: OPENBSD_5_2_BASE
# 1.197 16-Jul-2012 deraadt

oops, need sys/acct.h too


# 1.196 16-Jul-2012 deraadt

Put acct_shutdown() proto in a better place


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.195 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.194 02-Jul-2011 thib

rename VFSDEBUG to VFLCKDEBUG;

prompted by tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.193 21-Dec-2010 thib

Bring back the "End the VOP experiment." diff, naddy's issues where
unrelated, and his alpha is much happier now.

OK deraadt@


# 1.192 06-Dec-2010 jasper

- drop NENTS(), which was yet another copy of nitems().
no binary change


ok deraadt@


# 1.191 10-Sep-2010 thib

Backout the VOP diff until the issues naddy was seeing on alpha (gcc3)
have been resolved.


# 1.190 06-Sep-2010 thib

End the VOP experiment. Instead of the ridicolusly complicated operation
vector setup that has questionable features (that have, as far as I can
tell never been used in practice, atleast not in OpenBSD), remove all
the gunk and favor a simple struct full of function pointers that get
set directly by each of the filesystems.

Removes gobs of ugly code and makes things simpler by a magnitude.

The only downside of this is that we loose the vnoperate feature so
the spec/fifo operations of the filesystems need to be kept in sync
with specfs and fifofs, this is no big deal as the API it self is pretty
static.

Many thanks to armani@ who pulled an earlier version of this diff to
current after c2k10 and Gabriel Kihlman on tech@ for testing.

Liked by many. "come on, find your balls" deraadt@.


# 1.189 12-Aug-2010 oga

Nuke extra (typoed) extern declaration and a spare newline from the last
commit.

"fix it -- free commit" beck@


# 1.188 11-Aug-2010 beck

Make the number of vnodes to correspond to the number of buffers in
buffer cache - we grow them dynamically, but do not attempt to shrink
them if the buffer cache shrinks after growing.

Tested by very many for a long time.

ok oga@ todd@ phessler@ tedu@


Revision tags: OPENBSD_4_8_BASE
# 1.187 29-Jun-2010 tedu

makefstype was only used in ported from freebsd filesystems. fix them
and remove the function. ok thib


# 1.186 28-Jun-2010 claudio

Add the rtable id as an argument to rn_walktree(). Functions like
rt_if_remove_rtdelete() need to know the table id to be able to correctly
remove nodes.
Problem found by Andrea Parazzini and analyzed by Martin Pelik�n.
OK henning@


# 1.185 06-May-2010 mpf

Fix favail format string.
From mickey.
OK thib, otto.


Revision tags: OPENBSD_4_7_BASE
# 1.184 17-Dec-2009 oga

if anyone vref()s a VNON vnode, panic. This should not happen.

Written while trying to debug the nfs_inactive panics. Turns out it
never got hit, but it's a useful check to have.

ok beck@


# 1.183 17-Aug-2009 jasper

dd 'show all bufs' to show all the buffers in the system

ok beck@ thib@


# 1.182 13-Aug-2009 thib

add a show all vnodes command, use dlg's nice pool_walk() to accomplish
this.

ok beck@, dlg@


# 1.181 12-Aug-2009 beck

Namecache revamp.

This eliminates the large single namecache hash table, and implements
the name cache as a global lru of entires, and a redblack tree in each
vnode. It makes cache_purge actually purge the namecache entries associated
with a vnode when a vnode is recycled (very important for later on actually being
able to resize the vnode pool)

This commit does #if 0 out a bunch of procmap code that was
already broken before this change, but needs to be redone completely.

Tested by many, including in thib's nfs test setup.

ok oga@,art@,thib@,miod@


# 1.180 02-Aug-2009 beck

Dynamic buffer cache support - a re-commit of what was backed out
after c2k9

allows buffer cache to be extended and grow/shrink dynamically

tested by many, ok oga@, "why not just commit it" deraadt@


Revision tags: OPENBSD_4_6_BASE
# 1.179 25-Jun-2009 thib

backout the buf_acquire() does the bremfree() since all callers
where doing bremfree() befure calling buf_acquire().

This is causing us headache pinning down a bug that showed up
when deraadt@ too cvs to current, and will have to be done
anyway as a preperation for backouts.

OK deraadt@


# 1.178 15-Jun-2009 beck

Back out all the buffer cache changes I committed during c2k9. This reverts three
commits:

1) The sysctl allowing bufcachepercent to be changed at boot time.
2) The change moving the buffer cache hash chains to a red-black tree
3) The dynamic buffer cache (Which depended on the earlier too).

ok on the backout from marco and todd


# 1.177 06-Jun-2009 art

All caller of buf_acquire were doing bremfree before the call.
Just put it in the buf_acquire function.
oga@ ok


# 1.176 03-Jun-2009 beck

Change bufhash from the old grotty hash table to red-black trees hanging
off the vnode.
ok art@, oga@, miod@


Revision tags: OPENBSD_4_5_BASE
# 1.175 10-Nov-2008 pedro

Fix typo in comment, okay jmc@.


# 1.174 01-Nov-2008 deraadt

change vrele() to return an int. if it returns 0, it can gaurantee that
it did not sleep. this is used to avoid checkdirs() to avoid having
to restart the allproc walk every time through
idea from tedu, ok thib pedro


Revision tags: OPENBSD_4_4_BASE
# 1.173 05-Jul-2008 thib

re-introduce vdrop() to signal a lost intrest in a vnode;

ok art@


# 1.172 14-Jun-2008 mk

A bunch of pool_get() + bzero() -> pool_get(..., .. | PR_ZERO)
conversions that should shave a few bytes off the kernel.

ok henning, krw, jsing, oga, miod, and thib (``even though i usually prefer
FOO|BAR''; thanks for looking.


# 1.171 13-Jun-2008 beck

back out stupid vnode change that was unintentionally included
with biomem and art has no idea how it got there.
ok art@ thib@


# 1.170 12-Jun-2008 deraadt

Bring biomem diff back into the tree after the nfs_bio.c fix went in.
ok thib beck art


# 1.169 11-Jun-2008 deraadt

back out biomem diff since it is not right yet. Doing very large
file copies to nfsv2 causes the system to eventually peg the console.
On the console ^T indicates that the load is increasing rapidly, ddb
indicates many calls to getbuf, there is some very slow nfs traffic
making none (or extremely slow) progress. Eventually some machines
seize up entirely.


# 1.168 10-Jun-2008 beck

Buffer cache revamp

1) remove multiple size queues, introduced as a stopgap.
2) decouple pages containing data from their mappings
3) only keep buffers mapped when they actually have to be mapped
(right now, this is when buffers are B_BUSY)
4) New functions to make a buffer busy, and release the busy flag
(buf_acquire and buf_release)
5) Move high/low water marks and statistics counters into a structure
6) Add a sysctl to retrieve buffer cache statistics

Tested in several variants and beat upon by bob and art for a year. run
accidentally on henning's nfs server for a few months...

ok deraadt@, krw@, art@ - who promises to be around to deal with any fallout


# 1.167 09-Jun-2008 millert

Update access(2) to have modern semantics with respect to X_OK and
the superuser. access(2) will now only indicate success for X_OK on
non-directories if there is at least one execute bit set on the file.
OK deraadt@ thib@ otto@


# 1.166 07-May-2008 thib

remove the vfc_mountroot member from vfsconf and
do appropriate cleanup;

OK deraadt@


# 1.165 07-May-2008 claudio

Implement routing priorities. Every route inserted has a priority assigned
and the one route with the lowest number wins. This will be used by the
routing daemons to resolve the synchronisations issue in case of conflicts.
The nasty bits of this are in the multipath code. If no priority is specified
the kernel will choose an appropriate priority.

Looked at by a few people at n2k8 code is much older


# 1.164 06-May-2008 thib

retire vfs_mountroot();

setroot() is now (and has been) responsible for setting
the mountroot function pointer "to the right thing", or
failing todo that, to ffs_mountroot;

based on a discussion/diff from deraadt@.
OK deraadt@


# 1.163 23-Mar-2008 miod

Wrong printf construct.


# 1.162 16-Mar-2008 otto

Widen some struct statfs fields to support large filesystem stata
and add some to be able to support statvfs(2). Do the compat dance
to provide backward compatibility. ok thib@ miod@


Revision tags: OPENBSD_4_3_BASE
# 1.161 13-Dec-2007 blambert

replace calls to ltsleep with tsleep

remove PNORELOCK flag, as PNORELOCK is used for msleep

ok art@ thib@


# 1.160 16-Nov-2007 deraadt

er, the newline is wrong. dissapointing.


# 1.159 15-Nov-2007 deraadt

newline before syncing disks is way prettier


# 1.158 29-Oct-2007 chl

MALLOC/FREE -> malloc/free
replace an hard coded value with M_WAITOK

ok krw@


# 1.157 15-Sep-2007 bluhm

Allow to pull out an usb stick with ffs filesystem while mounted
and a file is written onto the stick. Without these fixes the
machine panics or hangs.
The usb fix calls the callback when the stick is pulled out to free
the associated buffers. Otherwise we have busy buffers for ever
and the automatic unmount will panic.
The change in the scsi layer prevents passing down further dirty
buffers to usb after the stick has been deactivated.
In vfs the automatic unmount has moved from the function vgonel()
to vop_generic_revoke(). Both are called when the sd device's vnode
is removed. In vgonel() the VXLOCK is already held which can cause
a deadlock. So call dounmount() earlier.

ok krw@, I like this marco@, tested by ian@


# 1.156 07-Sep-2007 art

Use M_ZERO in a few more places to shave bytes from the kernel.

eyeballed and ok dlg@


Revision tags: OPENBSD_4_2_BASE
# 1.155 07-Aug-2007 beck

A few changes to deal with multi-user performance issues seen. this
brings us back roughly to 4.1 level performance, although this is still
far from optimal as we have seen in a number of cases. This change

1) puts a lower bound on buffer cache queues to prevent starvation
2) fixes the code which looks for a buffer to recycle
3) reduces the number of vnodes back to 4.1 levels to avoid complex
performance issues better addressed after 4.2

ok art@ deraadt@, tested by many


# 1.154 01-Jun-2007 beck

decouple the allocated number of vnodes from the "desiredvnodes" variable
which is used to size a zillion other things that increasing excessively
has been shown to cause problems - so that we may incrementally look at
increasing those other things without making the kernel unusable.

This diff effectivly increases the number of vnodes back to the number
of buffers, as in the earlier dynamic buffer cache commits, without
increasing anything else (namecache, softdeps, etc. etc.)

ok pedro@ tedu@ art@ thib@


# 1.153 31-May-2007 tedu

remove some silly casts, no real change


# 1.152 31-May-2007 pedro

NFSv2 cannot cope with a big number of vnodes, so revert to NPROC-based
calculation until the problem is fixed, okay beck@ art@


# 1.151 30-May-2007 beck

back out vfs change - todd fries has seen afs issues, and I'm suspicious
this can cause other problems.


# 1.150 29-May-2007 beck

Step one of some vnode improvements - change getnewvnode to
actually allocate "desiredvnodes" - add a vdrop to un-hold a vnode held
with vhold, and change the name cache to make use of vhold/vdrop, while
keeping track of which vnodes are referred to by which cache entries to
correctly hold/drop vnodes when the cache uses them.
ok thib@, tedu@, art@


# 1.149 28-May-2007 thib

de-inline vref();

ok pedro@


# 1.148 26-May-2007 pedro

Dynamic buffer cache. Initial diff from mickey@, okay art@ beck@ toby@
deraadt@ dlg@.


# 1.147 26-May-2007 thib

Nuke a bunch of simpelocks and associated goo.

ok art@


# 1.146 17-May-2007 thib

Collapse struct v_selectinfo in struct vnode, remove the
simplelock and reuse the name for the selinfo member.
Clean-up accordingly.

ok tedu@,art@


# 1.145 09-May-2007 deraadt

kinfo_vgetfailed has not been used for > 8 years


# 1.144 13-Apr-2007 thib

Move the declaration of VN_KNOTE() into vnode.h instead of having
multiple defines all over;

ok tedu@


# 1.143 13-Apr-2007 bluhm

Remove comments talking about vnode interlock. No binary change.
ok thib


# 1.142 11-Apr-2007 thib

Remove the simplelock argument from vrecycle();

ok pedro@, sturm@


# 1.141 21-Mar-2007 thib

Remove the v_interlock simplelock from the vnode structure.
Zap all calls to simple_lock/unlock() on it (those calls are
#defined away though). Remove the LK_INTERLOCK from the calls
to vn_lock() and cleanup the filesystems wich implement VOP_LOCK().
(by remvoing the v_interlock from there calls to lockmgr()).

ok pedro@, art@, tedu@


# 1.140 12-Mar-2007 mickey

better desiredvnodes not based on maxusers; pedro@ deraadt@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.139 20-Feb-2007 deraadt

for vfsconf sysctl, do not leak kernel sensors out to userland
ok art thib


# 1.138 17-Feb-2007 mickey

fix ddb buf printing for daddr_t growth to 64bit;
from juan hernandez gonzalez; tested by bluhm@


# 1.137 14-Feb-2007 jsg

Consistently spell FALLTHROUGH to appease lint.
ok kettenis@ cloder@ tom@ henning@


# 1.136 13-Feb-2007 mickey

fix ddb buf print


# 1.135 20-Nov-2006 tom

vprint() should be defined if DIAGNOSTIC || DEBUG. Noticed by (and
original diff from) Jake < antipsychic (at) hotmail.com >. Discussed
with Mickey and Miod.

ok miod@ pedro@


# 1.134 30-Oct-2006 thib

use vp->v_type to index into vtypes rather then vp->v_tag,
fixing odd output in the 'show vnode' ddb code.

ok mickey@


Revision tags: OPENBSD_4_0_BASE
# 1.133 11-Jul-2006 mickey

add mount/vnode/buf and softdep printing commands; tested on a few archs and will make pedro happy too (;


# 1.132 09-Jul-2006 pedro

Fix tab where space was meant


# 1.131 08-Jul-2006 thib

vinvalbuf() debugging aid, under VFSDEBUG.

ok pedro@


# 1.130 03-Jul-2006 mickey

also print vp in vprint (useful for debugging); pedro@ ok


# 1.129 25-Jun-2006 sturm

rename vfs_busy() flags VB_UMIGNORE/VB_UMWAIT to VB_NOWAIT/VB_WAIT

requested by and ok pedro


# 1.128 14-Jun-2006 sturm

move vfs_busy() to rwlocks and properly hide the locking api from vfs

ok tedu, pedro


# 1.127 02-Jun-2006 pedro

Add a clonable devices implementation. Hacked along with thib@, input
from krw@ and toby@, subliminal prodding from dlg@, okay deraadt@.


# 1.126 28-May-2006 pedro

Spacing in vfs_sysctl()


# 1.125 07-May-2006 sturm

forgot to remove this sentence from the comment
ok pedro


# 1.124 30-Apr-2006 sturm

remove the simplelock argument from vfs_busy() which is currently not
used and will never be used this way in VFS

requested by and ok pedro, ok krw, biorn


# 1.123 19-Apr-2006 pedro

Remove unused mount list simple_lock() goo


Revision tags: OPENBSD_3_9_BASE
# 1.122 09-Jan-2006 pedro

Put vprint() under DIAGNOSTIC, as to save space in generated ramdisks.
Inspiration from miod@, okay deraadt@. Tested on i386, macppc and amd64.


# 1.121 30-Nov-2005 pedro

No need for vfs_busy() and vfs_unbusy() to take a process pointer
anymore. Testing by jolan@, thanks.


# 1.120 24-Nov-2005 pedro

Remove kernfs, okay deraadt@.


# 1.119 19-Nov-2005 pedro

Remove unnecessary lockmgr() archaism that was costing too much in terms
of panics and bugfixes. Access curproc directly, do not expect a process
pointer as an argument. Should fix many "process context required" bugs.
Incentive and okay millert@, okay marc@. Various testing, thanks.


# 1.118 18-Nov-2005 pedro

Work around yet another race on non-locking file systems: when calling
VOP_INACTIVE() in vrele() and vput(), we may sleep. Since there's no
locking of any kind, someone can vget() the vnode and vrele() it while
we sleep, beating us in getting the vnode on the free list.


# 1.117 08-Nov-2005 pedro

Missed one use of 'register'


# 1.116 07-Nov-2005 pedro

Use ANSI function declarations and deregister, no binary change


# 1.115 19-Oct-2005 pedro

Remove v_vnlock from struct vnode, okay krw@ tedu@


Revision tags: OPENBSD_3_8_BASE
# 1.114 26-May-2005 pedro

branches: 1.114.2;
RIP stackable filesystems, ok marius@ tedu@, discussed with deraadt@


# 1.113 24-May-2005 pedro

when a device vnode associated with a mount point disappears, mark the
filesystem as doomed and unmount it


# 1.112 22-May-2005 pedro

put VLOCKSWORK stuff under a single option, VFSDEBUG


# 1.111 01-May-2005 pedro

check for VBIOONFREELIST and VBIOONSYNCLIST in vprint(), okay marius@


# 1.110 24-Mar-2005 tedu

always good to check for invalid values. ok marius pedro


Revision tags: OPENBSD_3_7_BASE
# 1.109 10-Jan-2005 pedro

branches: 1.109.2;
change vget() to only put a vnode back on the free lists if it actually
was there. should fix a (rare) corner case introduced by my last commit.
ok tedu@, testing by joris, moritz@, danh@, otto@ and krw@. many thanks.


# 1.108 31-Dec-2004 pedro

sprinkle some more list macros in here


# 1.107 31-Dec-2004 pedro

when releasing a vnode, make it inactive before sticking it to one of
the free lists. should fix some races on filesystems that don't have
locks, such as nfs. also, it allows for a more straightforward way of
releasing vnodes (nodes that are going to be recycled don't have to be
moved to the head of the list). tested by many, thanks.

ok tedu@ deraadt@


# 1.106 28-Dec-2004 deraadt

clean dirty accident by miod


# 1.105 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


# 1.104 09-Dec-2004 pedro

minor spacing/styling nits


Revision tags: OPENBSD_3_6_BASE
# 1.103 04-Aug-2004 art

Uninline vputonfreelist.


# 1.102 04-Aug-2004 pedro

better comments


# 1.101 02-Aug-2004 pedro

- check for LK_NOWAIT on vget()
- use ltsleep() instead of the unlock + sleep combo

ok art@, inspiration from free/net


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.100 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


# 1.99 27-May-2004 tedu

shutdown accounting before shutting down vfs. should prevent some panics.
ok david@ millert@ (iirc)


# 1.98 25-Apr-2004 itojun

radix tree with multipath support. from kame. deraadt ok
user visible changes:
- you can add multiple routes with same key (route add A B then route add A C)
- you have to specify gateway address if there are multiple entries on the table
(route delete A B, instead of route delete A)
kernel change:
- radix_node_head has an extra entry
- rnh_deladdr takes extra argument

TODO:
- actually take advantage of multipath (rtalloc -> rtalloc_mpath)


Revision tags: OPENBSD_3_5_BASE
# 1.97 09-Jan-2004 tedu

back out vnode parents. weird breakge found in ports tree


# 1.96 06-Jan-2004 tedu

keep track of a vnode's parent dir. ufs only, and unused atm, but
the fun stuff is coming. testing by brad.


Revision tags: OPENBSD_3_4_BASE
# 1.95 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.94 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: UBC_SYNC_A
# 1.93 13-May-2003 naddy

Back out previous change that causes "vnode table full" for large-scale
file operations.


# 1.92 13-May-2003 tedu

do reclaim LAYER vnodes, no good reason not to


# 1.91 06-May-2003 tedu

attempt to put a process's cwd back in place after a forced umount.
won't always work, but it's the best we can do for now. this covers
at least some of the failure cases the previous commit to vfs_lookup.c
checks for.
ok weingart@


# 1.90 01-May-2003 tedu

several related changes:
vfs_subr.c:
add a missing simple_lock_init for vnode interlock
try to avoid reclaiming locked or layered vnodes
initialize vnlock pointer to NULL
remove old code to free vnlock, never used
lockinit the new vnode lock
vfs_syscalls.c:
support for VLAYER flag
vnode_if.sh:
support for splitting VDESC flags
vnode_if.src:
split VDESC flags
WILLPUT is the combination of WILLRELE and WILLUNLOCK
most uses for WILLRELE become WILLPUT
vnode.h:
add v_lock to struct vnode
add VLAYER flag
update for new VDESC flags


# 1.89 06-Apr-2003 ho

strcat/strcpy/sprintf cleanup. krw@, anil@ ok. art@ tested sparc64.


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_B
# 1.88 11-Aug-2002 art

Add two missing vfs_busy calls in the failure path of sysctl_vnode.
Found by aaron@

NOTE - I think we need a mount-point iterator just like we have
NOTE - vfs_mount_foreach_vnode. (btw. why don't we use foreach_vnode in here?)


# 1.87 12-Jul-2002 art

Change the locking on the mountpoint slightly. Instead of using mnt_lock
to get shared locks for lookup and get the exclusive lock only with
LK_DRAIN on unmount and do the real exclusive locking with flags in
mnt_flags, we now use shared locks for lookup and an exclusive lock for
unmount.

This is accomplished by slightly changing the semantics of vfs_busy.
Old vfs_busy behavior:
- with LK_NOWAIT set in flags, a shared lock was obtained if the
mountpoint wasn't being unmounted, otherwise we just returned an error.
- with no flags, a shared lock was obtained if the mountpoint was being
unmounted, otherwise we slept until the unmount was done and returned
an error.
LK_NOWAIT was used for sync(2) and some statistics code where it isn't really
critical that we get the correct results.
0 was used in fchdir and lookup where it's critical that we get the right
directory vnode for the filesystem root.

After this change vfs_busy keeps the same behavior for no flags and LK_NOWAIT.
But if some other flags are passed into it, they are passed directly
into lockmgr (actually LK_SLEEPFAIL is always added to those flags because
if we sleep for the lock, that means someone was holding the exclusive lock
and the exclusive lock is only held when the filesystem is being unmounted.

More changes:
dounmount must now be called with the exclusive lock held. (before this
the caller was supposed to hold the vfs_busy lock, but that wasn't always
true).
Zap some (now) unused mount flags.
And the highlight of this change:
Add some vfs_busy calls to match some vfs_unbusy calls, especially in
sys_mount. (lockmgr doesn't detect the case where we release a lock noone
holds (it will do that soon)).

If you've seen hangs on reboot with mfs this should solve it (I repeat this
for the fourth time now, but this time I spent two months fixing and
redesigning this and reading the code so this time I must have gotten
this right).


# 1.86 16-Jun-2002 miod

When processing the KERN_VNODE sysctl, the kernel builds a packed structure,
while pstat(8) expects a C structure abiding the regular structure packing
rules. This caused pstat -v to break on powerpc.

Unbreak the confusion by defining the structure in a common header file,
and having the kernel use it.

ok millert@ deraadt@


# 1.85 08-Jun-2002 art

Use ltsleep in vfs_busy.


# 1.84 16-May-2002 art

sprinkle some splassert(IPL_BIO) in some functions that are commented as "should be called at splbio()"


Revision tags: OPENBSD_3_1_BASE
# 1.83 14-Mar-2002 millert

First round of __P removal in sys


# 1.82 04-Feb-2002 miod

Cleanup mountroot-related definitions.


# 1.81 23-Jan-2002 art

Pool deals fairly well with physical memory shortage, but it doesn't deal
well (not at all) with shortages of the vm_map where the pages are mapped
(usually kmem_map).

Try to deal with it:
- group all information the backend allocator for a pool in a separate
struct. The pool will only have a pointer to that struct.
- change the pool_init API to reflect that.
- link all pools allocating from the same allocator on a linked list.
- Since an allocator is responsible to wait for physical memory it will
only fail (waitok) when it runs out of its backing vm_map, carefully
drain pools using the same allocator so that va space is freed.
(see comments in code for caveats and details).
- change pool_reclaim to return if it actually succeeded to free some
memory, use that information to make draining easier and more efficient.
- get rid of PR_URGENT, noone uses it.


# 1.80 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.79 10-Dec-2001 art

branches: 1.79.2;
No need to initialize the uobj on every getnewvnode. Just do
it when allocating. Add some improved diagnostics.


# 1.78 10-Dec-2001 art

Big cleanup inspired by NetBSD with some parts of the code from NetBSD.
- get rid of VOP_BALLOCN and VOP_SIZE
- move the generic getpages and putpages into miscfs/genfs
- create a genfs_node which must be added to the top of the private portion
of each vnode for filsystems that want to use genfs_{get,put}pages
- rename genfs_mmap to vop_generic_mmap


# 1.77 10-Dec-2001 art

Merge in struct uvm_vnode into struct vnode.


# 1.76 05-Dec-2001 art

Break out the part that lowers v_holdcnt in brelvp into an own function
and make it and vhold into public interfaces.


# 1.75 29-Nov-2001 art

Ooops. Revert part of the last commit that was completly wrong and wasn't supposed to be committed.


# 1.74 29-Nov-2001 art

Correctly handle b_vp with bgetvp and brelvp in {get,put}pages.
Prevents panics caused by vnodes being recycled under our feet.


# 1.73 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.72 21-Nov-2001 csapuntz

Added vfs_isbusy. Useful for verifying that a mount point is locked
Added vfs_mount_foreach_vnode. Several places in the code seem to want to
traverse the mount list and they all seem to handle locking differently.
Centralize traversing the mount list in one place so that we only need
to get the locking right once.


# 1.71 15-Nov-2001 art

Don't zero v_bioflag when recycling a vnode in getnewvnode.
Sometimes the vnode can be on the syncers list. While that is a bug, it's
just a minor annoyance. A vnode on a syncer worklist without VBIOONSYNCLIST
set is a disaster.


# 1.70 12-Nov-2001 art

Remove unnecessary check for NULL vnode in reassignbuf.


# 1.69 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.68 02-Oct-2001 csapuntz

Bounds check index into routing table. Thanks to Ken Ashcraft of Stanford
for finding this bug.


# 1.67 19-Sep-2001 csapuntz

Get rid of B_VFLUSH. Not relevant after the end of the AGE queue.


# 1.66 16-Sep-2001 millert

Add some missing lengths checks when passing data from userland to
kernel. From based on NetBSD patches.


# 1.65 02-Aug-2001 assar

(vput): make panic strings actually say vput instead of vrele


# 1.64 26-Jul-2001 miod

Typo.


# 1.63 27-Jun-2001 art

remove old vm


# 1.62 22-Jun-2001 deraadt

KNF


# 1.61 05-Jun-2001 provos

send note_revoke to knotes when vnode goes away, okay art@


# 1.60 16-May-2001 art

indentation nit.


# 1.59 29-Apr-2001 art

cleanup, remove incorrect comment


Revision tags: OPENBSD_2_9_BASE
# 1.58 22-Mar-2001 art

branches: 1.58.2;
Use pool for allocating vnodes.
Even though vnodes are never freed (could be) this gives us big memory and
kmem_map savings.


# 1.57 21-Mar-2001 art

uvm_vnp_terminate expect the vnode to be locked.
Why didn't LOCKDEBUG catch this?


# 1.56 16-Mar-2001 art

Oops. fix thinko in last.


# 1.55 16-Mar-2001 art

Use CIRCLEQ macros for mountlist.


# 1.54 16-Mar-2001 art

Initialize the mountlist_slock.


# 1.53 26-Feb-2001 csapuntz

Move v_writecount test back to it original place


# 1.52 26-Feb-2001 csapuntz

Make ref counts 32-bit unsigned ints as opposed to a potpourri of longs and
ints.


# 1.51 24-Feb-2001 csapuntz

Cleanup of vnode interface continues. Get rid of VHOLD/HOLDRELE.
Change VM/UVM to use buf_replacevnode to change the vnode associated
with a buffer.

Addition v_bioflag for flags written in interrupt handlers
(and read at splbio, though not strictly necessary)

Add vwaitforio and use it instead of a while loop of v_numoutput.

Fix race conditions when manipulation vnode free list


# 1.50 23-Feb-2001 csapuntz

Remove the clustering fields from the vnodes and place them in the
file system inode instead


# 1.49 21-Feb-2001 csapuntz

Latest soft updates from FreeBSD/Kirk McKusick

Snapshot-related code has been commented out.


# 1.48 08-Feb-2001 mickey

do not print stuff when not verbose


Revision tags: OPENBSD_2_8_BASE
# 1.47 27-Sep-2000 art

branches: 1.47.2;
Minimal optimization.


# 1.46 17-Jul-2000 art

Don't wait for B_READ buffers on shutdown.
From NetBSD.


Revision tags: OPENBSD_2_7_BASE
# 1.45 25-Apr-2000 csapuntz

Use CIRCLEQ_FOREACH


# 1.44 21-Apr-2000 mickey

see if there is any meaning under curproc before using &proc0 in vfs_syncwait(); from art@


Revision tags: SMP_BASE kame_19991208
# 1.43 05-Dec-1999 art

branches: 1.43.2;
With soft updates, some buffers will be remarked as dirty after being written.
Handle this when syncing filesystems when unmounting.
From NetBSD.


# 1.42 05-Dec-1999 art

Use VONSYNCLIST to see if we should remove a vnode from the sync list instead
of looking at v_dirtyblkhd.


Revision tags: OPENBSD_2_6_BASE
# 1.41 20-Aug-1999 art

more paranoid check of the refcount in vfs_register


# 1.40 08-Aug-1999 niklas

From NetBSD; vdevgone, used for revoking access to device nodes when they
disappear (detach is coming).


# 1.39 31-May-1999 millert

New struct statfs with mount options. NOTE: this replaces statfs(2),
fstatfs(2), and getfsstat(2) so you will need to build a new kernel
before doing a "make build" or you will get "unimplemented syscall" errors.

The new struct statfs has the following featuires:
o Has a u_int32_t flags field--now softdep can have a real flag.

o Uses u_int32_t instead of longs (nicer on the alpha). Note: the man
page used to lie about setting invalid/unused fields to -1. SunOS does
that but our code never has.

o Gets rid of f_type completely. It hasn't been used since NetBSD 0.9
and having it there but always 0 is confusing. It is conceivable
that this may cause some old code to not compile but that is better
than silently breaking.

o Adds a mount_info union that contains the FSTYPE_args struct. This
means that "mount" can now tell you all the options a filesystem was
mounted with. This is especially nice for NFS.

Other changes:
o The linux statfs emulation didn't convert between BSD fs names
and linux f_type numbers. Now it does, since the BSD f_type
number is useless to linux apps (and has been removed anyway)

o FreeBSD's struct statfs is different from our (both old and new)
and thus needs conversion. Previously, the OpenBSD syscalls
were used without any real translation.

o mount(8) will now show extra info when invoked with no arguments.
However, to see *everything* you need to use the -v (verbose) flag.


# 1.38 06-May-1999 mickey

factor out sync+wait code into vfa_syncwait() routine for
applications in system like power management and such.
art@ finally said `commit it'


# 1.37 30-Apr-1999 art

in vput, simple_unlock the v_interlock before VOP_INACTIVE, not after


Revision tags: OPENBSD_2_5_BASE
# 1.36 11-Mar-1999 deraadt

backout


# 1.35 11-Mar-1999 deraadt

back out unapproved changes


# 1.34 11-Mar-1999 mickey

indent


# 1.33 11-Mar-1999 mickey

factor sync+wait operation out into a separate function.


# 1.32 26-Feb-1999 art

adapt to uvm vnode pager


# 1.31 19-Feb-1999 art

add vfs_register and vfs_unregister functions


# 1.30 28-Dec-1998 art

simple_lock fixes


# 1.29 22-Dec-1998 art

deconfuse vprint, print holdcount, not refcount when we are talking about holdcnt


# 1.28 10-Dec-1998 art

vfs_unmountall: retry to unmount all remaining filesystems when one unmount failed


# 1.27 05-Dec-1998 csapuntz

Framework for generating automatic test code for locking discipline
in DIAGNOSTIC mode.

Added documentation to vfs_subr.c on locking needs of a couple calls.

Improvements to the vinvalbuf patch. We need to start over after we
let our pants down.


# 1.26 04-Dec-1998 csapuntz

VFS-Lite2 requires stricter locking around vnode buffer queues. vinvalbuf
had insufficient protection


# 1.25 20-Nov-1998 art

vn_lock already unlocks the simple lock. don't do that again


# 1.24 12-Nov-1998 csapuntz

Integrate latest soft updates patches for McKusick.

Integrate cleaner ffs mount code from FreeBSD. Most notably, this mount
code prevents you from mounting an unclean file system read-write.


Revision tags: OPENBSD_2_4_BASE
# 1.23 13-Oct-1998 csapuntz

In vrele, vget, reinstate to following order

- VNODE gets placed on free list
- VOP_INACTIVE is called

This was the original order. It was changed in an earlier patch due to
a race condition in non-locking FSes (like NFS) between getnewvnode
and inactive. However, the modified order had its own race conditions, so
it turned out not to be a good choice.


# 1.22 30-Aug-1998 csapuntz

Cleanup.

Error diagnostics in vputonfreelist to catch violations of assumptions.


# 1.21 06-Aug-1998 csapuntz

Rename vop_revoke, vn_bwrite, vop_noislocked, vop_nolock, vop_nounlock
to be vop_generic_revoke, vop_generic_bwrite, vop_generic_islocked,
vop_generic_lock and vop_generic_unlock.

Create vop_generic_abortop and propogate change to all file systems.

Fix PR/371.

Get rid of locking in NULLFS (should be mostly unnecessary now except for
forced unmounts).


# 1.20 25-Apr-1998 niklas

typo


Revision tags: OPENBSD_2_3_BASE
# 1.19 20-Feb-1998 niklas

typo


# 1.18 11-Jan-1998 csapuntz

Fix a couple spinlock references. More code motion in vfs_subr.c


# 1.17 10-Jan-1998 csapuntz

Broke up vfs_subr.c which was getting a bit huge. We now have seperate files
for the syncer daemon as well as default VOP_*.


# 1.16 24-Nov-1997 niklas

Fix non-DIAGNOSTIC (and non-COMPAT*) compilation


# 1.15 07-Nov-1997 csapuntz

Fixed hang on shutdown
Disabled vop_nolock for now. Filesystems still need to be cleaned up.


# 1.14 06-Nov-1997 csapuntz

DEBUG now compiles


# 1.13 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.12 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.11 06-Oct-1997 csapuntz

VFS Lite2 Changes


Revision tags: OPENBSD_2_1_BASE
# 1.10 25-Apr-1997 deraadt

proper mask check; mike@fast.cs.utah.edu


# 1.9 14-Apr-1997 tholo

Minor performance enhancements from NetBSD


# 1.8 24-Feb-1997 niklas

OpenBSD tags


# 1.7 11-Feb-1997 millert

Add fs_id support and random inode generation numbers for ffs.


# 1.6 04-Jan-1997 kstailey

spec_advlock() via lf_advlock()


Revision tags: OPENBSD_2_0_BASE
# 1.5 08-Aug-1996 tholo

Make {,f}chown(2) behaviour POSIX.1 compliant with SUID / SGID files
Enable CTL_FS processing by sysctl(3)
Add CTL_FS request to disable clearing SUID / SGID bit when a files owner
or group is changed by root
Make sysctl(8) understand CTL_FS requests


# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 29-Feb-1996 niklas

From NetBSD: Merge with NetBSD 960217


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.282 29-Sep-2018 visa

Use atomic operations to update vfc_refcount. Change the field's type
to unsigned int.

OK deraadt@


# 1.281 26-Sep-2018 visa

Move the allocating and freeing of mount points into
dedicated functions.

OK deraadt@ mpi@


# 1.280 22-Sep-2018 fcambus

Harmonize spacing after ellipses in displayed messages.

We were using spacing after ellipses in an inconsistent way in the
installer. Standardize on using "... " everywhere and take into account
the cursor position while we are waiting for the task to complete: the
cursor is now always positioned after the last dot, and the space is
added when displaying completion confirmation.

While there, also take cursor position into account in vfs_shutdown(),
and remove the extra leading space before ticks in dhclient.

OK deraadt@


# 1.279 17-Sep-2018 visa

Simplify VFS initialization.

Because loadable kernel modules are no longer, there is no need to
register or unregister filesystem implementations at runtime. Remove
vfs_register() and vfs_unregister(), and make vfsinit() call vfs_init
routines directly. Replace the linked list of vfsconf structs with
the vfsconflist[] array.

OK mpi@ bluhm@


# 1.278 16-Sep-2018 visa

Move vfsconf lookup code into dedicated functions.

OK bluhm@


# 1.277 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


# 1.276 02-Jul-2018 bluhm

Use more list macros for v_dirtyblkhd.
OK mpi@


# 1.275 06-Jun-2018 bluhm

The function dounmount() traverses the mnt_list in forward direction
to call vfs_busy() for all nested mount points. vfs_stall() called
vfs_busy() in reverser order for all mount points. Change the
direction of the latter to resolve the lock order conflict.
OK visa@


# 1.274 04-Jun-2018 guenther

Add VB_DUPOK to suppress witness(4) warning of concurrent mount locks.
Use that in three places:
- vfs_stall()
- sys_mount()
- dounmount()'s MNT_FORCE-does-recursive-unmounts case

ok deraadt@ visa@


# 1.273 27-May-2018 visa

Drop unnecessary `p' parameter from vget(9).

OK mpi@


# 1.272 08-May-2018 bluhm

When looping over mount points, the FOREACH SAVE macro is not save.
The loop variable mp is protected by vfs_busy() so that it cannot
be unmounted. But the next mount point nmp could be unmounted while
VFS_SYNC() sleeps. As the loop in vfs_stall() does not destroy the
mount point, TAILQ_FOREACH_REVERSE without _SAVE is the correct
macro to use.
OK deraadt@ visa@


# 1.271 08-May-2018 mpi

Move the vfs stall "barrier" logic to a function. FREF() will soon
change and this has nothing to do with it.

ok visa@, bluhm@


# 1.270 07-May-2018 bluhm

Print the vp pointer in the vinvalbuf() panic strings.
OK mpi@


# 1.269 02-May-2018 visa

Remove proc from the parameters of vn_lock(). The parameter is
unnecessary because curproc always does the locking.

OK mpi@


# 1.268 28-Apr-2018 visa

Clean up the parameters of VOP_LOCK() and VOP_UNLOCK(). It is always
curproc that does the locking or unlocking, so the proc parameter
is pointless and can be dropped.

OK mpi@, deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.267 07-Mar-2018 bluhm

Remounting files systems read-only does not work reliably. There
are corner cases where ffs may leak blocks. So better revert and
unmount all file systems at reboot. The "init died" panic will be
fixed in a different way.
OK deraadt@


# 1.266 10-Feb-2018 deraadt

Syncronize filesystems to disk when suspending. Each mountpoint's vnodes
are pushed to disk. Dangling vnodes (unlinked files still in use) and
vnodes undergoing change by long-running syscalls are identified -- and
such filesystems are marked dirty on-disk while we are suspended (in case
power is lost, a fsck will be required). Filesystems without dangling or
busy vnodes are marked clean, resulting in faster boots following
"battery died" circumstances.
Tested by numerous developers, thanks for the feedback.


# 1.265 14-Dec-2017 deraadt

Don't bother using DETACH_FORCE for the softraid luns at reboot
time; the aggressive mountpoint destruction seems to hit insane
use-after-frees when we are already far on the way down.


# 1.264 14-Dec-2017 deraadt

Give vflush_vnode() a hint about vnodes we don't need to account as "busy".
Change mountpoint to RDONLY a little later. Seems to improve the
rw->ro transition a bit.


# 1.263 11-Dec-2017 bluhm

Format the vnode lists of ddb show mount properly in columns.
OK krw@


# 1.262 11-Dec-2017 deraadt

In uvm Chuck decided backing store would not be allocated proactively
for blocks re-fetchable from the filesystem. However at reboot time,
filesystems are unmounted, and since processes lack backing store they
are killed. Since the scheduler is still running, in some cases init is
killed... which drops us to ddb [noted by bluhm]. Solution is to convert
filesystems to read-only [proposed by kettenis]. The tale follows:
sys_reboot() should pass proc * to MD boot() to vfs_shutdown() which
completes current IO with vfs_busy VB_WRITE|VB_WAIT, then calls VFS_MOUNT()
with MNT_UPDATE | MNT_RDONLY, soon teaching us that *fs_mount() calls a
copyin() late... so store the sizes in vfsconflist[] and move the copyin()
to sys_mount()... and notice nfs_mount copyin() is size-variant, so kill
legacy struct nfs_args3. Next we learn ffs_mount()'s MNT_UPDATE code is
sharp and rusty especially wrt softdep, so fix some bugs adn add
~MNT_SOFTDEP to the downgrade. Some vnodes need a little more help,
so tie them to &dead_vnops.

ffs_mount calling DIOCCACHESYNC is causing a bit of grief still but
this issue is seperate and will be dealt with in time.
couple hundred reboots by bluhm and myself, advice from guenther and
others at the hut


# 1.261 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.260 31-Jul-2017 florian

Give back some space to the ramdisk by compiling net/radix.c only
if we compile pf, ipsec, pipex or nfsserver.
Suggested by mpi some time ago.
Tweak & OK bluhm
deraadt assumes it's fair


# 1.259 20-Apr-2017 visa

Tweak lock inits to make the system runnable with witness(4)
on amd64 and i386.


# 1.258 04-Apr-2017 deraadt

struct vfsconf is tightly packed, but let's M_ZERO it in case that ever
changes to avoid exposing userland memory.


Revision tags: OPENBSD_6_1_BASE
# 1.257 15-Jan-2017 bluhm

When traversing the mount list, the current mount point is locked
with vfs_busy(). If the FOREACH_SAFE macro is used, the next pointer
is not locked and could be freed by another process. Unless
necessary, do not use _SAFE as it is unsafe. In vfs_unmountall()
the current pointer is actullay freed. Add a comment that this
race has to be fixed later.
OK krw@


# 1.256 10-Jan-2017 bluhm

Replace manual for() loops with FOREACH() macro.
OK millert@


# 1.255 10-Jan-2017 bluhm

Remove the unused olddp parameter from function dounmount().
OK mpi@ millert@


# 1.254 28-Sep-2016 kettenis

Cast enum to u_int when doing a bounds check to avoid a clang warning that
the comparison is always true.

ok jca@, tedu@


# 1.253 16-Sep-2016 dlg

move the namecache_rb_tree from RB macros to RBT functions.

i had to shuffle the includes a bit. all the knowledge of the RB
tree is now inside vfs_cache.c, and all accesses are via cache_*
functions.


# 1.252 16-Sep-2016 dlg

move buf_rb_bufs from RB macros to RBT functions

i had to shuffle the order of some header bits cos RBT_PROTOTYPE
needs to see what RBT_HEAD produces.


# 1.251 15-Sep-2016 dlg

all pools have their ipl set via pool_setipl, so fold it into pool_init.

the ioff argument to pool_init() is unused and has been for many
years, so this replaces it with an ipl argument. because the ipl
will be set on init we no longer need pool_setipl.

most of these changes have been done with coccinelle using the spatch
below. cocci sucks at formatting code though, so i fixed that by hand.

the manpage and subr_pool.c bits i did myself.

ok tedu@ jmatthew@

@ipl@
expression pp;
expression ipl;
expression s, a, o, f, m, p;
@@
-pool_init(pp, s, a, o, f, m, p);
-pool_setipl(pp, ipl);
+pool_init(pp, s, a, ipl, f, m, p);


# 1.250 25-Aug-2016 dlg

pool_setipl

ok kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.249 22-Jul-2016 kettenis

Prevent NULL-pointer call for filesystems that don't provide vfs_sysctl
in their vfsops.

Issue reported by Tim Newsham.

ok claudio@, natano@


# 1.248 19-Jun-2016 natano

Remove the lockmgr() API. It is only used by filesystems, where it is a
trivial change to use rrw locks instead. All it needs is LK_* defines
for the RW_* flags.

tested by naddy and sthen on package building infrastructure
input and ok jmc mpi tedu


# 1.247 26-May-2016 natano

The doforce variable isn't modified anywhere. Also, the only filesystem
left using it is fuse. It has been removed from all other filesystems.

ok millert deraadt


# 1.246 26-Apr-2016 natano

copy_statfs_info() is not only used by ufs, but by other filesystems too,
so make sure that all members of mp->mnt_stat.mount_info are copied.
ok stefan


# 1.245 26-Apr-2016 beck

fix off by one in vfs_vnode_print - found by miod
ok deraadt@, krw@


# 1.244 07-Apr-2016 natano

Share clone bitmap between aliased vnodes. This prevents duplicate clone
instance numbers being handed out for the same minor device.
ok mikeb


# 1.243 05-Apr-2016 natano

Increase size of the clone bitmap (revised diff after revert). I have
tested this with fuse _and_ drm on amd64 and macppc. Also tested with
cloning bpf (not in the tree) on macppc.

ok mikeb
"looks correct to me" millert

The original commit message is as follows:

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.242 01-Apr-2016 mikeb

Revert the clone bitmap enlargement change


# 1.241 31-Mar-2016 natano

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.240 19-Mar-2016 natano

Remove the unused flags argument from VOP_UNLOCK().

torture tested on amd64, i386 and macppc
ok beck mpi stefan
"the change looks right" deraadt


# 1.239 14-Mar-2016 krw

Change a bunch of (<blah> *)0 to NULL.

ok beck@ deraadt@


Revision tags: OPENBSD_5_9_BASE
# 1.238 05-Dec-2015 tedu

branches: 1.238.2;
remove stale lint annotations


# 1.237 16-Nov-2015 deraadt

In getdevvp() set the VISTTY flag on a vnode to indicate the underlying
device is a D_TTY device. (Like spec_open, but this sets the flag to
satisfy pre-VOP_OPEN situations)
ok millert semarie tedu guenther


# 1.236 13-Oct-2015 guenther

Initialize va_filerev in vattr_null() to avoid leaking stack garbage;
problem pointed out by Martin Natano (natano (at) natano.net)

Also, stop chaining assignments (foo = bar = baz) in vattr_null().
The exact meaning of those depends on the order of the sizes-and-
signednesses of the lvalues, making them fragile: a statement here
mixed *six* types, but managed to get them in a safe order. Delete
a 20+ year old XXX comment that was almost certainly bemoaning a bug
from when they were in an unsafe order.

ok deraadt@ miod@


# 1.235 08-Oct-2015 mpi

Use the radix API directly and get rid of the function pointers. There
is no point in keeping an unused level of abstraction.

ok mikeb@, claudio@


# 1.234 07-Oct-2015 mpi

rn_inithead() offset argument is now specified in byte, missed in previous.


# 1.233 04-Sep-2015 mpi

Make every subsystem using a radix tree call rn_init() and pass the
length of the key as argument.

This way every consumer of the radix tree has a chance to explicitly
initialize the shared data structures and no longer rely on another
subsystem to do the initialization.

As a bonus ``dom_maxrtkey'' is no longer used an die.

ART kernels should now be fully usable because pf(4) and IPSEC properly
initialized the radix tree.

ok chris@, reyk@


Revision tags: OPENBSD_5_8_BASE
# 1.232 16-Jul-2015 claudio

branches: 1.232.4;
Fix rn_match and there for the expoerted lookup functions in radix.c
to never return the internal RNF_ROOT nodes. This removes the checks
in the callee to verify that not an RNF_ROOT node was returned.
OK mpi@


# 1.231 12-May-2015 mikeb

Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.230 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.229 02-Mar-2015 guenther

Return EINVAL if the creds supplied for NFS export have a cr_ngroups less
than zero or greater than NGROUPS_MAX

Fixes panic seen by henning@


# 1.228 09-Jan-2015 tedu

rename desiredvnodes to initialvnodes. less of a lie. ok beck deraadt


# 1.227 19-Dec-2014 tedu

start retiring the nointr allocator. specify PR_WAITOK as a flag as a
marker for which pools are not interrupt safe. ok dlg


# 1.226 17-Dec-2014 tedu

remove lock.h from uvm_extern.h. another holdover from the simpletonlock
era. fix uvm including c files to include lock.h or atomic.h as necessary.
ok deraadt


# 1.225 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.224 10-Dec-2014 tedu

convert bcopy to memcpy. ok millert


# 1.223 21-Nov-2014 tedu

simple lock is long dead


# 1.222 19-Nov-2014 tedu

delete the KERN_VNODE sysctl. it fails to provide any isolation from the
kernel struct vnode defintion, and the only consumer (pstat) still needs
kvm to read much of the required information. no great loss to always use
kvm until there's a better replacement interface.
ok deraadt millert uebayasi


# 1.221 14-Nov-2014 tedu

prefer sizeof(*ptr) to sizeof(struct) for malloc and free


# 1.220 03-Nov-2014 deraadt

pass size argument to free()
ok doug tedu


# 1.219 13-Sep-2014 doug

Replace all queue *_END macro calls except CIRCLEQ_END with NULL.

CIRCLEQ_* is deprecated and not called in the tree. The other queue types
have *_END macros which were added for symmetry with CIRCLEQ_END. They are
defined as NULL. There's no reason to keep the other *_END macro calls.

ok millert@


Revision tags: OPENBSD_5_6_BASE
# 1.218 13-Jul-2014 tedu

pass the size to free in some of the obvious cases


# 1.217 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.216 10-Jul-2014 mpi

Stop using a shutdown hook for softraid(4) and explicitly shutdown
the disciplines right after vfs_shutdown().

This change is required in order to be able to set `cold' to 1 before
traversing the device (mainbus) tree for DVACT_POWERDOWN when halting
a machine. Yes, this is ugly because sr_shutdown() needs to sleep. But
at least it is obvious and hopefully somebody will be ofended and fix
it.

In order to properly flush the cache of the disks under softraid0,
sr_shutdown() now propagates DVACT_POWERDOWN for this particular subtree
of devices which are not under mainbus. As a side effect sd(4) shutdown
hook should no longer be necessary.

Tested by stsp@ and Jean-Philippe Ouellet.

ok deraadt@, stsp@, jsing@


# 1.215 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.214 04-Jun-2014 claudio

While it may be smart to use the radix tree for exports it is not OK to
use the domain specific tree initialisation method for this since that one
is multipath enabled and assumes that the radix node is part of a struct
rtentry. This code uses a different struct and so the multipath modifies
wrong fields and breaks stuff in mysterious ways.
Since we only support AF_INET here anyway simplify the code and only have
one radix_node_head pointer instead of AF_MAX ones.
Fixes NFS server issues reported by rpe@, OK rpe@, guenther@, sthen@


# 1.213 10-Apr-2014 tedu

pull the bufcache freelist code out into separate functions to allow new
algorithms to be tested. in the process, drop support for unused B_AGE and
b_synctime options.
previous versions ok beck deraadt


# 1.212 24-Mar-2014 guenther

Split the API: struct ucred remains the kernel internal structure while
struct xucred becomes the structure for syscalls (mount(2) and nfssvc(2)).

ok deraadt@ beck@


Revision tags: OPENBSD_5_5_BASE
# 1.211 21-Jan-2014 tedu

bzero -> memset


# 1.210 01-Dec-2013 krw

Change 'mountlist' from CIRCLEQ to TAILQ. Be paranoid and
use TAILQ_*_SAFE more than might be needed.

Bulk ports build by sthen@ showed nobody sticking their fingers
so deep into the kernel.

Feedback and suggestions from millert@. ok jsing@


# 1.209 27-Nov-2013 jsing

Defer the v_type initialisation until after the vnode has been purged from
the namecache. Changing the v_type between cache_enter() and cache_purge()
results in bad things happening.

ok beck@


# 1.208 02-Oct-2013 sf

format string fix: b_flags is long


# 1.207 01-Oct-2013 sf

Format string fixes: Cast time_t to long long

and mnt_stat.f_ctime is long long, too


# 1.206 08-Aug-2013 syl

Uncomment kprintf format attributes for sys/kern

tested on vax (gcc3) ok miod@


# 1.205 30-Jul-2013 beck

The previous change was made while chasing nfs performance issues
on Theo's servers - however this was in the context of the buffer flipper
changes and this is now suspect in a continues performance issue with NFS
so back it out for now


Revision tags: OPENBSD_5_4_BASE
# 1.204 24-Jun-2013 beck

Manipulating buffers after sleeping is dangerous. Instead of attempting
to cheat and VOP_BWRITE a buffer, restart the vinvalbuf if we have to wait
for a busy buffer to complete
ok tedu@ guenther@


# 1.203 15-Apr-2013 jsing

Add an f_mntfromspec member to struct statfs, which specifies the name of
the special provided when the mount was requested. This may be the same as
the special that was actually used for the mount (e.g. in the case of a
device node) or it may be different (e.g. in the case of a DUID).

Whilst here, change f_ctime to a 64 bit type and remove the pointless
f_spare members.

Compatibility goo courtesy of guenther@

ok krw@ millert@


Revision tags: OPENBSD_5_3_BASE
# 1.202 17-Feb-2013 miod

Comment out recently added __attribute__((__format__(__kprintf__))) annotations
in MI code; gcc 2.95 does not accept such annotation for function pointer
declarations, only function prototypes.
To be uncommented once gcc 2.95 bites the dust.


# 1.201 09-Feb-2013 miod

Add explicit __attribute__ ((__format__(__kprintf__)))) to the functions and
function pointer arguments which are {used as,} wrappers around the kernel
printf function.
No functional change.


# 1.200 17-Nov-2012 beck

Don't map a buffer (and potentially sleep) when invalidating it in vinvalbuf.
This fixes a problem where we could sleep for kva and then our pointers
would not be valid on the next pass through the loop. We do this
by adding buf_acquire_nomap() - which can be used to busy up the buffer
without changing its mapped or unmapped state. We do not need to have
the buffer mapped to invalidate it, so it is sufficient to acquire it
for that. In the case where we write the buffer, we do map the buffer, and
potentially sleep.


# 1.199 01-Oct-2012 guenther

Make groupmember() check the effective gid too, so that the checks are
consistent when the effective gid isn't also a supplementary group.

ok beck@


# 1.198 19-Sep-2012 guenther

vhold() and vdrop() are prototyped in vnode.h, so don't repeat them here

ok beck@


Revision tags: OPENBSD_5_2_BASE
# 1.197 16-Jul-2012 deraadt

oops, need sys/acct.h too


# 1.196 16-Jul-2012 deraadt

Put acct_shutdown() proto in a better place


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.195 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.194 02-Jul-2011 thib

rename VFSDEBUG to VFLCKDEBUG;

prompted by tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.193 21-Dec-2010 thib

Bring back the "End the VOP experiment." diff, naddy's issues where
unrelated, and his alpha is much happier now.

OK deraadt@


# 1.192 06-Dec-2010 jasper

- drop NENTS(), which was yet another copy of nitems().
no binary change


ok deraadt@


# 1.191 10-Sep-2010 thib

Backout the VOP diff until the issues naddy was seeing on alpha (gcc3)
have been resolved.


# 1.190 06-Sep-2010 thib

End the VOP experiment. Instead of the ridicolusly complicated operation
vector setup that has questionable features (that have, as far as I can
tell never been used in practice, atleast not in OpenBSD), remove all
the gunk and favor a simple struct full of function pointers that get
set directly by each of the filesystems.

Removes gobs of ugly code and makes things simpler by a magnitude.

The only downside of this is that we loose the vnoperate feature so
the spec/fifo operations of the filesystems need to be kept in sync
with specfs and fifofs, this is no big deal as the API it self is pretty
static.

Many thanks to armani@ who pulled an earlier version of this diff to
current after c2k10 and Gabriel Kihlman on tech@ for testing.

Liked by many. "come on, find your balls" deraadt@.


# 1.189 12-Aug-2010 oga

Nuke extra (typoed) extern declaration and a spare newline from the last
commit.

"fix it -- free commit" beck@


# 1.188 11-Aug-2010 beck

Make the number of vnodes to correspond to the number of buffers in
buffer cache - we grow them dynamically, but do not attempt to shrink
them if the buffer cache shrinks after growing.

Tested by very many for a long time.

ok oga@ todd@ phessler@ tedu@


Revision tags: OPENBSD_4_8_BASE
# 1.187 29-Jun-2010 tedu

makefstype was only used in ported from freebsd filesystems. fix them
and remove the function. ok thib


# 1.186 28-Jun-2010 claudio

Add the rtable id as an argument to rn_walktree(). Functions like
rt_if_remove_rtdelete() need to know the table id to be able to correctly
remove nodes.
Problem found by Andrea Parazzini and analyzed by Martin Pelik�n.
OK henning@


# 1.185 06-May-2010 mpf

Fix favail format string.
From mickey.
OK thib, otto.


Revision tags: OPENBSD_4_7_BASE
# 1.184 17-Dec-2009 oga

if anyone vref()s a VNON vnode, panic. This should not happen.

Written while trying to debug the nfs_inactive panics. Turns out it
never got hit, but it's a useful check to have.

ok beck@


# 1.183 17-Aug-2009 jasper

dd 'show all bufs' to show all the buffers in the system

ok beck@ thib@


# 1.182 13-Aug-2009 thib

add a show all vnodes command, use dlg's nice pool_walk() to accomplish
this.

ok beck@, dlg@


# 1.181 12-Aug-2009 beck

Namecache revamp.

This eliminates the large single namecache hash table, and implements
the name cache as a global lru of entires, and a redblack tree in each
vnode. It makes cache_purge actually purge the namecache entries associated
with a vnode when a vnode is recycled (very important for later on actually being
able to resize the vnode pool)

This commit does #if 0 out a bunch of procmap code that was
already broken before this change, but needs to be redone completely.

Tested by many, including in thib's nfs test setup.

ok oga@,art@,thib@,miod@


# 1.180 02-Aug-2009 beck

Dynamic buffer cache support - a re-commit of what was backed out
after c2k9

allows buffer cache to be extended and grow/shrink dynamically

tested by many, ok oga@, "why not just commit it" deraadt@


Revision tags: OPENBSD_4_6_BASE
# 1.179 25-Jun-2009 thib

backout the buf_acquire() does the bremfree() since all callers
where doing bremfree() befure calling buf_acquire().

This is causing us headache pinning down a bug that showed up
when deraadt@ too cvs to current, and will have to be done
anyway as a preperation for backouts.

OK deraadt@


# 1.178 15-Jun-2009 beck

Back out all the buffer cache changes I committed during c2k9. This reverts three
commits:

1) The sysctl allowing bufcachepercent to be changed at boot time.
2) The change moving the buffer cache hash chains to a red-black tree
3) The dynamic buffer cache (Which depended on the earlier too).

ok on the backout from marco and todd


# 1.177 06-Jun-2009 art

All caller of buf_acquire were doing bremfree before the call.
Just put it in the buf_acquire function.
oga@ ok


# 1.176 03-Jun-2009 beck

Change bufhash from the old grotty hash table to red-black trees hanging
off the vnode.
ok art@, oga@, miod@


Revision tags: OPENBSD_4_5_BASE
# 1.175 10-Nov-2008 pedro

Fix typo in comment, okay jmc@.


# 1.174 01-Nov-2008 deraadt

change vrele() to return an int. if it returns 0, it can gaurantee that
it did not sleep. this is used to avoid checkdirs() to avoid having
to restart the allproc walk every time through
idea from tedu, ok thib pedro


Revision tags: OPENBSD_4_4_BASE
# 1.173 05-Jul-2008 thib

re-introduce vdrop() to signal a lost intrest in a vnode;

ok art@


# 1.172 14-Jun-2008 mk

A bunch of pool_get() + bzero() -> pool_get(..., .. | PR_ZERO)
conversions that should shave a few bytes off the kernel.

ok henning, krw, jsing, oga, miod, and thib (``even though i usually prefer
FOO|BAR''; thanks for looking.


# 1.171 13-Jun-2008 beck

back out stupid vnode change that was unintentionally included
with biomem and art has no idea how it got there.
ok art@ thib@


# 1.170 12-Jun-2008 deraadt

Bring biomem diff back into the tree after the nfs_bio.c fix went in.
ok thib beck art


# 1.169 11-Jun-2008 deraadt

back out biomem diff since it is not right yet. Doing very large
file copies to nfsv2 causes the system to eventually peg the console.
On the console ^T indicates that the load is increasing rapidly, ddb
indicates many calls to getbuf, there is some very slow nfs traffic
making none (or extremely slow) progress. Eventually some machines
seize up entirely.


# 1.168 10-Jun-2008 beck

Buffer cache revamp

1) remove multiple size queues, introduced as a stopgap.
2) decouple pages containing data from their mappings
3) only keep buffers mapped when they actually have to be mapped
(right now, this is when buffers are B_BUSY)
4) New functions to make a buffer busy, and release the busy flag
(buf_acquire and buf_release)
5) Move high/low water marks and statistics counters into a structure
6) Add a sysctl to retrieve buffer cache statistics

Tested in several variants and beat upon by bob and art for a year. run
accidentally on henning's nfs server for a few months...

ok deraadt@, krw@, art@ - who promises to be around to deal with any fallout


# 1.167 09-Jun-2008 millert

Update access(2) to have modern semantics with respect to X_OK and
the superuser. access(2) will now only indicate success for X_OK on
non-directories if there is at least one execute bit set on the file.
OK deraadt@ thib@ otto@


# 1.166 07-May-2008 thib

remove the vfc_mountroot member from vfsconf and
do appropriate cleanup;

OK deraadt@


# 1.165 07-May-2008 claudio

Implement routing priorities. Every route inserted has a priority assigned
and the one route with the lowest number wins. This will be used by the
routing daemons to resolve the synchronisations issue in case of conflicts.
The nasty bits of this are in the multipath code. If no priority is specified
the kernel will choose an appropriate priority.

Looked at by a few people at n2k8 code is much older


# 1.164 06-May-2008 thib

retire vfs_mountroot();

setroot() is now (and has been) responsible for setting
the mountroot function pointer "to the right thing", or
failing todo that, to ffs_mountroot;

based on a discussion/diff from deraadt@.
OK deraadt@


# 1.163 23-Mar-2008 miod

Wrong printf construct.


# 1.162 16-Mar-2008 otto

Widen some struct statfs fields to support large filesystem stata
and add some to be able to support statvfs(2). Do the compat dance
to provide backward compatibility. ok thib@ miod@


Revision tags: OPENBSD_4_3_BASE
# 1.161 13-Dec-2007 blambert

replace calls to ltsleep with tsleep

remove PNORELOCK flag, as PNORELOCK is used for msleep

ok art@ thib@


# 1.160 16-Nov-2007 deraadt

er, the newline is wrong. dissapointing.


# 1.159 15-Nov-2007 deraadt

newline before syncing disks is way prettier


# 1.158 29-Oct-2007 chl

MALLOC/FREE -> malloc/free
replace an hard coded value with M_WAITOK

ok krw@


# 1.157 15-Sep-2007 bluhm

Allow to pull out an usb stick with ffs filesystem while mounted
and a file is written onto the stick. Without these fixes the
machine panics or hangs.
The usb fix calls the callback when the stick is pulled out to free
the associated buffers. Otherwise we have busy buffers for ever
and the automatic unmount will panic.
The change in the scsi layer prevents passing down further dirty
buffers to usb after the stick has been deactivated.
In vfs the automatic unmount has moved from the function vgonel()
to vop_generic_revoke(). Both are called when the sd device's vnode
is removed. In vgonel() the VXLOCK is already held which can cause
a deadlock. So call dounmount() earlier.

ok krw@, I like this marco@, tested by ian@


# 1.156 07-Sep-2007 art

Use M_ZERO in a few more places to shave bytes from the kernel.

eyeballed and ok dlg@


Revision tags: OPENBSD_4_2_BASE
# 1.155 07-Aug-2007 beck

A few changes to deal with multi-user performance issues seen. this
brings us back roughly to 4.1 level performance, although this is still
far from optimal as we have seen in a number of cases. This change

1) puts a lower bound on buffer cache queues to prevent starvation
2) fixes the code which looks for a buffer to recycle
3) reduces the number of vnodes back to 4.1 levels to avoid complex
performance issues better addressed after 4.2

ok art@ deraadt@, tested by many


# 1.154 01-Jun-2007 beck

decouple the allocated number of vnodes from the "desiredvnodes" variable
which is used to size a zillion other things that increasing excessively
has been shown to cause problems - so that we may incrementally look at
increasing those other things without making the kernel unusable.

This diff effectivly increases the number of vnodes back to the number
of buffers, as in the earlier dynamic buffer cache commits, without
increasing anything else (namecache, softdeps, etc. etc.)

ok pedro@ tedu@ art@ thib@


# 1.153 31-May-2007 tedu

remove some silly casts, no real change


# 1.152 31-May-2007 pedro

NFSv2 cannot cope with a big number of vnodes, so revert to NPROC-based
calculation until the problem is fixed, okay beck@ art@


# 1.151 30-May-2007 beck

back out vfs change - todd fries has seen afs issues, and I'm suspicious
this can cause other problems.


# 1.150 29-May-2007 beck

Step one of some vnode improvements - change getnewvnode to
actually allocate "desiredvnodes" - add a vdrop to un-hold a vnode held
with vhold, and change the name cache to make use of vhold/vdrop, while
keeping track of which vnodes are referred to by which cache entries to
correctly hold/drop vnodes when the cache uses them.
ok thib@, tedu@, art@


# 1.149 28-May-2007 thib

de-inline vref();

ok pedro@


# 1.148 26-May-2007 pedro

Dynamic buffer cache. Initial diff from mickey@, okay art@ beck@ toby@
deraadt@ dlg@.


# 1.147 26-May-2007 thib

Nuke a bunch of simpelocks and associated goo.

ok art@


# 1.146 17-May-2007 thib

Collapse struct v_selectinfo in struct vnode, remove the
simplelock and reuse the name for the selinfo member.
Clean-up accordingly.

ok tedu@,art@


# 1.145 09-May-2007 deraadt

kinfo_vgetfailed has not been used for > 8 years


# 1.144 13-Apr-2007 thib

Move the declaration of VN_KNOTE() into vnode.h instead of having
multiple defines all over;

ok tedu@


# 1.143 13-Apr-2007 bluhm

Remove comments talking about vnode interlock. No binary change.
ok thib


# 1.142 11-Apr-2007 thib

Remove the simplelock argument from vrecycle();

ok pedro@, sturm@


# 1.141 21-Mar-2007 thib

Remove the v_interlock simplelock from the vnode structure.
Zap all calls to simple_lock/unlock() on it (those calls are
#defined away though). Remove the LK_INTERLOCK from the calls
to vn_lock() and cleanup the filesystems wich implement VOP_LOCK().
(by remvoing the v_interlock from there calls to lockmgr()).

ok pedro@, art@, tedu@


# 1.140 12-Mar-2007 mickey

better desiredvnodes not based on maxusers; pedro@ deraadt@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.139 20-Feb-2007 deraadt

for vfsconf sysctl, do not leak kernel sensors out to userland
ok art thib


# 1.138 17-Feb-2007 mickey

fix ddb buf printing for daddr_t growth to 64bit;
from juan hernandez gonzalez; tested by bluhm@


# 1.137 14-Feb-2007 jsg

Consistently spell FALLTHROUGH to appease lint.
ok kettenis@ cloder@ tom@ henning@


# 1.136 13-Feb-2007 mickey

fix ddb buf print


# 1.135 20-Nov-2006 tom

vprint() should be defined if DIAGNOSTIC || DEBUG. Noticed by (and
original diff from) Jake < antipsychic (at) hotmail.com >. Discussed
with Mickey and Miod.

ok miod@ pedro@


# 1.134 30-Oct-2006 thib

use vp->v_type to index into vtypes rather then vp->v_tag,
fixing odd output in the 'show vnode' ddb code.

ok mickey@


Revision tags: OPENBSD_4_0_BASE
# 1.133 11-Jul-2006 mickey

add mount/vnode/buf and softdep printing commands; tested on a few archs and will make pedro happy too (;


# 1.132 09-Jul-2006 pedro

Fix tab where space was meant


# 1.131 08-Jul-2006 thib

vinvalbuf() debugging aid, under VFSDEBUG.

ok pedro@


# 1.130 03-Jul-2006 mickey

also print vp in vprint (useful for debugging); pedro@ ok


# 1.129 25-Jun-2006 sturm

rename vfs_busy() flags VB_UMIGNORE/VB_UMWAIT to VB_NOWAIT/VB_WAIT

requested by and ok pedro


# 1.128 14-Jun-2006 sturm

move vfs_busy() to rwlocks and properly hide the locking api from vfs

ok tedu, pedro


# 1.127 02-Jun-2006 pedro

Add a clonable devices implementation. Hacked along with thib@, input
from krw@ and toby@, subliminal prodding from dlg@, okay deraadt@.


# 1.126 28-May-2006 pedro

Spacing in vfs_sysctl()


# 1.125 07-May-2006 sturm

forgot to remove this sentence from the comment
ok pedro


# 1.124 30-Apr-2006 sturm

remove the simplelock argument from vfs_busy() which is currently not
used and will never be used this way in VFS

requested by and ok pedro, ok krw, biorn


# 1.123 19-Apr-2006 pedro

Remove unused mount list simple_lock() goo


Revision tags: OPENBSD_3_9_BASE
# 1.122 09-Jan-2006 pedro

Put vprint() under DIAGNOSTIC, as to save space in generated ramdisks.
Inspiration from miod@, okay deraadt@. Tested on i386, macppc and amd64.


# 1.121 30-Nov-2005 pedro

No need for vfs_busy() and vfs_unbusy() to take a process pointer
anymore. Testing by jolan@, thanks.


# 1.120 24-Nov-2005 pedro

Remove kernfs, okay deraadt@.


# 1.119 19-Nov-2005 pedro

Remove unnecessary lockmgr() archaism that was costing too much in terms
of panics and bugfixes. Access curproc directly, do not expect a process
pointer as an argument. Should fix many "process context required" bugs.
Incentive and okay millert@, okay marc@. Various testing, thanks.


# 1.118 18-Nov-2005 pedro

Work around yet another race on non-locking file systems: when calling
VOP_INACTIVE() in vrele() and vput(), we may sleep. Since there's no
locking of any kind, someone can vget() the vnode and vrele() it while
we sleep, beating us in getting the vnode on the free list.


# 1.117 08-Nov-2005 pedro

Missed one use of 'register'


# 1.116 07-Nov-2005 pedro

Use ANSI function declarations and deregister, no binary change


# 1.115 19-Oct-2005 pedro

Remove v_vnlock from struct vnode, okay krw@ tedu@


Revision tags: OPENBSD_3_8_BASE
# 1.114 26-May-2005 pedro

branches: 1.114.2;
RIP stackable filesystems, ok marius@ tedu@, discussed with deraadt@


# 1.113 24-May-2005 pedro

when a device vnode associated with a mount point disappears, mark the
filesystem as doomed and unmount it


# 1.112 22-May-2005 pedro

put VLOCKSWORK stuff under a single option, VFSDEBUG


# 1.111 01-May-2005 pedro

check for VBIOONFREELIST and VBIOONSYNCLIST in vprint(), okay marius@


# 1.110 24-Mar-2005 tedu

always good to check for invalid values. ok marius pedro


Revision tags: OPENBSD_3_7_BASE
# 1.109 10-Jan-2005 pedro

branches: 1.109.2;
change vget() to only put a vnode back on the free lists if it actually
was there. should fix a (rare) corner case introduced by my last commit.
ok tedu@, testing by joris, moritz@, danh@, otto@ and krw@. many thanks.


# 1.108 31-Dec-2004 pedro

sprinkle some more list macros in here


# 1.107 31-Dec-2004 pedro

when releasing a vnode, make it inactive before sticking it to one of
the free lists. should fix some races on filesystems that don't have
locks, such as nfs. also, it allows for a more straightforward way of
releasing vnodes (nodes that are going to be recycled don't have to be
moved to the head of the list). tested by many, thanks.

ok tedu@ deraadt@


# 1.106 28-Dec-2004 deraadt

clean dirty accident by miod


# 1.105 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


# 1.104 09-Dec-2004 pedro

minor spacing/styling nits


Revision tags: OPENBSD_3_6_BASE
# 1.103 04-Aug-2004 art

Uninline vputonfreelist.


# 1.102 04-Aug-2004 pedro

better comments


# 1.101 02-Aug-2004 pedro

- check for LK_NOWAIT on vget()
- use ltsleep() instead of the unlock + sleep combo

ok art@, inspiration from free/net


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.100 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


# 1.99 27-May-2004 tedu

shutdown accounting before shutting down vfs. should prevent some panics.
ok david@ millert@ (iirc)


# 1.98 25-Apr-2004 itojun

radix tree with multipath support. from kame. deraadt ok
user visible changes:
- you can add multiple routes with same key (route add A B then route add A C)
- you have to specify gateway address if there are multiple entries on the table
(route delete A B, instead of route delete A)
kernel change:
- radix_node_head has an extra entry
- rnh_deladdr takes extra argument

TODO:
- actually take advantage of multipath (rtalloc -> rtalloc_mpath)


Revision tags: OPENBSD_3_5_BASE
# 1.97 09-Jan-2004 tedu

back out vnode parents. weird breakge found in ports tree


# 1.96 06-Jan-2004 tedu

keep track of a vnode's parent dir. ufs only, and unused atm, but
the fun stuff is coming. testing by brad.


Revision tags: OPENBSD_3_4_BASE
# 1.95 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.94 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: UBC_SYNC_A
# 1.93 13-May-2003 naddy

Back out previous change that causes "vnode table full" for large-scale
file operations.


# 1.92 13-May-2003 tedu

do reclaim LAYER vnodes, no good reason not to


# 1.91 06-May-2003 tedu

attempt to put a process's cwd back in place after a forced umount.
won't always work, but it's the best we can do for now. this covers
at least some of the failure cases the previous commit to vfs_lookup.c
checks for.
ok weingart@


# 1.90 01-May-2003 tedu

several related changes:
vfs_subr.c:
add a missing simple_lock_init for vnode interlock
try to avoid reclaiming locked or layered vnodes
initialize vnlock pointer to NULL
remove old code to free vnlock, never used
lockinit the new vnode lock
vfs_syscalls.c:
support for VLAYER flag
vnode_if.sh:
support for splitting VDESC flags
vnode_if.src:
split VDESC flags
WILLPUT is the combination of WILLRELE and WILLUNLOCK
most uses for WILLRELE become WILLPUT
vnode.h:
add v_lock to struct vnode
add VLAYER flag
update for new VDESC flags


# 1.89 06-Apr-2003 ho

strcat/strcpy/sprintf cleanup. krw@, anil@ ok. art@ tested sparc64.


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_B
# 1.88 11-Aug-2002 art

Add two missing vfs_busy calls in the failure path of sysctl_vnode.
Found by aaron@

NOTE - I think we need a mount-point iterator just like we have
NOTE - vfs_mount_foreach_vnode. (btw. why don't we use foreach_vnode in here?)


# 1.87 12-Jul-2002 art

Change the locking on the mountpoint slightly. Instead of using mnt_lock
to get shared locks for lookup and get the exclusive lock only with
LK_DRAIN on unmount and do the real exclusive locking with flags in
mnt_flags, we now use shared locks for lookup and an exclusive lock for
unmount.

This is accomplished by slightly changing the semantics of vfs_busy.
Old vfs_busy behavior:
- with LK_NOWAIT set in flags, a shared lock was obtained if the
mountpoint wasn't being unmounted, otherwise we just returned an error.
- with no flags, a shared lock was obtained if the mountpoint was being
unmounted, otherwise we slept until the unmount was done and returned
an error.
LK_NOWAIT was used for sync(2) and some statistics code where it isn't really
critical that we get the correct results.
0 was used in fchdir and lookup where it's critical that we get the right
directory vnode for the filesystem root.

After this change vfs_busy keeps the same behavior for no flags and LK_NOWAIT.
But if some other flags are passed into it, they are passed directly
into lockmgr (actually LK_SLEEPFAIL is always added to those flags because
if we sleep for the lock, that means someone was holding the exclusive lock
and the exclusive lock is only held when the filesystem is being unmounted.

More changes:
dounmount must now be called with the exclusive lock held. (before this
the caller was supposed to hold the vfs_busy lock, but that wasn't always
true).
Zap some (now) unused mount flags.
And the highlight of this change:
Add some vfs_busy calls to match some vfs_unbusy calls, especially in
sys_mount. (lockmgr doesn't detect the case where we release a lock noone
holds (it will do that soon)).

If you've seen hangs on reboot with mfs this should solve it (I repeat this
for the fourth time now, but this time I spent two months fixing and
redesigning this and reading the code so this time I must have gotten
this right).


# 1.86 16-Jun-2002 miod

When processing the KERN_VNODE sysctl, the kernel builds a packed structure,
while pstat(8) expects a C structure abiding the regular structure packing
rules. This caused pstat -v to break on powerpc.

Unbreak the confusion by defining the structure in a common header file,
and having the kernel use it.

ok millert@ deraadt@


# 1.85 08-Jun-2002 art

Use ltsleep in vfs_busy.


# 1.84 16-May-2002 art

sprinkle some splassert(IPL_BIO) in some functions that are commented as "should be called at splbio()"


Revision tags: OPENBSD_3_1_BASE
# 1.83 14-Mar-2002 millert

First round of __P removal in sys


# 1.82 04-Feb-2002 miod

Cleanup mountroot-related definitions.


# 1.81 23-Jan-2002 art

Pool deals fairly well with physical memory shortage, but it doesn't deal
well (not at all) with shortages of the vm_map where the pages are mapped
(usually kmem_map).

Try to deal with it:
- group all information the backend allocator for a pool in a separate
struct. The pool will only have a pointer to that struct.
- change the pool_init API to reflect that.
- link all pools allocating from the same allocator on a linked list.
- Since an allocator is responsible to wait for physical memory it will
only fail (waitok) when it runs out of its backing vm_map, carefully
drain pools using the same allocator so that va space is freed.
(see comments in code for caveats and details).
- change pool_reclaim to return if it actually succeeded to free some
memory, use that information to make draining easier and more efficient.
- get rid of PR_URGENT, noone uses it.


# 1.80 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.79 10-Dec-2001 art

branches: 1.79.2;
No need to initialize the uobj on every getnewvnode. Just do
it when allocating. Add some improved diagnostics.


# 1.78 10-Dec-2001 art

Big cleanup inspired by NetBSD with some parts of the code from NetBSD.
- get rid of VOP_BALLOCN and VOP_SIZE
- move the generic getpages and putpages into miscfs/genfs
- create a genfs_node which must be added to the top of the private portion
of each vnode for filsystems that want to use genfs_{get,put}pages
- rename genfs_mmap to vop_generic_mmap


# 1.77 10-Dec-2001 art

Merge in struct uvm_vnode into struct vnode.


# 1.76 05-Dec-2001 art

Break out the part that lowers v_holdcnt in brelvp into an own function
and make it and vhold into public interfaces.


# 1.75 29-Nov-2001 art

Ooops. Revert part of the last commit that was completly wrong and wasn't supposed to be committed.


# 1.74 29-Nov-2001 art

Correctly handle b_vp with bgetvp and brelvp in {get,put}pages.
Prevents panics caused by vnodes being recycled under our feet.


# 1.73 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.72 21-Nov-2001 csapuntz

Added vfs_isbusy. Useful for verifying that a mount point is locked
Added vfs_mount_foreach_vnode. Several places in the code seem to want to
traverse the mount list and they all seem to handle locking differently.
Centralize traversing the mount list in one place so that we only need
to get the locking right once.


# 1.71 15-Nov-2001 art

Don't zero v_bioflag when recycling a vnode in getnewvnode.
Sometimes the vnode can be on the syncers list. While that is a bug, it's
just a minor annoyance. A vnode on a syncer worklist without VBIOONSYNCLIST
set is a disaster.


# 1.70 12-Nov-2001 art

Remove unnecessary check for NULL vnode in reassignbuf.


# 1.69 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.68 02-Oct-2001 csapuntz

Bounds check index into routing table. Thanks to Ken Ashcraft of Stanford
for finding this bug.


# 1.67 19-Sep-2001 csapuntz

Get rid of B_VFLUSH. Not relevant after the end of the AGE queue.


# 1.66 16-Sep-2001 millert

Add some missing lengths checks when passing data from userland to
kernel. From based on NetBSD patches.


# 1.65 02-Aug-2001 assar

(vput): make panic strings actually say vput instead of vrele


# 1.64 26-Jul-2001 miod

Typo.


# 1.63 27-Jun-2001 art

remove old vm


# 1.62 22-Jun-2001 deraadt

KNF


# 1.61 05-Jun-2001 provos

send note_revoke to knotes when vnode goes away, okay art@


# 1.60 16-May-2001 art

indentation nit.


# 1.59 29-Apr-2001 art

cleanup, remove incorrect comment


Revision tags: OPENBSD_2_9_BASE
# 1.58 22-Mar-2001 art

branches: 1.58.2;
Use pool for allocating vnodes.
Even though vnodes are never freed (could be) this gives us big memory and
kmem_map savings.


# 1.57 21-Mar-2001 art

uvm_vnp_terminate expect the vnode to be locked.
Why didn't LOCKDEBUG catch this?


# 1.56 16-Mar-2001 art

Oops. fix thinko in last.


# 1.55 16-Mar-2001 art

Use CIRCLEQ macros for mountlist.


# 1.54 16-Mar-2001 art

Initialize the mountlist_slock.


# 1.53 26-Feb-2001 csapuntz

Move v_writecount test back to it original place


# 1.52 26-Feb-2001 csapuntz

Make ref counts 32-bit unsigned ints as opposed to a potpourri of longs and
ints.


# 1.51 24-Feb-2001 csapuntz

Cleanup of vnode interface continues. Get rid of VHOLD/HOLDRELE.
Change VM/UVM to use buf_replacevnode to change the vnode associated
with a buffer.

Addition v_bioflag for flags written in interrupt handlers
(and read at splbio, though not strictly necessary)

Add vwaitforio and use it instead of a while loop of v_numoutput.

Fix race conditions when manipulation vnode free list


# 1.50 23-Feb-2001 csapuntz

Remove the clustering fields from the vnodes and place them in the
file system inode instead


# 1.49 21-Feb-2001 csapuntz

Latest soft updates from FreeBSD/Kirk McKusick

Snapshot-related code has been commented out.


# 1.48 08-Feb-2001 mickey

do not print stuff when not verbose


Revision tags: OPENBSD_2_8_BASE
# 1.47 27-Sep-2000 art

branches: 1.47.2;
Minimal optimization.


# 1.46 17-Jul-2000 art

Don't wait for B_READ buffers on shutdown.
From NetBSD.


Revision tags: OPENBSD_2_7_BASE
# 1.45 25-Apr-2000 csapuntz

Use CIRCLEQ_FOREACH


# 1.44 21-Apr-2000 mickey

see if there is any meaning under curproc before using &proc0 in vfs_syncwait(); from art@


Revision tags: SMP_BASE kame_19991208
# 1.43 05-Dec-1999 art

branches: 1.43.2;
With soft updates, some buffers will be remarked as dirty after being written.
Handle this when syncing filesystems when unmounting.
From NetBSD.


# 1.42 05-Dec-1999 art

Use VONSYNCLIST to see if we should remove a vnode from the sync list instead
of looking at v_dirtyblkhd.


Revision tags: OPENBSD_2_6_BASE
# 1.41 20-Aug-1999 art

more paranoid check of the refcount in vfs_register


# 1.40 08-Aug-1999 niklas

From NetBSD; vdevgone, used for revoking access to device nodes when they
disappear (detach is coming).


# 1.39 31-May-1999 millert

New struct statfs with mount options. NOTE: this replaces statfs(2),
fstatfs(2), and getfsstat(2) so you will need to build a new kernel
before doing a "make build" or you will get "unimplemented syscall" errors.

The new struct statfs has the following featuires:
o Has a u_int32_t flags field--now softdep can have a real flag.

o Uses u_int32_t instead of longs (nicer on the alpha). Note: the man
page used to lie about setting invalid/unused fields to -1. SunOS does
that but our code never has.

o Gets rid of f_type completely. It hasn't been used since NetBSD 0.9
and having it there but always 0 is confusing. It is conceivable
that this may cause some old code to not compile but that is better
than silently breaking.

o Adds a mount_info union that contains the FSTYPE_args struct. This
means that "mount" can now tell you all the options a filesystem was
mounted with. This is especially nice for NFS.

Other changes:
o The linux statfs emulation didn't convert between BSD fs names
and linux f_type numbers. Now it does, since the BSD f_type
number is useless to linux apps (and has been removed anyway)

o FreeBSD's struct statfs is different from our (both old and new)
and thus needs conversion. Previously, the OpenBSD syscalls
were used without any real translation.

o mount(8) will now show extra info when invoked with no arguments.
However, to see *everything* you need to use the -v (verbose) flag.


# 1.38 06-May-1999 mickey

factor out sync+wait code into vfa_syncwait() routine for
applications in system like power management and such.
art@ finally said `commit it'


# 1.37 30-Apr-1999 art

in vput, simple_unlock the v_interlock before VOP_INACTIVE, not after


Revision tags: OPENBSD_2_5_BASE
# 1.36 11-Mar-1999 deraadt

backout


# 1.35 11-Mar-1999 deraadt

back out unapproved changes


# 1.34 11-Mar-1999 mickey

indent


# 1.33 11-Mar-1999 mickey

factor sync+wait operation out into a separate function.


# 1.32 26-Feb-1999 art

adapt to uvm vnode pager


# 1.31 19-Feb-1999 art

add vfs_register and vfs_unregister functions


# 1.30 28-Dec-1998 art

simple_lock fixes


# 1.29 22-Dec-1998 art

deconfuse vprint, print holdcount, not refcount when we are talking about holdcnt


# 1.28 10-Dec-1998 art

vfs_unmountall: retry to unmount all remaining filesystems when one unmount failed


# 1.27 05-Dec-1998 csapuntz

Framework for generating automatic test code for locking discipline
in DIAGNOSTIC mode.

Added documentation to vfs_subr.c on locking needs of a couple calls.

Improvements to the vinvalbuf patch. We need to start over after we
let our pants down.


# 1.26 04-Dec-1998 csapuntz

VFS-Lite2 requires stricter locking around vnode buffer queues. vinvalbuf
had insufficient protection


# 1.25 20-Nov-1998 art

vn_lock already unlocks the simple lock. don't do that again


# 1.24 12-Nov-1998 csapuntz

Integrate latest soft updates patches for McKusick.

Integrate cleaner ffs mount code from FreeBSD. Most notably, this mount
code prevents you from mounting an unclean file system read-write.


Revision tags: OPENBSD_2_4_BASE
# 1.23 13-Oct-1998 csapuntz

In vrele, vget, reinstate to following order

- VNODE gets placed on free list
- VOP_INACTIVE is called

This was the original order. It was changed in an earlier patch due to
a race condition in non-locking FSes (like NFS) between getnewvnode
and inactive. However, the modified order had its own race conditions, so
it turned out not to be a good choice.


# 1.22 30-Aug-1998 csapuntz

Cleanup.

Error diagnostics in vputonfreelist to catch violations of assumptions.


# 1.21 06-Aug-1998 csapuntz

Rename vop_revoke, vn_bwrite, vop_noislocked, vop_nolock, vop_nounlock
to be vop_generic_revoke, vop_generic_bwrite, vop_generic_islocked,
vop_generic_lock and vop_generic_unlock.

Create vop_generic_abortop and propogate change to all file systems.

Fix PR/371.

Get rid of locking in NULLFS (should be mostly unnecessary now except for
forced unmounts).


# 1.20 25-Apr-1998 niklas

typo


Revision tags: OPENBSD_2_3_BASE
# 1.19 20-Feb-1998 niklas

typo


# 1.18 11-Jan-1998 csapuntz

Fix a couple spinlock references. More code motion in vfs_subr.c


# 1.17 10-Jan-1998 csapuntz

Broke up vfs_subr.c which was getting a bit huge. We now have seperate files
for the syncer daemon as well as default VOP_*.


# 1.16 24-Nov-1997 niklas

Fix non-DIAGNOSTIC (and non-COMPAT*) compilation


# 1.15 07-Nov-1997 csapuntz

Fixed hang on shutdown
Disabled vop_nolock for now. Filesystems still need to be cleaned up.


# 1.14 06-Nov-1997 csapuntz

DEBUG now compiles


# 1.13 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.12 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.11 06-Oct-1997 csapuntz

VFS Lite2 Changes


Revision tags: OPENBSD_2_1_BASE
# 1.10 25-Apr-1997 deraadt

proper mask check; mike@fast.cs.utah.edu


# 1.9 14-Apr-1997 tholo

Minor performance enhancements from NetBSD


# 1.8 24-Feb-1997 niklas

OpenBSD tags


# 1.7 11-Feb-1997 millert

Add fs_id support and random inode generation numbers for ffs.


# 1.6 04-Jan-1997 kstailey

spec_advlock() via lf_advlock()


Revision tags: OPENBSD_2_0_BASE
# 1.5 08-Aug-1996 tholo

Make {,f}chown(2) behaviour POSIX.1 compliant with SUID / SGID files
Enable CTL_FS processing by sysctl(3)
Add CTL_FS request to disable clearing SUID / SGID bit when a files owner
or group is changed by root
Make sysctl(8) understand CTL_FS requests


# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 29-Feb-1996 niklas

From NetBSD: Merge with NetBSD 960217


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.277 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


# 1.276 02-Jul-2018 bluhm

Use more list macros for v_dirtyblkhd.
OK mpi@


# 1.275 06-Jun-2018 bluhm

The function dounmount() traverses the mnt_list in forward direction
to call vfs_busy() for all nested mount points. vfs_stall() called
vfs_busy() in reverser order for all mount points. Change the
direction of the latter to resolve the lock order conflict.
OK visa@


# 1.274 04-Jun-2018 guenther

Add VB_DUPOK to suppress witness(4) warning of concurrent mount locks.
Use that in three places:
- vfs_stall()
- sys_mount()
- dounmount()'s MNT_FORCE-does-recursive-unmounts case

ok deraadt@ visa@


# 1.273 27-May-2018 visa

Drop unnecessary `p' parameter from vget(9).

OK mpi@


# 1.272 08-May-2018 bluhm

When looping over mount points, the FOREACH SAVE macro is not save.
The loop variable mp is protected by vfs_busy() so that it cannot
be unmounted. But the next mount point nmp could be unmounted while
VFS_SYNC() sleeps. As the loop in vfs_stall() does not destroy the
mount point, TAILQ_FOREACH_REVERSE without _SAVE is the correct
macro to use.
OK deraadt@ visa@


# 1.271 08-May-2018 mpi

Move the vfs stall "barrier" logic to a function. FREF() will soon
change and this has nothing to do with it.

ok visa@, bluhm@


# 1.270 07-May-2018 bluhm

Print the vp pointer in the vinvalbuf() panic strings.
OK mpi@


# 1.269 02-May-2018 visa

Remove proc from the parameters of vn_lock(). The parameter is
unnecessary because curproc always does the locking.

OK mpi@


# 1.268 28-Apr-2018 visa

Clean up the parameters of VOP_LOCK() and VOP_UNLOCK(). It is always
curproc that does the locking or unlocking, so the proc parameter
is pointless and can be dropped.

OK mpi@, deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.267 07-Mar-2018 bluhm

Remounting files systems read-only does not work reliably. There
are corner cases where ffs may leak blocks. So better revert and
unmount all file systems at reboot. The "init died" panic will be
fixed in a different way.
OK deraadt@


# 1.266 10-Feb-2018 deraadt

Syncronize filesystems to disk when suspending. Each mountpoint's vnodes
are pushed to disk. Dangling vnodes (unlinked files still in use) and
vnodes undergoing change by long-running syscalls are identified -- and
such filesystems are marked dirty on-disk while we are suspended (in case
power is lost, a fsck will be required). Filesystems without dangling or
busy vnodes are marked clean, resulting in faster boots following
"battery died" circumstances.
Tested by numerous developers, thanks for the feedback.


# 1.265 14-Dec-2017 deraadt

Don't bother using DETACH_FORCE for the softraid luns at reboot
time; the aggressive mountpoint destruction seems to hit insane
use-after-frees when we are already far on the way down.


# 1.264 14-Dec-2017 deraadt

Give vflush_vnode() a hint about vnodes we don't need to account as "busy".
Change mountpoint to RDONLY a little later. Seems to improve the
rw->ro transition a bit.


# 1.263 11-Dec-2017 bluhm

Format the vnode lists of ddb show mount properly in columns.
OK krw@


# 1.262 11-Dec-2017 deraadt

In uvm Chuck decided backing store would not be allocated proactively
for blocks re-fetchable from the filesystem. However at reboot time,
filesystems are unmounted, and since processes lack backing store they
are killed. Since the scheduler is still running, in some cases init is
killed... which drops us to ddb [noted by bluhm]. Solution is to convert
filesystems to read-only [proposed by kettenis]. The tale follows:
sys_reboot() should pass proc * to MD boot() to vfs_shutdown() which
completes current IO with vfs_busy VB_WRITE|VB_WAIT, then calls VFS_MOUNT()
with MNT_UPDATE | MNT_RDONLY, soon teaching us that *fs_mount() calls a
copyin() late... so store the sizes in vfsconflist[] and move the copyin()
to sys_mount()... and notice nfs_mount copyin() is size-variant, so kill
legacy struct nfs_args3. Next we learn ffs_mount()'s MNT_UPDATE code is
sharp and rusty especially wrt softdep, so fix some bugs adn add
~MNT_SOFTDEP to the downgrade. Some vnodes need a little more help,
so tie them to &dead_vnops.

ffs_mount calling DIOCCACHESYNC is causing a bit of grief still but
this issue is seperate and will be dealt with in time.
couple hundred reboots by bluhm and myself, advice from guenther and
others at the hut


# 1.261 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.260 31-Jul-2017 florian

Give back some space to the ramdisk by compiling net/radix.c only
if we compile pf, ipsec, pipex or nfsserver.
Suggested by mpi some time ago.
Tweak & OK bluhm
deraadt assumes it's fair


# 1.259 20-Apr-2017 visa

Tweak lock inits to make the system runnable with witness(4)
on amd64 and i386.


# 1.258 04-Apr-2017 deraadt

struct vfsconf is tightly packed, but let's M_ZERO it in case that ever
changes to avoid exposing userland memory.


Revision tags: OPENBSD_6_1_BASE
# 1.257 15-Jan-2017 bluhm

When traversing the mount list, the current mount point is locked
with vfs_busy(). If the FOREACH_SAFE macro is used, the next pointer
is not locked and could be freed by another process. Unless
necessary, do not use _SAFE as it is unsafe. In vfs_unmountall()
the current pointer is actullay freed. Add a comment that this
race has to be fixed later.
OK krw@


# 1.256 10-Jan-2017 bluhm

Replace manual for() loops with FOREACH() macro.
OK millert@


# 1.255 10-Jan-2017 bluhm

Remove the unused olddp parameter from function dounmount().
OK mpi@ millert@


# 1.254 28-Sep-2016 kettenis

Cast enum to u_int when doing a bounds check to avoid a clang warning that
the comparison is always true.

ok jca@, tedu@


# 1.253 16-Sep-2016 dlg

move the namecache_rb_tree from RB macros to RBT functions.

i had to shuffle the includes a bit. all the knowledge of the RB
tree is now inside vfs_cache.c, and all accesses are via cache_*
functions.


# 1.252 16-Sep-2016 dlg

move buf_rb_bufs from RB macros to RBT functions

i had to shuffle the order of some header bits cos RBT_PROTOTYPE
needs to see what RBT_HEAD produces.


# 1.251 15-Sep-2016 dlg

all pools have their ipl set via pool_setipl, so fold it into pool_init.

the ioff argument to pool_init() is unused and has been for many
years, so this replaces it with an ipl argument. because the ipl
will be set on init we no longer need pool_setipl.

most of these changes have been done with coccinelle using the spatch
below. cocci sucks at formatting code though, so i fixed that by hand.

the manpage and subr_pool.c bits i did myself.

ok tedu@ jmatthew@

@ipl@
expression pp;
expression ipl;
expression s, a, o, f, m, p;
@@
-pool_init(pp, s, a, o, f, m, p);
-pool_setipl(pp, ipl);
+pool_init(pp, s, a, ipl, f, m, p);


# 1.250 25-Aug-2016 dlg

pool_setipl

ok kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.249 22-Jul-2016 kettenis

Prevent NULL-pointer call for filesystems that don't provide vfs_sysctl
in their vfsops.

Issue reported by Tim Newsham.

ok claudio@, natano@


# 1.248 19-Jun-2016 natano

Remove the lockmgr() API. It is only used by filesystems, where it is a
trivial change to use rrw locks instead. All it needs is LK_* defines
for the RW_* flags.

tested by naddy and sthen on package building infrastructure
input and ok jmc mpi tedu


# 1.247 26-May-2016 natano

The doforce variable isn't modified anywhere. Also, the only filesystem
left using it is fuse. It has been removed from all other filesystems.

ok millert deraadt


# 1.246 26-Apr-2016 natano

copy_statfs_info() is not only used by ufs, but by other filesystems too,
so make sure that all members of mp->mnt_stat.mount_info are copied.
ok stefan


# 1.245 26-Apr-2016 beck

fix off by one in vfs_vnode_print - found by miod
ok deraadt@, krw@


# 1.244 07-Apr-2016 natano

Share clone bitmap between aliased vnodes. This prevents duplicate clone
instance numbers being handed out for the same minor device.
ok mikeb


# 1.243 05-Apr-2016 natano

Increase size of the clone bitmap (revised diff after revert). I have
tested this with fuse _and_ drm on amd64 and macppc. Also tested with
cloning bpf (not in the tree) on macppc.

ok mikeb
"looks correct to me" millert

The original commit message is as follows:

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.242 01-Apr-2016 mikeb

Revert the clone bitmap enlargement change


# 1.241 31-Mar-2016 natano

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.240 19-Mar-2016 natano

Remove the unused flags argument from VOP_UNLOCK().

torture tested on amd64, i386 and macppc
ok beck mpi stefan
"the change looks right" deraadt


# 1.239 14-Mar-2016 krw

Change a bunch of (<blah> *)0 to NULL.

ok beck@ deraadt@


Revision tags: OPENBSD_5_9_BASE
# 1.238 05-Dec-2015 tedu

branches: 1.238.2;
remove stale lint annotations


# 1.237 16-Nov-2015 deraadt

In getdevvp() set the VISTTY flag on a vnode to indicate the underlying
device is a D_TTY device. (Like spec_open, but this sets the flag to
satisfy pre-VOP_OPEN situations)
ok millert semarie tedu guenther


# 1.236 13-Oct-2015 guenther

Initialize va_filerev in vattr_null() to avoid leaking stack garbage;
problem pointed out by Martin Natano (natano (at) natano.net)

Also, stop chaining assignments (foo = bar = baz) in vattr_null().
The exact meaning of those depends on the order of the sizes-and-
signednesses of the lvalues, making them fragile: a statement here
mixed *six* types, but managed to get them in a safe order. Delete
a 20+ year old XXX comment that was almost certainly bemoaning a bug
from when they were in an unsafe order.

ok deraadt@ miod@


# 1.235 08-Oct-2015 mpi

Use the radix API directly and get rid of the function pointers. There
is no point in keeping an unused level of abstraction.

ok mikeb@, claudio@


# 1.234 07-Oct-2015 mpi

rn_inithead() offset argument is now specified in byte, missed in previous.


# 1.233 04-Sep-2015 mpi

Make every subsystem using a radix tree call rn_init() and pass the
length of the key as argument.

This way every consumer of the radix tree has a chance to explicitly
initialize the shared data structures and no longer rely on another
subsystem to do the initialization.

As a bonus ``dom_maxrtkey'' is no longer used an die.

ART kernels should now be fully usable because pf(4) and IPSEC properly
initialized the radix tree.

ok chris@, reyk@


Revision tags: OPENBSD_5_8_BASE
# 1.232 16-Jul-2015 claudio

branches: 1.232.4;
Fix rn_match and there for the expoerted lookup functions in radix.c
to never return the internal RNF_ROOT nodes. This removes the checks
in the callee to verify that not an RNF_ROOT node was returned.
OK mpi@


# 1.231 12-May-2015 mikeb

Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.230 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.229 02-Mar-2015 guenther

Return EINVAL if the creds supplied for NFS export have a cr_ngroups less
than zero or greater than NGROUPS_MAX

Fixes panic seen by henning@


# 1.228 09-Jan-2015 tedu

rename desiredvnodes to initialvnodes. less of a lie. ok beck deraadt


# 1.227 19-Dec-2014 tedu

start retiring the nointr allocator. specify PR_WAITOK as a flag as a
marker for which pools are not interrupt safe. ok dlg


# 1.226 17-Dec-2014 tedu

remove lock.h from uvm_extern.h. another holdover from the simpletonlock
era. fix uvm including c files to include lock.h or atomic.h as necessary.
ok deraadt


# 1.225 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.224 10-Dec-2014 tedu

convert bcopy to memcpy. ok millert


# 1.223 21-Nov-2014 tedu

simple lock is long dead


# 1.222 19-Nov-2014 tedu

delete the KERN_VNODE sysctl. it fails to provide any isolation from the
kernel struct vnode defintion, and the only consumer (pstat) still needs
kvm to read much of the required information. no great loss to always use
kvm until there's a better replacement interface.
ok deraadt millert uebayasi


# 1.221 14-Nov-2014 tedu

prefer sizeof(*ptr) to sizeof(struct) for malloc and free


# 1.220 03-Nov-2014 deraadt

pass size argument to free()
ok doug tedu


# 1.219 13-Sep-2014 doug

Replace all queue *_END macro calls except CIRCLEQ_END with NULL.

CIRCLEQ_* is deprecated and not called in the tree. The other queue types
have *_END macros which were added for symmetry with CIRCLEQ_END. They are
defined as NULL. There's no reason to keep the other *_END macro calls.

ok millert@


Revision tags: OPENBSD_5_6_BASE
# 1.218 13-Jul-2014 tedu

pass the size to free in some of the obvious cases


# 1.217 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.216 10-Jul-2014 mpi

Stop using a shutdown hook for softraid(4) and explicitly shutdown
the disciplines right after vfs_shutdown().

This change is required in order to be able to set `cold' to 1 before
traversing the device (mainbus) tree for DVACT_POWERDOWN when halting
a machine. Yes, this is ugly because sr_shutdown() needs to sleep. But
at least it is obvious and hopefully somebody will be ofended and fix
it.

In order to properly flush the cache of the disks under softraid0,
sr_shutdown() now propagates DVACT_POWERDOWN for this particular subtree
of devices which are not under mainbus. As a side effect sd(4) shutdown
hook should no longer be necessary.

Tested by stsp@ and Jean-Philippe Ouellet.

ok deraadt@, stsp@, jsing@


# 1.215 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.214 04-Jun-2014 claudio

While it may be smart to use the radix tree for exports it is not OK to
use the domain specific tree initialisation method for this since that one
is multipath enabled and assumes that the radix node is part of a struct
rtentry. This code uses a different struct and so the multipath modifies
wrong fields and breaks stuff in mysterious ways.
Since we only support AF_INET here anyway simplify the code and only have
one radix_node_head pointer instead of AF_MAX ones.
Fixes NFS server issues reported by rpe@, OK rpe@, guenther@, sthen@


# 1.213 10-Apr-2014 tedu

pull the bufcache freelist code out into separate functions to allow new
algorithms to be tested. in the process, drop support for unused B_AGE and
b_synctime options.
previous versions ok beck deraadt


# 1.212 24-Mar-2014 guenther

Split the API: struct ucred remains the kernel internal structure while
struct xucred becomes the structure for syscalls (mount(2) and nfssvc(2)).

ok deraadt@ beck@


Revision tags: OPENBSD_5_5_BASE
# 1.211 21-Jan-2014 tedu

bzero -> memset


# 1.210 01-Dec-2013 krw

Change 'mountlist' from CIRCLEQ to TAILQ. Be paranoid and
use TAILQ_*_SAFE more than might be needed.

Bulk ports build by sthen@ showed nobody sticking their fingers
so deep into the kernel.

Feedback and suggestions from millert@. ok jsing@


# 1.209 27-Nov-2013 jsing

Defer the v_type initialisation until after the vnode has been purged from
the namecache. Changing the v_type between cache_enter() and cache_purge()
results in bad things happening.

ok beck@


# 1.208 02-Oct-2013 sf

format string fix: b_flags is long


# 1.207 01-Oct-2013 sf

Format string fixes: Cast time_t to long long

and mnt_stat.f_ctime is long long, too


# 1.206 08-Aug-2013 syl

Uncomment kprintf format attributes for sys/kern

tested on vax (gcc3) ok miod@


# 1.205 30-Jul-2013 beck

The previous change was made while chasing nfs performance issues
on Theo's servers - however this was in the context of the buffer flipper
changes and this is now suspect in a continues performance issue with NFS
so back it out for now


Revision tags: OPENBSD_5_4_BASE
# 1.204 24-Jun-2013 beck

Manipulating buffers after sleeping is dangerous. Instead of attempting
to cheat and VOP_BWRITE a buffer, restart the vinvalbuf if we have to wait
for a busy buffer to complete
ok tedu@ guenther@


# 1.203 15-Apr-2013 jsing

Add an f_mntfromspec member to struct statfs, which specifies the name of
the special provided when the mount was requested. This may be the same as
the special that was actually used for the mount (e.g. in the case of a
device node) or it may be different (e.g. in the case of a DUID).

Whilst here, change f_ctime to a 64 bit type and remove the pointless
f_spare members.

Compatibility goo courtesy of guenther@

ok krw@ millert@


Revision tags: OPENBSD_5_3_BASE
# 1.202 17-Feb-2013 miod

Comment out recently added __attribute__((__format__(__kprintf__))) annotations
in MI code; gcc 2.95 does not accept such annotation for function pointer
declarations, only function prototypes.
To be uncommented once gcc 2.95 bites the dust.


# 1.201 09-Feb-2013 miod

Add explicit __attribute__ ((__format__(__kprintf__)))) to the functions and
function pointer arguments which are {used as,} wrappers around the kernel
printf function.
No functional change.


# 1.200 17-Nov-2012 beck

Don't map a buffer (and potentially sleep) when invalidating it in vinvalbuf.
This fixes a problem where we could sleep for kva and then our pointers
would not be valid on the next pass through the loop. We do this
by adding buf_acquire_nomap() - which can be used to busy up the buffer
without changing its mapped or unmapped state. We do not need to have
the buffer mapped to invalidate it, so it is sufficient to acquire it
for that. In the case where we write the buffer, we do map the buffer, and
potentially sleep.


# 1.199 01-Oct-2012 guenther

Make groupmember() check the effective gid too, so that the checks are
consistent when the effective gid isn't also a supplementary group.

ok beck@


# 1.198 19-Sep-2012 guenther

vhold() and vdrop() are prototyped in vnode.h, so don't repeat them here

ok beck@


Revision tags: OPENBSD_5_2_BASE
# 1.197 16-Jul-2012 deraadt

oops, need sys/acct.h too


# 1.196 16-Jul-2012 deraadt

Put acct_shutdown() proto in a better place


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.195 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.194 02-Jul-2011 thib

rename VFSDEBUG to VFLCKDEBUG;

prompted by tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.193 21-Dec-2010 thib

Bring back the "End the VOP experiment." diff, naddy's issues where
unrelated, and his alpha is much happier now.

OK deraadt@


# 1.192 06-Dec-2010 jasper

- drop NENTS(), which was yet another copy of nitems().
no binary change


ok deraadt@


# 1.191 10-Sep-2010 thib

Backout the VOP diff until the issues naddy was seeing on alpha (gcc3)
have been resolved.


# 1.190 06-Sep-2010 thib

End the VOP experiment. Instead of the ridicolusly complicated operation
vector setup that has questionable features (that have, as far as I can
tell never been used in practice, atleast not in OpenBSD), remove all
the gunk and favor a simple struct full of function pointers that get
set directly by each of the filesystems.

Removes gobs of ugly code and makes things simpler by a magnitude.

The only downside of this is that we loose the vnoperate feature so
the spec/fifo operations of the filesystems need to be kept in sync
with specfs and fifofs, this is no big deal as the API it self is pretty
static.

Many thanks to armani@ who pulled an earlier version of this diff to
current after c2k10 and Gabriel Kihlman on tech@ for testing.

Liked by many. "come on, find your balls" deraadt@.


# 1.189 12-Aug-2010 oga

Nuke extra (typoed) extern declaration and a spare newline from the last
commit.

"fix it -- free commit" beck@


# 1.188 11-Aug-2010 beck

Make the number of vnodes to correspond to the number of buffers in
buffer cache - we grow them dynamically, but do not attempt to shrink
them if the buffer cache shrinks after growing.

Tested by very many for a long time.

ok oga@ todd@ phessler@ tedu@


Revision tags: OPENBSD_4_8_BASE
# 1.187 29-Jun-2010 tedu

makefstype was only used in ported from freebsd filesystems. fix them
and remove the function. ok thib


# 1.186 28-Jun-2010 claudio

Add the rtable id as an argument to rn_walktree(). Functions like
rt_if_remove_rtdelete() need to know the table id to be able to correctly
remove nodes.
Problem found by Andrea Parazzini and analyzed by Martin Pelik�n.
OK henning@


# 1.185 06-May-2010 mpf

Fix favail format string.
From mickey.
OK thib, otto.


Revision tags: OPENBSD_4_7_BASE
# 1.184 17-Dec-2009 oga

if anyone vref()s a VNON vnode, panic. This should not happen.

Written while trying to debug the nfs_inactive panics. Turns out it
never got hit, but it's a useful check to have.

ok beck@


# 1.183 17-Aug-2009 jasper

dd 'show all bufs' to show all the buffers in the system

ok beck@ thib@


# 1.182 13-Aug-2009 thib

add a show all vnodes command, use dlg's nice pool_walk() to accomplish
this.

ok beck@, dlg@


# 1.181 12-Aug-2009 beck

Namecache revamp.

This eliminates the large single namecache hash table, and implements
the name cache as a global lru of entires, and a redblack tree in each
vnode. It makes cache_purge actually purge the namecache entries associated
with a vnode when a vnode is recycled (very important for later on actually being
able to resize the vnode pool)

This commit does #if 0 out a bunch of procmap code that was
already broken before this change, but needs to be redone completely.

Tested by many, including in thib's nfs test setup.

ok oga@,art@,thib@,miod@


# 1.180 02-Aug-2009 beck

Dynamic buffer cache support - a re-commit of what was backed out
after c2k9

allows buffer cache to be extended and grow/shrink dynamically

tested by many, ok oga@, "why not just commit it" deraadt@


Revision tags: OPENBSD_4_6_BASE
# 1.179 25-Jun-2009 thib

backout the buf_acquire() does the bremfree() since all callers
where doing bremfree() befure calling buf_acquire().

This is causing us headache pinning down a bug that showed up
when deraadt@ too cvs to current, and will have to be done
anyway as a preperation for backouts.

OK deraadt@


# 1.178 15-Jun-2009 beck

Back out all the buffer cache changes I committed during c2k9. This reverts three
commits:

1) The sysctl allowing bufcachepercent to be changed at boot time.
2) The change moving the buffer cache hash chains to a red-black tree
3) The dynamic buffer cache (Which depended on the earlier too).

ok on the backout from marco and todd


# 1.177 06-Jun-2009 art

All caller of buf_acquire were doing bremfree before the call.
Just put it in the buf_acquire function.
oga@ ok


# 1.176 03-Jun-2009 beck

Change bufhash from the old grotty hash table to red-black trees hanging
off the vnode.
ok art@, oga@, miod@


Revision tags: OPENBSD_4_5_BASE
# 1.175 10-Nov-2008 pedro

Fix typo in comment, okay jmc@.


# 1.174 01-Nov-2008 deraadt

change vrele() to return an int. if it returns 0, it can gaurantee that
it did not sleep. this is used to avoid checkdirs() to avoid having
to restart the allproc walk every time through
idea from tedu, ok thib pedro


Revision tags: OPENBSD_4_4_BASE
# 1.173 05-Jul-2008 thib

re-introduce vdrop() to signal a lost intrest in a vnode;

ok art@


# 1.172 14-Jun-2008 mk

A bunch of pool_get() + bzero() -> pool_get(..., .. | PR_ZERO)
conversions that should shave a few bytes off the kernel.

ok henning, krw, jsing, oga, miod, and thib (``even though i usually prefer
FOO|BAR''; thanks for looking.


# 1.171 13-Jun-2008 beck

back out stupid vnode change that was unintentionally included
with biomem and art has no idea how it got there.
ok art@ thib@


# 1.170 12-Jun-2008 deraadt

Bring biomem diff back into the tree after the nfs_bio.c fix went in.
ok thib beck art


# 1.169 11-Jun-2008 deraadt

back out biomem diff since it is not right yet. Doing very large
file copies to nfsv2 causes the system to eventually peg the console.
On the console ^T indicates that the load is increasing rapidly, ddb
indicates many calls to getbuf, there is some very slow nfs traffic
making none (or extremely slow) progress. Eventually some machines
seize up entirely.


# 1.168 10-Jun-2008 beck

Buffer cache revamp

1) remove multiple size queues, introduced as a stopgap.
2) decouple pages containing data from their mappings
3) only keep buffers mapped when they actually have to be mapped
(right now, this is when buffers are B_BUSY)
4) New functions to make a buffer busy, and release the busy flag
(buf_acquire and buf_release)
5) Move high/low water marks and statistics counters into a structure
6) Add a sysctl to retrieve buffer cache statistics

Tested in several variants and beat upon by bob and art for a year. run
accidentally on henning's nfs server for a few months...

ok deraadt@, krw@, art@ - who promises to be around to deal with any fallout


# 1.167 09-Jun-2008 millert

Update access(2) to have modern semantics with respect to X_OK and
the superuser. access(2) will now only indicate success for X_OK on
non-directories if there is at least one execute bit set on the file.
OK deraadt@ thib@ otto@


# 1.166 07-May-2008 thib

remove the vfc_mountroot member from vfsconf and
do appropriate cleanup;

OK deraadt@


# 1.165 07-May-2008 claudio

Implement routing priorities. Every route inserted has a priority assigned
and the one route with the lowest number wins. This will be used by the
routing daemons to resolve the synchronisations issue in case of conflicts.
The nasty bits of this are in the multipath code. If no priority is specified
the kernel will choose an appropriate priority.

Looked at by a few people at n2k8 code is much older


# 1.164 06-May-2008 thib

retire vfs_mountroot();

setroot() is now (and has been) responsible for setting
the mountroot function pointer "to the right thing", or
failing todo that, to ffs_mountroot;

based on a discussion/diff from deraadt@.
OK deraadt@


# 1.163 23-Mar-2008 miod

Wrong printf construct.


# 1.162 16-Mar-2008 otto

Widen some struct statfs fields to support large filesystem stata
and add some to be able to support statvfs(2). Do the compat dance
to provide backward compatibility. ok thib@ miod@


Revision tags: OPENBSD_4_3_BASE
# 1.161 13-Dec-2007 blambert

replace calls to ltsleep with tsleep

remove PNORELOCK flag, as PNORELOCK is used for msleep

ok art@ thib@


# 1.160 16-Nov-2007 deraadt

er, the newline is wrong. dissapointing.


# 1.159 15-Nov-2007 deraadt

newline before syncing disks is way prettier


# 1.158 29-Oct-2007 chl

MALLOC/FREE -> malloc/free
replace an hard coded value with M_WAITOK

ok krw@


# 1.157 15-Sep-2007 bluhm

Allow to pull out an usb stick with ffs filesystem while mounted
and a file is written onto the stick. Without these fixes the
machine panics or hangs.
The usb fix calls the callback when the stick is pulled out to free
the associated buffers. Otherwise we have busy buffers for ever
and the automatic unmount will panic.
The change in the scsi layer prevents passing down further dirty
buffers to usb after the stick has been deactivated.
In vfs the automatic unmount has moved from the function vgonel()
to vop_generic_revoke(). Both are called when the sd device's vnode
is removed. In vgonel() the VXLOCK is already held which can cause
a deadlock. So call dounmount() earlier.

ok krw@, I like this marco@, tested by ian@


# 1.156 07-Sep-2007 art

Use M_ZERO in a few more places to shave bytes from the kernel.

eyeballed and ok dlg@


Revision tags: OPENBSD_4_2_BASE
# 1.155 07-Aug-2007 beck

A few changes to deal with multi-user performance issues seen. this
brings us back roughly to 4.1 level performance, although this is still
far from optimal as we have seen in a number of cases. This change

1) puts a lower bound on buffer cache queues to prevent starvation
2) fixes the code which looks for a buffer to recycle
3) reduces the number of vnodes back to 4.1 levels to avoid complex
performance issues better addressed after 4.2

ok art@ deraadt@, tested by many


# 1.154 01-Jun-2007 beck

decouple the allocated number of vnodes from the "desiredvnodes" variable
which is used to size a zillion other things that increasing excessively
has been shown to cause problems - so that we may incrementally look at
increasing those other things without making the kernel unusable.

This diff effectivly increases the number of vnodes back to the number
of buffers, as in the earlier dynamic buffer cache commits, without
increasing anything else (namecache, softdeps, etc. etc.)

ok pedro@ tedu@ art@ thib@


# 1.153 31-May-2007 tedu

remove some silly casts, no real change


# 1.152 31-May-2007 pedro

NFSv2 cannot cope with a big number of vnodes, so revert to NPROC-based
calculation until the problem is fixed, okay beck@ art@


# 1.151 30-May-2007 beck

back out vfs change - todd fries has seen afs issues, and I'm suspicious
this can cause other problems.


# 1.150 29-May-2007 beck

Step one of some vnode improvements - change getnewvnode to
actually allocate "desiredvnodes" - add a vdrop to un-hold a vnode held
with vhold, and change the name cache to make use of vhold/vdrop, while
keeping track of which vnodes are referred to by which cache entries to
correctly hold/drop vnodes when the cache uses them.
ok thib@, tedu@, art@


# 1.149 28-May-2007 thib

de-inline vref();

ok pedro@


# 1.148 26-May-2007 pedro

Dynamic buffer cache. Initial diff from mickey@, okay art@ beck@ toby@
deraadt@ dlg@.


# 1.147 26-May-2007 thib

Nuke a bunch of simpelocks and associated goo.

ok art@


# 1.146 17-May-2007 thib

Collapse struct v_selectinfo in struct vnode, remove the
simplelock and reuse the name for the selinfo member.
Clean-up accordingly.

ok tedu@,art@


# 1.145 09-May-2007 deraadt

kinfo_vgetfailed has not been used for > 8 years


# 1.144 13-Apr-2007 thib

Move the declaration of VN_KNOTE() into vnode.h instead of having
multiple defines all over;

ok tedu@


# 1.143 13-Apr-2007 bluhm

Remove comments talking about vnode interlock. No binary change.
ok thib


# 1.142 11-Apr-2007 thib

Remove the simplelock argument from vrecycle();

ok pedro@, sturm@


# 1.141 21-Mar-2007 thib

Remove the v_interlock simplelock from the vnode structure.
Zap all calls to simple_lock/unlock() on it (those calls are
#defined away though). Remove the LK_INTERLOCK from the calls
to vn_lock() and cleanup the filesystems wich implement VOP_LOCK().
(by remvoing the v_interlock from there calls to lockmgr()).

ok pedro@, art@, tedu@


# 1.140 12-Mar-2007 mickey

better desiredvnodes not based on maxusers; pedro@ deraadt@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.139 20-Feb-2007 deraadt

for vfsconf sysctl, do not leak kernel sensors out to userland
ok art thib


# 1.138 17-Feb-2007 mickey

fix ddb buf printing for daddr_t growth to 64bit;
from juan hernandez gonzalez; tested by bluhm@


# 1.137 14-Feb-2007 jsg

Consistently spell FALLTHROUGH to appease lint.
ok kettenis@ cloder@ tom@ henning@


# 1.136 13-Feb-2007 mickey

fix ddb buf print


# 1.135 20-Nov-2006 tom

vprint() should be defined if DIAGNOSTIC || DEBUG. Noticed by (and
original diff from) Jake < antipsychic (at) hotmail.com >. Discussed
with Mickey and Miod.

ok miod@ pedro@


# 1.134 30-Oct-2006 thib

use vp->v_type to index into vtypes rather then vp->v_tag,
fixing odd output in the 'show vnode' ddb code.

ok mickey@


Revision tags: OPENBSD_4_0_BASE
# 1.133 11-Jul-2006 mickey

add mount/vnode/buf and softdep printing commands; tested on a few archs and will make pedro happy too (;


# 1.132 09-Jul-2006 pedro

Fix tab where space was meant


# 1.131 08-Jul-2006 thib

vinvalbuf() debugging aid, under VFSDEBUG.

ok pedro@


# 1.130 03-Jul-2006 mickey

also print vp in vprint (useful for debugging); pedro@ ok


# 1.129 25-Jun-2006 sturm

rename vfs_busy() flags VB_UMIGNORE/VB_UMWAIT to VB_NOWAIT/VB_WAIT

requested by and ok pedro


# 1.128 14-Jun-2006 sturm

move vfs_busy() to rwlocks and properly hide the locking api from vfs

ok tedu, pedro


# 1.127 02-Jun-2006 pedro

Add a clonable devices implementation. Hacked along with thib@, input
from krw@ and toby@, subliminal prodding from dlg@, okay deraadt@.


# 1.126 28-May-2006 pedro

Spacing in vfs_sysctl()


# 1.125 07-May-2006 sturm

forgot to remove this sentence from the comment
ok pedro


# 1.124 30-Apr-2006 sturm

remove the simplelock argument from vfs_busy() which is currently not
used and will never be used this way in VFS

requested by and ok pedro, ok krw, biorn


# 1.123 19-Apr-2006 pedro

Remove unused mount list simple_lock() goo


Revision tags: OPENBSD_3_9_BASE
# 1.122 09-Jan-2006 pedro

Put vprint() under DIAGNOSTIC, as to save space in generated ramdisks.
Inspiration from miod@, okay deraadt@. Tested on i386, macppc and amd64.


# 1.121 30-Nov-2005 pedro

No need for vfs_busy() and vfs_unbusy() to take a process pointer
anymore. Testing by jolan@, thanks.


# 1.120 24-Nov-2005 pedro

Remove kernfs, okay deraadt@.


# 1.119 19-Nov-2005 pedro

Remove unnecessary lockmgr() archaism that was costing too much in terms
of panics and bugfixes. Access curproc directly, do not expect a process
pointer as an argument. Should fix many "process context required" bugs.
Incentive and okay millert@, okay marc@. Various testing, thanks.


# 1.118 18-Nov-2005 pedro

Work around yet another race on non-locking file systems: when calling
VOP_INACTIVE() in vrele() and vput(), we may sleep. Since there's no
locking of any kind, someone can vget() the vnode and vrele() it while
we sleep, beating us in getting the vnode on the free list.


# 1.117 08-Nov-2005 pedro

Missed one use of 'register'


# 1.116 07-Nov-2005 pedro

Use ANSI function declarations and deregister, no binary change


# 1.115 19-Oct-2005 pedro

Remove v_vnlock from struct vnode, okay krw@ tedu@


Revision tags: OPENBSD_3_8_BASE
# 1.114 26-May-2005 pedro

branches: 1.114.2;
RIP stackable filesystems, ok marius@ tedu@, discussed with deraadt@


# 1.113 24-May-2005 pedro

when a device vnode associated with a mount point disappears, mark the
filesystem as doomed and unmount it


# 1.112 22-May-2005 pedro

put VLOCKSWORK stuff under a single option, VFSDEBUG


# 1.111 01-May-2005 pedro

check for VBIOONFREELIST and VBIOONSYNCLIST in vprint(), okay marius@


# 1.110 24-Mar-2005 tedu

always good to check for invalid values. ok marius pedro


Revision tags: OPENBSD_3_7_BASE
# 1.109 10-Jan-2005 pedro

branches: 1.109.2;
change vget() to only put a vnode back on the free lists if it actually
was there. should fix a (rare) corner case introduced by my last commit.
ok tedu@, testing by joris, moritz@, danh@, otto@ and krw@. many thanks.


# 1.108 31-Dec-2004 pedro

sprinkle some more list macros in here


# 1.107 31-Dec-2004 pedro

when releasing a vnode, make it inactive before sticking it to one of
the free lists. should fix some races on filesystems that don't have
locks, such as nfs. also, it allows for a more straightforward way of
releasing vnodes (nodes that are going to be recycled don't have to be
moved to the head of the list). tested by many, thanks.

ok tedu@ deraadt@


# 1.106 28-Dec-2004 deraadt

clean dirty accident by miod


# 1.105 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


# 1.104 09-Dec-2004 pedro

minor spacing/styling nits


Revision tags: OPENBSD_3_6_BASE
# 1.103 04-Aug-2004 art

Uninline vputonfreelist.


# 1.102 04-Aug-2004 pedro

better comments


# 1.101 02-Aug-2004 pedro

- check for LK_NOWAIT on vget()
- use ltsleep() instead of the unlock + sleep combo

ok art@, inspiration from free/net


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.100 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


# 1.99 27-May-2004 tedu

shutdown accounting before shutting down vfs. should prevent some panics.
ok david@ millert@ (iirc)


# 1.98 25-Apr-2004 itojun

radix tree with multipath support. from kame. deraadt ok
user visible changes:
- you can add multiple routes with same key (route add A B then route add A C)
- you have to specify gateway address if there are multiple entries on the table
(route delete A B, instead of route delete A)
kernel change:
- radix_node_head has an extra entry
- rnh_deladdr takes extra argument

TODO:
- actually take advantage of multipath (rtalloc -> rtalloc_mpath)


Revision tags: OPENBSD_3_5_BASE
# 1.97 09-Jan-2004 tedu

back out vnode parents. weird breakge found in ports tree


# 1.96 06-Jan-2004 tedu

keep track of a vnode's parent dir. ufs only, and unused atm, but
the fun stuff is coming. testing by brad.


Revision tags: OPENBSD_3_4_BASE
# 1.95 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.94 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: UBC_SYNC_A
# 1.93 13-May-2003 naddy

Back out previous change that causes "vnode table full" for large-scale
file operations.


# 1.92 13-May-2003 tedu

do reclaim LAYER vnodes, no good reason not to


# 1.91 06-May-2003 tedu

attempt to put a process's cwd back in place after a forced umount.
won't always work, but it's the best we can do for now. this covers
at least some of the failure cases the previous commit to vfs_lookup.c
checks for.
ok weingart@


# 1.90 01-May-2003 tedu

several related changes:
vfs_subr.c:
add a missing simple_lock_init for vnode interlock
try to avoid reclaiming locked or layered vnodes
initialize vnlock pointer to NULL
remove old code to free vnlock, never used
lockinit the new vnode lock
vfs_syscalls.c:
support for VLAYER flag
vnode_if.sh:
support for splitting VDESC flags
vnode_if.src:
split VDESC flags
WILLPUT is the combination of WILLRELE and WILLUNLOCK
most uses for WILLRELE become WILLPUT
vnode.h:
add v_lock to struct vnode
add VLAYER flag
update for new VDESC flags


# 1.89 06-Apr-2003 ho

strcat/strcpy/sprintf cleanup. krw@, anil@ ok. art@ tested sparc64.


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_B
# 1.88 11-Aug-2002 art

Add two missing vfs_busy calls in the failure path of sysctl_vnode.
Found by aaron@

NOTE - I think we need a mount-point iterator just like we have
NOTE - vfs_mount_foreach_vnode. (btw. why don't we use foreach_vnode in here?)


# 1.87 12-Jul-2002 art

Change the locking on the mountpoint slightly. Instead of using mnt_lock
to get shared locks for lookup and get the exclusive lock only with
LK_DRAIN on unmount and do the real exclusive locking with flags in
mnt_flags, we now use shared locks for lookup and an exclusive lock for
unmount.

This is accomplished by slightly changing the semantics of vfs_busy.
Old vfs_busy behavior:
- with LK_NOWAIT set in flags, a shared lock was obtained if the
mountpoint wasn't being unmounted, otherwise we just returned an error.
- with no flags, a shared lock was obtained if the mountpoint was being
unmounted, otherwise we slept until the unmount was done and returned
an error.
LK_NOWAIT was used for sync(2) and some statistics code where it isn't really
critical that we get the correct results.
0 was used in fchdir and lookup where it's critical that we get the right
directory vnode for the filesystem root.

After this change vfs_busy keeps the same behavior for no flags and LK_NOWAIT.
But if some other flags are passed into it, they are passed directly
into lockmgr (actually LK_SLEEPFAIL is always added to those flags because
if we sleep for the lock, that means someone was holding the exclusive lock
and the exclusive lock is only held when the filesystem is being unmounted.

More changes:
dounmount must now be called with the exclusive lock held. (before this
the caller was supposed to hold the vfs_busy lock, but that wasn't always
true).
Zap some (now) unused mount flags.
And the highlight of this change:
Add some vfs_busy calls to match some vfs_unbusy calls, especially in
sys_mount. (lockmgr doesn't detect the case where we release a lock noone
holds (it will do that soon)).

If you've seen hangs on reboot with mfs this should solve it (I repeat this
for the fourth time now, but this time I spent two months fixing and
redesigning this and reading the code so this time I must have gotten
this right).


# 1.86 16-Jun-2002 miod

When processing the KERN_VNODE sysctl, the kernel builds a packed structure,
while pstat(8) expects a C structure abiding the regular structure packing
rules. This caused pstat -v to break on powerpc.

Unbreak the confusion by defining the structure in a common header file,
and having the kernel use it.

ok millert@ deraadt@


# 1.85 08-Jun-2002 art

Use ltsleep in vfs_busy.


# 1.84 16-May-2002 art

sprinkle some splassert(IPL_BIO) in some functions that are commented as "should be called at splbio()"


Revision tags: OPENBSD_3_1_BASE
# 1.83 14-Mar-2002 millert

First round of __P removal in sys


# 1.82 04-Feb-2002 miod

Cleanup mountroot-related definitions.


# 1.81 23-Jan-2002 art

Pool deals fairly well with physical memory shortage, but it doesn't deal
well (not at all) with shortages of the vm_map where the pages are mapped
(usually kmem_map).

Try to deal with it:
- group all information the backend allocator for a pool in a separate
struct. The pool will only have a pointer to that struct.
- change the pool_init API to reflect that.
- link all pools allocating from the same allocator on a linked list.
- Since an allocator is responsible to wait for physical memory it will
only fail (waitok) when it runs out of its backing vm_map, carefully
drain pools using the same allocator so that va space is freed.
(see comments in code for caveats and details).
- change pool_reclaim to return if it actually succeeded to free some
memory, use that information to make draining easier and more efficient.
- get rid of PR_URGENT, noone uses it.


# 1.80 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.79 10-Dec-2001 art

branches: 1.79.2;
No need to initialize the uobj on every getnewvnode. Just do
it when allocating. Add some improved diagnostics.


# 1.78 10-Dec-2001 art

Big cleanup inspired by NetBSD with some parts of the code from NetBSD.
- get rid of VOP_BALLOCN and VOP_SIZE
- move the generic getpages and putpages into miscfs/genfs
- create a genfs_node which must be added to the top of the private portion
of each vnode for filsystems that want to use genfs_{get,put}pages
- rename genfs_mmap to vop_generic_mmap


# 1.77 10-Dec-2001 art

Merge in struct uvm_vnode into struct vnode.


# 1.76 05-Dec-2001 art

Break out the part that lowers v_holdcnt in brelvp into an own function
and make it and vhold into public interfaces.


# 1.75 29-Nov-2001 art

Ooops. Revert part of the last commit that was completly wrong and wasn't supposed to be committed.


# 1.74 29-Nov-2001 art

Correctly handle b_vp with bgetvp and brelvp in {get,put}pages.
Prevents panics caused by vnodes being recycled under our feet.


# 1.73 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.72 21-Nov-2001 csapuntz

Added vfs_isbusy. Useful for verifying that a mount point is locked
Added vfs_mount_foreach_vnode. Several places in the code seem to want to
traverse the mount list and they all seem to handle locking differently.
Centralize traversing the mount list in one place so that we only need
to get the locking right once.


# 1.71 15-Nov-2001 art

Don't zero v_bioflag when recycling a vnode in getnewvnode.
Sometimes the vnode can be on the syncers list. While that is a bug, it's
just a minor annoyance. A vnode on a syncer worklist without VBIOONSYNCLIST
set is a disaster.


# 1.70 12-Nov-2001 art

Remove unnecessary check for NULL vnode in reassignbuf.


# 1.69 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.68 02-Oct-2001 csapuntz

Bounds check index into routing table. Thanks to Ken Ashcraft of Stanford
for finding this bug.


# 1.67 19-Sep-2001 csapuntz

Get rid of B_VFLUSH. Not relevant after the end of the AGE queue.


# 1.66 16-Sep-2001 millert

Add some missing lengths checks when passing data from userland to
kernel. From based on NetBSD patches.


# 1.65 02-Aug-2001 assar

(vput): make panic strings actually say vput instead of vrele


# 1.64 26-Jul-2001 miod

Typo.


# 1.63 27-Jun-2001 art

remove old vm


# 1.62 22-Jun-2001 deraadt

KNF


# 1.61 05-Jun-2001 provos

send note_revoke to knotes when vnode goes away, okay art@


# 1.60 16-May-2001 art

indentation nit.


# 1.59 29-Apr-2001 art

cleanup, remove incorrect comment


Revision tags: OPENBSD_2_9_BASE
# 1.58 22-Mar-2001 art

branches: 1.58.2;
Use pool for allocating vnodes.
Even though vnodes are never freed (could be) this gives us big memory and
kmem_map savings.


# 1.57 21-Mar-2001 art

uvm_vnp_terminate expect the vnode to be locked.
Why didn't LOCKDEBUG catch this?


# 1.56 16-Mar-2001 art

Oops. fix thinko in last.


# 1.55 16-Mar-2001 art

Use CIRCLEQ macros for mountlist.


# 1.54 16-Mar-2001 art

Initialize the mountlist_slock.


# 1.53 26-Feb-2001 csapuntz

Move v_writecount test back to it original place


# 1.52 26-Feb-2001 csapuntz

Make ref counts 32-bit unsigned ints as opposed to a potpourri of longs and
ints.


# 1.51 24-Feb-2001 csapuntz

Cleanup of vnode interface continues. Get rid of VHOLD/HOLDRELE.
Change VM/UVM to use buf_replacevnode to change the vnode associated
with a buffer.

Addition v_bioflag for flags written in interrupt handlers
(and read at splbio, though not strictly necessary)

Add vwaitforio and use it instead of a while loop of v_numoutput.

Fix race conditions when manipulation vnode free list


# 1.50 23-Feb-2001 csapuntz

Remove the clustering fields from the vnodes and place them in the
file system inode instead


# 1.49 21-Feb-2001 csapuntz

Latest soft updates from FreeBSD/Kirk McKusick

Snapshot-related code has been commented out.


# 1.48 08-Feb-2001 mickey

do not print stuff when not verbose


Revision tags: OPENBSD_2_8_BASE
# 1.47 27-Sep-2000 art

branches: 1.47.2;
Minimal optimization.


# 1.46 17-Jul-2000 art

Don't wait for B_READ buffers on shutdown.
From NetBSD.


Revision tags: OPENBSD_2_7_BASE
# 1.45 25-Apr-2000 csapuntz

Use CIRCLEQ_FOREACH


# 1.44 21-Apr-2000 mickey

see if there is any meaning under curproc before using &proc0 in vfs_syncwait(); from art@


Revision tags: SMP_BASE kame_19991208
# 1.43 05-Dec-1999 art

branches: 1.43.2;
With soft updates, some buffers will be remarked as dirty after being written.
Handle this when syncing filesystems when unmounting.
From NetBSD.


# 1.42 05-Dec-1999 art

Use VONSYNCLIST to see if we should remove a vnode from the sync list instead
of looking at v_dirtyblkhd.


Revision tags: OPENBSD_2_6_BASE
# 1.41 20-Aug-1999 art

more paranoid check of the refcount in vfs_register


# 1.40 08-Aug-1999 niklas

From NetBSD; vdevgone, used for revoking access to device nodes when they
disappear (detach is coming).


# 1.39 31-May-1999 millert

New struct statfs with mount options. NOTE: this replaces statfs(2),
fstatfs(2), and getfsstat(2) so you will need to build a new kernel
before doing a "make build" or you will get "unimplemented syscall" errors.

The new struct statfs has the following featuires:
o Has a u_int32_t flags field--now softdep can have a real flag.

o Uses u_int32_t instead of longs (nicer on the alpha). Note: the man
page used to lie about setting invalid/unused fields to -1. SunOS does
that but our code never has.

o Gets rid of f_type completely. It hasn't been used since NetBSD 0.9
and having it there but always 0 is confusing. It is conceivable
that this may cause some old code to not compile but that is better
than silently breaking.

o Adds a mount_info union that contains the FSTYPE_args struct. This
means that "mount" can now tell you all the options a filesystem was
mounted with. This is especially nice for NFS.

Other changes:
o The linux statfs emulation didn't convert between BSD fs names
and linux f_type numbers. Now it does, since the BSD f_type
number is useless to linux apps (and has been removed anyway)

o FreeBSD's struct statfs is different from our (both old and new)
and thus needs conversion. Previously, the OpenBSD syscalls
were used without any real translation.

o mount(8) will now show extra info when invoked with no arguments.
However, to see *everything* you need to use the -v (verbose) flag.


# 1.38 06-May-1999 mickey

factor out sync+wait code into vfa_syncwait() routine for
applications in system like power management and such.
art@ finally said `commit it'


# 1.37 30-Apr-1999 art

in vput, simple_unlock the v_interlock before VOP_INACTIVE, not after


Revision tags: OPENBSD_2_5_BASE
# 1.36 11-Mar-1999 deraadt

backout


# 1.35 11-Mar-1999 deraadt

back out unapproved changes


# 1.34 11-Mar-1999 mickey

indent


# 1.33 11-Mar-1999 mickey

factor sync+wait operation out into a separate function.


# 1.32 26-Feb-1999 art

adapt to uvm vnode pager


# 1.31 19-Feb-1999 art

add vfs_register and vfs_unregister functions


# 1.30 28-Dec-1998 art

simple_lock fixes


# 1.29 22-Dec-1998 art

deconfuse vprint, print holdcount, not refcount when we are talking about holdcnt


# 1.28 10-Dec-1998 art

vfs_unmountall: retry to unmount all remaining filesystems when one unmount failed


# 1.27 05-Dec-1998 csapuntz

Framework for generating automatic test code for locking discipline
in DIAGNOSTIC mode.

Added documentation to vfs_subr.c on locking needs of a couple calls.

Improvements to the vinvalbuf patch. We need to start over after we
let our pants down.


# 1.26 04-Dec-1998 csapuntz

VFS-Lite2 requires stricter locking around vnode buffer queues. vinvalbuf
had insufficient protection


# 1.25 20-Nov-1998 art

vn_lock already unlocks the simple lock. don't do that again


# 1.24 12-Nov-1998 csapuntz

Integrate latest soft updates patches for McKusick.

Integrate cleaner ffs mount code from FreeBSD. Most notably, this mount
code prevents you from mounting an unclean file system read-write.


Revision tags: OPENBSD_2_4_BASE
# 1.23 13-Oct-1998 csapuntz

In vrele, vget, reinstate to following order

- VNODE gets placed on free list
- VOP_INACTIVE is called

This was the original order. It was changed in an earlier patch due to
a race condition in non-locking FSes (like NFS) between getnewvnode
and inactive. However, the modified order had its own race conditions, so
it turned out not to be a good choice.


# 1.22 30-Aug-1998 csapuntz

Cleanup.

Error diagnostics in vputonfreelist to catch violations of assumptions.


# 1.21 06-Aug-1998 csapuntz

Rename vop_revoke, vn_bwrite, vop_noislocked, vop_nolock, vop_nounlock
to be vop_generic_revoke, vop_generic_bwrite, vop_generic_islocked,
vop_generic_lock and vop_generic_unlock.

Create vop_generic_abortop and propogate change to all file systems.

Fix PR/371.

Get rid of locking in NULLFS (should be mostly unnecessary now except for
forced unmounts).


# 1.20 25-Apr-1998 niklas

typo


Revision tags: OPENBSD_2_3_BASE
# 1.19 20-Feb-1998 niklas

typo


# 1.18 11-Jan-1998 csapuntz

Fix a couple spinlock references. More code motion in vfs_subr.c


# 1.17 10-Jan-1998 csapuntz

Broke up vfs_subr.c which was getting a bit huge. We now have seperate files
for the syncer daemon as well as default VOP_*.


# 1.16 24-Nov-1997 niklas

Fix non-DIAGNOSTIC (and non-COMPAT*) compilation


# 1.15 07-Nov-1997 csapuntz

Fixed hang on shutdown
Disabled vop_nolock for now. Filesystems still need to be cleaned up.


# 1.14 06-Nov-1997 csapuntz

DEBUG now compiles


# 1.13 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.12 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.11 06-Oct-1997 csapuntz

VFS Lite2 Changes


Revision tags: OPENBSD_2_1_BASE
# 1.10 25-Apr-1997 deraadt

proper mask check; mike@fast.cs.utah.edu


# 1.9 14-Apr-1997 tholo

Minor performance enhancements from NetBSD


# 1.8 24-Feb-1997 niklas

OpenBSD tags


# 1.7 11-Feb-1997 millert

Add fs_id support and random inode generation numbers for ffs.


# 1.6 04-Jan-1997 kstailey

spec_advlock() via lf_advlock()


Revision tags: OPENBSD_2_0_BASE
# 1.5 08-Aug-1996 tholo

Make {,f}chown(2) behaviour POSIX.1 compliant with SUID / SGID files
Enable CTL_FS processing by sysctl(3)
Add CTL_FS request to disable clearing SUID / SGID bit when a files owner
or group is changed by root
Make sysctl(8) understand CTL_FS requests


# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 29-Feb-1996 niklas

From NetBSD: Merge with NetBSD 960217


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.275 06-Jun-2018 bluhm

The function dounmount() traverses the mnt_list in forward direction
to call vfs_busy() for all nested mount points. vfs_stall() called
vfs_busy() in reverser order for all mount points. Change the
direction of the latter to resolve the lock order conflict.
OK visa@


# 1.274 04-Jun-2018 guenther

Add VB_DUPOK to suppress witness(4) warning of concurrent mount locks.
Use that in three places:
- vfs_stall()
- sys_mount()
- dounmount()'s MNT_FORCE-does-recursive-unmounts case

ok deraadt@ visa@


# 1.273 27-May-2018 visa

Drop unnecessary `p' parameter from vget(9).

OK mpi@


# 1.272 08-May-2018 bluhm

When looping over mount points, the FOREACH SAVE macro is not save.
The loop variable mp is protected by vfs_busy() so that it cannot
be unmounted. But the next mount point nmp could be unmounted while
VFS_SYNC() sleeps. As the loop in vfs_stall() does not destroy the
mount point, TAILQ_FOREACH_REVERSE without _SAVE is the correct
macro to use.
OK deraadt@ visa@


# 1.271 08-May-2018 mpi

Move the vfs stall "barrier" logic to a function. FREF() will soon
change and this has nothing to do with it.

ok visa@, bluhm@


# 1.270 07-May-2018 bluhm

Print the vp pointer in the vinvalbuf() panic strings.
OK mpi@


# 1.269 02-May-2018 visa

Remove proc from the parameters of vn_lock(). The parameter is
unnecessary because curproc always does the locking.

OK mpi@


# 1.268 28-Apr-2018 visa

Clean up the parameters of VOP_LOCK() and VOP_UNLOCK(). It is always
curproc that does the locking or unlocking, so the proc parameter
is pointless and can be dropped.

OK mpi@, deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.267 07-Mar-2018 bluhm

Remounting files systems read-only does not work reliably. There
are corner cases where ffs may leak blocks. So better revert and
unmount all file systems at reboot. The "init died" panic will be
fixed in a different way.
OK deraadt@


# 1.266 10-Feb-2018 deraadt

Syncronize filesystems to disk when suspending. Each mountpoint's vnodes
are pushed to disk. Dangling vnodes (unlinked files still in use) and
vnodes undergoing change by long-running syscalls are identified -- and
such filesystems are marked dirty on-disk while we are suspended (in case
power is lost, a fsck will be required). Filesystems without dangling or
busy vnodes are marked clean, resulting in faster boots following
"battery died" circumstances.
Tested by numerous developers, thanks for the feedback.


# 1.265 14-Dec-2017 deraadt

Don't bother using DETACH_FORCE for the softraid luns at reboot
time; the aggressive mountpoint destruction seems to hit insane
use-after-frees when we are already far on the way down.


# 1.264 14-Dec-2017 deraadt

Give vflush_vnode() a hint about vnodes we don't need to account as "busy".
Change mountpoint to RDONLY a little later. Seems to improve the
rw->ro transition a bit.


# 1.263 11-Dec-2017 bluhm

Format the vnode lists of ddb show mount properly in columns.
OK krw@


# 1.262 11-Dec-2017 deraadt

In uvm Chuck decided backing store would not be allocated proactively
for blocks re-fetchable from the filesystem. However at reboot time,
filesystems are unmounted, and since processes lack backing store they
are killed. Since the scheduler is still running, in some cases init is
killed... which drops us to ddb [noted by bluhm]. Solution is to convert
filesystems to read-only [proposed by kettenis]. The tale follows:
sys_reboot() should pass proc * to MD boot() to vfs_shutdown() which
completes current IO with vfs_busy VB_WRITE|VB_WAIT, then calls VFS_MOUNT()
with MNT_UPDATE | MNT_RDONLY, soon teaching us that *fs_mount() calls a
copyin() late... so store the sizes in vfsconflist[] and move the copyin()
to sys_mount()... and notice nfs_mount copyin() is size-variant, so kill
legacy struct nfs_args3. Next we learn ffs_mount()'s MNT_UPDATE code is
sharp and rusty especially wrt softdep, so fix some bugs adn add
~MNT_SOFTDEP to the downgrade. Some vnodes need a little more help,
so tie them to &dead_vnops.

ffs_mount calling DIOCCACHESYNC is causing a bit of grief still but
this issue is seperate and will be dealt with in time.
couple hundred reboots by bluhm and myself, advice from guenther and
others at the hut


# 1.261 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.260 31-Jul-2017 florian

Give back some space to the ramdisk by compiling net/radix.c only
if we compile pf, ipsec, pipex or nfsserver.
Suggested by mpi some time ago.
Tweak & OK bluhm
deraadt assumes it's fair


# 1.259 20-Apr-2017 visa

Tweak lock inits to make the system runnable with witness(4)
on amd64 and i386.


# 1.258 04-Apr-2017 deraadt

struct vfsconf is tightly packed, but let's M_ZERO it in case that ever
changes to avoid exposing userland memory.


Revision tags: OPENBSD_6_1_BASE
# 1.257 15-Jan-2017 bluhm

When traversing the mount list, the current mount point is locked
with vfs_busy(). If the FOREACH_SAFE macro is used, the next pointer
is not locked and could be freed by another process. Unless
necessary, do not use _SAFE as it is unsafe. In vfs_unmountall()
the current pointer is actullay freed. Add a comment that this
race has to be fixed later.
OK krw@


# 1.256 10-Jan-2017 bluhm

Replace manual for() loops with FOREACH() macro.
OK millert@


# 1.255 10-Jan-2017 bluhm

Remove the unused olddp parameter from function dounmount().
OK mpi@ millert@


# 1.254 28-Sep-2016 kettenis

Cast enum to u_int when doing a bounds check to avoid a clang warning that
the comparison is always true.

ok jca@, tedu@


# 1.253 16-Sep-2016 dlg

move the namecache_rb_tree from RB macros to RBT functions.

i had to shuffle the includes a bit. all the knowledge of the RB
tree is now inside vfs_cache.c, and all accesses are via cache_*
functions.


# 1.252 16-Sep-2016 dlg

move buf_rb_bufs from RB macros to RBT functions

i had to shuffle the order of some header bits cos RBT_PROTOTYPE
needs to see what RBT_HEAD produces.


# 1.251 15-Sep-2016 dlg

all pools have their ipl set via pool_setipl, so fold it into pool_init.

the ioff argument to pool_init() is unused and has been for many
years, so this replaces it with an ipl argument. because the ipl
will be set on init we no longer need pool_setipl.

most of these changes have been done with coccinelle using the spatch
below. cocci sucks at formatting code though, so i fixed that by hand.

the manpage and subr_pool.c bits i did myself.

ok tedu@ jmatthew@

@ipl@
expression pp;
expression ipl;
expression s, a, o, f, m, p;
@@
-pool_init(pp, s, a, o, f, m, p);
-pool_setipl(pp, ipl);
+pool_init(pp, s, a, ipl, f, m, p);


# 1.250 25-Aug-2016 dlg

pool_setipl

ok kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.249 22-Jul-2016 kettenis

Prevent NULL-pointer call for filesystems that don't provide vfs_sysctl
in their vfsops.

Issue reported by Tim Newsham.

ok claudio@, natano@


# 1.248 19-Jun-2016 natano

Remove the lockmgr() API. It is only used by filesystems, where it is a
trivial change to use rrw locks instead. All it needs is LK_* defines
for the RW_* flags.

tested by naddy and sthen on package building infrastructure
input and ok jmc mpi tedu


# 1.247 26-May-2016 natano

The doforce variable isn't modified anywhere. Also, the only filesystem
left using it is fuse. It has been removed from all other filesystems.

ok millert deraadt


# 1.246 26-Apr-2016 natano

copy_statfs_info() is not only used by ufs, but by other filesystems too,
so make sure that all members of mp->mnt_stat.mount_info are copied.
ok stefan


# 1.245 26-Apr-2016 beck

fix off by one in vfs_vnode_print - found by miod
ok deraadt@, krw@


# 1.244 07-Apr-2016 natano

Share clone bitmap between aliased vnodes. This prevents duplicate clone
instance numbers being handed out for the same minor device.
ok mikeb


# 1.243 05-Apr-2016 natano

Increase size of the clone bitmap (revised diff after revert). I have
tested this with fuse _and_ drm on amd64 and macppc. Also tested with
cloning bpf (not in the tree) on macppc.

ok mikeb
"looks correct to me" millert

The original commit message is as follows:

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.242 01-Apr-2016 mikeb

Revert the clone bitmap enlargement change


# 1.241 31-Mar-2016 natano

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.240 19-Mar-2016 natano

Remove the unused flags argument from VOP_UNLOCK().

torture tested on amd64, i386 and macppc
ok beck mpi stefan
"the change looks right" deraadt


# 1.239 14-Mar-2016 krw

Change a bunch of (<blah> *)0 to NULL.

ok beck@ deraadt@


Revision tags: OPENBSD_5_9_BASE
# 1.238 05-Dec-2015 tedu

branches: 1.238.2;
remove stale lint annotations


# 1.237 16-Nov-2015 deraadt

In getdevvp() set the VISTTY flag on a vnode to indicate the underlying
device is a D_TTY device. (Like spec_open, but this sets the flag to
satisfy pre-VOP_OPEN situations)
ok millert semarie tedu guenther


# 1.236 13-Oct-2015 guenther

Initialize va_filerev in vattr_null() to avoid leaking stack garbage;
problem pointed out by Martin Natano (natano (at) natano.net)

Also, stop chaining assignments (foo = bar = baz) in vattr_null().
The exact meaning of those depends on the order of the sizes-and-
signednesses of the lvalues, making them fragile: a statement here
mixed *six* types, but managed to get them in a safe order. Delete
a 20+ year old XXX comment that was almost certainly bemoaning a bug
from when they were in an unsafe order.

ok deraadt@ miod@


# 1.235 08-Oct-2015 mpi

Use the radix API directly and get rid of the function pointers. There
is no point in keeping an unused level of abstraction.

ok mikeb@, claudio@


# 1.234 07-Oct-2015 mpi

rn_inithead() offset argument is now specified in byte, missed in previous.


# 1.233 04-Sep-2015 mpi

Make every subsystem using a radix tree call rn_init() and pass the
length of the key as argument.

This way every consumer of the radix tree has a chance to explicitly
initialize the shared data structures and no longer rely on another
subsystem to do the initialization.

As a bonus ``dom_maxrtkey'' is no longer used an die.

ART kernels should now be fully usable because pf(4) and IPSEC properly
initialized the radix tree.

ok chris@, reyk@


Revision tags: OPENBSD_5_8_BASE
# 1.232 16-Jul-2015 claudio

branches: 1.232.4;
Fix rn_match and there for the expoerted lookup functions in radix.c
to never return the internal RNF_ROOT nodes. This removes the checks
in the callee to verify that not an RNF_ROOT node was returned.
OK mpi@


# 1.231 12-May-2015 mikeb

Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.230 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.229 02-Mar-2015 guenther

Return EINVAL if the creds supplied for NFS export have a cr_ngroups less
than zero or greater than NGROUPS_MAX

Fixes panic seen by henning@


# 1.228 09-Jan-2015 tedu

rename desiredvnodes to initialvnodes. less of a lie. ok beck deraadt


# 1.227 19-Dec-2014 tedu

start retiring the nointr allocator. specify PR_WAITOK as a flag as a
marker for which pools are not interrupt safe. ok dlg


# 1.226 17-Dec-2014 tedu

remove lock.h from uvm_extern.h. another holdover from the simpletonlock
era. fix uvm including c files to include lock.h or atomic.h as necessary.
ok deraadt


# 1.225 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.224 10-Dec-2014 tedu

convert bcopy to memcpy. ok millert


# 1.223 21-Nov-2014 tedu

simple lock is long dead


# 1.222 19-Nov-2014 tedu

delete the KERN_VNODE sysctl. it fails to provide any isolation from the
kernel struct vnode defintion, and the only consumer (pstat) still needs
kvm to read much of the required information. no great loss to always use
kvm until there's a better replacement interface.
ok deraadt millert uebayasi


# 1.221 14-Nov-2014 tedu

prefer sizeof(*ptr) to sizeof(struct) for malloc and free


# 1.220 03-Nov-2014 deraadt

pass size argument to free()
ok doug tedu


# 1.219 13-Sep-2014 doug

Replace all queue *_END macro calls except CIRCLEQ_END with NULL.

CIRCLEQ_* is deprecated and not called in the tree. The other queue types
have *_END macros which were added for symmetry with CIRCLEQ_END. They are
defined as NULL. There's no reason to keep the other *_END macro calls.

ok millert@


Revision tags: OPENBSD_5_6_BASE
# 1.218 13-Jul-2014 tedu

pass the size to free in some of the obvious cases


# 1.217 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.216 10-Jul-2014 mpi

Stop using a shutdown hook for softraid(4) and explicitly shutdown
the disciplines right after vfs_shutdown().

This change is required in order to be able to set `cold' to 1 before
traversing the device (mainbus) tree for DVACT_POWERDOWN when halting
a machine. Yes, this is ugly because sr_shutdown() needs to sleep. But
at least it is obvious and hopefully somebody will be ofended and fix
it.

In order to properly flush the cache of the disks under softraid0,
sr_shutdown() now propagates DVACT_POWERDOWN for this particular subtree
of devices which are not under mainbus. As a side effect sd(4) shutdown
hook should no longer be necessary.

Tested by stsp@ and Jean-Philippe Ouellet.

ok deraadt@, stsp@, jsing@


# 1.215 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.214 04-Jun-2014 claudio

While it may be smart to use the radix tree for exports it is not OK to
use the domain specific tree initialisation method for this since that one
is multipath enabled and assumes that the radix node is part of a struct
rtentry. This code uses a different struct and so the multipath modifies
wrong fields and breaks stuff in mysterious ways.
Since we only support AF_INET here anyway simplify the code and only have
one radix_node_head pointer instead of AF_MAX ones.
Fixes NFS server issues reported by rpe@, OK rpe@, guenther@, sthen@


# 1.213 10-Apr-2014 tedu

pull the bufcache freelist code out into separate functions to allow new
algorithms to be tested. in the process, drop support for unused B_AGE and
b_synctime options.
previous versions ok beck deraadt


# 1.212 24-Mar-2014 guenther

Split the API: struct ucred remains the kernel internal structure while
struct xucred becomes the structure for syscalls (mount(2) and nfssvc(2)).

ok deraadt@ beck@


Revision tags: OPENBSD_5_5_BASE
# 1.211 21-Jan-2014 tedu

bzero -> memset


# 1.210 01-Dec-2013 krw

Change 'mountlist' from CIRCLEQ to TAILQ. Be paranoid and
use TAILQ_*_SAFE more than might be needed.

Bulk ports build by sthen@ showed nobody sticking their fingers
so deep into the kernel.

Feedback and suggestions from millert@. ok jsing@


# 1.209 27-Nov-2013 jsing

Defer the v_type initialisation until after the vnode has been purged from
the namecache. Changing the v_type between cache_enter() and cache_purge()
results in bad things happening.

ok beck@


# 1.208 02-Oct-2013 sf

format string fix: b_flags is long


# 1.207 01-Oct-2013 sf

Format string fixes: Cast time_t to long long

and mnt_stat.f_ctime is long long, too


# 1.206 08-Aug-2013 syl

Uncomment kprintf format attributes for sys/kern

tested on vax (gcc3) ok miod@


# 1.205 30-Jul-2013 beck

The previous change was made while chasing nfs performance issues
on Theo's servers - however this was in the context of the buffer flipper
changes and this is now suspect in a continues performance issue with NFS
so back it out for now


Revision tags: OPENBSD_5_4_BASE
# 1.204 24-Jun-2013 beck

Manipulating buffers after sleeping is dangerous. Instead of attempting
to cheat and VOP_BWRITE a buffer, restart the vinvalbuf if we have to wait
for a busy buffer to complete
ok tedu@ guenther@


# 1.203 15-Apr-2013 jsing

Add an f_mntfromspec member to struct statfs, which specifies the name of
the special provided when the mount was requested. This may be the same as
the special that was actually used for the mount (e.g. in the case of a
device node) or it may be different (e.g. in the case of a DUID).

Whilst here, change f_ctime to a 64 bit type and remove the pointless
f_spare members.

Compatibility goo courtesy of guenther@

ok krw@ millert@


Revision tags: OPENBSD_5_3_BASE
# 1.202 17-Feb-2013 miod

Comment out recently added __attribute__((__format__(__kprintf__))) annotations
in MI code; gcc 2.95 does not accept such annotation for function pointer
declarations, only function prototypes.
To be uncommented once gcc 2.95 bites the dust.


# 1.201 09-Feb-2013 miod

Add explicit __attribute__ ((__format__(__kprintf__)))) to the functions and
function pointer arguments which are {used as,} wrappers around the kernel
printf function.
No functional change.


# 1.200 17-Nov-2012 beck

Don't map a buffer (and potentially sleep) when invalidating it in vinvalbuf.
This fixes a problem where we could sleep for kva and then our pointers
would not be valid on the next pass through the loop. We do this
by adding buf_acquire_nomap() - which can be used to busy up the buffer
without changing its mapped or unmapped state. We do not need to have
the buffer mapped to invalidate it, so it is sufficient to acquire it
for that. In the case where we write the buffer, we do map the buffer, and
potentially sleep.


# 1.199 01-Oct-2012 guenther

Make groupmember() check the effective gid too, so that the checks are
consistent when the effective gid isn't also a supplementary group.

ok beck@


# 1.198 19-Sep-2012 guenther

vhold() and vdrop() are prototyped in vnode.h, so don't repeat them here

ok beck@


Revision tags: OPENBSD_5_2_BASE
# 1.197 16-Jul-2012 deraadt

oops, need sys/acct.h too


# 1.196 16-Jul-2012 deraadt

Put acct_shutdown() proto in a better place


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.195 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.194 02-Jul-2011 thib

rename VFSDEBUG to VFLCKDEBUG;

prompted by tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.193 21-Dec-2010 thib

Bring back the "End the VOP experiment." diff, naddy's issues where
unrelated, and his alpha is much happier now.

OK deraadt@


# 1.192 06-Dec-2010 jasper

- drop NENTS(), which was yet another copy of nitems().
no binary change


ok deraadt@


# 1.191 10-Sep-2010 thib

Backout the VOP diff until the issues naddy was seeing on alpha (gcc3)
have been resolved.


# 1.190 06-Sep-2010 thib

End the VOP experiment. Instead of the ridicolusly complicated operation
vector setup that has questionable features (that have, as far as I can
tell never been used in practice, atleast not in OpenBSD), remove all
the gunk and favor a simple struct full of function pointers that get
set directly by each of the filesystems.

Removes gobs of ugly code and makes things simpler by a magnitude.

The only downside of this is that we loose the vnoperate feature so
the spec/fifo operations of the filesystems need to be kept in sync
with specfs and fifofs, this is no big deal as the API it self is pretty
static.

Many thanks to armani@ who pulled an earlier version of this diff to
current after c2k10 and Gabriel Kihlman on tech@ for testing.

Liked by many. "come on, find your balls" deraadt@.


# 1.189 12-Aug-2010 oga

Nuke extra (typoed) extern declaration and a spare newline from the last
commit.

"fix it -- free commit" beck@


# 1.188 11-Aug-2010 beck

Make the number of vnodes to correspond to the number of buffers in
buffer cache - we grow them dynamically, but do not attempt to shrink
them if the buffer cache shrinks after growing.

Tested by very many for a long time.

ok oga@ todd@ phessler@ tedu@


Revision tags: OPENBSD_4_8_BASE
# 1.187 29-Jun-2010 tedu

makefstype was only used in ported from freebsd filesystems. fix them
and remove the function. ok thib


# 1.186 28-Jun-2010 claudio

Add the rtable id as an argument to rn_walktree(). Functions like
rt_if_remove_rtdelete() need to know the table id to be able to correctly
remove nodes.
Problem found by Andrea Parazzini and analyzed by Martin Pelik�n.
OK henning@


# 1.185 06-May-2010 mpf

Fix favail format string.
From mickey.
OK thib, otto.


Revision tags: OPENBSD_4_7_BASE
# 1.184 17-Dec-2009 oga

if anyone vref()s a VNON vnode, panic. This should not happen.

Written while trying to debug the nfs_inactive panics. Turns out it
never got hit, but it's a useful check to have.

ok beck@


# 1.183 17-Aug-2009 jasper

dd 'show all bufs' to show all the buffers in the system

ok beck@ thib@


# 1.182 13-Aug-2009 thib

add a show all vnodes command, use dlg's nice pool_walk() to accomplish
this.

ok beck@, dlg@


# 1.181 12-Aug-2009 beck

Namecache revamp.

This eliminates the large single namecache hash table, and implements
the name cache as a global lru of entires, and a redblack tree in each
vnode. It makes cache_purge actually purge the namecache entries associated
with a vnode when a vnode is recycled (very important for later on actually being
able to resize the vnode pool)

This commit does #if 0 out a bunch of procmap code that was
already broken before this change, but needs to be redone completely.

Tested by many, including in thib's nfs test setup.

ok oga@,art@,thib@,miod@


# 1.180 02-Aug-2009 beck

Dynamic buffer cache support - a re-commit of what was backed out
after c2k9

allows buffer cache to be extended and grow/shrink dynamically

tested by many, ok oga@, "why not just commit it" deraadt@


Revision tags: OPENBSD_4_6_BASE
# 1.179 25-Jun-2009 thib

backout the buf_acquire() does the bremfree() since all callers
where doing bremfree() befure calling buf_acquire().

This is causing us headache pinning down a bug that showed up
when deraadt@ too cvs to current, and will have to be done
anyway as a preperation for backouts.

OK deraadt@


# 1.178 15-Jun-2009 beck

Back out all the buffer cache changes I committed during c2k9. This reverts three
commits:

1) The sysctl allowing bufcachepercent to be changed at boot time.
2) The change moving the buffer cache hash chains to a red-black tree
3) The dynamic buffer cache (Which depended on the earlier too).

ok on the backout from marco and todd


# 1.177 06-Jun-2009 art

All caller of buf_acquire were doing bremfree before the call.
Just put it in the buf_acquire function.
oga@ ok


# 1.176 03-Jun-2009 beck

Change bufhash from the old grotty hash table to red-black trees hanging
off the vnode.
ok art@, oga@, miod@


Revision tags: OPENBSD_4_5_BASE
# 1.175 10-Nov-2008 pedro

Fix typo in comment, okay jmc@.


# 1.174 01-Nov-2008 deraadt

change vrele() to return an int. if it returns 0, it can gaurantee that
it did not sleep. this is used to avoid checkdirs() to avoid having
to restart the allproc walk every time through
idea from tedu, ok thib pedro


Revision tags: OPENBSD_4_4_BASE
# 1.173 05-Jul-2008 thib

re-introduce vdrop() to signal a lost intrest in a vnode;

ok art@


# 1.172 14-Jun-2008 mk

A bunch of pool_get() + bzero() -> pool_get(..., .. | PR_ZERO)
conversions that should shave a few bytes off the kernel.

ok henning, krw, jsing, oga, miod, and thib (``even though i usually prefer
FOO|BAR''; thanks for looking.


# 1.171 13-Jun-2008 beck

back out stupid vnode change that was unintentionally included
with biomem and art has no idea how it got there.
ok art@ thib@


# 1.170 12-Jun-2008 deraadt

Bring biomem diff back into the tree after the nfs_bio.c fix went in.
ok thib beck art


# 1.169 11-Jun-2008 deraadt

back out biomem diff since it is not right yet. Doing very large
file copies to nfsv2 causes the system to eventually peg the console.
On the console ^T indicates that the load is increasing rapidly, ddb
indicates many calls to getbuf, there is some very slow nfs traffic
making none (or extremely slow) progress. Eventually some machines
seize up entirely.


# 1.168 10-Jun-2008 beck

Buffer cache revamp

1) remove multiple size queues, introduced as a stopgap.
2) decouple pages containing data from their mappings
3) only keep buffers mapped when they actually have to be mapped
(right now, this is when buffers are B_BUSY)
4) New functions to make a buffer busy, and release the busy flag
(buf_acquire and buf_release)
5) Move high/low water marks and statistics counters into a structure
6) Add a sysctl to retrieve buffer cache statistics

Tested in several variants and beat upon by bob and art for a year. run
accidentally on henning's nfs server for a few months...

ok deraadt@, krw@, art@ - who promises to be around to deal with any fallout


# 1.167 09-Jun-2008 millert

Update access(2) to have modern semantics with respect to X_OK and
the superuser. access(2) will now only indicate success for X_OK on
non-directories if there is at least one execute bit set on the file.
OK deraadt@ thib@ otto@


# 1.166 07-May-2008 thib

remove the vfc_mountroot member from vfsconf and
do appropriate cleanup;

OK deraadt@


# 1.165 07-May-2008 claudio

Implement routing priorities. Every route inserted has a priority assigned
and the one route with the lowest number wins. This will be used by the
routing daemons to resolve the synchronisations issue in case of conflicts.
The nasty bits of this are in the multipath code. If no priority is specified
the kernel will choose an appropriate priority.

Looked at by a few people at n2k8 code is much older


# 1.164 06-May-2008 thib

retire vfs_mountroot();

setroot() is now (and has been) responsible for setting
the mountroot function pointer "to the right thing", or
failing todo that, to ffs_mountroot;

based on a discussion/diff from deraadt@.
OK deraadt@


# 1.163 23-Mar-2008 miod

Wrong printf construct.


# 1.162 16-Mar-2008 otto

Widen some struct statfs fields to support large filesystem stata
and add some to be able to support statvfs(2). Do the compat dance
to provide backward compatibility. ok thib@ miod@


Revision tags: OPENBSD_4_3_BASE
# 1.161 13-Dec-2007 blambert

replace calls to ltsleep with tsleep

remove PNORELOCK flag, as PNORELOCK is used for msleep

ok art@ thib@


# 1.160 16-Nov-2007 deraadt

er, the newline is wrong. dissapointing.


# 1.159 15-Nov-2007 deraadt

newline before syncing disks is way prettier


# 1.158 29-Oct-2007 chl

MALLOC/FREE -> malloc/free
replace an hard coded value with M_WAITOK

ok krw@


# 1.157 15-Sep-2007 bluhm

Allow to pull out an usb stick with ffs filesystem while mounted
and a file is written onto the stick. Without these fixes the
machine panics or hangs.
The usb fix calls the callback when the stick is pulled out to free
the associated buffers. Otherwise we have busy buffers for ever
and the automatic unmount will panic.
The change in the scsi layer prevents passing down further dirty
buffers to usb after the stick has been deactivated.
In vfs the automatic unmount has moved from the function vgonel()
to vop_generic_revoke(). Both are called when the sd device's vnode
is removed. In vgonel() the VXLOCK is already held which can cause
a deadlock. So call dounmount() earlier.

ok krw@, I like this marco@, tested by ian@


# 1.156 07-Sep-2007 art

Use M_ZERO in a few more places to shave bytes from the kernel.

eyeballed and ok dlg@


Revision tags: OPENBSD_4_2_BASE
# 1.155 07-Aug-2007 beck

A few changes to deal with multi-user performance issues seen. this
brings us back roughly to 4.1 level performance, although this is still
far from optimal as we have seen in a number of cases. This change

1) puts a lower bound on buffer cache queues to prevent starvation
2) fixes the code which looks for a buffer to recycle
3) reduces the number of vnodes back to 4.1 levels to avoid complex
performance issues better addressed after 4.2

ok art@ deraadt@, tested by many


# 1.154 01-Jun-2007 beck

decouple the allocated number of vnodes from the "desiredvnodes" variable
which is used to size a zillion other things that increasing excessively
has been shown to cause problems - so that we may incrementally look at
increasing those other things without making the kernel unusable.

This diff effectivly increases the number of vnodes back to the number
of buffers, as in the earlier dynamic buffer cache commits, without
increasing anything else (namecache, softdeps, etc. etc.)

ok pedro@ tedu@ art@ thib@


# 1.153 31-May-2007 tedu

remove some silly casts, no real change


# 1.152 31-May-2007 pedro

NFSv2 cannot cope with a big number of vnodes, so revert to NPROC-based
calculation until the problem is fixed, okay beck@ art@


# 1.151 30-May-2007 beck

back out vfs change - todd fries has seen afs issues, and I'm suspicious
this can cause other problems.


# 1.150 29-May-2007 beck

Step one of some vnode improvements - change getnewvnode to
actually allocate "desiredvnodes" - add a vdrop to un-hold a vnode held
with vhold, and change the name cache to make use of vhold/vdrop, while
keeping track of which vnodes are referred to by which cache entries to
correctly hold/drop vnodes when the cache uses them.
ok thib@, tedu@, art@


# 1.149 28-May-2007 thib

de-inline vref();

ok pedro@


# 1.148 26-May-2007 pedro

Dynamic buffer cache. Initial diff from mickey@, okay art@ beck@ toby@
deraadt@ dlg@.


# 1.147 26-May-2007 thib

Nuke a bunch of simpelocks and associated goo.

ok art@


# 1.146 17-May-2007 thib

Collapse struct v_selectinfo in struct vnode, remove the
simplelock and reuse the name for the selinfo member.
Clean-up accordingly.

ok tedu@,art@


# 1.145 09-May-2007 deraadt

kinfo_vgetfailed has not been used for > 8 years


# 1.144 13-Apr-2007 thib

Move the declaration of VN_KNOTE() into vnode.h instead of having
multiple defines all over;

ok tedu@


# 1.143 13-Apr-2007 bluhm

Remove comments talking about vnode interlock. No binary change.
ok thib


# 1.142 11-Apr-2007 thib

Remove the simplelock argument from vrecycle();

ok pedro@, sturm@


# 1.141 21-Mar-2007 thib

Remove the v_interlock simplelock from the vnode structure.
Zap all calls to simple_lock/unlock() on it (those calls are
#defined away though). Remove the LK_INTERLOCK from the calls
to vn_lock() and cleanup the filesystems wich implement VOP_LOCK().
(by remvoing the v_interlock from there calls to lockmgr()).

ok pedro@, art@, tedu@


# 1.140 12-Mar-2007 mickey

better desiredvnodes not based on maxusers; pedro@ deraadt@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.139 20-Feb-2007 deraadt

for vfsconf sysctl, do not leak kernel sensors out to userland
ok art thib


# 1.138 17-Feb-2007 mickey

fix ddb buf printing for daddr_t growth to 64bit;
from juan hernandez gonzalez; tested by bluhm@


# 1.137 14-Feb-2007 jsg

Consistently spell FALLTHROUGH to appease lint.
ok kettenis@ cloder@ tom@ henning@


# 1.136 13-Feb-2007 mickey

fix ddb buf print


# 1.135 20-Nov-2006 tom

vprint() should be defined if DIAGNOSTIC || DEBUG. Noticed by (and
original diff from) Jake < antipsychic (at) hotmail.com >. Discussed
with Mickey and Miod.

ok miod@ pedro@


# 1.134 30-Oct-2006 thib

use vp->v_type to index into vtypes rather then vp->v_tag,
fixing odd output in the 'show vnode' ddb code.

ok mickey@


Revision tags: OPENBSD_4_0_BASE
# 1.133 11-Jul-2006 mickey

add mount/vnode/buf and softdep printing commands; tested on a few archs and will make pedro happy too (;


# 1.132 09-Jul-2006 pedro

Fix tab where space was meant


# 1.131 08-Jul-2006 thib

vinvalbuf() debugging aid, under VFSDEBUG.

ok pedro@


# 1.130 03-Jul-2006 mickey

also print vp in vprint (useful for debugging); pedro@ ok


# 1.129 25-Jun-2006 sturm

rename vfs_busy() flags VB_UMIGNORE/VB_UMWAIT to VB_NOWAIT/VB_WAIT

requested by and ok pedro


# 1.128 14-Jun-2006 sturm

move vfs_busy() to rwlocks and properly hide the locking api from vfs

ok tedu, pedro


# 1.127 02-Jun-2006 pedro

Add a clonable devices implementation. Hacked along with thib@, input
from krw@ and toby@, subliminal prodding from dlg@, okay deraadt@.


# 1.126 28-May-2006 pedro

Spacing in vfs_sysctl()


# 1.125 07-May-2006 sturm

forgot to remove this sentence from the comment
ok pedro


# 1.124 30-Apr-2006 sturm

remove the simplelock argument from vfs_busy() which is currently not
used and will never be used this way in VFS

requested by and ok pedro, ok krw, biorn


# 1.123 19-Apr-2006 pedro

Remove unused mount list simple_lock() goo


Revision tags: OPENBSD_3_9_BASE
# 1.122 09-Jan-2006 pedro

Put vprint() under DIAGNOSTIC, as to save space in generated ramdisks.
Inspiration from miod@, okay deraadt@. Tested on i386, macppc and amd64.


# 1.121 30-Nov-2005 pedro

No need for vfs_busy() and vfs_unbusy() to take a process pointer
anymore. Testing by jolan@, thanks.


# 1.120 24-Nov-2005 pedro

Remove kernfs, okay deraadt@.


# 1.119 19-Nov-2005 pedro

Remove unnecessary lockmgr() archaism that was costing too much in terms
of panics and bugfixes. Access curproc directly, do not expect a process
pointer as an argument. Should fix many "process context required" bugs.
Incentive and okay millert@, okay marc@. Various testing, thanks.


# 1.118 18-Nov-2005 pedro

Work around yet another race on non-locking file systems: when calling
VOP_INACTIVE() in vrele() and vput(), we may sleep. Since there's no
locking of any kind, someone can vget() the vnode and vrele() it while
we sleep, beating us in getting the vnode on the free list.


# 1.117 08-Nov-2005 pedro

Missed one use of 'register'


# 1.116 07-Nov-2005 pedro

Use ANSI function declarations and deregister, no binary change


# 1.115 19-Oct-2005 pedro

Remove v_vnlock from struct vnode, okay krw@ tedu@


Revision tags: OPENBSD_3_8_BASE
# 1.114 26-May-2005 pedro

branches: 1.114.2;
RIP stackable filesystems, ok marius@ tedu@, discussed with deraadt@


# 1.113 24-May-2005 pedro

when a device vnode associated with a mount point disappears, mark the
filesystem as doomed and unmount it


# 1.112 22-May-2005 pedro

put VLOCKSWORK stuff under a single option, VFSDEBUG


# 1.111 01-May-2005 pedro

check for VBIOONFREELIST and VBIOONSYNCLIST in vprint(), okay marius@


# 1.110 24-Mar-2005 tedu

always good to check for invalid values. ok marius pedro


Revision tags: OPENBSD_3_7_BASE
# 1.109 10-Jan-2005 pedro

branches: 1.109.2;
change vget() to only put a vnode back on the free lists if it actually
was there. should fix a (rare) corner case introduced by my last commit.
ok tedu@, testing by joris, moritz@, danh@, otto@ and krw@. many thanks.


# 1.108 31-Dec-2004 pedro

sprinkle some more list macros in here


# 1.107 31-Dec-2004 pedro

when releasing a vnode, make it inactive before sticking it to one of
the free lists. should fix some races on filesystems that don't have
locks, such as nfs. also, it allows for a more straightforward way of
releasing vnodes (nodes that are going to be recycled don't have to be
moved to the head of the list). tested by many, thanks.

ok tedu@ deraadt@


# 1.106 28-Dec-2004 deraadt

clean dirty accident by miod


# 1.105 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


# 1.104 09-Dec-2004 pedro

minor spacing/styling nits


Revision tags: OPENBSD_3_6_BASE
# 1.103 04-Aug-2004 art

Uninline vputonfreelist.


# 1.102 04-Aug-2004 pedro

better comments


# 1.101 02-Aug-2004 pedro

- check for LK_NOWAIT on vget()
- use ltsleep() instead of the unlock + sleep combo

ok art@, inspiration from free/net


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.100 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


# 1.99 27-May-2004 tedu

shutdown accounting before shutting down vfs. should prevent some panics.
ok david@ millert@ (iirc)


# 1.98 25-Apr-2004 itojun

radix tree with multipath support. from kame. deraadt ok
user visible changes:
- you can add multiple routes with same key (route add A B then route add A C)
- you have to specify gateway address if there are multiple entries on the table
(route delete A B, instead of route delete A)
kernel change:
- radix_node_head has an extra entry
- rnh_deladdr takes extra argument

TODO:
- actually take advantage of multipath (rtalloc -> rtalloc_mpath)


Revision tags: OPENBSD_3_5_BASE
# 1.97 09-Jan-2004 tedu

back out vnode parents. weird breakge found in ports tree


# 1.96 06-Jan-2004 tedu

keep track of a vnode's parent dir. ufs only, and unused atm, but
the fun stuff is coming. testing by brad.


Revision tags: OPENBSD_3_4_BASE
# 1.95 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.94 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: UBC_SYNC_A
# 1.93 13-May-2003 naddy

Back out previous change that causes "vnode table full" for large-scale
file operations.


# 1.92 13-May-2003 tedu

do reclaim LAYER vnodes, no good reason not to


# 1.91 06-May-2003 tedu

attempt to put a process's cwd back in place after a forced umount.
won't always work, but it's the best we can do for now. this covers
at least some of the failure cases the previous commit to vfs_lookup.c
checks for.
ok weingart@


# 1.90 01-May-2003 tedu

several related changes:
vfs_subr.c:
add a missing simple_lock_init for vnode interlock
try to avoid reclaiming locked or layered vnodes
initialize vnlock pointer to NULL
remove old code to free vnlock, never used
lockinit the new vnode lock
vfs_syscalls.c:
support for VLAYER flag
vnode_if.sh:
support for splitting VDESC flags
vnode_if.src:
split VDESC flags
WILLPUT is the combination of WILLRELE and WILLUNLOCK
most uses for WILLRELE become WILLPUT
vnode.h:
add v_lock to struct vnode
add VLAYER flag
update for new VDESC flags


# 1.89 06-Apr-2003 ho

strcat/strcpy/sprintf cleanup. krw@, anil@ ok. art@ tested sparc64.


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_B
# 1.88 11-Aug-2002 art

Add two missing vfs_busy calls in the failure path of sysctl_vnode.
Found by aaron@

NOTE - I think we need a mount-point iterator just like we have
NOTE - vfs_mount_foreach_vnode. (btw. why don't we use foreach_vnode in here?)


# 1.87 12-Jul-2002 art

Change the locking on the mountpoint slightly. Instead of using mnt_lock
to get shared locks for lookup and get the exclusive lock only with
LK_DRAIN on unmount and do the real exclusive locking with flags in
mnt_flags, we now use shared locks for lookup and an exclusive lock for
unmount.

This is accomplished by slightly changing the semantics of vfs_busy.
Old vfs_busy behavior:
- with LK_NOWAIT set in flags, a shared lock was obtained if the
mountpoint wasn't being unmounted, otherwise we just returned an error.
- with no flags, a shared lock was obtained if the mountpoint was being
unmounted, otherwise we slept until the unmount was done and returned
an error.
LK_NOWAIT was used for sync(2) and some statistics code where it isn't really
critical that we get the correct results.
0 was used in fchdir and lookup where it's critical that we get the right
directory vnode for the filesystem root.

After this change vfs_busy keeps the same behavior for no flags and LK_NOWAIT.
But if some other flags are passed into it, they are passed directly
into lockmgr (actually LK_SLEEPFAIL is always added to those flags because
if we sleep for the lock, that means someone was holding the exclusive lock
and the exclusive lock is only held when the filesystem is being unmounted.

More changes:
dounmount must now be called with the exclusive lock held. (before this
the caller was supposed to hold the vfs_busy lock, but that wasn't always
true).
Zap some (now) unused mount flags.
And the highlight of this change:
Add some vfs_busy calls to match some vfs_unbusy calls, especially in
sys_mount. (lockmgr doesn't detect the case where we release a lock noone
holds (it will do that soon)).

If you've seen hangs on reboot with mfs this should solve it (I repeat this
for the fourth time now, but this time I spent two months fixing and
redesigning this and reading the code so this time I must have gotten
this right).


# 1.86 16-Jun-2002 miod

When processing the KERN_VNODE sysctl, the kernel builds a packed structure,
while pstat(8) expects a C structure abiding the regular structure packing
rules. This caused pstat -v to break on powerpc.

Unbreak the confusion by defining the structure in a common header file,
and having the kernel use it.

ok millert@ deraadt@


# 1.85 08-Jun-2002 art

Use ltsleep in vfs_busy.


# 1.84 16-May-2002 art

sprinkle some splassert(IPL_BIO) in some functions that are commented as "should be called at splbio()"


Revision tags: OPENBSD_3_1_BASE
# 1.83 14-Mar-2002 millert

First round of __P removal in sys


# 1.82 04-Feb-2002 miod

Cleanup mountroot-related definitions.


# 1.81 23-Jan-2002 art

Pool deals fairly well with physical memory shortage, but it doesn't deal
well (not at all) with shortages of the vm_map where the pages are mapped
(usually kmem_map).

Try to deal with it:
- group all information the backend allocator for a pool in a separate
struct. The pool will only have a pointer to that struct.
- change the pool_init API to reflect that.
- link all pools allocating from the same allocator on a linked list.
- Since an allocator is responsible to wait for physical memory it will
only fail (waitok) when it runs out of its backing vm_map, carefully
drain pools using the same allocator so that va space is freed.
(see comments in code for caveats and details).
- change pool_reclaim to return if it actually succeeded to free some
memory, use that information to make draining easier and more efficient.
- get rid of PR_URGENT, noone uses it.


# 1.80 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.79 10-Dec-2001 art

branches: 1.79.2;
No need to initialize the uobj on every getnewvnode. Just do
it when allocating. Add some improved diagnostics.


# 1.78 10-Dec-2001 art

Big cleanup inspired by NetBSD with some parts of the code from NetBSD.
- get rid of VOP_BALLOCN and VOP_SIZE
- move the generic getpages and putpages into miscfs/genfs
- create a genfs_node which must be added to the top of the private portion
of each vnode for filsystems that want to use genfs_{get,put}pages
- rename genfs_mmap to vop_generic_mmap


# 1.77 10-Dec-2001 art

Merge in struct uvm_vnode into struct vnode.


# 1.76 05-Dec-2001 art

Break out the part that lowers v_holdcnt in brelvp into an own function
and make it and vhold into public interfaces.


# 1.75 29-Nov-2001 art

Ooops. Revert part of the last commit that was completly wrong and wasn't supposed to be committed.


# 1.74 29-Nov-2001 art

Correctly handle b_vp with bgetvp and brelvp in {get,put}pages.
Prevents panics caused by vnodes being recycled under our feet.


# 1.73 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.72 21-Nov-2001 csapuntz

Added vfs_isbusy. Useful for verifying that a mount point is locked
Added vfs_mount_foreach_vnode. Several places in the code seem to want to
traverse the mount list and they all seem to handle locking differently.
Centralize traversing the mount list in one place so that we only need
to get the locking right once.


# 1.71 15-Nov-2001 art

Don't zero v_bioflag when recycling a vnode in getnewvnode.
Sometimes the vnode can be on the syncers list. While that is a bug, it's
just a minor annoyance. A vnode on a syncer worklist without VBIOONSYNCLIST
set is a disaster.


# 1.70 12-Nov-2001 art

Remove unnecessary check for NULL vnode in reassignbuf.


# 1.69 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.68 02-Oct-2001 csapuntz

Bounds check index into routing table. Thanks to Ken Ashcraft of Stanford
for finding this bug.


# 1.67 19-Sep-2001 csapuntz

Get rid of B_VFLUSH. Not relevant after the end of the AGE queue.


# 1.66 16-Sep-2001 millert

Add some missing lengths checks when passing data from userland to
kernel. From based on NetBSD patches.


# 1.65 02-Aug-2001 assar

(vput): make panic strings actually say vput instead of vrele


# 1.64 26-Jul-2001 miod

Typo.


# 1.63 27-Jun-2001 art

remove old vm


# 1.62 22-Jun-2001 deraadt

KNF


# 1.61 05-Jun-2001 provos

send note_revoke to knotes when vnode goes away, okay art@


# 1.60 16-May-2001 art

indentation nit.


# 1.59 29-Apr-2001 art

cleanup, remove incorrect comment


Revision tags: OPENBSD_2_9_BASE
# 1.58 22-Mar-2001 art

branches: 1.58.2;
Use pool for allocating vnodes.
Even though vnodes are never freed (could be) this gives us big memory and
kmem_map savings.


# 1.57 21-Mar-2001 art

uvm_vnp_terminate expect the vnode to be locked.
Why didn't LOCKDEBUG catch this?


# 1.56 16-Mar-2001 art

Oops. fix thinko in last.


# 1.55 16-Mar-2001 art

Use CIRCLEQ macros for mountlist.


# 1.54 16-Mar-2001 art

Initialize the mountlist_slock.


# 1.53 26-Feb-2001 csapuntz

Move v_writecount test back to it original place


# 1.52 26-Feb-2001 csapuntz

Make ref counts 32-bit unsigned ints as opposed to a potpourri of longs and
ints.


# 1.51 24-Feb-2001 csapuntz

Cleanup of vnode interface continues. Get rid of VHOLD/HOLDRELE.
Change VM/UVM to use buf_replacevnode to change the vnode associated
with a buffer.

Addition v_bioflag for flags written in interrupt handlers
(and read at splbio, though not strictly necessary)

Add vwaitforio and use it instead of a while loop of v_numoutput.

Fix race conditions when manipulation vnode free list


# 1.50 23-Feb-2001 csapuntz

Remove the clustering fields from the vnodes and place them in the
file system inode instead


# 1.49 21-Feb-2001 csapuntz

Latest soft updates from FreeBSD/Kirk McKusick

Snapshot-related code has been commented out.


# 1.48 08-Feb-2001 mickey

do not print stuff when not verbose


Revision tags: OPENBSD_2_8_BASE
# 1.47 27-Sep-2000 art

branches: 1.47.2;
Minimal optimization.


# 1.46 17-Jul-2000 art

Don't wait for B_READ buffers on shutdown.
From NetBSD.


Revision tags: OPENBSD_2_7_BASE
# 1.45 25-Apr-2000 csapuntz

Use CIRCLEQ_FOREACH


# 1.44 21-Apr-2000 mickey

see if there is any meaning under curproc before using &proc0 in vfs_syncwait(); from art@


Revision tags: SMP_BASE kame_19991208
# 1.43 05-Dec-1999 art

branches: 1.43.2;
With soft updates, some buffers will be remarked as dirty after being written.
Handle this when syncing filesystems when unmounting.
From NetBSD.


# 1.42 05-Dec-1999 art

Use VONSYNCLIST to see if we should remove a vnode from the sync list instead
of looking at v_dirtyblkhd.


Revision tags: OPENBSD_2_6_BASE
# 1.41 20-Aug-1999 art

more paranoid check of the refcount in vfs_register


# 1.40 08-Aug-1999 niklas

From NetBSD; vdevgone, used for revoking access to device nodes when they
disappear (detach is coming).


# 1.39 31-May-1999 millert

New struct statfs with mount options. NOTE: this replaces statfs(2),
fstatfs(2), and getfsstat(2) so you will need to build a new kernel
before doing a "make build" or you will get "unimplemented syscall" errors.

The new struct statfs has the following featuires:
o Has a u_int32_t flags field--now softdep can have a real flag.

o Uses u_int32_t instead of longs (nicer on the alpha). Note: the man
page used to lie about setting invalid/unused fields to -1. SunOS does
that but our code never has.

o Gets rid of f_type completely. It hasn't been used since NetBSD 0.9
and having it there but always 0 is confusing. It is conceivable
that this may cause some old code to not compile but that is better
than silently breaking.

o Adds a mount_info union that contains the FSTYPE_args struct. This
means that "mount" can now tell you all the options a filesystem was
mounted with. This is especially nice for NFS.

Other changes:
o The linux statfs emulation didn't convert between BSD fs names
and linux f_type numbers. Now it does, since the BSD f_type
number is useless to linux apps (and has been removed anyway)

o FreeBSD's struct statfs is different from our (both old and new)
and thus needs conversion. Previously, the OpenBSD syscalls
were used without any real translation.

o mount(8) will now show extra info when invoked with no arguments.
However, to see *everything* you need to use the -v (verbose) flag.


# 1.38 06-May-1999 mickey

factor out sync+wait code into vfa_syncwait() routine for
applications in system like power management and such.
art@ finally said `commit it'


# 1.37 30-Apr-1999 art

in vput, simple_unlock the v_interlock before VOP_INACTIVE, not after


Revision tags: OPENBSD_2_5_BASE
# 1.36 11-Mar-1999 deraadt

backout


# 1.35 11-Mar-1999 deraadt

back out unapproved changes


# 1.34 11-Mar-1999 mickey

indent


# 1.33 11-Mar-1999 mickey

factor sync+wait operation out into a separate function.


# 1.32 26-Feb-1999 art

adapt to uvm vnode pager


# 1.31 19-Feb-1999 art

add vfs_register and vfs_unregister functions


# 1.30 28-Dec-1998 art

simple_lock fixes


# 1.29 22-Dec-1998 art

deconfuse vprint, print holdcount, not refcount when we are talking about holdcnt


# 1.28 10-Dec-1998 art

vfs_unmountall: retry to unmount all remaining filesystems when one unmount failed


# 1.27 05-Dec-1998 csapuntz

Framework for generating automatic test code for locking discipline
in DIAGNOSTIC mode.

Added documentation to vfs_subr.c on locking needs of a couple calls.

Improvements to the vinvalbuf patch. We need to start over after we
let our pants down.


# 1.26 04-Dec-1998 csapuntz

VFS-Lite2 requires stricter locking around vnode buffer queues. vinvalbuf
had insufficient protection


# 1.25 20-Nov-1998 art

vn_lock already unlocks the simple lock. don't do that again


# 1.24 12-Nov-1998 csapuntz

Integrate latest soft updates patches for McKusick.

Integrate cleaner ffs mount code from FreeBSD. Most notably, this mount
code prevents you from mounting an unclean file system read-write.


Revision tags: OPENBSD_2_4_BASE
# 1.23 13-Oct-1998 csapuntz

In vrele, vget, reinstate to following order

- VNODE gets placed on free list
- VOP_INACTIVE is called

This was the original order. It was changed in an earlier patch due to
a race condition in non-locking FSes (like NFS) between getnewvnode
and inactive. However, the modified order had its own race conditions, so
it turned out not to be a good choice.


# 1.22 30-Aug-1998 csapuntz

Cleanup.

Error diagnostics in vputonfreelist to catch violations of assumptions.


# 1.21 06-Aug-1998 csapuntz

Rename vop_revoke, vn_bwrite, vop_noislocked, vop_nolock, vop_nounlock
to be vop_generic_revoke, vop_generic_bwrite, vop_generic_islocked,
vop_generic_lock and vop_generic_unlock.

Create vop_generic_abortop and propogate change to all file systems.

Fix PR/371.

Get rid of locking in NULLFS (should be mostly unnecessary now except for
forced unmounts).


# 1.20 25-Apr-1998 niklas

typo


Revision tags: OPENBSD_2_3_BASE
# 1.19 20-Feb-1998 niklas

typo


# 1.18 11-Jan-1998 csapuntz

Fix a couple spinlock references. More code motion in vfs_subr.c


# 1.17 10-Jan-1998 csapuntz

Broke up vfs_subr.c which was getting a bit huge. We now have seperate files
for the syncer daemon as well as default VOP_*.


# 1.16 24-Nov-1997 niklas

Fix non-DIAGNOSTIC (and non-COMPAT*) compilation


# 1.15 07-Nov-1997 csapuntz

Fixed hang on shutdown
Disabled vop_nolock for now. Filesystems still need to be cleaned up.


# 1.14 06-Nov-1997 csapuntz

DEBUG now compiles


# 1.13 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.12 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.11 06-Oct-1997 csapuntz

VFS Lite2 Changes


Revision tags: OPENBSD_2_1_BASE
# 1.10 25-Apr-1997 deraadt

proper mask check; mike@fast.cs.utah.edu


# 1.9 14-Apr-1997 tholo

Minor performance enhancements from NetBSD


# 1.8 24-Feb-1997 niklas

OpenBSD tags


# 1.7 11-Feb-1997 millert

Add fs_id support and random inode generation numbers for ffs.


# 1.6 04-Jan-1997 kstailey

spec_advlock() via lf_advlock()


Revision tags: OPENBSD_2_0_BASE
# 1.5 08-Aug-1996 tholo

Make {,f}chown(2) behaviour POSIX.1 compliant with SUID / SGID files
Enable CTL_FS processing by sysctl(3)
Add CTL_FS request to disable clearing SUID / SGID bit when a files owner
or group is changed by root
Make sysctl(8) understand CTL_FS requests


# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 29-Feb-1996 niklas

From NetBSD: Merge with NetBSD 960217


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.269 02-May-2018 visa

Remove proc from the parameters of vn_lock(). The parameter is
unnecessary because curproc always does the locking.

OK mpi@


# 1.268 28-Apr-2018 visa

Clean up the parameters of VOP_LOCK() and VOP_UNLOCK(). It is always
curproc that does the locking or unlocking, so the proc parameter
is pointless and can be dropped.

OK mpi@, deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.267 07-Mar-2018 bluhm

Remounting files systems read-only does not work reliably. There
are corner cases where ffs may leak blocks. So better revert and
unmount all file systems at reboot. The "init died" panic will be
fixed in a different way.
OK deraadt@


# 1.266 10-Feb-2018 deraadt

Syncronize filesystems to disk when suspending. Each mountpoint's vnodes
are pushed to disk. Dangling vnodes (unlinked files still in use) and
vnodes undergoing change by long-running syscalls are identified -- and
such filesystems are marked dirty on-disk while we are suspended (in case
power is lost, a fsck will be required). Filesystems without dangling or
busy vnodes are marked clean, resulting in faster boots following
"battery died" circumstances.
Tested by numerous developers, thanks for the feedback.


# 1.265 14-Dec-2017 deraadt

Don't bother using DETACH_FORCE for the softraid luns at reboot
time; the aggressive mountpoint destruction seems to hit insane
use-after-frees when we are already far on the way down.


# 1.264 14-Dec-2017 deraadt

Give vflush_vnode() a hint about vnodes we don't need to account as "busy".
Change mountpoint to RDONLY a little later. Seems to improve the
rw->ro transition a bit.


# 1.263 11-Dec-2017 bluhm

Format the vnode lists of ddb show mount properly in columns.
OK krw@


# 1.262 11-Dec-2017 deraadt

In uvm Chuck decided backing store would not be allocated proactively
for blocks re-fetchable from the filesystem. However at reboot time,
filesystems are unmounted, and since processes lack backing store they
are killed. Since the scheduler is still running, in some cases init is
killed... which drops us to ddb [noted by bluhm]. Solution is to convert
filesystems to read-only [proposed by kettenis]. The tale follows:
sys_reboot() should pass proc * to MD boot() to vfs_shutdown() which
completes current IO with vfs_busy VB_WRITE|VB_WAIT, then calls VFS_MOUNT()
with MNT_UPDATE | MNT_RDONLY, soon teaching us that *fs_mount() calls a
copyin() late... so store the sizes in vfsconflist[] and move the copyin()
to sys_mount()... and notice nfs_mount copyin() is size-variant, so kill
legacy struct nfs_args3. Next we learn ffs_mount()'s MNT_UPDATE code is
sharp and rusty especially wrt softdep, so fix some bugs adn add
~MNT_SOFTDEP to the downgrade. Some vnodes need a little more help,
so tie them to &dead_vnops.

ffs_mount calling DIOCCACHESYNC is causing a bit of grief still but
this issue is seperate and will be dealt with in time.
couple hundred reboots by bluhm and myself, advice from guenther and
others at the hut


# 1.261 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.260 31-Jul-2017 florian

Give back some space to the ramdisk by compiling net/radix.c only
if we compile pf, ipsec, pipex or nfsserver.
Suggested by mpi some time ago.
Tweak & OK bluhm
deraadt assumes it's fair


# 1.259 20-Apr-2017 visa

Tweak lock inits to make the system runnable with witness(4)
on amd64 and i386.


# 1.258 04-Apr-2017 deraadt

struct vfsconf is tightly packed, but let's M_ZERO it in case that ever
changes to avoid exposing userland memory.


Revision tags: OPENBSD_6_1_BASE
# 1.257 15-Jan-2017 bluhm

When traversing the mount list, the current mount point is locked
with vfs_busy(). If the FOREACH_SAFE macro is used, the next pointer
is not locked and could be freed by another process. Unless
necessary, do not use _SAFE as it is unsafe. In vfs_unmountall()
the current pointer is actullay freed. Add a comment that this
race has to be fixed later.
OK krw@


# 1.256 10-Jan-2017 bluhm

Replace manual for() loops with FOREACH() macro.
OK millert@


# 1.255 10-Jan-2017 bluhm

Remove the unused olddp parameter from function dounmount().
OK mpi@ millert@


# 1.254 28-Sep-2016 kettenis

Cast enum to u_int when doing a bounds check to avoid a clang warning that
the comparison is always true.

ok jca@, tedu@


# 1.253 16-Sep-2016 dlg

move the namecache_rb_tree from RB macros to RBT functions.

i had to shuffle the includes a bit. all the knowledge of the RB
tree is now inside vfs_cache.c, and all accesses are via cache_*
functions.


# 1.252 16-Sep-2016 dlg

move buf_rb_bufs from RB macros to RBT functions

i had to shuffle the order of some header bits cos RBT_PROTOTYPE
needs to see what RBT_HEAD produces.


# 1.251 15-Sep-2016 dlg

all pools have their ipl set via pool_setipl, so fold it into pool_init.

the ioff argument to pool_init() is unused and has been for many
years, so this replaces it with an ipl argument. because the ipl
will be set on init we no longer need pool_setipl.

most of these changes have been done with coccinelle using the spatch
below. cocci sucks at formatting code though, so i fixed that by hand.

the manpage and subr_pool.c bits i did myself.

ok tedu@ jmatthew@

@ipl@
expression pp;
expression ipl;
expression s, a, o, f, m, p;
@@
-pool_init(pp, s, a, o, f, m, p);
-pool_setipl(pp, ipl);
+pool_init(pp, s, a, ipl, f, m, p);


# 1.250 25-Aug-2016 dlg

pool_setipl

ok kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.249 22-Jul-2016 kettenis

Prevent NULL-pointer call for filesystems that don't provide vfs_sysctl
in their vfsops.

Issue reported by Tim Newsham.

ok claudio@, natano@


# 1.248 19-Jun-2016 natano

Remove the lockmgr() API. It is only used by filesystems, where it is a
trivial change to use rrw locks instead. All it needs is LK_* defines
for the RW_* flags.

tested by naddy and sthen on package building infrastructure
input and ok jmc mpi tedu


# 1.247 26-May-2016 natano

The doforce variable isn't modified anywhere. Also, the only filesystem
left using it is fuse. It has been removed from all other filesystems.

ok millert deraadt


# 1.246 26-Apr-2016 natano

copy_statfs_info() is not only used by ufs, but by other filesystems too,
so make sure that all members of mp->mnt_stat.mount_info are copied.
ok stefan


# 1.245 26-Apr-2016 beck

fix off by one in vfs_vnode_print - found by miod
ok deraadt@, krw@


# 1.244 07-Apr-2016 natano

Share clone bitmap between aliased vnodes. This prevents duplicate clone
instance numbers being handed out for the same minor device.
ok mikeb


# 1.243 05-Apr-2016 natano

Increase size of the clone bitmap (revised diff after revert). I have
tested this with fuse _and_ drm on amd64 and macppc. Also tested with
cloning bpf (not in the tree) on macppc.

ok mikeb
"looks correct to me" millert

The original commit message is as follows:

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.242 01-Apr-2016 mikeb

Revert the clone bitmap enlargement change


# 1.241 31-Mar-2016 natano

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.240 19-Mar-2016 natano

Remove the unused flags argument from VOP_UNLOCK().

torture tested on amd64, i386 and macppc
ok beck mpi stefan
"the change looks right" deraadt


# 1.239 14-Mar-2016 krw

Change a bunch of (<blah> *)0 to NULL.

ok beck@ deraadt@


Revision tags: OPENBSD_5_9_BASE
# 1.238 05-Dec-2015 tedu

branches: 1.238.2;
remove stale lint annotations


# 1.237 16-Nov-2015 deraadt

In getdevvp() set the VISTTY flag on a vnode to indicate the underlying
device is a D_TTY device. (Like spec_open, but this sets the flag to
satisfy pre-VOP_OPEN situations)
ok millert semarie tedu guenther


# 1.236 13-Oct-2015 guenther

Initialize va_filerev in vattr_null() to avoid leaking stack garbage;
problem pointed out by Martin Natano (natano (at) natano.net)

Also, stop chaining assignments (foo = bar = baz) in vattr_null().
The exact meaning of those depends on the order of the sizes-and-
signednesses of the lvalues, making them fragile: a statement here
mixed *six* types, but managed to get them in a safe order. Delete
a 20+ year old XXX comment that was almost certainly bemoaning a bug
from when they were in an unsafe order.

ok deraadt@ miod@


# 1.235 08-Oct-2015 mpi

Use the radix API directly and get rid of the function pointers. There
is no point in keeping an unused level of abstraction.

ok mikeb@, claudio@


# 1.234 07-Oct-2015 mpi

rn_inithead() offset argument is now specified in byte, missed in previous.


# 1.233 04-Sep-2015 mpi

Make every subsystem using a radix tree call rn_init() and pass the
length of the key as argument.

This way every consumer of the radix tree has a chance to explicitly
initialize the shared data structures and no longer rely on another
subsystem to do the initialization.

As a bonus ``dom_maxrtkey'' is no longer used an die.

ART kernels should now be fully usable because pf(4) and IPSEC properly
initialized the radix tree.

ok chris@, reyk@


Revision tags: OPENBSD_5_8_BASE
# 1.232 16-Jul-2015 claudio

branches: 1.232.4;
Fix rn_match and there for the expoerted lookup functions in radix.c
to never return the internal RNF_ROOT nodes. This removes the checks
in the callee to verify that not an RNF_ROOT node was returned.
OK mpi@


# 1.231 12-May-2015 mikeb

Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.230 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.229 02-Mar-2015 guenther

Return EINVAL if the creds supplied for NFS export have a cr_ngroups less
than zero or greater than NGROUPS_MAX

Fixes panic seen by henning@


# 1.228 09-Jan-2015 tedu

rename desiredvnodes to initialvnodes. less of a lie. ok beck deraadt


# 1.227 19-Dec-2014 tedu

start retiring the nointr allocator. specify PR_WAITOK as a flag as a
marker for which pools are not interrupt safe. ok dlg


# 1.226 17-Dec-2014 tedu

remove lock.h from uvm_extern.h. another holdover from the simpletonlock
era. fix uvm including c files to include lock.h or atomic.h as necessary.
ok deraadt


# 1.225 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.224 10-Dec-2014 tedu

convert bcopy to memcpy. ok millert


# 1.223 21-Nov-2014 tedu

simple lock is long dead


# 1.222 19-Nov-2014 tedu

delete the KERN_VNODE sysctl. it fails to provide any isolation from the
kernel struct vnode defintion, and the only consumer (pstat) still needs
kvm to read much of the required information. no great loss to always use
kvm until there's a better replacement interface.
ok deraadt millert uebayasi


# 1.221 14-Nov-2014 tedu

prefer sizeof(*ptr) to sizeof(struct) for malloc and free


# 1.220 03-Nov-2014 deraadt

pass size argument to free()
ok doug tedu


# 1.219 13-Sep-2014 doug

Replace all queue *_END macro calls except CIRCLEQ_END with NULL.

CIRCLEQ_* is deprecated and not called in the tree. The other queue types
have *_END macros which were added for symmetry with CIRCLEQ_END. They are
defined as NULL. There's no reason to keep the other *_END macro calls.

ok millert@


Revision tags: OPENBSD_5_6_BASE
# 1.218 13-Jul-2014 tedu

pass the size to free in some of the obvious cases


# 1.217 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.216 10-Jul-2014 mpi

Stop using a shutdown hook for softraid(4) and explicitly shutdown
the disciplines right after vfs_shutdown().

This change is required in order to be able to set `cold' to 1 before
traversing the device (mainbus) tree for DVACT_POWERDOWN when halting
a machine. Yes, this is ugly because sr_shutdown() needs to sleep. But
at least it is obvious and hopefully somebody will be ofended and fix
it.

In order to properly flush the cache of the disks under softraid0,
sr_shutdown() now propagates DVACT_POWERDOWN for this particular subtree
of devices which are not under mainbus. As a side effect sd(4) shutdown
hook should no longer be necessary.

Tested by stsp@ and Jean-Philippe Ouellet.

ok deraadt@, stsp@, jsing@


# 1.215 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.214 04-Jun-2014 claudio

While it may be smart to use the radix tree for exports it is not OK to
use the domain specific tree initialisation method for this since that one
is multipath enabled and assumes that the radix node is part of a struct
rtentry. This code uses a different struct and so the multipath modifies
wrong fields and breaks stuff in mysterious ways.
Since we only support AF_INET here anyway simplify the code and only have
one radix_node_head pointer instead of AF_MAX ones.
Fixes NFS server issues reported by rpe@, OK rpe@, guenther@, sthen@


# 1.213 10-Apr-2014 tedu

pull the bufcache freelist code out into separate functions to allow new
algorithms to be tested. in the process, drop support for unused B_AGE and
b_synctime options.
previous versions ok beck deraadt


# 1.212 24-Mar-2014 guenther

Split the API: struct ucred remains the kernel internal structure while
struct xucred becomes the structure for syscalls (mount(2) and nfssvc(2)).

ok deraadt@ beck@


Revision tags: OPENBSD_5_5_BASE
# 1.211 21-Jan-2014 tedu

bzero -> memset


# 1.210 01-Dec-2013 krw

Change 'mountlist' from CIRCLEQ to TAILQ. Be paranoid and
use TAILQ_*_SAFE more than might be needed.

Bulk ports build by sthen@ showed nobody sticking their fingers
so deep into the kernel.

Feedback and suggestions from millert@. ok jsing@


# 1.209 27-Nov-2013 jsing

Defer the v_type initialisation until after the vnode has been purged from
the namecache. Changing the v_type between cache_enter() and cache_purge()
results in bad things happening.

ok beck@


# 1.208 02-Oct-2013 sf

format string fix: b_flags is long


# 1.207 01-Oct-2013 sf

Format string fixes: Cast time_t to long long

and mnt_stat.f_ctime is long long, too


# 1.206 08-Aug-2013 syl

Uncomment kprintf format attributes for sys/kern

tested on vax (gcc3) ok miod@


# 1.205 30-Jul-2013 beck

The previous change was made while chasing nfs performance issues
on Theo's servers - however this was in the context of the buffer flipper
changes and this is now suspect in a continues performance issue with NFS
so back it out for now


Revision tags: OPENBSD_5_4_BASE
# 1.204 24-Jun-2013 beck

Manipulating buffers after sleeping is dangerous. Instead of attempting
to cheat and VOP_BWRITE a buffer, restart the vinvalbuf if we have to wait
for a busy buffer to complete
ok tedu@ guenther@


# 1.203 15-Apr-2013 jsing

Add an f_mntfromspec member to struct statfs, which specifies the name of
the special provided when the mount was requested. This may be the same as
the special that was actually used for the mount (e.g. in the case of a
device node) or it may be different (e.g. in the case of a DUID).

Whilst here, change f_ctime to a 64 bit type and remove the pointless
f_spare members.

Compatibility goo courtesy of guenther@

ok krw@ millert@


Revision tags: OPENBSD_5_3_BASE
# 1.202 17-Feb-2013 miod

Comment out recently added __attribute__((__format__(__kprintf__))) annotations
in MI code; gcc 2.95 does not accept such annotation for function pointer
declarations, only function prototypes.
To be uncommented once gcc 2.95 bites the dust.


# 1.201 09-Feb-2013 miod

Add explicit __attribute__ ((__format__(__kprintf__)))) to the functions and
function pointer arguments which are {used as,} wrappers around the kernel
printf function.
No functional change.


# 1.200 17-Nov-2012 beck

Don't map a buffer (and potentially sleep) when invalidating it in vinvalbuf.
This fixes a problem where we could sleep for kva and then our pointers
would not be valid on the next pass through the loop. We do this
by adding buf_acquire_nomap() - which can be used to busy up the buffer
without changing its mapped or unmapped state. We do not need to have
the buffer mapped to invalidate it, so it is sufficient to acquire it
for that. In the case where we write the buffer, we do map the buffer, and
potentially sleep.


# 1.199 01-Oct-2012 guenther

Make groupmember() check the effective gid too, so that the checks are
consistent when the effective gid isn't also a supplementary group.

ok beck@


# 1.198 19-Sep-2012 guenther

vhold() and vdrop() are prototyped in vnode.h, so don't repeat them here

ok beck@


Revision tags: OPENBSD_5_2_BASE
# 1.197 16-Jul-2012 deraadt

oops, need sys/acct.h too


# 1.196 16-Jul-2012 deraadt

Put acct_shutdown() proto in a better place


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.195 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.194 02-Jul-2011 thib

rename VFSDEBUG to VFLCKDEBUG;

prompted by tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.193 21-Dec-2010 thib

Bring back the "End the VOP experiment." diff, naddy's issues where
unrelated, and his alpha is much happier now.

OK deraadt@


# 1.192 06-Dec-2010 jasper

- drop NENTS(), which was yet another copy of nitems().
no binary change


ok deraadt@


# 1.191 10-Sep-2010 thib

Backout the VOP diff until the issues naddy was seeing on alpha (gcc3)
have been resolved.


# 1.190 06-Sep-2010 thib

End the VOP experiment. Instead of the ridicolusly complicated operation
vector setup that has questionable features (that have, as far as I can
tell never been used in practice, atleast not in OpenBSD), remove all
the gunk and favor a simple struct full of function pointers that get
set directly by each of the filesystems.

Removes gobs of ugly code and makes things simpler by a magnitude.

The only downside of this is that we loose the vnoperate feature so
the spec/fifo operations of the filesystems need to be kept in sync
with specfs and fifofs, this is no big deal as the API it self is pretty
static.

Many thanks to armani@ who pulled an earlier version of this diff to
current after c2k10 and Gabriel Kihlman on tech@ for testing.

Liked by many. "come on, find your balls" deraadt@.


# 1.189 12-Aug-2010 oga

Nuke extra (typoed) extern declaration and a spare newline from the last
commit.

"fix it -- free commit" beck@


# 1.188 11-Aug-2010 beck

Make the number of vnodes to correspond to the number of buffers in
buffer cache - we grow them dynamically, but do not attempt to shrink
them if the buffer cache shrinks after growing.

Tested by very many for a long time.

ok oga@ todd@ phessler@ tedu@


Revision tags: OPENBSD_4_8_BASE
# 1.187 29-Jun-2010 tedu

makefstype was only used in ported from freebsd filesystems. fix them
and remove the function. ok thib


# 1.186 28-Jun-2010 claudio

Add the rtable id as an argument to rn_walktree(). Functions like
rt_if_remove_rtdelete() need to know the table id to be able to correctly
remove nodes.
Problem found by Andrea Parazzini and analyzed by Martin Pelik�n.
OK henning@


# 1.185 06-May-2010 mpf

Fix favail format string.
From mickey.
OK thib, otto.


Revision tags: OPENBSD_4_7_BASE
# 1.184 17-Dec-2009 oga

if anyone vref()s a VNON vnode, panic. This should not happen.

Written while trying to debug the nfs_inactive panics. Turns out it
never got hit, but it's a useful check to have.

ok beck@


# 1.183 17-Aug-2009 jasper

dd 'show all bufs' to show all the buffers in the system

ok beck@ thib@


# 1.182 13-Aug-2009 thib

add a show all vnodes command, use dlg's nice pool_walk() to accomplish
this.

ok beck@, dlg@


# 1.181 12-Aug-2009 beck

Namecache revamp.

This eliminates the large single namecache hash table, and implements
the name cache as a global lru of entires, and a redblack tree in each
vnode. It makes cache_purge actually purge the namecache entries associated
with a vnode when a vnode is recycled (very important for later on actually being
able to resize the vnode pool)

This commit does #if 0 out a bunch of procmap code that was
already broken before this change, but needs to be redone completely.

Tested by many, including in thib's nfs test setup.

ok oga@,art@,thib@,miod@


# 1.180 02-Aug-2009 beck

Dynamic buffer cache support - a re-commit of what was backed out
after c2k9

allows buffer cache to be extended and grow/shrink dynamically

tested by many, ok oga@, "why not just commit it" deraadt@


Revision tags: OPENBSD_4_6_BASE
# 1.179 25-Jun-2009 thib

backout the buf_acquire() does the bremfree() since all callers
where doing bremfree() befure calling buf_acquire().

This is causing us headache pinning down a bug that showed up
when deraadt@ too cvs to current, and will have to be done
anyway as a preperation for backouts.

OK deraadt@


# 1.178 15-Jun-2009 beck

Back out all the buffer cache changes I committed during c2k9. This reverts three
commits:

1) The sysctl allowing bufcachepercent to be changed at boot time.
2) The change moving the buffer cache hash chains to a red-black tree
3) The dynamic buffer cache (Which depended on the earlier too).

ok on the backout from marco and todd


# 1.177 06-Jun-2009 art

All caller of buf_acquire were doing bremfree before the call.
Just put it in the buf_acquire function.
oga@ ok


# 1.176 03-Jun-2009 beck

Change bufhash from the old grotty hash table to red-black trees hanging
off the vnode.
ok art@, oga@, miod@


Revision tags: OPENBSD_4_5_BASE
# 1.175 10-Nov-2008 pedro

Fix typo in comment, okay jmc@.


# 1.174 01-Nov-2008 deraadt

change vrele() to return an int. if it returns 0, it can gaurantee that
it did not sleep. this is used to avoid checkdirs() to avoid having
to restart the allproc walk every time through
idea from tedu, ok thib pedro


Revision tags: OPENBSD_4_4_BASE
# 1.173 05-Jul-2008 thib

re-introduce vdrop() to signal a lost intrest in a vnode;

ok art@


# 1.172 14-Jun-2008 mk

A bunch of pool_get() + bzero() -> pool_get(..., .. | PR_ZERO)
conversions that should shave a few bytes off the kernel.

ok henning, krw, jsing, oga, miod, and thib (``even though i usually prefer
FOO|BAR''; thanks for looking.


# 1.171 13-Jun-2008 beck

back out stupid vnode change that was unintentionally included
with biomem and art has no idea how it got there.
ok art@ thib@


# 1.170 12-Jun-2008 deraadt

Bring biomem diff back into the tree after the nfs_bio.c fix went in.
ok thib beck art


# 1.169 11-Jun-2008 deraadt

back out biomem diff since it is not right yet. Doing very large
file copies to nfsv2 causes the system to eventually peg the console.
On the console ^T indicates that the load is increasing rapidly, ddb
indicates many calls to getbuf, there is some very slow nfs traffic
making none (or extremely slow) progress. Eventually some machines
seize up entirely.


# 1.168 10-Jun-2008 beck

Buffer cache revamp

1) remove multiple size queues, introduced as a stopgap.
2) decouple pages containing data from their mappings
3) only keep buffers mapped when they actually have to be mapped
(right now, this is when buffers are B_BUSY)
4) New functions to make a buffer busy, and release the busy flag
(buf_acquire and buf_release)
5) Move high/low water marks and statistics counters into a structure
6) Add a sysctl to retrieve buffer cache statistics

Tested in several variants and beat upon by bob and art for a year. run
accidentally on henning's nfs server for a few months...

ok deraadt@, krw@, art@ - who promises to be around to deal with any fallout


# 1.167 09-Jun-2008 millert

Update access(2) to have modern semantics with respect to X_OK and
the superuser. access(2) will now only indicate success for X_OK on
non-directories if there is at least one execute bit set on the file.
OK deraadt@ thib@ otto@


# 1.166 07-May-2008 thib

remove the vfc_mountroot member from vfsconf and
do appropriate cleanup;

OK deraadt@


# 1.165 07-May-2008 claudio

Implement routing priorities. Every route inserted has a priority assigned
and the one route with the lowest number wins. This will be used by the
routing daemons to resolve the synchronisations issue in case of conflicts.
The nasty bits of this are in the multipath code. If no priority is specified
the kernel will choose an appropriate priority.

Looked at by a few people at n2k8 code is much older


# 1.164 06-May-2008 thib

retire vfs_mountroot();

setroot() is now (and has been) responsible for setting
the mountroot function pointer "to the right thing", or
failing todo that, to ffs_mountroot;

based on a discussion/diff from deraadt@.
OK deraadt@


# 1.163 23-Mar-2008 miod

Wrong printf construct.


# 1.162 16-Mar-2008 otto

Widen some struct statfs fields to support large filesystem stata
and add some to be able to support statvfs(2). Do the compat dance
to provide backward compatibility. ok thib@ miod@


Revision tags: OPENBSD_4_3_BASE
# 1.161 13-Dec-2007 blambert

replace calls to ltsleep with tsleep

remove PNORELOCK flag, as PNORELOCK is used for msleep

ok art@ thib@


# 1.160 16-Nov-2007 deraadt

er, the newline is wrong. dissapointing.


# 1.159 15-Nov-2007 deraadt

newline before syncing disks is way prettier


# 1.158 29-Oct-2007 chl

MALLOC/FREE -> malloc/free
replace an hard coded value with M_WAITOK

ok krw@


# 1.157 15-Sep-2007 bluhm

Allow to pull out an usb stick with ffs filesystem while mounted
and a file is written onto the stick. Without these fixes the
machine panics or hangs.
The usb fix calls the callback when the stick is pulled out to free
the associated buffers. Otherwise we have busy buffers for ever
and the automatic unmount will panic.
The change in the scsi layer prevents passing down further dirty
buffers to usb after the stick has been deactivated.
In vfs the automatic unmount has moved from the function vgonel()
to vop_generic_revoke(). Both are called when the sd device's vnode
is removed. In vgonel() the VXLOCK is already held which can cause
a deadlock. So call dounmount() earlier.

ok krw@, I like this marco@, tested by ian@


# 1.156 07-Sep-2007 art

Use M_ZERO in a few more places to shave bytes from the kernel.

eyeballed and ok dlg@


Revision tags: OPENBSD_4_2_BASE
# 1.155 07-Aug-2007 beck

A few changes to deal with multi-user performance issues seen. this
brings us back roughly to 4.1 level performance, although this is still
far from optimal as we have seen in a number of cases. This change

1) puts a lower bound on buffer cache queues to prevent starvation
2) fixes the code which looks for a buffer to recycle
3) reduces the number of vnodes back to 4.1 levels to avoid complex
performance issues better addressed after 4.2

ok art@ deraadt@, tested by many


# 1.154 01-Jun-2007 beck

decouple the allocated number of vnodes from the "desiredvnodes" variable
which is used to size a zillion other things that increasing excessively
has been shown to cause problems - so that we may incrementally look at
increasing those other things without making the kernel unusable.

This diff effectivly increases the number of vnodes back to the number
of buffers, as in the earlier dynamic buffer cache commits, without
increasing anything else (namecache, softdeps, etc. etc.)

ok pedro@ tedu@ art@ thib@


# 1.153 31-May-2007 tedu

remove some silly casts, no real change


# 1.152 31-May-2007 pedro

NFSv2 cannot cope with a big number of vnodes, so revert to NPROC-based
calculation until the problem is fixed, okay beck@ art@


# 1.151 30-May-2007 beck

back out vfs change - todd fries has seen afs issues, and I'm suspicious
this can cause other problems.


# 1.150 29-May-2007 beck

Step one of some vnode improvements - change getnewvnode to
actually allocate "desiredvnodes" - add a vdrop to un-hold a vnode held
with vhold, and change the name cache to make use of vhold/vdrop, while
keeping track of which vnodes are referred to by which cache entries to
correctly hold/drop vnodes when the cache uses them.
ok thib@, tedu@, art@


# 1.149 28-May-2007 thib

de-inline vref();

ok pedro@


# 1.148 26-May-2007 pedro

Dynamic buffer cache. Initial diff from mickey@, okay art@ beck@ toby@
deraadt@ dlg@.


# 1.147 26-May-2007 thib

Nuke a bunch of simpelocks and associated goo.

ok art@


# 1.146 17-May-2007 thib

Collapse struct v_selectinfo in struct vnode, remove the
simplelock and reuse the name for the selinfo member.
Clean-up accordingly.

ok tedu@,art@


# 1.145 09-May-2007 deraadt

kinfo_vgetfailed has not been used for > 8 years


# 1.144 13-Apr-2007 thib

Move the declaration of VN_KNOTE() into vnode.h instead of having
multiple defines all over;

ok tedu@


# 1.143 13-Apr-2007 bluhm

Remove comments talking about vnode interlock. No binary change.
ok thib


# 1.142 11-Apr-2007 thib

Remove the simplelock argument from vrecycle();

ok pedro@, sturm@


# 1.141 21-Mar-2007 thib

Remove the v_interlock simplelock from the vnode structure.
Zap all calls to simple_lock/unlock() on it (those calls are
#defined away though). Remove the LK_INTERLOCK from the calls
to vn_lock() and cleanup the filesystems wich implement VOP_LOCK().
(by remvoing the v_interlock from there calls to lockmgr()).

ok pedro@, art@, tedu@


# 1.140 12-Mar-2007 mickey

better desiredvnodes not based on maxusers; pedro@ deraadt@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.139 20-Feb-2007 deraadt

for vfsconf sysctl, do not leak kernel sensors out to userland
ok art thib


# 1.138 17-Feb-2007 mickey

fix ddb buf printing for daddr_t growth to 64bit;
from juan hernandez gonzalez; tested by bluhm@


# 1.137 14-Feb-2007 jsg

Consistently spell FALLTHROUGH to appease lint.
ok kettenis@ cloder@ tom@ henning@


# 1.136 13-Feb-2007 mickey

fix ddb buf print


# 1.135 20-Nov-2006 tom

vprint() should be defined if DIAGNOSTIC || DEBUG. Noticed by (and
original diff from) Jake < antipsychic (at) hotmail.com >. Discussed
with Mickey and Miod.

ok miod@ pedro@


# 1.134 30-Oct-2006 thib

use vp->v_type to index into vtypes rather then vp->v_tag,
fixing odd output in the 'show vnode' ddb code.

ok mickey@


Revision tags: OPENBSD_4_0_BASE
# 1.133 11-Jul-2006 mickey

add mount/vnode/buf and softdep printing commands; tested on a few archs and will make pedro happy too (;


# 1.132 09-Jul-2006 pedro

Fix tab where space was meant


# 1.131 08-Jul-2006 thib

vinvalbuf() debugging aid, under VFSDEBUG.

ok pedro@


# 1.130 03-Jul-2006 mickey

also print vp in vprint (useful for debugging); pedro@ ok


# 1.129 25-Jun-2006 sturm

rename vfs_busy() flags VB_UMIGNORE/VB_UMWAIT to VB_NOWAIT/VB_WAIT

requested by and ok pedro


# 1.128 14-Jun-2006 sturm

move vfs_busy() to rwlocks and properly hide the locking api from vfs

ok tedu, pedro


# 1.127 02-Jun-2006 pedro

Add a clonable devices implementation. Hacked along with thib@, input
from krw@ and toby@, subliminal prodding from dlg@, okay deraadt@.


# 1.126 28-May-2006 pedro

Spacing in vfs_sysctl()


# 1.125 07-May-2006 sturm

forgot to remove this sentence from the comment
ok pedro


# 1.124 30-Apr-2006 sturm

remove the simplelock argument from vfs_busy() which is currently not
used and will never be used this way in VFS

requested by and ok pedro, ok krw, biorn


# 1.123 19-Apr-2006 pedro

Remove unused mount list simple_lock() goo


Revision tags: OPENBSD_3_9_BASE
# 1.122 09-Jan-2006 pedro

Put vprint() under DIAGNOSTIC, as to save space in generated ramdisks.
Inspiration from miod@, okay deraadt@. Tested on i386, macppc and amd64.


# 1.121 30-Nov-2005 pedro

No need for vfs_busy() and vfs_unbusy() to take a process pointer
anymore. Testing by jolan@, thanks.


# 1.120 24-Nov-2005 pedro

Remove kernfs, okay deraadt@.


# 1.119 19-Nov-2005 pedro

Remove unnecessary lockmgr() archaism that was costing too much in terms
of panics and bugfixes. Access curproc directly, do not expect a process
pointer as an argument. Should fix many "process context required" bugs.
Incentive and okay millert@, okay marc@. Various testing, thanks.


# 1.118 18-Nov-2005 pedro

Work around yet another race on non-locking file systems: when calling
VOP_INACTIVE() in vrele() and vput(), we may sleep. Since there's no
locking of any kind, someone can vget() the vnode and vrele() it while
we sleep, beating us in getting the vnode on the free list.


# 1.117 08-Nov-2005 pedro

Missed one use of 'register'


# 1.116 07-Nov-2005 pedro

Use ANSI function declarations and deregister, no binary change


# 1.115 19-Oct-2005 pedro

Remove v_vnlock from struct vnode, okay krw@ tedu@


Revision tags: OPENBSD_3_8_BASE
# 1.114 26-May-2005 pedro

branches: 1.114.2;
RIP stackable filesystems, ok marius@ tedu@, discussed with deraadt@


# 1.113 24-May-2005 pedro

when a device vnode associated with a mount point disappears, mark the
filesystem as doomed and unmount it


# 1.112 22-May-2005 pedro

put VLOCKSWORK stuff under a single option, VFSDEBUG


# 1.111 01-May-2005 pedro

check for VBIOONFREELIST and VBIOONSYNCLIST in vprint(), okay marius@


# 1.110 24-Mar-2005 tedu

always good to check for invalid values. ok marius pedro


Revision tags: OPENBSD_3_7_BASE
# 1.109 10-Jan-2005 pedro

branches: 1.109.2;
change vget() to only put a vnode back on the free lists if it actually
was there. should fix a (rare) corner case introduced by my last commit.
ok tedu@, testing by joris, moritz@, danh@, otto@ and krw@. many thanks.


# 1.108 31-Dec-2004 pedro

sprinkle some more list macros in here


# 1.107 31-Dec-2004 pedro

when releasing a vnode, make it inactive before sticking it to one of
the free lists. should fix some races on filesystems that don't have
locks, such as nfs. also, it allows for a more straightforward way of
releasing vnodes (nodes that are going to be recycled don't have to be
moved to the head of the list). tested by many, thanks.

ok tedu@ deraadt@


# 1.106 28-Dec-2004 deraadt

clean dirty accident by miod


# 1.105 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


# 1.104 09-Dec-2004 pedro

minor spacing/styling nits


Revision tags: OPENBSD_3_6_BASE
# 1.103 04-Aug-2004 art

Uninline vputonfreelist.


# 1.102 04-Aug-2004 pedro

better comments


# 1.101 02-Aug-2004 pedro

- check for LK_NOWAIT on vget()
- use ltsleep() instead of the unlock + sleep combo

ok art@, inspiration from free/net


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.100 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


# 1.99 27-May-2004 tedu

shutdown accounting before shutting down vfs. should prevent some panics.
ok david@ millert@ (iirc)


# 1.98 25-Apr-2004 itojun

radix tree with multipath support. from kame. deraadt ok
user visible changes:
- you can add multiple routes with same key (route add A B then route add A C)
- you have to specify gateway address if there are multiple entries on the table
(route delete A B, instead of route delete A)
kernel change:
- radix_node_head has an extra entry
- rnh_deladdr takes extra argument

TODO:
- actually take advantage of multipath (rtalloc -> rtalloc_mpath)


Revision tags: OPENBSD_3_5_BASE
# 1.97 09-Jan-2004 tedu

back out vnode parents. weird breakge found in ports tree


# 1.96 06-Jan-2004 tedu

keep track of a vnode's parent dir. ufs only, and unused atm, but
the fun stuff is coming. testing by brad.


Revision tags: OPENBSD_3_4_BASE
# 1.95 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.94 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: UBC_SYNC_A
# 1.93 13-May-2003 naddy

Back out previous change that causes "vnode table full" for large-scale
file operations.


# 1.92 13-May-2003 tedu

do reclaim LAYER vnodes, no good reason not to


# 1.91 06-May-2003 tedu

attempt to put a process's cwd back in place after a forced umount.
won't always work, but it's the best we can do for now. this covers
at least some of the failure cases the previous commit to vfs_lookup.c
checks for.
ok weingart@


# 1.90 01-May-2003 tedu

several related changes:
vfs_subr.c:
add a missing simple_lock_init for vnode interlock
try to avoid reclaiming locked or layered vnodes
initialize vnlock pointer to NULL
remove old code to free vnlock, never used
lockinit the new vnode lock
vfs_syscalls.c:
support for VLAYER flag
vnode_if.sh:
support for splitting VDESC flags
vnode_if.src:
split VDESC flags
WILLPUT is the combination of WILLRELE and WILLUNLOCK
most uses for WILLRELE become WILLPUT
vnode.h:
add v_lock to struct vnode
add VLAYER flag
update for new VDESC flags


# 1.89 06-Apr-2003 ho

strcat/strcpy/sprintf cleanup. krw@, anil@ ok. art@ tested sparc64.


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_B
# 1.88 11-Aug-2002 art

Add two missing vfs_busy calls in the failure path of sysctl_vnode.
Found by aaron@

NOTE - I think we need a mount-point iterator just like we have
NOTE - vfs_mount_foreach_vnode. (btw. why don't we use foreach_vnode in here?)


# 1.87 12-Jul-2002 art

Change the locking on the mountpoint slightly. Instead of using mnt_lock
to get shared locks for lookup and get the exclusive lock only with
LK_DRAIN on unmount and do the real exclusive locking with flags in
mnt_flags, we now use shared locks for lookup and an exclusive lock for
unmount.

This is accomplished by slightly changing the semantics of vfs_busy.
Old vfs_busy behavior:
- with LK_NOWAIT set in flags, a shared lock was obtained if the
mountpoint wasn't being unmounted, otherwise we just returned an error.
- with no flags, a shared lock was obtained if the mountpoint was being
unmounted, otherwise we slept until the unmount was done and returned
an error.
LK_NOWAIT was used for sync(2) and some statistics code where it isn't really
critical that we get the correct results.
0 was used in fchdir and lookup where it's critical that we get the right
directory vnode for the filesystem root.

After this change vfs_busy keeps the same behavior for no flags and LK_NOWAIT.
But if some other flags are passed into it, they are passed directly
into lockmgr (actually LK_SLEEPFAIL is always added to those flags because
if we sleep for the lock, that means someone was holding the exclusive lock
and the exclusive lock is only held when the filesystem is being unmounted.

More changes:
dounmount must now be called with the exclusive lock held. (before this
the caller was supposed to hold the vfs_busy lock, but that wasn't always
true).
Zap some (now) unused mount flags.
And the highlight of this change:
Add some vfs_busy calls to match some vfs_unbusy calls, especially in
sys_mount. (lockmgr doesn't detect the case where we release a lock noone
holds (it will do that soon)).

If you've seen hangs on reboot with mfs this should solve it (I repeat this
for the fourth time now, but this time I spent two months fixing and
redesigning this and reading the code so this time I must have gotten
this right).


# 1.86 16-Jun-2002 miod

When processing the KERN_VNODE sysctl, the kernel builds a packed structure,
while pstat(8) expects a C structure abiding the regular structure packing
rules. This caused pstat -v to break on powerpc.

Unbreak the confusion by defining the structure in a common header file,
and having the kernel use it.

ok millert@ deraadt@


# 1.85 08-Jun-2002 art

Use ltsleep in vfs_busy.


# 1.84 16-May-2002 art

sprinkle some splassert(IPL_BIO) in some functions that are commented as "should be called at splbio()"


Revision tags: OPENBSD_3_1_BASE
# 1.83 14-Mar-2002 millert

First round of __P removal in sys


# 1.82 04-Feb-2002 miod

Cleanup mountroot-related definitions.


# 1.81 23-Jan-2002 art

Pool deals fairly well with physical memory shortage, but it doesn't deal
well (not at all) with shortages of the vm_map where the pages are mapped
(usually kmem_map).

Try to deal with it:
- group all information the backend allocator for a pool in a separate
struct. The pool will only have a pointer to that struct.
- change the pool_init API to reflect that.
- link all pools allocating from the same allocator on a linked list.
- Since an allocator is responsible to wait for physical memory it will
only fail (waitok) when it runs out of its backing vm_map, carefully
drain pools using the same allocator so that va space is freed.
(see comments in code for caveats and details).
- change pool_reclaim to return if it actually succeeded to free some
memory, use that information to make draining easier and more efficient.
- get rid of PR_URGENT, noone uses it.


# 1.80 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.79 10-Dec-2001 art

branches: 1.79.2;
No need to initialize the uobj on every getnewvnode. Just do
it when allocating. Add some improved diagnostics.


# 1.78 10-Dec-2001 art

Big cleanup inspired by NetBSD with some parts of the code from NetBSD.
- get rid of VOP_BALLOCN and VOP_SIZE
- move the generic getpages and putpages into miscfs/genfs
- create a genfs_node which must be added to the top of the private portion
of each vnode for filsystems that want to use genfs_{get,put}pages
- rename genfs_mmap to vop_generic_mmap


# 1.77 10-Dec-2001 art

Merge in struct uvm_vnode into struct vnode.


# 1.76 05-Dec-2001 art

Break out the part that lowers v_holdcnt in brelvp into an own function
and make it and vhold into public interfaces.


# 1.75 29-Nov-2001 art

Ooops. Revert part of the last commit that was completly wrong and wasn't supposed to be committed.


# 1.74 29-Nov-2001 art

Correctly handle b_vp with bgetvp and brelvp in {get,put}pages.
Prevents panics caused by vnodes being recycled under our feet.


# 1.73 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.72 21-Nov-2001 csapuntz

Added vfs_isbusy. Useful for verifying that a mount point is locked
Added vfs_mount_foreach_vnode. Several places in the code seem to want to
traverse the mount list and they all seem to handle locking differently.
Centralize traversing the mount list in one place so that we only need
to get the locking right once.


# 1.71 15-Nov-2001 art

Don't zero v_bioflag when recycling a vnode in getnewvnode.
Sometimes the vnode can be on the syncers list. While that is a bug, it's
just a minor annoyance. A vnode on a syncer worklist without VBIOONSYNCLIST
set is a disaster.


# 1.70 12-Nov-2001 art

Remove unnecessary check for NULL vnode in reassignbuf.


# 1.69 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.68 02-Oct-2001 csapuntz

Bounds check index into routing table. Thanks to Ken Ashcraft of Stanford
for finding this bug.


# 1.67 19-Sep-2001 csapuntz

Get rid of B_VFLUSH. Not relevant after the end of the AGE queue.


# 1.66 16-Sep-2001 millert

Add some missing lengths checks when passing data from userland to
kernel. From based on NetBSD patches.


# 1.65 02-Aug-2001 assar

(vput): make panic strings actually say vput instead of vrele


# 1.64 26-Jul-2001 miod

Typo.


# 1.63 27-Jun-2001 art

remove old vm


# 1.62 22-Jun-2001 deraadt

KNF


# 1.61 05-Jun-2001 provos

send note_revoke to knotes when vnode goes away, okay art@


# 1.60 16-May-2001 art

indentation nit.


# 1.59 29-Apr-2001 art

cleanup, remove incorrect comment


Revision tags: OPENBSD_2_9_BASE
# 1.58 22-Mar-2001 art

branches: 1.58.2;
Use pool for allocating vnodes.
Even though vnodes are never freed (could be) this gives us big memory and
kmem_map savings.


# 1.57 21-Mar-2001 art

uvm_vnp_terminate expect the vnode to be locked.
Why didn't LOCKDEBUG catch this?


# 1.56 16-Mar-2001 art

Oops. fix thinko in last.


# 1.55 16-Mar-2001 art

Use CIRCLEQ macros for mountlist.


# 1.54 16-Mar-2001 art

Initialize the mountlist_slock.


# 1.53 26-Feb-2001 csapuntz

Move v_writecount test back to it original place


# 1.52 26-Feb-2001 csapuntz

Make ref counts 32-bit unsigned ints as opposed to a potpourri of longs and
ints.


# 1.51 24-Feb-2001 csapuntz

Cleanup of vnode interface continues. Get rid of VHOLD/HOLDRELE.
Change VM/UVM to use buf_replacevnode to change the vnode associated
with a buffer.

Addition v_bioflag for flags written in interrupt handlers
(and read at splbio, though not strictly necessary)

Add vwaitforio and use it instead of a while loop of v_numoutput.

Fix race conditions when manipulation vnode free list


# 1.50 23-Feb-2001 csapuntz

Remove the clustering fields from the vnodes and place them in the
file system inode instead


# 1.49 21-Feb-2001 csapuntz

Latest soft updates from FreeBSD/Kirk McKusick

Snapshot-related code has been commented out.


# 1.48 08-Feb-2001 mickey

do not print stuff when not verbose


Revision tags: OPENBSD_2_8_BASE
# 1.47 27-Sep-2000 art

branches: 1.47.2;
Minimal optimization.


# 1.46 17-Jul-2000 art

Don't wait for B_READ buffers on shutdown.
From NetBSD.


Revision tags: OPENBSD_2_7_BASE
# 1.45 25-Apr-2000 csapuntz

Use CIRCLEQ_FOREACH


# 1.44 21-Apr-2000 mickey

see if there is any meaning under curproc before using &proc0 in vfs_syncwait(); from art@


Revision tags: SMP_BASE kame_19991208
# 1.43 05-Dec-1999 art

branches: 1.43.2;
With soft updates, some buffers will be remarked as dirty after being written.
Handle this when syncing filesystems when unmounting.
From NetBSD.


# 1.42 05-Dec-1999 art

Use VONSYNCLIST to see if we should remove a vnode from the sync list instead
of looking at v_dirtyblkhd.


Revision tags: OPENBSD_2_6_BASE
# 1.41 20-Aug-1999 art

more paranoid check of the refcount in vfs_register


# 1.40 08-Aug-1999 niklas

From NetBSD; vdevgone, used for revoking access to device nodes when they
disappear (detach is coming).


# 1.39 31-May-1999 millert

New struct statfs with mount options. NOTE: this replaces statfs(2),
fstatfs(2), and getfsstat(2) so you will need to build a new kernel
before doing a "make build" or you will get "unimplemented syscall" errors.

The new struct statfs has the following featuires:
o Has a u_int32_t flags field--now softdep can have a real flag.

o Uses u_int32_t instead of longs (nicer on the alpha). Note: the man
page used to lie about setting invalid/unused fields to -1. SunOS does
that but our code never has.

o Gets rid of f_type completely. It hasn't been used since NetBSD 0.9
and having it there but always 0 is confusing. It is conceivable
that this may cause some old code to not compile but that is better
than silently breaking.

o Adds a mount_info union that contains the FSTYPE_args struct. This
means that "mount" can now tell you all the options a filesystem was
mounted with. This is especially nice for NFS.

Other changes:
o The linux statfs emulation didn't convert between BSD fs names
and linux f_type numbers. Now it does, since the BSD f_type
number is useless to linux apps (and has been removed anyway)

o FreeBSD's struct statfs is different from our (both old and new)
and thus needs conversion. Previously, the OpenBSD syscalls
were used without any real translation.

o mount(8) will now show extra info when invoked with no arguments.
However, to see *everything* you need to use the -v (verbose) flag.


# 1.38 06-May-1999 mickey

factor out sync+wait code into vfa_syncwait() routine for
applications in system like power management and such.
art@ finally said `commit it'


# 1.37 30-Apr-1999 art

in vput, simple_unlock the v_interlock before VOP_INACTIVE, not after


Revision tags: OPENBSD_2_5_BASE
# 1.36 11-Mar-1999 deraadt

backout


# 1.35 11-Mar-1999 deraadt

back out unapproved changes


# 1.34 11-Mar-1999 mickey

indent


# 1.33 11-Mar-1999 mickey

factor sync+wait operation out into a separate function.


# 1.32 26-Feb-1999 art

adapt to uvm vnode pager


# 1.31 19-Feb-1999 art

add vfs_register and vfs_unregister functions


# 1.30 28-Dec-1998 art

simple_lock fixes


# 1.29 22-Dec-1998 art

deconfuse vprint, print holdcount, not refcount when we are talking about holdcnt


# 1.28 10-Dec-1998 art

vfs_unmountall: retry to unmount all remaining filesystems when one unmount failed


# 1.27 05-Dec-1998 csapuntz

Framework for generating automatic test code for locking discipline
in DIAGNOSTIC mode.

Added documentation to vfs_subr.c on locking needs of a couple calls.

Improvements to the vinvalbuf patch. We need to start over after we
let our pants down.


# 1.26 04-Dec-1998 csapuntz

VFS-Lite2 requires stricter locking around vnode buffer queues. vinvalbuf
had insufficient protection


# 1.25 20-Nov-1998 art

vn_lock already unlocks the simple lock. don't do that again


# 1.24 12-Nov-1998 csapuntz

Integrate latest soft updates patches for McKusick.

Integrate cleaner ffs mount code from FreeBSD. Most notably, this mount
code prevents you from mounting an unclean file system read-write.


Revision tags: OPENBSD_2_4_BASE
# 1.23 13-Oct-1998 csapuntz

In vrele, vget, reinstate to following order

- VNODE gets placed on free list
- VOP_INACTIVE is called

This was the original order. It was changed in an earlier patch due to
a race condition in non-locking FSes (like NFS) between getnewvnode
and inactive. However, the modified order had its own race conditions, so
it turned out not to be a good choice.


# 1.22 30-Aug-1998 csapuntz

Cleanup.

Error diagnostics in vputonfreelist to catch violations of assumptions.


# 1.21 06-Aug-1998 csapuntz

Rename vop_revoke, vn_bwrite, vop_noislocked, vop_nolock, vop_nounlock
to be vop_generic_revoke, vop_generic_bwrite, vop_generic_islocked,
vop_generic_lock and vop_generic_unlock.

Create vop_generic_abortop and propogate change to all file systems.

Fix PR/371.

Get rid of locking in NULLFS (should be mostly unnecessary now except for
forced unmounts).


# 1.20 25-Apr-1998 niklas

typo


Revision tags: OPENBSD_2_3_BASE
# 1.19 20-Feb-1998 niklas

typo


# 1.18 11-Jan-1998 csapuntz

Fix a couple spinlock references. More code motion in vfs_subr.c


# 1.17 10-Jan-1998 csapuntz

Broke up vfs_subr.c which was getting a bit huge. We now have seperate files
for the syncer daemon as well as default VOP_*.


# 1.16 24-Nov-1997 niklas

Fix non-DIAGNOSTIC (and non-COMPAT*) compilation


# 1.15 07-Nov-1997 csapuntz

Fixed hang on shutdown
Disabled vop_nolock for now. Filesystems still need to be cleaned up.


# 1.14 06-Nov-1997 csapuntz

DEBUG now compiles


# 1.13 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.12 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.11 06-Oct-1997 csapuntz

VFS Lite2 Changes


Revision tags: OPENBSD_2_1_BASE
# 1.10 25-Apr-1997 deraadt

proper mask check; mike@fast.cs.utah.edu


# 1.9 14-Apr-1997 tholo

Minor performance enhancements from NetBSD


# 1.8 24-Feb-1997 niklas

OpenBSD tags


# 1.7 11-Feb-1997 millert

Add fs_id support and random inode generation numbers for ffs.


# 1.6 04-Jan-1997 kstailey

spec_advlock() via lf_advlock()


Revision tags: OPENBSD_2_0_BASE
# 1.5 08-Aug-1996 tholo

Make {,f}chown(2) behaviour POSIX.1 compliant with SUID / SGID files
Enable CTL_FS processing by sysctl(3)
Add CTL_FS request to disable clearing SUID / SGID bit when a files owner
or group is changed by root
Make sysctl(8) understand CTL_FS requests


# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 29-Feb-1996 niklas

From NetBSD: Merge with NetBSD 960217


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


Revision tags: OPENBSD_6_3_BASE
# 1.267 07-Mar-2018 bluhm

Remounting files systems read-only does not work reliably. There
are corner cases where ffs may leak blocks. So better revert and
unmount all file systems at reboot. The "init died" panic will be
fixed in a different way.
OK deraadt@


# 1.266 10-Feb-2018 deraadt

Syncronize filesystems to disk when suspending. Each mountpoint's vnodes
are pushed to disk. Dangling vnodes (unlinked files still in use) and
vnodes undergoing change by long-running syscalls are identified -- and
such filesystems are marked dirty on-disk while we are suspended (in case
power is lost, a fsck will be required). Filesystems without dangling or
busy vnodes are marked clean, resulting in faster boots following
"battery died" circumstances.
Tested by numerous developers, thanks for the feedback.


# 1.265 14-Dec-2017 deraadt

Don't bother using DETACH_FORCE for the softraid luns at reboot
time; the aggressive mountpoint destruction seems to hit insane
use-after-frees when we are already far on the way down.


# 1.264 14-Dec-2017 deraadt

Give vflush_vnode() a hint about vnodes we don't need to account as "busy".
Change mountpoint to RDONLY a little later. Seems to improve the
rw->ro transition a bit.


# 1.263 11-Dec-2017 bluhm

Format the vnode lists of ddb show mount properly in columns.
OK krw@


# 1.262 11-Dec-2017 deraadt

In uvm Chuck decided backing store would not be allocated proactively
for blocks re-fetchable from the filesystem. However at reboot time,
filesystems are unmounted, and since processes lack backing store they
are killed. Since the scheduler is still running, in some cases init is
killed... which drops us to ddb [noted by bluhm]. Solution is to convert
filesystems to read-only [proposed by kettenis]. The tale follows:
sys_reboot() should pass proc * to MD boot() to vfs_shutdown() which
completes current IO with vfs_busy VB_WRITE|VB_WAIT, then calls VFS_MOUNT()
with MNT_UPDATE | MNT_RDONLY, soon teaching us that *fs_mount() calls a
copyin() late... so store the sizes in vfsconflist[] and move the copyin()
to sys_mount()... and notice nfs_mount copyin() is size-variant, so kill
legacy struct nfs_args3. Next we learn ffs_mount()'s MNT_UPDATE code is
sharp and rusty especially wrt softdep, so fix some bugs adn add
~MNT_SOFTDEP to the downgrade. Some vnodes need a little more help,
so tie them to &dead_vnops.

ffs_mount calling DIOCCACHESYNC is causing a bit of grief still but
this issue is seperate and will be dealt with in time.
couple hundred reboots by bluhm and myself, advice from guenther and
others at the hut


# 1.261 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.260 31-Jul-2017 florian

Give back some space to the ramdisk by compiling net/radix.c only
if we compile pf, ipsec, pipex or nfsserver.
Suggested by mpi some time ago.
Tweak & OK bluhm
deraadt assumes it's fair


# 1.259 20-Apr-2017 visa

Tweak lock inits to make the system runnable with witness(4)
on amd64 and i386.


# 1.258 04-Apr-2017 deraadt

struct vfsconf is tightly packed, but let's M_ZERO it in case that ever
changes to avoid exposing userland memory.


Revision tags: OPENBSD_6_1_BASE
# 1.257 15-Jan-2017 bluhm

When traversing the mount list, the current mount point is locked
with vfs_busy(). If the FOREACH_SAFE macro is used, the next pointer
is not locked and could be freed by another process. Unless
necessary, do not use _SAFE as it is unsafe. In vfs_unmountall()
the current pointer is actullay freed. Add a comment that this
race has to be fixed later.
OK krw@


# 1.256 10-Jan-2017 bluhm

Replace manual for() loops with FOREACH() macro.
OK millert@


# 1.255 10-Jan-2017 bluhm

Remove the unused olddp parameter from function dounmount().
OK mpi@ millert@


# 1.254 28-Sep-2016 kettenis

Cast enum to u_int when doing a bounds check to avoid a clang warning that
the comparison is always true.

ok jca@, tedu@


# 1.253 16-Sep-2016 dlg

move the namecache_rb_tree from RB macros to RBT functions.

i had to shuffle the includes a bit. all the knowledge of the RB
tree is now inside vfs_cache.c, and all accesses are via cache_*
functions.


# 1.252 16-Sep-2016 dlg

move buf_rb_bufs from RB macros to RBT functions

i had to shuffle the order of some header bits cos RBT_PROTOTYPE
needs to see what RBT_HEAD produces.


# 1.251 15-Sep-2016 dlg

all pools have their ipl set via pool_setipl, so fold it into pool_init.

the ioff argument to pool_init() is unused and has been for many
years, so this replaces it with an ipl argument. because the ipl
will be set on init we no longer need pool_setipl.

most of these changes have been done with coccinelle using the spatch
below. cocci sucks at formatting code though, so i fixed that by hand.

the manpage and subr_pool.c bits i did myself.

ok tedu@ jmatthew@

@ipl@
expression pp;
expression ipl;
expression s, a, o, f, m, p;
@@
-pool_init(pp, s, a, o, f, m, p);
-pool_setipl(pp, ipl);
+pool_init(pp, s, a, ipl, f, m, p);


# 1.250 25-Aug-2016 dlg

pool_setipl

ok kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.249 22-Jul-2016 kettenis

Prevent NULL-pointer call for filesystems that don't provide vfs_sysctl
in their vfsops.

Issue reported by Tim Newsham.

ok claudio@, natano@


# 1.248 19-Jun-2016 natano

Remove the lockmgr() API. It is only used by filesystems, where it is a
trivial change to use rrw locks instead. All it needs is LK_* defines
for the RW_* flags.

tested by naddy and sthen on package building infrastructure
input and ok jmc mpi tedu


# 1.247 26-May-2016 natano

The doforce variable isn't modified anywhere. Also, the only filesystem
left using it is fuse. It has been removed from all other filesystems.

ok millert deraadt


# 1.246 26-Apr-2016 natano

copy_statfs_info() is not only used by ufs, but by other filesystems too,
so make sure that all members of mp->mnt_stat.mount_info are copied.
ok stefan


# 1.245 26-Apr-2016 beck

fix off by one in vfs_vnode_print - found by miod
ok deraadt@, krw@


# 1.244 07-Apr-2016 natano

Share clone bitmap between aliased vnodes. This prevents duplicate clone
instance numbers being handed out for the same minor device.
ok mikeb


# 1.243 05-Apr-2016 natano

Increase size of the clone bitmap (revised diff after revert). I have
tested this with fuse _and_ drm on amd64 and macppc. Also tested with
cloning bpf (not in the tree) on macppc.

ok mikeb
"looks correct to me" millert

The original commit message is as follows:

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.242 01-Apr-2016 mikeb

Revert the clone bitmap enlargement change


# 1.241 31-Mar-2016 natano

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.240 19-Mar-2016 natano

Remove the unused flags argument from VOP_UNLOCK().

torture tested on amd64, i386 and macppc
ok beck mpi stefan
"the change looks right" deraadt


# 1.239 14-Mar-2016 krw

Change a bunch of (<blah> *)0 to NULL.

ok beck@ deraadt@


Revision tags: OPENBSD_5_9_BASE
# 1.238 05-Dec-2015 tedu

branches: 1.238.2;
remove stale lint annotations


# 1.237 16-Nov-2015 deraadt

In getdevvp() set the VISTTY flag on a vnode to indicate the underlying
device is a D_TTY device. (Like spec_open, but this sets the flag to
satisfy pre-VOP_OPEN situations)
ok millert semarie tedu guenther


# 1.236 13-Oct-2015 guenther

Initialize va_filerev in vattr_null() to avoid leaking stack garbage;
problem pointed out by Martin Natano (natano (at) natano.net)

Also, stop chaining assignments (foo = bar = baz) in vattr_null().
The exact meaning of those depends on the order of the sizes-and-
signednesses of the lvalues, making them fragile: a statement here
mixed *six* types, but managed to get them in a safe order. Delete
a 20+ year old XXX comment that was almost certainly bemoaning a bug
from when they were in an unsafe order.

ok deraadt@ miod@


# 1.235 08-Oct-2015 mpi

Use the radix API directly and get rid of the function pointers. There
is no point in keeping an unused level of abstraction.

ok mikeb@, claudio@


# 1.234 07-Oct-2015 mpi

rn_inithead() offset argument is now specified in byte, missed in previous.


# 1.233 04-Sep-2015 mpi

Make every subsystem using a radix tree call rn_init() and pass the
length of the key as argument.

This way every consumer of the radix tree has a chance to explicitly
initialize the shared data structures and no longer rely on another
subsystem to do the initialization.

As a bonus ``dom_maxrtkey'' is no longer used an die.

ART kernels should now be fully usable because pf(4) and IPSEC properly
initialized the radix tree.

ok chris@, reyk@


Revision tags: OPENBSD_5_8_BASE
# 1.232 16-Jul-2015 claudio

branches: 1.232.4;
Fix rn_match and there for the expoerted lookup functions in radix.c
to never return the internal RNF_ROOT nodes. This removes the checks
in the callee to verify that not an RNF_ROOT node was returned.
OK mpi@


# 1.231 12-May-2015 mikeb

Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.230 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.229 02-Mar-2015 guenther

Return EINVAL if the creds supplied for NFS export have a cr_ngroups less
than zero or greater than NGROUPS_MAX

Fixes panic seen by henning@


# 1.228 09-Jan-2015 tedu

rename desiredvnodes to initialvnodes. less of a lie. ok beck deraadt


# 1.227 19-Dec-2014 tedu

start retiring the nointr allocator. specify PR_WAITOK as a flag as a
marker for which pools are not interrupt safe. ok dlg


# 1.226 17-Dec-2014 tedu

remove lock.h from uvm_extern.h. another holdover from the simpletonlock
era. fix uvm including c files to include lock.h or atomic.h as necessary.
ok deraadt


# 1.225 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.224 10-Dec-2014 tedu

convert bcopy to memcpy. ok millert


# 1.223 21-Nov-2014 tedu

simple lock is long dead


# 1.222 19-Nov-2014 tedu

delete the KERN_VNODE sysctl. it fails to provide any isolation from the
kernel struct vnode defintion, and the only consumer (pstat) still needs
kvm to read much of the required information. no great loss to always use
kvm until there's a better replacement interface.
ok deraadt millert uebayasi


# 1.221 14-Nov-2014 tedu

prefer sizeof(*ptr) to sizeof(struct) for malloc and free


# 1.220 03-Nov-2014 deraadt

pass size argument to free()
ok doug tedu


# 1.219 13-Sep-2014 doug

Replace all queue *_END macro calls except CIRCLEQ_END with NULL.

CIRCLEQ_* is deprecated and not called in the tree. The other queue types
have *_END macros which were added for symmetry with CIRCLEQ_END. They are
defined as NULL. There's no reason to keep the other *_END macro calls.

ok millert@


Revision tags: OPENBSD_5_6_BASE
# 1.218 13-Jul-2014 tedu

pass the size to free in some of the obvious cases


# 1.217 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.216 10-Jul-2014 mpi

Stop using a shutdown hook for softraid(4) and explicitly shutdown
the disciplines right after vfs_shutdown().

This change is required in order to be able to set `cold' to 1 before
traversing the device (mainbus) tree for DVACT_POWERDOWN when halting
a machine. Yes, this is ugly because sr_shutdown() needs to sleep. But
at least it is obvious and hopefully somebody will be ofended and fix
it.

In order to properly flush the cache of the disks under softraid0,
sr_shutdown() now propagates DVACT_POWERDOWN for this particular subtree
of devices which are not under mainbus. As a side effect sd(4) shutdown
hook should no longer be necessary.

Tested by stsp@ and Jean-Philippe Ouellet.

ok deraadt@, stsp@, jsing@


# 1.215 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.214 04-Jun-2014 claudio

While it may be smart to use the radix tree for exports it is not OK to
use the domain specific tree initialisation method for this since that one
is multipath enabled and assumes that the radix node is part of a struct
rtentry. This code uses a different struct and so the multipath modifies
wrong fields and breaks stuff in mysterious ways.
Since we only support AF_INET here anyway simplify the code and only have
one radix_node_head pointer instead of AF_MAX ones.
Fixes NFS server issues reported by rpe@, OK rpe@, guenther@, sthen@


# 1.213 10-Apr-2014 tedu

pull the bufcache freelist code out into separate functions to allow new
algorithms to be tested. in the process, drop support for unused B_AGE and
b_synctime options.
previous versions ok beck deraadt


# 1.212 24-Mar-2014 guenther

Split the API: struct ucred remains the kernel internal structure while
struct xucred becomes the structure for syscalls (mount(2) and nfssvc(2)).

ok deraadt@ beck@


Revision tags: OPENBSD_5_5_BASE
# 1.211 21-Jan-2014 tedu

bzero -> memset


# 1.210 01-Dec-2013 krw

Change 'mountlist' from CIRCLEQ to TAILQ. Be paranoid and
use TAILQ_*_SAFE more than might be needed.

Bulk ports build by sthen@ showed nobody sticking their fingers
so deep into the kernel.

Feedback and suggestions from millert@. ok jsing@


# 1.209 27-Nov-2013 jsing

Defer the v_type initialisation until after the vnode has been purged from
the namecache. Changing the v_type between cache_enter() and cache_purge()
results in bad things happening.

ok beck@


# 1.208 02-Oct-2013 sf

format string fix: b_flags is long


# 1.207 01-Oct-2013 sf

Format string fixes: Cast time_t to long long

and mnt_stat.f_ctime is long long, too


# 1.206 08-Aug-2013 syl

Uncomment kprintf format attributes for sys/kern

tested on vax (gcc3) ok miod@


# 1.205 30-Jul-2013 beck

The previous change was made while chasing nfs performance issues
on Theo's servers - however this was in the context of the buffer flipper
changes and this is now suspect in a continues performance issue with NFS
so back it out for now


Revision tags: OPENBSD_5_4_BASE
# 1.204 24-Jun-2013 beck

Manipulating buffers after sleeping is dangerous. Instead of attempting
to cheat and VOP_BWRITE a buffer, restart the vinvalbuf if we have to wait
for a busy buffer to complete
ok tedu@ guenther@


# 1.203 15-Apr-2013 jsing

Add an f_mntfromspec member to struct statfs, which specifies the name of
the special provided when the mount was requested. This may be the same as
the special that was actually used for the mount (e.g. in the case of a
device node) or it may be different (e.g. in the case of a DUID).

Whilst here, change f_ctime to a 64 bit type and remove the pointless
f_spare members.

Compatibility goo courtesy of guenther@

ok krw@ millert@


Revision tags: OPENBSD_5_3_BASE
# 1.202 17-Feb-2013 miod

Comment out recently added __attribute__((__format__(__kprintf__))) annotations
in MI code; gcc 2.95 does not accept such annotation for function pointer
declarations, only function prototypes.
To be uncommented once gcc 2.95 bites the dust.


# 1.201 09-Feb-2013 miod

Add explicit __attribute__ ((__format__(__kprintf__)))) to the functions and
function pointer arguments which are {used as,} wrappers around the kernel
printf function.
No functional change.


# 1.200 17-Nov-2012 beck

Don't map a buffer (and potentially sleep) when invalidating it in vinvalbuf.
This fixes a problem where we could sleep for kva and then our pointers
would not be valid on the next pass through the loop. We do this
by adding buf_acquire_nomap() - which can be used to busy up the buffer
without changing its mapped or unmapped state. We do not need to have
the buffer mapped to invalidate it, so it is sufficient to acquire it
for that. In the case where we write the buffer, we do map the buffer, and
potentially sleep.


# 1.199 01-Oct-2012 guenther

Make groupmember() check the effective gid too, so that the checks are
consistent when the effective gid isn't also a supplementary group.

ok beck@


# 1.198 19-Sep-2012 guenther

vhold() and vdrop() are prototyped in vnode.h, so don't repeat them here

ok beck@


Revision tags: OPENBSD_5_2_BASE
# 1.197 16-Jul-2012 deraadt

oops, need sys/acct.h too


# 1.196 16-Jul-2012 deraadt

Put acct_shutdown() proto in a better place


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.195 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.194 02-Jul-2011 thib

rename VFSDEBUG to VFLCKDEBUG;

prompted by tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.193 21-Dec-2010 thib

Bring back the "End the VOP experiment." diff, naddy's issues where
unrelated, and his alpha is much happier now.

OK deraadt@


# 1.192 06-Dec-2010 jasper

- drop NENTS(), which was yet another copy of nitems().
no binary change


ok deraadt@


# 1.191 10-Sep-2010 thib

Backout the VOP diff until the issues naddy was seeing on alpha (gcc3)
have been resolved.


# 1.190 06-Sep-2010 thib

End the VOP experiment. Instead of the ridicolusly complicated operation
vector setup that has questionable features (that have, as far as I can
tell never been used in practice, atleast not in OpenBSD), remove all
the gunk and favor a simple struct full of function pointers that get
set directly by each of the filesystems.

Removes gobs of ugly code and makes things simpler by a magnitude.

The only downside of this is that we loose the vnoperate feature so
the spec/fifo operations of the filesystems need to be kept in sync
with specfs and fifofs, this is no big deal as the API it self is pretty
static.

Many thanks to armani@ who pulled an earlier version of this diff to
current after c2k10 and Gabriel Kihlman on tech@ for testing.

Liked by many. "come on, find your balls" deraadt@.


# 1.189 12-Aug-2010 oga

Nuke extra (typoed) extern declaration and a spare newline from the last
commit.

"fix it -- free commit" beck@


# 1.188 11-Aug-2010 beck

Make the number of vnodes to correspond to the number of buffers in
buffer cache - we grow them dynamically, but do not attempt to shrink
them if the buffer cache shrinks after growing.

Tested by very many for a long time.

ok oga@ todd@ phessler@ tedu@


Revision tags: OPENBSD_4_8_BASE
# 1.187 29-Jun-2010 tedu

makefstype was only used in ported from freebsd filesystems. fix them
and remove the function. ok thib


# 1.186 28-Jun-2010 claudio

Add the rtable id as an argument to rn_walktree(). Functions like
rt_if_remove_rtdelete() need to know the table id to be able to correctly
remove nodes.
Problem found by Andrea Parazzini and analyzed by Martin Pelik�n.
OK henning@


# 1.185 06-May-2010 mpf

Fix favail format string.
From mickey.
OK thib, otto.


Revision tags: OPENBSD_4_7_BASE
# 1.184 17-Dec-2009 oga

if anyone vref()s a VNON vnode, panic. This should not happen.

Written while trying to debug the nfs_inactive panics. Turns out it
never got hit, but it's a useful check to have.

ok beck@


# 1.183 17-Aug-2009 jasper

dd 'show all bufs' to show all the buffers in the system

ok beck@ thib@


# 1.182 13-Aug-2009 thib

add a show all vnodes command, use dlg's nice pool_walk() to accomplish
this.

ok beck@, dlg@


# 1.181 12-Aug-2009 beck

Namecache revamp.

This eliminates the large single namecache hash table, and implements
the name cache as a global lru of entires, and a redblack tree in each
vnode. It makes cache_purge actually purge the namecache entries associated
with a vnode when a vnode is recycled (very important for later on actually being
able to resize the vnode pool)

This commit does #if 0 out a bunch of procmap code that was
already broken before this change, but needs to be redone completely.

Tested by many, including in thib's nfs test setup.

ok oga@,art@,thib@,miod@


# 1.180 02-Aug-2009 beck

Dynamic buffer cache support - a re-commit of what was backed out
after c2k9

allows buffer cache to be extended and grow/shrink dynamically

tested by many, ok oga@, "why not just commit it" deraadt@


Revision tags: OPENBSD_4_6_BASE
# 1.179 25-Jun-2009 thib

backout the buf_acquire() does the bremfree() since all callers
where doing bremfree() befure calling buf_acquire().

This is causing us headache pinning down a bug that showed up
when deraadt@ too cvs to current, and will have to be done
anyway as a preperation for backouts.

OK deraadt@


# 1.178 15-Jun-2009 beck

Back out all the buffer cache changes I committed during c2k9. This reverts three
commits:

1) The sysctl allowing bufcachepercent to be changed at boot time.
2) The change moving the buffer cache hash chains to a red-black tree
3) The dynamic buffer cache (Which depended on the earlier too).

ok on the backout from marco and todd


# 1.177 06-Jun-2009 art

All caller of buf_acquire were doing bremfree before the call.
Just put it in the buf_acquire function.
oga@ ok


# 1.176 03-Jun-2009 beck

Change bufhash from the old grotty hash table to red-black trees hanging
off the vnode.
ok art@, oga@, miod@


Revision tags: OPENBSD_4_5_BASE
# 1.175 10-Nov-2008 pedro

Fix typo in comment, okay jmc@.


# 1.174 01-Nov-2008 deraadt

change vrele() to return an int. if it returns 0, it can gaurantee that
it did not sleep. this is used to avoid checkdirs() to avoid having
to restart the allproc walk every time through
idea from tedu, ok thib pedro


Revision tags: OPENBSD_4_4_BASE
# 1.173 05-Jul-2008 thib

re-introduce vdrop() to signal a lost intrest in a vnode;

ok art@


# 1.172 14-Jun-2008 mk

A bunch of pool_get() + bzero() -> pool_get(..., .. | PR_ZERO)
conversions that should shave a few bytes off the kernel.

ok henning, krw, jsing, oga, miod, and thib (``even though i usually prefer
FOO|BAR''; thanks for looking.


# 1.171 13-Jun-2008 beck

back out stupid vnode change that was unintentionally included
with biomem and art has no idea how it got there.
ok art@ thib@


# 1.170 12-Jun-2008 deraadt

Bring biomem diff back into the tree after the nfs_bio.c fix went in.
ok thib beck art


# 1.169 11-Jun-2008 deraadt

back out biomem diff since it is not right yet. Doing very large
file copies to nfsv2 causes the system to eventually peg the console.
On the console ^T indicates that the load is increasing rapidly, ddb
indicates many calls to getbuf, there is some very slow nfs traffic
making none (or extremely slow) progress. Eventually some machines
seize up entirely.


# 1.168 10-Jun-2008 beck

Buffer cache revamp

1) remove multiple size queues, introduced as a stopgap.
2) decouple pages containing data from their mappings
3) only keep buffers mapped when they actually have to be mapped
(right now, this is when buffers are B_BUSY)
4) New functions to make a buffer busy, and release the busy flag
(buf_acquire and buf_release)
5) Move high/low water marks and statistics counters into a structure
6) Add a sysctl to retrieve buffer cache statistics

Tested in several variants and beat upon by bob and art for a year. run
accidentally on henning's nfs server for a few months...

ok deraadt@, krw@, art@ - who promises to be around to deal with any fallout


# 1.167 09-Jun-2008 millert

Update access(2) to have modern semantics with respect to X_OK and
the superuser. access(2) will now only indicate success for X_OK on
non-directories if there is at least one execute bit set on the file.
OK deraadt@ thib@ otto@


# 1.166 07-May-2008 thib

remove the vfc_mountroot member from vfsconf and
do appropriate cleanup;

OK deraadt@


# 1.165 07-May-2008 claudio

Implement routing priorities. Every route inserted has a priority assigned
and the one route with the lowest number wins. This will be used by the
routing daemons to resolve the synchronisations issue in case of conflicts.
The nasty bits of this are in the multipath code. If no priority is specified
the kernel will choose an appropriate priority.

Looked at by a few people at n2k8 code is much older


# 1.164 06-May-2008 thib

retire vfs_mountroot();

setroot() is now (and has been) responsible for setting
the mountroot function pointer "to the right thing", or
failing todo that, to ffs_mountroot;

based on a discussion/diff from deraadt@.
OK deraadt@


# 1.163 23-Mar-2008 miod

Wrong printf construct.


# 1.162 16-Mar-2008 otto

Widen some struct statfs fields to support large filesystem stata
and add some to be able to support statvfs(2). Do the compat dance
to provide backward compatibility. ok thib@ miod@


Revision tags: OPENBSD_4_3_BASE
# 1.161 13-Dec-2007 blambert

replace calls to ltsleep with tsleep

remove PNORELOCK flag, as PNORELOCK is used for msleep

ok art@ thib@


# 1.160 16-Nov-2007 deraadt

er, the newline is wrong. dissapointing.


# 1.159 15-Nov-2007 deraadt

newline before syncing disks is way prettier


# 1.158 29-Oct-2007 chl

MALLOC/FREE -> malloc/free
replace an hard coded value with M_WAITOK

ok krw@


# 1.157 15-Sep-2007 bluhm

Allow to pull out an usb stick with ffs filesystem while mounted
and a file is written onto the stick. Without these fixes the
machine panics or hangs.
The usb fix calls the callback when the stick is pulled out to free
the associated buffers. Otherwise we have busy buffers for ever
and the automatic unmount will panic.
The change in the scsi layer prevents passing down further dirty
buffers to usb after the stick has been deactivated.
In vfs the automatic unmount has moved from the function vgonel()
to vop_generic_revoke(). Both are called when the sd device's vnode
is removed. In vgonel() the VXLOCK is already held which can cause
a deadlock. So call dounmount() earlier.

ok krw@, I like this marco@, tested by ian@


# 1.156 07-Sep-2007 art

Use M_ZERO in a few more places to shave bytes from the kernel.

eyeballed and ok dlg@


Revision tags: OPENBSD_4_2_BASE
# 1.155 07-Aug-2007 beck

A few changes to deal with multi-user performance issues seen. this
brings us back roughly to 4.1 level performance, although this is still
far from optimal as we have seen in a number of cases. This change

1) puts a lower bound on buffer cache queues to prevent starvation
2) fixes the code which looks for a buffer to recycle
3) reduces the number of vnodes back to 4.1 levels to avoid complex
performance issues better addressed after 4.2

ok art@ deraadt@, tested by many


# 1.154 01-Jun-2007 beck

decouple the allocated number of vnodes from the "desiredvnodes" variable
which is used to size a zillion other things that increasing excessively
has been shown to cause problems - so that we may incrementally look at
increasing those other things without making the kernel unusable.

This diff effectivly increases the number of vnodes back to the number
of buffers, as in the earlier dynamic buffer cache commits, without
increasing anything else (namecache, softdeps, etc. etc.)

ok pedro@ tedu@ art@ thib@


# 1.153 31-May-2007 tedu

remove some silly casts, no real change


# 1.152 31-May-2007 pedro

NFSv2 cannot cope with a big number of vnodes, so revert to NPROC-based
calculation until the problem is fixed, okay beck@ art@


# 1.151 30-May-2007 beck

back out vfs change - todd fries has seen afs issues, and I'm suspicious
this can cause other problems.


# 1.150 29-May-2007 beck

Step one of some vnode improvements - change getnewvnode to
actually allocate "desiredvnodes" - add a vdrop to un-hold a vnode held
with vhold, and change the name cache to make use of vhold/vdrop, while
keeping track of which vnodes are referred to by which cache entries to
correctly hold/drop vnodes when the cache uses them.
ok thib@, tedu@, art@


# 1.149 28-May-2007 thib

de-inline vref();

ok pedro@


# 1.148 26-May-2007 pedro

Dynamic buffer cache. Initial diff from mickey@, okay art@ beck@ toby@
deraadt@ dlg@.


# 1.147 26-May-2007 thib

Nuke a bunch of simpelocks and associated goo.

ok art@


# 1.146 17-May-2007 thib

Collapse struct v_selectinfo in struct vnode, remove the
simplelock and reuse the name for the selinfo member.
Clean-up accordingly.

ok tedu@,art@


# 1.145 09-May-2007 deraadt

kinfo_vgetfailed has not been used for > 8 years


# 1.144 13-Apr-2007 thib

Move the declaration of VN_KNOTE() into vnode.h instead of having
multiple defines all over;

ok tedu@


# 1.143 13-Apr-2007 bluhm

Remove comments talking about vnode interlock. No binary change.
ok thib


# 1.142 11-Apr-2007 thib

Remove the simplelock argument from vrecycle();

ok pedro@, sturm@


# 1.141 21-Mar-2007 thib

Remove the v_interlock simplelock from the vnode structure.
Zap all calls to simple_lock/unlock() on it (those calls are
#defined away though). Remove the LK_INTERLOCK from the calls
to vn_lock() and cleanup the filesystems wich implement VOP_LOCK().
(by remvoing the v_interlock from there calls to lockmgr()).

ok pedro@, art@, tedu@


# 1.140 12-Mar-2007 mickey

better desiredvnodes not based on maxusers; pedro@ deraadt@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.139 20-Feb-2007 deraadt

for vfsconf sysctl, do not leak kernel sensors out to userland
ok art thib


# 1.138 17-Feb-2007 mickey

fix ddb buf printing for daddr_t growth to 64bit;
from juan hernandez gonzalez; tested by bluhm@


# 1.137 14-Feb-2007 jsg

Consistently spell FALLTHROUGH to appease lint.
ok kettenis@ cloder@ tom@ henning@


# 1.136 13-Feb-2007 mickey

fix ddb buf print


# 1.135 20-Nov-2006 tom

vprint() should be defined if DIAGNOSTIC || DEBUG. Noticed by (and
original diff from) Jake < antipsychic (at) hotmail.com >. Discussed
with Mickey and Miod.

ok miod@ pedro@


# 1.134 30-Oct-2006 thib

use vp->v_type to index into vtypes rather then vp->v_tag,
fixing odd output in the 'show vnode' ddb code.

ok mickey@


Revision tags: OPENBSD_4_0_BASE
# 1.133 11-Jul-2006 mickey

add mount/vnode/buf and softdep printing commands; tested on a few archs and will make pedro happy too (;


# 1.132 09-Jul-2006 pedro

Fix tab where space was meant


# 1.131 08-Jul-2006 thib

vinvalbuf() debugging aid, under VFSDEBUG.

ok pedro@


# 1.130 03-Jul-2006 mickey

also print vp in vprint (useful for debugging); pedro@ ok


# 1.129 25-Jun-2006 sturm

rename vfs_busy() flags VB_UMIGNORE/VB_UMWAIT to VB_NOWAIT/VB_WAIT

requested by and ok pedro


# 1.128 14-Jun-2006 sturm

move vfs_busy() to rwlocks and properly hide the locking api from vfs

ok tedu, pedro


# 1.127 02-Jun-2006 pedro

Add a clonable devices implementation. Hacked along with thib@, input
from krw@ and toby@, subliminal prodding from dlg@, okay deraadt@.


# 1.126 28-May-2006 pedro

Spacing in vfs_sysctl()


# 1.125 07-May-2006 sturm

forgot to remove this sentence from the comment
ok pedro


# 1.124 30-Apr-2006 sturm

remove the simplelock argument from vfs_busy() which is currently not
used and will never be used this way in VFS

requested by and ok pedro, ok krw, biorn


# 1.123 19-Apr-2006 pedro

Remove unused mount list simple_lock() goo


Revision tags: OPENBSD_3_9_BASE
# 1.122 09-Jan-2006 pedro

Put vprint() under DIAGNOSTIC, as to save space in generated ramdisks.
Inspiration from miod@, okay deraadt@. Tested on i386, macppc and amd64.


# 1.121 30-Nov-2005 pedro

No need for vfs_busy() and vfs_unbusy() to take a process pointer
anymore. Testing by jolan@, thanks.


# 1.120 24-Nov-2005 pedro

Remove kernfs, okay deraadt@.


# 1.119 19-Nov-2005 pedro

Remove unnecessary lockmgr() archaism that was costing too much in terms
of panics and bugfixes. Access curproc directly, do not expect a process
pointer as an argument. Should fix many "process context required" bugs.
Incentive and okay millert@, okay marc@. Various testing, thanks.


# 1.118 18-Nov-2005 pedro

Work around yet another race on non-locking file systems: when calling
VOP_INACTIVE() in vrele() and vput(), we may sleep. Since there's no
locking of any kind, someone can vget() the vnode and vrele() it while
we sleep, beating us in getting the vnode on the free list.


# 1.117 08-Nov-2005 pedro

Missed one use of 'register'


# 1.116 07-Nov-2005 pedro

Use ANSI function declarations and deregister, no binary change


# 1.115 19-Oct-2005 pedro

Remove v_vnlock from struct vnode, okay krw@ tedu@


Revision tags: OPENBSD_3_8_BASE
# 1.114 26-May-2005 pedro

branches: 1.114.2;
RIP stackable filesystems, ok marius@ tedu@, discussed with deraadt@


# 1.113 24-May-2005 pedro

when a device vnode associated with a mount point disappears, mark the
filesystem as doomed and unmount it


# 1.112 22-May-2005 pedro

put VLOCKSWORK stuff under a single option, VFSDEBUG


# 1.111 01-May-2005 pedro

check for VBIOONFREELIST and VBIOONSYNCLIST in vprint(), okay marius@


# 1.110 24-Mar-2005 tedu

always good to check for invalid values. ok marius pedro


Revision tags: OPENBSD_3_7_BASE
# 1.109 10-Jan-2005 pedro

branches: 1.109.2;
change vget() to only put a vnode back on the free lists if it actually
was there. should fix a (rare) corner case introduced by my last commit.
ok tedu@, testing by joris, moritz@, danh@, otto@ and krw@. many thanks.


# 1.108 31-Dec-2004 pedro

sprinkle some more list macros in here


# 1.107 31-Dec-2004 pedro

when releasing a vnode, make it inactive before sticking it to one of
the free lists. should fix some races on filesystems that don't have
locks, such as nfs. also, it allows for a more straightforward way of
releasing vnodes (nodes that are going to be recycled don't have to be
moved to the head of the list). tested by many, thanks.

ok tedu@ deraadt@


# 1.106 28-Dec-2004 deraadt

clean dirty accident by miod


# 1.105 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


# 1.104 09-Dec-2004 pedro

minor spacing/styling nits


Revision tags: OPENBSD_3_6_BASE
# 1.103 04-Aug-2004 art

Uninline vputonfreelist.


# 1.102 04-Aug-2004 pedro

better comments


# 1.101 02-Aug-2004 pedro

- check for LK_NOWAIT on vget()
- use ltsleep() instead of the unlock + sleep combo

ok art@, inspiration from free/net


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.100 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


# 1.99 27-May-2004 tedu

shutdown accounting before shutting down vfs. should prevent some panics.
ok david@ millert@ (iirc)


# 1.98 25-Apr-2004 itojun

radix tree with multipath support. from kame. deraadt ok
user visible changes:
- you can add multiple routes with same key (route add A B then route add A C)
- you have to specify gateway address if there are multiple entries on the table
(route delete A B, instead of route delete A)
kernel change:
- radix_node_head has an extra entry
- rnh_deladdr takes extra argument

TODO:
- actually take advantage of multipath (rtalloc -> rtalloc_mpath)


Revision tags: OPENBSD_3_5_BASE
# 1.97 09-Jan-2004 tedu

back out vnode parents. weird breakge found in ports tree


# 1.96 06-Jan-2004 tedu

keep track of a vnode's parent dir. ufs only, and unused atm, but
the fun stuff is coming. testing by brad.


Revision tags: OPENBSD_3_4_BASE
# 1.95 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.94 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: UBC_SYNC_A
# 1.93 13-May-2003 naddy

Back out previous change that causes "vnode table full" for large-scale
file operations.


# 1.92 13-May-2003 tedu

do reclaim LAYER vnodes, no good reason not to


# 1.91 06-May-2003 tedu

attempt to put a process's cwd back in place after a forced umount.
won't always work, but it's the best we can do for now. this covers
at least some of the failure cases the previous commit to vfs_lookup.c
checks for.
ok weingart@


# 1.90 01-May-2003 tedu

several related changes:
vfs_subr.c:
add a missing simple_lock_init for vnode interlock
try to avoid reclaiming locked or layered vnodes
initialize vnlock pointer to NULL
remove old code to free vnlock, never used
lockinit the new vnode lock
vfs_syscalls.c:
support for VLAYER flag
vnode_if.sh:
support for splitting VDESC flags
vnode_if.src:
split VDESC flags
WILLPUT is the combination of WILLRELE and WILLUNLOCK
most uses for WILLRELE become WILLPUT
vnode.h:
add v_lock to struct vnode
add VLAYER flag
update for new VDESC flags


# 1.89 06-Apr-2003 ho

strcat/strcpy/sprintf cleanup. krw@, anil@ ok. art@ tested sparc64.


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_B
# 1.88 11-Aug-2002 art

Add two missing vfs_busy calls in the failure path of sysctl_vnode.
Found by aaron@

NOTE - I think we need a mount-point iterator just like we have
NOTE - vfs_mount_foreach_vnode. (btw. why don't we use foreach_vnode in here?)


# 1.87 12-Jul-2002 art

Change the locking on the mountpoint slightly. Instead of using mnt_lock
to get shared locks for lookup and get the exclusive lock only with
LK_DRAIN on unmount and do the real exclusive locking with flags in
mnt_flags, we now use shared locks for lookup and an exclusive lock for
unmount.

This is accomplished by slightly changing the semantics of vfs_busy.
Old vfs_busy behavior:
- with LK_NOWAIT set in flags, a shared lock was obtained if the
mountpoint wasn't being unmounted, otherwise we just returned an error.
- with no flags, a shared lock was obtained if the mountpoint was being
unmounted, otherwise we slept until the unmount was done and returned
an error.
LK_NOWAIT was used for sync(2) and some statistics code where it isn't really
critical that we get the correct results.
0 was used in fchdir and lookup where it's critical that we get the right
directory vnode for the filesystem root.

After this change vfs_busy keeps the same behavior for no flags and LK_NOWAIT.
But if some other flags are passed into it, they are passed directly
into lockmgr (actually LK_SLEEPFAIL is always added to those flags because
if we sleep for the lock, that means someone was holding the exclusive lock
and the exclusive lock is only held when the filesystem is being unmounted.

More changes:
dounmount must now be called with the exclusive lock held. (before this
the caller was supposed to hold the vfs_busy lock, but that wasn't always
true).
Zap some (now) unused mount flags.
And the highlight of this change:
Add some vfs_busy calls to match some vfs_unbusy calls, especially in
sys_mount. (lockmgr doesn't detect the case where we release a lock noone
holds (it will do that soon)).

If you've seen hangs on reboot with mfs this should solve it (I repeat this
for the fourth time now, but this time I spent two months fixing and
redesigning this and reading the code so this time I must have gotten
this right).


# 1.86 16-Jun-2002 miod

When processing the KERN_VNODE sysctl, the kernel builds a packed structure,
while pstat(8) expects a C structure abiding the regular structure packing
rules. This caused pstat -v to break on powerpc.

Unbreak the confusion by defining the structure in a common header file,
and having the kernel use it.

ok millert@ deraadt@


# 1.85 08-Jun-2002 art

Use ltsleep in vfs_busy.


# 1.84 16-May-2002 art

sprinkle some splassert(IPL_BIO) in some functions that are commented as "should be called at splbio()"


Revision tags: OPENBSD_3_1_BASE
# 1.83 14-Mar-2002 millert

First round of __P removal in sys


# 1.82 04-Feb-2002 miod

Cleanup mountroot-related definitions.


# 1.81 23-Jan-2002 art

Pool deals fairly well with physical memory shortage, but it doesn't deal
well (not at all) with shortages of the vm_map where the pages are mapped
(usually kmem_map).

Try to deal with it:
- group all information the backend allocator for a pool in a separate
struct. The pool will only have a pointer to that struct.
- change the pool_init API to reflect that.
- link all pools allocating from the same allocator on a linked list.
- Since an allocator is responsible to wait for physical memory it will
only fail (waitok) when it runs out of its backing vm_map, carefully
drain pools using the same allocator so that va space is freed.
(see comments in code for caveats and details).
- change pool_reclaim to return if it actually succeeded to free some
memory, use that information to make draining easier and more efficient.
- get rid of PR_URGENT, noone uses it.


# 1.80 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.79 10-Dec-2001 art

branches: 1.79.2;
No need to initialize the uobj on every getnewvnode. Just do
it when allocating. Add some improved diagnostics.


# 1.78 10-Dec-2001 art

Big cleanup inspired by NetBSD with some parts of the code from NetBSD.
- get rid of VOP_BALLOCN and VOP_SIZE
- move the generic getpages and putpages into miscfs/genfs
- create a genfs_node which must be added to the top of the private portion
of each vnode for filsystems that want to use genfs_{get,put}pages
- rename genfs_mmap to vop_generic_mmap


# 1.77 10-Dec-2001 art

Merge in struct uvm_vnode into struct vnode.


# 1.76 05-Dec-2001 art

Break out the part that lowers v_holdcnt in brelvp into an own function
and make it and vhold into public interfaces.


# 1.75 29-Nov-2001 art

Ooops. Revert part of the last commit that was completly wrong and wasn't supposed to be committed.


# 1.74 29-Nov-2001 art

Correctly handle b_vp with bgetvp and brelvp in {get,put}pages.
Prevents panics caused by vnodes being recycled under our feet.


# 1.73 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.72 21-Nov-2001 csapuntz

Added vfs_isbusy. Useful for verifying that a mount point is locked
Added vfs_mount_foreach_vnode. Several places in the code seem to want to
traverse the mount list and they all seem to handle locking differently.
Centralize traversing the mount list in one place so that we only need
to get the locking right once.


# 1.71 15-Nov-2001 art

Don't zero v_bioflag when recycling a vnode in getnewvnode.
Sometimes the vnode can be on the syncers list. While that is a bug, it's
just a minor annoyance. A vnode on a syncer worklist without VBIOONSYNCLIST
set is a disaster.


# 1.70 12-Nov-2001 art

Remove unnecessary check for NULL vnode in reassignbuf.


# 1.69 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.68 02-Oct-2001 csapuntz

Bounds check index into routing table. Thanks to Ken Ashcraft of Stanford
for finding this bug.


# 1.67 19-Sep-2001 csapuntz

Get rid of B_VFLUSH. Not relevant after the end of the AGE queue.


# 1.66 16-Sep-2001 millert

Add some missing lengths checks when passing data from userland to
kernel. From based on NetBSD patches.


# 1.65 02-Aug-2001 assar

(vput): make panic strings actually say vput instead of vrele


# 1.64 26-Jul-2001 miod

Typo.


# 1.63 27-Jun-2001 art

remove old vm


# 1.62 22-Jun-2001 deraadt

KNF


# 1.61 05-Jun-2001 provos

send note_revoke to knotes when vnode goes away, okay art@


# 1.60 16-May-2001 art

indentation nit.


# 1.59 29-Apr-2001 art

cleanup, remove incorrect comment


Revision tags: OPENBSD_2_9_BASE
# 1.58 22-Mar-2001 art

branches: 1.58.2;
Use pool for allocating vnodes.
Even though vnodes are never freed (could be) this gives us big memory and
kmem_map savings.


# 1.57 21-Mar-2001 art

uvm_vnp_terminate expect the vnode to be locked.
Why didn't LOCKDEBUG catch this?


# 1.56 16-Mar-2001 art

Oops. fix thinko in last.


# 1.55 16-Mar-2001 art

Use CIRCLEQ macros for mountlist.


# 1.54 16-Mar-2001 art

Initialize the mountlist_slock.


# 1.53 26-Feb-2001 csapuntz

Move v_writecount test back to it original place


# 1.52 26-Feb-2001 csapuntz

Make ref counts 32-bit unsigned ints as opposed to a potpourri of longs and
ints.


# 1.51 24-Feb-2001 csapuntz

Cleanup of vnode interface continues. Get rid of VHOLD/HOLDRELE.
Change VM/UVM to use buf_replacevnode to change the vnode associated
with a buffer.

Addition v_bioflag for flags written in interrupt handlers
(and read at splbio, though not strictly necessary)

Add vwaitforio and use it instead of a while loop of v_numoutput.

Fix race conditions when manipulation vnode free list


# 1.50 23-Feb-2001 csapuntz

Remove the clustering fields from the vnodes and place them in the
file system inode instead


# 1.49 21-Feb-2001 csapuntz

Latest soft updates from FreeBSD/Kirk McKusick

Snapshot-related code has been commented out.


# 1.48 08-Feb-2001 mickey

do not print stuff when not verbose


Revision tags: OPENBSD_2_8_BASE
# 1.47 27-Sep-2000 art

branches: 1.47.2;
Minimal optimization.


# 1.46 17-Jul-2000 art

Don't wait for B_READ buffers on shutdown.
From NetBSD.


Revision tags: OPENBSD_2_7_BASE
# 1.45 25-Apr-2000 csapuntz

Use CIRCLEQ_FOREACH


# 1.44 21-Apr-2000 mickey

see if there is any meaning under curproc before using &proc0 in vfs_syncwait(); from art@


Revision tags: SMP_BASE kame_19991208
# 1.43 05-Dec-1999 art

branches: 1.43.2;
With soft updates, some buffers will be remarked as dirty after being written.
Handle this when syncing filesystems when unmounting.
From NetBSD.


# 1.42 05-Dec-1999 art

Use VONSYNCLIST to see if we should remove a vnode from the sync list instead
of looking at v_dirtyblkhd.


Revision tags: OPENBSD_2_6_BASE
# 1.41 20-Aug-1999 art

more paranoid check of the refcount in vfs_register


# 1.40 08-Aug-1999 niklas

From NetBSD; vdevgone, used for revoking access to device nodes when they
disappear (detach is coming).


# 1.39 31-May-1999 millert

New struct statfs with mount options. NOTE: this replaces statfs(2),
fstatfs(2), and getfsstat(2) so you will need to build a new kernel
before doing a "make build" or you will get "unimplemented syscall" errors.

The new struct statfs has the following featuires:
o Has a u_int32_t flags field--now softdep can have a real flag.

o Uses u_int32_t instead of longs (nicer on the alpha). Note: the man
page used to lie about setting invalid/unused fields to -1. SunOS does
that but our code never has.

o Gets rid of f_type completely. It hasn't been used since NetBSD 0.9
and having it there but always 0 is confusing. It is conceivable
that this may cause some old code to not compile but that is better
than silently breaking.

o Adds a mount_info union that contains the FSTYPE_args struct. This
means that "mount" can now tell you all the options a filesystem was
mounted with. This is especially nice for NFS.

Other changes:
o The linux statfs emulation didn't convert between BSD fs names
and linux f_type numbers. Now it does, since the BSD f_type
number is useless to linux apps (and has been removed anyway)

o FreeBSD's struct statfs is different from our (both old and new)
and thus needs conversion. Previously, the OpenBSD syscalls
were used without any real translation.

o mount(8) will now show extra info when invoked with no arguments.
However, to see *everything* you need to use the -v (verbose) flag.


# 1.38 06-May-1999 mickey

factor out sync+wait code into vfa_syncwait() routine for
applications in system like power management and such.
art@ finally said `commit it'


# 1.37 30-Apr-1999 art

in vput, simple_unlock the v_interlock before VOP_INACTIVE, not after


Revision tags: OPENBSD_2_5_BASE
# 1.36 11-Mar-1999 deraadt

backout


# 1.35 11-Mar-1999 deraadt

back out unapproved changes


# 1.34 11-Mar-1999 mickey

indent


# 1.33 11-Mar-1999 mickey

factor sync+wait operation out into a separate function.


# 1.32 26-Feb-1999 art

adapt to uvm vnode pager


# 1.31 19-Feb-1999 art

add vfs_register and vfs_unregister functions


# 1.30 28-Dec-1998 art

simple_lock fixes


# 1.29 22-Dec-1998 art

deconfuse vprint, print holdcount, not refcount when we are talking about holdcnt


# 1.28 10-Dec-1998 art

vfs_unmountall: retry to unmount all remaining filesystems when one unmount failed


# 1.27 05-Dec-1998 csapuntz

Framework for generating automatic test code for locking discipline
in DIAGNOSTIC mode.

Added documentation to vfs_subr.c on locking needs of a couple calls.

Improvements to the vinvalbuf patch. We need to start over after we
let our pants down.


# 1.26 04-Dec-1998 csapuntz

VFS-Lite2 requires stricter locking around vnode buffer queues. vinvalbuf
had insufficient protection


# 1.25 20-Nov-1998 art

vn_lock already unlocks the simple lock. don't do that again


# 1.24 12-Nov-1998 csapuntz

Integrate latest soft updates patches for McKusick.

Integrate cleaner ffs mount code from FreeBSD. Most notably, this mount
code prevents you from mounting an unclean file system read-write.


Revision tags: OPENBSD_2_4_BASE
# 1.23 13-Oct-1998 csapuntz

In vrele, vget, reinstate to following order

- VNODE gets placed on free list
- VOP_INACTIVE is called

This was the original order. It was changed in an earlier patch due to
a race condition in non-locking FSes (like NFS) between getnewvnode
and inactive. However, the modified order had its own race conditions, so
it turned out not to be a good choice.


# 1.22 30-Aug-1998 csapuntz

Cleanup.

Error diagnostics in vputonfreelist to catch violations of assumptions.


# 1.21 06-Aug-1998 csapuntz

Rename vop_revoke, vn_bwrite, vop_noislocked, vop_nolock, vop_nounlock
to be vop_generic_revoke, vop_generic_bwrite, vop_generic_islocked,
vop_generic_lock and vop_generic_unlock.

Create vop_generic_abortop and propogate change to all file systems.

Fix PR/371.

Get rid of locking in NULLFS (should be mostly unnecessary now except for
forced unmounts).


# 1.20 25-Apr-1998 niklas

typo


Revision tags: OPENBSD_2_3_BASE
# 1.19 20-Feb-1998 niklas

typo


# 1.18 11-Jan-1998 csapuntz

Fix a couple spinlock references. More code motion in vfs_subr.c


# 1.17 10-Jan-1998 csapuntz

Broke up vfs_subr.c which was getting a bit huge. We now have seperate files
for the syncer daemon as well as default VOP_*.


# 1.16 24-Nov-1997 niklas

Fix non-DIAGNOSTIC (and non-COMPAT*) compilation


# 1.15 07-Nov-1997 csapuntz

Fixed hang on shutdown
Disabled vop_nolock for now. Filesystems still need to be cleaned up.


# 1.14 06-Nov-1997 csapuntz

DEBUG now compiles


# 1.13 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.12 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.11 06-Oct-1997 csapuntz

VFS Lite2 Changes


Revision tags: OPENBSD_2_1_BASE
# 1.10 25-Apr-1997 deraadt

proper mask check; mike@fast.cs.utah.edu


# 1.9 14-Apr-1997 tholo

Minor performance enhancements from NetBSD


# 1.8 24-Feb-1997 niklas

OpenBSD tags


# 1.7 11-Feb-1997 millert

Add fs_id support and random inode generation numbers for ffs.


# 1.6 04-Jan-1997 kstailey

spec_advlock() via lf_advlock()


Revision tags: OPENBSD_2_0_BASE
# 1.5 08-Aug-1996 tholo

Make {,f}chown(2) behaviour POSIX.1 compliant with SUID / SGID files
Enable CTL_FS processing by sysctl(3)
Add CTL_FS request to disable clearing SUID / SGID bit when a files owner
or group is changed by root
Make sysctl(8) understand CTL_FS requests


# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 29-Feb-1996 niklas

From NetBSD: Merge with NetBSD 960217


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.266 10-Feb-2018 deraadt

Syncronize filesystems to disk when suspending. Each mountpoint's vnodes
are pushed to disk. Dangling vnodes (unlinked files still in use) and
vnodes undergoing change by long-running syscalls are identified -- and
such filesystems are marked dirty on-disk while we are suspended (in case
power is lost, a fsck will be required). Filesystems without dangling or
busy vnodes are marked clean, resulting in faster boots following
"battery died" circumstances.
Tested by numerous developers, thanks for the feedback.


# 1.265 14-Dec-2017 deraadt

Don't bother using DETACH_FORCE for the softraid luns at reboot
time; the aggressive mountpoint destruction seems to hit insane
use-after-frees when we are already far on the way down.


# 1.264 14-Dec-2017 deraadt

Give vflush_vnode() a hint about vnodes we don't need to account as "busy".
Change mountpoint to RDONLY a little later. Seems to improve the
rw->ro transition a bit.


# 1.263 11-Dec-2017 bluhm

Format the vnode lists of ddb show mount properly in columns.
OK krw@


# 1.262 11-Dec-2017 deraadt

In uvm Chuck decided backing store would not be allocated proactively
for blocks re-fetchable from the filesystem. However at reboot time,
filesystems are unmounted, and since processes lack backing store they
are killed. Since the scheduler is still running, in some cases init is
killed... which drops us to ddb [noted by bluhm]. Solution is to convert
filesystems to read-only [proposed by kettenis]. The tale follows:
sys_reboot() should pass proc * to MD boot() to vfs_shutdown() which
completes current IO with vfs_busy VB_WRITE|VB_WAIT, then calls VFS_MOUNT()
with MNT_UPDATE | MNT_RDONLY, soon teaching us that *fs_mount() calls a
copyin() late... so store the sizes in vfsconflist[] and move the copyin()
to sys_mount()... and notice nfs_mount copyin() is size-variant, so kill
legacy struct nfs_args3. Next we learn ffs_mount()'s MNT_UPDATE code is
sharp and rusty especially wrt softdep, so fix some bugs adn add
~MNT_SOFTDEP to the downgrade. Some vnodes need a little more help,
so tie them to &dead_vnops.

ffs_mount calling DIOCCACHESYNC is causing a bit of grief still but
this issue is seperate and will be dealt with in time.
couple hundred reboots by bluhm and myself, advice from guenther and
others at the hut


# 1.261 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.260 31-Jul-2017 florian

Give back some space to the ramdisk by compiling net/radix.c only
if we compile pf, ipsec, pipex or nfsserver.
Suggested by mpi some time ago.
Tweak & OK bluhm
deraadt assumes it's fair


# 1.259 20-Apr-2017 visa

Tweak lock inits to make the system runnable with witness(4)
on amd64 and i386.


# 1.258 04-Apr-2017 deraadt

struct vfsconf is tightly packed, but let's M_ZERO it in case that ever
changes to avoid exposing userland memory.


Revision tags: OPENBSD_6_1_BASE
# 1.257 15-Jan-2017 bluhm

When traversing the mount list, the current mount point is locked
with vfs_busy(). If the FOREACH_SAFE macro is used, the next pointer
is not locked and could be freed by another process. Unless
necessary, do not use _SAFE as it is unsafe. In vfs_unmountall()
the current pointer is actullay freed. Add a comment that this
race has to be fixed later.
OK krw@


# 1.256 10-Jan-2017 bluhm

Replace manual for() loops with FOREACH() macro.
OK millert@


# 1.255 10-Jan-2017 bluhm

Remove the unused olddp parameter from function dounmount().
OK mpi@ millert@


# 1.254 28-Sep-2016 kettenis

Cast enum to u_int when doing a bounds check to avoid a clang warning that
the comparison is always true.

ok jca@, tedu@


# 1.253 16-Sep-2016 dlg

move the namecache_rb_tree from RB macros to RBT functions.

i had to shuffle the includes a bit. all the knowledge of the RB
tree is now inside vfs_cache.c, and all accesses are via cache_*
functions.


# 1.252 16-Sep-2016 dlg

move buf_rb_bufs from RB macros to RBT functions

i had to shuffle the order of some header bits cos RBT_PROTOTYPE
needs to see what RBT_HEAD produces.


# 1.251 15-Sep-2016 dlg

all pools have their ipl set via pool_setipl, so fold it into pool_init.

the ioff argument to pool_init() is unused and has been for many
years, so this replaces it with an ipl argument. because the ipl
will be set on init we no longer need pool_setipl.

most of these changes have been done with coccinelle using the spatch
below. cocci sucks at formatting code though, so i fixed that by hand.

the manpage and subr_pool.c bits i did myself.

ok tedu@ jmatthew@

@ipl@
expression pp;
expression ipl;
expression s, a, o, f, m, p;
@@
-pool_init(pp, s, a, o, f, m, p);
-pool_setipl(pp, ipl);
+pool_init(pp, s, a, ipl, f, m, p);


# 1.250 25-Aug-2016 dlg

pool_setipl

ok kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.249 22-Jul-2016 kettenis

Prevent NULL-pointer call for filesystems that don't provide vfs_sysctl
in their vfsops.

Issue reported by Tim Newsham.

ok claudio@, natano@


# 1.248 19-Jun-2016 natano

Remove the lockmgr() API. It is only used by filesystems, where it is a
trivial change to use rrw locks instead. All it needs is LK_* defines
for the RW_* flags.

tested by naddy and sthen on package building infrastructure
input and ok jmc mpi tedu


# 1.247 26-May-2016 natano

The doforce variable isn't modified anywhere. Also, the only filesystem
left using it is fuse. It has been removed from all other filesystems.

ok millert deraadt


# 1.246 26-Apr-2016 natano

copy_statfs_info() is not only used by ufs, but by other filesystems too,
so make sure that all members of mp->mnt_stat.mount_info are copied.
ok stefan


# 1.245 26-Apr-2016 beck

fix off by one in vfs_vnode_print - found by miod
ok deraadt@, krw@


# 1.244 07-Apr-2016 natano

Share clone bitmap between aliased vnodes. This prevents duplicate clone
instance numbers being handed out for the same minor device.
ok mikeb


# 1.243 05-Apr-2016 natano

Increase size of the clone bitmap (revised diff after revert). I have
tested this with fuse _and_ drm on amd64 and macppc. Also tested with
cloning bpf (not in the tree) on macppc.

ok mikeb
"looks correct to me" millert

The original commit message is as follows:

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.242 01-Apr-2016 mikeb

Revert the clone bitmap enlargement change


# 1.241 31-Mar-2016 natano

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.240 19-Mar-2016 natano

Remove the unused flags argument from VOP_UNLOCK().

torture tested on amd64, i386 and macppc
ok beck mpi stefan
"the change looks right" deraadt


# 1.239 14-Mar-2016 krw

Change a bunch of (<blah> *)0 to NULL.

ok beck@ deraadt@


Revision tags: OPENBSD_5_9_BASE
# 1.238 05-Dec-2015 tedu

branches: 1.238.2;
remove stale lint annotations


# 1.237 16-Nov-2015 deraadt

In getdevvp() set the VISTTY flag on a vnode to indicate the underlying
device is a D_TTY device. (Like spec_open, but this sets the flag to
satisfy pre-VOP_OPEN situations)
ok millert semarie tedu guenther


# 1.236 13-Oct-2015 guenther

Initialize va_filerev in vattr_null() to avoid leaking stack garbage;
problem pointed out by Martin Natano (natano (at) natano.net)

Also, stop chaining assignments (foo = bar = baz) in vattr_null().
The exact meaning of those depends on the order of the sizes-and-
signednesses of the lvalues, making them fragile: a statement here
mixed *six* types, but managed to get them in a safe order. Delete
a 20+ year old XXX comment that was almost certainly bemoaning a bug
from when they were in an unsafe order.

ok deraadt@ miod@


# 1.235 08-Oct-2015 mpi

Use the radix API directly and get rid of the function pointers. There
is no point in keeping an unused level of abstraction.

ok mikeb@, claudio@


# 1.234 07-Oct-2015 mpi

rn_inithead() offset argument is now specified in byte, missed in previous.


# 1.233 04-Sep-2015 mpi

Make every subsystem using a radix tree call rn_init() and pass the
length of the key as argument.

This way every consumer of the radix tree has a chance to explicitly
initialize the shared data structures and no longer rely on another
subsystem to do the initialization.

As a bonus ``dom_maxrtkey'' is no longer used an die.

ART kernels should now be fully usable because pf(4) and IPSEC properly
initialized the radix tree.

ok chris@, reyk@


Revision tags: OPENBSD_5_8_BASE
# 1.232 16-Jul-2015 claudio

branches: 1.232.4;
Fix rn_match and there for the expoerted lookup functions in radix.c
to never return the internal RNF_ROOT nodes. This removes the checks
in the callee to verify that not an RNF_ROOT node was returned.
OK mpi@


# 1.231 12-May-2015 mikeb

Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.230 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.229 02-Mar-2015 guenther

Return EINVAL if the creds supplied for NFS export have a cr_ngroups less
than zero or greater than NGROUPS_MAX

Fixes panic seen by henning@


# 1.228 09-Jan-2015 tedu

rename desiredvnodes to initialvnodes. less of a lie. ok beck deraadt


# 1.227 19-Dec-2014 tedu

start retiring the nointr allocator. specify PR_WAITOK as a flag as a
marker for which pools are not interrupt safe. ok dlg


# 1.226 17-Dec-2014 tedu

remove lock.h from uvm_extern.h. another holdover from the simpletonlock
era. fix uvm including c files to include lock.h or atomic.h as necessary.
ok deraadt


# 1.225 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.224 10-Dec-2014 tedu

convert bcopy to memcpy. ok millert


# 1.223 21-Nov-2014 tedu

simple lock is long dead


# 1.222 19-Nov-2014 tedu

delete the KERN_VNODE sysctl. it fails to provide any isolation from the
kernel struct vnode defintion, and the only consumer (pstat) still needs
kvm to read much of the required information. no great loss to always use
kvm until there's a better replacement interface.
ok deraadt millert uebayasi


# 1.221 14-Nov-2014 tedu

prefer sizeof(*ptr) to sizeof(struct) for malloc and free


# 1.220 03-Nov-2014 deraadt

pass size argument to free()
ok doug tedu


# 1.219 13-Sep-2014 doug

Replace all queue *_END macro calls except CIRCLEQ_END with NULL.

CIRCLEQ_* is deprecated and not called in the tree. The other queue types
have *_END macros which were added for symmetry with CIRCLEQ_END. They are
defined as NULL. There's no reason to keep the other *_END macro calls.

ok millert@


Revision tags: OPENBSD_5_6_BASE
# 1.218 13-Jul-2014 tedu

pass the size to free in some of the obvious cases


# 1.217 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.216 10-Jul-2014 mpi

Stop using a shutdown hook for softraid(4) and explicitly shutdown
the disciplines right after vfs_shutdown().

This change is required in order to be able to set `cold' to 1 before
traversing the device (mainbus) tree for DVACT_POWERDOWN when halting
a machine. Yes, this is ugly because sr_shutdown() needs to sleep. But
at least it is obvious and hopefully somebody will be ofended and fix
it.

In order to properly flush the cache of the disks under softraid0,
sr_shutdown() now propagates DVACT_POWERDOWN for this particular subtree
of devices which are not under mainbus. As a side effect sd(4) shutdown
hook should no longer be necessary.

Tested by stsp@ and Jean-Philippe Ouellet.

ok deraadt@, stsp@, jsing@


# 1.215 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.214 04-Jun-2014 claudio

While it may be smart to use the radix tree for exports it is not OK to
use the domain specific tree initialisation method for this since that one
is multipath enabled and assumes that the radix node is part of a struct
rtentry. This code uses a different struct and so the multipath modifies
wrong fields and breaks stuff in mysterious ways.
Since we only support AF_INET here anyway simplify the code and only have
one radix_node_head pointer instead of AF_MAX ones.
Fixes NFS server issues reported by rpe@, OK rpe@, guenther@, sthen@


# 1.213 10-Apr-2014 tedu

pull the bufcache freelist code out into separate functions to allow new
algorithms to be tested. in the process, drop support for unused B_AGE and
b_synctime options.
previous versions ok beck deraadt


# 1.212 24-Mar-2014 guenther

Split the API: struct ucred remains the kernel internal structure while
struct xucred becomes the structure for syscalls (mount(2) and nfssvc(2)).

ok deraadt@ beck@


Revision tags: OPENBSD_5_5_BASE
# 1.211 21-Jan-2014 tedu

bzero -> memset


# 1.210 01-Dec-2013 krw

Change 'mountlist' from CIRCLEQ to TAILQ. Be paranoid and
use TAILQ_*_SAFE more than might be needed.

Bulk ports build by sthen@ showed nobody sticking their fingers
so deep into the kernel.

Feedback and suggestions from millert@. ok jsing@


# 1.209 27-Nov-2013 jsing

Defer the v_type initialisation until after the vnode has been purged from
the namecache. Changing the v_type between cache_enter() and cache_purge()
results in bad things happening.

ok beck@


# 1.208 02-Oct-2013 sf

format string fix: b_flags is long


# 1.207 01-Oct-2013 sf

Format string fixes: Cast time_t to long long

and mnt_stat.f_ctime is long long, too


# 1.206 08-Aug-2013 syl

Uncomment kprintf format attributes for sys/kern

tested on vax (gcc3) ok miod@


# 1.205 30-Jul-2013 beck

The previous change was made while chasing nfs performance issues
on Theo's servers - however this was in the context of the buffer flipper
changes and this is now suspect in a continues performance issue with NFS
so back it out for now


Revision tags: OPENBSD_5_4_BASE
# 1.204 24-Jun-2013 beck

Manipulating buffers after sleeping is dangerous. Instead of attempting
to cheat and VOP_BWRITE a buffer, restart the vinvalbuf if we have to wait
for a busy buffer to complete
ok tedu@ guenther@


# 1.203 15-Apr-2013 jsing

Add an f_mntfromspec member to struct statfs, which specifies the name of
the special provided when the mount was requested. This may be the same as
the special that was actually used for the mount (e.g. in the case of a
device node) or it may be different (e.g. in the case of a DUID).

Whilst here, change f_ctime to a 64 bit type and remove the pointless
f_spare members.

Compatibility goo courtesy of guenther@

ok krw@ millert@


Revision tags: OPENBSD_5_3_BASE
# 1.202 17-Feb-2013 miod

Comment out recently added __attribute__((__format__(__kprintf__))) annotations
in MI code; gcc 2.95 does not accept such annotation for function pointer
declarations, only function prototypes.
To be uncommented once gcc 2.95 bites the dust.


# 1.201 09-Feb-2013 miod

Add explicit __attribute__ ((__format__(__kprintf__)))) to the functions and
function pointer arguments which are {used as,} wrappers around the kernel
printf function.
No functional change.


# 1.200 17-Nov-2012 beck

Don't map a buffer (and potentially sleep) when invalidating it in vinvalbuf.
This fixes a problem where we could sleep for kva and then our pointers
would not be valid on the next pass through the loop. We do this
by adding buf_acquire_nomap() - which can be used to busy up the buffer
without changing its mapped or unmapped state. We do not need to have
the buffer mapped to invalidate it, so it is sufficient to acquire it
for that. In the case where we write the buffer, we do map the buffer, and
potentially sleep.


# 1.199 01-Oct-2012 guenther

Make groupmember() check the effective gid too, so that the checks are
consistent when the effective gid isn't also a supplementary group.

ok beck@


# 1.198 19-Sep-2012 guenther

vhold() and vdrop() are prototyped in vnode.h, so don't repeat them here

ok beck@


Revision tags: OPENBSD_5_2_BASE
# 1.197 16-Jul-2012 deraadt

oops, need sys/acct.h too


# 1.196 16-Jul-2012 deraadt

Put acct_shutdown() proto in a better place


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.195 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.194 02-Jul-2011 thib

rename VFSDEBUG to VFLCKDEBUG;

prompted by tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.193 21-Dec-2010 thib

Bring back the "End the VOP experiment." diff, naddy's issues where
unrelated, and his alpha is much happier now.

OK deraadt@


# 1.192 06-Dec-2010 jasper

- drop NENTS(), which was yet another copy of nitems().
no binary change


ok deraadt@


# 1.191 10-Sep-2010 thib

Backout the VOP diff until the issues naddy was seeing on alpha (gcc3)
have been resolved.


# 1.190 06-Sep-2010 thib

End the VOP experiment. Instead of the ridicolusly complicated operation
vector setup that has questionable features (that have, as far as I can
tell never been used in practice, atleast not in OpenBSD), remove all
the gunk and favor a simple struct full of function pointers that get
set directly by each of the filesystems.

Removes gobs of ugly code and makes things simpler by a magnitude.

The only downside of this is that we loose the vnoperate feature so
the spec/fifo operations of the filesystems need to be kept in sync
with specfs and fifofs, this is no big deal as the API it self is pretty
static.

Many thanks to armani@ who pulled an earlier version of this diff to
current after c2k10 and Gabriel Kihlman on tech@ for testing.

Liked by many. "come on, find your balls" deraadt@.


# 1.189 12-Aug-2010 oga

Nuke extra (typoed) extern declaration and a spare newline from the last
commit.

"fix it -- free commit" beck@


# 1.188 11-Aug-2010 beck

Make the number of vnodes to correspond to the number of buffers in
buffer cache - we grow them dynamically, but do not attempt to shrink
them if the buffer cache shrinks after growing.

Tested by very many for a long time.

ok oga@ todd@ phessler@ tedu@


Revision tags: OPENBSD_4_8_BASE
# 1.187 29-Jun-2010 tedu

makefstype was only used in ported from freebsd filesystems. fix them
and remove the function. ok thib


# 1.186 28-Jun-2010 claudio

Add the rtable id as an argument to rn_walktree(). Functions like
rt_if_remove_rtdelete() need to know the table id to be able to correctly
remove nodes.
Problem found by Andrea Parazzini and analyzed by Martin Pelik�n.
OK henning@


# 1.185 06-May-2010 mpf

Fix favail format string.
From mickey.
OK thib, otto.


Revision tags: OPENBSD_4_7_BASE
# 1.184 17-Dec-2009 oga

if anyone vref()s a VNON vnode, panic. This should not happen.

Written while trying to debug the nfs_inactive panics. Turns out it
never got hit, but it's a useful check to have.

ok beck@


# 1.183 17-Aug-2009 jasper

dd 'show all bufs' to show all the buffers in the system

ok beck@ thib@


# 1.182 13-Aug-2009 thib

add a show all vnodes command, use dlg's nice pool_walk() to accomplish
this.

ok beck@, dlg@


# 1.181 12-Aug-2009 beck

Namecache revamp.

This eliminates the large single namecache hash table, and implements
the name cache as a global lru of entires, and a redblack tree in each
vnode. It makes cache_purge actually purge the namecache entries associated
with a vnode when a vnode is recycled (very important for later on actually being
able to resize the vnode pool)

This commit does #if 0 out a bunch of procmap code that was
already broken before this change, but needs to be redone completely.

Tested by many, including in thib's nfs test setup.

ok oga@,art@,thib@,miod@


# 1.180 02-Aug-2009 beck

Dynamic buffer cache support - a re-commit of what was backed out
after c2k9

allows buffer cache to be extended and grow/shrink dynamically

tested by many, ok oga@, "why not just commit it" deraadt@


Revision tags: OPENBSD_4_6_BASE
# 1.179 25-Jun-2009 thib

backout the buf_acquire() does the bremfree() since all callers
where doing bremfree() befure calling buf_acquire().

This is causing us headache pinning down a bug that showed up
when deraadt@ too cvs to current, and will have to be done
anyway as a preperation for backouts.

OK deraadt@


# 1.178 15-Jun-2009 beck

Back out all the buffer cache changes I committed during c2k9. This reverts three
commits:

1) The sysctl allowing bufcachepercent to be changed at boot time.
2) The change moving the buffer cache hash chains to a red-black tree
3) The dynamic buffer cache (Which depended on the earlier too).

ok on the backout from marco and todd


# 1.177 06-Jun-2009 art

All caller of buf_acquire were doing bremfree before the call.
Just put it in the buf_acquire function.
oga@ ok


# 1.176 03-Jun-2009 beck

Change bufhash from the old grotty hash table to red-black trees hanging
off the vnode.
ok art@, oga@, miod@


Revision tags: OPENBSD_4_5_BASE
# 1.175 10-Nov-2008 pedro

Fix typo in comment, okay jmc@.


# 1.174 01-Nov-2008 deraadt

change vrele() to return an int. if it returns 0, it can gaurantee that
it did not sleep. this is used to avoid checkdirs() to avoid having
to restart the allproc walk every time through
idea from tedu, ok thib pedro


Revision tags: OPENBSD_4_4_BASE
# 1.173 05-Jul-2008 thib

re-introduce vdrop() to signal a lost intrest in a vnode;

ok art@


# 1.172 14-Jun-2008 mk

A bunch of pool_get() + bzero() -> pool_get(..., .. | PR_ZERO)
conversions that should shave a few bytes off the kernel.

ok henning, krw, jsing, oga, miod, and thib (``even though i usually prefer
FOO|BAR''; thanks for looking.


# 1.171 13-Jun-2008 beck

back out stupid vnode change that was unintentionally included
with biomem and art has no idea how it got there.
ok art@ thib@


# 1.170 12-Jun-2008 deraadt

Bring biomem diff back into the tree after the nfs_bio.c fix went in.
ok thib beck art


# 1.169 11-Jun-2008 deraadt

back out biomem diff since it is not right yet. Doing very large
file copies to nfsv2 causes the system to eventually peg the console.
On the console ^T indicates that the load is increasing rapidly, ddb
indicates many calls to getbuf, there is some very slow nfs traffic
making none (or extremely slow) progress. Eventually some machines
seize up entirely.


# 1.168 10-Jun-2008 beck

Buffer cache revamp

1) remove multiple size queues, introduced as a stopgap.
2) decouple pages containing data from their mappings
3) only keep buffers mapped when they actually have to be mapped
(right now, this is when buffers are B_BUSY)
4) New functions to make a buffer busy, and release the busy flag
(buf_acquire and buf_release)
5) Move high/low water marks and statistics counters into a structure
6) Add a sysctl to retrieve buffer cache statistics

Tested in several variants and beat upon by bob and art for a year. run
accidentally on henning's nfs server for a few months...

ok deraadt@, krw@, art@ - who promises to be around to deal with any fallout


# 1.167 09-Jun-2008 millert

Update access(2) to have modern semantics with respect to X_OK and
the superuser. access(2) will now only indicate success for X_OK on
non-directories if there is at least one execute bit set on the file.
OK deraadt@ thib@ otto@


# 1.166 07-May-2008 thib

remove the vfc_mountroot member from vfsconf and
do appropriate cleanup;

OK deraadt@


# 1.165 07-May-2008 claudio

Implement routing priorities. Every route inserted has a priority assigned
and the one route with the lowest number wins. This will be used by the
routing daemons to resolve the synchronisations issue in case of conflicts.
The nasty bits of this are in the multipath code. If no priority is specified
the kernel will choose an appropriate priority.

Looked at by a few people at n2k8 code is much older


# 1.164 06-May-2008 thib

retire vfs_mountroot();

setroot() is now (and has been) responsible for setting
the mountroot function pointer "to the right thing", or
failing todo that, to ffs_mountroot;

based on a discussion/diff from deraadt@.
OK deraadt@


# 1.163 23-Mar-2008 miod

Wrong printf construct.


# 1.162 16-Mar-2008 otto

Widen some struct statfs fields to support large filesystem stata
and add some to be able to support statvfs(2). Do the compat dance
to provide backward compatibility. ok thib@ miod@


Revision tags: OPENBSD_4_3_BASE
# 1.161 13-Dec-2007 blambert

replace calls to ltsleep with tsleep

remove PNORELOCK flag, as PNORELOCK is used for msleep

ok art@ thib@


# 1.160 16-Nov-2007 deraadt

er, the newline is wrong. dissapointing.


# 1.159 15-Nov-2007 deraadt

newline before syncing disks is way prettier


# 1.158 29-Oct-2007 chl

MALLOC/FREE -> malloc/free
replace an hard coded value with M_WAITOK

ok krw@


# 1.157 15-Sep-2007 bluhm

Allow to pull out an usb stick with ffs filesystem while mounted
and a file is written onto the stick. Without these fixes the
machine panics or hangs.
The usb fix calls the callback when the stick is pulled out to free
the associated buffers. Otherwise we have busy buffers for ever
and the automatic unmount will panic.
The change in the scsi layer prevents passing down further dirty
buffers to usb after the stick has been deactivated.
In vfs the automatic unmount has moved from the function vgonel()
to vop_generic_revoke(). Both are called when the sd device's vnode
is removed. In vgonel() the VXLOCK is already held which can cause
a deadlock. So call dounmount() earlier.

ok krw@, I like this marco@, tested by ian@


# 1.156 07-Sep-2007 art

Use M_ZERO in a few more places to shave bytes from the kernel.

eyeballed and ok dlg@


Revision tags: OPENBSD_4_2_BASE
# 1.155 07-Aug-2007 beck

A few changes to deal with multi-user performance issues seen. this
brings us back roughly to 4.1 level performance, although this is still
far from optimal as we have seen in a number of cases. This change

1) puts a lower bound on buffer cache queues to prevent starvation
2) fixes the code which looks for a buffer to recycle
3) reduces the number of vnodes back to 4.1 levels to avoid complex
performance issues better addressed after 4.2

ok art@ deraadt@, tested by many


# 1.154 01-Jun-2007 beck

decouple the allocated number of vnodes from the "desiredvnodes" variable
which is used to size a zillion other things that increasing excessively
has been shown to cause problems - so that we may incrementally look at
increasing those other things without making the kernel unusable.

This diff effectivly increases the number of vnodes back to the number
of buffers, as in the earlier dynamic buffer cache commits, without
increasing anything else (namecache, softdeps, etc. etc.)

ok pedro@ tedu@ art@ thib@


# 1.153 31-May-2007 tedu

remove some silly casts, no real change


# 1.152 31-May-2007 pedro

NFSv2 cannot cope with a big number of vnodes, so revert to NPROC-based
calculation until the problem is fixed, okay beck@ art@


# 1.151 30-May-2007 beck

back out vfs change - todd fries has seen afs issues, and I'm suspicious
this can cause other problems.


# 1.150 29-May-2007 beck

Step one of some vnode improvements - change getnewvnode to
actually allocate "desiredvnodes" - add a vdrop to un-hold a vnode held
with vhold, and change the name cache to make use of vhold/vdrop, while
keeping track of which vnodes are referred to by which cache entries to
correctly hold/drop vnodes when the cache uses them.
ok thib@, tedu@, art@


# 1.149 28-May-2007 thib

de-inline vref();

ok pedro@


# 1.148 26-May-2007 pedro

Dynamic buffer cache. Initial diff from mickey@, okay art@ beck@ toby@
deraadt@ dlg@.


# 1.147 26-May-2007 thib

Nuke a bunch of simpelocks and associated goo.

ok art@


# 1.146 17-May-2007 thib

Collapse struct v_selectinfo in struct vnode, remove the
simplelock and reuse the name for the selinfo member.
Clean-up accordingly.

ok tedu@,art@


# 1.145 09-May-2007 deraadt

kinfo_vgetfailed has not been used for > 8 years


# 1.144 13-Apr-2007 thib

Move the declaration of VN_KNOTE() into vnode.h instead of having
multiple defines all over;

ok tedu@


# 1.143 13-Apr-2007 bluhm

Remove comments talking about vnode interlock. No binary change.
ok thib


# 1.142 11-Apr-2007 thib

Remove the simplelock argument from vrecycle();

ok pedro@, sturm@


# 1.141 21-Mar-2007 thib

Remove the v_interlock simplelock from the vnode structure.
Zap all calls to simple_lock/unlock() on it (those calls are
#defined away though). Remove the LK_INTERLOCK from the calls
to vn_lock() and cleanup the filesystems wich implement VOP_LOCK().
(by remvoing the v_interlock from there calls to lockmgr()).

ok pedro@, art@, tedu@


# 1.140 12-Mar-2007 mickey

better desiredvnodes not based on maxusers; pedro@ deraadt@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.139 20-Feb-2007 deraadt

for vfsconf sysctl, do not leak kernel sensors out to userland
ok art thib


# 1.138 17-Feb-2007 mickey

fix ddb buf printing for daddr_t growth to 64bit;
from juan hernandez gonzalez; tested by bluhm@


# 1.137 14-Feb-2007 jsg

Consistently spell FALLTHROUGH to appease lint.
ok kettenis@ cloder@ tom@ henning@


# 1.136 13-Feb-2007 mickey

fix ddb buf print


# 1.135 20-Nov-2006 tom

vprint() should be defined if DIAGNOSTIC || DEBUG. Noticed by (and
original diff from) Jake < antipsychic (at) hotmail.com >. Discussed
with Mickey and Miod.

ok miod@ pedro@


# 1.134 30-Oct-2006 thib

use vp->v_type to index into vtypes rather then vp->v_tag,
fixing odd output in the 'show vnode' ddb code.

ok mickey@


Revision tags: OPENBSD_4_0_BASE
# 1.133 11-Jul-2006 mickey

add mount/vnode/buf and softdep printing commands; tested on a few archs and will make pedro happy too (;


# 1.132 09-Jul-2006 pedro

Fix tab where space was meant


# 1.131 08-Jul-2006 thib

vinvalbuf() debugging aid, under VFSDEBUG.

ok pedro@


# 1.130 03-Jul-2006 mickey

also print vp in vprint (useful for debugging); pedro@ ok


# 1.129 25-Jun-2006 sturm

rename vfs_busy() flags VB_UMIGNORE/VB_UMWAIT to VB_NOWAIT/VB_WAIT

requested by and ok pedro


# 1.128 14-Jun-2006 sturm

move vfs_busy() to rwlocks and properly hide the locking api from vfs

ok tedu, pedro


# 1.127 02-Jun-2006 pedro

Add a clonable devices implementation. Hacked along with thib@, input
from krw@ and toby@, subliminal prodding from dlg@, okay deraadt@.


# 1.126 28-May-2006 pedro

Spacing in vfs_sysctl()


# 1.125 07-May-2006 sturm

forgot to remove this sentence from the comment
ok pedro


# 1.124 30-Apr-2006 sturm

remove the simplelock argument from vfs_busy() which is currently not
used and will never be used this way in VFS

requested by and ok pedro, ok krw, biorn


# 1.123 19-Apr-2006 pedro

Remove unused mount list simple_lock() goo


Revision tags: OPENBSD_3_9_BASE
# 1.122 09-Jan-2006 pedro

Put vprint() under DIAGNOSTIC, as to save space in generated ramdisks.
Inspiration from miod@, okay deraadt@. Tested on i386, macppc and amd64.


# 1.121 30-Nov-2005 pedro

No need for vfs_busy() and vfs_unbusy() to take a process pointer
anymore. Testing by jolan@, thanks.


# 1.120 24-Nov-2005 pedro

Remove kernfs, okay deraadt@.


# 1.119 19-Nov-2005 pedro

Remove unnecessary lockmgr() archaism that was costing too much in terms
of panics and bugfixes. Access curproc directly, do not expect a process
pointer as an argument. Should fix many "process context required" bugs.
Incentive and okay millert@, okay marc@. Various testing, thanks.


# 1.118 18-Nov-2005 pedro

Work around yet another race on non-locking file systems: when calling
VOP_INACTIVE() in vrele() and vput(), we may sleep. Since there's no
locking of any kind, someone can vget() the vnode and vrele() it while
we sleep, beating us in getting the vnode on the free list.


# 1.117 08-Nov-2005 pedro

Missed one use of 'register'


# 1.116 07-Nov-2005 pedro

Use ANSI function declarations and deregister, no binary change


# 1.115 19-Oct-2005 pedro

Remove v_vnlock from struct vnode, okay krw@ tedu@


Revision tags: OPENBSD_3_8_BASE
# 1.114 26-May-2005 pedro

branches: 1.114.2;
RIP stackable filesystems, ok marius@ tedu@, discussed with deraadt@


# 1.113 24-May-2005 pedro

when a device vnode associated with a mount point disappears, mark the
filesystem as doomed and unmount it


# 1.112 22-May-2005 pedro

put VLOCKSWORK stuff under a single option, VFSDEBUG


# 1.111 01-May-2005 pedro

check for VBIOONFREELIST and VBIOONSYNCLIST in vprint(), okay marius@


# 1.110 24-Mar-2005 tedu

always good to check for invalid values. ok marius pedro


Revision tags: OPENBSD_3_7_BASE
# 1.109 10-Jan-2005 pedro

branches: 1.109.2;
change vget() to only put a vnode back on the free lists if it actually
was there. should fix a (rare) corner case introduced by my last commit.
ok tedu@, testing by joris, moritz@, danh@, otto@ and krw@. many thanks.


# 1.108 31-Dec-2004 pedro

sprinkle some more list macros in here


# 1.107 31-Dec-2004 pedro

when releasing a vnode, make it inactive before sticking it to one of
the free lists. should fix some races on filesystems that don't have
locks, such as nfs. also, it allows for a more straightforward way of
releasing vnodes (nodes that are going to be recycled don't have to be
moved to the head of the list). tested by many, thanks.

ok tedu@ deraadt@


# 1.106 28-Dec-2004 deraadt

clean dirty accident by miod


# 1.105 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


# 1.104 09-Dec-2004 pedro

minor spacing/styling nits


Revision tags: OPENBSD_3_6_BASE
# 1.103 04-Aug-2004 art

Uninline vputonfreelist.


# 1.102 04-Aug-2004 pedro

better comments


# 1.101 02-Aug-2004 pedro

- check for LK_NOWAIT on vget()
- use ltsleep() instead of the unlock + sleep combo

ok art@, inspiration from free/net


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.100 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


# 1.99 27-May-2004 tedu

shutdown accounting before shutting down vfs. should prevent some panics.
ok david@ millert@ (iirc)


# 1.98 25-Apr-2004 itojun

radix tree with multipath support. from kame. deraadt ok
user visible changes:
- you can add multiple routes with same key (route add A B then route add A C)
- you have to specify gateway address if there are multiple entries on the table
(route delete A B, instead of route delete A)
kernel change:
- radix_node_head has an extra entry
- rnh_deladdr takes extra argument

TODO:
- actually take advantage of multipath (rtalloc -> rtalloc_mpath)


Revision tags: OPENBSD_3_5_BASE
# 1.97 09-Jan-2004 tedu

back out vnode parents. weird breakge found in ports tree


# 1.96 06-Jan-2004 tedu

keep track of a vnode's parent dir. ufs only, and unused atm, but
the fun stuff is coming. testing by brad.


Revision tags: OPENBSD_3_4_BASE
# 1.95 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.94 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: UBC_SYNC_A
# 1.93 13-May-2003 naddy

Back out previous change that causes "vnode table full" for large-scale
file operations.


# 1.92 13-May-2003 tedu

do reclaim LAYER vnodes, no good reason not to


# 1.91 06-May-2003 tedu

attempt to put a process's cwd back in place after a forced umount.
won't always work, but it's the best we can do for now. this covers
at least some of the failure cases the previous commit to vfs_lookup.c
checks for.
ok weingart@


# 1.90 01-May-2003 tedu

several related changes:
vfs_subr.c:
add a missing simple_lock_init for vnode interlock
try to avoid reclaiming locked or layered vnodes
initialize vnlock pointer to NULL
remove old code to free vnlock, never used
lockinit the new vnode lock
vfs_syscalls.c:
support for VLAYER flag
vnode_if.sh:
support for splitting VDESC flags
vnode_if.src:
split VDESC flags
WILLPUT is the combination of WILLRELE and WILLUNLOCK
most uses for WILLRELE become WILLPUT
vnode.h:
add v_lock to struct vnode
add VLAYER flag
update for new VDESC flags


# 1.89 06-Apr-2003 ho

strcat/strcpy/sprintf cleanup. krw@, anil@ ok. art@ tested sparc64.


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_B
# 1.88 11-Aug-2002 art

Add two missing vfs_busy calls in the failure path of sysctl_vnode.
Found by aaron@

NOTE - I think we need a mount-point iterator just like we have
NOTE - vfs_mount_foreach_vnode. (btw. why don't we use foreach_vnode in here?)


# 1.87 12-Jul-2002 art

Change the locking on the mountpoint slightly. Instead of using mnt_lock
to get shared locks for lookup and get the exclusive lock only with
LK_DRAIN on unmount and do the real exclusive locking with flags in
mnt_flags, we now use shared locks for lookup and an exclusive lock for
unmount.

This is accomplished by slightly changing the semantics of vfs_busy.
Old vfs_busy behavior:
- with LK_NOWAIT set in flags, a shared lock was obtained if the
mountpoint wasn't being unmounted, otherwise we just returned an error.
- with no flags, a shared lock was obtained if the mountpoint was being
unmounted, otherwise we slept until the unmount was done and returned
an error.
LK_NOWAIT was used for sync(2) and some statistics code where it isn't really
critical that we get the correct results.
0 was used in fchdir and lookup where it's critical that we get the right
directory vnode for the filesystem root.

After this change vfs_busy keeps the same behavior for no flags and LK_NOWAIT.
But if some other flags are passed into it, they are passed directly
into lockmgr (actually LK_SLEEPFAIL is always added to those flags because
if we sleep for the lock, that means someone was holding the exclusive lock
and the exclusive lock is only held when the filesystem is being unmounted.

More changes:
dounmount must now be called with the exclusive lock held. (before this
the caller was supposed to hold the vfs_busy lock, but that wasn't always
true).
Zap some (now) unused mount flags.
And the highlight of this change:
Add some vfs_busy calls to match some vfs_unbusy calls, especially in
sys_mount. (lockmgr doesn't detect the case where we release a lock noone
holds (it will do that soon)).

If you've seen hangs on reboot with mfs this should solve it (I repeat this
for the fourth time now, but this time I spent two months fixing and
redesigning this and reading the code so this time I must have gotten
this right).


# 1.86 16-Jun-2002 miod

When processing the KERN_VNODE sysctl, the kernel builds a packed structure,
while pstat(8) expects a C structure abiding the regular structure packing
rules. This caused pstat -v to break on powerpc.

Unbreak the confusion by defining the structure in a common header file,
and having the kernel use it.

ok millert@ deraadt@


# 1.85 08-Jun-2002 art

Use ltsleep in vfs_busy.


# 1.84 16-May-2002 art

sprinkle some splassert(IPL_BIO) in some functions that are commented as "should be called at splbio()"


Revision tags: OPENBSD_3_1_BASE
# 1.83 14-Mar-2002 millert

First round of __P removal in sys


# 1.82 04-Feb-2002 miod

Cleanup mountroot-related definitions.


# 1.81 23-Jan-2002 art

Pool deals fairly well with physical memory shortage, but it doesn't deal
well (not at all) with shortages of the vm_map where the pages are mapped
(usually kmem_map).

Try to deal with it:
- group all information the backend allocator for a pool in a separate
struct. The pool will only have a pointer to that struct.
- change the pool_init API to reflect that.
- link all pools allocating from the same allocator on a linked list.
- Since an allocator is responsible to wait for physical memory it will
only fail (waitok) when it runs out of its backing vm_map, carefully
drain pools using the same allocator so that va space is freed.
(see comments in code for caveats and details).
- change pool_reclaim to return if it actually succeeded to free some
memory, use that information to make draining easier and more efficient.
- get rid of PR_URGENT, noone uses it.


# 1.80 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.79 10-Dec-2001 art

branches: 1.79.2;
No need to initialize the uobj on every getnewvnode. Just do
it when allocating. Add some improved diagnostics.


# 1.78 10-Dec-2001 art

Big cleanup inspired by NetBSD with some parts of the code from NetBSD.
- get rid of VOP_BALLOCN and VOP_SIZE
- move the generic getpages and putpages into miscfs/genfs
- create a genfs_node which must be added to the top of the private portion
of each vnode for filsystems that want to use genfs_{get,put}pages
- rename genfs_mmap to vop_generic_mmap


# 1.77 10-Dec-2001 art

Merge in struct uvm_vnode into struct vnode.


# 1.76 05-Dec-2001 art

Break out the part that lowers v_holdcnt in brelvp into an own function
and make it and vhold into public interfaces.


# 1.75 29-Nov-2001 art

Ooops. Revert part of the last commit that was completly wrong and wasn't supposed to be committed.


# 1.74 29-Nov-2001 art

Correctly handle b_vp with bgetvp and brelvp in {get,put}pages.
Prevents panics caused by vnodes being recycled under our feet.


# 1.73 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.72 21-Nov-2001 csapuntz

Added vfs_isbusy. Useful for verifying that a mount point is locked
Added vfs_mount_foreach_vnode. Several places in the code seem to want to
traverse the mount list and they all seem to handle locking differently.
Centralize traversing the mount list in one place so that we only need
to get the locking right once.


# 1.71 15-Nov-2001 art

Don't zero v_bioflag when recycling a vnode in getnewvnode.
Sometimes the vnode can be on the syncers list. While that is a bug, it's
just a minor annoyance. A vnode on a syncer worklist without VBIOONSYNCLIST
set is a disaster.


# 1.70 12-Nov-2001 art

Remove unnecessary check for NULL vnode in reassignbuf.


# 1.69 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.68 02-Oct-2001 csapuntz

Bounds check index into routing table. Thanks to Ken Ashcraft of Stanford
for finding this bug.


# 1.67 19-Sep-2001 csapuntz

Get rid of B_VFLUSH. Not relevant after the end of the AGE queue.


# 1.66 16-Sep-2001 millert

Add some missing lengths checks when passing data from userland to
kernel. From based on NetBSD patches.


# 1.65 02-Aug-2001 assar

(vput): make panic strings actually say vput instead of vrele


# 1.64 26-Jul-2001 miod

Typo.


# 1.63 27-Jun-2001 art

remove old vm


# 1.62 22-Jun-2001 deraadt

KNF


# 1.61 05-Jun-2001 provos

send note_revoke to knotes when vnode goes away, okay art@


# 1.60 16-May-2001 art

indentation nit.


# 1.59 29-Apr-2001 art

cleanup, remove incorrect comment


Revision tags: OPENBSD_2_9_BASE
# 1.58 22-Mar-2001 art

branches: 1.58.2;
Use pool for allocating vnodes.
Even though vnodes are never freed (could be) this gives us big memory and
kmem_map savings.


# 1.57 21-Mar-2001 art

uvm_vnp_terminate expect the vnode to be locked.
Why didn't LOCKDEBUG catch this?


# 1.56 16-Mar-2001 art

Oops. fix thinko in last.


# 1.55 16-Mar-2001 art

Use CIRCLEQ macros for mountlist.


# 1.54 16-Mar-2001 art

Initialize the mountlist_slock.


# 1.53 26-Feb-2001 csapuntz

Move v_writecount test back to it original place


# 1.52 26-Feb-2001 csapuntz

Make ref counts 32-bit unsigned ints as opposed to a potpourri of longs and
ints.


# 1.51 24-Feb-2001 csapuntz

Cleanup of vnode interface continues. Get rid of VHOLD/HOLDRELE.
Change VM/UVM to use buf_replacevnode to change the vnode associated
with a buffer.

Addition v_bioflag for flags written in interrupt handlers
(and read at splbio, though not strictly necessary)

Add vwaitforio and use it instead of a while loop of v_numoutput.

Fix race conditions when manipulation vnode free list


# 1.50 23-Feb-2001 csapuntz

Remove the clustering fields from the vnodes and place them in the
file system inode instead


# 1.49 21-Feb-2001 csapuntz

Latest soft updates from FreeBSD/Kirk McKusick

Snapshot-related code has been commented out.


# 1.48 08-Feb-2001 mickey

do not print stuff when not verbose


Revision tags: OPENBSD_2_8_BASE
# 1.47 27-Sep-2000 art

branches: 1.47.2;
Minimal optimization.


# 1.46 17-Jul-2000 art

Don't wait for B_READ buffers on shutdown.
From NetBSD.


Revision tags: OPENBSD_2_7_BASE
# 1.45 25-Apr-2000 csapuntz

Use CIRCLEQ_FOREACH


# 1.44 21-Apr-2000 mickey

see if there is any meaning under curproc before using &proc0 in vfs_syncwait(); from art@


Revision tags: SMP_BASE kame_19991208
# 1.43 05-Dec-1999 art

branches: 1.43.2;
With soft updates, some buffers will be remarked as dirty after being written.
Handle this when syncing filesystems when unmounting.
From NetBSD.


# 1.42 05-Dec-1999 art

Use VONSYNCLIST to see if we should remove a vnode from the sync list instead
of looking at v_dirtyblkhd.


Revision tags: OPENBSD_2_6_BASE
# 1.41 20-Aug-1999 art

more paranoid check of the refcount in vfs_register


# 1.40 08-Aug-1999 niklas

From NetBSD; vdevgone, used for revoking access to device nodes when they
disappear (detach is coming).


# 1.39 31-May-1999 millert

New struct statfs with mount options. NOTE: this replaces statfs(2),
fstatfs(2), and getfsstat(2) so you will need to build a new kernel
before doing a "make build" or you will get "unimplemented syscall" errors.

The new struct statfs has the following featuires:
o Has a u_int32_t flags field--now softdep can have a real flag.

o Uses u_int32_t instead of longs (nicer on the alpha). Note: the man
page used to lie about setting invalid/unused fields to -1. SunOS does
that but our code never has.

o Gets rid of f_type completely. It hasn't been used since NetBSD 0.9
and having it there but always 0 is confusing. It is conceivable
that this may cause some old code to not compile but that is better
than silently breaking.

o Adds a mount_info union that contains the FSTYPE_args struct. This
means that "mount" can now tell you all the options a filesystem was
mounted with. This is especially nice for NFS.

Other changes:
o The linux statfs emulation didn't convert between BSD fs names
and linux f_type numbers. Now it does, since the BSD f_type
number is useless to linux apps (and has been removed anyway)

o FreeBSD's struct statfs is different from our (both old and new)
and thus needs conversion. Previously, the OpenBSD syscalls
were used without any real translation.

o mount(8) will now show extra info when invoked with no arguments.
However, to see *everything* you need to use the -v (verbose) flag.


# 1.38 06-May-1999 mickey

factor out sync+wait code into vfa_syncwait() routine for
applications in system like power management and such.
art@ finally said `commit it'


# 1.37 30-Apr-1999 art

in vput, simple_unlock the v_interlock before VOP_INACTIVE, not after


Revision tags: OPENBSD_2_5_BASE
# 1.36 11-Mar-1999 deraadt

backout


# 1.35 11-Mar-1999 deraadt

back out unapproved changes


# 1.34 11-Mar-1999 mickey

indent


# 1.33 11-Mar-1999 mickey

factor sync+wait operation out into a separate function.


# 1.32 26-Feb-1999 art

adapt to uvm vnode pager


# 1.31 19-Feb-1999 art

add vfs_register and vfs_unregister functions


# 1.30 28-Dec-1998 art

simple_lock fixes


# 1.29 22-Dec-1998 art

deconfuse vprint, print holdcount, not refcount when we are talking about holdcnt


# 1.28 10-Dec-1998 art

vfs_unmountall: retry to unmount all remaining filesystems when one unmount failed


# 1.27 05-Dec-1998 csapuntz

Framework for generating automatic test code for locking discipline
in DIAGNOSTIC mode.

Added documentation to vfs_subr.c on locking needs of a couple calls.

Improvements to the vinvalbuf patch. We need to start over after we
let our pants down.


# 1.26 04-Dec-1998 csapuntz

VFS-Lite2 requires stricter locking around vnode buffer queues. vinvalbuf
had insufficient protection


# 1.25 20-Nov-1998 art

vn_lock already unlocks the simple lock. don't do that again


# 1.24 12-Nov-1998 csapuntz

Integrate latest soft updates patches for McKusick.

Integrate cleaner ffs mount code from FreeBSD. Most notably, this mount
code prevents you from mounting an unclean file system read-write.


Revision tags: OPENBSD_2_4_BASE
# 1.23 13-Oct-1998 csapuntz

In vrele, vget, reinstate to following order

- VNODE gets placed on free list
- VOP_INACTIVE is called

This was the original order. It was changed in an earlier patch due to
a race condition in non-locking FSes (like NFS) between getnewvnode
and inactive. However, the modified order had its own race conditions, so
it turned out not to be a good choice.


# 1.22 30-Aug-1998 csapuntz

Cleanup.

Error diagnostics in vputonfreelist to catch violations of assumptions.


# 1.21 06-Aug-1998 csapuntz

Rename vop_revoke, vn_bwrite, vop_noislocked, vop_nolock, vop_nounlock
to be vop_generic_revoke, vop_generic_bwrite, vop_generic_islocked,
vop_generic_lock and vop_generic_unlock.

Create vop_generic_abortop and propogate change to all file systems.

Fix PR/371.

Get rid of locking in NULLFS (should be mostly unnecessary now except for
forced unmounts).


# 1.20 25-Apr-1998 niklas

typo


Revision tags: OPENBSD_2_3_BASE
# 1.19 20-Feb-1998 niklas

typo


# 1.18 11-Jan-1998 csapuntz

Fix a couple spinlock references. More code motion in vfs_subr.c


# 1.17 10-Jan-1998 csapuntz

Broke up vfs_subr.c which was getting a bit huge. We now have seperate files
for the syncer daemon as well as default VOP_*.


# 1.16 24-Nov-1997 niklas

Fix non-DIAGNOSTIC (and non-COMPAT*) compilation


# 1.15 07-Nov-1997 csapuntz

Fixed hang on shutdown
Disabled vop_nolock for now. Filesystems still need to be cleaned up.


# 1.14 06-Nov-1997 csapuntz

DEBUG now compiles


# 1.13 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.12 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.11 06-Oct-1997 csapuntz

VFS Lite2 Changes


Revision tags: OPENBSD_2_1_BASE
# 1.10 25-Apr-1997 deraadt

proper mask check; mike@fast.cs.utah.edu


# 1.9 14-Apr-1997 tholo

Minor performance enhancements from NetBSD


# 1.8 24-Feb-1997 niklas

OpenBSD tags


# 1.7 11-Feb-1997 millert

Add fs_id support and random inode generation numbers for ffs.


# 1.6 04-Jan-1997 kstailey

spec_advlock() via lf_advlock()


Revision tags: OPENBSD_2_0_BASE
# 1.5 08-Aug-1996 tholo

Make {,f}chown(2) behaviour POSIX.1 compliant with SUID / SGID files
Enable CTL_FS processing by sysctl(3)
Add CTL_FS request to disable clearing SUID / SGID bit when a files owner
or group is changed by root
Make sysctl(8) understand CTL_FS requests


# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 29-Feb-1996 niklas

From NetBSD: Merge with NetBSD 960217


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.265 14-Dec-2017 deraadt

Don't bother using DETACH_FORCE for the softraid luns at reboot
time; the aggressive mountpoint destruction seems to hit insane
use-after-frees when we are already far on the way down.


# 1.264 14-Dec-2017 deraadt

Give vflush_vnode() a hint about vnodes we don't need to account as "busy".
Change mountpoint to RDONLY a little later. Seems to improve the
rw->ro transition a bit.


# 1.263 11-Dec-2017 bluhm

Format the vnode lists of ddb show mount properly in columns.
OK krw@


# 1.262 11-Dec-2017 deraadt

In uvm Chuck decided backing store would not be allocated proactively
for blocks re-fetchable from the filesystem. However at reboot time,
filesystems are unmounted, and since processes lack backing store they
are killed. Since the scheduler is still running, in some cases init is
killed... which drops us to ddb [noted by bluhm]. Solution is to convert
filesystems to read-only [proposed by kettenis]. The tale follows:
sys_reboot() should pass proc * to MD boot() to vfs_shutdown() which
completes current IO with vfs_busy VB_WRITE|VB_WAIT, then calls VFS_MOUNT()
with MNT_UPDATE | MNT_RDONLY, soon teaching us that *fs_mount() calls a
copyin() late... so store the sizes in vfsconflist[] and move the copyin()
to sys_mount()... and notice nfs_mount copyin() is size-variant, so kill
legacy struct nfs_args3. Next we learn ffs_mount()'s MNT_UPDATE code is
sharp and rusty especially wrt softdep, so fix some bugs adn add
~MNT_SOFTDEP to the downgrade. Some vnodes need a little more help,
so tie them to &dead_vnops.

ffs_mount calling DIOCCACHESYNC is causing a bit of grief still but
this issue is seperate and will be dealt with in time.
couple hundred reboots by bluhm and myself, advice from guenther and
others at the hut


# 1.261 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.260 31-Jul-2017 florian

Give back some space to the ramdisk by compiling net/radix.c only
if we compile pf, ipsec, pipex or nfsserver.
Suggested by mpi some time ago.
Tweak & OK bluhm
deraadt assumes it's fair


# 1.259 20-Apr-2017 visa

Tweak lock inits to make the system runnable with witness(4)
on amd64 and i386.


# 1.258 04-Apr-2017 deraadt

struct vfsconf is tightly packed, but let's M_ZERO it in case that ever
changes to avoid exposing userland memory.


Revision tags: OPENBSD_6_1_BASE
# 1.257 15-Jan-2017 bluhm

When traversing the mount list, the current mount point is locked
with vfs_busy(). If the FOREACH_SAFE macro is used, the next pointer
is not locked and could be freed by another process. Unless
necessary, do not use _SAFE as it is unsafe. In vfs_unmountall()
the current pointer is actullay freed. Add a comment that this
race has to be fixed later.
OK krw@


# 1.256 10-Jan-2017 bluhm

Replace manual for() loops with FOREACH() macro.
OK millert@


# 1.255 10-Jan-2017 bluhm

Remove the unused olddp parameter from function dounmount().
OK mpi@ millert@


# 1.254 28-Sep-2016 kettenis

Cast enum to u_int when doing a bounds check to avoid a clang warning that
the comparison is always true.

ok jca@, tedu@


# 1.253 16-Sep-2016 dlg

move the namecache_rb_tree from RB macros to RBT functions.

i had to shuffle the includes a bit. all the knowledge of the RB
tree is now inside vfs_cache.c, and all accesses are via cache_*
functions.


# 1.252 16-Sep-2016 dlg

move buf_rb_bufs from RB macros to RBT functions

i had to shuffle the order of some header bits cos RBT_PROTOTYPE
needs to see what RBT_HEAD produces.


# 1.251 15-Sep-2016 dlg

all pools have their ipl set via pool_setipl, so fold it into pool_init.

the ioff argument to pool_init() is unused and has been for many
years, so this replaces it with an ipl argument. because the ipl
will be set on init we no longer need pool_setipl.

most of these changes have been done with coccinelle using the spatch
below. cocci sucks at formatting code though, so i fixed that by hand.

the manpage and subr_pool.c bits i did myself.

ok tedu@ jmatthew@

@ipl@
expression pp;
expression ipl;
expression s, a, o, f, m, p;
@@
-pool_init(pp, s, a, o, f, m, p);
-pool_setipl(pp, ipl);
+pool_init(pp, s, a, ipl, f, m, p);


# 1.250 25-Aug-2016 dlg

pool_setipl

ok kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.249 22-Jul-2016 kettenis

Prevent NULL-pointer call for filesystems that don't provide vfs_sysctl
in their vfsops.

Issue reported by Tim Newsham.

ok claudio@, natano@


# 1.248 19-Jun-2016 natano

Remove the lockmgr() API. It is only used by filesystems, where it is a
trivial change to use rrw locks instead. All it needs is LK_* defines
for the RW_* flags.

tested by naddy and sthen on package building infrastructure
input and ok jmc mpi tedu


# 1.247 26-May-2016 natano

The doforce variable isn't modified anywhere. Also, the only filesystem
left using it is fuse. It has been removed from all other filesystems.

ok millert deraadt


# 1.246 26-Apr-2016 natano

copy_statfs_info() is not only used by ufs, but by other filesystems too,
so make sure that all members of mp->mnt_stat.mount_info are copied.
ok stefan


# 1.245 26-Apr-2016 beck

fix off by one in vfs_vnode_print - found by miod
ok deraadt@, krw@


# 1.244 07-Apr-2016 natano

Share clone bitmap between aliased vnodes. This prevents duplicate clone
instance numbers being handed out for the same minor device.
ok mikeb


# 1.243 05-Apr-2016 natano

Increase size of the clone bitmap (revised diff after revert). I have
tested this with fuse _and_ drm on amd64 and macppc. Also tested with
cloning bpf (not in the tree) on macppc.

ok mikeb
"looks correct to me" millert

The original commit message is as follows:

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.242 01-Apr-2016 mikeb

Revert the clone bitmap enlargement change


# 1.241 31-Mar-2016 natano

Increase size of the clone bitmap. A limit of only 64 device clones
turned out to be too low for the upcoming work on cloning bpf. The new
limit is 1024 device clones. As part of the size increase, the bitmap
has been changed to be allocated separately to avoid bloating all device
nodes, as suggested by guenther, millert and deraadt.

ok millert mikeb


# 1.240 19-Mar-2016 natano

Remove the unused flags argument from VOP_UNLOCK().

torture tested on amd64, i386 and macppc
ok beck mpi stefan
"the change looks right" deraadt


# 1.239 14-Mar-2016 krw

Change a bunch of (<blah> *)0 to NULL.

ok beck@ deraadt@


Revision tags: OPENBSD_5_9_BASE
# 1.238 05-Dec-2015 tedu

branches: 1.238.2;
remove stale lint annotations


# 1.237 16-Nov-2015 deraadt

In getdevvp() set the VISTTY flag on a vnode to indicate the underlying
device is a D_TTY device. (Like spec_open, but this sets the flag to
satisfy pre-VOP_OPEN situations)
ok millert semarie tedu guenther


# 1.236 13-Oct-2015 guenther

Initialize va_filerev in vattr_null() to avoid leaking stack garbage;
problem pointed out by Martin Natano (natano (at) natano.net)

Also, stop chaining assignments (foo = bar = baz) in vattr_null().
The exact meaning of those depends on the order of the sizes-and-
signednesses of the lvalues, making them fragile: a statement here
mixed *six* types, but managed to get them in a safe order. Delete
a 20+ year old XXX comment that was almost certainly bemoaning a bug
from when they were in an unsafe order.

ok deraadt@ miod@


# 1.235 08-Oct-2015 mpi

Use the radix API directly and get rid of the function pointers. There
is no point in keeping an unused level of abstraction.

ok mikeb@, claudio@


# 1.234 07-Oct-2015 mpi

rn_inithead() offset argument is now specified in byte, missed in previous.


# 1.233 04-Sep-2015 mpi

Make every subsystem using a radix tree call rn_init() and pass the
length of the key as argument.

This way every consumer of the radix tree has a chance to explicitly
initialize the shared data structures and no longer rely on another
subsystem to do the initialization.

As a bonus ``dom_maxrtkey'' is no longer used an die.

ART kernels should now be fully usable because pf(4) and IPSEC properly
initialized the radix tree.

ok chris@, reyk@


Revision tags: OPENBSD_5_8_BASE
# 1.232 16-Jul-2015 claudio

branches: 1.232.4;
Fix rn_match and there for the expoerted lookup functions in radix.c
to never return the internal RNF_ROOT nodes. This removes the checks
in the callee to verify that not an RNF_ROOT node was returned.
OK mpi@


# 1.231 12-May-2015 mikeb

Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.230 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.229 02-Mar-2015 guenther

Return EINVAL if the creds supplied for NFS export have a cr_ngroups less
than zero or greater than NGROUPS_MAX

Fixes panic seen by henning@


# 1.228 09-Jan-2015 tedu

rename desiredvnodes to initialvnodes. less of a lie. ok beck deraadt


# 1.227 19-Dec-2014 tedu

start retiring the nointr allocator. specify PR_WAITOK as a flag as a
marker for which pools are not interrupt safe. ok dlg


# 1.226 17-Dec-2014 tedu

remove lock.h from uvm_extern.h. another holdover from the simpletonlock
era. fix uvm including c files to include lock.h or atomic.h as necessary.
ok deraadt


# 1.225 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.224 10-Dec-2014 tedu

convert bcopy to memcpy. ok millert


# 1.223 21-Nov-2014 tedu

simple lock is long dead


# 1.222 19-Nov-2014 tedu

delete the KERN_VNODE sysctl. it fails to provide any isolation from the
kernel struct vnode defintion, and the only consumer (pstat) still needs
kvm to read much of the required information. no great loss to always use
kvm until there's a better replacement interface.
ok deraadt millert uebayasi


# 1.221 14-Nov-2014 tedu

prefer sizeof(*ptr) to sizeof(struct) for malloc and free


# 1.220 03-Nov-2014 deraadt

pass size argument to free()
ok doug tedu


# 1.219 13-Sep-2014 doug

Replace all queue *_END macro calls except CIRCLEQ_END with NULL.

CIRCLEQ_* is deprecated and not called in the tree. The other queue types
have *_END macros which were added for symmetry with CIRCLEQ_END. They are
defined as NULL. There's no reason to keep the other *_END macro calls.

ok millert@


Revision tags: OPENBSD_5_6_BASE
# 1.218 13-Jul-2014 tedu

pass the size to free in some of the obvious cases


# 1.217 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.216 10-Jul-2014 mpi

Stop using a shutdown hook for softraid(4) and explicitly shutdown
the disciplines right after vfs_shutdown().

This change is required in order to be able to set `cold' to 1 before
traversing the device (mainbus) tree for DVACT_POWERDOWN when halting
a machine. Yes, this is ugly because sr_shutdown() needs to sleep. But
at least it is obvious and hopefully somebody will be ofended and fix
it.

In order to properly flush the cache of the disks under softraid0,
sr_shutdown() now propagates DVACT_POWERDOWN for this particular subtree
of devices which are not under mainbus. As a side effect sd(4) shutdown
hook should no longer be necessary.

Tested by stsp@ and Jean-Philippe Ouellet.

ok deraadt@, stsp@, jsing@


# 1.215 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.214 04-Jun-2014 claudio

While it may be smart to use the radix tree for exports it is not OK to
use the domain specific tree initialisation method for this since that one
is multipath enabled and assumes that the radix node is part of a struct
rtentry. This code uses a different struct and so the multipath modifies
wrong fields and breaks stuff in mysterious ways.
Since we only support AF_INET here anyway simplify the code and only have
one radix_node_head pointer instead of AF_MAX ones.
Fixes NFS server issues reported by rpe@, OK rpe@, guenther@, sthen@


# 1.213 10-Apr-2014 tedu

pull the bufcache freelist code out into separate functions to allow new
algorithms to be tested. in the process, drop support for unused B_AGE and
b_synctime options.
previous versions ok beck deraadt


# 1.212 24-Mar-2014 guenther

Split the API: struct ucred remains the kernel internal structure while
struct xucred becomes the structure for syscalls (mount(2) and nfssvc(2)).

ok deraadt@ beck@


Revision tags: OPENBSD_5_5_BASE
# 1.211 21-Jan-2014 tedu

bzero -> memset


# 1.210 01-Dec-2013 krw

Change 'mountlist' from CIRCLEQ to TAILQ. Be paranoid and
use TAILQ_*_SAFE more than might be needed.

Bulk ports build by sthen@ showed nobody sticking their fingers
so deep into the kernel.

Feedback and suggestions from millert@. ok jsing@


# 1.209 27-Nov-2013 jsing

Defer the v_type initialisation until after the vnode has been purged from
the namecache. Changing the v_type between cache_enter() and cache_purge()
results in bad things happening.

ok beck@


# 1.208 02-Oct-2013 sf

format string fix: b_flags is long


# 1.207 01-Oct-2013 sf

Format string fixes: Cast time_t to long long

and mnt_stat.f_ctime is long long, too


# 1.206 08-Aug-2013 syl

Uncomment kprintf format attributes for sys/kern

tested on vax (gcc3) ok miod@


# 1.205 30-Jul-2013 beck

The previous change was made while chasing nfs performance issues
on Theo's servers - however this was in the context of the buffer flipper
changes and this is now suspect in a continues performance issue with NFS
so back it out for now


Revision tags: OPENBSD_5_4_BASE
# 1.204 24-Jun-2013 beck

Manipulating buffers after sleeping is dangerous. Instead of attempting
to cheat and VOP_BWRITE a buffer, restart the vinvalbuf if we have to wait
for a busy buffer to complete
ok tedu@ guenther@


# 1.203 15-Apr-2013 jsing

Add an f_mntfromspec member to struct statfs, which specifies the name of
the special provided when the mount was requested. This may be the same as
the special that was actually used for the mount (e.g. in the case of a
device node) or it may be different (e.g. in the case of a DUID).

Whilst here, change f_ctime to a 64 bit type and remove the pointless
f_spare members.

Compatibility goo courtesy of guenther@

ok krw@ millert@


Revision tags: OPENBSD_5_3_BASE
# 1.202 17-Feb-2013 miod

Comment out recently added __attribute__((__format__(__kprintf__))) annotations
in MI code; gcc 2.95 does not accept such annotation for function pointer
declarations, only function prototypes.
To be uncommented once gcc 2.95 bites the dust.


# 1.201 09-Feb-2013 miod

Add explicit __attribute__ ((__format__(__kprintf__)))) to the functions and
function pointer arguments which are {used as,} wrappers around the kernel
printf function.
No functional change.


# 1.200 17-Nov-2012 beck

Don't map a buffer (and potentially sleep) when invalidating it in vinvalbuf.
This fixes a problem where we could sleep for kva and then our pointers
would not be valid on the next pass through the loop. We do this
by adding buf_acquire_nomap() - which can be used to busy up the buffer
without changing its mapped or unmapped state. We do not need to have
the buffer mapped to invalidate it, so it is sufficient to acquire it
for that. In the case where we write the buffer, we do map the buffer, and
potentially sleep.


# 1.199 01-Oct-2012 guenther

Make groupmember() check the effective gid too, so that the checks are
consistent when the effective gid isn't also a supplementary group.

ok beck@


# 1.198 19-Sep-2012 guenther

vhold() and vdrop() are prototyped in vnode.h, so don't repeat them here

ok beck@


Revision tags: OPENBSD_5_2_BASE
# 1.197 16-Jul-2012 deraadt

oops, need sys/acct.h too


# 1.196 16-Jul-2012 deraadt

Put acct_shutdown() proto in a better place


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.195 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.194 02-Jul-2011 thib

rename VFSDEBUG to VFLCKDEBUG;

prompted by tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.193 21-Dec-2010 thib

Bring back the "End the VOP experiment." diff, naddy's issues where
unrelated, and his alpha is much happier now.

OK deraadt@


# 1.192 06-Dec-2010 jasper

- drop NENTS(), which was yet another copy of nitems().
no binary change


ok deraadt@


# 1.191 10-Sep-2010 thib

Backout the VOP diff until the issues naddy was seeing on alpha (gcc3)
have been resolved.


# 1.190 06-Sep-2010 thib

End the VOP experiment. Instead of the ridicolusly complicated operation
vector setup that has questionable features (that have, as far as I can
tell never been used in practice, atleast not in OpenBSD), remove all
the gunk and favor a simple struct full of function pointers that get
set directly by each of the filesystems.

Removes gobs of ugly code and makes things simpler by a magnitude.

The only downside of this is that we loose the vnoperate feature so
the spec/fifo operations of the filesystems need to be kept in sync
with specfs and fifofs, this is no big deal as the API it self is pretty
static.

Many thanks to armani@ who pulled an earlier version of this diff to
current after c2k10 and Gabriel Kihlman on tech@ for testing.

Liked by many. "come on, find your balls" deraadt@.


# 1.189 12-Aug-2010 oga

Nuke extra (typoed) extern declaration and a spare newline from the last
commit.

"fix it -- free commit" beck@


# 1.188 11-Aug-2010 beck

Make the number of vnodes to correspond to the number of buffers in
buffer cache - we grow them dynamically, but do not attempt to shrink
them if the buffer cache shrinks after growing.

Tested by very many for a long time.

ok oga@ todd@ phessler@ tedu@


Revision tags: OPENBSD_4_8_BASE
# 1.187 29-Jun-2010 tedu

makefstype was only used in ported from freebsd filesystems. fix them
and remove the function. ok thib


# 1.186 28-Jun-2010 claudio

Add the rtable id as an argument to rn_walktree(). Functions like
rt_if_remove_rtdelete() need to know the table id to be able to correctly
remove nodes.
Problem found by Andrea Parazzini and analyzed by Martin Pelik�n.
OK henning@


# 1.185 06-May-2010 mpf

Fix favail format string.
From mickey.
OK thib, otto.


Revision tags: OPENBSD_4_7_BASE
# 1.184 17-Dec-2009 oga

if anyone vref()s a VNON vnode, panic. This should not happen.

Written while trying to debug the nfs_inactive panics. Turns out it
never got hit, but it's a useful check to have.

ok beck@


# 1.183 17-Aug-2009 jasper

dd 'show all bufs' to show all the buffers in the system

ok beck@ thib@


# 1.182 13-Aug-2009 thib

add a show all vnodes command, use dlg's nice pool_walk() to accomplish
this.

ok beck@, dlg@


# 1.181 12-Aug-2009 beck

Namecache revamp.

This eliminates the large single namecache hash table, and implements
the name cache as a global lru of entires, and a redblack tree in each
vnode. It makes cache_purge actually purge the namecache entries associated
with a vnode when a vnode is recycled (very important for later on actually being
able to resize the vnode pool)

This commit does #if 0 out a bunch of procmap code that was
already broken before this change, but needs to be redone completely.

Tested by many, including in thib's nfs test setup.

ok oga@,art@,thib@,miod@


# 1.180 02-Aug-2009 beck

Dynamic buffer cache support - a re-commit of what was backed out
after c2k9

allows buffer cache to be extended and grow/shrink dynamically

tested by many, ok oga@, "why not just commit it" deraadt@


Revision tags: OPENBSD_4_6_BASE
# 1.179 25-Jun-2009 thib

backout the buf_acquire() does the bremfree() since all callers
where doing bremfree() befure calling buf_acquire().

This is causing us headache pinning down a bug that showed up
when deraadt@ too cvs to current, and will have to be done
anyway as a preperation for backouts.

OK deraadt@


# 1.178 15-Jun-2009 beck

Back out all the buffer cache changes I committed during c2k9. This reverts three
commits:

1) The sysctl allowing bufcachepercent to be changed at boot time.
2) The change moving the buffer cache hash chains to a red-black tree
3) The dynamic buffer cache (Which depended on the earlier too).

ok on the backout from marco and todd


# 1.177 06-Jun-2009 art

All caller of buf_acquire were doing bremfree before the call.
Just put it in the buf_acquire function.
oga@ ok


# 1.176 03-Jun-2009 beck

Change bufhash from the old grotty hash table to red-black trees hanging
off the vnode.
ok art@, oga@, miod@


Revision tags: OPENBSD_4_5_BASE
# 1.175 10-Nov-2008 pedro

Fix typo in comment, okay jmc@.


# 1.174 01-Nov-2008 deraadt

change vrele() to return an int. if it returns 0, it can gaurantee that
it did not sleep. this is used to avoid checkdirs() to avoid having
to restart the allproc walk every time through
idea from tedu, ok thib pedro


Revision tags: OPENBSD_4_4_BASE
# 1.173 05-Jul-2008 thib

re-introduce vdrop() to signal a lost intrest in a vnode;

ok art@


# 1.172 14-Jun-2008 mk

A bunch of pool_get() + bzero() -> pool_get(..., .. | PR_ZERO)
conversions that should shave a few bytes off the kernel.

ok henning, krw, jsing, oga, miod, and thib (``even though i usually prefer
FOO|BAR''; thanks for looking.


# 1.171 13-Jun-2008 beck

back out stupid vnode change that was unintentionally included
with biomem and art has no idea how it got there.
ok art@ thib@


# 1.170 12-Jun-2008 deraadt

Bring biomem diff back into the tree after the nfs_bio.c fix went in.
ok thib beck art


# 1.169 11-Jun-2008 deraadt

back out biomem diff since it is not right yet. Doing very large
file copies to nfsv2 causes the system to eventually peg the console.
On the console ^T indicates that the load is increasing rapidly, ddb
indicates many calls to getbuf, there is some very slow nfs traffic
making none (or extremely slow) progress. Eventually some machines
seize up entirely.


# 1.168 10-Jun-2008 beck

Buffer cache revamp

1) remove multiple size queues, introduced as a stopgap.
2) decouple pages containing data from their mappings
3) only keep buffers mapped when they actually have to be mapped
(right now, this is when buffers are B_BUSY)
4) New functions to make a buffer busy, and release the busy flag
(buf_acquire and buf_release)
5) Move high/low water marks and statistics counters into a structure
6) Add a sysctl to retrieve buffer cache statistics

Tested in several variants and beat upon by bob and art for a year. run
accidentally on henning's nfs server for a few months...

ok deraadt@, krw@, art@ - who promises to be around to deal with any fallout


# 1.167 09-Jun-2008 millert

Update access(2) to have modern semantics with respect to X_OK and
the superuser. access(2) will now only indicate success for X_OK on
non-directories if there is at least one execute bit set on the file.
OK deraadt@ thib@ otto@


# 1.166 07-May-2008 thib

remove the vfc_mountroot member from vfsconf and
do appropriate cleanup;

OK deraadt@


# 1.165 07-May-2008 claudio

Implement routing priorities. Every route inserted has a priority assigned
and the one route with the lowest number wins. This will be used by the
routing daemons to resolve the synchronisations issue in case of conflicts.
The nasty bits of this are in the multipath code. If no priority is specified
the kernel will choose an appropriate priority.

Looked at by a few people at n2k8 code is much older


# 1.164 06-May-2008 thib

retire vfs_mountroot();

setroot() is now (and has been) responsible for setting
the mountroot function pointer "to the right thing", or
failing todo that, to ffs_mountroot;

based on a discussion/diff from deraadt@.
OK deraadt@


# 1.163 23-Mar-2008 miod

Wrong printf construct.


# 1.162 16-Mar-2008 otto

Widen some struct statfs fields to support large filesystem stata
and add some to be able to support statvfs(2). Do the compat dance
to provide backward compatibility. ok thib@ miod@


Revision tags: OPENBSD_4_3_BASE
# 1.161 13-Dec-2007 blambert

replace calls to ltsleep with tsleep

remove PNORELOCK flag, as PNORELOCK is used for msleep

ok art@ thib@


# 1.160 16-Nov-2007 deraadt

er, the newline is wrong. dissapointing.


# 1.159 15-Nov-2007 deraadt

newline before syncing disks is way prettier


# 1.158 29-Oct-2007 chl

MALLOC/FREE -> malloc/free
replace an hard coded value with M_WAITOK

ok krw@


# 1.157 15-Sep-2007 bluhm

Allow to pull out an usb stick with ffs filesystem while mounted
and a file is written onto the stick. Without these fixes the
machine panics or hangs.
The usb fix calls the callback when the stick is pulled out to free
the associated buffers. Otherwise we have busy buffers for ever
and the automatic unmount will panic.
The change in the scsi layer prevents passing down further dirty
buffers to usb after the stick has been deactivated.
In vfs the automatic unmount has moved from the function vgonel()
to vop_generic_revoke(). Both are called when the sd device's vnode
is removed. In vgonel() the VXLOCK is already held which can cause
a deadlock. So call dounmount() earlier.

ok krw@, I like this marco@, tested by ian@


# 1.156 07-Sep-2007 art

Use M_ZERO in a few more places to shave bytes from the kernel.

eyeballed and ok dlg@


Revision tags: OPENBSD_4_2_BASE
# 1.155 07-Aug-2007 beck

A few changes to deal with multi-user performance issues seen. this
brings us back roughly to 4.1 level performance, although this is still
far from optimal as we have seen in a number of cases. This change

1) puts a lower bound on buffer cache queues to prevent starvation
2) fixes the code which looks for a buffer to recycle
3) reduces the number of vnodes back to 4.1 levels to avoid complex
performance issues better addressed after 4.2

ok art@ deraadt@, tested by many


# 1.154 01-Jun-2007 beck

decouple the allocated number of vnodes from the "desiredvnodes" variable
which is used to size a zillion other things that increasing excessively
has been shown to cause problems - so that we may incrementally look at
increasing those other things without making the kernel unusable.

This diff effectivly increases the number of vnodes back to the number
of buffers, as in the earlier dynamic buffer cache commits, without
increasing anything else (namecache, softdeps, etc. etc.)

ok pedro@ tedu@ art@ thib@


# 1.153 31-May-2007 tedu

remove some silly casts, no real change


# 1.152 31-May-2007 pedro

NFSv2 cannot cope with a big number of vnodes, so revert to NPROC-based
calculation until the problem is fixed, okay beck@ art@


# 1.151 30-May-2007 beck

back out vfs change - todd fries has seen afs issues, and I'm suspicious
this can cause other problems.


# 1.150 29-May-2007 beck

Step one of some vnode improvements - change getnewvnode to
actually allocate "desiredvnodes" - add a vdrop to un-hold a vnode held
with vhold, and change the name cache to make use of vhold/vdrop, while
keeping track of which vnodes are referred to by which cache entries to
correctly hold/drop vnodes when the cache uses them.
ok thib@, tedu@, art@


# 1.149 28-May-2007 thib

de-inline vref();

ok pedro@


# 1.148 26-May-2007 pedro

Dynamic buffer cache. Initial diff from mickey@, okay art@ beck@ toby@
deraadt@ dlg@.


# 1.147 26-May-2007 thib

Nuke a bunch of simpelocks and associated goo.

ok art@


# 1.146 17-May-2007 thib

Collapse struct v_selectinfo in struct vnode, remove the
simplelock and reuse the name for the selinfo member.
Clean-up accordingly.

ok tedu@,art@


# 1.145 09-May-2007 deraadt

kinfo_vgetfailed has not been used for > 8 years


# 1.144 13-Apr-2007 thib

Move the declaration of VN_KNOTE() into vnode.h instead of having
multiple defines all over;

ok tedu@


# 1.143 13-Apr-2007 bluhm

Remove comments talking about vnode interlock. No binary change.
ok thib


# 1.142 11-Apr-2007 thib

Remove the simplelock argument from vrecycle();

ok pedro@, sturm@


# 1.141 21-Mar-2007 thib

Remove the v_interlock simplelock from the vnode structure.
Zap all calls to simple_lock/unlock() on it (those calls are
#defined away though). Remove the LK_INTERLOCK from the calls
to vn_lock() and cleanup the filesystems wich implement VOP_LOCK().
(by remvoing the v_interlock from there calls to lockmgr()).

ok pedro@, art@, tedu@


# 1.140 12-Mar-2007 mickey

better desiredvnodes not based on maxusers; pedro@ deraadt@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.139 20-Feb-2007 deraadt

for vfsconf sysctl, do not leak kernel sensors out to userland
ok art thib


# 1.138 17-Feb-2007 mickey

fix ddb buf printing for daddr_t growth to 64bit;
from juan hernandez gonzalez; tested by bluhm@


# 1.137 14-Feb-2007 jsg

Consistently spell FALLTHROUGH to appease lint.
ok kettenis@ cloder@ tom@ henning@


# 1.136 13-Feb-2007 mickey

fix ddb buf print


# 1.135 20-Nov-2006 tom

vprint() should be defined if DIAGNOSTIC || DEBUG. Noticed by (and
original diff from) Jake < antipsychic (at) hotmail.com >. Discussed
with Mickey and Miod.

ok miod@ pedro@


# 1.134 30-Oct-2006 thib

use vp->v_type to index into vtypes rather then vp->v_tag,
fixing odd output in the 'show vnode' ddb code.

ok mickey@


Revision tags: OPENBSD_4_0_BASE
# 1.133 11-Jul-2006 mickey

add mount/vnode/buf and softdep printing commands; tested on a few archs and will make pedro happy too (;


# 1.132 09-Jul-2006 pedro

Fix tab where space was meant


# 1.131 08-Jul-2006 thib

vinvalbuf() debugging aid, under VFSDEBUG.

ok pedro@


# 1.130 03-Jul-2006 mickey

also print vp in vprint (useful for debugging); pedro@ ok


# 1.129 25-Jun-2006 sturm

rename vfs_busy() flags VB_UMIGNORE/VB_UMWAIT to VB_NOWAIT/VB_WAIT

requested by and ok pedro


# 1.128 14-Jun-2006 sturm

move vfs_busy() to rwlocks and properly hide the locking api from vfs

ok tedu, pedro


# 1.127 02-Jun-2006 pedro

Add a clonable devices implementation. Hacked along with thib@, input
from krw@ and toby@, subliminal prodding from dlg@, okay deraadt@.


# 1.126 28-May-2006 pedro

Spacing in vfs_sysctl()


# 1.125 07-May-2006 sturm

forgot to remove this sentence from the comment
ok pedro


# 1.124 30-Apr-2006 sturm

remove the simplelock argument from vfs_busy() which is currently not
used and will never be used this way in VFS

requested by and ok pedro, ok krw, biorn


# 1.123 19-Apr-2006 pedro

Remove unused mount list simple_lock() goo


Revision tags: OPENBSD_3_9_BASE
# 1.122 09-Jan-2006 pedro

Put vprint() under DIAGNOSTIC, as to save space in generated ramdisks.
Inspiration from miod@, okay deraadt@. Tested on i386, macppc and amd64.


# 1.121 30-Nov-2005 pedro

No need for vfs_busy() and vfs_unbusy() to take a process pointer
anymore. Testing by jolan@, thanks.


# 1.120 24-Nov-2005 pedro

Remove kernfs, okay deraadt@.


# 1.119 19-Nov-2005 pedro

Remove unnecessary lockmgr() archaism that was costing too much in terms
of panics and bugfixes. Access curproc directly, do not expect a process
pointer as an argument. Should fix many "process context required" bugs.
Incentive and okay millert@, okay marc@. Various testing, thanks.


# 1.118 18-Nov-2005 pedro

Work around yet another race on non-locking file systems: when calling
VOP_INACTIVE() in vrele() and vput(), we may sleep. Since there's no
locking of any kind, someone can vget() the vnode and vrele() it while
we sleep, beating us in getting the vnode on the free list.


# 1.117 08-Nov-2005 pedro

Missed one use of 'register'


# 1.116 07-Nov-2005 pedro

Use ANSI function declarations and deregister, no binary change


# 1.115 19-Oct-2005 pedro

Remove v_vnlock from struct vnode, okay krw@ tedu@


Revision tags: OPENBSD_3_8_BASE
# 1.114 26-May-2005 pedro

branches: 1.114.2;
RIP stackable filesystems, ok marius@ tedu@, discussed with deraadt@


# 1.113 24-May-2005 pedro

when a device vnode associated with a mount point disappears, mark the
filesystem as doomed and unmount it


# 1.112 22-May-2005 pedro

put VLOCKSWORK stuff under a single option, VFSDEBUG


# 1.111 01-May-2005 pedro

check for VBIOONFREELIST and VBIOONSYNCLIST in vprint(), okay marius@


# 1.110 24-Mar-2005 tedu

always good to check for invalid values. ok marius pedro


Revision tags: OPENBSD_3_7_BASE
# 1.109 10-Jan-2005 pedro

branches: 1.109.2;
change vget() to only put a vnode back on the free lists if it actually
was there. should fix a (rare) corner case introduced by my last commit.
ok tedu@, testing by joris, moritz@, danh@, otto@ and krw@. many thanks.


# 1.108 31-Dec-2004 pedro

sprinkle some more list macros in here


# 1.107 31-Dec-2004 pedro

when releasing a vnode, make it inactive before sticking it to one of
the free lists. should fix some races on filesystems that don't have
locks, such as nfs. also, it allows for a more straightforward way of
releasing vnodes (nodes that are going to be recycled don't have to be
moved to the head of the list). tested by many, thanks.

ok tedu@ deraadt@


# 1.106 28-Dec-2004 deraadt

clean dirty accident by miod


# 1.105 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


# 1.104 09-Dec-2004 pedro

minor spacing/styling nits


Revision tags: OPENBSD_3_6_BASE
# 1.103 04-Aug-2004 art

Uninline vputonfreelist.


# 1.102 04-Aug-2004 pedro

better comments


# 1.101 02-Aug-2004 pedro

- check for LK_NOWAIT on vget()
- use ltsleep() instead of the unlock + sleep combo

ok art@, inspiration from free/net


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.100 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


# 1.99 27-May-2004 tedu

shutdown accounting before shutting down vfs. should prevent some panics.
ok david@ millert@ (iirc)


# 1.98 25-Apr-2004 itojun

radix tree with multipath support. from kame. deraadt ok
user visible changes:
- you can add multiple routes with same key (route add A B then route add A C)
- you have to specify gateway address if there are multiple entries on the table
(route delete A B, instead of route delete A)
kernel change:
- radix_node_head has an extra entry
- rnh_deladdr takes extra argument

TODO:
- actually take advantage of multipath (rtalloc -> rtalloc_mpath)


Revision tags: OPENBSD_3_5_BASE
# 1.97 09-Jan-2004 tedu

back out vnode parents. weird breakge found in ports tree


# 1.96 06-Jan-2004 tedu

keep track of a vnode's parent dir. ufs only, and unused atm, but
the fun stuff is coming. testing by brad.


Revision tags: OPENBSD_3_4_BASE
# 1.95 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.94 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: UBC_SYNC_A
# 1.93 13-May-2003 naddy

Back out previous change that causes "vnode table full" for large-scale
file operations.


# 1.92 13-May-2003 tedu

do reclaim LAYER vnodes, no good reason not to


# 1.91 06-May-2003 tedu

attempt to put a process's cwd back in place after a forced umount.
won't always work, but it's the best we can do for now. this covers
at least some of the failure cases the previous commit to vfs_lookup.c
checks for.
ok weingart@


# 1.90 01-May-2003 tedu

several related changes:
vfs_subr.c:
add a missing simple_lock_init for vnode interlock
try to avoid reclaiming locked or layered vnodes
initialize vnlock pointer to NULL
remove old code to free vnlock, never used
lockinit the new vnode lock
vfs_syscalls.c:
support for VLAYER flag
vnode_if.sh:
support for splitting VDESC flags
vnode_if.src:
split VDESC flags
WILLPUT is the combination of WILLRELE and WILLUNLOCK
most uses for WILLRELE become WILLPUT
vnode.h:
add v_lock to struct vnode
add VLAYER flag
update for new VDESC flags


# 1.89 06-Apr-2003 ho

strcat/strcpy/sprintf cleanup. krw@, anil@ ok. art@ tested sparc64.


Revision tags: OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_B
# 1.88 11-Aug-2002 art

Add two missing vfs_busy calls in the failure path of sysctl_vnode.
Found by aaron@

NOTE - I think we need a mount-point iterator just like we have
NOTE - vfs_mount_foreach_vnode. (btw. why don't we use foreach_vnode in here?)


# 1.87 12-Jul-2002 art

Change the locking on the mountpoint slightly. Instead of using mnt_lock
to get shared locks for lookup and get the exclusive lock only with
LK_DRAIN on unmount and do the real exclusive locking with flags in
mnt_flags, we now use shared locks for lookup and an exclusive lock for
unmount.

This is accomplished by slightly changing the semantics of vfs_busy.
Old vfs_busy behavior:
- with LK_NOWAIT set in flags, a shared lock was obtained if the
mountpoint wasn't being unmounted, otherwise we just returned an error.
- with no flags, a shared lock was obtained if the mountpoint was being
unmounted, otherwise we slept until the unmount was done and returned
an error.
LK_NOWAIT was used for sync(2) and some statistics code where it isn't really
critical that we get the correct results.
0 was used in fchdir and lookup where it's critical that we get the right
directory vnode for the filesystem root.

After this change vfs_busy keeps the same behavior for no flags and LK_NOWAIT.
But if some other flags are passed into it, they are passed directly
into lockmgr (actually LK_SLEEPFAIL is always added to those flags because
if we sleep for the lock, that means someone was holding the exclusive lock
and the exclusive lock is only held when the filesystem is being unmounted.

More changes:
dounmount must now be called with the exclusive lock held. (before this
the caller was supposed to hold the vfs_busy lock, but that wasn't always
true).
Zap some (now) unused mount flags.
And the highlight of this change:
Add some vfs_busy calls to match some vfs_unbusy calls, especially in
sys_mount. (lockmgr doesn't detect the case where we release a lock noone
holds (it will do that soon)).

If you've seen hangs on reboot with mfs this should solve it (I repeat this
for the fourth time now, but this time I spent two months fixing and
redesigning this and reading the code so this time I must have gotten
this right).


# 1.86 16-Jun-2002 miod

When processing the KERN_VNODE sysctl, the kernel builds a packed structure,
while pstat(8) expects a C structure abiding the regular structure packing
rules. This caused pstat -v to break on powerpc.

Unbreak the confusion by defining the structure in a common header file,
and having the kernel use it.

ok millert@ deraadt@


# 1.85 08-Jun-2002 art

Use ltsleep in vfs_busy.


# 1.84 16-May-2002 art

sprinkle some splassert(IPL_BIO) in some functions that are commented as "should be called at splbio()"


Revision tags: OPENBSD_3_1_BASE
# 1.83 14-Mar-2002 millert

First round of __P removal in sys


# 1.82 04-Feb-2002 miod

Cleanup mountroot-related definitions.


# 1.81 23-Jan-2002 art

Pool deals fairly well with physical memory shortage, but it doesn't deal
well (not at all) with shortages of the vm_map where the pages are mapped
(usually kmem_map).

Try to deal with it:
- group all information the backend allocator for a pool in a separate
struct. The pool will only have a pointer to that struct.
- change the pool_init API to reflect that.
- link all pools allocating from the same allocator on a linked list.
- Since an allocator is responsible to wait for physical memory it will
only fail (waitok) when it runs out of its backing vm_map, carefully
drain pools using the same allocator so that va space is freed.
(see comments in code for caveats and details).
- change pool_reclaim to return if it actually succeeded to free some
memory, use that information to make draining easier and more efficient.
- get rid of PR_URGENT, noone uses it.


# 1.80 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.79 10-Dec-2001 art

branches: 1.79.2;
No need to initialize the uobj on every getnewvnode. Just do
it when allocating. Add some improved diagnostics.


# 1.78 10-Dec-2001 art

Big cleanup inspired by NetBSD with some parts of the code from NetBSD.
- get rid of VOP_BALLOCN and VOP_SIZE
- move the generic getpages and putpages into miscfs/genfs
- create a genfs_node which must be added to the top of the private portion
of each vnode for filsystems that want to use genfs_{get,put}pages
- rename genfs_mmap to vop_generic_mmap


# 1.77 10-Dec-2001 art

Merge in struct uvm_vnode into struct vnode.


# 1.76 05-Dec-2001 art

Break out the part that lowers v_holdcnt in brelvp into an own function
and make it and vhold into public interfaces.


# 1.75 29-Nov-2001 art

Ooops. Revert part of the last commit that was completly wrong and wasn't supposed to be committed.


# 1.74 29-Nov-2001 art

Correctly handle b_vp with bgetvp and brelvp in {get,put}pages.
Prevents panics caused by vnodes being recycled under our feet.


# 1.73 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.72 21-Nov-2001 csapuntz

Added vfs_isbusy. Useful for verifying that a mount point is locked
Added vfs_mount_foreach_vnode. Several places in the code seem to want to
traverse the mount list and they all seem to handle locking differently.
Centralize traversing the mount list in one place so that we only need
to get the locking right once.


# 1.71 15-Nov-2001 art

Don't zero v_bioflag when recycling a vnode in getnewvnode.
Sometimes the vnode can be on the syncers list. While that is a bug, it's
just a minor annoyance. A vnode on a syncer worklist without VBIOONSYNCLIST
set is a disaster.


# 1.70 12-Nov-2001 art

Remove unnecessary check for NULL vnode in reassignbuf.


# 1.69 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.68 02-Oct-2001 csapuntz

Bounds check index into routing table. Thanks to Ken Ashcraft of Stanford
for finding this bug.


# 1.67 19-Sep-2001 csapuntz

Get rid of B_VFLUSH. Not relevant after the end of the AGE queue.


# 1.66 16-Sep-2001 millert

Add some missing lengths checks when passing data from userland to
kernel. From based on NetBSD patches.


# 1.65 02-Aug-2001 assar

(vput): make panic strings actually say vput instead of vrele


# 1.64 26-Jul-2001 miod

Typo.


# 1.63 27-Jun-2001 art

remove old vm


# 1.62 22-Jun-2001 deraadt

KNF


# 1.61 05-Jun-2001 provos

send note_revoke to knotes when vnode goes away, okay art@


# 1.60 16-May-2001 art

indentation nit.


# 1.59 29-Apr-2001 art

cleanup, remove incorrect comment


Revision tags: OPENBSD_2_9_BASE
# 1.58 22-Mar-2001 art

branches: 1.58.2;
Use pool for allocating vnodes.
Even though vnodes are never freed (could be) this gives us big memory and
kmem_map savings.


# 1.57 21-Mar-2001 art

uvm_vnp_terminate expect the vnode to be locked.
Why didn't LOCKDEBUG catch this?


# 1.56 16-Mar-2001 art

Oops. fix thinko in last.


# 1.55 16-Mar-2001 art

Use CIRCLEQ macros for mountlist.


# 1.54 16-Mar-2001 art

Initialize the mountlist_slock.


# 1.53 26-Feb-2001 csapuntz

Move v_writecount test back to it original place


# 1.52 26-Feb-2001 csapuntz

Make ref counts 32-bit unsigned ints as opposed to a potpourri of longs and
ints.


# 1.51 24-Feb-2001 csapuntz

Cleanup of vnode interface continues. Get rid of VHOLD/HOLDRELE.
Change VM/UVM to use buf_replacevnode to change the vnode associated
with a buffer.

Addition v_bioflag for flags written in interrupt handlers
(and read at splbio, though not strictly necessary)

Add vwaitforio and use it instead of a while loop of v_numoutput.

Fix race conditions when manipulation vnode free list


# 1.50 23-Feb-2001 csapuntz

Remove the clustering fields from the vnodes and place them in the
file system inode instead


# 1.49 21-Feb-2001 csapuntz

Latest soft updates from FreeBSD/Kirk McKusick

Snapshot-related code has been commented out.


# 1.48 08-Feb-2001 mickey

do not print stuff when not verbose


Revision tags: OPENBSD_2_8_BASE
# 1.47 27-Sep-2000 art

branches: 1.47.2;
Minimal optimization.


# 1.46 17-Jul-2000 art

Don't wait for B_READ buffers on shutdown.
From NetBSD.


Revision tags: OPENBSD_2_7_BASE
# 1.45 25-Apr-2000 csapuntz

Use CIRCLEQ_FOREACH


# 1.44 21-Apr-2000 mickey

see if there is any meaning under curproc before using &proc0 in vfs_syncwait(); from art@


Revision tags: SMP_BASE kame_19991208
# 1.43 05-Dec-1999 art

branches: 1.43.2;
With soft updates, some buffers will be remarked as dirty after being written.
Handle this when syncing filesystems when unmounting.
From NetBSD.


# 1.42 05-Dec-1999 art

Use VONSYNCLIST to see if we should remove a vnode from the sync list instead
of looking at v_dirtyblkhd.


Revision tags: OPENBSD_2_6_BASE
# 1.41 20-Aug-1999 art

more paranoid check of the refcount in vfs_register


# 1.40 08-Aug-1999 niklas

From NetBSD; vdevgone, used for revoking access to device nodes when they
disappear (detach is coming).


# 1.39 31-May-1999 millert

New struct statfs with mount options. NOTE: this replaces statfs(2),
fstatfs(2), and getfsstat(2) so you will need to build a new kernel
before doing a "make build" or you will get "unimplemented syscall" errors.

The new struct statfs has the following featuires:
o Has a u_int32_t flags field--now softdep can have a real flag.

o Uses u_int32_t instead of longs (nicer on the alpha). Note: the man
page used to lie about setting invalid/unused fields to -1. SunOS does
that but our code never has.

o Gets rid of f_type completely. It hasn't been used since NetBSD 0.9
and having it there but always 0 is confusing. It is conceivable
that this may cause some old code to not compile but that is better
than silently breaking.

o Adds a mount_info union that contains the FSTYPE_args struct. This
means that "mount" can now tell you all the options a filesystem was
mounted with. This is especially nice for NFS.

Other changes:
o The linux statfs emulation didn't convert between BSD fs names
and linux f_type numbers. Now it does, since the BSD f_type
number is useless to linux apps (and has been removed anyway)

o FreeBSD's struct statfs is different from our (both old and new)
and thus needs conversion. Previously, the OpenBSD syscalls
were used without any real translation.

o mount(8) will now show extra info when invoked with no arguments.
However, to see *everything* you need to use the -v (verbose) flag.


# 1.38 06-May-1999 mickey

factor out sync+wait code into vfa_syncwait() routine for
applications in system like power management and such.
art@ finally said `commit it'


# 1.37 30-Apr-1999 art

in vput, simple_unlock the v_interlock before VOP_INACTIVE, not after


Revision tags: OPENBSD_2_5_BASE
# 1.36 11-Mar-1999 deraadt

backout


# 1.35 11-Mar-1999 deraadt

back out unapproved changes


# 1.34 11-Mar-1999 mickey

indent


# 1.33 11-Mar-1999 mickey

factor sync+wait operation out into a separate function.


# 1.32 26-Feb-1999 art

adapt to uvm vnode pager


# 1.31 19-Feb-1999 art

add vfs_register and vfs_unregister functions


# 1.30 28-Dec-1998 art

simple_lock fixes


# 1.29 22-Dec-1998 art

deconfuse vprint, print holdcount, not refcount when we are talking about holdcnt


# 1.28 10-Dec-1998 art

vfs_unmountall: retry to unmount all remaining filesystems when one unmount failed


# 1.27 05-Dec-1998 csapuntz

Framework for generating automatic test code for locking discipline
in DIAGNOSTIC mode.

Added documentation to vfs_subr.c on locking needs of a couple calls.

Improvements to the vinvalbuf patch. We need to start over after we
let our pants down.


# 1.26 04-Dec-1998 csapuntz

VFS-Lite2 requires stricter locking around vnode buffer queues. vinvalbuf
had insufficient protection


# 1.25 20-Nov-1998 art

vn_lock already unlocks the simple lock. don't do that again


# 1.24 12-Nov-1998 csapuntz

Integrate latest soft updates patches for McKusick.

Integrate cleaner ffs mount code from FreeBSD. Most notably, this mount
code prevents you from mounting an unclean file system read-write.


Revision tags: OPENBSD_2_4_BASE
# 1.23 13-Oct-1998 csapuntz

In vrele, vget, reinstate to following order

- VNODE gets placed on free list
- VOP_INACTIVE is called

This was the original order. It was changed in an earlier patch due to
a race condition in non-locking FSes (like NFS) between getnewvnode
and inactive. However, the modified order had its own race conditions, so
it turned out not to be a good choice.


# 1.22 30-Aug-1998 csapuntz

Cleanup.

Error diagnostics in vputonfreelist to catch violations of assumptions.


# 1.21 06-Aug-1998 csapuntz

Rename vop_revoke, vn_bwrite, vop_noislocked, vop_nolock, vop_nounlock
to be vop_generic_revoke, vop_generic_bwrite, vop_generic_islocked,
vop_generic_lock and vop_generic_unlock.

Create vop_generic_abortop and propogate change to all file systems.

Fix PR/371.

Get rid of locking in NULLFS (should be mostly unnecessary now except for
forced unmounts).


# 1.20 25-Apr-1998 niklas

typo


Revision tags: OPENBSD_2_3_BASE
# 1.19 20-Feb-1998 niklas

typo


# 1.18 11-Jan-1998 csapuntz

Fix a couple spinlock references. More code motion in vfs_subr.c


# 1.17 10-Jan-1998 csapuntz

Broke up vfs_subr.c which was getting a bit huge. We now have seperate files
for the syncer daemon as well as default VOP_*.


# 1.16 24-Nov-1997 niklas

Fix non-DIAGNOSTIC (and non-COMPAT*) compilation


# 1.15 07-Nov-1997 csapuntz

Fixed hang on shutdown
Disabled vop_nolock for now. Filesystems still need to be cleaned up.


# 1.14 06-Nov-1997 csapuntz

DEBUG now compiles


# 1.13 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.12 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.11 06-Oct-1997 csapuntz

VFS Lite2 Changes


Revision tags: OPENBSD_2_1_BASE
# 1.10 25-Apr-1997 deraadt

proper mask check; mike@fast.cs.utah.edu


# 1.9 14-Apr-1997 tholo

Minor performance enhancements from NetBSD


# 1.8 24-Feb-1997 niklas

OpenBSD tags


# 1.7 11-Feb-1997 millert

Add fs_id support and random inode generation numbers for ffs.


# 1.6 04-Jan-1997 kstailey

spec_advlock() via lf_advlock()


Revision tags: OPENBSD_2_0_BASE
# 1.5 08-Aug-1996 tholo

Make {,f}chown(2) behaviour POSIX.1 compliant with SUID / SGID files
Enable CTL_FS processing by sysctl(3)
Add CTL_FS request to disable clearing SUID / SGID bit when a files owner
or group is changed by root
Make sysctl(8) understand CTL_FS requests


# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 29-Feb-1996 niklas

From NetBSD: Merge with NetBSD 960217


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision