History log of /freebsd-10.0-release/sys/sys/vnode.h
Revision Date Author Comments
(<<< Hide modified files)
(Show modified files >>>)
# 259065 07-Dec-2013 gjb

- Copy stable/10 (r259064) to releng/10.0 as part of the
10.0-RELEASE cycle.
- Update __FreeBSD_version [1]
- Set branch name to -RC1

[1] 10.0-CURRENT __FreeBSD_version value ended at '55', so
start releng/10.0 at '100' so the branch is started with
a value ending in zero.

Approved by: re (implicit)
Sponsored by: The FreeBSD Foundation

# 256281 10-Oct-2013 gjb

Copy head (r256279) to stable/10 as part of the 10.0-RELEASE cycle.

Approved by: re (implicit)
Sponsored by: The FreeBSD Foundation


# 253106 09-Jul-2013 kib

There are several code sequences like
vfs_busy(mp);
vfs_write_suspend(mp);
which are problematic if other thread starts unmount between two
calls. The unmount starts a write, while vfs_write_suspend() drain
writers. On the other hand, unmount drains busy references, causing
the deadlock.

Add a flag argument to vfs_write_suspend and require the callers of it
to specify VS_SKIP_UNMOUNT flag, when the call is performed not in the
mount path, i.e. the covered vnode is not locked. The suspension is
not attempted if VS_SKIP_UNMOUNT is specified and unmount is in
progress.

Reported and tested by: Andreas Longwitz <longwitz@incore.de>
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks


# 248561 20-Mar-2013 mckusick

When renaming a directory from one parent directory to another,
we need to call ufs_checkpath() to walk from our new location to
the root of the filesystem to ensure that we do not encounter
ourselves along the way. Until now, we accomplished this by reading
the ".." entries of each directory in our path until we reached
the root (or encountered an error). This change tries to avoid the
I/O of reading the ".." entries by first looking them up in the
name cache and only doing the I/O when the name cache lookup fails.

Reviewed by: kib
Tested by: Peter Holm
MFC after: 4 weeks


# 248319 15-Mar-2013 kib

Implement the helper function vn_io_fault_pgmove(), intended to use by
the filesystem VOP_READ() and VOP_WRITE() implementations in the same
way as vn_io_fault_uiomove() over the unmapped buffers. Helper
provides the convenient wrapper over the pmap_copy_pages() for struct
uio consumers, taking care of the TDP_UIOHELD situations.

Sponsored by: The FreeBSD Foundation
Tested by: pho
MFC after: 2 weeks


# 247388 27-Feb-2013 kib

The softdep freeblks workitem might hold a reference on the dquot.
Current dqflush() panics when a dquot with with non-zero refcount is
encountered. The situation is possible, because quotas are turned off
before softdep workitem queue if flushed, due to the quota file writes
might create softdep workitems.

Make the encountering an active dquot in dqflush() not fatal, return
the error from quotaoff() instead. Ignore the quotaoff() failures
when ffs_flushfiles() is called in the course of softdep_flushfiles()
loop, until the last iteration. At the last loop, the quotas must be
closed, and because SU workitems should be already flushed, the
references to dquot are gone.

Sponsored by: The FreeBSD Foundation
Reported and tested by: pho
Reviewed by: mckusick
MFC after: 2 weeks


# 245410 14-Jan-2013 kib

Rearrange the struct bufobj and struct vnode layouts to reduce
padding. On the amd64 kernel with INVARIANTS turned off, size of the
struct vnode is reduced from 496 to 472 bytes, saving 24 bytes of
memory and KVA per vnode.

Noted and reviewed by: peter
Tested by: pho
Sponsored by: The FreeBSD Foundation


# 245406 14-Jan-2013 kib

Add exported vfs_hash_index() function, which calculates the canonical
pre-masked hash for the given vnode. The function assumes that
vp->v_hash is initialized by the filesystem vnode instantiation
function. At the moment, it is only done if filesystem uses
vfs_hash_insert().

Reviewed by: peter
Tested by: peter, pho (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 5 days


# 245286 11-Jan-2013 kib

Add flags argument to vfs_write_resume() and remove
vfs_write_resume_flags().

Sponsored by: The FreeBSD Foundation


# 244925 01-Jan-2013 kib

The process_deferred_inactive() function locks the vnodes of the ufs
mount, which means that is must not be called while the snaplock is
owned. The vfs_write_resume(9) does call the function as the
VFS_SUSP_CLEAN() method, which is too early and falls into the region
still protected by snaplock.

Add yet another flag for the vfs_write_resume_flags() to avoid calling
suspension cleanup handler after the suspend is lifted, and use it in
the ffs_snapshot() call to vfs_write_resume.

Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 244795 28-Dec-2012 kib

Make it possible to atomically resume writes on the mount and account
the write start, by adding a variation of the vfs_write_resume(9)
which accepts flags.

Use the new function to prevent a deadlock between parallel suspension
and snapshotting a UFS mount. The ffs_snapshot() code performed
vfs_write_resume() followed by vn_start_write() while owning the
snaplock. If the suspension intervene between resume and
vn_start_write(), the deadlock occured after the suspending thread
tried to lock the snaplock, most typically during the write in the
ffs_copyonwrite().

Reported and tested by: Andreas Longwitz <longwitz@incore.de>
Reviewed by: mckusick
MFC after: 2 weeks
X-MFC-note: make the vfs_write_resume(9) function a macro after the MFC,
in HEAD


# 243612 27-Nov-2012 pjd

- Add NOCAPCHECK flag to namei that allows lookup to work even if the process
is in capability mode.
- Add VN_OPEN_NOCAPCHECK flag for vn_open_cred() to will ne converted into
NOCAPCHECK namei flag.

This functionality will be used to enable core dumps for sandboxed processes.

Reviewed by: rwatson
Obtained from: WHEEL Systems
MFC after: 2 weeks


# 241556 14-Oct-2012 kib

Add a KPI to allow to reserve some amount of space in the numvnodes
counter, without actually allocating the vnodes. The supposed use of
the getnewvnode_reserve(9) is to reclaim enough free vnodes while the
code still does not hold any resources that might be needed during the
reclamation, and to consume the slack later for getnewvnode() calls
made from the innards. After the critical block is finished, the
caller shall free any reserve left, by getnewvnode_drop_reserve(9).

Reviewed by: avg
Tested by: pho
MFC after: 1 week


# 236762 08-Jun-2012 jhb

Split the second half of vn_open_cred() (after a vnode has been found via
a lookup or created via VOP_CREATE()) into a new vn_open_vnode() function
and use this function in fhopen() instead of duplicating code from
vn_open_cred() directly.

Tested by: pho
Reviewed by: kib
MFC after: 2 weeks


# 236321 30-May-2012 kib

vn_io_fault() is a facility to prevent page faults while filesystems
perform copyin/copyout of the file data into the usermode
buffer. Typical filesystem hold vnode lock and some buffer locks over
the VOP_READ() and VOP_WRITE() operations, and since page fault
handler may need to recurse into VFS to get the page content, a
deadlock is possible.

The facility works by disabling page faults handling for the current
thread and attempting to execute i/o while allowing uiomove() to
access the usermode mapping of the i/o buffer. If all buffer pages are
resident, uiomove() is successfull and request is finished. If EFAULT
is returned from uiomove(), the pages backing i/o buffer are faulted
in and held, and the copyin/out is performed using uiomove_fromphys()
over the held pages for the second attempt of VOP call.

Since pages are hold in chunks to prevent large i/o requests from
starving free pages pool, and since vnode lock is only taken for
i/o over the current chunk, the vnode lock no longer protect atomicity
of the whole i/o request. Use newly added rangelocks to provide the
required atomicity of i/o regardind other i/o and truncations.

Filesystems need to explicitely opt-in into the scheme, by setting the
MNTK_NO_IOPF struct mount flag, and optionally by using
vn_io_fault_uiomove(9) helper which takes care of calling uiomove() or
converting uio into request for uiomove_fromphys().

Reviewed by: bf (comments), mdf, pjd (previous version)
Tested by: pho
Tested by: flo, Gustau P?rez <gperez entel upc edu> (previous version)
MFC after: 2 months


# 236317 30-May-2012 kib

Add a rangelock implementation, intended to be used to range-locking
the i/o regions of the vnode data space. The implementation is quite
simple-minded, it uses the list of the lock requests, ordered by
arrival time. Each request may be for read or for write. The
implementation is fair FIFO.

MFC after: 2 month


# 236312 30-May-2012 kib

Clarify that the v_lockf is advisory lock list.

MFC after: 3 days


# 236043 26-May-2012 kib

Add a vn_bmap_seekhole(9) vnode helper which can be used by any
filesystem which supports VOP_BMAP(9) to implement SEEK_HOLE/SEEK_DATA
commands for lseek(2).

MFC after: 2 weeks


# 234607 23-Apr-2012 trasz

Remove unused thread argument to vrecycle().

Reviewed by: kib


# 234605 23-Apr-2012 trasz

Remove unused thread argument from vtruncbuf().

Reviewed by: kib


# 234482 20-Apr-2012 mckusick

This change creates a new list of active vnodes associated with
a mount point. Active vnodes are those with a non-zero use or hold
count, e.g., those vnodes that are not on the free list. Note that
this list is in addition to the list of all the vnodes associated
with a mount point.

To avoid adding another set of linkage pointers to the vnode
structure, the active list uses the existing linkage pointers
used by the free list (previously named v_freelist, now renamed
v_actfreelist).

This update adds the MNT_VNODE_FOREACH_ACTIVE interface that loops
over just the active vnodes associated with a mount point (typically
less than 1% of the vnodes associated with the mount point).

Reviewed by: kib
Tested by: Peter Holm
MFC after: 2 weeks


# 234400 17-Apr-2012 mckusick

Drop export of vdestroy() function from kern/vfs_subr.c as it is
used only as a helper function in that file. Replace sole call to
vbusy() with inline code in vholdl(). Replace sole calls to vfree()
and vdestroy() with inline code in vdropl().

The Clang compiler already inlines these functions, so they do not
show up in a kernel backtrace which is confusing. Also you cannot
set their frame in kgdb which means that it is impossible to view
their local variables. So, while the produced code is unchanged,
the debugging should be easier.

Discussed with: kib
MFC after: 2 weeks


# 234158 11-Apr-2012 mckusick

Export vinactive() from kern/vfs_subr.c (e.g., make it no longer
static and declare its prototype in sys/vnode.h) so that it can be
called from process_deferred_inactive() (in ufs/ffs/ffs_snapshot.c)
instead of the body of vinactive() being cut and pasted into
process_deferred_inactive().

Reviewed by: kib
MFC after: 2 weeks


# 233463 25-Mar-2012 trasz

Remove unused define.

Discussed with: kib


# 232821 11-Mar-2012 kib

Remove fifo.h. The only used function declaration from the header is
migrated to sys/vnode.h.

Submitted by: gianni


# 232420 02-Mar-2012 rmacklem

Post r230394, the Lookup RPC counts for both NFS clients increased
significantly. Upon investigation this was caused by name cache
misses for lookups of "..". For name cache entries for non-".."
directories, the cache entry serves double duty. It maps both the
named directory plus ".." for the parent of the directory. As such,
two ctime values (one for each of the directory and its parent) need
to be saved in the name cache entry.
This patch adds an entry for ctime of the parent directory to the
name cache. It also adds an additional uma zone for large entries
with this time value, in order to minimize memory wastage.
As well, it fixes a couple of cases where the mtime of the parent
directory was being saved instead of ctime for positive name cache
entries. With this patch, Lookup RPC counts return to values similar
to pre-r230394 kernels.

Reported by: bde
Discussed with: kib
Reviewed by: jhb
MFC after: 2 weeks


# 232317 29-Feb-2012 trociny

Introduce VOP_UNP_BIND(), VOP_UNP_CONNECT(), and VOP_UNP_DETACH()
operations for setting and accessing vnode's v_socket field.

The operations are necessary to implement proper unix socket handling
on layered file systems like nullfs(5).

This change fixes the long standing issue with nullfs(5) being in that
unix sockets did not work between lower and upper layers: if we bound
to a socket on the lower layer we could connect only to the lower
path; if we bound to the upper layer we could connect only to the
upper path. The new behavior is one can connect to both the lower and
the upper paths regardless what layer path one binds to.

PR: kern/51583, kern/159663
Suggested by: kib
Reviewed by: arch
MFC after: 2 weeks


# 232152 25-Feb-2012 trociny

When detaching an unix domain socket, uipc_detach() checks
unp->unp_vnode pointer to detect if there is a vnode associated with
(binded to) this socket and does necessary cleanup if there is.

The issue is that after forced unmount this check may be too late as
the unp_vnode is reclaimed and the reference is stale.

To fix this provide a helper function that is called on a socket vnode
reclamation to do necessary cleanup.

Pointed by: kib
Reviewed by: kib
MFC after: 2 weeks


# 231949 20-Feb-2012 kib

Fix found places where uio_resid is truncated to int.

Add the sysctl debug.iosize_max_clamp, enabled by default. Setting the
sysctl to zero allows to perform the SSIZE_MAX-sized i/o requests from
the usermode.

Discussed with: bde, das (previous versions)
MFC after: 1 month


# 231220 08-Feb-2012 kib

Trim 8 unused bytes from struct vnode on 64-bit architectures.

Reviewed by: alc


# 231088 06-Feb-2012 jhb

Rename cache_lookup_times() to cache_lookup() and retire the old API and
ABI stub for cache_lookup().


# 231075 06-Feb-2012 kib

Current implementations of sync(2) and syncer vnode fsync() VOP uses
mnt_noasync counter to temporary remove MNTK_ASYNC mount option, which
is needed to guarantee a synchronous completion of the initiated i/o
before syscall or VOP return. Global removal of MNTK_ASYNC option is
harmful because not only i/o started from corresponding thread becomes
synchronous, but all i/o is synchronous on the filesystem which is
initiated during sync(2) or syncer activity.

Instead of removing MNTK_ASYNC from mnt_kern_flag, provide a local
thread flag to disable async i/o for current thread only. Use the
opportunity to move DOINGASYNC() macro into sys/vnode.h and
consistently use it through places which tested for MNTK_ASYNC.

Some testing demonstrated 60-70% improvements in run time for the
metadata-intensive operations on async-mounted UFS volumes, but still
with great deviation due to other reasons.

Reviewed by: mckusick
Tested by: scottl
MFC after: 2 weeks


# 230394 20-Jan-2012 jhb

Close a race in NFS lookup processing that could result in stale name cache
entries on one client when a directory was renamed on another client. The
root cause for the stale entry being trusted is that each per-vnode nfsnode
structure has a single 'n_ctime' timestamp used to validate positive name
cache entries. However, if there are multiple entries for a single vnode,
they all share a single timestamp. To fix this, extend the name cache
to allow filesystems to optionally store a timestamp value in each name
cache entry. The NFS clients now fetch the timestamp associated with
each name cache entry and use that to validate cache hits instead of the
timestamps previously stored in the nfsnode. Another part of the fix is
that the NFS clients now use timestamps from the post-op attributes of
RPCs when adding name cache entries rather than pulling the timestamps out
of the file's attribute cache. The latter is subject to races with other
lookups updating the attribute cache concurrently. Some more details:
- Add a variant of nfsm_postop_attr() to the old NFS client that can return
a vattr structure with a copy of the post-op attributes.
- Handle lookups of "." as a special case in the NFS clients since the name
cache does not store name cache entries for ".", so we cannot get a
useful timestamp. It didn't really make much sense to recheck the
attributes on the the directory to validate the namecache hit for "."
anyway.
- ABI compat shims for the name cache routines are present in this commit
so that it is safe to MFC.

MFC after: 2 weeks


# 230129 15-Jan-2012 mm

Introduce vn_path_to_global_path()

This function updates path string to vnode's full global path and checks
the size of the new path string against the pathlen argument.

In vfs_domount(), sys_unmount() and kern_jail_set() this new function
is used to update the supplied path argument to the respective global path.

Unbreaks jailed zfs(8) with enforce_statfs set to 1.

Reviewed by: kib
MFC after: 1 month


# 228849 23-Dec-2011 jhb

Add post-VOP hooks for VOP_DELETEEXTATTR() and VOP_SETEXTATTR() and use
these to trigger a NOTE_ATTRIB EVFILT_VNODE kevent when the extended
attributes of a vnode are changed.

Note that OS X already implements this behavior.

Reviewed by: rwatson
MFC after: 2 weeks


# 227070 04-Nov-2011 jhb

Add the posix_fadvise(2) system call. It is somewhat similar to
madvise(2) except that it operates on a file descriptor instead of a
memory region. It is currently only supported on regular files.

Just as with madvise(2), the advice given to posix_fadvise(2) can be
divided into two types. The first type provide hints about data access
patterns and are used in the file read and write routines to modify the
I/O flags passed down to VOP_READ() and VOP_WRITE(). These modes are
thus filesystem independent. Note that to ease implementation (and
since this API is only advisory anyway), only a single non-normal
range is allowed per file descriptor.

The second type of hints are used to hint to the OS that data will or
will not be used. These hints are implemented via a new VOP_ADVISE().
A default implementation is provided which does nothing for the WILLNEED
request and attempts to move any clean pages to the cache page queue for
the DONTNEED request. This latter case required two other changes.
First, a new V_CLEANONLY flag was added to vinvalbuf(). This requests
vinvalbuf() to only flush clean buffers for the vnode from the buffer
cache and to not remove any backing pages from the vnode. This is
used to ensure clean pages are not wired into the buffer cache before
attempting to move them to the cache page queue. The second change adds
a new vm_object_page_cache() method. This method is somewhat similar to
vm_object_page_remove() except that instead of freeing each page in the
specified range, it attempts to move clean pages to the cache queue if
possible.

To preserve the ABI of struct file, the f_cdevpriv pointer is now reused
in a union to point to the currently active advice region if one is
present for regular files.

Reviewed by: jilles, kib, arch@
Approved by: re (kib)
MFC after: 1 month


# 225166 25-Aug-2011 mm

Generalize ffs_pages_remove() into vn_pages_remove().

Remove mapped pages for all dataset vnodes in zfs_rezget() using
new vn_pages_remove() to fix mmapped files changed by
zfs rollback or zfs receive -F.

PR: kern/160035, kern/156933
Reviewed by: kib, pjd
Approved by: re (kib)
MFC after: 1 week


# 224914 16-Aug-2011 kib

Add the fo_chown and fo_chmod methods to struct fileops and use them
to implement fchown(2) and fchmod(2) support for several file types
that previously lacked it. Add MAC entries for chown/chmod done on
posix shared memory and (old) in-kernel posix semaphores.

Based on the submission by: glebius
Reviewed by: rwatson
Approved by: re (bz)


# 223911 10-Jul-2011 kib

Update locking annotations for the struct vnode.

MFC after: 3 days


# 222958 10-Jun-2011 jeff

Implement fully asynchronous partial truncation with softupdates journaling
to resolve errors which can cause corruption on recovery with the old
synchronous mechanism.

- Append partial truncation freework structures to indirdeps while
truncation is proceeding. These prevent new block pointers from
becoming valid until truncation completes and serialize truncations.
- On completion of a partial truncate journal work waits for zeroed
pointers to hit indirects.
- softdep_journal_freeblocks() handles last frag allocation and last
block zeroing.
- vtruncbuf/ffs_page_remove moved into softdep_*_freeblocks() so it
is only implemented in one place.
- Block allocation failure handling moved up one level so it does not
proceed with buf locks held. This permits us to do more extensive
reclaims when filesystem space is exhausted.
- softdep_sync_metadata() is broken into two parts, the first executes
once at the start of ffs_syncvnode() and flushes truncations and
inode dependencies. The second is called on each locked buf. This
eliminates excessive looping and rollbacks.
- Improve the mechanism in process_worklist_item() that handles
acquiring vnode locks for handle_workitem_remove() so that it works
more generally and does not loop excessively over the same worklist
items on each call.
- Don't corrupt directories by zeroing the tail in fsck. This is only
done for regular files.
- Push a fsync complete record for files that need it so the checker
knows a truncation in the journal is no longer valid.

Discussed with: mckusick, kib (ffs_pages_remove and ffs_truncate parts)
Tested by: pho


# 220791 18-Apr-2011 mdf

Add the posix_fallocate(2) syscall. The default implementation in
vop_stdallocate() is filesystem agnostic and will run as slow as a
read/write loop in userspace; however, it serves to correctly
implement the functionality for filesystems that do not implement a
VOP_ALLOCATE.

Note that __FreeBSD_version was already bumped today to 900036 for any
ports which would like to use this function.

Also reserve space in the syscall table for posix_fadvise(2).

Reviewed by: -arch (previous version)


# 218195 02-Feb-2011 mdf

Put the general logic for being a CPU hog into a new function
should_yield(). Use this in various places. Encapsulate the common
case of check-and-yield into a new function maybe_yield().

Change several checks for a magic number of iterations to use
should_yield() instead.

MFC after: 1 week


# 215548 19-Nov-2010 kib

Remove prtactive variable and related printf()s in the vop_inactive
and vop_reclaim() methods. They seems to be unused, and the reported
situation is normal for the forced unmount.

MFC after: 1 week
X-MFC-note: keep prtactive symbol in vfs_subr.c


# 211531 20-Aug-2010 jhb

Add dedicated routines to toggle lockmgr flags such as LK_NOSHARE and
LK_CANRECURSE after a lock is created. Use them to implement macros that
otherwise manipulated the flags directly. Assert that the associated
lockmgr lock is exclusively locked by the current thread when manipulating
these flags to ensure the flag updates are safe. This last change required
some minor shuffling in a few filesystems to exclusively lock a brand new
vnode slightly earlier.

Reviewed by: kib
MFC after: 3 days


# 210923 06-Aug-2010 kib

Add new make_dev_p(9) flag MAKEDEV_ETERNAL to inform devfs that created
cdev will never be destroyed. Propagate the flag to devfs vnodes as
VV_ETERNVALDEV. Use the flags to avoid acquiring devmtx and taking a
thread reference on such nodes.

In collaboration with: pho
MFC after: 1 month


# 209056 11-Jun-2010 avg

vnode.h: expand debug macros to non-empty void statements when
DEBUG_VFS_LOCKS is disabled

MFC after: 2 weeks


# 208003 12-May-2010 zml

Add VOP_ADVLOCKPURGE so that the file system is called when purging
locks (in the case where the VFS impl isn't using lf_*)

Submitted by: Matthew Fleming <matthew.fleming@isilon.com>
Reviewed by: zml, dfr


# 207719 06-May-2010 trasz

Style fixes and removal of unneeded variable.

Submitted by: bde@


# 207662 05-May-2010 trasz

Move checking against RLIMIT_FSIZE into one place, vn_rlimit_fsize().

Reviewed by: kib


# 206093 02-Apr-2010 kib

Add function vop_rename_fail(9) that performs needed cleanup for locks
and references of the VOP_RENAME(9) arguments. Use vop_rename_fail()
in deadfs_rename().

Tested by: Mikolaj Golub
MFC after: 1 week


# 202528 17-Jan-2010 kib

Add new function vunref(9) that decrements vnode use count (and hold
count) while vnode is exclusively locked.

The code for vput(9), vrele(9) and vunref(9) is merged.

In collaboration with: pho
Reviewed by: alc
MFC after: 3 weeks


# 200770 21-Dec-2009 kib

VI_OBJDIRTY vnode flag mirrors the state of OBJ_MIGHTBEDIRTY vm object
flag. Besides providing the redundand information, need to update both
vnode and object flags causes more acquisition of vnode interlock.
OBJ_MIGHTBEDIRTY is only checked for vnode-backed vm objects.

Remove VI_OBJDIRTY and make sure that OBJ_MIGHTBEDIRTY is set only for
vnode-backed vm objects.

Suggested and reviewed by: alc
Tested by: pho
MFC after: 3 weeks


# 197680 01-Oct-2009 trasz

Provide default implementation for VOP_ACCESS(9), so that filesystems which
want to provide VOP_ACCESSX(9) don't have to implement both. Note that
this commit makes implementation of either of these two mandatory.

Reviewed by: kib


# 197405 22-Sep-2009 trasz

Add pieces of infrastructure required for NFSv4 ACL support in UFS.

Reviewed by: rwatson


# 195148 28-Jun-2009 stas

- Turn the third (islocked) argument of the knote call into flags parameter.
Introduce the new flag KNF_NOKQLOCK to allow event callers to be called
without KQ_LOCK mtx held.
- Modify VFS knote calls to always use KNF_NOKQLOCK flag. This is required
for ZFS as its getattr implementation may sleep.

Approved by: re (rwatson)
Reviewed by: kib
MFC after: 2 weeks


# 194972 25-Jun-2009 trasz

Tweak comment.


# 194601 21-Jun-2009 kib

Add explicit struct ucred * argument for VOP_VPTOCNP, to be used by
vn_open_cred in default implementation. Valid struct ucred is needed for
audit and MAC, and curthread credentials may be wrong.

This further requires modifying the interface of vn_fullpath(9), but it
is out of scope of this change.

Reviewed by: rwatson


# 194586 21-Jun-2009 kib

Add another flags argument to vn_open_cred. Use it to specify that some
vn_open_cred invocations shall not audit namei path.

In particular, specify VN_OPEN_NOAUDIT for dotdot lookup performed by
default implementation of vop_vptocnp, and for the open done for core
file. vn_fullpath is called from the audit code, and vn_open there need
to disable audit to avoid infinite recursion. Core file is created on
return to user mode, that, in particular, happens during syscall return.
The creation of the core file is audited by direct calls, and we do not
want to overwrite audit information for syscall.

Reported, reviewed and tested by: rwatson


# 193307 02-Jun-2009 attilio

Handle lock recursion differenty by always checking against LO_RECURSABLE
instead the lock own flag itself.

Tested by: pho


# 193174 31-May-2009 kib

Eliminate code duplication in vn_fullpath1() around the cache lookups
and calls to vn_vptocnp() by moving more of the common code to
vn_vptocnp(). Rename vn_vptocnp() to vn_vptocnp_locked() to signify that
cache is locked around the call.

Do not track buffer position by both the pointer and offset, use only
buflen to record the start of the free space.

Export vn_vptocnp() for external consumers as a wrapper around
vn_vptocnp_locked() that locks the cache and handles hold counts.

Tested by: pho


# 193092 30-May-2009 trasz

Add VOP_ACCESSX, which can be used to query for newly added V*
permissions, such as VWRITE_ACL. For a filsystems that don't
implement it, there is a default implementation, which works
as a wrapper around VOP_ACCESS.

Reviewed by: rwatson@


# 190888 10-Apr-2009 rwatson

Remove VOP_LEASE and supporting functions. This hasn't been used since
the removal of NQNFS, but was left in in case it was required for NFSv4.
Since our new NFSv4 client and server can't use it for their
requirements, GC the old mechanism, as well as other unused lease-
related code and interfaces.

Due to its impact on kernel programming and binary interfaces, this
change should not be MFC'd.

Proposed by: jeff
Reviewed by: jeff
Discussed with: rmacklem, zach loafman @ isilon


# 190533 29-Mar-2009 kan

Replace v_dd vnode pointer with v_cache_dd pointer to struct namecache
in directory vnodes. Allow namecache dotdot entry to be created pointing
from child vnode to parent vnode if no existing links in opposite
direction exist. Use direct link from parent to child for dotdot lookups
otherwise.

This restores more efficient dotdot caching in NFS filesystems which
was lost when vnodes stoppped being type stable.

Reviewed by: kib


# 190524 29-Mar-2009 trasz

Get rid of VSTAT and replace it with VSTAT_PERMS, which is somewhat
better defined.

Approved by: rwatson (mentor)


# 190481 27-Mar-2009 trasz

Add new V* constants, neccessary for granular permission checks
in NFSv4 ACLs. While here, get rid of VALLPERM; it wasn't used anyway.

Approved by: rwatson (mentor)


# 189540 08-Mar-2009 marcus

Add a prototype for the new vop_stdvptocnp function.

Reviewed by: kib
Approved by: kib
Tested by: pho


# 188833 19-Feb-2009 jhb

Enable caching of negative pathname lookups in the NFS client. To avoid
stale entries, we save a copy of the directory's modification time when
the first negative cache entry was added in the directory's NFS node.
When a negative cache entry is hit during a pathname lookup, the parent
directory's modification time is checked. If it has changed, all of the
negative cache entries for that parent are purged and the lookup falls
back to using the RPC. This required adding a new cache_purge_negative()
method to the name cache to purge only negative cache entries for a given
directory.

Submitted by: mohans, Rick Macklem, Ricardo Labiaga @ NetApp
Reviewed by: mohans


# 187833 28-Jan-2009 jhb

Actually remove the VA_MARK_ATIME flag. This should have been in the
earlier commit to add VOP_MARKATIME().


# 187528 21-Jan-2009 kib

Move the code from ufs_lookup.c used to do dotdot lookup, into
the helper function. It is supposed to be useful for any filesystem
that has to unlock dvp to walk to the ".." entry in lookup routine.

Requested by: jhb
Tested by: pho
MFC after: 1 month


# 185957 11-Dec-2008 marcus

Add a new error VOP, VOP_ENOENT. This function will simply return ENOENT.

Reviewed by: arch
Approved by: kib


# 185029 17-Nov-2008 pjd

Update ZFS from version 6 to 13 and bring some FreeBSD-specific changes.

This bring huge amount of changes, I'll enumerate only user-visible changes:

- Delegated Administration

Allows regular users to perform ZFS operations, like file system
creation, snapshot creation, etc.

- L2ARC

Level 2 cache for ZFS - allows to use additional disks for cache.
Huge performance improvements mostly for random read of mostly
static content.

- slog

Allow to use additional disks for ZFS Intent Log to speed up
operations like fsync(2).

- vfs.zfs.super_owner

Allows regular users to perform privileged operations on files stored
on ZFS file systems owned by him. Very careful with this one.

- chflags(2)

Not all the flags are supported. This still needs work.

- ZFSBoot

Support to boot off of ZFS pool. Not finished, AFAIK.

Submitted by: dfr

- Snapshot properties

- New failure modes

Before if write requested failed, system paniced. Now one
can select from one of three failure modes:
- panic - panic on write error
- wait - wait for disk to reappear
- continue - serve read requests if possible, block write requests

- Refquota, refreservation properties

Just quota and reservation properties, but don't count space consumed
by children file systems, clones and snapshots.

- Sparse volumes

ZVOLs that don't reserve space in the pool.

- External attributes

Compatible with extattr(2).

- NFSv4-ACLs

Not sure about the status, might not be complete yet.

Submitted by: trasz

- Creation-time properties

- Regression tests for zpool(8) command.

Obtained from: OpenSolaris


# 184413 28-Oct-2008 trasz

Introduce accmode_t. This is required for NFSv4 ACLs - it will be neccessary
to add more V* constants, and the variables changed by this patch were often
being assigned to mode_t variables, which is 16 bit.

Approved by: rwatson (mentor)


# 183754 10-Oct-2008 attilio

Remove the struct thread unuseful argument from bufobj interface.
In particular following functions KPI results modified:
- bufobj_invalbuf()
- bufsync()

and BO_SYNC() "virtual method" of the buffer objects set.
Main consumers of bufobj functions are affected by this change too and,
in particular, functions which changed their KPI are:
- vinvalbuf()
- g_vfs_close()

Due to the KPI breakage, __FreeBSD_version will be bumped in a later
commit.

As a side note, please consider just temporary the 'curthread' argument
passing to VOP_SYNC() (in bufsync()) as it will be axed out ASAP

Reviewed by: kib
Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>


# 183071 16-Sep-2008 kib

Garbage-collect vn_write_suspend_wait().

Suggested and reviewed by: tegge
Tested by: pho
MFC after: 1 month


# 182905 10-Sep-2008 trasz

Remove VSVTX, VSGID and VSUID. This should be a no-op,
as VSVTX == S_ISVTX, VSGID == S_ISGID and VSUID == S_ISUID.

Approved by: rwatson (mentor)


# 182371 28-Aug-2008 attilio

Decontextualize the couplet VOP_GETATTR / VOP_SETATTR as the passed thread
was always curthread and totally unuseful.

Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>


# 182364 28-Aug-2008 kib

Introduce the VV_FORCEINSMQ vnode flag. It instructs the insmnque() function
to ignore the unmounting and forces insertion of the vnode into the mount
vnode list.

Change insmntque() to fail when forced unmount is in progress and
VV_FORCEINSMQ is not specified.

Add an assertion to the insmntque(), requiring the vnode to be
exclusively locked for mp-safe filesystems.

Use the VV_FORCEINSMQ for the creation of the syncvnode.

Tested by: pho
Reviewed by: tegge
MFC after: 1 month


# 181060 31-Jul-2008 csjp

Currently, BSM audit pathname token generation for chrooted or jailed
processes are not producing absolute pathname tokens. It is required
that audited pathnames are generated relative to the global root mount
point. This modification changes our implementation of audit_canon_path(9)
and introduces a new function: vn_fullpath_global(9) which performs a
vnode -> pathname translation relative to the global mount point based
on the contents of the name cache. Much like vn_fullpath,
vn_fullpath_global is a wrapper function which called vn_fullpath1.

Further, the string parsing routines have been converted to use the
sbuf(9) framework. This change also removes the conditional acquisition
of Giant, since the vn_fullpath1 method will not dip into file system
dependent code.

The vnode locking was modified to use vhold()/vdrop() instead the vref()
and vrele(). This will modify the hold count instead of modifying the
user count. This makes more sense since it's the kernel that requires
the reference to the vnode. This also makes sure that the vnode does not
get recycled we hold the reference to it. [1]

Discussed with: rwatson
Reviewed by: kib [1]
MFC after: 2 weeks


# 178243 16-Apr-2008 kib

Move the head of byte-level advisory lock list from the
filesystem-specific vnode data to the struct vnode. Provide the
default implementation for the vop_advlock and vop_advlockasync.
Purge the locks on the vnode reclaim by using the lf_purgelocks().
The default implementation is augmented for the nfs and smbfs.
In the nfs_advlock, push the Giant inside the nfs_dolock.

Before the change, the vop_advlock and vop_advlockasync have taken the
unlocked vnode and dereferenced the fs-private inode data, racing with
with the vnode reclamation due to forced unmount. Now, the vop_getattr
under the shared vnode lock is used to obtain the inode size, and
later, in the lf_advlockasync, after locking the vnode interlock, the
VI_DOOMED flag is checked to prevent an operation on the doomed vnode.

The implementation of the lf_purgelocks() is submitted by dfr.

Reported by: kris
Tested by: kris, pho
Discussed with: jeff, dfr
MFC after: 2 weeks


# 177957 06-Apr-2008 attilio

Optimize lockmgr in order to get rid of the pool mutex interlock, of the
state transitioning flags and of msleep(9) callings.
Use, instead, an algorithm very similar to what sx(9) and rwlock(9)
alredy do and direct accesses to the sleepqueue(9) primitive.

In order to avoid writer starvation a mechanism very similar to what
rwlock(9) uses now is implemented, with the correspective per-thread
shared lockmgrs counter.

This patch also adds 2 new functions to lockmgr KPI: lockmgr_rw() and
lockmgr_args_rw(). These two are like the 2 "normal" versions, but they
both accept a rwlock as interlock. In order to realize this, the general
lockmgr manager function "__lockmgr_args()" has been implemented through
the generic lock layer. It supports all the blocking primitives, but
currently only these 2 mappers live.

The patch drops the support for WITNESS atm, but it will be probabilly
added soon. Also, there is a little race in the draining code which is
also present in the current CVS stock implementation: if some sharers,
once they wakeup, are in the runqueue they can contend the lock with
the exclusive drainer. This is hard to be fixed but the now committed
code mitigate this issue a lot better than the (past) CVS version.
In addition assertive KA_HELD and KA_UNHELD have been made mute
assertions because they are dangerous and they will be nomore supported
soon.

In order to avoid namespace pollution, stack.h is splitted into two
parts: one which includes only the "struct stack" definition (_stack.h)
and one defining the KPI. In this way, newly added _lockmgr.h can
just include _stack.h.

Kernel ABI results heavilly changed by this commit (the now committed
version of "struct lock" is a lot smaller than the previous one) and
KPI results broken by lockmgr_rw() / lockmgr_args_rw() introduction,
so manpages and __FreeBSD_version will be updated accordingly.

Tested by: kris, pho, jeff, danger
Reviewed by: jeff
Sponsored by: Google, Summer of Code program 2007


# 177782 31-Mar-2008 kib

Add the utility function vn_commname() to retrieve the command name
from the vfs namecache, when available.

Reviewed by: rwatson, rdivacky
Tested by: pho


# 177537 24-Mar-2008 jeff

- Remove an old comment; vnodes have been working without Giant for
years now.
- Clarify the locking required for VI_DOOMED in preparation for
simplifications to vget() and vn_lock().


# 177528 23-Mar-2008 kib

Yield the cpu in the kernel while iterating the list of the
vnodes belonging to the mountpoint. Also, yield when in the
softdep_process_worklist() even when we are not going to sleep due to
buffer drain.

It is believed that the ULE fixed the problem [1], but the yielding
seems to be needed at least for the 4BSD case.

Discussed: on stable@, with bde
Reviewed by: tegge, jeff [1]
MFC after: 2 weeks


# 176708 01-Mar-2008 attilio

- Handle buffer lock waiters count directly in the buffer cache instead
than rely on the lockmgr support [1]:
* bump the waiters only if the interlock is held
* let brelvp() return the waiters count
* rely on brelvp() instead than BUF_LOCKWAITERS() in order to check
for the waiters number
- Remove a namespace pollution introduced recently with lockmgr.h
including lock.h by including lock.h directly in the consumers and
making it mandatory for using lockmgr.
- Modify flags accepted by lockinit():
* introduce LK_NOPROFILE which disables lock profiling for the
specified lockmgr
* introduce LK_QUIET which disables ktr tracing for the specified
lockmgr [2]
* disallow LK_SLEEPFAIL and LK_NOWAIT to be passed there so that it
can only be used on a per-instance basis
- Remove BUF_LOCKWAITERS() and lockwaiters() as they are no longer
used

This patch breaks KPI so __FreBSD_version will be bumped and manpages
updated by further commits. Additively, 'struct buf' changes results in
a disturbed ABI also.

[2] Really, currently there is no ktr tracing in the lockmgr, but it
will be added soon.

[1] Submitted by: kib
Tested by: pho, Andrea Barberio <insomniac at slackware dot it>


# 176519 24-Feb-2008 attilio

Introduce some functions in the vnode locks namespace and in the ffs
namespace in order to handle lockmgr fields in a controlled way instead
than spreading all around bogus stubs:
- VN_LOCK_AREC() allows lock recursion for a specified vnode
- VN_LOCK_ASHARE() allows lock sharing for a specified vnode

In FFS land:
- BUF_AREC() allows lock recursion for a specified buffer lock
- BUF_NOREC() disallows recursion for a specified buffer lock

Side note: union_subr.c::unionfs_node_update() is the only other function
directly handling lockmgr fields. As this is not simple to fix, it has
been left behind as "sole" exception.


# 175294 13-Jan-2008 attilio

VOP_LOCK1() (and so VOP_LOCK()) and VOP_UNLOCK() are only used in
conjuction with 'thread' argument passing which is always curthread.
Remove the unuseful extra-argument and pass explicitly curthread to lower
layer functions, when necessary.

KPI results broken by this change, which should affect several ports, so
version bumping and manpage update will be further committed.

Tested by: kris, pho, Diego Sardina <siarodx at gmail dot com>


# 175202 09-Jan-2008 attilio

vn_lock() is currently only used with the 'curthread' passed as argument.
Remove this argument and pass curthread directly to underlying
VOP_LOCK1() VFS method. This modify makes the code cleaner and in
particular remove an annoying dependence helping next lockmgr() cleanup.
KPI results, obviously, changed.

Manpage and FreeBSD_version will be updated through further commits.

As a side note, would be valuable to say that next commits will address
a similar cleanup about VFS methods, in particular vop_lock1 and
vop_unlock.

Tested by: Diego Sardina <siarodx at gmail dot com>,
Andrea Di Pasquale <whyx dot it at gmail dot com>


# 172826 20-Oct-2007 pjd

Remove redundant prototypes.


# 170152 31-May-2007 kib

Revert UF_OPENING workaround for CURRENT.
Change the VOP_OPEN(), vn_open() vnode operation and d_fdopen() cdev operation
argument from being file descriptor index into the pointer to struct file.

Proposed and reviewed by: jhb
Reviewed by: daichi (unionfs)
Approved by: re (kensmith)


# 169998 25-May-2007 pjd

The cache_leaf_test() function seems to be unused, so remove it.


# 169671 18-May-2007 kib

Since renaming of vop_lock to _vop_lock, pre- and post-condition
function calls are no more generated for vop_lock.
Rename _vop_lock to vop_lock1 to satisfy tools/vnode_if.awk assumption
about vop naming conventions. This restores pre/post-condition calls.


# 168192 31-Mar-2007 des

Make vdropl() public; zfs needs it. There is also plenty of existing
file system code (mostly *_reclaim()) which look like this:

VOP_LOCK(vp);
/* examine vp */
VOP_UNLOCK(vp);
vdrop(vp);

This can now be rewritten to:

VOP_LOCK(vp);
/* examine vp */
vdropl(vp); /* will unlock vp */

MFC after: 1 week


# 167497 12-Mar-2007 tegge

Make insmntque() externally visibile and allow it to fail (e.g. during
late stages of unmount). On failure, the vnode is recycled.

Add insmntque1(), to allow for file system specific cleanup when
recycling vnode on failure.

Change getnewvnode() to no longer call insmntque(). Previously,
embryonic vnodes were put onto the list of vnode belonging to a file
system, which is unsafe for a file system marked MPSAFE.

Change vfs_hash_insert() to no longer lock the vnode. The caller now
has that responsibility.

Change most file systems to lock the vnode and call insmntque() or
insmntque1() after a new vnode has been sufficiently setup. Handle
failed insmntque*() calls by propagating errors to callers, possibly
after some file system specific cleanup.

Approved by: re (kensmith)
Reviewed by: kib
In collaboration with: kib


# 166774 15-Feb-2007 pjd

Move vnode-to-file-handle translation from vfs_vptofh to vop_vptofh method.
This way we may support multiple structures in v_data vnode field within
one file system without using black magic.

Vnode-to-file-handle should be VOP in the first place, but was made VFS
operation to keep interface as compatible as possible with SUN's VFS.
BTW. Now Solaris also implements vnode-to-file-handle as VOP operation.

VFS_VPTOFH() was left for API backward compatibility, but is marked for
removal before 8.0-RELEASE.

Approved by: mckusick
Discussed with: many (on IRC)
Tested with: ufs, msdosfs, cd9660, nullfs and zfs


# 165210 14-Dec-2006 kib

Use tab after #define.

Pointed out by: pjd


# 165203 14-Dec-2006 kib

Resolve two deadlocks that could be caused by busy md device backed
by vnode. Allow for md thread and the thread that owns lock on vnode
backing the md device to do the write even when runningbufspace is
exhausted.

Tested by: Peter Holm
Reviewed by: tegge
MFC after: 2 weeks


# 164248 13-Nov-2006 kmacy

change vop_lock handling to allowing tracking of callers' file and line for
acquisition of lockmgr locks

Approved by: scottl (standing in for mentor rwatson)


# 163841 31-Oct-2006 pjd

Add gjournal specific code to the UFS file system:
- Add FS_GJOURNAL flag which enables gjournal support on a file system.
- Add cg_unrefs field to the cylinder group structure which holds
number of unreferenced (orphaned) inodes in the given cylinder group.
- Add fs_unrefs field to the super block structure which holds
total number of unreferenced (orphaned) inodes.
- When file or a directory is orphaned (last reference is removed, but
object is still open), increase fs_unrefs and cg_unrefs fields,
which is a hint for fsck in which cylinder groups looks for such
(orphaned) objects.
- When file is last closed, decrease {fs,cg}_unrefs fields.
- Add VV_DELETED vnode flag which points at orphaned objects.

Sponsored by: home.pl


# 157832 18-Apr-2006 delphij

In vfs_hash_get(): mount point should never be changed
so explicitly constify the mp parameter.

Reviewed by: phk


# 156451 08-Mar-2006 tegge

Use vn_start_secondary_write() and vn_finished_secondary_write() as a
replacement for vn_write_suspend_wait() to better account for secondary write
processing.

Close race where secondary writes could be started after ffs_sync() returned
but before the file system was marked as suspended.

Detect if secondary writes or softdep processing occurred during vnode sync
loop in ffs_sync() and retry the loop if needed.


# 156203 02-Mar-2006 jeff

- Move softdep from using a global worklist to per-mount worklists. This
has many positive effects including improved smp locking, reducing
interdependencies between mounts that can lead to deadlocks, etc.
- Add the softdep worklist and various counters to the ufsmnt structure.
- Add a mount pointer to the workitem and remove mount pointers from the
various structures derived from the workitem as they are now redundant.
- Remove the poor-man's semaphore protecting softdep_process_worklist and
softdep_flushworklist. Several threads may now process the list
simultaneously.
- Add softdep_waitidle() to block the thread until all pending
dependencies being operated on by other threads have been flushed.
- Use softdep_waitidle() in unmount and snapshots to block either
operation until the fs is stable.
- Remove softdep worklist processing from the syncer and move it into the
softdep_flush() thread. This thread processes all softdep mounts
once each second and when it is called via the new softdep_speedup()
when there is a resource shortage. This removes the softdep hook
from the kernel and various hacks in header files to support it.

Reviewed by/Discussed with: tegge, truckman, mckusick
Tested by: kris


# 155177 01-Feb-2006 yar

Use off_t for file size passed to vnode_create_vobject().
The former type, size_t, was causing truncation to 32 bits on i386,
which immediately led to undersizing of VM objects backed by
files >4GB. In particular, sendfile(2) was broken for such files.

PR: kern/92243
MFC after: 5 days


# 154390 15-Jan-2006 rwatson

Rename uid and gid arguments to vaccess() prototype to match vaccess()
implementation in vfs_subr.c. No functional change.

MFC after: 3 days


# 154152 09-Jan-2006 tegge

Add marker vnodes to ensure that all vnodes associated with the mount point are
iterated over when using MNT_VNODE_FOREACH.

Reviewed by: truckman


# 153400 13-Dec-2005 des

Eradicate caddr_t from the VFS API.


# 153397 13-Dec-2005 des

Nuke vnodeop_desc.vdesc_transports, which has been unused since the dawn
of time (or the inception of ncvs, whichever came last)


# 151252 12-Oct-2005 dds

Move execve's access time update functionality into a new
vfs_mark_atime() function, and use the new function for
performing efficient atime updates in mmap().

Reviewed by: bde
MFC after: 2 weeks


# 150020 12-Sep-2005 phk

Introduce vfs_read_dirent() which can help VOP_READDIR() implementations
by handling all the cookie stuff.


# 148768 05-Aug-2005 ssouhlal

Holding a vnode doesn't prevent v_mount from disappearing (when the
vnode is inactivated), possibly leading to a NULL dereference when
checking if the mount wants knotes to be activated in the VOP hooks.
So, we add a new vnode flag VV_NOKNOTE that is only set in getnewvnode(),
if necessary, and check it when activating knotes.
Since the flags are not erased when a vnode is being held, we can safely
read them.

Reviewed by: kris@
MFC after: 3 days


# 148668 03-Aug-2005 jeff

- Replace the series of DEBUG_LOCKS hacks which tried to save the vn_lock
caller by saving the stack of the last locker/unlocker in lockmgr. We
also put the stack in KTR at the moment.

Contributed by: Antoine Brodin <antoine.brodin@laposte.net>


# 147732 01-Jul-2005 ssouhlal

Mistakingly undefined VN_KNOTE_LOCKED in my previous commit.

Noticed by: Antoine Brodin <antoine.brodin@laposte.net>
Approved by: re (scottl)


# 147730 01-Jul-2005 ssouhlal

Fix the recent panics/LORs/hangs created by my kqueue commit by:

- Introducing the possibility of using locks different than mutexes
for the knlist locking. In order to do this, we add three arguments to
knlist_init() to specify the functions to use to lock, unlock and
check if the lock is owned. If these arguments are NULL, we assume
mtx_lock, mtx_unlock and mtx_owned, respectively.

- Using the vnode lock for the knlist locking, when doing kqueue operations
on a vnode. This way, we don't have to lock the vnode while holding a
mutex, in filt_vfsread.

Reviewed by: jmg
Approved by: re (scottl), scottl (mentor override)
Pointyhat to: ssouhlal
Will be happy: everyone


# 147332 13-Jun-2005 jeff

- Don't make vgonel() globally visible, we want to change its prototype
anyway and it's not used outside of vfs_subr.c.
- Change vgonel() to accept a parameter which determines whether or not
we'll put the vnode on the free list when we're done.
- Use the new vgonel() parameter rather than VI_DOOMED to signal our
intentions in vtryrecycle().
- In vgonel() return if VI_DOOMED is already set, this vnode has already
been reclaimed.

Sponsored by: Isilon Systems, Inc.


# 147198 09-Jun-2005 ssouhlal

Allow EVFILT_VNODE events to work on every filesystem type, not just
UFS by:
- Making the pre and post hooks for the VOP functions work even when
DEBUG_VFS_LOCKS is not defined.
- Moving the KNOTE activations into the corresponding VOP hooks.
- Creating a MNTK_NOKNOTE flag for the mnt_kern_flag field of struct
mount that permits filesystems to disable the new behavior.
- Creating a default VOP_KQFILTER function: vfs_kqfilter()

My benchmarks have not revealed any performance degradation.

Reviewed by: jeff, bde
Approved by: rwatson, jmg (kqueue changes), grehan (mentor)


# 146829 31-May-2005 kensmith

This patch addresses a standards violation issue. The standards say a
file's access time should be updated when it gets executed. A while
ago the mechanism used to exec was changed to use a more mmap based
mechanism and this behavior was broken as a side-effect of that.

A new vnode flag is added that gets set when the file gets executed,
and the VOP_SETATTR() vnode operation gets called. The underlying
filesystem is expected to handle it based on its own semantics, some
filesystems don't support access time at all. Those that do should
handle it in a way that does not block, does not generate I/O if possible,
etc. In particular vn_start_write() has not been called. The UFS code
handles it the same way as it would normally handle the access time if
a file was read - the IN_ACCESS flag gets set in the inode but no other
action happens at this point. The actual time update will happen later
during a sync (which handles all the necessary locking).

Got me into this: cperciva
Discussed with: a lot with bde, a little with kan
Showed patches to: phk, jeffr, standards@, arch@
Minor discussion on: arch@


# 145589 27-Apr-2005 jeff

- Changes to vgone() and related teardown code have meant that the vxthread
pointer is no longer needed.


# 145423 22-Apr-2005 jeff

- Add a VI_LOCK_FLAGS so we can pass MTX_DUPOK in. This somewhat defeats
the purpose of having macros to hide the lock type as we may now be
dependent on MTX_ flags.

Sponsored by: Isilon Systems, Inc.


# 144918 11-Apr-2005 jeff

- Add the mising ASSERT_VOP_ELOCKED code in the !DEBUG_VFS_LOCKS case.

Pointy hat to: me


# 144908 11-Apr-2005 jeff

- Enable ASSERT_VOP_ELOCKED and assert_vop_elocked() now that vnode_if.awk
uses it.

Sponsored by: Isilon Systems, Inc.


# 144320 30-Mar-2005 das

Eliminate v_id and v_ddid. This changes struct vnode, so all
filesystem modules must be recompiled. (Since struct vnode has
already changed in 6-CURRENT, there's little advantage to leaving
the unused fields around.)


# 144051 24-Mar-2005 jeff

- If vput() is called with a shared lock it must upgrade to an exclusive
before it can call VOP_INACTIVE(). This must use the EXCLUPGRADE path
because we may violate some lock order with another locked vnode if
we drop and reacquire the lock. If EXCLUPGRADE fails, we mark the
vnode with VI_OWEINACT. This case should be very rare.
- Clear VI_OWEINACT in vinactive() and vbusy().
- If VI_OWEINACT is set in vgone() do the VOP_INACTIVE call here as well.

Sponsored by: Isilon Systems, Inc.


# 143692 16-Mar-2005 phk

Add two arguments to the vfs_hash() KPI so that filesystems which do
not have unique hashes (NFS) can also use it.


# 143680 16-Mar-2005 phk

Add mnt_hashseed to struct mount and initialize it witn PRNG bits, use
it to get better hashing in vfs_hash.

In case of an insert collision in vfs_hash_insert(), put the loosing vnode
on a special list so that vfs_hash_remove() can just assume that it is on
a list.

Drop the VI_HASHED flag.


# 143652 15-Mar-2005 jeff

- Now that there are no external users of vfree() make it static.
- Move VSHOULDBUSY, VSHOULDFREE, and VTRYRECYCLE into vfs_subr.c so
no one else attempts to grow a dependency on them.
- Now that objects with pages hold the vnode we don't have to do unlocked
checks for the page count in the vm object in VSHOULDFREE. These three
macros could simply check for holdcnt state transitions to determine
whether the vnode is on the free list already, but the extra safety
the flag affords us is probably worth the minimal cost.
- The leafonly sysctl and code have been dead for several years now,
remove the sysctl and the code that employed it from vtryrecycle().
- vtryrecycle() also no longer has to check the object's page count as
the object holds the vnode until it reaches 0.

Sponsored by: Isilon Systems, Inc.


# 143640 15-Mar-2005 jeff

- Expose vholdl() so it may be used outside of vfs_subr.c


# 143561 14-Mar-2005 phk

Currently (almost) all filesystems maintain a local inode hash table
to get from (mount + inode) to vnode. These tables are mostly
copy&pasted from UFS, sized based on desiredvnodes and therefore
quite large (128K-512K). Several filesystems are buggy enough that
they allocate the hash table even before they know if they will
ever be used or not.

Add "vfs_hash", a system wide hash table, which will replace all
the per-filesystem hash-tables.

The fields we add to struct vnode will more or less be saved in
the respective filesystems inodes.

Having one central implementation will save code and will allow us
to justify the complexity of code to dynamically (re)size the hash
at a later point.


# 143560 14-Mar-2005 jeff

- Increment the holdcnt once for each usecount reference. This allows us
to use only the holdcnt to determine whether a vnode may be recycled,
simplifying the V* macros as well as vtryrecycle(), etc.

Sponsored by: Isilon Systems, Inc.


# 143557 14-Mar-2005 jeff

- We do not have to check the object's ref_count in VSHOULDFREE or
vtryrecycle(). All obj refs also ref the vnode.
- Consistently use v_incr_usecount() to increment the usecount. This will
be more important later.

Sponsored by: Isilon Systems, Inc.


# 143554 14-Mar-2005 jeff

- Retire OLOCK and OWANT. All callers hold the vnode lock when creating
a vnode object. There has been an assert to prove this for some time.

Sponsored by: Isilon Systems, Inc.


# 143493 13-Mar-2005 jeff

- Get rid of VXLOCK, VXWANT, and vx_*. The vnode lock now protects us
against recycling.
- Modify VSHOULDFREE, VCANRECYCLE, etc. now that certain flags are no
longer important. Remove VMIGHTFREE as it is only used in one place.

Sponsored by: Isilon Systems, Inc.


# 142251 22-Feb-2005 phk

Group the fields in struct vnode by their function and stick comments
there to tell what the function is.


# 142242 22-Feb-2005 phk

Reap more benefits from DEVFS:

List devfs_dirents rather than vnodes off their shared struct cdev, this
saves a pointer field in the vnode at the expense of a field in the
devfs_dirent. There are often 100 times more vnodes so this is bargain.
In addition it makes it harder for people to try to do stypid things like
"finding the vnode from cdev".

Since DEVFS handles all VCHR nodes now, we can do the vnode related
cleanup in devfs_reclaim() instead of in dev_rel() and vgonel().
Similarly, we can do the struct cdev related cleanup in dev_rel()
instead of devfs_reclaim().

rename idestroy_dev() to destroy_devl() for consistency.

Add LIST_ENTRY de_alias to struct devfs_dirent.
Remove v_specnext from struct vnode.
Change si_hlist to si_alist in struct cdev.
String new devfs vnodes' devfs_dirent on si_alist when
we create them and take them off in devfs_reclaim().

Fix devfs_revoke() accordingly. Also don't clear fields
devfs_reclaim() will clear when called from vgone();

Let devfs_reclaim() call dev_rel() instead of vgonel().

Move the usecount tracking from dev_rel() to devfs_reclaim(),
and let dev_rel() take a struct cdev argument instead of vnode.

Destroy SI_CHEAPCLONE devices in dev_rel() (instead of
devfs_reclaim()) when they are no longer used. (This
should maybe happen in devfs_close() instead.)


# 142225 22-Feb-2005 phk

Remove vfinddev(), it is generally bogus when faced with jails and
chroot and has no legitimate use(r)s in the tree.


# 142011 17-Feb-2005 phk

Introduce vx_wait{l}() and use it instead of home-rolled versions.


# 141637 10-Feb-2005 phk

Make various vnode related functions static


# 141606 10-Feb-2005 phk

Add __printflike() to vn_printf()


# 141533 08-Feb-2005 phk

Drag another softupdates tentacle back into FFS: Now that FFS's
vop_fsync is separate from the internal use we can do the full job
there.


# 141448 07-Feb-2005 phk

Remove vop_stddestroyvobject()


# 140936 28-Jan-2005 phk

Remove unused argument to vrecycle()


# 140929 28-Jan-2005 phk

Move the contents of vop_stddestroyvobject() to the new vnode_pager
function vnode_destroy_vobject().

Make the new function zero the vp->v_object pointer so we can tell
if a call is missing.


# 140783 24-Jan-2005 phk

Take VOP_GETVOBJECT() out to pasture. We use the direct pointer now.


# 140781 24-Jan-2005 phk

Kill VOP_CREATEVOBJECT(), it is now the responsibility of the filesystem
for a given vnode to create a vnode_pager object if one is needed.


# 140767 24-Jan-2005 phk

Move the body of vop_stdcreatevobject() over to the vnode_pager under
the name Sande^H^H^H^H^Hvnode_create_vobject().

Make the new function take a size argument which removes the need for
a VOP_STAT() or a very pessimistic guess for disks.

Call that new function from vop_stdcreatevobject().

Make vnode_pager_alloc() private now that its only user came home.


# 140739 24-Jan-2005 phk

Change vprint() to vn_printf() which takes varargs.
Add #define for vprint() to call vn_printf().


# 140734 24-Jan-2005 phk

Kill the VV_OBJBUF and test the v_object for NULL instead.


# 140719 24-Jan-2005 jeff

- Add a VCANRECYCLE() which performs all the checks required to ensure
that we are free to release a vnode.


# 140220 14-Jan-2005 phk

Eliminate unused and unnecessary "cred" argument from vinvalbuf()


# 140181 13-Jan-2005 phk

Ditch vfs_object_create() and make the callers call VOP_CREATEVOBJECT()
directly.


# 140166 13-Jan-2005 phk

Get rid of the VDESC() macro while the pot is boiling anyway, it is
only used from generate files now, so we might as well generate the
right stuff from the start.


# 140165 13-Jan-2005 phk

Change the generated VOP_ macro implementations to improve type checking
and KASSERT coverage.

After this check there is only one "nasty" cast in this code but there
is a KASSERT to protect against the wrong argument structure behind
that cast.

Un-inlining the meat of VOP_FOO() saves 35kB of text segment on a typical
kernel with no change in performance.

We also now run the checking and tracing on VOP's which have been layered
by nullfs, umapfs, deadfs or unionfs.

Add new (non-inline) VOP_FOO_AP() functions which take a "struct
foo_args" argument and does everything the VOP_FOO() macros
used to do with checks and debugging code.

Add KASSERT to VOP_FOO_AP() check for argument type being
correct.

Slim down VOP_FOO() inline functions to just stuff arguments
into the struct foo_args and call VOP_FOO_AP().

Put function pointer to VOP_FOO_AP() into vop_foo_desc structure
and make VCALL() use it instead of the current offsetoff() hack.

Retire vcall() which implemented the offsetoff()

Make deadfs and unionfs use VOP_FOO_AP() calls instead of
VCALL(), we know which specific call we want already.

Remove unneeded arguments to VCALL() in nullfs and umapfs bypass
functions.

Remove unused vdesc_offset and VOFFSET().

Generally improve style/readability of the generated code.


# 139825 07-Jan-2005 imp

/* -> /*- for license, minor formatting changes


# 139188 22-Dec-2004 phk

Shuffle numeric values of the IO_* flags to match the O_* flags from
fcntl.h.

This is in preparation for making the flags passed to device drivers be
consistently from fcntl.h for all entrypoints.

Today open, close and ioctl uses fcntl.h flags, while read and write
uses vnode.h flags.


# 139085 20-Dec-2004 phk

We can only ever get to vgonechrl() from a devfs vnode, so we do not
need to reassign the vp->v_op to devfs_specops, we know that is the
value already.

Make devfs_specops private to devfs.


# 138701 11-Dec-2004 marcel

Revert rev 1.259. The null-pointer function call (a dereference on
ia64) was not the result of a change in the vector operations. It
was caused by the NFS locking code using a FIFO and those bypassing
the vnode. This indirectly caused the panic. The NFS locking code
has been changed.

Requested by: phk


# 138509 07-Dec-2004 phk

The remaining part of nmount/omount/rootfs mount changes. I cannot sensibly
split the conversion of the remaining three filesystems out from the root
mounting changes, so in one go:

cd9660:
Convert to nmount.
Add omount compat shims.
Remove dedicated rootfs mounting code.
Use vfs_mountedfrom()
Rely on vfs_mount.c calling VFS_STATFS()

nfs(client):
Convert to nmount (the simple way, mount_nfs(8) is still necessary).
Add omount compat shims.
Drop COMPAT_PRELITE2 mount arg compatibility.

ffs:
Convert to nmount.
Add omount compat shims.
Remove dedicated rootfs mounting code.
Use vfs_mountedfrom()
Rely on vfs_mount.c calling VFS_STATFS()

Remove vfs_omount() method, all filesystems are now converted.

Remove MNTK_WANTRDWR, handling RO/RW conversions is a filesystem
task, and they all do it now.

Change rootmounting to use DEVFS trampoline:

vfs_mount.c:
Mount devfs on /. Devfs needs no 'from' so this is clean.
symlink /dev to /. This makes it possible to lookup /dev/foo.
Mount "real" root filesystem on /.
Surgically move the devfs mountpoint from under the real root
filesystem onto /dev in the real root filesystem.

Remove now unnecessary getdiskbyname().

kern_init.c:
Don't do devfs mounting and rootvnode assignment here, it was
already handled by vfs_mount.c.

Remove now unused bdevvp(), addaliasu() and addalias(). Put the
few necessary lines in devfs where they belong. This eliminates the
second-last source of bogo vnodes, leaving only the lemming-syncer.

Remove rootdev variable, it doesn't give meaning in a global context and
was not trustworth anyway. Correct information is provided by
statfs(/).


# 138411 05-Dec-2004 marcel

Fix null-pointer indirect function calls introduced in the previous
commit. In the new world order, the transitive closure on the vector
operations is not precomputed. As such, it's unsafe to actually use
any of the function pointers in an indirect function call. They can
be null, and we need to use the default vector in that case.
This is mostly a quick fix for the four function pointers that are
ed explicitly. A more generic or scalable solution is likely to see
the light of day.

No pathos on: current@


# 138290 01-Dec-2004 phk

Back when VOP_* was introduced, we did not have new-style struct
initializations but we did have lofty goals and big ideals.

Adjust to more contemporary circumstances and gain type checking.

Replace the entire vop_t frobbing thing with properly typed
structures. The only casualty is that we can not add a new
VOP_ method with a loadable module. History has not given
us reason to belive this would ever be feasible in the the
first place.

Eliminate in toto VOCALL(), vop_t, VNODEOP_SET() etc.

Give coda correct prototypes and function definitions for
all vop_()s.

Generate a bit more data from the vnode_if.src file: a
struct vop_vector and protype typedefs for all vop methods.

Add a new vop_bypass() and make vop_default be a pointer
to another struct vop_vector.

Remove a lot of vfs_init since vop_vector is ready to use
from the compiler.

Cast various vop_mumble() to void * with uppercase name,
for instance VOP_PANIC, VOP_NULL etc.

Implement VCALL() by making vdesc_offset the offsetof() the
relevant function pointer in vop_vector. This is disgusting
but since the code is generated by a script comparatively
safe. The alternative for nullfs etc. would be much worse.

Fix up all vnode method vectors to remove casts so they
become typesafe. (The bulk of this is generated by scripts)


# 137680 13-Nov-2004 phk

Eliminate vop_revoke() function now that devfs_revoke() does the entire job.


# 137508 10-Nov-2004 phk

Slim vnodes by another four bytes by eliminating the (now) unused field
v_cachedid.


# 137506 10-Nov-2004 phk

Remove vn_todev()


# 137483 09-Nov-2004 phk

Remove vnode->v_cachedfs.

It was only used for the highly dangerous "export all vnodes with a sysctl"
function.


# 136992 27-Oct-2004 phk

Move the syncer linkage from vnode to bufobj.

This is not quite a perfect separation: the syncer still think it knows
that everything is a vnode.


# 136943 25-Oct-2004 phk

Loose the v_dirty* and v_clean* alias macros.

Check the count field where we just want to know the full/empty state,
rather than using TAILQ_EMPTY() or TAILQ_FIRST().


# 136941 25-Oct-2004 phk

Remove vnode->v_bsize. This was a dead-end.


# 136938 25-Oct-2004 phk

Collapse vnode->v_object and buf->b_object into bufobj->bo_object.


# 136770 22-Oct-2004 phk

Alas, poor SPECFS! -- I knew him, Horatio; A filesystem of infinite
jest, of most excellent fancy: he hath taught me lessons a thousand
times; and now, how abhorred in my imagination it is! my gorge rises
at it. Here were those hacks that I have curs'd I know not how
oft. Where be your kludges now? your workarounds? your layering
violations, that were wont to set the table on a roar?

Move the skeleton of specfs into devfs where it now belongs and
bury the rest.


# 136751 21-Oct-2004 phk

Move the VI_BWAIT flag into no bo_flag element of bufobj and call it BO_WWAIT

Add bufobj_wref(), bufobj_wdrop() and bufobj_wwait() to handle the write
count on a bufobj. Bufobj_wdrop() replaces vwakeup().

Use these functions all relevant places except in ffs_softdep.c where
the use if interlocked_sleep() makes this impossible.

Rename b_vnbufs to b_bobufs now that we touch all the relevant files anyway.


# 136750 21-Oct-2004 phk

Add BO_* macros parallel to VI_* macros for manipulating the bo_mtx.

Initialize the bo_mtx when we allocate a vnode i getnewvnode() For
now we point to the vnodes interlock mutex, that retains the exact
same locking sematics.

Move v_numoutput from vnode to bufobj. Add renaming macro to
postpone code sweep.


# 136746 21-Oct-2004 phk

Forced commit to get the right commit message:

Add new include file <sys/bufobj.h> which will contain the gory
details on the new buffer-cache object. (see comments in file
about the direction this is moving).

Include it from <sys/vnode.h> for now to avoid munging a lot of files
which can later be munged back.

Embed a bufobj in vnode.

Move the buf splay trees from the vnode to the bufobj.

Alias the fields to avoid sweeping code yet.

Hide vnode and bufobj behind

#if defined(_KERNEL) || defined(_KVM_VNODE)

to discourage userland voyeurism.


# 136745 21-Oct-2004 phk

Add new function ttyinitmode() which sets our systemwide default
modes on a tty structure. Both the ".init" and the current settings
are initialized allowing the function to be used both at attach and
open time.

The function takes an argument to decide if echoing should be enabled
by default. Echoing should not be enabled for regular physical
serial ports unless they are consoles, in which case they should
be configured by ttyconsolemode() instead.

Use the new function throughout.


# 134899 07-Sep-2004 phk

Create simple function init_va_filerev() for initializing a va_filerev
field.

Replace three instances of longhaired initialization va_filerev fields.

Added XXX comment wondering why we don't use random bits instead of
uptime of the system for this purpose.


# 133741 15-Aug-2004 jmg

Add locking to the kqueue subsystem. This also makes the kqueue subsystem
a more complete subsystem, and removes the knowlege of how things are
implemented from the drivers. Include locking around filter ops, so a
module like aio will know when not to be unloaded if there are outstanding
knotes using it's filter ops.

Currently, it uses the MTX_DUPOK even though it is not always safe to
aquire duplicate locks. Witness currently doesn't support the ability
to discover if a dup lock is ok (in some cases).

Reviewed by: green, rwatson (both earlier versions)


# 133459 10-Aug-2004 rwatson

Modify vnode locking key: the v_pollinfo pointer itself is protected
by Giant; the contents are protected by the pollinfo mutex. We rely
on Giant to prevent races in assigning the value of v_pollinfo.


# 132023 12-Jul-2004 alfred

Make VFS_ROOT() and vflush() take a thread argument.
This is to allow filesystems to decide based on the passed thread
which vnode to return.
Several filesystems used curthread, they now use the passed thread.


# 131565 04-Jul-2004 phk

Blocksize for I/O should be a property of the vnode and not found by groping
around in the vnodes surroundings when we allocate a block.

Assign a blocksize when we create a vnode, and yell a warning (and ignore it)
if we got the wrong size.

Please email all such warnings to me.


# 130640 17-Jun-2004 phk

Second half of the dev_t cleanup.

The big lines are:
NODEV -> NULL
NOUDEV -> NODEV
udev_t -> dev_t
udev2dev() -> findcdev()

Various minor adjustments including handling of userland access to kernel
space struct cdev etc.


# 130585 16-Jun-2004 phk

Do the dreaded s/dev_t/struct cdev */
Bump __FreeBSD_version accordingly.


# 130101 05-Jun-2004 tjr

Change the types of vn_rdwr_inchunks()'s len and aresid arguments to
size_t and size_t *, respectively. Update callers for the new interface.
This is a better fix for overflows that occurred when dumping segments
larger than 2GB to core files.


# 127976 07-Apr-2004 imp

Remove advertising clause from University of California Regent's license,
per letter dated July 22, 1999.

Approved by: core


# 127592 29-Mar-2004 peter

Clean up the stub fake vnode locking implemenations. The main reason this
stuff was here (NFS) was fixed by Alfred in November. The only remaining
consumer of the stub functions was umapfs, which is horribly horribly
broken. It has missed out on about the last 5 years worth of maintenence
that was done on nullfs (from which umapfs is derived). It needs major
work to bring it up to date with the vnode locking protocol. umapfs really
needs to find a caretaker to bring it into the 21st century.

Functions GC'ed:
vop_noislocked, vop_nolock, vop_nounlock, vop_sharedlock.


# 126851 11-Mar-2004 phk

Remove unused second arg to vfinddev().
Don't call addaliasu() on VBLK nodes.


# 124147 05-Jan-2004 kan

Properly ifdef support for vfs locking assertions based on DEBUG_VFS_LOCKS.

Obtained from: bde


# 124146 05-Jan-2004 kan

Style fixes:
Remove double empty lines.
Add tab after #define's.
Properly terminate sentences in comments.

Obtained from: bde (mostly).


# 123932 28-Dec-2003 bde

v_vxproc was a bogus name for a thread (pointer).


# 122524 12-Nov-2003 rwatson

Modify the MAC Framework so that instead of embedding a (struct label)
in various kernel objects to represent security data, we embed a
(struct label *) pointer, which now references labels allocated using
a UMA zone (mac_label.c). This allows the size and shape of struct
label to be varied without changing the size and shape of these kernel
objects, which become part of the frozen ABI with 5-STABLE. This opens
the door for boot-time selection of the number of label slots, and hence
changes to the bound on the number of simultaneous labeled policies
at boot-time instead of compile-time. This also makes it easier to
embed label references in new objects as required for locking/caching
with fine-grained network stack locking, such as inpcb structures.

This change also moves us further in the direction of hiding the
structure of kernel objects from MAC policy modules, not to mention
dramatically reducing the number of '&' symbols appearing in both the
MAC Framework and MAC policy modules, and improving readability.

While this results in minimal performance change with MAC enabled, it
will observably shrink the size of a number of critical kernel data
structures for the !MAC case, and should have a small (but measurable)
performance benefit (i.e., struct vnode, struct socket) do to memory
conservation and reduced cost of zeroing memory.

NOTE: Users of MAC must recompile their kernel and all MAC modules as a
result of this change. Because this is an API change, third party
MAC modules will also need to be updated to make less use of the '&'
symbol.

Suggestions from: bmilekic
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


# 122091 05-Nov-2003 kan

Remove mntvnode_mtx and replace it with per-mountpoint mutex.
Introduce two new macros MNT_ILOCK(mp)/MNT_IUNLOCK(mp) to
operate on this mutex transparently.

Eventually new mutex will be protecting more fields in
struct mount, not only vnode list.

Discussed with: jeff


# 121441 23-Oct-2003 wollman

Add appropriate const poisoning to the assert_*locked() family so that I can
call ASSERT_VOP_LOCKED(vp, __func__) without a diagnostic.

Inspired by: the evil and rude OpenAFS cache manager code


# 120742 04-Oct-2003 jeff

- Document more of the vnode locking strategy.


# 118094 27-Jul-2003 phk

Add fdidx argument to vn_open() and vn_open_cred() and pass -1 throughout.


# 117365 09-Jul-2003 hsu

Replace custom field offset macro with the system __offsetof() macro.

Reviewed by: bde


# 115456 31-May-2003 phk

The IO_NOWDRAIN and B_NOWDRAIN hacks are no longer needed to prevent
deadlocks with vnode backed md(4) devices because md now uses a
kthread to run the bio requests instead of doing it directly from
the bio down path.


# 114061 26-Apr-2003 alc

Remove an unused declaration.


# 113303 09-Apr-2003 mike

Add prototypes for change_root() and change_dir().


# 111466 25-Feb-2003 mckusick

Prevent large files from monopolizing the system buffers. Keep
track of the number of dirty buffers held by a vnode. When a
bdwrite is done on a buffer, check the existing number of dirty
buffers associated with its vnode. If the number rises above
vfs.dirtybufthresh (currently 90% of vfs.hidirtybuffers), one
of the other (hopefully older) dirty buffers associated with
the vnode is written (using bawrite). In the event that this
approach fails to curb the growth in it the vnode's number of
dirty buffers (due to soft updates rollback dependencies),
the more drastic approach of doing a VOP_FSYNC on the vnode
is used. This code primarily affects very large and actively
written files such as snapshots. This change should eliminate
hanging when taking snapshots or doing background fsck on
very large filesystems.

Hopefully, one day it will be possible to cache filesystem
metadata in the VM cache as is done with file data. As it
stands, only the buffer cache can be used which limits total
metadata storage to about 20Mb no matter how much memory is
available on the system. This rather small memory gets badly
thrashed causing a lot of extra I/O. For example, taking a
snapshot of a 1Tb filesystem minimally requires about 35,000
write operations, but because of the cache thrashing (we only
have about 350 buffers at our disposal) ends up doing about
237,540 I/O's thus taking twenty-five minutes instead of four
if it could run entirely in the cache.

Reported by: Attila Nagy <bra@fsn.hu>
Sponsored by: DARPA & NAI Labs.


# 110584 09-Feb-2003 jeff

- Cleanup unlocked accesses to buf flags by introducing a new b_vflag member
that is protected by the vnode lock.
- Move B_SCANNED into b_vflags and call it BV_SCANNED.
- Create a vop_stdfsync() modeled after spec's sync.
- Replace spec_fsync, msdos_fsync, and hpfs_fsync with the stdfsync and some
fs specific processing. This gives all of these filesystems proper
behavior wrt MNT_WAIT/NOWAIT and the use of the B_SCANNED flag.
- Annotate the locking in buf.h


# 108399 29-Dec-2002 iedowse

Add a new vnode flag VI_DOINGINACT to indicate that a VOP_INACTIVE
call is in progress on the vnode. When vput() or vrele() sees a
1->0 reference count transition, it now return without any further
action if this flag is set. This flag is necessary to avoid recursion
into VOP_INACTIVE if the filesystem inactive routine causes the
reference count to increase and then drop back to zero. It is also
used to guarantee that an unlocked vnode will not be recycled while
blocked in VOP_INACTIVE().

There are at least two cases where the recursion can occur: one is
that the softupdates code called by ufs_inactive() via ffs_truncate()
can call vput() on the vnode. This has been reported by many people
as "lockmgr: draining against myself" panics. The other case is
that nfs_inactive() can call vget() and then vrele() on the vnode
to clean up a sillyrename file.

Reviewed by: mckusick (an older version of the patch)


# 108356 28-Dec-2002 dillon

Abstract-out the constants for the sequential heuristic.

No operational changes.

MFC after: 1 day


# 106057 27-Oct-2002 wollman

Change the way support for asynchronous I/O is indicated to applications
to conform to 1003.1-2001. Make it possible for applications to actually
tell whether or not asynchronous I/O is supported.

Since FreeBSD's aio implementation works on all descriptor types, don't
call down into file or vnode ops when [f]pathconf() is asked about
_PC_ASYNC_IO; this avoids the need for every file and vnode op to know about
it.


# 105902 24-Oct-2002 mckusick

Within ufs, the ffs_sync and ffs_fsync functions did not always
check for and/or report I/O errors. The result is that a VFS_SYNC
or VOP_FSYNC called with MNT_WAIT could loop infinitely on ufs in
the presence of a hard error writing a disk sector or in a filesystem
full condition. This patch ensures that I/O errors will always be
checked and returned. This patch also ensures that every call to
VFS_SYNC or VOP_FSYNC with MNT_WAIT set checks for and takes
appropriate action when an error is returned.

Sponsored by: DARPA & NAI Labs.


# 105020 13-Oct-2002 jeff

- Remove the do { } while(0) from the VOP lock assert macros. This was
not optimized away by the compiler in time for it to still leave the VOP
functions as inlines.

Submitted by: bde


# 104095 28-Sep-2002 phk

Don't use unnamed anonymous structs: give it a name.


# 104043 27-Sep-2002 phk

Rename struct specinfo to the more appropriate struct cdev.

Agreed on: jake, rwatson, jhb


# 103986 26-Sep-2002 jeff

- Move ASSERT_VOP_*LOCK* functionality into functions in vfs_subr.c
- Make the VI asserts more orthogonal to the rest of the asserts by using a
new, common vfs_badlock() function and adding a 'str' arg.
- Adjust generated ASSERTS to match the new prototype.
- Adjust explicit ASSERTS to match the new prototype.


# 103933 25-Sep-2002 jeff

- Lock down the syncer with sync_mtx.
- Enable vfs_badlock_mutex by default.
- Assert that the vp is locked in VOP_UNLOCK.
- Use standard interlock macros in remaining code.
- Correct a race in getnewvnode().
- Lock access to v_numoutput with interlock.
- Lock access to buf lists and splay tree with interlock.
- Add VOP and VI asserts.
- Lock b_vnbufs with the vnode interlock.
- Add vrefcnt() for callers who want to retreive the vnode ref without
holding a lock. Add a comment that describes when this is safe.
- Add vholdl() and vdropl() so that callers who already own the interlock
can avoid race conditions and unnecessary unlocking.
- Move the VOP_GETATTR() in vflush() into the WRITECLOSE conditional case.
- Hold the interlock before droping the mntlist_mtx in vflush() to avoid
a race.
- Fix locking in vfs_msync().


# 103926 24-Sep-2002 jeff

- Finish the struct vnode lock annotation.
- Order fields by what lock is required to access them.


# 103849 23-Sep-2002 jeff

- Include sys/ktr.h so that vnode_if.h can define trace points.


# 103314 14-Sep-2002 njl

Remove all use of vnode->v_tag, replacing with appropriate substitutes.
v_tag is now const char * and should only be used for debugging.

Additionally:
1. All users of VT_NTS now check vfsconf->vf_type VFCF_NETWORK
2. The user of VT_PROCFS now checks for the new flag VV_PROCDEP, which
is propagated by pseudofs to all child vnodes if the fs sets PFS_PROCDEP.

Suggested by: phk
Reviewed by: bde, rwatson (earlier version)


# 103187 10-Sep-2002 bde

Fixed namespace pollution in uma changes:
- use `struct uma_zone *' instead of uma_zone_t, so that <sys/uma.h> isn't
a prerequisite.
- don't include <sys/uma.h>.
Namespace pollution makes "opaque" types like uma_zone_t perfectly
non-opaque. Such types should never be used (see style(9)).

"Fixed" subsequently grown dependencies of this header on its own
pollution by polluting explicitly:
- include <sys/mutex.h> and its prerequisite <sys/lock.h> instead of
depending on namespace pollution 2 layers deep in <sys/uma.h>.


# 102779 01-Sep-2002 iedowse

Split out a number of mostly VFS and signal related syscalls into
a kernel-internal kern_*() version and a wrapper that is called via
the syscall vector table. For paths and structure pointers, the
internal version either takes a uio_seg parameter or requires the
caller to copyin() the data to kernel memory as appropiate. This
will permit emulation layers to use these syscalls without having
to copy out translated arguments to the stack gap.

Discussed on: -arch
Review/suggestions: bde, jhb, peter, marcel


# 102210 21-Aug-2002 jeff

- Add two new debugging macros: ASSERT_VI_LOCKED and ASSERT_VI_UNLOCKED
- Use the new VI asserts in place of the old mtx_assert checks.
- Add the VI asserts to the automated lock checking in the VOP calls. The
interlock should not be held across vops with a few exceptions.
- Add the vop_(un)lock_{pre,post} functions to assert that interlock is held
when LK_INTERLOCK is set.


# 101983 16-Aug-2002 rwatson

Make similar changes to fo_stat() and fo_poll() as made earlier to
fo_read() and fo_write(): explicitly use the cred argument to fo_poll()
as "active_cred" using the passed file descriptor's f_cred reference
to provide access to the file credential. Add an active_cred
argument to fo_stat() so that implementers have access to the active
credential as well as the file credential. Generally modify callers
of fo_stat() to pass in td->td_ucred rather than fp->f_cred, which
was redundantly provided via the fp argument. This set of modifications
also permits threads to perform these operations on behalf of another
thread without modifying their credential.

Trickle this change down into fo_stat/poll() implementations:

- badfo_poll(), badfo_stat(): modify/add arguments.
- kqueue_poll(), kqueue_stat(): modify arguments.
- pipe_poll(), pipe_stat(): modify/add arguments, pass active_cred to
MAC checks rather than td->td_ucred.
- soo_poll(), soo_stat(): modify/add arguments, pass fp->f_cred rather
than cred to pru_sopoll() to maintain current semantics.
- sopoll(): moidfy arguments.
- vn_poll(), vn_statfile(): modify/add arguments, pass new arguments
to vn_stat(). Pass active_cred to MAC and fp->f_cred to VOP_POLL()
to maintian current semantics.
- vn_close(): rename cred to file_cred to reflect reality while I'm here.
- vn_stat(): Add active_cred and file_cred arguments to vn_stat()
and consumers so that this distinction is maintained at the VFS
as well as 'struct file' layer. Pass active_cred instead of
td->td_ucred to MAC and to VOP_GETATTR() to maintain current semantics.

- fifofs: modify the creation of a "filetemp" so that the file
credential is properly initialized and can be used in the socket
code if desired. Pass ap->a_td->td_ucred as the active
credential to soo_poll(). If we teach the vnop interface about
the distinction between file and active credentials, we would use
the active credential here.

Note that current inconsistent passing of active_cred vs. file_cred to
VOP's is maintained. It's not clear why GETATTR would be authorized
using active_cred while POLL would be authorized using file_cred at
the file system level.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# 101941 15-Aug-2002 rwatson

In order to better support flexible and extensible access control,
make a series of modifications to the credential arguments relating
to file read and write operations to cliarfy which credential is
used for what:

- Change fo_read() and fo_write() to accept "active_cred" instead of
"cred", and change the semantics of consumers of fo_read() and
fo_write() to pass the active credential of the thread requesting
an operation rather than the cached file cred. The cached file
cred is still available in fo_read() and fo_write() consumers
via fp->f_cred. These changes largely in sys_generic.c.

For each implementation of fo_read() and fo_write(), update cred
usage to reflect this change and maintain current semantics:

- badfo_readwrite() unchanged
- kqueue_read/write() unchanged
pipe_read/write() now authorize MAC using active_cred rather
than td->td_ucred
- soo_read/write() unchanged
- vn_read/write() now authorize MAC using active_cred but
VOP_READ/WRITE() with fp->f_cred

Modify vn_rdwr() to accept two credential arguments instead of a
single credential: active_cred and file_cred. Use active_cred
for MAC authorization, and select a credential for use in
VOP_READ/WRITE() based on whether file_cred is NULL or not. If
file_cred is provided, authorize the VOP using that cred,
otherwise the active credential, matching current semantics.

Modify current vn_rdwr() consumers to pass a file_cred if used
in the context of a struct file, and to always pass active_cred.
When vn_rdwr() is used without a file_cred, pass NOCRED.

These changes should maintain current semantics for read/write,
but avoid a redundant passing of fp->f_cred, as well as making
it more clear what the origin of each credential is in file
descriptor read/write operations.

Follow-up commits will make similar changes to other file descriptor
operations, and modify the MAC framework to pass both credentials
to MAC policy modules so they can implement either semantic for
revocation.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# 101848 13-Aug-2002 rwatson

Move to a nested include of _label.h instead of mac.h in sys/sys/*.h
(Most of the places where mac.h was recursively included from another
kernel header file. net/netinet to follow.)

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs
Suggested by: bde


# 101711 11-Aug-2002 rwatson

Introduce IO_NOMACCHECK, a flag that will be passed to vn_rdwr() to
indicate that the calling code has already performed necessary MAC
checks (if any) for this operation. This flag will help resolve
layering problems that existing because vn_rdwr() is called both
on behalf of user processes directly (such as in system calls of
various sorts, during core dumps, etc), as well as deep in the file
system code on behalf of the file system (such as in UFS, ext2fs,
etc). Code that is acting on behalf of a kernel service rather
than explicitly on behalf of a user process will specify this flag.
By default, MAC checks will be performed (and generally should
be performed).

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# 101491 07-Aug-2002 jeff

- Adjust locking markup to match the proc markup.
- Add a comment about the current, unfinished, state of vnode locking.

Suggested by: bde


# 101368 05-Aug-2002 jeff

- Document more of the struct vnode locking protocol.
- Slightly reformat a comment block.


# 101308 04-Aug-2002 jeff

- Replace v_flag with v_iflag and v_vflag
- v_vflag is protected by the vnode lock and is used when synchronization
with VOP calls is needed.
- v_iflag is protected by interlock and is used for dealing with vnode
management issues. These flags include X/O LOCK, FREE, DOOMED, etc.
- All accesses to v_iflag and v_vflag have either been locked or marked with
mp_fixme's.
- Many ASSERT_VOP_LOCKED calls have been added where the locking was not
clear.
- Many functions in vfs_subr.c were restructured to provide for stronger
locking.

Idea stolen from: BSD/OS


# 101039 31-Jul-2002 des

Introduce struct xvnode, which will be used instead of struct vnode for
sysctl purposes. Also add two fields to struct vnode, v_cachedfs and
v_cachedid, which hold the vnode's device and file id and are filled in
by vn_open_cred() and vn_stat().

Sponsored by: DARPA, NAI Labs


# 100983 30-Jul-2002 rwatson

Begin committing support for Mandatory Access Control and extensible
kernel access control. The MAC framework permits loadable kernel
modules to link to the kernel at compile-time, boot-time, or run-time,
and augment the system security policy. This commit includes the
initial kernel implementation, although the interface with the userland
components of the oeprating system is still under work, and not all
kernel subsystems are supported. Later in this commit sequence,
documentation of which kernel subsystems will not work correctly with
a kernel compiled with MAC support will be added.

Label vnodes, permitting security information to maintained at the
granularity of the individual file, directory (et al). This data is
protected by the vnode lock and may be read only when holding a shared
lock, or modified only when holding an exclusive lock. Label
information may be considered either the primary copy, or a cached
copy. Individual file systems or kernel services may use the
VCACHEDLABEL flag for accounting purposes to determine which it is.
New VOPs will be introduced to refresh this label on demand, or to
set the label value.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# 100920 30-Jul-2002 jeff

- Add vfs_badlock_{print,panic} support to the remaining VOP_ASSERT_*
macros.


# 100868 29-Jul-2002 jeff

- Add VBAD to the list of vnodes that are ignored on locking operations.


# 100765 27-Jul-2002 rwatson

Reserve VCACHEDLABEL vnode flag for use by the TrustedBSD MAC
implementation. This flag will indicate that the security label
in the vnode is currently valid, and therefore doesn't need to
be refreshed before an access control decision can be made. Most
file systems (or stdvops) will set this flag after they load the
MAC label from disk the first time to prevent redundant disk I/O;
some file synthetic file systems (procfs, for example) may not.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# 100480 22-Jul-2002 rwatson

Add VALLPERM, which is a mask of all the access control request permission
bits for vnodes passed to vaccess() and friends.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# 100479 22-Jul-2002 rwatson

Sort vnode access mode flags.
Add flags VSTAT, VAPPEND required for TrustedBSD.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# 100344 19-Jul-2002 mckusick

Add support to UFS2 to provide storage for extended attributes.
As this code is not actually used by any of the existing
interfaces, it seems unlikely to break anything (famous
last words).

The internal kernel interface to manipulate these attributes
is invoked using two new IO_ flags: IO_NORMAL and IO_EXT.
These flags may be specified in the ioflags word of VOP_READ,
VOP_WRITE, and VOP_TRUNCATE. Specifying IO_NORMAL means that
you want to do I/O to the normal data part of the file and
IO_EXT means that you want to do I/O to the extended attributes
part of the file. IO_NORMAL and IO_EXT are mutually exclusive
for VOP_READ and VOP_WRITE, but may be specified individually
or together in the case of VOP_TRUNCATE. For example, when
removing a file, VOP_TRUNCATE is called with both IO_NORMAL
and IO_EXT set. For backward compatibility, if neither IO_NORMAL
nor IO_EXT is set, then IO_NORMAL is assumed.

Note that the BA_ and IO_ flags have been `merged' so that they
may both be used in the same flags word. This merger is possible
by assigning the IO_ flags to the low sixteen bits and the BA_
flags the high sixteen bits. This works because the high sixteen
bits of the IO_ word is reserved for read-ahead and help with
write clustering so will never be used for flags. This merge
lets us get away from code of the form:

if (ioflags & IO_SYNC)
flags |= BA_SYNC;

For the future, I have considered adding a new field to the
vattr structure, va_extsize. This addition could then be
exported through the stat structure to allow applications to
find out the size of the extended attribute storage and also
would provide a more standard interface for truncating them
(via VOP_SETATTR rather than VOP_TRUNCATE).

I am also contemplating adding a pathconf parameter (for
concreteness, lets call it _PC_MAX_EXTSIZE) which would
let an application determine the maximum size of the extended
atribute storage.

Sponsored by: DARPA & NAI Labs.


# 100201 16-Jul-2002 mckusick

Change the name of st_createtime to st_birthtime. This change is
made to reduce confusion between st_ctime and st_createtime.

Submitted by: Eric Allman <eric@sendmail.org>
Sponsored by: DARPA & NAI Labs.


# 99737 10-Jul-2002 dillon

Replace the global buffer hash table with per-vnode splay trees using a
methodology similar to the vm_map_entry splay and the VM splay that Alan
Cox is working on. Extensive testing has appeared to have shown no
increase in overhead.

Disadvantages
Dirties more cache lines during lookups.

Not as fast as a hash table lookup (but still N log N and optimal
when there is locality of reference).

Advantages
vnode->v_dirtyblkhd is now perfectly sorted, making fsync/sync/filesystem
syncer operate more efficiently.

I get to rip out all the old hacks (some of which were mine) that tried
to keep the v_dirtyblkhd tailq sorted.

The per-vnode splay tree should be easier to lock / SMPng pushdown on
vnodes will be easier.

This commit along with another that Alan is working on for the VM page
global hash table will allow me to implement ranged fsync(), optimize
server-side nfs commit rpcs, and implement partial syncs by the
filesystem syncer (aka filesystem syncer would detect that someone is
trying to get the vnode lock, remembers its place, and skip to the
next vnode).

Note that the buffer cache splay is somewhat more complex then other splays
due to special handling of background bitmap writes (multiple buffers with
the same lblkno in the same vnode), and B_INVAL discontinuities between the
old hash table and the existence of the buffer on the v_cleanblkhd list.

Suggested by: alc


# 99692 09-Jul-2002 jeff

- Remove IS_LOCKING_VFS() all of our filesystems support locking now
- Add IGNORE_LOCK() that only ignores VCHR files for now since no one locks
their underlying device in the leaf filesystems. (devvp)
- Add prototypes for vop_lookup_{pre,post} that I forgot before.


# 99568 07-Jul-2002 jeff

- VT_PSEUDOFS and VT_PROCFS support locking now
- Remove VBLK from the list of vtypes that are ignored for locking ops.


# 99485 06-Jul-2002 jeff

- Add vop_strategy_pre to validate VOP_STRATEGY locking.
- Disable original vop_strategy lock specification.
- Switch to the new vop_strategy_pre for lock validation.

VOP_STRATEGY requires only that the buf is locked UNLESS the block numbers need
to be translated. There may be other reasons, but as long as the underlying
layer uses a VOP to perform the operations they will be caught later.


# 99483 06-Jul-2002 jeff

Add "vop_rename_pre" to do pre rename lock verification. This is enabled only
with DEBUG_VFS_LOCKS.


# 99426 05-Jul-2002 jeff

Cleanups for vnode lock debugging.
- Tell IS_LOCKING_VFS to ignore block and character devices. specfs vnodes
aren't locked for io and they just generate lots of false positives.
- Add newlines to the badlock prints.


# 99220 01-Jul-2002 iedowse

Use indirect function pointer hooks instead of #ifdef SOFTUPDATES
direct calls for the two places where the kernel calls into soft
updates code. Set up the hooks in softdep_initialize() and NULL
them out in softdep_uninitialize(). This change allows soft updates
to function correctly when ufs is loaded as a module.

Reviewed by: mckusick


# 98985 28-Jun-2002 jeff

Improve the VOP locking asserts

- Add vfs_badlock_print to control whether or not we print lock violations
- Add vfs_badlock_panic to control whether we panic on lock violations

Both default to on to mimic the original behavior if DEBUG_VFS_LOCKS is on.


# 98542 21-Jun-2002 mckusick

This commit adds basic support for the UFS2 filesystem. The UFS2
filesystem expands the inode to 256 bytes to make space for 64-bit
block pointers. It also adds a file-creation time field, an ability
to use jumbo blocks per inode to allow extent like pointer density,
and space for extended attributes (up to twice the filesystem block
size worth of attributes, e.g., on a 16K filesystem, there is space
for 32K of attributes). UFS2 fully supports and runs existing UFS1
filesystems. New filesystems built using newfs can be built in either
UFS1 or UFS2 format using the -O option. In this commit UFS1 is
the default format, so if you want to build UFS2 format filesystems,
you must specify -O 2. This default will be changed to UFS2 when
UFS2 proves itself to be stable. In this commit the boot code for
reading UFS2 filesystems is not compiled (see /sys/boot/common/ufsread.c)
as there is insufficient space in the boot block. Once the size of the
boot block is increased, this code can be defined.

Things to note: the definition of SBSIZE has changed to SBLOCKSIZE.
The header file <ufs/ufs/dinode.h> must be included before
<ufs/ffs/fs.h> so as to get the definitions of ufs2_daddr_t and
ufs_lbn_t.

Still TODO:
Verify that the first level bootstraps work for all the architectures.
Convert the utility ffsinfo to understand UFS2 and test growfs.
Add support for the extended attribute storage. Update soft updates
to ensure integrity of extended attribute storage. Switch the
current extended attribute interfaces to use the extended attribute
storage. Add the extent like functionality (framework is there,
but is currently never used).

Sponsored by: DARPA & NAI Labs.
Reviewed by: Poul-Henning Kamp <phk@freebsd.org>


# 96755 16-May-2002 trhodes

More s/file system/filesystem/g


# 95752 29-Apr-2002 rwatson

Since devfs now uses vnode locks, add devfs back to IS_LOCKING_VFS.


# 95481 26-Apr-2002 rwatson

Add UDF to the list of filesystems where locking assertions should be
evaluated.

Approved by: scottl


# 95479 26-Apr-2002 rwatson

1.43 (dfr 04-Apr-97): /*
1.43 (dfr 04-Apr-97): * [dfr] Kludge until I get around to fixing all the vfs locking.
1.43 (dfr 04-Apr-97): */

The new devfs doesn't support VFS locking. So don't do locking
assertions for devfs vnodes.

With this change, a kernel with options DEBUG_VFS_LOCKS actually
gets to single-user mode.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# 94658 14-Apr-2002 scottl

Add a filesystem driver for the Universal Disk Format. For more info,
see http://people.freebsd.org/~scottl/udf

MFC after: when asmodai gets the backport done
Prodded by: phk asmodai des


# 92751 20-Mar-2002 jeff

Remove references to vm_zone.h and switch over to the new uma API.

Also, remove maxsockets. If you look carefully you'll notice that the old
zone allocator never honored this anyway.


# 92719 19-Mar-2002 alfred

Remove __P


# 92654 19-Mar-2002 jeff

This is the first part of the new kernel memory allocator. This replaces
malloc(9) and vm_zone with a slab like allocator.

Reviewed by: arch@


# 90860 18-Feb-2002 phk

Make v_addpollinfo() visible and non-inline.
Have callers only call it as needed.
Add necessary call in ufs_kqfilter().

Test-case found by: Andrew Gallatin <gallatin@cs.duke.edu>


# 90791 17-Feb-2002 phk

Move the stuff related to select and poll out of struct vnode.
The use of the zone allocator may or may not be overkill.
There is an XXX: over in ufs/ufs/ufs_vnops.c that jlemon may need
to revisit.

This shaves about 60 bytes of struct vnode which on my laptop means
600k less RAM used for vnodes.


# 90790 17-Feb-2002 phk

Collect the VN_KNOTE() macro definitions on vnode.h


# 90787 17-Feb-2002 phk

v_lease is unused, zap it.


# 90361 07-Feb-2002 julian

Pre-KSE/M3 commit.
this is a low-functionality change that changes the kernel to access the main
thread of a process via the linked list of threads rather than
assuming that it is embedded in the process. It IS still embeded there
but remove all teh code that assumes that in preparation for the next commit
which will actually move it out.

Reviewed by: peter@freebsd.org, gallatin@cs.duke.edu, benno rice,


# 88318 20-Dec-2001 dillon

Fix a BUF_TIMELOCK race against BUF_LOCK and fix a deadlock in vget()
against VM_WAIT in the pageout code. Both fixes involve adjusting
the lockmgr's timeout capability so locks obtained with timeouts do not
interfere with locks obtained without a timeout.

Hopefully MFC: before the 4.5 release


# 88149 18-Dec-2001 dillon

This is a forward port of Peter's vlrureclaim() fix, with some minor mods
by me to make it more efficient. The original code had serious balancing
problems and could also deadlock easily. This code relegates the vnode
reclamation to its own kproc and relaxes the vnode reclamation requirements
to better maintain kern.maxvnodes. This code still doesn't balance as well
as it could, but it does a much better job then the original code.

Approved by: re@freebsd.org
Obtained from: ps, peter, dillon
MFS Assuming: Assuming no problems crop up in Yahoo testing
MFC after: 7 days


# 86278 11-Nov-2001 alfred

turn vn_open() into a wrapper around vn_open_cred() which allows
one to perform a vn_open using temporary/other/fake credentials.

Modify the nfs client side locking code to use vn_open_cred() passing
proc0's ucred instead of the old way which was to temporary raise
privs while running vn_open(). This should close the race hopefully.


# 86089 05-Nov-2001 dillon

Implement IO_NOWDRAIN and B_NOWDRAIN - prevents the buffer cache from blocking
in wdrain during a write. This flag needs to be used in devices whos
strategy routines turn-around and issue another high level I/O, such as
when MD turns around and issues a VOP_WRITE to vnode backing store, in order
to avoid deadlocking the dirty buffer draining code.

Remove a vprintf() warning from MD when the backing vnode is found to be
in-use. The syncer of buf_daemon could be flushing the backing vnode at
the time of an MD operation so the warning is not correct.

MFC after: 1 week


# 85606 27-Oct-2001 dillon

syncdelay, filedelay, dirdelay, metadelay are ints, not time_t's,
and can also be made static.


# 85517 25-Oct-2001 dillon

Implement kern.maxvnodes. adjusting kern.maxvnodes now actually has a
real effect.

Optimize vfs_msync(). Avoid having to continually drop and re-obtain
mutexes when scanning the vnode list. Improves looping case by 500%.

Optimize ffs_sync(). Avoid having to continually drop and re-obtain
mutexes when scanning the vnode list. This makes a couple of assumptions,
which I believe are ok, in regards to vnode stability when the mount list
mutex is held. Improves looping case by 500%.

(more optimization work is needed on top of these fixes)

MFC after: 1 week


# 85339 22-Oct-2001 dillon

Change the vnode list under the mount point from a LIST to a TAILQ
in preparation for an implementation of limiting code for kern.maxvnodes.

MFC after: 3 days


# 85287 21-Oct-2001 des

Convert textvp_fullpath() into the more generic vn_fullpath() which takes a
struct thread * and a struct vnode * instead of a struct proc *.

Temporarily add a textvp_fullpath macro for compatibility.


# 84249 01-Oct-2001 dillon

After extensive testing it has been determined that adding complexity
to avoid removing higher level directory vnodes from the namecache has
no perceivable effect and will be removed. This is especially true
when vmiodirenable is turned on, which it is by default now. ( vmiodirenable
makes a huge difference in directory caching ). The vfs.vmiodirenable and
vfs.nameileafonly sysctls have been left in to allow further testing, but
I expect to rip out vfs.nameileafonly soon too.

I have also determined through testing that the real problem with numvnodes
getting too large is due to the VM Page cache preventing the vnode from
being reclaimed. The directory stuff made only a tiny dent relative
to Poul's original code, enough so that some tests succeeded. But tests
with several million small files show that the bigger problem is the VM Page
cache. This will have to be addressed by a future commit.

MFC after: 3 days


# 83421 13-Sep-2001 obrien

Re-apply rev 1.178 -- style(9) the structure definitions.
I have to wonder how many other changes were lost in the KSE mildstone 2 merge.


# 83366 12-Sep-2001 julian

KSE Milestone 2
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.

Sorry john! (your next MFC will be a doosie!)

Reviewed by: peter@freebsd.org, dillon@freebsd.org

X-MFC after: ha ha ha ha


# 83222 08-Sep-2001 dillon

This brings in a Yahoo coredump patch from Paul, with additional mods by
me (addition of vn_rdwr_inchunks). The problem Yahoo is solving is that
if you have large process images core dumping, or you have a large number of
forked processes all core dumping at the same time, the original coredump code
would leave the vnode locked throughout. This can cause the directory vnode
to get locked up, which can cause the parent directory vnode to get locked
up, and so on all the way to the root node, locking the entire machine up
for extremely long periods of time.

This patch solves the problem in two ways. First it uses an advisory
non-blocking lock to abort multiple processes trying to core to the same
file. Second (my contribution) it chunks up the writes and uses bwillwrite()
to avoid holding the vnode locked while blocking in the buffer cache.

Submitted by: ps
Reviewed by: dillon
MFC after: 2 weeks


# 82395 27-Aug-2001 peter

If a file has been completely unlinked, stop automatically syncing the
file. ffs will discard any pending dirty pages when it is closed,
so we may as well not waste time trying to clean them. This doesn't
stop other things from writing it out, eg: pageout, fsync(2) etc.


# 77435 29-May-2001 phk

Remove MFS


# 77115 24-May-2001 dillon

This patch implements O_DIRECT about 80% of the way. It takes a patchset
Tor created a while ago, removes the raw I/O piece (that has cache coherency
problems), and adds a buffer cache / VM freeing piece.

Essentially this patch causes O_DIRECT I/O to not be left in the cache, but
does not prevent it from going through the cache, hence the 80%. For
the last 20% we need a method by which the I/O can be issued directly to
buffer supplied by the user process and bypass the buffer cache entirely,
but still maintain cache coherency.

I also have the code working under -stable but the changes made to sys/file.h
may not be MFCable, so an MFC is not on the table yet.

Submitted by: tegge, dillon


# 76688 16-May-2001 iedowse

Change the second argument of vflush() to an integer that specifies
the number of references on the filesystem root vnode to be both
expected and released. Many filesystems hold an extra reference on
the filesystem root vnode, which must be accounted for when
determining if the filesystem is busy and then released if it isn't
busy. The old `skipvp' approach required individual filesystem
xxx_unmount functions to re-implement much of vflush()'s logic to
deal with the root vnode.

All 9 filesystems that hold an extra reference on the root vnode
got the logic wrong in the case of forced unmounts, so `umount -f'
would always fail if there were any extra root vnode references.
Fix this issue centrally in vflush(), now that we can.

This commit also fixes a vnode reference leak in devfs, which could
result in idle devfs filesystems that refuse to unmount.

Reviewed by: phk, bp


# 76167 01-May-2001 phk

Implement vop_std{get|put}pages() and add them to the default vop[].

Un-copy&paste all the VOP_{GET|PUT}PAGES() functions which do nothing but
the default.


# 76166 01-May-2001 markm

Undo part of the tangle of having sys/lock.h and sys/mutex.h included in
other "system" header files.

Also help the deprecation of lockmgr.h by making it a sub-include of
sys/lock.h and removing sys/lockmgr.h form kernel .c files.

Sort sys/*.h includes where possible in affected files.

OK'ed by: bde (with reservations)


# 76131 29-Apr-2001 phk

Add a vop_stdbmap(), and make it part of the default vop vector.

Make 7 filesystems which don't really know about VOP_BMAP rely
on the default vector, rather than more or less complete local
vop_nopbmap() implementations.


# 75830 22-Apr-2001 obrien

Removed old version of vaccess_acl_posix1e() that snuck back in rev 1.146.

Submitted by (with good eye): Niels Chr. Bank-Pedersen <ncbp@bank-pedersen.dk>


# 75818 21-Apr-2001 obrien

Style(9) fixes:
* get rid of space (0x20) before tab (^I)
* indent with ^I, not 0x20
* continuation line for prototypes is for 0x20's past function's name col.
* etc.


# 75654 18-Apr-2001 tanimura

Reclaim directory vnodes held in namecache if few free vnodes are
available.

Only directory vnodes holding no child directory vnodes held in
v_cache_src are recycled, so that directory vnodes near the root of
the filesystem hierarchy remain in namecache and directory vnodes are
not reclaimed in cascade.

The period of vnode reclaiming attempt and the number of vnodes
attempted to reclaim can be tuned via sysctl(2).

Suggested by: tegge
Approved by: phk


# 75580 17-Apr-2001 phk

This patch removes the VOP_BWRITE() vector.

VOP_BWRITE() was a hack which made it possible for NFS client
side to use struct buf with non-bio backing.

This patch takes a more general approach and adds a bp->b_op
vector where more methods can be added.

The success of this patch depends on bp->b_op being initialized
all relevant places for some value of "relevant" which is not
easy to determine. For now the buffers have grown a b_magic
element which will make such issues a tiny bit easier to debug.


# 75571 17-Apr-2001 rwatson

In my first reading of POSIX.1e, I misinterpreted handling of the
ACL_USER_OBJ and ACL_GROUP_OBJ fields, believing that modification of the
access ACL could be used by privileged processes to change file/directory
ownership. In fact, this is incorrect; ACL_*_OBJ (+ ACL_MASK and
ACL_OTHER) should have undefined ae_id fields; this commit attempts
to correct that misunderstanding.

o Modify arguments to vaccess_acl_posix1e() to accept the uid and gid
associated with the vnode, as those can no longer be extracted from
the ACL passed as an argument. Perform all comparisons against
the passed arguments. This actually has the effect of simplifying
a number of components of this call, as well as reducing the indent
level, but now seperates handling of ACL_GROUP_OBJ from ACL_GROUP.

o Modify acl_posix1e_check() to return EINVAL if the ae_id field of
any of the ACL_{USER_OBJ,GROUP_OBJ,MASK,OTHER} entries is a value
other than ACL_UNDEFINED_ID. As a temporary work-around to allow
clean upgrades, set the ae_id field to ACL_UNDEFINED_ID before
each check so that this cannot cause a failure in the short term
(this work-around will be removed when the userland libraries and
utilities are updated to take this change into account).

o Modify ufs_sync_acl_from_inode() so that it forces
ACL_{USER_OBJ,GROUP_OBJ,MASK,OTHER} ae_id fields to ACL_UNDEFINED_ID
when synchronizing the ACL from the inode.

o Modify ufs_sync_inode_from_acl to not propagate uid and gid
information to the inode from the ACL during ACL update. Also
modify the masking of permission bits that may be set from
ALLPERMS to (S_IRWXU|S_IRWXG|S_IRWXO), as ACLs currently do not
carry none-ACCESSPERMS (S_ISUID, S_ISGID, S_ISTXT).

o Modify ufs_getacl() so that when it emulates an access ACL from
the inode, it initializes the ae_id fields to ACL_UNDEFINED_ID.

o Clean up ufs_setacl() substantially since it is no longer possible
to perform chown/chgrp operations using vop_setacl(), so all the
access control for that can be eliminated.

o Modify ufs_access() so that it passes owner uid and gid information
into vaccess_acl_posix1e().

Pointed out by: jedger
Obtained from: TrustedBSD Project


# 75478 13-Apr-2001 bp

Move VT_SMBFS definition to the proper place. Undefine VI_LOCK/VI_UNLOCK.


# 74962 28-Mar-2001 des

Prepare for pseudofs.


# 74437 19-Mar-2001 rwatson

o Rename "namespace" argument to "attrnamespace" as namespace is a C++
reserved word.

Submitted by: jkh
Obtained from: TrustedBSD Project


# 74273 15-Mar-2001 rwatson

o Change the API and ABI of the Extended Attribute kernel interfaces to
introduce a new argument, "namespace", rather than relying on a first-
character namespace indicator. This is in line with more recent
thinking on EA interfaces on various mailing lists, including the
posix1e, Linux acl-devel, and trustedbsd-discuss forums. Two namespaces
are defined by default, EXTATTR_NAMESPACE_SYSTEM and
EXTATTR_NAMESPACE_USER, where the primary distinction lies in the
access control model: user EAs are accessible based on the normal
MAC and DAC file/directory protections, and system attributes are
limited to kernel-originated or appropriately privileged userland
requests.

o These API changes occur at several levels: the namespace argument is
introduced in the extattr_{get,set}_file() system call interfaces,
at the vnode operation level in the vop_{get,set}extattr() interfaces,
and in the UFS extended attribute implementation. Changes are also
introduced in the VFS extattrctl() interface (system call, VFS,
and UFS implementation), where the arguments are modified to include
a namespace field, as well as modified to advoid direct access to
userspace variables from below the VFS layer (in the style of recent
changes to mount by adrian@FreeBSD.org). This required some cleanup
and bug fixing regarding VFS locks and the VFS interface, as a vnode
pointer may now be optionally submitted to the VFS_EXTATTRCTL()
call. Updated documentation for the VFS interface will be committed
shortly.

o In the near future, the auto-starting feature will be updated to
search two sub-directories to the ".attribute" directory in appropriate
file systems: "user" and "system" to locate attributes intended for
those namespaces, as the single filename is no longer sufficient
to indicate what namespace the attribute is intended for. Until this
is committed, all attributes auto-started by UFS will be placed in
the EXTATTR_NAMESPACE_SYSTEM namespace.

o The default POSIX.1e attribute names for ACLs and Capabilities have
been updated to no longer include the '$' in their filename. As such,
if you're using these features, you'll need to rename the attribute
backing files to the same names without '$' symbols in front.

o Note that these changes will require changes in userland, which will
be committed shortly. These include modifications to the extended
attribute utilities, as well as to libutil for new namespace
string conversion routines. Once the matching userland changes are
committed, a buildworld is recommended to update all the necessary
include files and verify that the kernel and userland environments
are in sync. Note: If you do not use extended attributes (most people
won't), upgrading is not imperative although since the system call
API has changed, the new userland extended attribute code will no longer
compile with old include files.

o Couple of minor cleanups while I'm there: make more code compilation
conditional on FFS_EXTATTR, which should recover a bit of space on
kernels running without EA's, as well as update copyright dates.

Obtained from: TrustedBSD Project


# 73890 06-Mar-2001 rwatson

o Introduce filesystem-independent POSIX.1e ACL utility routines to
support implementations of ACLs in file systems. Introduce the
following new functions:

vaccess_acl_posix1e() vaccess() that accepts an ACL
acl_posix1e_mode_to_perm() Convert mode bits to ACL rights
acl_posix1e_mode_to_entry() Build ACL entry from mode/uid/gid
acl_posix1e_perms_to_mode() Generate file mode from ACL
acl_posix1e_check() Syntax verification for ACL

These functions allow a file system to rely on central ACL evaluation
and syntax checking, as well as providing useful utilities to
allow ACL-based file systems to generate mode/owner/etc information
to return via VOP_GETATTR(), and to support file systems that split
their ACL information over their existing inode storage (mode, uid,
gid) and extended ACL into extended attributes (additional users,
groups, ACL mask).

o Add prototypes for exported functions to sys/acl.h, sys/vnode.h

Reviewed by: trustedbsd-discuss, freebsd-arch
Obtained from: TrustedBSD Project


# 72794 21-Feb-2001 bp

Add VI_LOCK(), VI_TRYLOCK() and VI_UNLOCK() macros to isolate implementation
details of v_interlock.

Reviewed by: jhb, phk, arch@


# 71576 24-Jan-2001 jasone

Convert all simplelocks to mutexes and remove the simplelock implementations.


# 70834 09-Jan-2001 wollman

select() DKI is now in <sys/selinfo.h>.


# 70374 26-Dec-2000 dillon

This implements a better launder limiting solution. There was a solution
in 4.2-REL which I ripped out in -stable and -current when implementing the
low-memory handling solution. However, maxlaunder turns out to be the saving
grace in certain very heavily loaded systems (e.g. newsreader box). The new
algorithm limits the number of pages laundered in the first pageout daemon
pass. If that is not sufficient then suceessive will be run without any
limit.

Write I/O is now pipelined using two sysctls, vfs.lorunningspace and
vfs.hirunningspace. This prevents excessive buffered writes in the
disk queues which cause long (multi-second) delays for reads. It leads
to more stable (less jerky) and generally faster I/O streaming to disk
by allowing required read ops (e.g. for indirect blocks and such) to occur
without interrupting the write stream, amoung other things.

NOTE: eventually, filesystem write I/O pipelining needs to be done on a
per-device basis. At the moment it is globalized.


# 68885 18-Nov-2000 dillon

Implement a low-memory deadlock solution.

Removed most of the hacks that were trying to deal with low-memory
situations prior to now.

The new code is based on the concept that I/O must be able to function in
a low memory situation. All major modules related to I/O (except
networking) have been adjusted to allow allocation out of the system
reserve memory pool. These modules now detect a low memory situation but
rather then block they instead continue to operate, then return resources
to the memory pool instead of cache them or leave them wired.

Code has been added to stall in a low-memory situation prior to a vnode
being locked.

Thus situations where a process blocks in a low-memory condition while
holding a locked vnode have been reduced to near nothing. Not only will
I/O continue to operate, but many prior deadlock conditions simply no
longer exist.

Implement a number of VFS/BIO fixes

(found by Ian): in biodone(), bogus-page replacement code, the loop
was not properly incrementing loop variables prior to a continue
statement. We do not believe this code can be hit anyway but we
aren't taking any chances. We'll turn the whole section into a
panic (as it already is in brelse()) after the release is rolled.

In biodone(), the foff calculation was incorrectly
clamped to the iosize, causing the wrong foff to be calculated
for pages in the case of an I/O error or biodone() called without
initiating I/O. The problem always caused a panic before. Now it
doesn't. The problem is mainly an issue with NFS.

Fixed casts for ~PAGE_MASK. This code worked properly before only
because the calculations use signed arithmatic. Better to properly
extend PAGE_MASK first before inverting it for the 64 bit masking
op.

In brelse(), the bogus_page fixup code was improperly throwing
away the original contents of 'm' when it did the j-loop to
fix the bogus pages. The result was that it would potentially
invalidate parts of the *WRONG* page(!), leading to corruption.

There may still be cases where a background bitmap write is
being duplicated, causing potential corruption. We have identified
a potentially serious bug related to this but the fix is still TBD.
So instead this patch contains a KASSERT to detect the problem
and panic the machine rather then continue to corrupt the filesystem.
The problem does not occur very often.. it is very hard to
reproduce, and it may or may not be the cause of the corruption
people have reported.

Review by: (VFS/BIO: mckusick, Ian Dowse <iedowse@maths.tcd.ie>)
Testing by: (VM/Deadlock) Paul Saab <ps@yahoo-inc.com>


# 67365 20-Oct-2000 jhb

Catch up to moving headers:
- machine/ipl.h -> sys/ipl.h
- machine/mutex.h -> sys/mutex.h


# 67309 19-Oct-2000 rwatson

o Introduce new VOP_ACCESS() flag VADMIN, allowing file systems to perform
"administrative" authorization checks. In most cases, the VADMIN test
checks to make sure the credential effective uid is the same as the file
owner.
o Modify vaccess() to set VADMIN as an available right if the uid is
appropriate.
o Modify references to uid-based access control operations such that they
now always invoke VOP_ACCESS() instead of using hard-coded policy checks.
o This allows alternative UFS policies to be implemented by replacing only
ufs_access() (such as mandatory system policies).
o VOP_ACCESS() requires the caller to hold an exclusive vnode lock on the
vnode: I believe that new invocations of VOP_ACCESS() are always called
with the lock held.
o Some direct checks of the uid remain, largely associated with the QUOTA
and SUIDDIR code.

Reviewed by: eivind
Obtained from: TrustedBSD Project


# 66615 03-Oct-2000 jasone

Convert lockmgr locks from using simple locks to using mutexes.

Add lockdestroy() and appropriate invocations, which corresponds to
lockinit() and must be called to clean up after a lockmgr lock is no
longer needed.


# 66355 25-Sep-2000 bp

Add a lock structure to vnode structure. Previously it was either allocated
separately (nfs, cd9660 etc) or keept as a first element of structure
referenced by v_data pointer(ffs). Such organization leads to known problems
with stacked filesystems.

From this point vop_no*lock*() functions maintain only interlock lock.
vop_std*lock*() functions maintain built-in v_lock structure using lockmgr().
vop_sharedlock() is compatible with vop_stdunlock(), but maintains a shared
lock on vnode.

If filesystem wishes to export lockmgr compatible lock, it can put an address
of this lock to v_vnlock field. This indicates that the upper filesystem
can take advantage of it and use single lock structure for entire (or part)
of stack of vnodes. This field shouldn't be examined or modified by VFS code
except for initialization purposes.

Reviewed in general by: mckusick


# 66272 22-Sep-2000 rwatson

o Introduce vn_extattr_rm(), a helper function in the style of
vn_extattr_get() and vn_extattr_set(). vn_extattr_rm() removes the
specified extended attribute from a vnode, authorizing the change as
the kernel (NULL cred).

Obtained from: TrustedBSD Project


# 66243 22-Sep-2000 eivind

Remove addalias() prototype (staticized in kern/vfs_subr.c)


# 65770 12-Sep-2000 bp

Add three new VOPs: VOP_CREATEVOBJECT, VOP_DESTROYVOBJECT and VOP_GETVOBJECT.
They will be used by nullfs and other stacked filesystems to support full
cache coherency.

Reviewed in general by: mckusick, dillon


# 65492 05-Sep-2000 phk

Move extern declaration of dead_vnodeop_p to a .h file.

Remove race condition in vn_isdisk().


# 65200 29-Aug-2000 rwatson

o Restructure vaccess() so as to check for DAC permission to modify the
object before falling back on privilege. Make vaccess() accept an
additional optional argument, privused, to determine whether
privilege was required for vaccess() to return 0. Add commented
out capability checks for reference. Rename some variables to make
it more clear which modes/uids/etc are associated with the object,
and which with the access mode.
o Update file system use of vaccess() to pass NULL as the optional
privused argument. Once additional patches are applied, suser()
will no longer set ASU, so privused will permit passing of
privilege information up the stack to the caller.

Reviewed by: bde, green, phk, -security, others
Obtained from: TrustedBSD Project


# 64865 20-Aug-2000 phk

Centralize the canonical vop_access user/group/other check in vaccess().

Discussed with: bde


# 64819 18-Aug-2000 phk

Introduce vop_stdinactive() and make it the default if no vop_inactive
is declared.

Sort and prune a few vop_op[].


# 64405 08-Aug-2000 rwatson

o Introduce vn_extattr_{get,set}, wrapper routines for VOP_GETEXTATTR
and VOP_SETEXTATTR to simplify calling from in-kernel consumers,
such as capability code. Both accept a vnode (optionally locked,
with ioflg to indicate that), attribute name, and a buffer + buffer
length in UIO_SYSSPACE. Both authorize the call as a kernel request,
with cred set to NULL for the actual VOP_ calls.

Obtained from: TrustedBSD Project


# 63788 24-Jul-2000 mckusick

This patch corrects the first round of panics and hangs reported
with the new snapshot code.

Update addaliasu to correctly implement the semantics of the old
checkalias function. When a device vnode first comes into existence,
check to see if an anonymous vnode for the same device was created
at boot time by bdevvp(). If so, adopt the bdevvp vnode rather than
creating a new vnode for the device. This corrects a problem which
caused the kernel to panic when taking a snapshot of the root
filesystem.

Change the calling convention of vn_write_suspend_wait() to be the
same as vn_start_write().

Split out softdep_flushworklist() from softdep_flushfiles() so that
it can be used to clear the work queue when suspending filesystem
operations.

Access to buffers becomes recursive so that snapshots can recursively
traverse their indirect blocks using ffs_copyonwrite() when checking
for the need for copy on write when flushing one of their own indirect
blocks. This eliminates a deadlock between the syncer daemon and a
process taking a snapshot.

Ensure that softdep_process_worklist() can never block because of a
snapshot being taken. This eliminates a problem with buffer starvation.

Cleanup change in ffs_sync() which did not synchronously wait when
MNT_WAIT was specified. The result was an unclean filesystem panic
when doing forcible unmount with heavy filesystem I/O in progress.

Return a zero'ed block when reading a block that was not in use at
the time that a snapshot was taken. Normally, these blocks should
never be read. However, the readahead code will occationally read
them which can cause unexpected behavior.

Clean up the debugging code that ensures that no blocks be written
on a filesystem while it is suspended. Snapshots must explicitly
label the blocks that they are writing during the suspension so that
they do not cause a `write on suspended filesystem' panic.

Reorganize ffs_copyonwrite() to eliminate a deadlock and also to
prevent a race condition that would permit the same block to be
copied twice. This change eliminates an unexpected soft updates
inconsistency in fsck caused by the double allocation.

Use bqrelse rather than brelse for buffers that will be needed
soon again by the snapshot code. This improves snapshot performance.


# 62976 11-Jul-2000 mckusick

Add snapshots to the fast filesystem. Most of the changes support
the gating of system calls that cause modifications to the underlying
filesystem. The gating can be enabled by any filesystem that needs
to consistently suspend operations by adding the vop_stdgetwritemount
to their set of vnops. Once gating is enabled, the function
vfs_write_suspend stops all new write operations to a filesystem,
allows any filesystem modifying system calls already in progress
to complete, then sync's the filesystem to disk and returns. The
function vfs_write_resume allows the suspended write operations to
begin again. Gating is not added by default for all filesystems as
for SMP systems it adds two extra locks to such critical kernel
paths as the write system call. Thus, gating should only be added
as needed.

Details on the use and current status of snapshots in FFS can be
found in /sys/ufs/ffs/README.snapshot so for brevity and timelyness
is not included here. Unless and until you create a snapshot file,
these changes should have no effect on your system (famous last words).


# 62552 04-Jul-2000 mckusick

Simplify and rationalise the management of the vnode free list
(preparing the code to add snapshots).


# 62550 04-Jul-2000 mckusick

Move the truncation code out of vn_open and into the open system call
after the acquisition of any advisory locks. This fix corrects a case
in which a process tries to open a file with a non-blocking exclusive
lock. Even if it fails to get the lock it would still truncate the
file even though its open failed. With this change, the truncation
is done only after the lock is successfully acquired.

Obtained from: BSD/OS


# 60938 26-May-2000 jake

Back out the previous change to the queue(3) interface.
It was not discussed and should probably not happen.

Requested by: msmith and others


# 60833 23-May-2000 jake

Change the way that the queue(3) structures are declared; don't assume that
the type argument to *_HEAD and *_ENTRY is a struct.

Suggested by: phk
Reviewed by: phk
Approved by: mdodd


# 59652 26-Apr-2000 green

Move procfs_fullpath() to vfs_cache.c, with a rename to textvp_fullpath().
There's no excuse to have code in synthetic filestores that allows direct
references to the textvp anymore.

Feature requested by: msmith
Feature agreed to by: warner
Move requested by: phk
Move agreed to by: bde


# 59480 22-Apr-2000 green

Move the declaration of "struct namecache" to vnode.h, as it can be useful
elsewhere. Note, of course, that in an ideal world nothing should need
to see our VFS implementation :-/


# 58909 01-Apr-2000 dillon

Change the write-behind code to take more care when starting
async I/O's. The sequential read heuristic has been extended to
cover writes as well. We continue to call cluster_write() normally,
thus blocks in the file will still be reallocated for large (but still
random) I/O's, but I/O will only be initiated for truely sequential
writes.

This solves a number of annoying situations, especially with DBM (hash
method) writes, and also has the side effect of fixing a number of
(stupid) benchmarks.

Reviewed-by: mckusick


# 56949 02-Feb-2000 rwatson

Remove static qualifier from vgonel, as it is needed by the Arla folk
outside of vfs_subr.c.

Submitted by: Assar Westerlund <assar@sics.se>
Reviewed by: rwatson
Approved by: jkh


# 56033 15-Jan-2000 bp

Add VT_NWFS tag.


# 55756 10-Jan-2000 phk

Give vn_isdisk() a second argument where it can return a suitable errno.

Suggested by: bde


# 55205 29-Dec-1999 peter

Change #ifdef KERNEL to #ifdef _KERNEL in the public headers. "KERNEL"
is an application space macro and the applications are supposed to be free
to use it as they please (but cannot). This is consistant with the other
BSD's who made this change quite some time ago. More commits to come.


# 54803 19-Dec-1999 rwatson

Second pass commit to introduce new ACL and Extended Attribute system
calls, vnops, vfsops, both in /kern, and to individual file systems that
require a vfsop_ array entry.

Reviewed by: eivind


# 54533 13-Dec-1999 alfred

explain that ioflags can be used to give read-ahead hints to the underlying
filesystem.


# 54444 11-Dec-1999 eivind

Lock reporting and assertion changes.
* lockstatus() and VOP_ISLOCKED() gets a new process argument and a new
return value: LK_EXCLOTHER, when the lock is held exclusively by another
process.
* The ASSERT_VOP_(UN)LOCKED family is extended to use what this gives them
* Extend the vnode_if.src format to allow more exact specification than
locked/unlocked.

This commit should not do any semantic changes unless you are using
DEBUG_VFS_LOCKS.

Discussed with: grog, mch, peter, phk
Reviewed by: peter


# 54372 09-Dec-1999 semenu

Added VT_HPFS vnode type.


# 51926 04-Oct-1999 phk

Move the buffered read/write code out of spec_{read|write} and into
two new functions spec_buf{read|write}.

Add sysctl vfs.bdev_buffered which defaults to 1 == true. This
sysctl can be used to experimentally turn buffered behaviour for
bdevs off. I should not be changed while any blockdevices are
open. Remove the misplaced sysctl vfs.enable_userblk_io.

No other changes in behaviour.


# 51797 29-Sep-1999 phk

Remove v_maxio from struct vnode.

Replace it with mnt_iosize_max in struct mount.

Nits from: bde


# 51488 20-Sep-1999 dillon

Final commit to remove vnode->v_lastr. vm_fault now handles read
clustering issues (replacing code that used to be in
ufs/ufs/ufs_readwrite.c). vm_fault also now uses the new VM page counter
inlines.

This completes the changeover from vnode->v_lastr to vm_entry_t->v_lastr
for VM, and fp->f_nextread and fp->f_seqcount (which have been in the
tree for a while). Determination of the I/O strategy (sequential, random,
and so forth) is now handled on a descriptor-by-descriptor basis for
base I/O calls, and on a memory-region-by-memory-region and
process-by-process basis for VM faults.

Reviewed by: David Greenman <dg@root.com>, Alan Cox <alc@cs.rice.edu>


# 51345 17-Sep-1999 dillon

Add vfs.enable_userblk_io sysctl to control whether user reads and writes
to buffered block devices are allowed. The default is to be backwards
compatible, i.e. reads and writes are allowed.

The idea is for a larger crowd to start running with this disabled and
see what problems, if any, crop up, and then to change the default to
off and see if any problems crop up in the next 6 months prior to
potentially removing support entirely. There are still a few people,
Julian and myself included, who believe the buffered block device
access from usermode to be useful.

Remove use of vnode->v_lastr from buffered block device I/O in
preparation for removal of vnode->v_lastr field, replacing it with
the already existing seqcount metric to detect sequential operation.

Reviewed by: Alan Cox <alc@cs.rice.edu>, David Greenman <dg@root.com>


# 50477 27-Aug-1999 peter

$Id$ -> $FreeBSD$


# 50405 26-Aug-1999 phk

Simplify the handling of VCHR and VBLK vnodes using the new dev_t:

Make the alias list a SLIST.

Drop the "fast recycling" optimization of vnodes (including
the returning of a prexisting but stale vnode from checkalias).
It doesn't buy us anything now that we don't hardlimit
vnodes anymore.

Rename checkalias2() and checkalias() to addalias() and
addaliasu() - which takes dev_t and udev_t arg respectively.

Make the revoke syscalls use vcount() instead of VALIASED.

Remove VALIASED flag, we don't need it now and it is faster
to traverse the much shorter lists than to maintain the
flag.

vfs_mountedon() can check the dev_t directly, all the vnodes
point to the same one.

Print the devicename in specfs/vprint().

Remove a couple of stale LFS vnode flags.

Remove unimplemented/unused LK_DRAINED;


# 50347 25-Aug-1999 phk

Introduce vn_isdisk(struct vnode *vp) function, and use it to test for diskness.


# 50334 25-Aug-1999 julian

Make DEVFS use PHK's specinfo struct as the source of dev_t and devsw.

In lookup() however it's the other way around as we need to supply the
dev_t for the vnode, so devfs still has a copy of it stashed away.

Sourcing it from the vnode in the vnops however is useful as it makes
a lot of the code almost the same as that in specfs.


# 50137 21-Aug-1999 jdp

Support full-precision file timestamps. Until now, only the seconds
have been maintained, and that is still the default. A new sysctl
variable "vfs.timestamp_precision" can be used to enable higher
levels of precision:

0 = seconds only; nanoseconds zeroed (default).
1 = seconds and nanoseconds, accurate within 1/HZ.
2 = seconds and nanoseconds, truncated to microseconds.
>=3 = seconds and nanoseconds, maximum precision.

Level 1 uses getnanotime(), which is fast but can be wrong by up
to 1/HZ. Level 2 uses microtime(). It might be desirable for
consistency with utimes() and friends, which take timeval structures
rather than timespecs. Level 3 uses nanotime() for the higest
precision.

I benchmarked levels 0, 1, and 3 by copying a 550 MB tree with
"cpio -pdu". There was almost negligible difference in the system
times -- much less than 1%, and less than the variation among
multiple runs at the same level. Bruce Evans dreamed up a torture
test involving 1-byte reads with intervening fstat() calls, but
the cpio test seems more realistic to me.

This feature is currently implemented only for the UFS (FFS and
MFS) filesystems. But I think it should be easy to support it in
the others as well.

An earlier version of this was reviewed by Bruce. He's not to
blame for any breakage I've introduced since then.

Reviewed by: bde (an earlier version of the code)


# 49678 13-Aug-1999 phk

s/v_specinfo/v_rdev/


# 49535 08-Aug-1999 phk

Decommision miscfs/specfs/specdev.h. Most of it goes into <sys/conf.h>,
a few lines into <sys/vnode.h>.

Add a few fields to struct specinfo, paving the way for the fun part.


# 49101 26-Jul-1999 alc

Add sysctl and support code to allow directories to be VMIO'd. The default
setting for the sysctl is OFF, which is the historical operation.

Submitted by: dillon


# 48936 20-Jul-1999 phk

Now a dev_t is a pointer to struct specinfo which is shared by all specdev
vnodes referencing this device.

Details:
cdevsw->d_parms has been removed, the specinfo is available
now (== dev_t) and the driver should modify it directly
when applicable, and the only driver doing so, does so:
vn.c. I am not sure the logic in checking for "<" was right
before, and it looks even less so now.

An intial pool of 50 struct specinfo are depleted during
early boot, after that malloc had better work. It is
likely that fewer than 50 would do.

Hashing is done from udev_t to dev_t with a prime number
remainder hash, experiments show no better hash available
for decent cost (MD5 is only marginally better) The prime
number used should not be close to a power of two, we use
83 for now.

Add new checkalias2() to get around the loss of info from
dev2udev() in bdevvp();

The aliased vnodes are hung on a list straight of the dev_t,
and speclisth[SPECSZ] is unused. The sharing of struct
specinfo means that the v_specnext moves into the vnode
which grows by 4 bytes.

Don't use a VBLK dev_t which doesn't make sense in MFS, now
we hang a dummy cdevsw on B/Cmaj 253 so that things look sane.

Storage overhead from all of this is O(50k).

Bump __FreeBSD_version to 400009

The next step will add the stuff needed so device-drivers can start to
hang things from struct specinfo


# 48884 18-Jul-1999 phk

Introduce the vn_todev(struct vnode*) function, which returns the dev_t
corresponding to a VBLK or VCHR node, or NODEV.


# 48312 28-Jun-1999 phk

make va_fsid be of type udev_t


# 47940 15-Jun-1999 mckusick

Get rid of the global variable rushjob and replace it with a function in
kern/vfs_subr.c named speedup_syncer() which handles the speedup request.
Change the various clients of rushjob to use the new function.


# 47028 11-May-1999 phk

Divorce "dev_t" from the "major|minor" bitmap, which is now called
udev_t in the kernel but still called dev_t in userland.

Provide functions to manipulate both types:
major() umajor()
minor() uminor()
makedev() umakedev()
dev2udev() udev2dev()

For now they're functions, they will become in-line functions
after one of the next two steps in this process.

Return major/minor/makedev to macro-hood for userland.

Register a name in cdevsw[] for the "filedescriptor" driver.

In the kernel the udev_t appears in places where we have the
major/minor number combination, (ie: a potential device: we
may not have the driver nor the device), like in inodes, vattr,
cdevsw registration and so on, whereas the dev_t appears where
we carry around a reference to a actual device.

In the future the cdevsw and the aliased-from vnode will be hung
directly from the dev_t, along with up to two softc pointers for
the device driver and a few houskeeping bits. This will essentially
replace the current "alias" check code (same buck, bigger bang).

A little stunt has been provided to try to catch places where the
wrong type is being used (dev_t vs udev_t), if you see something
not working, #undef DEVT_FASCIST in kern/kern_conf.c and see if
it makes a difference. If it does, please try to track it down
(many hands make light work) or at least try to reproduce it
as simply as possible, and describe how to do that.

Without DEVT_FASCIST I belive this patch is a no-op.

Stylistic/posixoid comments about the userland view of the <sys/*.h>
files welcome now, from userland they now contain the end result.

Next planned step: make all dev_t's refer to the same devsw[] which
means convert BLK's to CHR's at the perimeter of the vnodes and
other places where they enter the game (bootdev, mknod, sysctl).


# 45739 17-Apr-1999 peter

Well folks, this is it - The second stage of the removal for build support
for LKM's..


# 44151 19-Feb-1999 dillon

Make worklist add function a static, remove from sys/vnode.h


# 43555 03-Feb-1999 semenu

Added vnode tag for NTFS.
Reviewed by: David O'Brien <obrien@NUXI.com>


# 43350 28-Jan-1999 dillon

Clarify the SYSINIT problem by breaking SYSINIT's up into a void *
version and a const void * version. Currently the const void * version
simply calls the void * version ( i.e. no 'fix' is in place ).

A solution needs to be found for the C_SYSINIT ( etc...) family of
macros that allows const void * without generating a warning, but
does not allow non-const void *.


# 43311 27-Jan-1999 dillon

Fix warnings in preparation for adding -Wall -Wcast-qual to the
kernel compile


# 43301 27-Jan-1999 dillon

Fix warnings in preparation for adding -Wall -Wcast-qual to the
kernel compile


# 42900 20-Jan-1999 eivind

Add 'options DEBUG_LOCKS', which stores extra information in struct
lock, and add some macros and function parameters to make sure that
the information get to the point where it can be put in the lock
structure.

While I'm here, add DEBUG_VFS_LOCKS to LINT.


# 42315 05-Jan-1999 eivind

Remove the 'waslocked' parameter to vfs_object_create().


# 41056 10-Nov-1998 peter

Make the vnode opv vector construction fully dynamic. Previously we
leaked memory on each unload and were limited to items referenced in
the kernel copy of vnode_if.c. Now a kernel module is free to create
it's own VOP_FOO() routines and the rest of the system will happily
deal with it, including passthrough layers like union/umap/etc.

Have VFS_SET() call a common vfs_modevent() handler rather than
inline duplicating the common code all over the place.

Have VNODEOP_SET() have the vnodeops removed at unload time (assuming a
module) so that the vop_t ** vector is reclaimed.

Slightly adjust the vop_t ** vectors so that calling slot 0 is a panic
rather than a page fault. This could happen if VOP_something() was called
without *any* handlers being present anywhere (including in vfs_default.c).
slot 1 becomes the default vector for the vnodeop table.

TODO: reclaim zones on unload (eg: nfs code)


# 40786 31-Oct-1998 peter

Convert the vnode clean/dirty attached buffer lists from LISTs to TAILQs.
Add a new flags field (we get this for free because of struct packing)
for cleaner management of tailq membership.
We had two spare b_flags slots, but they are a precious resource and may
be needed for other things that are related to other b_flags bits. The two
new flags are convenient to use in a seperate location.

Reviewed (in principle) by: dg
Obtained from: John Dyson's old work-in-progress


# 40722 29-Oct-1998 peter

Remove the V_SAVEMETA flag, nothing uses it any more now that msdosfs and
ext2fs call vtruncbuf() directly. This simplifies and cleans up
vinvalbuf() a little.


# 40435 16-Oct-1998 peter

*gulp*. Jordan specifically OK'ed this..

This is the bulk of the support for doing kld modules. Two linker_sets
were replaced by SYSINIT()'s. VFS's and exec handlers are self registered.
kld is now a superset of lkm. I have converted most of them, they will
follow as a seperate commit as samples.
This all still works as a static a.out kernel using LKM's.


# 39085 11-Sep-1998 rvb

All the references to cfs, in symbols, structs, and strings
have been changed to coda. (Same for CFS.)


# 39036 10-Sep-1998 tegge

Don't keep the underlying directory locked while performing the file
system specific VFS_MOUNT operation.
PR: 1067


# 38573 27-Aug-1998 jkh

Add VT_CFS type.
Submitted by: Robert Baron <rvb@sicily.odyssey.cs.cmu.edu>


# 35938 11-May-1998 dyson

Fix the futimes/undelete/utrace conflict with other BSD's. Note that
the only common usage of utrace (the possible problem with this
commit) is with malloc, so this should be a real problem. Add
the various NetBSD syscalls that allow full emulation of their
development environment.


# 34924 28-Mar-1998 bde

Moved some #includes from <sys/param.h> nearer to where they are actually
used.


# 34611 15-Mar-1998 dyson

Some VM improvements, including elimination of alot of Sig-11
problems. Tor Egge and others have helped with various VM bugs
lately, but don't blame him -- blame me!!!

pmap.c:
1) Create an object for kernel page table allocations. This
fixes a bogus allocation method previously used for such, by
grabbing pages from the kernel object, using bogus pindexes.
(This was a code cleanup, and perhaps a minor system stability
issue.)

pmap.c:
2) Pre-set the modify and accessed bits when prudent. This will
decrease bus traffic under certain circumstances.

vfs_bio.c, vfs_cluster.c:
3) Rather than calculating the beginning virtual byte offset
multiple times, stick the offset into the buffer header, so
that the calculated offset can be reused. (Long long multiplies
are often expensive, and this is a probably unmeasurable performance
improvement, and code cleanup.)

vfs_bio.c:
4) Handle write recursion more intelligently (but not perfectly) so
that it is less likely to cause a system panic, and is also
much more robust.

vfs_bio.c:
5) getblk incorrectly wrote out blocks that are incorrectly sized.
The problem is fixed, and writes blocks out ONLY when B_DELWRI
is true.

vfs_bio.c:
6) Check that already constituted buffers have fully valid pages. If
not, then make sure that the B_CACHE bit is not set. (This was
a major source of Sig-11 type problems.)

vfs_bio.c:
7) Fix a potential system deadlock due to an incorrectly specified
sleep priority while waiting for a buffer write operation. The
change that I made opens the system up to serious problems, and
we need to examine the issue of process sleep priorities.

vfs_cluster.c, vfs_bio.c:
8) Make clustered reads work more correctly (and more completely)
when buffers are already constituted, but not fully valid.
(This was another system reliability issue.)

vfs_subr.c, ffs_inode.c:
9) Create a vtruncbuf function, which is used by filesystems that
can truncate files. The vinvalbuf forced a file sync type operation,
while vtruncbuf only invalidates the buffers past the new end of file,
and also invalidates the appropriate pages. (This was a system reliabiliy
and performance issue.)

10) Modify FFS to use vtruncbuf.

vm_object.c:
11) Make the object rundown mechanism for OBJT_VNODE type objects work
more correctly. Included in that fix, create pager entries for
the OBJT_DEAD pager type, so that paging requests that might slip
in during race conditions are properly handled. (This was a system
reliability issue.)

vm_page.c:
12) Make some of the page validation routines be a little less picky
about arguments passed to them. Also, support page invalidation
change the object generation count so that we handle generation
counts a little more robustly.

vm_pageout.c:
13) Further reduce pageout daemon activity when the system doesn't
need help from it. There should be no additional performance
decrease even when the pageout daemon is running. (This was
a significant performance issue.)

vnode_pager.c:
14) Teach the vnode pager to handle race conditions during vnode
deallocations.


# 34266 08-Mar-1998 julian

Reviewed by: dyson@freebsd.org (john Dyson), dg@root.com (david greenman)
Submitted by: Kirk McKusick (mcKusick@mckusick.com)
Obtained from: WHistle development tree


# 34206 07-Mar-1998 dyson

This mega-commit is meant to fix numerous interrelated problems. There
has been some bitrot and incorrect assumptions in the vfs_bio code. These
problems have manifest themselves worse on NFS type filesystems, but can
still affect local filesystems under certain circumstances. Most of
the problems have involved mmap consistancy, and as a side-effect broke
the vfs.ioopt code. This code might have been committed seperately, but
almost everything is interrelated.

1) Allow (pmap_object_init_pt) prefaulting of buffer-busy pages that
are fully valid.
2) Rather than deactivating erroneously read initial (header) pages in
kern_exec, we now free them.
3) Fix the rundown of non-VMIO buffers that are in an inconsistent
(missing vp) state.
4) Fix the disassociation of pages from buffers in brelse. The previous
code had rotted and was faulty in a couple of important circumstances.
5) Remove a gratuitious buffer wakeup in vfs_vmio_release.
6) Remove a crufty and currently unused cluster mechanism for VBLK
files in vfs_bio_awrite. When the code is functional, I'll add back
a cleaner version.
7) The page busy count wakeups assocated with the buffer cache usage were
incorrectly cleaned up in a previous commit by me. Revert to the
original, correct version, but with a cleaner implementation.
8) The cluster read code now tries to keep data associated with buffers
more aggressively (without breaking the heuristics) when it is presumed
that the read data (buffers) will be soon needed.
9) Change to filesystem lockmgr locks so that they use LK_NOPAUSE. The
delay loop waiting is not useful for filesystem locks, due to the
length of the time intervals.
10) Correct and clean-up spec_getpages.
11) Implement a fully functional nfs_getpages, nfs_putpages.
12) Fix nfs_write so that modifications are coherent with the NFS data on
the server disk (at least as well as NFS seems to allow.)
13) Properly support MS_INVALIDATE on NFS.
14) Properly pass down MS_INVALIDATE to lower levels of the VM code from
vm_map_clean.
15) Better support the notion of pages being busy but valid, so that
fewer in-transit waits occur. (use p->busy more for pageouts instead
of PG_BUSY.) Since the page is fully valid, it is still usable for
reads.
16) It is possible (in error) for cached pages to be busy. Make the
page allocation code handle that case correctly. (It should probably
be a printf or panic, but I want the system to handle coding errors
robustly. I'll probably add a printf.)
17) Correct the design and usage of vm_page_sleep. It didn't handle
consistancy problems very well, so make the design a little less
lofty. After vm_page_sleep, if it ever blocked, it is still important
to relookup the page (if the object generation count changed), and
verify it's status (always.)
18) In vm_pageout.c, vm_pageout_clean had rotted, so clean that up.
19) Push the page busy for writes and VM_PROT_READ into vm_pageout_flush.
20) Fix vm_pager_put_pages and it's descendents to support an int flag
instead of a boolean, so that we can pass down the invalidate bit.


# 32724 24-Jan-1998 dyson

Add better support for larger I/O clusters, including larger physical
I/O. The support is not mature yet, and some of the underlying implementation
needs help. However, support does exist for IDE devices now.


# 32585 17-Jan-1998 dyson

Tie up some loose ends in vnode/object management. Remove an unneeded
config option in pmap. Fix a problem with faulting in pages. Clean-up
some loose ends in swap pager memory management.

The system should be much more stable, but all subtile bugs aren't fixed yet.


# 32454 11-Jan-1998 dyson

Fix some vnode management problems, and better mgmt of vnode free list.
Fix the UIO optimization code.
Fix an assumption in vm_map_insert regarding allocation of swap pagers.
Fix an spl problem in the collapse handling in vm_object_deallocate.
When pages are freed from vnode objects, and the criteria for putting
the associated vnode onto the free list is reached, either put the
vnode onto the list, or put it onto an interrupt safe version of the
list, for further transfer onto the actual free list.
Some minor syntax changes changing pre-decs, pre-incs to post versions.
Remove a bogus timeout (that I added for debugging) from vn_lock.

PHK will likely still have problems with the vnode list management, and
so do I, but it is better than it was.


# 32286 06-Jan-1998 dyson

Make our v_usecount vnode reference count work identically to the
original BSD code. The association between the vnode and the vm_object
no longer includes reference counts. The major difference is that
vm_object's are no longer freed gratuitiously from the vnode, and so
once an object is created for the vnode, it will last as long as the
vnode does.

When a vnode object reference count is incremented, then the underlying
vnode reference count is incremented also. The two "objects" are now
more intimately related, and so the interactions are now much less
complex.

When vnodes are now normally placed onto the free queue with an object still
attached. The rundown of the object happens at vnode rundown time, and
happens with exactly the same filesystem semantics of the original VFS
code. There is absolutely no need for vnode_pager_uncache and other
travesties like that anymore.

A side-effect of these changes is that SMP locking should be much simpler,
the I/O copyin/copyout optimizations work, NFS should be more ponderable,
and further work on layered filesystems should be less frustrating, because
of the totally coherent management of the vnode objects and vnodes.

Please be careful with your system while running this code, but I would
greatly appreciate feedback as soon a reasonably possible.


# 32093 29-Dec-1997 dyson

Add the vnode interlock back around vget.


# 32072 28-Dec-1997 dyson

Fix the decl of vfs_ioopt, allow LFS to compile again, fix a minor problem
with the object cache removal.


# 32071 28-Dec-1997 dyson

Lots of improvements, including restructring the caching and management
of vnodes and objects. There are some metadata performance improvements
that come along with this. There are also a few prototypes added when
the need is noticed. Changes include:

1) Cleaning up vref, vget.
2) Removal of the object cache.
3) Nuke vnode_pager_uncache and friends, because they aren't needed anymore.
4) Correct some missing LK_RETRY's in vn_lock.
5) Correct the page range in the code for msync.

Be gentle, and please give me feedback asap.


# 31727 15-Dec-1997 wollman

Add support for poll(2) on files. vop_nopoll() now returns POLLNVAL
if one of the new poll types is requested; hopefully this will not break
any existing code. (This is done so that programs have a dependable
way of determining whether a filesystem supports the extended poll types
or not.)

The new poll types added are:

POLLWRITE - file contents may have been modified
POLLNLINK - file was linked, unlinked, or renamed
POLLATTRIB - file's attributes may have been changed
POLLEXTEND - file was extended

Note that the internal operation of poll() means that it is impossible
for two processes to reliably poll for the same event (this could
be fixed but may not be worth it), so it is not possible to rewrite
`tail -f' to use poll at this time.


# 31561 05-Dec-1997 bde

Don't include <sys/lock.h> in headers when only `struct simplelock' is
required. Fixed everything that depended on the pollution.


# 31352 22-Nov-1997 bde

Staticized.


# 31248 18-Nov-1997 bde

Don't #include <machine/smp.h> even in the SMP case. Fixed the one
place that depended on it. The "bazillion warnings" mentioned in the
log for rev.1.45 apparently aren't a problem any more. It is hard
to be sure because the SIMPLELOCK_DEBUG option turns off (and breaks)
things in the SMP case.

Don't forward declare structs that are already implicitly forward declared.

Fixed a disordered declaration.


# 30743 26-Oct-1997 phk

VFS interior redecoration.

Rename vn_default_error to vop_defaultop all over the place.
Move vn_bwrite from vfs_bio.c to vfs_default.c and call it vop_stdbwrite.
Use vop_null instead of nullop.
Move vop_nopoll from vfs_subr.c to vfs_default.c
Move vop_sharedlock from vfs_subr.c to vfs_default.c
Move vop_nolock from vfs_subr.c to vfs_default.c
Move vop_nounlock from vfs_subr.c to vfs_default.c
Move vop_noislocked from vfs_subr.c to vfs_default.c
Use vop_ebadf instead of *_ebadf.
Add vop_defaultop for getpages on master vnode in MFS.


# 30739 26-Oct-1997 phk

Simplify the lease_check stuff.


# 30513 17-Oct-1997 phk

Make a set of VOP standard lock, unlock & islocked VOP operators, which
depend on the lock being located at vp->v_data. Saves 3x3 identical
vop procs, more as the other filesystems becomes lock aware.


# 30492 16-Oct-1997 phk

Another VFS cleanup "kilo commit"

1. Remove VOP_UPDATE, it is (also) an UFS/{FFS,LFS,EXT2FS,MFS}
intereface function, and now lives in the ufsmount structure.

2. Remove VOP_SEEK, it was unused.

3. Add mode default vops:

VOP_ADVLOCK vop_einval
VOP_CLOSE vop_null
VOP_FSYNC vop_null
VOP_IOCTL vop_enotty
VOP_MMAP vop_einval
VOP_OPEN vop_null
VOP_PATHCONF vop_einval
VOP_READLINK vop_einval
VOP_REALLOCBLKS vop_eopnotsupp

And remove identical functionality from filesystems

4. Add vop_stdpathconf, which returns the canonical stuff. Use
it in the filesystems. (XXX: It's probably wrong that specfs
and fifofs sets this vop, shouldn't it come from the "host"
filesystem, for instance ufs or cd9660 ?)

5. Try to make system wide VOP functions have vop_* names.

6. Initialize the um_* vectors in LFS.

(Recompile your LKMS!!!)


# 30474 16-Oct-1997 phk

VFS mega cleanup commit (x/N)

1. Add new file "sys/kern/vfs_default.c" where default actions for
VOPs go. Implement proper defaults for ABORTOP, BWRITE, LEASE,
POLL, REVOKE and STRATEGY. Various stuff spread over the entire
tree belongs here.

2. Change VOP_BLKATOFF to a normal function in cd9660.

3. Kill VOP_BLKATOFF, VOP_TRUNCATE, VOP_VFREE, VOP_VALLOC. These
are private interface functions between UFS and the underlying
storage manager layer (FFS/LFS/MFS/EXT2FS). The functions now
live in struct ufsmount instead.

4. Remove a kludge of VOP_ functions in all filesystems, that did
nothing but obscure the simplicity and break the expandability.
If a filesystem doesn't implement VOP_FOO, it shouldn't have an
entry for it in its vnops table. The system will try to DTRT
if it is not implemented. There are still some cruft left, but
the bulk of it is done.

5. Fix another VCALL in vfs_cache.c (thanks Bruce!)


# 30354 12-Oct-1997 phk

Last major round (Unless Bruce thinks of somthing :-) of malloc changes.

Distribute all but the most fundamental malloc types. This time I also
remembered the trick to making things static: Put "static" in front of
them.

A couple of finer points by: bde


# 29653 21-Sep-1997 dyson

Change the M_NAMEI allocations to use the zone allocator. This change
plus the previous changes to use the zone allocator decrease the useage
of malloc by half. The Zone allocator will be upgradeable to be able
to use per CPU-pools, and has more intelligent usage of SPLs. Additionally,
it has reasonable stats gathering capabilities, while making most calls
inline.


# 29350 14-Sep-1997 peter

Update interfaces for poll()


# 28954 31-Aug-1997 phk

Change the 0xdeadb hack to a flag called VDOOMED.
Introduce VFREE which indicates that vnode is on freelist.
Rename vholdrele() to vdrop().
Create vfree() and vbusy() to add/delete vnode from freelist.
Add vfree()/vbusy() to keep (v_holdcnt != 0 || v_usecount != 0)
vnodes off the freelist.
Generalize vhold()/v_holdcnt to mean "do not recycle".
Fix reassignbuf()s lack of use of vhold().
Use vhold() instead of checking v_cache_src list.
Remove vtouch(), the vnodes are always vget'ed soon enough
after for it to have any measuable effect.
Add sysctl debug.freevnodes to keep track of things.
Move cache_purge() up in getnewvnodes to avoid race.
Decrement v_usecount after VOP_INACTIVE(), put a vhold() on
it during VOP_INACTIVE()
Unmacroize vhold()/vdrop()
Print out VDOOMED and VFREE flags (XXX: should use %b)

Reviewed by: dyson


# 28787 26-Aug-1997 phk

Uncut&paste cache_lookup().

This unifies several times in theory indentical 50 lines of code.

The filesystems have a new method: vop_cachedlookup, which is the
meat of the lookup, and use vfs_cache_lookup() for their vop_lookup
method. vfs_cache_lookup() will check the namecache and pass on
to the vop_cachedlookup method in case of a miss.

It's still the task of the individual filesystems to populate the
namecache with cache_enter().

Filesystems that do not use the namecache will just provide the
vop_lookup method as usual.


# 28349 18-Aug-1997 fsmp

Added includes of smp.h for SMP.
This eliminates a bazillion warnings about implicit s_lock & friends.


# 25453 04-May-1997 phk

1. Add a {pointer, v_id} pair to the vnode to store the reference to the
".." vnode. This is cheaper storagewise than keeping it in the
namecache, and it makes more sense since it's a 1:1 mapping.

2. Also handle the case of "." more intelligently rather than stuff
the namecache with pointless entries.

3. Add two lists to the vnode and hang namecache entries which go from
or to this vnode. When cleaning a vnode, delete all namecache
entries it invalidates.

4. Never reuse namecache enties, malloc new ones when we need it, free
old ones when they die. No longer a hard limit on how many we can
have.

5. Remove the upper limit on namelength of namecache entries.

6. Make a global list for negative namecache entries, limit their number
to a sysctl'able (debug.ncnegfactor) fraction of the total namecache.
Currently the default fraction is 1/16th. (Suggestions for better
default wanted!)

7. Assign v_id correctly in the face of 32bit rollover.

8. Remove the LRU list for namecache entries, not needed. Remove the
#ifdef NCH_STATISTICS stuff, it's not needed either.

9. Use the vnode freelist as a true LRU list, also for namecache accesses.

10. Reuse vnodes more aggresively but also more selectively, if we can't
reuse, malloc a new one. There is no longer a hard limit on their
number, they grow to the point where we don't reuse potentially
usable vnodes. A vnode will not get recycled if still has pages in
core or if it is the source of namecache entries (Yes, this does
indeed work :-) "." and ".." are not namecache entries any longer...)

11. Do not overload the v_id field in namecache entries with whiteout
information, use a char sized flags field instead, so we can get
rid of the vpid and v_id fields from the namecache struct. Since
we're linked to the vnodes and purged when they're cleaned, we don't
have to check the v_id any more.

12. NFS knew about the limitation on name length in the namecache, it
shouldn't and doesn't now.

Bugs:
The namecache statistics no longer includes the hits for ".."
and "." hits.

Performance impact:
Generally in the +/- 0.5% for "normal" workstations, but
I hope this will allow the system to be selftuning over a
bigger range of "special" applications. The case where
RAM is available but unused for cache because we don't have
any vnodes should be gone.

Future work:
Straighten out the namecache statistics.

"desiredvnodes" is still used to (bogusly ?) size hash
tables in the filesystems.

I have still to find a way to safely free unused vnodes
back so their number can shrink when not needed.

There is a few uses of the v_id field left in the filesystems,
scheduled for demolition at a later time.

Maybe a one slot cache for unused namecache entries should
be implemented to decrease the malloc/free frequency.


# 24623 04-Apr-1997 dfr

Add some debugging macros for tracing VFS locking bugs.
Declare (hopefully short-lived) vop_sharedlock.


# 22975 22-Feb-1997 peter

Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not
ready for it yet.


# 22602 12-Feb-1997 mpp

Remove function prototypes for vfs_mountroot and vgoneall, since
they were removed with the Lite2 merge.

Submitted by: bde


# 22521 10-Feb-1997 dyson

This is the kernel Lite/2 commit. There are some requisite userland
changes, so don't expect to be able to run the kernel as-is (very well)
without the appropriate Lite/2 userland changes.

The system boots and can mount UFS filesystems.

Untested: ext2fs, msdosfs, NFS
Known problems: Incorrect Berkeley ID strings in some files.
Mount_std mounts will not work until the getfsent
library routine is changed.

Reviewed by: various people
Submitted by: Jeffery Hsu <hsu@freebsd.org>


# 21770 16-Jan-1997 bde

Removed option EXTRAVNODES. All versions of FreeBSD-2.x have a sysctl
variable `kern.maxvnodes' which gives much better control over vnode
allocation than EXTRAVNODES (except in -current between 1995/10/28 and
1996/11/12, kern.maxvnodes was read-only and thus useless).


# 21673 14-Jan-1997 jkh

Make the long-awaited change from $Id$ to $FreeBSD$

This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.

Boy, I'm glad we're not using sup anymore. This update would have been
insane otherwise.


# 21002 29-Dec-1996 dyson

This commit is the embodiment of some VFS read clustering improvements.
Firstly, now our read-ahead clustering is on a file descriptor basis and not
on a per-vnode basis. This will allow multiple processes reading the
same file to take advantage of read-ahead clustering. Secondly, there
previously was a problem with large reads still using the ramp-up
algorithm. Of course, that was bogus, and now we read the entire
"chunk" off of the disk in one operation. The read-ahead clustering
algorithm should use less CPU than the previous also (I hope :-)).

NOTE: THAT LKMS MUST BE REBUILT!!!


# 18990 17-Oct-1996 jkh

Some very small changes to support Netcon's TFS filesystem.
These patches were formerly applied by the Netcon installer
before rebuilding your kernel.


# 18946 15-Oct-1996 bde

Updated #includes to 4.4lite style.


# 17761 21-Aug-1996 dyson

Even though this looks like it, this is not a complex code change.
The interface into the "VMIO" system has changed to be more consistant
and robust. Essentially, it is now no longer necessary to call vn_open
to get merged VM/Buffer cache operation, and exceptional conditions
such as merged operation of VBLK devices is simpler and more correct.

This code corrects a potentially large set of problems including the
problems with ktrace output and loaded systems, file create/deletes,
etc.

Most of the changes to NFS are cosmetic and name changes, eliminating
a layer of subroutine calls. The direct calls to vput/vrele have
been re-instituted for better cross platform compatibility.

Reviewed by: davidg


# 16025 30-May-1996 peter

Add an option "EXTRA_VNODES" to cause an extra number of vnode structures
to be allocated at boot time. This is an expensive option, as they
consume physical ram and are not pageable etc. In certain situations,
this kind of option is quite useful, especially for news servers that
access a large number of directories at random and torture the name cache.
Defining 5000 or 10000 extra vnodes should cut down the amount of vnode
recycling somewhat, which should allow better name and directory caching
etc.

This is a "your mileage may vary" option, with no real indication of
what works best for your machine except trial and error. Too many will
cost you ram that you could otherwise use for disk buffers etc.

This is based on something John Dyson mentioned to me a while ago.


# 14902 29-Mar-1996 dg

Change v_usecount & v_writecount from a short to an int. As shorts they
can and will overflow on large machines - especially on machines with
filesystems with lots of files (like netnews servers), and the result
is a "free vnode isn't" panic or worse.
This fixes one of the causes of these panics that I've been experiancing on
wcarchive.


# 14359 03-Mar-1996 peter

Add missing prototype for newly public vn_vmio_open function, next to
vn_vmio_close.


# 13765 30-Jan-1996 mpp

Fix a bunch of spelling errors in the comment fields of
a bunch of system include files.


# 13490 19-Jan-1996 dyson

Eliminated many redundant vm_map_lookup operations for vm_mmap.
Speed up for vfs_bio -- addition of a routine bqrelse to greatly diminish
overhead for merged cache.
Efficiency improvement for vfs_cluster. It used to do alot of redundant
calls to cluster_rbuild.
Correct the ordering for vrele of .text and release of credentials.
Use the selective tlb update for 486/586/P6.
Numerous fixes to the size of objects allocated for files. Additionally,
fixes in the various pagers.
Fixes for proper positioning of vnode_pager_setsize in msdosfs and ext2fs.
Fixes in the swap pager for exhausted resources. The pageout code
will not as readily thrash.
Change the page queue flags (PG_ACTIVE, PG_INACTIVE, PG_FREE, PG_CACHE) into
page queue indices (PQ_ACTIVE, PQ_INACTIVE, PQ_FREE, PQ_CACHE),
thereby improving efficiency of several routines.
Eliminate even more unnecessary vm_page_protect operations.
Significantly speed up process forks.
Make vm_object_page_clean more efficient, thereby eliminating the pause
that happens every 30seconds.
Make sequential clustered writes B_ASYNC instead of B_DELWRI even in the
case of filesystems mounted async.
Fix a panic with busy pages when write clustering is done for non-VMIO
buffers.


# 13012 25-Dec-1995 bde

Removed redundant (incompletely staticized) declararations.


# 12913 17-Dec-1995 phk

Staticize.
Unstaticize a function in scsi/scsi_base that was used, with an undocumented
option.
My last count on the LINT kernel shows:
Total symbols: 3647
unref symbols: 463
undef symbols: 4
1 ref symbols: 1751
2 ref symbols: 485
Approaching the pain threshold now.


# 12873 15-Dec-1995 bde

Completed a function declaration.

Restored order to prototype list.

Restored tabs to #defines.


# 12767 11-Dec-1995 dyson

Changes to support 1Tb filesizes. Pages are now named by an
(object,index) pair instead of (object,offset) pair.


# 12158 09-Nov-1995 bde

Introduced a type `vop_t' for vnode operation functions and used
it 1138 times (:-() in casts and a few more times in declarations.
This change is null for the i386.

The type has to be `typedef int vop_t(void *)' and not `typedef
int vop_t()' because `gcc -Wstrict-prototypes' warns about the
latter. Since vnode op functions are called with args of different
(struct pointer) types, neither of these function types is any use
for type checking of the arg, so it would be preferable not to use
the complete function type, especially since using the complete
type requires adding 1138 casts to avoid compiler warnings and
another 40+ casts to reverse the function pointer conversions before
calling the functions.


# 12148 08-Nov-1995 dyson

Export a symbol that ext2fs wants (insmntque.)


# 9411 06-Jul-1995 dg

Fixed an object allocation race condition that was causing a "object
deallocated too many times" panic when using NFS.

Reviewed by: John Dyson


# 9356 28-Jun-1995 dg

1) Converted v_vmdata to v_object.
2) Removed unnecessary vm_object_lookup()/pager_cache(object, TRUE) pairs
after vnode_pager_alloc() calls - the object is already guaranteed to be
persistent.
3) Removed some gratuitous casts.


# 7945 20-Apr-1995 julian

Reviewed by: no-one yet, but non-intrusive
Submitted by: julian@tfs.com
Obtained from: written from scratch

slight changes to make space for devfs..
(also conditional test code in i386/isa/fd.c)

===================================================================
RCS file: /home/ncvs/src/sys/sys/malloc.h,v
retrieving revision 1.7
diff -r1.7 malloc.h
113a114,117
> #define M_DEVFSMNT 62 /* DEVFS mount structure */
> #define M_DEVFSBACK 63 /* DEVFS Back node */
> #define M_DEVFSFRONT 64 /* DEVFS Front node */
> #define M_DEVFSNODE 65 /* DEVFS node */
184c188,192
< NULL, NULL, NULL, NULL, NULL, \
---
> "DEVFS mount", /* 62 M_DEVFSMNT */ \
> "DEVFS back", /* 63 M_DEVFSBACK */ \
> "DEVFS front", /* 64 M_DEVFSFRONT */ \
> "DEVFS node", /* 65 M_DEVFSNODE */ \
> NULL, \
Index: sys/mount.h
===================================================================
RCS file: /home/ncvs/src/sys/sys/mount.h,v
retrieving revision 1.16
diff -r1.16 mount.h
100c100,101
< #define MOUNT_MAXTYPE 15
---
> #define MOUNT_DEVFS 16 /* existing device Filesystem */
> #define MOUNT_MAXTYPE 16
118a120
> "devfs", /* 15 MOUNT_DEVFS */ \
Index: sys/vnode.h
===================================================================
RCS file: /home/ncvs/src/sys/sys/vnode.h,v
retrieving revision 1.19
diff -r1.19 vnode.h
61c61
< VT_UNION, VT_MSDOSFS
---
> VT_UNION, VT_MSDOSFS, VT_DEVFS


# 7695 09-Apr-1995 dg

Changes from John Dyson and myself:

Fixed remaining known bugs in the buffer IO and VM system.

vfs_bio.c:
Fixed some race conditions and locking bugs. Improved performance
by removing some (now) unnecessary code and fixing some broken
logic.
Fixed process accounting of # of FS outputs.
Properly handle NFS interrupts (B_EINTR).

(various)
Replaced calls to clrbuf() with calls to an optimized routine
called vfs_bio_clrbuf().

(various FS sync)
Sync out modified vnode_pager backed pages.

ffs_vnops.c:
Do two passes: Sync out file data first, then indirect blocks.

vm_fault.c:
Fixed deadly embrace caused by acquiring locks in the wrong order.

vnode_pager.c:
Changed to use buffer I/O system for writing out modified pages. This
should fix the problem with the modification date previous not getting
updated. Also dramatically simplifies the code. Note that this is
going to change in the future and be implemented via VOP_PUTPAGES().

vm_object.c:
Fixed a pile of bugs related to cleaning (vnode) objects. The performance
of vm_object_page_clean() is terrible when dealing with huge objects,
but this will change when we implement a binary tree to keep the object
pages sorted.

vm_pageout.c:
Fixed broken clustering of pageouts. Fixed race conditions and other
lockup style bugs in the scanning of pages. Improved performance.


# 7461 29-Mar-1995 dg

When NFS is compiled into the kernel, make NQNFS lease checking conditional
on a "NQNFS" kernel config option. NQNFS is a 4.4 wart and the performance
penalty of the lease checks on the client/server for _local_ I/O is too high
to have this occur all the time - especially when most people will never
use it.


# 7090 16-Mar-1995 bde

Add and move declarations to fix all of the warnings from `gcc -Wimplicit'
(except in netccitt, netiso and netns) and most of the warnings from
`gcc -Wnested-externs'. Fix all the bugs found. There were no serious
ones.


# 6946 07-Mar-1995 dg

Added a new flag "VAGE" to indicate that the vnode should go on the head
of the free list.


# 5412 05-Jan-1995 gibbs

Add VNINACT flag. LFS has a habbit of skipping the ufs_inactive procedure.
It used to do this by setting a global <Yuck>. Now we set th VNINACT
flag in the vnode to force a skip of ufs_inactive.

Sorry for missing this file in my last commit folks.

Index: vnode.h
===================================================================
RCS file: /usr/cvs/src/sys/sys/vnode.h,v
retrieving revision 1.14
diff -c -r1.14 vnode.h
*** 1.14 1994/11/14 13:51:53
--- vnode.h 1994/12/03 01:06:27
***************
*** 116,121 ****
--- 116,122 ----
#define VALIASED 0x0800 /* vnode has an alias */
#define VDIROP 0x1000 /* LFS: vnode is involved in a directory op */
#define VVMIO 0x2000 /* VMIO flag */
+ #define VNINACT 0x4000 /* LFS: skip ufs_inactive() in lfs_vunref */

/*
* Vnode attributes. A field value of VNOVAL represents a field whose value


# 4465 14-Nov-1994 bde

Add prototype for vfinddev().


# 3745 20-Oct-1994 wollman

Make my ALLDEVS kernel compile (basically, LINT minus a lot of options).

This involves fixing a few things I broke last time.


# 3438 08-Oct-1994 phk

Added prototypes here and there. Moved pfctlinput into socket.h.


# 3374 05-Oct-1994 dg

Stuff object into v_vmdata rather than pager. Not important which at
the moment, but will be in the future. Other changes mostly cosmetic,
but are made for future VMIO considerations.

Submitted by: John Dyson


# 3304 02-Oct-1994 phk

Prototypes, prototypes and even more prototypes. Not quite done yet, but
getting closer all the time.


# 3098 25-Sep-1994 phk

While in the real world, I had a bad case of being swapped out for a lot of
cycles. While waiting there I added a lot of the extra ()'s I have, (I have
never used LISP to any extent). So I compiled the kernel with -Wall and
shut up a lot of "suggest you add ()'s", removed a bunch of unused var's
and added a couple of declarations here and there. Having a lap-top is
highly recommended. My kernel still runs, yell at me if you kernel breaks.


# 2997 22-Sep-1994 wollman

Make NFS loadable.


# 2946 21-Sep-1994 wollman

Implemented loadable VFS modules, and made most existing filesystems
loadable. (NFS is a notable exception.)


# 2893 19-Sep-1994 dfr

Added msdosfs.

Obtained from: NetBSD


# 2811 15-Sep-1994 bde

Add some prototypes.


# 2384 29-Aug-1994 dg

"bogus" fixes from 1.1.5 to work around some cache coherency problems.


# 2165 21-Aug-1994 paul

Made them all idempotent.
Reviewed by:
Submitted by:


# 1817 02-Aug-1994 dg

Added $Id$


# 1542 24-May-1994 rgrimes

This commit was generated by cvs2svn to compensate for changes in r1541,
which included commits to RCS files with non-trunk default branches.


# 1541 24-May-1994 rgrimes

BSD 4.4 Lite Kernel Sources