History log of /freebsd-11.0-release/sys/kern/vfs_subr.c
Revision Date Author Comments
(<<< Hide modified files)
(Show modified files >>>)
# 303975 11-Aug-2016 gjb

Copy stable/11@r303970 to releng/11.0 as part of the 11.0-RELEASE
cycle.

Prune svn:mergeinfo from the new branch, and rename it to RC1.

Update __FreeBSD_version.

Use the quarterly branch for the default FreeBSD.conf pkg(8) repo and
the dvd1.iso packages population.

Approved by: re (implicit)
Sponsored by: The FreeBSD Foundation

# 303290 25-Jul-2016 kib

MFC r302567:
In vgonel(), postpone setting BO_DEAD until VOP_RECLAIM() is called,
if vnode is VMIO. For VMIO vnodes, set BO_DEAD in vm_object_terminate().

MFC r302580:
Fix grammar.

Approved by: re (gjb)


# 302408 08-Jul-2016 gjb

Copy head@r302406 to stable/11 as part of the 11.0-RELEASE cycle.
Prune svn:mergeinfo from the new branch, as nothing has been merged
here.

Additional commits post-branch will follow.

Approved by: re (implicit)
Sponsored by: The FreeBSD Foundation


# 302322 03-Jul-2016 kib

Remove racy assert. The thread which changes vnode usecount from 0 to 1
does it under the vnode interlock, but the interlock is not owned by the
asserting thread. As result, we might read increased use counter but also
still see VI_OWEINACT.

In collaboration with: nwhitehorn
Hardware donated by: IBM LTC
Sponsored by: The FreeBSD Foundation (kib)
Approved by: re (gjb)


# 302029 20-Jun-2016 kib

Fix typo. Note that atomic is still required even for interlocked case.

Sponsored by: The FreeBSD Foundation
Approved by: re (marius)


# 302000 17-Jun-2016 mjg

vfs: ifdef out noop vop_* primitives on !DEBUG_VFS_LOCKS kernels

This removes calls to empty functions like vop_lock_{pre/post} from
common vfs routines.

Approved by: re (gjb)


# 301996 17-Jun-2016 kib

Add VFS interface to flush specified amount of free vnodes belonging
to mount points with the given filesystem type, specified by mount
vfs_ops pointer.

Based on patch by: mckusick
Reviewed by: avg, mckusick
Tested by: allanjude, madpilot
Sponsored by: The FreeBSD Foundation
Approved by: re (gjb)


# 301040 31-May-2016 trasz

Cosmetics - add missing space after ellipses in shutdown messages.

MFC after: 1 month
Sponsored by: The FreeBSD Foundation


# 299916 16-May-2016 avg

vfs_read_dirent: increment ncookies after adding a cookie

It seems that at present vfs_read_dirent() is used only with filesystems
that do not support cookies, so the bug never manifested itself.

MFC after: 1 week


# 298982 03-May-2016 kib

Add EVFILT_VNODE open, read and close notifications.

While there, order EVFILT_VNODE notes descriptions alphabetically.

Based on submission, and tested by: Vladimir Kondratyev <wulf@cicgroup.ru>
MFC after: 2 weeks


# 298922 02-May-2016 kib

Issue NOTE_EXTEND when a directory entry is added to or removed from
the monitored directory as the result of rename(2) operation. The
renames staying in the directory are not reported.

Submitted by: Vladimir Kondratyev <wulf@cicgroup.ru>
MFC after: 2 weeks


# 298921 02-May-2016 kib

Fix reporting of NOTE_LINK when directory link count changes due to
rename removing or adding subdirectory entry.

Discussed with and tested by: Vladimir Kondratyev <wulf@cicgroup.ru>
NetBSD PR: 48958 (http://gnats.netbsd.org/48958)
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation


# 298819 29-Apr-2016 pfg

sys/kern: spelling fixes in comments.

No functional change.


# 298649 26-Apr-2016 pfg

sys: extend use of the howmany() macro when available.

We have a howmany() macro in the <sys/param.h> header that is
convenient to re-use as it makes things easier to read.


# 295971 24-Feb-2016 kib

Provide more correct sizing of the KVA consumed by a vnode, used by
the virtvnodes calculation. Include the size of fs-specific v_data as
the nfs nclnode inline, the NFS nclnode is bigger than either ZFS
znode or UFS inode. Include the size of namecache_ts and short cache
path element, multiplied by the name cache population factor, again
inline.

Inline defines are used to avoid pollution of the vnode.h with the
subsystem-private objects. Non-significant unsynchronized changes of
the definitions are fine, we do not care about that precision, and
e.g. ZFS consumes much malloced memory per vnode for reasons
unaccounted in the formula.

Lower the partition of kmem dedicated to vnodes, from 1/7 to 1/10.

The measures reduce vnode cache pressure on kmem and bring the vnode
cache memory use below some apparent thresholds that were exceeded by
r291244 due to more robust vnode reuse.

Reported and tested by: marius (i386, previous version)
Reviewed by: bde
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 295716 17-Feb-2016 kib

In bnoreuselist(), check both ends of the specified logical block
numbers range.

This effectively skips indirect and extdata blocks on the buffer
queue. Since their logical block numbers are negative, bnoreuselist()
could loop infinitely.

Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 294299 18-Jan-2016 markj

Add vrefl(), a locked variant of vref(9).

This API has no in-tree consumers at the moment but is useful to at least
one out-of-tree consumer, and naturally complements existing vnode refcount
functions (vholdl(9), vdropl(9)).

Obtained from: kib (sys/ portion)
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D4947
Differential Revision: https://reviews.freebsd.org/D4953


# 293197 05-Jan-2016 kib

Two fixes for excessive iterations after r292326.

Advance the logical block number to the lblkno of the found block plus
one, instead of incrementing the block number which was used for
lookup. This change skips sparcely populated buffer ranges, similar
to r292325, instead of doing useless lookups.

Do not restart the bnoreuselist() from the start of the range if
buffer lock cannot be obtained without sleep. Only retry lookup and
lock for the same queue and same logical block number.

Reported by: benno
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 3 days


# 292326 16-Dec-2015 kib

Optimize vop_stdadvise(POSIX_FADV_DONTNEED). Instead of looking up a
buffer for each block number in the range with gbincore(), look up the
next instantiated buffer with the logical block number which is
greater or equal to the next lblkno. This significantly speeds up the
iteration for sparce-populated range.

Move the iteration into new helper bnoreuselist(), which is structured
similarly to flushbuflist().

Reported and tested by: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation


# 292325 16-Dec-2015 kib

Simplify the loop step in the flushbuflist() and make it independed on
the type stability of the buffers memory. Instead of memoizing
pointer to the next buffer and validating it, remember the next
logical block number in the bo list and re-lookup.

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation


# 291743 04-Dec-2015 mckusick

We need to zero out the clustering variables in a freed vnode structure.
For completeness add a VNASSERT that there are no threads waiting on a
range lock (this was previously checked on every vnode free).

Reported by; Rick Macklem
Fix from: Mateusz Guzik
PR: 204949


# 291671 03-Dec-2015 mckusick

We need to zero out the union of pointers in a freed vnode structure.

PR: 204949
Fix from: Mateusz Guzik
Tested by: Jason Unovitch


# 291460 29-Nov-2015 mckusick

As the kernel allocates and frees vnodes, it fully initializes them
on every allocation and fully releases them on every free. These
are not trivial costs: it starts by zeroing a large structure then
initializes a mutex, a lock manager lock, an rw lock, four lists,
and six pointers. And looking at vfs.vnodes_created, these operations
are being done millions of times an hour on a busy machine.

As a performance optimization, this code update uses the uma_init
and uma_fini routines to do these initializations and cleanups only
as the vnodes enter and leave the vnode_zone. With this change the
initializations are only done kern.maxvnodes times at system startup
and then only rarely again. The frees are done only if the vnode_zone
shrinks which never happens in practice. For those curious about the
avoided work, look at the vnode_init() and vnode_fini() functions in
kern/vfs_subr.c to see the code that has been removed from the main
vnode allocation/free path.

Reviewed by: kib
Tested by: Peter Holm


# 291380 27-Nov-2015 kib

Remove VI_AGE vnode iflag, it is unused.

Noted by: bde
Sponsored by: The FreeBSD Foundation


# 291379 27-Nov-2015 kib

Move the comment about resident pages preventing vnode from leaving
active list, into the header comment for vdrop(), which is the
function that decides whether to leave the vnode on the list. Note
that dirty page write-out in vinactive() is asynchronous.

Discussed with: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 291244 24-Nov-2015 kib

Rework the vnode cache recycling to meet free and unused vnodes
targets. See the comment above wantfreevnodes variable for the
description of the algorithm.

The vfs.vlru_alloc_cache_src sysctl is removed. New code frees
namecache sources as the last chance to satisfy the highest watermark,
instead of selecting the source vnodes randomly. This provides good
enough behaviour to keep vn_fullpath() working in most situations.
The filesystem layout with deep trees, where the removed knob was
required, is thus handled automatically.

Submitted by: bde
Discussed with: mckusick
Tested by: pho
MFC after: 1 month


# 291116 21-Nov-2015 glebius

Remove remnants of the old NFS from vnode pager.

Reviewed by: kib
Sponsored by: Netflix


# 288280 26-Sep-2015 markj

Remove a check for a condition that is always false by a preceding KASSERT
that was added in r144704.


# 288276 26-Sep-2015 markj

Fix argument ordering in vn_printf().

MFC after: 3 days


# 287831 15-Sep-2015 cem

kevent(2): Note DOOMED vnodes with NOTE_REVOKE

In poll mode, check for and wake VBAD vnodes. (Vnodes that are VBAD at
registration will never be woken by the RECLAIM trigger.)

Add post-VOP_RECLAIM hook to trigger notes on vnode reclamation. (Vnodes that
were fine at registration but are vgoned while being monitored should signal
waiters.)

Reviewed by: kib
Approved by: markj (mentor)
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D3675


# 287497 06-Sep-2015 mckusick

Track changes to kern.maxvnodes and appropriately increase or decrease
the size of the name cache hash table (mapping file names to vnodes)
and the vnode hash table (mapping mount point and inode number to vnode).
An appropriate locking strategy is the key to changing hash table sizes
while they are in active use.

Reviewed by: kib
Tested by: Peter Holm
Differential Revision: https://reviews.freebsd.org/D2265
MFC after: 2 weeks


# 287107 24-Aug-2015 trasz

Make vfs_unmountall() unmount /dev after /, not before. The only
reason this didn't result in an unclean shutdown is that devfs ignores
MNT_FORCE flag.

Reviewed by: kib@
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D3467


# 287033 23-Aug-2015 trasz

After r286237 it should be fine to call vgone(9) on a busy GEOM vnode;
remove KASSERT that would prevent forced devfs unmount from working.

MFC after: 1 month
Sponsored by: The FreeBSD Foundation


# 286307 05-Aug-2015 ed

Make it possible to implement poll(2) on top of kqueue(2).

It looks like EVFILT_READ and EVFILT_WRITE trigger under the same
conditions as poll()'s POLLRDNORM and POLLWRNORM as described by POSIX.
The only difference is that POLLRDNORM has to be triggered on regular
files unconditionally, whereas EVFILT_READ only triggers when not EOF.

Introduce a new flag, NOTE_FILE_POLL, that can be used to make
EVFILT_READ and EVFILT_WRITE behave identically to poll(). This flag
will be used by cloudlibc's poll() function.

Reviewed by: jmg
Differential Revision: https://reviews.freebsd.org/D3303


# 286281 04-Aug-2015 trasz

Mark vgonel() as static. It was already declared static earlier;
no idea why compilers don't warn about this.

MFC after: 1 month
Sponsored by: The FreeBSD Foundation


# 285632 16-Jul-2015 mjg

vfs: implement v_holdcnt/v_usecount manipulation using atomic ops

Transitions 0->1 and 1->0 (which decide e.g. on putting the vnode on the free
list) of either counter are still guarded with vnode interlock.

Reviewed by: kib (earlier version)
Tested by: pho


# 285393 11-Jul-2015 mjg

vfs: always clear VI_OWEINACT in consumers bumping v_usecount

Previously vputx would detect the condition and clear the flag.

With this change it is invalid to have both v_usecount > 0 and the flag
set. Assert the condition is met in all revlevant places.

Reviewed by: kib


# 285392 11-Jul-2015 mjg

vfs: move si_usecount manipulation to dedicated functions

Reviewed by: kib


# 285384 11-Jul-2015 kib

Do not allow creation of the dirty buffers for the dead buffer
objects, i.e. for buffer objects which vnode was reclaimed. Buffer
cache cannot write such buffers. Return the error and discard the
buffer immediately on write attempt.

BO_DIRTY now always set during vnode reclamation, since it is used not
only for the INVARIANTS checks. Do allow placement of the clean
buffers on dead bufobj list, otherwise filesystems cannot use bufcache
at all after the devvp reclaim.

Reported and tested by: trasz
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 285183 05-Jul-2015 markj

Remove a stale descriptive comment for gbincore().

The splay trees referenced in the comment were converted to
path-compressed tries in r250551.

MFC after: 3 days


# 284733 23-Jun-2015 jmg

zero this struct as it depends upon it...

Reviewed by: mjg
Differential Revision: https://reviews.freebsd.org/D2890


# 284495 17-Jun-2015 kib

vfs_msync(), called from syncer vnode fsync VOP, only iterates over
the active vnode list for the given mount point, with the assumption
that vnodes with dirty pages are active. This is enforced by
vinactive() doing vm_object_page_clean() pass over the vnode pages.

The issue is, if vinactive() cannot be called during vput() due to the
vnode being only shared-locked, we might end up with the dirty pages
for the vnode on the free list. Such vnode is invisible to syncer,
and pages are only cleaned on the vnode reactivation. In other words,
the race results in the broken guarantee that user data, written
through the mmap(2), is written to the disk not later than in 30
seconds after the write.

Fix this by keeping the vnode which is freed but still owing
inactivation, on the active list. When syncer loops find such vnode,
it is deactivated and cleaned by the final vput() call.

Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 283602 27-May-2015 kib

Right now, dounmount() is called with unreferenced mount point.
Nothing stops a parallel unmount to suceed before the given call to
dounmount() checks and locks the covered vnode. Prevent dounmount()
from acting on the freed (although type-stable) memory by changing the
interface to require the mount point to be referenced. dounmount()
consumes the reference on return, regardless of the sucessfull or
erronous result.

Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 281562 15-Apr-2015 rmacklem

File systems that do not use the buffer cache (such as ZFS) must
use VOP_FSYNC() to perform the NFS server's Commit operation.
This patch adds a mnt_kern_flag called MNTK_USES_BCACHE which
is set by file systems that use the buffer cache. If this flag
is not set, the NFS server always does a VOP_FSYNC().
This should be ok for old file system modules that do not set
MNTK_USES_BCACHE, since calling VOP_FSYNC() is correct, although
it might not be optimal for file systems that use the buffer cache.

Reviewed by: kib
MFC after: 2 weeks


# 279362 27-Feb-2015 kib

The VNASSERT in vflush() FORCECLOSE case is trying to panic early to
prevent errors from yanking devices out from under filesystems. Only
care about special vnodes on devfs, special nodes on other kinds of
filesystems do not have special properties.

Sponsored by: EMC / Isilon Storage Division
Submitted by: Conrad Meyer
MFC after: 1 week


# 278891 17-Feb-2015 ngie

Add the mnt_lockref field to the ddb(4) 'show mount' command

MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D1688
Submitted by: Conrad Meyer <conrad.meyer@isilon.com>
Sponsored by: EMC / Isilon Storage Division


# 278760 14-Feb-2015 jhb

Add two new counters for vnode life cycle events:
- vfs.recycles counts the number of vnodes forcefully recycled to avoid
exceeding kern.maxvnodes.
- vfs.vnodes_created counts the number of vnodes created by successful
calls to getnewvnode().

Differential Revision: https://reviews.freebsd.org/D1671
Reviewed by: kib
MFC after: 1 week


# 277712 25-Jan-2015 jhb

Change the default VFS timestamp precision from seconds to microseconds.

Discussed on: arch@
MFC after: 2 weeks


# 275743 13-Dec-2014 kib

The vinactive() call in vgonel() may start writes for the dirty pages,
creating delayed write buffers belonging to the reclaimed vnode. Put
the buffer cleanup code after inactivation.

Add asserts that ensure that buffer queues are empty and add BO_DEAD
flag for bufobj to check that no buffers are added after the cleanup.
BO_DEAD is only used by INVARIANTS-enabled kernels.

Reported and tested by: pho (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 275637 09-Dec-2014 kib

Apply chunk forgotten in r275620. Remove local variable for real.

CID: 1257462
Sponsored by: The FreeBSD Foundation


# 275620 08-Dec-2014 kib

Add functions syncer_suspend() and syncer_resume(), which are supposed
to be called before suspension and after resume, correspondingly. The
syncer_suspend() ensures that all filesystems dirty data and metadata
are saved to the permanent storage, and stops kernel threads which
might modify filesystems. The syncer_resume() restores stopped
threads.

For now, only syncer is stopped. This is needed, because each sync
loop causes superblock updates for UFS.

Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 273118 15-Oct-2014 mjg

Don't take devmtx unnecessarily in vn_isdisk.

MFC after: 1 week


# 272366 01-Oct-2014 will

In the syncer, drop the sync mutex while patting the watchdog.

Some watchdog drivers (like ipmi) need to sleep while patting the watchdog.
See sys/dev/ipmi/ipmi.c:ipmi_wd_event(), which calls malloc(M_WAITOK).

Submitted by: asomers
MFC after: 1 month
Sponsored by: Spectra Logic
MFSpectraBSD: 637548 on 2012/10/04


# 269457 03-Aug-2014 kib

Remove Giant acquisition from the mount and unmount pathes.

It could be claimed that two things were reasonable protected by
Giant. One is vfsconf list links, which is converted to the new
dedicated sx vfsconf_sx. Another is vfsconf.vfc_refcount, which is
now updated with atomics.

Note that vfc_refcount still has the same races now as it has under
the Giant, the unload of filesystem modules can happen while the
module is still in use.

Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 269244 29-Jul-2014 kib

Remove one-time use macros which check for the vnode lifecycle. More,
some parts of the checks are in fact redundand in the surrounding
code, and it is more clear what the conditions are by direct testing
of the flags. Two of the three macros were only used in assertions.

In vnlru_free(), all relevant parts of vholdl() were already inlined,
except the increment of v_holdcnt itself. Do not call vholdl() to do
the increment as well, this allows to make assertions in
vholdl()/vhold() more strict.

In v_incr_usecount(), call vholdl() before incrementing other ref
counters. The change is no-op, but it makes less surprising to see
the vnode state in debugger if interrupted inside v_incr_usecount().

Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 267392 12-Jun-2014 mav

Implement simple direct-mapped cache for popular filesystem identifiers to
avoid congestion on global mountlist_mtx mutex in vfs_busyfs(), while
traversing through the list of mount points.

This change significantly improves NFS server scalability, since it had
to do this translation for every request, and the global lock becomes quite
congested.

This code is more optimized for relatively small number of mount points.
On systems with hundreds of active mount points this simple cache may have
many collisions. But the original traversal code in that case should also
behave much worse, so we are not loosing much.

Reviewed by: attilio
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.


# 267362 11-Jun-2014 mav

Remove unneeded mountlist_mtx acquisition from sync_fsync().

All struct mount fields accessed by sync_fsync() are protected by MNT_MTX.


# 267239 08-Jun-2014 mav

Remove extra branching from r267232.

MFC after: 2 weeks


# 267232 08-Jun-2014 mav

Use atomics to modify numvnodes variable.

This allows to mostly avoid lock usage in getnewvnode_[drop_]reserve(),
that reduces number of global vnode_free_list_mtx mutex acquisitions
from 4 to 2 per NFS request on ZFS, improving SMP scalability.

Reviewed by: kib
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.


# 266482 21-May-2014 bjk

Check for mismatched vref()/vdrop()

Assert that the hold count has not fallen below the use count, a situation
that would only happen when a vref() (or similar) is erroneously paired
with a vdrop(). This situation has not been observed in the wild, but
could be helpful for someone implementing a new filesystem.

Reviewed by: kib
Approved by: hrs (mentor)


# 263620 22-Mar-2014 bdrewery

Rename global cnt to vm_cnt to avoid shadowing.

To reduce the diff struct pcu.cnt field was not renamed, so
PCPU_OP(cnt.field) is still used. pc_cnt and pcpu are also used in
kvm(3) and vmstat(8). The goal was to not affect externally used KPI.

Bump __FreeBSD_version_ in case some out-of-tree module/code relies on the
the global cnt variable.

Exp-run revealed no ports using it directly.

No objection from: arch@
Sponsored by: EMC / Isilon Storage Division


# 256211 09-Oct-2013 kib

Do not flush buffers when the v_object of the passed vnode does not
really belong to it. Such vnodes, with the pointers to other vnodes
v_objects, are typically instantiated by the bypass filesystems.
Invalidating mappings of other vnode pages and the pages is wrong,
since reclamation of the upper vnode does not imply that lower vnode
is reclaimed too.

One of the consequences of the improper reclamation was destruction of
the wired mappings of the lower vnode pages, triggering miscellaneous
assertions in the VM system.

Reported by: John Marshall <john.marshall@riverwillow.com.au>
Tested by: John Marshall <john.marshall@riverwillow.com.au>, pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Approved by: re (gjb)


# 255979 01-Oct-2013 kib

When printing the vnode information from ddb, print the lengths of the
dirty and clean buffer queues.

Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Approved by: re (gjb)


# 255942 29-Sep-2013 kib

For vunref(), try to upgrade the vnode lock if the function was called
with the vnode shared-locked. If upgrade succeeded, the inactivation
can be done immediately, instead of being postponed.

Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Approved by: re (glebius)


# 255880 26-Sep-2013 kib

Acquire a hold reference on the vnode when a knote is instantiated.
Otherwise, knote keeps a pointer to a vnode which could become invalid
any time.

Reported by: many
Tested by: Patrick Lamaiziere <patfbsd@davenulle.org>
Discussed with: jmg
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Approved by: re (marius)


# 254446 17-Aug-2013 pjd

In r114945 the line 'nmp = TAILQ_NEXT(mp, mnt_list);' was duplicated.
Instead of just removing the duplicate, convert the loop to TAILQ_FOREACH().


# 253737 28-Jul-2013 kib

When creation of the v_pollinfo raced and our instance of vpollinfo
must be destroyed, knlist_clear() and seldrain() calls could be
avoided, since vpollinfo was not used. More, the knlist_clear()
calling protocol requires the knlist locked, which is not true at the
call site.

Split the destruction into the helper destroy_vpollinfo_free(), and
call it when raced, instead of destroy_vpollinfo().

Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 3 days


# 253417 17-Jul-2013 kib

Clear the vnode knotes before destroying vpollinfo.

Reported and tested by: Patrick Lamaiziere <patfbsd@davenulle.org>
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 251322 03-Jun-2013 kib

Be more generous when donating the current thread time to the owner of
the vnode lock while iterating over the free vnode list. Instead of
yielding, pause for 1 tick. The change is reported to help in some
virtualized environments.

Submitted by: Roger Pau Monn? <roger.pau@citrix.com>
Discussed with: jilles
Tested by: pho
MFC after: 2 weeks


# 251171 31-May-2013 jeff

- Convert the bufobj lock to rwlock.
- Use a shared bufobj lock in getblk() and inmem().
- Convert softdep's lk to rwlock to match the bufobj lock.
- Move INFREECNT to b_flags and protect it with the buf lock.
- Remove unnecessary locking around bremfree() and BKGRDINPROG.

Sponsored by: EMC / Isilon Storage Division
Discussed with: mckusick, kib, mdf


# 250551 12-May-2013 jeff

- Add a new general purpose path-compressed radix trie which can be used
with any structure containing a uint64_t index. The tree code
auto-generates type safe wrappers.
- Eliminate the buf splay and replace it with pctrie. This is not only
significantly faster with large files but also allows for the possibility
of shared locking.

Reviewed by: alc, attilio
Sponsored by: EMC / Isilon Storage Division


# 250505 11-May-2013 kib

- Fix nullfs vnode reference leak in nullfs_reclaim_lowervp(). The
null_hashget() obtains the reference on the nullfs vnode, which must
be dropped.

- Fix a wart which existed from the introduction of the nullfs
caching, do not unlock lower vnode in the nullfs_reclaim_lowervp().
It should be innocent, but now it is also formally safe. Inform the
nullfs_reclaim() about this using the NULLV_NOUNLOCK flag set on
nullfs inode.

- Add a callback to the upper filesystems for the lower vnode
unlinking. When inactivating a nullfs vnode, check if the lower
vnode was unlinked, indicated by nullfs flag NULLV_DROP or VV_NOSYNC
on the lower vnode, and reclaim upper vnode if so. This allows
nullfs to purge cached vnodes for the unlinked lower vnode, avoiding
excessive caching.

Reported by: G??ran L??wkrantz <goran.lowkrantz@ismobile.com>
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 250411 09-May-2013 marcel

Add option WITNESS_NO_VNODE to suppress printing LORs between VNODE
locks. To support this, VNODE locks are created with the LK_IS_VNODE
flag. This flag is propagated down using the LO_IS_VNODE flag.

Note that WITNESS still records the LOR. Only the printing and the
optional entering into the kernel debugger is bypassed with the
WITNESS_NO_VNODE option.


# 250247 04-May-2013 mdf

Add missing vdrop() in error case.

Submitted by: Fahad (mohd.fahadullah@isilon.com)
MFC after: 1 week


# 249548 16-Apr-2013 rmacklem

Allow the vnode to be unlocked for the weird case of
LK_EXCLOTHER. LK_EXCLOTHER is only used to acquire a
usecount on a vnode during NFSv4 recovery from an
expired lease.

Reported and tested by: pho
MFC after: 2 weeks


# 249218 06-Apr-2013 jeff

Prepare to replace the buf splay with a trie:

- Don't insert BKGRDMARKER bufs into the splay or dirty/clean buf lists.
No consumers need to find them there and it complicates the tree.
These flags are all FFS specific and could be moved out of the buf
cache.
- Use pbgetvp() and pbrelvp() to associate the background and journal
bufs with the vp. Not only is this much cheaper it makes more sense
for these transient bufs.
- Fix the assertions in pbget* and pbrel*. It's not safe to check list
pointers which were never initialized. Use the BX flags instead. We
also check B_PAGING in reassignbuf() so this should cover all cases.

Discussed with: kib, mckusick, attilio
Sponsored by: EMC / Isilon Storage Division


# 248084 09-Mar-2013 attilio

Switch the vm_object mutex to be a rwlock. This will enable in the
future further optimizations where the vm_object lock will be held
in read mode most of the time the page cache resident pool of pages
are accessed for reading purposes.

The change is mostly mechanical but few notes are reported:
* The KPI changes as follow:
- VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK()
- VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK()
- VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK()
- VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED()
(in order to avoid visibility of implementation details)
- The read-mode operations are added:
VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(),
VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED()
* The vm/vm_pager.h namespace pollution avoidance (forcing requiring
sys/mutex.h in consumers directly to cater its inlining functions
using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h
consumers now must include also sys/rwlock.h.
* zfs requires a quite convoluted fix to include FreeBSD rwlocks into
the compat layer because the name clash between FreeBSD and solaris
versions must be avoided.
At this purpose zfs redefines the vm_object locking functions
directly, isolating the FreeBSD components in specific compat stubs.

The KPI results heavilly broken by this commit. Thirdy part ports must
be updated accordingly (I can think off-hand of VirtualBox, for example).

Sponsored by: EMC / Isilon storage division
Reviewed by: jeff
Reviewed by: pjd (ZFS specific review)
Discussed with: alc
Tested by: pho


# 245411 14-Jan-2013 kib

Add a trivial comment to record the proper commit log for r245407:

Set the v_hash for a new vnode in the getnewvnode() to the value
calculated based on the vnode structure address. Filesystems using
vfs_hash_insert() override the v_hash using the standard formula of
(inode_number + mnt_hashseed). For other filesystems, the
initialization allows the vfs_hash_index() to provide useful hash too.

Suggested, reviewed and tested by: peter
Sponsored by: The FreeBSD Foundation
MFC after: 5 days


# 245407 14-Jan-2013 kib

diff --git a/sys/kern/vfs_subr.c b/sys/kern/vfs_subr.c
index 7c243b6..0bdaf36 100644
--- a/sys/kern/vfs_subr.c
+++ b/sys/kern/vfs_subr.c
@@ -279,6 +279,7 @@ SYSCTL_INT(_debug, OID_AUTO, vnlru_nowhere, CTLFLAG_RW,
#define VSHOULDFREE(vp) (!((vp)->v_iflag & VI_FREE) && !(vp)->v_holdcnt)
#define VSHOULDBUSY(vp) (((vp)->v_iflag & VI_FREE) && (vp)->v_holdcnt)

+static int vnsz2log;

/*
* Initialize the vnode management data structures.
@@ -293,6 +294,7 @@ SYSCTL_INT(_debug, OID_AUTO, vnlru_nowhere, CTLFLAG_RW,
static void
vntblinit(void *dummy __unused)
{
+ u_int i;
int physvnodes, virtvnodes;

/*
@@ -332,6 +334,9 @@ vntblinit(void *dummy __unused)
syncer_maxdelay = syncer_mask + 1;
mtx_init(&sync_mtx, "Syncer mtx", NULL, MTX_DEF);
cv_init(&sync_wakeup, "syncer");
+ for (i = 1; i <= sizeof(struct vnode); i <<= 1)
+ vnsz2log++;
+ vnsz2log--;
}
SYSINIT(vfs, SI_SUB_VFS, SI_ORDER_FIRST, vntblinit, NULL);

@@ -1067,6 +1072,14 @@ alloc:
}
rangelock_init(&vp->v_rl);

+ /*
+ * For the filesystems which do not use vfs_hash_insert(),
+ * still initialize v_hash to have vfs_hash_index() useful.
+ * E.g., nullfs uses vfs_hash_index() on the lower vnode for
+ * its own hashing.
+ */
+ vp->v_hash = (uintptr_t)vp >> vnsz2log;
+
*vpp = vp;
return (0);
}


# 244706 26-Dec-2012 attilio

Fixup r244240: mp_ncpus will be 1 also in the !SMP and smp_disabled=1
case. There is no point in optimizing further the code and use a TRUE
litteral for a path that does heavyweight stuff anyway (like lock acq),
at the price of obfuscated code.

Use the appropriate check where necessary and remove a macro.

Sponsored by: EMC / Isilon storage division
MFC after: 3 days


# 244534 21-Dec-2012 attilio

Fixup r218424: uio_yield() was scaling directly to userland priority.
When kern_yield() was introduced with the possibility to specify
a new priority, the behaviour changed by not lowering priority at all
in the consumers, making the yielding mechanism highly ineffective for
high priority kthreads like bufdaemon, syncer, vlrudaemon, etc.
There are no evidences that consumers could bear with such change in
semantic and this situation could finally lead to bugs similar to the
ones fixed in r244240.
Re-specify userland pri for kthreads involved.

Tested by: pho
Reviewed by: kib, mdf
MFC after: 1 week


# 244240 15-Dec-2012 kib

When mnt_vnode_next_active iterator cannot lock the next vnode and
yields, specify the user priority for the yield. Otherwise, a
higher-priority (kernel) thread could fall into the priority-inversion
with the thread owning the mutex lock.

On single-processor machines or UP kernels, do not loop adaptively
when the next vnode cannot be locked, instead yield unconditionally.

Restructure the iteration initializer and the iterator to remove code
duplication. Put the code to fetch and lock a vnode next to the
current marker, into the mnt_vnode_next_active() function, and use it
instead of repeating the loop.

Reported by: hrs, rmacklem
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 3 days


# 244095 10-Dec-2012 kib

Do not yield while owning a mutex. The Giant reacquire in the
kern_yield() is problematic than.

The owned mutex is the mount interlock, and it is in fact not needed
to guarantee the stability of the mount list of active vnodes, so fix
the the issue by only taking the mount interlock for MNT_REF and
MNT_REL operations.

While there, augment the unconditional yield by some amount of
spinning [1].

Reported and tested by: pho
Reviewed by: attilio
Submitted by: attilio [1]
MFC after: 3 days


# 243835 03-Dec-2012 kib

The vnode_free_list_mtx is required unconditionally when iterating
over the active list. The mount interlock is not enough to guarantee
the validity of the tailq link pointers. The __mnt_vnode_next_active()
and __mnt_vnode_first_active() active lists iterators helper functions
did not provided the neccessary stability for the list, allowing the
iterators to pick garbage.

This was uncovered after the r243599 made the active list iterators
non-nop.

Since a vnode interlock is before the vnode_free_list_mtx, obtain the
vnode ilock in the non-blocking manner when under vnode_free_list_mtx,
and restart iteration after the yield if the lock attempt failed.

Assert that a vnode found on the list is active, and assert that the
helpers return the vnode with interlock owned.

Reported and tested by: pho
MFC after: 1 week


# 243599 27-Nov-2012 davidxu

Take first active vnode correctly.

Reviewed by: kib
MFC after: 3 days


# 243499 24-Nov-2012 avg

assert_vop_locked: make the assertion race-free and more efficient

this is really a minor improvement for the sake of correctness

MFC after: 6 days


# 243400 22-Nov-2012 avg

remove vop_lookup_pre and vop_lookup_post

Suggested by: kib
MFC after: 5 days


# 243307 19-Nov-2012 attilio

insmntque() is always called with the lock held in exclusive mode,
then:
- assume the lock is held in exclusive mode and remove a moot check
about the lock acquisition.
- in the destructor remove !MPSAFE specific chunk.

Reviewed by: kib
MFC after: 2 weeks


# 243272 19-Nov-2012 avg

assert_vop_locked should treat LK_EXCLOTHER as the not locked case

... from a perspective of the current thread.

Spotted by: mjg
Discussed with: kib
MFC after: 18 days


# 243271 19-Nov-2012 avg

vnode_if: fix locking protocol description for lookup and cachedlookup

Also remove the checks from vop_lookup_pre and vop_lookup_post, which
are now completely redundant (before this change they were partially
redundant).

Discussed with: kib
MFC after: 10 days


# 242833 09-Nov-2012 attilio

Complete MPSAFE VFS interface and remove MNTK_MPSAFE flag.
Porters should refer to __FreeBSD_version 1000021 for this change as
it may have happened at the same timeframe.


# 242617 05-Nov-2012 kib

A clarification to the behaviour of the active vnode list management
regarding the vnode page cleaning.

In collaboration with: pho
MFC after: 1 week


# 242560 04-Nov-2012 kib

Add decoding of the missed MNT_KERN_ flags to ddb "show mount" command.

MFC after: 3 weeks


# 242559 04-Nov-2012 kib

Add decoding of the missed VI_ and VV_ flags to ddb "show vnode" command.

MFC after: 3 days


# 242556 04-Nov-2012 kib

Order the enumeration of the MNT_ flags to be the same as the order of
their definitions.

MFC after: 3 days


# 241896 22-Oct-2012 kib

Remove the support for using non-mpsafe filesystem modules.

In particular, do not lock Giant conditionally when calling into the
filesystem module, remove the VFS_LOCK_GIANT() and related
macros. Stop handling buffers belonging to non-mpsafe filesystems.

The VFS_VERSION is bumped to indicate the interface change which does
not result in the interface signatures changes.

Conducted and reviewed by: attilio
Tested by: pho


# 241556 14-Oct-2012 kib

Add a KPI to allow to reserve some amount of space in the numvnodes
counter, without actually allocating the vnodes. The supposed use of
the getnewvnode_reserve(9) is to reclaim enough free vnodes while the
code still does not hold any resources that might be needed during the
reclamation, and to consume the slack later for getnewvnode() calls
made from the innards. After the critical block is finished, the
caller shall free any reserve left, by getnewvnode_drop_reserve(9).

Reviewed by: avg
Tested by: pho
MFC after: 1 week


# 240475 13-Sep-2012 attilio

Remove all the checks on curthread != NULL with the exception of some MD
trap checks (eg. printtrap()).

Generally this check is not needed anymore, as there is not a legitimate
case where curthread != NULL, after pcpu 0 area has been properly
initialized.

Reviewed by: bde, jhb
MFC after: 1 week


# 240284 09-Sep-2012 kib

Add a facility for vgone() to inform the set of subscribed mounts
about vnode reclamation. Typical use is for the bypass mounts like
nullfs to get a notification about lower vnode going away.

Now, vgone() calls new VFS op vfs_reclaim_lowervp() with an argument
lowervp which is reclaimed. It is possible to register several
reclamation event listeners, to correctly handle the case of several
nullfs mounts over the same directory.

For the filesystem not having nullfs mounts over it, the overhead
added is a single mount interlock lock/unlock in the vnode reclamation
path.

In collaboration with: pho
MFC after: 3 weeks


# 239588 22-Aug-2012 kib

Provide some compat32 shims for sysctl vfs.conflist. It is required
for getvfsbyname(3) operation when called from 32bit process, and
getvfsbyname(3) is used by recent bsdtar import.

Reported by: many
Tested by: David Naylor <naylor.b.david@gmail.com>
MFC after: 5 days


# 236503 03-Jun-2012 avg

free wdog_kern_pat calls in post-panic paths from under SW_WATCHDOG

Those calls are useful with hardware watchdog drivers too.

MFC after: 3 weeks


# 236317 30-May-2012 kib

Add a rangelock implementation, intended to be used to range-locking
the i/o regions of the vnode data space. The implementation is quite
simple-minded, it uses the list of the lock requests, ordered by
arrival time. Each request may be for read or for write. The
implementation is fair FIFO.

MFC after: 2 month


# 234607 23-Apr-2012 trasz

Remove unused thread argument to vrecycle().

Reviewed by: kib


# 234605 23-Apr-2012 trasz

Remove unused thread argument from vtruncbuf().

Reviewed by: kib


# 234483 20-Apr-2012 mckusick

This update uses the MNT_VNODE_FOREACH_ACTIVE interface that loops
over just the active vnodes associated with a mount point to replace
MNT_VNODE_FOREACH_ALL in the vfs_msync, ffs_sync_lazy, and qsync
routines.

The vfs_msync routine is run every 30 seconds for every writably
mounted filesystem. It ensures that any files mmap'ed from the
filesystem with modified pages have those pages queued to be
written back to the file from which they are mapped.

The ffs_lazy_sync and qsync routines are run every 30 seconds for
every writably mounted UFS/FFS filesystem. The ffs_lazy_sync routine
ensures that any files that have been accessed in the previous
30 seconds have had their access times queued for updating in the
filesystem. The qsync routine ensures that any files with modified
quotas have those quotas queued to be written back to their
associated quota file.

In a system configured with 250,000 vnodes, less than 1000 are
typically active at any point in time. Prior to this change all
250,000 vnodes would be locked and inspected twice every minute
by the syncer. For UFS/FFS filesystems they would be locked and
inspected six times every minute (twice by each of these three
routines since each of these routines does its own pass over the
vnodes associated with a mount point). With this change the syncer
now locks and inspects only the tiny set of vnodes that are active.

Reviewed by: kib
Tested by: Peter Holm
MFC after: 2 weeks


# 234482 20-Apr-2012 mckusick

This change creates a new list of active vnodes associated with
a mount point. Active vnodes are those with a non-zero use or hold
count, e.g., those vnodes that are not on the free list. Note that
this list is in addition to the list of all the vnodes associated
with a mount point.

To avoid adding another set of linkage pointers to the vnode
structure, the active list uses the existing linkage pointers
used by the free list (previously named v_freelist, now renamed
v_actfreelist).

This update adds the MNT_VNODE_FOREACH_ACTIVE interface that loops
over just the active vnodes associated with a mount point (typically
less than 1% of the vnodes associated with the mount point).

Reviewed by: kib
Tested by: Peter Holm
MFC after: 2 weeks


# 234443 18-Apr-2012 mckusick

Delete a no longer useful VNASSERT missed during changes in 234400.

Suggested by: kib


# 234441 18-Apr-2012 mckusick

Fix a memory leak of M_VNODE_MARKER introduced in 234386.

Found by: Peter Holm


# 234400 17-Apr-2012 mckusick

Drop export of vdestroy() function from kern/vfs_subr.c as it is
used only as a helper function in that file. Replace sole call to
vbusy() with inline code in vholdl(). Replace sole calls to vfree()
and vdestroy() with inline code in vdropl().

The Clang compiler already inlines these functions, so they do not
show up in a kernel backtrace which is confusing. Also you cannot
set their frame in kgdb which means that it is impossible to view
their local variables. So, while the produced code is unchanged,
the debugging should be easier.

Discussed with: kib
MFC after: 2 weeks


# 234386 17-Apr-2012 mckusick

Replace the MNT_VNODE_FOREACH interface with MNT_VNODE_FOREACH_ALL.
The primary changes are that the user of the interface no longer
needs to manage the mount-mutex locking and that the vnode that
is returned has its mutex locked (thus avoiding the need to check
to see if its is DOOMED or other possible end of life senarios).

To minimize compatibility issues for third-party developers, the
old MNT_VNODE_FOREACH interface will remain available so that this
change can be MFC'ed to 9. Following the MFC to 9, MNT_VNODE_FOREACH
will be removed in head.

The reason for this update is to prepare for the addition of the
MNT_VNODE_FOREACH_ACTIVE interface that will loop over just the
active vnodes associated with a mount point (typically less than
1% of the vnodes associated with the mount point).

Reviewed by: kib
Tested by: Peter Holm
MFC after: 2 weeks


# 234158 11-Apr-2012 mckusick

Export vinactive() from kern/vfs_subr.c (e.g., make it no longer
static and declare its prototype in sys/vnode.h) so that it can be
called from process_deferred_inactive() (in ufs/ffs/ffs_snapshot.c)
instead of the body of vinactive() being cut and pasted into
process_deferred_inactive().

Reviewed by: kib
MFC after: 2 weeks


# 232709 09-Mar-2012 kib

Decomission mnt_noasync. Introduce MNTK_NOASYNC mnt_kern_flag which
allows a filesystem to request VFS to not allow MNTK_ASYNC.

MFC after: 1 week


# 232152 25-Feb-2012 trociny

When detaching an unix domain socket, uipc_detach() checks
unp->unp_vnode pointer to detect if there is a vnode associated with
(binded to) this socket and does necessary cleanup if there is.

The issue is that after forced unmount this check may be too late as
the unp_vnode is reclaimed and the reference is stale.

To fix this provide a helper function that is called on a socket vnode
reclamation to do necessary cleanup.

Pointed by: kib
Reviewed by: kib
MFC after: 2 weeks


# 231075 06-Feb-2012 kib

Current implementations of sync(2) and syncer vnode fsync() VOP uses
mnt_noasync counter to temporary remove MNTK_ASYNC mount option, which
is needed to guarantee a synchronous completion of the initiated i/o
before syscall or VOP return. Global removal of MNTK_ASYNC option is
harmful because not only i/o started from corresponding thread becomes
synchronous, but all i/o is synchronous on the filesystem which is
initiated during sync(2) or syncer activity.

Instead of removing MNTK_ASYNC from mnt_kern_flag, provide a local
thread flag to disable async i/o for current thread only. Use the
opportunity to move DOINGASYNC() macro into sys/vnode.h and
consistently use it through places which tested for MNTK_ASYNC.

Some testing demonstrated 60-70% improvements in run time for the
metadata-intensive operations on async-mounted UFS volumes, but still
with great deviation due to other reasons.

Reviewed by: mckusick
Tested by: scottl
MFC after: 2 weeks


# 230553 25-Jan-2012 kib

When doing vflush(WRITECLOSE), clean vnode pages.

Unmounts do vfs_msync() before calling VFS_UNMOUNT(), but there is
still a race allowing a process to dirty pages after msync
finished. Remounts rw->ro just left dirty pages in system.

Reviewed by: alc, tegge (long time ago)
Tested by: pho
MFC after: 2 weeks


# 230249 17-Jan-2012 mckusick

Make sure all intermediate variables holding mount flags (mnt_flag)
and that all internal kernel calls passing mount flags are declared
as uint64_t so that flags in the top 32-bits are not lost.

MFC after: 2 weeks


# 229727 06-Jan-2012 jhb

Use proper argument structure types for the extattr post-VOP hooks.
The wrong structure happened to work since the only argument used was
the vnode which is in the same place in both VOP_SETATTR() and the two
extattr VOPs.

MFC after: 3 days


# 228849 23-Dec-2011 jhb

Add post-VOP hooks for VOP_DELETEEXTATTR() and VOP_SETEXTATTR() and use
these to trigger a NOTE_ATTRIB EVFILT_VNODE kevent when the extended
attributes of a vnode are changed.

Note that OS X already implements this behavior.

Reviewed by: rwatson
MFC after: 2 weeks


# 227070 04-Nov-2011 jhb

Add the posix_fadvise(2) system call. It is somewhat similar to
madvise(2) except that it operates on a file descriptor instead of a
memory region. It is currently only supported on regular files.

Just as with madvise(2), the advice given to posix_fadvise(2) can be
divided into two types. The first type provide hints about data access
patterns and are used in the file read and write routines to modify the
I/O flags passed down to VOP_READ() and VOP_WRITE(). These modes are
thus filesystem independent. Note that to ease implementation (and
since this API is only advisory anyway), only a single non-normal
range is allowed per file descriptor.

The second type of hints are used to hint to the OS that data will or
will not be used. These hints are implemented via a new VOP_ADVISE().
A default implementation is provided which does nothing for the WILLNEED
request and attempts to move any clean pages to the cache page queue for
the DONTNEED request. This latter case required two other changes.
First, a new V_CLEANONLY flag was added to vinvalbuf(). This requests
vinvalbuf() to only flush clean buffers for the vnode from the buffer
cache and to not remove any backing pages from the vnode. This is
used to ensure clean pages are not wired into the buffer cache before
attempting to move them to the cache page queue. The second change adds
a new vm_object_page_cache() method. This method is somewhat similar to
vm_object_page_remove() except that instead of freeing each page in the
specified range, it attempts to move clean pages to the cache queue if
possible.

To preserve the ABI of struct file, the f_cdevpriv pointer is now reused
in a union to point to the currently active advice region if one is
present for regular files.

Reviewed by: jilles, kib, arch@
Approved by: re (kib)
MFC after: 1 month


# 226849 27-Oct-2011 jhb

Whitespace fix.


# 226734 25-Oct-2011 pjd

The v_data field is a pointer, so set it to NULL, not 0.

MFC after: 3 days


# 226098 07-Oct-2011 jonathan

Change one printf() to log().

As noted in kern/159780, printf() is not very jail-friendly, since it can't be easily monitored by jail management tools. This patch reports an error via log() instead, which, if nobody is watching the log file, still prints to the console.

Approved by: mentor (rwatson)
Submitted by: Eugene Grosbein <eugen@eg.sd.rdtc.ru>
MFC after: 5 days


# 226022 04-Oct-2011 kib

Move parts of the commit log for r166167, where Tor explained the
interaction between vnode locks and vfs_busy(), into comment.

MFC after: 1 week


# 225177 25-Aug-2011 attilio

Fix a deficiency in the selinfo interface:
If a selinfo object is recorded (via selrecord()) and then it is
quickly destroyed, with the waiters missing the opportunity to awake,
at the next iteration they will find the selinfo object destroyed,
causing a PF#.

That happens because the selinfo interface has no way to drain the
waiters before to destroy the registered selinfo object. Also this
race is quite rare to get in practice, because it would require a
selrecord(), a poll request by another thread and a quick destruction
of the selrecord()'ed selinfo object.

Fix this by adding the seldrain() routine which should be called
before to destroy the selinfo objects (in order to avoid such case),
and fix the present cases where it might have already been called.
Sometimes, the context is safe enough to prevent this type of race,
like it happens in device drivers which installs selinfo objects on
poll callbacks. There, the destruction of the selinfo object happens
at driver detach time, when all the filedescriptors should be already
closed, thus there cannot be a race.
For this case, mfi(4) device driver can be set as an example, as it
implements a full correct logic for preventing this from happening.

Sponsored by: Sandvine Incorporated
Reported by: rstone
Tested by: pluknet
Reviewed by: jhb, kib
Approved by: re (bz)
MFC after: 3 weeks


# 224294 24-Jul-2011 mckusick

Move the MNTK_SUJ flag in mnt_kern_flag to MNT_SUJ in mnt_flag
so that it is visible to userland programs. This change enables
the `mount' command with no arguments to be able to show if a
filesystem is mounted using journaled soft updates as opposed
to just normal soft updates.

Approved by: re (bz)


# 223677 29-Jun-2011 alc

Add a new option, OBJPR_NOTMAPPED, to vm_object_page_remove(). Passing this
option to vm_object_page_remove() asserts that the specified range of pages
is not mapped, or more precisely that none of these pages have any managed
mappings. Thus, vm_object_page_remove() need not call pmap_remove_all() on
the pages.

This change not only saves time by eliminating pointless calls to
pmap_remove_all(), but it also eliminates an inconsistency in the use of
pmap_remove_all() versus related functions, like pmap_remove_write(). It
eliminates harmless but pointless calls to pmap_remove_all() that were being
performed on PG_UNMANAGED pages.

Update all of the existing assertions on pmap_remove_all() to reflect this
change.

Reviewed by: kib


# 223505 24-Jun-2011 jonathan

Tidy up a capabilities-related comment.

This comment refers to an #ifdef that hasn't been merged [yet?]; remove it.

Approved by: rwatson


# 221829 13-May-2011 mdf

Use a name instead of a magic number for kern_yield(9) when the priority
should not change. Fetch the td_user_pri under the thread lock. This
is probably not necessary but a magic number also seems preferable to
knowing the implementation details here.

Requested by: Jason Behmer < jason DOT behmer AT isilon DOT com >


# 221173 28-Apr-2011 attilio

Add the watchdogs patting during the (shutdown time) disk syncing and
disk dumping.
With the option SW_WATCHDOG on, these operations are doomed to let
watchdog fire, fi they take too long.

I implemented the stubs this way because I really want wdog_kern_*
KPI to not be dependant by SW_WATCHDOG being on (and really, the option
only enables watchdog activation in hardclock) and also avoid to
call them when not necessary (avoiding not-volountary watchdog
activations).

Sponsored by: Sandvine Incorporated
Discussed with: emaste, des
MFC after: 2 weeks


# 220967 23-Apr-2011 rmacklem

Fix a LOR in vfs_busy() where, after msleeping, it would lock
the mutexes in the wrong order for the case where the
MBF_MNTLSTLOCK is set. I believe this did have the
potential for deadlock. For example, if multiple nfsd threads
called vfs_busyfs(), which calls vfs_busy() with MBF_MNTLSTLOCK.
Thanks go to pho for catching this during his testing.

Tested by: pho
Submitted by: kib
MFC after: 2 weeks


# 220328 04-Apr-2011 pluknet

Remove malloc type M_NETADDR unused since splitting into vfs_subr.c
and vfs_export.c.

MFC after: 1 week


# 219396 08-Mar-2011 kib

Do not assert buffer lock in VFS_STRATEGY() when kernel already paniced.

Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 218424 08-Feb-2011 mdf

Based on discussions on the svn-src mailing list, rework r218195:

- entirely eliminate some calls to uio_yeild() as being unnecessary,
such as in a sysctl handler.

- move should_yield() and maybe_yield() to kern_synch.c and move the
prototypes from sys/uio.h to sys/proc.h

- add a slightly more generic kern_yield() that can replace the
functionality of uio_yield().

- replace source uses of uio_yield() with the functional equivalent,
or in some cases do not change the thread priority when switching.

- fix a logic inversion bug in vlrureclaim(), pointed out by bde@.

- instead of using the per-cpu last switched ticks, use a per thread
variable for should_yield(). With PREEMPTION, the only reasonable
use of this is to determine if a lock has been held a long time and
relinquish it. Without PREEMPTION, this is essentially the same as
the per-cpu variable.


# 218195 02-Feb-2011 mdf

Put the general logic for being a CPU hog into a new function
should_yield(). Use this in various places. Encapsulate the common
case of check-and-yield into a new function maybe_yield().

Change several checks for a magic number of iterations to use
should_yield() instead.

MFC after: 1 week


# 217824 25-Jan-2011 kib

When vtruncbuf() iterates over the vnode buffer list, lock buffer object
before checking the validity of the next buffer pointer. Otherwise, the
buffer might be reclaimed after the check, causing iteration to run into
wrong buffer.

Reported and tested by: pho
MFC after: 1 week


# 217555 18-Jan-2011 mdf

Specify a CTLTYPE_FOO so that a future sysctl(8) change does not need
to rely on the format string.


# 217326 12-Jan-2011 mdf

sysctl(9) cleanup checkpoint: amd64 GENERIC builds cleanly.

Commit the kernel changes.


# 217076 06-Jan-2011 jhb

- Restore dropping the priority of syncer down to PPAUSE when it is idle.
This was lost when it was converted to using a condition variable instead
of lbolt.
- Drop the priority of flowtable down to PPAUSE when it is idle as well
since it is a similar background task.

MFC after: 2 weeks


# 216733 27-Dec-2010 kib

Teach ddb "show mount" about MNTK_SUJ flag.


# 215797 24-Nov-2010 kib

Allow shared-locked vnode to be passed to vunref(9).
When shared-locked vnode is supplied as an argument to vunref(9) and
resulting usecount is 0, set VI_OWEINACT and do not try to upgrade vnode
lock. The later could cause vnode unlock, allowing the vnode to be
reclaimed meantime.

Tested by: pho
MFC after: 1 week


# 215548 19-Nov-2010 kib

Remove prtactive variable and related printf()s in the vop_inactive
and vop_reclaim() methods. They seems to be unused, and the reported
situation is normal for the forced unmount.

MFC after: 1 week
X-MFC-note: keep prtactive symbol in vfs_subr.c


# 215304 14-Nov-2010 brucec

Fix some more style(9) issues.


# 215283 14-Nov-2010 brucec

Fix style(9) issues from r215281 and r215282.

MFC after: 1 week


# 215282 14-Nov-2010 brucec

Add descriptions to some more sysctls.

PR: kern/148510
MFC after: 1 week


# 212466 11-Sep-2010 kib

Protect mnt_syncer with the sync_mtx. This prevents a (rare) vnode leak
when mount and update are executed in parallel.

Encapsulate syncer vnode deallocation into the helper function
vfs_deallocate_syncvnode(), to not externalize sync_mtx from vfs_subr.c.

Found and reviewed by: jh (previous version of the patch)
Tested by: pho
MFC after: 3 weeks


# 212096 01-Sep-2010 emaste

As long as we are going to panic anyway, there's no need to hide additional
information behind DIAGNOSTIC.


# 212002 30-Aug-2010 jh

execve(2) has a special check for file permissions: a file must have at
least one execute bit set, otherwise execve(2) will return EACCES even
for an user with PRIV_VFS_EXEC privilege.

Add the check also to vaccess(9), vaccess_acl_nfs4(9) and
vaccess_acl_posix1e(9). This makes access(2) to better agree with
execve(2). Because ZFS doesn't use vaccess(9) for VEXEC, add the check
to zfs_freebsd_access() too. There may be other file systems which are
not using vaccess*() functions and need to be handled separately.

PR: kern/125009
Reviewed by: bde, trasz
Approved by: pjd (ZFS part)


# 211930 28-Aug-2010 pjd

There is a bug in vfs_allocate_syncvnode() failure handling in mount code.
Actually it is hard to properly handle such a failure, especially in MNT_UPDATE
case. The only reason for the vfs_allocate_syncvnode() function to fail is
getnewvnode() failure. Fortunately it is impossible for current implementation
of getnewvnode() to fail, so we can assert this and make
vfs_allocate_syncvnode() void. This in turn free us from handling its failures
in the mount code.

Reviewed by: kib
MFC after: 1 month


# 211213 12-Aug-2010 kib

The buffers b_vflags field is not always properly protected by
bufobj lock. If b_bufobj is not NULL, then bufobj lock should be
held when manipulating the flags. Not doing this sometimes leaves
BV_BKGRDINPROG to be erronously set, causing softdep' getdirtybuf() to
stuck indefinitely in "getbuf" sleep, waiting for background write to
finish which is not actually performed.

Add BO_LOCK() in the cases where it was missed.

In collaboration with: pho
Tested by: bz
Reviewed by: jeff
MFC after: 1 month


# 210837 04-Aug-2010 alc

In order for MAXVNODES_MAX to be an "int" on powerpc and sparc, we must
cast PAGE_SIZE to an "int". (Powerpc and sparc, unlike the other
architectures, define PAGE_SIZE as a "long".)

Submitted by: Andreas Tobler


# 210782 02-Aug-2010 alc

Update the "desiredvnodes" calculation. In particular, make the part of
the calculation that is based on the kernel's heap size more conservative.
Hopefully, this will eliminate the need for MAXVNODES_MAX, but for the
time being set MAXVNODES_MAX to a large value.

Reviewed by: jhb@
MFC after: 6 weeks


# 209390 21-Jun-2010 ed

Use ISO C99 integer types in sys/kern where possible.

There are only about 100 occurences of the BSD-specific u_int*_t
datatypes in sys/kern. The ISO C99 integer types are used here more
often.


# 209260 17-Jun-2010 pjd

Backout r207970 for now, it can lead to deadlocks.

Reported by: kan
MFC after: 3 days


# 208773 03-Jun-2010 kib

Sometimes vnodes share the lock despite being different vnodes on
different mount points, e.g. the nullfs vnode and the covered vnode
from the lower filesystem. In this case, existing assertion in
vop_rename_pre() may be triggered.

Check for vnode locks equiality instead of the vnodes itself to
not trip over the situation.

Submitted by: Mikolaj Golub <to.my.trociny@gmail.com>
Tested by: pho
MFC after: 2 weeks


# 208003 12-May-2010 zml

Add VOP_ADVLOCKPURGE so that the file system is called when purging
locks (in the case where the VFS impl isn't using lf_*)

Submitted by: Matthew Fleming <matthew.fleming@isilon.com>
Reviewed by: zml, dfr


# 207970 12-May-2010 pjd

When there is no memory or KVA, try to help by reclaiming some vnodes.
This helps with 'kmem_map too small' panics.

No objections from: kib
Tested by: Alexander V. Ribchansky <shurik@zk.informjust.ua>
MFC after: 1 week


# 207937 11-May-2010 pjd

I added vfs_lowvnodes event, but it was only used for a short while and now
it is totally unused. Remove it.

MFC after: 3 days


# 207141 24-Apr-2010 jeff

- Merge soft-updates journaling from projects/suj/head into head. This
brings in support for an optional intent log which eliminates the need
for background fsck on unclean shutdown.

Sponsored by: iXsystems, Yahoo!, and Juniper.
With help from: McKusick and Peter Holm


# 206160 04-Apr-2010 jh

Add missing MNT_NFS4ACLS.


# 206135 03-Apr-2010 pjd

Fix some whitespace nits.


# 206134 03-Apr-2010 pjd

Add missing mnt_kern_flag flags in 'show mount' output.


# 206093 02-Apr-2010 kib

Add function vop_rename_fail(9) that performs needed cleanup for locks
and references of the VOP_RENAME(9) arguments. Use vop_rename_fail()
in deadfs_rename().

Tested by: Mikolaj Golub
MFC after: 1 week


# 202528 17-Jan-2010 kib

Add new function vunref(9) that decrements vnode use count (and hold
count) while vnode is exclusively locked.

The code for vput(9), vrele(9) and vunref(9) is merged.

In collaboration with: pho
Reviewed by: alc
MFC after: 3 weeks


# 201134 28-Dec-2009 kib

Add a knob to allow reclaim of the directory vnodes that are source of
the namecache records. The reclamation is not enabled by default because
for typical workload it would make namecache unusable, but large nested
directory tree easily puts any process that accesses filesystem into 1
second wait for vlru.

Reported by: yar (long time ago)
MFC after: 3 days


# 201019 26-Dec-2009 trasz

Now that all the callers seem to be fixed, add KASSERTs to make sure VAPPEND
is not being used improperly.


# 200770 21-Dec-2009 kib

VI_OBJDIRTY vnode flag mirrors the state of OBJ_MIGHTBEDIRTY vm object
flag. Besides providing the redundand information, need to update both
vnode and object flags causes more acquisition of vnode interlock.
OBJ_MIGHTBEDIRTY is only checked for vnode-backed vm objects.

Remove VI_OBJDIRTY and make sure that OBJ_MIGHTBEDIRTY is set only for
vnode-backed vm objects.

Suggested and reviewed by: alc
Tested by: pho
MFC after: 3 weeks


# 199529 19-Nov-2009 jh

Extend ddb(4) "show mount" command to print active string mount options.
Note that only option names are printed, not values.

Reviewed by: pjd
Approved by: trasz (mentor)
MFC after: 2 weeks


# 197680 01-Oct-2009 trasz

Provide default implementation for VOP_ACCESS(9), so that filesystems which
want to provide VOP_ACCESSX(9) don't have to implement both. Note that
this commit makes implementation of either of these two mandatory.

Reviewed by: kib


# 197134 12-Sep-2009 rwatson

Use C99 initialization for struct filterops.

Obtained from: Mac OS X
Sponsored by: Apple Inc.
MFC after: 3 weeks


# 197030 09-Sep-2009 kib

In vfs_mark_atime(9), be resistent against reclaimed vnodes.
Assert that neccessary locks are taken, since vop might not be called.

Tested by: pho
MFC after: 3 days


# 195285 02-Jul-2009 jamie

Call prison_check from vfs_suser rather than re-implementing it.

Approved by: re (kib), bz (mentor)


# 193951 10-Jun-2009 kib

Adapt vfs kqfilter to the shared vnode lock used by zfs write vop. Use
vnode interlock to protect the knote fields [1]. The locking assumes
that shared vnode lock is held, thus we get exclusive access to knote
either by exclusive vnode lock protection, or by shared vnode lock +
vnode interlock.

Do not use kl_locked() method to assert either lock ownership or the
fact that curthread does not own the lock. For shared locks, ownership
is not recorded, e.g. VOP_ISLOCKED can return LK_SHARED for the shared
lock not owned by curthread, causing false positives in kqueue subsystem
assertions about knlist lock.

Remove kl_locked method from knlist lock vector, and add two separate
assertion methods kl_assert_locked and kl_assert_unlocked, that are
supposed to use proper asserts. Change knlist_init accordingly.

Add convenience function knlist_init_mtx to reduce number of arguments
for typical knlist initialization.

Submitted by: jhb [1]
Noted by: jhb [2]
Reviewed by: jhb
Tested by: rnoland


# 193511 05-Jun-2009 rwatson

Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC
and used in a large number of files, but also because an increasing number
of incorrect uses of MAC calls were sneaking in due to copy-and-paste of
MAC-aware code without the associated opt_mac.h include.

Discussed with: pjd


# 193138 31-May-2009 attilio

Remove the now invalid (and possibly unused) debug.mpsafevfs
sysctl/tunable.

Reviewed by: emaste
Sponsored by: Sandvine Incorporated


# 193092 30-May-2009 trasz

Add VOP_ACCESSX, which can be used to query for newly added V*
permissions, such as VWRITE_ACL. For a filsystems that don't
implement it, there is a default implementation, which works
as a wrapper around VOP_ACCESS.

Reviewed by: rwatson@


# 192895 27-May-2009 jamie

Add hierarchical jails. A jail may further virtualize its environment
by creating a child jail, which is visible to that jail and to any
parent jails. Child jails may be restricted more than their parents,
but never less. Jail names reflect this hierarchy, being MIB-style
dot-separated strings.

Every thread now points to a jail, the default being prison0, which
contains information about the physical system. Prison0's root
directory is the same as rootvnode; its hostname is the same as the
global hostname, and its securelevel replaces the global securelevel.
Note that the variable "securelevel" has actually gone away, which
should not cause any problems for code that properly uses
securelevel_gt() and securelevel_ge().

Some jail-related permissions that were kept in global variables and
set via sysctls are now per-jail settings. The sysctls still exist for
backward compatibility, used only by the now-deprecated jail(2) system
call.

Approved by: bz (mentor)


# 191990 11-May-2009 attilio

Remove the thread argument from the FSD (File-System Dependent) parts of
the VFS. Now all the VFS_* functions and relating parts don't want the
context as long as it always refers to curthread.

In some points, in particular when dealing with VOPs and functions living
in the same namespace (eg. vflush) which still need to be converted,
pass curthread explicitly in order to retain the old behaviour.
Such loose ends will be fixed ASAP.

While here fix a bug: now, UFS_EXTATTR can be compiled alone without the
UFS_EXTATTR_AUTOSTART option.

VFS KPI is heavilly changed by this commit so thirdy parts modules needs
to be recompiled. Bump __FreeBSD_version in order to signal such
situation.


# 190533 29-Mar-2009 kan

Replace v_dd vnode pointer with v_cache_dd pointer to struct namecache
in directory vnodes. Allow namecache dotdot entry to be created pointing
from child vnode to parent vnode if no existing links in opposite
direction exist. Use direct link from parent to child for dotdot lookups
otherwise.

This restores more efficient dotdot caching in NFS filesystems which
was lost when vnodes stoppped being type stable.

Reviewed by: kib


# 189287 02-Mar-2009 kan

Change vfs_busy to wait until an outcome of pending unmount
operation is known and to retry or fail accordingly to that
outcome. This fixes the problem with namespace traversing
programs failing with random ENOENT errors if someone just
happened to try to unmount that same filesystem at the same
time.

Reported by: dhw
Reviewed by: kib, attilio
Sponsored by: Juniper Networks, Inc.


# 188244 06-Feb-2009 jhb

Tweak the output of VOP_PRINT/vn_printf() some.
- Align the fifo output in fifo_print() with other vn_printf() output.
- Remove the leading space from lockmgr_printinfo() so its output lines up
in vn_printf().
- lockmgr_printinfo() now ends with a newline, so remove an extra newline
from vn_printf().


# 188243 06-Feb-2009 trasz

Add KASSERTs to make it easier to debug problems like the one fixed
in r188141.

Reviewed by: kib,attilio
Approved by: rwatson (mentor)
Tested by: pho
Sponsored by: FreeBSD Foundation


# 188150 05-Feb-2009 attilio

Add more KTR_VFS logging point in order to have a more effective tracing.

Reviewed by: brueffer, kib
Tested by: Gianni Trematerra <giovanni D trematerra A gmail D com>


# 187654 23-Jan-2009 jhb

Tweak the wording for vfs_mark_atime() since the I/O it is avoiding by not
updating va_atime via VOP_SETATTR() isn't always synchronous. For some
filesystems it is asynchronous.

Suggested by: bde


# 187653 23-Jan-2009 jhb

Push down Giant in the vlnru kproc main loop so that it is only acquired
around calls to vlrureclaim() on non-MPSAFE filesystems. Specifically,
vnlru no longer needs Giant for the common case of waking up and deciding
there is nothing for it to do.

MFC after: 2 weeks


# 187564 21-Jan-2009 jhb

Fix a few style bogons.

Submitted by: bde


# 187526 21-Jan-2009 jhb

Move the VA_MARKATIME flag for VOP_SETATTR() out into its own VOP:
VOP_MARKATIME() since unlike the rest of VOP_SETATTR(), VA_MARKATIME
can be performed while holding a shared vnode lock (the same functionality
is done internally by VOP_READ which can run with a shared vnode lock).
Add missing locking of the vnode interlock to the ufs implementation and
remove a special note and test from the NFS client about not supporting the
feature.

Inspired by: ups
Tested by: pho


# 187467 20-Jan-2009 kib

FFS puts the extended attributes blocks at the negative blocks for the
vnode, from -1 down. When vinvalbuf(vp, V_ALT) is done for the vnode, it
incorrectly does vm_object_page_remove(0, 0), removing all pages from
the underlying vm object, not only the pages that back the extended
attributes data.

Change vinvalbuf() to not remove any pages from the object when
V_NORMAL or V_ALT are specified. Instead, the only in-tree caller
in ffs_inode.c:ffs_truncate() that specifies V_ALT explicitely
removes the corresponding page range. The V_NORMAL caller
does vnode_pager_setsize(vp, 0) immediately after the call to
vinvalbuf(V_NORMAL) already.

Reported by: csjp
Reviewed by: ups
MFC after: 3 weeks


# 186197 16-Dec-2008 attilio

1) Fix a deadlock in the VFS:
- threadA runs vfs_rel(mp1)
- threadB does unmount the mp1 fs, sets MNTK_UNMOUNT and drop MNT_ILOCK()
- threadA runs vfs_busy(mp1) and, as long as, MNTK_UNMOUNT is set, sleeps
waiting for threadB to complete the unmount
- threadB, in vfs_mount_destroy(), finds mnt_lock > 0 and sleeps waiting
for the refcount to expire.

Fix the deadlock by adding a flag called MNTK_REFEXPIRE which signals the
unmounter is waiting for mnt_ref to expire.
The vfs_busy contenders got awake, fails, and if they retry the
MNTK_REFEXPIRE won't allow them to sleep again.

2) Simplify significantly the code of vfs_mount_destroy() trimming
unnecessary codes:
- as long as any reference exited, it is no-more possible to have
write-op (primarty and secondary) in progress.
- it is no needed to drop and reacquire the mount lock.
- filling the structures with dummy values is unuseful as long as
it is going to be freed.

Tested by: pho, Andrea Barberio <insomniac at slackware dot it>
Discussed with: kib


# 185432 29-Nov-2008 kib

In the nfsrv_fhtovp(), after the vfs_getvfs() function found the pointer
to the fs, but before a vnode on the fs is locked, unmount may free fs
structures, causing access to destroyed data and freed memory.

Introduce a vfs_busymp() function that looks up and busies found
fs while mountlist_mtx is held. Use it in nfsrv_fhtovp() and in the
implementation of the handle syscalls.

Two other uses of the vfs_getvfs() in the vfs_subr.c, namely in
sysctl_vfs_ctl and vfs_getnewfsid seems to be ok. In particular,
sysctl_vfs_ctl is protected by Giant by being a non-sleeping sysctl
handler, that prevents Giant-locked unmount code to interfere with it.

Noted by: tegge
Reviewed by: dfr
Tested by: pho
MFC after: 1 month


# 185029 17-Nov-2008 pjd

Update ZFS from version 6 to 13 and bring some FreeBSD-specific changes.

This bring huge amount of changes, I'll enumerate only user-visible changes:

- Delegated Administration

Allows regular users to perform ZFS operations, like file system
creation, snapshot creation, etc.

- L2ARC

Level 2 cache for ZFS - allows to use additional disks for cache.
Huge performance improvements mostly for random read of mostly
static content.

- slog

Allow to use additional disks for ZFS Intent Log to speed up
operations like fsync(2).

- vfs.zfs.super_owner

Allows regular users to perform privileged operations on files stored
on ZFS file systems owned by him. Very careful with this one.

- chflags(2)

Not all the flags are supported. This still needs work.

- ZFSBoot

Support to boot off of ZFS pool. Not finished, AFAIK.

Submitted by: dfr

- Snapshot properties

- New failure modes

Before if write requested failed, system paniced. Now one
can select from one of three failure modes:
- panic - panic on write error
- wait - wait for disk to reappear
- continue - serve read requests if possible, block write requests

- Refquota, refreservation properties

Just quota and reservation properties, but don't count space consumed
by children file systems, clones and snapshots.

- Sparse volumes

ZVOLs that don't reserve space in the pool.

- External attributes

Compatible with extattr(2).

- NFSv4-ACLs

Not sure about the status, might not be complete yet.

Submitted by: trasz

- Creation-time properties

- Regression tests for zpool(8) command.

Obtained from: OpenSolaris


# 184599 03-Nov-2008 attilio

Remove the mnt_holdcnt and mnt_holdcntwaiters because they are useless.
Really, the concept of holdcnt in the struct mount is rappresented by
the mnt_ref (which prevents the type-stable structure from being
"recycled) handled through vfs_ref() and vfs_rel().
On this optic, switch the holdcnt acquisition into an emulated vfs_ref()
(and subsequent release into vfs_rel()).

Discussed with: kib
Tested by: pho


# 184554 02-Nov-2008 attilio

Improve VFS locking:
- Implement real draining for vfs consumers by not relying on the
mnt_lock and using instead a refcount in order to keep track of lock
requesters.
- Due to the change above, remove the mnt_lock lockmgr because it is now
useless.
- Due to the change above, vfs_busy() is no more linked to a lockmgr.
Change so its KPI by removing the interlock argument and defining 2 new
flags for it: MBF_NOWAIT which basically replaces the LK_NOWAIT of the
old version (which was unlinked from the lockmgr alredy) and
MBF_MNTLSTLOCK which provides the ability to drop the mountlist_mtx
once the mnt interlock is held (ability still desired by most consumers).
- The stub used into vfs_mount_destroy(), that allows to override the
mnt_ref if running for more than 3 seconds, make it totally useless.
Remove it as it was thought to work into older versions.
If a problem of "refcount held never going away" should appear, we will
need to fix properly instead than trust on such hackish solution.
- Fix a bug where returning (with an error) from dounmount() was still
leaving the MNTK_MWAIT flag on even if it the waiters were actually
woken up. Just a place in vfs_mount_destroy() is left because it is
going to recycle the structure in any case, so it doesn't matter.
- Remove the markercnt refcount as it is useless.

This patch modifies VFS ABI and breaks KPI for vfs_busy() so manpages and
__FreeBSD_version will be modified accordingly.

Discussed with: kib
Tested by: pho


# 184413 28-Oct-2008 trasz

Introduce accmode_t. This is required for NFSv4 ACLs - it will be neccessary
to add more V* constants, and the variables changed by this patch were often
being assigned to mode_t variables, which is 16 bit.

Approved by: rwatson (mentor)


# 184411 28-Oct-2008 kib

Style return statements in vn_pollrecord().


# 184409 28-Oct-2008 kib

Protect check for v_pollinfo == NULL and assignment of the newly allocated
vpollinfo with vnode interlock. Fully initialize vpollinfo before putting
pointer to it into vp->v_pollinfo.

Discussed with: dwhite
Tested by: pho
MFC after: 1 week


# 184073 20-Oct-2008 kib

In vfs_busy(), lockmgr() cannot legitimately sleep, because code checked
MNTK_UNMOUNT before, and mnt_mtx is used as interlock. vfs_busy() always
tries to obtain a shared lock on mnt_lock, the other user is unmount who
tries to drain it, setting MNTK_UNMOUNT before.

Reviewed by: tegge, attilio
Tested by: pho
MFC after: 2 weeks


# 183754 10-Oct-2008 attilio

Remove the struct thread unuseful argument from bufobj interface.
In particular following functions KPI results modified:
- bufobj_invalbuf()
- bufsync()

and BO_SYNC() "virtual method" of the buffer objects set.
Main consumers of bufobj functions are affected by this change too and,
in particular, functions which changed their KPI are:
- vinvalbuf()
- g_vfs_close()

Due to the KPI breakage, __FreeBSD_version will be bumped in a later
commit.

As a side note, please consider just temporary the 'curthread' argument
passing to VOP_SYNC() (in bufsync()) as it will be axed out ASAP

Reviewed by: kib
Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>


# 182542 31-Aug-2008 attilio

Decontextualize vfs_busy(), vfs_unbusy() and vfs_mount_alloc() functions.

Manpages are updated accordingly.

Tested by: Diego Sardina <siarodx at gmail dot com>


# 182371 28-Aug-2008 attilio

Decontextualize the couplet VOP_GETATTR / VOP_SETATTR as the passed thread
was always curthread and totally unuseful.

Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>


# 182364 28-Aug-2008 kib

Introduce the VV_FORCEINSMQ vnode flag. It instructs the insmnque() function
to ignore the unmounting and forces insertion of the vnode into the mount
vnode list.

Change insmntque() to fail when forced unmount is in progress and
VV_FORCEINSMQ is not specified.

Add an assertion to the insmntque(), requiring the vnode to be
exclusively locked for mp-safe filesystems.

Use the VV_FORCEINSMQ for the creation of the syncvnode.

Tested by: pho
Reviewed by: tegge
MFC after: 1 month


# 182120 24-Aug-2008 csjp

Remove worrying printf warning on bootup when processing vnodes which
have NULL mount-points. This is the case for special vnodes, such as the
one used in nameiinit() which is used for crossing mount points in lookup()
to avoid lock ordering issues.

MFC after: 2 weeks
Discussed with: rwatson, kib


# 180995 30-Jul-2008 ed

Remove the use of lbolt from the VFS syncer.

It seems we only use `lbolt' inside the VFS syncer and the TTY layer
now. Because I'm planning to replace the TTY layer next month, there's
no reason to keep `lbolt' if it's only used in a single thread inside
the kernel.

Because the syncer code wanted to wake up the syncer thread before the
timeout, it called sleepq_remove(). Because we now just use a condvar(9)
with a timeout value of `hz', we can wake it up using cv_broadcast()
without waking up any unrelated threads.

Reviewed by: phk


# 180844 27-Jul-2008 pjd

Assert for exclusive vnode lock in vinactive(), vrecycle() and vgonel()
functions.

Reviewed by: kib


# 180843 27-Jul-2008 pjd

- Move vp test for beeing NULL under IGNORE_LOCK().
- Check if panicstr isn't set, if it is ignore the lock. This helps to avoid
confusion, because lockmgr is a no-op when panicstr isn't NULL, so
asserting anything at this point doesn't make sense and can just race with
other panic.

Discussed with: kib


# 180682 21-Jul-2008 attilio

- Disallow XFS mounting in write mode. The write support never worked really
and there is no need to maintain it.
- Fix vn_get() in order to let it call vget(9) with a valid locking
request. vget(9) returns the vnode locked in order to prevent recycling,
but in this case internal XFS locks alredy prevent it from happening, so
it is safe to drop the vnode lock before to return by vn_get().
- Add a VNASSERT() in vget(9) in order to catch malformed locking requests.

Discussed with: kan, kib
Tested by: Lothar Braun <lothar at lobraun dot de>


# 179093 18-May-2008 pjd

Be more friendly for DDB pager.

Educated by: jhb's BSDCan presentation


# 178761 04-May-2008 attilio

sync_vnode() has some messy code about locking in order to deal with
mount fs needing Giant to be held when processing bufobjs.
Use a different subqueue for pending workitems on filesystems requiring
Giant. This simplifies the code notably and also reduces the number of
Giant acquisitions (and the whole processing cost).

Suggested by: jeff
Reviewed by: kib
Tested by: pho


# 178585 26-Apr-2008 pjd

Implement 'show mount' command in DDB. Without argument, it prints short
info about all currently mounted file systems. When an address is given
as an argument, prints detailed info about the given mount point.

MFC after: 2 weeks


# 178458 24-Apr-2008 kib

Allow the vnode zone to return the unused memory. The vnode reference
count is/shall be properly maintained for the long time, and VFS
shall be safe against the vnode memory reclamation.

Proposed by: jeff
Tested by: pho


# 178243 16-Apr-2008 kib

Move the head of byte-level advisory lock list from the
filesystem-specific vnode data to the struct vnode. Provide the
default implementation for the vop_advlock and vop_advlockasync.
Purge the locks on the vnode reclaim by using the lf_purgelocks().
The default implementation is augmented for the nfs and smbfs.
In the nfs_advlock, push the Giant inside the nfs_dolock.

Before the change, the vop_advlock and vop_advlockasync have taken the
unlocked vnode and dereferenced the fs-private inode data, racing with
with the vnode reclamation due to forced unmount. Now, the vop_getattr
under the shared vnode lock is used to obtain the inode size, and
later, in the lf_advlockasync, after locking the vnode interlock, the
VI_DOOMED flag is checked to prevent an operation on the doomed vnode.

The implementation of the lf_purgelocks() is submitted by dfr.

Reported by: kris
Tested by: kris, pho
Discussed with: jeff, dfr
MFC after: 2 weeks


# 177857 02-Apr-2008 jeff

- Destroy the bo mtx when the vnode is destroyed.


# 177687 28-Mar-2008 attilio

b_waiters cannot be adequately protected by the interlock because it is
dropped after the call to lockmgr() so just revert this approach using
something similar to the precedent one:
BUF_LOCKWAITERS() just checks if there are waiters (not the actual number
of them) and it is based on newly introduced lockmgr_waiters() which
returns if the lockmgr has waiters or not. The name has been choosen
differently by old lockwaiters() in order to not confuse them.

KPI results enriched by this commit so __FreeBSD_version bumping and
manpage update will be happening soon.
'struct buf' also changes, so kernel ABI is disturbed.

Bug found by: jeff
Approved by: jeff, kib


# 177539 24-Mar-2008 jeff

- Greatly simplify vget() by removing the guarantee that any new
references to a vnode with VI_OWEINACT set will force the vinactive()
call. The kernel makes no guarantees about which reference was the
last to close a file or when the actual inactive processing will
happen. The previous code was designed to preserve existing semantics
in the face of shared locks, however, this was unnecessary.

Discussed with: mckusick


# 177513 23-Mar-2008 jeff

- Only return 1 from sync_vnode() in cases where the vnode is still
at the head of the sync list. This prevents sched_sync() from
re-queueing a vnode which may have been freed already.

Discussed with: kib


# 177511 23-Mar-2008 jeff

- Pass BO_MTX(bo) to lockmgr in vtruncbuf, we don't own the vnode
interlock here anymore.

Reported by: kris


# 177493 22-Mar-2008 jeff

- Complete part of the unfinished bufobj work by consistently using
BO_LOCK/UNLOCK/MTX when manipulating the bufobj.
- Create a new lock in the bufobj to lock bufobj fields independently.
This leaves the vnode interlock as an 'identity' lock while the bufobj
is an io lock. The bufobj lock is ordered before the vnode interlock
and also before the mnt ilock.
- Exploit this new lock order to simplify softdep_check_suspend().
- A few sync related functions are marked with a new XXX to note that
we may not properly interlock against a non-zero bv_cnt when
attempting to sync all vnodes on a mountlist. I do not believe this
race is important. If I'm wrong this will make these locations easier
to find.

Reviewed by: kib (earlier diff)
Tested by: kris, pho (earlier diff)


# 177253 16-Mar-2008 rwatson

In keeping with style(9)'s recommendations on macros, use a ';'
after each SYSINIT() macro invocation. This makes a number of
lightweight C parsers much happier with the FreeBSD kernel
source, including cflow's prcc and lxr.

MFC after: 1 month
Discussed with: imp, rink


# 176708 01-Mar-2008 attilio

- Handle buffer lock waiters count directly in the buffer cache instead
than rely on the lockmgr support [1]:
* bump the waiters only if the interlock is held
* let brelvp() return the waiters count
* rely on brelvp() instead than BUF_LOCKWAITERS() in order to check
for the waiters number
- Remove a namespace pollution introduced recently with lockmgr.h
including lock.h by including lock.h directly in the consumers and
making it mandatory for using lockmgr.
- Modify flags accepted by lockinit():
* introduce LK_NOPROFILE which disables lock profiling for the
specified lockmgr
* introduce LK_QUIET which disables ktr tracing for the specified
lockmgr [2]
* disallow LK_SLEEPFAIL and LK_NOWAIT to be passed there so that it
can only be used on a per-instance basis
- Remove BUF_LOCKWAITERS() and lockwaiters() as they are no longer
used

This patch breaks KPI so __FreBSD_version will be bumped and manpages
updated by further commits. Additively, 'struct buf' changes results in
a disturbed ABI also.

[2] Really, currently there is no ktr tracing in the lockmgr, but it
will be added soon.

[1] Submitted by: kib
Tested by: pho, Andrea Barberio <insomniac at slackware dot it>


# 176559 25-Feb-2008 attilio

Axe the 'thread' argument from VOP_ISLOCKED() and lockstatus() as it is
always curthread.

As KPI gets broken by this patch, manpages and __FreeBSD_version will be
updated by further commits.

Tested by: Andrea Barberio <insomniac at slackware dot it>


# 176116 08-Feb-2008 attilio

Conver all explicit instances to VOP_ISLOCKED(arg, NULL) into
VOP_ISLOCKED(arg, curthread). Now, VOP_ISLOCKED() and lockstatus() should
only acquire curthread as argument; this will lead in axing the additional
argument from both functions, making the code cleaner.

Reviewed by: jeff, kib


# 175635 24-Jan-2008 attilio

Cleanup lockmgr interface and exported KPI:
- Remove the "thread" argument from the lockmgr() function as it is
always curthread now
- Axe lockcount() function as it is no longer used
- Axe LOCKMGR_ASSERT() as it is bogus really and no currently used.
Hopefully this will be soonly replaced by something suitable for it.
- Remove the prototype for dumplockinfo() as the function is no longer
present

Addictionally:
- Introduce a KASSERT() in lockstatus() in order to let it accept only
curthread or NULL as they should only be passed
- Do a little bit of style(9) cleanup on lockmgr.h

KPI results heavilly broken by this change, so manpages and
FreeBSD_version will be modified accordingly by further commits.

Tested by: matteo


# 175486 19-Jan-2008 attilio

- Introduce the function lockmgr_recursed() which returns true if the
lockmgr lkp, when held in exclusive mode, is recursed
- Introduce the function BUF_RECURSED() which does the same for bufobj
locks based on the top of lockmgr_recursed()
- Introduce the function BUF_ISLOCKED() which works like the counterpart
VOP_ISLOCKED(9), showing the state of lockmgr linked with the bufobj

BUF_RECURSED() and BUF_ISLOCKED() entirely replace the usage of bogus
BUF_REFCNT() in a more explicative and SMP-compliant way.
This allows us to axe out BUF_REFCNT() and leaving the function
lockcount() totally unused in our stock kernel. Further commits will
axe lockcount() as well as part of lockmgr() cleanup.

KPI results, obviously, broken so further commits will update manpages
and freebsd version.

Tested by: kris (on UFS and NFS)


# 175294 13-Jan-2008 attilio

VOP_LOCK1() (and so VOP_LOCK()) and VOP_UNLOCK() are only used in
conjuction with 'thread' argument passing which is always curthread.
Remove the unuseful extra-argument and pass explicitly curthread to lower
layer functions, when necessary.

KPI results broken by this change, which should affect several ports, so
version bumping and manpage update will be further committed.

Tested by: kris, pho, Diego Sardina <siarodx at gmail dot com>


# 175202 10-Jan-2008 attilio

vn_lock() is currently only used with the 'curthread' passed as argument.
Remove this argument and pass curthread directly to underlying
VOP_LOCK1() VFS method. This modify makes the code cleaner and in
particular remove an annoying dependence helping next lockmgr() cleanup.
KPI results, obviously, changed.

Manpage and FreeBSD_version will be updated through further commits.

As a side note, would be valuable to say that next commits will address
a similar cleanup about VFS methods, in particular vop_lock1 and
vop_unlock.

Tested by: Diego Sardina <siarodx at gmail dot com>,
Andrea Di Pasquale <whyx dot it at gmail dot com>


# 174952 28-Dec-2007 rwatson

In "show lockedvnods" DDB command, use db_printf() rather than printf()
so that the results end up in the DDB output stream rather than the
console output stream.

This should likely also be done for the vprint() function it calls.

MFC after: 3 months


# 174943 27-Dec-2007 attilio

As LK_EXCLUPGRADE is used in conjuction with LK_NOWAIT, LK_UPGRADE becames
equivalent with this and so operate the switch.

That call is the only one remaining LK_EXCLUPGRADE consumer and removing
it will prepare the ground for LK_EXCLUPGRADE axing and further
lockmgr improvements.

Discussed with: jeff, ups


# 174898 25-Dec-2007 rwatson

Add a new 'why' argument to kdb_enter(), and a set of constants to use
for that argument. This will allow DDB to detect the broad category of
reason why the debugger has been entered, which it can use for the
purposes of deciding which DDB script to run.

Assign approximate why values to all current consumers of the
kdb_enter() interface.


# 174284 05-Dec-2007 kib

Use curthread instead of the FIRST_THREAD_IN_PROC for vnlru and syncer,
when applicable.

Aquire Giant slightly later for vnlru.

In the syncer, aquire the Giant only when a vnode belongs to the
non-MPsafe fs.

In both speedup_syncer() and syncer_shutdown(), remove the syncer thread from
the lbolt sleep queue after the syncer state is modified, not before.

Herded by: attilio
Tested by: Peter Holm
Reviewed by: ups
MFC after: 1 week


# 172930 24-Oct-2007 rwatson

Merge first in a series of TrustedBSD MAC Framework KPI changes
from Mac OS X Leopard--rationalize naming for entry points to
the following general forms:

mac_<object>_<method/action>
mac_<object>_check_<method/action>

The previous naming scheme was inconsistent and mostly
reversed from the new scheme. Also, make object types more
consistent and remove spaces from object types that contain
multiple parts ("posix_sem" -> "posixsem") to make mechanical
parsing easier. Introduce a new "netinet" object type for
certain IPv4/IPv6-related methods. Also simplify, slightly,
some entry point names.

All MAC policy modules will need to be recompiled, and modules
not updates as part of this commit will need to be modified to
conform to the new KPI.

Sponsored by: SPARTA (original patches against Mac OS X)
Obtained from: TrustedBSD Project, Apple Computer


# 172836 20-Oct-2007 julian

Rename the kthread_xxx (e.g. kthread_create()) calls
to kproc_xxx as they actually make whole processes.
Thos makes way for us to add REAL kthread_create() and friends
that actually make theads. it turns out that most of these
calls actually end up being moved back to the thread version
when it's added. but we need to make this cosmetic change first.

I'd LOVE to do this rename in 7.0 so that we can eventually MFC the
new kthread_xxx() calls.


# 172151 12-Sep-2007 kib

When restoring the mount after umount failed, the MNTK_UNMOUNT flag
prevents insmntque() from placing reallocated syncer vnode on mount
list, that causes panic in vfs_allocate_syncvnode().

Introduce MNTK_NOINSMNTQ flag, that marks the period when instmntque is
not allowed to success, instead of MNTK_UNMOUNT. The MNTK_NOINSMNTQ is
set and cleared simultaneously with MNTK_UNMOUNT, except on umount error
path, where it is cleaned just before the syncer vnode is going to be
allocated.

Reported by: Peter Jeremy <peterjeremy optushome com au>
Suggested by: tegge
Approved by: re (rwatson)


# 171823 13-Aug-2007 pjd

Improve vn_printf() by:
- adding missing vnode flags,
- printing unknown flags as numbers,
- using strlcat() instead of strcat().

Approved by: re (bmah)


# 170587 12-Jun-2007 rwatson

Eliminate now-unused SUSER_ALLOWJAIL arguments to priv_check_cred(); in
some cases, move to priv_check() if it was an operation on a thread and
no other flags were present.

Eliminate caller-side jail exception checking (also now-unused); jail
privilege exception code now goes solely in kern_jail.c.

We can't yet eliminate suser() due to some cases in the KAME code where
a privilege check is performed and then used in many different deferred
paths. Do, however, move those prototypes to priv.h.

Reviewed by: csjp
Obtained from: TrustedBSD Project


# 170170 31-May-2007 attilio

Revert VMCNT_* operations introduction.
Probabilly, a general approach is not the better solution here, so we should
solve the sched_lock protection problems separately.

Requested by: alc
Approved by: jeff (mentor)


# 170035 27-May-2007 rwatson

Universally adopt most conventional spelling of acquire.


# 169671 18-May-2007 kib

Since renaming of vop_lock to _vop_lock, pre- and post-condition
function calls are no more generated for vop_lock.
Rename _vop_lock to vop_lock1 to satisfy tools/vnode_if.awk assumption
about vop naming conventions. This restores pre/post-condition calls.


# 169667 18-May-2007 jeff

- define and use VMCNT_{GET,SET,ADD,SUB,PTR} macros for manipulating
vmcnts. This can be used to abstract away pcpu details but also changes
to use atomics for all counters now. This means sched lock is no longer
responsible for protecting counts in the switch routines.

Contributed by: Attilio Rao <attilio@FreeBSD.org>


# 168699 13-Apr-2007 pjd

Fix jails and jail-friendly file systems handling:
- We need to allow for PRIV_VFS_MOUNT_OWNER inside a jail.
- Move security checks to vfs_suser() and deny unmounting and updating
for jailed root from different jails, etc.

OK'ed by: rwatson


# 168682 13-Apr-2007 pjd

When we are running low on vnodes, there is currently no way to ask other
subsystems to release some vnodes. Implement backpressure based on
vfs_lowvnodes event (similar to vm_lowmem for memory).


# 168587 10-Apr-2007 pjd

Minor style cleanups (mostly removal of trailing whitespaces).


# 168585 10-Apr-2007 pjd

Correct typos.


# 168204 01-Apr-2007 pjd

Now that the vdropl() function is public, assert that the vnode interlock
is held.


# 168192 31-Mar-2007 des

Make vdropl() public; zfs needs it. There is also plenty of existing
file system code (mostly *_reclaim()) which look like this:

VOP_LOCK(vp);
/* examine vp */
VOP_UNLOCK(vp);
vdrop(vp);

This can now be rewritten to:

VOP_LOCK(vp);
/* examine vp */
vdropl(vp); /* will unlock vp */

MFC after: 1 week


# 167933 27-Mar-2007 marcel

PowerPC is the only architecture with mpsafe_vfs=0. This is now
broken. Rudimentary tests show that PowerPC can run with
mpsafe_vfs=1. Make it so...


# 167497 13-Mar-2007 tegge

Make insmntque() externally visibile and allow it to fail (e.g. during
late stages of unmount). On failure, the vnode is recycled.

Add insmntque1(), to allow for file system specific cleanup when
recycling vnode on failure.

Change getnewvnode() to no longer call insmntque(). Previously,
embryonic vnodes were put onto the list of vnode belonging to a file
system, which is unsafe for a file system marked MPSAFE.

Change vfs_hash_insert() to no longer lock the vnode. The caller now
has that responsibility.

Change most file systems to lock the vnode and call insmntque() or
insmntque1() after a new vnode has been sufficiently setup. Handle
failed insmntque*() calls by propagating errors to callers, possibly
after some file system specific cleanup.

Approved by: re (kensmith)
Reviewed by: kib
In collaboration with: kib


# 164248 13-Nov-2006 kmacy

change vop_lock handling to allowing tracking of callers' file and line for
acquisition of lockmgr locks

Approved by: scottl (standing in for mentor rwatson)


# 164073 07-Nov-2006 jhb

Simplify operations with sync_mtx in sched_sync():
- Don't drop the lock just to reacquire it again to check rushjob, this
only wastes time.
- Use msleep() to drop the mutex while sleeping instead of explicitly
unlocking around tsleep.

Reviewed by: pjd


# 164070 07-Nov-2006 jhb

Fix comment typo and function declaration.


# 164033 06-Nov-2006 rwatson

Sweep kernel replacing suser(9) calls with priv(9) calls, assigning
specific privilege names to a broad range of privileges. These may
require some future tweaking.

Sponsored by: nCircle Network Security, Inc.
Obtained from: TrustedBSD Project
Discussed on: arch@
Reviewed (at least in part) by: mlaier, jmg, pjd, bde, ceri,
Alex Lyashkov <umka at sevcity dot net>,
Skip Ford <skip dot ford at verizon dot net>,
Antoine Brodin <antoine dot brodin at laposte dot net>


# 163988 04-Nov-2006 pjd

Typo, 'from' vnode is locked here, not 'to' vnode.


# 163841 31-Oct-2006 pjd

Add gjournal specific code to the UFS file system:
- Add FS_GJOURNAL flag which enables gjournal support on a file system.
- Add cg_unrefs field to the cylinder group structure which holds
number of unreferenced (orphaned) inodes in the given cylinder group.
- Add fs_unrefs field to the super block structure which holds
total number of unreferenced (orphaned) inodes.
- When file or a directory is orphaned (last reference is removed, but
object is still open), increase fs_unrefs and cg_unrefs fields,
which is a hint for fsck in which cylinder groups looks for such
(orphaned) objects.
- When file is last closed, decrease {fs,cg}_unrefs fields.
- Add VV_DELETED vnode flag which points at orphaned objects.

Sponsored by: home.pl


# 163606 22-Oct-2006 rwatson

Complete break-out of sys/sys/mac.h into sys/security/mac/mac_framework.h
begun with a repo-copy of mac.h to mac_framework.h. sys/mac.h now
contains the userspace and user<->kernel API and definitions, with all
in-kernel interfaces moved to mac_framework.h, which is now included
across most of the kernel instead.

This change is the first step in a larger cleanup and sweep of MAC
Framework interfaces in the kernel, and will not be MFC'd.

Obtained from: TrustedBSD Project
Sponsored by: SPARTA


# 162945 02-Oct-2006 kib

Correct the comment: numvnodes is decreased on vdestroying the vnode.

OKed by: tegge
Approved by: pjd (mentor)
MFC after: 1 week


# 162649 26-Sep-2006 tegge

Add mnt_noasync counter to better handle interleaved calls to nmount(),
sync() and sync_fsync() without losing MNT_ASYNC. Add MNTK_ASYNC flag
which is set only when MNT_ASYNC is set and mnt_noasync is zero, and
check that flag instead of MNT_ASYNC before initiating async io.


# 162647 26-Sep-2006 tegge

Use mount interlock to protect all changes to mnt_flag and mnt_kern_flag.
This eliminates a race where MNT_UPDATE flag could be lost when nmount()
raced against sync(), sync_fsync() or quotactl().


# 162024 04-Sep-2006 pjd

Add 'show vnode <addr>' DDB command.


# 161160 10-Aug-2006 pjd

getnewvnode() can be called with NULL mp.

Found by: Coverity Prevent (tm)
Coverity ID: 1521
Confirmed by: phk


# 161122 09-Aug-2006 pjd

Add a bandaid to avoid a deadlock in a situation, when we are trying to suspend
a file system, but need to obtain a vnode. We may not be able to do it, because
all vnodes could be already in use and other processes cannot release them,
because they are waiting in "suspfs" state.

In such situation, we allow to allocate a vnode anyway.

This is a temporary fix - there is no backpressure to free vnodes allocated in
those circumstances.

MFC after: 1 week
Reviewed by: tegge


# 161020 06-Aug-2006 rwatson

Improve commenting of vaccess(), making sure to be clear that the ifdef
capabilities code is there for reference and never actually used. Slight
style tweak.


# 160378 15-Jul-2006 alc

Enable debug.mpsafevfs by default on arm. Since every architecture except
powerpc has debug.mpsafevfs enabled by default, it is shorter to enumerate
the architectures on which debug.mpsafevfs is off.

Tested by: cognet@


# 160112 05-Jul-2006 kib

Back out my rev. 1.674. The better fix (rev. 1.637) is already in tree.

Approved by: kan (mentor)


# 159964 26-Jun-2006 babkin

Backed out the change by request from rwatson.

PR: kern/14584


# 159927 25-Jun-2006 babkin

The common UID/GID space implementation. It has been discussed on -arch
in 1999, and there are changes to the sysctl names compared to PR,
according to that discussion. The description is in sys/conf/NOTES.
Lines in the GENERIC files are added in commented-out form.
I'll attach the test script I've used to PR.

PR: kern/14584
Submitted by: babkin


# 159392 08-Jun-2006 kib

Fix the LOR that occurs when the MAC compiled into the kernel
and vnode is destroyed.

Reviewed by: rwatson
LOR: 189
MFC after: 2 weeks
Approved by: kan (mentor)


# 158906 25-May-2006 ups

Do not set B_NOCACHE on buffers when releasing them in flushbuflist().
If B_NOCACHE is set the pages of vm backed buffers will be invalidated.
However clean buffers can be backed by dirty VM pages so invalidating them
can lead to data loss.
Add support for flush dirty page in the data invalidation function
of some network file systems.

This fixes data losses during vnode recycling (and other code paths
using invalbuf(*,V_SAVE,*,*)) for data written using an mmaped file.

Collaborative effort by: jhb@,mohans@,peter@,ps@,ups@
Reviewed by: tegge@
MFC after: 7 days


# 158471 12-May-2006 jhb

Remove various bits of conditional Alpha code and fixup a few comments.


# 158151 29-Apr-2006 pjd

vn_start_write()/vn_finished_write() is not needed here, because
vn_start_write() is always called earlier in the code path and calling
the function recursively may lead to a deadlock.

Confirmed by: tegge
MFC after: 2 weeks


# 158095 28-Apr-2006 jeff

- Add a BO_NEEDSGIANT flag to the bufobj. This flag forces all child
buffers to go on the buf daemon's DIRTYGIANT queue.
- Set BO_NEEDSGIANT on ffs's devvp since the ffs_copyonwrite handler
runs in the context of the buf daemon and may require Giant.


# 157470 04-Apr-2006 jeff

- VFS_LOCK_GIANT when recycling a vnode via getnewvnode. We may be
recycling for an unrelated filesystem. I really don't like potentially
acquiring giant in the context of a giantless filesystem but there
are reasonable objections to removing the recycling from this path.

Sponsored by: Isilon Systems, Inc.


# 157345 31-Mar-2006 jeff

- Add an assert to vgone. It is illegal to call vgone without a reference
to the vnode. Without a reference the vnode will never be vdestroy'd
and the memory will never be reclaimed.

Sponsored by: Isilon Systems, Inc.


# 157324 31-Mar-2006 jeff

- Hold a reference from the time vfs_busy starts until vfs_unbusy is
called.
- vfs_getvfs has to return a reference to prevent the returned mountpoint
from changing identities.
- Release references acquired via vfs_getvfs.

Discussed with: tegge
Tested by: kris
Sponsored by: Isilon Systems, Inc.


# 157319 31-Mar-2006 jeff

- Add the B_NEEDSGIANT flag which is only set if the vnode that owns a buf
requires Giant. It is set in bgetvp and cleared in brelvp.
- Create QUEUE_DIRTY_GIANT for dirty buffers that require giant.
- In the buf daemon, only grab giant when processing QUEUE_DIRTY_GIANT and
only if we think there are buffers in that queue.

Sponsored by: Isilon Systems, Inc.


# 156892 19-Mar-2006 jeff

- Correct an assert in vop_rename_pre. fdvp may be locked if it is either
the target directory or file. This case should fail in the filesystem
anyway and perhaps kern_rename() should catch it.

Sponsored by: Isilon Systems, Inc.


# 156451 08-Mar-2006 tegge

Use vn_start_secondary_write() and vn_finished_secondary_write() as a
replacement for vn_write_suspend_wait() to better account for secondary write
processing.

Close race where secondary writes could be started after ffs_sync() returned
but before the file system was marked as suspended.

Detect if secondary writes or softdep processing occurred during vnode sync
loop in ffs_sync() and retry the loop if needed.


# 156225 02-Mar-2006 tegge

Eliminate a deadlock when creating snapshots. Blocking vn_start_write() must
be called without any vnode locks held. Remove calls to vn_start_write() and
vn_finished_write() in vnode_pager_putpages() and add these calls before the
vnode lock is obtained to most of the callers that don't already have them.


# 156223 02-Mar-2006 tegge

Don't try to show marker nodes.


# 156203 02-Mar-2006 jeff

- Move softdep from using a global worklist to per-mount worklists. This
has many positive effects including improved smp locking, reducing
interdependencies between mounts that can lead to deadlocks, etc.
- Add the softdep worklist and various counters to the ufsmnt structure.
- Add a mount pointer to the workitem and remove mount pointers from the
various structures derived from the workitem as they are now redundant.
- Remove the poor-man's semaphore protecting softdep_process_worklist and
softdep_flushworklist. Several threads may now process the list
simultaneously.
- Add softdep_waitidle() to block the thread until all pending
dependencies being operated on by other threads have been flushed.
- Use softdep_waitidle() in unmount and snapshots to block either
operation until the fs is stable.
- Remove softdep worklist processing from the syncer and move it into the
softdep_flush() thread. This thread processes all softdep mounts
once each second and when it is called via the new softdep_speedup()
when there is a resource shortage. This removes the softdep hook
from the kernel and various hacks in header files to support it.

Reviewed by/Discussed with: tegge, truckman, mckusick
Tested by: kris


# 155938 23-Feb-2006 jeff

- Release the mount ref once the vnode has been recycled rather than once
the last reference is dropped. I forgot that vnodes can stick around
for a very long time until processes discover that they are dead. This
means that a vnode reference is not sufficient to keep the mount
referenced and even more code will be required to ref mount points.

Discovered by: kris


# 155901 22-Feb-2006 jeff

- Grab a mnt ref in vfs_busy() before dropping the interlock. This will
prevent the mount point from going away while we're waiting on the lock.
The ref does not need to persist once we have the lock because the
lock prevents the mount point from being unmounted.

MFC After: 1 week


# 155386 06-Feb-2006 jeff

- Add a ref count to the mount structure. Sleep for up to 3 seconds in
vfs_mount_destroy waiting for this ref to hit 0. We don't print an
error if we are rebooting as the root mount always retains some refernces
by init proc.
- Acquire a mnt ref for every vnode allocated to a mount point. Drop this
ref only once vdestroy() has been called and the mount has been freed.
- No longer NULL the v_mount pointer in delmntque() so that we may release
the ref after vgone() has been called. This allows us to guarantee
that the mount point structure will be valid until the last vnode has
lost its last ref.
- Fix a few places that rely on checking v_mount to detect recycling.

Sponsored by: Isilon Systems, Inc.
MFC After: 1 week


# 155161 01-Feb-2006 jeff

- Solve a race where we could lose a call to VOP_INACTIVE. If vget() waiting
on a lock held the last usecount ref on a vnode and the lock failed we
would not call INACTIVE. Solve this by only holding a holdcnt to prevent
the vnode from disappearing while we wait on vn_lock. Other callers
may now VOP_INACTIVE while we are waiting on the lock, however this race
is acceptable, while losing INACTIVE is not.

Discussed with: kan, pjd
Tested by: kkenn
Sponsored by: Isilon Systems, Inc.
MFC After: 1 week


# 154946 28-Jan-2006 kris

Back out r1.653; it turns out that the race (or at least the printf) is
actually not hard to trigger, and it can cause a lot of console spam.

Approved by: kan


# 154646 21-Jan-2006 rwatson

Convert remaining functions in vfs_subr.c from K&R prototypes to ANSI C
prototypes, as the majority of new functions added have been in this
style. Changing prototype style now results in gcc noticing that the
implementation of vn_pollrecord() has a 'short' argument instead of
'int' as prototyped in vnode.h, so correct that definition. In practice
this didn't matter as only poll flags in the lower 16 bits are used.

MFC after: 1 week


# 154152 09-Jan-2006 tegge

Add marker vnodes to ensure that all vnodes associated with the mount point are
iterated over when using MNT_VNODE_FOREACH.

Reviewed by: truckman


# 153859 29-Dec-2005 pjd

Print a warning when we miss vinactive() call, because of race in vget().
The race is very real, but conditions needed for triggering it are rather
hard to meet now.
When gjournal will be committed (where it is quite easy to trigger) we need
to fix it.

For now, verify if it is really hard to trigger.

Discussed with: kan


# 152254 09-Nov-2005 dwhite

This is a workaround for a complicated issue involving VFS cookies and devfs.
The PR and patch have the details. The ultimate fix requires architectural
changes and clarifications to the VFS API, but this will prevent the system
from panicking when someone does "ls /dev" while running in a shell under the
linuxulator.

This issue affects HEAD and RELENG_6 only.

PR: 88249
Submitted by: "Devon H. O'Dell" <dodell@ixsystems.com>
MFC after: 3 days


# 151897 31-Oct-2005 rwatson

Normalize a significant number of kernel malloc type names:

- Prefer '_' to ' ', as it results in more easily parsed results in
memory monitoring tools such as vmstat.

- Remove punctuation that is incompatible with using memory type names
as file names, such as '/' characters.

- Disambiguate some collisions by adding subsystem prefixes to some
memory types.

- Generally prefer lower case to upper case.

- If the same type is defined in multiple architecture directories,
attempt to use the same name in additional cases.

Not all instances were caught in this change, so more work is required to
finish this conversion. Similar changes are required for UMA zone names.


# 151352 14-Oct-2005 kris

mpsafevm has been stable and defaulted to 1 on sparc64 for over 6 months,
so we are ready for mpsafevfs=1 by default on sparc64 too. I have been
running this on all my sparc64 machines for over 6 months, and have not
encountered MD problems.

MFC after: 1 week


# 151252 12-Oct-2005 dds

Move execve's access time update functionality into a new
vfs_mark_atime() function, and use the new function for
performing efficient atime updates in mmap().

Reviewed by: bde
MFC after: 2 weeks


# 150741 30-Sep-2005 truckman

Un-staticize runningbufwakeup() and staticize updateproc.

Add a new private thread flag to indicate that the thread should
not sleep if runningbufspace is too large.

Set this flag on the bufdaemon and syncer threads so that they skip
the waitrunningbufspace() call in bufwrite() rather than than
checking the proc pointer vs. the known proc pointers for these two
threads. A way of preventing these threads from being starved for
I/O but still placing limits on their outstanding I/O would be
desirable.

Set this flag in ffs_copyonwrite() to prevent bufwrite() calls from
blocking on the runningbufspace check while holding snaplk. This
prevents snaplk from being held for an arbitrarily long period of
time if runningbufspace is high and greatly reduces the contention
for snaplk. The disadvantage is that ffs_copyonwrite() can start
a large amount of I/O if there are a large number of snapshots,
which could cause a deadlock in other parts of the code.

Call runningbufwakeup() in ffs_copyonwrite() to decrement runningbufspace
before attempting to grab snaplk so that I/O requests waiting on
snaplk are not counted in runningbufspace as being in-progress.
Increment runningbufspace again before actually launching the
original I/O request.

Prior to the above two changes, the system could deadlock if enough
I/O requests were blocked by snaplk to prevent runningbufspace from
falling below lorunningspace and one of the bawrite() calls in
ffs_copyonwrite() blocked in waitrunningbufspace() while holding
snaplk.

See <http://www.holm.cc/stress/log/cons143.html>


# 150231 16-Sep-2005 tegge

Break out of loop if next buffer pointer has become invalid while flushing
current buffer.

Reviewed by: kan


# 150062 12-Sep-2005 rwatson

In vfs_kqfilter(), return EINVAL instead of 1 (EPERM) when an unsupported
kqueue filter type is requested on a vnode.

MFC after: 3 days


# 150047 12-Sep-2005 jkim

use monotonic `time_uptime' instead of `time_second'

Approved by: anholt (mentor)
Discussed on: arch


# 150020 12-Sep-2005 phk

Introduce vfs_read_dirent() which can help VOP_READDIR() implementations
by handling all the cookie stuff.


# 149557 28-Aug-2005 ssouhlal

Fix a typo in vop_rename_pre() where we ended up using vholdl()
instead of vhold(), even though the vnode interlock is unlocked.

MFC after: 3 days


# 149385 23-Aug-2005 truckman

Back out the removal of LK_NOWAIT from the VOP_LOCK() call in
vlrureclaim() in vfs_subr.c 1.636 because waiting for the vnode
lock aggravates an existing race condition. It is also undesirable
according to the commit log for 1.631.

Fix the tiny race condition that remains by rechecking the vnode
state after grabbing the vnode lock and grabbing the vnode interlock.

Fix the problem of other threads being starved (which 1.636 attempted
to fix by removing LK_NOWAIT) by calling uio_yield() periodically
in vlrureclaim(). This should be more deterministic than hoping
that VOP_LOCK() without LK_NOWAIT will block, which may not happen
in this loop.

Reviewed by: kan
MFC after: 5 days


# 149340 20-Aug-2005 rwatson

Silence "busy" warnings when unmounting devfs at system shutdown. This
is a workaround for non-symetric teardown of the file systems at
shutdown with respect to the mount order at boot. The proper long term
fix is to properly detach devfs from the root mount before unmounting
each, and should be implemented, but since the problem is non-harmful,
this temporary band-aid will prevent false positive bug reports and
unnecessary error output for 6.0-RELEASE.

MFC after: 3 days
Tested by: pav, pjd


# 149034 13-Aug-2005 marcel

Make mpsafe_vfs=1 the default on ia64.


# 148922 10-Aug-2005 kan

Do not drop the vnode interlock if vdropl is called on already doomed vnode.
vdropl callers expect it to return with interlock still being held.

MFC after: 2 days


# 148768 06-Aug-2005 ssouhlal

Holding a vnode doesn't prevent v_mount from disappearing (when the
vnode is inactivated), possibly leading to a NULL dereference when
checking if the mount wants knotes to be activated in the VOP hooks.
So, we add a new vnode flag VV_NOKNOTE that is only set in getnewvnode(),
if necessary, and check it when activating knotes.
Since the flags are not erased when a vnode is being held, we can safely
read them.

Reviewed by: kris@
MFC after: 3 days


# 148671 03-Aug-2005 jeff

- Unlock before we call mac_destroy_vnode to prevent a lock order reversal.

Found by: trhodes


# 148167 20-Jul-2005 jeff

- Allow vnlru to drop giant if the filesystem does not require it. The
vnlru proc is extremely inefficient, potentially iteration over tens of
thousands of vnodes without blocking. Droping Giant allows other threads
to preempt us although we should revisit the algorithm to fix the runtime
problems especially since this may hold up all vnode allocations.
- Remove the LK_NOWAIT from the VOP_LOCK in vlrureclaim. This provides
a natural blocking point to help alleviate the situation described above
although it may not technically be desirable.
- yield after we make a pass on all mount points to prevent us from
blocking other threads which require Giant.

MFC after: 2 weeks


# 147772 05-Jul-2005 pjd

Fix one "wrong b_bufobj" panic in reassignbuf() by moving VI_UNLOCK(vp)
below KASSERT()s, which means there was no real problem here, we just
needed better locking for assertions.

OK'ed by: jeff
Approved by: re (scottl)


# 147730 01-Jul-2005 ssouhlal

Fix the recent panics/LORs/hangs created by my kqueue commit by:

- Introducing the possibility of using locks different than mutexes
for the knlist locking. In order to do this, we add three arguments to
knlist_init() to specify the functions to use to lock, unlock and
check if the lock is owned. If these arguments are NULL, we assume
mtx_lock, mtx_unlock and mtx_owned, respectively.

- Using the vnode lock for the knlist locking, when doing kqueue operations
on a vnode. This way, we don't have to lock the vnode while holding a
mutex, in filt_vfsread.

Reviewed by: jmg
Approved by: re (scottl), scottl (mentor override)
Pointyhat to: ssouhlal
Will be happy: everyone


# 147478 18-Jun-2005 jeff

- Try to catch the wrong bufobj panics a little earlier. I believe they
are actually caused by a buf with both VNCLEAN and VNDIRTY set. In
the traces it is clear that the buf is removed from the dirty queue while
it is actually on the clean queue which leaves the tail pointer set.
Assert that both flags are not set in buf_vlist_add and buf_vlist_remove.

Sponsored by: Isilon Systems, Inc.
Approved by: re (blanket vfs)


# 147407 16-Jun-2005 jeff

- Change holdcnt use around vnode recycling. We now always keep a holdcnt
ref while we're calling vgone(). This prevents transient refs from
re-adding us to the free list. Previously, a vfree() triggered via
vinvalbuf() getting rid of all of a vnode's pages could place a partially
destructed vnode on the free list where vtryrecycle() could find it. The
first call to vtryrecycle would hang up on the vnode lock, but when it
failed it would place a now dead vnode onto the free list, and another
call to vtryrecycle() would free an already free vnode. There were many
complications of having a zero ref count while freeing which can now go
away.
- Change vdropl() to release the interlock before returning. All callers
now respect this, so vdropl() directly frees VI_DOOMED vnodes once the
last ref is dropped. This means that we'll never have VI_DOOMED vnodes
on the free list.
- Seperate v_incr_usecount() into v_incr_usecount(), v_decr_usecount() and
v_decr_useonly(). The incr/decr split is so that incr usecount can
return with the interlock still held while decr drops the interlock so
it can call vdropl() which will potentially free the vnode. The calling
function can't drop the lock of an already free'd node. v_decr_useonly()
drops a usecount without droping the hold count. This is done so the
usecount reaches zero in vput() before we recycle, however the holdcount
is still 1 which prevents any new references from placing the vnode
back on the free list.
- Fix vnlrureclaim() to vhold the vnode since it doesn't do a vget(). We
wouldn't want vnlrureclaim() to bump the usecount since this has
different semantics. Also change vnlrureclaim() to do a NOWAIT on the
vn_lock. When this function runs we're usually in a desperate situation
and we wouldn't want to wait for any specific vnode to be released.
- Fix a bunch of misc comments to reflect the new behavior.
- Add vhold() and vdrop() to vflush() for the same reasons that we do in
vlrureclaim(). Previously we held no reference and a vnode could have
been freed while we were waiting on the lock.
- Get rid of vlruvp() and vfreehead(). Neither are used. vlruvp() should
really be rethought before it's reintroduced.
- vgonel() always returns with the vnode locked now and never puts the
vnode back on a free list. The vnode will be freed as soon as the last
reference is released.

Sponsored by: Isilon Systems, Inc.
Debugging help from: Kris Kennaway, Peter Holm
Approved by: re (blanket vfs)


# 147387 14-Jun-2005 jeff

- In reassignbuf() add many asserts to validate the head and tail pointers
of the clean and dirty lists. This is in an attempt to catch the wrong
bufobj problem sooner.
- In vgonel() don't acquire an extra reference in the active case, the
vnode lock and VI_DOOMED protect us from recursively cleaning.
- Also in vgonel() clean up some stale comments.

Sponsored by: Isilon Systems, Inc.
Approved by: re (blanket vfs)


# 147332 13-Jun-2005 jeff

- Don't make vgonel() globally visible, we want to change its prototype
anyway and it's not used outside of vfs_subr.c.
- Change vgonel() to accept a parameter which determines whether or not
we'll put the vnode on the free list when we're done.
- Use the new vgonel() parameter rather than VI_DOOMED to signal our
intentions in vtryrecycle().
- In vgonel() return if VI_DOOMED is already set, this vnode has already
been reclaimed.

Sponsored by: Isilon Systems, Inc.


# 147327 13-Jun-2005 jeff

- Add KTR_VFS events to vdestroy, vtruncbuf, vinvalbuf, vfreehead.

Sponsored by: Isilon Systems, Inc.


# 147297 11-Jun-2005 jeff

- Assert that we're not in the name cache anymore in vdestroy().

Sponsored by: Isilon Systems, Inc.


# 147290 11-Jun-2005 jeff

- Add KTR_VFS tracing to track the life of vnodes. Eventually KTR_VFS
events could be added to cover other interesting details.
- Add some VNASSERTs to discover places where we access vnodes after
they have been uma_zfree'd before we try to free them again.
- Add a few more VNASSERTs to vdestroy() to be certain that the vnode is
really unused.

Sponsored by: Isilon Systems, Inc.


# 147198 09-Jun-2005 ssouhlal

Allow EVFILT_VNODE events to work on every filesystem type, not just
UFS by:
- Making the pre and post hooks for the VOP functions work even when
DEBUG_VFS_LOCKS is not defined.
- Moving the KNOTE activations into the corresponding VOP hooks.
- Creating a MNTK_NOKNOTE flag for the mnt_kern_flag field of struct
mount that permits filesystems to disable the new behavior.
- Creating a default VOP_KQFILTER function: vfs_kqfilter()

My benchmarks have not revealed any performance degradation.

Reviewed by: jeff, bde
Approved by: rwatson, jmg (kqueue changes), grehan (mentor)


# 147113 07-Jun-2005 jeff

- Clear OWEINACT prior to calling VOP_INACTIVE to remove the possibility
of a vget causing another call to INACTIVE before we're finished.


# 145953 06-May-2005 cperciva

If we are going to
1. Copy a NULL-terminated string into a fixed-length buffer, and
2. copyout that buffer to userland,
we really ought to
0. Zero the entire buffer
first.

Security: FreeBSD-SA-05:08.kmem


# 145822 03-May-2005 jeff

- A vnode may have made its way onto the free list while it was being
vgone'd. We must remove it from the freelist before returning in
vtryrecycle() or we may get a duplicate free.

Reported by: kkenn


# 145785 02-May-2005 csjp

Since it is not possible for curthread to be NULL in this context,
drop the check+initialization for a straight initialization. Also
assert that curthread will never be NULL just to be sure.

Discussed with: rwatson, peter
MFC after: 1 week


# 145767 01-May-2005 jeff

- All buffers should either be clean or dirty. If neither of these flags
are set when we attempt to remove a buffer from a queue we should panic.
Hopefully this will catch the source of the wrong bufobj panics.

Sponsored by: Isilon Systems, Inc.


# 145697 30-Apr-2005 jeff

- In vnlru_free() remove the vnode from the free list before we call
vtryrecycle(). We could sometimes get into situations where two threads
could try to recycle the same vnode before this.
- vtryrecycle() is now responsible for returning the vnode to the free list
if it fails and someone else hasn't done it.
- Make a new function vfreehead() which moves a vnode to the head of the
free list and use it in vgone() to clean up that code a bit.

Sponsored by: Isilon Systems, Inc.
Reported by: pho, kkenn


# 145590 27-Apr-2005 jeff

- Don't vgonel() via vgone() or vrecycle() if the vnode is already doomed.
This fixes forced unmounts via nullfs.

Reported by: kkenn
Sponsored by: Isilon Systems, Inc.


# 145588 27-Apr-2005 jeff

- Stop setting vxthread, we've asserted that it was useless for several
weeks now.


# 145385 22-Apr-2005 jeff

- Disable code which allows getnewvnode() to fail. Many ffs_vget() callers
do not correctly deal with failures. This presently risks deadlock
problems if dependency processing is held up by failures to allocate
a vnode, however, this is better than the situation with the failures.

Sponsored by: Isilon Systems, Inc.


# 145249 18-Apr-2005 phk

Initialize mountlist_mtx with an MTX_SYSINIT(), we need it to be ready
earlier.


# 145005 13-Apr-2005 jeff

- Change vop_lookup_post assertions to reflect recent vfs_lookup changes.

Sponsored by: Isilon Systems, Inc.


# 144908 11-Apr-2005 jeff

- Enable ASSERT_VOP_ELOCKED and assert_vop_elocked() now that vnode_if.awk
uses it.

Sponsored by: Isilon Systems, Inc.


# 144900 11-Apr-2005 jeff

- Change the VOP_LOCK UPGRADE in vput() to do a LK_NOWAIT to avoid a
potential lock order reversal. Also, don't unlock the vnode if this
fails, lockmgr has already unlocked it for us.
- Restructure vget() now that vn_lock() does all of VI_DOOMED checking
for us and also handles the case where there is no real lock type.
- If VI_OWEINACT is set, we need to upgrade the lock request to EXCLUSIVE
so that we can call inactive. It's not legal to vget a vnode that hasn't
had INACTIVE called yet.

Sponsored by: Isilon Systems, Inc.


# 144704 06-Apr-2005 jeff

- Assert that the bufobj matches in flushbuflists. I still haven't gotten
to root cause on exactly how this happens.
- If the assert is disabled, we presently try to handle this case, but the
BUF_UNLOCK was missing. Thus, if this condition ever hit we would leak
a buf lock.

Many thanks to Peter Holm for all his help in finding this bug. He really
put more effort into it than I did.


# 144661 05-Apr-2005 jeff

- Move NDFREE() from vfs_subr to vfs_lookup where namei() is.


# 144625 04-Apr-2005 jeff

- Add a missing unlock of the vnode_free_list_mtx.

Spotted by: Antoine Brodin


# 144624 04-Apr-2005 jeff

- Instead of waiting forever to get a vnode in getnewvnode() wait for
one to become available for one second and then return ENFILE. We
can run out of vnodes, and there must be a hard limit because without
one we can quickly run out of KVA on x86. Presently the system can
deadlock if there are maxvnodes directories in the namecache. The
original 4.x BSD behavior was to return ENFILE if we reached the max,
but 4.x BSD did not have the vnlru proc so it was less profitable to
wait.


# 144374 31-Mar-2005 jeff

- Disable vfs shared locks by default. They must be specifically enabled
on filesystems which safely support them. It appears that many
network filesystems specifically are not shared lock safe.

Sponsored by: Isilon Systems, Inc.


# 144367 31-Mar-2005 jeff

- LK_NOPAUSE is a nop now.

Sponsored by: Isilon Systems, Inc.


# 144319 30-Mar-2005 das

Eliminate v_id and v_ddid. The name cache now holds references to
vnodes whose names it caches, so we no longer need a `generation
number' to tell us if a referenced vnode is invalid. Replace the use
of the parent's v_id in the hash function with the address of the
parent vnode.

Tested by: Peter Holm
Glanced at by: jeff, phk


# 144284 29-Mar-2005 jeff

- Dont clear OWEINACT in vbusy(), we still owe an inactive call if someone
vhold()s us.
- Avoid an extra mutex acquire and release in the common case of vgonel()
by checking for OWEINACT at the start of the function.
- Fix the case where we set OWEINACT in vput(). LK_EXCLUPGRADE drops our
shared lock if it fails.

Sponsored by: Isilon Systems, Inc.


# 144283 29-Mar-2005 jeff

- Don't initial v_dd here, let cache_purge() do it for us.

Sponsored by: Isilon Systems, Inc.


# 144219 28-Mar-2005 jeff

- Move code that should probably be an assert above the main body of
vrele so that we can decrease the indentation of the real work and
make things slightly more clear.

Sponsored by: Isilon Systems, Inc.


# 144204 28-Mar-2005 jeff

- Adjust asserts in vop_lookup_post() to match the new post PDIRUNLOCK
vfs.

Sponsored by: Isilon Systems, Inc.


# 144173 27-Mar-2005 phk

Remove another ';' after if().

Also spotted by: bz


# 144172 27-Mar-2005 phk

Remove extra ; at end of if().

Found by: bz


# 144092 25-Mar-2005 jeff

- Don't recycle vnodes anymore. Free them once they are dead. getnewvnode
now always allocates a new vnode.
- Define a new function, vnlru_free, which frees vnodes from the free list.
It takes as a parameter the number of vnodes to free, which is
wantfreevnodes - freevnodes when called from vnlru_proc or 1 when
called from getnewvnode(). For now, getnewvnode() still tries to reclaim
a free vnode before creating a new one when we are near the limit.
- Define a function, vdestroy, which handles the actual release of memory
and teardown of locks, etc. This could become a uma_dtor() routine.
- Get rid of minvnodes. Now wantfreevnodes is 1/4th the max vnodes. This
keeps more unreferenced vnodes around so that files which have only
been stat'd are less likely to be kicked out of the system before we
have a chance to read them, etc. These vnodes may still be freed via
the normal vnlru_proc() routines which may some day become a real lru.


# 144055 24-Mar-2005 jeff

- Pass LK_EXCLUSIVE to VFS_ROOT() to satisfy the new flags argument. For
now, all calls to VFS_ROOT() should still acquire exclusive locks.

Sponsored by: Isilon Systems, Inc.


# 144051 24-Mar-2005 jeff

- If vput() is called with a shared lock it must upgrade to an exclusive
before it can call VOP_INACTIVE(). This must use the EXCLUPGRADE path
because we may violate some lock order with another locked vnode if
we drop and reacquire the lock. If EXCLUPGRADE fails, we mark the
vnode with VI_OWEINACT. This case should be very rare.
- Clear VI_OWEINACT in vinactive() and vbusy().
- If VI_OWEINACT is set in vgone() do the VOP_INACTIVE call here as well.

Sponsored by: Isilon Systems, Inc.


# 143652 15-Mar-2005 jeff

- Now that there are no external users of vfree() make it static.
- Move VSHOULDBUSY, VSHOULDFREE, and VTRYRECYCLE into vfs_subr.c so
no one else attempts to grow a dependency on them.
- Now that objects with pages hold the vnode we don't have to do unlocked
checks for the page count in the vm object in VSHOULDFREE. These three
macros could simply check for holdcnt state transitions to determine
whether the vnode is on the free list already, but the extra safety
the flag affords us is probably worth the minimal cost.
- The leafonly sysctl and code have been dead for several years now,
remove the sysctl and the code that employed it from vtryrecycle().
- vtryrecycle() also no longer has to check the object's page count as
the object holds the vnode until it reaches 0.

Sponsored by: Isilon Systems, Inc.


# 143640 15-Mar-2005 jeff

- Expose vholdl() so it may be used outside of vfs_subr.c


# 143560 14-Mar-2005 jeff

- Increment the holdcnt once for each usecount reference. This allows us
to use only the holdcnt to determine whether a vnode may be recycled,
simplifying the V* macros as well as vtryrecycle(), etc.

Sponsored by: Isilon Systems, Inc.


# 143557 14-Mar-2005 jeff

- We do not have to check the object's ref_count in VSHOULDFREE or
vtryrecycle(). All obj refs also ref the vnode.
- Consistently use v_incr_usecount() to increment the usecount. This will
be more important later.

Sponsored by: Isilon Systems, Inc.


# 143552 14-Mar-2005 jeff

- Slightly rearrange vrele() to move the common case in one indentation
level.

Sponsored by: Isilon Systems, Inc.


# 143551 14-Mar-2005 jeff

- Rework vget() so we drop the usecount in two failure cases that were
missed by my last commit.

Sponsored by: Isilon Systems, Inc.


# 143497 13-Mar-2005 jeff

- Remove vx_lock, vx_unlock, vx_wait, etc.
- Add a vn_start_write/vn_finished_write around vlrureclaim so we don't do
writing ops without suspending. This could suspend the vlruproc which
should not be a problem under normal circumstances.
- Manually implement VMIGHTFREE in vlrureclaim as this was the only instance
where it was used.
- Acquire a lock before calling vgone() as it now requires it.
- Move the acquisition of the vnode interlock from vtryrecycle() to
getnewvnode() so that if it fails we don't drop and reacquire the
vnode_free_list_mtx.
- Check for a usecount or holdcount at the end of vtryrecycle() in case
someone grabbed a ref while we were recycling. Abort the recycle, and
on the final ref drop this vnode will be placed on the head of the free
list.
- Move the redundant VOP_INACTIVE protection code into the local
vinactive() routine to avoid code bloat.
- Keep the vnode lock held across calls to vgone() in several places.
- vgonel() no longer uses XLOCK, instead callers must hold an exclusive
vnode lock. The VI_DOOMED flag is set to allow other threads to detect
a vnode which is no longer valid. This flag is set until the last
reference is gone, and there are no chances for a new ref. vgonel()
holds this lock across the entire function, which greatly simplifies
logic.
_ Only vfree() in one place in vgone() not three.
- Adjust vget() to check the VI_DOOMED flag prior to waiting on the lock
in the LK_NOWAIT case. In other cases, check after we have slept and
acquired an exlusive lock. This will simulate the old vx_wait()
behavior.

Sponsored by: Isilon Systems, Inc.


# 142296 23-Feb-2005 jeff

- Enable SMP VFS by default on current. More users are needed to turn up
any remaining bugs. Anyone inconvenienced by this can still disable it
in the loader.

Sponsored by: Isilon Systems, Inc.


# 142265 23-Feb-2005 jeff

- Only the xlock holder should be calling VOP_LOCK on a vp once VI_XLOCK
has been set. Assert that this is the case so that we catch filesystems
who are using naked VOP_LOCKs in illegal cases.

Sponsored by: Isilon Systems, Inc.


# 142264 22-Feb-2005 jeff

- Add a check for xlock in vop_lock_assert. Presently the xlock is
considered to be as good as an exclusive lock, although there is still a
possibility of someone acquiring a VOP LOCK while xlock is held.

Sponsored by: Isilon Systems, Inc.


# 142252 22-Feb-2005 phk

Zero the v_un container field to make sure everything is gone.


# 142242 22-Feb-2005 phk

Reap more benefits from DEVFS:

List devfs_dirents rather than vnodes off their shared struct cdev, this
saves a pointer field in the vnode at the expense of a field in the
devfs_dirent. There are often 100 times more vnodes so this is bargain.
In addition it makes it harder for people to try to do stypid things like
"finding the vnode from cdev".

Since DEVFS handles all VCHR nodes now, we can do the vnode related
cleanup in devfs_reclaim() instead of in dev_rel() and vgonel().
Similarly, we can do the struct cdev related cleanup in dev_rel()
instead of devfs_reclaim().

rename idestroy_dev() to destroy_devl() for consistency.

Add LIST_ENTRY de_alias to struct devfs_dirent.
Remove v_specnext from struct vnode.
Change si_hlist to si_alist in struct cdev.
String new devfs vnodes' devfs_dirent on si_alist when
we create them and take them off in devfs_reclaim().

Fix devfs_revoke() accordingly. Also don't clear fields
devfs_reclaim() will clear when called from vgone();

Let devfs_reclaim() call dev_rel() instead of vgonel().

Move the usecount tracking from dev_rel() to devfs_reclaim(),
and let dev_rel() take a struct cdev argument instead of vnode.

Destroy SI_CHEAPCLONE devices in dev_rel() (instead of
devfs_reclaim()) when they are no longer used. (This
should maybe happen in devfs_close() instead.)


# 142225 22-Feb-2005 phk

Remove vfinddev(), it is generally bogus when faced with jails and
chroot and has no legitimate use(r)s in the tree.


# 142079 19-Feb-2005 phk

Try to unbreak the vnode locking around vop_reclaim() (based mostly on
patch from kan@).

Pull bufobj_invalbuf() out of vinvalbuf() and make g_vfs call it on
close. This is not yet a generally safe function, but for this very
specific use it is safe. This solves the problem with buffers not
being flushed by unmount or after failed mount attempts.


# 142042 18-Feb-2005 phk

Make sure to drop the VI_LOCK in vgonel();

Spotted by: Taku YAMAMOTO <taku@tackymt.homeip.net>


# 142011 17-Feb-2005 phk

Introduce vx_wait{l}() and use it instead of home-rolled versions.


# 142010 17-Feb-2005 phk

Convert KASSERTS to VNASSERTS


# 141637 10-Feb-2005 phk

Make various vnode related functions static


# 141605 10-Feb-2005 phk

Don't pass NULL to vprint()


# 141547 08-Feb-2005 jeff

- Add a new assert in the getnewvnode(). Assert that the usecount is still
0 to detect getnewvnode() races.
- Add the vnode address to a few panics near by to help in debugging.

Sponsored by: Isilon Systems, Inc.


# 141450 07-Feb-2005 phk

Access vmobject via the bufobj instead of the vnode


# 141435 07-Feb-2005 phk

Don't call VOP_DESTROYVOBJECT(), trust that VOP_RECLAIM() did what
was necessary.


# 140936 28-Jan-2005 phk

Remove unused argument to vrecycle()


# 140935 28-Jan-2005 phk

Integrate vclean() into vgonel().

Various associated polishing.


# 140934 28-Jan-2005 phk

Remove register keyword


# 140782 25-Jan-2005 phk

Don't use VOP_GETVOBJECT, use vp->v_object directly.


# 140772 24-Jan-2005 phk

Eliminate the constant flags argument to vclean()


# 140739 24-Jan-2005 phk

Change vprint() to vn_printf() which takes varargs.
Add #define for vprint() to call vn_printf().


# 140734 24-Jan-2005 phk

Kill the VV_OBJBUF and test the v_object for NULL instead.


# 140720 24-Jan-2005 jeff

- Add the tunable and sysctl for the mpsafevfs. It currently defaults
to off.
- Protect access to mnt_kern_flag with the mointpoint mutex.
- Remove some KASSERTs which are not legal checks without the appropriate
locks held.
- Use VCANRECYCLE() rather than rolling several slightly different
checks together.
- Return from vtryrecycle() with a recycled vnode rather than a locked
vnode. This simplifies some locking.
- Remove several GIANT_REQUIRED lines.
- Add a few KASSERTs to help with INACT debugging.

Sponsored By: Isilon Systems, Inc.


# 140360 16-Jan-2005 phk

Fix a bug I introduced in 1.561 which has caused considerable filesystem
unhappiness lately.

As far as I can tell, no files that have made it safely to disk
have been endangered, but stuff in transit has been in peril.

Pointy hat: phk


# 140220 14-Jan-2005 phk

Eliminate unused and unnecessary "cred" argument from vinvalbuf()


# 140181 13-Jan-2005 phk

Ditch vfs_object_create() and make the callers call VOP_CREATEVOBJECT()
directly.


# 140056 11-Jan-2005 phk

Add BO_SYNC() and add a default which uses the secret vnode pointer
and VOP_FSYNC() for now.


# 140053 11-Jan-2005 phk

More vnode -> bufobj migration.


# 140052 11-Jan-2005 phk

Give flushbuflist() a struct bufv as first argument and avoid home-rolling
TAILQ_FOREACH_SAFE().

Loose the error pointer argument and return any errors the normal way.

Return EAGAIN for the case where more work needs to be done.


# 140048 11-Jan-2005 phk

Remove the unused credential argument from VOP_FSYNC() and VFS_SYNC().

I'm not sure why a credential was added to these in the first place, it is
not used anywhere and it doesn't make much sense:

The credentials for syncing a file (ability to write to the
file) should be checked at the system call level.

Credentials for syncing one or more filesystems ("none")
should be checked at the system call level as well.

If the filesystem implementation needs a particular credential
to carry out the syncing it would logically have to the
cached mount credential, or a credential cached along with
any delayed write data.

Discussed with: rwatson


# 139804 06-Jan-2005 imp

/* -> /*- for copyright notices, minor format tweaks as necessary


# 139665 04-Jan-2005 phk

Since we do not support forceful unmount of DEVFS we can do away with
the partially implemented vnode-readoption code in vgonechrl().


# 139085 20-Dec-2004 phk

We can only ever get to vgonechrl() from a devfs vnode, so we do not
need to reassign the vp->v_op to devfs_specops, we know that is the
value already.

Make devfs_specops private to devfs.


# 138509 07-Dec-2004 phk

The remaining part of nmount/omount/rootfs mount changes. I cannot sensibly
split the conversion of the remaining three filesystems out from the root
mounting changes, so in one go:

cd9660:
Convert to nmount.
Add omount compat shims.
Remove dedicated rootfs mounting code.
Use vfs_mountedfrom()
Rely on vfs_mount.c calling VFS_STATFS()

nfs(client):
Convert to nmount (the simple way, mount_nfs(8) is still necessary).
Add omount compat shims.
Drop COMPAT_PRELITE2 mount arg compatibility.

ffs:
Convert to nmount.
Add omount compat shims.
Remove dedicated rootfs mounting code.
Use vfs_mountedfrom()
Rely on vfs_mount.c calling VFS_STATFS()

Remove vfs_omount() method, all filesystems are now converted.

Remove MNTK_WANTRDWR, handling RO/RW conversions is a filesystem
task, and they all do it now.

Change rootmounting to use DEVFS trampoline:

vfs_mount.c:
Mount devfs on /. Devfs needs no 'from' so this is clean.
symlink /dev to /. This makes it possible to lookup /dev/foo.
Mount "real" root filesystem on /.
Surgically move the devfs mountpoint from under the real root
filesystem onto /dev in the real root filesystem.

Remove now unnecessary getdiskbyname().

kern_init.c:
Don't do devfs mounting and rootvnode assignment here, it was
already handled by vfs_mount.c.

Remove now unused bdevvp(), addaliasu() and addalias(). Put the
few necessary lines in devfs where they belong. This eliminates the
second-last source of bogo vnodes, leaving only the lemming-syncer.

Remove rootdev variable, it doesn't give meaning in a global context and
was not trustworth anyway. Correct information is provided by
statfs(/).


# 138344 03-Dec-2004 phk

Improve vprint() a little bit: break long lines, reduce indent and tell
if the VI_LOCK() is held.


# 138290 01-Dec-2004 phk

Back when VOP_* was introduced, we did not have new-style struct
initializations but we did have lofty goals and big ideals.

Adjust to more contemporary circumstances and gain type checking.

Replace the entire vop_t frobbing thing with properly typed
structures. The only casualty is that we can not add a new
VOP_ method with a loadable module. History has not given
us reason to belive this would ever be feasible in the the
first place.

Eliminate in toto VOCALL(), vop_t, VNODEOP_SET() etc.

Give coda correct prototypes and function definitions for
all vop_()s.

Generate a bit more data from the vnode_if.src file: a
struct vop_vector and protype typedefs for all vop methods.

Add a new vop_bypass() and make vop_default be a pointer
to another struct vop_vector.

Remove a lot of vfs_init since vop_vector is ready to use
from the compiler.

Cast various vop_mumble() to void * with uppercase name,
for instance VOP_PANIC, VOP_NULL etc.

Implement VCALL() by making vdesc_offset the offsetof() the
relevant function pointer in vop_vector. This is disgusting
but since the code is generated by a script comparatively
safe. The alternative for nullfs etc. would be much worse.

Fix up all vnode method vectors to remove casts so they
become typesafe. (The bulk of this is generated by scripts)


# 137721 15-Nov-2004 phk

Move pbgetvp() and pbrelvp() to vm_pager.c with the rest of the pbuf stuff.


# 137690 14-Nov-2004 phk

Move the bit of the syncer which deals with vnodes into a separate
function.


# 137680 13-Nov-2004 phk

Eliminate vop_revoke() function now that devfs_revoke() does the entire job.


# 137508 10-Nov-2004 phk

Slim vnodes by another four bytes by eliminating the (now) unused field
v_cachedid.


# 137506 10-Nov-2004 phk

Remove vn_todev()


# 137483 09-Nov-2004 phk

Remove vnode->v_cachedfs.

It was only used for the highly dangerous "export all vnodes with a sysctl"
function.


# 137186 04-Nov-2004 phk

Remove buf->b_dev field.


# 137170 03-Nov-2004 phk

Always initialize bo_private along with bo_ops in getnewvnode().

Spotted by: tegge


# 137050 29-Oct-2004 phk

Loose vfs_mountedon()


# 137033 29-Oct-2004 phk

Give the bufobj a private __bo_vnode for now to keep the syncer floating [1]
At some point later the syncer will unlearn about vnodes and the filesystems
method called by the syncer will know enough about what's in bo_private to
do the right thing.

[1] Ok, I know, but I couldn't resist the pun.


# 136992 27-Oct-2004 phk

Move the syncer linkage from vnode to bufobj.

This is not quite a perfect separation: the syncer still think it knows
that everything is a vnode.


# 136966 26-Oct-2004 phk

Put the I/O block size in bufobj->bo_bsize.

We keep si_bsize_phys around for now as that is the simplest way to pull
the number out of disk device drivers in devfs_open(). The correct solution
would be to do an ioctl(DIOCGSECTORSIZE), but the point is probably mooth
when filesystems sit on GEOM, so don't bother for now.


# 136943 25-Oct-2004 phk

Loose the v_dirty* and v_clean* alias macros.

Check the count field where we just want to know the full/empty state,
rather than using TAILQ_EMPTY() or TAILQ_FIRST().


# 136941 25-Oct-2004 phk

Remove vnode->v_bsize. This was a dead-end.


# 136938 25-Oct-2004 phk

Collapse vnode->v_object and buf->b_object into bufobj->bo_object.


# 136927 24-Oct-2004 phk

Move the buffer method vector (buf->b_op) to the bufobj.

Extend it with a strategy method.

Add bufstrategy() which do the usual VOP_SPECSTRATEGY/VOP_STRATEGY
song and dance.

Rename ibwrite to bufwrite().

Move the two NFS buf_ops to more sensible places, add bufstrategy
to them.

Add inlines for bwrite() and bstrategy() which calls through
buf->b_bufobj->b_ops->b_{write,strategy}().

Replace almost all VOP_STRATEGY()/VOP_SPECSTRATEGY() calls with bstrategy().


# 136771 22-Oct-2004 rwatson

When MAC is enabled, warn if getnewvnode() is asked to produce a vnode
without a mountpoint. In this scenario, there's no useful source for
a label on the vnode, since we can't query the mountpoint for the
labeling strategy or default label.


# 136770 22-Oct-2004 phk

Alas, poor SPECFS! -- I knew him, Horatio; A filesystem of infinite
jest, of most excellent fancy: he hath taught me lessons a thousand
times; and now, how abhorred in my imagination it is! my gorge rises
at it. Here were those hacks that I have curs'd I know not how
oft. Where be your kludges now? your workarounds? your layering
violations, that were wont to set the table on a roar?

Move the skeleton of specfs into devfs where it now belongs and
bury the rest.


# 136767 22-Oct-2004 phk

Add b_bufobj to struct buf which eventually will eliminate the need for b_vp.

Initialize b_bufobj for all buffers.

Make incore() and gbincore() take a bufobj instead of a vnode.

Make inmem() local to vfs_bio.c

Change a lot of VI_[UN]LOCK(bp->b_vp) to BO_[UN]LOCK(bp->b_bufobj)
also VI_MTX() to BO_MTX(),

Make buf_vlist_add() take a bufobj instead of a vnode.

Eliminate other uses of bp->b_vp where bp->b_bufobj will do.

Various minor polishing: remove "register", turn panic into KASSERT,
use new function declarations, TAILQ_FOREACH_SAFE() etc.


# 136751 21-Oct-2004 phk

Move the VI_BWAIT flag into no bo_flag element of bufobj and call it BO_WWAIT

Add bufobj_wref(), bufobj_wdrop() and bufobj_wwait() to handle the write
count on a bufobj. Bufobj_wdrop() replaces vwakeup().

Use these functions all relevant places except in ffs_softdep.c where
the use if interlocked_sleep() makes this impossible.

Rename b_vnbufs to b_bobufs now that we touch all the relevant files anyway.


# 136750 21-Oct-2004 phk

Add BO_* macros parallel to VI_* macros for manipulating the bo_mtx.

Initialize the bo_mtx when we allocate a vnode i getnewvnode() For
now we point to the vnodes interlock mutex, that retains the exact
same locking sematics.

Move v_numoutput from vnode to bufobj. Add renaming macro to
postpone code sweep.


# 136749 21-Oct-2004 phk

Polish vtruncbuf() to improve readability and style a bit.


# 136747 21-Oct-2004 phk

Simplify buf_vlist_remove().

Now that we have encapsulated the splaytree related information
into a structure we can eliminate the half of this function.


# 136180 06-Oct-2004 grog

vtryrecycle: Don't rely on type VBAD alone to mean that we don't need
to clean the vnode. If v_data is set, we still need to
clean it. This code change should catch all incidents of
the previous commit (INVARIANTS only).


# 136179 06-Oct-2004 grog

getnewvnode: Weaken the panic "cleaned vnode isn't" to a warning.

Discussion: this panic (or waning) only occurs when the kernel is
compiled with INVARIANTS. Otherwise the problem (which means that
the vp->v_data field isn't NULL, and represents a coding error and
possibly a memory leak) is silently ignored by setting it to NULL
later on.

Panicking here isn't very helpful: by this time, we can only find
the symptoms. The panic occurs long after the reason for "not
cleaning" has been forgotten; in the case in point, it was the
result of severe file system corruption which left the v_type field
set to VBAD. That issue will be addressed by a separate commit.


# 136014 01-Oct-2004 phk

Fix a LOR relating to freeing cdevs.


# 135708 24-Sep-2004 phk

Hold dev_lock and check for NULL devsw pointer when we determine
if a vnode is a disk.


# 135600 23-Sep-2004 phk

Do not refcount the cdevsw, but rather maintain a cdev->si_threadcount
of the number of threads which are inside whatever is behind the
cdevsw for this particular cdev.

Make the device mutex visible through dev_lock() and dev_unlock().
We may want finer granularity later.

Replace spechash_mtx use with dev_lock()/dev_unlock().


# 135280 15-Sep-2004 phk

Remove unused B_WRITEINPROG flag


# 134899 07-Sep-2004 phk

Create simple function init_va_filerev() for initializing a va_filerev
field.

Replace three instances of longhaired initialization va_filerev fields.

Added XXX comment wondering why we don't use random bits instead of
uptime of the system for this purpose.


# 134091 20-Aug-2004 truckman

Don't attempt to trigger the syncer thread final sync code in the
shutdown_pre_sync state if the RB_NOSYNC flag is set. This is the
likely cause of hangs after a system panic that are keeping crash
dumps from being done.

This is a MFC candidate for RELENG_5.

MFC after: 3 days


# 133826 16-Aug-2004 obrien

s/MAX_SAFE_MAXVNODES/MAXVNODES_MAX/g


# 133741 15-Aug-2004 jmg

Add locking to the kqueue subsystem. This also makes the kqueue subsystem
a more complete subsystem, and removes the knowlege of how things are
implemented from the drivers. Include locking around filter ops, so a
module like aio will know when not to be unloaded if there are outstanding
knotes using it's filter ops.

Currently, it uses the MTX_DUPOK even though it is not always safe to
aquire duplicate locks. Witness currently doesn't support the ability
to discover if a dup lock is ok (in some cases).

Reviewed by: green, rwatson (both earlier versions)


# 133462 11-Aug-2004 rwatson

In v_addpollinfo(), we allocate storage to back vp->v_pollinfo. However,
we may sleep when doing so; check that we didn't race with another thread
allocating storage for the vnode after allocation is made to a local
pointer, and only update the vnode pointer if it's still NULL. Otherwise,
accept that another thread got there first, and release the local storage.

Discussed with: jmg


# 133418 10-Aug-2004 njl

Skip the syncing disks loop if there are no dirty buffers. Remove a
variable used to flag the initial printf.

Submitted by: truckman (earlier version)


# 133038 02-Aug-2004 obrien

Put a cap on the auto-tuning of kern.maxvnodes.

Cap value chosen by: scottl


# 132866 30-Jul-2004 njl

Minor message cleanup.


# 132710 27-Jul-2004 phk

Convert the vfsconf list to a TAILQ.

Introduce vfs_byname() function to find things on it.

Staticize vfs_nmount() function under the name vfs_donmount().

Various cleanups.


# 132653 26-Jul-2004 cperciva

Rename suser_cred()'s PRISON_ROOT flag to SUSER_ALLOWJAIL. This is
somewhat clearer, but more importantly allows for a consistent naming
scheme for suser_cred flags.

The old name is still defined, but will be removed in a few days (unless I
hear any complaints...)

Discussed with: rwatson, scottl
Requested by: jhb


# 132640 25-Jul-2004 phk

Eliminate unused second argument to reassignbuf() and simplify it
accordingly.


# 132489 21-Jul-2004 alfred

put several of the options for DEBUG_VFS_LOCKS under control of sysctls.


# 132197 15-Jul-2004 alfred

Cleanup shutdown output.


# 132177 15-Jul-2004 alfred

Tidy up system shutdown.


# 132023 12-Jul-2004 alfred

Make VFS_ROOT() and vflush() take a thread argument.
This is to allow filesystems to decide based on the passed thread
which vnode to return.
Several filesystems used curthread, they now use the passed thread.


# 132007 12-Jul-2004 alfred

Dump the actual bad values when this assertion is tripped.


# 131934 10-Jul-2004 marcel

Update for the KDB framework:
o Call kdb_enter() instead of Debugger().


# 131784 08-Jul-2004 alfred

fixup sysctl by fsid node


# 131695 06-Jul-2004 alfred

Introduce vfs_suser(), used to test if a user should have special privs
for a mount.


# 131691 06-Jul-2004 alfred

NFS mobility PHASE I, II & III (phase VI, and V pending):

Rebind the client socket when we experience a timeout. This fixes
the case where our IP changes for some reason.

Signal a VFS event when NFS transitions from up to down and vice
versa.

Add a placeholder vfs_sysctl where we will put status reporting
shortly.

Also:
Make down NFS mounts return EIO instead of EINTR when there is a
soft timeout or force unmount in progress.


# 131650 05-Jul-2004 truckman

Unconditionally set last_work_seen while in the SYNCER_RUNNING state
so that last_work_seen has a reasonable value at the transition
to the SYNCER_SHUTTING_DOWN state, even if net_worklist_len happened
to be zero at the time.

Initialize last_work_seen to zero as a safety measure in case the
syncer never ran in the SYNCER_RUNNING state.

Tested by: phk


# 131602 05-Jul-2004 truckman

Rework syncer termination code:

Speed up the syncer when shutting down by sleeping for a shorter
period of time instead of cranking up rushjob and using the
normal one second sleep.

Skip empty worklist slots when shutting down to avoid lengthy
intervals of inactivity.

Give I/O more time to complete between steps by not speeding the
syncer quite as much.

Terminate the syncer after one full pass through the worklist
plus one second with the worklist containing nothing but syncer
vnodes.

Print an indication of shutdown progress to the console.

Add a sysctl, vfs.worklist_len, to allow the size of the syncer worklist
to be monitored.


# 131598 04-Jul-2004 phk

Give synthetic root filesystem device vnodes a v_bsize of DEV_BSIZE.


# 131593 04-Jul-2004 alfred

Pass the operation in with the fsidctl.
Remove some fsidctls that we will not be using.
Correct prototypes for fs sysctls.


# 131590 04-Jul-2004 phk

Make the last commit handle non-phk root devices better.


# 131565 04-Jul-2004 phk

Blocksize for I/O should be a property of the vnode and not found by groping
around in the vnodes surroundings when we allocate a block.

Assign a blocksize when we create a vnode, and yell a warning (and ignore it)
if we got the wrong size.

Please email all such warnings to me.


# 131562 04-Jul-2004 alfred

Introduce a new kevent filter. EVFILT_FS that will be used to signal
generic filesystem events to userspace. Currently only mount and unmount
of filesystems are signalled. Soon to be added, up/down status of NFS.

Introduce a sysctl node used to route requests to/from filesystems
based on filesystem ids.

Introduce a new vfsop, vfs_sysctl(mp, req) that is used as the callback/
entrypoint by the sysctl code to change individual filesystems.


# 131559 04-Jul-2004 alfred

Revision 1.496 would not boot on my system due to
ffs_mount -> bdevvp -> getnewvnode(..., mp = NULL, ...) ->
insmntqueue(vp, mp = NULL) -> KASSERT -> panic

Make getnewvnode() only call insmntqueue() if the mountpoint parameter
is not NULL.


# 131551 04-Jul-2004 phk

When we traverse the vnodes on a mountpoint we need to look out for
our cached 'next vnode' being removed from this mountpoint. If we
find that it was recycled, we restart our traversal from the start
of the list.

Code to do that is in all local disk filesystems (and a few other
places) and looks roughly like this:

MNT_ILOCK(mp);
loop:
for (vp = TAILQ_FIRST(&mp...);
(vp = nvp) != NULL;
nvp = TAILQ_NEXT(vp,...)) {
if (vp->v_mount != mp)
goto loop;
MNT_IUNLOCK(mp);
...
MNT_ILOCK(mp);
}
MNT_IUNLOCK(mp);

The code which takes vnodes off a mountpoint looks like this:

MNT_ILOCK(vp->v_mount);
...
TAILQ_REMOVE(&vp->v_mount->mnt_nvnodelist, vp, v_nmntvnodes);
...
MNT_IUNLOCK(vp->v_mount);
...
vp->v_mount = something;

(Take a moment and try to spot the locking error before you read on.)

On a SMP system, one CPU could have removed nvp from our mountlist
but not yet gotten to assign a new value to vp->v_mount while another
CPU simultaneously get to the top of the traversal loop where it
finds that (vp->v_mount != mp) is not true despite the fact that
the vnode has indeed been removed from our mountpoint.

Fix:

Introduce the macro MNT_VNODE_FOREACH() to traverse the list of
vnodes on a mountpoint while taking into account that vnodes may
be removed from the list as we go. This saves approx 65 lines of
duplicated code.

Split the insmntque() which potentially moves a vnode from one mount
point to another into delmntque() and insmntque() which does just
what the names say.

Fix delmntque() to set vp->v_mount to NULL while holding the
mountpoint lock.


# 131428 01-Jul-2004 truckman

When shutting down the syncer kernel thread, first tell it to run
faster and iterate to over its work list a few times in an attempt
to empty the work list before the syncer terminates. This leaves
fewer dirty blocks to be written at the "syncing disks" stage and
keeps the the "giving up on N buffers" problem from being triggered
by the presence of a large soft updates work list at system shutdown
time. The downside is that the syncer takes noticeably longer to
terminate.

Tested by: "Arjan van Leeuwen" <avleeuwen AT piwebs DOT com>
Approved by: mckusick


# 130640 17-Jun-2004 phk

Second half of the dev_t cleanup.

The big lines are:
NODEV -> NULL
NOUDEV -> NODEV
udev_t -> dev_t
udev2dev() -> findcdev()

Various minor adjustments including handling of userland access to kernel
space struct cdev etc.


# 130585 16-Jun-2004 phk

Do the dreaded s/dev_t/struct cdev */
Bump __FreeBSD_version accordingly.


# 130470 14-Jun-2004 phk

Remove a left over from userland buffer-cache access to disks.


# 129898 31-May-2004 rwatson

Assert Giant in vrele().


# 128141 11-Apr-2004 mux

Put deprecated sysctl code inside BURN_BRIDGES.


# 127911 05-Apr-2004 imp

Remove advertising clause from University of California Regent's license,
per letter dated July 22, 1999.

Approved by: core


# 127593 29-Mar-2004 peter

Kill some XXXKSE's. vnlru/syncer are single threaded.


# 126853 11-Mar-2004 phk

Properly vector all bwrite() and BUF_WRITE() calls through the same path
and s/BUF_WRITE()/bwrite()/ since it now does the same as bwrite().


# 126851 11-Mar-2004 phk

Remove unused second arg to vfinddev().
Don't call addaliasu() on VBLK nodes.


# 126683 06-Mar-2004 kan

Always call vn_finished_write after vn_start_write was called. All
occurences of 'goto done' after vn_start_write invocation were cleaning
up incompletely.


# 126326 27-Feb-2004 jhb

Switch the sleep/wakeup and condition variable implementations to use the
sleep queue interface:
- Sleep queues attempt to merge some of the benefits of both sleep queues
and condition variables. Having sleep qeueus in a hash table avoids
having to allocate a queue head for each wait channel. Thus, struct cv
has shrunk down to just a single char * pointer now. However, the
hash table does not hold threads directly, but queue heads. This means
that once you have located a queue in the hash bucket, you no longer have
to walk the rest of the hash chain looking for threads. Instead, you have
a list of all the threads sleeping on that wait channel.
- Outside of the sleepq code and the sleep/cv code the kernel no longer
differentiates between cv's and sleep/wakeup. For example, calls to
abortsleep() and cv_abort() are replaced with a call to sleepq_abort().
Thus, the TDF_CVWAITQ flag is removed. Also, calls to unsleep() and
cv_waitq_remove() have been replaced with calls to sleepq_remove().
- The sched_sleep() function no longer accepts a priority argument as
sleep's no longer inherently bump the priority. Instead, this is soley
a propery of msleep() which explicitly calls sched_prio() before
blocking.
- The TDF_ONSLEEPQ flag has been dropped as it was never used. The
associated TDF_SET_ONSLEEPQ and TDF_CLR_ON_SLEEPQ macros have also been
dropped and replaced with a single explicit clearing of td_wchan.
TD_SET_ONSLEEPQ() would really have only made sense if it had taken
the wait channel and message as arguments anyway. Now that that only
happens in one place, a macro would be overkill.


# 126253 26-Feb-2004 truckman

Split the mlock() kernel code into two parts, mlock(), which unpacks
the syscall arguments and does the suser() permission check, and
kern_mlock(), which does the resource limit checking and calls
vm_map_wire(). Split munlock() in a similar way.

Enable the RLIMIT_MEMLOCK checking code in kern_mlock().

Replace calls to vslock() and vsunlock() in the sysctl code with
calls to kern_mlock() and kern_munlock() so that the sysctl code
will obey the wired memory limits.

Nuke the vslock() and vsunlock() implementations, which are no
longer used.

Add a member to struct sysctl_req to track the amount of memory
that is wired to handle the request.

Modify sysctl_wire_old_buffer() to return an error if its call to
kern_mlock() fails. Only wire the minimum of the length specified
in the sysctl request and the length specified in its argument list.
It is recommended that sysctl handlers that use sysctl_wire_old_buffer()
should specify reasonable estimates for the amount of data they
want to return so that only the minimum amount of memory is wired
no matter what length has been specified by the request.

Modify the callers of sysctl_wire_old_buffer() to look for the
error return.

Modify sysctl_old_user to obey the wired buffer length and clean up
its implementation.

Reviewed by: bms


# 126094 21-Feb-2004 phk

Check for NODEV return from udev2dev()


# 126082 21-Feb-2004 phk

Device megapatch 6/6:

This is what we came here for: Hang dev_t's from their cdevsw,
refcount cdevsw and dev_t and generally keep track of things a lot
better than we used to:

Hold a cdevsw reference around all entrances into the device driver,
this will be necessary to safely determine when we can unload driver
code.

Hold a dev_t reference while the device is open.

KASSERT that we do not enter the driver on a non-referenced dev_t.

Remove old D_NAG code, anonymous dev_t's are not a problem now.

When destroy_dev() is called on a referenced dev_t, move it to
dead_cdevsw's list. When the refcount drops, free it.

Check that cdevsw->d_version is correct. If not, set all methods
to the dead_*() methods to prevent entrance into driver. Print
warning on console to this effect. The device driver may still
explode if it is also incompatible with newbus, but in that case
we probably didn't get this far in the first place.


# 126081 21-Feb-2004 phk

Device megapatch 5/6:

Remove the unused second argument from udev2dev().

Convert all remaining users of makedev() to use udev2dev(). The
semantic difference is that udev2dev() will only locate a pre-existing
dev_t, it will not line makedev() create a new one.

Apart from the tiny well controlled windown in D_PSEUDO drivers,
there should no longer be any "anonymous" dev_t's in the system
now, only dev_t's created with make_dev() and make_dev_alias()


# 124162 05-Jan-2004 kan

More style fixes.

Obtained from: bde


# 124148 05-Jan-2004 kan

style(9):

Add empty line before first code line in functions with no local
variables.
Properly terminate comment sentences.
Indent lines which are longer that 80 characters.
Move v_addpollinfo closer to the rest of poll-related functions.
Move DEBUG_VFS_LOCKS ifdefed block to the end of file.

Obtained from: bde (partly)


# 124118 04-Jan-2004 kan

Cosmetics: strip '\n' from a string passed to Debugger().


# 123932 28-Dec-2003 bde

v_vxproc was a bogus name for a thread (pointer).


# 123569 16-Dec-2003 jeff

- In vget() if LK_NOWAIT is specified we should return EBUSY and not ENOENT.

Submitted by: Stephan Uphoff <ups@stups.com>


# 123568 16-Dec-2003 jeff

- When doing a forced unmount, VFS attempts to keep VCHR vnodes valid by
reassigning their v_ops field to specfs, detaching from the mountpoint, etc.
However, this is not sufficient. If we vclean() the vnode the pages owned
by the vnode are lost, potentially while buffers reference them. Implement
parts of vclean() seperately in vgonechrl() so that the pages and bufs
associated with a device vnode are not destroyed while in use.


# 123072 30-Nov-2003 jeff

- Don't forget to unlock the vnode interlock in the LK_NOWAIT case.

Submitted by: Stephan Uphoff <ups@stups.com>
Approved by: re (rwatson)


# 122352 09-Nov-2003 tanimura

- Implement selwakeuppri() which allows raising the priority of a
thread being waken up. The thread waken up can run at a priority as
high as after tsleep().

- Replace selwakeup()s with selwakeuppri()s and pass appropriate
priorities.

- Add cv_broadcastpri() which raises the priority of the broadcast
threads. Used by selwakeuppri() if collision occurs.

Not objected in: -arch, -current


# 122091 05-Nov-2003 kan

Remove mntvnode_mtx and replace it with per-mountpoint mutex.
Introduce two new macros MNT_ILOCK(mp)/MNT_IUNLOCK(mp) to
operate on this mutex transparently.

Eventually new mutex will be protecting more fields in
struct mount, not only vnode list.

Discussed with: jeff


# 121441 23-Oct-2003 wollman

Add appropriate const poisoning to the assert_*locked() family so that I can
call ASSERT_VOP_LOCKED(vp, __func__) without a diagnostic.

Inspired by: the evil and rude OpenAFS cache manager code


# 121287 20-Oct-2003 alc

Initialize the buf's b_object in pbgetvp(). Clear it in pbrelvp(). (This
facilitates synchronization of the vm page's valid field using the
vm object's lock.)

Suggested by: tegge


# 121157 17-Oct-2003 phk

Simplify count_dev()


# 121038 12-Oct-2003 phk

Simplify vn_isdisk() a bit.


# 121012 11-Oct-2003 jeff

- Fix a typo, I meant & and not |. This was causing lockups from the syncer
looping forever due to list corruption.

Solved by: tegge


# 120791 05-Oct-2003 jeff

- Fix an XXX. Check the error of vn_lock() in vflush(). Don't specify
LK_RETRY either, we don't want this vnode if it turns into another.
- Remove the code that checks the mount point after acquiring the lock
we are guaranteed to either fail or get the vnode that we wanted.


# 120780 05-Oct-2003 jeff

- Rename vcanrecycle() to vtryrecycle() to reflect its new role.
- In vtryrecycle() try to vgonel the vnode if all of the previous checks
passed. We won't vgonel if someone has either acquired a hold or usecount
or started the vgone process elsewhere. This is because we may have been
removed from the free list while we were inspecting the vnode for
recycling.
- The VI_TRYLOCK stops two threads from entering getnewvnode() and recycling
the same vnode. To further reduce the likelyhood of this event, requeue
the vnode on the tail of the list prior to calling vtryrecycle(). We can
not actually remove the vnode from the list until we know that it's
going to be recycled because other interlock holders may see the VI_FREE
flag and try to remove it from the free list.
- Kill a bogus XXX comment. If XLOCK is set we shouldn't wait for it
regardless of MNT_WAIT because the vnode does not actually belong to
this filesystem.


# 120779 05-Oct-2003 jeff

- Don't cache_purge() in getnewvnode. It's done in vclean(). With this
purge, the purge in vclean, and the filesystems purge, we had 3 purges
per vnode.
- Move the insmntque(vp, 0) to vclean() so that we may remove it from the
two vgone() functions and reduce the number of lock operations required.


# 120773 05-Oct-2003 jeff

- Solve a LOR with the sync_mtx by using the VI_ONWORKLST flag to determine
whether or not the sync failed. This could potentially get set between
the time that we VOP_UNLOCK and VI_LOCK() but the race would harmelssly
lead to the sync being delayed by an extra 30 seconds. If we do not move
the vnode it could cause an endless loop if it continues to fail to sync.
- Use vhold and vdrop to stop the vnode from changing identities while we
have it unlocked. Other internal vfs lists are likely to follow this
scheme.


# 120771 05-Oct-2003 jeff

- Move the xlock 'locking' code into vx_lock() and vx_unlock().
- Create a new function, vgonechrl(), which performs vgone for an in-use
character device. Move the code from vflush() that did this into
vgonechrl().
- Hold the xlock across the entirety of vgonel() and vgonechrl() so that
at no point will an invalid vnode exist on any list without XLOCK set.
- Move the xlock code out of vclean() now that it is in the vgone*()
functions.


# 120756 04-Oct-2003 jeff

- In sched_sync() test our preconditions prior to dropping the sync_mtx.
This is so that we may grab the interlock while still holding the
sync_mtx. We have to VI_TRYLOCK() because in all other cases the lock
order runs the other way.
- If we don't meet any of the preconditions, reinsert the vp into the
list for the next second.
- We don't need to panic if we fail to sync here because each FSYNC
function handles this case. Removing this redundant code also
simplifies locking.


# 120746 04-Oct-2003 jeff

- In a Giantless world, the vn_lock() in vcanrecycle() could legitimately
fail. Remove the panic from that case and document why it might fail.
- Document the reason for calling cache_purge() on a newly created vnode.
- In insmntque() order the operations so that we can call mtx_unlock()
one fewer times. This makes the code somewhat clearer as well.
- Add XXX comments in sched_sync() and vflush().
- In vget(), do not sleep while waiting for XLOCK to clear if LK_NOWAIT is
set.
- In vclean() we don't need to acquire a lock around a single TAILQ_FIRST
call. It's ok if we race here, the vinvalbuf will just do nothing.
- Increase the scope of the lock in vgonel() to reduce the number of lock
operations that are performed.


# 120271 20-Sep-2003 jeff

- In reassignbuf() don't unlock vp and lock newvp if they are the same.
Doing so creates a race where the buf is on neither list.
- Only vfree() in an error case in vclean() if VSHOULDFREE() thinks we
should.
- Convert the error case in vclean() to INVARIANTS from DIAGNOSTIC as this
really should not happen and is fast to check.


# 120265 19-Sep-2003 jeff

- Remove spls(). The locking that has replaced them is in place and they
no longer serve as guidelines for future work.


# 120241 19-Sep-2003 kan

Eliminate one case of VI_UNLOCK followed by an immediate
VI_LOCK.


# 118607 07-Aug-2003 jhb

Consistently use the BSD u_int and u_short instead of the SYSV uint and
ushort. In most of these files, there was a mixture of both styles and
this change just makes them self-consistent.

Requested by: bde (kern_ktrace.c)


# 117879 22-Jul-2003 phk

Revert stuff which accidentally ended up in the previous commit.


# 117878 22-Jul-2003 phk

Don't attempt to inline large functions mb_alloc() and mb_free(),
it more than doubles the text size of this file.

GCC has wisely ignored us on this previously


# 116182 11-Jun-2003 obrien

Use __FBSDID().


# 115536 31-May-2003 phk

Remove unused variable and now unbalanced call to splbio();

Found by: FlexeLint


# 115266 23-May-2003 alc

Make the maximum number of vnodes a function of both the physical memory
size and the kernel's heap size, specifically, vm_kmem_size. This
function allows a maximum of 40% of the vm_kmem_size to be used for
vnodes and vm objects. This is a conservative bound based upon recent
problem reports. (In other words, a slight increase in this percentage
may be safe.)

Finally, machines with less than ~3GB of RAM should be unaffected
by this change, i.e., the maximum number of vnodes should remain
the same. If necessary, machines with 3GB or more of RAM can increase
the maximum number of vnodes by increasing vm_kmem_size.

Desired by: scottl
Tested by: jake
Approved by: re (rwatson,scottl)


# 115076 16-May-2003 truckman

Detect that a vnode has been reclaimed while vflush() was waiting to lock
the vnode and restart the loop. Vflush() is vulnerable since it does not
hold a reference to the vnode and it holds no other locks while waiting
for the vnode lock. The vnode will no longer be on the list when the
loop is restarted.

Approved by: re (rwatson)


# 114968 13-May-2003 alc

Optimize the use of splay in gbincore(). During a "make buildworld" the
desired buffer is found at one of the roots more than 60% of the time.
Thus, checking both roots before performing either splay eliminates
unnecessary splays on the first tree splayed.

Approved by: re (jhb)


# 114945 12-May-2003 rwatson

Remove bogus locking from DDB's "show lockedvnods" command: using
synchronization primitives from inside DDB is generally a bad idea,
and in this case it frequently results in panics due to DDB commands
being executed from the sio fast interrupt context on a serial
console. Replace the locking with a note that a lack of locking
means that DDB may get see inconsistent views of the mount and vnode
lists, which could also result in a panic. More frequently,
though, this avoids a panic than causes it.

Discussed with ages ago: bde
Approved by: re (scottl)


# 114570 03-May-2003 alc

- Revert kern/vfs_subr.c revision 1.444. The vm_object's size isn't
trustworthy for vnode-backed objects.
- Restore the old behavior of vm_object_page_remove() when the end
of the given range is zero. Add a comment to vm_object_page_remove()
regarding this behavior.

Reported by: iedowse


# 114371 01-May-2003 alc

Lock accesses to the vm_object's ref_count and resident_page_count.


# 114091 26-Apr-2003 alc

Various changes to vm_object_page_remove():
- Eliminate an odd, special-case feature:
if start == end == 0 then all pages are removed. Only one caller
used this feature and that caller can trivially pass the object's
size.
- Assert that the vm_object is locked on entry; don't bother testing
for a NULL vm_object.
- Style: Fix lines that are longer than 80 characters.


# 114074 26-Apr-2003 alc

- Convert vm_object_pip_wait() from using tsleep() to msleep().
- Make vm_object_pip_sleep() static.
- Lock the vm_object when performing vm_object_pip_wait().


# 113955 24-Apr-2003 alc

- Acquire the vm_object's lock when performing vm_object_page_clean().
- Add a parameter to vm_pageout_flush() that tells vm_pageout_flush()
whether its caller has locked the vm_object. (This is a temporary
measure to bootstrap vm_object locking.)


# 113671 18-Apr-2003 alc

Update locking around vm_object_page_remove() to use the new macros.


# 113426 13-Apr-2003 alc

Use vm_object_pip_wait() rather than reimplementing it.


# 112693 26-Mar-2003 tegge

Adjust the number of vnodes scanned by vlrureclaim() according to the
size of the vnode list.


# 112495 22-Mar-2003 yar

We shouldn't assert that a vode is locked in vop_lock_post()
if VOP_LOCK() has failed.

Reviewed by: jeff


# 112182 13-Mar-2003 jeff

- Remove a dead check for bp->b_vp == vp in vtruncbuf(). This has not been
possible for some time.
- Lock the buf before accessing fields. This should very rarely be locked.
- Assert that B_DELWRI is set after we acquire the buf. This should always
be the case now.


# 112181 13-Mar-2003 jeff

- Remove a race between fsync like functions and flushbufqueues() by
requiring locked bufs in vfs_bio_awrite(). Previously the buf could
have been written out by fsync before we acquired the buf lock if it
weren't for giant. The cluster_wbuild() handles this race properly but
the single write at the end of vfs_bio_awrite() would not.
- Modify flushbufqueues() so there is only one copy of the loop. Pass a
parameter in that says whether or not we should sync bufs with deps.
- Call flushbufqueues() a second time and then break if we couldn't find
any bufs without deps.


# 111937 06-Mar-2003 alc

Remove ENABLE_VFS_IOOPT. It is a long unfinished work-in-progress.

Discussed on: arch@


# 111841 03-Mar-2003 njl

Finish cleanup of vprint() which was begun with changing v_tag to a string.
Remove extraneous uses of vop_null, instead defering to the default op.
Rename vnode type "vfs" to the more descriptive "syncer".
Fix formatting for various filesystems that use vop_print.


# 111723 02-Mar-2003 jeff

- Hold the vnode interlock across calls to bgetvp instead of acquiring it
internally. This is required to stop multiple bufs from being associated
with a single lblkno.


# 111694 01-Mar-2003 jeff

- gc USE_BUFHASH. The smp locking of the buf cache renders this useless.


# 111466 25-Feb-2003 mckusick

Prevent large files from monopolizing the system buffers. Keep
track of the number of dirty buffers held by a vnode. When a
bdwrite is done on a buffer, check the existing number of dirty
buffers associated with its vnode. If the number rises above
vfs.dirtybufthresh (currently 90% of vfs.hidirtybuffers), one
of the other (hopefully older) dirty buffers associated with
the vnode is written (using bawrite). In the event that this
approach fails to curb the growth in it the vnode's number of
dirty buffers (due to soft updates rollback dependencies),
the more drastic approach of doing a VOP_FSYNC on the vnode
is used. This code primarily affects very large and actively
written files such as snapshots. This change should eliminate
hanging when taking snapshots or doing background fsck on
very large filesystems.

Hopefully, one day it will be possible to cache filesystem
metadata in the VM cache as is done with file data. As it
stands, only the buffer cache can be used which limits total
metadata storage to about 20Mb no matter how much memory is
available on the system. This rather small memory gets badly
thrashed causing a lot of extra I/O. For example, taking a
snapshot of a 1Tb filesystem minimally requires about 35,000
write operations, but because of the cache thrashing (we only
have about 350 buffers at our disposal) ends up doing about
237,540 I/O's thus taking twenty-five minutes instead of four
if it could run entirely in the cache.

Reported by: Attila Nagy <bra@fsn.hu>
Sponsored by: DARPA & NAI Labs.


# 111463 25-Feb-2003 jeff

- Add an interlock argument to BUF_LOCK and BUF_TIMELOCK.
- Remove the buftimelock mutex and acquire the buf's interlock to protect
these fields instead.
- Hold the vnode interlock while locking bufs on the clean/dirty queues.
This reduces some cases from one BUF_LOCK with a LK_NOWAIT and another
BUF_LOCK with a LK_TIMEFAIL to a single lock.

Reviewed by: arch, mckusick


# 111333 23-Feb-2003 phk

Bracket the kern.vnode sysctl in #ifdef notyet because it results
in massive locking issues on diskless systems.

It is also not clear that this sysctl is non-dangerous in its
requirements for locked down memory on large RAM systems.


# 111119 19-Feb-2003 imp

Back out M_* changes, per decision of the TRB.

Approved by: trb


# 109623 21-Jan-2003 alfred

Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0.
Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.


# 108399 29-Dec-2002 iedowse

Add a new vnode flag VI_DOINGINACT to indicate that a VOP_INACTIVE
call is in progress on the vnode. When vput() or vrele() sees a
1->0 reference count transition, it now return without any further
action if this flag is set. This flag is necessary to avoid recursion
into VOP_INACTIVE if the filesystem inactive routine causes the
reference count to increase and then drop back to zero. It is also
used to guarantee that an unlocked vnode will not be recycled while
blocked in VOP_INACTIVE().

There are at least two cases where the recursion can occur: one is
that the softupdates code called by ufs_inactive() via ffs_truncate()
can call vput() on the vnode. This has been reported by many people
as "lockmgr: draining against myself" panics. The other case is
that nfs_inactive() can call vget() and then vrele() on the vnode
to clean up a sillyrename file.

Reviewed by: mckusick (an older version of the patch)


# 108390 29-Dec-2002 phk

Use a timeout of one second while we wait for the vnode washer,
this prevents a potential race and makes the system a little bit
less jerky under extreme loads.


# 108388 29-Dec-2002 phk

Vnodes pull in 800-900 bytes these days, all things counted, so we need
to treat desiredvnodes much more like a limit than as a vague concept.

On a 2GB RAM machine where desired vnodes is 130k, we run out of
kmem_map space when we hit about 190k vnodes.

If we wake up the vnode washer in getnewvnode(), sleep until it is done,
so that it has a chance to offer us a washed vnode. If we don't sleep
here we'll just race ahead and allocate yet a vnode which will never
get freed.

In the vnodewasher, instead of doing 10 vnodes per mountpoint per
rotation, do 10% of the vnodes distributed evenly across the
mountpoints.


# 108367 28-Dec-2002 phk

KASSERT that vop_revoke() gets a VCHR.


# 107891 15-Dec-2002 alc

Perform vm_object_lock() and vm_object_unlock() around
vm_object_page_remove().


# 107676 08-Dec-2002 alc

To avoid lock order reversals in getnewvnode(), the call to uma_zfree()
must be delayed until the vnode interlock is released.

Reported by: kris@
Approved by: re (jhb)


# 107319 27-Nov-2002 robert

Do not set a variable (vp->p_pollinfo) to NULL if we know
it already has that value.

Approved by: re


# 105988 26-Oct-2002 rwatson

Slightly change the semantics of vnode labels for MAC: rather than
"refreshing" the label on the vnode before use, just get the label
right from inception. For single-label file systems, set the label
in the generic VFS getnewvnode() code; for multi-label file systems,
leave the labeling up to the file system. With UFS1/2, this means
reading the extended attribute during vfs_vget() as the inode is
pulled off disk, rather than hitting the extended attributes
frequently during operations later, improving performance. This
also corrects sematics for shared vnode locks, which were not
previously present in the system. This chances the cache
coherrency properties WRT out-of-band access to label data, but in
an acceptable form. With UFS1, there is a small race condition
during automatic extended attribute start -- this is not present
with UFS2, and occurs because EAs aren't available at vnode
inception. We'll introduce a work around for this shortly.

Approved by: re
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


# 105914 25-Oct-2002 phk

In vrele() we can actually have a VCHR with v_rdev == NULL if we
came from the bottom of addaliasu(). Don't panic.


# 105902 25-Oct-2002 mckusick

Within ufs, the ffs_sync and ffs_fsync functions did not always
check for and/or report I/O errors. The result is that a VFS_SYNC
or VOP_FSYNC called with MNT_WAIT could loop infinitely on ufs in
the presence of a hard error writing a disk sector or in a filesystem
full condition. This patch ensures that I/O errors will always be
checked and returned. This patch also ensures that every call to
VFS_SYNC or VOP_FSYNC with MNT_WAIT set checks for and takes
appropriate action when an error is returned.

Sponsored by: DARPA & NAI Labs.


# 105894 24-Oct-2002 phk

Fix the spechash lock order reversal by keeping an updated sum
of v_usecount in the dev_t which vcount() can return without
locking any vnodes.

Seen by: jhb


# 105121 14-Oct-2002 mckusick

When scanning the freelist looking for candidate vnodes to recycle,
be sure to exit the loop with vp == NULL if no candidates are found.
Formerly, this bug would cause the last vnode inspected to be used,
even if it was not available. The result was a panic "vn_finished_write:
neg cnt".

Sponsored by: DARPA & NAI Labs.


# 105119 14-Oct-2002 mckusick

Unconditionally reset vp->v_vnlock back to the default in the
vclean() function (e.g., vp->v_vnlock = &vp->v_lock) rather
than requiring filesystems that use alternate locks to do so
in their vop_reclaim functions. This change is a further cleanup
of the vop_stdlock interface.

Submitted by: Poul-Henning Kamp <phk@critter.freebsd.dk>
Sponsored by: DARPA & NAI Labs.


# 105077 14-Oct-2002 mckusick

Regularize the vop_stdlock'ing protocol across all the filesystems
that use it. Specifically, vop_stdlock uses the lock pointed to by
vp->v_vnlock. By default, getnewvnode sets up vp->v_vnlock to
reference vp->v_lock. Filesystems that wish to use the default
do not need to allocate a lock at the front of their node structure
(as some still did) or do a lockinit. They can simply start using
vn_lock/VOP_UNLOCK. Filesystems that wish to manage their own locks,
but still use the vop_stdlock functions (such as nullfs) can simply
replace vp->v_vnlock with a pointer to the lock that they wish to
have used for the vnode. Such filesystems are responsible for
setting the vp->v_vnlock back to the default in their vop_reclaim
routine (e.g., vp->v_vnlock = &vp->v_lock).

In theory, this set of changes cleans up the existing filesystem
lock interface and should have no function change to the existing
locking scheme.

Sponsored by: DARPA & NAI Labs.


# 104829 11-Oct-2002 mckusick

When considering a vnode for reuse in getnewvnode, we call
vcanrecycle to check a free vnode's availability. If it is
available, vcanrecycle returns an error code of zero and the
vnode in question locked. The getnewvnode routine then used
to call vn_start_write with the V_NOWAIT flag. If the filesystem
was suspended while taking a snapshot, the vn_start_write would
fail but getnewvnode would fail to unlock the vnode, instead
leaving it locked on the freelist. The result would be that the
vnode would be locked forever and would eventually hang the
system with a race to the root when it was attempted to recycle
it. This fix moves the vn_start_write check into vcanrecycle
where it will properly unlock the vnode if it is unavailable
for recycling due to filesystem suspension.

Sponsored by: DARPA & NAI Labs.


# 104509 05-Oct-2002 sobomax

Fix problem introduced in rev.1.406, which can cause already unlocked
mutex being unlocked again causing system panic.


# 104302 01-Oct-2002 phk

Fix some harmless mis-indents.

Spotted by: FlexeLint


# 104237 30-Sep-2002 rwatson

Move vnode MAC label initialization to after the release of the vnode
interlock in getnewvnode() to avoid possible sleeps while holding
the mutex. Note that the warning from Witness is a slight false
positive since we know there will be no contention on the interlock
since we haven't made the vnode available for use yet, but the theory
is not a bad one.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


# 104094 28-Sep-2002 phk

Be consistent about "static" functions: if the function is marked
static in its prototype, mark it static at the definition too.

Inspired by: FlexeLint warning #512


# 103986 26-Sep-2002 jeff

- Move ASSERT_VOP_*LOCK* functionality into functions in vfs_subr.c
- Make the VI asserts more orthogonal to the rest of the asserts by using a
new, common vfs_badlock() function and adding a 'str' arg.
- Adjust generated ASSERTS to match the new prototype.
- Adjust explicit ASSERTS to match the new prototype.


# 103933 25-Sep-2002 jeff

- Lock down the syncer with sync_mtx.
- Enable vfs_badlock_mutex by default.
- Assert that the vp is locked in VOP_UNLOCK.
- Use standard interlock macros in remaining code.
- Correct a race in getnewvnode().
- Lock access to v_numoutput with interlock.
- Lock access to buf lists and splay tree with interlock.
- Add VOP and VI asserts.
- Lock b_vnbufs with the vnode interlock.
- Add vrefcnt() for callers who want to retreive the vnode ref without
holding a lock. Add a comment that describes when this is safe.
- Add vholdl() and vdropl() so that callers who already own the interlock
can avoid race conditions and unnecessary unlocking.
- Move the VOP_GETATTR() in vflush() into the WRITECLOSE conditional case.
- Hold the interlock before droping the mntlist_mtx in vflush() to avoid
a race.
- Fix locking in vfs_msync().


# 103559 18-Sep-2002 njl

Remove any VOP_PRINT that redundantly prints the tag.
Move lockmgr_printinfo() into vprint() for everyone's benefit.

Suggested by: bde


# 103314 14-Sep-2002 njl

Remove all use of vnode->v_tag, replacing with appropriate substitutes.
v_tag is now const char * and should only be used for debugging.

Additionally:
1. All users of VT_NTS now check vfsconf->vf_type VFCF_NETWORK
2. The user of VT_PROCFS now checks for the new flag VV_PROCDEP, which
is propagated by pseudofs to all child vnodes if the fs sets PFS_PROCDEP.

Suggested by: phk
Reviewed by: bde, rwatson (earlier version)


# 103228 11-Sep-2002 julian

Indentation does not make a block.. need curly braces too.
Submitted by: Eagle-eyes evans <bde@freebsd.org>


# 103216 11-Sep-2002 julian

Completely redo thread states.

Reviewed by: davidxu@freebsd.org


# 102989 05-Sep-2002 phk

Fix an inherited style bug: compare with NOCRED instead of NULL.

Sponsored by: DARPA & NAI Labs.


# 102987 05-Sep-2002 phk

Introduce new extattr_check_cred() function which implements the canonical
crential washing for extended attributes.

Sponsored by: DARPA & NAI Labs.


# 102412 25-Aug-2002 charnier

Replace various spelling with FALLTHROUGH which is lint()able


# 102297 23-Aug-2002 jeff

- Fix a mistake in my last few commits. The PDROP flag stops msleep from
re-acquiring the mutex.

Pointy hat to: me
Noticed by: tegge


# 102255 22-Aug-2002 jeff

- Make vn_lock() vget() and VOP_LOCK() all behave the same way WRT
LK_INTERLOCK. The interlock will never be held on return from these
functions even when there is an error. Errors typically only occur when
the XLOCK is held which means this isn't the vnode we want anyway. Almost
all users of these interfaces expected this behavior even though it was
not provided before.


# 102251 22-Aug-2002 jeff

- Fix interlock handling in vn_lock(). Previously, vn_lock() could return
with interlock held in error conditions when the caller did not specify
LK_INTERLOCK.
- Add several comments to vn_lock() describing the rational behind the code
flow since it was not immediately obvious.


# 102214 21-Aug-2002 jeff

- Document two cases, one in vget and the other in vn_lock, where the state
of interlock on exit is not consistent. There are probably several bugs
relating to this.


# 102211 21-Aug-2002 jeff

- If vn_lock fails with the LK_INTERLOCK flag set, interlock will not be
released. vcanrecycle() failed to unlock interlock under this condition.
- Remove an extra VOP_UNLOCK from a failure case in vcanrecycle().

Pointed out by: rwatson


# 102210 21-Aug-2002 jeff

- Add two new debugging macros: ASSERT_VI_LOCKED and ASSERT_VI_UNLOCKED
- Use the new VI asserts in place of the old mtx_assert checks.
- Add the VI asserts to the automated lock checking in the VOP calls. The
interlock should not be held across vops with a few exceptions.
- Add the vop_(un)lock_{pre,post} functions to assert that interlock is held
when LK_INTERLOCK is set.


# 101769 13-Aug-2002 jeff

- Extend the vnode_free_list_mtx to cover numvnodes and freevnodes. This
was done only some of the time before, and now it is uniformly applied.


# 101651 10-Aug-2002 mux

- Introduce a new struct xvfsconf, the userland version of struct vfsconf.
- Make getvfsbyname() take a struct xvfsconf *.
- Convert several consumers of getvfsbyname() to use struct xvfsconf.
- Correct the getvfsbyname.3 manpage.
- Create a new vfs.conflist sysctl to dump all the struct xvfsconf in the
kernel, and rewrite getvfsbyname() to use this instead of the weird
existing API.
- Convert some {set,get,end}vfsent() consumers to use the new vfs.conflist
sysctl.
- Convert a vfsload() call in nfsiod.c to kldload() and remove the useless
vfsisloadable() and endvfsent() calls.
- Add a warning printf() in vfs_sysctl() to tell people they are using
an old userland.

After these changes, it's possible to modify struct vfsconf without
breaking the binary compatibility. Please note that these changes don't
break this compatibility either.

When bp will have updated mount_smbfs(8) with the patch I sent him, there
will be no more consumers of the {set,get,end}vfsent(), vfsisloadable()
and vfsload() API, and I will promptly delete it.


# 101367 05-Aug-2002 jeff

- Move some logic from getnewvnode() to a new function vcanrecycle()
- Unlock the free list mutex around vcanrecycle to prevent a lock order
reversal.


# 101308 04-Aug-2002 jeff

- Replace v_flag with v_iflag and v_vflag
- v_vflag is protected by the vnode lock and is used when synchronization
with VOP calls is needed.
- v_iflag is protected by interlock and is used for dealing with vnode
management issues. These flags include X/O LOCK, FREE, DOOMED, etc.
- All accesses to v_iflag and v_vflag have either been locked or marked with
mp_fixme's.
- Many ASSERT_VOP_LOCKED calls have been added where the locking was not
clear.
- Many functions in vfs_subr.c were restructured to provide for stronger
locking.

Idea stolen from: BSD/OS


# 101173 01-Aug-2002 rwatson

Include file cleanup; mac.h and malloc.h at one point had ordering
relationship requirements, and no longer do.

Reminded by: bde


# 101041 31-Jul-2002 des

Nit in previous commit: the correct sysctl type is "S,xvnode"


# 101040 31-Jul-2002 des

Initialize v_cachedid to -1 in getnewvnode().
Reintroduce the kern.vnode sysctl and make it export xvnodes rather than
vnodes.

Sponsored by: DARPA, NAI Labs


# 101009 31-Jul-2002 rwatson

Note that the privilege indicating flag to vaccess() originally used
by the process accounting system is now deprecated.


# 101008 31-Jul-2002 rwatson

Introduce support for Mandatory Access Control and extensible
kernel access control.

Invoke the necessary MAC entry points to maintain labels on vnodes.
In particular, initialize the label when the vnode is allocated or
reused, and destroy the label when the vnode is going to be released,
or reused. Wow, an object where there really is exactly one place
where it's allocated, and one other where it's freed. Amazing.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# 100863 29-Jul-2002 jeff

- Backout the patch made in revision 1.75 of vfs_mount.c. The vputs here
were hiding the real problem of the missing unlock in sync_inactive.
- Add the missing unlock in sync_inactive.

Submitted by: iedowse


# 100831 28-Jul-2002 truckman

Wire the sysctl output buffer before grabbing any locks to prevent
SYSCTL_OUT() from blocking while locks are held. This should
only be done when it would be inconvenient to make a temporary copy of
the data and defer calling SYSCTL_OUT() until after the locks are
released.


# 100481 22-Jul-2002 rwatson

Teach discretionary access control methods for files about VAPPEND
and VALLPERM.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# 100344 19-Jul-2002 mckusick

Add support to UFS2 to provide storage for extended attributes.
As this code is not actually used by any of the existing
interfaces, it seems unlikely to break anything (famous
last words).

The internal kernel interface to manipulate these attributes
is invoked using two new IO_ flags: IO_NORMAL and IO_EXT.
These flags may be specified in the ioflags word of VOP_READ,
VOP_WRITE, and VOP_TRUNCATE. Specifying IO_NORMAL means that
you want to do I/O to the normal data part of the file and
IO_EXT means that you want to do I/O to the extended attributes
part of the file. IO_NORMAL and IO_EXT are mutually exclusive
for VOP_READ and VOP_WRITE, but may be specified individually
or together in the case of VOP_TRUNCATE. For example, when
removing a file, VOP_TRUNCATE is called with both IO_NORMAL
and IO_EXT set. For backward compatibility, if neither IO_NORMAL
nor IO_EXT is set, then IO_NORMAL is assumed.

Note that the BA_ and IO_ flags have been `merged' so that they
may both be used in the same flags word. This merger is possible
by assigning the IO_ flags to the low sixteen bits and the BA_
flags the high sixteen bits. This works because the high sixteen
bits of the IO_ word is reserved for read-ahead and help with
write clustering so will never be used for flags. This merge
lets us get away from code of the form:

if (ioflags & IO_SYNC)
flags |= BA_SYNC;

For the future, I have considered adding a new field to the
vattr structure, va_extsize. This addition could then be
exported through the stat structure to allow applications to
find out the size of the extended attribute storage and also
would provide a more standard interface for truncating them
(via VOP_SETATTR rather than VOP_TRUNCATE).

I am also contemplating adding a pathconf parameter (for
concreteness, lets call it _PC_MAX_EXTSIZE) which would
let an application determine the maximum size of the extended
atribute storage.

Sponsored by: DARPA & NAI Labs.


# 100207 17-Jul-2002 mckusick

Change utimes to set the file creation time (for filesystems that
support creation times such as UFS2) to the value of the
modification time if the value of the modification time is older
than the current creation time. See utimes(2) for further details.

Sponsored by: DARPA & NAI Labs.


# 99737 10-Jul-2002 dillon

Replace the global buffer hash table with per-vnode splay trees using a
methodology similar to the vm_map_entry splay and the VM splay that Alan
Cox is working on. Extensive testing has appeared to have shown no
increase in overhead.

Disadvantages
Dirties more cache lines during lookups.

Not as fast as a hash table lookup (but still N log N and optimal
when there is locality of reference).

Advantages
vnode->v_dirtyblkhd is now perfectly sorted, making fsync/sync/filesystem
syncer operate more efficiently.

I get to rip out all the old hacks (some of which were mine) that tried
to keep the v_dirtyblkhd tailq sorted.

The per-vnode splay tree should be easier to lock / SMPng pushdown on
vnodes will be easier.

This commit along with another that Alan is working on for the VM page
global hash table will allow me to implement ranged fsync(), optimize
server-side nfs commit rpcs, and implement partial syncs by the
filesystem syncer (aka filesystem syncer would detect that someone is
trying to get the vnode lock, remembers its place, and skip to the
next vnode).

Note that the buffer cache splay is somewhat more complex then other splays
due to special handling of background bitmap writes (multiple buffers with
the same lblkno in the same vnode), and B_INVAL discontinuities between the
old hash table and the existence of the buffer on the v_cleanblkhd list.

Suggested by: alc


# 99690 09-Jul-2002 jeff

- Use standard locking functions in syncer's opv
- vput instead of vrele syncer vnodes in vfs_mount
- Add vop_lookup_{pre,post} to verify locking in VOP_LOOKUP


# 99515 07-Jul-2002 jeff

- Don't hold the vn lock while calling VOP_CLOSE in vclean().


# 99513 07-Jul-2002 jeff

- BUF_REFCNT() seems to be the preferred method for verifying a locked buf.
Tell vop_strategy_pre() to use this instead.
- Ignore B_CLUSTER bufs. Their components are locked but they don't really
exist so they don't have to be. This isn't ideal but it is safe.


# 99508 06-Jul-2002 jeff

Fix a mistake in my last commit. Don't grab an extra reference to the object
in bp->b_object.


# 99489 06-Jul-2002 jeff

Fixup uses of GETVOBJECT.

- Cache a pointer to the vnode's object in the buf.
- Hold a reference to that object in addition to the vnode's reference just
to be consistent.
- Cleanup code that got the object indirectly through the vp and VOP calls.

This fixes at least one case where we were calling GETVOBJECT without a lock.
It also avoids an expensive layered call at the cost of another pointer in
struct buf.


# 99485 06-Jul-2002 jeff

- Add vop_strategy_pre to validate VOP_STRATEGY locking.
- Disable original vop_strategy lock specification.
- Switch to the new vop_strategy_pre for lock validation.

VOP_STRATEGY requires only that the buf is locked UNLESS the block numbers need
to be translated. There may be other reasons, but as long as the underlying
layer uses a VOP to perform the operations they will be caught later.


# 99483 06-Jul-2002 jeff

Add "vop_rename_pre" to do pre rename lock verification. This is enabled only
with DEBUG_VFS_LOCKS.


# 99338 03-Jul-2002 mux

Move vfs_rootmountalloc() in vfs_mount.c and remove lite2_vfs_mountroot()
which was #if 0'd and is not likely to be used now.


# 99264 02-Jul-2002 mux

Move every code related to mount(2) in a new file, vfs_mount.c.
The file vfs_conf.c which was dealing with root mounting has
been repo-copied into vfs_mount.c to preserve history.
This makes nmount related development easier, and help reducing
the size of vfs_syscalls.c, which is still an enormous file.

Reviewed by: rwatson
Repo-copy by: peter


# 99220 01-Jul-2002 iedowse

Use indirect function pointer hooks instead of #ifdef SOFTUPDATES
direct calls for the two places where the kernel calls into soft
updates code. Set up the hooks in softdep_initialize() and NULL
them out in softdep_uninitialize(). This change allows soft updates
to function correctly when ufs is loaded as a module.

Reviewed by: mckusick


# 99021 29-Jun-2002 obrien

Rename the db command lockedvnodes to lockedvnods so that it fits on the
help screen and one doens't think we have a lockedvnodesmap command.


# 98994 28-Jun-2002 alfred

nuke caddr_t.


# 98985 28-Jun-2002 jeff

Improve the VOP locking asserts

- Add vfs_badlock_print to control whether or not we print lock violations
- Add vfs_badlock_panic to control whether we panic on lock violations

Both default to on to mimic the original behavior if DEBUG_VFS_LOCKS is on.


# 98979 28-Jun-2002 green

Fix a case where a vnode got explicitly unlocked after the pointer to it
got set to NULL.

Revision 1.355: in the box


# 98510 20-Jun-2002 mux

Change the way we internally store the mount options to
a linked list. This is to allow the merging of the mount
options in the MNT_UPDATE case, as the current data structure
is unsuitable for this.

There are no functional differences in this commit.

Reviewed by: phk


# 98233 14-Jun-2002 mux

Change vfs_copyopt() so that the length argument passed to it
must be the exact same size as the mount option. This makes
vfs_copyopt() much more useful.


# 97935 06-Jun-2002 des

Move some sysctls from the debug tree to the vfs tree.


# 97934 06-Jun-2002 des

Gratuitous whitespace cleanup.


# 96755 16-May-2002 trhodes

More s/file system/filesystem/g


# 96744 16-May-2002 mux

o Fix vfs_copyopt(), the first argument to bcopy() is the source,
not the destination.
o Remove some code from vfs_getopt() which was making the interface
more complicated to use for a very slight gain.


# 96145 07-May-2002 jeff

Switch from just holding the interlock to holding the standard lock throughout
getnewvnode(). This is safer. In the future, we should investigate requiring
only the interlock to get the vnode object.


# 96094 06-May-2002 jeff

Hold the currently selected vnode's lock across the call to VOP_GETVOBJECT.

Don't try to create a vm object before the file system has a chance to finish
initializing it. This is incorrect for a number of reasons. Firstly, that
VOP requires a lock which the file system may not have initialized yet. Also,
open and others will create a vm object if it is necessary later.


# 96073 05-May-2002 phk

Expand the one-line function pbreassignbuf() the only place it is or could
be used.


# 96034 04-May-2002 dillon

Remove obsolete code (that was already #if 0'd out).
Requested by: Hiten Pandya <hitmaster2k@yahoo.com>


# 93818 04-Apr-2002 jhb

Change callers of mtx_init() to pass in an appropriate lock type name. In
most cases NULL is passed, but in some cases such as network driver locks
(which use the MTX_NETWORK_LOCK macro) and UMA zone locks, a name is used.

Tested on: i386, alpha, sparc64


# 93593 01-Apr-2002 jhb

Change the suser() API to take advantage of td_ucred as well as do a
general cleanup of the API. The entire API now consists of two functions
similar to the pre-KSE API. The suser() function takes a thread pointer
as its only argument. The td_ucred member of this thread must be valid
so the only valid thread pointers are curthread and a few kernel threads
such as thread0. The suser_cred() function takes a pointer to a struct
ucred as its first argument and an integer flag as its second argument.
The flag is currently only used for the PRISON_ROOT flag.

Discussed on: smp@


# 93228 26-Mar-2002 mux

As discussed in -arch, add the new nmount(2) system call and the
new vfs_getopt()/vfs_copyopt() API. This is intended to be used
later, when there will be filesystems implementing the VFS_NMOUNT
operation. The mount(2) system call will disappear when all
filesystems will be converted to the new API. Documentation will
be committed in a while.

Reviewed by: phk


# 92751 20-Mar-2002 jeff

Remove references to vm_zone.h and switch over to the new uma API.

Also, remove maxsockets. If you look carefully you'll notice that the old
zone allocator never honored this anyway.


# 92723 19-Mar-2002 alfred

Remove __P.


# 91709 05-Mar-2002 rwatson

Three p_ucred -> td_ucred's missed in jhb's earlier pass; all appear to
be safe.


# 91406 27-Feb-2002 jhb

Simple p_ucred -> td_ucred changes to start using the per-thread ucred
reference.


# 90860 18-Feb-2002 phk

Make v_addpollinfo() visible and non-inline.
Have callers only call it as needed.
Add necessary call in ufs_kqfilter().

Test-case found by: Andrew Gallatin <gallatin@cs.duke.edu>


# 90835 18-Feb-2002 phk

Remove yet a redundant VN_KNOTE() macro.


# 90791 17-Feb-2002 phk

Move the stuff related to select and poll out of struct vnode.
The use of the zone allocator may or may not be overkill.
There is an XXX: over in ufs/ufs/ufs_vnops.c that jlemon may need
to revisit.

This shaves about 60 bytes of struct vnode which on my laptop means
600k less RAM used for vnodes.


# 90375 07-Feb-2002 peter

Fix a couple of style bugs introduced (or touched by) previous commit.


# 90361 07-Feb-2002 julian

Pre-KSE/M3 commit.
this is a low-functionality change that changes the kernel to access the main
thread of a process via the linked list of threads rather than
assuming that it is embedded in the process. It IS still embeded there
but remove all teh code that assumes that in preparation for the next commit
which will actually move it out.

Reviewed by: peter@freebsd.org, gallatin@cs.duke.edu, benno rice,


# 90100 02-Feb-2002 mckusick

In the routines vrele() and vput(), we must lock the vnode and
call VOP_INACTIVE before placing the vnode back on the free list.
Otherwise there is a race condition on SMP machines between
getnewvnode() locking the vnode to reclaim it and vrele()
locking the vnode to inactivate it. This window of vulnerability
becomes exaggerated in the presence of filesystems that have
been suspended as the inactive routine may need to temporarily
release the lock on the vnode to avoid deadlock with the syncer
process.


# 89526 19-Jan-2002 dillon

Remove 'VXLOCK: interlock avoided' warnings. This can now occur in normal
operation. The vgonel() code has always called vclean() but until we
started proactively freeing vnodes it would never actually be called with
a dirty vnode, so this situation did not occur prior to the vnlru() code.
Now that we proactively free vnodes when kern.maxvnodes is hit, however,
vclean() winds up with work to do and improperly generates the warnings.

Reviewed by: peter
Approved by: re (for MFC)
MFC after: 1 day


# 89384 15-Jan-2002 mckusick

When downgrading a filesystem from read-write to read-only, operations
involving file removal or file update were not always being fully
committed to disk. The result was lost files or corrupted file data.
This change ensures that the filesystem is properly synced to disk
before the filesystem is down-graded.

This delta also fixes a long standing bug in which a file open for
reading has been unlinked. When the last open reference to the file
is closed, the inode is reclaimed by the filesystem. Previously,
if the filesystem had been down-graded to read-only, the inode could
not be reclaimed, and thus was lost and had to be later recovered
by fsck. With this change, such files are found at the time of the
down-grade. Normally they will result in the filesystem down-grade
failing with `device busy'. If a forcible down-grade is done, then
the affected files will be revoked causing the inode to be released
and the open file descriptors to begin failing on attempts to read.

Submitted by: "Sam Leffler" <sam@errno.com>


# 89238 10-Jan-2002 dillon

Add vlruvp() routine - implements LRU operation for vnode recycling.

We calculate a trigger point that both guarentees we will find a
sufficient number of vnodes to recycle and prevents us from recycling
vnodes with lots of resident pages. This particular section of
code is designed to recycle vnodes, not do unnecessary frees of
cached VM pages.


# 88465 25-Dec-2001 dillon

Fix type-o in previous commit (tsleep was using wrong rendezvous point)


# 88318 20-Dec-2001 dillon

Fix a BUF_TIMELOCK race against BUF_LOCK and fix a deadlock in vget()
against VM_WAIT in the pageout code. Both fixes involve adjusting
the lockmgr's timeout capability so locks obtained with timeouts do not
interfere with locks obtained without a timeout.

Hopefully MFC: before the 4.5 release


# 88161 19-Dec-2001 peter

Do not initialize static/global variables to 0. Use bss instead of
taking up space in the data section.


# 88160 19-Dec-2001 peter

Use a different mechanism to get the vnlru process to wake up and notice
the shutdown request at reboot/halt time.
Disable the printf 'vnlru process getting nowhere, pausing...' and instead
export the count to the debug.vnlru_nowhere sysctl.


# 88149 18-Dec-2001 dillon

This is a forward port of Peter's vlrureclaim() fix, with some minor mods
by me to make it more efficient. The original code had serious balancing
problems and could also deadlock easily. This code relegates the vnode
reclamation to its own kproc and relaxes the vnode reclamation requirements
to better maintain kern.maxvnodes. This code still doesn't balance as well
as it could, but it does a much better job then the original code.

Approved by: re@freebsd.org
Obtained from: ps, peter, dillon
MFS Assuming: Assuming no problems crop up in Yahoo testing
MFC after: 7 days


# 87847 14-Dec-2001 dillon

A slightly different version of the vlrureclaim fix.

Reported by: peter, ps


# 87823 13-Dec-2001 peter

If we were called to allocate a vnode that is not associated with a
mount point, do not dereference the NULL mp argument.


# 86037 04-Nov-2001 dillon

Add mnt_reservedvnlist so we can MFC to 4.x, in order to make all mount
structure changes now rather then piecemeal later on. mnt_nvnodelist
currently holds all the vnodes under the mount point. This will eventually
be split into a 'dirty' and 'clean' list. This way we only break kld's once
rather then twice. nvnodelist will eventually turn into the dirty list
and should remain compatible with the klds.


# 85876 02-Nov-2001 rwatson

Merge from POSIX.1e Capabilities development tree:

o POSIX.1e capabilities authorize overriding of VEXEC for VDIR based
on CAP_DAC_READ_SEARCH, but of !VDIR based on CAP_DAC_EXECUTE. Add
appropriate conditionals to vaccess() to take that into account.
o Synchronization cap_check_xxx() -> cap_check() change.

Obtained from: TrustedBSD Project


# 85606 27-Oct-2001 dillon

syncdelay, filedelay, dirdelay, metadelay are ints, not time_t's,
and can also be made static.


# 85517 26-Oct-2001 dillon

Implement kern.maxvnodes. adjusting kern.maxvnodes now actually has a
real effect.

Optimize vfs_msync(). Avoid having to continually drop and re-obtain
mutexes when scanning the vnode list. Improves looping case by 500%.

Optimize ffs_sync(). Avoid having to continually drop and re-obtain
mutexes when scanning the vnode list. This makes a couple of assumptions,
which I believe are ok, in regards to vnode stability when the mount list
mutex is held. Improves looping case by 500%.

(more optimization work is needed on top of these fixes)

MFC after: 1 week


# 85515 25-Oct-2001 dillon

Add missing TAILQ_INSERT_TAIL's which somehow didn't get comitted with
the recent vnode cleanup.


# 85339 23-Oct-2001 dillon

Change the vnode list under the mount point from a LIST to a TAILQ
in preparation for an implementation of limiting code for kern.maxvnodes.

MFC after: 3 days


# 85036 16-Oct-2001 dillon

fix minor bug in kern.minvnodes sysctl. Use OID_AUTO.


# 84675 08-Oct-2001 dillon

WS Cleanup


# 84561 05-Oct-2001 dillon

vinvalbuf() was only waiting for write-I/O to complete. It really has to
wait for both read AND write I/O to complete. Only NFS calls vinvalbuf()
on an active vnode (when the server indicates that the file is stale), so
this bug fix only effects NFS clients.

MFC after: 3 days


# 84249 01-Oct-2001 dillon

After extensive testing it has been determined that adding complexity
to avoid removing higher level directory vnodes from the namecache has
no perceivable effect and will be removed. This is especially true
when vmiodirenable is turned on, which it is by default now. ( vmiodirenable
makes a huge difference in directory caching ). The vfs.vmiodirenable and
vfs.nameileafonly sysctls have been left in to allow further testing, but
I expect to rip out vfs.nameileafonly soon too.

I have also determined through testing that the real problem with numvnodes
getting too large is due to the VM Page cache preventing the vnode from
being reclaimed. The directory stuff made only a tiny dent relative
to Poul's original code, enough so that some tests succeeded. But tests
with several million small files show that the bigger problem is the VM Page
cache. This will have to be addressed by a future commit.

MFC after: 3 days


# 83366 12-Sep-2001 julian

KSE Milestone 2
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.

Sorry john! (your next MFC will be a doosie!)

Reviewed by: peter@freebsd.org, dillon@freebsd.org

X-MFC after: ha ha ha ha


# 82395 27-Aug-2001 peter

If a file has been completely unlinked, stop automatically syncing the
file. ffs will discard any pending dirty pages when it is closed,
so we may as well not waste time trying to clean them. This doesn't
stop other things from writing it out, eg: pageout, fsync(2) etc.


# 80448 27-Jul-2001 peter

Revert previous accidental commit. FWIW, it was part of enabling
VM caching of disks through mmap() and stopping syncing of open files
that had their last reference in the fs removed (ie: their unsync'ed
pages get discarded on close already, so I made it stop syncing too).


# 80447 27-Jul-2001 peter

Fix cut/paste blunder. Serves me right for doing a last minute tweak
to what I had for some time.

Submitted by: bde


# 79224 04-Jul-2001 dillon

With Alfred's permission, remove vm_mtx in favor of a fine-grained approach
(this commit is just the first stage). Also add various GIANT_ macros to
formalize the removal of Giant, making it easy to test in a more piecemeal
fashion. These macros will allow us to test fine-grained locks to a degree
before removing Giant, and also after, and to remove Giant in a piecemeal
fashion via sysctl's on those subsystems which the authors believe can
operate without Giant.


# 78909 28-Jun-2001 jhb

- Fix a mntvnode and vnode interlock reversal.
- Protect the mnt_vnode list with the mntvnode lock.


# 76827 19-May-2001 alfred

Introduce a global lock for the vm subsystem (vm_mtx).

vm_mtx does not recurse and is required for most low level
vm operations.

faults can not be taken without holding Giant.

Memory subsystems can now call the base page allocators safely.

Almost all atomic ops were removed as they are covered under the
vm mutex.

Alpha and ia64 now need to catch up to i386's trap handlers.

FFS and NFS have been tested, other filesystems will need minor
changes (grabbing the vm lock when twiddling page properties).

Reviewed (partially) by: jake, jhb


# 76688 16-May-2001 iedowse

Change the second argument of vflush() to an integer that specifies
the number of references on the filesystem root vnode to be both
expected and released. Many filesystems hold an extra reference on
the filesystem root vnode, which must be accounted for when
determining if the filesystem is busy and then released if it isn't
busy. The old `skipvp' approach required individual filesystem
xxx_unmount functions to re-implement much of vflush()'s logic to
deal with the root vnode.

All 9 filesystems that hold an extra reference on the root vnode
got the logic wrong in the case of forced unmounts, so `umount -f'
would always fail if there were any extra root vnode references.
Fix this issue centrally in vflush(), now that we can.

This commit also fixes a vnode reference leak in devfs, which could
result in idle devfs filesystems that refuse to unmount.

Reviewed by: phk, bp


# 76486 11-May-2001 iedowse

In vrele() and vput(), avoid triggering the confusing "missed vn_close"
KASSERT when vp->v_usecount is zero or negative. In this case, the
"v*: negative ref cnt" panic that follows is much more appropriate.

Reviewed by: mckusick


# 76051 26-Apr-2001 phk

vfs_subr.c is getting rather fat. The underlying repocopy and this
commit moves the filesystem export handling code to vfs_export.c


# 75934 25-Apr-2001 phk

Move the netexport structure from the fs-specific mountstructure
to struct mount.

This makes the "struct netexport *" paramter to the vfs_export
and vfs_checkexport interface unneeded.

Consequently that all non-stacking filesystems can use
vfs_stdcheckexp().

At the same time, make it a pointer to a struct netexport
in struct mount, so that we can remove the bogus AF_MAX
and #include <net/radix.h> from <sys/mount.h>


# 75858 23-Apr-2001 grog

Correct #includes to work with fixed sys/mount.h.


# 75654 18-Apr-2001 tanimura

Reclaim directory vnodes held in namecache if few free vnodes are
available.

Only directory vnodes holding no child directory vnodes held in
v_cache_src are recycled, so that directory vnodes near the root of
the filesystem hierarchy remain in namecache and directory vnodes are
not reclaimed in cascade.

The period of vnode reclaiming attempt and the number of vnodes
attempted to reclaim can be tuned via sysctl(2).

Suggested by: tegge
Approved by: phk


# 75580 17-Apr-2001 phk

This patch removes the VOP_BWRITE() vector.

VOP_BWRITE() was a hack which made it possible for NFS client
side to use struct buf with non-bio backing.

This patch takes a more general approach and adds a bp->b_op
vector where more methods can be added.

The success of this patch depends on bp->b_op being initialized
all relevant places for some value of "relevant" which is not
easy to determine. For now the buffers have grown a b_magic
element which will make such issues a tiny bit easier to debug.


# 72956 23-Feb-2001 jlemon

Add a NOTE_REVOKE flag for vnodes, which is triggered from within vclean().
Use this to tell a filter attached to a vnode that the underlying vnode is
no longer valid, by returning EV_EOF.

PR: kern/25309, kern/25206


# 72650 18-Feb-2001 green

Switch to using a struct xucred instead of a struct xucred when not
actually in the kernel. This structure is a different size than
what is currently in -CURRENT, but should hopefully be the last time
any application breakage is caused there. As soon as any major
inconveniences are removed, the definition of the in-kernel struct
ucred should be conditionalized upon defined(_KERNEL).

This also changes struct export_args to remove dependency on the
constantly-changing struct ucred, as well as limiting the bounds
of the size fields to the correct size. This means: a) mountd and
friends won't break all the time, b) mountd and friends won't crash
the kernel all the time if they don't know what they're doing wrt
actual struct export_args layout.

Reviewed by: bde


# 72200 09-Feb-2001 bmilekic

Change and clean the mutex lock interface.

mtx_enter(lock, type) becomes:

mtx_lock(lock) for sleep locks (MTX_DEF-initialized locks)
mtx_lock_spin(lock) for spin locks (MTX_SPIN-initialized)

similarily, for releasing a lock, we now have:

mtx_unlock(lock) for MTX_DEF and mtx_unlock_spin(lock) for MTX_SPIN.
We change the caller interface for the two different types of locks
because the semantics are entirely different for each case, and this
makes it explicitly clear and, at the same time, it rids us of the
extra `type' argument.

The enter->lock and exit->unlock change has been made with the idea
that we're "locking data" and not "entering locked code" in mind.

Further, remove all additional "flags" previously passed to the
lock acquire/release routines with the exception of two:

MTX_QUIET and MTX_NOSWITCH

The functionality of these flags is preserved and they can be passed
to the lock/unlock routines by calling the corresponding wrappers:

mtx_{lock, unlock}_flags(lock, flag(s)) and
mtx_{lock, unlock}_spin_flags(lock, flag(s)) for MTX_DEF and MTX_SPIN
locks, respectively.

Re-inline some lock acq/rel code; in the sleep lock case, we only
inline the _obtain_lock()s in order to ensure that the inlined code
fits into a cache line. In the spin lock case, we inline recursion and
actually only perform a function call if we need to spin. This change
has been made with the idea that we generally tend to avoid spin locks
and that also the spin locks that we do have and are heavily used
(i.e. sched_lock) do recurse, and therefore in an effort to reduce
function call overhead for some architectures (such as alpha), we
inline recursion for this case.

Create a new malloc type for the witness code and retire from using
the M_DEV type. The new type is called M_WITNESS and is only declared
if WITNESS is enabled.

Begin cleaning up some machdep/mutex.h code - specifically updated the
"optimized" inlined code in alpha/mutex.h and wrote MTX_LOCK_SPIN
and MTX_UNLOCK_SPIN asm macros for the i386/mutex.h as we presently
need those.

Finally, caught up to the interface changes in all sys code.

Contributors: jake, jhb, jasone (in no particular order)


# 71999 04-Feb-2001 phk

Mechanical change to use <sys/queue.h> macro API instead of
fondling implementation details.

Created with: sed(1)
Reviewed by: md5(1)


# 71860 31-Jan-2001 bp

Properly lock new vnode.

Reminded by: tegge


# 71576 24-Jan-2001 jasone

Convert all simplelocks to mutexes and remove the simplelock implementations.


# 71411 23-Jan-2001 rwatson

o The move to using VADMIN under vaccess() resulted in some system
calls returning EACCES instead of EPERM. This patch modifies vaccess()
to return EPERM instead of EACCES if VADMIN is among the requested
rights. This affects functions normally limited to the owners of
a file, such as chmod(), as EPERM is the error indicating that
privilege would allow the operation, rather than a chance in mandatory
or discretionary rights.

Reported by: bde


# 70063 15-Dec-2000 jhb

Stick the kthread API in a kthread_* namespace, and the specialized kproc
functions in a kproc_* namespace.

Reviewed by: -arch


# 69950 13-Dec-2000 mckusick

Use proper mutex locking when calling setrunnable from speedup_syncer().

Submitted by: Tor.Egge@fast.no


# 69781 08-Dec-2000 dwmalone

Convert more malloc+bzero to malloc+M_ZERO.

Submitted by: josh@zipperup.org
Submitted by: Robert Drehmel <robd@gmx.net>


# 69664 06-Dec-2000 peter

Untangle vfsinit() a bit. Use seperate sysinit functions rather than
having a super-function calling bits all over the place.


# 69529 02-Dec-2000 gallatin

Correct int/long type mismatch in the proper place this time. freevnodes
and numvnodes are longs in the kernel. They should remain longs in systat,
what really needs to change is that they should be using SYSCTL_LONG rather
than SYSCTL_INT. I also changed wantfreevnodes to SYSCTL_LONG because I
happened to notice it.

I wish there was a way to find all of these automatically..

Pointed out by: bde


# 69436 01-Dec-2000 jhb

Use msleep() instead of mtx_exit()/tsleep() so that we release the lock and
go to sleep as an "atomic" operation.


# 69400 30-Nov-2000 mckusick

Get rid of a bogus mtx_exit (it was attempting to release an
already released mutex).

Submitted by: "Chris Knight" <chris@aims.com.au>


# 68885 18-Nov-2000 dillon

Implement a low-memory deadlock solution.

Removed most of the hacks that were trying to deal with low-memory
situations prior to now.

The new code is based on the concept that I/O must be able to function in
a low memory situation. All major modules related to I/O (except
networking) have been adjusted to allow allocation out of the system
reserve memory pool. These modules now detect a low memory situation but
rather then block they instead continue to operate, then return resources
to the memory pool instead of cache them or leave them wired.

Code has been added to stall in a low-memory situation prior to a vnode
being locked.

Thus situations where a process blocks in a low-memory condition while
holding a locked vnode have been reduced to near nothing. Not only will
I/O continue to operate, but many prior deadlock conditions simply no
longer exist.

Implement a number of VFS/BIO fixes

(found by Ian): in biodone(), bogus-page replacement code, the loop
was not properly incrementing loop variables prior to a continue
statement. We do not believe this code can be hit anyway but we
aren't taking any chances. We'll turn the whole section into a
panic (as it already is in brelse()) after the release is rolled.

In biodone(), the foff calculation was incorrectly
clamped to the iosize, causing the wrong foff to be calculated
for pages in the case of an I/O error or biodone() called without
initiating I/O. The problem always caused a panic before. Now it
doesn't. The problem is mainly an issue with NFS.

Fixed casts for ~PAGE_MASK. This code worked properly before only
because the calculations use signed arithmatic. Better to properly
extend PAGE_MASK first before inverting it for the 64 bit masking
op.

In brelse(), the bogus_page fixup code was improperly throwing
away the original contents of 'm' when it did the j-loop to
fix the bogus pages. The result was that it would potentially
invalidate parts of the *WRONG* page(!), leading to corruption.

There may still be cases where a background bitmap write is
being duplicated, causing potential corruption. We have identified
a potentially serious bug related to this but the fix is still TBD.
So instead this patch contains a KASSERT to detect the problem
and panic the machine rather then continue to corrupt the filesystem.
The problem does not occur very often.. it is very hard to
reproduce, and it may or may not be the cause of the corruption
people have reported.

Review by: (VFS/BIO: mckusick, Ian Dowse <iedowse@maths.tcd.ie>)
Testing by: (VM/Deadlock) Paul Saab <ps@yahoo-inc.com>


# 68262 02-Nov-2000 tegge

Clear the VFREE flag when the vnode is removed from the free list in
getnewvnode(). Otherwise routines called from VOP_INACTIVE() might
attempt to remove the vnode from a free list the vnode isn't on,
causing corruption.
PR: 18012


# 68259 02-Nov-2000 phk

Take VBLK devices further out of their missery.

This should fix the panic I introduced in my previous commit on this topic.


# 67365 20-Oct-2000 jhb

Catch up to moving headers:
- machine/ipl.h -> sys/ipl.h
- machine/mutex.h -> sys/mutex.h


# 67309 19-Oct-2000 rwatson

o Introduce new VOP_ACCESS() flag VADMIN, allowing file systems to perform
"administrative" authorization checks. In most cases, the VADMIN test
checks to make sure the credential effective uid is the same as the file
owner.
o Modify vaccess() to set VADMIN as an available right if the uid is
appropriate.
o Modify references to uid-based access control operations such that they
now always invoke VOP_ACCESS() instead of using hard-coded policy checks.
o This allows alternative UFS policies to be implemented by replacing only
ufs_access() (such as mandatory system policies).
o VOP_ACCESS() requires the caller to hold an exclusive vnode lock on the
vnode: I believe that new invocations of VOP_ACCESS() are always called
with the lock held.
o Some direct checks of the uid remain, largely associated with the QUOTA
and SUIDDIR code.

Reviewed by: eivind
Obtained from: TrustedBSD Project


# 66886 09-Oct-2000 eivind

Blow away the v_specmountpoint define, replacing it with what it was
defined as (rdev->si_mountpoint)


# 66720 06-Oct-2000 jasone

Do not call lockdestroy() for v_vnlock, which may point to a lock in a
deeper vfs stacking layer.

Submitted by: bp


# 66686 05-Oct-2000 eivind

Style fixes based on comments by bde


# 66615 04-Oct-2000 jasone

Convert lockmgr locks from using simple locks to using mutexes.

Add lockdestroy() and appropriate invocations, which corresponds to
lockinit() and must be called to clean up after a lockmgr lock is no
longer needed.


# 66541 02-Oct-2000 bp

Move KASSERTs which checks value of v_usecount after vnode locking, so
it will not produce wrong alarms.


# 66411 27-Sep-2000 mckusick

Do the right thing if bdevvp is called twice for the same device.

Obtained from: Poul-Henning Kamp <phk@freebsd.org>


# 66355 25-Sep-2000 bp

Add a lock structure to vnode structure. Previously it was either allocated
separately (nfs, cd9660 etc) or keept as a first element of structure
referenced by v_data pointer(ffs). Such organization leads to known problems
with stacked filesystems.

From this point vop_no*lock*() functions maintain only interlock lock.
vop_std*lock*() functions maintain built-in v_lock structure using lockmgr().
vop_sharedlock() is compatible with vop_stdunlock(), but maintains a shared
lock on vnode.

If filesystem wishes to export lockmgr compatible lock, it can put an address
of this lock to v_vnlock field. This indicates that the upper filesystem
can take advantage of it and use single lock structure for entire (or part)
of stack of vnodes. This field shouldn't be examined or modified by VFS code
except for initialization purposes.

Reviewed in general by: mckusick


# 66244 22-Sep-2000 eivind

Style fixes:
* Add lots of comments
* Convert a couple of assertions to KASSERT()
* Minimal whitespace & misapplied {} fixes
* Convert #if 0 to #if COMPILING_LINT for code we presently do not
support, but want to keep available.

Reviewed by: adrian, markm


# 66242 22-Sep-2000 eivind

Staticize addalias()


# 66168 21-Sep-2000 alfred

comment vfs_export functions, requested by: eivind


# 66130 20-Sep-2000 rwatson

o Add additional comment describing vaccess() behavior.

Requested by: eivind
Reviewed by: eivind, adrian


# 66067 19-Sep-2000 phk

Rename lminor() to dev2unit(). This function gives a linear unit number
which hides the 'hole' in the minor bits.

Introduce unit2minor() to do the reverse operation.

Fix some some make_dev() calls which didn't use UID_* or GID_* macros.

Kill the v_hashchain alias macro, it hides the real relationship.

Introduce experimental SI_CHEAPCLONE flag set it on cloned bpfs.


# 65770 12-Sep-2000 bp

Add three new VOPs: VOP_CREATEVOBJECT, VOP_DESTROYVOBJECT and VOP_GETVOBJECT.
They will be used by nullfs and other stacked filesystems to support full
cache coherency.

Reviewed in general by: mckusick, dillon


# 65557 07-Sep-2000 jasone

Major update to the way synchronization is done in the kernel. Highlights
include:

* Mutual exclusion is used instead of spl*(). See mutex(9). (Note: The
alpha port is still in transition and currently uses both.)

* Per-CPU idle processes.

* Interrupts are run in their own separate kernel threads and can be
preempted (i386 only).

Partially contributed by: BSDi (BSD/OS)
Submissions by (at least): cp, dfr, dillon, grog, jake, jhb, sheldonh


# 65516 06-Sep-2000 rwatson

o Synchronize vaccess() capability access control checks with TrustedBSD
tree.

Obtained from: TrustedBSD Project


# 65492 05-Sep-2000 phk

Move extern declaration of dead_vnodeop_p to a .h file.

Remove race condition in vn_isdisk().


# 65200 29-Aug-2000 rwatson

o Restructure vaccess() so as to check for DAC permission to modify the
object before falling back on privilege. Make vaccess() accept an
additional optional argument, privused, to determine whether
privilege was required for vaccess() to return 0. Add commented
out capability checks for reference. Rename some variables to make
it more clear which modes/uids/etc are associated with the object,
and which with the access mode.
o Update file system use of vaccess() to pass NULL as the optional
privused argument. Once additional patches are applied, suser()
will no longer set ASU, so privused will permit passing of
privilege information up the stack to the caller.

Reviewed by: bde, green, phk, -security, others
Obtained from: TrustedBSD Project


# 64875 20-Aug-2000 phk

Fix typo in last commit.


# 64865 20-Aug-2000 phk

Centralize the canonical vop_access user/group/other check in vaccess().

Discussed with: bde


# 63788 24-Jul-2000 mckusick

This patch corrects the first round of panics and hangs reported
with the new snapshot code.

Update addaliasu to correctly implement the semantics of the old
checkalias function. When a device vnode first comes into existence,
check to see if an anonymous vnode for the same device was created
at boot time by bdevvp(). If so, adopt the bdevvp vnode rather than
creating a new vnode for the device. This corrects a problem which
caused the kernel to panic when taking a snapshot of the root
filesystem.

Change the calling convention of vn_write_suspend_wait() to be the
same as vn_start_write().

Split out softdep_flushworklist() from softdep_flushfiles() so that
it can be used to clear the work queue when suspending filesystem
operations.

Access to buffers becomes recursive so that snapshots can recursively
traverse their indirect blocks using ffs_copyonwrite() when checking
for the need for copy on write when flushing one of their own indirect
blocks. This eliminates a deadlock between the syncer daemon and a
process taking a snapshot.

Ensure that softdep_process_worklist() can never block because of a
snapshot being taken. This eliminates a problem with buffer starvation.

Cleanup change in ffs_sync() which did not synchronously wait when
MNT_WAIT was specified. The result was an unclean filesystem panic
when doing forcible unmount with heavy filesystem I/O in progress.

Return a zero'ed block when reading a block that was not in use at
the time that a snapshot was taken. Normally, these blocks should
never be read. However, the readahead code will occationally read
them which can cause unexpected behavior.

Clean up the debugging code that ensures that no blocks be written
on a filesystem while it is suspended. Snapshots must explicitly
label the blocks that they are writing during the suspension so that
they do not cause a `write on suspended filesystem' panic.

Reorganize ffs_copyonwrite() to eliminate a deadlock and also to
prevent a race condition that would permit the same block to be
copied twice. This change eliminates an unexpected soft updates
inconsistency in fsck caused by the double allocation.

Use bqrelse rather than brelse for buffers that will be needed
soon again by the snapshot code. This improves snapshot performance.


# 62976 11-Jul-2000 mckusick

Add snapshots to the fast filesystem. Most of the changes support
the gating of system calls that cause modifications to the underlying
filesystem. The gating can be enabled by any filesystem that needs
to consistently suspend operations by adding the vop_stdgetwritemount
to their set of vnops. Once gating is enabled, the function
vfs_write_suspend stops all new write operations to a filesystem,
allows any filesystem modifying system calls already in progress
to complete, then sync's the filesystem to disk and returns. The
function vfs_write_resume allows the suspended write operations to
begin again. Gating is not added by default for all filesystems as
for SMP systems it adds two extra locks to such critical kernel
paths as the write system call. Thus, gating should only be added
as needed.

Details on the use and current status of snapshots in FFS can be
found in /sys/ufs/ffs/README.snapshot so for brevity and timelyness
is not included here. Unless and until you create a snapshot file,
these changes should have no effect on your system (famous last words).


# 62776 07-Jul-2000 bp

Fix support for more than 256 simultaneous mounts. Theoretical limit
is 2^16 mounts per fs type.

Reported by: Troy Arie Cobb <tcobb@staff.circle.net> via phk
Reviewed by: bde


# 62573 04-Jul-2000 phk

Previous commit changing SYSCTL_HANDLER_ARGS violated KNF.

Pointed out by: bde


# 62552 04-Jul-2000 mckusick

Simplify and rationalise the management of the vnode free list
(preparing the code to add snapshots).


# 62549 04-Jul-2000 mckusick

If a buffer flush fails when trying to reclaim a vnode, it is too
late to save the vnode, so just toss any remaining unwritten buffers
rather than leaving them lying around to make trouble in the future.


# 62469 03-Jul-2000 phk

Make the two calls from kern/* into softupdates #ifdef SOFTUPDATES,
that is way cleaner than using the softupdates_stub stunt, which
should be killed when convenient.

Discussed with: mckusick


# 62454 03-Jul-2000 phk

Style police catches up with rev 1.26 of src/sys/sys/sysctl.h:

Sanitize SYSCTL_HANDLER_ARGS so that simplistic tools can grog our
sources:

-sysctl_vm_zone SYSCTL_HANDLER_ARGS
+sysctl_vm_zone (SYSCTL_HANDLER_ARGS)


# 62148 27-Jun-2000 phk

Move prtactive to vfs from ufs. It is used all over the place.


# 61724 16-Jun-2000 phk

Virtualizes & untangles the bioops operations vector.

Ref: Message-ID: <18317.961014572@critter.freebsd.dk> To: current@


# 60938 26-May-2000 jake

Back out the previous change to the queue(3) interface.
It was not discussed and should probably not happen.

Requested by: msmith and others


# 60833 23-May-2000 jake

Change the way that the queue(3) structures are declared; don't assume that
the type argument to *_HEAD and *_ENTRY is a struct.

Suggested by: phk
Reviewed by: phk
Approved by: mdodd


# 60539 14-May-2000 asmodai

Fix the rootmount code for now.
This function will probably rewritten/renamed to devpp.

Submitted by: Assar Westerlund <assar@sics.se> on -current
Confirmed to work: Steinar Haug <sthaug@nethelp.no>,
Manfred Antar <mantar@pacbell.net>
Reviewed by: phk


# 60041 05-May-2000 phk

Separate the struct bio related stuff out of <sys/buf.h> into
<sys/bio.h>.

<sys/bio.h> is now a prerequisite for <sys/buf.h> but it shall
not be made a nested include according to bdes teachings on the
subject of nested includes.

Diskdrivers and similar stuff below specfs::strategy() should no
longer need to include <sys/buf.> unless they need caching of data.

Still a few bogus uses of struct buf to track down.

Repocopy by: peter


# 58349 20-Mar-2000 phk

Rename the existing BUF_STRATEGY() to DEV_STRATEGY()

substitute BUF_WRITE(foo) for VOP_BWRITE(foo->b_vp, foo)

substitute BUF_STRATEGY(foo) for VOP_STRATEGY(foo->b_vp, foo)

This patch is machine generated except for the ccd.c and buf.h parts.


# 58185 18-Mar-2000 chris

In vn_isdisk(), check whether vp->v_rdev is NULL. If it is, then
return ENXIO (Device not configured). Without this, vn_isdisk()
could (and did in the case of lstat() under fdesc) pass a NULL pointer
to devsw(), which caused a page fault.

Reviewed by: alfred


# 58132 16-Mar-2000 phk

Eliminate the undocumented, experimental, non-delivering and highly
dangerous MAX_PERF option.


# 58059 14-Mar-2000 bde

Don't try so hard to make the lower 16 bits of fsids unique. It tended
to recycle full fsids after only 16 mount/unmount's. This is probably
too often for exported fsids. Now we recycle the full fsids only
after 2^16 mount/ umount's and only ensure uniqueness in the lower 16
bits if there have been <= 256 calls to vfs_getnewfsid() since the
system started.


# 57931 12-Mar-2000 bde

Try harder to make the lower 16 bits of fsids unique. The vfs type
number was packed very wastefully, giving perfect non-uniqeness in
the lower 16 bits of fsids for filesystems with the same vfs type.
This made linux_stat() return perfectly non-unique (broken) 16-bit
st_dev's for nfs mount points, and effectively reduced mntid_base to
8 bits so that the vfs_getnewfsid() looped endlessly when there are
already 256 mounted filesystems with the required vfs type.

Approved by: jkh


# 57025 07-Feb-2000 sos

Do refcounting of open devices (more) correctly.

count_dev funtion by phk.


# 56949 02-Feb-2000 rwatson

Remove static qualifier from vgonel, as it is needed by the Arla folk
outside of vfs_subr.c.

Submitted by: Assar Westerlund <assar@sics.se>
Reviewed by: rwatson
Approved by: jkh


# 56837 29-Jan-2000 rwatson

This patch fixes a locking bug that can result in deadlock if
the codepath is followed.

From the PR:

vclean calls vrele leading to deadlock (if usecount > 0)

vclean() calls vrele() if v_usecount of the node was higher than one.
But before calling it, it sets the VXLOCK flag, which will make
vn_lock called from vrele dead-lock.

PR: kern/15117
Submitted by: Assar Westerlund <assar@stacken.kth.se>
Reviewed by: rwatson
Obtained from: NetBSD


# 55756 10-Jan-2000 phk

Give vn_isdisk() a second argument where it can return a suitable errno.

Suggested by: bde


# 55695 10-Jan-2000 mckusick

Remove the P_BUFEXHAUST flag from the syncer process (leaving
it only on the buf_daemon process). The problem is that when the
syncer process starts running the worklist, it wants to delete
lots of files. It does this by VFS_VGET'ing the vnodes, clearing
the blocks in them and bdwrite'ing the buffer. It can process close
to a thousand files per second which generates a large number of
dirty buffers. So, giving it special priviledge at the buffer trough
leads to trouble as the buf_daemon does occationally need a free
buffer to proceed and if the syncer has used every last one up,
we are toast.


# 55611 08-Jan-2000 eivind

Change NDFREE() from a macro to a function for the time being; the macro
version caused intolerable bloat (30k). I'm likely to revisit this with an
attempt at a smarter macro.

Bloat noticed by: bde


# 55539 07-Jan-2000 luoqi

Introduce a mechanism to suspend/resume system processes. Suspend syncer
and bufdaemon prior to disk sync during system shutdown.


# 55431 05-Jan-2000 dillon

Enhance reassignbuf(). When a buffer cannot be time-optimally inserted
into vnode dirtyblkhd we append it to the list instead of prepend it to
the list in order to maintain a 'forward' locality of reference, which
is arguably better then 'reverse'. The original algorithm did things this
way to but at a huge time cost.

Enhance the append interlock for NFS writes to handle intr/soft mounts
better.

Fix the hysteresis for NFS async daemon I/O requests to reduce the
number of unnecessary context switches.

Modify handling of NFS mount options. Any given user option that is
too high now defaults to the kernel maximum for that option rather then
the kernel default for that option.

Reviewed by: Alfred Perlstein <bright@wintelcom.net>


# 54989 22-Dec-1999 mckusick

Prettyness police: Identify flags in b_xflags with BX_ to distinguish
them from flags in b_flags which are prefixed with B_


# 54467 12-Dec-1999 dillon

Add MAP_NOSYNC feature to mmap(), and MADV_NOSYNC and MADV_AUTOSYNC to
madvise().

This feature prevents the update daemon from gratuitously flushing
dirty pages associated with a mapped file-backed region of memory. The
system pager will still page the memory as necessary and the VM system
will still be fully coherent with the filesystem. Modifications made
by other means to the same area of memory, for example by write(), are
unaffected. The feature works on a page-granularity basis.

MAP_NOSYNC allows one to use mmap() to share memory between processes
without incuring any significant filesystem overhead, putting it in
the same performance category as SysV Shared memory and anonymous memory.

Reviewed by: julian, alc, dg


# 54444 11-Dec-1999 eivind

Lock reporting and assertion changes.
* lockstatus() and VOP_ISLOCKED() gets a new process argument and a new
return value: LK_EXCLOTHER, when the lock is held exclusively by another
process.
* The ASSERT_VOP_(UN)LOCKED family is extended to use what this gives them
* Extend the vnode_if.src format to allow more exact specification than
locked/unlocked.

This commit should not do any semantic changes unless you are using
DEBUG_VFS_LOCKS.

Discussed with: grog, mch, peter, phk
Reviewed by: peter


# 53900 29-Nov-1999 dillon

Remove vfs_getrootfsid() function (a temporary hack added a few months
ago to make BOOTP work again). It is no longer required by BOOTP and
no longer used.


# 53577 22-Nov-1999 phk

Convert various pieces of code to use vn_isdisk() rather than checking
for vp->v_type == VBLK.

In ccd: we don't need to call VOP_GETATTR to find the type of a vnode.

Reviewed by: sos


# 53452 20-Nov-1999 phk

struct mountlist and struct mount.mnt_list have no business being
a CIRCLEQ. Change them to TAILQ_HEAD and TAILQ_ENTRY respectively.

This removes ugly mp != (void*)&mountlist comparisons.

Requested by: phk
Submitted by: Jake Burkholder jake@checker.org
PR: 14967


# 53225 16-Nov-1999 phk

Commit the remaining part of PR14914:

Alot of the code in sys/kern directly accesses the *Q_HEAD and *Q_ENTRY
structures for list operations. This patch makes all list operations
in sys/kern use the queue(3) macros, rather than directly accessing the
*Q_{HEAD,ENTRY} structures.

Reviewed by: phk
Submitted by: Jake Burkholder <jake@checker.org>
PR: 14914


# 53059 09-Nov-1999 phk

Next step in the device cleanup process.

Correctly lock vnodes when calling VOP_OPEN() from filesystem mount code.

Unify spec_open() for bdev and cdev cases.

Remove the disabled bdev specific read/write code.


# 52635 29-Oct-1999 phk

useracc() the prequel:

Merge the contents (less some trivial bordering the silly comments)
of <vm/vm_prot.h> and <vm/vm_inherit.h> into <vm/vm.h>. This puts
the #defines for the vm_inherit_t and vm_prot_t types next to their
typedefs.

This paves the road for the commit to follow shortly: change
useracc() to use VM_PROT_{READ|WRITE} rather than B_{READ|WRITE}
as argument.


# 52128 11-Oct-1999 peter

Trim unused options (or #ifdef for undoc options).

Submitted by: phk


# 51926 04-Oct-1999 phk

Move the buffered read/write code out of spec_{read|write} and into
two new functions spec_buf{read|write}.

Add sysctl vfs.bdev_buffered which defaults to 1 == true. This
sysctl can be used to experimentally turn buffered behaviour for
bdevs off. I should not be changed while any blockdevices are
open. Remove the misplaced sysctl vfs.enable_userblk_io.

No other changes in behaviour.


# 51797 29-Sep-1999 phk

Remove v_maxio from struct vnode.

Replace it with mnt_iosize_max in struct mount.

Nits from: bde


# 51488 21-Sep-1999 dillon

Final commit to remove vnode->v_lastr. vm_fault now handles read
clustering issues (replacing code that used to be in
ufs/ufs/ufs_readwrite.c). vm_fault also now uses the new VM page counter
inlines.

This completes the changeover from vnode->v_lastr to vm_entry_t->v_lastr
for VM, and fp->f_nextread and fp->f_seqcount (which have been in the
tree for a while). Determination of the I/O strategy (sequential, random,
and so forth) is now handled on a descriptor-by-descriptor basis for
base I/O calls, and on a memory-region-by-memory-region and
process-by-process basis for VM faults.

Reviewed by: David Greenman <dg@root.com>, Alan Cox <alc@cs.rice.edu>


# 51478 20-Sep-1999 phk

Initialize vp->v_maxio to its default in getnetvnode() rather than
four different places in vfs_cluster.c


# 51388 19-Sep-1999 dillon

Fix BOOTP root FS mounts. Also cleanup vfs_getnewfsid() and collapse
addaliasu() into addalias() (no operational change) and clarify comments
relating to a trick that vclean() uses.

The fix to BOOTP is yet another hack. Actually, rootfsid handling
is already a major hack. The whole thing needs to be cleaned up.

Reviewed by: David Greenman <dg@root.com>, Alan Cox <alc@cs.rice.edu>


# 51345 17-Sep-1999 dillon

Add vfs.enable_userblk_io sysctl to control whether user reads and writes
to buffered block devices are allowed. The default is to be backwards
compatible, i.e. reads and writes are allowed.

The idea is for a larger crowd to start running with this disabled and
see what problems, if any, crop up, and then to change the default to
off and see if any problems crop up in the next 6 months prior to
potentially removing support entirely. There are still a few people,
Julian and myself included, who believe the buffered block device
access from usermode to be useful.

Remove use of vnode->v_lastr from buffered block device I/O in
preparation for removal of vnode->v_lastr field, replacing it with
the already existing seqcount metric to detect sequential operation.

Reviewed by: Alan Cox <alc@cs.rice.edu>, David Greenman <dg@root.com>


# 50549 29-Aug-1999 phk

Add dev_t freeing code. Controlled by sysctl debug.free_devt, default
is off.


# 50521 28-Aug-1999 phk

remove unused variables.


# 50477 28-Aug-1999 peter

$Id$ -> $FreeBSD$


# 50405 26-Aug-1999 phk

Simplify the handling of VCHR and VBLK vnodes using the new dev_t:

Make the alias list a SLIST.

Drop the "fast recycling" optimization of vnodes (including
the returning of a prexisting but stale vnode from checkalias).
It doesn't buy us anything now that we don't hardlimit
vnodes anymore.

Rename checkalias2() and checkalias() to addalias() and
addaliasu() - which takes dev_t and udev_t arg respectively.

Make the revoke syscalls use vcount() instead of VALIASED.

Remove VALIASED flag, we don't need it now and it is faster
to traverse the much shorter lists than to maintain the
flag.

vfs_mountedon() can check the dev_t directly, all the vnodes
point to the same one.

Print the devicename in specfs/vprint().

Remove a couple of stale LFS vnode flags.

Remove unimplemented/unused LK_DRAINED;


# 50347 25-Aug-1999 phk

Introduce vn_isdisk(struct vnode *vp) function, and use it to test for diskness.


# 50334 25-Aug-1999 julian

Make DEVFS use PHK's specinfo struct as the source of dev_t and devsw.

In lookup() however it's the other way around as we need to supply the
dev_t for the vnode, so devfs still has a copy of it stashed away.

Sourcing it from the vnode in the vnops however is useful as it makes
a lot of the code almost the same as that in specfs.


# 50137 22-Aug-1999 jdp

Support full-precision file timestamps. Until now, only the seconds
have been maintained, and that is still the default. A new sysctl
variable "vfs.timestamp_precision" can be used to enable higher
levels of precision:

0 = seconds only; nanoseconds zeroed (default).
1 = seconds and nanoseconds, accurate within 1/HZ.
2 = seconds and nanoseconds, truncated to microseconds.
>=3 = seconds and nanoseconds, maximum precision.

Level 1 uses getnanotime(), which is fast but can be wrong by up
to 1/HZ. Level 2 uses microtime(). It might be desirable for
consistency with utimes() and friends, which take timeval structures
rather than timespecs. Level 3 uses nanotime() for the higest
precision.

I benchmarked levels 0, 1, and 3 by copying a 550 MB tree with
"cpio -pdu". There was almost negligible difference in the system
times -- much less than 1%, and less than the variation among
multiple runs at the same level. Bruce Evans dreamed up a torture
test involving 1-byte reads with intervening fstat() calls, but
the cpio test seems more realistic to me.

This feature is currently implemented only for the UFS (FFS and
MFS) filesystems. But I think it should be easy to support it in
the others as well.

An earlier version of this was reviewed by Bruce. He's not to
blame for any breakage I've introduced since then.

Reviewed by: bde (an earlier version of the code)


# 49679 13-Aug-1999 phk

The bdevsw() and cdevsw() are now identical, so kill the former.


# 49678 13-Aug-1999 phk

s/v_specinfo/v_rdev/


# 49535 08-Aug-1999 phk

Decommision miscfs/specfs/specdev.h. Most of it goes into <sys/conf.h>,
a few lines into <sys/vnode.h>.

Add a few fields to struct specinfo, paving the way for the fun part.


# 49101 26-Jul-1999 alc

Add sysctl and support code to allow directories to be VMIO'd. The default
setting for the sysctl is OFF, which is the historical operation.

Submitted by: dillon


# 48936 20-Jul-1999 phk

Now a dev_t is a pointer to struct specinfo which is shared by all specdev
vnodes referencing this device.

Details:
cdevsw->d_parms has been removed, the specinfo is available
now (== dev_t) and the driver should modify it directly
when applicable, and the only driver doing so, does so:
vn.c. I am not sure the logic in checking for "<" was right
before, and it looks even less so now.

An intial pool of 50 struct specinfo are depleted during
early boot, after that malloc had better work. It is
likely that fewer than 50 would do.

Hashing is done from udev_t to dev_t with a prime number
remainder hash, experiments show no better hash available
for decent cost (MD5 is only marginally better) The prime
number used should not be close to a power of two, we use
83 for now.

Add new checkalias2() to get around the loss of info from
dev2udev() in bdevvp();

The aliased vnodes are hung on a list straight of the dev_t,
and speclisth[SPECSZ] is unused. The sharing of struct
specinfo means that the v_specnext moves into the vnode
which grows by 4 bytes.

Don't use a VBLK dev_t which doesn't make sense in MFS, now
we hang a dummy cdevsw on B/Cmaj 253 so that things look sane.

Storage overhead from all of this is O(50k).

Bump __FreeBSD_version to 400009

The next step will add the stuff needed so device-drivers can start to
hang things from struct specinfo


# 48892 19-Jul-1999 phk

[click] Now all dev_t's in the kernel have their char device major.

Only know casualy of this is swapinfo/pstat which should be fixes
the right way: Store the actual pathname in the kernel like mount
does. [Volounteers sought for this task]

The road map from here is roughly: expand struct specinfo into struct
based dev_t. Add dev_t registration facilities for device drivers and
start to use them.


# 48884 18-Jul-1999 phk

Introduce the vn_todev(struct vnode*) function, which returns the dev_t
corresponding to a VBLK or VCHR node, or NODEV.


# 48863 17-Jul-1999 phk

Fix 2nd arg to udev2dev().


# 48859 17-Jul-1999 phk

I have not one single time remembered the name of this function correctly
so obviously I gave it the wrong name. s/umakedev/makeudev/g


# 48777 12-Jul-1999 kris

Correct a couple of spelling errors in comments.


# 48677 08-Jul-1999 mckusick

These changes appear to give us benefits with both small (32MB) and
large (1G) memory machine configurations. I was able to run 'dbench 32'
on a 32MB system without bring the machine to a grinding halt.

* buffer cache hash table now dynamically allocated. This will
have no effect on memory consumption for smaller systems and
will help scale the buffer cache for larger systems.

* minor enhancement to pmap_clearbit(). I noticed that
all the calls to it used constant arguments. Making
it an inline allows the constants to propogate to
deeper inlines and should produce better code.

* removal of inherent vfs_ioopt support through the emplacement
of appropriate #ifdef's, with John's permission. If we do not
find a use for it by the end of the year we will remove it entirely.

* removal of getnewbufloops* counters & sysctl's - no longer
necessary for debugging, getnewbuf() is now optimal.

* buffer hash table functions removed from sys/buf.h and localized
to vfs_bio.c

* VFS_BIO_NEED_DIRTYFLUSH flag and support code added
( bwillwrite() ), allowing processes to block when too many dirty
buffers are present in the system.

* removal of a softdep test in bdwrite() that is no longer necessary
now that bdwrite() no longer attempts to flush dirty buffers.

* slight optimization added to bqrelse() - there is no reason
to test for available buffer space on B_DELWRI buffers.

* addition of reverse-scanning code to vfs_bio_awrite().
vfs_bio_awrite() will attempt to locate clusterable areas
in both the forward and reverse direction relative to the
offset of the buffer passed to it. This will probably not
make much of a difference now, but I believe we will start
to rely on it heavily in the future if we decide to shift
some of the burden of the clustering closer to the actual
I/O initiation.

* Removal of the newbufcnt and lastnewbuf counters that Kirk
added. They do not fix any race conditions that haven't already
been fixed by the gbincore() test done after the only call
to getnewbuf(). getnewbuf() is a static, so there is no chance
of it being misused by other modules. ( Unless Kirk can think
of a specific thing that this code fixes. I went through it
very carefully and didn't see anything ).

* removal of VOP_ISLOCKED() check in flushbufqueues(). I do not
think this check is necessary, the buffer should flush properly
whether the vnode is locked or not. ( yes? ).

* removal of extra arguments passed to getnewbuf() that are not
necessary.

* missed cluster_wbuild() that had to be a cluster_wbuild_wb() in
vfs_cluster.c

* vn_write() now calls bwillwrite() *PRIOR* to locking the vnode,
which should greatly aid flushing operations in heavy load
situations - both the pageout and update daemons will be able
to operate more efficiently.

* removal of b_usecount. We may add it back in later but for now
it is useless. Prior implementations of the buffer cache never
had enough buffers for it to be useful, and current implementations
which make more buffers available might not benefit relative to
the amount of sophistication required to implement a b_usecount.
Straight LRU should work just as well, especially when most things
are VMIO backed. I expect that (even though John will not like
this assumption) directories will become VMIO backed some point soon.

Submitted by: Matthew Dillon <dillon@backplane.com>
Reviewed by: Kirk McKusick <mckusick@mckusick.com>


# 48544 04-Jul-1999 mckusick

The buffer queue mechanism has been reformulated. Instead of having
QUEUE_AGE, QUEUE_LRU, and QUEUE_EMPTY we instead have QUEUE_CLEAN,
QUEUE_DIRTY, QUEUE_EMPTY, and QUEUE_EMPTYKVA. With this patch clean
and dirty buffers have been separated. Empty buffers with KVM
assignments have been separated from truely empty buffers. getnewbuf()
has been rewritten and now operates in a 100% optimal fashion. That is,
it is able to find precisely the right kind of buffer it needs to
allocate a new buffer, defragment KVM, or to free-up an existing buffer
when the buffer cache is full (which is a steady-state situation for
the buffer cache).

Buffer flushing has been reorganized. Previously buffers were flushed
in the context of whatever process hit the conditions forcing buffer
flushing to occur. This resulted in processes blocking on conditions
unrelated to what they were doing. This also resulted in inappropriate
VFS stacking chains due to multiple processes getting stuck trying to
flush dirty buffers or due to a single process getting into a situation
where it might attempt to flush buffers recursively - a situation that
was only partially fixed in prior commits. We have added a new daemon
called the buf_daemon which is responsible for flushing dirty buffers
when the number of dirty buffers exceeds the vfs.hidirtybuffers limit.
This daemon attempts to dynamically adjust the rate at which dirty buffers
are flushed such that getnewbuf() calls (almost) never block.

The number of nbufs and amount of buffer space is now scaled past the
8MB limit that was previously imposed for systems with over 64MB of
memory, and the vfs.{lo,hi}dirtybuffers limits have been relaxed
somewhat. The number of physical buffers has been increased with the
intention that we will manage physical I/O differently in the future.

reassignbuf previously attempted to keep the dirtyblkhd list sorted which
could result in non-deterministic operation under certain conditions,
such as when a large number of dirty buffers are being managed. This
algorithm has been changed. reassignbuf now keeps buffers locally sorted
if it can do so cheaply, and otherwise gives up and adds buffers to
the head of the dirtyblkhd list. The new algorithm is deterministic but
not perfect. The new algorithm greatly reduces problems that previously
occured when write_behind was turned off in the system.

The P_FLSINPROG proc->p_flag bit has been replaced by the more descriptive
P_BUFEXHAUST bit. This bit allows processes working with filesystem
buffers to use available emergency reserves. Normal processes do not set
this bit and are not allowed to dig into emergency reserves. The purpose
of this bit is to avoid low-memory deadlocks.

A small race condition was fixed in getpbuf() in vm/vm_pager.c.

Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
Reviewed by: Kirk McKusick <mckusick@mckusick.com>


# 48468 02-Jul-1999 phk

Make sure that stat(2) and friends always return a valid st_dev field.

Pseudo-FS need not fill in the va_fsid anymore, the syscall code
will use the first half of the fsid, which now looks like a udev_t
with major 255.


# 48391 01-Jul-1999 peter

Slight reorganization of kernel thread/process creation. Instead of using
SYSINIT_KT() etc (which is a static, compile-time procedure), use a
NetBSD-style kthread_create() interface. kproc_start is still available
as a SYSINIT() hook. This allowed simplification of chunks of the
sysinit code in the process. This kthread_create() is our old kproc_start
internals, with the SYSINIT_KT fork hooks grafted in and tweaked to work
the same as the NetBSD one.

One thing I'd like to do shortly is get rid of nfsiod as a user initiated
process. It makes sense for the nfs client code to create them on the
fly as needed up to a user settable limit. This means that nfsiod
doesn't need to be in /sbin and is always "available". This is a fair bit
easier to do outside of the SYSINIT_KT() framework.


# 48225 26-Jun-1999 mckusick

Convert buffer locking from using the B_BUSY and B_WANTED flags to using
lockmgr locks. This commit should be functionally equivalent to the old
semantics. That is, all buffer locking is done with LK_EXCLUSIVE
requests. Changes to take advantage of LK_SHARED and LK_RECURSIVE will
be done in future commits.


# 47964 16-Jun-1999 mckusick

Add a vnode argument to VOP_BWRITE to get rid of the last vnode
operator special case. Delete special case code from vnode_if.sh,
vnode_if.src, umap_vnops.c, and null_vnops.c.


# 47940 15-Jun-1999 mckusick

Get rid of the global variable rushjob and replace it with a function in
kern/vfs_subr.c named speedup_syncer() which handles the speedup request.
Change the various clients of rushjob to use the new function.


# 47640 31-May-1999 phk

Simplify cdevsw registration.

The cdevsw_add() function now finds the major number(s) in the
struct cdevsw passed to it. cdevsw_add_generic() is no longer
needed, cdevsw_add() does the same thing.

cdevsw_add() will print an message if the d_maj field looks bogus.

Remove nblkdev and nchrdev variables. Most places they were used
bogusly. Instead check a dev_t for validity by seeing if devsw()
or bdevsw() returns NULL.

Move bdevsw() and devsw() functions to kern/kern_conf.c

Bump __FreeBSD_version to 400006

This commit removes:
72 bogus makedev() calls
26 bogus SYSINIT functions

if_xe.c bogusly accessed cdevsw[], author/maintainer please fix.

I4b and vinum not changed. Patches emailed to authors. LINT
probably broken until they catch up.


# 47445 24-May-1999 jb

Remove the test for bdevsw(dev) == NULL from bdevvp() because it fails
if there is no character device associated with the block device. In this
case that doesn't matter because bdevvp() doesn't use the character
device structure.

I can use the pointy bit of the axe too.


# 47202 14-May-1999 luoqi

Legally acquire a major number for mfs.


# 47132 14-May-1999 mckusick

Previously directories were sync'ed every 10 seconds while bitmaps &
inodes were synced every 15 seconds. This is now reversed as during
directory create, we cannot commit the directory entry until its
inode has been written. With this switch, the inodes will be more
likely to be written by the time that the directory is written thus
reducing the number of directory rollbacks that are needed.


# 47075 12-May-1999 peter

Fix (?) SPECHASH dev_t/major/minor/etc args


# 47065 12-May-1999 phk

Don't peek into dev_t


# 47028 11-May-1999 phk

Divorce "dev_t" from the "major|minor" bitmap, which is now called
udev_t in the kernel but still called dev_t in userland.

Provide functions to manipulate both types:
major() umajor()
minor() uminor()
makedev() umakedev()
dev2udev() udev2dev()

For now they're functions, they will become in-line functions
after one of the next two steps in this process.

Return major/minor/makedev to macro-hood for userland.

Register a name in cdevsw[] for the "filedescriptor" driver.

In the kernel the udev_t appears in places where we have the
major/minor number combination, (ie: a potential device: we
may not have the driver nor the device), like in inodes, vattr,
cdevsw registration and so on, whereas the dev_t appears where
we carry around a reference to a actual device.

In the future the cdevsw and the aliased-from vnode will be hung
directly from the dev_t, along with up to two softc pointers for
the device driver and a few houskeeping bits. This will essentially
replace the current "alias" check code (same buck, bigger bang).

A little stunt has been provided to try to catch places where the
wrong type is being used (dev_t vs udev_t), if you see something
not working, #undef DEVT_FASCIST in kern/kern_conf.c and see if
it makes a difference. If it does, please try to track it down
(many hands make light work) or at least try to reproduce it
as simply as possible, and describe how to do that.

Without DEVT_FASCIST I belive this patch is a no-op.

Stylistic/posixoid comments about the userland view of the <sys/*.h>
files welcome now, from userland they now contain the end result.

Next planned step: make all dev_t's refer to the same devsw[] which
means convert BLK's to CHR's at the perimeter of the vnodes and
other places where they enter the game (bootdev, mknod, sysctl).


# 46679 08-May-1999 phk

Fix some of the places where too much inside knowledge about major/minor
layout and dev_t structure is being (ab)used.


# 46676 08-May-1999 phk

I got tired of seeing all the cdevsw[major(foo)] all over the place.

Made a new (inline) function devsw(dev_t dev) and substituted it.

Changed to the BDEV variant to this format as well: bdevsw(dev_t dev)

DEVFS will eventually benefit from this change too.


# 46635 07-May-1999 phk

Continue where Julian left off in July 1998:

Virtualize bdevsw[] from cdevsw. bdevsw() is now an (inline)
function.

Join CDEV_MODULE and BDEV_MODULE to DEV_MODULE (please pay attention
to the order of the cmaj/bmaj arguments!)

Join CDEV_DRIVER_MODULE and BDEV_DRIVER_MODULE to DEV_DRIVER_MODULE
(ditto!)

(Next step will be to convert all bdev dev_t's to cdev dev_t's
before they get to do any damage^H^H^H^H^H^Hwork in the kernel.)


# 46381 03-May-1999 billf

Add sysctl descriptions to many SYSCTL_XXXs

PR: kern/11197
Submitted by: Adrian Chadd <adrian@FreeBSD.org>
Reviewed by: billf(spelling/style/minor nits)
Looked at by: bde(style)


# 44679 12-Mar-1999 julian

Reviewed by: Many at differnt times in differnt parts,
including alan, john, me, luoqi, and kirk
Submitted by: Matt Dillon <dillon@frebsd.org>

This change implements a relatively sophisticated fix to getnewbuf().
There were two problems with getnewbuf(). First, the writerecursion
can lead to a system stack overflow when you have NFS and/or VN
devices in the system. Second, the free/dirty buffer accounting was
completely broken. Not only did the nfs routines blow it trying to
manually account for the buffer state, but the accounting that was
done did not work well with the purpose of their existance: figuring
out when getnewbuf() needs to sleep.

The meat of the change is to kern/vfs_bio.c. The remaining diffs are
all minor except for NFS, which includes both the fixes for bp
interaction AND fixes for a 'biodone(): buffer already done' lockup.
Sys/buf.h also contains a chaining structure which is not used by
this patchset but is used by other patches that are coming soon.
This patch deliniated by tags PRE_MAT_GETBUF and POST_MAT_GETBUF.
(sorry for the missing T matt)


# 44247 25-Feb-1999 dillon

Reviewed by: Julian Elischer <julian@whistle.com>

Add d_parms() to {c,b}devsw[]. If non-NULL this function points to
a device routine that will properly fill in the specinfo structure.
vfs_subr.c's checkalias() supplies appropriate defaults. This change
should be fully backwards compatible with existing devices.


# 44150 19-Feb-1999 dillon

Protect vn worklist and vn->v_{clean,dirty}blkhd at splbio().

Get rid of extra LIST_REMOVE()

Reviewed by: hsu@FreeBSD.ORG (Jeffrey Hsu), mckusick@McKusick.COM
Submitted by: hsu@FreeBSD.ORG (Jeffrey Hsu), dillon@backplane.com ( Matthew Dillon )


# 43618 04-Feb-1999 dillon

vp->v_object must be valid after normal flow of vfs_object_create()
completes, change if() to KASSERT(). This is not a bug, we are
simplify clarifying and optimizing the code.

In if/else in vfs_object_create(), the failure of both conditionals
will lead to a NULL object. Exit gracefully if this case occurs.
( this case does not normally occur, but needed to be handled ).

Obtained from: Eivind Eklund <eivind@FreeBSD.org>


# 43403 29-Jan-1999 dillon

More const fixes for -Wall, -Wcast-qual


# 43311 28-Jan-1999 dillon

Fix warnings in preparation for adding -Wall -Wcast-qual to the
kernel compile


# 42957 21-Jan-1999 dillon

This is a rather large commit that encompasses the new swapper,
changes to the VM system to support the new swapper, VM bug
fixes, several VM optimizations, and some additional revamping of the
VM code. The specific bug fixes will be documented with additional
forced commits. This commit is somewhat rough in regards to code
cleanup issues.

Reviewed by: "John S. Dyson" <root@dyson.iquest.net>, "David Greenman" <dg@root.com>


# 42453 10-Jan-1999 eivind

KNFize, by bde.


# 42408 08-Jan-1999 eivind

Split DIAGNOSTIC -> DIAGNOSTIC, INVARIANTS, and INVARIANT_SUPPORT as
discussed on -hackers.

Introduce 'KASSERT(assertion, ("panic message", args))' for simple
check + panic.

Reviewed by: msmith


# 42315 05-Jan-1999 eivind

Remove the 'waslocked' parameter to vfs_object_create().


# 42313 05-Jan-1999 eivind

Finish staticization.


# 42248 02-Jan-1999 bde

Ifdefed conditionally used simplock variables.


# 42043 24-Dec-1998 bde

Restored rev.1.31 which was clobbered by rev.1.69 (the big Lite2
merge). This fixes at least hanging in revoke(2) when a somewhat
active slave pty is revoked. The hang made the window for the
null pointer bug in ufsspec_{read,write} much larger.

There are many other bugs in this area (revoke of an active fifo
at best leaks memory...).


# 41995 22-Dec-1998 eivind

Check return value of tsleep(). I've checked of all call points -
there does not seem to be a problem with this.

PR: kern/8732
Analysis by: David G Andersen <danderse@cs.utah.edu>
Tested by: Alfred Perlstein <bright@hotjobs.com>


# 41994 21-Dec-1998 eivind

Staticize.


# 41514 04-Dec-1998 archie

Examine all occurrences of sprintf(), strcat(), and str[n]cpy()
for possible buffer overflow problems. Replaced most sprintf()'s
with snprintf(); for others cases, added terminating NUL bytes where
appropriate, replaced constants like "16" with sizeof(), etc.

These changes include several bug fixes, but most changes are for
maintainability's sake. Any instance where it wasn't "immediately
obvious" that a buffer overflow could not occur was made safer.

Reviewed by: Bruce Evans <bde@zeta.org.au>
Reviewed by: Matthew Dillon <dillon@apollo.backplane.com>
Reviewed by: Mike Spengler <mks@networkcs.com>


# 40787 31-Oct-1998 peter

Convert lists for bufs attached to vnodes from a LIST to a TAILQ.
- Use TAILQ_* macros extensively instead of internal names
- use b_xflags instead of the NOLIST magic number hack in the next pointer
- clean bufs are inserted at the tail rather than the head.
- redo dirty buffer insert so that metadata (negative lbn) goes to the
tail directly rather than at the HEAD. This makes a difference when
inserting dirty data blocks in lbn sorted order since data block
insertion will not have to bypass all the metadata cruft. data is
lbn sorted since it makes sense for clustering and writeback ordering,
while metadata sorting doesn't help much since the lbn's are
meaningless when walking the list for writebacks.

Small systems will not notice much (if any) benefit from this, but really
busy systems with large dirty block lists should get a lot more.

I've tested this with softdep, and it doesn't seem to mind the change of
queueing of metadata.

Reviewed (in princible) by: dg
Obtained from: partly from John Dyson's work-in-progress patches in June.


# 40777 31-Oct-1998 peter

The last argument to vm_object_page_clean() are now bit flags, rather than
the old true/false.

While here, have vfs_msync() only call vm_object_page_clean() with
OBJPC_SYNC if called with MNT_WAIT flags. vfs_msync() is called at unmount
time (with MNT_WAIT) and from the syncer process (formerly update).
This should make dirty mmap writebacks a little less nasty.

I have tested this a little with SOFTUPDATES enabled, but I don't normally
use it since I've been badly burned too many times.


# 40728 29-Oct-1998 bde

Oops, rev.1.167 made the device number checking in bdevvp() too strict
for mfs root mounts. Don't require major 255 to be in bdevsw[].


# 40722 29-Oct-1998 peter

Remove the V_SAVEMETA flag, nothing uses it any more now that msdosfs and
ext2fs call vtruncbuf() directly. This simplifies and cleans up
vinvalbuf() a little.


# 40659 26-Oct-1998 bde

Updated the major number check in vfs_object_create(). It's not
clear if the check is necessary, but vfs_object_create() is called
for all vnodes and it was silly to create objects for VBLK vnodes
that don't even have a driver.


# 40648 25-Oct-1998 phk

Nitpicking and dusting performed on a train. Removes trivial warnings
about unused variables, labels and other lint.


# 40647 25-Oct-1998 bde

Fixed device number checking in bdevvp():
- dev != NODEV was checked for, but 0 was returned on failure. This was
fixed in Lite2 (except the return code was still slightly wrong (ENODEV
instead of ENXIO)) but the changes were not merged. This case probably
doesn't actually occur under FreeBSD.
- major(dev) was not checked to have a valid non-NULL bdevsw entry. This
caused panics when the driver for the root device didn't exist.

Fixed minor misformattings in bdevvp(). Rev.1.14 consisted mainly of
gratuitous reformattings that seem to have caused many Lite2 merge
errors.

PR: 8417


# 40349 14-Oct-1998 dt

Backed out rev. 1.164. It caused problems on SMP.

PR: 8309


# 40286 13-Oct-1998 dg

Fixed two potentially serious classes of bugs:

1) The vnode pager wasn't properly tracking the file size due to
"size" being page rounded in some cases and not in others.
This sometimes resulted in corrupted files. First noticed by
Terry Lambert.
Fixed by changing the "size" pager_alloc parameter to be a 64bit
byte value (as opposed to a 32bit page index) and changing the
pagers and their callers to deal with this properly.
2) Fixed a bogus type cast in round_page() and trunc_page() that
caused some 64bit offsets and sizes to be scrambled. Removing
the cast required adding casts at a few dozen callers.
There may be problems with other bogus casts in close-by
macros. A quick check seemed to indicate that those were okay,
however.


# 40267 12-Oct-1998 dt

UnVMIO vnodes of block devices when they are no longer in use. (Some
things, like msdosfs, do not work (panic) on devices with VMIO enabled.
FFS enable VMIO on mounted devices, and nothing previously disabled it, so,
after you mounted FFS floppy, you could not mount msdosfs floppy anymore...)

This is mostly a quick before-release fix.

Reviewed by: bde


# 39187 14-Sep-1998 sos

Remove the SLICE code.
This clearly needs alot more thought, and we dont need this to hunt
us down in 3.0-RELEASE.


# 38866 05-Sep-1998 bde

Instantiate `nfs_mount_type' in a standard file so that it is present
when nfs is an LKM. Declare it in a header file. Don't forget to use
it in non-Lite2 code. Initialize it to -1 instead of to 0, since 0
will soon be the mount type number for the first vfs loaded.

NetBSD uses strcmp() to avoid this ugly global.


# 38618 29-Aug-1998 bde

Oops, the previous revision unconfigured too much pre-Lite2 compatibilty
cruft. At least lsvfs(1) was broken.


# 38289 12-Aug-1998 bde

Don't configure compatibility code for pre-Lite2 mount() calls by
default. This code should go away soon.


# 37599 12-Jul-1998 dfr

Initialise all the fields separately in vattr_null since on the alpha
they are not all the same width.


# 37555 11-Jul-1998 bde

Fixed printf format errors.


# 37101 21-Jun-1998 bde

Removed unused includes.


# 36874 10-Jun-1998 julian

Replace 'sleep()' with 'tsleep()'
Accidentally imported from Kirk's codebase.

Pointed out by: various.


# 36862 10-Jun-1998 julian

Submitted by: Kirk McKusick <mckusick@McKusick.COM>

Fix for potential hang when trying to reboot the system or
to forcibly unmount a soft update enabled filesystem.
FreeBSD already handled the reboot case differently, this is however a better
fix.


# 36735 07-Jun-1998 dfr

This commit fixes various 64bit portability problems required for
FreeBSD/alpha. The most significant item is to change the command
argument to ioctl functions from int to u_long. This change brings us
inline with various other BSD versions. Driver writers may like to
use (__FreeBSD_version == 300003) to detect this change.

The prototype FreeBSD/alpha machdep will follow in a couple of days
time.


# 36126 17-May-1998 tegge

Supply the correct process argument to dounmount when possible.


# 35319 19-Apr-1998 julian

Add changes and code to implement a functional DEVFS.
This code will be turned on with the TWO options
DEVFS and SLICE. (see LINT)
Two labels PRE_DEVFS_SLICE and POST_DEVFS_SLICE will deliniate these changes.

/dev will be automatically mounted by init (thanks phk)
on bootup. See /sys/dev/slice/slice.4 for more info.
All code should act the same without these options enabled.

Mike Smith, Poul Henning Kamp, Soeren, and a few dozen others

This code does not support the following:
bad144 handling.
Persistance. (My head is still hurting from the last time we discussed this)
ATAPI flopies are not handled by the SLICE code yet.

When this code is running, all major numbers are arbitrary and COULD
be dynamically assigned. (this is not done, for POLA only)
Minor numbers for disk slices ARE arbitray and dynamically assigned.


# 35264 18-Apr-1998 peter

In vfs_msync(), test to see if the vnode being examined is "interesting"
(ie: it has a vm_object attached and is marked as OBJ_MIGHTBEDIRTY) before
attempting to lock it. This should reduce the cpu hit that is incurred
when doing a sync(2) and when the syncer process is doing the 30-second
writeback of dirty mmap() data to disk. Skip this speedup if we are
doing an unmount() to be sure to get everything - we can afford to
occasionally miss a msync while the system is running, but not at unmount.

I'm not sure about the VXLOCK and MNT_WAIT case, it seems a bit odd to skip
doing a page_clean at unmount time just because a vnode is VXLOCKed, but
that's what was being done before...


# 35220 16-Apr-1998 peter

When the softdep conversion took place, the periodic vfs_msync() from
update got lost. This is responsible for ensuring that dirty mmap() pages
get periodically written to disk. Without it, long time mmap's might not
have their dirty pages written out at all of the system crashes or isn't
cleanly shut down. This could be nasty if you've got a long-running
writing via mmap(), dirty pages used to get written to disk within 30
seconds or so.


# 35214 15-Apr-1998 tegge

Unlock mountlist_slock if the mount point was busy (unmount in progress)
during the attempt at lazy fsync.


# 34961 30-Mar-1998 phk

Eradicate the variable "time" from the kernel, using various measures.
"time" wasn't a atomic variable, so splfoo() protection were needed
around any access to it, unless you just wanted the seconds part.

Most uses of time.tv_sec now uses the new variable time_second instead.

gettime() changed to getmicrotime(0.

Remove a couple of unneeded splfoo() protections, the new getmicrotime()
is atomic, (until Bruce sets a breakpoint in it).

A couple of places needed random data, so use read_random() instead
of mucking about with time which isn't random.

Add a new nfs_curusec() function.

Mark a couple of bogosities involving the now disappeard time variable.

Update ffs_update() to avoid the weird "== &time" checks, by fixing the
one remaining call that passwd &time as args.

Change profiling in ncr.c to use ticks instead of time. Resolution is
the same.

Add new function "tvtohz()" to avoid the bogus "splfoo(), add time, call
hzto() which subtracts time" sequences.

Reviewed by: bde


# 34928 28-Mar-1998 bde

Removed unused #includes.


# 34926 28-Mar-1998 bde

Don't depend on <sys/mount.h> including <sys/socket.h>.


# 34694 19-Mar-1998 dyson

In kern_physio.c fix tsleep priority messup.

In vfs_bio.c, remove b_generation count usage,
remove redundant reassignbuf,
remove redundant spl(s),
manage page PG_ZERO flags more correctly,
utilize in invalid value for b_offset until it
is properly initialized. Add asserts
for #ifdef DIAGNOSTIC, when b_offset is
improperly used.
when a process is not performing I/O, and just waiting
on a buffer generally, make the sleep priority
low.
only check page validity in getblk for B_VMIO buffers.

In vfs_cluster, add b_offset asserts, correct pointer calculation
for clustered reads. Improve readability of certain parts of
the code. Remove redundant spl(s).

In vfs_subr, correct usage of vfs_bio_awrite (From Andrew Gallatin
<gallatin@cs.duke.edu>). More vtruncbuf problems fixed.


# 34690 19-Mar-1998 dyson

Fix an embarassing problem in vtruncbuf.


# 34639 17-Mar-1998 dyson

Correct a severely evil bug in the vtruncbuf code. It didn't cause
me any problems until after the previous commit. This problem then
caused a severe case of creeping crud on my diskdrive, and hosed
my system so bad, that I needed to do a complete reinstall. Sorry!!!

I assume that others have manifest this bug.


# 34612 16-Mar-1998 dyson

Allow vfs_ioopt to be enabled with a (temporary) config option.


# 34611 16-Mar-1998 dyson

Some VM improvements, including elimination of alot of Sig-11
problems. Tor Egge and others have helped with various VM bugs
lately, but don't blame him -- blame me!!!

pmap.c:
1) Create an object for kernel page table allocations. This
fixes a bogus allocation method previously used for such, by
grabbing pages from the kernel object, using bogus pindexes.
(This was a code cleanup, and perhaps a minor system stability
issue.)

pmap.c:
2) Pre-set the modify and accessed bits when prudent. This will
decrease bus traffic under certain circumstances.

vfs_bio.c, vfs_cluster.c:
3) Rather than calculating the beginning virtual byte offset
multiple times, stick the offset into the buffer header, so
that the calculated offset can be reused. (Long long multiplies
are often expensive, and this is a probably unmeasurable performance
improvement, and code cleanup.)

vfs_bio.c:
4) Handle write recursion more intelligently (but not perfectly) so
that it is less likely to cause a system panic, and is also
much more robust.

vfs_bio.c:
5) getblk incorrectly wrote out blocks that are incorrectly sized.
The problem is fixed, and writes blocks out ONLY when B_DELWRI
is true.

vfs_bio.c:
6) Check that already constituted buffers have fully valid pages. If
not, then make sure that the B_CACHE bit is not set. (This was
a major source of Sig-11 type problems.)

vfs_bio.c:
7) Fix a potential system deadlock due to an incorrectly specified
sleep priority while waiting for a buffer write operation. The
change that I made opens the system up to serious problems, and
we need to examine the issue of process sleep priorities.

vfs_cluster.c, vfs_bio.c:
8) Make clustered reads work more correctly (and more completely)
when buffers are already constituted, but not fully valid.
(This was another system reliability issue.)

vfs_subr.c, ffs_inode.c:
9) Create a vtruncbuf function, which is used by filesystems that
can truncate files. The vinvalbuf forced a file sync type operation,
while vtruncbuf only invalidates the buffers past the new end of file,
and also invalidates the appropriate pages. (This was a system reliabiliy
and performance issue.)

10) Modify FFS to use vtruncbuf.

vm_object.c:
11) Make the object rundown mechanism for OBJT_VNODE type objects work
more correctly. Included in that fix, create pager entries for
the OBJT_DEAD pager type, so that paging requests that might slip
in during race conditions are properly handled. (This was a system
reliability issue.)

vm_page.c:
12) Make some of the page validation routines be a little less picky
about arguments passed to them. Also, support page invalidation
change the object generation count so that we handle generation
counts a little more robustly.

vm_pageout.c:
13) Further reduce pageout daemon activity when the system doesn't
need help from it. There should be no additional performance
decrease even when the pageout daemon is running. (This was
a significant performance issue.)

vnode_pager.c:
14) Teach the vnode pager to handle race conditions during vnode
deallocations.


# 34577 14-Mar-1998 dyson

Disable the vfs.ioopt option for now, so that we don't get gratuitious
bugreports. I might not be able to fix the problems before 3.0, due
to other, more important things.


# 34568 14-Mar-1998 tegge

Don't misuse vnode interlocks in routines that can be called from interrupts.
PR: 5893


# 34266 08-Mar-1998 julian

Reviewed by: dyson@freebsd.org (john Dyson), dg@root.com (david greenman)
Submitted by: Kirk McKusick (mcKusick@mckusick.com)
Obtained from: WHistle development tree


# 34206 07-Mar-1998 dyson

This mega-commit is meant to fix numerous interrelated problems. There
has been some bitrot and incorrect assumptions in the vfs_bio code. These
problems have manifest themselves worse on NFS type filesystems, but can
still affect local filesystems under certain circumstances. Most of
the problems have involved mmap consistancy, and as a side-effect broke
the vfs.ioopt code. This code might have been committed seperately, but
almost everything is interrelated.

1) Allow (pmap_object_init_pt) prefaulting of buffer-busy pages that
are fully valid.
2) Rather than deactivating erroneously read initial (header) pages in
kern_exec, we now free them.
3) Fix the rundown of non-VMIO buffers that are in an inconsistent
(missing vp) state.
4) Fix the disassociation of pages from buffers in brelse. The previous
code had rotted and was faulty in a couple of important circumstances.
5) Remove a gratuitious buffer wakeup in vfs_vmio_release.
6) Remove a crufty and currently unused cluster mechanism for VBLK
files in vfs_bio_awrite. When the code is functional, I'll add back
a cleaner version.
7) The page busy count wakeups assocated with the buffer cache usage were
incorrectly cleaned up in a previous commit by me. Revert to the
original, correct version, but with a cleaner implementation.
8) The cluster read code now tries to keep data associated with buffers
more aggressively (without breaking the heuristics) when it is presumed
that the read data (buffers) will be soon needed.
9) Change to filesystem lockmgr locks so that they use LK_NOPAUSE. The
delay loop waiting is not useful for filesystem locks, due to the
length of the time intervals.
10) Correct and clean-up spec_getpages.
11) Implement a fully functional nfs_getpages, nfs_putpages.
12) Fix nfs_write so that modifications are coherent with the NFS data on
the server disk (at least as well as NFS seems to allow.)
13) Properly support MS_INVALIDATE on NFS.
14) Properly pass down MS_INVALIDATE to lower levels of the VM code from
vm_map_clean.
15) Better support the notion of pages being busy but valid, so that
fewer in-transit waits occur. (use p->busy more for pageouts instead
of PG_BUSY.) Since the page is fully valid, it is still usable for
reads.
16) It is possible (in error) for cached pages to be busy. Make the
page allocation code handle that case correctly. (It should probably
be a printf or panic, but I want the system to handle coding errors
robustly. I'll probably add a printf.)
17) Correct the design and usage of vm_page_sleep. It didn't handle
consistancy problems very well, so make the design a little less
lofty. After vm_page_sleep, if it ever blocked, it is still important
to relookup the page (if the object generation count changed), and
verify it's status (always.)
18) In vm_pageout.c, vm_pageout_clean had rotted, so clean that up.
19) Push the page busy for writes and VM_PROT_READ into vm_pageout_flush.
20) Fix vm_pager_put_pages and it's descendents to support an int flag
instead of a boolean, so that we can pass down the invalidate bit.


# 33967 01-Mar-1998 dyson

Change vfs.ioopt default back to '0'.


# 33936 01-Mar-1998 dyson

1) Use a more consistent page wait methodology.
2) Do not unnecessarily force page blocking when paging
pages out.
3) Further improve swap pager performance and correctness,
including fixing the paging in progress deadlock (except
in severe I/O error conditions.)
4) Enable vfs_ioopt=1 as a default.
5) Fix and enable the page prezeroing in SMP mode.

All in all, SMP systems especially should show a significant
improvement in "snappyness."


# 33755 23-Feb-1998 dyson

Clean-up the vget mechanism by permanently attaching VM objects to
vnodes, therefore vget doesn't need to do so anymore. Other minor
improvements include the temp free vnode queue obeying the VAGE
flag and a printf that warns of to-be-removed code being executed.


# 33205 10-Feb-1998 kato

Fixed vnode interlock handling.

Reviewed by: Bruce Evans <bde@zeta.org.au>
Tor Egge <Tor.Egge@idi.ntnu.no>


# 33181 09-Feb-1998 eivind

Staticize.


# 33152 07-Feb-1998 kato

When the vp is lcoked, vget() calls vfs_object_create() with
waslocked = TRUE. This change may fix lockmgr panic in umapfs/nullfs.

PR: 5634
Reviewed by: "John S. Dyson" <toor@dyson.iquest.net>
Suggested by: Bruce Evans <bde@zeta.org.au>


# 33134 06-Feb-1998 eivind

Back out DIAGNOSTIC changes.


# 33109 05-Feb-1998 dyson

1) Start using a cleaner and more consistant page allocator instead
of the various ad-hoc schemes.
2) When bringing in UPAGES, the pmap code needs to do another vm_page_lookup.
3) When appropriate, set the PG_A or PG_M bits a-priori to both avoid some
processor errata, and to minimize redundant processor updating of page
tables.
4) Modify pmap_protect so that it can only remove permissions (as it
originally supported.) The additional capability is not needed.
5) Streamline read-only to read-write page mappings.
6) For pmap_copy_page, don't enable write mapping for source page.
7) Correct and clean-up pmap_incore.
8) Cluster initial kern_exec pagin.
9) Removal of some minor lint from kern_malloc.
10) Correct some ioopt code.
11) Remove some dead code from the MI swapout routine.
12) Correct vm_object_deallocate (to remove backing_object ref.)
13) Fix dead object handling, that had problems under heavy memory load.
14) Add minor vm_page_lookup improvements.
15) Some pages are not in objects, and make sure that the vm_page.c can
properly support such pages.
16) Add some more page deficit handling.
17) Some minor code readability improvements.


# 33108 04-Feb-1998 eivind

Turn DIAGNOSTIC into a new-style option.


# 32910 31-Jan-1998 tegge

Update freevnodes when adding a vnode to the head of the free list.


# 32724 24-Jan-1998 dyson

Add better support for larger I/O clusters, including larger physical
I/O. The support is not mature yet, and some of the underlying implementation
needs help. However, support does exist for IDE devices now.


# 32702 22-Jan-1998 dyson

VM level code cleanups.

1) Start using TSM.
Struct procs continue to point to upages structure, after being freed.
Struct vmspace continues to point to pte object and kva space for kstack.
u_map is now superfluous.
2) vm_map's don't need to be reference counted. They always exist either
in the kernel or in a vmspace. The vmspaces are managed by reference
counts.
3) Remove the "wired" vm_map nonsense.
4) No need to keep a cache of kernel stack kva's.
5) Get rid of strange looking ++var, and change to var++.
6) Change more data structures to use our "zone" allocator. Added
struct proc, struct vmspace and struct vnode. This saves a significant
amount of kva space and physical memory. Additionally, this enables
TSM for the zone managed memory.
7) Keep ioopt disabled for now.
8) Remove the now bogus "single use" map concept.
9) Use generation counts or id's for data structures residing in TSM, where
it allows us to avoid unneeded restart overhead during traversals, where
blocking might occur.
10) Account better for memory deficits, so the pageout daemon will be able
to make enough memory available (experimental.)
11) Fix some vnode locking problems. (From Tor, I think.)
12) Add a check in ufs_lookup, to avoid lots of unneeded calls to bcmp.
(experimental.)
13) Significantly shrink, cleanup, and make slightly faster the vm_fault.c
code. Use generation counts, get rid of unneded collpase operations,
and clean up the cluster code.
14) Make vm_zone more suitable for TSM.

This commit is partially as a result of discussions and contributions from
other people, including DG, Tor Egge, PHK, and probably others that I
have forgotten to attribute (so let me know, if I forgot.)

This is not the infamous, final cleanup of the vnode stuff, but a necessary
step. Vnode mgmt should be correct, but things might still change, and
there is still some missing stuff (like ioopt, and physical backing of
non-merged cache files, debugging of layering concepts.)


# 32585 17-Jan-1998 dyson

Tie up some loose ends in vnode/object management. Remove an unneeded
config option in pmap. Fix a problem with faulting in pages. Clean-up
some loose ends in swap pager memory management.

The system should be much more stable, but all subtile bugs aren't fixed yet.


# 32456 12-Jan-1998 dyson

Fix another vnode leak.


# 32454 12-Jan-1998 dyson

Fix some vnode management problems, and better mgmt of vnode free list.
Fix the UIO optimization code.
Fix an assumption in vm_map_insert regarding allocation of swap pagers.
Fix an spl problem in the collapse handling in vm_object_deallocate.
When pages are freed from vnode objects, and the criteria for putting
the associated vnode onto the free list is reached, either put the
vnode onto the list, or put it onto an interrupt safe version of the
list, for further transfer onto the actual free list.
Some minor syntax changes changing pre-decs, pre-incs to post versions.
Remove a bogus timeout (that I added for debugging) from vn_lock.

PHK will likely still have problems with the vnode list management, and
so do I, but it is better than it was.


# 32320 07-Jan-1998 dyson

Disable io optimizations again, minor bug found, and will be fixed in
a few days.


# 32286 06-Jan-1998 dyson

Make our v_usecount vnode reference count work identically to the
original BSD code. The association between the vnode and the vm_object
no longer includes reference counts. The major difference is that
vm_object's are no longer freed gratuitiously from the vnode, and so
once an object is created for the vnode, it will last as long as the
vnode does.

When a vnode object reference count is incremented, then the underlying
vnode reference count is incremented also. The two "objects" are now
more intimately related, and so the interactions are now much less
complex.

When vnodes are now normally placed onto the free queue with an object still
attached. The rundown of the object happens at vnode rundown time, and
happens with exactly the same filesystem semantics of the original VFS
code. There is absolutely no need for vnode_pager_uncache and other
travesties like that anymore.

A side-effect of these changes is that SMP locking should be much simpler,
the I/O copyin/copyout optimizations work, NFS should be more ponderable,
and further work on layered filesystems should be less frustrating, because
of the totally coherent management of the vnode objects and vnodes.

Please be careful with your system while running this code, but I would
greatly appreciate feedback as soon a reasonably possible.


# 32094 29-Dec-1997 dyson

Add the vnode interlock back around vref.


# 32072 29-Dec-1997 dyson

Fix the decl of vfs_ioopt, allow LFS to compile again, fix a minor problem
with the object cache removal.


# 32071 29-Dec-1997 dyson

Lots of improvements, including restructring the caching and management
of vnodes and objects. There are some metadata performance improvements
that come along with this. There are also a few prototypes added when
the need is noticed. Changes include:

1) Cleaning up vref, vget.
2) Removal of the object cache.
3) Nuke vnode_pager_uncache and friends, because they aren't needed anymore.
4) Correct some missing LK_RETRY's in vn_lock.
5) Correct the page range in the code for msync.

Be gentle, and please give me feedback asap.


# 31853 19-Dec-1997 dyson

Some performance improvements, and code cleanups (including changing our
expensive OFF_TO_IDX to btoc whenever possible.)


# 31727 15-Dec-1997 wollman

Add support for poll(2) on files. vop_nopoll() now returns POLLNVAL
if one of the new poll types is requested; hopefully this will not break
any existing code. (This is done so that programs have a dependable
way of determining whether a filesystem supports the extended poll types
or not.)

The new poll types added are:

POLLWRITE - file contents may have been modified
POLLNLINK - file was linked, unlinked, or renamed
POLLATTRIB - file's attributes may have been changed
POLLEXTEND - file was extended

Note that the internal operation of poll() means that it is impossible
for two processes to reliably poll for the same event (this could
be fixed but may not be worth it), so it is not possible to rewrite
`tail -f' to use poll at this time.


# 31352 22-Nov-1997 bde

Staticized.


# 31132 12-Nov-1997 julian

Reviewed by: various.

Ever since I first say the way the mount flags were used I've hated the
fact that modes, and events, internal and exported, and short-term
and long term flags are all thrown together. Finally it's annoyed me enough..
This patch to the entire FreeBSD tree adds a second mount flag word
to the mount struct. it is not exported to userspace. I have moved
some of the non exported flags over to this word. this means that we now
have 8 free bits in the mount flags. There are another two that might
well move over, but which I'm not sure about.
The only user visible change would have been in pstat -v, except
that davidg has disabled it anyhow.
I'd still like to move the state flags and the 'command' flags
apart from each other.. e.g. MNT_FORCE really doesn't have the
same semantics as MNT_RDONLY, but that's left for another day.


# 31016 07-Nov-1997 phk

Remove a bunch of variables which were unused both in GENERIC and LINT.

Found by: -Wunused


# 30743 26-Oct-1997 phk

VFS interior redecoration.

Rename vn_default_error to vop_defaultop all over the place.
Move vn_bwrite from vfs_bio.c to vfs_default.c and call it vop_stdbwrite.
Use vop_null instead of nullop.
Move vop_nopoll from vfs_subr.c to vfs_default.c
Move vop_sharedlock from vfs_subr.c to vfs_default.c
Move vop_nolock from vfs_subr.c to vfs_default.c
Move vop_nounlock from vfs_subr.c to vfs_default.c
Move vop_noislocked from vfs_subr.c to vfs_default.c
Use vop_ebadf instead of *_ebadf.
Add vop_defaultop for getpages on master vnode in MFS.


# 30354 12-Oct-1997 phk

Last major round (Unless Bruce thinks of somthing :-) of malloc changes.

Distribute all but the most fundamental malloc types. This time I also
remembered the trick to making things static: Put "static" in front of
them.

A couple of finer points by: bde


# 30309 11-Oct-1997 phk

Distribute and statizice a lot of the malloc M_* types.

Substantial input from: bde


# 30293 11-Oct-1997 phk

Dike out a weird warning.


# 29869 26-Sep-1997 phk

I lost a bit of my change in the last commit, this is more like it.
Noticed by: bde


# 29853 25-Sep-1997 phk

Reduce the target number of vnodes on the freelist from desiredvnodes
(usually a couple of thousand) to 25. The measured impact on cache-hits
doesn't justify spending memory this way:

Target number of free vnodes versus namecache hit rate in % during a
make world:
10 98.5316
200 98.5479
500 98.5546
1000 98.5709
3000 98.6006
4000 98.6126


# 29788 24-Sep-1997 phk

A couple of handles to tweak, more statistics.


# 29506 16-Sep-1997 bde

Fixed gratuitous ANSIisms.


# 29358 14-Sep-1997 peter

Provide a 'return true' poll vnode op rather than duplicating the
'do nothing' case all over the various filesystems.


# 29323 13-Sep-1997 peter

print correct function name in a panic (vop_nolock -> vop_sharedlock)


# 29208 07-Sep-1997 bde

Removed yet more vestiges of config-time swap configuration and/or
cleaned up nearby cruft.


# 29203 07-Sep-1997 bde

Removed vestiges of config-time "argument processing" configuration.


# 29076 03-Sep-1997 phk

Hmm, this is hopefully better.


# 29070 03-Sep-1997 phk

Revert the v_usecount handling in relation to VOP_INACTIVE.


# 29041 02-Sep-1997 bde

Removed unused #includes.


# 28954 31-Aug-1997 phk

Change the 0xdeadb hack to a flag called VDOOMED.
Introduce VFREE which indicates that vnode is on freelist.
Rename vholdrele() to vdrop().
Create vfree() and vbusy() to add/delete vnode from freelist.
Add vfree()/vbusy() to keep (v_holdcnt != 0 || v_usecount != 0)
vnodes off the freelist.
Generalize vhold()/v_holdcnt to mean "do not recycle".
Fix reassignbuf()s lack of use of vhold().
Use vhold() instead of checking v_cache_src list.
Remove vtouch(), the vnodes are always vget'ed soon enough
after for it to have any measuable effect.
Add sysctl debug.freevnodes to keep track of things.
Move cache_purge() up in getnewvnodes to avoid race.
Decrement v_usecount after VOP_INACTIVE(), put a vhold() on
it during VOP_INACTIVE()
Unmacroize vhold()/vdrop()
Print out VDOOMED and VFREE flags (XXX: should use %b)

Reviewed by: dyson


# 28795 26-Aug-1997 bde

Restored rev.1.92 which was clobbered by the previous commit.


# 28774 26-Aug-1997 dyson

Back out some incorrect changes that was worse than the original bug.


# 28558 22-Aug-1997 dyson

This is a trial improvement for the vnode reference count while on the vnode
free list problem. Also, the vnode age flag is no longer used by the
vnode pager. (It is actually incorrect to use then.) Constructive
feedback welcome -- just be kind.


# 28551 21-Aug-1997 bde

#include <machine/limits.h> explicitly in the few places that it is required.


# 28270 16-Aug-1997 wollman

Fix all areas of the system (or at least all those in LINT) to avoid storing
socket addresses in mbufs. (Socket buffers are the one exception.) A number
of kernel APIs needed to get fixed in order to make this happen. Also,
fix three protocol families which kept PCBs in mbufs to not malloc them
instead. Delete some old compatibility cruft while we're at it, and add
some new routines in the in_cksum family.


# 27892 04-Aug-1997 dyson

Fix a problem with the vfs vnode caching that it doesn't grow quickly
enough and can cause some strange performance problems. Specifically, at
or near startup time is when the problem is worst. To reproduce
the problem, run "lat_syscall stat" from the alpha lmbench code right
after bootup. A positive side effect of this mod is that the name
cache can be set to grow again by sysctl. A noticable positive
performance impact is realized due to a larger namecache being available
as needed (or tuned.)


# 27473 17-Jul-1997 dfr

Merge WebNFS support from NetBSD

Obtained from: NetBSD


# 26780 22-Jun-1997 dyson

Remove a window during running down a file vnode. Also, the OBJ_DEAD
flag wasn't being respected during vref(), et. al. Note that this
isn't the eventual fix for the locking problem. Fine grained SMP
in the VM and VFS code will require (lots) more work.


# 26533 10-Jun-1997 dg

Disabled the kern.vnode sysctl variable. It's causing system crashes on
large systems and needs to be re-thinked or removed wholesale.


# 25509 06-May-1997 phk

Fix a race condition that did, after all, exist.

Reviewed by: phk
Submitted by: dfr


# 25453 04-May-1997 phk

1. Add a {pointer, v_id} pair to the vnode to store the reference to the
".." vnode. This is cheaper storagewise than keeping it in the
namecache, and it makes more sense since it's a 1:1 mapping.

2. Also handle the case of "." more intelligently rather than stuff
the namecache with pointless entries.

3. Add two lists to the vnode and hang namecache entries which go from
or to this vnode. When cleaning a vnode, delete all namecache
entries it invalidates.

4. Never reuse namecache enties, malloc new ones when we need it, free
old ones when they die. No longer a hard limit on how many we can
have.

5. Remove the upper limit on namelength of namecache entries.

6. Make a global list for negative namecache entries, limit their number
to a sysctl'able (debug.ncnegfactor) fraction of the total namecache.
Currently the default fraction is 1/16th. (Suggestions for better
default wanted!)

7. Assign v_id correctly in the face of 32bit rollover.

8. Remove the LRU list for namecache entries, not needed. Remove the
#ifdef NCH_STATISTICS stuff, it's not needed either.

9. Use the vnode freelist as a true LRU list, also for namecache accesses.

10. Reuse vnodes more aggresively but also more selectively, if we can't
reuse, malloc a new one. There is no longer a hard limit on their
number, they grow to the point where we don't reuse potentially
usable vnodes. A vnode will not get recycled if still has pages in
core or if it is the source of namecache entries (Yes, this does
indeed work :-) "." and ".." are not namecache entries any longer...)

11. Do not overload the v_id field in namecache entries with whiteout
information, use a char sized flags field instead, so we can get
rid of the vpid and v_id fields from the namecache struct. Since
we're linked to the vnodes and purged when they're cleaned, we don't
have to check the v_id any more.

12. NFS knew about the limitation on name length in the namecache, it
shouldn't and doesn't now.

Bugs:
The namecache statistics no longer includes the hits for ".."
and "." hits.

Performance impact:
Generally in the +/- 0.5% for "normal" workstations, but
I hope this will allow the system to be selftuning over a
bigger range of "special" applications. The case where
RAM is available but unused for cache because we don't have
any vnodes should be gone.

Future work:
Straighten out the namecache statistics.

"desiredvnodes" is still used to (bogusly ?) size hash
tables in the filesystems.

I have still to find a way to safely free unused vnodes
back so their number can shrink when not needed.

There is a few uses of the v_id field left in the filesystems,
scheduled for demolition at a later time.

Maybe a one slot cache for unused namecache entries should
be implemented to decrease the malloc/free frequency.


# 25294 30-Apr-1997 dyson

Staticize an unnecessarily global function: vputrele.
Submitted by: Michael Hancock <michaelh@cet.co.jp>


# 25129 25-Apr-1997 peter

copyin the export network mask to the correct variable.

Submitted by: Mike Hibler <mike@marker.cs.utah.edu>, PR#3380


# 24624 04-Apr-1997 dfr

Add a function vop_sharedlock which a copy of vop_nolock without the
implementation #ifdef out. This can be used for now by NFS. As soon
as all the other filesystems' locking is fixed, this can go away.

Print the vnode address in vprint for easier debugging.


# 24487 01-Apr-1997 bde

Use OID_AUTO instead of magic number for the Lite2 sysctl debug.busyprt.

Removed declaration of vfs_unmountroot() again.

Staticized vgonel().


# 23389 05-Mar-1997 dg

Fixed splbio problems in vinvalbuf. Closes PR#2875, although fixed
differently by me.


# 23382 04-Mar-1997 bde

Attach vfs_sysctl() one level lower so that only the levels below
VFS_GENERIC aren't done in the FreeBSD way. The previous commit
broke the nfs sysctls.


# 23333 03-Mar-1997 bde

Merged Lite2's vfs_sysctl(). It doesn't fit very well into FreeBSD's
(phk's) sysctl framework, and I needed special code to disambiguate
the VFS_GENERIC node from the VFS_VFSCONF leaf, so I only converted
the leaves to the FreeBSD framework. The error handling isn't quite
right. CSRGS's sysctls seem to return ENOTDIR too much and FreeBSD's
sysctls don't agree with the man page.


# 23289 02-Mar-1997 bde

Restored some pre-Lite2-merge source-level compatibility to the mount()
and getvfsbyname() interfaces. The new interfaces are now hidden from
applications unless _NEW_VFSCONF is defined. The new vfsconf interfaces
don't work yet.


# 23254 02-Mar-1997 bde

Moved vfs sysctls to where Lite2 put them. No code changes yet.


# 23159 27-Feb-1997 bde

Fixed Lite2 merge of spechash simplelocking. It was misplaced in
checkalias() and missing in vfinddev() and vcount().


# 23149 27-Feb-1997 dyson

Fix the previous simple_lock fix breakage in the combined
vput/vrele routine. Fix a panic message. Fix the vop_nounlock
routine so that "special" filesystems that use it work correctly.


# 23145 27-Feb-1997 dyson

Fix the simple_lock problem with the physical I/O buffer code, and
also fix the missing simple_unlock in vrele, and improve vrele/vput
by merging them into one routine. BDE pointed these problems out.


# 23135 26-Feb-1997 bde

Fixed unmounting of the root fs. vfs_unmountroot() wasn't fully updated
to do Lite2 locking and vfs_unmountall() wasn't as simple as the Lite2
version.


# 23118 25-Feb-1997 bde

Merged some missing locking from Lite2:
- getnewvnode() and vref() were missing one simple_unlock() each.
- the Lite2 locking changes weren't merged at all in
printlockedvnodes() or sysctl_vnode(). Merging these undid
some KNF style regressions.


# 22975 22-Feb-1997 peter

Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not
ready for it yet.


# 22521 10-Feb-1997 dyson

This is the kernel Lite/2 commit. There are some requisite userland
changes, so don't expect to be able to run the kernel as-is (very well)
without the appropriate Lite/2 userland changes.

The system boots and can mount UFS filesystems.

Untested: ext2fs, msdosfs, NFS
Known problems: Incorrect Berkeley ID strings in some files.
Mount_std mounts will not work until the getfsent
library routine is changed.

Reviewed by: various people
Submitted by: Jeffery Hsu <hsu@freebsd.org>


# 21770 16-Jan-1997 bde

Removed option EXTRAVNODES. All versions of FreeBSD-2.x have a sysctl
variable `kern.maxvnodes' which gives much better control over vnode
allocation than EXTRAVNODES (except in -current between 1995/10/28 and
1996/11/12, kern.maxvnodes was read-only and thus useless).


# 21673 14-Jan-1997 jkh

Make the long-awaited change from $Id$ to $FreeBSD$

This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.

Boy, I'm glad we're not using sup anymore. This update would have been
insane otherwise.


# 21002 29-Dec-1996 dyson

This commit is the embodiment of some VFS read clustering improvements.
Firstly, now our read-ahead clustering is on a file descriptor basis and not
on a per-vnode basis. This will allow multiple processes reading the
same file to take advantage of read-ahead clustering. Secondly, there
previously was a problem with large reads still using the ramp-up
algorithm. Of course, that was bogus, and now we read the entire
"chunk" off of the disk in one operation. The read-ahead clustering
algorithm should use less CPU than the previous also (I hope :-)).

NOTE: THAT LKMS MUST BE REBUILT!!!


# 19667 12-Nov-1996 bde

Restored writability of kern.maxvnodes. It was broken a year ago in
rev.1.29 of kern_sysctl.c.

Should be in 2.2.


# 19229 28-Oct-1996 phk

init_main.c: pass -d to init if DEVFS_ROOT
kern_conf.c: gd driver is a disk.
vfs_subr.c: include opt_devfs.h


# 18996 17-Oct-1996 jkh

I'm not sure why, but Netcon's TFS filesystem code doesn't want to
add free vnodes back to the freelist. They must do their own vnode
management. Anyway, this change is *only* activated with their filesystem
and doesn't affect anyone else. Whoops, forgot the submitted-by lines
in my previous commits too.. :-(
Submitted-By: Tony Ardolino <tony@netcon.com>


# 18973 17-Oct-1996 dyson

Clean up the rundown of the object backing a vnode. This should fix
NFS problems associated with forcible dismounts.


# 18527 28-Sep-1996 dyson

Correct vget by removing a window where a vnode can potentially go away.


# 18397 19-Sep-1996 nate

In sys/time.h, struct timespec is defined as:

/*
* Structure defined by POSIX.4 to be like a timeval.
*/
struct timespec {
time_t ts_sec; /* seconds */
long ts_nsec; /* and nanoseconds */
};

The correct names of the fields are tv_sec and tv_nsec.

Reminded by: James Drobina <jdrobina@infinet.com>


# 17761 21-Aug-1996 dyson

Even though this looks like it, this is not a complex code change.
The interface into the "VMIO" system has changed to be more consistant
and robust. Essentially, it is now no longer necessary to call vn_open
to get merged VM/Buffer cache operation, and exceptional conditions
such as merged operation of VBLK devices is simpler and more correct.

This code corrects a potentially large set of problems including the
problems with ktrace output and loaded systems, file create/deletes,
etc.

Most of the changes to NFS are cosmetic and name changes, eliminating
a layer of subroutine calls. The direct calls to vput/vrele have
been re-instituted for better cross platform compatibility.

Reviewed by: davidg


# 17605 15-Aug-1996 dyson

Certain vnode buffer list operations were not being spl protected,
and they needed to be. Brelse for example can be called at interrupt
level, and the buffer list operations were not being protected from it.


# 17349 30-Jul-1996 bde

Only use the special bdevvp() for DEVFS if DEVFS_ROOT is defined. This
makes option DEVFS safe to use again (although mounting devfs is unsafe).


# 17272 24-Jul-1996 phk

DEVFS needs a special bdevvp().


# 17122 12-Jul-1996 bde

Staticized a few variables.

Fixed warnings about unused variables.


# 16025 31-May-1996 peter

Add an option "EXTRA_VNODES" to cause an extra number of vnode structures
to be allocated at boot time. This is an expensive option, as they
consume physical ram and are not pageable etc. In certain situations,
this kind of option is quite useful, especially for news servers that
access a large number of directories at random and torture the name cache.
Defining 5000 or 10000 extra vnodes should cut down the amount of vnode
recycling somewhat, which should allow better name and directory caching
etc.

This is a "your mileage may vary" option, with no real indication of
what works best for your machine except trial and error. Too many will
cost you ram that you could otherwise use for disk buffers etc.

This is based on something John Dyson mentioned to me a while ago.


# 14425 09-Mar-1996 dyson

Put the "free vnode isn't" check back in the right place.


# 13490 19-Jan-1996 dyson

Eliminated many redundant vm_map_lookup operations for vm_mmap.
Speed up for vfs_bio -- addition of a routine bqrelse to greatly diminish
overhead for merged cache.
Efficiency improvement for vfs_cluster. It used to do alot of redundant
calls to cluster_rbuild.
Correct the ordering for vrele of .text and release of credentials.
Use the selective tlb update for 486/586/P6.
Numerous fixes to the size of objects allocated for files. Additionally,
fixes in the various pagers.
Fixes for proper positioning of vnode_pager_setsize in msdosfs and ext2fs.
Fixes in the swap pager for exhausted resources. The pageout code
will not as readily thrash.
Change the page queue flags (PG_ACTIVE, PG_INACTIVE, PG_FREE, PG_CACHE) into
page queue indices (PQ_ACTIVE, PQ_INACTIVE, PQ_FREE, PQ_CACHE),
thereby improving efficiency of several routines.
Eliminate even more unnecessary vm_page_protect operations.
Significantly speed up process forks.
Make vm_object_page_clean more efficient, thereby eliminating the pause
that happens every 30seconds.
Make sequential clustered writes B_ASYNC instead of B_DELWRI even in the
case of filesystems mounted async.
Fix a panic with busy pages when write clustering is done for non-VMIO
buffers.


# 13228 04-Jan-1996 wollman

Convert DDB to new-style option.


# 13168 02-Jan-1996 dg

Moved the #ifdef DIAGNOSTIC in vrele() so that the check for negative
v_usecount is always performed and only the call to vprint is conditional.


# 12913 17-Dec-1995 phk

Staticize.
Unstaticize a function in scsi/scsi_base that was used, with an undocumented
option.
My last count on the LINT kernel shows:
Total symbols: 3647
unref symbols: 463
undef symbols: 4
1 ref symbols: 1751
2 ref symbols: 485
Approaching the pain threshold now.


# 12767 11-Dec-1995 dyson

Changes to support 1Tb filesizes. Pages are now named by an
(object,index) pair instead of (object,offset) pair.


# 12662 07-Dec-1995 dg

Untangled the vm.h include file spaghetti.


# 12650 06-Dec-1995 phk

A couple of minor tweaks to the sysctl stuff.


# 12577 02-Dec-1995 bde

Completed function declarations and/or added prototypes.


# 12519 29-Nov-1995 phk

A test was backwards.
Noticed by: Cheng, Hsiao-Yang <sycheng@cis.ufl.edu>


# 12429 20-Nov-1995 phk

Mega commit for sysctl.
Convert the remaining sysctl stuff to the new way of doing things.
the devconf stuff is the reason for the large number of files.
Cleaned up some compiler warnings while I were there.


# 12324 16-Nov-1995 bde

Fixed support for DIAGNOSTIC option. SYSCTL_INT() depends on kernel.h.


# 12283 14-Nov-1995 phk

Change some of the debug sysctl vars. The semantics of these will change.


# 12199 11-Nov-1995 bde

Fixed type of vfs_free_netcred().
Removed redundant declaration of insmntque().


# 12158 09-Nov-1995 bde

Introduced a type `vop_t' for vnode operation functions and used
it 1138 times (:-() in casts and a few more times in declarations.
This change is null for the i386.

The type has to be `typedef int vop_t(void *)' and not `typedef
int vop_t()' because `gcc -Wstrict-prototypes' warns about the
latter. Since vnode op functions are called with args of different
(struct pointer) types, neither of these function types is any use
for type checking of the arg, so it would be preferable not to use
the complete function type, especially since using the complete
type requires adding 1138 casts to avoid compiler warnings and
another 40+ casts to reverse the function pointer conversions before
calling the functions.


# 12136 07-Nov-1995 dyson

This is a modification missed by me in the msync fixes a few days ago.


# 11852 28-Oct-1995 bde

Call vfs_unbusy() before error returns from sysctl_vnode(). This fixes
PR 795.

Set the size before one error return from sysctl_vnode() the same as before
the other. The caller might want to know about the amount successfully
read although the current caller doesn't.


# 10275 25-Aug-1995 bde

Don't compile the diagnostic functions vhold() and holdrele() unless
DIAGNOSTIC is defined.


# 10027 11-Aug-1995 dg

Converted mountlist to a CIRCLEQ.

Partially obtained from: 4.4BSD-Lite2


# 9507 13-Jul-1995 dg

NOTE: libkvm, w, ps, 'top', and any other utility which depends on struct
proc or any VM system structure will have to be rebuilt!!!

Much needed overhaul of the VM system. Included in this first round of
changes:

1) Improved pager interfaces: init, alloc, dealloc, getpages, putpages,
haspage, and sync operations are supported. The haspage interface now
provides information about clusterability. All pager routines now take
struct vm_object's instead of "pagers".

2) Improved data structures. In the previous paradigm, there is constant
confusion caused by pagers being both a data structure ("allocate a
pager") and a collection of routines. The idea of a pager structure has
escentially been eliminated. Objects now have types, and this type is
used to index the appropriate pager. In most cases, items in the pager
structure were duplicated in the object data structure and thus were
unnecessary. In the few cases that remained, a un_pager structure union
was created in the object to contain these items.

3) Because of the cleanup of #1 & #2, a lot of unnecessary layering can now
be removed. For instance, vm_object_enter(), vm_object_lookup(),
vm_object_remove(), and the associated object hash list were some of the
things that were removed.

4) simple_lock's removed. Discussion with several people reveals that the
SMP locking primitives used in the VM system aren't likely the mechanism
that we'll be adopting. Even if it were, the locking that was in the code
was very inadequate and would have to be mostly re-done anyway. The
locking in a uni-processor kernel was a no-op but went a long way toward
making the code difficult to read and debug.

5) Places that attempted to kludge-up the fact that we don't have kernel
thread support have been fixed to reflect the reality that we are really
dealing with processes, not threads. The VM system didn't have complete
thread support, so the comments and mis-named routines were just wrong.
We now use tsleep and wakeup directly in the lock routines, for instance.

6) Where appropriate, the pagers have been improved, especially in the
pager_alloc routines. Most of the pager_allocs have been rewritten and
are now faster and easier to maintain.

7) The pagedaemon pageout clustering algorithm has been rewritten and
now tries harder to output an even number of pages before and after
the requested page. This is sort of the reverse of the ideal pagein
algorithm and should provide better overall performance.

8) Unnecessary (incorrect) casts to caddr_t in calls to tsleep & wakeup
have been removed. Some other unnecessary casts have also been removed.

9) Some almost useless debugging code removed.

10) Terminology of shadow objects vs. backing objects straightened out.
The fact that the vm_object data structure escentially had this
backwards really confused things. The use of "shadow" and "backing
object" throughout the code is now internally consistent and correct
in the Mach terminology.

11) Several minor bug fixes, including one in the vm daemon that caused
0 RSS objects to not get purged as intended.

12) A "default pager" has now been created which cleans up the transition
of objects to the "swap" type. The previous checks throughout the code
for swp->pg_data != NULL were really ugly. This change also provides
the rudiments for future backing of "anonymous" memory by something
other than the swap pager (via the vnode pager, for example), and it
allows the decision about which of these pagers to use to be made
dynamically (although will need some additional decision code to do
this, of course).

13) (dyson) MAP_COPY has been deprecated and the corresponding "copy
object" code has been removed. MAP_COPY was undocumented and non-
standard. It was furthermore broken in several ways which caused its
behavior to degrade to MAP_PRIVATE. Binaries that use MAP_COPY will
continue to work correctly, but via the slightly different semantics
of MAP_PRIVATE.

14) (dyson) Sharing maps have been removed. It's marginal usefulness in a
threads design can be worked around in other ways. Both #12 and #13
were done to simplify the code and improve readability and maintain-
ability. (As were most all of these changes)

TODO:

1) Rewrite most of the vnode pager to use VOP_GETPAGES/PUTPAGES. Doing
this will reduce the vnode pager to a mere fraction of its current size.

2) Rewrite vm_fault and the swap/vnode pagers to use the clustering
information provided by the new haspage pager interface. This will
substantially reduce the overhead by eliminating a large number of
VOP_BMAP() calls. The VOP_BMAP() filesystem interface should be
improved to provide both a "behind" and "ahead" indication of
contiguousness.

3) Implement the extended features of pager_haspage in swap_pager_haspage().
It currently just says 0 pages ahead/behind.

4) Re-implement the swap device (swstrategy) in a more elegant way, perhaps
via a much more general mechanism that could also be used for disk
striping of regular filesystems.

5) Do something to improve the architecture of vm_object_collapse(). The
fact that it makes calls into the swap pager and knows too much about
how the swap pager operates really bothers me. It also doesn't allow
for collapsing of non-swap pager objects ("unnamed" objects backed by
other pagers).


# 9436 08-Jul-1995 dg

Improve negative usecount diagnostic a little.


# 9356 28-Jun-1995 dg

1) Converted v_vmdata to v_object.
2) Removed unnecessary vm_object_lookup()/pager_cache(object, TRUE) pairs
after vnode_pager_alloc() calls - the object is already guaranteed to be
persistent.
3) Removed some gratuitous casts.


# 9340 27-Jun-1995 bde

Pass the correct nonblocking flag to VOP_CLOSE() in vclean().
VOP_CLOSE() takes `F' (file) flags, not `IO' flags. At least that's
what close() passes. I previously fixed ttylclose() to check
FNONBLOCK instead of IO_NDELAY. This broke the call from vclean()
and cleaning of ptys sometimes deadlocked.


# 8692 21-May-1995 dg

Changes to fix the following bugs:

1) Files weren't properly synced on filesystems other than UFS. In some
cases, this lead to lost data. Most likely would be noticed on NFS.
The fix is to make the VM page sync/object_clean general rather than
in each filesystem.
2) Mixing regular and mmaped file I/O on NFS was very broken. It caused
chunks of files to end up as zeroes rather than the intended contents.
The fix was to fix several race conditions and to kludge up the
"b_dirtyoff" and "b_dirtyend" that NFS relies upon - paying attention
to page modifications that occurred via the mmapping.

Reviewed by: David Greenman
Submitted by: John Dyson


# 8465 12-May-1995 dg

Increased ratio of allowed vnodes on freelist to 1/4th of the total. This
is more representative of worst case situations of 4 files/directory. (If
that last sentence doesn't make any sense, I'm not surprised. It's rather
compilcated how this all fits together....).
This should fix a problem that Ed Hudson has been complaining about where
directories with lots of symlinks could cause excessive disk I/O.


# 7877 16-Apr-1995 dg

Changed #ifdef around printlockedvnodes() from DEBUG to DDB.


# 7694 09-Apr-1995 dg

Changes from John Dyson and myself:

Fixed remaining known bugs in the buffer IO and VM system.

vfs_bio.c:
Fixed some race conditions and locking bugs. Improved performance
by removing some (now) unnecessary code and fixing some broken
logic.
Fixed process accounting of # of FS outputs.
Properly handle NFS interrupts (B_EINTR).

(various)
Replaced calls to clrbuf() with calls to an optimized routine
call vfs_bio_clrbuf().

(various FS sync)
Sync out modified vnode_pager backed pages.

ffs_vnops.c:
Do two passes: Sync out file data first, then indirect blocks.

vm_fault.c:
Fixed deadly embrace caused by acquiring locks in the wrong order.

vnode_pager.c:
Changed to use buffer I/O system for writing out modified pages. This
should fix the problem with the modification date previous not getting
updated. Also dramatically simplifies the code. Note that this is
going to change in the future and be implemented via VOP_PUTPAGES().

vm_object.c:
Fixed a pile of bugs related to cleaning (vnode) objects. The performance
of vm_object_page_clean() is terrible when dealing with huge objects,
but this will change when we implement a binary tree to keep the object
pages sorted.

vm_pageout.c:
Fixed broken clustering of pageouts. Fixed race conditions and other
lockup style bugs in the scanning of pages. Improved performance.


# 7205 21-Mar-1995 dg

Fixed vinvalbuf() to work like NFS wants it to. The previous code wouldn't
flush pages in the vm object if V_SAVE was true.


# 7186 20-Mar-1995 dg

Don't gain/lose a reference to the object when yanking its pages in
vinvalbuf()...it will cause vnode locking problems in vm_object_terminate,
and isn't necessary anyway.


# 7181 20-Mar-1995 dg

Don't attempt to sync pages in the V_SAVE case of vinvalbuf; doing so can
lead to a deadlock. Just let the VM system deal with it.


# 7090 16-Mar-1995 bde

Add and move declarations to fix all of the warnings from `gcc -Wimplicit'
(except in netccitt, netiso and netns) and most of the warnings from
`gcc -Wnested-externs'. Fix all the bugs found. There were no serious
ones.


# 7010 11-Mar-1995 dg

Added a comment.


# 6991 10-Mar-1995 dg

Reorganized an if() expression for efficiency.


# 6969 09-Mar-1995 phk

Clean up and improve the namecache.

1. We always keep one 16th of the vnodes on the freelist, so that the
namecache doesn't get trashed. It used to be that it wasn't a problem, but
the only vnodes getting released these days are directories and things which
Clean up and improve the namecache.

1. We always keep one 16th of the vnodes on the freelist, so that the
namecache doesn't get trashed. It used to be that it wasn't a problem, but
the only vnodes getting released these days are directories and things which
gets forced out of the VM/cache. The latter is not numerous enough to keep
the pool of vnodes needed for the namecache sufficiently big.

2. Purge invalid entries in the namecache as soon as we notice them. This
avoids a stale entry pushing out a valid entry on the LRU list.

3. Speed up the lookup in the namecache by avoid a special case branch.

4. Make the cache purge routines do the thing they're supposed to, and in
a decently efficient manner.

5. Make the size of the namecache follow the number of vnodes, so that we
can always point to all the vnodes we have in core.

6. Readability has gone way up.

7. Added a "options NCH_STATISTICS" feature that will gather more
detailed statistics on the performance of the namecache.

Reviewed by: davidg

(cvs is dumping core on me :-( )


# 6945 07-Mar-1995 dg

Put VAGE vnodes at the head of the free list.


# 6762 27-Feb-1995 dg

Backed out previous change. I forgot (for about the fourth time) that
v_rdev is a #define which is dereferenced through v_specinfo->si_rdev,
and that isn't initialized until later in checkalias().


# 6758 27-Feb-1995 dg

Initialize v_rdev in getnewvnode() - it appears that some filesystems
may not properly initialize this field in all cases, and this would
result in very anti-social behavior (overwriting on some other random
device/location).

Submitted by: John Dyson


# 6621 22-Feb-1995 dg

vfs_cluster.c:
Various more tweaks from John Dyson to improve read ahead calculations.

vfs_subr.c:
Only wakeup if numoutput is 0 in vwakeup().

Submitted by: John Dyson


# 5464 10-Jan-1995 dg

Fixed some formatting weirdness that I overlooked in the previous commit.


# 5455 09-Jan-1995 dg

These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.

The majority of the merged VM/cache work is by John Dyson.

The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.

vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.

vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.

vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.

vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.

vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.

pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.

vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.

proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.

swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.

machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.

machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.

ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.

Submitted by: John Dyson and David Greenman


# 5201 23-Dec-1994 dg

Protect vnode buffer chain manipulation with splbio to prevent list
corruption..


# 3396 06-Oct-1994 dg

Use tsleep() rather than sleep so that 'ps' is more informative about
the wait.


# 3374 05-Oct-1994 dg

Stuff object into v_vmdata rather than pager. Not important which at
the moment, but will be in the future. Other changes mostly cosmetic,
but are made for future VMIO considerations.

Submitted by: John Dyson


# 3308 02-Oct-1994 phk

All of this is cosmetic. prototypes, #includes, printfs and so on. Makes
GCC a lot more silent.


# 3098 25-Sep-1994 phk

While in the real world, I had a bad case of being swapped out for a lot of
cycles. While waiting there I added a lot of the extra ()'s I have, (I have
never used LISP to any extent). So I compiled the kernel with -Wall and
shut up a lot of "suggest you add ()'s", removed a bunch of unused var's
and added a couple of declarations here and there. Having a lap-top is
highly recommended. My kernel still runs, yell at me if you kernel breaks.


# 2384 29-Aug-1994 dg

"bogus" fixes from 1.1.5 to work around some cache coherency problems.


# 2250 24-Aug-1994 dg

Initialized v_writecount.


# 2220 22-Aug-1994 dg

print "BUSY" instead of error number if filesystem was busy during
vfs_unmountall() - this is the most common case. If it was a different
error, then print the error number.


# 2152 20-Aug-1994 dg

Implemented filesystem clean bit via:

machdep.c:
Changed printf's a little and call vfs_unmountall() if the sync was
successful.

cd9660_vfsops.c, ffs_vfsops.c, nfs_vfsops.c, lfs_vfsops.c:
Allow dismount of root FS. It is now disallowed at a higher level.

vfs_conf.c:
Removed unused rootfs global.

vfs_subr.c:
Added new routines vfs_unmountall and vfs_unmountroot. Filesystems
are now dismounted if the machine is properly rebooted.

ffs_vfsops.c:
Toggle clean bit at the appropriate places. Print warning if an
unclean FS is mounted.

ffs_vfsops.c, lfs_vfsops.c:
Fix bug in selecting proper flags for VOP_CLOSE().

vfs_syscalls.c:
Disallow dismounting root FS via umount syscall.


# 2112 18-Aug-1994 wollman

Fix up some sloppy coding practices:

- Delete redundant declarations.
- Add -Wredundant-declarations to Makefile.i386 so they don't come back.
- Delete sloppy COMMON-style declarations of uninitialized data in
header files.
- Add a few prototypes.
- Clean up warnings resulting from the above.

NB: ioconf.c will still generate a redundant-declaration warning, which
is unavoidable unless somebody volunteers to make `config' smarter.


# 1817 02-Aug-1994 dg

Added $Id$


# 1549 25-May-1994 rgrimes

The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch.

Reviewed by: Rodney W. Grimes
Submitted by: John Dyson and David Greenman


# 1542 24-May-1994 rgrimes

This commit was generated by cvs2svn to compensate for changes in r1541,
which included commits to RCS files with non-trunk default branches.


# 1541 24-May-1994 rgrimes

BSD 4.4 Lite Kernel Sources