History log of /freebsd-9.3-release/sys/ufs/ffs/ffs_vnops.c
Revision Date Author Comments
(<<< Hide modified files)
(Show modified files >>>)
# 267654 19-Jun-2014 gjb

Copy stable/9 to releng/9.3 as part of the 9.3-RELEASE cycle.

Approved by: re (implicit)
Sponsored by: The FreeBSD Foundation

# 262780 05-Mar-2014 pfg

MFC r262678;
ufs: small formatting fixes.

Cleanup some extra space.
Use of tabs vs. spaces.
No functional change.

Reviewed by: mckusick


# 259224 11-Dec-2013 pfg

MFC r256448, r257029;

Make di_blocks unsigned in UFS1 as is the case already for UFS2.
Most of the code between UFS1 and UFS2 is shared so this change
is pretty safe. Not only this makes UFS1 and 2 consistent but it
also matches what NetBSD and MacOS X have for some years now.

UFS2: make di_extsize unsigned.
di_extsize is the EA size and as such it should be unsigned.
Adjust related types for consistency.

Reviewed by: mckusick


# 251897 18-Jun-2013 scottl

Merge the second part of the unmapped I/O changes. This enables the
infrastructure in the block layer and UFS filesystem as well as a few
drivers. The list of MFC revisions is long, so I won't quote changelogs.

r248508,248510,248511,248512,248514,248515,248516,248517,248518,
248519,248520,248521,248550,248568,248789,248790,249032,250936

Submitted by: kib
Approved by: kib
Obtained from: Netflix


# 251144 30-May-2013 scottl

MFC r248282, in a modified fashion. From the original changelog:

Add currently unused flag argument to the cluster_read(),
cluster_write() and cluster_wbuild() functions. The flags to be
allowed are a subset of the GB_* flags for getblk().

This merge adds a cluster_*_gb() API variant instead of changing the ABI
with an added argument to the existing API. Most API consumers that were
changed in the original rev have been left un-changed in this merge to
reduce churn. The mergeinfo is recorded though for future merging
convenience. This is effectively a no-op for the moment.

Submitted by: kib, FF
Obtained from: Netflix


# 240238 08-Sep-2012 kib

MFC r239065:
Stop including vm_param.h into vm_page.h. Include vm_param.h
explicitely for the kernel code which needs it.


# 239850 29-Aug-2012 kib

MFC r236322:
Enable vn_io_fault() lock avoidance for UFS.


# 237713 28-Jun-2012 kib

Fix unbounded-length malloc, controlled from usermode. The added check
is performed before exact size of the buffer is calculated, but the
buffer cannot have size greater then the total space allocated for
extended attributes. The existing check is executing with precise
size, but it is too late, since buffer needs to be allocated in
advance.

Also, adapt to uio_resid being of ssize_t type. Use lblktosize instead of
multiplying by fs block size by hand as well.


# 233654 29-Mar-2012 kib

MFC r232948:
Supply boolean as the second argument to ffs_update(), and not a
MNT_[NO]WAIT constants, which in fact always caused sync operation.


# 233630 28-Mar-2012 mckusick

MFC of 232351, 233438, and 233629

MFC reviewed by: kib

MFC 232351:

This change avoids a kernel deadlock on "snaplk" when using
snapshots on UFS filesystems running with journaled soft updates.
This is the first of several bugs that need to be fixed before
removing the restriction added in -r230250 to prevent the use
of snapshots on filesystems running with journaled soft updates.

The deadlock occurs when holding the snapshot lock (snaplk)
and then trying to flush an inode via ffs_update(). We become
blocked by another process trying to flush a different inode
contained in the same inode block that we need. It holds the
inode block for which we are waiting locked. When it tries to
write the inode block, it gets blocked waiting for the our
snaplk when it calls ffs_copyonwrite() to see if the inode
block needs to be copied in our snapshot.

The most obvious place that this deadlock arises is in the
ffs_copyonwrite() routine when it updates critical metadata
in a snapshot and tries to write it out before proceeding.
The fix here is to write the data and indirect block pointer
for the snapshot, but to skip the call to ffs_update() to
write the snapshot inode. To ensure that we will never have
to update a pointer in the inode itself, the ffs_snapshot()
routine that creates the snapshot has to ensure that all the
direct blocks are allocated as part of the creation of the
snapshot.

A less obvious place that this deadlock occurs is when we hold
the snaplk because we are deleting a snapshot. In the course of
doing the deletion, we need to allocate various soft update
dependency structures and allocate some journal space. If we
hit a resource limit while doing this we decrease the resources
in use by flushing out an existing dirty file to get it to give
up the soft dependency resources that it holds. The flush can
cause an ffs_update() to be done on the inode for the file that
we have selected to flush resulting in the same deadlock as
described above when the inode that we have chosen to flush
resides in the same inode block as the snapshot inode that we hold.
The fix is to defer cleaning up any time that the inode on which
we are operating is a snapshot.

Help and review by: Jeff Roberson
Tested by: Peter Holm

MFC 233438:

Add a third flags argument to ffs_syncvnode to avoid a possible conflict
with MNT_WAIT flags that passed in its second argument.

Discussed with: kib

MFC 233629:

A refinement of change 232351 to avoid a race with a forcible unmount.
While we have a snapshot vnode unlocked to avoid a deadlock with another
inode in the same inode block being updated, the filesystem containing
it may be forcibly unmounted. When that happens the snapshot vnode is
revoked. We need to check for that condition and fail appropriately.

Spotted by: kib
Reviewed by: kib


# 233439 24-Mar-2012 kib

MFC r232834:
In ffs_syncvnode(), pass boolean false as second argument of ffs_update().
Synchronous inode block update is not needed for MNT_LAZY callers (syncer),
and since waitfor values are not zero, code did unneccessary synchronous
update.


# 233353 23-Mar-2012 kib

MFC r231949:
Fix found places where uio_resid is truncated to int.

Add the sysctl debug.iosize_max_clamp, enabled by default. Setting the
sysctl to zero allows to perform the SSIZE_MAX-sized i/o requests from
the usermode.

MFC r232493:
Remove unneeded cast to u_int. The values as small enough to fit into
int, beside the use of MIN macro which performs type promotions.

MFC r232494:
Instead of incomplete handling of read(2)/write(2) return values that
does not fit into registers, declare that we do not support this case
using CTASSERT(), and remove endianess-unsafe code to split return value
into td_retval.

While there, change the style of the sysctl debug.iosize_max_clamp
definition.

MFC r232495:
pipe_read(): change the type of size to int, and remove signed clamp.
pipe_write(): change the type of desiredsize back to int, its value fits.


# 232956 14-Mar-2012 kib

MFC r232833:
Remove not needed ARGSUSED lint command.


# 231953 20-Feb-2012 kib

MFC r231313 (by mckusick):
First attempt the uiomove() to the newly allocated (and dirty) buffer
and only zeros it if the uiomove() fails. The effect is to eliminate the
gratuitous zeroing of the buffer in the usual case where the uiomove()
successfully fills it.


# 225736 22-Sep-2011 kensmith

Copy head to stable/9 as part of 9.0-RELEASE release cycle.

Approved by: re (implicit)


# 224503 29-Jul-2011 mckusick

Update to -r224294 to ensure that only one of MNT_SUJ or MNT_SOFTDEP
is set so that mount can revert back to using MNT_NOWAIT when doing
getmntinfo.

Approved by: re (kib)


# 222958 10-Jun-2011 jeff

Implement fully asynchronous partial truncation with softupdates journaling
to resolve errors which can cause corruption on recovery with the old
synchronous mechanism.

- Append partial truncation freework structures to indirdeps while
truncation is proceeding. These prevent new block pointers from
becoming valid until truncation completes and serialize truncations.
- On completion of a partial truncate journal work waits for zeroed
pointers to hit indirects.
- softdep_journal_freeblocks() handles last frag allocation and last
block zeroing.
- vtruncbuf/ffs_page_remove moved into softdep_*_freeblocks() so it
is only implemented in one place.
- Block allocation failure handling moved up one level so it does not
proceed with buf locks held. This permits us to do more extensive
reclaims when filesystem space is exhausted.
- softdep_sync_metadata() is broken into two parts, the first executes
once at the start of ffs_syncvnode() and flushes truncations and
inode dependencies. The second is called on each locked buf. This
eliminates excessive looping and rollbacks.
- Improve the mechanism in process_worklist_item() that handles
acquiring vnode locks for handle_workitem_remove() so that it works
more generally and does not loop excessively over the same worklist
items on each call.
- Don't corrupt directories by zeroing the tail in fsck. This is only
done for regular files.
- Push a fsync complete record for files that need it so the checker
knows a truncation in the journal is no longer valid.

Discussed with: mckusick, kib (ffs_pages_remove and ffs_truncate parts)
Tested by: pho


# 221281 30-Apr-2011 kib

Fix typos.

Noted by: Fabian Keil <freebsd-listen fabiankeil de>
Pointy hat to: kib
MFC after: 1 week


# 221261 30-Apr-2011 kib

Clarify the comment.

MFC after: 1 week


# 209717 06-Jul-2010 jeff

- Handle the truncation of an inode with an effective link count of 0 in
the context of the process that reduced the effective count. Previously
all truncation as a result of unlink happened in the softdep flush
thread. This had the effect of being impossible to rate limit properly
with the journal code. Now the process issuing unlinks is suspended
when the journal files. This has a side-effect of improving rm
performance by allowing more concurrent work.
- Handle two cases in inactive, one for effnlink == 0 and another when
nlink finally reaches 0.
- Eliminate the SPACECOUNTED related code since the truncation is no
longer delayed.

Discussed with: mckusick


# 207728 06-May-2010 alc

Eliminate page queues locking around most calls to vm_page_free().


# 207669 05-May-2010 alc

Acquire the page lock around all remaining calls to vm_page_free() on
managed pages that didn't already have that lock held. (Freeing an
unmanaged page, such as the various pmaps use, doesn't require the page
lock.)

This allows a change in vm_page_remove()'s locking requirements. It now
expects the page lock to be held instead of the page queues lock.
Consequently, the page queues lock is no longer required at all by callers
to vm_page_rename().

Discussed with: kib


# 207662 05-May-2010 trasz

Move checking against RLIMIT_FSIZE into one place, vn_rlimit_fsize().

Reviewed by: kib


# 207141 24-Apr-2010 jeff

- Merge soft-updates journaling from projects/suj/head into head. This
brings in support for an optional intent log which eliminates the need
for background fsck on unclean shutdown.

Sponsored by: iXsystems, Yahoo!, and Juniper.
With help from: McKusick and Peter Holm


# 195265 01-Jul-2009 trasz

Don't panic on attempt to set ACL on a block device file.
This is just a part of kern/125613.

PR: kern/125613
Submitted by: Jaakko Heinonen <jh at saunalahti dot fi>
Reviewed by: rwatson
Approved by: re (kib)


# 195187 30-Jun-2009 kib

For SU mounts, softdep_fsync() might drop vnode lock, allowing other
threads to put dirty buffers on the vnode bufobj list. For regular files
and synchronous fsync requests, check for the condition and restart the
fsync vop if a new dirty buffer arrived.

Tested by: pho
Approved by: re (kensmith)
MFC after: 1 month


# 190469 27-Mar-2009 kib

Correct typo.

Noted by: kensmith


# 189737 12-Mar-2009 kib

The non-modifying EA VOPs are executed with only shared vnode lock taken.
Provide a custom lock around initializing and tearing down EA area,
to prevent both memory leaks and double-free of it. Count the number
of EA area accessors.

Lock protocol requires either holding exclusive vnode lock to modify
i_ea_area, or shared vnode lock and owning IN_EA_LOCKED flag in i_flag.

Noted by: YAMAMOTO, Taku <taku tackymt homeip net>
Tested by: pho (previous version)
MFC after: 2 weeks


# 187790 27-Jan-2009 rwatson

Following a fair amount of real world experience with ACLs and
extended attributes since FreeBSD 5, make the following semantic
changes:

- Don't update the inode modification time (mtime) when extended
attributes (and hence also ACLs) are added, modified, or removed.
- Don't update the inode access tie (atime) when extended attributes
(and hence also ACLs) are queried.

This means that rsync (and related tools) won't improperly think
that the data in the file has changed when only the ACL has changed.

Note that ffs_reallocblks() has not been changed to not update on an
IO_EXT transaction, but currently EAs don't use the cluster write
routines so this shouldn't be a problem. If EAs grow support for
clustering, then VOP_REALLOCBLKS() will need to grow a flag argument
to carry down IO_EXT to UFS.

MFC after: 1 week
PR: ports/125739
Reported by: Alexander Zagrebin <alexz@visp.ru>
Tested by: pluknet <pluknet@gmail.com>,
Greg Byshenk <freebsd@byshenk.net>
Discussed with: kib, kientzle, timur, Alexander Bokovoy <ab@samba.org>


# 187468 20-Jan-2009 kib

When extending inode size, we call vnode_pager_setsize(), to have a
address space where to put vnode pages, and then call UFS_BALLOC(),
to actually allocate new block and map it. When UFS_BALLOC() returns
error, sometimes we forget to revert the vm object size increase,
allowing for the pages that are not backed by the logical disk blocks.

Revert vnode_pager_setsize() back when UFS_BALLOC() failed, for
ffs_truncate() and ffs_write().

PR: 129956
Reviewed by: ups
MFC after: 3 weeks


# 184074 20-Oct-2008 kib

Assert that v_holdcnt is non-zero before entering lockmgr in vn_lock
and ffs_lock. This cannot catch situations where holdcnt is incremented
not by curthread, but I think it is useful.

Reviewed by: tegge, attilio
Tested by: pho
MFC after: 2 weeks


# 182721 03-Sep-2008 trasz

When calling extattr_check_cred, use V{READ,WRITE}, not I{READ,WRITE}.

Approved by: rwatson (mentor)


# 177779 31-Mar-2008 jeff

- Since rev 1.142 of ffs_snapshot.c the interlock has not been required
to protect the v_lock pointer. Removing the interlock acquisition
here allows vn_lock() to proceed without requiring the interlock
at all.
- If the lock mutated while we were sleeping on it the interlock has
been dropped. It is conceivable that the upper layer code was
relying on the interlock and LK_NOWAIT to protect the identity or
state of the vnode while acquiring the lock. In this case return
EBUSY rather than trying the new lock to prevent potential races.

Reviewed by: tegge


# 177493 22-Mar-2008 jeff

- Complete part of the unfinished bufobj work by consistently using
BO_LOCK/UNLOCK/MTX when manipulating the bufobj.
- Create a new lock in the bufobj to lock bufobj fields independently.
This leaves the vnode interlock as an 'identity' lock while the bufobj
is an io lock. The bufobj lock is ordered before the vnode interlock
and also before the mnt ilock.
- Exploit this new lock order to simplify softdep_check_suspend().
- A few sync related functions are marked with a new XXX to note that
we may not properly interlock against a non-zero bv_cnt when
attempting to sync all vnodes on a mountlist. I do not believe this
race is important. If I'm wrong this will make these locations easier
to find.

Reviewed by: kib (earlier diff)
Tested by: kris, pho (earlier diff)


# 177474 21-Mar-2008 kib

Reduce the acquisition of the vnode interlock in the ffs_read() and
ffs_extread() when setting the IN_ACCESS flag by checking whether the
IN_ACCESS is already set. The possible race there is admissible.

Tested by: pho
Submitted by: jeff


# 176564 25-Feb-2008 keramida

Minor typo nit.


# 176320 15-Feb-2008 attilio

- Introduce lockmgr_args() in the lockmgr space. This function performs
the same operation of lockmgr() but accepting a custom wmesg, prio and
timo for the particular lock instance, overriding default values
lkp->lk_wmesg, lkp->lk_prio and lkp->lk_timo.
- Use lockmgr_args() in order to implement BUF_TIMELOCK()
- Cleanup BUF_LOCK()
- Remove LK_INTERNAL as it is nomore used in the lockmgr namespace

Tested by: Andrea Barberio <insomniac at slackware dot it>


# 175635 24-Jan-2008 attilio

Cleanup lockmgr interface and exported KPI:
- Remove the "thread" argument from the lockmgr() function as it is
always curthread now
- Axe lockcount() function as it is no longer used
- Axe LOCKMGR_ASSERT() as it is bogus really and no currently used.
Hopefully this will be soonly replaced by something suitable for it.
- Remove the prototype for dumplockinfo() as the function is no longer
present

Addictionally:
- Introduce a KASSERT() in lockstatus() in order to let it accept only
curthread or NULL as they should only be passed
- Do a little bit of style(9) cleanup on lockmgr.h

KPI results heavilly broken by this change, so manpages and
FreeBSD_version will be modified accordingly by further commits.

Tested by: matteo


# 175294 13-Jan-2008 attilio

VOP_LOCK1() (and so VOP_LOCK()) and VOP_UNLOCK() are only used in
conjuction with 'thread' argument passing which is always curthread.
Remove the unuseful extra-argument and pass explicitly curthread to lower
layer functions, when necessary.

KPI results broken by this change, which should affect several ports, so
version bumping and manpage update will be further committed.

Tested by: kris, pho, Diego Sardina <siarodx at gmail dot com>


# 175053 01-Jan-2008 obrien

style(9)


# 173464 08-Nov-2007 obrien

Turn most ffs 'DIAGNOSTIC's into INVARIANTS.


# 171437 13-Jul-2007 rodrigc

Perform range check before allocating memory when reading
extended attributes.

Reviewed by: kib
Approved by: re (hrs)
PR: 114389


# 170587 11-Jun-2007 rwatson

Eliminate now-unused SUSER_ALLOWJAIL arguments to priv_check_cred(); in
some cases, move to priv_check() if it was an operation on a thread and
no other flags were present.

Eliminate caller-side jail exception checking (also now-unused); jail
privilege exception code now goes solely in kern_jail.c.

We can't yet eliminate suser() due to some cases in the KAME code where
a privilege check is performed and then used in many different deferred
paths. Do, however, move those prototypes to priv.h.

Reviewed by: csjp
Obtained from: TrustedBSD Project


# 169671 18-May-2007 kib

Since renaming of vop_lock to _vop_lock, pre- and post-condition
function calls are no more generated for vop_lock.
Rename _vop_lock to vop_lock1 to satisfy tools/vnode_if.awk assumption
about vop naming conventions. This restores pre/post-condition calls.


# 168353 04-Apr-2007 delphij

Use *_EMPTY macros when appropriate.


# 167719 19-Mar-2007 brian

When we write extended attributes, assert that the inode hasn't
already been deleted. The assertion is important to show that
we won't end up accounting for extended attribute blocks (using
fs_pendingblocks) in our subsequent call to fs_alloc().

Agreed verbally by: mckusick

MFC after: 3 weeks


# 167155 01-Mar-2007 pjd

Fix build breakage.


# 167152 01-Mar-2007 pjd

Rename PRIV_VFS_CLEARSUGID to PRIV_VFS_RETAINSUGID, which seems to better
describe the privilege.

OK'ed by: rwatson


# 167151 01-Mar-2007 pjd

Avoid checking for privileges if there is no need to.

Discussed with: rwatson


# 166864 21-Feb-2007 mckusick

The functions that set and delete external attributes must check
that the filesystem is not mounted read-only before proceeding.

Reported by: Ryan Beasley <ryanb@FreeBSD.org>
MFC after: 1 week


# 166774 15-Feb-2007 pjd

Move vnode-to-file-handle translation from vfs_vptofh to vop_vptofh method.
This way we may support multiple structures in v_data vnode field within
one file system without using black magic.

Vnode-to-file-handle should be VOP in the first place, but was made VFS
operation to keep interface as compatible as possible with SUN's VFS.
BTW. Now Solaris also implements vnode-to-file-handle as VOP operation.

VFS_VPTOFH() was left for API backward compatibility, but is marked for
removal before 8.0-RELEASE.

Approved by: mckusick
Discussed with: many (on IRC)
Tested with: ufs, msdosfs, cd9660, nullfs and zfs


# 164248 13-Nov-2006 kmacy

change vop_lock handling to allowing tracking of callers' file and line for
acquisition of lockmgr locks

Approved by: scottl (standing in for mentor rwatson)


# 164033 06-Nov-2006 rwatson

Sweep kernel replacing suser(9) calls with priv(9) calls, assigning
specific privilege names to a broad range of privileges. These may
require some future tweaking.

Sponsored by: nCircle Network Security, Inc.
Obtained from: TrustedBSD Project
Discussed on: arch@
Reviewed (at least in part) by: mlaier, jmg, pjd, bde, ceri,
Alex Lyashkov <umka at sevcity dot net>,
Skip Ford <skip dot ford at verizon dot net>,
Antoine Brodin <antoine dot brodin at laposte dot net>


# 163194 10-Oct-2006 kib

Do not translate the IN_ACCESS inode flag into the IN_MODIFIED while filesystem
is suspending/suspended. Doing so may result in deadlock. Instead, set the
(new) IN_LAZYACCESS flag, that becomes IN_MODIFIED when suspend is lifted.

Change the locking protocol in order to set the IN_ACCESS and timestamps
without upgrading shared vnode lock to exclusive (see comments in the
inode.h). Before that, inode was modified while holding only shared
lock.

Tested by: Peter Holm
Reviewed by: tegge, bde
Approved by: pjd (mentor)
MFC after: 3 weeks


# 158321 05-May-2006 tegge

Avoid locking overhead when snapshots are disabled.


# 158259 02-May-2006 tegge

Close a race when VOP_LOCK() on a snapshot file is attempted at the
same time as it is changed back into a normal file. The locker would
get the shared "snaplk" lock which would no longer be the correct lock
for the vnode.


# 151184 09-Oct-2005 tegge

Adjust totread argument passed to cluster_read() to account for offset not
being block aligned.


# 147198 09-Jun-2005 ssouhlal

Allow EVFILT_VNODE events to work on every filesystem type, not just
UFS by:
- Making the pre and post hooks for the VOP functions work even when
DEBUG_VFS_LOCKS is not defined.
- Moving the KNOTE activations into the corresponding VOP hooks.
- Creating a MNTK_NOKNOTE flag for the mnt_kern_flag field of struct
mount that permits filesystems to disable the new behavior.
- Creating a default VOP_KQFILTER function: vfs_kqfilter()

My benchmarks have not revealed any performance degradation.

Reviewed by: jeff, bde
Approved by: rwatson, jmg (kqueue changes), grehan (mentor)


# 144373 31-Mar-2005 jeff

- Set LK_NOSHARE for snapshot locks. snapshots require exclusive only
access.
- Remove the hack from ffs_lock() to implement LK_NOSHARE in a ffs
specific way.

Sponsored by: Isilon Systems, Inc.


# 143504 13-Mar-2005 jeff

- It is not legal to access v_data without the vnode lock or interlock
held. Grab the vnode interlock if LK_INTERLOCK has not been passed in
so that we can inspect v_data in ffs_lock().

Sponsored by: Isilon Systems, Inc.


# 141542 08-Feb-2005 phk

Split the vop_vector for ffs1 and ffs2, this is mostly for the different
EXTATTR support.


# 141533 08-Feb-2005 phk

Drag another softupdates tentacle back into FFS: Now that FFS's
vop_fsync is separate from the internal use we can do the full job
there.


# 141526 08-Feb-2005 phk

Don't use the UFS_* and VFS_* functions where a direct call is possble.

The UFS_ functions are for UFS to call back into VFS. The VFS functions
are external entry points into the filesystem.


# 141525 08-Feb-2005 phk

(forced commit to record correct commit message)

Split ffs_fsync() into a VOP_FSYNC() component and an internal part
called ffs_syncvnode().

Eliminate unnecessary thread argument and XXX'ed curthread passes
for same. Reduce softdep_sync_metadata() from a struct vop_fsync_args
to just the vnode argument it needs.

Convert internal VOP_FSYNC() calls to use ffs_syncvnode().


# 141522 08-Feb-2005 phk

For snapshots we need all VOP_LOCKs to be exclusive.

The "business class upgrade" was implemented in UFS's VOP_LOCK
implementation ufs_lock() which is the wrong layer, so move it to
ffs_lock().

Also, as long as we have not abandonned advanced vfs-stacking we
should not preclude it from happening: instead of implementing a
copy locally, use the VOP_LOCK_APV(&ufs) to correctly arrive at
vop_stdlock() at the bottom.


# 141521 08-Feb-2005 phk

For snapshots we need all VOP_LOCKs to be exclusive.

The "business class upgrade" was implemented in UFS's VOP_LOCK
implementation ufs_lock() which is the wrong layer, so move it to
ffs_lock().

Also, as long as we have not abandonned advanced vfs-stacking we
should not preclude it from happening: instead of implementing a
copy locally, use the VOP_LOCK_APV(&ufs) to correctly arrive at
vop_stdlock() at the bottom.


# 141520 08-Feb-2005 phk

Use VOP_STRATEGY_APV() instead of direct dereference, this is more
correct.


# 140707 24-Jan-2005 jeff

- Remove GIANT_REQUIRED where giant is no longer required.

Sponsored By: Isilon Systems, Inc.


# 139825 07-Jan-2005 imp

/* -> /*- for license, minor formatting changes


# 138869 14-Dec-2004 phk

white space


# 138290 01-Dec-2004 phk

Back when VOP_* was introduced, we did not have new-style struct
initializations but we did have lofty goals and big ideals.

Adjust to more contemporary circumstances and gain type checking.

Replace the entire vop_t frobbing thing with properly typed
structures. The only casualty is that we can not add a new
VOP_ method with a loadable module. History has not given
us reason to belive this would ever be feasible in the the
first place.

Eliminate in toto VOCALL(), vop_t, VNODEOP_SET() etc.

Give coda correct prototypes and function definitions for
all vop_()s.

Generate a bit more data from the vnode_if.src file: a
struct vop_vector and protype typedefs for all vop methods.

Add a new vop_bypass() and make vop_default be a pointer
to another struct vop_vector.

Remove a lot of vfs_init since vop_vector is ready to use
from the compiler.

Cast various vop_mumble() to void * with uppercase name,
for instance VOP_PANIC, VOP_NULL etc.

Implement VCALL() by making vdesc_offset the offsetof() the
relevant function pointer in vop_vector. This is disgusting
but since the code is generated by a script comparatively
safe. The alternative for nullfs etc. would be much worse.

Fix up all vnode method vectors to remove casts so they
become typesafe. (The bulk of this is generated by scripts)


# 138270 01-Dec-2004 phk

Mechanically change prototypes for vnode operations to use the new typedefs.


# 137846 18-Nov-2004 jeff

- Eliminate the acquisition and release of the bqlock in bremfree() by
setting the B_REMFREE flag in the buf. This is done to prevent lock order
reversals with code that must call bremfree() with a local lock held.
This also reduces overhead by removing two lock operations per buf for
fsync() and similar.
- Check for the B_REMFREE flag in brelse() and bqrelse() after the bqlock
has been acquired so that we may remove ourself from the free-list.
- Provide a bremfreef() function to immediately remove a buf from a
free-list for use only by NFS. This is done because the nfsclient code
overloads the b_freelist queue for its own async. io queue.
- Simplify the numfreebuffers accounting by removing a switch statement
that executed the same code in every possible case.
- getnewbuf() can encounter locked bufs on free-lists once Giant is removed.
Remove a panic associated with this condition and delay asserts that
inspect the buf until after it is locked.

Reviewed by: phk
Sponsored by: Isilon Systems, Inc.


# 137035 29-Oct-2004 phk

Move UFS from DEVFS backing to GEOM backing.

This eliminates a bunch of vnode overhead (approx 1-2 % speed
improvement) and gives us more control over the access to the storage
device.

Access counts on the underlying device are not correctly tracked and
therefore it is possible to read-only mount the same disk device multiple
times:
syv# mount -p
/dev/md0 /var ufs rw 2 2
/dev/ad0 /mnt ufs ro 1 1
/dev/ad0 /mnt2 ufs ro 1 1
/dev/ad0 /mnt3 ufs ro 1 1

Since UFS/FFS is not a synchrousely consistent filesystem (ie: it caches
things in RAM) this is not possible with read-write mounts, and the system
will correctly reject this.

Details:

Add a geom consumer and a bufobj pointer to ufsmount.

Eliminate the vnode argument from softdep_disk_prewrite().
Pick the vnode out of bp->b_vp for now. Eventually we
should find it through bp->b_bufobj->b_private.

In the mountcode, use g_vfs_open() once we have used
VOP_ACCESS() to check permissions.

When upgrading and downgrading between r/o and r/w do the
right thing with GEOM access counts. Remove all the
workarounds for not being able to do this with VOP_OPEN().

If we are the root mount, drop the exclusive access count
until we upgrade to r/w. This allows fsck of the root
filesystem and the MNT_RELOAD to work correctly.

Set bo_private to the GEOM consumer on the device bufobj.

Change the ffs_ops->strategy function to call g_vfs_strategy()

In ufs_strategy() directly call the strategy on the disk
bufobj. Same in rawread.

In ffs_fsync() we will no longer see VCHR device nodes, so
remove code which synced the filesystem mounted on it, in
case we came there. I'm not sure this code made sense in
the first place since we would have taken the specfs route
on such a vnode.

Redo the highly bogus readblock() function in the snapshot
code to something slightly less bogus: Constructing an uio
and using physio was really quite a detour. Instead just
fill in a bio and ship it down.


# 136988 27-Oct-2004 phk

Eliminate unnecessary KASSERTS.


# 136943 25-Oct-2004 phk

Loose the v_dirty* and v_clean* alias macros.

Check the count field where we just want to know the full/empty state,
rather than using TAILQ_EMPTY() or TAILQ_FIRST().


# 136751 21-Oct-2004 phk

Move the VI_BWAIT flag into no bo_flag element of bufobj and call it BO_WWAIT

Add bufobj_wref(), bufobj_wdrop() and bufobj_wwait() to handle the write
count on a bufobj. Bufobj_wdrop() replaces vwakeup().

Use these functions all relevant places except in ffs_softdep.c where
the use if interlocked_sleep() makes this impossible.

Rename b_vnbufs to b_bobufs now that we touch all the relevant files anyway.


# 136721 20-Oct-2004 rwatson

Explicitly break out NETA license from Berkeley license to clearly
indicate license grant, as well as to indicate that NETA is asserting
only two clauses, not four clauses.

Requested by: imp


# 135877 28-Sep-2004 phk

Remove support for accessing device nodes in UFS/FFS.

Device nodes can still be created and exported with NFS.


# 135858 27-Sep-2004 phk

Give cluster_write() an explicit vnode argument.

In the future a struct buf will not automatically point out a vnode for us.


# 135459 19-Sep-2004 phk

The getpages VOP was a good stab at getting scatter/gather I/O without
too much kernel copying, but it is not the right way to do it, and it is
in the way for straightening out the buffer cache.

The right way is to pass the VM page array down through the struct
bio to the disk device driver and DMA directly in to/out off the
physical memory. Once the VM/buf thing is sorted out it is next on
the list.

Retire most of vnode method. ffs_getpages(). It is not clear if what is
left shouldn't be in the default implementation which we now fall back to.

Retire specfs_getpages() as well, as it has no users now.


# 133741 15-Aug-2004 jmg

Add locking to the kqueue subsystem. This also makes the kqueue subsystem
a more complete subsystem, and removes the knowlege of how things are
implemented from the drivers. Include locking around filter ops, so a
module like aio will know when not to be unloaded if there are outstanding
knotes using it's filter ops.

Currently, it uses the MTX_DUPOK even though it is not always safe to
aquire duplicate locks. Witness currently doesn't support the ability
to discover if a dup lock is ok (in some cases).

Reviewed by: green, rwatson (both earlier versions)


# 132775 28-Jul-2004 kan

Avoid using casts as lvalues. Introduce DIP_SET macro which sets proper
inode field based on UFS version. Use DIP ro read values and DIP_SET
to modify them throughout FFS code base.


# 132653 26-Jul-2004 cperciva

Rename suser_cred()'s PRISON_ROOT flag to SUSER_ALLOWJAIL. This is
somewhat clearer, but more importantly allows for a consistent naming
scheme for suser_cred flags.

The old name is still defined, but will be removed in a few days (unless I
hear any complaints...)

Discussed with: rwatson, scottl
Requested by: jhb


# 129545 21-May-2004 kensmith

Upon further review it was decided this piece of the msync(2)
fixes was applicable to HEAD, originally it was thought this
should only be done in RELENG_4. Implement IO_INVAL in the vnode
op for writing by marking the buffer as "no cache". This fix
has already been applied to RELENG_4 as Rev. 1.65.2.15 of
ufs/ufs/ufs_readwrite.c.

Reviewed by: alc, tegge


# 128006 07-Apr-2004 bde

Record where half the bits in this file came from (from ufs_readwrite.c).
Damage to history from moving bits was especially large since a repo copy
is not feasible for partial files.


# 127975 07-Apr-2004 imp

Remove advertising clause from University of California Regent's
license, per letter dated July 22, 1999 and irc message from Robert
Watson saying that clause 3 can be removed from those files with an
NAI copyright that also have only a University of California
copyrights.

Approved by: core, rwatson


# 125710 11-Feb-2004 bde

Removed more vestiges of vfs_ioopt:
- rev.1.42 of ffs_readwrite.c added a special case in ffs_read() for reads
that are initially at EOF, and rev.1.62 of ufs_readwrite.c fixed
timestamp bugs in it. Removal of most of vfs_ioopt made it just and
optimization, and removal of the vm object reference calls made it less
than an optimization. It was cloned in rev.1.94 of ufs_readwrite.c as
part of cloning ffs_extwrite() although it was always less than an
optimization in ffs_extwrite().
- some comments, compound statements and vertical whitespace were vestiges
of dead code.


# 125454 04-Feb-2004 jhb

Locking for the per-process resource limits structure.
- struct plimit includes a mutex to protect a reference count. The plimit
structure is treated similarly to struct ucred in that is is always copy
on write, so having a reference to a structure is sufficient to read from
it without needing a further lock.
- The proc lock protects the p_limit pointer and must be held while reading
limits from a process to keep the limit structure from changing out from
under you while reading from it.
- Various global limits that are ints are not protected by a lock since
int writes are atomic on all the archs we support and thus a lock
wouldn't buy us anything.
- All accesses to individual resource limits from a process are abstracted
behind a simple lim_rlimit(), lim_max(), and lim_cur() API that return
either an rlimit, or the current or max individual limit of the specified
resource from a process.
- dosetrlimit() was renamed to kern_setrlimit() to match existing style of
other similar syscall helper functions.
- The alpha OSF/1 compat layer no longer calls getrlimit() and setrlimit()
(it didn't used the stackgap when it should have) but uses lim_rlimit()
and kern_setrlimit() instead.
- The svr4 compat no longer uses the stackgap for resource limits calls,
but uses lim_rlimit() and kern_setrlimit() instead.
- The ibcs2 compat no longer uses the stackgap for resource limits. It
also no longer uses the stackgap for accessing sysctl's for the
ibcs2_sysconf() syscall but uses kernel_sysctl() instead. As a result,
ibcs2_sysconf() no longer needs Giant.
- The p_rlimit macro no longer exists.

Submitted by: mtm (mostly, I only did a few cleanups and catchups)
Tested on: i386
Compiled on: alpha, amd64


# 125259 31-Jan-2004 alc

Remove unnecessary vm object reference and deallocate calls from ffs_read()
and ffs_write(). These calls trace their origins to the dead vfs_ioopt
code, first appearing in revision 1.39 of ufs_readwrite.c.

Observed by: bde
Discussed with: tegge


# 125079 27-Jan-2004 ache

Turn uio_resid/uio_offset comments into KASSERTs

Reviewed by: bde


# 124857 23-Jan-2004 ache

Copy comment about caller check from ffs_read to ffs_extread, don't
check for uio_resid < 0 here too.


# 124856 23-Jan-2004 ache

Fix various panic() strings to reflect true function name to allow
easy grep.
Small code reorganization to look more logic.
Copy ffs_write check from prev. commit to ffs_extwrite.


# 124855 23-Jan-2004 ache

ffs_read:
Replace wrong check returned EFBIG with EOVERFLOW handling from POSIX:

36708 [EOVERFLOW] The file is a regular file, nbyte is greater than 0, the
starting position is before the end-of-file, and the starting position is
greater than or equal to the offset maximum established in the open file
description associated with fildes.

ffs_write:
Replace u_int64_t cast with uoff_t cast which is more natural for types
used.

ffs_write & ffs_read:
Remove uio_offset and uio_resid checks for negative values, the caller
supposed to do it already. Add comments about it.

Reviewed by: bde


# 124728 19-Jan-2004 kan

Spell magic '16' number as IO_SEQSHIFT.


# 120763 04-Oct-2003 alc

Synchronize access to a vm page's valid field using the containing
vm object's lock.


# 118607 07-Aug-2003 jhb

Consistently use the BSD u_int and u_short instead of the SYSV uint and
ushort. In most of these files, there was a mixture of both styles and
this change just makes them self-consistent.

Requested by: bde (kern_ktrace.c)


# 118131 28-Jul-2003 rwatson

Rename VOP_RMEXTATTR() to VOP_DELETEEXTATTR() for consistency with the
kernel ACL interfaces and system call names.

Break out UFS2 and FFS extattr delete and list vnode operations from
setextattr and getextattr to deleteextattr and listextattr, which
cleans up the implementations, and makes the results more readable,
and makes the APIs more clear.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


# 116423 15-Jun-2003 alc

Lock the vm object when freeing pages.


# 116412 15-Jun-2003 phk

Add the same KASSERT to all VOP_STRATEGY and VOP_SPECSTRATEGY implementations
to check that the buffer points to the correct vnode.


# 116192 11-Jun-2003 obrien

Use __FBSDID().


# 115869 05-Jun-2003 rwatson

Implement ffs_listextattr() by breaking out that logic and special-cased
attribute name of "" from ffs_getextattr(). Invoking VOP_GETETATTR()
with an empty name is now no longer supported; user application
compatibility is provided by a system call level compatibility
wrapper. We make sure to explicitly reject attempts to set an EA
with the name "".

Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


# 115588 01-Jun-2003 rwatson

Return EOPNOTSUPP for attempted EA operations on VCHR vnodes in UFS2;
if we permit them to occur, the kernel panics due to our performing
EA operations using VOP_STRATEGY on the vnode. This went unnoticed
previously because there are very for users of device nodes on UFS2
due to the introduction of devfs. However, this can come up with
the Linux compat directories and its hard-coded dev nodes (which will
need to go away as we move away from hard-coded device numbers).
This can come up if you use EA-intensive features such as ACLs and
MAC.

The proper fix is pretty complicated, but this band-aid would be
an excellent MFC candidate for the release.


# 115474 31-May-2003 phk

Remove unused local variables.

Found by: FlexeLint


# 115456 31-May-2003 phk

The IO_NOWDRAIN and B_NOWDRAIN hacks are no longer needed to prevent
deadlocks with vnode backed md(4) devices because md now uses a
kthread to run the bio requests instead of doing it directly from
the bio down path.


# 114599 03-May-2003 alc

Lock the vm_object on entry to vm_object_vndeallocate().


# 114216 29-Apr-2003 kan

Deprecate machine/limits.h in favor of new sys/limits.h.
Change all in-tree consumers to include <sys/limits.h>

Discussed on: standards@
Partially submitted by: Craig Rodrigues <rodrigc@attbi.com>


# 112694 26-Mar-2003 tegge

Add support for reading directly from file to userland buffer when the
O_DIRECT descriptor status flag is set and both offset and length is a
multiple of the physical media sector size.


# 112181 13-Mar-2003 jeff

- Remove a race between fsync like functions and flushbufqueues() by
requiring locked bufs in vfs_bio_awrite(). Previously the buf could
have been written out by fsync before we acquired the buf lock if it
weren't for giant. The cluster_wbuild() handles this race properly but
the single write at the end of vfs_bio_awrite() would not.
- Modify flushbufqueues() so there is only one copy of the loop. Pass a
parameter in that says whether or not we should sync bufs with deps.
- Call flushbufqueues() a second time and then break if we couldn't find
any bufs without deps.


# 111937 06-Mar-2003 alc

Remove ENABLE_VFS_IOOPT. It is a long unfinished work-in-progress.

Discussed on: arch@


# 111463 25-Feb-2003 jeff

- Add an interlock argument to BUF_LOCK and BUF_TIMELOCK.
- Remove the buftimelock mutex and acquire the buf's interlock to protect
these fields instead.
- Hold the vnode interlock while locking bufs on the clean/dirty queues.
This reduces some cases from one BUF_LOCK with a LK_NOWAIT and another
BUF_LOCK with a LK_TIMEFAIL to a single lock.

Reviewed by: arch, mckusick


# 111119 19-Feb-2003 imp

Back out M_* changes, per decision of the TRB.

Approved by: trb


# 110584 09-Feb-2003 jeff

- Cleanup unlocked accesses to buf flags by introducing a new b_vflag member
that is protected by the vnode lock.
- Move B_SCANNED into b_vflags and call it BV_SCANNED.
- Create a vop_stdfsync() modeled after spec's sync.
- Replace spec_fsync, msdos_fsync, and hpfs_fsync with the stdfsync and some
fs specific processing. This gives all of these filesystems proper
behavior wrt MNT_WAIT/NOWAIT and the use of the B_SCANNED flag.
- Annotate the locking in buf.h


# 109623 21-Jan-2003 alfred

Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0.
Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.


# 105422 18-Oct-2002 dillon

Fix a file-rewrite performance case for UFS[2]. When rewriting portions
of a file in chunks that are less then the filesystem block size, if the
data is not already cached the system will perform a read-before-write.
The problem is that it does this on a block-by-block basis, breaking up the
I/Os and making clustering impossible for the writes. Programs such
as INN using cyclic file buffers suffer greatly. This problem is only going
to get worse as we use larger and larger filesystem block sizes.

The solution is to extend the sequential heuristic so UFS[2] can perform
a far larger read and readahead when dealing with this case.

(note: maximum disk write bandwidth is 27MB/sec thru filesystem)
(note: filesystem blocksize in test is 8K (1K frag))
dd if=/dev/zero of=test.dat bs=1k count=2m conv=notrunc

Before: (note half of these are reads)
tty da0 da1 acd0 cpu
tin tout KB/t tps MB/s KB/t tps MB/s KB/t tps MB/s us ni sy in id
0 76 14.21 598 8.30 0.00 0 0.00 0.00 0 0.00 0 0 7 1 92
0 76 14.09 813 11.19 0.00 0 0.00 0.00 0 0.00 0 0 9 5 86
0 76 14.28 821 11.45 0.00 0 0.00 0.00 0 0.00 0 0 8 1 91

After: (note half of these are reads)
tty da0 da1 acd0 cpu
tin tout KB/t tps MB/s KB/t tps MB/s KB/t tps MB/s us ni sy in id
0 76 63.62 434 26.99 0.00 0 0.00 0.00 0 0.00 0 0 18 1 80
0 76 63.58 424 26.30 0.00 0 0.00 0.00 0 0.00 0 0 17 2 82
0 76 63.82 438 27.32 0.00 0 0.00 0.00 0 0.00 1 0 19 2 79

Reviewed by: mckusick
Approved by: re
X-MFC after: immediately (was heavily tested in -stable for 4 months)


# 105136 14-Oct-2002 mckusick

When reading or writing the extended attributes of a special device
or fifo in UFS2, the normal ufs_strategy routine needs to be used
rather than the spec_strategy or fifo_strategy routine. Thus the
ffsext_strategy routine is interposed in the ffs_vnops vectors for
special devices and fifo's to pick off this special case. Otherwise
it simply falls through to the usual spec_strategy or fifo_strategy
routine.

Submitted by: Robert Watson <rwatson@FreeBSD.org>
Sponsored by: DARPA & NAI Labs.


# 104346 02-Oct-2002 dd

size_t is not a struct (fix mislabelling in a comment).


# 104094 28-Sep-2002 phk

Be consistent about "static" functions: if the function is marked
static in its prototype, mark it static at the definition too.

Inspired by: FlexeLint warning #512


# 104051 27-Sep-2002 phk

Use our mount-credential if we get a NOCRED when we try to write out EA
space back to disk.

This is wrong in many ways, but not as wrong as a panic.

Pancied on: rwatson & jmallet
Sponsored by: DARPA & NAI Labs.


# 103946 25-Sep-2002 jeff

- Convert locks to use standard macros.
- Lock access to the buflists.
- Document broken locking.
- Use vrefcnt().


# 102991 05-Sep-2002 phk

Implement the VOP_OPENEXTATTR() and VOP_CLOSEEXTATTR() methods.

Use extattr_check_cred() to check access to EAs.

This is still a WIP.

Sponsored by: DARPA & NAI Labs.


# 102957 05-Sep-2002 bde

Include <sys/malloc.h> instead of depending on namespace pollution 2
layers deep in <sys/proc.h> or <sys/vnode.h>.

Include <sys/vmmeter.h> instead of depending on namespace pollution in
<sys/pcpu.h>.

Sorted includes as much as possible.


# 102608 30-Aug-2002 phk

Correctly handle setting, getting and deleting EA's with zero length content.

Sponsored by: DARPA & NAI Labs.


# 102382 24-Aug-2002 alc

o Retire vm_page_zero_fill() and vm_page_zero_fill_area(). Ever since
pmap_zero_page() and pmap_zero_page_area() were modified to accept
a struct vm_page * instead of a physical address, vm_page_zero_fill()
and vm_page_zero_fill_area() have served no purpose.


# 102175 20-Aug-2002 phk

Implement list of EA return functionality.
Correctly delete EA's when the content length is set to zero.

Sponsored by: DARPA & NAI Labs.


# 102090 19-Aug-2002 phk

First snapshot of UFS2 EA support.

Sponsored by: DARPA & NAI Labs.


# 101789 13-Aug-2002 phk

Expand the arguments to ffs_ext{read,write}() to their component
parts rather than use vop_{read,write}_args. Access to these
functions will ultimately not be available through the
"vop_{read,write}+IO_EXT" API but this functionality is retained
for debugging purposes for now.

Sponsored by: DARPA & NAI Labs.


# 101780 13-Aug-2002 phk

Unravel the UFS_EXTATTR incest between FFS and UFS: UFS_EXTATTR is an
UFS only thing, and FFS should in principle not know if it is enabled
or not.

This commit cleans ffs_vnops.c for such knowledge, but not ffs_vfsops.c

Sponsored by: DARPA and NAI Labs.


# 101720 12-Aug-2002 phk

Stop pretending that the FFS file ufs_readwrite.c is a UFS file.

Instead of #including it, pull it into ffs_vnops.c and name things
correctly.

Sponsored by: DARPA & NAI Labs.


# 101308 04-Aug-2002 jeff

- Replace v_flag with v_iflag and v_vflag
- v_vflag is protected by the vnode lock and is used when synchronization
with VOP calls is needed.
- v_iflag is protected by interlock and is used for dealing with vnode
management issues. These flags include X/O LOCK, FREE, DOOMED, etc.
- All accesses to v_iflag and v_vflag have either been locked or marked with
mp_fixme's.
- Many ASSERT_VOP_LOCKED calls have been added where the locking was not
clear.
- Many functions in vfs_subr.c were restructured to provide for stronger
locking.

Idea stolen from: BSD/OS


# 98542 21-Jun-2002 mckusick

This commit adds basic support for the UFS2 filesystem. The UFS2
filesystem expands the inode to 256 bytes to make space for 64-bit
block pointers. It also adds a file-creation time field, an ability
to use jumbo blocks per inode to allow extent like pointer density,
and space for extended attributes (up to twice the filesystem block
size worth of attributes, e.g., on a 16K filesystem, there is space
for 32K of attributes). UFS2 fully supports and runs existing UFS1
filesystems. New filesystems built using newfs can be built in either
UFS1 or UFS2 format using the -O option. In this commit UFS1 is
the default format, so if you want to build UFS2 format filesystems,
you must specify -O 2. This default will be changed to UFS2 when
UFS2 proves itself to be stable. In this commit the boot code for
reading UFS2 filesystems is not compiled (see /sys/boot/common/ufsread.c)
as there is insufficient space in the boot block. Once the size of the
boot block is increased, this code can be defined.

Things to note: the definition of SBSIZE has changed to SBLOCKSIZE.
The header file <ufs/ufs/dinode.h> must be included before
<ufs/ffs/fs.h> so as to get the definitions of ufs2_daddr_t and
ufs_lbn_t.

Still TODO:
Verify that the first level bootstraps work for all the architectures.
Convert the utility ffsinfo to understand UFS2 and test growfs.
Add support for the extended attribute storage. Update soft updates
to ensure integrity of extended attribute storage. Switch the
current extended attribute interfaces to use the extended attribute
storage. Add the extent like functionality (framework is there,
but is currently never used).

Sponsored by: DARPA & NAI Labs.
Reviewed by: Poul-Henning Kamp <phk@freebsd.org>


# 95974 03-May-2002 phk

Name ufs_vop_[gs]etextattr() consistently with the rest of our VOPs and
put then in the ufs_vnops where they belong, rather than in the ffs_vnops.

Ok'ed by: rwatson
Sponsored by: DARPA & NAI Labs.


# 92728 19-Mar-2002 alfred

Remove __P.


# 92250 13-Mar-2002 mckusick

This corrects the first of two known deadlock conditions that
come from the presence of a snapshot file.


# 83366 12-Sep-2001 julian

KSE Milestone 2
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.

Sorry john! (your next MFC will be a doosie!)

Reviewed by: peter@freebsd.org, dillon@freebsd.org

X-MFC after: ha ha ha ha


# 76167 01-May-2001 phk

Implement vop_std{get|put}pages() and add them to the default vop[].

Un-copy&paste all the VOP_{GET|PUT}PAGES() functions which do nothing but
the default.


# 76132 29-Apr-2001 phk

VOP_BALLOC was never really a VOP in the first place, so convert it
to UFS_BALLOC like the other "between UFS and FFS function interfaces".


# 76117 29-Apr-2001 grog

Revert consequences of changes to mount.h, part 2.

Requested by: bde


# 75858 23-Apr-2001 grog

Correct #includes to work with fixed sys/mount.h.


# 74433 19-Mar-2001 rwatson

o Change options FFS_EXTATTR and options FFS_EXTATTR_AUTOSTART to
options UFS_EXTATTR and UFS_EXTATTR_AUTOSTART respectively. This change
reflects the fact that our EA support is implemented entirely at the
UFS layer (modulo FFS start/stop/autostart hooks for mount and unmount
events). This also better reflects the fact that [shortly] MFS will also
support EAs, as well as possibly IFS.

o Consumers of the EA support in FFS are reminded that as a result, they
must change kernel config files to reflect the new option names.

Obtained from: TrustedBSD Project


# 73942 07-Mar-2001 mckusick

Fixes to track snapshot copy-on-write checking in the specinfo
structure rather than assuming that the device vnode would reside
in the FFS filesystem (which is obviously a broken assumption with
the device filesystem).


# 72012 04-Feb-2001 phk

Another round of the <sys/queue.h> FOREACH transmogriffer.

Created with: sed(1)
Reviewed by: md5(1)


# 67106 14-Oct-2000 adrian

Initial commit of IFS - a inode-namespaced FFS. Here is a short
description:

How it works:
--

Basically ifs is a copy of ffs, overriding some vfs/vnops. (Yes, hack.)
I didn't see the need in duplicating all of sys/ufs/ffs to get this
off the ground.

File creation is done through a special file - 'newfile' . When newfile
is called, the system allocates and returns an inode. Note that newfile
is done in a cloning fashion:

fd = open("newfile", O_CREAT|O_RDWR, 0644);
fstat(fd, &st);

printf("new file is %d\n", (int)st.st_ino);

Once you have created a file, you can open() and unlink() it by its returned
inode number retrieved from the stat call, ie:

fd = open("5", O_RDWR);

The creation permissions depend entirely if you have write access to the
root directory of the filesystem.

To get the list of currently allocated inodes, VOP_READDIR has been added
which returns a directory listing of those currently allocated.

--

What this entails:

* patching conf/files and conf/options to include IFS as a new compile
option (and since ifs depends upon FFS, include the FFS routines)

* An entry in i386/conf/NOTES indicating IFS exists and where to go for
an explanation

* Unstaticize a couple of routines in src/sys/ufs/ffs/ which the IFS
routines require (ffs_mount() and ffs_reload())

* a new bunch of routines in src/sys/ufs/ifs/ which implement the IFS
routines. IFS replaces some of the vfsops, and a handful of vnops -
most notably are VFS_VGET(), VOP_LOOKUP(), VOP_UNLINK() and VOP_READDIR().
Any other directory operation is marked as invalid.

What this results in:

* an IFS partition's create permissions are controlled by the perm/ownership of
the root mount point, just like a normal directory

* Each inode has perm and ownership too

* IFS does *NOT* mean an FFS partition can be opened per inode. This is a
completely seperate filesystem here

* Softupdates doesn't work with IFS, and really I don't think it needs it.
Besides, fsck's are FAST. (Try it :-)

* Inodes 0 and 1 aren't allocatable because they are special (dump/swap IIRC).
Inode 2 isn't allocatable since UFS/FFS locks all inodes in the system against
this particular inode, and unravelling THAT code isn't trivial. Therefore,
useful inodes start at 3.

Enjoy, and feedback is definitely appreciated!


# 66886 09-Oct-2000 eivind

Blow away the v_specmountpoint define, replacing it with what it was
defined as (rdev->si_mountpoint)


# 66187 21-Sep-2000 rwatson

o Permit UFS Extended Attributes to be associated with special devices
and FIFOs.

Obtained from: TrustedBSD Project


# 62976 11-Jul-2000 mckusick

Add snapshots to the fast filesystem. Most of the changes support
the gating of system calls that cause modifications to the underlying
filesystem. The gating can be enabled by any filesystem that needs
to consistently suspend operations by adding the vop_stdgetwritemount
to their set of vnops. Once gating is enabled, the function
vfs_write_suspend stops all new write operations to a filesystem,
allows any filesystem modifying system calls already in progress
to complete, then sync's the filesystem to disk and returns. The
function vfs_write_resume allows the suspended write operations to
begin again. Gating is not added by default for all filesystems as
for SMP systems it adds two extra locks to such critical kernel
paths as the write system call. Thus, gating should only be added
as needed.

Details on the use and current status of snapshots in FFS can be
found in /sys/ufs/ffs/README.snapshot so for brevity and timelyness
is not included here. Unless and until you create a snapshot file,
these changes should have no effect on your system (famous last words).


# 61730 16-Jun-2000 phk

Revert part of my bioops change which implemented panic(8).


# 61724 16-Jun-2000 phk

Virtualizes & untangles the bioops operations vector.

Ref: Message-ID: <18317.961014572@critter.freebsd.dk> To: current@


# 60041 05-May-2000 phk

Separate the struct bio related stuff out of <sys/buf.h> into
<sys/bio.h>.

<sys/bio.h> is now a prerequisite for <sys/buf.h> but it shall
not be made a nested include according to bdes teachings on the
subject of nested includes.

Diskdrivers and similar stuff below specfs::strategy() should no
longer need to include <sys/buf.> unless they need caching of data.

Still a few bogus uses of struct buf to track down.

Repocopy by: peter


# 59241 15-Apr-2000 rwatson

Introduce extended attribute support for FFS, allowing arbitrary
(name, value) pairs to be associated with inodes. This support is
used for ACLs, MAC labels, and Capabilities in the TrustedBSD
security extensions, which are currently under development.

In this implementation, attributes are backed to data vnodes in the
style of the quota support in FFS. Support for FFS extended
attributes may be enabled using the FFS_EXTATTR kernel option
(disabled by default). Userland utilities and man pages will be
committed in the next batch. VFS interfaces and man pages have
been in the repo since 4.0-RELEASE and are unchanged.

o ufs/ufs/extattr.h: UFS-specific extattr defines
o ufs/ufs/ufs_extattr.c: bulk of support routines
o ufs/{ufs,ffs,mfs}/*.[ch]: hooks and extattr.h includes
o contrib/softupdates/ffs_softdep.c: extattr.h includes
o conf/options, conf/files, i386/conf/LINT: added FFS_EXTATTR

o coda/coda_vfsops.c: XXX required extattr.h due to ufsmount.h
(This should not be the case, and will be fixed in a future commit)

Currently attributes are not supported in MFS. This will be fixed.

Reviewed by: adrian, bp, freebsd-fs, other unthanked souls
Obtained from: TrustedBSD Project


# 55756 10-Jan-2000 phk

Give vn_isdisk() a second argument where it can return a suitable errno.

Suggested by: bde


# 55697 09-Jan-2000 mckusick

Several performance improvements for soft updates have been added:
1) Fastpath deletions. When a file is being deleted, check to see if it
was so recently created that its inode has not yet been written to
disk. If so, the delete can proceed to immediately free the inode.
2) Background writes: No file or block allocations can be done while the
bitmap is being written to disk. To avoid these stalls, the bitmap is
copied to another buffer which is written thus leaving the original
available for futher allocations.
3) Link count tracking. Constantly track the difference in i_effnlink and
i_nlink so that inodes that have had no change other than i_effnlink
need not be written.
4) Identify buffers with rollback dependencies so that the buffer flushing
daemon can choose to skip over them.


# 53577 22-Nov-1999 phk

Convert various pieces of code to use vn_isdisk() rather than checking
for vp->v_type == VBLK.

In ccd: we don't need to call VOP_GETATTR to find the type of a vnode.

Reviewed by: sos


# 52635 29-Oct-1999 phk

useracc() the prequel:

Merge the contents (less some trivial bordering the silly comments)
of <vm/vm_prot.h> and <vm/vm_inherit.h> into <vm/vm.h>. This puts
the #defines for the vm_inherit_t and vm_prot_t types next to their
typedefs.

This paves the road for the commit to follow shortly: change
useracc() to use VM_PROT_{READ|WRITE} rather than B_{READ|WRITE}
as argument.


# 50477 27-Aug-1999 peter

$Id$ -> $FreeBSD$


# 49535 08-Aug-1999 phk

Decommision miscfs/specfs/specdev.h. Most of it goes into <sys/conf.h>,
a few lines into <sys/vnode.h>.

Add a few fields to struct specinfo, paving the way for the fun part.


# 48225 26-Jun-1999 mckusick

Convert buffer locking from using the B_BUSY and B_WANTED flags to using
lockmgr locks. This commit should be functionally equivalent to the old
semantics. That is, all buffer locking is done with LK_EXCLUSIVE
requests. Changes to take advantage of LK_SHARED and LK_RECURSIVE will
be done in future commits.


# 47995 18-Jun-1999 mckusick

On our final pass through ffs_fsync, do all I/O synchronously so that
we can find out if our flush is failing because of write errors. This
change avoids a "flush failed" panic during unrecoverable disk errors.


# 47131 13-May-1999 mckusick

Add a hook to ffs_fsync to allow soft updates to get first chance at doing
a sync on the block device for the filesystem. That allows it to push the
bitmap blocks before the inode blocks which greatly reduces the number of
inode rollbacks that need to be done.


# 44391 02-Mar-1999 mckusick

When fsync'ing a file on a filesystem using soft updates, we first try
to write all the dirty blocks. If some of those blocks have dependencies,
they will be remarked dirty when the I/O completes. On systems with
really fast I/O systems, it is possible to get in an infinite loop trying
to flush the buffers, because the I/O finishes before we can get all the
dirty buffers off the v_dirtyblkhd list and into the I/O queue. (The
previous algorithm looped over the v_dirtyblkhd list writing out buffers
until the list emptied.) So, now we mark each buffer that we try to
write so that we can distinguish the ones that are being remarked dirty
from those that we have not yet tried to flush. Once we have tried to
push every buffer once, we then push any associated metadata that is
causing the remaining buffers to be redirtied.

Submitted by: Matthew Dillon <dillon@apollo.backplane.com>


# 42374 07-Jan-1999 bde

Don't pass unused unused timestamp args to UFS_UPDATE() or waste
time initializing them. This almost finishes centralizing (in-core)
timestamp updates in ufs_itimes().


# 40790 31-Oct-1998 peter

Use TAILQ macros for clean/dirty block list processing. Set b_xflags
rather than abusing the list next pointer with a magic number.


# 39623 24-Sep-1998 luoqi

Eliminate a race in VOP_FSYNC() when softupdates is enabled.
Submitted by: Kirk McKusick <mckusick@McKusick.COM>
Two minor changes are also included,
1. Remove gratuitious checks for error return from vn_lock with LK_RETRY set,
vn_lock should always succeed in these cases.
2. Back out change rev. 1.36->1.37, which unnecessarily makes async mount
a little more unstable. It also keeps us in sync with other BSDs.
Suggested by: Bruce Evans <bde@zeta.org.au>


# 38907 07-Sep-1998 bde

Put the zombie ffs sysctl node in "notyet" state together with its few
remaining children. Prepare it for MOUNT_UFS going away.


# 36863 10-Jun-1998 julian

Back out John's changes 1.45 -> 1.46
Kirk confirms that the original semantic was what he wanted...
(well, a very slight difference)
May fix "dangling deps" panic with soft updates.


# 35955 11-May-1998 julian

Add missing splx()

Submitted by: Luoqi Chen <luoqi@chen.ml.org>


# 34961 30-Mar-1998 phk

Eradicate the variable "time" from the kernel, using various measures.
"time" wasn't a atomic variable, so splfoo() protection were needed
around any access to it, unless you just wanted the seconds part.

Most uses of time.tv_sec now uses the new variable time_second instead.

gettime() changed to getmicrotime(0.

Remove a couple of unneeded splfoo() protections, the new getmicrotime()
is atomic, (until Bruce sets a breakpoint in it).

A couple of places needed random data, so use read_random() instead
of mucking about with time which isn't random.

Add a new nfs_curusec() function.

Mark a couple of bogosities involving the now disappeard time variable.

Update ffs_update() to avoid the weird "== &time" checks, by fixing the
one remaining call that passwd &time as args.

Change profiling in ncr.c to use ticks instead of time. Resolution is
the same.

Add new function "tvtohz()" to avoid the bogus "splfoo(), add time, call
hzto() which subtracts time" sequences.

Reviewed by: bde


# 34924 28-Mar-1998 bde

Moved some #includes from <sys/param.h> nearer to where they are actually
used.


# 34734 21-Mar-1998 dyson

Softdep_sync_metadata appears to expect that it is called at splbio,
so make it so...


# 34696 19-Mar-1998 dyson

Fix vfs_bio_awrite usage, and correct vtruncbuf usage.


# 34266 08-Mar-1998 julian

Reviewed by: dyson@freebsd.org (john Dyson), dg@root.com (david greenman)
Submitted by: Kirk McKusick (mcKusick@mckusick.com)
Obtained from: WHistle development tree


# 33847 26-Feb-1998 msmith

In the author's words:

These diffs implement the first stage of a VOP_{GET|PUT}PAGES pushdown
for local media FS's.

See ffs_putpages in /sys/ufs/ufs/ufs_readwrite.c for implementation
details for generic *_{get|put}pages for local media FS's. Support
is trivial to add for any FS that formerly relied on the default
behaviour of the vnode_pager in in EOPNOTSUPP cases (just copy the
ffs_getpages() code for the FS in question's *_{get|put}pages).

Obviously, it would be better if each local media FS implemented a
more optimal method, instead of calling an exported interface from
the /sys/vm/vnode_pager.c, but this is a necessary first step in
getting the FS's to a point where they can be supplied with better
implementations on a case-by-case basis.

Obviously, the cd9660_putpages() can be rather trivial (since it
is a read-only FS type 8-)).

A slight (temporary) modification is made to print a diagnostic message
in the case where the underlying filesystem attempts to engage in the
previous behaviour. Failure is likely to be ungraceful.

Submitted by: terry@freebsd.org (Terry Lambert)


# 33134 06-Feb-1998 eivind

Back out DIAGNOSTIC changes.


# 33108 04-Feb-1998 eivind

Turn DIAGNOSTIC into a new-style option.


# 32976 01-Feb-1998 dyson

Back out recent laptop sync changes. They had significant errors.


# 32951 31-Jan-1998 dyson

Support more intelligent sync operations for MNT_NOATIME.
PR: kern/5577
Submitted by: Craig Leres <leres@ee.lbl.gov>


# 32286 06-Jan-1998 dyson

Make our v_usecount vnode reference count work identically to the
original BSD code. The association between the vnode and the vm_object
no longer includes reference counts. The major difference is that
vm_object's are no longer freed gratuitiously from the vnode, and so
once an object is created for the vnode, it will last as long as the
vnode does.

When a vnode object reference count is incremented, then the underlying
vnode reference count is incremented also. The two "objects" are now
more intimately related, and so the interactions are now much less
complex.

When vnodes are now normally placed onto the free queue with an object still
attached. The rundown of the object happens at vnode rundown time, and
happens with exactly the same filesystem semantics of the original VFS
code. There is absolutely no need for vnode_pager_uncache and other
travesties like that anymore.

A side-effect of these changes is that SMP locking should be much simpler,
the I/O copyin/copyout optimizations work, NFS should be more ponderable,
and further work on layered filesystems should be less frustrating, because
of the totally coherent management of the vnode objects and vnodes.

Please be careful with your system while running this code, but I would
greatly appreciate feedback as soon a reasonably possible.


# 30780 27-Oct-1997 bde

Removed unused #includes. The need for most of them went away with
recent changes (docluster* and vfs improvements).


# 30492 16-Oct-1997 phk

Another VFS cleanup "kilo commit"

1. Remove VOP_UPDATE, it is (also) an UFS/{FFS,LFS,EXT2FS,MFS}
intereface function, and now lives in the ufsmount structure.

2. Remove VOP_SEEK, it was unused.

3. Add mode default vops:

VOP_ADVLOCK vop_einval
VOP_CLOSE vop_null
VOP_FSYNC vop_null
VOP_IOCTL vop_enotty
VOP_MMAP vop_einval
VOP_OPEN vop_null
VOP_PATHCONF vop_einval
VOP_READLINK vop_einval
VOP_REALLOCBLKS vop_eopnotsupp

And remove identical functionality from filesystems

4. Add vop_stdpathconf, which returns the canonical stuff. Use
it in the filesystems. (XXX: It's probably wrong that specfs
and fifofs sets this vop, shouldn't it come from the "host"
filesystem, for instance ufs or cd9660 ?)

5. Try to make system wide VOP functions have vop_* names.

6. Initialize the um_* vectors in LFS.

(Recompile your LKMS!!!)


# 30474 16-Oct-1997 phk

VFS mega cleanup commit (x/N)

1. Add new file "sys/kern/vfs_default.c" where default actions for
VOPs go. Implement proper defaults for ABORTOP, BWRITE, LEASE,
POLL, REVOKE and STRATEGY. Various stuff spread over the entire
tree belongs here.

2. Change VOP_BLKATOFF to a normal function in cd9660.

3. Kill VOP_BLKATOFF, VOP_TRUNCATE, VOP_VFREE, VOP_VALLOC. These
are private interface functions between UFS and the underlying
storage manager layer (FFS/LFS/MFS/EXT2FS). The functions now
live in struct ufsmount instead.

4. Remove a kludge of VOP_ functions in all filesystems, that did
nothing but obscure the simplicity and break the expandability.
If a filesystem doesn't implement VOP_FOO, it shouldn't have an
entry for it in its vnops table. The system will try to DTRT
if it is not implemented. There are still some cruft left, but
the bulk of it is done.

5. Fix another VCALL in vfs_cache.c (thanks Bruce!)


# 30439 15-Oct-1997 phk

vnops megacommit

1. Use the default function to access all the specfs operations.
2. Use the default function to access all the fifofs operations.
3. Use the default function to access all the ufs operations.
4. Fix VCALL usage in vfs_cache.c
5. Use VOCALL to access specfs functions in devfs_vnops.c
6. Staticize most of the spec and fifofs vnops functions.
7. Make UFS panic if it lacks bits of the underlying storage handling.


# 30434 15-Oct-1997 phk

Hmm, realign the vnops into two columns.


# 30431 15-Oct-1997 phk

Stylistic overhaul of vnops tables.
1. Remove comment stating the blatantly obvious.
2. Align in two columns.
3. Sort all but the default element alphabetically.
4. Remove XXX comments pointing out entries not needed.


# 30283 10-Oct-1997 phk

Add type arg to ffs_mountfs and avoid examining v_tag to find out
if MFS is getting a free ride.

Use generic ufs_reclaim().


# 29888 27-Sep-1997 kato

Clustered read and write are switched at mount-option level.

1. Clustered I/O is switched by the MNT_NOCLUSTERR and MNT_NOCLUSTERW
bits of the mnt_flag. The sysctl variables, vfs.foo.doclusterread
and vfs.foo.doclusterwrite are deleted. Only mount option can
control clustered I/O from userland.
2. When foofs_mount mounts block device, foofs_mount checks D_CLUSTERR
and D_CLUSTERW bits of the d_flags member in the block device switch
table. If D_NOCLUSTERR / D_NOCLUSTERW are set, MNT_NOCLUSTERR /
MNT_NOCLUSTERW bits will be set. In this case, MNT_NOCLUSTERR and
MNT_NOCLUSTERW cannot be cleared from userland.
3. Vnode driver disables both clustered read and write.
4. Union filesystem disables clutered write.

Reviewed by: bde


# 29362 14-Sep-1997 peter

Convert select -> poll.
Delete 'always succeed' select/poll handlers, replaced with generic call.
Flag missing vnode op table entries.


# 29041 02-Sep-1997 bde

Removed unused #includes.


# 28787 26-Aug-1997 phk

Uncut&paste cache_lookup().

This unifies several times in theory indentical 50 lines of code.

The filesystems have a new method: vop_cachedlookup, which is the
meat of the lookup, and use vfs_cache_lookup() for their vop_lookup
method. vfs_cache_lookup() will check the namecache and pass on
to the vop_cachedlookup method in case of a miss.

It's still the task of the individual filesystems to populate the
namecache with cache_enter().

Filesystems that do not use the namecache will just provide the
vop_lookup method as usual.


# 28701 25-Aug-1997 kato

Renamed doclusterread/write to unique names (ffs_doclusterread/write),
and staticize them. Move the #include of <sys/sysctl.h> to the top of
the file.

Pointed out by: Bruce Evans <bde@zeta.org.au>


# 24101 22-Mar-1997 bde

Fixed some invalid (non-atomic) accesses to `time', mostly ones of the
form `tv = time'. Use a new function gettime(). The current version
just forces atomicicity without fixing precision or efficiency bugs.
Simplified some related valid accesses by using the central function.


# 23383 04-Mar-1997 bde

Fixed connection of vfs.ffs node to the sysctl tree.


# 22975 22-Feb-1997 peter

Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not
ready for it yet.


# 22521 10-Feb-1997 dyson

This is the kernel Lite/2 commit. There are some requisite userland
changes, so don't expect to be able to run the kernel as-is (very well)
without the appropriate Lite/2 userland changes.

The system boots and can mount UFS filesystems.

Untested: ext2fs, msdosfs, NFS
Known problems: Incorrect Berkeley ID strings in some files.
Mount_std mounts will not work until the getfsent
library routine is changed.

Reviewed by: various people
Submitted by: Jeffery Hsu <hsu@freebsd.org>


# 21673 14-Jan-1997 jkh

Make the long-awaited change from $Id$ to $FreeBSD$

This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.

Boy, I'm glad we're not using sup anymore. This update would have been
insane otherwise.


# 12911 17-Dec-1995 phk

Staticize.


# 12662 07-Dec-1995 dg

Untangled the vm.h include file spaghetti.


# 12288 14-Nov-1995 phk

Get rid of the last debug sysctl variables of the old style.


# 12158 09-Nov-1995 bde

Introduced a type `vop_t' for vnode operation functions and used
it 1138 times (:-() in casts and a few more times in declarations.
This change is null for the i386.

The type has to be `typedef int vop_t(void *)' and not `typedef
int vop_t()' because `gcc -Wstrict-prototypes' warns about the
latter. Since vnode op functions are called with args of different
(struct pointer) types, neither of these function types is any use
for type checking of the arg, so it would be preferable not to use
the complete function type, especially since using the complete
type requires adding 1138 casts to avoid compiler warnings and
another 40+ casts to reverse the function pointer conversions before
calling the functions.


# 11701 23-Oct-1995 dyson

Finalize GETPAGES layering scheme. Move the device GETPAGES
interface into specfs code. No need at this point to modify the
PUTPAGES stuff except in the layered-type (NULL/UNION) filesystems.


# 10998 25-Sep-1995 dyson

Re-enable read clustering.


# 10949 22-Sep-1995 dg

Shit! I changed the wrong doclusterread! ...Thanks to Steven Wallace and
Poul-Henning for convincing me that I should look at my mistake! :-)


# 10946 21-Sep-1995 dg

Disable file read clustering until the bug(s) in vfs_cluster.c are fixed.
This should temporarily fix the sig 10/11 problems that people have been
having for the past 3 weeks.


# 10578 06-Sep-1995 dyson

Added indirect pointer for ffs_getpages, and added external declaration.


# 8876 30-May-1995 rgrimes

Remove trailing whitespace.


# 8692 21-May-1995 dg

Changes to fix the following bugs:

1) Files weren't properly synced on filesystems other than UFS. In some
cases, this lead to lost data. Most likely would be noticed on NFS.
The fix is to make the VM page sync/object_clean general rather than
in each filesystem.
2) Mixing regular and mmaped file I/O on NFS was very broken. It caused
chunks of files to end up as zeroes rather than the intended contents.
The fix was to fix several race conditions and to kludge up the
"b_dirtyoff" and "b_dirtyend" that NFS relies upon - paying attention
to page modifications that occurred via the mmapping.

Reviewed by: David Greenman
Submitted by: John Dyson


# 7695 09-Apr-1995 dg

Changes from John Dyson and myself:

Fixed remaining known bugs in the buffer IO and VM system.

vfs_bio.c:
Fixed some race conditions and locking bugs. Improved performance
by removing some (now) unnecessary code and fixing some broken
logic.
Fixed process accounting of # of FS outputs.
Properly handle NFS interrupts (B_EINTR).

(various)
Replaced calls to clrbuf() with calls to an optimized routine
called vfs_bio_clrbuf().

(various FS sync)
Sync out modified vnode_pager backed pages.

ffs_vnops.c:
Do two passes: Sync out file data first, then indirect blocks.

vm_fault.c:
Fixed deadly embrace caused by acquiring locks in the wrong order.

vnode_pager.c:
Changed to use buffer I/O system for writing out modified pages. This
should fix the problem with the modification date previous not getting
updated. Also dramatically simplifies the code. Note that this is
going to change in the future and be implemented via VOP_PUTPAGES().

vm_object.c:
Fixed a pile of bugs related to cleaning (vnode) objects. The performance
of vm_object_page_clean() is terrible when dealing with huge objects,
but this will change when we implement a binary tree to keep the object
pages sorted.

vm_pageout.c:
Fixed broken clustering of pageouts. Fixed race conditions and other
lockup style bugs in the scanning of pages. Improved performance.


# 5455 09-Jan-1995 dg

These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.

The majority of the merged VM/cache work is by John Dyson.

The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.

vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.

vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.

vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.

vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.

vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.

pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.

vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.

proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.

swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.

machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.

machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.

ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.

Submitted by: John Dyson and David Greenman


# 3487 09-Oct-1994 phk

Cosmetics. make gcc less noisy. Still some way to go here.


# 3396 06-Oct-1994 dg

Use tsleep() rather than sleep so that 'ps' is more informative about
the wait.


# 2979 22-Sep-1994 wollman

More loadable VFS changes:

- Make a number of filesystems work again when they are statically compiled
(blush)

- FIFOs are no longer optional; ``options FIFO'' removed from distributed
config files.


# 2946 21-Sep-1994 wollman

Implemented loadable VFS modules, and made most existing filesystems
loadable. (NFS is a notable exception.)


# 1960 08-Aug-1994 dg

Made lockf advisory locking code generic (rather than ufs specific), and
use it in NFS. This is required both for diskless support and for POSIX
compliance. Note: the support in NFS is only for the local node.

Submitted by: based on work originally done by Yuval Yurom


# 1817 02-Aug-1994 dg

Added $Id$


# 1542 24-May-1994 rgrimes

This commit was generated by cvs2svn to compensate for changes in r1541,
which included commits to RCS files with non-trunk default branches.


# 1541 24-May-1994 rgrimes

BSD 4.4 Lite Kernel Sources