History log of /freebsd-10.0-release/sys/ufs/ufs/ufs_lookup.c
Revision Date Author Comments
(<<< Hide modified files)
(Show modified files >>>)
# 259065 07-Dec-2013 gjb

- Copy stable/10 (r259064) to releng/10.0 as part of the
10.0-RELEASE cycle.
- Update __FreeBSD_version [1]
- Set branch name to -RC1

[1] 10.0-CURRENT __FreeBSD_version value ended at '55', so
start releng/10.0 at '100' so the branch is started with
a value ending in zero.

Approved by: re (implicit)
Sponsored by: The FreeBSD Foundation

# 256281 10-Oct-2013 gjb

Copy head (r256279) to stable/10 as part of the 10.0-RELEASE cycle.

Approved by: re (implicit)
Sponsored by: The FreeBSD Foundation


# 248561 20-Mar-2013 mckusick

When renaming a directory from one parent directory to another,
we need to call ufs_checkpath() to walk from our new location to
the root of the filesystem to ensure that we do not encounter
ourselves along the way. Until now, we accomplished this by reading
the ".." entries of each directory in our path until we reached
the root (or encountered an error). This change tries to avoid the
I/O of reading the ".." entries by first looking them up in the
name cache and only doing the I/O when the name cache lookup fails.

Reviewed by: kib
Tested by: Peter Holm
MFC after: 4 weeks


# 246299 03-Feb-2013 pfg

UFS: Remove dead assignment.

Submitted by: Christoph Mallon
MFC after: 3 days


# 241011 27-Sep-2012 mdf

Fix up kernel sources to be ready for a 64-bit ino_t.

Original code by: Gleb Kurtsou


# 234605 23-Apr-2012 trasz

Remove unused thread argument from vtruncbuf().

Reviewed by: kib


# 231949 20-Feb-2012 kib

Fix found places where uio_resid is truncated to int.

Add the sysctl debug.iosize_max_clamp, enabled by default. Setting the
sysctl to zero allows to perform the SSIZE_MAX-sized i/o requests from
the usermode.

Discussed with: bde, das (previous versions)
MFC after: 1 month


# 222954 10-Jun-2011 jeff

- If the fsync in ufs_direnter fails SUJ can later panic because we have
partially added a name. Allow ufs_direnter() to continue in the
hopes that it is a transient error. If it is not, the directory
is corrupted already from IO errors and writing this new block
is not likely to make things worse.


# 219804 20-Mar-2011 kib

Retire opt_ffs_broken_fixme.h.
Instead of directly calling ffs_snapgone(), use UFS_SNAPGONE() with
usual layering.

Requested by: bde
MFC after: 1 week


# 219712 17-Mar-2011 kib

Remove the #if defined(FFS) || defined(IFS) braces around the calls to
ffs_snapgone(). ufs.ko module is not build with FFS define, causing
snapshot inode number slots in superblock never be freed, as well as a
reference on the snapshot vnode.

IFS was removed several years ago, and UFS/FFS separation was not
maintained for real.

Reported, analyzed and tested by: Yamagi Burmeister <lists yamagi org>
MFC after: 3 days


# 209717 06-Jul-2010 jeff

- Handle the truncation of an inode with an effective link count of 0 in
the context of the process that reduced the effective count. Previously
all truncation as a result of unlink happened in the softdep flush
thread. This had the effect of being impossible to rate limit properly
with the journal code. Now the process issuing unlinks is suspended
when the journal files. This has a side-effect of improving rm
performance by allowing more concurrent work.
- Handle two cases in inactive, one for effnlink == 0 and another when
nlink finally reaches 0.
- Eliminate the SPACECOUNTED related code since the truncation is no
longer delayed.

Discussed with: mckusick


# 209367 20-Jun-2010 kib

Ensure that VOP_ACCESSX is called with exclusively locked vnode for
the kernel compiled with QUOTA option. ufs_accessx() upgrades the vdp
vnode lock from shared to exclusive to assign the dquot structure to
the vnode, and ufs_delete_denied() is called when tvp is locked. Since
upgrade drops shared lock when non-blocked upgrade failed, LOR is there.

Reported and tested by: Dmitry Pryanishnikov <lynx.ripe gmail com>
Tested by: pho
PR: kern/147890
MFC after: 1 week


# 207141 24-Apr-2010 jeff

- Merge soft-updates journaling from projects/suj/head into head. This
brings in support for an optional intent log which eliminates the need
for background fsck on unclean shutdown.

Sponsored by: iXsystems, Yahoo!, and Juniper.
With help from: McKusick and Peter Holm


# 206894 20-Apr-2010 kib

The cache_enter(9) function shall not be called for doomed dvp.
Assert this.

In the reported panic, vdestroy() fired the assertion "vp has namecache
for ..", because pseudofs may end up doing cache_enter() with reclaimed
dvp, after dotdot lookup temporary unlocked dvp.
Similar problem exists in ufs_lookup() for "." lookup, when vnode
lock needs to be upgraded.

Verify that dvp is not reclaimed before calling cache_enter().

Reported and tested by: pho
Reviewed by: kan
MFC after: 2 weeks


# 202113 11-Jan-2010 mckusick

Background:

When renaming a directory it passes through several intermediate
states. First its new name will be created causing it to have two
names (from possibly different parents). Next, if it has different
parents, its value of ".." will be changed from pointing to the old
parent to pointing to the new parent. Concurrently, its old name
will be removed bringing it back into a consistent state. When fsck
encounters an extra name for a directory, it offers to remove the
"extraneous hard link"; when it finds that the names have been
changed but the update to ".." has not happened, it offers to rewrite
".." to point at the correct parent. Both of these changes were
considered unexpected so would cause fsck in preen mode or fsck in
background mode to fail with the need to run fsck manually to fix
these problems. Fsck running in preen mode or background mode now
corrects these expected inconsistencies that arise during directory
rename. The functionality added with this update is used by fsck
running in background mode to make these fixes.

Solution:

This update adds three new fsck sysctl commands to support background
fsck in correcting expected inconsistencies that arise from incomplete
directory rename operations. They are:

setcwd(dirinode) - set the current directory to dirinode in the
filesystem associated with the snapshot.
setdotdot(oldvalue, newvalue) - Verify that the inode number for ".."
in the current directory is oldvalue then change it to newvalue.
unlink(nameptr, oldvalue) - Verify that the inode number associated
with nameptr in the current directory is oldvalue then unlink it.

As with all other fsck sysctls, these new ones may only be used by
processes with appropriate priviledge.

Reported by: jeff
Security issues: rwatson


# 200796 21-Dec-2009 trasz

Implement NFSv4 ACL support for UFS.

Reviewed by: rwatson


# 194296 16-Jun-2009 kib

Do not use casts (int *)0 and (struct thread *)0 for the arguments of
vn_rdwr, use NULL.

Reviewed by: jhb
MFC after: 1 week


# 191315 20-Apr-2009 kib

In ufs_checkpath(), recheck that '..' still points to the inode with
the same inode number after VFS_VGET() and relock of the vp. If '..'
changed, redo the lookup. To reduce code duplication, move the code to
read '..' dirent into the static helper function ufs_dir_dd_ino().

Supply the source inode number as an argument to ufs_checkpath() instead
of the source inode itself. The inode is unlocked, thus it might be
reclaimed, causing accesses to the freed memory.

Use vn_vget_ino() to get the '..' vnode by its inode number, instead of
directly code VFS_VGET() and relock, to properly busy the mount point
while vp lock is dropped.

Noted and reviewed by: tegge
Tested by: pho
MFC after: 1 month


# 191260 19-Apr-2009 kib

When verifying '..' after VFS_VGET() in ufs_lookup(), do not return
error if '..' is still there but changed between lookup and check.
Start relookup instead. Rename is supposed to change '..' reference
atomically, so transient failures introduced by r191137 are wrong.

While rearranging the code to allow lookup restart in ufs_lookup(),
remove the comment that only distracts the reader.

Noted and reviewed by: tegge
Also reported by: pho
MFC after: 1 month


# 191137 16-Apr-2009 kib

Verify that '..' still exists with the same inode number after
VFS_VGET() has returned in ufs_lookup(). If the '..' lookup started
immediately before the parent directory was removed, we might return
either cleared or unrelated inode otherwise.

Ufs_lookup() is split into new function ufs_lookup_() that either does
lookup, or verifies that directory entry exists and references supplied
inode number.

Reviewed by: tegge
Tested by: pho,
Andreas Tobler <andreast-list fgznet ch> (previous version)
MFC after: 1 month


# 187528 21-Jan-2009 kib

Move the code from ufs_lookup.c used to do dotdot lookup, into
the helper function. It is supposed to be useful for any filesystem
that has to unlock dvp to walk to the ".." entry in lookup routine.

Requested by: jhb
Tested by: pho
MFC after: 1 month


# 185556 02-Dec-2008 kib

Do not lock vnode interlock around reading of v_iflag to check VI_DOOMED.
Read of the pointer is atomic, and flag cannot be set while vnode lock
is held.

Requested by: jhb
MFC after: 1 month


# 185170 22-Nov-2008 kib

Busy ufs filesystem around block of code that does ".." lookup. Since
mnt_lock is before lock of any vnode on the mp, it uses LK_NOWAIT. Since
MNTK_UNMOUNT may be transient, pdp lock is dropped when vfs_busy()
failed, and operation is retried after some time. This way, ffs_vget()
is not called on the mp that may be in the process of being destroyed by
unmount.

Check for the VI_DOOMED flag on pdp after its lock is reacquired, to
better detect some situations where directory containing ".."
entry is removed during the lookup.

Reviewed by: tegge, attilio (previous version)
Tested by: pho
MFC after: 1 month


# 183093 16-Sep-2008 jhb

Retire the 'i_reclen' field from the in-memory i-node. Previously,
during a DELETE lookup operation, lookup would cache the length of the
directory entry to be deleted in 'i_reclen'. Later, the actual VOP to
remove the directory entry (ufs_remove, ufs_rename, etc.) would call
ufs_dirremove() which extended the length of the previous directory
entry to "remove" the deleted entry.

However, we always read the entire block containing the directory
entry when doing the removal, so we always have the directory entry to
be deleted in-memory when doing the update to the directory block.
Also, we already have to figure out where the directory entry that is
being removed is in the block so that we can pass the component name
to the dirhash code to update the dirhash. So, instead of passing
'i_reclen' from ufs_lookup() to the ufs_dirremove() routine, just read
the 'd_reclen' field directly out of the entry being removed when
updating the length of the previous entry in the block.

This avoids a cosmetic issue of writing to 'i_reclen' while holding a
shared vnode lock. It also slightly reduces the amount of side-band
data passed from ufs_lookup() to operations updating a directory via
the directory's i-node.

Reviewed by: jeff


# 183079 16-Sep-2008 jhb

- Only set i_offset in the parent directory's i-node during a lookup for
non-LOOKUP operations.
- Relax a VOP assertion for a DELETE lookup. rename() uses WANTPARENT
instead of LOCKPARENT when looking up the source pathname. ufs_rename()
uses a relookup() to lock the parent directory when it decides to finally
remove the source path. Thus, it is ok for a DELETE with WANTPARENT set
instead of LOCKPARENT to use a shared vnode lock rather than an exclusive
vnode lock.

Reported by: kris (2)
Reviewed by: jeff


# 181018 30-Jul-2008 jhb

Whitespace tweak.


# 179159 20-May-2008 ups

Allow VM object creation in ufs_lookup. (If vfs.vmiodirenable is set)
Directory IO without a VM object will store data in 'malloced' buffers
severely limiting caching of the data. Without this change VM objects for
directories are only created on an open() of the directory.
TODO: Inline test if VM object already exists to avoid locking/function call
overhead.

Tested by: kris@
Reviewed by: jeff@
Reported by: David Filo


# 178420 22-Apr-2008 jeff

- Use a local variable for i_ino in ufs_lookup. It is only used to
communicate between two parts of this one function. This was causing
problems with shared lookups as each would trash the ino value in the
inode.
- Remove the unused i_ino field from the inode structure.


# 178109 11-Apr-2008 jeff

- cache dp->i_offset in the local 'i_offset' variable for use in loop
indexes so directory lookup becomes shared lock safe. In the modifying
cases an exclusive lock is held here so the commit routine may
rely on the state of i_offset.
- Similarly handle i_diroff by fetching at the start and setting only once
the operation is complete. Without the exclusive lock these are only
considered hints.
- Assert that an exclusive lock is held when we're preparing for a commit
routine.
- Honor the lock type request from lookup instead of always using exclusive
locking.

Tested by: pho, kris


# 175294 13-Jan-2008 attilio

VOP_LOCK1() (and so VOP_LOCK()) and VOP_UNLOCK() are only used in
conjuction with 'thread' argument passing which is always curthread.
Remove the unuseful extra-argument and pass explicitly curthread to lower
layer functions, when necessary.

KPI results broken by this change, which should affect several ports, so
version bumping and manpage update will be further committed.

Tested by: kris, pho, Diego Sardina <siarodx at gmail dot com>


# 175202 09-Jan-2008 attilio

vn_lock() is currently only used with the 'curthread' passed as argument.
Remove this argument and pass curthread directly to underlying
VOP_LOCK1() VFS method. This modify makes the code cleaner and in
particular remove an annoying dependence helping next lockmgr() cleanup.
KPI results, obviously, changed.

Manpage and FreeBSD_version will be updated through further commits.

As a side note, would be valuable to say that next commits will address
a similar cleanup about VFS methods, in particular vop_lock1 and
vop_unlock.

Tested by: Diego Sardina <siarodx at gmail dot com>,
Andrea Di Pasquale <whyx dot it at gmail dot com>


# 173464 08-Nov-2007 obrien

Turn most ffs 'DIAGNOSTIC's into INVARIANTS.


# 167542 14-Mar-2007 kib

Call getinoquota() before allocating new block for the directory to properly
account for block allocation.

Tested by: Peter Holm
Reviewed by: tegge
Approved by: re (kensmith)


# 160859 31-Jul-2006 obrien

Rather than print out a nice error message giving details sufficent to fix
a 'ufs_dirbad' and then panicing (making it very hard to see the details),
put them in the panic message itself.


# 160269 11-Jul-2006 daichi

The ufs_lookup.c has a critical bug around the whiteout
process. UFS must check a whiteout name when it uses the
whiteout, but the current implementation does not check
the whileout name, so sometimes UFS writes over a wrong
whtieout. UFS *MUST* check the whiteout name to use a
corrent whiteout. This bug leads unionfs. panic.
This commit fixes this trouble.

Submitted by: Masanori Ozawa <ozawa@ongs.co.jp> (unionfs developer)
Reviewed by: tegge & rodrigc (mentor)
Approved by: rodrigc (mentor)
MFC after: 2 weeks


# 156418 08-Mar-2006 tegge

Don't set IN_CHANGE and IN_UPDATE on inodes for potentially suspended
file systems. This could cause deadlocks when creating snapshots.

Reviewed by: jeff


# 151390 16-Oct-2005 truckman

Correct the type of the temporary variable used by ufs_lookup.c:1.78
to fix the race condition in the ufs_lookup() ISDOTDOT code.

Noticed by: bde
MFC after: 12 days


# 151347 14-Oct-2005 truckman

Close a race in the ufs_lookup() code that handles the ISDOTDOT
case by saving the value of dp->i_ino before unlocking the vnode
for the current directory and passing the saved value to VFS_VGET().

Without this change, another thread can overwrite dp->i_ino after
the current directory is unlocked, causing ufs_lookup() to lock
and return the wrong vnode in place of the vnode for its parent
directory. A deadlock can occur if dp->i_ino was changed to a
subdirectory of the current directory because the root to leaf vnode
lock ordering will be violated. A vnode lock can be leaked if
dp->i_ino was changed to point to the current directory, which
causes the current vnode lock for the current directory to be
recursed, which confuses lookup() into calling vrele() when it
should be calling vput().

The probability of this bug being triggered seems to be quite low
unless the sysctl variable debug.vfscache is set to 0.

Reviewed by: jhb
MFC after: 2 weeks


# 145006 13-Apr-2005 jeff

- Change all filesystems and vfs_cache to relock the dvp once the child is
locked in the ISDOTDOT case. Se vfs_lookup.c r1.79 for details.

Sponsored by: Isilon Systems, Inc.


# 144300 29-Mar-2005 jeff

- Remove wantparent, it is no longer necessary. An assert in vfs_lookup.c
prevents any callers from doing a modifying op without
LOCKPARENT or WANTPARENT. It wasn't even properly used in the CREATE
or DELETE cases.


# 144288 29-Mar-2005 jeff

- Honor the cn_lkflags passed from namei() when locking the leaf.

Sponsored by: Isilon Systems, Inc.


# 144208 28-Mar-2005 jeff

- We no longer have to bother with PDIRUNLOCK, lookup() handles it for us.

Sponsored by: Isilon Systems, Inc.


# 140048 11-Jan-2005 phk

Remove the unused credential argument from VOP_FSYNC() and VFS_SYNC().

I'm not sure why a credential was added to these in the first place, it is
not used anywhere and it doesn't make much sense:

The credentials for syncing a file (ability to write to the
file) should be checked at the system call level.

Credentials for syncing one or more filesystems ("none")
should be checked at the system call level as well.

If the filesystem implementation needs a particular credential
to carry out the syncing it would logically have to the
cached mount credential, or a credential cached along with
any delayed write data.

Discussed with: rwatson


# 139825 07-Jan-2005 imp

/* -> /*- for license, minor formatting changes


# 132775 28-Jul-2004 kan

Avoid using casts as lvalues. Introduce DIP_SET macro which sets proper
inode field based on UFS version. Use DIP ro read values and DIP_SET
to modify them throughout FFS code base.


# 127975 07-Apr-2004 imp

Remove advertising clause from University of California Regent's
license, per letter dated July 22, 1999 and irc message from Robert
Watson saying that clause 3 can be removed from those files with an
NAI copyright that also have only a University of California
copyrights.

Approved by: core, rwatson


# 126853 11-Mar-2004 phk

Properly vector all bwrite() and BUF_WRITE() calls through the same path
and s/BUF_WRITE()/bwrite()/ since it now does the same as bwrite().


# 116192 11-Jun-2003 obrien

Use __FBSDID().


# 114293 30-Apr-2003 markm

Fix some easy, global, lint warnings. In most cases, this means
making some local variables static. In a couple of cases, this means
removing an unused variable.


# 104302 01-Oct-2002 phk

Fix some harmless mis-indents.

Spotted by: FlexeLint


# 101941 15-Aug-2002 rwatson

In order to better support flexible and extensible access control,
make a series of modifications to the credential arguments relating
to file read and write operations to cliarfy which credential is
used for what:

- Change fo_read() and fo_write() to accept "active_cred" instead of
"cred", and change the semantics of consumers of fo_read() and
fo_write() to pass the active credential of the thread requesting
an operation rather than the cached file cred. The cached file
cred is still available in fo_read() and fo_write() consumers
via fp->f_cred. These changes largely in sys_generic.c.

For each implementation of fo_read() and fo_write(), update cred
usage to reflect this change and maintain current semantics:

- badfo_readwrite() unchanged
- kqueue_read/write() unchanged
pipe_read/write() now authorize MAC using active_cred rather
than td->td_ucred
- soo_read/write() unchanged
- vn_read/write() now authorize MAC using active_cred but
VOP_READ/WRITE() with fp->f_cred

Modify vn_rdwr() to accept two credential arguments instead of a
single credential: active_cred and file_cred. Use active_cred
for MAC authorization, and select a credential for use in
VOP_READ/WRITE() based on whether file_cred is NULL or not. If
file_cred is provided, authorize the VOP using that cred,
otherwise the active credential, matching current semantics.

Modify current vn_rdwr() consumers to pass a file_cred if used
in the context of a struct file, and to always pass active_cred.
When vn_rdwr() is used without a file_cred, pass NOCRED.

These changes should maintain current semantics for read/write,
but avoid a redundant passing of fp->f_cred, as well as making
it more clear what the origin of each credential is in file
descriptor read/write operations.

Follow-up commits will make similar changes to other file descriptor
operations, and modify the MAC framework to pass both credentials
to MAC policy modules so they can implement either semantic for
revocation.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# 101744 12-Aug-2002 rwatson

Pass IO_NOMACCHECK to vn_rdwr() in the following checks to prevent
enforcement of MAC policy on the read or write operations:

- In ext2fs, don't enforce MAC on loop-back reads and writes supporting
directory read operations in lookup(), directory modifications in
rename(), directory write operations in mkdir(), symlink write
operations in symlink().

- In the NFS client locking code, perform vn_rdwr() on the NFS locking
socket without enforcing MAC, since the write is done on behalf of
the kernel NFS implementation rather than the user process.

- In UFS, don't enforce MAC on loop-back reads and writes supporting
directory read operations in lookup(), and symlink write operations
in symlink().

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# 100344 19-Jul-2002 mckusick

Add support to UFS2 to provide storage for extended attributes.
As this code is not actually used by any of the existing
interfaces, it seems unlikely to break anything (famous
last words).

The internal kernel interface to manipulate these attributes
is invoked using two new IO_ flags: IO_NORMAL and IO_EXT.
These flags may be specified in the ioflags word of VOP_READ,
VOP_WRITE, and VOP_TRUNCATE. Specifying IO_NORMAL means that
you want to do I/O to the normal data part of the file and
IO_EXT means that you want to do I/O to the extended attributes
part of the file. IO_NORMAL and IO_EXT are mutually exclusive
for VOP_READ and VOP_WRITE, but may be specified individually
or together in the case of VOP_TRUNCATE. For example, when
removing a file, VOP_TRUNCATE is called with both IO_NORMAL
and IO_EXT set. For backward compatibility, if neither IO_NORMAL
nor IO_EXT is set, then IO_NORMAL is assumed.

Note that the BA_ and IO_ flags have been `merged' so that they
may both be used in the same flags word. This merger is possible
by assigning the IO_ flags to the low sixteen bits and the BA_
flags the high sixteen bits. This works because the high sixteen
bits of the IO_ word is reserved for read-ahead and help with
write clustering so will never be used for flags. This merge
lets us get away from code of the form:

if (ioflags & IO_SYNC)
flags |= BA_SYNC;

For the future, I have considered adding a new field to the
vattr structure, va_extsize. This addition could then be
exported through the stat structure to allow applications to
find out the size of the extended attribute storage and also
would provide a more standard interface for truncating them
(via VOP_SETATTR rather than VOP_TRUNCATE).

I am also contemplating adding a pathconf parameter (for
concreteness, lets call it _PC_MAX_EXTSIZE) which would
let an application determine the maximum size of the extended
atribute storage.

Sponsored by: DARPA & NAI Labs.


# 98658 23-Jun-2002 dillon

Rename the BALLOC flags from B_* to BA_* to avoid confusion with the
struct buf B_ flags.

Approved by: mckusick


# 98542 21-Jun-2002 mckusick

This commit adds basic support for the UFS2 filesystem. The UFS2
filesystem expands the inode to 256 bytes to make space for 64-bit
block pointers. It also adds a file-creation time field, an ability
to use jumbo blocks per inode to allow extent like pointer density,
and space for extended attributes (up to twice the filesystem block
size worth of attributes, e.g., on a 16K filesystem, there is space
for 32K of attributes). UFS2 fully supports and runs existing UFS1
filesystems. New filesystems built using newfs can be built in either
UFS1 or UFS2 format using the -O option. In this commit UFS1 is
the default format, so if you want to build UFS2 format filesystems,
you must specify -O 2. This default will be changed to UFS2 when
UFS2 proves itself to be stable. In this commit the boot code for
reading UFS2 filesystems is not compiled (see /sys/boot/common/ufsread.c)
as there is insufficient space in the boot block. Once the size of the
boot block is increased, this code can be defined.

Things to note: the definition of SBSIZE has changed to SBLOCKSIZE.
The header file <ufs/ufs/dinode.h> must be included before
<ufs/ffs/fs.h> so as to get the definitions of ufs2_daddr_t and
ufs_lbn_t.

Still TODO:
Verify that the first level bootstraps work for all the architectures.
Convert the utility ffsinfo to understand UFS2 and test growfs.
Add support for the extended attribute storage. Update soft updates
to ensure integrity of extended attribute storage. Switch the
current extended attribute interfaces to use the extended attribute
storage. Add the extent like functionality (framework is there,
but is currently never used).

Sponsored by: DARPA & NAI Labs.
Reviewed by: Poul-Henning Kamp <phk@freebsd.org>


# 96755 16-May-2002 trhodes

More s/file system/filesystem/g


# 96506 13-May-2002 phk

Remove register keyword.

Sponsored by: DARPA & NAI Labs.
Submitted by: mckusick


# 92462 16-Mar-2002 mckusick

Add a flags parameter to VFS_VGET to pass through the desired
locking flags when acquiring a vnode. The immediate purpose is
to allow polling lock requests (LK_NOWAIT) needed by soft updates
to avoid deadlock when enlisting other processes to help with
the background cleanup. For the future it will allow the use of
shared locks for read access to vnodes. This change touches a
lot of files as it affects most filesystems within the system.
It has been well tested on FFS, loopback, and CD-ROM filesystems.
only lightly on the others, so if you find a problem there, please
let me (mckusick@mckusick.com) know.


# 91406 27-Feb-2002 jhb

Simple p_ucred -> td_ucred changes to start using the per-thread ucred
reference.


# 91060 22-Feb-2002 phk

Replace bowrite() with BUF_WRITE in ufs.

Remove bowrite(), it is now unused.

This is the first step in getting entirely rid of BIO_ORDERED which is
a generally accepted evil thing.

Approved by: mckusick


# 83366 12-Sep-2001 julian

KSE Milestone 2
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.

Sorry john! (your next MFC will be a doosie!)

Reviewed by: peter@freebsd.org, dillon@freebsd.org

X-MFC after: ha ha ha ha


# 82334 25-Aug-2001 iedowse

When compacting directories, ufs_direnter() always trusted DIRSIZ()
to supply the number of bytes to be bcopy()'d to move an entry. If
d_ino == 0 however, DIRSIZ() is not guaranteed to return a sensible
length, so ufs_direnter could end up corrupting a directory during
compaction. In practice I believe this can only happen after fsck_ffs
has fixed a previously-corrupted directory.

We now deal with any mid-block unused entries specially to avoid
using DIRSIZ() or bcopy() on such entries. We also ensure that the
variables 'dsize' and 'spacefree' contain meaningful values at all
times. Add a few comments to describe better this intricate piece
of code.

The special handling of mid-block unused entries makes the dirhash-
specific bugfix in the previous revision (1.53) now uncecessary,
so this change removes it.

Reviewed by: mckusick


# 82124 21-Aug-2001 iedowse

When compressing directory blocks, the dirhash code didn't check
that the directory entry was in use before attempting to find it
in the hash structures to change its offset. Normally, unused
entries do not need to be moved, but fsck can leave behind some
unused entries that do. A dirhash sanity panic resulted when the
entry to be moved was not found. Add a check that stops entries
with d_ino == 0 from being passed to ufsdirhash_move().


# 81877 18-Aug-2001 peter

Sigh. ufs_lookup() calls ffs_snapgone(), meaning that 'options EXT2FS'
without 'options FFS' would fail to link.


# 79690 13-Jul-2001 iedowse

Return a locked struct buf from ufsdirhash_lookup() to avoid one
extra getblk/brelse sequence for each lookup. We already had this
buf in ufsdirhash_lookup(), so there was no point in brelse'ing it
only to have the caller immediately reaquire the same buffer.

This should make the case of sequential lookups marginally faster;
in my tests, sequential lookups with dirhash enabled are now only
around 1% slower than without dirhash.


# 79561 10-Jul-2001 iedowse

Bring in dirhash, a simple hash-based lookup optimisation for large
directories. When enabled via "options UFS_DIRHASH", in-core hash
arrays are maintained for large directories. These allow all
directory operations to take place quickly instead of requiring
long linear searches. For now anyway, dirhash is not enabled by
default.

The in-core hash arrays have a memory requirement that is approximately
half the size of the size of the on-disk directory file. A number
of new sysctl variables allow control over which directories get
hashed and over the maximum amount of memory that dirhash will use:

vfs.ufs.dirhash_minsize
The minimum on-disk directory size for which hashing should be
used. The default is 2560 (2.5k).

vfs.ufs.dirhash_maxmem
The system-wide maximum total memory to be used by dirhash data
structures. The default is 2097152 (2MB).

The current amount of memory being used by dirhash is visible
through the read-only sysctl variable vfs.ufs.dirhash_maxmem.
Finally, some extra sanity checks that are enabled by default, but
which may have an impact on performance, can be disabled by setting
vfs.ufs.dirhash_docheck to 0.

Discussed on: -fs, -hackers


# 76724 17-May-2001 mckusick

When a new block is allocated to a directory, an fsync of a file
whose name is within that block must ensure not only that the block
containing the file name has been written, but also that the on-disk
directory inode references that block. When a new directory block
is created, we allocate a newdirblk structure which is linked to
the associated allocdirect (on its ad_newdirblk list). When the
allocdirect has been satisfied, the newdirblk structure is moved
to the inodedep id_bufwait list of its directory to await the inode
being written. When the inode is written, the directory entries
are fully committed and can be deleted from their pagedep->id_pendinghd
and inodedep->id_pendinghd lists.


# 76132 29-Apr-2001 phk

VOP_BALLOC was never really a VOP in the first place, so convert it
to UFS_BALLOC like the other "between UFS and FFS function interfaces".


# 76117 29-Apr-2001 grog

Revert consequences of changes to mount.h, part 2.

Requested by: bde


# 75858 23-Apr-2001 grog

Correct #includes to work with fixed sys/mount.h.


# 71976 03-Feb-2001 iedowse

Extend the sanity checks in ufs_lookup to ensure that each directory
entry fits within its DIRBLKSIZ block. The surrounding code is
extremely fragile with respect to corruption of the directory entry
'd_reclen' field; if directory corruption occurs, it can blindly
scan forward beyond the end of the filesystem block. Usually this
results in a 'fault on nofault entry' panic.

Directory corruption is now much more likely to be detected, resulting
in a 'ufs_dirbad' panic. If the filesystem is read-only, it will
simply print a warning message, and skip the corrupted block.

Reviewed by: mckusick


# 71968 03-Feb-2001 iedowse

Use the correct flags field when checking for a read-only filesystem
in ufs_dirbad(). The mnt_stat.f_flags field is only updated by the
syscalls *statfs and getfsstat, so mnt_flag should be used instead.

This only affects whether or not a panic is generated on detection of
certain types of directory corruption.

Reviewed by: mckusick


# 70183 19-Dec-2000 mckusick

Several small but important fixes for snapshots:

1) Be more tolerant of missing snapshot files by only trying to decrement
their reference count if they are registered as active.

2) Fix for snapshots of filesystems with block sizes larger than 8K
(from Ollivier Robert <roberto@eurocontrol.fr>).

3) Fix to avoid losing last block in snapshot file when calculating blocks
that need to be copied (from Don Coleman <coleman@coleman.org>).


# 69967 13-Dec-2000 mckusick

Preventing runaway kernel soft updates memory, take three.
Previously, the syncer process was the only process in the
system that could process the soft updates background work
list. If enough other processes were adding requests to that
list, it would eventually grow without bound. Because some of
the work list requests require vnodes to be locked, it was
not generally safe to let random processes process the work
list while they already held vnodes locked. By adding a flag
to the work list queue processing function to indicate whether
the calling process could safely lock vnodes, it becomes possible
to co-opt other processes into helping out with the work list.
Now when the worklist gets too large, other processes can safely
help out by picking off those work requests that can be handled
without locking a vnode, leaving only the small number of
requests requiring a vnode lock for the syncer process. With
this change, it appears possible to keep even the nastiest
workloads under control.

Submitted by: Paul Saab <ps@yahoo-inc.com>


# 67309 19-Oct-2000 rwatson

o Introduce new VOP_ACCESS() flag VADMIN, allowing file systems to perform
"administrative" authorization checks. In most cases, the VADMIN test
checks to make sure the credential effective uid is the same as the file
owner.
o Modify vaccess() to set VADMIN as an available right if the uid is
appropriate.
o Modify references to uid-based access control operations such that they
now always invoke VOP_ACCESS() instead of using hard-coded policy checks.
o This allows alternative UFS policies to be implemented by replacing only
ufs_access() (such as mandatory system policies).
o VOP_ACCESS() requires the caller to hold an exclusive vnode lock on the
vnode: I believe that new invocations of VOP_ACCESS() are always called
with the lock held.
o Some direct checks of the uid remain, largely associated with the QUOTA
and SUIDDIR code.

Reviewed by: eivind
Obtained from: TrustedBSD Project


# 66033 18-Sep-2000 rwatson

o Substitute suser() calls for direct credential checks, which is now
safe as suser() no longer sets ASU.
o Note that in some cases, the PRISON_ROOT flag is used even though no
process structure is passed, to indicate that if a process structure
(and hence jail) was available, it would be ok. In the long run,
the jail identifier should probably be moved to ucred, as the uidinfo
information was.
o Some uid 0 checks remain relating to the quota code, which I'll leave
for another day.

Reviewed by: phk, eivind
Obtained from: TrustedBSD Project


# 65973 17-Sep-2000 bp

Add new flag PDIRUNLOCK to the component.cn_flags which should be set by
filesystem lookup() routine if it unlocks parent directory. This flag should
be carefully tracked by filesystems if they want to work properly with nullfs
and other stacked filesystems.

VFS takes advantage of this flag to perform symantically correct usage
of vrele() instead of vput() if parent directory already unlocked.

If filesystem fails to track this flag then previous codepath in VFS left
unchanged.

Convert UFS code to set PDIRUNLOCK flag if necessary. Other filesystmes will
be changed after some period of testing.

Reviewed in general by: mckusick, dillon, adrian
Obtained from: NetBSD


# 63897 26-Jul-2000 mckusick

Clean up the snapshot code so that it no longer depends on the use of
the SF_IMMUTABLE flag to prevent writing. Instead put in explicit
checking for the SF_SNAPSHOT flag in the appropriate places. With
this change, it is now possible to rename and link to snapshot files.
It is also possible to set or clear any of the owner, group, or
other read bits on the file, though none of the write or execute
bits can be set. There is also an explicit test to prevent the
setting or clearing of the SF_SNAPSHOT flag via chflags() or
fchflags(). Note also that the modify time cannot be changed as
it needs to accurately reflect the time that the snapshot was taken.

Submitted by: Robert Watson <rwatson@FreeBSD.org>


# 60041 05-May-2000 phk

Separate the struct bio related stuff out of <sys/buf.h> into
<sys/bio.h>.

<sys/bio.h> is now a prerequisite for <sys/buf.h> but it shall
not be made a nested include according to bdes teachings on the
subject of nested includes.

Diskdrivers and similar stuff below specfs::strategy() should no
longer need to include <sys/buf.> unless they need caching of data.

Still a few bogus uses of struct buf to track down.

Repocopy by: peter


# 59241 15-Apr-2000 rwatson

Introduce extended attribute support for FFS, allowing arbitrary
(name, value) pairs to be associated with inodes. This support is
used for ACLs, MAC labels, and Capabilities in the TrustedBSD
security extensions, which are currently under development.

In this implementation, attributes are backed to data vnodes in the
style of the quota support in FFS. Support for FFS extended
attributes may be enabled using the FFS_EXTATTR kernel option
(disabled by default). Userland utilities and man pages will be
committed in the next batch. VFS interfaces and man pages have
been in the repo since 4.0-RELEASE and are unchanged.

o ufs/ufs/extattr.h: UFS-specific extattr defines
o ufs/ufs/ufs_extattr.c: bulk of support routines
o ufs/{ufs,ffs,mfs}/*.[ch]: hooks and extattr.h includes
o contrib/softupdates/ffs_softdep.c: extattr.h includes
o conf/options, conf/files, i386/conf/LINT: added FFS_EXTATTR

o coda/coda_vfsops.c: XXX required extattr.h due to ufsmount.h
(This should not be the case, and will be fixed in a future commit)

Currently attributes are not supported in MFS. This will be fixed.

Reviewed by: adrian, bp, freebsd-fs, other unthanked souls
Obtained from: TrustedBSD Project


# 58349 20-Mar-2000 phk

Rename the existing BUF_STRATEGY() to DEV_STRATEGY()

substitute BUF_WRITE(foo) for VOP_BWRITE(foo->b_vp, foo)

substitute BUF_STRATEGY(foo) for VOP_STRATEGY(foo->b_vp, foo)

This patch is machine generated except for the ccd.c and buf.h parts.


# 58088 15-Mar-2000 mckusick

Bug fixes for currently harmless bugs that could rise to bite
the unwary if the code were called in slightly different ways.

1) In ufs_bmaparray() the code for calculating 'runb' will stop one block
short of the first entry in an indirect block. i.e. if an indirect block
contains N block numbers b[0]..b[N-1] then the code will never check if
b[0] and b[1] are sequential. For reference, compare with the equivalent
code that deals with direct blocks.

2) In ufs_lookup() there is an off-by-one error in the test that checks
if dp->i_diroff is outside the range of the the current directory size.
This is completely harmless, since the following while-loop condition
'dp->i_offset < endsearch' is never met, so the code immediately
does a second pass starting at dp->i_offset = 0.

3) Again in ufs_lookup(), the condition in a sanity check is wrong
for directories that are longer than one block. This bug means that
the sanity check is only effective for small directories.

Submitted by: Ian Dowse <iedowse@maths.tcd.ie>


# 57869 09-Mar-2000 dillon

In the 'found' case for ufs_lookup() the underlying bp's data was
being accessed after the bp had been releaed. A simple move of the
brelse() solves the problem.

Approved by: jkh
Submitted by: Ian Dowse <iedowse@maths.tcd.ie>


# 55697 09-Jan-2000 mckusick

Several performance improvements for soft updates have been added:
1) Fastpath deletions. When a file is being deleted, check to see if it
was so recently created that its inode has not yet been written to
disk. If so, the delete can proceed to immediately free the inode.
2) Background writes: No file or block allocations can be done while the
bitmap is being written to disk. To avoid these stalls, the bitmap is
copied to another buffer which is written thus leaving the original
available for futher allocations.
3) Link count tracking. Constantly track the difference in i_effnlink and
i_nlink so that inodes that have had no change other than i_effnlink
need not be written.
4) Identify buffers with rollback dependencies so that the buffer flushing
daemon can choose to skip over them.


# 52641 29-Oct-1999 dillon

Add sysctl debug.dircheck to allow directory sanity checking to be turned
on with a sysctl.

Fix two bugs in ufs_lookup that can cause deadlocks due to out-of-order
locking. This fix was tested for a few days prior to commit.


# 50477 27-Aug-1999 peter

$Id$ -> $FreeBSD$


# 48801 13-Jul-1999 mckusick

Create the macro DOINGASYNC to check whether the MNT_ASYNC flag has
been set for a mount point. Insert missing checks to ensure that all
write operations are done asynchronously when the MNT_ASYNC option
has been requested.

Submitted by: Craig A Soules <soules+@andrew.cmu.edu>
Reviewed by: Kirk McKusick <mckusick@mckusick.com>


# 47964 16-Jun-1999 mckusick

Add a vnode argument to VOP_BWRITE to get rid of the last vnode
operator special case. Delete special case code from vnode_if.sh,
vnode_if.src, umap_vnops.c, and null_vnops.c.


# 43311 27-Jan-1999 dillon

Fix warnings in preparation for adding -Wall -Wcast-qual to the
kernel compile


# 42374 07-Jan-1999 bde

Don't pass unused unused timestamp args to UFS_UPDATE() or waste
time initializing them. This almost finishes centralizing (in-core)
timestamp updates in ufs_itimes().


# 37555 11-Jul-1998 bde

Fixed printf format errors.


# 35205 15-Apr-1998 bde

Fixed bitrot in the non-softdep case of ufs_dirremove():
- restored async mount support. The first entry in a block is still
always written synchronously, although it probably shouldn't be in
the async case.
- restored use of BWRITE() instead of bowrite() for the DOWHITEOUT
case, although bowrite() is probably better.

Broken by: merge of softdep changes (rev.1.22).
Found by: lmbench2 delete-file benchmarks.


# 34961 30-Mar-1998 phk

Eradicate the variable "time" from the kernel, using various measures.
"time" wasn't a atomic variable, so splfoo() protection were needed
around any access to it, unless you just wanted the seconds part.

Most uses of time.tv_sec now uses the new variable time_second instead.

gettime() changed to getmicrotime(0.

Remove a couple of unneeded splfoo() protections, the new getmicrotime()
is atomic, (until Bruce sets a breakpoint in it).

A couple of places needed random data, so use read_random() instead
of mucking about with time which isn't random.

Add a new nfs_curusec() function.

Mark a couple of bogosities involving the now disappeard time variable.

Update ffs_update() to avoid the weird "== &time" checks, by fixing the
one remaining call that passwd &time as args.

Change profiling in ncr.c to use ticks instead of time. Resolution is
the same.

Add new function "tvtohz()" to avoid the bogus "splfoo(), add time, call
hzto() which subtracts time" sequences.

Reviewed by: bde


# 34266 08-Mar-1998 julian

Reviewed by: dyson@freebsd.org (john Dyson), dg@root.com (david greenman)
Submitted by: Kirk McKusick (mcKusick@mckusick.com)
Obtained from: WHistle development tree


# 33134 06-Feb-1998 eivind

Back out DIAGNOSTIC changes.


# 33108 04-Feb-1998 eivind

Turn DIAGNOSTIC into a new-style option.


# 32702 22-Jan-1998 dyson

VM level code cleanups.

1) Start using TSM.
Struct procs continue to point to upages structure, after being freed.
Struct vmspace continues to point to pte object and kva space for kstack.
u_map is now superfluous.
2) vm_map's don't need to be reference counted. They always exist either
in the kernel or in a vmspace. The vmspaces are managed by reference
counts.
3) Remove the "wired" vm_map nonsense.
4) No need to keep a cache of kernel stack kva's.
5) Get rid of strange looking ++var, and change to var++.
6) Change more data structures to use our "zone" allocator. Added
struct proc, struct vmspace and struct vnode. This saves a significant
amount of kva space and physical memory. Additionally, this enables
TSM for the zone managed memory.
7) Keep ioopt disabled for now.
8) Remove the now bogus "single use" map concept.
9) Use generation counts or id's for data structures residing in TSM, where
it allows us to avoid unneeded restart overhead during traversals, where
blocking might occur.
10) Account better for memory deficits, so the pageout daemon will be able
to make enough memory available (experimental.)
11) Fix some vnode locking problems. (From Tor, I think.)
12) Add a check in ufs_lookup, to avoid lots of unneeded calls to bcmp.
(experimental.)
13) Significantly shrink, cleanup, and make slightly faster the vm_fault.c
code. Use generation counts, get rid of unneded collpase operations,
and clean up the cluster code.
14) Make vm_zone more suitable for TSM.

This commit is partially as a result of discussions and contributions from
other people, including DG, Tor Egge, PHK, and probably others that I
have forgotten to attribute (so let me know, if I forgot.)

This is not the infamous, final cleanup of the vnode stuff, but a necessary
step. Vnode mgmt should be correct, but things might still change, and
there is still some missing stuff (like ioopt, and physical backing of
non-merged cache files, debugging of layering concepts.)


# 30474 16-Oct-1997 phk

VFS mega cleanup commit (x/N)

1. Add new file "sys/kern/vfs_default.c" where default actions for
VOPs go. Implement proper defaults for ABORTOP, BWRITE, LEASE,
POLL, REVOKE and STRATEGY. Various stuff spread over the entire
tree belongs here.

2. Change VOP_BLKATOFF to a normal function in cd9660.

3. Kill VOP_BLKATOFF, VOP_TRUNCATE, VOP_VFREE, VOP_VALLOC. These
are private interface functions between UFS and the underlying
storage manager layer (FFS/LFS/MFS/EXT2FS). The functions now
live in struct ufsmount instead.

4. Remove a kludge of VOP_ functions in all filesystems, that did
nothing but obscure the simplicity and break the expandability.
If a filesystem doesn't implement VOP_FOO, it shouldn't have an
entry for it in its vnops table. The system will try to DTRT
if it is not implemented. There are still some cruft left, but
the bulk of it is done.

5. Fix another VCALL in vfs_cache.c (thanks Bruce!)


# 29287 10-Sep-1997 phk

Update the comment and remove checks now done centrally.


# 29041 02-Sep-1997 bde

Removed unused #includes.


# 28787 26-Aug-1997 phk

Uncut&paste cache_lookup().

This unifies several times in theory indentical 50 lines of code.

The filesystems have a new method: vop_cachedlookup, which is the
meat of the lookup, and use vfs_cache_lookup() for their vop_lookup
method. vfs_cache_lookup() will check the namecache and pass on
to the vop_cachedlookup method in case of a miss.

It's still the task of the individual filesystems to populate the
namecache with cache_enter().

Filesystems that do not use the namecache will just provide the
vop_lookup method as usual.


# 24438 31-Mar-1997 peter

Treat symlinks as first class citizens with their own uid/gid rather than
as shadows of their containing directory. This should solve the problem
of users not being able to delete their symlinks from /tmp once and for
all.

Symlinks do not have modes though, they are accessable to everything that
can read the directory (as before). They are made to show this fact at
lstat time (they appear as mode 0777 always, since that's how the the
lookup routines in the kernel treat them).

More commits will follow, eg: add a real lchown() syscall and man pages.


# 23562 09-Mar-1997 mpp

Update a number of routines to reflect the actual name
of the routine that caused the panic.


# 22975 22-Feb-1997 peter

Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not
ready for it yet.


# 22521 10-Feb-1997 dyson

This is the kernel Lite/2 commit. There are some requisite userland
changes, so don't expect to be able to run the kernel as-is (very well)
without the appropriate Lite/2 userland changes.

The system boots and can mount UFS filesystems.

Untested: ext2fs, msdosfs, NFS
Known problems: Incorrect Berkeley ID strings in some files.
Mount_std mounts will not work until the getfsent
library routine is changed.

Reviewed by: various people
Submitted by: Jeffery Hsu <hsu@freebsd.org>


# 21673 14-Jan-1997 jkh

Make the long-awaited change from $Id$ to $FreeBSD$

This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.

Boy, I'm glad we're not using sup anymore. This update would have been
insane otherwise.


# 18069 06-Sep-1996 gibbs

Use bowrite instead of VOP_BWRITE in a few cases. This can probably be taken
further.


# 12120 06-Nov-1995 dyson

This commit causes UFS to perform at Linux EXT2FS metadata rates. After
earlier discussions with DG, and a recent email exchange with SEF, I
decided to allow UFS to run wide-open on an experimental basis. We
will probably support eventually multiple async modes, and this is
the fastest the we can expect. Just use the -o async flag on the
UFS mount. Good luck...


# 11644 22-Oct-1995 dg

Moved the filesystem read-only check out of the syscalls and into the
filesystem layer, as was done in lite-2. Merged in some other cosmetic
changes while I was at it. Rewrote most of msdosfs_access() to be more
like ufs_access() and to include the FS read-only check.

Obtained from: partially from 4.4BSD-lite2


# 11264 06-Oct-1995 phk

use roundup2 to avoid a bunch of 64bit divides.


# 10358 28-Aug-1995 julian

Reviewed by: julian with quick glances by bruce and others
Submitted by: terry (terry lambert)
This is a composite of 3 patch sets submitted by terry.
they are:
New low-level init code that supports loadbal modules better
some cleanups in the namei code to help terry in 16-bit character support
some changes to the mount-root code to make it a little more
modular..

NOTE: mounting root off cdrom or NFS MIGHT be broken as I haven't been able
to test those cases..

certainly mounting root of disk still works just fine..
mfs should work but is untested. (tomorrows task)

The low level init stuff includes a total rewrite of init_main.c
to make it possible for new modules to have an init phase by simply
adding an entry to a TEXT_SET (or is it DATA_SET) list. thus a new module can
be added to the kernel without editing any other files other than the
'files' file.


# 9759 29-Jul-1995 bde

Eliminate sloppy common-style declarations. There should be none left for
the LINT configuation.


# 3427 08-Oct-1994 phk

POSSIBLE BOGUS CODE found, (related to dos-partitions) in ufs_disksubr.c,
look for CC_WALL.
Cosmetics, a couple of unused vars.


# 1817 02-Aug-1994 dg

Added $Id$


# 1542 24-May-1994 rgrimes

This commit was generated by cvs2svn to compensate for changes in r1541,
which included commits to RCS files with non-trunk default branches.


# 1541 24-May-1994 rgrimes

BSD 4.4 Lite Kernel Sources